Over the past ten years, I’ve successfully implemented various types of screen scraping in order to provide data to my clients. Most of these implementations have involved accessing HTML and parsing out the data we needed for the web application.
My latest implementation of this made use of the HTML Agility Pack and managed to incorporate the e-Labels For Education site into the Labels For Education site. (No links, because the e-Labels program is being phased out.) Recently, I’ve been spending a lot of times on some site doing the same thing over and over again. But most of the sites I visit now implement some kind of AJAX so that doing a simple web request to a page without also loading and parsing the JavaScript ends up giving me a page with no useful data at all. Unlike the work I’ve done in the past where this was sufficient.
This, combined with my recent work implementing Jasmine unit test for JavaScript and running them in the PhantomJS headless browser has had me thinking, wouldn’t it be great if I could do similar kinds of screen scraping, or even browser automation, but use something like an embedded version of PhantomJS to get the work done.