Scraping Made Easy with jQuery and SelectorGadget
A few days ago I was doing a TON of scraping, and as you know, without the right tools, scraping can be a REAL pain. Out of my pain comes your pleasure — here’s a list of scraping tools and resources which will make your life MUCH easier the next time you need some information from a crufty old website. If you’re short on time, skip to the end and read the tl;dr.
Scraping with jQuery and Node.js
Scraping with jQuery is a real pleasure. Here’s some example code to get you started, it gets the top three articles with their point values from Hacker News. The key part:
var window = jsdom(agent.body).createWindow()
, $ = require('jquery').create(window);
// scrape!
var titles = $('.title a')
, points = $('.subtext span');
var printme = $.map(points, function(el, i) {
if (i < 3) {
return $(el).text() + '\t' + $(titles[i]).text();
}
});
console.log(printme.join('\n'));To get the example working, make sure you’ve installed node.js and npm). You may find run.js very helpful for auto-rerunning your scraper whenever you make a change.
Test your code live against the page with jqueryify
Now that you can run any jQuery code you’d like, you can start testing your code against the page. I like to use firebug or Google Chrome’s built in console. Not all pages have jQuery included, and sometimes the version of jQuery will be old, so it’s best to overwrite it with your own version using the jqueryify bookmarklet (it’s also a good idea to change the version of jQuery used by the bookmarklet to match the one used by jsdom to avoid any strange bugs).
Find the shortest selector with Selector Gadget
Next up is finding the correct selectors for the information you want to extract. Selector Gadget makes it very easy to find the least complicated selector that still does the job. Of course you can always choose selectors by hand if you don’t fancy selector gadget, but it is super helpful for crufty nasty sites.
Scrape dynamic pages that use javascript to populate information (arg!)
At this point you can easily find selectors and test on the page live, but there is danger ahead if you encounter a site that uses javascript or AJAX to populate the information. Your tests will work just fine, but when you load the page programmatically, the page’s javascript won’t be run. There’s a few ways to get around this:
- Use a regex to extract the values from the javascript written to script tags in the page.
- Pretend to be the page and Make requests to the AJAX urls to get the information you need
- URL hack. There may still be a URL hack you haven’t tried which will give you the information you need. (in this case, request is your friend.)
- Run the page’s javascript programmatically, then scrape the information after it’s been slotted into the page. (See jsdom’s documentation.).
Regex: tools to make your life easier
txt2re.com is a regex generator. That’s right, generator, not tester. This means you enter in a string, click the parts you’d like to match, and then copy and paste the regex into your code. For me, this is pure bliss.
regexpal.com helps you quickly test your regular expressions against test data. Protip: only type the regex, omit the / at the front and end which you’d normally use in your javascript.
tl;dr
- Use jQuery and node.js for your scraping. Example code
- Test on the live page with the jqueryify bookmarklet
- Find the best selectors with Selector Gadget
- For pages with inline script tags containing information, use txt2re.com to generate regular expressions, and regexpal.com to test them.
- And finally, don’t forget that URL hacking is your friend!
Thanks for reading — if you have any feedback, please leave a comment below!
PS If things are going wrong, here’s the list of common gotchas:
- use the same version of jQuery to test as the one you use in your scraper
- make sure the information you need is not loaded by javascript on the page
