David Trejo's Thoughts

Scraping Made Easy with jQuery and SelectorGadget

_write_less_do_more

A few days ago I was doing a TON of scraping, and as you know, without the right tools, scraping can be a REAL pain. Out of my pain comes your pleasure — here’s a list of scraping tools and resources which will make your life MUCH easier the next time you need some information from a crufty old website. If you’re short on time, skip to the end and read the tl;dr.

Scraping with jQuery and Node.js

Scraping with jQuery is a real pleasure. Here’s some example code to get you started, it gets the top three articles with their point values from Hacker News. The key part:

var window = jsdom(agent.body).createWindow()
    , $ = require('jquery').create(window);
  
  // scrape!
  var titles = $('.title a')
    , points = $('.subtext span');
    
  var printme = $.map(points, function(el, i) {
      if (i < 3) {
        return $(el).text() + '\t' + $(titles[i]).text();
      }
    });
  
  console.log(printme.join('\n'));

To get the example working, make sure you’ve installed node.js and npm). You may find run.js very helpful for auto-rerunning your scraper whenever you make a change.

Test your code live against the page with jqueryify

Now that you can run any jQuery code you’d like, you can start testing your code against the page. I like to use firebug or Google Chrome’s built in console. Not all pages have jQuery included, and sometimes the version of jQuery will be old, so it’s best to overwrite it with your own version using the jqueryify bookmarklet (it’s also a good idea to change the version of jQuery used by the bookmarklet to match the one used by jsdom to avoid any strange bugs).

Find the shortest selector with Selector Gadget

Next up is finding the correct selectors for the information you want to extract. Selector Gadget makes it very easy to find the least complicated selector that still does the job. Of course you can always choose selectors by hand if you don’t fancy selector gadget, but it is super helpful for crufty nasty sites.

Scrape dynamic pages that use javascript to populate information (arg!)

At this point you can easily find selectors and test on the page live, but there is danger ahead if you encounter a site that uses javascript or AJAX to populate the information. Your tests will work just fine, but when you load the page programmatically, the page’s javascript won’t be run. There’s a few ways to get around this:

  1. Use a regex to extract the values from the javascript written to script tags in the page.
  2. Pretend to be the page and Make requests to the AJAX urls to get the information you need
  3. URL hack. There may still be a URL hack you haven’t tried which will give you the information you need. (in this case, request is your friend.)
  4. Run the page’s javascript programmatically, then scrape the information after it’s been slotted into the page. (See jsdom’s documentation.).

Regex: tools to make your life easier

txt2re.com is a regex generator. That’s right, generator, not tester. This means you enter in a string, click the parts you’d like to match, and then copy and paste the regex into your code. For me, this is pure bliss.

regexpal.com helps you quickly test your regular expressions against test data. Protip: only type the regex, omit the / at the front and end which you’d normally use in your javascript.

tl;dr

  • Use jQuery and node.js for your scraping. Example code
  • Test on the live page with the jqueryify bookmarklet
  • Find the best selectors with Selector Gadget
  • For pages with inline script tags containing information, use txt2re.com to generate regular expressions, and regexpal.com to test them.
  • And finally, don’t forget that URL hacking is your friend!

Thanks for reading — if you have any feedback, please leave a comment below!

PS If things are going wrong, here’s the list of common gotchas:

  • use the same version of jQuery to test as the one you use in your scraper
  • make sure the information you need is not loaded by javascript on the page

Filed under  //   jquery   nodejs  

$1100 clock made with jQuery

The QlockTwo, made in Germany, is a beautiful device.

I set out to steal some of that beauty and made a version of the QlockTwo using jQuery. You can see it live right here.

Qlocktwo

Building this clock was a ton of fun. It took me a couple of days, and I learned about the relativity of time (ooh sophisticated).

Here are some fancy psychological things that go on when you use the clock:

The first half of the hour creeps along — you don't notice the dots as much as the words, so most of the time it's later than what you see in words. The only way to stay accurate is to pay lots of attention to the dots, which requires more thought and means it takes you longer to read the clock.

The second half of the hour races. The words are always either in synced with or ahead of the real-time. This means it that you'll finish meetings earlier, and be more punctual because you'll give yourself extra time to get places. Just make sure you always schedule your appointments on the hour.

Technical Details

I select the words to be highlighted in a really jenky way. It is very dependent on the structure of the table (yeah I tried to use divs, but I don't have the skill to keep all the words aligned horizontally and vertically and keep everything square too). Here's roughly how the code works:

  • Javascript asks for the time 
  • Time is transformed into words by a for loop which used to be a GIGANTIC composite switch and if else statement. It was a mess.
  • The text in all the table cells is put together into one long string. There is a one-to-one relationship between the table cells and the letters in the string. This means that if I match the word "OCLOCK" in the string, and the O in OCLOCK is the 50th character in the string, then I know I can just light up the 50th table cell and it will contain the "O." That wasn't the most clear way of saying things. Here's how it goes: I want to highlight OCLOCK, the code finds the position of OCLOCK in the string, then highlights the table cells in that position. 
  • Every 400 milliseconds the clock's face is refreshed. It is important to me that the clock change times in a smooth manner, and it is important that the clock light up in a visible way when you load the page. To achieve a smooth fadein, there is a css transition of .12 seconds. This makes things smooth when the clock changes times. The 400 millisecond fadein leaves the page time to load when you first fire it up, so you get to experience the initial glow when the clock lights up. 
  • Now you might be thinking, if the CLOCK REFRESHES EVERY 400 MILLISECONDS THEN WHY ISN'T IT FLASHING LIKE CRAAAAAZYYY?? That's how it was initially, and it was ugly as sin (if you consider sin ugly). Here's the solution: compare the previously lighted letters with the new ones that need to be lighted, unlight the ones not present in the new list, and then light up the new ones. This turns out to be a bit complicated and it turns out Venn diagrams made with water colors are a key tool of the modern-day programmer. 
  • What other cool stuff does the clock have in it? I stole a glow from midtonedesign.com, which adds the halo behind the clock (cheers Jonotan). I also chose a font from the google font directory that mimics the QlockTwo's font quite well. And of course I threw in some text-shadows to make things glow. Mandatory.

Conclusion

 jQuery is tons of fun to work with, and the #jquery channel is full of helpful people. I emailed QlockTwo and apparently my version is a "Nice Application with JQuery and CSS!"

The time distorting aspects of the clock are quite fun, so make sure to put one up in your office so you can wrap up boring meetings more quickly.

Cheers,
David

 

Filed under  //   css   html   jquery