Web Scraping with CSS Selectors in Node JS using JSDOM or Cheerio

January 22, 2013 in Data Blog, HowTo, NodeJS, Scraping, Tech

I’ve traditionally used python for web scraping but I’d been increasingly thinking about using Node JS given that it is based on a browser JS engine and therefore would appear to be a more natural fit when getting info out of web pages.

In particular, my first steps when scraping information from a website is to open up the Chrome Developer tools (or Firebug in Firefox) and try and extract information by inspecting the page and playing around in the console – the latter is especially attractive if jQuery is available (and if it’s not available there are plenty of ways to inject it).

Here’s an example of inspecting the http://police.uk/data webpage with Chrome Developer tools:

Inspecting webpage with chrome developer tools

The result of this experimentation is usually a few lines of jQuery selectors.

What I want to be able to do next is reuse these css selectors I found with my in-browser experimentation directly in the scraping script. Now, things like pyquery do exist in python (and there is some css selector support in the brilliant BeautifulSoup) but a connection with something like Node seems even more natural – it is after the JS engine from a browser!

UK Crime Data

My immediate motivation for this work was wanting to play around with the UK Crime data (all open data now!). To do this I needed to:

Get the data in consolidated form by scraping the file list and data files from http://police.uk/data/ – while they commendably provide the data in bulk there is no single file to download, instead there is one file per force per month.
Do data cleaning and analysis – this included some fun geo-conversion and csv parsing

I’m just going to talk about the first part in what folllows – though I hope to cover the second part in a follow up post.

I should also note that all the code used for scraping and working with this data can be found in the [UK Crime dataset data package on GitHub] on Github - scrape.js file is here. You can also see some of the ongoing results of these data experiments in an experimental UK crime "dashboard" here.

: https://github.com/datasets/crime-uk

Scraping using CSS Selectors in Node

Two options present themselves when doing simple scraping using css selectors in node.js:

Using jsdom (+ jquery)
Using cheerio (which provides jquery like access to html) + something to retrieve html (my preference is request but you can just uses node's built in http request)

For the UK crime work I used jsdom but I've subsequently used cheerio as it is substantially faster so I'll cover both here (I didn't discover cheerio until I'd started on the crime work!).

Here's an excerpted code example (full example in the scrape.js):

var url = 'http://police.uk/data';
// holder for results
var out = {
  'streets': []
}
jsdom.env({
  html: url,
  scripts: [
    'http://code.jquery.com/jquery.js'
  ],
  done: function(errors, window) {
    var $ = window.$;
    // find all the html links to the street zip files
    $('#downloads .months table tr td:nth-child(2) a').each(function(idx, elem) {
      // push the url (href attribute) onto the list
      out['streets'].push( $(elem).attr('href') );
    });
  });
});

As an example of Cheerio scraping here's an example from work scraping info the EU's TED database (sample sample):

var url = http://files.opented.org.s3.amazonaws.com/scraped/100120-2011/summary.html;
// place to store results
var data = {};
// do the request using the request library
request(url, function(err, resp, body){
  $ = cheerio.load(body);

  data.winnerDetails = $('.txtmark .addr').html();

  $('.mlioccur .txtmark').each(function(i, html) {
    var spans = $(html).find('span');
    var span0 = $(spans[0]);
    if (span0.text() == 'Initial estimated total value of the contract ') {
      var amount = $(spans[4]).text()
      data.finalamount = cleanAmount(amount);
      data.initialamount = cleanAmount($(spans[1]).text());
    }
  });
});

← Data roundup, January 16

Data roundup, January 23 →