MG


scraping with node.js

June 21, 2013

Last year, I wrote a post about web scraping. The app I was working on was using PHP, so the HTML parsing code required a library that is probably not too well known. Since then I’ve started using node.js more, which makes scraping much easier. I thought I’d do another example with node, where it’s possible to use the more common jQuery syntax for HTML parsing.

For this example, I’ll be scraping song lyrics from LyricWiki. By exploring the site, I found that the URL pattern for a given track is http://lyrics.wikia.com/[artist_name]:[track_name]. So for this example, I’ll pass the artist and track as arguments to my script.

var artist = process.argv[2] && process.argv[2].replace(/\s+/g, "_");
var track = process.argv[3] && process.argv[3].replace(/\s+/g, "_");
if(!artist || !track) {
	console.log("Usage: node lyrics_scrape.js [artist] [track]");
	process.exit(1);
}
var url = "http://lyrics.wikia.com/" + artist + ":" + track;

Next I include a couple of node libraries for fetching the html, and parsing HTML with jquery syntax.

var request = require("request"); // npm install request
var cheerio = require("cheerio"); // npm install cheerio

There are a couple options for the parser library. jsdom will give you a full W3C DOM environment to work with, which has quite a bit of overhead. You can then load the actual jQuery library, and start using it. Cheerio uses a simpler DOM, and implements a subset of the jQuery API without all the extra browser dependent code, resulting in much better performance. This is a more appropriate solution for scraping.

After we load the libraries, we request the URL, pass the HTML to cheerio, and start parsing out the information we want. I target the div with class “lyricbox”, remove a couple of nodes that I don’t want, replace <br>’s with newlines, and grab the text. For my application, I wanted the lyrics as an array of lines. Here’s the rest of the script.

request(url, function(err, response, html) {
	if(err) return console.error(err);
	var $ = cheerio.load(html);
	$("div.lyricbox > .rtMatcher, div.lyricbox > .lyricsbreak").remove();
	$("div.lyricbox > br").replaceWith("\n");
	var lyrics = $("div.lyricbox").text();
	console.log(lyrics.split("\n"));
	process.exit(0);
});

Then we run it on the command line:

> node lyrics_scrape.js weezer "only in dreams"

[ 'You can\'t resist her',
  'She\'s in your bones',
  'She is your marrow',
  'And your ride home',
  '',
  'You can\'t avoid her',
  'She\'s in the air',
  'In between molecules of',
  'Oxygen and carbon dioxide',
  '',
  'Only in dreams',
  'We see what it means',
  'Reach out our hands',
  'Hold on to hers',
  'But when we wake',
  'It\'s all been erased',
  'And so it seems',
  'Only in dreams',
  '',
  'You walk up to her',
  'Ask her to dance',
  'She says, "Hey, baby',
  'I just might take the chance"',
  '',
  'You say it\'s a good thing',
  'That you float in the air',
  'That way there\'s no way',
  'I will crush your pretty toenails into a thousand pieces',
  '',
  'Only in dreams',
  'We see what it means',
  'Reach out our hands',
  'Hold on to hers',
  'But when we wake',
  'It\'s all been erased',
  'And so it seems',
  'Only in dreams',
  '',
  'Only in dreams',
  'Only in dreams',
  'Only in dreams',
  'Only in dreams',
  'Only in dreams',
  'Only in dreams',
  '',
  '',
  '' ]

And that’s it, much simpler than the PHP library, and using the same language and syntax as the browser. My workflow now is to start in the browser, use the jQuerify bookmarklet to add jQuery to any web page. Then you can experiment in the browser console and once you have your selectors figured out, transfer them over to the node server.