Of Spiders and Scrapers: Decomposing Web Pages 101

Not all sites proffer slick RESTful interfaces and XML feeds. In those cases, collecting data requires some good, old-fashioned scraping. This week, let's look at some of the tools and techniques required to scrape a site.

With so many different platforms connecting to the Internet these days, the traditional, HTML Web page is just one of many outlets of information. RSS syndicates content to aggregators and specialized readers; messaging services such as Twitter and Facebook keep audiences engaged with frequent, even real-time alerts; and programmatic interfaces, or APIs, provide automated access and further blur the distinction between client and server. If you’re authoring a specialized client or a “mashup” application for a new site, there’s likely no shortage of methods to collect and repurpose content.

Of course, not all sites proffer slick RESTful interfaces and XML feeds. Indeed, most don’t. In those cases, collecting data requires some good, old-fashioned scraping: identify the pages you want, download the content, and sift through the text or HTML of each page to extract the pertinent data. Depending on the complexity of the source, scraping can be simple or extremely difficult; nonetheless, the tools required are largely the same from task to task.

This week, let’s look at some of the tools and techniques required to scrape a site and explore some of the strategies. The examples shown here are written in Ruby and are based on libraries available as Ruby Gems; however, nearly every scripting language offers analogs. To find scraping code for your favorite language, type language web scraping or language html parser (where language is Perl or Ruby or Python or whatever) into a search engine and scan the results.

The General Approach

Over the years, I’ve likely scraped tens of Web sites. In fact, in 2005, I scraped the entirety of the then-current Linux Magazine site to port its content to a new content management system. I’ve also scraped sites to aggregate and analyze sales data and to watch for breaking events. For example, I recently wrote a small piece of code to monitor a site and alert my daughter when the collectible toy she wanted went on sale.

In general, and as alluded above, scraping boils down to a few fundamental tasks.

  1. Identify the content you are interested in.
  2. Identify the Web pages that contain the content. Determine if and how the pages are interconnected.
  3. Analyze the structure of a representative Web page. Determine if the data is accessible.
  4. Find or write the tools to collect the pages and extract the data. Identify special cases and exceptions and modify the solution to suit.

For example, an online catalog is typically simple to scrape. A catalog usually has less than four or five unique page styles (excluding the expanded Web site, such as the editorial, order processing, and customer service pages); the individual item pages are uniform; and an index or collection of categories serves as navigation, linking to subcategories and ultimately to each item page itself.

At the opposite extreme, a site that has grown organically over time and is replete with user-generated content is difficult. If pages and content are irregular in structure, no pattern can serve globally or even widely to decode the HTML. Variation is anathema if you want to excise specific data from a large set of pages. (On the other hand, if you simply want to index all the content on each page for your search engine, lack of predictability is moot.)

Assuming the material is largely accessible through page links, the next issue is page construction. The rub? Any (initial) parts of a page generated by Ajax are effectively invisible to crawlers and scrapers. After all, unless your scraper has a JavaScript interpreter and is able to realize the Ajax calls, it cannot hope to reproduce what the consumer sees in the browser. Thus, the data is protected. This is unusual for most data-driven sites, but not rare either.

In some cases, you cannot circumvent the Ajax. However, in other cases, it makes the scraping even simpler. If you can deduce and extract the Ajax call—it’s just a URL and either GET or PUT— and no complex authentication is required, you can collect those URLs and call each one. Presto! Instant data feed!

The more common case is data generated and embedded in the page sent from the server. Look for patterns and distinct, “landmark” tags suitable as references. For instance, if the catalog page renders the product’s description, price, and image in a table, is the table ID the same ID on each page or does it follow a particular naming scheme? If you can find a proximate element, your parsing or XPath expression can be that much simpler.

Tools for scraping are readily available. curl and wget can download a list of URLs or can start at a root URL and follow links to any depth to capture, download, and save some or all of a site. Programming libraries such as hpricot for Ruby make light work of parsing HTML and finding nodes.

The work is picking what nodes to extract and writing the proper XPath expression.

Spider the Web

Let’s look at a small (and somewhat contrived) scraping application. This application spiders the Linux Magazine Web site to find the titles of all my blog posts. The approach mirrors the steps described above.

The site provides an RSS feed of my blog posts at http://www.linux-mag.com/blogs/mstreicher/feeds. I’ll use that as my starting point to find the URLs of all my articles. Using wget, I pull that file down and pull the URLs from the text.

$ wget --quiet http://www.linux-mag.com/blogs/mstreicher/feed -O - | \
  grep '
' | grep id | perl -i -pe 's|||g' | \
  wget --quiet -i -

All that mumbo jumbo downloads my articles. The first instance of wget pulls the feed file down and writes it to standard output (the -O -). The grep commands and the Perl one-liner pull the URLs out of the text to form a list of addresses to retrieve. The list is then fed to another instance of wget (the -i - reads from standard input) to pull the pages down. (If you want to see what happens along the way, omit the --quiet options.

The next step is to extract the title from the HTML pages that are now resident on the local drive. To do that, I use Ruby and hpricot to parse the HTML and pull out the content of the title element found in the head. Here is the Ruby code to do that.

require 'rubygems'
require 'hpricot'
STDOUT.sync = true 

ARGV.each do |argv|
  doc = open(argv) { |f| Hpricot(f) }
  puts (doc/"head > title").inner_html.gsub( /\|.*$/m, '' ).strip
end

The code imports the hpricot gem, forces standard output to flush immediately, and then iterates over each argument, opening each file and extracting the HTML within the title element. The latter part of the line—.gsub( /\|.*$/m, '').strip—strips off some trailing cruft. If you run this over the files, you should see the titles.

$ ruby lm.rb *html*
Don't Repeat Yourself. Use Rails Templates.
Extend Your Scripting Language with SWIG
Rip: A New Way to Package Ruby Software
Google Web Elements: Essential as Fire, Water, Earth, Air
Balsamiq Mockups: Pencil and Paper 2.0
Linux Magazine
Hands-On with Adobe Browserlab
Sifting Through Billions and Billions of Bytes
Building Small Sites with Webby
Sunspot: A Solr-Powered Search Engine for Ruby
The Ugly Truth About the Web

One command-line incantation and eight lines of Ruby code later and you’re done.

More Challenging Tasks

There is a lot more to scraping than this. The biggest complication is navigation and interaction. For example, what do I do if I have to submit a form to get to the target page? What if I have to login to get to the form? Both are very common scenarios.

Luckily, there is software to automate Web page navigation, too. Typically called “Mechanize” or some variation, the software behaves like a browser, downloading pages, interpreting the document, robotically filling in fields and submitting the form data. Mechanize must be scripted to click here, enter text there, and so on. Once you arrive at the target, you can use the same techniques shown here to grab the page and extract the datum.

It may take a little snooping and sleuthing, but if you can find a common structure or discernible pattern, the software can do the heavy lifting.

Happy tinkering.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62