dcsimg

The Wrong Stuff

More and more these days, you get faced with a problem with angle brackets somewhere in the data. How do you find what you're looking for in HTML or XML data?

More and more these days, you get faced with a problem with angle brackets somewhere in the data. How do you find what you’re looking for in HTML or XML data?

At first glance, the question has an obvious answer. If you have an HTML task, you use HTML::Parser or some derived or wrapper class. If you have an XML task, you use XML::Parser or XML::LibXML. But maybe the obvious answer isn’t always the best. Let’s look at a couple of cases.

Parsing XML with HTML::Parser

My friend Doug LaFarge was recently working on an e-commerce website. Part of the task involved computing shipping charges by connecting up with a remote web service via HTTP, passing the size and weight of the packages and its destination address, and getting back a response.

Now, I won’t embarass the service provider by giving their name, but they really did a poor job of designing and documenting their service. First, their “sample Perl code” didn’t work, as they were using + to do string concatenation. (It was apparently copied from their JavaScript example, except they weren’t paying attention.) Second, the service returned something that was nearly XML, but had extra leading and trailing whitespace, so a true XML parser would always abort. (You had to trim the whitespace before feeding the parser.) And finally, they returned XML, but weren’t using SOAP, which was odd because it looked like a natural SOAP application. So, if you can get around the fact that their example programs didn’t run, the response required massaging before parsing, and it wasn’t SOAP, their service worked fine.

After we had informed the company that their sample program didn’t work, they asked us if we could suggest some improvements to it. At first, I reached for XML::Parser, and then realized that this would be bad as model code, because in my experience, XML::Parser is a bit finicky to install, requiring expat to be installed as well. And there was still that nasty bit of needing to trim the whitespace.

But I had noticed some time ago that the friendly HTML::Parser module has an “XML mode,” which modifies the parser so that it can deal mostly with XHTML, but also works neatly on generic well-formed XML. And, since the sample code we were developing presumed that LWP was installed, we could also presume that in nearly all cases, we’d have HTML::Parser as well.

I quickly started hacking up code and within a half-hour, was happily fetching the data, recognizing the start/end tags and content. Let’s take a look at some of the code snippets.

First, we need to construct the URL containing the shipping parameters, including credentials for authorization. See Figure One .




Figure One: Calling the shipping quote service


my $API_URL =
http://name.of.shipping.company/calculate.cgi“;
my $USERNAME = “doug”; my $PASSWORD = “password”;

use URI;
my $uri = URI->new($API_URL);
$uri->query_form(
Username => $USERNAME, Password => $PASSWORD,
FromAddress => …, FromCity => …,
FromState => …, FromZip => …,
ToAddress => …,

Package1Name => ‘big box’,
Package1Weight => 10,
Package1Width => 20,

Package2Name => ‘small tube’,
Package2Weight => 5,

Carrier1Name => ‘MonkeyFlingers’,
Carrier2Name => ‘StarvingEngineers’,

Method1Name => ‘Overnight’,
Method2Name => ‘SlowBoatToChina’,

);

I’m leaving a lot out here. Let’s just say we end up with a URL that’s about 300 to 1,000 characters long. Ugh. A very dumb interface.

Now, we make the request:


my $response = get $uri;

At this point, $response is either undef (if the fetch failed), or some XML-like string (with the ugly extra whitespace). Again, simplifying things a bit for brevity, the XML-like string looks something like Figure Two .




Figure Two: Shipping quotes returned as XML-like text


<?xml version=”1.0″>
<response>
<package id=”big box”>
<quote id=1>
<carrier>MonkeyFlingers</carrier>
<method>Overnight</method>
<amount>123.95</amount>
</quote>
<quote id=2>
<carrier>MonkeyFlingers</carrier>
<method>SlowBoatToChina</method>
<amount>3.95</amount>
</quote>
<quote id=3>
<carrier>StarvingEngineers</carrier>
<method>Overnight</method>
<amount>99.50</amount>
</quote>

</package>
<package id=”small tube”>
<quote id=1>

</quote>
</package>
</response>

Because it’s promised to be well-formed, we know that we’ll get nicely matching pairs of start and end tags from a parsing. We can parse this result using HTML::Parser using a nice program structure like the one in Figure Three.




Figure Three: A template for the HTLM parser


my @state; ## add other variables after here

my $p = HTML::Parser->new (
xml_mode => 1,
start_h =>
[sub {
my ($tagname, $attr) = @_;
push @state, $tagname;
## We are beginning state "@state"
}, "tagname, attr"
],
text_h =>
[sub {
my ($text) = @_;
## We see content within state "@state"
}, "dtext"
],
end_h =>
[sub {
my ($tagname) = @_;
## We are ending state "@state"
pop @state;
}, "tagname"
],
);

|$p->parse($result); $p->eof;

The array @state, when interpolated within double quotes, will be a space-separated list of states showing where we are in the XML hierarchy. For example, at the beginning of a particular package, @state will be response package in the first handler. This is the basic pattern. For our specific application, we’ll need to aggregate the resulting data into our final data structure. See Figure Four.




Figure Four: A parser for the XML


my @state;
my %quotes; # all quotes, keyed by package name
my $package; # the current package name
my %quote; # the current quote being accumulated for $package

use HTML::Parser;
my $p = HTML::Parser->new (
xml_mode => 1,
start_h =>
[sub {
my ($tagname, $attr) = @_;
push @state, $tagname;
## We are beginning state "@state"
if ("@state" eq "response package") { # beginning of package
$package = $attr->{id}; # pick out the package id
} elsif ("@state" eq "response package quote") { # beginning of quote
%quote = (); # empty out the quote info
}
}, "tagname, attr"
],
text_h =>
[sub {
my ($text) = @_;
## We see content within state "@state"
if ("@state" eq "response package quote carrier") {
$quote{"carrier"} = $text; # carrier for this quote
} elsif ("@state" eq "response package quote method") {
$quote{"method"} = $text; # method for this quote
} elsif ("@state" eq "response package quote amount") {
$quote{"amount"} = $text; # amount for this quote
}
}, "dtext"
],
end_h =>
[sub {
my ($tagname) = @_;
## We are ending state "@state"
if ("@state" eq "response package quote") { # end of a quote
push @{$quotes{$package}}, { %quote }; # save hash copy
}
pop @state;
}, "tagname"
],
);

$p->parse($result);
$p->eof;

Wow. Lots of stuff there. Basically, I looked at each beginning, middle, and end of each state, and attached actions to perform at that step. Beginning states are used to reset accumulator variables or save the attributes of the start tag. Middles are used to extract the text content between elements. Ends merge the accumulators into larger structures. If you keep that pattern in mind, it’s pretty easy to come up with the locations for things. The resulting data structure when dumped with Data::Dumper looks something like Figure Five.




Figure Five: The XML converted to a data structure


$VAR1 = {
‘big box’ =>
[
{
'carrier' => 'MonkeyFlingers',
'amount' => '123.95',
'method' => 'Overnight'
},
{
'carrier' => 'MonkeyFlingers',
'amount' => '3.95',
'method' => 'SlowBoatToChina'
},
{
'carrier' => 'StarvingSoftwareEngineers',
'amount' => '99.50',
'method' => 'Overnight'
}
],
‘small tube’ => [
...
]
};

And then we’d wander through that structure in the rest of the application. The problem is solved by using HTML::Parser to parse XML.

Parsing HTML with XML::LibXML

The XML::LibXML module is a wrapper around the GNOME libxml2 parser, which is perhaps even more finicky to install than expat, but I seem to have managed. And it’s worth it, because of the additional functionality (and I’m told, speed) over the older expat.

First, the XML::LibXML module can parse HTML (including dealing with the optional close tags for the elements) and return back a nice node tree, suitable for spitting out as XHTML. For example, parsing and cleaning up the http://www.perl.org web page looks like this:


use LWP::Simple; use XML::LibXML;
my $html = get “http://www.perl.org“;
my $doc = XML::LibXML->new-> parse_html_string($html);
print $doc->toStringHTML;

The result is clean enough to be valid XHTML, with all the tags nicely balanced. But another nice feature of XML::LibXML is the built-in XPath processor. For web-scraping, this is a very powerful tool. For example, let’s say I want to find the current rank of Learning Perl in O’Reilly’s top-25 book sales page (updated weekly).


use LWP::Simple; use XML::LibXML;
my $html = get “http://www.oreilly.com/catalog/top25.html“;
my $doc = XML::LibXML->new-> parse_html_string($html);

I now have a DOM object of the page. I’m interested in the table in the middle of the page that has the book rankings. In the table, the td cell containing “Learning Perl” is in the same row as the cell containing the ranking. With a simple bit of XPath magic, I can first locate the cell containing the title, then from there go to the closest enclosing row and pick out the first table cell’s content, and then get the string value of that node.


//text()[contains(., "Learning Perl")]/ancestor::tr[1]/td[1]/text()
;

The nice thing about this XPath is that it’s relatively immune to layout changes or added information or reformatting. Back to our DOM, this would be simply:


use LWP::Simple; use XML::LibXML;
my $html = get “http://www.oreilly.com/catalog/top25.html“;
my $doc = XML::LibXML->new-> parse_html_string($html);
my $location =
‘//text()[contains(., "Learning Perl")]‘ .
‘/ancestor::tr[1]/td[1]/text()’;
print $doc->findvalue($location);

I got the data I needed, relatively easily. And that’s why you should consider parsing HTML using an XML parser, especially if you’re webscraping. Sometimes, using the wrong tool for the right reasons can be useful. Until next time, enjoy!



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com.

Comments are closed.