Even though the Web is roughly a decade old and there are now many options for developing Web applications, Perl is still regarded by many as “the darling language of Web programming.” Perl’s text-wrangling abilities still exceed that of any other popular open source language, and a wealth of Perl modules (from the core distribution and the CPAN) makes Web applications a snap to construct and maintain.
Even though the Web is roughly a decade old and there are now many options for developing Web applications, Perl is still regarded by many as “the darling language of Web programming.” Perl’s text-wrangling abilities still exceed that of any other popular open source language, and a wealth of Perl modules (from the core distribution and the CPAN) makes Web applications a snap to construct and maintain.
A frequent task for Perl is Web scraping, or getting data from a browser-facing Web site. While Web services are slowly gaining a foothold, scraping tools will always be necessary to gleen information that isn’t yet (or never will be) offered through some SOAP-like interface.
One emerging Web scraping tool is WWW::Mechanize by Andy Lester. (WWW:Mechanize builds on WWW::Automate, an earlier work called created by Kirrily ‘Skud’ Robert). With WWW::Mechanize, you get a “virtual browser” that can load pages, fill out form elements by name, “click” on “Submit” buttons or image maps, follow links by name or position, and even press “Back” when needed. Although Lester has developed the module primarily to automate Web site testing, the features required to test Web sites are precisely what’s needed to scrape sites, too.
To try this interesting tool, I picked a problem that I faced just the other day. I frequently pop over to the Yahoo! news pages to search the news photos, looking for photos with particular keywords. As I search, I pick out the pictures that interest me and then…
Please log in to view this content.
Not Yet a Member?
Register with LinuxMagazine.com and get free access to the entire archive, including: