Screen Scraping for Fun and Profit

Even though the Web is roughly a decade old and there are now many options for developing Web applications, Perl is still regarded by many as "the darling language of Web programming." Perl's text-wrangling abilities still exceed that of any other popular open source language, and a wealth of Perl modules (from the core distribution and the CPAN) makes Web applications a snap to construct and maintain.

Even though the Web is roughly a decade old and there are now many options for developing Web applications, Perl is still regarded by many as “the darling language of Web programming.” Perl’s text-wrangling abilities still exceed that of any other popular open source language, and a wealth of Perl modules (from the core distribution and the CPAN) makes Web applications a snap to construct and maintain.

A frequent task for Perl is Web scraping, or getting data from a browser-facing Web site. While Web services are slowly gaining a foothold, scraping tools will always be necessary to gleen information that isn’t yet (or never will be) offered through some SOAP-like interface.

One emerging Web scraping tool is WWW::Mechanize by Andy Lester. (WWW:Mechanize builds on WWW::Automate, an earlier work called created by Kirrily ‘Skud’ Robert). With WWW::Mechanize, you get a “virtual browser” that can load pages, fill out form elements by name, “click” on “Submit” buttons or image maps, follow links by name or position, and even press “Back” when needed. Although Lester has developed the module primarily to automate Web site testing, the features required to test Web sites are precisely what’s needed to scrape sites, too.

To try this interesting tool, I picked a problem that I faced just the other day. I frequently pop over to the Yahoo! news pages to search the news photos, looking for photos with particular keywords. As I search, I pick out the pictures that interest me and then go through a series of more-or-less routine keystrokes and clicks to save those images to my hard drive for later access. (Of course, I respect the copyright of the images.) For example, I often search on the keyword “Oregon” to find all of the images related to my home state (usually pictures of our sports teams).

So, having been introduced to WWW::Mechanize, I thought this would be a perfect opportunity to reduce the amount of time I spend in a day doing repetitive tasks. And that’s what programming is really all about: when we’re presented with an overwhelming array of non-repetitive tasks, leave the overwhelming array of repetitive tasks to the CPU.

The strategy for automation is straightforward. I let the “virtual browser” visit the Yahoo “Advanced Search” page and let it complete the search form, entering the keywords and selecting photos rather than news stories. To minimize round trips, the virtual browser also asks for one hundred images per response. Then, the virtual browser clicks the submit button, and we’re off.

For each response page, I look for any links that are an image and link to the news detail page. For each of the URLs for those links, I follow the link, then locate the full-sized image URL in the response content.

Two optimizations make this easy on both bandwidth and the Yahoo! news engine. First, every news link is noted in a DBM cache and not followed again for another thirty days. Second, the image is downloaded using “mirror” logic, which means that if the image already exists and is current enough, no data is actually transferred. Using these optimizations, a “no new images” run comes back in a second or two, fast enough to run hourly from a cron job. And there’s an additional benefit to using mirror logic: because the timestamp of each image is altered to match the source data, I can quickly see what images have been added recently according to the source, regardless of when I actually downloaded the data.

So, let’s take a look at the program, presented in Listing One. Lines 1 through 3 start nearly every program I write, enabling warnings, turning on compiler restrictions regarding variable declaration, barewords, and references, and also disabling the buffering of STDOUT.

Listing One: The WWW::Mechanize Web scraper

1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
5 ## user configurable parts
7 my $BASEDIR = “/home/merlyn/Yahoo-news-images”;
9 my $SEARCHES = <<’END’;
10 oregon oregon
11 camel camel
12 shania shania twain
13 END
15 ## tinker parts
17 my $INDEX = “.index”;
19 ## no servicable parts below
21 use WWW::Mechanize 0.33;
22 use File::Basename;
24 my $m = WWW::Mechanize->new;
25 $m->quiet(1); # I’ll handle my own errors, thank you
27 for (grep !/^\#/, split /\n/, $SEARCHES) {
28 my ($subdir, @keywords) = split;
30 print “— updating $subdir from a search for @keywords —\n”;
32 $subdir = “$BASEDIR/$subdir” unless $subdir =~ m{^/};
33 -d $subdir or mkdir $subdir, 0755 or die “Cannot mkdir $subdir: $!”;
35 dbmopen(my %seen, “$subdir/$INDEX”, 0644) or die “can’t index: $!”;
37 ## clean any expired %seen tags
38 {
39 my $now = time;
40 for (keys %seen) {
41 delete $seen{$_} if $seen{$_} < $now;
42 }
43 }
45 $m->get(“http://search.news.yahoo.com/search/news/options?p=“);
47 $m->field(“c”, “news_photos”);
48 $m->field(“p”, “@keywords”);
49 $m->field(“n”, 100);
50 $m->click();
52 {
53 print “looking at “, $m->uri, “\n”;
54 my @links = @{$m->extract_links};
56 my @image_links = grep {
57 $links[$_][0] =~ m{^http://story\.news\.yahoo\.com/} and
58 $links[$_][1] eq “[IMG]“;
59 } 0..$#links;
61 for my $image_link (@image_links) {
62 my $seen_key = “$links[$image_link][0]“;
63 if ($seen{$seen_key}) {
64 print ” saw $seen_key\n”;
65 next;
66 }
68 $m->follow($image_link);
70 print ” looking at “, $m->uri, “\n”;
71 if (my ($image_url) = $m->res->content =~ m{<img src=(http:\S+) align=middle}) {
72 print ” mirroring $image_url… “;
73 my $response = $m->mirror ($image_url, “$subdir/”.basename($image_url));
74 print $response->message, “\n”;
75 $seen{“$seen_key”} = time + 30 * 86400; # ignore for 30 days
76 }
78 $m->back;
79 }
81 redo if $m->follow(qr{next \d});
82 }
84 }

Lines 7 through 13 hold the configuration parameters for this program. Since the program is typically invoked from a cron job, command-line parameters just won’t do. $BASEDIR gives the top directory in which all the images will be saved. $SEARCHES defines the various searches. Each line consists of a directory name, and then one or more keywords. For example, the line beginning with shania defines a subdirectory called shania, and then selects the keywords shania and twain for the search.

Line 17 defines a constant that names the DBM database (within the given directory) that persists whether or not we’ve already drilled down into a specific subpage. (The DBM file name begins with a dot so that we won’t see it in a normal ls command.)

Lines 21 and 22 pull in the WWW::Mechanize and File::Basename modules, the latter of which is a core module. You can install WWW::Mechanize from the CPAN if you don’t already have it.

Line 24 creates the WWW::Mechanize object, our “virtual browser.” This object class inherits from LWP::UserAgent, so I have full control over such things as proxies, user-agent names, and cookies.

Line 27 begins the outer loop. For each line in the $SEARCHES configuration string that doesn’t begin with a hash (#) (commented out), we extract the subdirectory name and the keywords in line 28. Line 30 traces progress for the impatient invoker.

Lines 32 and 33 establish the directory to receive the images, creating it if necessary. Line 35 opens the DBM database within this directory as the tied hash %seen.

Lines 38 to 43 remove any stale entries in the %seen hash. While this doesn’t affect the outcome of the algorithm, letting the stale entries accumulate causes the unbounded growth of the DBM files. Each key is a subpage URL I obtained from a search, while each value is an expiration time expressed as Unix internal time (seconds since the epoch). If the value associated with each key is older than the current time, I nuke the entry.

Line 45 instructs my “virtual browser” to go fetch the given URL. I got the URL from pressing “Advanced Search” on Yahoo’s opening page. If I wanted to be clever, I could have simply gone to “news.yahoo.com” and followed the “Advanced Search” link, which might have been safer in the long run (in case this particular URL ever changes).

Lines 47 to 49 “fill in” various parts of the first form present on the page. The first form is the form I want, even though there’s another form later on in the page. Had I wanted something other than the first form, I could have requested that before starting to fill in the fields.

The names c and p and n came from staring at a “view source” of the page in question. This is where screen scraping takes a bit of talent: I need to figure out exactly what gets set when a user fills in the various form elements, including the names as given in the form description, not necessarily as presented to the user. Field c is “what kind of search”, and p is the keyword blank. Field n is “how many responses per page”.

Once I’ve updated the form elements to specific values, I “click” on the submit button in line 50. This causes the WWW::Mechanize object to encode the form and submit a GET or POST request as needed, noting any response from the Web server.

Lines 52 to 82 process a single response page, advancing to the next page and repeating as necessary. A trace message is printed in line 53, again for the impatient observer.

Line 54 extracts the links of the page. The WWW::Mechanize object scans the response automatically, looking for A and FRAME elements. In lines 56 to 59, I look at this array to see if there are any links to stories. Each link is checked to see if the URL begins with story.news.yahoo.com and is merely an image. If so, I save the array indices for those links of interest.

The resulting image links are processed in the loop in lines 61 to 79. For each image link (a small integer, indexing into the @links array), I extract the subpage URL that follows in line 62. Lines 63 to 66 skip over the subpage URLs that I’ve already seen, noting them as such. This is an important step, because if I’ve already visited the subpage, I’ve already extracted the image from that page and there’s no new information to obtain.

Line 68 has the “virtual browser” follow the link indicated by the numeric index. Because I’ve pulled the links from the same place that the virtual browser is looking at, I know that the numbers are synchronized. The subpage is then visited and parsed, as reported in line 70.

Lines 71 through 76 look for the image URL, using a convention that I “reverse engineered” by staring at the web page HTML (obtained by calling $m->res->content). If an image element appeared as:

then it was definitely the large image for the news story. It’s important to distinguish between the image of interest and any other incidental images on the page, since there are almost always other images leading to other stories on the page as well. This will break if Yahoo! changes the page layout, but that’s the price of screen scraping.

If I find an image URL in line 71, I announce it in line 72. Line 73 uses the LWP::UserAgent method to “mirror” the URL to a local filename. The local filename is the “basename” of the URL preceded by the directory path. This use of basename was a quick-and-dirty shortcut, valid within the Unix world. A more portable method would have been to create a URI object from the path, and then extracting the final path step, as in:

my $basename = (URI->new($image_url)-> path_segments)[-1];

but this seemed like overkill for me.

The result of the mirror ends up as a HTTP::Response object in $response in line 73. Line 74 shows the response text as part of the tracing messages. Line 75 puts a “do not visit this URL for 30 days” flag into the DBM database. The 30 days figure comes from knowing that Yahoo! keeps only 30 days of historical stories and images. If that figure ever increases, I’ll bump this value up as well.

Line 78 pushes the virtual browser’s “Back” button, taking me back to the previous page. A WWW::Mechanize object remembers all pages from an initial GET as a stack, so this takes me back to the query result page.

Line 81 follows any link that has text matching next \d as a regular expression. If there are more pictures, the search result page contains exactly such a link. If such a link is found, the method returns a true value, and I loop around again to line 52. Otherwise, I drop out of the block, and loop again to the next search keywords.

Now that I’ve set this up, the only maintenance required is making up my mind about the keywords to search or perhaps some slight changes to the regular expressions if Yahoo! changes their page layout.

Have fun scraping the Web! Until next time, enjoy.

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com. The code shown in this column can be downloaded from http://www.linux-mag.com/downloads/2003-04/perl.

“Perl 5 was my rewrite of Perl. I want Perl 6 to be the community’s rewrite of Perl… and of the community.”

– Larry Wall, State of the Onion speech, OSCon 2000.

Many Perl aficianados ask, “If Perl ain’t broke, why fix it?” First of all, Perl 5 ain’t broke. Those of us working on the design of Perl 6 are doing so precisely because we like Perl 5 so much. In fact, we use Perl 5 every day, for everything from filtering our email, to maintaining our servers, to formatting this very paragraph.

Indeed, it’s because we like Perl 5 so much that we want Perl 6 to be even better. Perl 5′s goal was to make “easy things easy, and hard things possible”. We want Perl 6 to make easy things trivial, hard things easy, and impossible things merely hard.

Moreover, our love of Perl doesn’t blind us to its flaws. Those $, @, and % prefixes on variables are confusing. Some of Perl’s other syntax is unnecessarily cluttered, it lacks some basic language features like named subroutine parameters, strong typing, and a simple case statement, and its object-oriented model isn’t strong enough for most production environments. And the list goes on.

So the Perl 6 design process is about keeping what works in Perl 5, fixing what doesn’t, and adding what’s missing. That means there will be a few fundamental changes to the language, a large number of extensions to existing features, and a handful of completely new ideas.

This article showcases some of the ways that these modifications, enhancements, and innovations will work together to make the future Perl even more insanely great — without, we hope, making it even more greatly insane.

Sigils Simplified

Let’s start with those mysterious $‘s, @‘s, %‘s, &‘s and *‘s that can be such a source of grief for newcomers to Perl (and can occasionally trip up experts, too!)

They’re called sigils and the most important news is that Perl 6 will keep most of them. We considered removing them completely (as some people requested), but ultimately concluded that they’re far too valuable to remove. Sigils make it much easier to interpolate a variable into a character string or regular expression and they provide important sanity checks on the arguments of various built-ins (such as finding logical errors like pop %hash and keys @array).

While sigils are staying, we’re modifying how a sigil relates to its variable and doing it in a way that actually reduces mistakes rather than cause them. In Perl 5, the type of sigil a variable requires depends on how it’s being used, and in particular, what kind of value that use is supposed to produce. Consider the code in Listing One.

Listing One: Accessing hashes and arrays in Perl 5

1 # Perl 5 code…
3 print keys %hash;
4 print $hash{“name”};
5 print @hash{“name”, “rank”, “cereal preference”};
7 print @array;
8 print $array[0];

If you’re using a hash as a full hash (as in line 3), you use the normal hash sigil (%). But, if you’re looking up a single entry in the hash (line 4), and expect a single scalar value, you use the scalar sigil ($). And if you’re looking up several entries at once (line 5), and expecting a list of values, you

use the array sigil (@). Likewise, when you want the full array (as in line 7), you use @, but when you want just a single element of it (line 8) you use $.

It’s a logical enough system — at least until we throw subroutine references and method calls into the mix, at which point the whole thing breaks down completely. Worse, it doesn’t fit well into many people’s brains.

That’s because sigils act rather like English demonstrative articles (“that value”, “these values”, “those values”). But an English article always agrees with the underlying plurality of the object it demonstrates, not with the plurality of the bit(s) of the object you’re currently interested in. So when you say “pass me those apples” and “pass me one of those apples” the “those” stays plural, whether you’re asking for one or all of the fruits. But in Perl the equivalent requests are:

pass(@apples); # “Pass those apples”
pass($apples[1]); # “Pass one of that apples”

To most programmers that’s simply counter-intuitive.

So Perl 6 will change how sigils operate. In the new version of Perl, sigils will cease to be adjectives and become an indivisible part of the nouns themselves. Listing Two shows what that means for the five print statements shown previously in Listing One.

Listing Two: Accessing hashes and arrays in Perl 6

1 # Perl 6 code…
3 print keys %hash;
4 print %hash{“name”};
5 print %hash{“name”, “rank”, “cereal preference”};
7 print @array;
8 print @array[0];

The previous complicated rules about selecting the sigil according to the nature of the value(s) being returned are replaced with a single rule that simply says: “If it’s a hash, use %. If it’s an array, use @. If it’s a scalar, use $. Always.”

Not only is that vastly easier to teach, to learn, and to remember, it also has the elegant side-effect of silently fixing one of the most common mistakes made by Perl programmers. Many programs include code that locates and returns a particular data structure (say a hash) by reference. That reference is then usually stored in a local scalar variable, through which the hash is later accessed, like so:

$data = locate_hash_for(“required data”);

# then later…

print $data{“particular_entry”};

In Perl 5 that’s a nasty and subtle error. The first line stores a reference to a hash in the scalar variable $data. But later a particular entry is looked up in the hash %data. Same name, different variables. In fact, the $data{…} syntax has nothing to do with the variable $data. Instead, it means: “Look up the entry in %data, using the $ prefix on the variable because it’s returning a single value”.

The correct code is print $data->{“particular_ entry”}, which dereferences the hash reference in $data (using the -> operator) and then looks up the particular entry.

But people just don’t think that way, so they’re constantly being bitten by this mistake. Fortunately, in Perl 6, it’s not a mistake at all. Because the sigil is determined by the variable, rather than the value(s) it’s providing, $data always means the scalar variable. So $data{“particular_entry”} always means “look up the entry in the hash whose reference is stored in $data“.

We suspect that when people port their code from Perl 5 to Perl 6 many of these kinds of hidden bugs will simply “evaporate,” because the language semantics will have been changed to match how people actually think and code.

A Swiss Army Case Statement

Perl’s problem isn’t the lack of a case statement. It’s the opposite problem: Perl has too many. Because Perl doesn’t have one standard case statement, people have invented 23 alternative “case patterns,” everything from the pedestrian…

# Perl 5 code…

$val = ‘G4′;

if ($val eq ‘A4′) { print “paper” }
elsif ($val eq ‘B4′) { print “prior” }
elsif ($val eq ‘C4′) { die “BOOM!” }
else { print “huh??” }

… to the baroque …

# Perl 5 code…

$val = ‘G4′;

({ ‘A4′ => sub { print “paper” },
‘B4′ => sub { print “prior” },
‘C4′ => sub { die “BOOM!” },
}->{$val} || sub { print “huh??” }

In Perl 6, there’s no need to jury-rig such awkward, inefficient solutions. Instead, there is a single, standard, built-in, control statement that does the job. It’s called given.

# Perl 6 code…
$val = ‘G4′;
given $val {
when ‘A4′ { print “paper” }
when ‘B4′ { print “prior” }
when ‘C4′ { die “BOOM!” }
default { print “huh??” }

The given statement associates the value in $val with the special Perl “current topic” variable $_. This association lasts for the duration of the associated block. Then each successive when statement within the block compares its associated value (‘A4′, ‘B4′, etc.) against the current value of $_. The first when statement whose value matches $_ has its associated block executed, after which control passes straight to the end of the surrounding block.

Though this seems no different in essence from the case statement found in many other programming languages, there is far more power here than first meets the eye. The way each when compares its value against $_ is determined by the (runtime) types of the two values being compared. In the example above, each when value is a string, so the string in $val is compared against each when using Perl’s eq string comparison operator. However, if the code had been:

# Perl 6 code…
$val = ‘G4′;
given $val {
when ‘A4′ { print “paper” }
when %B4 { print “prior” }
when /C4/ { die “BOOM!” }
when &D4 { print “huh??” }

then the first when would still compare against $val using eq, but the second when would treat the string in $_ as a key into the %B4 hash and consider the match successful if the corresponding element in the hash contained a true value. The third when, on the other hand, would note that the $_ was being compared against a regular expression, so it would use pattern matching to compare the two. And the final when, finding a subroutine (&D4), would pass $_ as an argument to that subroutine and consider the match successful if D4($_) returned a true value.

It may sound complex, but it isn’t. In all instances, the when just automatically chooses the most appropriate way to compare the “given” value (i.e., $_) against the current case. In other words, it just Does What You Mean.

A more practical example of the power, flexibility, and convenience of given/when can be seen in the following code, which guesses an encoding scheme based on the first character in the data:

given $first_char {
when [0..9] { $guess = ‘dec’ }
when /<[A-F]>/ { $guess = ‘hex’ }
when &is_ASCII { $guess = ’7-bit’ }
when %known { $guess = %known{$_} }
default { die Cannot::Guess }

Interestingly, because a given statement is really a function — one that returns the final value from any successful nested when statement — this example could also have been written as:

$guess = given $first_char {
when [0..9] { ‘dec’ }
when /<[A-F]>/ { ‘hex’ }
when &is_ASCII { ’7-bit’ }
when %known { %known{$_} }
default { die Cannot::Guess }

Considering Perl’s predominantly C/Unix background, people often wonder why we chose given and when as keywords, rather than switch and case. There were two reasons. First, given and when read much more naturally and are therefore much easier for non-C/Unix programmers (now the majority of Perl’s users) to understand. But more importantly, we chose new keywords because the constructs they label are vastly more powerful than a mere switch statement.

For a start, a when always compares its value against $_, whether or not it’s inside a given. It doesn’t care whether $_ was set by something else entirely. So a when can be used in any context that has an active $_, not just inside a given.

For example, in Perl, a for loop successively aliases $_ to the list of values it’s iterating on, so in Perl 6, it’s possible to combine looping and selection in a very efficient and readable manner. This code snippet shows how.

# Perl 6 code…

for (@events) {
when Mouse::Over { change_focus($_) }
when Mouse::Click { make_selection() }
when Window::Enter { change_focus($_) }
when Window::Close { delete_window() }
when /unknown\s+event/ { log_event($_) }

Here, the first five cases have class names as their values, so the associated when statements attempt to match each event object by checking whether it belongs to that class. The last when, on the other hand, uses a pattern match as its test. So, that when treats the event object as a string (by implicitly invoking the object’s coercion-to-string method) and checks whether that string matches the specified pattern.

Just as a when doesn’t have to be associated with a given, a given doesn’t have to depend on nested whens. A given statement always sets $_, whether or not that $_ is ever examined by a when. So, within a given block any of the Perl constructs that default to operating on $_ — including the new unary dereference operator (.) — can be used. As Listing Three illustrates, that gives Perl 6 the equivalent (and more) of a Pascal-like with statement.

Listing Three: Another use for given in Perl 6

1 given $obj_ref {
2 .synchronize();
3 %data = .get_data;
4 given %data {
5 .{name} = uc .{name};
6 .{addr} //= “unknown”;
7 print;
8 }
9 .set_data(%data);
10 }

Within the lexical scope of its associated block, the outer given (line 1) aliases $_ to an object reference. Then lines 2, 3, and 9 use the new “unary dot” notation to call the synchronize(), get_data(), and set_data() methods of that object without having to explicitly and repeatedly re-refer to the $obj_ref variable. Similarly, the inner given (line 4) lexically aliases $_ to the %data hash, enabling its various entries to be accessed without having to explicitly write %data everywhere. So too, the print statement at line 7 defaults to printing $_, printing the updated hash.

The Perl 5 equivalent of this code is considerably more cluttered with repeated referents, as Listing Four demonstrates.

Listing Four: Repetitious referents in Perl 5

1 do {
2 $obj_ref->synchronize();
3 %data = $obj_ref->get_data;
4 do {
5 $data{name} = uc $data{name};
6 $data{addr} = “unknown” if !defined $data{addr};
7 print %data;
8 }
9 $obj_ref->set_data(%data);
10 }

Perl has waited a long time for a real case statement, and the one it’s finally getting is the most powerful, flexible, and generalized tool we could invent.

It Takes All Types

Data types and type specifications take a much more prominent role in Perl 6. It isn’t that Perl 5 is weakly typed (as many people seem to think). It’s just that it lacks some important compile-time type specification mechanisms.

For example, in Perl 5, there’s no (easy) way to set up a variable that’s only permitted to store integers or set up a reference that only refers to objects of class Widget. Likewise, there’s no way to set up an array whose elements must be character strings or create a hash whose values must be references to arrays of numbers. In Perl 6 there is. When declaring a variable (with a my or our keyword), the type of the variable can be specified immediately after the keyword:

my Int $number;
my Widget $obj_ref;
my Str @strings;
my Array of Num %counters;

In fact, Perl 6 variables can be more precisely typed than variables in most other languages because Perl 6 allows you to specify both the storage type of a variable (what kinds of values it can contain) and the implementation type of the variable (how the variable itself is actually implemented). For example, a declaration like my Num @observations is SparseArray specifies that the @observations variable is required to store numbers, but takes the necessary internal structure and behavior to do so from the SparseArray class (rather than from the usual Array class).

Explicit typing extends to Perl 6 subroutines as well. For example:

sub Num mean(Int @vals) {
return sum(@vals)/@vals;

specifies that the mean() subroutine takes an array of integers and returns a number. We could also write that as:

sub mean(Int @vals) returns Num {
return sum(@vals)/@vals;

This extended form is handy when the return type is more complicated. For example, the following subroutine definition specifies that hist() takes an array of integers and returns another array of integers (namely, the frequency histogram it creates):

sub hist(Int @vals) returns Array of Int {
my Int @histogram;
for @vals { @histogram[$_]++ }
return @histogram;

The Array of Int notation is an example of the way that compound types are composed in Perl 6. The compound type is constructed by passing the “inner” type (the one after the of) to the constructor of the “outer” type (the one before the of). This allows the outer type (in this example, Array) to determine how to implement the required storage and behavior to allow it to hold integers.

Note that most of the subroutines shown throughout this article have proper named parameter lists. Those parameters, just like all other variables, may be given simple or compound types, just as the @vals parameter was in the first line of hist(). That’s “may be given,” not “must be given”. Explicit typing is entirely optional in Perl 6. It’s still perfectly valid to specify a subroutine with neither formal parameter list, nor return type:

sub hist {
my @histogram;
for @_ { @histogram[$_]++ }
return @histogram;

Of course, without named parameters, you have to access the subroutine’s arguments via the standard @_ variable. And there’s no guarantee that each element of that argument list is an integer suitable for indexing the @histogram array.

But the point to remember is that this “untyped” version of hist() isn’t untyped at all. It’s simply using the standard default types (as Perl 5 always does). We could get precisely the same “untyped” effect with explicit typing:

sub hist(Scalar @_) returns Array of Scalar {
my Scalar @histogram;
for @_ { @histogram[$_]++ }
return @histogram;

So, strong typing isn’t optional in Perl 6. Only explicit strong typing is. In the absence of type specifications, Perl will simply use its default types. And, in those situations where more type-precision is called for, you can explicitly provide it.

This is a distinctly Perlish approach to typing: it doesn’t get in the way unnecessarily, it tries to “Do The Right Thing” automatically, and it provides a specific syntax to override the defaults for those times when Doing the Right Thing isn’t quite the right thing to do.

Industrial-strength Objects

Perl 5′s object-oriented features are powerful, flexible, mutable, extensible, minimalist, and thoroughly in keeping with Perl’s “There’s More Than One Way to Do It” philosophy. And to many people that makes them totally unsuitable for implementing production code.

Listing Five shows a simple, but typical, Perl 5 class definition that illustrates why. Staff::Record is a class (even though the keyword says package). new(), name(), rank(), and check_rank() are polymorphically dispatched methods of the class (even though the keywords say sub).

Listing Five: Class declaration in Perl 5

1 package Staff::Record;
3 sub new {
4 my ($class, $name, $rank, $cereal_pref) = @_;
5 my $self = bless { name=>$name, rank=>$rank, pref=>$cereal_pref }, $class;
6 $self->check_rank($rank);
7 return $self;
8 }
10 sub name {
11 my ($self) = @_;
12 return $self->{name};
13 }
15 sub rank {
16 my ($self, $new_rank) = @_;
17 if (@_ > 1) {
18 $self->check_rank($new_rank);
19 $self->{rank} = $new_rank
20 }
21 return $self->{rank};
22 }
24 sub check_rank {
25 my ($self, $rank) = @_;
26 die “Invalid rank for “, ref($self), ” object: $rank”
27 unless 0 < $new_rank && $new_rank < 10;
28 }
30 # Other methods here

Objects of the class are really just hashes, which have been associated with the class by the bless function (line 5). Hence, all the other methods of the class access the data stored in those methods via hash entry look-ups (lines 12, 19, and 21).

Unfortunately, because they’re really just hashes, the entries of all Staff::Record objects are equally accessible outside the methods of the class. There’s absolutely nothing to stop anyone from by-passing the name() and rank() methods completely, and out in the main program, writing something like:

$obj->{name} = “Mudd”;
$obj->{rank} = 1000000;

The new() method acts like a constructor, and hence expects to be called on the class itself, as opposed to name() and rank(), which expect to operate on objects. The check_rank() method is really just an internal utility, and not intended to be part of the class’s public interface. Notice however that there is nothing in the way the methods are defined that indicates (or enforces) these important distinctions.

As you can see, there are numerous opportunities for this kind of code to go horribly (and subtly) wrong. Which is why we’re adding a new, declarative object orientated (OO) mechanism to Perl 6. Listing Six shows the corresponding class under Perl 6.

Listing Six: Class declaration in Perl 6

1 class Staff::Record {
2 has Str $.name;
3 has Int $.rank;
4 has Hash $.pref;
6 method CREATE(Str $name, Int $rank, Hash $cereal_pref) {
7 .check_rank($rank);
8 ($.name, $.rank, $.pref) = ($name, $rank, $cereal_pref);
9 }
11 method name() returns Str {
12 return $.name;
13 }
15 method rank() returns Str {
16 return $.rank;
17 }
19 method rank(Int $new_rank) {
20 .check_rank($new_rank);
21 $.rank = $new_rank;
22 }
24 method check_rank(Staff::Record $obj: Int $rank) is private {
25 die “Invalid rank for $obj.class() object: $rank”

26 unless 0 < $rank < 10;
27 }
29 # Other methods here
30 }

Notice first that the keywords now correspond to the constructs they define: class for classes, method for methods. The class itself is confined to the scope of a block. The attributes of the class are explicitly defined using the has keyword (lines 2 to 4) with special names that start with a dot. And they’re no longer directly accessible outside the class’s block.

There is no need to define a new() constructor; it’s generated automatically by the class definition. Instead, a specially named initializer method (lines 6 to 9) is defined and simply assumes that the object it is setting up will already have been created for it.

Methods can be declared private (line 24) using a property. They can have proper parameter list and return types. They no longer need to refer to their object explicitly; within a method, attributes can be accessed directly, by their (dotted) variable name (as at lines 12, 16, and 21).

However, when a named invocant (i.e., an explicit reference to the calling object) is required, it can still be specified at the start of the method’s parameter list (as it is for check_rank() at line 24).

The declaration of the invocant is separated from the method’s normal parameters by a colon. Whether or not an object reference has been explicitly declared, within a method, $_ is always aliased to the invoking object. That means that the unary dot operator can be used to call other methods from within a method (as check_rank() is at lines 7 and 20).

Methods can be overloaded within a class (for example, the two rank() methods defined at lines 15 and 19), so long as the variants can be distinguished by their parameter lists. Method calls can even be directly interpolated within a character string (as $obj.class() is at line 25).

Many of the details of the new OO mechanism are yet to be finalized. For example, Perl 6 will also provide delegated dispatch of methods, multiple dispatch (i.e., method selection based on the types of two or more parameters at once), anonymous classes, interfaces, multiple inheritance, and parametric classes. The syntax and semantics of all those features are not yet locked down.

In the meantime, it’s certain that Perl 6 will make object-oriented Perl coding more declarative, more secure, more robust, and far more standardized. Ironically, it does all that in its usual postmodern way: by extending the existing language to increase the programmer’s choice of tools.

Parallel Data

Perl 6 will introduce an entirely new scalar data-type: the junction. Junctions are like a collision between set theory, boolean logic, quantum state superpositions, and SIMD parallel processing.

That probably sounds terrifying but, curiously, in practice it isn’t at all. The way junctions are used is very straightforward and convenient. Consider the following two if statements:

if ( any(@new_values) > 10 ) {
print “Too big\n”;

if ( all(@new_values) < 0 ) {
print “Too small\n”;

You can almost certainly work out what’s happening just by reading the code aloud. The any function takes a list of scalar values as its arguments and returns a “junction” of those values. That is, it returns a single scalar value that is equivalent to any of the arguments it was given. So the first if statement means exactly what it says: “If any of the new values is greater than 10, print a message.” In other words, comparing the junction returned by any against the value 10 implicitly compares all of the values from @new_values against 10. And if any of them is greater than 10, then the overall comparison is true.

Likewise, in the second if, the call to all creates a junction that is equivalent to all of the values at once. Hence, the subsequent comparison against zero is true only if all the new values are less than zero.

“Any” junctions are known as disjunctions because they act like a Boolean OR: “this OR that OR the other.” “All” junctions are known as conjunctions, because they imply an “AND” between their values — “this AND that AND the other.”

There are two other types of junctions available in Perl 6: abjunctions, which represent exactly one of their possible values at any given time:

if ( one(@roots) == 0 ) {
print “Unique root to polynomial\n”;

and injunctions, which represent none of their values:

if ( $passwd eq none(@previous_passwds) ) {
print “New password accepted\n”;

Junctive comparisons like these are ideal candidates for compiler optimization — by distributing the computations involved to parallel threads, or processes, or processors. The notation also provides an easily understandable (and hence maintainable) way to incorporate parallel processing into a serial language.

Nor is that parallelization restricted to linear processing. In the following Perl 6 example, the NxM comparisons required between the two sets of values could all be performed in parallel:

if ( any(@new_values) > all(@old_values) ) {
print “New maximum value recorded\n”;

The equivalent Perl 5 code is much slower, far less clear, four times as long, and consequently has been banished to Listing Seven.

Listing Seven: NxM comparisons in Perl 5

1 # Perl 5 code…
3 my $any_true = 0;
4 for my $new (@new_values) {
5 my $all_true = 1;
6 for my $old (@old_values) {
7 $all_true &&= ($new > $old) or last;
8 }
9 $any_true ||= $all_true and last
10 }
11 if ($any_true) {
12 print “New maximum value recorded\n”;
13 }

Junctions also act as mathematical sets. For example, you could read in text lines and print them out without repetitions (like the Linux uniq utility does, only without the requirement that repeated lines be adjacent):

for (<>) {
next when $seen;
$seen = any($seen,$_);

The for (<>) reads each input line in turn and aliases it to the $_ variable. The when then compares that line against the junction in $seen. If the current line matches any of the previously seen lines, the when‘s comparison succeeds and the next command causes Perl to skip immediately to the next iteration of the for loop. Otherwise, the input line is printed. Then the final statement in the block updates the set of “seen” lines by creating a new set (i.e. a junction) consisting of all the previously seen lines plus the current input line.

That last line is, admittedly, a little clunky. But Perl 6 also provides a binary operator (|) that creates an “any” junction from its two arguments. So you could rewrite the last line much more cleanly as:

$seen = $seen | $_;

or, better still, as:

$seen |= $_;

Alternatively, by changing the sense of the comparison you could use an “all” junction instead:

for (<>) {
if $_ ne $seen {
print $_;
$seen = all($seen,$_);

Because $seen now contains an “all” junction, the comparison $_ ne $seen means “the current line is not equal to all the previous lines,” which is exactly what’s wanted. Here too, Perl 6 provides a binary operator (&) to facilitate creating “all” junctions, so the last line could be written more cleanly as:

$seen &= $_;

By the way, if you’re wondering what happened to the bitwise boolean operators that | and & represent in Perl 5 (and C), they’re still available. But their dual Perl behaviours have been “factored out” and renamed as follows:

  • +| for bitwise OR on a number,

  • ~| for bitwise OR on a string,

  • +& for bitwise AND on a number,

  • ~& for bitwise AND on a string.

The junctive binary operators are particularly handy for tests on a fixed number of values. For example, the ability to code tests of the form “if [A or B or C] is zero” is an often-requested language feature, one that junctions provide in a very natural way:

if ($A|$B|$C == 0) {
print “Coefficients must be non-zero\n”;

Another feature that is often asked for is the ability to easily detect if a particular value appears in a list. With junctions, that’s just:

if ( $value == any(@list) ) {…}

This solution is far superior to providing an explicit in operator:

# NOT Perl 6 code…
if ( $value in @list ) {…}

because the use of junctions allows you to select the most appropriate form of comparison for a particular list:

if ( $value == any(@list) ) {…}
if ( $value eq any(@list) ) {…}
if ( fuzzy_match($value,any(@list)) ) {…}

The first if checks if any element of @list is numerically equal to $value, whereas the second checks if any element is equal to $value under string comparison, while the third looks for list membership as defined by some user-supplied subroutine.

Junctions also provide “union types.” That is, if you need a variable that can store either integers or file descriptor objects, but nothing else, you can declare precisely that:

my Int|FileDesc $variable;

The type specified for $variable is a junction of the Int and FileDesc types, so whenever a value is assigned, its run-time type is compared against both types simultaneously, and the overall comparison succeeds if either type comparison succeeds.

But, of course, junctions give you more than just unions. You can also specify “intersection types,” such as:

my FloorWax&DessertTopping $shimmer;

This declaration requires that any object assigned to $shimmer must belong to both the FloorWax and DessertTopping classes. (Like Perl 5, Perl 6 allows multiple inheritance, so it’s perfectly possible to create objects that inherit from two unrelated classes.)

Using an injunction, you can even create a variable that will store anything but a particular type of value:

my none(Soup) $for_you;

This declaration means that you’re allowed to assign any type scalar value to $for_you, as long as it isn’t a reference to a Soup object.

In addition to Perl 6′s threads, coroutines, and vector operators, junctions provide yet another kind of parallelism: for data, types, and procedures. And, though junctions can be enormously powerful, they’re also unexpectedly intuitive.

Impatience is a virtue

When working programmers see the power and elegance that Perl 6 will provide, their most common response is “I want it now!” Although there is a prototype Perl 6 compiler already available to download from http://www.parrotcode.org, we’re still a few years away from a full release of Perl 6.

But, as it turns out, you can sample many of the tasty new Perl 6 features today… in Perl 5. Some of the modules listed below change Perl 5′s syntax to match that of Perl 6. Others just provide similar functionality, but with non-Perl 6 syntax. You can grab any or all of them from the Comprehensive Perl Archive Network (http://search.cpan.org).

  • Perl6::Currying enables you to use Perl 6′s higher-order functions in Perl 5.

  • Perl6::Interpolators allows you to use the Perl 6 $(…) function-interpolation syntax in Perl 5.

  • Perl6::Placeholders makes Perl 6′s implicitly declared parameter mechanism available in Perl 5.

  • Perl6::Parameters allows Perl 5 subroutines to take typed parameter lists just like Perl 6.

  • Perl6::Variables changes Perl 5′s sigil rules to match those of Perl 6.

  • Attribute::Handlers provides some of the functionality of Perl 6 properties.

  • Class::Contract implements declaratively specified, properly encapsulated classes, much as Perl 6 will.

  • Class::Delegation provides OO delegation similar to that in Perl 6.

  • Class::Multimethods supports Perl6-style multimethods and function overloading in Perl 5

  • NEXT provides Perl6-ish method redispatch in Perl 5.

  • Quantum::Superpositions is the forerunner to Perl 6 junctions.

  • Switch adds the Perl 6 given and when statements to Perl 5.

  • Text::Reform is the prototype for the Perl 6 replacement for Perl 5′s format statement.

  • Want provides Perl 6′s extended call-context detection in Perl 5.

  • Bundle::Perl6 is a convenient collection of many of the above modules.

The Once and Future Perl

20/20 hindsight is a tremendous advantage when you’re creating a language. In designing Perl 6, we’re re-examining fifteen years of Perl usage to see what we got right and what we could do better.

Apart from the new mechanisms described so far, the Perl 6 project is also about filing off those few remaining rough edges of the language. Fine-tuning will bring long-hoped-for features such as multiway comparisons, simplification of the parenthesizing rules, generalized string interpolators, and auto-dereferencing in suitable contexts.

None of those minor new features is, by itself, a quantum leap for Perl 6. But programming in Perl is supposed to be easy, natural, and fun. Eliminating those trivial everyday annoyances and facilitating less intrusive, more intuitive ways of solving the same problems is an essential part of making Perl programmers even more productive.

Many of the tools and techniques that are important to the modern Perl community have been developed as CPAN modules. Many of these have become so widely used that it makes sense to integrate their features into the language itself. So PDL vector operations become Perl 6 vector operators; the any and all data types from Quantum::Superpositions become Perl 6 disjunctions and conjunctions; the Memoization module becomes the is cached property; and Parse::RecDescent-like grammars become part of Perl 6′s greatly enhanced pattern matching facilities (which are evolving so significantly that we’re planning a separate article on them in a forthcoming issue of Linux Magazine).

Like Perl 1 through 5, Perl 6 is evolving to meet the needs of its users, taking that which has proven widely useful or commonly necessary and making it part of the core language, relegating specialized and occasionally-used features to external modules, and consigning to oblivion those ideas that just didn’t work.

Our goal is to make the next version of Perl an even more powerful tool for today’s programmer. Perl 6 will be a tool that can grow and adapt to meet the needs of programmers over the next two decades.


Larry Wall’s Perl 6 design documents: http://dev.perl.org/perl6/apocalypse/

Damian Conway’s tutorials on Perl 6: http://dev.perl.org/perl6/exegesis/

Why Perl 6 will still be Perl: http://damian.conway.org/Articles/ANFSCS.html

Search the CPAN for Perl 6 related code and documentation: http://search.cpan.org/search?query=Perl6&mode=all

Everything else Perl6ish: http://dev.perl.org/perl6

The Perl Foundation: http://www.perlfoundation.org

Damian Conway (http://damian.conway.org) is a professional Perl boffin. His company, Thoughtstream, offers Perl training and consultancy world-wide. You can find additional information on the changes in Perl 6 at http://www.linux-mag.com/downloads/2003-04/perl6.

Comments are closed.