The Basics of Hashes

When I first started playing with awk more than two decades ago, I was amazed at the ease with which common tasks could be easily solved through the use of its "array" datatype. Prior to that, I had experienced only BASIC and C arrays, where the only index available was a small integer. But awk arrays could have arbitrary strings as keys!

When I first started playing with awk more than two decades ago, I was amazed at the ease with which common tasks could be easily solved through the use of its “array” datatype. Prior to that, I had experienced only BASIC and C arrays, where the only index available was a small integer. But awk arrays could have arbitrary strings as keys!

I remained mystified at how awk could manage to find the value for my key quickly. It took some study later to learn about hash tables and complex data structures, but luckily I didn’t need to know how it was done “under the hood” to be able to use them effectively.

As Larry Wall copied a lot of what was right about awk when designing and building the first version of Perl, one of the things he brought along was awk’s idea of an array with arbitrary string keys. He chose to call those associative arrays, to distinguish them from the more traditional small-integer-index arrays, which he also implemented in this new Perl language.

After a half-dozen years of Perl development, the term associative array was optimized to just hash for the Perl version 5 releases. I’m glad for that, because I use the term frequently when describing nearly every complex program I write.

A hash is a mapping of keys to values. We create elements in hashes by telling the hash that “this value goes with that key.” Later on, we can ask the hash, “Now what value goes with this key?” and the hash can respond quickly. Even though that sounds simple, just being able to do that helps us solve lots of problems with relative efficiency.

The values of a hash can be any scalar or reference, including strings, numbers, the special undef value, or a reference to another data structure. Let’s put the references aside for a moment, and stay with just the basics.

The keys of a hash must be a string. If we try to use a number as a key, it gets “stringified.” So, using any of 2 or 2.0 or 2e0 as keys would result in accessing the same hash element because they all produce the same string.

However, using the strings “2″ and “2.0″ and “2e0″ access different elements, because the strings themselves are different. Also, if undef is used as a key, it becomes the empty string. If you have warnings enabled, you’ll likely get a warning message on that.

The keys of a hash are always unique, because the keys uniquely identify a particular element. If you try to insert a new element that has an existing key, it simply overwrites the previous value. The values of a hash may be duplicated, however. More than one element of a hash can have a value of fred, for example. Keys must be unique; values do not.

The name of a hash is like all other Perl variables: alphanumerics and underscore, of any length, possibly including package markers (::, the double-colon). The hash name is preceded by a percent-sign (%) when referring to the entire hash (for certain functions that access the hash as a whole), such as %age. The hash name is preceded by a dollar sign and followed by a key expression enclosed in curly-braces when referring to an individual element, such as $age{“fred”}.

Inserting items into a hash is as simple as assigning them:

$age{“fred”} = 27;
$age{“wilma”} = 24;

If a key is reused, the new value overwrites the old value:

$age{“fred”} = 28;

If the key is entirely an alphanumeric string (including underscore), the quotes can often be dropped:

$age{fred} = 28; # same as $age{“fred”} = 28;

But don’t get into this habit in general, because this breaks when there’s no longer a simple alphanumeric string:

$age{Mr. Slate} = 35; # BROKEN
$age{“Mr. Slate”} = 35; # correct

And now for the magic of the hash: to get the value back, just ask for the key:

print “Fred is “, $age{fred}, ” years old.\n”;

The hash can quickly locate the 28 value corresponding to the fred key. OK, so I was mystified at how it did this when I was younger, and I can still pretend to be mystified by it now. In practical terms, while it’ll get a bit slower and slower as you put hundreds and thousands of items into the hash, it won’t ever be as slow as a linear search for the item in a list of items. It just works, and works well.

If you ask for a key that isn’t in the hash, the response is an undef value. And unlike awk, this doesn’t change the hash at all. But undef can also be a value in the hash! To tell the difference between the undef that is returned when the key is absent from the undef that is returned because you stored it there, use exists:

if (exists $age{dino}) {
print “We know dino’s age… it’s
} else {
print “We don’t know dino’s age.\n”;

If we had a lot of ages to initialize into the hash, we could certainly assign each one individually, as shown above. But a faster way to initialize an entire hash is to assign a list to the %-form of the hash, as in:

%age = (‘fred’, 28, ‘wilma’, 24, ‘Mr. Slate’, 35);

Here, the list of six items is broken into three pairs. Within each pair, the first item is a key, and the second is a corresponding value. Later items override earlier items, so this assignment

%age = (‘fred’, 27, ‘wilma’, 24, ‘fred’, 28);

results in Fred having the age 28. As an alternative syntax, we can write the original initialization using the “fat-arrow,” or =>:

%age = (‘fred’ => 28, ‘wilma’ => 24, ‘Mr. Slate’ => 35);

=> clarifies which items are paired. As a further simplification, any item to the left of a fat arrow which is simply an alphanumeric-and-underscore name can have its quotes removed:

%age = (fred => 28, wilma => 24, ‘Mr. Slate’ => 35);

Again note that Mr. Slate requires special quoting.

We can process all the names without knowing which names got inserted by using the keys() function:

foreach my $key (keys %age) {
print “$key is $age{$key} years old.\n”;

The list of keys comes out as some random ordering of the three keys. The foreach loop puts each of these names in turn into the $key variable, and then the corresponding value is looked up inside the double-quoted string.

Why is it a random order? Well, the primary purpose in life for a hash is to find a value for a given key. To do that, it organizes the keys internally into “buckets,” and when I ask for all of the keys, it just dumps the buckets in their internal order. We can fix the indeterminacy by adding sort in front of the keys operator:

foreach my $key (sort keys %age) {
print “$key is $age{$key} years old.\n”;

This output is sorted alphabetically by keys (the names). Another common task is to sort by values, not by keys. Let’s say we wanted the output sorted in terms of increasing ages. We can do that with a user-defined sort:

sort { $age{$a} <=> $age{$b} ) keys %age

Here, we’re taking the list of keys (the names), and sorting them according to their corresponding age. If we put this expression inside the foreach loop, we get our desired “sort by age” display.

A useful bonus effect of having a missing key return undef for a value is that undef is the “null element” for both addition and concatenation. For example, let’s increase an element by three:

$age{wilma} += 3;

The existing value of $age{wilma} is consulted (24), incremented by 3 (yielding 27), and then stored back as the updated value for the given key. But what if the key is absent?

$age{dino} += 3;

The existing value of $age{dino} doesn’t exist, so we start with undef. But undef plus 3 is 3, so we store a 3 back into $age{dino}, creating an element where an element didn’t previously exist!

This is very useful when counting or summarizing. For example, counting words that appear in a file is as simple as:

%count = (); # clear the hash out
while (<>) {
my @words = split /\W+/, $_;
foreach my $word (@words) {
$count{$word} += 1;

The first time a word is encountered, the hash lookup for that word as a key returns undef, which is incremented to 1, and the value of 1 is stored for that word. When the same word is encountered a second time, the existing value of 1 is incremented and updated to 2, and so on. When the while() loop is done, we can then display the final results:

foreach my $word (
sort { $count{$b} <=> $count{$a} } keys
) {
print “$word was seen $count{$word}

We can also take advantage of the uniqueness property of hash keys to filter out duplicates from a list. For example, suppose we need to examine lines of a file to generate output that consists of only one line, even if the line is duplicated. Easy enough:

%seen = ();
while (<>) {$seen{$_} = 1;}

foreach (sort keys %seen) {print;}

For each line being read, we create a hash element in %seen with a key of that line to a value of 1. If the same line is seen more than once, we’re just overwriting the one with another one. When we’re done looking at all the lines, we dump all the keys. Of course, this loses the original input order. If that’s important, we can print each line as we see it, then ignore any subsequent appearances:

%seen = ();
while (<>) {
next if $seen{$_};
$seen{$_} = 1;

On the first appearance of a line, $seen{$_} will be undef, which appears to be false, so we won’t take the next operation. After printing the line, we set the value to 1, and remaining lines will be ignored.

I hope this little overview of a small, but powerful feature of Perl has inspired and enlightened you.

Until next time, enjoy!

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com. You can download the code described in this column from http://www.linux-mag.com/downloads/2003-09/perl.

Comments are closed.