Packing It In

Take a look at Perl's confusing, but important pack and unpack functions.
The other authors of the wildly popular Learning Perl book and I dropped a few things between the second and third editions of the tome, simply because we wanted to make room for a few more relevant topics. As Perl has matured, Perl users migrated from being primarily system administrators to traditional developers, authoring complete mission-critical applications, such as some of the code behind many of the websites that you visit frequent. One of the sections that didn’t make the cut was the section on the confusing, but important pack and unpack functions. This month, let’s take a look at this small corner of the Perl universe.

Bits and Pieces

The pack function turns Perl-managable data (numbers and strings) into a sequence of bits that might make sense to some external application. unpack generally goes in the other direction, taking a bag-of-bits from some hostile, real-world interface, and turning the parade of one’s and zeroes into nice strings and numbers for further processing.
The number of options for pack and unpack is dizzying. In fact, as I was researching this article, I realized that I hadn’t read the related documentation for a few Perl releases, and it seems that they’ve snuck in about twice as many features as when I last looked. Too bad a pack format isn’t quite Turing-complete, although I’m happy they aren’t self-aware, as regular expressions seem to have become.
The easiest way to get into how pack and unpack work is to dive right into it. Take for example, packing a character:
my $string = pack "CC", 65, 66;
If you print $string, you see an uppercase A followed by an uppercase B, presuming a nice ASCII environment and not something odd like EBCDIC. This pack invocation works similar to sprintf(): the first argument is a template, which defines how to interpret the remaining arguments, and the function returns the result.
In this case, the template consists of two C characters, each of which denote an unsigned character. For each one of these characters, pack takes the next element from the arguments (first 65, then 66), and “packs” each one into the result. If you put a 65 value into an ASCII byte, you get a capital letter A, just as if you’d said chr(65), and so that’s the first byte of the result.
You can continue this, addming more characters:
my $string = pack "CCCC", 65, 66, 67, 68;
But when you have the same item repeated, such as the C above, you can use a shortcut:
my $string = pack "C4", 65, 66, 67, 68;
For many of the formatting letters, a trailing numeric value means “repeat this many times”. (Some of the others interpret a numeric value as a width. More on that in a moment.)
As with sprintf(), if you ask for more values than you have, you get 0 padding. And if you don’t use up everything, those extra values are simply ignored. To keep from having to count the number of elements, a special value of * means “as many as you need”:
my $string = pack "C*", @some_numbers;

Time to Unpack

You use unpack to go the other way, from a string to a list of individual numbers:
my @numbers = unpack "C*", "Hello world!";
In this case, I end up with a series of 12 numbers beginning with 72 and finishing with 33, being the ASCII values of each character of the string. If I wanted to skip the first two characters, I can use the “x” format to skip a byte, and either “xx” or “x2” to skip two bytes:
my @part = unpack "x2 C*", "Hello world!";
Now I get the values starting with the third character. What if I wanted only every other character? Like a regular expression, I can group parts of the format in parentheses (which can be nested):
print pack "C*", unpack "(x C)*", "Hello world!";
And now I’ve picked out every other character, resulting in el ol!. Had I swapped the x and C, I’d get Hlowrd instead.
Both of the last two formats used whitespace. Whitespace can be introduced between format constructs for clarity. In fact, you can even add Perl-style comments, beginning with a pound-sign and terminated by a newline.
Another common format is n, which stands for a 16-bit integer in “network” (big-endian) order. The corresponding data element is again expected to be a numeric value, but the result is now two bytes of the string instead of one. The first byte is the “high” part of the 16-bit value, while the second byte is the “low” byte. For example, both…
my $data = pack "n", 1;
my $data = pack "C*", 0, 1;
… result in the same string, with a NUL byte followed by a byte having the ASCII value of 1 (a Control-A). Similarly, the statements…
my $data2 = pack "n", 256;
my $data2 = pack "C*", 1, 0;
… result in the same string as well, namely, a byte with the ASCII value of 1 (Control-A) followed by a NUL byte. The first byte represents the high half of the 16-bit value.
The N value is similar, but packs a 32-bit value into four bytes, most significant byte first. Again, both of the next two statements result in the same value:
my $data = pack "N", 65536 + 256 * 2 + 3;
my $data = pack "C*", 0, 1, 2, 3;
If you want the little-end first, you can use v and V in place of n and N:
my $data_reversed = pack "V", 65536 + 256 * 2 + 3;
my $data_reversed = pack "C*", 3, 2, 1, 0;
Here, the low byte comes first, followed by successively more significant bytes. The letter “v” comes from “vax” ordering, since little-endian order was used on the DEC VAX computer system (and probably because “v” was one of the few characters not already taken).
The L letter is a “native,” unsigned 32-byte value. On big-endian machines, this letter acts like N, but on little-endian machines, this letter acts like V. You can use this to figure out your native byte order:
print unpack "C*", pack "L", 0x04030201;
On little-endian machines, this prints 1234, but on big-endian machines, it prints 4321.
And as long as I introduced a hex value there, let’s look at how to get that hex value into and out of a string. The only one I really use is H*, which unpacks a single string into its corresponding hex representation (of any length), or packs the hex string back into the original string:
my $hello_as_hex = unpack "H*", "hello"; # "68656c6c6f"
print pack "H*", $hello_as_hex; # say hello!
If you wanted those as pairs of characters, you can use the repetition marker:
my @hexes = unpack "(H2)*", "hello";
Now you have qw(68 65 6c 6c 6f) as five separate elements. Joy.
Similarly, I can break a string into bits with B*:
print unpack "B*", "hi!"; 
The first eight bits represent the letter h from the highest to lowest bit. The other two characters similarly follow. Again, to see this easier, use a grouping:
print "$_\n" for unpack "(B8)*", "hello world!";
The previous unpack results in:
(Wow. With a bit more work, I could turn that into a old-style, 8-level paper tape.)
Another really useful format is A, denoting a space-padded ASCII string, used nearly always with a specific width:
my $value = pack "A10", "some";
The output value will be the string some followed by six spaces. The value is truncated if necessary. Replacing the A with a results in a NUL-padded value. Using Z insists that there be at least one NUL, so Z10 prints up to nine characters from the value, reserving the last character for a NUL. When you use the values in an unpack, the corresponding value will have spaces trimmed for A, and NULs trimmed for Z. The a format does no trimming at all.
For example:
my ($hello, $world) = unpack "A6 A5", "Hello world";
In this case, Hello (with a trailing space) is considered for the first output, but the trailing space is stripped, resulting in Hello and world for the two values. For a6 a5, the trailing space would have been kept.
What if you wanted five characters, skip one, and get the next five? Just throw in an x to skip over the unwanted byte:
my ($hello, $world) = unpack "a5 x a5", "Hello-world!";
You can also skip to an absolute position with @:
my ($world) = unpack ’@6 a5’, "Hello world";
By skipping to position six (numbered starting from 0), we’ll start picking up characters with the w.
We can skip to the end of the string with x*, and then back up one or more characters with X:
my ($last) = unpack ’x* X a’, "Hello world!";
$last ends up with the last character of the string. You can even use X to interpret the same byte two different ways:
my @pairs = unpack ’(a X C)*’, "Hello";
Now you’ll get pairs out for each character in the string, consisting of the original character (from a), and then its byte value (from C), because we back up between interpreting the two formats.
The output of the Unix who command consists of an 8-character, trailing-space-padded username field, followed by an 8-character, trailing-space-padded terminal (TTY) field, followed by the date and time of login. Using unpack, you can easily pull the lines apart:
foreach (`who`) {
chomp; # throw away trailing newline
my ($user, $tty, $time) = unpack "A8 A8 A*", $_;

However, who is merely interpreting the information from the utmp file, which on my system is defined by a C struct that looks like:
#define UT_NAMESIZE     8
#define UT_LINESIZE 8
#define UT_HOSTSIZE 16
struct utmp {
char ut_line[UT_LINESIZE];
char ut_name[UT_NAMESIZE];
char ut_host[UT_HOSTSIZE];
time_t ut_time;
You can translate this into a pack format rather directly. Because utmp is composed of NUL-terminated strings, you can use Z, and since time_t is a native” long” (generally), the format is something like Z8 Z8 Z16 L.
You can open up the utmp file (/var/run/utmp on my system), read it 36 bytes at a time, and unpack it to get at the data:
open UTMP, "/var/run/utmp" or die;
while (read(UTMP, my $buf, 36) > 0) {
my ($line, $name, $host, $time) = unpack "Z8 Z8 Z16 L", $buf;
next unless $name;
printf "%-8s %-8s %s", $name, $line, scalar localtime $time;
printf " (%s)", $host if $host;
print "\n";
And there’s a working who program. Of course, your utmp structure is likely different from mine, but the principles are similar.
I hope you’ve enjoyed this little trip into the world of pack and unpack. For more information, pack and unpack are described in mind-numbing detail in perlfunc, and recent versions of Perl include perlpacktut as a gentle tutorial.
Until next time, enjoy!

Randal Schwartz is the Chief Perl Guru at Stonehenge Consulting. You can reach Randal at class="emailaddress">merlyn@stonehenge.com.

Comments are closed.