dcsimg

Using Filehandles with Perl

In the past two columns, I looked at using references in Perl and showed the basic syntax for creating references to arrays, hashes, scalars, and subroutines. I also described the canonical form of converting a non-reference expression into a reference, and how to use the shortcut rules to make this simpler.

In the past two columns, I looked at using references in Perl and showed the basic syntax for creating references to arrays, hashes, scalars, and subroutines. I also described the canonical form of converting a non-reference expression into a reference, and how to use the shortcut rules to make this simpler.

Let’s take a look now at filehandles and directory handles. These handles let us look at or talk to the “outside world,” a worthwhile endeavor if we want our program to have some permanent impact on the computing environment.

First, recall that a filehandle doesn’t really have a syntax to make it a full-fledged variable. You can’t assign it, use it (directly) with local() or my(), pass it to or from a subroutine, or store it into a data structure. What does that leave? Well, there are about a dozen operations that use a filehandle or a directory handle, specified by a bareword (an alphanumeric symbol sequence separated by double colons, like STDIN or MyPack::Output). For refreshers, that looks like this:


while (<INPUT>) {
last unless /\S/;
print OUTPUT $_;
}

Now, that’s a nice ordinary chunk of code. In fact, it’s a nicely useful chunk of code that copies all of the contents on the filehandle INPUT to the filehandle OUTPUT, until the first blank line. What makes this useful is that it’s the formula for “ARPA text messages,” the structure of e-mail and Usenet postings. Both of these are transmitted in three parts: a header of “meta-content,” such as subject and originating address; a single blank line; and the text of the message.

We could drop this code into a subroutine so that it might be used from various places around my program:


sub copyheader {
while (<INPUT>) {
last unless /\S/;
print OUTPUT $_;
}
}

But to use this subroutine, I have to specifically use the filehandles called INPUT and OUTPUT. What if I wanted, say, to copy MAILMSG to STDOUT? I can’t assign INPUT from MAILMSG. But I can get almost the same thing with a glob assignment:


*INPUT = *MAILMSG;
*OUTPUT = *STDOUT;
copyheader();

Now, a bit of explanation is needed here. The prefix asterisk operator in *INPUT means “access (or alter) a magic value that denotes the symbol-table entry for everything named INPUT.” Now you don’t have to know too much about how Perl stores things; suffice it to say that when we execute *INPUT = *MAILMSG, any reference to anything named INPUT is automatically redirected to the current corresponding item named MAILMSG. This is true for $INPUT, @INPUT, %INPUT, and &INPUT. We don’t care about those, but any use of the INPUT filehandle is automatically mapped to the MAILMSG filehandle instead! Now, that’s useful.

So, while the subroutine thinks it is copying from INPUT to OUTPUT, we’re actually in effect copying from MAILMSG to STDOUT. The downside is that we can no longer access anything that was originally named INPUT, so we should choose the name wisely.

And, the change is permanent. Or is it? Not if we use local() in the right way:


{
local *INPUT = *MAILMSG;
local *OUTPUT = *STDOUT;
copyheader();
}

Here, the assignment that aliases “all things INPUT” to “all things MAILMSG” is done as a local operation, meaning it will be undone at the end of the enclosing block. That’s good news, because outside this block everything is as it once was (except that the MAILMSG and STDOUT filehandles are now in new positions within their respective files).

We’ve still hit a bit too much though. Suppose the inside block wanted to access the original @INPUT. That’s reasonable, since we really needed to alias only the filehandles across. That’s just a bit trickier, with:


{
local *INPUT = *MAILMSG{IO};
local *OUTPUT = *STDOUT{IO};
copyheader();
}

The {IO} suffix indicates that we don’t want the symbol access for everything named MAILMSG, but just the filehandle and directory handle named MAILMSG. The alias assignment therefore (temporarily) messes up just those, and not the scalar, array, hash, or subroutine symbols as well. To select those individually, we can use *FOO{SCALAR}, *FOO{ARRAY}, *FOO{HASH} or *FOO{CODE}, respectively. Note that if any of these have not yet been used in a program, the value will be undef, and it won’t make any sense to alias to another symbol entry.

We can also use this syntax to pass these entries into a subroutine:


copyheader(*MAILMSG{IO}, *STDOUT{IO});

sub copyheader {
local *INPUT = shift;
local *OUTPUT = shift;
while (<INPUT>) {
last unless /\S/;
print OUTPUT $_;
}
}

Hey, nearly like assignable filehandles now, albeit with an ugly syntax. We can simplify this even further with another trick. Nearly any place you have a filehandle, you can also stick a simple scalar variable. A normal lexical variable certainly can serve as such a simple scalar. This means we can use a pair of these to grab the subroutine’s input parameters, like so:



sub copyheader { my $in = shift;
my $out = shift;
while (<$in>) {
last unless /\S/;
print $out $_;
}
}

Wow. Much cleaner. Notice the use of the filehandle read operator (angle brackets) around the outside of the scalar variable. Some of the documentation refers to this as indirect filehandles, but that’s just fancy talk. Now, these subroutines have been using existing filehandles passed in by the caller. Can we likewise return a filehandle? Sure, by using the same syntax, roughly speaking.


sub get_body_handle {
my $filename = shift;
local *IN;
open IN, $filename or
die “$filename:$!”;
while (<IN>) {
last unless /\S/;
}
return *IN{IO};
}

{
my $handle =get_body_handle
(“/home/ merlyn/Mail/inbox/101″);
print “body: $_” while <$handle>;
}

Here we’re creating a local symbol-table entry in IN, which won’t mess up any global use of the same name. Then a normal open() connects up the filehandle, and we seek forward until we’ve found the blank line. The return passes back the filehandle portion of the symbol-table entry, and that’s captured in $handle. And nicely enough, when the $handle variable goes out of scope, the filehandle is automatically implicitly closed, freeing up resources.

But that local *IN still bugs me. If the subroutine had needed to access @IN at the same time, we’d have been in trouble. Worse yet, all the normal problems with local come into play: If this subroutine calls another subroutine, all things named IN are still obscured. So, let’s be slightly trickier, and we’ll get all the same goodies without any of the downsides:


sub get_body_handle {
my $filename = shift;
my $in = do { local *IN };
open $in, $filename or
die “$filename:$!”;
while (<$in>) {
last unless /\S/;
}
return $in;
}

{
my $handle = get_body_handle
(“/home/merlyn/Mail/inbox/101″);
print “body: $_” while <$handle>;
}

Ah, so the first thing you might notice is that I’ve gone back to indirect filehandle notation, using a simple scalar variable. But this variable is being initialized using a do-block. Inside this do-block we’ll create temporary symbol-table entry, then return it.

This is a very quick operation, and it is almost entirely unlikely to mess up anyone (except for signal handlers that are executed in that small window, but that situation has its own troubles).

The symbol-table name (here IN) is arbitrary. Also, if the filehandle container variable $in had gone out of scope before returning a value, the filehandle would have been closed automatically.

Well, we now have passing filehandles into subroutines, returning them from subroutines, and even creating local filehandles. We can also store these filehandles into an arbitrary data structure and create directory handles the same way.

For example, let’s list a directory returning the names of the 10 most recently modified files using a localized directory handle:


sub get_ten_newest_files {
my $dirname = shift;
my $handle = do { local *X };
opendir $handle, $dirname or die
“$dirname:$!”;
my @names = map “$dirname/$_”, readdir
$handle;
@names = map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map { [$_, (stat)[9]] }
grep { /\d$/ }
@names;
splice @names, 10 if @names > 10;
@names;
}
my @newest = get_ten_newest_files
(“/home/merlyn/Mail/inbox”);
print “$_\n” for @newest;

Here we create a local handle for a directory handle (as $handle), then open that directory handle onto our selected directory. After fetching all the names, I do a Schwartzian Transform (named after me, but not by me; it’s a long story) to order them by descending modtimes, as well as select only the message files.

Could we have also already opened all those files? I mean, can we stick the filehandles into a data structure and pass it around?

Sure enough. Let’s make the return value a 10-element list where each element is an arrayref to a two-element array of a filename and its already-opened filehandle. For grins, the filehandle will be already positioned to its body (past the header). So, here goes:


sub get_ten_newest_files {
my $dirname = shift;
my $handle = do { local *X };
opendir $handle, $dirname or die
“$dirname:$!”;
my @names = map “$dirname/$_”,
readdir $handle;
@names = map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map { [$_, (stat)[9]] }
grep { /\d$/ }
@names;
splice @names, 10 if @names > 10;
return map {
my $name = $_;
my $fh = do { local *X };
open $fh, $name or die
“Cannot open $name:$!”;
while (<$fh>) {
last unless /\S/;
}
[$name, $fh];
} @names;
}
my @newest = get_ten_newest_files
(“/home/merlyn/Mail/inbox”);
for (@newest) {
my ($name, $handle) = @$_;
print “$name: $_” for <$handle>;
}

Wow, lots of stuff, but hopefully you can see the meat in the middle. For each name in @names being returned by the subroutine, we transform the name into a two-element array, the second element of which is a brand new filehandle for each filename. The main code pulls out the filenames and dumps the filehandles, which generates just the bodies.

If I’ve been following the development direction correctly, the next major release of Perl after 5.005_03 will eliminate the need for all those


my $x = do { local *X };

steps, by treating any undef variable used by open as if it has a filehandle symbol already installed. Joy!

I hope you’ve enjoyed this little excursion into subroutine references. For further information, check the documentation that comes with Perl, as well as chapter 4 of the O’Reilly & Associates book Programming Perl, Second Edition. Until next time, enjoy!




Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com.

Comments are closed.