Finding Files Incrementally

The traditional File::Find module included with Perl is nice. I use it frequently. However, it's got this interesting feature -- or rather limitation; it wants to be in control until it has found everything of interest. Now, that's perfectly fine for most applications, but I've occasionally wanted to turn a hierarchical search "inside out." That is, I'd set up the search, including the start directories and the "wanted" criteria, and then repeatedly call some function to get the "next" item of interest. This is similar to how you could call the find program externally:

The traditional File::Find module included with Perl is nice. I use it frequently. However, it’s got this interesting feature — or rather limitation; it wants to be in control until it has found everything of interest. Now, that’s perfectly fine for most applications, but I’ve occasionally wanted to turn a hierarchical search “inside out.” That is, I’d set up the search, including the start directories and the “wanted” criteria, and then repeatedly call some function to get the “next” item of interest. This is similar to how you could call the find program externally:

open FIND, “find /home -atime +60 -print|”;
while (<FIND>) {
printf “%s aged %d days using %d blocks\n”,
$_, -A $_, -s $_ / 1024;

Recently, I saw Mark-Jason Dominus, Perl hacker extraordinaire, give a talk on using streams and iterators; he pulled out an example that was very close to what I wanted. He showed off a subroutine that acted as a factory, generating subroutines that, when repeatedly invoked, provided the “next” file to be found in sequence. (It’s one of the examples from his forthcoming book, some of which can already be found at http://perl.plover.com/book/.) He then went on to show how to put this inside a filter to throw away the files that did not meet the necessary criteria.

I spent some time analyzing this and decided to take a different, but similar, approach. Invoking a subroutine repeatedly is functional but not a natural Perl construct. Iteration in Perl is more likely accomplished by reading from a filehandle, so I first pondered the idea of using a tied filehandle to hold the iterator. (If you’re not familiar with the concept of a tied object, see the perltie man page, included with standard Perl documentation.) But then I got hung up on what “seek” should mean (if supported at all), how to return structured data from a filehandle read, and how to reset the search back to the beginning.

But after a while, it dawned on me: the tie mechanism was right, but I should use a hash instead of a filehandle! A hash has a nice iteration interface in the form of calls to each. And a tied hash automatically sequences through the entire iteration when keys is called, so getting the full list was simple as well.

After puttering with it a bit, I came up with the code for this month’s column. The interface provides a virtual hash. In their natural order, the keys of the hash are the filenames meeting the criteria, obtained in the same order as they are found. The corresponding values are the results from the wanted subroutine, defaulting to an arrayref of the stat value. The criteria and starting directories are established when the hash is tied. For example, the find code above would be rewritten as:

use TieFinder;
tie my %f, TieFinder::, sub { -A > 60 }, “/home”;
while (defined ($_ = each %f)) {
printf “%s aged %d days using %d blocks\n”,
$_, -A $_, -s $_ / 1024;

So, let’s look at Listing One and see how I did it.

Lines 1 through 3 turn on warnings, enable compiler restrictions, and unbuffer standard output.

Line 5 pulls in the Data::Dumper routines so I can show what the hash looks like during testing.

Lines 7 through 56 form a virtual use block. I could extract all of this part of the file, stick it into TieFinder.pm somewhere along my @INC path, and replace that with use TieFinder. However, for testing purposes it was easier to include it into the same file.

To tie a hash, we have to respect a certain set of standard method calls in the object class we’re creating. The most important of these is TIEHASH, defined in lines 10 through 18. This constructor must return a blessed object in this (or a related) class. The arguments in @_ are taken from the tie parameter list.

After shifting off the class name in line 11, we next see if the first parameter is a coderef. If so, this is the wanted subroutine, which we save in $wanted. If not, we use a default wanted, which merely returns the first parameter passed to it. (We’ll see that this is an arrayref of the stat call on the file, but I’m getting ahead of myself.) The default wanted selects every entry by returning a true value each time.

Lines 13 to 17 construct the object as a blessed hashref, saving the wanted coderef, the original top-level dirs (useful when we reset the iterator), and “primes the pump” by setting those files as the current to-be-dispensed names. We put those in reverse order, because the normal behavior is to pop the queue array, and we want them to come out in the original expected order.

Lines 20 to 41 define the NEXTKEY method, used when the each operator acts upon the tied hash or when some construct (like keys) needs to repeatedly act on the entire hash. This is the guts of the iterator, identifying files to return, directories to descend, and generally managing the flow.

The loop in lines 22 to 39 finds the first element of the queue that is suitable as a returned key. Initially, the queue will be the top-level directory (or directories) given during the tie operation. The first item is popped from the queue in line 23 and is stat’ed in line 24 (as an lstat, so we can differentiate a symbolic link from what it points at).

Lines 25 to 33 generate the descendants of a directory if needed. First, in line 26 we skip over any symbolic links, because this could lead to infinite recursing if it’s a symbolic link pointing to a superior directory. (I’m using the stat-less shortcut for the symlink testing operator here, since I know the most recent stat was our file in question.) Lines 27 and 28 attempt to open a directory handle on the path, which will simply fail if the directory is not a readable directory, keeping us from having to pre-test that possibility.

Lines 29 through 31 read the directory, throw away the dot and dot-dot entries, sort them alphabetically, prepend the directory path, and then reverse the order of the names. These names are pushed onto the end of the queue, meaning that they’ll be the next names to be processed. This creates a “depth-first” transversal. To get a “breadth-first” transversal, we’d put names on one end of the queue and pull them off the other end. A more flexible routine would allow this to be selected via a configuration parameter. We could also do some jiggering here to get pre-traversal versus post-traversal handling of the directory names themselves.

Lines 34 to 37 set up the calling of the wanted routine. First, the item is aliased into $_. Next, the wanted routine is called, passing the stat structure as a single arrayref parameter. The return value is kept in a private hash belonging to the object. If the user of the tied hash later asks for the value corresponding to the given key, we’ll need to return it, so we have to stash it in a cache hash. If the value is not true, the next in line 34 tries the next item in the queue. Otherwise, we return the item, which becomes the next key of our virtual hash. If the queue drains, we’ve used up all the keys, and we return undef in line 40 to show that.

Lines 43 to 48 define the FIRSTKEY method, used when the tied hash interface wants to reset the key generation to the first item. In this case, we reset the queue to the initial top directories, flush the cache hash, and call NEXTKEY to fetch the value.

Lines 50 to 55 define the FETCH method, used when the tied hash interface wants to get a value for a given key. I inserted a debug statement here for tracing, and it was quite informative. The value comes from the cache hash, set to the return value from wanted for that particular path name. Note that a user can make an explicit call to fetch a value that might not yet have been populated, in which case we return undef. Perhaps a more robust implementation would die if the key did not exist in the cache hash.

That’s it. We now have the means to implement a hash that appears to have the results of a recursive file find operation as keys, and the corresponding values being the return values of a user-defined function, defaulting to the stat-values for that file. The user-defined function can also return a false value, indicating that a file is not desired.

Let’s play with it a bit, starting in line 58. The @ARGV array here provides the top-level directory starting points. The wanted subroutine returns either the size in bytes or the true-but-0-valued string of 0e0 on the empty files. This selects everything but gives us an interesting integer to look at. Other subroutines I used while testing included:

sub { -d _ ? 2 : -T _ ? 1 : 0 }

which returned 2 for directories, 1 for text files, and skipped over any binary files. In addition:

sub { /\d/ }

which kept only the files that had digits somewhere in the pathname. The $_ variable has the full path here, unlike File::Find, in which $_ has only the basename. Another choice might be:

sub { shift->[2] & 2 }

for world-writable files, looking at the mode word of the stat structure and and-ing it with 2 for the world-writable value. And there’s even:

sub { sprintf “0%03o”, shift->[2] & 07777 }

which selects all files but provides a text string of the permission values. Lastly, there’s:

sub { localtime shift->[9] }

which gives back the modification time as a string in local time format. The possibilities are endless.

Lines 60 to 63 show an example of a complete scan using the each operator. The normal scan causes the queue to be built up and drained as needed, thus allowing us to “process” the entry as quickly as it is known by the routine. The keys are obtained by calling FIRSTKEY and NEXTKEY, and the values are fetched with FETCH. Uncomment the debug line in FETCH to see this as it happens.

Lines 65 to 68 show an example of getting just the keys by using the each operator in a scalar context to kick the iterator without fetching the value. Again, uncommenting the debug line in FETCH shows that no calls are made for the values, making this a painless run through the data; in addition, we’ve got control as soon as some of the data has been returned.

The last three demonstrations use Dumper to show that this acts as an ordinary hash with respect to the keys and values operators, even when passed as a hashref directly to Dumper. Oddly, I noticed the very last example seems to call FETCH twice for each element, at least under Perl version 5.005_03. Perhaps that will be altered in a later version.

There you have it, a powerful model for painless filewalking and a nice demonstration of using Perl’s built-in interfaces to capture the notion of an iterator.

Listing One: TieFInder

1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
5 use Data::Dumper;
8 package TieFinder;
10 sub TIEHASH {
11 my $class = shift;
12 my $wanted = (@_ && ref $_[0] eq ‘CODE’)
? shift : sub { shift };
13 bless {
14 wanted => $wanted,
15 dirs => [@_],
16 queue=>[reverse @_], #prime the pump
17 }, $class;
18 }
20 sub NEXTKEY {
21 my $self = shift;
22 while ( @{$self->{queue}} ) {
23 my $item = pop @{$self->{queue}};
24 my @lstat = lstat($item);
25 {
26 next if -l _; # don’t
recurse on symlinks
27 local *DH;
28 opendir DH, $item or next;
29 push @{$self->{queue}},
30 reverse map “$item/$_”,
31 sort grep { $_ ne “.” and $_
ne “..” } readdir DH;
32 closedir DH;
33 }
34 next unless do {
35 local $_ = $item;
36 $self->{hash}{$item} = $self->
37 };
38 return $item;
39 }
40 return undef;
41 }
43 sub FIRSTKEY {
44 my $self = shift;
45 $self->{queue} = [reverse @{$self->{dirs}}];
46 $self->{hash} = {};
47 $self->NEXTKEY;
48 }
50 sub FETCH {
51 my $self = shift;
52 my $key = shift;
53 ## warn “DEBUG: fetching $key\n”;
54 $self->{hash}{$key};
55 }
56 }
58 tie my%f,TieFinder::, sub{-s _ or”0e0″}, @ARGV;
60 print “key value\n”;
61 while (my($k, $v) = each %f) {
62 print “k = $k, v = $v\n”;
63 }
65 print “key\n”;
66 while (defined(my $k = each %f)) {
67 print “k = $k\n”;
68 }
70 print “dumper keys\n”;
71 print Dumper([keys %f]);
73 print “dumper values\n”;
74 print Dumper([values %f]);
76 print “dumper hash\n”;
77 print Dumper(\%f);

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com. Code listings for this column can be found at: http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.