dcsimg

A Great Find

In last month's column, I introduced the File::Find module that's included as part of the core Perl distribution. File::Find provides a framework to recursively catalog or manipulate directories and their contents.

In last month’s column, I introduced the File::Find module that’s included as part of the core Perl distribution. File::Find provides a framework to recursively catalog or manipulate directories and their contents.

To use File::Find you have to create a wanted subroutine. For every file found in one or more starting directories, File::Find calls the wanted subroutine, letting it do the work of processing or skipping the file. Except for communication provided via the $File::Find::prune variable, the wanted subroutine’s output is ignored. This is called a callback model.

Recently, Richard Clamp was inspired to write a wrapper around File::Find called File::Find::Rule that turns the actions of descending into a directory into more of a filter model. A rule object is created and a series of methods are called against it to set up ever-narrowing filters, separating those items of interest from the rest.

For example, to create a rule to find only those files (and not directories or other things) that have been accessed at least 14 days ago, we create a filter with:


use File::Find::Rule;
my $filter = File::Find::Rule->new;

Initially, this filter finds everything. So we must restrict it:


$filter->file; # find only files
$filter->atime(‘>14′); # accessed more than 14 days ago

Now we have $filter which rejects any entry that doesn’t meet both of the criteria. (The default connector is and, if you want to think of it as a boolean expression.) All that remains is to give it a starting point:


my @results = $filter->in(“/tmp”);

The in() method takes a starting point and constructs the appropriate wanted subroutine for a call to File::Find. It gathers up all of the entries that meet the conditions and returns the result. The filter still remains, however, and can be reused on another directory:


push @results, $filter->in(“/usr/tmp”);

However, we could have done this all at once with:


my @results = $filter->in(“/tmp”, “/usr/tmp”);

So far, this looks a bit messier than simply writing an appropriate wanted routine, but we can simplify it knowing about two shortcuts:

1. Nearly all filter routines can be called as class methods (rather than instance methods), which will automatically instantiate a new filter and then add the rule.

2. Nearly all filter routines return the instance directly.

The effect of these conventions is that we can “chain” most of the filter rules. For example:


use File::Find::Rule;
my $filter = File::Find::Rule->file-> atime(‘>14′);
my @results = $filter->in(qw(/tmp /usr/tmp));

Or even more simply:


use File::Find::Rule;
my @results = File::Find::Rule
->file->atime(‘>14′)
->in(qw(/tmp /usr/tmp));

(Oddly enough, while writing this column, I found a bug that prevents this code from working for version 0.08 of the module. Hopefully by the time you read this, the author will have repaired the bug and uploaded it to the CPAN.)

File::Find::Rule is a generator/filter wrapper around the callback scheme of File::Find. Because of the added layer, calling File::Find directly is always going to be a bit faster, but may be harder to understand, so take your pick.

A Handy Rule

As a comparison, let’s go through each of the tasks presented in last month’s column, and see what they’d look like using File::Find::Rule rather than a hand-crafted File::Find wanted routine. Then we’ll look at a couple of the already available “plug-ins” for File::Find::Rule to simplify some common tasks.

For starters, we have the very common “print everything below a given directory” task, which translates as:


use File::Find::Rule;
print “$_\n” for File::Find::Rule->in(‘.’);

Without any restrictions, this code generates every name.

The next utility tallies the disk blocks used by each user below a given starting point. The code builds the entire file listing and then iterates over that.


use File::Find::Rule;
my %blocks;
for (File::Find::Rule->file->in(‘.’)) {
next unless my @stat = stat;
$blocks{$stat[4]} += $stat[12];
}
for (sort {$blocks{$b} <=> $blocks{$a}} keys %blocks) {
printf “%16s %8d\n”, scalar getpwuid($_), $blocks{$_};
}

Here, the only restriction is that entries in the directory be a file (and not something like a directory or a symbolic link). The remainder of the code is a typical “take this filename and summarize some information about it” loop.

Many of the remaining examples from last month’s column ignored the contents of CVS directories, using the $File::Find::prune flag of the wanted callback. For these entries, we either noticed it was a CVS directory and returned immediately or we continued to see if it was a file, and if so, processed the file.

In the File::Find::Rule logic, rules are normally chained together with and. We can perform a logical or using the any() method. For example, if we have three filters:


my $filter1 = File::Find::Rule->method1;
my $filter2 = File::Find::Rule->method2a-> method2b;
my $filter3 = File::Find::Rule->method3;

Then we can add an alternative of any of these choices to an existing filter chain as:


$filter->any($filter1, $filter2, $filter3);

This is recursive, allowing us to construct arbitrarily complex rules. So, to get all files that aren’t within CVS directories, we can use:


use File::Find::Rule;
my $prune_if_cvs = File::Find::Rule
->directory ->name(“CVS”)->prune->discard;
my $file = File::Find::Rule->file;
my @files = File::Find::Rule
->any($prune_if_cvs, $file)
->in(“/cvs/bigproject1″);

The first filter ($prune_if_cvs) identifies a directory named CVS. If that’s true, the prune special filter sets the prune flag, but continues to accept the file. The discard special filter always fails, causing the entry to be rejected. The second filter ($file) accepts the entry only if it is a file.

With any(), all filters are evaluated for each entry in the directory /cvs/bigproject1. If prune and discard get executed, the result is false. (In fact, it’s impossible for that branch to return true, since it ends in discard.) However, we’ll then evaluate whether the entry is a file, and if so, the filename is accepted. Of course, if the entry is a directory named CVS, both filters return false and the entry is skipped as we intended.

Putting this into a larger context, let’s identify the total size used by the various MIME types within the CVS tree:


use File::MMagic;
my $mm = File::MMagic->new;
my %total;

use File::Find::Rule;
my $prune_if_cvs = File::Find::Rule
->directory
->name(“CVS”)
->prune->discard;
my $file = File::Find::Rule->file;

for (File::Find::Rule->any($prune_if_cvs, $file) ->in(“/cvs/bigproject1″)) {
my $type = $mm->checktype_filename($_);
$total{$type}{count}++;
$total{$type}{size} += (stat($_))[12];
## push @{$total{$type}{names}}, $File::Find::name;
}
for (sort keys %total) {
print “$_ has $total{$_}{count} items with $total{$_}{size} blocks\n”;
## print map ” $_\n”, sort @{$total{$_}{names}};
}

Again, we use any() to skip CVS directories. The resulting list of files is then used in the first foreach loop. Within the loop, the MIME type is identified and used as a key to distinguish the summarizing items. (As in the previous version of this program, the commented lines can be uncommented to include a complete list of all files that share a given MIME type.)

So far, all of these examples have created an entire list before processing the values. But just like the Unix find command, File::Find::Rule filters have an -exec “switch,” too.

The exec() filter executes a subroutine reference (a coderef), passing the basename, directory name, and the full path name as the first three parameters. For convenience, the basename of the file is also present in $_. If the subroutine returns a true value, then the name is still considered “accepted,” and the filters continue. However, we’ll typically use this as the final stage in a filter, so the return value doesn’t matter.

The advantage in using an exec() filter is most evident when we’re iterating over a large portion of the disk (like a search from the top root directory, for example). Using the return value, we wait until the entire filter chain has executed over all of the directories. Using an exec() filter means we get the names one at a time as we find them, leading to more immediate results and a slightly more efficient use of resources.

Rewriting that last example using exec(), we get:


use File::MMagic;
my $mm = File::MMagic->new;
my %total;

use File::Find::Rule;
my $prune_if_cvs = File::Find::Rule
->directory
->name(“CVS”)
->prune->discard;
my $file = File::Find::Rule->file;
File::Find::Rule
->any($prune_if_cvs, $file)
->exec(sub {
my $type = $mm->checktype_filename($_);
$total{$type}{count}++;
$total{$type}{size} += (stat($_))[12];
}
)->in(“/cvs/bigproject1″);

for (sort keys %total) {
print “$_ has $total{$_}{count} items with $total{$_}{size} blocks\n”;
}

Most of the code is the same — we just put the first foreach loop body inside the exec() filter.

Plug-in Filters

The design of File::Find::Rule is “pluggable,” meaning that the author has provided a place to add additional filter rules should we find something lacking in the core. At the moment, the CPAN already includes four such plug-ins.

  • The ImageSize plug-in adds a filter to use the Image::Size module to select or reject images based on their size.
  • The MP3Info plug-in uses MP3::Info to allow selection of MP3 audio files based on things like artist and album name.
  • The Digest plug-in uses the Digest module to compute various digest functions, like MD5 or SHA1, helping to locate or reject files based on their digest values.

Let’s look at the fourth plug-in in a bit more detail to give you an idea of how to use the other three. The MMagic plug-in gives us a magic() filter, which accepts only those MIME types that match one or more glob patterns.

For example, a filter that finds only text files can be constructed with:


use File::Find::Rule::MMagic;
my $files = File::Find::Rule->magic(‘text/*’);

Here, we specifically include the plug-in module. The plug-in automatically loads File::Find::Rule and extends the filter list to include magic().

So, looking at the last example from last time, let’s print out all of the plain text files in the CVS tree:


use File::Find::Rule::MMagic;
my $prune_if_cvs = File::Find::Rule
->directory ->name(“CVS”)->prune->discard;
my $file = File::Find::Rule->file;
@ARGV = sort File::Find::Rule
->any($prune_if_cvs, $file)
->magic(‘text/plain’)
->in(“/cvs/bigproject1″);
while (<>) {
print “$ARGV\t$_”;
}

Again, we’re loading @ARGV, so that the final while loop can iterate over the newly-found file list, opening each file and displaying the contents.

Clamp’s Champ

I think Richard Clamp is on to something here. I like the ease with which a filter object can be constructed, and I understand that he’s also working on a command-line version of the interface as well. That’d really be full circle — to have gone from Unix find to Perl’s File::Find, then to a rule-based syntax, and then back to the command line. Amazing.

Well, until next time, enjoy your new file finding skills!



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com. You can download the Perl scripts shown in this month’s column from http://www.linux-mag.com/downloads/2003-03/perl.

Comments are closed.