Find Files Easily

Sooner or later, every Perl hacker ends up wanting to process a collection of files contained within a directory, including all the files in all the subdirectories. Thankfully, Perl comes with the File::Find module to perform this task in a tested, portable manner.
Sooner or later, every Perl hacker ends up wanting to process a collection of files contained within a directory, including all the files in all the subdirectories. Thankfully, Perl comes with the File::Find module to perform this task in a tested, portable manner.
The basic File::Find interface is simple:
use File::Find;
sub wanted {
return unless –f;
print $File::Find::name, "\n";

find \&wanted, ".";
Here, File::Find exports the find() routine, which takes a coderef and a list of starting points (here, “dot” for the current directory). The mechanism inside File::Find locates all filesystem entries below — and including — each starting point and calls the subroutine referenced by the coderef for each entry. File::Find doesn’t filter anything; it’s up to the subroutine to ignore the entries that are not of interest.
The wanted() routine gets the full pathname in $File::Find::name and the basename in $_. For efficiency, the current process changes directory to the directory being examined, so either $File::Find::name or $_ can be used to access the filesystem entry being examined. However, if you want to use the names afterward, you should always collect the $File::Find::name values, because you’ll no longer be in the proper directory for $_.
In the code snippet above, wanted() returns if the entry for $_ is not a file (the –f test conveniently defaults to $_). Otherwise, the full pathname is printed followed by a newline. Hence, the output is all the names of all the files within the current directory (and below) printed to stdout. The equivalent Linux find command line is:
$ find . –type f –print
find starts at” dot”, recursing and printing all files found.

Better Finding Through Technology

Some time ago, I wrote File::Finder to translate find commands rather directly into Perl code, which would then use File::Find behind the scenes to do the work.
The equivalent code using File::Finder (found in the CPAN) looks like:
use File::Finder;
The type and print method calls correspond exactly to the find arguments. Only the in() call is out of order, specifying a list of starting points after the conditions are specified.
But how does it work? The result of calling type() on the File::Finder class results in a File::Finder object, similar to having said:
use File::Finder;
my $ff1 = File::Finder->new->type(’f’);
Inside the File::Finder object, the type method call has recorded a step: a coderef that will ultimately check a pathname to see if it is a file or not. The code to create this step is in the File::Finder::Steps class, automatically selected by sneaky delegation inside the File::Finder object.
Next, the File::Finder object is duplicated by the print method call, adding a second step to ultimately print the pathname in question:
my $ff2 = $ff1->print;
The value of $ff1 is untouched. In fact, we can use it as the starting point of another File::Finder rule.
At this point, $ff2 can be used as the wanted() routine in File::Find directly:
use File::Find;
find $ff2, ".";
The File::Finder object recognizes that it’s being used in a place where a coderef is wanted, and turns itself into a wanted() routine that executes the series of steps it contains. Thus, you get the series of files printed on standard output.
But, continuing on with the example, you can also call in() on this object:
This effectively does the same thing, loading File::Find to call the find() routine, passing the constructed wanted routine as well. However, in() has an additional feature: the matching names are gathered and returned in a list context or a count of the names in a scalar context:
my @names = $ff2->in(’.’);
Of course, we’ve printed them all. If we didn’t want them printed,
we could go back to the previous File::Finder object:
my @names = $ff1->in(’.’);
The in() routine is actually a specialization of the gather() routine,
which returns a list of the concatenated return values of the coderef
executed for each entry:
my %size_of = $ff1->gather(sub { $File::Find::name => –s }, ’.’);
Here, for each file, the coderef runs and returns a two-element list of the name and its corresponding size. When you concatenate the resulting lists, you get key/value pairs in the right shape to initialize the hash.
What if you wanted more conditions, like all files that start with a dot? In find, you’d type that as:
$ find . –type f –name ’.*’ –print
And similary, using File::Finder:
Again, the File::Finder representation is a straightforward translation from the find command-line. The name() step takes a string that’s treated as a filename glob. If you pass a regexp object instead, you get a regexp match:
How about files that don’t begin with a dot? While you could simply change the glob to *, let’s introduce a not() instead, as in…
$ find . –type f \! –name ’.*’ –print
The equivalent File::Finder is similar again:
not() negates the test of the step that immediately follows.
The default connection between type() and name is a logical “and.” You can spell that out directly with find ’s –a:
$ find . –type f –a –name ’.*’ –print
The logical and here is a “short-circuit” logical and, meaning that if the left side of the and is false, the right side is ignored. Short-circuiting also controls whether the –print is executed, which we can see by adding the second –a:
$ find . –type f –a –name ’.*’ –a –print
You can write this expanded version in File::Finder as well:
In both cases, the and() is merely a syntax helper and doesn’t change the execution. The expression is computed from left to right, and the first false step stops the execution, preventing the pathname from being printed.
We can introduce an logical or condition, which also short circuits. Logical or is typically used to say “everything except:”
$ find . –type f –o –print
Here, if the path is a file, the or stops, because a true value on the left keeps the expression on the right from executing. So, you end up with everything that isn’t a file. In File::Finder, there’s still a direct correspondance:
But what if you want to print all entries that are either a file or begin with a dot? Because of the relative precedence of logical and and logical or, you need to use parentheses in the find command line:
$ find . ’(’ –type f –o –name ’.*’ ’)’ –print
To indicate parentheses in File::Finder, add left() and right():
Again, a direct correspondence with the find command.
The print() operation returns a true value and prints the name, which is useful to know if chain any further steps after print().

Don’t Go There!

The find command supports a –prune option: if –prune is executed and the entry is a directory, the directory is skipped and not processed recursively.
Let’s say we’re looking at a Subversion tree, and we don’t want to descend into (or consider) any .svn directories:
$ find . –type d –name ’.svn’ –prune –o –type f –print
The command says: “If looking at a directory and the directory is named .svn, execute –prune. ” This tells find to not descend into the directory. Moreover, if that also returns true, the –o skips the remaining evaluation. If the and-ed expression to the left of the –o is false and the entry is a file, its name is printed. In File::Finder, again the correspondence is straightforward:
my $prune_svn = File::Finder->type(’d’)->name(’.svn’)->prune;
Why save $prune_svn as a separate object? You can reuse it to collect only directories:
my @dirs = $prune_svn->or->type(’d’)->in(’.’);
Being able to reuse these components also allows building the condition in manageable pieces.
You can also evaluate arbitrary Perl code at a particular step. The code is executed as part of a File::Find “wanted” operation, so it gets all the same treatment. If the code returns true, then the step is also considered true.
For example, suppose you want to make sure that symlinks point at a valid file entry. You can add a step made with eval() to check –l and not –e for dangling symlinks:
my @danglers = $prune_svn->or->eval(sub { –l and not –e })->in(’.’);
The eval() step also accepts File::Finder objects, allowing you to create subroutines:
my $file = File::Finder->type(’f’);
my $begins_with_dot = File::Finder->name(’.*’);

my $file_or_begins_with_dot = File::Finder
my @dotfiles = $prune_svn->or
This is an alternative to using parentheses to achieve the same result, because you can consider the eval() subcomponent to be parenthesized.

It’s Only Natural

Although File::Finder operates similarly to the older File::Find::Rule, I personally find that the syntax of File::Finder is more natural. I might explain this as having spent years writing find commands, dealing with the slightly weird and/or/not/paren syntax for complex rules.
However, File::Find::Rule supports conditions that File::Finder doesn’t understand (yet!). So, to leverage the existing File::Find::Rules conditions and plugins, you can use a ffr() step with a File::Find::Rule object and the appropriate condition is interpreted.
For example, to find images that have greater than 1000 pixels in both directions, create the File::Find::Rule object first…
use File::Find::Rule;
use File::Find::Rule::ImageSize;

my $ffr_big_images = File::Find::Rule
… Then use that File::Find::Rule step with File::Finder:
use File::Finder;
my $big_images = File::Finder->ffr($ffr_big_images);
my %sizes = $big_images->gather(sub { $File::Find::name => –s }, ’Pictures’);
I hope File::Finder finds its way into your toolkit.
Until next time, enjoy!

Randal Schwartz is the chief Perl guru at Stonehenge Consulting. You can reach Randal at class="emailaddress">merlyn@stonehenge.com.

Comments are closed.