dcsimg

Getting Some Directory Assistance

Most Perl scripts aren't doing anything glamorous. They're the workhorse of your system, moving things around and handling those mundane repetitive tasks while you aren't necessarily looking. Those tasks are often on a series of filenames, perhaps not known in advance but obtained by looking at the contents of a directory. Perl has a few primary means of getting lists of names, so let's take a look at them.

Most Perl scripts aren’t doing anything glamorous. They’re the workhorse of your system, moving things around and handling those mundane repetitive tasks while you aren’t necessarily looking. Those tasks are often on a series of filenames, perhaps not known in advance but obtained by looking at the contents of a directory. Perl has a few primary means of getting lists of names, so let’s take a look at them.

The simplest to use and understand is globbing. Globbing is what the shell does when you use echo *.c to get a list of all the C source files in a directory. The term globbing comes from the use of the old /etc/glob program in early versions of Unix, with a name derived from something like “global expansion.”

Most programs running from the shell don’t have to know how to do globbing for themselves. For example, the rm command in:


$ rm *.c

never sees the *.c. Instead, the shell expands (globs) the filename pattern, comes up with a list of names, and then hands those names to the arguments of rm. This is why the rm command can’t help you when you’ve accidentally typed a space between the asterisk and the period; it never sees the asterisk, but rather sees a list of explicit names, just as if you’d laboriously typed all of them directly.

Similarly, if you invoke your Perl program with glob pattern on the command line:


$ my_perl_prog *.c

then your Perl program already has the expanded values, and nothing further needs to be done to process the elements of @ARGV.

However, sometimes you don’t have the luxury of having all the files already passed to your Perl program from the command line. What to do then? Use the glob operator from within Perl!


my @c_source = glob “*.c”;

Here @list will be loaded up with all the names in the current directory that don’t begin with a dot but do end in .c, just as if I had handed that to the shell for expansion. To get all the C source files and object files, I can use either:


my @c_source_and_object = glob “*.c *.o”;

or


my @c_source_and_object = glob “*.[co]“;

Notice that multiple patterns can be specified in one glob by separating them with whitespace (similar to the shell) or a character-class-like entry. Another way to write the glob operator is to put angle brackets around the glob pattern:


my @c_source_and_object = <*.c *.o>;

The value between the angle brackets is interpreted as if it were a double-quoted string, so Perl variables become their current Perl values before the glob is evaluated. This lets us vary the patterns at runtime:


for my $suffix (qw(.c .o .out)) {
$files_with{$suffix} = [<*$suffix>];
}

Here I’m creating a hash of arrayrefs, so $files_with{“.o”} will be an arrayref of all matching files. Either syntax is fine; the glob named operator is a fairly recent invention, so legacy programs tend to use the angle bracket version as well.

A word of caution about the angle bracket syntax: if the only thing inside the angle brackets is just a simple scalar variable, then angle brackets take on their more familiar meaning of “read a line from a filehandle.” Here the filehandle is an “indirect” filehandle, meaning that the variable contains the name of, or a reference to, a filehandle. If you’re not sure whether you’ll be getting a glob or not, always use the glob named operator.

Globbing can perform anything the shell normally does. For example, to get all the files in the current directory or any first-level nested subdirectory that ends in .c:


my @many_c_files = <*.c */*.c>;

So here we’re potentially reading many subdirectories. Directories that are two levels down are still ignored however. The normal Perl globbing syntax doesn’t have an entry for “recursively descend” in spite of many modern shell extended globbing forms that can handle that.

Also, just as in the shell, files that begin with a dot will not have their dot matched by a wildcard character. Instead, the dot must be matched explicitly, giving us the easy equivalent of the “hidden file.” To get all the files, we need two separate glob patterns:


my @everything = <.* *>;

The resulting list includes all files, with or without dots. The separate lists are sorted individually but not merged. If you want the entire list sorted together, you’ve got to manage that on your own:


my @sorted_everything = sort <* .*>;

Of course, the output of glob can easily be used as the input to other operations. Here’s the equivalent of rm -i *:


for my $filename (<*>) {
print “remove $filename? “;
next unless <STDIN> =~ /^y/i;
unlink $filename or warn “Cannot unlink $filename: $!”;
}

As simple as globbing is to use and understand, it does not come without its drawbacks. Prior to the 5.6 release of Perl, globbing was implemented by literally forking off a C-shell behind the scenes (or a Bourne-style shell if C-shell was not available) and asking that shell to expand the globs. This had several consequences.

For one thing, the globbing syntax was actually slightly dependent on the particular shell being used behind the scenes. As long as you stayed with the simple star, question mark, square brackets stuff, you’d be fine; however, if you took advantage of curly-brace alternatives and then moved to a box without that, your program would blow up.

Second, the syntax was sensitive to shell special characters. For example, one of my “Just another Perl hacker” signatures read something like this:


print <;echo Just another Perl hacker,>;

which works because the child shell’s glob operation was terminated by the semicolon. We then began a new operation that would show up as a single filename to the shell-to-Perl interface, which then becomes the return value from the globbing operation and is dumped out to STDOUT via print — scary, considering the full security implications of passing an arbitrary string as part of a glob pattern.

Third, because the shell was a separate process, each glob incurred the expense of a fork/exec operation. This is fine if you do it once or twice in a program, but prohibitively expensive to get every file of every directory below a given large directory.

Finally, and perhaps most significantly, the classic C-shell had a fixed-size buffer for globbing expansion (roughly 10 K if memory serves me right). If you’ve ever gone into a “fat” directory (with lots of long names) and typed rm * only to be greeted with “NCARGMAX exceeded” or an equally obscure error message, you’ve seen this in action. So the C-shell can expand only so many names, but since Perl is counting on the C-shell for a complete expansion, Perl also loses.

This led most people who were wanting to write robust, efficient, and secure directory lookups to avoid glob entirely, instead jumping directly to a lower-level mechanism for directory access — the directory handle.

A directory handle is like a filehandle. You open it (with opendir), read from it (with readdir), and perhaps close it when you are done (with closedir). I say “perhaps” because directory handles, like filehandles, close automatically at the end of the program or whenever the handle is successfully reopened.

In a scalar context, readdir returns one item at a time. In a list context, readdir returns all items, again, just like a filehandle. But what items?

Well, we’ll get back the contents of the directory as a list of names. This list of names is not sorted in any particular order and consists of only the basenames (everything after the final slash of a pathname) of the entries within that directory. These entries include everything, including plain files, directories, and even Unix-domain sockets. They also include files that begin with a dot and especially the mandatory entries of “.” and “..”. The entries are also unsorted (for speed). So to dump everything in the current directory, we could use:


opendirHERE,”.” ordie”Cannot opendir.:$!”;
foreach my $name (readdir HERE) {
print “one name in the current directory is $name\n”;
}
closedir HERE;

The closedir isn’t necessary here but does free up a few resources that would otherwise be tied up until program’s end. The names of this listing will be in the same order and will have the same contents as an ls -f command or a find . -print if there were no subdirectories. To get just the same thing as ls with no options, we will need to toss the entries that begin with dot and sort them alphabetically:


opendir HERE, “.” or die “opendir: $!”;
foreach my $name (sort grep !/^\./, readdir HERE) {
print “$name\n”;
}
closedir HERE;

Because the names are simply the names within the directory, and not the full pathnames, they aren’t directly useable or testable. For example, consider this incorrect code to pick out all the directories of a given directory:


opendir THERE, “/usr” or die “opendir: $!”;
foreach my $name (readdir THERE) {
next unless -d $name; # THIS IS WRONG
print “one directory in /usr is $name\n”;
}

This is wrong because one of the names returned by readdir will be, for example, lib, which we are then testing for directory-ness as if it were in the current directory! One solution is to patch up the name to include the full path before we use it with file tests or further access. Here’s a refined solution that skips over dot-files as well, making all directories immediately under /usr to be mode 755 (read/write/execute for root and read/execute for group and others):


opendir THERE, “/usr” or die “opendir: $!”;
foreach my $name (readdir THERE) {
next if /^\./; # skip over dot files
my$fullname= “/usr/$name”; # get full name
next unless -d $fullname;
chmod 0755, $fullname or warn “Cannot chmod $fullname: $!”;
}
closedir THERE;

What about subdirectories? What if we wanted to examine every directory recursively below /usr looking for world-writable entries? Well, we could certainly use find for that, but in Perl it’s not much harder to write this:


use File::Find;
find sub {
return unless -d; # is it a directory?
return unless(stat)[2]&2;# world writable?
print “$File::Find::name is world writable!\n”;
}, “/usr”;

The first use defines the find subroutine. This subroutine expects a “coderef” as its first argument, which we’re providing by using an anonymous subroutine. The remaining arguments to find are a list of top-level starting points for which find will locate all names recursively. For each found entry, find calls the subroutine, passing the basename of the entry in $_ and the full name in $File::Find::name. In addition, the working directory has been changed to that of the entry (for speed on further file tests).

So in this example, I tested $_ to see if it was a directory, and if so, then further tested its “stat 2″ element (the tricky one with the type encoded along with the permissions values) to see if the second bit from the right was set. That’s the world-writable bit. If both of those were successful tests, we print out the full name. (Printing $_ there would be not very helpful since that’s just the basename).

Note that in its simplicity, this subroutine will actually print each name twice — once while we are looking at the directory “from above” and once when the name is passed as “dot” in the “current directory.” To reject that, you could add the following:


return if $_ eq “.” or $_ eq “..”;

near the beginning of the subroutine. Now we’ll get just the names, although we’d never find /usr as a world-writable directory. For that, it would take a little more sophisticated juggling.

The File::Find module is included with Perl (since Perl 5.000), so there’s no excuse not to use it whenever you think of anything to do with recursing down directories. There’s a version that does “depth first” recursion (giving you the names before the containing directory) and a mechanism for pruning the tree if you head into areas of non-interest. The version included with Perl 5.6 also has the ability to follow symlinks and provide sorted names, so check the documentation to stay up-to-date.

I hope this directory assistance has got your number now. Until next time, enjoy!



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com. Code listings for this column can be found at: http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.