Serving Up a Piece of Tar

The great thing about Web servers is that they can serve more than Web pages. They can serve stuff. Sometimes that stuff is inside tarballs, those little bundles of joy that efficiently hold many files (sometimes numbering in the thousands) for convenient transferring or archiving. A recent message on the Perl Monastery (http://www.perlmonks.org) inspired me. The person known as "Screamer" posted a little note titled "Serving tarball contents as part of your webspace." It was very short and appears in Listing One.

The great thing about Web servers is that they can serve more than Web pages. They can serve stuff. Sometimes that stuff is inside tarballs, those little bundles of joy that efficiently hold many files (sometimes numbering in the thousands) for convenient transferring or archiving. A recent message on the Perl Monastery (http://www.perlmonks.org) inspired me. The person known as “Screamer” posted a little note titled “Serving tarball contents as part of your webspace.” It was very short and appears in Listing One.

Listing One: Screamer’s Script

#!/usr/bin/perl -w
use strict;
use CGI;
use CGI::Carp qw(fatalsToBrowser);
use Archive::Tar;
my $cgi = CGI->new();
my $tar = Archive::Tar->new(‘foobar.tar.gz’);
my $file = $cgi->path_info();
$file =~ s|^/||;
print $cgi->header(), $tar->get_content($file);

This script accepts the URL http://host/cgi/tarserver/fred/barney/betty.txt and returns the file fred/barney/betty.txt from inside the tar archive foobar.tar.gz. I liked the code, but Screamer went on to give a few limitations, and I thought of a few myself.

So, I sat down and wrote a program I thought would be a bit more flexible and useful and present it in this month’s column (see Listing Two). In particular, I wanted to serve many files from many different tarballs and be able to browse the files within the tarballs instead of having to get the filename correct on the first try. As always, this is a work-in-progress (which is at the point where I want to show it off) that could use more features and once-overs before being put into production.

Listing Two: Data Sharer — Part I

1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
5 use CGI::Carp qw(fatalsToBrowser); # DEBUG only
7 ## begin config
8 my $DIR = “/home/merlyn/Web/Tarserver”;
9 sub VALID {
10 local $_ = shift;
11 /(\.tgz|\.tar(\.gz)?)\z/ && !/\A\./;
12 }
13 ## end config
15 use CGI qw(:all);
17 (my $path = path_info()) =~ s/\A\///;
18 my @path = split ‘/’, $path;
20 my @choices;
22 if (@path) { # first element must be tar.gz
23 die “bad tar name: $path[0]” unless VALID($path[0]);
24 my $tarchive = “$DIR/$path[0]“;
25 die “missing tarchive: $tarchive” unless -f $tarchive and -r $tarchive;
27 ## must look in contents now
28 my @names = do {
29 require Cache::FileCache;
31 my $cache = Cache::FileCache->new
32 ({namespace => ‘tarserver’,
33 username => ‘nobody’,
34 default_expires_in => ’10 minutes’,
35 auto_purge_interval => ’1 hour’,
36 }) or die “Cannot connect to cache”;
37 if (my $names = $cache->get($tarchive)) {
38 @$names;
39 } else {
40 require Archive::Tar;
42 die “Cannot list archive $tarchive”
43 unless my @n = Archive::Tar->list_archive($tarchive);
44 $cache->set($tarchive, \@n);
45 @n;
46 }
47 };
49 for my $step (1..$#path) {
50 @names = map /\A\/?\Q$path[$step]\E(?:\/(.*))?\z/s, @names;
51 die “no such name” unless @names;
52 if (grep !defined $_, @names) {
53 die “trailing stuff after name” if $step != $#path;
54 require Archive::Tar;
56 my $at = Archive::Tar->new($tarchive)
57 or die “Cannot open archive $tarchive”;
58 my $file = join “/”, @path[1..$#path];
59 defined(my $contents = $at-> get_content($file))
60 or die “Cannot get $file from $tarchive”;
62 require File::MMagic;
63 my $mimetype = File::MMagic-> new->checktype_contents($contents);
64 print header($mimetype), $contents;
65 exit 0;
66 }
67 }
69 {
70 my %choices = ();
71 $choices{$_}++ for map /\A([^\/]+\/?)/, @names;
73 @choices = sort keys %choices;
74 }
76 } else { # choose a top-level item
77 opendir D, $DIR;
78 @choices = sort grep VALID($_), readdir D;
79 closedir D;
80 }
82 print header(‘text/html’), start_html(‘tar server’), h1(‘tar server’);
84 ## show path
85 print “from “, a({href => url()}, “Top”);
86 {
87 my $link = “”;
88 for (@path) {
89 $link .= “/$_”;
90 print ” / “, a({href => url().$link}, escapeHTML(“$_”));
91 }
92 }
93 print br;
95 ## show sublinks
96 my $prefix = @path ? join(“/”, @path, “”) : “”;
97 print ul(map {
98 li(a({href => url().”/$prefix$_”}, escapeHTML($_)));
99 } @choices);
101 print end_html;

This CGI script uses the “path info” of a URL to specify both the name of the tarchive and the file within it that is desired. For example, the URL http://host/cgi/tarserver?foo.tar.gz/a/b would ask for the file a/b in the tarchive foo.tar.gz.

Lines 1 through 3 start nearly every program I write, enabling warnings for debugging, turning on the common restrictions (variables need to be declared, references cannot be soft references, and barewords are not allowed), and disabling the buffering on standard output.

Line 5 pulls in the CGI::Carp module for debugging, turning all those die messages into browser responses. This makes it a bit easier for me to debug through my browser. As this program is still representative, and not complete, my next step would be to use reasonable responses rather than aborting with die(). But for now, the following will suffice.

Lines 7 through 13 define two configuration items — the location of the tar files and a subroutine to select which files from that directory are permitted. The subroutine returns true for .tar files, .tar.gz files, and .tgz files; it rejects any file name that begins with a dot for safety reasons.

Line 15 pulls in the CGI module, with all the shortcuts enabled as subroutines rather than method calls. This is slightly more expensive than using the object-oriented interface, but I’m willing to trade programmer-typing time for CPU time.

Lines 17 and 18 extract the “path info” data. The CGI script will be repeatedly invoked, with the “parameter” coming from information trailing the name, as if the script name were a directory. path_info() provides the data after the name from which I strip an initial slash if present. A simple split() then pulls it apart and stores it in @path.

Line 20 declares @choices, which will contain a list of subentries below the current path (if any) for the HTML code beginning in line 82 to display.

If path information has been included, line 22 notes that, and so we’ll need to look within an archive. Obviously, this must be a tar file with a good name (validated in line 23) and reasonably readable as a file (validated in line 25).

Parsing through the contents of an archive for the list of names is relatively expensive (since it also typically involves a decompression), so we want to cache that. The @names variable declared in line 28 will most likely come from that cache, except for the first hit.

Line 29 brings in the Cache::FileCache module (found in CPAN) to handle the caching interface. Lines 31 to 36 create the cache object. We cache for only 10 minutes, because the purpose of this cache is to prevent an expensive parse of a file that is likely being browsed through many multiple hits over a short period of time. There’s no point in caching it longer, and that might cause confusion if the file is updated from time to time. Once an hour, one lucky user also gets to run the cleaning cycle to erase old cached items.

If the item is found in the cache (in line 37), then we return that list for @names. If the item is not found, we’ll have to parse the tar archive, and that’s handled in lines 40 to 45. Line 40 pulls in Archive::Tar (found in the CPAN), which also automatically uses Compress::Gzip for gzipped tar files.

Line 43 lists the archive. Line 44 stores the listing into the cache so we don’t have to do this for another 10 minutes, and line 45 returns the list to become @names.

Once we have the contents of the archive, we have to search in @names for the desired file (whose path is stored in @path). Line 49 starts to loop over each element of @path (we start at index 1 because $path[0] is the name of the archive we’re looking in).

The map() and ugly regular expression in line 50 perform some interesting magic. The match tries to find all the file names remaining in @names whose first component matches where we are in @path. If the match fails, the map ignores that file name.

If the match succeeds, then the value of $1 is retained in @names. But each item in @names has had the matching component “removed” due to the location of the parentheses. Thus, not only does @names get shorter as non-matching files get dropped, but each element of @names gets shorter as each matching prefix is removed from the element. If there’s no trailing item to match, then $1 is undef, which we’ll use in a moment.

Line 51 takes care of a path given to us that doesn’t map to any entries; if the set of names in @names ever reduces to empty, there can’t be any matches, so the path must be incorrect.

Line 52 detects any undef value in the list. If there are any, then the optional trailing part of the regular expression in line 50 must have failed, and thus we’ve matched the entire name exactly. Line 53 ensures that we didn’t match a name and still have trailing path steps (e.g., the tarchive has a/b but was asked for a/b/c). If there aren’t any, it’s time to grab the contents using Archive::Tar again. Line 56 opens the archive, then line 58 constructs the path, and line 59 grabs the contents into a scalar.

The HTTP protocol wants a MIME type delivered in the header, so we’ll bring in the File::MMagic module to help identify the type (also from the CPAN). This type is used in line 64, along with sending out the contents, and we’re done delivering the specific file.

However, if we don’t have a specific file, we’ll fall down out of the loop and set up @choices. The hash %choices created in line 70 will have all the files left over from walking the name list, as well as the top-level of any subdirectory, indicated with a trailing slash.

By the way, the trailing slash wound up not messing up any of the logic (accidentally, as I was pleasantly surprised, rather than by design).

If no path was included, lines 77 to 79 set up @choices as being all the tar files in the directory.

Line 82 prints out a normal HTML header and a simple “h1″ for a label.

Lines 84 to 93 show the “path” we’re currently browsing. The first link, labeled Top, will bring us back to the list of tar files and is also the initial entry point. Otherwise, each step in the path list becomes an individual link to bring us to that place in the hierarchy (a fairly standard and intuitive interface). I’m careful to escape the HTML entities (using the CGI.pm subroutine) in case a filename contains HTML-sensitive characters (particularly, like an ampersand).

Lines 95 through 99 show the sublinks. At the top level, this is a list of tar files in the designated directory. If one of those files is clicked, the contents are determined, and the top-level directories or files of that archive are displayed. As each directory is selected, we reuse the listing to generate the subdirectories.

If a file is selected instead, then the content is displayed, much like the normal index-generation facilities of a typical Web server. An unordered list (typically represented with bullets) is used to organize the display.

Line 101 ends the HTML output, and we’re done!

The tar archive contains information about size and timestamps, which we’re ignoring. The Archive::Tar interface can give those to us nicely, so one improvement could be to include that in the output. Also, for simple security, I’ve required all the tar files to be in the same directory, but there’s nothing stopping the path from being partially a list of qualified Unix directories, then the archive, then some more directories within the archive, then the final file. Be careful not to permit dot-dot when you do this, though, or you could end up serving your entire disk.

I hope this program illustrates a couple of concepts you can use in your own code or at least provides you with a quick and dirty tool to browse a few tar-gz’s you have lying around. Until next time, enjoy!

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com. Code listings for this column can be found at http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.