Mirroring Your Own Mini-CPAN

The Comprehensive Perl Archive Network, known as "the CPAN," is the "one stop shopping center" for all things Perl. This 1.2 GB archive contains over 13,000 modules for your Perl programs, as well as scripts, documentation, many non-Unix Perl binaries, and other interesting things.

The Comprehensive Perl Archive Network, known as “the CPAN,” is the “one stop shopping center” for all things Perl. This 1.2 GB archive contains over 13,000 modules for your Perl programs, as well as scripts, documentation, many non-Unix Perl binaries, and other interesting things.

Although there’s nearly always a good, fast CPAN archive nearby when you’re connected to the Net, sometimes you’re connected to the Net at different speeds (like quickly at work, but slowly at home, or vice versa), or not at all. And what do you do then when you’re like me, at 30,000 feet jetting off to yet another conference or customer site, and you realize you need a module that you haven’t yet installed on your laptop? (This is especially an issue when a deadline for a magazine column looms close.)

Well, for the past year or so, I’ve been mirroring the entire CPAN to my laptop, thanks to the permission and cooperation of the owner of one of the major archive sites (and a few carefully constructed rsync commands). But at a recent conference, someone said, “Hey, can you just burn that to a CD for me?” and I was stuck. The current CPAN exceeds the size of a CD-ROM, even though only a small portion of the files are needed for module installation!

So that got me thinking: if I brought down only the files that were needed by CPAN.pm to install the latest release of a module, how big would that be? And the answer was wonderfully surprising: a bit more than 200 Mb, which easily fits on a CD-ROM.

Unfortunately, I didn’t see any clean, easy-to-use, efficient “mirror only the latest modules of the CPAN” program out there, so I wrote my own, shown in Listing One.

Listing One: A script to create your own mirror of the CPAN

1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
5 ### CONFIG
7 my $REMOTE=”http://www.cpan.org/“;
8 # my $REMOTE = “http://fi.cpan.org/“;
9 # my $REMOTE = “http://au.cpan.org/“;
10 # my $REMOTE = “file://Users/merlyn/MIRROR/CPAN/”;
12 ## warning: unknown files below this dir are deleted!
13 my $LOCAL=”/Users/merlyn/MIRROR/MINICPAN/”;
15 my $TRACE=1;
19 ## core -
20 use File::Path qw(mkpath);
21 use File::Basename qw(dirname);
22 use File::Spec::Functions qw(catfile);
23 use File::Find qw(find);
25 ## LWP -
26 use URI();
27 use LWP::Simple qw(mirror RC_OK RC_NOT_MODIFIED);
29 ## Compress::Zlib -
30 use Compress::Zlib qw(gzopen $gzerrno);
32 ## first, get index files
33 my_mirror($_) for qw(
34 authors/01mailrc.txt.gz
35 modules/02packages.details.txt.gz
36 modules/03modlist.data.gz
37 );
39 ## now walk the packages list
40 my $details=catfile ($LOCAL, qw(modules 02packages.details.txt.gz));
41 my $gz=gzopen($details, “rb”)or die “Cannot open details: $gzerrno”;
42 my $inheader=1;
43 while ($gz->gzreadline($_)> 0) {
44 if ($inheader) {
45 $inheader=0 unless /\S/;
46 next;
47 }
49 my( my $module, my $version, my $path)=split;
50 next if $path =~ m{/perl-5}; # skip Perl distributions
51 my_mirror(“authors/id/$path”, 1);
52 }
54 ## finally, clean the files we didn’t stick there
55 clean_unmirrored();
57 exit 0;
59 BEGIN {
60 ## %mirrored tracks the already done, keyed by filename
61 ## 1 = local-checked, 2 = remote-mirrored
62 my %mirrored;
64 sub my_mirror {
65 my $path=shift; # partial URL
66 my $skip_if_present=shift; # true/false
68 my $remote_uri=URI->new_abs($path, $REMOTE) ->as_string; # full URL
69 my $local_file=catfile($LOCAL, split “/”, $path); # native absolute file
70 my $checksum_might_be_up_to_date=1;
72 if ($skip_if_present and -f $local_file) {
73 ## upgrade to checked if not already
74 $mirrored{$local_file}=1 unless $mirrored {$local_file};
75 } elsif (($mirrored{$local_file} || 0)< 2) {
76 ## upgrade to full mirror
77 $mirrored{$local_file}=2;
79 mkpath(dirname($local_file), $TRACE, 0711);
80 print $path if $TRACE;
81 my $status=mirror($remote_uri, $local_file);
83 if ($status==RC_OK) {
84 $checksum_might_be_up_to_date=0;
85 print ” … updated\n” if $TRACE;
86 } elsif ($status != RC_NOT_MODIFIED) {
87 warn “\n$remote_uri: $status\n”;
88 return;
89 } else {
90 print ” … up to date\n” if $TRACE;
91 }
92 }
94 if ($path =~ m{^authors/id}) {
95 # maybe fetch CHECKSUMS
96 my $checksum_path = URI->new_abs (“CHECKSUMS”, $remote_uri)->rel($REMOTE);
97 if ($path ne $checksum_path) {
98 my_mirror($checksum_path, $checksum_might_be_up_to_date);
99 }
100 }
101 }
103 sub clean_unmirrored {
104 find sub {
105 return unless -f and not $mirrored {$File::Find::name};
106 print “$File::Find::name … removed\n” if $TRACE;
107 unlink $_ or warn “Cannot remove $File::Find::name: $!”;
108 }, $LOCAL;
109 }
110 }

Lines 1 through 3 start nearly every long program that I write, enabling warnings, compiler restrictions, and disabling buffering on STDOUT. Lines 5 through 17 form the configuration section of this program. There’s really only three things to set here:

  • $REMOTE is the URL prefix leading to the nearest CPAN archive. The uncommented value is the main United States CPAN archive. The next value is the Finland archive, which also happens to be the master archive. If you want the most up-to-date sources, use the U.S. and Finland URLs. And because I was initially developing this program at the annual SAGE-AU conference in Australia, the value following Finland’s is the Australian CPAN archive. Finally, I have a complete CPAN archive on my laptop’s disk already, so I can point to that with a file: URL as well, as shown by the fourth value.

  • $REMOTE is the source, so we need to define a destination, $LOCAL. This is a simple Unix path. If you’re on a non-Unix system, you can specify this in the local directory syntax, since we’ll be using the cross-platform File::Spec library to manipulate this path. And, as the comment warns, this program owns the contents of that directory, and is free to delete anything it sees fit, so keep that in mind when you choose a path.

  • Finally, a simple true/false $TRACE flag decides whether this program is noisy by default or quiet by default. The noise is limited to actual activity, and reassures me that something is happening during execution.

Next, from lines 20 to 30, we’ll pull in the necessary modules. The standard Perl bundle gives us the dirname, catfile, and find routines. The optional CPAN-installable LWP library gives us the URI object module and the mirror routine (and some associated status values). And Compress::Zlib lets us expand the gzip-compressed index file so we know what distributions are needed for the mirror.

Once we’ve got everything set up, it’s time to transfer everything needed for a typical operation of the core CPAN module (described by perldoc CPAN in a typical Perl installation). First, we need the index files, defined in lines 34 to 36. We’ll call my_mirror on each of those, defined later. For now, we’ll presume that this creates or refreshes each of those files below the $LOCAL-identified directory.

The 02packages.details.txt.gz file is a flat text file with a short header that contains the path to each distribution for each module in the CPAN. However, this file is gzip-compressed, so we need to expand the file to process the contents. Stealing the example out of the Compress::Zlib man page nearly directly, lines 40 to 52 expand this file and extract the necessary information.

Line 40 constructs the filename in a platform-independent way by using the catfile routine. Note that we’re actually passing three parameters. The first parameter is the value of $LOCAL, which serves as the starting point, from which we descend further to the subdirectory called modules, and finally to a file within that directory called 02packages.details .txt.gz. I’ve tested this only on Unix, but I’ll presume that the program is portable to other platforms, because I’ve used the portable functions.

Line 41 takes this constructed path, and creates a Compress ::Zlib object, which can be asked to deliver the uncompressed file line-by-line. If that fails, we’re in an unrecoverable state, and we’ll abort.

The data contains a header, delimited by a blank line, so we need to skip over all the data up to and including that blank line. We’ll do this by setting a flag to an initial 1 value in line 42. Line 43 reads a line at a time into $_, stopping when there is no more data (or there’s an I/O error). Lines 44 to 47 look for the end of the header as long as we’re still in the header. A header ends on a line that doesn’t contain a non-blank character, hence the unless.

If we make it to line 49, we’re staring at a standard line from the index, which looks something like

Parse::RecDescent 1.80

The first column is the module name (here Parse:: RecDescent), and is not very interesting to us. Neither is the second column, which is the current version number. But the third column contains (the unique part of) the path to the distribution for this module, and that’s what the CPAN module will be looking for, and what we need to mirror.

Note that many module names will share the same common distribution file, so we’ll need logic to avoid downloading duplicates. We’ll defer that problem to the my_mirror subroutine.

A few of the modules are listed as belonging to a core Perl distribution. To avoid mirroring the various Perl distributions (and wasting space in our mirror), we’ll skip over them in line 50. The regular expression is somewhat ad-hoc, but seems to do the right thing.

Line 51 mirrors the requested distribution into our local mirror. The 1 parameter says “if it already exists, it’s up to date,” and is an optimization based on external knowledge that a given distribution will never be updated in place. Rather, a new file will be created with a new version number. Of course, like any optimization, we do this with some hesitation and a bit of caution.

Once we’ve passed through the entire module list, we need to delete any outdated modules. A CPAN contributor has the option of leaving older versions of modules in the CPAN, or deleting them. To keep in sync with the master archive we need to keep track of everything that is current, and delete anything not mentioned. And that’s it, as line 57 confirms.

But of course, that’s not the whole story. We need to manage the mirroring. There are two steps to mirroring: fetching the files, and throwing away anything left over. These need to share a common hash, which we’ll define as a closure variable inside a BEGIN block starting in line 59. The %mirrored hash in line 62 is keyed by the filename, and has a value of 1 to indicate that the file has been at least checked for existence, and 2 to indicate that it has been mirrored from the remote site and brought up to date. At the end of the run, any files that aren’t either 1 or 2 for values are deleted files or temp files, and should be deleted from our mirror.

The my_mirror routine starting in line 64 does the hard work. The two parameters are the partial URL path and the “skip if present” flag.

In line 68, we use the URI module to construct the full URL, based on the $REMOTE value and the partial path. Line 69 constructs the local file path, based on $LOCAL and the partial path as well. The task for the remainder of the subroutine is to make the local file be up to date with respect to the remote URL.

Line 70 manages the checksum file. Each distribution is checksummed to ensure proper, complete transfer. We’ll first pretend that the checksum file doesn’t need updating, but later remove that assumption if we end up transferring the distribution file.

Starting in line 72, we look at what to do to bring this file up to date. If $skip_if_present is true, then we’ll never worry about the remote timestamp being out of sync. If the file is present, it’s good enough, noted by the -f flag in line 72. Line 74 records that the file was at least checked for existence, so we don’t delete it during the cleanup phase.

If $skip_if_present is not true, or the file doesn’t exist, then it’s time to do a full mirror on this distribution. We note that in line 77. Line 79 creates the directory to receive the file. (I would argue that LWP should do this for me, but that’s not the way it works.) The $TRACE value causes a series of mkdir command-lines to be traced to the output; otherwise, this operation is silent. Line 80 also puts out some noise if $TRACE is set. Note the absence of a newline, because we’re going to follow on with a result status.

Line 81 is where the real work happens. We’ll call mirror to bring the remote URL to the local file. This is done in such a way that the existing modification timestamp (if any) is noted and respected, minimizing the load on the remote server. And, the file is actually written into a temp file, and then renamed only when the transfer is complete, thus ensuring that other users of this directory will not see partially transferred files at normal locations. (If one of these transfers aborts mid-way, the cleanup phase at the end of this program will delete the partial transfer.) The modification time is also updated to reflect the remote data, so that a later mirror will again note that the file is up to date.

The result of mirror is an HTTP status value. If it’s RC_OK, then we’ve got a new version of the remote file. In this case, the checksum file may now be out of date: we can’t merely check for its existence, so we’ll flag that by setting the variable to 0 in line 84.

If the response is RC_NOT_MODIFIED, then we already had an up-to-date version of the file, and the remote server has informed us of such without even sending us a new version. In that case, we end up in line 90, finishing out the tracing message if needed.

However, if the status is neither of these, then something wrong has happened, and we’ll generate a warn noting the status, and abort any further operation on this path by returning from the subroutine.

Once the distribution has been transferred, it’s time to grab the checksum file. If the path is a distribution (checked in line 94), we’ll compute the path to the CHECKSUMS file in lines 95 and 96. We must be careful to perform URL calculations here, not native path calculations. And, to keep the algorithm easy, we need to compute the path relative to the original CPAN mirror base, not a full path. Thankfully, this is also trivial with the URI module.

In line 97, if we’re not already looking at a CHECKSUMS file, we need to call back to ourself to transfer the file. This is a clean tail-recursion, so I could have simply used a goto or a loop, but the subroutine call seemed easier and clearer at the time. If the checksum might already be up to date, it will merely be checked for its presence. If a transfer has taken place, a full mirror call will be issued instead.

Finally, we have the cleanup phase routine. We’ll start at $LOCAL using the File::Find recursion. If a file exists, and it’s not noted as such in the %mirrored hash (line 105), then we remove it (line 107).

And there you have it. Set up the configuration, and let it rip. Some sample output from the script is shown in Figure One. On the first execution, you will want to be on a fast link (or a relatively unloaded time of day), because it downloads about 200 megabytes of data. After that, it’s about 2-5 minutes (average) per day on a 28.8 link, which is completely tolerable for me from my hotel room when I’m on the road. And don’t forget: you’re downloading only installable modules, not the rest of the CPAN.

Figure One: Mirroring your own CPAN

mkdir /Users/merlyn
mkdir /Users/merlyn/MIRROR
mkdir /Users/merlyn/MIRROR/MINICPAN
mkdir /Users/merlyn/MIRROR/MINICPAN/authors
authors/01mailrc.txt.gz … updated
mkdir /Users/merlyn/MIRROR/MINICPAN/modules
modules/02packages.details.txt.gz … updated
modules/03modlist.data.gz … updated
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id/M
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id/M/MA
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id/M/MA/MARCEL
authors/id/M/MA/MARCEL/Devel-SearchINC-0.01.tar.gz … updated
authors/id/M/MA/MARCEL/CHECKSUMS … updated
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id/D
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id/D/DU
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id/D/DU/DUNCAND
authors/id/D/DU/DUNCAND/CGI-Portable-0.47.tar.gz … updated
authors/id/D/DU/DUNCAND/CHECKSUMS … updated
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id/M/MI
mkdir /Users/merlyn/MIRROR/MINICPAN/authors/id/M/MI/MIYAGAWA
authors/id/M/MI/MIYAGAWA/abbreviation-0.02.tar.gz … updated

To use this mini-CPAN mirror with CPAN.pm, you’ll need to enter at the CPAN prompt:

conf urllist unshift file://$LOCAL
conf commit
reload index

Here, $LOCAL is replaced by the value you’ve set in $LOCAL but specified as a URL path (forward slashes for directory delimiters, and percent-escaped unusual characters). That’s because CPAN.pm is expecting a URL, not a file path.

At the risk of repeating myself: this won’t make CPAN installations any faster, unless you happen to be a road-warrior like me, needing to do CPAN installations when you are on a very slow net link (or no link at all). Of course, you could burn a daily CD for your friends, and “hand them a CPAN archive on a disk,” providing a gateway between your bandwidth and the sneakernet. At least you won’t be worrying trying to figure out how to fit the full 1.2+ GB CPAN on a CD-ROM!

Until next time, enjoy!

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com. Code listings for this month’s column can be found at http://www.stonehenge.com/merlyn/LinuxMag/ and at http://www.linux-mag.com/downloads/2002-11/perl.

Comments are closed.