Browsing a Local CPAN Mirror

Last month, I showed how to fetch a subset of the CPAN (Comprehensive Perl Archive Network) to create a local mini-mirror. The subset included just the latest distribution of each module, plus the index files, so that the CPAN.pm module could install and update your local modules.

Last month, I showed how to fetch a subset of the CPAN (Comprehensive Perl Archive Network) to create a local mini-mirror. The subset included just the latest distribution of each module, plus the index files, so that the CPAN.pm module could install and update your local modules.

I can use the mini-mirror to install CPAN modules when I’m disconnected from the net (like when I’m on a cruise ship for Geek Cruises, or at 30,000 feet, jetting off to another Perl training site).

But I often find myself just browsing through the recent additions to the CPAN to see what’s new, what’s cool, and what’s being updated. That’s easy to do online, because http://search.cpan.org provides a “Recent Additions” link. But offline, the data is much less readily available as the RECENT file shows only a few days of past activity. Worse, my mini-mirror doesn’t download either the RECENT file or any of the README extractions for the distributions.

So, I started wondering if there was a way I could use just my mini-CPAN and still browse the newest distributions, or even better, dump out the README files for those distributions if they existed. Surely, the information was there in the form of the mirrored timestamps on the distributions themselves. And the README files, while not extracted, were certainly present inside the tar.gz files. And that led me to create the program presented in Listing One.

Listing One: Browse your local mini-CPAN – Part 1

1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
5 ### CONFIG
7 my $LOCAL = “/Users/merlyn/MIRROR/MINICPAN/”;
11 ## core -
12 use File::Spec::Functions qw(catfile devnull);
13 use Safe qw();
15 ## Compress::Zlib -
16 use Compress::Zlib qw(gzopen $gzerrno);
18 ## Archive::Tar -
19 use Archive::Tar qw();
21 my $days_ago = 0;
22 for my $distro (get_distro_sorted_by_age()) { # list of hashrefs
24 ## paging by days old
25 unless ((time – $distro->{modtime})/86400 < $days_ago + 1) {
26 print “[more]\n”;
27 <STDIN>;
28 print ++$days_ago, ” days ago:\n”;
29 }
31 show_distro($distro);
32 }
34 exit 0;
36 sub get_distro_sorted_by_age {
37 my %seen;
38 my @distros;
40 my $details = catfile($LOCAL, qw(modules 02packages.details.txt.gz));
41 for (uncompress_and_grab_after_blank($details)) {
42 my ($module, $version, $path) = split;
43 next if $path =~ m{/perl-5}; # skip Perl distributions
45 next if $seen{$path}++;
47 ## native absolute file:
48 my $local_file = catfile($LOCAL, split “/”, “authors/id/$path”);
50 push @distros, {
51 filename => $local_file,
52 path => $path,
53 module => $module,
54 modtime => (stat($local_file))[9],
55 };
56 }
57 ## return distros sorted by descending modtimes
58 sort {$b->{modtime} <=> $a->{modtime}} @distros;
59 }
61 sub show_distro {
62 my $distro = shift;
64 my $data = get_module_data();
65 my $at = get_archive_tar_for($distro->{filename}) or return;
67 my $description = $data->{$distro->{module}} {description} || “”;
68 print “$distro->{path} ($description)\n”;
70 my @readmes = sort grep m{/README\z}, $at->list_files();
72 for my $readme (@readmes) {
73 print “| $readme\n”;
74 my $content = $at->get_content($readme);
75 for ($content =~ /([^\cM\cJ]*)\cM?\cJ?/g) {
76 print “| | $_\n”;
77 }
78 }
80 }
82 BEGIN {
83 my $data; # cached value
85 sub get_module_data {
86 $data ||= do {
87 my $modlist = catfile($LOCAL, qw(modules 03modlist.data.gz));
88 no strict;
89 my $ret = Safe->new(“CPAN::Safe1″)->
90 reval(join(“”,
91 uncompress_and_grab_after_blank ($modlist),
92 “CPAN::Modulelist->data”));
93 die $@ if $@;
94 $ret;
95 };
96 }
97 }
99 sub uncompress_and_grab_after_blank {
100 my $file = shift;
101 my $inheader = 1;
102 my @return = ();
104 my $gz = gzopen($file, “rb”) or die “Cannot open $file: $gzerrno”;
106 while ($gz->gzreadline($_) > 0) {
107 if ($inheader) {
108 $inheader = 0 unless /\S/;
109 next;
110 }
112 push @return, $_;
113 }
114 @return;
115 }
117 sub get_archive_tar_for {
118 my $filename = shift;
119 my $at = eval {
120 local *STDERR;
121 open STDERR, “>”.devnull();
122 Archive::Tar->new($filename) or die “Archive::Tar failed on $filename\n”;
123 };
124 warn $@ if $@;
125 $at;
126 }

First off, lines 1-3 start nearly every Perl program I write, enabling warnings, compile-time restrictions, and unbuffering STDOUT.

Line 7 is the only configuration parameter of this script. It’s a Unix path to the top of my CPAN mirror (the directory with subdirectories author and modules immediately within it). You can use either a full CPAN mirror, or the mini-CPAN mirror created with last month’s code.

Lines 11 to 19 pull in the various modules needed for the program. From the core distribution, we need the catfile() function, and the Safe module (reasons explained later). From the Compress::Zlib module (found in the CPAN), we need the gzopen() function and the $gzerrno variable. And finally, we’ll pull in the Archive::Tar module, also found in the CPAN.

The main program lives in lines 21 to 34. We’re going to page the output by days. The first output is everything that was uploaded within the past 24 hours. The program then pauses, and waits for me to hit RETURN before showing the second day’s result, and so on. The $days_ago variable defined in line 21 manages this part of the process.

Lines 22 to 32 loop over each “distro”, consisting of a hash reference to the particulars of a given distribution.We call a subroutine (defined later) in line 22 to get all the distros as a big list, sorted by age from newest to oldest.

The modification age in days of each distro is extracted in line 25 and compared to the current upper boundary. If we’ve gone too far, line 26 presents a prompt, and line 27 waits for a RETURN (discarding the actual input). A prompt in line 28 lets us know how old we’ve gotten.

Of course, once we’ve finished hassling with the date stamping, the next step is to do the real work, triggered from line 31. We’ll see that subroutine defined later.

Lines 36 to 59 handle the scanning of the existing distros, determining the order of the distros from most recent to oldest. Lines 37 and 38 declare a “process this only once” hash and the array to hold the list of resulting distros.

Line 40 computes the “package details” pathname as a native filepath. Line 41 calls a subroutine with this pathname to fetch the file, uncompress it, and return a list of all the lines of the file after the first blank line. Each line represents one package name. Line 42 splits the whitespace-delimited line into the interesting parts. Line 43 skips over anything that looks like a Perl distribution, and line 45 rejects any distros we’ve seen already to ensure one pass each.

Line 48 computes a native filename for the CPAN-mirror location of this particular distro. Note that the indices always show Unix-style forward slashes, but we’ll pass separate elements to catfile() so that it just does the right thing to construct a full path.

Line 50 constructs a distro record for this distro as a new anonymous hash (reference) pushed onto the @distros array. We note the filename of the CPAN mirror file, the identification path, the module name (or at least one of the modules of this distribution), and the modification time as an internal timestamp (the usual “seconds since the Unix epoch” value).

Line 58 does a simple sort-block sort to bring out all the distros ordered by their modification timestamp order. Note that $b appears before $a, so we get a descending sort, resulting in newest-first, just as we promised.

Lines 61 to 80 show each chosen distro, which is passed as a parameter, and shifted off the @_ array in line 62.

Line 64 fetches the module information data, calling a subroutine to return the value. On the first call, this subroutine performs a lot of work, but the subroutine caches the value for subsequent calls. The format of the response is a reference to a hash of records, where each record is itself a keyed hash.

Line 65 opens up the distro file, presuming it is can be opened with the Archive::Tar module (Archive::Tar can open both ordinary tar files and compressed tar files). The resulting Archive::Tar object can then be queried and extracted.

Line 67 fetches the module description of the current distro, which is always a short phrase if it’s present. Since some modules have no description, we’ll “or” this with an empty string to keep from getting an undef error later, particularly in line 68, which displays the short path for the distro and its description.

Line 70 queries the Archive::Tar object for all of the contained files that look like a README file. These usually have some cool information in them, and that’s what we’re looking for: to see how cool the module can be.

Lines 72 to 78 display each README file. The name is indented by a pipe-space. The content is then fetched in line 74, and processed to have two pipe-spaces in front of each line. Regardless of whether the README file uses Mac, Unix, or DOS line endings, and whether we’re running on any of those architectures, this code pretty much does the right thing. Note the use of \cM instead of \r, because on a Mac (and maybe Windows — I forget), the \r and \n are swapped around. See perldoc perlport for all the gruesome details.

Next, we’ve got the get_module_data subroutine defined in lines 85 to 96. This subroutine performs a rather expensive and messy operation, so we’ll cache the output in a static, lexically local variable defined in line 83. The variable and the subroutine are enclosed in a BEGIN block to ensure proper closure between them.

The meat of the subroutine is the block between lines 87 and 94, which generates the value we want cached. The ||= do at the start of the block ensures that we’ll perform the block only when the current value of $data is false, such as when it is undef initially. If the value is not false, we simply return it.

The block first computes the name of the “modlist data” file in line 87, again in a way that should be portable regardless of the filename syntax. The modlist data file contains Perl code to be executed. However, we don’t want to execute arbitrary code from a Web site, lest our box be owned by some “scriptkiddie” who manages to replace the file with nefarious code. So we’ll execute this code in a Safe environment. (I’ll admit that I stole most of this code’s semantics from the existing CPAN.pm module source, although I rewrote most of the syntax.) The code to be executed comes from uncompressing the modlist data file in line 91. A call to CPAN::Modulelist->data returns the value from the code, which ends up in $ret declared in line 89. If anything goes wrong, we abort with an appropriate error message.

Lines 99 to 115 define the weirdly-named uncompress_ and_grab_after_blank routine. Oddly enough, both the “module list” and the “module data” files are compressed, and have an unneeded header delimited from the body by a blank line. So, we get to use this routine twice. The file name comes in from the first parameter (line 100). Line 101 defines the state flag (we’re in the header initially, so all we’re doing is looking for the blank line signifying the end of the header). Line 102 defines the return value (initially an empty list).

Line 104 opens up the compressed file, returning a gzip object handle.

Lines 106 to 113 read the file line-by-line until end-of-file. If we’re in the header, we’re looking for a blank line. If we’re not in the header, then the line is attached to the end of the @return value. When we’re done, the list of lines is returned in line 114.

Lines 117 to 126 define the last subroutine, get_archive _tar_for. The incoming parameter is the filename to open, shifted off in line 118. Lines 119 to 123 create the Archive::Tar object.

However, I found that the library sometimes spits raw compressed data to standard error, and that was pretty nasty on my terminal. To get around that, I create a local STDERR glob, then reopen the STDERR filehandle to a /dev/null-ish sort of thing, which took care of the junk on my screen from a bad open. Line 124 turns any die from inside the eval block into a mere warn. Line 125 returns the final Archive::Tar object, if any.

And there it is: my mini-CPAN mini-readme-browser. In playing with this for the past few weeks, I’ve already seen about a dozen new cool modules that I’m now investigating. You can be sure that I’ll be writing about them in future columns. Stay tuned!

So until next time, enjoy!

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com. Code listings for this month’s column can be found at http://www.stonehenge.com/merlyn/LinuxMag/ and at http://www.linux-mag.com/downloads/2002-12/perl.

Comments are closed.