Web 2.0, Meet Usenet 1.0

The" new" Web is all shiny and collaborative, but" old school" Usenet is still chugging along. Here, Randal Schwartz connects some of the new with some of the old, scraping CPAN for news of novel Perl modules.
The “new” Web is all shiny, with user-collaborative reviews and comments, AJAX interactions, and RSS feeds to track all those blogs and podcasts But before of all that nice IP traffic, we communicated “over the net” via email, mailing lists, and Usenet. Yes, Usenet, the original “distributed bulletin board” system, gave “netizens” soapboxes to scream and rant, ask and answer, and distribute messages to “thousands of machines around the net,” as Larry Wall’s rn program used to warn prior to every request to post.
Although the new generation of “net” users focuses on direct-IP communication (through the Web, in blogs, and with instant messaging and IRC), Usenet is still chugging along behind the scenes, being operated more or less as it has been since its inception in 1979.
For example, a portion of the Usenet newsgroups are moderated, meaning that articles posted to those groups aren’t immediately distributed to the world, but instead are mailed to a moderator for approval. comp.lang.perl.moderated operates in this fashion (as its name implies), which theoretically means the group has more light (of knowledge) and less heat (from flames). A dedicated group of moderators manages the group, including Stonehenge trainer Tad McClellan.
Similarly, I’m the primary moderator for comp.lang.perl.announce (CLPA), an announcements mailing list for new and updated Perl software. I was selected into this position when the newsgroup was being created, and I spend a few minutes a day making sure announcements get out in a timely fashion. At one point, CLPA was also gatewayed into a mailing list, allowing people to get frequent new-Perl-code announcements directly via their email without having to find the Usenet group.
Over the years, CLPA has become a bit quieter, getting a posting only every few days from a handful of dedicated CPAN contributors. On my long list of items waiting for “round tuits,” I observed that the list of new and updated modules in the CPAN would be well within the charter of CLPA, but didn’t want to write the necessary tools to scrape the frequently updated module list to find the differences, and certainly wasn’t interested in doing such work by hand.
However, I recently noticed that search.cpan.org, my favorite view into the CPAN, has a public RSS feed of new modules going back a few days, along with a direct link to get more information. Aha! Finally, with a bit of automation, I could start pumping timely data into CLPA. By bolting together a few CPAN modules, I produced a nightly “CPAN 2 CLPA” module, presented in Listing One.
LISTING ONE: A script to forward RSS items to email
01#!/usr/bin/perl -w
02use strict;
04### config
06my ($HOME) = glob “~”;
07my $RSS_TEMP_DIR = “$HOME/lib/xml-rss-feed”;
08my $HTTP_CACHE_TEMP_DIR = “$HOME/lib/httpcache”;
09my $SIGNATURE = “$HOME/.signature”;
11## for news posting:
12my ($HOST,$USER,$PASS) = qw(nntp.example.com merlyn guesswhat);
14### end config
16use Encode qw(encode);
17use XML::RSS::Feed ();
18use HTTP::Cache::Transparent ();
19use LWP::Simple qw(get);
20use News::NNTPClient ();
22mkdir $RSS_TEMP_DIR, 0755 unless -e $RSS_TEMP_DIR; # one time init
23HTTP::Cache::Transparent::init({BasePath => $HTTP_CACHE_TEMP_DIR});
25my $feed = XML::RSS::Feed->new
26 (url => “http://search.cpan.org/uploads.rdf”,
27 name => “search.cpan.org”,
28 tmpdir => $RSS_TEMP_DIR,
29 );
31my @OUTPUT;
33my $xml = get($feed->url);
35for my $headline ($feed->late_breaking_news) {
36 push @OUTPUT, $headline->headline . “\n”;
37 push @OUTPUT, $headline->url . “\n”;
38 my $desc = encode(’ascii’ => $headline->description);
39 push @OUTPUT, “$desc\n” if defined $desc;
40 push @OUTPUT, “—-\n”;
43exit 0 unless @OUTPUT; # we have something to say
45pop @OUTPUT; # remove final — line
47my $c = News::NNTPClient->new(split /:/, $HOST);
48if ($USER) {
49 $c->authinfo($USER, $PASS);
51$c->postok or die “Cannot post to $HOST: $!”;
55@OUTPUT = split /\n/, <<”END”;
56Newsgroups: comp.lang.perl.announce
57Followup-to: poster
58From: merlyn\@stonehenge.com (Randal Schwartz)
59Subject: new CPAN modules on @{[unpack ’A10 x10 A*’, gmtime]}
61The following modules have recently been added to or updated in the
62Comprehensive Perl Archive Network (CPAN). You can install them using the
63instructions in the ’perlmodinstall’ page included with your Perl
66@{[join ’’, @OUTPUT]}
68If you’re an author of one of these modules, please submit a detailed
69announcement to comp.lang.perl.announce, and we’ll pass it along.
71print “Just another Perl hacker,” # the original
74@{[join ’’, <>]}
77warn map “$_\n”, @OUTPUT;
79$c->post(@OUTPUT) or warn “failed post!”;
Lines 1-2 define the path to Perl, enable warnings, and turn on strict mode, as always.
Lines 4-14 provide the “user serviceable parts” for things below. I get my home directory using a glob trick, although this is probably on par with the mystery of:
my $HOME = (getpwuid $<)[7];
(I’m not sure whether counting on a glob of tilde is more or less portable than getting the eighth value of the password file entry for the current user, but in any case there’s more than one way to do it.)
From the home directory, I derive paths for the two data directories used by this program. The XML::RSS::Feed module needs a place to keep information about RSS headlines that have been already seen, so those are thrown into $RSS_TEMP_DIR. And HTTP::Cache::Transparent needs a local cache area, which I put into HTTP_CACHE_TEMP_DIR. Finally, once I have a posting, I have to push it into the news network, so I list my (not-real) NNTP host from my ISP, along with my personal authentication credentials.
Lines 16-20 define the needed modules. The Encode module is included with the core Perl distribution, and is used to remap non-ASCII characters into their ASCII equivalents. The remaining modules are found in the CPAN, and are described as they are used below.
Line 22 works around a bug in the XML::RSS::Feed module: if you give a path that doesn’t exist (which I’ve done, more than once), XML::RSS::Feed does not create the directory for you, and doesn’t tell you that it’s not there, so you simply get confusing behavior (for instance, all headlines are always marked new).
Line 23 enables the transparent web cache. Most modern RSS generators can take advantage of client-side caching to reduce the traffic and CPU load. If a web client already has a prior fetch of an RSS feed, the client can include the modification time of that fetch along with the next request, and the server can say, “Nope, you’ve already got the latest version.” Normally, LWP::UserAgent does no caching of prior fetches, but dropping in HTTP::Cache::Transparent modifies the behavior of LWP so that caching is performed automatically, much as if a proxy cache server were inserted upstream. HTTP::Cache::Transparent is quite a nice module, and can be used to improve many web-fetching scenarios for cooperating servers.
Lines 25-29 set up the XML::RSS::Feed object, representing our source data stream. The URL was obtained from the “RSS 1.0” button on the http://search.cpan.org/recent page, although like many modern Web sites, the RSS information is also in a metatag link, available in modern browsers through a separate user interface for easy grabbing.
Line 31 provides a cache for the output of this program. I had originally just printed the information to STDOUT, but then I realized I didn’t want to post an article if there were no new items; so I replaced all the print operations with push@OUTPUT, to save the data.
Lines 33 uses LWP::Simple ’s get() function to grab the RSS data.Because LWP::Simple uses LWP::UserAgent underneath, and I’ve modified LWP::UserAgent to cache the fetches, I’m actually performing a cached fetch.
Line 34 parses the RSS feed data, as copied from the XML::RSS::Feed man page example. Lines 35-41 process each “new” headline, as determined by XML::RSS::Feed to be something that we haven’t seen before.
For each new headline, lines 36-40 grab the text of the headline, the URL for further information (here, the detailed page on the updated module), and the one-line text description as provided by search.cpan.org, and push them onto the end of @OUTPUT, followed by a separator.
Now, if the code makes it all the way to line 43 and there still isn’t any output, there’s no point in posting a news message, because it’ll be empty. There might be no output if people stopped submitting things to the CPAN (unlikely) or something has broken in the CPAN indexer or CPAN mothership (rare, but it can and has happened), or something is broken in search.cpan.org ’s update of the RSS feed (also rare, but it also has happened). Hopefully, on the next day’s run, though, the code picks up everything that was missed from the time before.
Line 45 cleans up the output just a bit, turning the trailing dash-line into a separator line instead. At this point, @OUTPUT is the guts of a news posting that I want to make into CLPA, but I’ll still need some wrapper headers and footers to make it nice.
Lines 47-51 fire up a connection to my ISP’s news host, including verifying that I can post something — an essential role here. Line 53 shoves the name of my dot-signature file into @ARGV, so that I can easily open and read it with a” diamond” (<>) read below.
Lines 55-75 wrap the @OUTPUT variable with the boilerplate headers and footers for a full news posting. Splitting (by newlines) the single string of the here-document updates the value in @OUTPUT. The here-document is double-quote interpolated, because the keyword END is enclosed in double-quotes. This gives me a simple templating strategy, because any scalar or array variables within the here-document will be expanded.
Lines 56-59 provide the news posting header text. Note that the @ in line 58 had to be escaped or else a variable named @stonehenge would have been needed, and would have failed to compile, because this program is use strict.
Line 59 requires some explanation. The outer @{..} is an array interpolation, but the value that it interpolates results from the square-bracketed expression […]. Thus, we have an expression computed within a double-quoted string, providing some data for the interpolation. The unpack extracts the day of week and the date from a scalar-value gmtime() expansion, using unpack operations that I described in last month’s column (online after June 15, 2006) at http://www.linux-mag.com/2006-05/perl_01.html.)
Lines 61-64 include some boilerplate text above the list of headlines. Line 66 interpolates the original @OUTPUT variable into this string. I can’t let the original elements remain separate, or I’ll get the single-space-between-elements mess that seems to trouble the beginners. (Honestly, I originally had @OUTPUT there, and couldn’t figure out where the space was coming from myself!)
Lines 68-74 finish up the posting text, using the diamond read to grab my signature below the signature marker automatically.
All that’s left to do is post the message! For debugging, I dump the contents to STDERR (line 77), which my cron job happily emails me each night. And then, I push the button in line 79, which posts my automatically generated CLPA message to “thousands of machines” within the space of mere minutes.
Mission accomplished. All that’s left to do is point a nightly cron task at this program and put everything on autopilot.
Obviously, the program as-is has limited use. But consider taking a blogsearch.google.com RSS feed and posting the search results to your group’s internal news server every few hours. By distributing the results as news postings, you can minimize the hit on Google’s resources, as well as have a historical record of searches to see when things first appeared. I hope you have fun adapting these techniques.
Until next time, enjoy!

Randal Schwartz is the Chief Perl guru at Stonehenge Consulting. You can reach Randal at class="emailaddress">merlyn@stonehenge.com.

Comments are closed.