dcsimg

Moving Your News Service

Several months ago in this space, I talked about how my ISP was looking at the performance of their news server. I wrote a program to see just how bad the news service was compared to the other local ISPs, using Deja as a baseline. Well, the ISP just got bought out by a big national chain. They decided not to fight the spotty news service any more and just convert over to the conglomerate's big service. The problem with moving from one news server to another is that the article numbers are not in sync, so a .newsrc file will have the right newsgroups but the wrong "read" marks. And since I read a lot of newsgroups, I don't have time to reread existing articles, and I don't want to just throw away any new articles.

Several months ago in this space, I talked about how my ISP was looking at the performance of their news server. I wrote a program to see just how bad the news service was compared to the other local ISPs, using Deja as a baseline. Well, the ISP just got bought out by a big national chain. They decided not to fight the spotty news service any more and just convert over to the conglomerate’s big service. The problem with moving from one news server to another is that the article numbers are not in sync, so a .newsrc file will have the right newsgroups but the wrong “read” marks. And since I read a lot of newsgroups, I don’t have time to reread existing articles, and I don’t want to just throw away any new articles.

The solution is a bit complicated and requires extensive bookkeeping, but that’s what computers are for, and Perl in particular. What you need to do is mark as read any articles you’ve already seen. Messages are uniquely identified by a message ID, and you can get that mapped into article numbers via the appropriate XHDR request to the NNTP server.

So, basically, for every subscribed newsgroup, we fetch the message IDs of the last 500 articles from the new server (500 being the maximum number of unread articles per group I’d care to face). Then, we fetch the last 1,500 or so message IDs from the old server. Then, for every message ID on the new server, if I’ve already read it on the old server, mark it as read in the new newsrc.

The newsrc file is the classic rn format. Most modern newsreaders can import and export this format, so it’s a niceleast common denominator of exchange. And there’s a goodmodule or two in the CPAN to deal with this format as well.

There was another additional requirement, just to make this even more interesting. My newsreading and general information processing is on yet another ISP from where either the old or new news servers are located. So I use ssh tunneling to go back to the shell-account machine of the old ISP to get to its news server and also to get to the new server at the takeover ISP’s machine, which is permitted access only to its customers. It’s almost as bad as trying to figure out those spy novels with all the odd names, but most of the time this is transparent to me.

After my ISP converted, I had a fairly nice-looking new newsrc with all my previously read articles punched out already. And the program to do this all is in Listing One, which goes as follows.




Listing One: The .newsrc Converter – Part I


   1    #!/usr/bin/perl
2 use strict;
3 $|++;
4
5 use Net::NNTP;
6 use News::Newsrc;
7 use IO::File;
8
9 ## config
10
11 my $DST_MAX = 500;
12 my $SRC_MAX = $DST_MAX * 3;
13
14 my $OLD = “news.old-isp.comm”;
15 my $NEW = “news.big-mega-isp.comm”;
16
17 my $VIA = “shell.old-isp.comm”;
18 my $VIA_OLD_PORT = 42001;
19 my $VIA_NEW_PORT = 42002;
20
21 my $VERBOSE = 2; # 0 quiet, 1 expected
errors, 2 noisy
22
23 ## end config
24
25 system join ” “,
26 “ssh -f -q”,
27 “-L $VIA_OLD_PORT:$OLD:119″,
28 “-L $VIA_NEW_PORT:$NEW:119″,
29 “$VIA”,
30 “exec sleep 60″,
31 “&”,
32 “sleep 5″ if $VIA;
33
34 my $SRC_NNTP = $VIA ? “localhost:$VIA_
OLD_PORT” : $OLD;
35 my $DST_NNTP = $VIA ? “localhost:$VIA_
NEW_PORT” : $NEW;
36
37 my $src = Net::NNTP->new($SRC_NNTP)
or die “src: $!”;
38 my $dst = Net::NNTP->new($DST_NNTP)
or die “dst: $!”;
39
40 my $src_rc = News::Newsrc->new or
die “Cannot new newsrc for src”;
41 my $dst_rc = News::Newsrc->new or
die “Cannot new newsrc for dst”;
42 my @extra_lines = ();
43
44 {
45 my $newsrc = IO::File->new
(“$ENV{HOME}/.newsrc”, “r”)
46 or die “Cannot open .newsrc: $!”;
47 my @all = <$newsrc>;
48 @extra_lines = grep !/^\S+[:!]\s/, @all;
49 $src_rc->_scan(join “, grep
/^\S+[:!]\s/,@all); # dies if fail
50 }
51
52 for my $group ($src_rc->groups) {
53 eval {
54 if ($src_rc->subscribed($group)) {
55 print”subscribed to$group\n”if
$VERBOSE>1;

Line 2 turns on strict mode — needed for every program that is longer than 10 lines or used for longer than 10 minutes. In this case, the first applies but not the second, since I hope I’m not changing servers frequently.

Line 3 unbuffers standard output. There’s not a lot of output from this program, and I want to see it as it comes along.

Lines 5, 6, and 7 pull in the modules we’ll need. Net::NNTP comes from the CPAN, and lets us talk to NNTP servers. News::Newsrc also comes from the CPAN, and provides parsing and updating of newsrc-format files. IO::File is a core module installed with Perl, and lets us have generic filehandles as objects.

Lines 9 through 23 provide the most-likely-to-be-tweaked settable variables. As always, I’m providing my programs not as ready-to-run robust programs, but as snippets for your own inspiration (steal the ideas, not the code). However, since I’ll probably brush the dust off this program in another year or so when this ISP merges with another one, I’ll make it easy to remember my thinking by providing a distinct configuration area.

Here, $DST_MAX is the greatest number of unread articles we’re willing to tolerate on the new server. You could probably crank this up to 20000 or so if you wanted to be sure to read everything the new server has to offer, but if you have a lot of groups, the bigger numbers will mean slower operations. (I had about 120 subscribed newsgroups, and it took about 10 minutes to process at my value of 500 here.) $SRC_MAX is how many articles to map in the old news server. Because articles come in a scrambled order, this should be a number bigger than $DST_MAX to ensure that we don’t miss an article number mapping on the old server that we’ll need.

$OLD and $NEW are the old and new news server hosts, respectively. I presumed that I’d always be using port 119 (the NNTP port) on the addresses, although I see that it wouldn’t be hard to parameterize that. No sense in making everything too flexible for such a little-used program!

$VIA is used when I need to ssh-tunnel the connections. It’s the hostname of the shell machine at the old ISP. (Please note that these are not the real hostnames…the comm suffix should be enough of a clue not to try them.) If $VIA is false (such as 0, undef, or the empty string), tunnelling won’t be used, so this is an optional step. However, if it’s used, we need to select two hopefully unused port numbers for the local tunnel ports, and those are given in $VIA_ OLD_PORT and $VIA_NEW_PORT.

Finally, $VERBOSE says how noisy to be. If we turn on all the noise, we get a pretty good complete description of where we are in the process and what we’ve accomplished.

Lines 25 to 33 set up the tunnel if needed. For this to work, I have to have ssh trained to accept connections from my workhorse ISP to my newsreading ISP, which I needed to do for my newsreader anyway. The crucial parts are the selection of the tunnels (the -L parameters), the command to run (sleep 60), and the additional sleep for five seconds after firing off the ssh to let everything warm up. The sleep 60 is executed on the remote host and needs to be longer than it takes for my program to connect to the local tunnel ports. Once the connections are established, the remote command can terminate without any problem.

$SRC_NNTP and $DST_NNTP, defined in lines 34 and 35, set up the hostname and portnumber (if needed) for the old and new news servers. Lines 37 and 38 attempt the connection to those servers, dieing if things are bad.

Lines 40 and 41 create News::newsrc objects to hold the newsrc for the old server and the newsrc for the new server. Line 42 sets aside a place for lines from the old newsrc that aren’t really about subscribed or unsubscribed newsgroups — apparently, News::newsrc blows up on these.

Lines 44 to 50 grab the old newsrc information into the newsrc object. As you can tell, this is pretty inflexible, grabbing the file directly from my home directory. Maybe this should have been a parameter, but I don’t care, because the job got done. @extra_lines gets all the stuff that’s not about a newsgroup, while the remaining lines are sucked into the newsrc object.




Listing One: The .newsrc Converter – Part II


 56              $dst_rc->subscribe($group);
57
58 (undef, my $src_low, my $src_
high) = $src->group($group)
59 or die “Cannot get info for
src $group\n”;
60 $src_low = $src_high -$SRC_MAX
if $src_low < $src_high -
$SRC_MAX;
61 my %src_msgid_to_art = reverse
%{$src-> xhdr(“Message-Id”,
“$src_low-$src_high”)};
62 (undef, my $dst_low, my
$dst_high) = $dst-> group($group)
63 or die “Cannot get info for
dst $group\n”;
64
65 $dst_low = $dst_high- $DST_MAX if
$dst_low < $dst_high -$DST_MAX;
66 my %dst_art_to_msgid = %{$dst->
xhdr (“Message-Id”,
“$dst_low-$dst_high”)};
67
68 for my $dst_art ($dst_low..$dst_
high) {
69 eval {
70 my $msgid = $dst_art_to_
msgid{$dst_art} or
71 die “no msgid for $dst_
art in $group at
dst\n”;
72 ## next;
73 my $src_art = $src_msgid_
to_art{$msgid} or
74 die “no art for $msgid
in $group at src\n”;
75 ## next;
76 next unless $src_rc->marked
($group,$src_art);
77 print “mapping $msgid from
$src_art to $dst_art\n”
if $VERBOSE > 1;
78 $dst_rc->mark($group,
$dst_art);
79 }; warn $@ if $@ and $VERBOSE;
80 }
81 $dst_rc->mark_range($group,1,
$dst_low – 1);
82 } else {
83 print “unsubscribed to $group\n” if
$VERBOSE > 1;
84 $dst_rc->unsubscribe($group);
85 (undef, my $dst_low, my $dst_high) = $dst->group($group)
86 or die “Cannot get info for
dst $group\n”;
87 $dst_rc->mark_range($group, 1,
$dst_low – 1) if $dst_low;
88 }
89 }; warn $@ if $@ and $VERBOSE;
90 }
91
92 print “==== RESULT ====\n”;
93 print @extra_lines, $dst_rc->_dump;

Lines 52 to 90 do the bulk of the job. For every newsgroup mentioned in the old newsrc, we loop once with $groupset to that group. A large eval block protects us from premature death on any particular newsgroup, giving us instead a group that won’t be transferred to the new newsrc.

Line 54 determines if it’s a subscribed newsgroup, and if so, sends us through the bulk of lines 55 to 81 (described in a moment). If not, we skip down to lines 83 to 87 and mark the group as unsubscribed in line 84. Line 85 grabs the lowest article number still active on the news server, and line 87 ensures that we don’t try reading any article number below that. (Most newsreaders do the equivalent already, but I’m trying to make an accurate newsrc here.)

Now, back to the harder part. Line 58 gets the info from the old server about article number range present in the group. Line 60 computes a range not to exceed $SRC_MAX items for which we must get a “message-id-to-article-number” map constructed. Line 61 creates a hash from the hashref returned by calling the NNTP XHDR operation for all the message IDs in the given article number range. Sure, you can get this info one article at a time, but the XHDR command is very fast since it reads directly from the .overview file that most news servers now maintain. The result is that we have a hash called %src_msgid_to_artthat we can feed a message ID and get back the article number. Since we can then see if this article number has already been read, we’ll be able to tell if we should mark it as having been read in the new newsrc. Lines 65 and 66 do the same thing in the other direction, matching message IDs with corresponding article numbers in the new news server.

And then it’s time for the heavy bookkeeping. Lines 68 to 80 check each article in the new server for its message ID number (line 70). If that same article (line 73) has been read on the old server (line 76), we mark it as read on the new server (line 78). Not rocket science, but a lot of details to get right. At this point, we’re not talking to either of the servers — all of the information is in hashes in memory.

Line 81 then marks as read anything below the articles we’ve considered. This means we can never have more than $DST_MAX articles unread.

And now that we’re all done, line 93 dumps the result! I could have made it save the new newsrc directly, but I’m running this program inside a window that I can cut-n-paste, so it didn’t matter.

So there you have it. I wish you the luxury of never having to move from one news server to another, but at least if you have this program and a short period of overlap, it’ll ease the pain a bit when you must move.

This is my last column for Linux Magazine that will be strictly about Perl. Next month, I’ll begin writing about general Webmaster topics. I hope you enjoy the new format as much as you’ve said you’ve liked this column in the past. Until next time, go forth and be Perl-y!



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com.

Comments are closed.