dcsimg


Monitoring Your News Server

Usenet news has been around since 1979. I've been reading news nearly daily since 1980, except for a brief hiatus in 1984 when I missed the "great renaming" that gave us our current Usenet naming scheme. Because news is important (and familiar) to me, it's important for me to read news from a news server that has fairly decent article coverage.

Usenet news has been around since 1979. I’ve been reading news nearly daily since 1980, except for a brief hiatus in 1984 when I missed the “great renaming” that gave us our current Usenet naming scheme. Because news is important (and familiar) to me, it’s important for me to read news from a news server that has fairly decent article coverage.

I’m a charter subscriber to the largest ISP in town. Recently, there were some complaints on the ISP-only internal newsgroups that the newsfeed seemed a little less than normal. I wondered if it was a summertime slowdown or an actual problem, and since I like to help out the admins of this ISP when I can, I took it upon myself to hack out a Perl tool to verify whether the problem was real or merely perceived.

Because I wanted some quantitative data, I decided to ask Deja (formerly Deja News) and AltaVista about all the articles they’d seen in a given time frame. I figured that if my ISP also had all of those articles, there wasn’t a problem. If only some of those articles had shown up, however, then it would be time to figure out how to have the ISP solve the feed issues. And while I was at it, I could also compare three ISPs to which I have news access all at the same time.

Now, doing this all from scratch would have been quite difficult. I’d have to parse the output of the Deja and AltaVista search engines, looking for links, then extract each of their message-IDs carefully from the results. Thankfully, I worked smartly on this one, and noted that there’s a nice CPAN (Comprehensive Perl Archive Network) module called WWW::Search that does exactly this. So, in less than 150 lines of code, I could do all the research I needed and still have some time left over actually to read the news that was there.

Also, this program leverages off of the very nice LWP package from Gisle Aas and friends, allowing me to trivially fetch data from a given URL and break apart the returned URLs.

Now, even if you don’t suspect you are living with a flaky newsspool, you can still use the techniques presented here to discover other interesting news-related events. So, let’s take a look at the code in Listing One.

Lines 1 through 3 start nearly every program I write, enabling warnings, turning on the most common compiler restrictions, and disabling buffering on STDOUT.

Lines 5 through 8 pull in the modules that I’ll be using, all found in the CPAN. Net::NNTP comes from Graham Barr’slibnet.LWP::Search and URI are both in the Bundle::LWP group. And WWW::Simple is on its own. If you don’t have these modules, use perl -MCPAN -eshell to fetch and install them for you.

Lines 10 through 30 are the configuration area. I tend to lump things I might want to change between runs in a special area at the top of the program, and mark it as such. I also generally use uppercase variables for these constants.

Line 12 defines the verbosity level of this program. Here it’s set to 1, meaning that we’ll know as each article is being fetched from one of the Net sources. While this is reassuring, it can also be a bit noisy. By setting this to 0, we’ll see only the final report.

Line 14 defines the newsgroups that we’ll be checking, as a list. I wanted a representative sampling, so I picked a few of the newsgroups I read. A newsgroup has to be carried by Deja and AltaVista or it can’t be checked, so using internal or very local newsgroups won’t work well.

Lines 16 and 17 define the window of articles to be considered. Because the primary source feed is Deja, which doesn’t honor cancels, I pick dates that are old enough also to be in AltaVista, which seems to take two weeks to get new articles into the searching database. That way, if it’s seen in Deja but nowhere else, I can presume it’s a canceled article instead of worry that it never got to my server. The downside is that my news server might have expired the article by now, so I’ll get a false “missing.” It’s too bad AltaVista doesn’t have current databases as it had originally.

Lines 19 through 28 define the particular news servers that I’m scanning. I defined the three local news servers from the three ISPs to which I’m subscribed. For each ISP, I must define a host, giving the hostname and optional port number on which the NNTP server is located. (I’ve obscured the actual hostnames here so as not to make my ISPs mad.)

For two of these news servers, I access them via an SSH tunnel. The tunnel command will be executed prior to attempting to connect to the news server. This particular SSH tunnel command establishes (for 180 seconds) a local port (like 1190 or 1191) that is connected to a news server via a remote command-line host. So, for example, connecting to localhost at port 1191 will really be connecting to news.teleXXXX.com. The purpose of the SSH tunnel is mostly to improve security by communicating without allowing a real password out on the wire, where there is always a chance that it can be sniffed.

If you have an ISP that requires authinfo-style authentication, you may also include user and pass as parameters here. Be aware that those passwords are transmitted in the clear, so wire-snoopers will see them.

Lines 32 through 34 set up global variables. %id holds the information about each message-ID. $FROM and $TO are the human-readable start and end dates for this particular report.

Lines 36 through 65 handle the initial Deja lookup. Line 39 holds a hash used to ensure that we look at a particular Deja article number only once. Deja breaks up long articles into multiple hits, and we need only the first hit to find the message-ID.

Lines 41 through 49 set up a WWW::Search object, looking for the right articles in the designated groups in the indicated date range. We’ll set the maximum to a number that is nice and high (10,000), although I probably wouldn’t have patience to fetch more than a 1,000 or so hits from Deja. Because of redundancy, having more than 1,000 hits probably wouldn’t be that much more useful than 1,000 anyway.

Lines 47 to 64 discover the matching hits and look for message-IDs in those hits. Each hit will come out in $result in line 51, with its URL extracted in line 52.

Because multiple hits can refer to the same article, I have to process the query-form parameters of the URL to determine which Deja article number is being fetched. If article 9876 is too long, Deja will return successive chunks in hits 9876.1, 9876.2, 9876.3, and so on, but we want only the whatever-dot-1 part. So, lines 57 to 59 will determine the article number and will skip any later hits on the same article number.

Lines 60 to 62 fetch the text of the article (or just the first part of a long article) and extract the message-ID, noting it into the %id two-level hash with a first-level key of the message-ID and a second-level key of DJ (for DeJa).

Lines 67 to 88 do basically the same thing for AltaVista. The biggest change is that AV wants the newsgroups in the query string (not as a separate field), and the date format is incompatibly different (month/day year for Deja, day-month-year for AV). Additionally, the returned URL needs to be hacked in a slightly different way to get a good text file to search [See line 82]. And we’ll record message-IDs found from AV in the %id hash again, this time with a sub-key of AV.

Lines 90 to 94 extract the known message-IDs from the %id hash. Initially, I just sorted the IDs to make the report consistent between runs, but I thought it might be nice to see the messages grouped by originating host to see if there was a pattern. So, we have a classic Schwartzian Transform here (named for me, but not by me) to sort the message-IDs by their hostname first, then localpart second. The result is a list of messages that the big two archivers have seen, for which we now need to scan our local news servers for confirmation.

Lines 96 through 123 look at each of the news servers being tested. $short_host gets a two-character unique identification in line 98. We’ll extract the access information in line 99 from the %NNTP hash into %INFO. If it’s tunneled, that tunnel program is launched in lines 100 to 103.

Line 105 attempts a connection to the NNTP host and port. If that doesn’t work, lines 106 to 109 discover that and move on to the next one. Lines 110 through 112 provide authinfo connection information if that’s designated in the %NNTP hash above for this host. Note that this may fail, but that’ll just make the later stuff exit early.

Lines 113 to 123 ask each particular NNTP server if it has seen each message-ID. The return value from the nntpstat method will be true in this case, so we’ll note in line 116 that this was so. The code controlled by $NOISY notes our progress and results.

Lines 125 to 134 dump out the final report. First, we’ll get the host list from the %NNTP hash in line 125. Line 127 dumps out a nice banner. Lines 129 to 132 dump out the results for each message-ID for each host, including Deja and AltaVista, as a nice two-character code if found or spaces if absent, and make a nice set of columns in front of each message-ID.

Finally, lines 136 to 148 turn a Unix timestamp into an appropriate date in the incompatible Deja and AV formats.

So, if you’re suspecting propagating problems with your news server, now you can see just how much of the news you’re really getting. Until next time, enjoy!




Listing One: Randal’s Newsfeed Checker (part I)


 1     #!/usr/bin/perl -w
2 use strict;
3 $|++;
4
5 use Net::NNTP;
6 use WWW::Search;
7 use LWP::Simple;
8 use URI;
9
10 ## CONFIG ##
11
12 my $NOISY = 1;
13
14 my @GROUPS = qw(comp.lang.perl.misc rec.humor.funny
pdx.general comp.risks);
15
16 my $DAYS_AGO_FROM = 21;
17 my $DAYS_AGO_TO = 19;
18
19 my %NNTP =
20 (
21 ‘in’ => {host => ‘news.inetXXXXX.com’},
22 ‘te’ => {host => ‘localhost:1191′,
23 tunnel => ‘ssh -f -q -L 1191:news.teleXXXX.com:119
teleXXXX.com sleep 180′,
24 },
25 ‘ag’ => {host => ‘localhost:1190′,
26 tunnel => ‘ssh -f -q -L 1190:herXXX.rXXXX.com:119
agXXX.rXXXX.com sleep 180′,
27 },
28 );
29
30 ## END CONFIG ##
31
32 my %id;
33 my $FROM = days_ago_to_deja_date($DAYS_AGO_FROM);
34 my $TO = days_ago_to_deja_date($DAYS_AGO_TO);
35
36 ## deja phase
37
38 {
39 my %seen;
40
41 my $search = WWW::Search->new(‘Dejanews’);
42 $search->native_query
43 (“”,
44 {
45 groups => join(‘,’, @GROUPS),
46 fromdate => $FROM,
47 todate => $TO,
48 });
49 $search->maximum_to_retrieve(10000);
50 print “Deja: ” if $NOISY;
51 while (my $result = $search->next_result) {
52 my $url = $result->url;
53 my $uri = URI->new($url);
54 my %query = $uri->query_form;
55 next unless exists $query{AN};
56 print “.” if $NOISY;
57 my($an) = $query{AN} =~ /(\d+)/;
58 next if $seen{$an}++;
59 $uri->query_form(AN => “$an.1″, fmt => ‘raw’);
60 next unless $_ = get “$uri”;
61 next unless /^Message-ID:\s+(.*\S)\s*$/m;
62 $id{$1}{DJ}++;
63 }
64 print “\n” if $NOISY;
65 }
66
67 ## alta phase
68
69 {
70 my $search = WWW::Search->new
(‘AltaVista::AdvancedNews’);
71 $search->native_query
72 (join(” OR “, map “newsgroups:$_”, @GROUPS),
73 {
74 d0 => days_ago_to_alta_date($DAYS_AGO_FROM),
75 d1 => days_ago_to_alta_date($DAYS_AGO_TO),
76 });
77 $search->maximum_to_retrieve(10000);
78 print “Alta: ” if $NOISY;
79 while (my $result = $search->next_result) {
80 my $url = $result->url;
81 print “.” if $NOISY;
82 $url =~ s/news\?msg/news?plain\@msg/;
83 next unless $_ = get $url;
84 next unless /^Message-ID:\s+(.*\S)\s*$/m;
85 $id{$1}{AV}++;
86 }
87 print “\n” if $NOISY;
88 }
89
90 my @msg_id =
91 map { $_->[0] }
92 sort { $a->[2] cmp $b->[2] or $a->[1] cmp $b->[1]
or $a->[0] cmp $b->[0] }
93 map { /(.*)\@(.*)/ ? [$_, $1, $2] : [$_, "", ""] }
94 keys %id;
95
96 ## nntp phase
97
98 for my $short_host (sort keys %NNTP) {
99 my %INFO = %{$NNTP{$short_host}};
100 if (my $tun = $INFO{tunnel}) {
101 print “launching $tun\n” if $NOISY;
102 system $tun;
103 }
104
105 my $c = Net::NNTP->new($INFO{host});
106 unless (defined $c) {
107 warn “cannot connect to $short_host, skipping\n”;
108 next;
109 }
110 if ($INFO{user}) {
111 $c->authinfo($INFO{user},$INFO{pass});
112 }
113 for my $msg_id (@msg_id) {
114 print “$msg_id at $short_host: ” if $NOISY;
115 if ($c->nntpstat($msg_id)) {
116 $id{$msg_id}{$short_host}++;
117 print “yes” if $NOISY;
118 } else {
119 print “no” if $NOISY;
120 }
121 print “\n” if $NOISY;
122 }
123 }
124
125 my @hosts = sort keys %NNTP;
126
127 print “report from $FROM to $TO for @GROUPS\n”;
128 for my $msg_id (@msg_id) {
129 for my $host (“DJ”,”AV”, @hosts){
130 print $id{$msg_id}{$host} ? $host : ” “;
131 print ” “;
132 }
133 print “$msg_id\n”;
134 }
135
136 ## subroutines
137
138 sub days_ago_to_deja_date {
139 my $days = shift;
140 my @gm = gmtime(time – 86400 * $days);
141 return sprintf “%02d/%02d/%04d”, 1 + $gm[4],
$gm[3], 1900 + $gm[5];
142 }
143
144 sub days_ago_to_alta_date {
145 my $days = shift;
146 my @gm = gmtime(time – 86400 * $days);
147 return sprintf “%02d-%02d-%04d”, $gm[3], 1 + $gm[4],
1900 + $gm[5];
148 }





Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-authored Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com.








feedback



<< prev   page 1 2 3       

 

Linux Magazine /
November 1999 / PERL OF WISDOM
No News Is Not Good News




Comments are closed.