dcsimg

Determining Text Comprehensibility

Ahh, manpages. Some of them are great. But, a few of them are just, well, incomprehensible. So I was sitting back a few days ago, wondering if there was a way to locate the really ugly ones for some sort of award. Then I remembered that I had seen a neat module called Lingua::EN::Fathom that could compute various statistics about a chunk of text or a file, including the relative readability indices, such as the "Fog" index. The "Fog" index is interesting in particular because it was originally calibrated to be an indication of "grade level," with 1.0 being "first grade text" and 12.0 being "high school senior." At least that's the way I remember it.

Ahh, manpages. Some of them are great. But, a few of them are just, well, incomprehensible. So I was sitting back a few days ago, wondering if there was a way to locate the really ugly ones for some sort of award. Then I remembered that I had seen a neat module called Lingua::EN::Fathom that could compute various statistics about a chunk of text or a file, including the relative readability indices, such as the “Fog” index. The “Fog” index is interesting in particular because it was originally calibrated to be an indication of “grade level,” with 1.0 being “first grade text” and 12.0 being “high school senior.” At least that’s the way I remember it.

While I don’t believe in the irrational religion sometimes applied to these indices (“We shall have no documentation with a Fog factor of higher than 10.0″), I do think they can indicate that something is amiss.

So, in an hour or so, I hacked out a program that wanders through all of the manpages in my MANPATH, extracts the text information (discarding the troff markup), and, for my amusement, sorts them by Fog index. Since I brought together a lot of different technologies and CPAN modules to do it, I thought I’d share this program with you in Listing One(pg. 96). Without further ado, I will go over some of it in detail:

Line 1 gives the path to Perl and turns on warnings. I usually don’t trigger any warnings, but it’s nice to run occasionally with the safeties on.

Line 2 enables the normal compiler restrictions, requiring me to declare all variables, quote all quoted strings (rather than using barewords), and prevents me from using those hard-to-debug symbolic references.

Line 3 ensures that each print operation on STDOUT results in an immediate I/O operation. Normally, we’d like STDOUT to be buffered to minimize the number of system calls, but since this program produces a trace of what’s happening, I would kind of like to know that it’s happening while it is happening (not after we got 8000 bytes to throw from a buffer). As an aside, some people would prefer that I use $| = 1; here, because it would be clearer. But, I find the $|++ form easier to type, and I saw Larry do it once, so it must be blessed.

Line 6 provides the only configuration variable for this program: the location of the memory to be used between invocations. Running the text analysis on the data each time is expensive (especially when testing the report generator at the bottom of the program), so I’m keeping a file in my home directory to hold the results. The filename will have an extension, depending on the chosen DBM library appended to it.

Line 9 is what got me started on this program: a module from the CPAN to compute readability scores. As this is not part of the standard distribution, you’ll need to install this yourself.

Line 10 provides the two constants I needed for the later DBM association.

Line 11 pulls in the “multi-level database” adapter. MLDBM wraps the fetch and store routines for a DBM tied hash so that any reference (meaning a data structure) is first “serialized.” The result is a complex data structure that is turned into a simple byte string during storage. When it is retrieved, the reverse occurs, so that we get a similar data structure again. There are some interesting limitations, but in writing this particular program, none of them managed to get in my way.

The args to the use indicate that we want to use DB_ File as our DBM and Storable as our serializer. DB_ File is found in the Perl distribution, but you must have installed “Berkeley DB” before building Perl for this to be useful. Replace that with SDBM if you can’t find DB_File. Storable is also found in the CPAN, and is my preferred serializer due to its robustness and speed. However, Data: :Dumper can also be used here, with the advantage that it’s the default.

Line 12 selects the ever-popular File::Find module (included in the distribution) to recurse downward through the man directories to scout out the manpage files.

Line 13 enables simple trapping of signals with a trivial die operation. I found that, without this line, if I pressed control-C too early in the process, none of my database had been updated. After thinking about it, this actually made sense. An interrupt stops everything, not even giving Perl a chance to write the data to the file by closing the database cleanly.

Line 15 associates %DB with the multilevel database named in $DATAFILE. The remaining parameters are passed to the underlying DB_File tie operation. Those parameters create the database (if necessary) and set the permissions for the file (again, if necessary).

Line 17 sets up a global @manpages variable which holds the manpages that were found by the subroutine in lines 20 through 23.

Lines 19 through 24 walk through the directories named in my MANPATH, looking for manpages. First, the MANPATH is split on colons. Then, each element is suffixed with slash-period-slash. As far as File::Find is concerned, this doesn’t change the starting directories. But, as we’ll see in line 29, the presence of this marker is needed later to distinguish the prefix directory from the location within that directory.

The anonymous subroutine starting in line 19 is called repeatedly by File::Find‘s find routine. The full name of each file can be found in $File::Find::name, while $_ is set up properly together with the current directory to perform file stat tests. The conditions that I’m using here declare that we’re looking for a plain file (not a symbolic link) that isn’t named whatis and is neither too big nor too small. If it passes the test, the name gets stuffed at the end of @manpages.

Line 26 creates the text analyzer object. I humored myself at the time by calling it $fat, which originally was a shortened form of “fathom.” As I write this column the following day, I can no longer remember why I found that funny. I guess it’s meta-funny.

And now for the first big loop in lines 28 to 48. This is where we’ve got the list of manpages. It’s time to go see just how awful they are.

Line 29 pulls apart the $dir (which is the original element of my MANPATH) from the $file (which is the path below that directory). This is possible because we included the slash-dot-slash marker in the middle of the path during the filename searching.

This is necessary because the troff commands of the manpages presume that the current directory is at the top of the manpage tree during processing — particularly for the .so command, which can bring in another manpage like an include file.

Line 30 refixes the name to avoid the marker. Line 31 shows us our progress with that updated name.

Lines 32 through 36 keep us from rescanning the same file. First, the modification timestamp is grabbed into $mtime. Next, we check the existing database entry (if any) to see if the recorded modification time from a previous run is the same as the modification time we’ve just seen. If they’re the same, we’ve already done this file on a prior run, and we can skip this one altogether. If not, we have to get our hands dirty with it.

Line 38 is where this program actually spends most of its time. We have a deroff command that reads a troff file and removes most of the troff embedded control sequences and commands. While it’s certainly not perfect, it is fairly useful and close enough for this demonstration. Also, we need to be in that parent directory so that relative filenames work; that’s handled nicely with the simple one-liner shell command inside the back quotes.

“But wait,” you may ask. “I don’t have a deroff!” Never fear. I ran into the same problem myself. A quick search on the Net (thank you http://www.google.com!) revealed that this had been one of the already completed commands in the Perl Power Tools project, archived at http://language.perl.com/ppt/. So, I downloaded the .tar.gz file from that page, extracted the pure Perl implementation of deroff, and installed it quite nicely. Yes, I know that there are a few open source C versions out there, but I just didn’t want to futz around.

Line 39 detects a failure in the attempt to deroff the text and moves along if something broke. Absolutely nothing went wrong in the hundreds of files I analyzed, but you never know.

Line 40 is where this program does some heavy CPU on its own. Looking for the various statistics (including our readability scores), the text of the deroff’ed manpage is crunched. There didn’t appear to be any possible error return from this call, so I didn’t try to detect one.

Line 42 creates the %info data structure to hold the attributes of the particular file that we want to store in the database. We’ll start with the modification time that we fetched earlier to ensure that later passes will go, “Hey, I’ve already seen this version.”

Lines 43 through 45 use the $fat object to access the three scores, via the fog, flesch, and kincaid methods. I’ve used a nice trick here: an “indirect method call,” where the name of the method comes from a variable. The result is as if I had said:


$info{“fog”} = $fat->fog();
$info{“flesch”} = $fat->flesch();
$info{“kincaid”} = $fat->kincaid();

but with considerably less typing (that is, until I put together this illustration).

Line 46 stores the information into the database. The value is a reference to the hash, but the MLDBM triggers a serialization so that the actual DBM value stored is just a byte string that can be reconverted into a similar data structure upon access.

In fact, an access already potentially occurred up in line 33. The access to $DB{$name} fetched a byte string from the disk database, which was then reconverted into a hashref so that the subsequent access to the hashref element with a key of mtime would succeed.

Line 47 lets us know we did the deed for this file and are moving on.

That completes the data-gathering phase. It’s now time to do the report, as indicated in line 50.

Line 54 is a quick trick with interesting performance consequences. The hash of %DB acts like a normal hash but actually involves two levels of tied data structures. This can be quite slow, especially when performing repeated accesses for sorting. So, in one brief operation, we copy the entire database as an in-memory hash to the %db hash. Now we can use %db in the same way we would have used %DB, but without the same access expense. Of course, since it’s a copy, we can’t change the real database. But, that’s not needed here.

Lines 55 to 57 sort the database by the key specified in $kind, defined in line 52. We’ve got a descending numeric sort to put the worst offenders first. A simple printf makes the columns line up nicely.

My output from running this program looks something like Listing Two. Yeah, that first file ranked a whopping “grade 167″ education needed to read it. That is, in theory. Approximately a 5th or 6th grade education was required for the simplest few. As a comparison, the text of this column (before editing) came out at around 13.3 on the fog index. Hmmm…I hope you all made it through high school! Until next time, keep your sentences short and to the point. Enjoy!




Listing One: Comprehensibility Checker


1    #!/usr/bin/perl -w
2 use strict;
3 $|++;
4
5 ## config
6 my $DATAFILE = “/home/merlyn/.manfog”;
7 ## end config
8
9 use Lingua::EN::Fathom;
10 use Fcntl qw(O_CREAT O_RDWR);
11 use MLDBM qw(DB_File Storable);
12 use File::Find;
13 use sigtrap qw(die normal-signals);
14
15 tie my %DB, ‘MLDBM’, $DATAFILE, O_CREAT|
O_RDWR, 0644 or die “Cannot tie: $!”;
16
17 my @manpages;
18
19 find sub {
20 return unless -f and not -l and $_ ne
“whatis”;
21 my $size = -s;
22 return if $size < 80 or $size > 16384;
23 push @manpages, $File::Find::name;
24 }, map “$_/./”, split /:/, $ENV{MANPATH};
25
26 my $fat = Lingua::EN::Fathom->new;
27
28 for my $name (@manpages) {
29 next unless my ($dir, $file) = $name =~ m{(.*?)/\./(.*)}s;
30 $name = “$dir/$file”;
31 print “$name ==> “;
32 my $mtime = (stat $name)[9];
33 if (exists $DB{$name} and
$DB{$name}{mtime} == $mtime) {
34 print “… already computed\n”;
35 next;
36 }
37
38 my $text = ‘cd $dir && deroff $file’;
39 (print “cannot deroff: exit status
$?”), next if $?;
40 $fat->analyse_block($text);
41
42 my %info = ( mtime => $mtime );
43 for my $meth (qw(fog flesch kincaid)) {
44 $info{$meth} = $fat->$meth();
45 }
46 $DB{$name} = \%info;
47 print “… done\n”;
48 }
49
50 print “final report:\n\n”;
51
52 my $kind = “fog”;
53
54 my %db = %DB; # speed up the cache
55 for my $page (sort { $db{$b}{$kind} <=>
$db{$a}{$kind}}keys %db) {
56 printf”%10.3f %s\n”, $db{$page}{$kind},
$page;
57 }




Listing Two: Output of Comprehensibility Checker


1   final report:
2
3 167.341 /usr/lib/perl5/5.00503/man/man3/WWW::Search::Euroseek.3
4 154.020 /usr/lib/perl5/5.00503/man/man3/GTop.3
5 65.528 /usr/lib/perl5/5.00503/man/man3/Tk::X.3
6 56.616 /usr/man/man1/mh-chart.1
7 45.591 /usr/man/man1/tar.1
8 40.133 /usr/lib/perl5/5.00503/man/man3/Bio::SeqFeatureI.3
9 39.012 /usr/lib/perl5/5.00503/man/man3/XML::BMEcat.3
10 37.714 /usr/lib/perl5/5.00503/man/man3/less.3
11 37.200 /usr/lib/perl5/5.00503/man/man3/Business::UPC.3
12 36.809 /usr/lib/perl5/5.00503/man/man3/Number::Spell.3
13 [...many lines omitted...]
14 7.179 /usr/man/man1/tiffsplit.1
15 7.174 /usr/lib/perl5/5.00503/man/man3/Tie::NetAddr::IP.3
16 7.018 /usr/lib/perl5/5.00503/man/man3/DaCart.3
17 6.957 /usr/man/man3/form_driver.3x
18 6.899 /usr/man/man7/samba.7
19 6.814 /usr/lib/perl5/5.00503/man/man3/Array::Reform.3
20 6.740 /usr/lib/perl5/5.00503/man/man3/Net::GrpNetworks.3
21 6.314 /usr/man/man5/rcsfile.5
22 6.210 /usr/lib/perl5/5.00503/man/man3/Network::IPv4Addr.3
23 6.002 /usr/lib/perl5/5.00503/man/man3/Net::IPv4Addr.3
24 5.881 /usr/man/man8/kbdrate.8
25 5.130 /usr/lib/perl5/5.00503/man/man3/Net::Netmask.3



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com. Code listings for this column can be found at: http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.