dcsimg

Processing Footnotes

I wrote a Web page the other day and realized that I wanted footnotes. I wanted to keep the main message in the main text and have annotations for some of the side points. It's easy enough to do, right? Just put some text in a table at the end, use those cute little sup tags around the footnote numbers, and hack away.

I wrote a Web page the other day and realized that I wanted footnotes. I wanted to keep the main message in the main text and have annotations for some of the side points. It’s easy enough to do, right? Just put some text in a table at the end, use those cute little sup tags around the footnote numbers, and hack away.

Oops…those little numbers! I started to dread getting six footnotes inserted and then having to go back to insert yet another between numbers 2 and 3. It was going to be a maintenance nightmare. Could Perl help? Of course!

Expending about 10 times the amount of labor I would have spent doing this manually, I hacked out the program in Listing One. This obviously wasn’t very efficient, so to make the time that I invested worthwhile, I’ll pass the program along to you. Besides, it illustrates how to create an angly-bracket metalanguage for your HTML and XML processing. Yeah, that justifies it.

The idea is to insert a footnote into the main flow using a made-up tag of foot. The processor pass then takes those out, replacing them with an anchor link and a unique number. Then, at the end of the file, all the footnotes are dumped out. For an example, look at the end of the program. And, I couldn’t stop there, so I decided to allow nested footnotes (like those found on the alt.sysadmin.recovery newsgroup). About half of my coding time was spent getting those to work. Someday, I will learn to prioritize.

Let’s see what I wasted an hour on, starting with the first few lines that begin nearly every program I write. These lines enable warnings, turn on the normal compiler restrictions for non-trivial programs, and disable buffering on STDOUT.

Line 5 pulls in the HTML::Parser, a wonderful piece of work maintained by Gisle Aas. This is a C-based module for lightning-fast parsing of anglybracket data input, which is normally HTML text. This is much faster than hand-rolled regular expressions. You’ll find this as part of the LWP module family in the CPAN.

Lines 7 and 8 contain the footnote list and footnote stack index, respectively. The first item of @feet is the text of the first footnote, numbered 1. The last item of @feet_ index (if any) is the subscript of @feet of the current footnote we are creating. As each new foot tag is seen, we create a new empty footnote in @feet and put its index at the end of @feet_index. When the note is ended, we pop off @feet_index, thus resuming the previous note. If there are no items in @feet_index, it’s the main body, and we can just copy the data through.

Yes, this is the logic that it took me the better part of an hour to get cleanly. I wanted the footnotes to be numbered in the order of the start tags. I kept creating algorithms to number them on the basis of the end tag instead — until I formulated the indirection table.

Line 10 keeps track of our nesting of elements. With the way I’m using HTML::Parser, it wouldn’t matter to the parser that I have mismatched tags. However, since my footnote processing is fragile under these circumstances, I enforced the XML-notion of “well-formedness” and required properly balanced tags.

You may ask why I did not just use XML::Parser instead of HTML::Parser. Well, I like the callback flexibility of HTML::Parser for small projects.

Speaking of which, lines 12 through 17 define my parser object. I defined three callbacks. The first one is for text items, which will get the text as its first parameter. The second one is for start tags, which will get both the original text and the tagname extracted. Finally, the end tags are also triggered, again getting both the original text and the tagname.

Line 19 turns on xml_mode in the HTML::Parser, which keeps the tags in their original case and slightly alters the handling of a few other constructs. Again, this is yet more evidence that I really wanted an XML Parser.

Line 20 pulls in the contents of the DATA filehandle, which is the contents of this file under the __END__ token below, with my sample data (the description of how I write my Perl columns). The result of this parsing pass is a number of calls to the three callback subroutines, which after completion will have printed the main part of the text to STDOUT already. We’ll see how this works shortly.

The other effect of the parse is the extraction of the footnotes into @feet. Lines 22 to 27 dump this out in a nice way. I’m using an HTML table for layout, with a column of the footnote numbers and a column of the footnote text. Each footnote also has an anchor assigned to it, which we can use as the target of an internal anchor reference using a fragment identifier. Lacking inspiration, I numbered these note1, note2, and so on.

Line 29 is an exit, redundant because there’s only subroutines from here down; it exists just to keep it clear where the program ends.

Lines 31 to 38 define the text handler, which is called whenever HTML::Parser finds some text outside a tag. I’ve selected to pass the text itself as the first parameter, which I copy into $text in line 32. If we’re currently in a footnote, this text is part of the footnote, so we append it to the right footnote. However, the right footnote will be the one whose index is in the rightmost element of @feet_ index, hence the indirection. Remember, this is not necessarily the highest indexed footnote (I might have nested footnotes, but this would be the exception and not the rule). If I’m not in a footnote, the text is simply dumped to STDOUT instead.

Lines 40 to 51 define the handler for the beginning of every element. The incoming parameters are the original text and the tagname for that start tag (we don’t need the attributes). These are assigned to $text and $tagname in line 41.

Line 44 notes the current element name by pushing the tag onto the stack. I’ll check this on the close tag in order to make sure that the tags are nested properly to make valid elements.

Line 46 does the work for a footnote start tag. First, I created the new footnote as empty in line 47. Next, I inserted the reference to the footnote in line 48 by faking a text event containing the reference. I can’t just print this because I might still be in another footnote; faking the text event “does the right thing.” (For better maintenance, I’d probably pull the “add text” operation to a separate subroutine that both the text handler and this handler call, but this worked for this quick-and-dirty program.)

Line 49 adds the footnote index onto the footnote stack. Note that we cannot do this before the previous line because the footnote reference would end up inside itself.

Line 53 handles the start tags that are of no interest (everything except foot tags) by simply copying them as-is to the current output (either a footnote or STDOUT).

Lines 56 to 72 handle the end tags. Again, the text and tagname end up in variables, defined in line 57.

Lines 60 through 64 handle the verification of properly nested tags. If there’s no start tag, or the tags don’t match, a swift and painless death is the result.

Lines 66 to 69 handle the foot end tag, which is the only one of interest. If I find one, I simply pop an entry off the @feet_index array, which will pop us back to the previous footnote on the next text item seen. If we don’t find any, we’re back to dumping to STDOUT.

Line 71 dumps the other uninteresting end tags as needed.

That’s all there is. It’s not rocket science, but it gets the job done. As sample text, I included an outline of what it takes to write a Perl column. If you run the program, you will get the HTML output shown in Listing Two.

Note how the footnotes have been replaced with internal fragment references and the content of the footnotes has become a table at the end. Yes, I could have done all this by hand, but it was more fun to write the program and get it done right, once and for all. So, don’t fear footnotes and writing tiny metalanguages for those odd tasks.




Listing One: Using Footnotes in HTML — Part I

1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
4

5 use HTML::Parser;
6

7 my @feet; # final footnote list
8 my @feet_index; # indexes into @feet
9

10 my @elements; # ensure nested tags match
11
12 my $parser = HTML::Parser->new
13 (
14 text_h => [\&text_h, "text"],
15 start_h =>[\&start_h, "text, tagname"],
16 end_h => [\&end_h, "text, tagname"],
17 );
18
19 $parser->xml_mode(1); # keep tags case-sensitive
20 $parser->parse_file(\*DATA); # prints main part to STDOUT
21
22 if (@feet) { # we had footnotes?
23 print “<hr><table border=’0′
cellspacing=’0′ cellpadding=’2′>\n”;
24 print “<tr><td valign=’top’><sup><a
name=’note$_’>$_</a></sup></td>”,
25 “<td>$feet[$_-1]</td></tr>\n” for 1..@feet;
26 print “</table>”;
27 }
28
29 exit 0; # end of code
30
31 sub text_h {
32 my ($text) = @_;
33 if (@feet_index){ # are we inside a footnote?
34 $feet[$feet_index[-1]] .= $text; # append to that
35 } else {
36 print $text; # just show it
37 }
38 }
39
40 sub start_h {
41 my ($text, $tagname) = @_;
42
43 ## ensure proper nesting
44 push @elements, $tagname;
45
46 if ($tagname eq “foot”) {
47 push @feet, “; # the note itself
48 text_h(“<sup><a href=’#note”.@feet.
“‘>”.@feet.”</a></sup>”);
49 push @feet_index, $#feet; # pointer to note
50 return;
51 }
52
53 text_h($text); # uninteresting start tag
54 }
55
56 sub end_h {
57 my ($text, $tagname) = @_;
58
59 ## ensure proper nesting
60 die “saw $text outside of element”
61 unless @elements;
62 die “saw $text nested inside <$elements[-1]>”
63 unless $elements[-1] eq $tagname;
64 pop @elements;
65
66 if ($tagname eq “foot”) {
67 pop @feet_index; # no longer accumulating here
68 return;
69 }
70
71 text_h($text); # uninteresting end tag
72 }
73
74 __END__
75 <h2>Writing a Perl column</h2>
76 Writing a magazine column about Perl is
a simple<foot>for some!</foot> task.
77 Just perform the following steps:
78 <ol>
79 <li>Think of a problem to
80 solve<foot>You can ask around for help here.
81 I keep an archive of “todo” ideas,
and it really helps.</foot>.</li>
82 <li>Write the code to solve it.<foot>The
code should be between 50 and 200
83 lines for optimum column
84 length<foot>About 10,000 characters.
</foot>.</foot></li>
85 <li>Fret over the code for a few hours
<foot>Or a few minutes.</foot>.</li>
86 <li>Writethe column<foot>I use POD<foot>See
<tt>perldoc perlpod</tt>.</foot>
87 format.</foot>.</li>
88 <li>Show the column to a group of friends
89 on IRC<foot>Usually the
<tt>#perl</tt> channel.</foot>
90 for a quick peer review.</li>
91 <li>Turn it in<foot>By email.</foot>
to the editor.</li>
92 <li>Wait a few days for the
galleys<foot>Usually a
93 PDF<foot><i>Portable Document
Format</i> from
94 Adobe<foot>See<tt>www.adobe.com</tt>for
downloads.</foot>.</foot>.</foot>
95 to come back.</li>
96 <li>Grimace over the hacks to your lovely
97 prose<foot>just kidding, guys!</foot>
and provide corrections to
98 the corrections.
99 </li>
100 <li>Wait a few months<foot>Or so it
seems, since the deadline
101 for an April cover is usually the first
week of January.</foot> for it
102 to “hit the stands”.</li>
103 <li>Wave the magazine in front of your
friends<foot>Or the cute girl
104 at the bookstore checkout
counter.</foot>!</li>
105 </ol>




Listing Two: The HTML Output of Listing One

<h2>Writing a Perl column</h2>
Writing a magazine column about Perl is a simple<sup>[1]</sup> task.
Just perform the following steps:
<ol>
<li>Think of a problem to
solve<sup>[2]</sup>.</li>
<li>Write the code to solve it.<sup>[3]</sup></li>
<li>Fret over the code for a few hours<sup>[5]</sup>.</li>
<li>Write the column.</li>
<li>Show the column to a group of friends
on IRC<sup>[6]</sup>
for a quick peer review.</li>
<li>Turn it in<sup>[7]</sup> to the editor.</li>
<li>Wait a few days for the galleys<sup>[8]</sup>
to come back.</li>
<li>Grimace over the hacks to your lovely
prose<sup>[9]</sup> and provide corrections to
the corrections.
</li>
<li>Wait a few months<sup>[10]</sup> for it
to “hit the stands”.</li>
<li>Wave the magazine in front of your friends<sup>[11]</sup>!</li>
</ol>
<hr><table border=’0′ cellspacing=’0′ cellpadding=’2′>
<tr><td><sup>[1]</sup></td><td>for some!</td></tr>
<tr><td><sup>[2]</sup></td><td>You can ask around for help here.</td></tr>
<tr><td><sup>[3]</sup></td><td>The code should be between 50 and 200
lines for optimum column
length<sup>[4]</sup>.</td></tr>
<tr><td><sup>[4]</sup></td><td>About 10,000 characters.</td></tr>
<tr><td><sup>[5]</sup></td><td>or a few minutes</td></tr>
<tr><td><sup>[6]</sup></td><td>Usually the <tt>#perl</tt> channel</td></tr>
<tr><td><sup>[7]</sup></td><td>By email.</td></tr>
<tr><td><sup>[8]</sup></td><td>usually a PDF.</td></tr>
<tr><td><sup>[9]</sup></td><td>just kidding, guys!</td></tr>
<tr><td><sup>[10]</sup></td><td>Or so it seems, since the deadline
for an April cover is usually the first week of January.</td></tr>
<tr><td><sup>[11]</sup></td><td>Or the cute girl
at the bookstore checkout counter.</td></tr>
</table>



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com. Code listings for this column can be found at: http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.