Long Processes Through CGI

The CGI protocol is wonderful for the remote execution of short tasks. But how do you execute a longer task? A task can't just run without giving some kind of feedback to the user -- eventually either the user will get bored or Apache will drop the connection.

The CGI protocol is wonderful for the remote execution of short tasks. But how do you execute a longer task? A task can’t just run without giving some kind of feedback to the user — eventually either the user will get bored or Apache will drop the connection.

I’ve seen (and written) some solutions that depend on “server push”, but not all browsers support that feature. Other solutions I’ve seen slowly write simple HTML and rely on the browser to incrementally render a page to reflect activity. Again, you can’t count on that behavior across the browser spectrum.

But one solution that minimizes server overhead and dependence on browser peculiarities is client pull, also called “meta refresh”. In client pull, the initial request forks a process to perform the real work and redirects the browser to a new URL that “pulls” the results obtained so far. While the results remain incomplete, an additional header instructs the browser to “refresh” the data after a predetermined number of seconds.

All that sounds messy. For example, how will each CGI invocation know what data to display? Where will this data be? And, how will requests manage inter-process communication? Believe it or not, the solutions are not as daunting as they might seem.

First, each CGI request will be assigned a unique “session key” that’s hard to guess, but easy to hand around. This session key provides all the differentiation we need. In my sample code, I’m using the MD5 hash of unpredictable data.

Next, I could use temporary files to store session data, but that requires some sort of cleaner to zap stale files. An easier solution is to use the Cache::Cache from the CPAN — a Perl module I’ve sung praises about in the past.

So, the basic strategy is this: the browser hits the form and the user fills out that form; the browser submits the form; after verifying good information, the response forks to run the task and redirect the browser back with a session key; the forked process runs the task, collects output as it arrives, updates a cache, and sets a flag when the task is complete; the CGI script pulls data from the cache and displays it, sending a refresh as long as the data is not complete.

For purposes of demonstration, we’ll use traceroute, a typical system administration task. Obviously, traceroute consumes system and network resources, so you shouldn’t set this up in a public place exactly as I’ve written it (as shown in Listing One) unless you want angry glares from your network neighbors.

Listing One: tracerouter-cgi.pl – Part 1

1 #!/usr/bin/perl -T
2 use strict;
3 $|++;
5 $ENV{PATH} = “/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin”;
7 use CGI qw(:all delete_all escapeHTML);
9 if (my $session = param(‘session’)) {
10 my $cache = get_cache_handle();
11 my $data = $cache->get($session);
12 unless ($data and ref $data eq “ARRAY”) {
13 show_form();
14 exit 0;
15 }
16 print header;
17 print start_html(-title => “Traceroute Results”,
18 ($data->[0] ? () : (-head =>
19 ["<meta http-equiv=refresh content=5>"])));
20 print h1(“Traceroute Results”);
21 print pre(escapeHTML($data->[1]));
22 print p(i(“… continuing …”)) unless $data->[0];
23 print end_html;
24 } elsif (my $host = param(‘host’)) {
25 if ($host =~ /^([a-zA-Z0-9.\-]{1,100})\z/) {
26 $host = $1;
27 my $session = get_session_id();
28 my $cache = get_cache_handle();
29 $cache->set($session, [0, ""]);
31 if (my $pid = fork) {
32 delete_all();
33 param(‘session’, $session);
34 print redirect(self_url());
35 } elsif (defined $pid) {
36 close STDOUT;
37 unless (open F, “-|”) {
38 open STDERR, “>&=1″;
39 exec “/usr/sbin/traceroute”, $host;
40 die “Cannot execute traceroute: $!”;
41 }
42 my $buf = “”;
43 while (<F>) {
44 $buf .= $_;
45 $cache->set($session, [0, $buf]);
46 }
47 $cache->set($session, [1, $buf]);
48 exit 0;
49 } else {
50 die “Cannot fork: $!”;
51 }
52 } else {
53 show_form();
54 }
55 } else {
56 show_form();
57 }
59 exit 0;
61 sub show_form {
62 print header, start_html(“Traceroute”), h1(“Traceroute”);
63 print start_form;
64 print submit(‘traceroute to this host:’), ” “, textfield(‘host’);
65 print end_form, end_html;
66 }
68 sub get_cache_handle {
69 require Cache::FileCache;
71 Cache::FileCache->new
72 ({
73 namespace => ‘tracerouter’,
74 username => ‘nobody’,
75 default_expires_in => ’30 minutes’,
76 auto_purge_interval => ’4 hours’,
77 });
78 }
80 sub get_session_id {
81 require Digest::MD5;
83 Digest::MD5::md5_hex(Digest::MD5::md5_hex(time().{}.rand().$$));

Lines 1 through 3 begin nearly every CGI program I write: enable taint checking, set compiler restrictions, and disable the buffering of standard output.

Line 5 sets the shell execution path. Because we’re tainted, any use of a child command will be forbidden unless the PATH itself is also untainted — the simplest way to do that is to set PATH directly.

Line 7 pulls in the CGI shortcuts, including a couple of unusual entries that don’t get pulled in with :all (for reasons I can’t fathom).

Lines 9 to 57 form the three-way switch that determines what the CGI program does for a particular invocation. Since the code segments are listed in the reverse order of their normal invocation sequence, I’ll start at the bottom and work backwards.

Line 56 shows a Web form that accepts a single parameter, the host to which we are traceroute-ing. This comes from a subroutine defined in lines 61 to 66. Simply put, we print the HTTP (actually CGI) header, the beginning of the HTML document (titling the page as Traceroute), and then a first-level head also titled Traceroute. The form comes next (with an action that defaults to the same script again) along with a single submit button and a text field. The fieldname is host, which we note for the next part of the description. Then the form is closed and the HTML is completed. This is your standard trivial form.

When the user submits this form, we come back to the same script and end up in the code starting in line 24. Here’s where it gets interesting.

First lines 25 and 26 validate the input parameter and untaint it by extracting the host name from it via a narrowly defined regex. Note that I limit the size of the hostname to 100 characters (to prevent a denial-of-service or buffer-overflow attack) and the range of characters to prevent other messiness. Be very conservative when accepting web form parameters. If the validation fails, we redisplay the form in line 53.

Line 27 fetches a unique session ID. The session ID is 32 hex characters and reasonably hard to predict. The subroutine in lines 80 through 84 pull in the Digest::MD5 module (found in the CPAN) to take some random and unpredictable data to generate such a hash. I stole the routine from Apache:: Session; if it’s good enough for them, it’s good enough for me.

Line 28 gets a Cache::Cache object to hold the information for the interprocess communication. The subroutine beginning on line 68 defines this object: we’ll cache in the filespace, naming the application tracerouter. The data will be good for 30 minutes before purging, and a purging run will be executed automatically on the first hit after four hours have passed.

Line 29 puts the initial data into the cache. The cache is always a two element arrayref. The first element is a flag that’s true if the output is complete, false if it isn’t. The second element is the data so far.

And now the fun part. We’re going to fork on line 31. This permits the parent process to tell Apache that we’re done responding to the request, while letting the child go off to perform the long traceroute. If we’re the parent, we need to construct a URL that points back to us, but with the session ID as a CGI parameter. So, we clear all the stored CGI parameters (line 32), set the session ID (line 33), and then print a CGI-redirect of “ourself” (as modified), which becomes an external redirect to the browser (line 34), and we’re done.

The child goes on, but it must first close STDOUT. If it doesn’t, Apache (thinking that output might still be coming for the browser) won’t respond to the browser or release the connection until this is all resolved. Next, we have to launch a child process of the child to execute the traceroute.

We’ll do this with a pipe-open, which includes an implicit fork, on line 37. The grandchild process merges STDERR to STDOUT, and then executes traceroute, passing it the validated host parameter from before. If line 40 is executed, we die and emit a single line of output as our response.

The child (the parent of the traceroute) reads from the file handle opened from the STDOUT (and STDERR) of the traceroute starting in line 42. We declare a buffer ($buf), and as each line is read (line 43), the line is added to the buffer (line 44) and shoved into the cache storage (line 45). When the command is complete, we get end-of-file, drop out of the loop, store the entire buffer again with an “I’m done” flag (line 47), and exit (line 48).

In short, the child process scurries off to execute the command. The parent tells the server to tell the browser to “please revisit me with this session key”. So, the browser comes back on its own volition and ends up starting in line 9 for the third and final part of this program.

Line 10 gets the cache handle, opening the same cache to which the forked child is writing. Line 11 gets the cache data for that session key. Now if the data is missing, either the data has expired or someone is trying to jimmy up a session key to hijack someone else’s session. In either case, we show the form (again) and stop.

Line 16 generates the CGI header. Lines 17 to 19 follow that with the HTML header. If the “data complete” flag is not set, then we need to keep going after this display, so we’ll add a meta-refresh tag to the head info. This instructs the browser to poll the same URL in a number of seconds (here 5 seconds).

Lines 20-23 dump the data that we have so far. If the data is incomplete, an italicized “continuing” paragraph is appended, to let the user know that we’re still working on the answer. And that’s it! That’s a basic strategy to watch a long-running CGI program.

Note that the child process has no awareness if the parent is finally disinterested and could continue merrily chugging away to produce a result that no one will see. Perhaps that can be fixed in another revision. But until next time, enjoy!

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com. Code listings for this column can be found at http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.