Profiling in Template Toolkit

The Template Toolkit does not support any profiling tools “out of the box.” However, that didn’t stop Randal from getting the numbers — and the performance boost — he needed.
Over the past year or so, I’ve talked a lot about Andy Wardley’s Template Toolkit (TT, http://http://www.template-toolkit.org/), because I’ve been using the software extensively to build a number of significant web sites. Of course, “build” means “solving problems,” or scratching my head until that precious “aha! ” moment comes along.
For example, I recently completed the features of a deliverable web site, but as the customer started playing with it, one of his first comments was, “This part is too slow, and I spend all my time on this screen.” That wasn’t a show-stopper, though, because I’d deliberately chosen to program the application with flexibility and ease-of-implementation first, knowing that I’d need to make a performance pass later to optimize anything that was too slow. (Avoid the perils of premature optimization, or making decisions about performance before actually understanding where the bottlenecks are.)
“No problem”, I said, “because I’ll just trot out Devel::DProf to find the hot spots and slow spots and optimize those.” But, alas, adding Devel::DProf caused the program to dump core. Ugh. Even moving the code to a different architecture (from my Mac OS X development platform to the OpenBSD deployment platform) proved to be fruitless.
This wasn’t looking good — for the application or my income.
After fretting a bit and asking for help in all the usual places online, I decided that I still needed to know where the program was spending its time. I suspected that 90 percent or more of the time was in the “rendering” phase of CGI::Prototype, so I could accept a set of per-template timings as a first-order diagnostic tool. However, Template Toolkit doesn’t have any profiling tools “out of the box.” But this is Perl, and the code is open. Also, Andy designed TT to be easily subclassed to alter portions of the behavior.
I spent about an hour tracing the flow of how a template is processed and decided that my per-template statistics could be obtained by noting the execution time of each invocation of the process() method of Template::Context. The first parameter to process() seemed to be the same kind of value as the first parameter of the process() method of the top-level Template object, or the template handed to the PROCESS or INCLUDE or WRAPPER Template directives. Thus, I could use that value to aggregate results over different invocations of the same template.
My strategy was simple: create a subclass of Template::Context called My::Template::Context, override the process() method to “start the timer” by noting the wall clock time (via time()) and the four CPU times (user CPU, system CPU, child user CPU, and child system CPU) accumulated thus far (via times()), and then call the “real” process() method to do the actual work, capturing its return value. When the method returned, I’d subtract the initial timer values from new values to yield the amount of time spent processing the template. Finally, I’d tell Template to use my new class instead of the original class by setting $Template::Config::CONTEXT to my class name (another nice configuration hook provided by Andy).
Or at least it looked simple… But then I realized that what I really wanted was the duration spent only in the “parent” template, not in all of the subordinate templates that it calls. This is important because I’m also interested in the immediate cost of all of the callbacks made by a template, which I could measure only indirectly. (To measure those more directly, I would have had to perform a similar trick with Template::Stash, and that looked even hairier.)
So, I decided that I’d simply maintain a stack of calls into process(), and on each return, compute the deltas for the current call and also subtract that value from each of the parent calls. After all, if the time was being charged to this invocation, it’s no longer affecting the parent. To do that, I push forward the starting time counters of every parent call, pretending they started later. It all works out in the end, although I got the math wrong at least twice while I was hacking.
The result is shown in Listing One.
LISTING ONE: Profiling the Template Toolkit
01package My::Template::Context;
02use base qw(Template::Context);
04my @stack;
05my %totals;
07sub process {
08 my $self = shift;
010 my $template = $_[0];
011 if (UNIVERSAL::isa($template, “Template::Document”)) {
012 $template = $template->name || $template;
013 }
015 push @stack, [time, times];
017 my @return = wantarray ?
018 $self->SUPER::process(@_) :
019 scalar $self->SUPER::process(@_);
021 my @delta_times = @{pop @stack};
022 @delta_times = map { $_ – shift @delta_times } time, times;
023 for (0..$#delta_times) {
024 $totals{$template}[$_] += $delta_times[$_];
025 for my $parent (@stack) {
026 $parent->[$_] += $delta_times[$_] if @stack; # parent adjust
027 }
028 }
029 $totals{$template}[5] ++; # count of calls
031 unless (@stack) {
032 ## top level again, time to display results
033 print STDERR “– $template at “. localtime, “:\n”;
034 printf STDERR “%3s %3s %6s %6s %6s %6s %s\n”,
035 qw(cnt clk user sys cuser csys template);
036 for my $template (sort keys %totals) {
037 my @values = @{$totals{$template}};
038 printf STDERR “%3d %3d %6.2f %6.2f %6.2f %6.2f %s\n”,
039 $values[5], @values[0..4], $template;
040 }
041 print STDERR “– end\n”;
042 %totals = (); # clear out results
043 }
045 # return value from process:
046 wantarray ? @return : $return[0];
049$Template::Config::CONTEXT = __PACKAGE__;
Line 1 defines the beginning of my module. Obviously, in a mod_perl enviroment, the name My:: isn’t a good choice, but this is still in CGI, so I didn’t have trouble with the name (for the time being).
Line 2 both pulls in the Template::Context module and declares the current module to be a derived class by setting @ISA automatically.
Line 4 declares the @stack of times for each nested invocation. Initially, the empty stack shows that the application is at the top level. As each new invocation is entered, a new arrayref is pushed onto the stack. When the stack again becomes empty, the code has returned to the top-level, indicating that it’s time to dump the results (simply to STDERR and thus into my web log).
Line 5 accumulates the results. The key is the template name, with the value being an arrayref of the accumulated sums of the times. This hash is reset as each top-level is reached.
Line 7 to Line 47 define the process() method that overrides the method of the same name found in Template::Context.
Line 8 pulls out the $self object from the arguments.
Line 10 to Line 13 attempt to figure out the name for this template. The first parameter to process() can either be a template name, a Template::Document if it’s a BLOCK inside another template, or even a coderef. Luckily, I didn’t have any coderefs in my application, but I still had to do something sensible with BLOCK values. Line 11 determines if the template is an object that inherits from Template::Document. If so, name() provides its name.
Another way that might have worked just as well would be:
my $template = $_[0]; 
eval { $template = $template->name };
This simpler version tries the name() method, and if it doesn’t work, leaves the template name alone. But I coded what worked first and moved on.
Line 15 captures the initial wall clock time and the CPU times into a five-element list, then pushes that as an arrayref onto the stack. This” starts” the timer.
Line 17 to Line 19 call SUPER::process() on the arguments to do the job. However, I was unsure of the invocation context of process(), so I had to be careful to pass that same context on to the SUPER call. If in a list context, my@return… is already good; if not, then the scalar prefix ensures the proper context. (I’m intentionally not paying attention to the void context here, as it would have complicated the code even further.)
Now that the real work has been done, it’s time to figure out how long it took. Line 21 pops the stack, retrieving the five starting values. These might have been altered by nested calls to process() — more on that in a moment.
Line 22 subtracts the old values from the current values, using a crazy little idiom I invented a while back. The map executes the block five times. Each value is subtracted from the next element shifted from @delta_times, creating the delta value. But the map accumulates these five deltas, which then get assigned right back into @delta_times. Yes, the code is slightly obfuscated, but I’ve used it enough times now that it begins to feel automatically like the right thing to reach for.
Line 23 through Line 28 process each of the five deltas (wall clock, and the four CPU times). First, Line 24 adds them into the appropriate accumulator, keyed by the name in $template; then, the loop from Line 25 to Line 27 pushes each of the “start time” counters forward by that same amount in the stack, thus discounting that amount of time from those accumulations. That’s the trickiest part of the code, I’d say.
Line 29 accumulates a count of the calls, just so I can see if something was called once or a hundred times. In general, you worry about total CPU time used, not the per-usage CPU time, but having the call count might be useful to understand “Big O” scaling issues to determine an asymptotic upper bound.
If at the top of the stack, it’s time to show how much the hit cost. Line 31 to Line 43 dump out the totals. The dump goes to STDERR so that I see it on my web error log, and I’m using printf() to try to make it at least slightly pretty. Line 42 clears out the totals (the stack is already clear) so that a new invocation starts it all over again.
Line 46 returns an appropriate value to the caller: if in a scalar context, only $return[0] has the proper value. Otherwise, the entire list is returned.
Finally, Line 49 tells Template that this is the kind of Template::Context to use, and Line 51 provides the traditional true value for the require embedded in the invoking use.
Speaking of which, using this module is now as simple as putting a few extra lines at the top of my CGI script:

use lib “/my/local/lib”; # location of this module
use My::Template::Context; # turn on logging

use My::App; # CGI::Prototype
Now, as this CGI program is invoked, I get extra lines in my web
log showing exactly where time is being spent. Yeay!
Within a few hours, I was able to cut the CPU time by over 70% and keep my customer happy and paying. Happy endings all around.
Until next time, enjoy!

Randal Schwartz is the author of several books on Perl and is the chief Perl guru at Stonehenge Consulting. You can reach Randal at class="emailaddress">merlyn@stonehenge.com.

Comments are closed.