dcsimg

Creating an Inline Language

Back in one of the very first issues of Linux Magazine (September 1999, available online at http://www.linux-mag.com/1999-09/perl_01.html), I wrote about the Spew language. Given a description of text, sentences, and paragraphs, Spew generates random prose based on that description. The grammar is specified using a simple BNF-like format, with extensions to give weighting to more-favored choices.

Back in one of the very first issues of Linux Magazine (September 1999, available online at http://www.linux-mag.com/1999-09/perl_01.html), I wrote about the Spew language. Given a description of text, sentences, and paragraphs, Spew generates random prose based on that description. The grammar is specified using a simple BNF-like format, with extensions to give weighting to more-favored choices.

For example, given the Spew input…


START: sentence
sentence: person ” ” action “.”
person: “He” | “She” | “Barnacle Bill”
action: “walks” | “sails” | “swims”

… we’d end up with random combinations of people walking, sailing, and swimming.

I recently rewrote that code, and my new implementation of Spew consists of two parts: a parser to translate the grammar (such as the above) into a compiled internal representation, and an execution engine to walk through that representation, selecting choices and subrules as needed. To speed my development, I initially chose to use Parse::RecDescent to write the first part. While this allowed me to save time while writing my grammar translator, the cost came with a penalty: to go from a Spew grammar to Spew output, I had to first load Parse::RecDescent, parse and translate my grammar’s grammar (written in Parse::RecDescent‘s own description language), execute that translation to parse the Spew input to get a compiled representation of that grammar, and finally run through that final data structure to generate the random output.

For toy operation, that sufficed. But for anything significant, like a random mission statement generated on every RSS hit to my webserver, the Parse::RecDescent setup would be prohibitively CPU-intensive.

But consider that ninety percent of the process is the same for every hit. All we need is a cache of that intermediate data structure, and we’re done — as long as we ensure that we’re using a cache based on the current grammar.

One nice framework that’s come along over the years is the Inline framework, originally written by Brian Ingerson. Brian realized that people didn’t write a lot of interfaces to code written in languages like C because Perl’s XS language was so awkward to use. Also, XS required a shim to be created between Perl and the other library, and this required firing up a C compiler on the right input files. For this to be fast and transparent, the shim had to be easily located and cached and not rebuilt needlessly. So, Brian set out to simplify the mechanism.

The Inline framework creates an MD5 checksum of the source code and uses that value as part of a directory name located in a writable location. The directory holds the results of analyzing and translating the source code so that it can be quickly reused.

Initially, the only Inline module was Inline::C, for building quick XS glue for arbitrary C code and libraries. But over the following months, various other languages were also supported, including hooks to handle interpreted languages by caching their translation.

And that’s where we get back to Spew. By following the Inline-API man page (and staring at some of the other examples in the CPAN), I created the Inline::Spew module, which I’ve placed on the CPAN as well. Once installed, I can take a Spew grammar and cache the translation in a transparent way. For example …


use Inline Spew => q{
START: sentence
sentence: person ” ” action “.”
person: “He” | “She” | “Barnacle Bill”
action: “walks” | “sails” | “swims”
};

print spew(), “\n” for 1..5;

… generates (randomly):


She swims.
He sails.
Barnacle Bill sails.
Barnacle Bill walks.
He walks.

The first time I invoke the script, the Inline mechanism computes the MD5 checksum of the source text, and notes that I don’t have a compiled version of the grammar. So the Spew compiler gets loaded, along with Parse::RecDescent and everything it needs. The “grammar grammar” gets compiled, and is then used to convert my grammar to a data structure. And here’s the cool part: the data structure is then saved (as a YAML file). On subsequent invocations, we start by loading the data structure and then performing the final random walk. On my laptop, this results in about a half a second of CPU time saved per invocation, as long as I don’t change the grammar.

And how does this all work? Let’s examine the module code in Listing One. This listing has been slightly abridged for inclusion here, but I’ve left the essential pieces in place.




Listing One: myspew.pl, a custom Spew interpreter


1 package Inline::Spew;
2
3 require Inline;
4 require YAML;
5
6 our @ISA = qw(Inline);
7
8 sub register {
9 return {
10 language => ‘Spew’,
11 type => ‘interpreted’,
12 suffix => ‘spew’,
13 };
14 }
15
16 sub validate {
17 }
18
19 sub build {
20 my $o = shift;
21 my $code = $o->{API}{code};
22 my $location = “$o->{API}{location}”;
23
24 require File::Basename;
25 my $directory = File::Basename::dirname
($location);

26 $o->mkpath($directory) unless -d $directory;
27
28 my $spew = spew_compile($code);
29
30 YAML::DumpFile($location, $spew);
31 }
32
33 sub load {
34 my $o = shift;
35
36 my $sub = do {
37 my $s = $o->{CONFIG}{SUB} || “spew”;
38 unless ($s =~ /::/) {
39 $s = $o->{API}{pkg}.”::$s”;
40 }
41 $s;
42 };
43 my $location = $o->{API}{location};
44 my @result = YAML::LoadFile($location);
45
46 {
47 no strict ‘refs’;
48 *$sub = sub {
49 my $start = shift || “START”;
50 return spew_show($result[0], $start);
51 };
52 }
53 }
54
55 sub spew_show {
56 my ($parsed, $defn) = @_;
57 die “missing defn for $defn” unless exists
$parsed->{$defn};

58
59 my @choices = @{$parsed->{$defn}{is}};
60 my $weight = 0;
61 my @keeper = ();
62 while (@choices) {
63 my ($thisweight, @thisitem) = @{pop @choices};
64 $thisweight = 0 if $thisweight < 0;
# no funny stuff

65 $weight += $thisweight;
66 @keeper = @thisitem if rand($weight) <
$thisweight;

67 }
68 my $result;
69 for (@keeper) {
70 ## should be a list of ids or defns
71 die “huh $_ in $defn” if ref $defn;
72 if (/^ (.*)/s) {
73 $result .= $1;
74 } elsif (/^(\w+)$/) {
75 $result .= spew_show($parsed, $1);
76 } else {
77 die “Can’t show $_ in $defn\n”;
78 }
79 }
80 return $result;
81 }
82
83 BEGIN {
84
85 my $parser;
86 my $GRAMMAR = q{
87
88 { my %grammar; my $internal = 0; }
89
90 grammar: rule(s) /\Z/ { \%grammar; }
91
92 ## rule returns identifier (not used)
93 rule: identifier “:” defn {
94 push @{$grammar{$item[1]}{is}}, @{$item[3]};
95 $grammar{$item[1]}{defined}{$item
pos[1]{line}{to}}++;

96 $item[1];
97 }
98 | <error>
99
100 ## defn returns listref of choices
101 defn: <leftop: choice “|” choice>
102
103 ## choice returns a listref of [weight => @items]
104 choice: weight unweightedchoice { [ $item[1]
=> @{$item[2]} ] }

105
106 ## weight returns weight if present, 1 if not
107 weight: /\d+(\.\d+)?/ <commit> /\@/ {
$item[1] } | { 1 }

108
109 ## unweightedchoice returns a listref of @items
110 unweightedchoice: item(s)
111
112 ## item returns ” literal text” or “identifier”
113 item:
114 { $_ = extract_quotelike($text) and
” ” . eval }

115 | identifier <commit> …!/:/ { # must
not be followed by colon!

116 $grammar{$item[1]}{used}{$item
pos[1]{line}{to}}++;

117 $item[1]; # non-leading space flags
an identifier

118 }
119 | “(” defn “)” { # parens for recursion,
gensym an internal

120 ++$internal;
121 push @{$grammar{$internal}{is}},
@{$item[2]};

122 $internal;
123 }
124 | <error>
125
126 identifier: /[^\W\d]\w*/
127 };
128
129 sub spew_compile {
130 my $source = shift;
131
132 unless ($parser) {
133 require Parse::RecDescent;
134 $parser = Parse::RecDescent->new($GRAMMAR)
or die “internal bad”;

135 }
136
137 my $parsed = $parser->grammar($source) or
die “bad spew grammar”;

138
139 for my $id (sort keys %$parsed) {
140 next if $id =~ /^\d+$/; # skip
internals

141 my $id_ref = $parsed->{$id};
142 unless (exists $id_ref->{defined}) {
143 die “$id used in @{[sort keys %{$id_ref->{used}}]} but not defined”;
144 }
145 }
146
147 return $parsed;
148 }
149 }
150
151 1;
152 __END__

Line 1 defines this module as belonging to Inline::Spew. Lines 3 and 4 bring in the Inline and YAML modules. Line 6 declares this module to be a subclass of Inline.

Lines 8 to 14 define the subroutine that registers this particular Inline module to the Inline framework. Inline goes through Perl’s include path when invoked, finding all modules that match Inline::*, and brings them in to find out the languages they support. The register subroutine clarifies the language that this module supports, whether it is a compiled or interpreted language, and any filename suffix that should be used. These are returned as a hashref by protocol.

Line 16 defines the validation routine used to validate the configuration parameters. I’m lazy, and didn’t write one, but this is where I would check that the only configuration parameter is SUB and that its value is a respectable subroutine name. Maybe in the next release…

Lines 19 to 31 form the compilation phase of this module. This routine is called if there doesn’t yet exist a cached compilation result for a given piece of Spew grammar. The only input parameter is the Inline object, copied here to $o in line 20.

Lines 21 and 22 extract the Spew source text into $code, and the computed desired location of the Spew “compiled” file into $location.

One thing I figured out by looking at other examples is that the directory containing the to-be-created file defined in $location might not yet exist. So, I’ve added code to create it in lines 24 through 26. (Oddly enough, there’s an undocumented mkpath() call to handle the creation).

Next, it’s time to compile the Spew input grammar into the intermediate data structure, via the call to spew_compile() in line 28 (defined later in this file.) Line 30 dumps this data structure as a YAML-formatted file to the location expected by Inline. And that ends the compilation step.

After the compilation step, Inline then loads the compiled module, defined here with the load() subroutine beginning in line 33. First, the Inline object is shifted off into $o, as before (in line 34). Next, we determine the name of the subroutine to be bound to the compiled object. Line 37 sets $s to the value of the SUB configuration value, defaulting to spew if absent. If the name doesn’t contain a double colon, then we need to prefix the proper package. We can get the package name from the pkg variable in the API object hash, shown in line 39.

Lines 43 and 44 extract the compiled Spew grammar from the $location file. At this point, $result[0] should be the same as $spew at the end of the build subroutines.

Lines 46 to 53 create the subroutine. First, in line 47, we have to turn off the strictness about symbolic references because we’re about to use a string value ($sub) as a reference. Next, we create an anonymous subroutine and assign it to the glob defined using the name $sub. This effectively creates the subroutine as a named subroutine.

When the subroutine is invoked, it takes its only parameter (defaulting to START if not present) and calls the spew_show() subroutine, defined below, along with the grammar data structure.

The first time a program with a particular grammar is run, the initial call to Inline invokes register() to understand what Inline::Spew is about, validate() on the configuration parameters (doing nothing here), build to translate the grammar into the stored file, then load() to load that stored file and connect it up with the subroutine name of choice. On subsequent calls, the build() step is skipped, but that’s good, because that’s the expensive one.

The remainder of the module listing is adapted very slightly from the original spew program presented in the September 1999 issue. As such, I won’t bother re-describing the code in detail, except to point out that the code in lines 132 through 135 computes the grammar grammar “lazily.” The expensive Parse::RecDescent invocation and execution is avoided until we actually need to compile a spew grammar. Also, my error checking is extremely clumsy: if a bad grammar is given, the program simply dies (line 137). This too could use a bit of cleaning up. But it works as it is.

Have fun with this module by creating varying web pages or automatic email responders. (I’d be interested to hear from you if you do.) Use the module as an example for your own Inline modules. And, until next time, enjoy!



Randal Schwartz is the chief Perl guru at Stonehenge Consulting. He can be reached at merlyn@stonehenge.com.

Comments are closed.