dcsimg

Writing Nonsense with Perl

I'm looking for a Perl script that uses recursive grammar techniques to generate random sentences. I've found several scripts that will throw up a string of text chosen from a pre-made list, but I'd really like to find something that generates sentences on the fly.

I’m looking for a Perl script that uses recursive grammar techniques to generate random sentences. I’ve found several scripts that will throw up a string of text chosen from a pre-made list, but I’d really like to find something that generates sentences on the fly.

I know there’ve been a few non-Perl solutions like Spew and the Dada Engine; what I’m looking for doesn’t have to be that complex — Don Cross’s JavaScript implementation is along the scale I’m thinking.

Anybody here know of something like that?

When I read this posting, I was reminded of the time, 20 years back, I wrote just such a program as a test of the UNIX YACC and LEX tools that I had just read about, called yak (for Yet Another Klugey generator, as well as the meaning of having something that yaks at you). So, I whipped out Damian Conway’s wonderful Parse::RecDescent module, with which I had been playing recently, and generated the spew program in Listing One.




Listing One: Spew


 1      #!/usr/bin/perl -w
2 use strict;
3 $|++;
4
5 use Parse::RecDescent 1.65;
6 use Getopt::Long;
7
8 GetOptions(
9 “start=s” => \ (my $START = “START”),
10 ) or die “see code for usage\n”;
11
12 ## define the grammar of the spew grammar:
13
14 (my $parser = Parse::RecDescent->
new(<<’END_OF_GRAMMAR’)) or die “bad!”;
15
16 ## return hashref
17 ## { ident => {
18 ## is => [
19 ## [weight => item, item, item, ...],
20 ## [weight => item, item, item, ...], …
21 ## ],
22 ## defined => { line-number => times }
23 ## used => { line-number => times }
24 ## }, …
25 ## }
26 ## item is ” literal” or ident
27 ## ident is C-symbol or number (internal for
nested rules)
28
29 { my %grammar; my $internal = 0; }
30
31 grammar: rule(s) /\Z/ { \%grammar; }
32
33 ## rule returns identifier (not used)
34 rule: identifier “:” defn {
35 push @{$grammar{$item[1]}{is}}, @{$item[3]};
36 $grammar{$item[1]}{defined}{$itempos[1]{line}
{to}}++;
37 $item[1];
38 }
39 | <error>
40
41 ## defn returns listref of choices
42 defn: <leftop: choice “|” choice>
43
44 ## choice returns a listref of [weight =>
@items]
45 choice: weight unweightedchoice { [ $item[1] =>
@{$item[2]} ] }
46
47 ## weight returns weight if present, 1 if not
48 weight: /\d+(\.\d+)?/ <commit> /\@/
{ $item[1] } | { 1 }
49
50 ## unweightedchoice returns a listref of @items
51 unweightedchoice: item(s)
52
53 ## item returns ” literal text” or “identifier”
54 item:
55 { $_ = extract_quotelike($text) and ” ” . eval }
56 | identifier <commit> …!/:/ { # must not
be followed by colon!
57 $grammar{$item[1]}{used}{$itempos[1]{line}
{to}}++;
58 $item[1]; # non-leading space flags an
identifier
59 }
60 | “(” defn “)” { # parens for recursion,
gensym an internal
61 ++$internal;
62 push @{$grammar{$internal}{is}}, @{$item[2]};
63 $internal;
64 }
65 | <error>
66
67 identifier: /[^\W\d]\w*/
68
69 END_OF_GRAMMAR
70
71 my @data = <>;
72 for (@data) {
73 s/^\s*#.*//;
74 }
75
76 (my $parsed = $parser->grammar(join ”
@data)) or die “bad parse”;
77
78 for my $id (sort keys %$parsed) {
79 next if $id =~ /^\d+$/; # skip internals
80 my $id_ref = $parsed->{$id};
81 unless (exists $id_ref->{defined}) {
82 print “$id used in @{[sort keys %{$id_ref->
{used}}]} but not defined – FATAL\n”;
83 }
84 unless (exists $id_ref->{used} or $id eq
$START) {
85 print “$id defined in @{[sort keys %{$id_ref->
{defined}}]} but not used – WARNING\n”;
86 }
87 }
88
89 #DEBUGGING:# use Data::Dumper; print
Dumper($parsed);
90 show($START);
91
92 sub show {
93 my $defn = shift;
94 die “missing defn for $defn” unless exists
$parsed->{$defn};
95
96 my @choices = @{$parsed->{$defn}{is}};
97 my $weight = 0;
98 my @keeper = ();
99 while (@choices) {
100 my ($thisweight, @thisitem) = @{pop @choices};
101 $thisweight = 0 if $thisweight < 0; # no
funny stuff
102 $weight += $thisweight;
103 @keeper = @thisitem if rand($weight) <
$thisweight;
104 }
105 for (@keeper) {
106 ## should be a list of ids or defns
107 die “huh $_ in $defn” if ref $defn;
108 if (/^ (.*)/s) {
109 print $1;
110 } elsif (/^(\w+)$/) {
111 show($1);
112 } else {
113 die “Can’t show $_ in $defn\n”;
114 }
115 }
116 }

Before looking at the details of the implementation, let’s quickly summarize the elements involved. For comparison, Cross’s demonstration is over six hundred lines of JavaScript, which produce the results visible at http://www.intersrv.com/~dcross/sentence.html. It’s a single, fixed procedural client-side program which implicitly encodes an involved random grammar. Among the selections it makes: a NounPhrase might be a Noun, or Adjective plus Noun, or a Noun plus Prepositional Phrase. At the lowest level, the grammar chooses between such nouns as “crane,” “umbrella,” “plaintiff,” “bowling ball,” and so on.

spew does much the same thing. It’s a command-line application, of course, written in Perl rather than JavaScript. The biggest difference, though, is that the grammar it embeds is used to read other grammars. spew is a machine that takes a “random sentence grammar” as input, and emits a randomly-chosen text that conforms to the grammar. The Dilbert Mission Generator I show in Listing Two is an example of such a grammar. Notice the use of the 2 @ numeric weight: this grammar produces twice as many sentences with “… because …” clauses as those without.




Listing Two: Clone of the Dilbert Mission Generator


  1     ## Our challenge is to effectively reverse-
engineer the output of:
2 ## http://www.dilbert.com/comics/dilbert/
career/bin/ms 2.cgi
3 ## as well as dynamically provide interactive
feedback. :-)
4
5 START: missions
6
7 missions: mission “\n\n” mission “\n\n” mission
“\n\n” mission “\n”
8
9 mission:
10 Our_job_is_to ” ” do_goals “.” |
11 2 @ Our_job_is_to ” ” do_goals ” ” because “.”
12
13 Our_job_is_to:
14 (“It is our ” | “It’s our “) job ” to” |
15 “Our ” job (” is to” | ” is to continue to”) |
16 “The customer can count on us to” |
17 (“We continually ” | “We “) (“strive” |
“envision” | “exist”) ” to” |
18 “We have committed to” |
19 “We”
20
21 job:
22 “business” | “challenge” | “goal” | “job” |
“mission” | “responsibility”
23
24 do_goals:
25 goal | goal ” ” in_order_to ” ” goal
26
27 in_order_to:
28 “as well as to” |
29 “in order that we may” |
30 “in order to” |
31 “so that we may endeavor to” |
32 “so that we may” |
33 “such that we may continue to” |
34 “to allow us to” |
35 “while continuing to” |
36 “and”
37
38 because:
39 “because that is what the customer expects” |
40 “for 100% customer satisfaction” |
41 “in order to solve business problems” |
42 “to exceed customer expectations” |
43 “to meet our customer’s needs” |
44 “to set us apart from the competition” |
45 “to stay competitive in tomorrow’s world” |
46 “while promoting personal employee growth”
47
48 goal: adverbly ” ” verb ” ” adjective ” ” noun
49
50 adverbly:
51 “quickly” | “proactively” | “efficiently” |
“assertively” |
52 “interactively” | “professionally” |
“authoritatively” |
53 “conveniently” | “completely” | “continually” |
“dramatically” |
54 “enthusiastically” | “collaboratively” |
“synergistically” |
55 “seamlessly” | “competently” | “globally”
56
57
58 verb:
59 “maintain” | “supply” | “provide access to” |
“disseminate” |
60 “network” | “create” | “engineer” |
“integrate” | “leverage other’s” |
61 “leverage existing” | “coordinate” |
“administrate” | “initiate” |
62 “facilitate” | “promote” | “restore” |
“fashion” | “revolutionize” |
63 “build” | “enhance” | “simplify” | “pursue” |
“utilize” | “foster” |
64 “customize” | “negotiate”
65
66 adjective:
67 “professional” | “timely” | “effective” |
“unique” | “cost-effective” |
68 “virtual” | “scalable” | “economically sound” |
69 “inexpensive” | “value-added” | “business” |
“quality” | “diverse” |
70 “high-quality” | “competitive” | “excellent” |
“innovative” |
71 “corporate” | “high standards in” |
“world-class” | “error-free” |
72 “performance-based” | “multimedia-based” |
“market-driven” |
73 “cutting edge” | “high-payoff” | “low-risk
high-yield” |
74 “long-term high-impact” | “prospective” |
“progressive” | “ethical” |
75 “enterprise-wide” | “principle-centered” |
“mission-critical” |
76 “parallel” | “interdependent” | “emerging” |
77 “seven-habits-conforming” | “resource-leveling”
78
79 noun:
80 “content” | “paradigms” | “data” |
“opportunities” |
81 “information” | “services” | “materials” |
“technology” | “benefits” |
82 “solutions” | “infrastructures” | “products” |
“deliverables” |
83 “catalysts for change” | “resources” |
“methods of empowerment” |
84 “sources” | “leadership skills” |
“meta-services” | “intellectual capital”

Damian’s Parse::RecDescent forms the algorithmic heart of spew. Its author is a lecturer in Computer Science at Monash University with an interest in language as interface. His previous experiments on metaprogramming — that is, program-writing programs — led to Getopt::Declare, also available from CPAN. For him, Parse::RecDescent serves as a test bed for a variety of advanced parsing techniques and grammatical features. It also serves as a parser generator as powerful and easy to use as Perl itself, which makes it perfect for spew.

A Look at Spew

The first three lines start most of my long programs, enabling warnings, turning on compiler restrictions, and disabling buffering on STDOUT.

Line 5 brings in the Parse::RecDescent module. You need a fairly late-model version of this rapidly developing module (I used version 1.65), so if the program doesn’t work the first time, try an upgrade. Line 6 brings in the standard Getopt::Long module.

Lines 8 through 10 look for the designated “start” symbol for the random sentence grammar. A grammar can define multiple starting places, defaulting to START unless specified. $START will contain this name later.

Lines 14 through 69 define input to Parse::RecDescent to build a parser for the small BNF random sentence grammar describing the random sentences. Keep in mind that we’re building a parser on top of which we construct an un-parser; the same terminology appears at more than one level of the design. In fact, I tried to make the random-sentence BNF look very much like Parse::RecDescent grammar as well. If Parse::RecDescentcan create this random sentence grammar parser successfully, we get a parsing object in $parser. The Parse::RecDescent invocation fails with the simple bad error only if you’ve mangled the spew-grammar in lines 15 to 68.

Lines 16 through 27 describe the return value from the parser once it is executed against a random sentence grammar. In brief, we’ll have a hashref with keys being the various non-terminal symbols of the grammar, and the corresponding values representing the information about that non-terminal. The is nested sub-key contains a series of choices, which will be randomly selected during the random walk. Some of the item entries will be literal strings to be printed (they are indicated by a leading space, stripped before printing), and some are further non-terminal symbols to be walked recursively. To generate this data structure, we parse the random sentence grammar with the grammar that follows.

The excellent documentation for Parse::RecDescent details the inputs it recognizes. If you’re familiar with general BNF, you’ll probably be able to pick out most of the elements of spew‘s grammar between lines 14 and 69 even without his specification. Highlights include:

* Non-terminals of both the spew grammar and the random sentence grammar are C-symbols (the same as Perl identifiers).

* The spew grammar uses both regular expressions and constant strings as terminals, and those terminals are separated by optional white space.

* The spew grammar keeps track of the line numbers where non-terminals are defined and used (for error reporting).

* When a rule has a subrule (a parenthesized part), a non-terminal entry is generated for the subrule, but assigned a sequentially increasing integer instead of a real name. In all other respects, the non-terminal acts the same a user-defined non-terminal. This means that:


foo: a ( b c | d e ) f | g h

is the same as


foo: a 1 f | g h
1: b c | d e

except that you can’t use “1″ as a legal identifier.

* Weights are added by prefixing a choice with a positive floating-point number followed by an @, as in:


foo: 5 @ fred | 2 @ barney | 1.5 @ dino | betty

* which is five times as likely to pick fred as betty (or a total of 5 out of 9.5 times). This is simpler than repeating a grammar choice multiple times, as I’ve seen in other similar programs, and it lets you do fine-grained ratio definitions.

* Some productions use <error> to present a hint to the user if nothing is found. For example, if while looking for an item, the parser finds neither a Perl-style quoted string, nor a non-terminal, nor a subrule, the error flag says that this is worthy of an error message at this level, and reminds the user of the valid choices. Without <error>, a higher level of the RecDescent parser tells the user nothing more than that the grammar is bad.

After the parser is constructed, lines 71 through 74 fetch the random sentence grammar from the filenames specified on the command line. Standalone comment lines — those beginning with a ‘#’ — are discarded, but the spew grammar doesn’t recognize same-line comments.

Line 76 uses the constructed parser to parse the random sentence grammar, yielding the definitions for the non-terminals. Lines 78 to 87 walk through the parsed output, looking for inconsistencies in non-terminals that are used but not defined, or vice versa. We don’t actually abort the program if something is used but not defined, because it might not be selected on this particular random walk. It’s a bit like playing Russian roulette though.

Line 89 is a debugging stub. If you want to see the output of the parser, uncomment the line. I used it quite a bit while I was debugging the parser. Line 90 starts the random walk through the random sentence grammar, by generating exactly one item of the top-level non-terminal.

Lines 92 through 116 define the show subroutine, extracting the right portions of the parsed random sentence grammar. Lines 96 to 104 select one of the choices, based on their weights. Lines 105 through 115 dump each item of that choice. If the item is a terminal, it gets printed in line 109. If the item is a symbol, it’s a non-terminal, and show gets called recursively to dump the new non-terminal.

That’s the spew program. For the fun of seeing it in action, we need input: a random sentence grammar. I searched for examples by typing random sentence into www.google.com. The most interesting hit I got led me to the Dilbert Mission Generator, located at http://www.dilbert.com/comics/dilbert/career/bin/ms2.cgi. I spent about an hour hitting reload repeatedly to reverse-engineer the output, and came up with Listing Two. I’ve cleaned up some of the choices, and fixed a few misspellings, so this grammar isn’t exactly what you see at www.dilbert.com.

The Dilbert Mission Generator

It’s convenient to use the default start symbol, START, in line 5. The output is four missions (one for each fiscal quarter), as shown in line 7. I named the non-terminals similar to the language they evoke, because that makes it easier for me to see what fits together.

Note that lines 14, 15, and 17 use subrules to eliminate some typing, and that line 11 has a weight of 2, choosing the missions that have reasons twice as likely as the missions that are unjustified.

The lists of terms in lines 50 to 84 came off the “word list” page linked from the mission generator, with a small bit of modification.

If you stick this into mission. spew (a simple invocation of spew) mission.spew can yield something like this on standard output:

We have committed to collaboratively integrate ethical sources while promoting personal employee growth.

Our challenge is to professionally simplify progressive services to set us apart from the competition.

We assertively restore virtual data so that we may synergistically create parallel infrastructures while promoting personal employee growth.

It is our business to authoritatively network prospective products so that we may endeavor to efficiently customize seven-habits-conforming services while promoting personal employee growth.

Hmm, I think I heard those very words at the last shareholder meeting.Those employees sure are growing too.

One of the other results of my research into random grammars led to a nice little stash on Julie Zelenski’s site (http://www-cs-faculty.stanford.edu/~zelenski/rsg/grammars/). The trouble is that they were all in a weird format that spew doesn’t recognize. Well, in a short period of time, I had written a conversion program to go from one format to the other, given in Listing Three. Now, I can use Julie’s work to generate Bionic Woman episodes and Microsoft Press Releases. How fun!




Listing Three: Program for Converting Julie Zelenski’s Random Grammars to C


  1     #!/usr/bin/perl -w
2 use strict;
3 $|++;
4
5 use Parse::RecDescent;
6
7 (my $parser = Parse::RecDescent->
new(<<’END_OF_GRAMMAR’)) or die “bad!”;
8
9 start: <leftop: junk rule junk> /\Z/ { 1 }
10
11 junk: /[^{}]*/ { 1 }
12
13 rule: “{” defining choices “}” { print
“$item[2]: $item[3]\n\n” }
14 | <error>
15
16 defining: symbol “;” <commit> { $item[1] } |
symbol
17
18 symbol: /^<(\S+)>/ {
19 my $x = $1;
20 $x =~ s/([^a-zA-Z])/sprintf “_%02x_”, ord $1/ge;
21 if ($x eq “start”) {
22 $x = “START”;
23 }
24 ## warn “saw symbol $x\n”;
25 $x;
26 }
27
28 choices: choice(s) { join ” |\n “, @
{$item[1]} } | { q{“”} }
29
30 choice: item(s) “;” { join q{ ” ” }, @
{$item[1]} }
31 | <error>
32
33 item: symbol | wordlist | <error>
34
35 wordlist: word(s) { “q{” . join( ” “, @
{$item[1]} ) . “}” }
36
37 word: /^”(\S+)”/ <commit> { $1 } |
/^[^\s<>{};]+/
38
39 END_OF_GRAMMAR
40
41 (my $parsed = $parser->start(join ”, <>))
or die “bad parse”;
42
43 ## see http://www-cs-faculty.stanford.edu/~zelenski/rsg/grammars/

Anyway, I hope you’re now convinced that Randal is actually a figment of his own imagination, and all of the work is really being done by clever programs. (Note to editor: please continue to send checks for columns; I’m still paying for hardware maintenance).



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-authored Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com.

Comments are closed.