dcsimg




Putting PostgreSQL Through its Paces

The folks at Red Hat recently selected the open source PostgreSQL database as the foundation for their commercial Red Hat Database product. This decision, however, was not made without a good deal of whining from the ranks of the MySQL faithful, who weren't able to fully comprehend why it was that their baby had been passed over.

The folks at Red Hat recently selected the open source PostgreSQL database as the foundation for their commercial Red Hat Database product. This decision, however, was not made without a good deal of whining from the ranks of the MySQL faithful, who weren’t able to fully comprehend why it was that their baby had been passed over.

After all, MySQL was faster, better, cheaper, easier to install, and all those other important buzzwords. PostgreSQL was an oddity: buggy, unsupported, no longer in development, unable to easily deal with large data, and (gasp) requiring a separate process for each connection instead of being nicely threaded.

It’s obvious to me that those who discount PostgreSQL in favor of MySQL haven’t taken a look at PostgreSQL recently. I too was a “MySQL all the way” person until some fellow Perl hackers convinced me to take another look. Making one last comparison before getting to the meat of this month’s column, my conclusion is that MySQL is “putting a bit of structure onto a flat file,” where PostgreSQL is “an open source Oracle replacement at a fraction of the cost.”

To commemorate the successful installation of PostgreSQL on my system, I wanted to tackle a little project. While racking my brain for a worthy project, someone on the Perl IRC channel mentioned a “word of the day” program; this inspired me to create one with PostgreSQL.

A DBM hash could easily hold the words and their definitions, but a database could also easily hold a per-user “I’ve already seen that word” table. That way, we wouldn’t end up seeing the same word twice. In addition, someone recently showed me the trick:


SELECT word FROM list ORDER BY random() LIMIT 1;

to get a random entry, so I was hoping to put that into a program at some point. This works by first sorting the list according to a random value, effectively shuffling it, and then selecting just the first entry.

Of course, after tinkering with it a bit, I decided that there was absolutely no reason not to make it a CGI script and a crontab-able program, so I went ahead and threw in HTML-cleaning of the definitions.

My next step was then to figure out which dictionary I should use. I was originally pointed at Lingua::Wordnet and got detoured for an hour, trying to install and understand that. I discussed this online, and someone mentioned FOLDOC, the dictionary of computing terms (both historical and current), as a good source of pertinent information. So I checked FOLDOC, and it indeed turned out to be a very good source for this little project.

The program flows as follows: On each invocation, the current FOLDOC flat file is mirrored into a local cache. If the local cache was updated, the file is parsed into words and definitions and placed in the PostgreSQL database in a simple two-column table.

Next, a “deck” is consulted for every user (keyed by Unix userid) who invokes the program. Initially, the deck is empty, so a new wordlist is created in a random order by “shuffling” a copy of the terms from the dictionary. This is done using SQL similar to (for the user “merlyn”):


INSERT INTO deck (word, person)
SELECT word, merlyn
FROM foldoc
ORDER BY random()

The first entry for this user is pulled off and deleted from the deck. The definition is then pulled from the dictionary table. If the invocation was from the command line, the word and definition are displayed with minimal reformatting. However, if CGI invocation is detected, then an HTML massaging locates URLs and e-mail addresses, as well as fixes up HTML entities.

Now comes the really cool part of PostgreSQL. All this shuffling and updating of the dictionary is being done within transactions that do not block the other readers! If the dictionary is stale, the other readers see the dictionary instantaneously change from the old dictionary to the new dictionary, without blocking. Try that with MySQL. PostgreSQL provides the concurrent, consistent views that only big guys like Oracle and Interbase have been able to provide in the past.

Enough on that? Let’s look at the code in Listing One.








feedback



       page 1 2 3 4   next >>

 

Linux Magazine /
December 2001 / PERL OF WISDOM
Putting PostgreSQL Through its Paces







Listing One: Word of the Day — Part I


1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
4
5 use DBI;
6
7 ## configuration
8 my $REMOTE = “http://foldoc.doc.ic.ac.uk/foldoc/Dictionary“;
9 my $LOCAL = “/home/merlyn/Web/Dictionary.txt”;
10 my @DSN = qw(dbi:Pg:dbname= foldoc_word_of_the_day USER PASS);
11 ## end configuration
12
13 my $dbh = DBI->connect(@DSN, {RaiseError => 1, PrintError => 0});
14
15 refresh_wordlist() if -w $LOCAL;
16 my ($word, $meaning) = get_word_and_meaning_for(scalar getpwuid $<);
17
18 if ($ENV{GATEWAY_INTERFACE}) { # running under CGI
19 require HTML::FromText;
20
21 print “Content-type: text/html\n\n”;
22 print HTML::FromText::text2html(“$word\n\n$meaning”,
23 map { $_ => 1 }
24 qw(title urls email paras));
25 } else {
26 print “$word\n$meaning”;
27 }
28
29 $dbh->disconnect;
30
31 exit 0;
32
33 sub refresh_wordlist {
34 require LWP::Simple;
35 return unless LWP::Simple::mirror($REMOTE, $LOCAL) == 200;
36
37 eval {
38 $dbh->do(q{CREATE TABLE foldoc (word text, meaning text)});
39 };
40 die $@ if $@ and $@ !~ /already exists/;
41
42 eval {
43 $dbh->begin_work;
44
45 $dbh->do(q{DELETE FROM foldoc}); # clean it out
46
47 my $insert = $dbh->prepare
48 (q{INSERT INTO foldoc(word, meaning) VALUES (?, ?)});
49
50 open LOCAL, $LOCAL or die;
51 my $entry;
52 {
53 $_ = <LOCAL>;
54 if (not defined $_ or /^\S/) { # end of definition
55 if (defined $entry) { # save any cached definition
56 $entry =~ s/^(\S.*)\n([ \t]*\n) *// or die;
57 my $key = $1; # get key
58 $entry =~ s/\s+\z/\n/; # clean up definition
59
60 unless ($key =~ /Free On-line Dictionary|Acknowledgements/) {
61 print “$key -> “;
62 print $insert-> execute($key, $entry);
63 print “\n”;
64 }
65
66 undef $entry;
67 }
68 last unless defined $_;
69 }
70 $entry .= $_;
71 redo;
72 }
73 $dbh->commit;
74 };
75 if ($@) {
76 $dbh->rollback;
77 die $@;
78 }
79
80 ## create and reset word
81
82 eval {
83 $dbh->do(q{CREATE TABLE deck (word text, person text)});
84 };
85 die $@ if $@ and $@ !~ /already exists/;
86
87 eval {
88 $dbh->begin_work;
89 $dbh->do(q{DELETE FROM deck}); # clean it out
90 $dbh->commit;
91 };
92 if ($@) {
93 $dbh->rollback;
94 die $@;
95 }
96
97 }
98
99 sub get_word_and_meaning_for {
100 my $person = shift;
101
102 for (my $tries = 0; $tries <= 2; $tries++) {
103 $dbh->begin_work;
104 if (my ($word) =
105 $dbh->selectrow_array(q{
106 SELECT word FROM deck
107 WHERE person = ?
108 FOR UPDATE OF deck
109 LIMIT 1
110 },
111 undef, $person)) {
112
113 ## got a good word
114 $dbh->do(q{DELETE FROM deck WHERE (word, person) = (?, ?)},
115 undef, $word, $person);
116 $dbh->commit;
117
118 if (my ($meaning) =
119 $dbh->selectrow_array(q{
120 SELECT meaning FROM foldoc
121 WHERE word = ?
122 }, undef, $word)) {
123 return ($word, $meaning);
124 }
125
126 die “missing meaning for $word\n”;
127 } else {
128 ## no words left, shuffle the deck
129
130 $dbh->do(q{
131 INSERT INTO deck (word, person)
132 SELECT word, ?
133 FROM foldoc
134 ORDER BY random()
135 }, undef, $person);
136 $dbh->commit;
137 }
138 }
139 die “Cannot get a word for $person\n”;
140 }








feedback



<< prev   page 1 2 3 4   next >>

 

Linux Magazine /
December 2001 / PERL OF WISDOM
Putting PostgreSQL Through its Paces





Lines 1 through 3 start most of the programs I write, turning on warnings, enabling compiler restrictions, and disabling the buffering on STDOUT.

Line 5 accesses the DBI module, found in the CPAN. You’ll also need to have the DBD::Pg module, found both in the CPAN and with the PostgreSQL source distribution.

Lines 8 through 10 define the configuration parameters I might want to change in this program. The URL from which I’m fetching the FOLDOC file is in $REMOTE. The local file in which this is being cached is in $LOCAL. And @DSN defines the DBI identifier, user, and password. The PostgreSQL database foldoc_word_of_the_day must already exist but can be empty, because the script creates the necessary tables. (Setting up a PostgreSQL database is beyond the scope of this column, but information is available at http://www.postgresql.org/. Make sure you have the latest version, as some of the syntax here requires at least version 7.1. Also, make sure your default permissions for tables are adequate, or you may need to GRANT permissions so, for example, the user nobody has access to the deck.)

Line 13 establishes the connection to the PostgreSQL database. RaiseError is set, causing all serious errors to throw an exception. If uncaught, the exceptions cause the program to die, but for those rare steps where some of the serious errors might be expected, we can use an eval block to catch them.

Line 15 causes the database to be refreshed from the master remote wordlist, but only if the invoker of the script can write to the local cache. This means that if the file is writable only by me, then the CGI invocations won’t refresh the file, and I don’t have to make any directory or file “world writable” just to make it work with CGI.

Line 16 fetches the word and its definition from the database, for the user running the script. This is nobody on my Web server, or merlyn for me. The “already seen” lists are maintained per-user.

Lines 18 to 27 format the response. If we’re running as CGI, then the environment variable GATEWAY_INTERFACE is set. The text2html subroutine from HTML::FromText is pulled in, and appropriate headers are added for a CGI response. If it’s not CGI, then the word and meaning are simply dumped to STDOUT.

Line 29 disconnects from the database, and line 31 keeps us from accidentally executing further code below.

And now for the subroutines, starting with refresh_ wordlist in line 33. We’ll fetch the list with LWP::Simple (found in the LWP library in the CPAN), using the mirror routine to mirror the file to the local cache. If the return code is 200, then we’ve got a new version, and it’s time to refresh the PostgreSQL database as well.

Lines 37 to 40 ensure that the PostgreSQL table holding the dictionary has been created. Both the word and meaning columns are of type “text,” which is a text string of unlimited size, stored compactly.

Lines 42 to 74 form a commit-block for a transaction. If anything fails in the block, then the rollback in line 76 erases the actions as if nothing had happened. Also, any changes made within the block are not visible to other users until line 73 is executed, so it’s as if we’re working on our own private copy of the database.








feedback



<< prev   page 1 2 3 4   next >>

 

Linux Magazine /
December 2001 / PERL OF WISDOM
Putting PostgreSQL Through its Paces






Line 45 clears out the foldoc table. Good thing we’re doing this in private, or other users would no longer be able to get the definitions for their words. (In MySQL, this could be accomplished by blocking other users while we’re doing this, but in PostgreSQL, the other users see the old version until we commit.)

Lines 47 and 48 define a statement handle to insert the entries into the foldoc table. Lines 50 to 72 parse the FOLDOC file. The file consists of many thousands of entries that look like the following:


Artistic license

<legal> The {open source license}to {Perl}.
(1999-12-29)

Note that the term is flush-left and the definition is indented. This is a trivial parsing problem for Perl. The business step is down in line 62, where the term and definition are inserted into the database. A trace of the term and “number of lines inserted” (usually 1) accompanies the insertion, which usually scrolls by faster than I can read it.

Once a new dictionary is inserted, it’s time to also invalidate the existing decks. First, the deck is created in line 83 if needed. The deck has two columns, a word (identical to the word column of the other table), and a person. The deck is then cleaned in line 89, forcing the next hit for each individual requestor to shuffle a clean deck for them.

To get a word and definition, we call the routine starting in line 99, passing the individual into $person in line 100. We’ll try to get the first “card” of the deck twice, failing after the second try.

Lines 105 through 111 attempt to grab that first card. A placeholder is used to identify the person, ensuring that we don’t have to be aware of the quoting conventions for that string. If this succeeds, then we delete the card from the deck, look up the definition in lines 119 to 122, and return the word and the definition in line 123. If it fails, then it’s time to shuffle, so lines 131 to 135 create the shuffled deck using the random number generator to insert the items.

Again, the updates to the deck are done inside transaction begin-end brackets, so other users of the database will not see the partial updates. In fact, even if two hits for the Web user (“nobody”) come in at the same time, they’ll be dealt different words (through a serialization, thanks to the “FOR UPDATE” in the select of line 108).

There’s a handful of uncaught die operations in the program that will make some unexpected and unfriendly errors in a CGI environment, but we’ll leave those for future tweaking.

In conclusion, it’s not hard to use PostgreSQL’s advanced features. It has finally arrived as a real, practical database. Until next time, enjoy!


Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com. Code listings for this column can be found at http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.

Software
 Stick a Fork in Flock: Why it Failed
 CentOS 5.6 Finally Arrives: Is It Suitable for Business Use?
 Rooting a Nook Color: Is it Worth It?
System Administration
 Scripting, Part Two: Looping for Fun and Profit
 Command Line Magic: Scripting, Part One
 Making the Evolutionary Leap from Meerkat to Narwhal
Storage
 Extended File Attributes Rock!
 Checksumming Files to Find Bit-Rot
 What's an inode?
Mobile
 Putting Text to Speech to Work
 Look Who's Talking: Android Edition
 Upgrading Android: A Guided Tour
HPC
 A Little (q)bit of Quantum Computing
 Emailing HPC
 Chasing The Number