Saving Complex Data

A Perl program alters the outside world in some manner. Otherwise, there'd be no point in running the program. But sometimes, our Perl programs need a little "memory" to do their job, something that persists information from one invocation to the next. But how do you keep such values around?

A Perl program alters the outside world in some manner. Otherwise, there’d be no point in running the program. But sometimes, our Perl programs need a little “memory” to do their job, something that persists information from one invocation to the next. But how do you keep such values around?

If the value is simple enough, you can just write it out to a file, as this snippet does.

my $MEMORY = “memory-file”;

# … at beginning of program …
open M, “<$MEMORY” or die “Cannot open $MEMORY for reading: $!”;
{ local $/; $value = <M> }
close M;

# … at end of program …
open M, “>$MEMORY” or die “Cannot open $MEMORY for writing: $!”;
print M $value;
close M;

But there are a few problems with this technique.

  • First, the value has to be a simple scalar. That means it’s not very interesting and you really can’t scale to multiple values by using separate files.

  • Second, because the program is writing the string version of the value to a file, it’ll run into slight problems storing floating point numbers accurately, simply because internal floating point numbers do not correlate with decimal strings on a precise one-to-one basis.

  • Third, this technique breaks down if there are multiple instances of the program using the data. For example, two programs might both read the value, then update it, then write out their respective new values back to the file. The last one wins, and no trace is seen of the other invocation. And, for a small moment between opening the file for writing, and closing the filehandle, there’s no data in the file (or maybe a partial chunk of the data). Thus, any other instance of the program will get the wrong results.

Just to keep it simple, let’s ignore the multiple reader/writer problem for the moment. What can you do to save and restore complex data?

The oldest method of storing complex Perl data, and one that’s still used today, is the Data::Dumper module, which has enough power to convert a nearly-arbitrary data structure into the code that’s necessary to recreate the data. The usage is rather simple:

my $MEMORY = “memory-file”;

#… assign values to $complex_variable …

use Data::Dumper;
open M, “>$MEMORY” or die;
print M Data::Dumper->Dump([$complex_variable], ['$complex_variable']);
close M;

To restore the value later, it’s simply do $MEMORY;. That one line recreates the value of $complex_variable into a package variable with the same name. You have to use a package variable because a lexical variable defined in the $MEMORY file would not persist beyond the do operation.

The data within the $MEMORY file is reasonably human readable. In fact, one of the most common uses of Data::Dumper is to dump the data for people to interpret the values.

While Data::Dumper is powerful and can reconstruct nearly any complex data structure, it is also limited by its design.

  • First, because complete Perl code is used to recreate the data, you must invoke a full Perl parser on the file to restore it. This can be slow in some cases, as well as a security risk. (I addressed the security aspect of this problem in the October 2001 column, available online at http://www.linux-mag.com/2001-10/perl_01.html.)

  • Second, because a dump is Perl code, it can’t be cross-platform, sharing data with Python, Ruby, Java, and C, for example.

  • Third, a dump (of Perl code) is not an ideal format for a human to edit.

To address these issues, Brian Ingerson (of Inline fame) created “Yet Another Markup Language,” also known as YAML. A YAML file contains a serialization of complex data, similar to the Data::Dumper file. However, instead of using Perl syntax constructs to delimit the data and define its structure, YAML uses a simple-to-parse set of punctuation and indentation to define the data. This structure is rich enough to represent arrays, hashes, and even blessed objects.

Because the YAML markup is not Perl code, there’s no potential security problems. Also, YAML handlers have already been written for Python, Ruby, Java, and C, as well as Perl. A complex data structure can be constructed in Ruby, and then YAML-stored and read into a Perl program, modified, and then written back out for a Python program to read!

The simplest interface to YAML from the Perl world is the YAML module. We can drop it in, in place of Data::Dumper quite simply:

my $MEMORY = “memory-file”;
# assign values to $complex_variable …
use YAML qw(Dump);
open M, “>$MEMORY” or die;
print M Dump($complex_variable);
close M;

But it’s a bit simpler to use the DumpFile interface:

use YAML qw(DumpFile);
# assign values to $complex_variable …
DumpFile($MEMORY, $complex_variable);

And restoring it is nearly as easy:

use YAML qw(LoadFile);
my ($complex_variable) = LoadFile($MEMORY);

The resulting YAML file is human readable and somewhat human editable. It’s actually quite a nice design.

One disadvantage of YAML is that YAML is not included in the Perl core distribution. You must install YAML from the CPAN. YAML still has the disadvantage that floating point numbers are not precisely represented, and some overhead is added to a complex data structure so that the structure can be recognized by humans. To solve these two issues, let’s look at Yet Another Way to store complex data: the Storable module.

Like Data::Dumper, Storable is included as part of the core Perl installation for recent versions of Perl. (Older versions of Perl should probably be upgraded, or you can install Storable from the CPAN.)

But unlike Data::Dumper and YAML, the Storable interface produces a serialization that is not intended to be read by humans. Instead, it’s a compact byte stream that accurately records scalars (both strings and numbers), complex data structures, and blessed objects. The usage is similar to YAML:

use Storable qw(store retrieve);
# … change $complex_variable …
store $complex_variable, $MEMORY;
# … later …
$complex_variable = retrieve $MEMORY;

Note that the data is written with some sensitivity to the endian-ness of the processor architecture running the code. At a slight speed penalty, you can replace store with nstore, which writes the data in an architecture independent manner. And retrieve is smart enough to recognize that this has happened.

So far, all of the modules store an entire complex value in the entire data file. Any changes made to the data rewrites the entire file. There are times when you know that you’ll be predominantly accessing and updating only a portion of the data. For those cases, you can segment the data using the DBM mechanism.

The DBM mechanism essentially puts a hash out on disk. In its simplest form, you can associate a hash with a diskfile using dbmopen:

dbmopen %db, $MEMORY, 0644 or die “Cannot tie $MEMORY: $!”;

From this point until the end of the program, any access to the %db hash is mapped into DBM calls against the disk file. Because it’s a hash, you can update arbitrary keys with new values, remove key/value pairs, and even iterate over the entire hash.

The DBM hash is implemented using the tie mechanism, using one of the many DBM modules, as listed in the AnyDBM_File man page. The dbmopen finds a suitable DBM implementation, then ties the variable to the appropriate interface. Depending on the DBM implementation, you may end up with a file with a name that ends in .db, or a pair of files that end in .dir and .pag. Also, some implementations have a limited key or limited key/value pair size, as small as 1024 bytes.

For example, if SDBM is the most sophisticated DBM you have installed for Perl, the dbmopen call above is automatically translated into:

use Fcntl;
use SDBM_File;
tie %db, SDBM_File, $MEMORY, O_RDWR
| O_CREAT, 0644 or die “…”;

But unless you need to tweak some of these settings, the dbmopen call is often easier to type.

Using a DBM allows you to store a hash on to the disk. But the hash must have simple scalars for the keys and values, and cannot contain further complex structures. But if the values of a hash can be arbitrary scalars, couldn’t you take a complex

data structure, serialize it, and then store that value as an element of the hash stored within a DBM? Certainly. But before you scurry off to write the code, you might want to look at the MLDBM module. This module does precisely that and is included with all modern versions of Perl.

Using the tie interface again, replace the SDBM_File module with MLDBM:

use Fcntl;
use MLDBM;
tie %db, MLDBM, $MEMORY, O_RDWR|O_CREAT, 0644 or die “…”;

Here, every time an assignment is made to an element of %db, the scalar is serialized using Data::Dumper (by default). If it’s a simple scalar, then nothing much happens. But if it’s a reference to a complex data structure, you get a single string that recreates that data structure, and that string is stored into the DBM.

When a value is pulled from an element of %db, the process is reversed: the value is eval‘ed, resulting in a complex data structure in memory.

Because Data::Dumper is a bit slow, a bit dangerous, and a bit noisy, a better choice for the serializer is Storable. You can get that by replacing the use line above with:

use MLDBM qw(Storable);

Now use Storable and not Data::Dumper. You can also pick a specific DBM module with:

use MLDBM qw(DB_File Storable);

It’s important to note with MLDBM that the DBM is updated only when an element of the hash is written. If you assign a complex data structure as an element of the hash and then later update a part of that complex data structure, the value is not reflected in the DBM! The MLDBM man page provides examples and advice to make this work the best.

Well, hopefully that’ll get you started on your persistent data quest. If these means are not sophisticated enough for you, be sure to check the CPAN for cool tools like Tie::DBI (linking your tied hash to a DBI table), Tangram (mapping complex objects directly to a collection of tables), Attribute::Persistence (for a dirt-simple persistent interface), and Inline::Files (for a novel rewrite-your-program persistent storage). Until next time, enjoy!

Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com.

Comments are closed.