Hacking with CouchDB

Working with CouchDB is very straightforward. There's virtually no setup involved and no complicated libraries to hassle with.

Last week we started looking into CouchDB, a document oriented database with many advanced features and a snowballing user base. We looked at installation on Ubuntu (trivial), high-level features, and the built-in Web interface called futon. This week we’ll look at getting some data into CouchDB and eventually play with indexing/views, and querying.

Upgrade Time

Thanks to installing the Ubuntu Netbook Remix 9.10 on my Samsung NC10 Netbook, we can look at the latest and greatest CouchDB–version 0.10. What a difference a week makes! (Oh, and UNR runs surprisingly well on this little machine!)

The main difference from what we saw last week’s output is this:

$ curl http://localhost:5984/
{"couchdb":"Welcome","version":"0.10.0"}

All I needed to do was sudo apt-get install couchdb.

Test Script

My primary development language is Perl, so I’ll show examples using Perl. We’ll make sure of the Net::CouchDb distribution from CPAN.

$ sudo cpan Net::CouchDb

That will also install a JSON module, since the client and server speak to each other using JSON documents over HTTP.

With that done, let’s whip up a simple script that connects to the local CouchDB server, creates a database, and stores a trivial document.

#!/usr/bin/perl -w

use strict;
use Net::CouchDb;
use Net::CouchDb::Document;

$|++;

my $couch_db = 'test_foo';
my $cdb = Net::CouchDb->new(host => 'localhost', port => 5984) or die "$!";
$cdb->create_db($couch_db) or die "db already exists?: $!";
my $test_db = $cdb->db($couch_db);
my $record = {
    foo     => 'bar',
    message => 'hello, world!',
};
my $doc = Net::CouchDb::Document->new(1001, $record);
$test_db->put($doc) or die "$!";
print "OK\n";

exit;

Basically, we create a CouchDB connection object ($cdb) and then call create_db() to create a new database. In this example, the script will simply die if the database already exists. But in reality you’d do something a bit more sophisticated. Once connected, we construct a hash with the keys foo and message that we refer to as $record. We then pass that reference, along with an document id (1001) to the Net::CouchDb::Document constructor to get a document object. Then it’s just a matter of calling the put() method on our CouchDB connection object and passing in that document object.

To verify that the document actually got there, you can visit the Futon interface and click on the test_foo database. The URL should look like this: http://localhost:5984/_utils/database.html?test_foo. You’ll see a document with the id 1001 that you can click on and manipulate.

Any changes you make in the Futon interface can be committed to the database by clicking the “Save Document” link in the upper left. Doing so doesn’t actually update or replace the document. Instead if stores a new version of the document.

Load Some Data

In order to do anything interesting, we need some data to load into CouchDB. So let’s write a simple tool that can extract messages from a Unix mailbox file (mbox). We’ll treat each message as a document with multiple fields–each message header as well as the body.

Let’s install a few Perl modules to make that task easier.

$ sudo apt-get install libmail-mbox-messageparser-perl libmailtools-perl

With those installed we can use the following code, which builds on the previous example:

#!/usr/bin/perl -w
$|++;

use strict;
use Net::CouchDb;
use Net::CouchDb::Document;
use Mail::Mbox::MessageParser;
use Mail::Internet;

my $db_name = 'test_mail';
my $cdb = Net::CouchDb->new(host => 'localhost', port => 5984) or die "$!";
$cdb->create_db($db_name)
my $mail_db = $cdb->db($db_name);

my $file_name = shift || 'mbox';
my $file_handle = new FileHandle($file_name);

my $folder_reader = Mail::Mbox::MessageParser->new({
    'file_name'    => $file_name,
    'file_handle'  => $file_handle,
    'enable_cache' => 0,
    'enable_grep'  => 1,
});

# skip anything before 1st message
my $prologue = $folder_reader->prologue;
print $prologue;

# read one message at a time
while(not $folder_reader->end_of_file())
{
    my $email = $folder_reader->read_next_email();
    my $msg = Mail::Internet->new();
    my @lines = split /\n/, $$email;
    $msg->extract(\@lines);

    my $body = join "\n", @{$msg->body};
    my $id   = $msg->get("Message-Id:");

    my $message = {
        Body    => $body,
    };

    for my $field (qw[From: To: Cc: Subject:]) {
        my $value = $msg->get($field);
        if ($value) {
            $message->{$field} = $value;
        }
    }

    my $doc = Net::CouchDb::Document->new($id, $message);
    $mail_db->put($doc) or die "$!";
}

exit;

This time, we create a database named test_mail to hold our email messages. Then we use Mail::Mbox::MessageParser to parse through the mailbox file given as the first argument on the command (defaults to mbox in the current directory).

We then iterate over each message in the mailbox, using Mail::Internet->extract() to parse the message into an object from which we can extract headers and the body. We then construct a $message hash that will represent the document to store in CouchDB. We include the body and then any of the following header fields if they exist: From, To, Cc, Subject. You could easily add additional fields like User-Agent, Precedence, and so on.

Once that document is created, we use it as the basis of a Net::CouchDb::Document object that it then stored in the database.

I ran this code against an mbox file containing 46 messages as delivered by procmail and read with mutt. But it just as well could have worked against a mailbox file from Mozilla Thunderbird or Evolution.

Now, I should note that there’s a lot more we could do here. There’s little error checking, no scrubbing or normalization of the data, etc. The reality is that nowadays a lot of email is really a multipart MIME message that may contain a plain-text piece and an HTML piece (and possibly attachments for the images that make up an annoying animated signature or “stationary” background). We don’t deal with any of that. The point is to see get some data into CouchDB, not to write a fully functional email preservation tool.

See What You’ve Done?

Now is a good time to hit the Futon interface to see what you’ve done. You should see one record per message and can navigate through the set to spot-check the script: http://localhost:5984/_utils/database.html?test_mail. You should see that every record has a Body field as well as some of the others.

More To Come

So far we’ve covered the basics of CouchDB, installed it, and loaded some data in with a Perl script that extracts email messages from a traditional mbox file. Next week we’ll finish up by playing a bit with sever-side JavaScript for views and indexing.

Have you been working with CouchDB already? If so, drop a note in the comments.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62