dcsimg

Database Comparison

Choosing the right database involves balancing a number of factors. Here are some of the most important ones.

by Brian Jepson

Persistence is all too often the drudge work of any project. It’s fun to design and develop an application, but when it comes time to figure out how to store and fetch data to and from persistent (or long-term) storage, the work can get complicated, unruly, and downright aggravating.

In most cases, you don’t need to “roll your own” persistent storage engine. Instead, you can pick from a number of different databases that are out there, some of which may already be installed on your Linux system.

Generally speaking, there are four different types of databases available on Linux: key/value, relational, object-oriented, and XML. Each has strengths and weaknesses. When choosing a database you should consider whether it:


  • Can handle simultaneous access from multiple users

  • Can be easily accessed from programs written in different languages (especially object-oriented languages)

  • Can handle large amounts of data

  • Can ensure that multiple operations either all happen at once, or not at all (“transactions”)

  • Can perform complicated searches (“queries”)

To help compare the different types of databases, let’s imagine that we need to store data about employees, projects, and what project(s) each employee is working on. As we examine each type of database to see how it stores this information, we’ll point out its respective strengths and weaknesses.

Key/Value Databases

Key/value databases are the simplest form of databases and store data in key/value pairs. Each key (such as an employee’s ID number) corresponds to or is associated with one value (such as their name). There can’t be two entries in the database with the same key; every key must be unique.

Berkeley DB (http://www.sleepycat.com) and GDBM (http://www.gnu.org/software/gdbm/gdbm.html) are the most widely known key/value databases and may already be installed on your Linux system.








Database fig1.2.o
Figure One: The layout of a key/value database.

Figure One shows three separate key/value databases used to store information about employees and the projects they’re working on.

Each database is stored as a separate file in the file system (called Employee.db, Project.db, and Xref.db). Each database file contains one collection of key/value pairs, so if you need to store more information you’ll need to create another file.

The advantages of key/value databases are that they are incredibly easy to use, usually come installed with the operating system, and don’t involve significant user management since they use Linux file permissions to control access.

The programs you write to access key/value databases are also likely to be portable, as long as a copy of the particular database software is on the target system. Their “one key, one value” paradigm also maps easily to many programming languages (such as Perl’s associative arrays and Java’s Hashtable class). The keys are also stored in a quickly-accessible data structure (such as a B-tree or in a hash table) so that they can even handle very large databases.

But key/value databases do have drawbacks. Although they use Linux’s file permissions to control access to the database, there’s no protection against multiple users trying to write to the same database simultaneously. You must access the database from the system it’s running on; if you want remote access, you’ll have to program that yourself. Also, the size of the database is limited to the maximum size of a file or file system.

Also, there isn’t a separate language for performing queries on key/value databases. The only fetch operations are “get the value for this key” or “get the next key/value pair.” To perform even a simple query, such as finding all the employees who have been with the company for three or more years, you have to write the code yourself. You’d have to similarly search through the entire database to find the key for a specific value. This is known as a “one-way query.”

And since every key must be unique, you can’t easily use the Xref.db database from Figure One to assign an employee to more than one project. If you tried to add a value of 2000 for the key of 100, you’d overwrite the already-existing value of 1000.

(Slightly) Cheating

However, you can work around this limitation. Although key/ value databases normally store only a single value (e.g., an int, char, or string), it’s possible to store “extra” data in the value field. One way to do this would be to put all the data into a single string and place a delimiter character between the bits of data. For instance, you could use a colon to delimit your data just as it’s done in the /etc/passwd file. In our example, the value in Xref.db for the key 100 would be “1000:2000″.

Another way to do this is to realize that from the point of view of database routines, the value stored in the database is just a pointer to the start of the data and a count of how many bytes the data is. Using the pointer and byte count, we can actually store a C struct as the value of a key/value database.

However, if you often find yourself working around the limitations of key/value databases in this manner, you should probably consider moving to a relational database.

Relational Databases

Relational databases break complex data into a collection of tables. Tables are groups of similar items, such as employees or projects, where each item in the table has the same attributes (for example, every employee has a first name, last name, and ID number; each project has a name and ID number). A specific item is called a row, and each attribute is a column. If you were to visualize a table in a relational database, it would look very much like a spreadsheet or HTML table.

Most Linux distributions (including RedHat, Mandrake, and Debian) include two open source relational database systems, PostgreSQL (http://www.postgresql.org/) and MySQL (http://www.mysql.org/). Proprietary alternatives are also available, such as Oracle (http://www.oracle.com) and Sybase (http://www.sybase.com).

In a relational database, not only can tables be used to hold information about a single item (an employee or a project), but they can also hold information that expresses the relationship between the employees and projects (thus the name “relational database”).

Consider the DBM tables shown in Figure One. The Xref.db file implies a relationship between Employee.db and Project.db, but there’s nothing about Xref.db that causes the database engine to enforce these relationships. For example, you’d need to use program logic to enforce a rule such as “you cannot delete an employee who is assigned to a project.” In a relational database, you can set such a rule when you create the tables and let the database engine enforce them whenever a change is made.

Every table in a relational database has one or more columns identified as keys. These keys are similar to keys in a key/value database, but they need not be unique as they are in a key/value database. Any key that must be unique in a relational database is called a primary key.

Any column in a table can also have an index associated with it. Indexes greatly speed up searching for a value in that key or column (for an even faster type of database, see “Main Memory Databases.” An index for a primary key requires the creation of a “unique index,” which enforces the uniqueness of the primary key.




Main Memory Databases

The architecture of a main-memory database (MMDBMS) is similar to that of relational databases, but the data engines are tuned for in-memory data structures rather than files. Instead of creating indexes with B+ trees, MMDBMSs use data structures such as hashtables. This makes a huge difference: a lot of overhead is thrown out along with disk I/O, and the result is that main-memory databases are blazingly fast and are thus becoming popular with ultra-high-traffic Web sites as well as in real-time and embedded systems.

But this performance has limitations. MySQL has a HEAP table type that implements these features but they do not support partial matching in queries. You can search for all employees whose last names are ‘Jones’, but you can’t search for all employees whose last names begin with ‘Jo’. Also, should something happen to cause an unexpected shutdown (such as a power loss), all data is lost. With disk-based databases, a transaction log can be used to recover updates that were in-progress when the system failed. Main-memory database vendors realize this is a problem and offer configurations to minimize exposure to such failures (for example, you could replicate the main-memory database to a disk-based database).

One thing that remains the same is the query language, SQL. Also, APIs such as ODBC and JDBC are popular. This makes it relatively easy for relational database developers to transition to a main-memory DBMS.

Two examples of MMDBMSs are Polyhedra (http://www.polyhedra.com/) and TimesTen (http://www.timesten.com/).

Although you don’t have to use primary keys and indexes in a relational database, it’s usually a good idea. Without a primary key, it may not be possible to uniquely distinguish rows from one another. Suppose you have two employees named John Smith. By specifying a numeric primary key in the employee table’s design, and by using a different number each time you add an employee, you can distinguish between your John Smiths.

Relational databases also use foreign keys, whose values match the values of a primary key in a different (or foreign) table. Foreign keys express relationships between tables.

Representing Data in Relational Databases








Database fig2.2.o
Figure Two: The employee-project database

Figure Two shows the logical structure of our database, stored in three tables, employee, project, and proj_empl. The primary key in each table is marked in bold.

Note that these three tables don’t need to be stored in three separate files (and in fact almost always won’t be). In relational databases, you work with tables, not files. The database engine manages how and where it stores the tables.

The employee table has a primary key called empl_id, and the project table has a primary key called prj_id. The proj_empl table cross-references the data between the employee and project tables. The proj_empl table contains two foreign keys, fk_empl_id and fk_prj_id (the “fk” stands for “foreign key”).

Relational databases do not mandate that a primary key be a single column. The primary key in the proj_empl table actually consists of two combined columns (the foreign keys fk_empl_id and fk_prj_id). This means a given combination of employee and project IDs can only appear once in that table.

The tables themselves consist of a set of rows and columns as shown in Figure Three. If you trace fk_empl_id in proj_empl to empl_id in employee, and fk_prj_id in proj_empl to prj_id in project, you’ll see that Brian is assigned to one project (“Our Next Big Thing”), and that Joan is assigned to two projects (“Our Next Big Thing” and “Big Secret Project”).





Figure Three: The employee and project tables


THE EMPLOYEE TABLE






first_name

last_name

empl_id

years

Brian

Jepson

100

3

Joan

Peckham

200

3

THE PROJ_EMPL TABLE






fk_prj_id

fk_empl_id

1000

100

1000

200

2000

200


Programming with Relational Databases

When programming with key/value databases, you interact directly with the database, calling separate functions to add a key/value pair, fetch a value or pair (for your own queries), and delete or modify values.

In a relational database, though, these operations are handled through a special-purpose database programming language called SQL (Structured Query Language). The database is manipulated by sending SQL statements to the database server. The SQL language also contains commands that can work on entire tables, as well as the data within them (for more on SQL, see “About SQL”).




About SQL

While SQL can be used to create a database as well as create, modify, and remove rows, it’s most important task is to perform queries (“find me rows that match these criteria”). When querying a database with SQL, you should think the tables as though they were mathematical sets. You can get results that are intersections, unions, or subsets of the tables.

For example, you can use SQL to generate a result set that consists of the names of all employees who have been with the company three years or less and who are also assigned to the “Our Next Big Thing” project. Such a query might look like this:


SELECT employee.first_name, employee.last_name
FROM employee, proj_empl, project
WHERE employee.empl_id = proj_empl.fk_empl_id
AND proj_empl.fk_proj_id = project.proj_id
AND project.proj_id = 1000
AND employee.years <= 3

If we were to translate this into plain English, we might get:

Join (or connect) the employeetable to the proj_empltable by linking empl_idin the table employeeto fk_empl_idin the table proj_empl. Then, join the proj_empltable to the projecttable by linking fk_proj_idin proj_emp> to proj_idin project. After the tables are joined, filter them so that we only see projects where the proj_idis equal to 1000, and where the value of yearsis less than or equal to three. After the joins and filters are in place, retrieve the first names and last names of the matching employees.

To pass SQL statements to the database server (and receive the results), an API is used to provide connectivity to the database, even across a network. A database access API might be written in a single language, but will generally be accessible from many others (e.g., the MySQL library libmysqlclient.so can be accessed from C, C++, Perl, and PHP). Alternatively, the API might also be written for a single language, such as the MySQL library for Java, which is written in Java, but can run on any system with a Java virtual machine.

Perhaps the most well-known APIs are ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity). Most database vendors support one or both of these APIs for accessing to their database products. The ODBC API is available in almost every programming language.

This combination of SQL and APIs provides nearly complete interoperability between a client and the database. The programmer performs adds, deletes, and updates by writing SQL statements that are passed to the database by the API. The database vendor, type of system the database lives on, and type of client that’s accessing the database are irrelevant, because they all support SQL and the various APIs.

Listing One shows an example of querying a MySQL database in Java. After loading the JDBC driver and connecting to the MySQL database (lines 5-8), the SQL command s built and passed to the database (lines 10-13). In this case, the query is asking for the employee ID number and names of all employees in the employee table.




Listing One: Querying a MySQL database in Java


1 import java.sql.*;
2 public class GetEmpl {
3 public static void main(String[] argv) {
4 try {
5 Class.forName(“org.gjt.mm.mysql.Driver”).newInstance();
6
7 Connection conn = DriverManager.getConnection(
8 “jdbc:mysql:///bjepson”, “bjepson”, “secret”);
9
10 Statement stmt = conn.createStatement();
11 String sql =
12 “SELECT empl_id, first_name, last_name FROM employee”;
13 ResultSet rs = stmt.executeQuery(sql);
14
15 while(rs.next()) {
16 int empl_id = rs.getInt(“empl_id”);
17 String first_name = rs.getString(“first_name”);
18 String last_name = rs.getString(“last_name”);
19 System.out.println(first_name + ” ” + last_name +
20 ” is employee #” + empl_id);
21 }
22
23 stmt.close();
24 conn.close();
25 } catch (Exception e) {
26 e.printStackTrace();
27 }
28 }
29 }

Once the query has been passed to the database server, the result is displayed as a number of rows. The while loop (beginning on line 15) retrieves all the rows that match the query criteria (since no criteria were specified, every row will be returned), and their data is printed out (lines 16-20).

When we connected to the database (lines 7 and 8), we had to provide a login id and password. Relational databases support multiple concurrent users. Just like a Unix user account, each database user has a username and password. Some databases even give each user their own separate place to store tables. This is quite handy in a system that supports virtual hosts, since you can give each user a private database separate from all other users on that server.

Another strength of relational databases is their ability to handle “transactions.” If several changes to a database (additions, modifications, or deletions) need to either all be done at once or else none of them should be done, a relational database can ensure that this happens. The programmer tells the database server that a transaction has started. Any changes are stored and not actually saved in the tables until the transaction is finished, at which time the programmer would tell the server to “commit” the transaction. At any point during the transaction, the programmer can tell the server to “roll back” the tables to the state they were before the transaction began.

What’s Normal, Anyway?

Before you actually start the process of creating a database, you should go through a process known as normalization, in which you examine the kind of information that your application will track (such as employees and projects), and organize the structure of your tables in a way that is optimized for the database. The proj_empl table from Figure Two is a result of the normalization process.

For instance, when you started thinking about how to store your data, you might have considered assigning employees to a project through a list associated with each project (represented in a table with multiple columns, such as employee_id1, employee_id2, etc.). After normalization, you’d discover that you should represent this information by using a cross-reference table (proj_empl). This also allows you to write bi-directional queries (you can find all the employees assigned to a project and all projects assigned to a particular employee) and removes what would otherwise be an arbitrary limit (i.e., the number of columns that were named employee_idn).

A database that’s been normalized will generally scale well as the number of users and the amount of data increase. However, there can come a time when you’ll need to look at how the database server is processing your queries, and tune them to achieve maximum efficiency. This can sometimes lead you to reorganize the logical structure of the database, which can have a ripple effect through an application as you modify the code that depends on the reorganized tables. For more on database normalization, see “MySQL Performance Tuning” in the June 2001 issue (available online at http://www.linux-mag.com/2001-06/mysql_01.html).

Storing Objects in Databases

However, one of the drawbacks of normalization is that you often end up with a database whose tables do not map well to the data structures you would use in an object-oriented application. Consider the tables shown in Figure Two and let’s assume we’re writing in Java. If you followed a typical object-oriented design, your application would likely have a class that represents an employee and another class that represents a project.

But you wouldn’t normally have a cross-reference class that corresponds to the proj_empl table. Instead, each instance of the Project class would have a collection of employees, such as Project.EmployeeCollection. Although the same employee may participate in different projects, this does not imply that you must maintain multiple copies for each employee. Instead, the EmployeeCollection would contain references to each object instances (this is the default behavior of Java’s collection classes).

Because of the mismatch between the design of your object model and the design of your database, you will spend a lot of time moving data into and out of objects when you store and fetch data from the database. This is sometimes called an impedance mismatch, a term borrowed from electrical engineering that describes a condition where a signal encounters an unexpected change in the medium through which it travels.

Fortunately, there are a variety of solutions. Most people agree it would be nice if you could tell an object to save itself via some magical function “SaveYourself()” and not have to worry about how it deals with the database. Object-oriented databases and XML databases address this issue in different manners. Object-oriented databases store objects directly in the database, whereas XML databases store objects by translating and storing them as XML.

Object-Oriented Databases

In the object-oriented database world, the Object Data Management Group (ODMG) has created the ODMG 3.0 standard for storing objects in a database (http://www.odmg.org/). This standard includes an Object Definition Language (ODL), an Object Query Language (OQL), and APIs for SmallTalk, Java, and C++. These APIs are different from ODBC, but many object-oriented databases support ODBC as a way to connect to them.

ODL is used to create class definitions for objects that will be stored in the database, whereas OQL replaces SQL in performing queries. OQL can retrieve collections of objects, individual objects, or object fields, and is just as powerful as SQL. Many object-oriented databases also support SQL for backward compatibility.

There are many commercial and Open Source object-oriented databases to choose from. Christopher Browne has catalogued many of these on his Web site at http://www.cbbrowne.com/info/oodbms.html. There’s also a number of object-oriented database systems at Cetus Links (http://www.cetus-links.org/oo_db_systems_1.html). A particularly interesting product is Ozone (http://www.ozone-db.org/), an open source system written in Java that supports the ODMG 3.0 interface.

Because object-oriented databases typically follow the client-server model, they usually offer the same authentication features supported by relational database servers. They also support multiple simultaneous users and can do transactions.

Object-oriented databases also hide the underlying details of how objects are stored in the database. The storage format depends on the implementation of the database and is generally unimportant from a programmer’s point of view.

XML Databases

A different approach to storing objects is to first convert them into XML (Extensible Markup Language) and store the resulting XML document. Two ways to do this are with Open Source project Castor (http://castor.exolab.org/) or Sun’s JAXB (http://java.sun.com/xml/jaxb/).

Both of these can generate code to “marshal” and “unmarshal” a Java object to and from an XML document, but they both require an XML schema that defines the valid values of the object fields. Castor can do this with any XML schema, but JAXB only works with DTDs. The XML code in Listing Two shows one way that the data from our employee and project tables could be marshaled. For more on XML, see the July 2001 and October 2001 issues (available online at http://www.linux-mag.com/2001-07/xml_basics_01.html and http://www.linuxmagazine.com/2001-10/xmldom_01.html, respectively). You can also look at the February 2002 issue for more on XML schemas, including DTDs (online at http://www.linux-magazine.com/2002-02/xml_01.html).




Listing Two: XML representation of employee-project database


<?xml version=”1.0″?>
<DataSet>
<employee>
<first_name>Brian</first_name>
<last_name>Jepson</last_name>
<empl_id>100</empl_id>
<years>3</years>
</employee>
<employee>
<first_name>Joan</first_name>
<last_name>Peckham</last_name>
<empl_id>200</empl_id>
<years>3</years>
</employee>
<project>
<prj_id>2000</prj_id>
<prj_name>Big Secret Project</prj_code>
<prj_empl_id>200</prj_empl_id>
</project>
<project>
<prj_id>1000</prj_id>
<prj_name>Our Next Big Thing</prj_code>
<prj_empl_id>100</prj_empl_id>
<prj_empl_id>200</prj_empl_id>
</project>
</DataSet>

Once Castor or JAXB has converted your objects into XML documents, those documents must be stored. Xindice (http://xml.apache.org/xindice/) is a database system that’s part of the Apache XML project that can manage XML documents like the one shown in Listing Two, as well as handling arbitrary XML documents that are not representations of an object.

Like relational database systems, Xindice and other XML databases hide the actual data files behind the facade of the server (you can check out a nice page on XML databases at http://www.rpbourret.com/xml/XMLDatabaseProds.htm). The database server manages the data and controls user authentication for multiple users, and has an API that lets clients to connect to the database.

However, the various XML databases are not standardized. Xindice uses XML-RPC as its access protocol, XPath as its query language, and a language of its own called XUpdate to perform updates. This can place limitations on the kind of clients and servers you can use as well as the programming languages you can work with.

This is where Castor has a particular advantage, because it can work with both relational databases and XML databases. Castor can define classes that know how to store and retrieve themselves from a relational database. This can free you from the previous restrictions.

XML databases are also behind the curve when it comes to transactions. There’s no standardization, but one database, Tamino (http://www.softwareag.com/tamino/) does provide transaction functionality; Xindice has transactions on its list of things to do.

Because XML is based on text, and because its uses human-readable tags to provide structure, it’s not ideally suited for use on large amounts of data. For example, the integer value “255″ is represented in XML as three Unicode characters “2″, “5″, and “5″ (six bytes). In a database that doesn’t need to store data in a human-readable format, this can be stored as a single byte (0xFF).

And although working with XML thus has a certain amount of overhead of space, it has proven to be highly expressive, and has the backing of industry, academia, and the World Wide Web Consortium. If you’re developing an application to interoperate with others, XML should almost certainly figure in your plan somewhere.

Interoperability is the Future

So, how should you decide which type of database is right for you? The most important factors to consider are ease of programming and interoperability with other systems. For ease of programming, an object-oriented database will allow you to use your object model as-is (straight from your UML model to the layout of the database). For interoperability, XML offers the most promise, even if the technology isn’t completely “ready for prime time.”

XML data bindings such as Castor provide a glimpse of the future; objects that can cross programming language boundaries. This would make it easier for a system written in one language (such as Java) to communicate with a system written in another (such as Perl). If both systems use the same XML schema, you could move data seamlessly between systems.

On the other hand, relational databases bring a lot to the table because of the nearly universal support for APIs such as ODBC. There is no doubt that relational databases will continue to be popular. Most off-the-shelf Web log, message board, or groupware packages use relational databases to store their data. This is not because relational databases are the purest or most efficient way to represent data, but because they’re ubiquitous and represent an acceptable tradeoff between performance and the complexity of the programming model.

Table One provides some generalized comparisons about how the different classes of databases deals with the various factors discussed here. Of course, the characteristics of your dataset, your programming skills, and the database you choose will affect all of these issues.




Table One: Capabilities of different databases

KEY:


Multi-User: How well the database performs as the number of users increases.


Interop: How well the database supports access from different programming languages.


Large Tables: How well the database performs with very large amounts of data.


Transactions: Can the database handle transactions?


Queries: Can the database perform complex searches?


OO Integration: How well the database integrates with Object-Oriented programming languages.

SCORING:

*: Acceptable**: Good***: Excellent







DATABASEMULTI-USERINTEROPLARGE TABLESTRANSACTIONSQUERIESOO INTEGRATION
Key/Value***********[a]
Relational****************
OO ****************
XML******[b]*n/a[c]******

[a] Key/value databases get a high score because they map directly to collection classes, such as Java’s Hashtable and Perl’s associative arrays.

[b] XML databases get a high score because there is widespread support for XML in many languages, even though they may only expose an API in a few programming languages.

[c] Implementation-dependent: check with the XML database developer.




Resources


PostgreSQL: http://www.postgresql.org


MySQL: http://www.mysql.com


Ozone Object-Oriented Database: http://www.ozone-db.org/


Xindice XML Database: http://www.dbxml.org


Tamino XML Database: http://www.softwareag.com/tamino/


Castor: http://castor.exolab.org/


JAXB: http://java.sun.com/xml/jaxb/


Christopher Browne’s Directory of Object-Oriented Databases: http://www.cbbrowne.com/info/oodbms.html


Slashdot Thread on Object-Oriented vs. Relational Databases: http://slashdot.org/article.pl?sid=01/05/03/1434242&mode=thread


Cetus Links of Object-Oriented Database Management Systems: http://www.cetus-links.org/oo_db_systems_1.html


Ronald Bourret’s Listing of XML Database Products: http://www.rpbourret.com/xml/XMLDatabaseProds.htm


XML Data Binding with Castor: http://www.onjava.com/pub/a/onjava/2001/10/24/xmldatabind.html


Polyhedra Main-Memory Database: http://www.polyhedra.com/


TimesTen Main-Memory Database: http://www.timesten.com/



Brian Jepson is a writer and editor with O’Reilly & Associates, Inc. He can be reached at bjepson@jepstone.net.

Comments are closed.