dcsimg

Introducing MongoDB

Relational databases are not necessarily the best choice for modern Web applications. A new generation of alternatives is emerging. MongoDB is one of the best emergent solutions. And it's open source.

Web applications and traditional relational databases are nearing an end to their tumultuous relationship. For over a decade now, most Web applications have been built on top of relational databases, with various layers of indirection to simplify coding and boost the productivity of developers. For every Web programming language, there are any number of object-relational mapping (ORM) choices, each with pros and cons, yet none good enough that a developer can forget about SQL or ignore protecting the database. Moreover, as Web applications grow more complicated and sites need to be created faster, adapt instantly, and scale massively, these old solutions are no longer satisfying the demands of the Web.

There are a number of different projects working on new database technologies, all of which forego the stalwart relational model. Relational databases are difficult to scale, largely because distributed joins are difficult to perform efficiently. Further, mapping from the many popular dynamically-typed languages to SQL is complicated, inefficient, and time consuming. While often called the “NoSQL” movement, the need for new technologies is caused by the relational model, rather than SQL.

Beyond the relational model, there are a number of data model choices: key-value stores, tabular databases, graph databases, and document databases. Key-value stores are simple and easy to scale. If all you need is get and put access, a key-value store works great. However, most applications need more features, including secondary indexes, dynamic queries and sorting. Tabular or column databases can also scale, but don’t offer any improvement when mapping to programming languages. Document databases can be scaled and do in fact map well to programming languages.

MongoDB (from “humongous”) is a document database designed to be easy to work with, fast, and very scalable. It was also designed to be ideal for website infrastructure. It is perfect for user profiles, sessions, product information, and all forms of Web content (blogs, wikis, comments, messages, and more). It’s not great for transactions or perfect durability, as required by something like a banking system. Good fits for MongoDB include applications with complex objects or real-time reporting requirements, and agile projects where the underlying database schema changes often. MongoDB does not suit software with complex (multiple object) transactions.

Inside MongoDB

MongoDB stores BSON, essentially a JSON document in an efficient binary representation with more data types. BSON documents readily persist many data structures, including maps, structs, associative arrays, and objects in any dynamic language. Using MongoDB and BSON, you can just store your class data directly in the database, removing a whole slew of problems. MongoDB is also schema-free. You can add fields whenever you need to without performing an expensive change on your database. Adding fields is also quick, which is ideal for agile environments. You need not revise schemas from staging to production or have scripts ready roll changes back.

MongoDB uses a custom network protocol for client/server chatter. While this approach is far faster than REST, it demands a native driver for each programming language. Currently, production drivers exist for Python, Ruby, PHP, C++, Java, and Perl, and drivers for Erlang, Factor, and C# are in the works.

MongoDB has an auto-sharding implementation in alpha. Just specify a shard key for a collection and scale horizontally as much as you need to.

MongoDB is an interesting combination of modern Web usage semantics and proven database techniques. In some ways MongoDB is closer to MySQL than to other so-called “NoSQL” databases: It has a query optimizer, ad-hoc queries, and a custom network layer. It also lets you organize document into collections, similar to sql tables, for speed, efficiency, and organization.

To get great performance and horizontal scalability however, MongoDB gives something up: transactions. MongoDB does not support transactions that span multiple collections. You can do atomic operations on a single object, but you can’t modify objects from two collections atomically.

Internally, MongoDB uses memory mapped files to store data on disk. This keeps the main database code clean and simple, but causes some problems on 32-bit platforms. If you run MongoDB on a 32-bit system, you can only store about 2 GB of data. 64-bit systems are effectively unbounded.

Versions of MongoDB exist for Linux, Mac OS X, Windows, Solaris, and Free BSD. See http://www.mongodb.org/display/DOCS/Downloads for all platform downloads.

Running MongoDB

Lets download and run MongoDB, and march through some simple examples. To install MongoDB, connect to its project host and download the latest binaries for your system. Unpack the tarball, create a directory for the databases, and launch the daemon and workspace.

$ curl -O http://downloads.mongodb.org/linux/mongodb-linux-x86_64-1.0.0.tgz
$ tar zxvf mongodb-linux-x86_64-1.0.0.tgz
$ cd mongodb-linux-x86_64-1.0.0
$ mkdir /data/
$ mkdir /data/db
$ ./bin/mongod &
$ ./bin/mongo

The latter command, ./bin/mongo, is the MongoDB shell. It’s based on JavaScript, so you can use any JavaScript that you want.

// lets insert some data
> db.foo.save( { "name" : "Bob" } )
> db.foo.save( { "name" : "Joe" } )

// now lets retrieve it
> db.foo.find()
{"_id" :  ObjectId( "4a6e88194fb6b7661627ad47")  , "name" : "Bob"}
{"_id" :  ObjectId( "4a6e881d4fb6b7661627ad48")  , "name" : "Joe"}
// you can get a list of all the collections

> show collections
foo
> db.foo.save( { "name" : "Lisa" } )

// this is a dynamic query matching all documents where "name" is "Joe"
> db.foo.find( { "name" : "Joe" } )
{"_id" :  ObjectId( "4a6e881d4fb6b7661627ad48")  , "name" : "Joe"}

// you can use .explain() to see how a query will be executed.
// In this case, since there are no indexes, a table scan is necessary
> db.foo.find( { "name" : "Joe" } ).explain()
{"cursor" : "BasicCursor" , "startKey" : {} , "endKey" : {} ,
"nscanned" : 3 , "n" : 1 , "millis" : 0 ,
"oldPlan" : {"cursor" : "BasicCursor" , "startKey" : {} , "endKey" : {}} ,
"allPlans" : [{"cursor" : "BasicCursor" , "startKey" : {} , "endKey" : {}}]}

// lets create an index on name
> db.foo.ensureIndex( { "name" : 1 } )

// now when we do the explain, we'll see that we only looked at 1 object
> db.foo.find( { "name" : "Joe" } ).explain()
{"cursor" : "BtreeCursor name_1" , "startKey" : {"name" : "Joe"} , "endKey" : {"name" : "Joe"} , "nscanned" : 1 , "n" : 1 , "millis" : 0 , "allPlans" : [{"cursor" : "BtreeCursor name_1" , "startKey" : {"name" : "Joe"} , "endKey" : {"name" : "Joe"}}]}

// lets delete something
db.foo.remove( { "name" : "Joe" } )

For more developer documentation, see http://www.mongodb.org/display/DOCS/Developer+Zone.

Thinking in Documents

Designing a schema for a document database is very different than for a relational database. In general, the schema is very similar to an object you might have in your code, rather than a mapping of that object to a table.

A big question that comes up is when to embed a document in a parent document and when to link to it and have it stored in a seperate collection. For starters, embedding is great when you have a one to many relationship.

In MongoDB, like in a traditional RDBMs, you must choose how many indexes to have and what fields to index. MongoDB has a novel query optimizer that protects against worst case performance. Instead of relying on statistics which can change over time, it occasionally tries different plans in parallel, stopping when it is finished, and remembers the best index to use.

Compound indexes are also similar to those in a relational database, and have similar properties. For example if you have two fields you often query together, say last and first name, you may want an index on { last : 1 , first : 1 }. However, as long as the last name is first in the index specification, you can also query quickly on only the last name. Queries on first name only would not optimized via an index unless you specifically created an index on that field. This mirrors traditional relational systems.

Replication

MongoDB replication should feel very familiar, but is a lot simpler to setup than traditional database software. If you want to start a server as a master, run mongod --master. This causes the server to keep a transaction log. To start a slave, run mongod --slave --source 192.168.0.2. The slave syncs all of the data and then reads the transaction log. You can have as many slaves reading from a master as you want.

One caveat: if a slave gets too far out of sync, it has to re-clone the data. How long a slave can be down for before getting out of sync depends on the transaction log size. By default, the transaction log is the maximum of 1 GB or 5 percent of free disk space. This usually stores many hours of operations, and potentially even more depending on your use case. You can also customize size with --oplogsize.

One of the frustrating management tasks with traditional replication is handling failover in client code and then getting things back in sync. To make this easier, MongoDB offers a feature called replica pairs. This is basically a pair of servers, where one system is always master and one system is always the slave. The pair negotiate who is what at startup, and then guarantee only one is master at any given time. A forthcoming feature allows a pair to have multiple slaves.

Sharding

MongoDB version 1.0 has an early implementation of sharding. If you decide to shard a collection all you have to do is specify a shard key. Sharding is order preserving, so records with close shard keys will likely be on the same shard. For example, an e-commerce application might choose to shard based on user ID. Queries and sorts within a shard are very fast.

Queries and sorts across shards also work. You can sort by a non-shard key, and the system will do a merge sort of the results. This works well for two to twenty nodes or so; a thousand nodes might slow down too much.

Sharding consists of multiple regular database instances (mongod) and any number of sharding processes (mongos). mongos is basically a database router. Every request goes through mongos; it decides how many and which mongods should receive the query. mongos collates the results, and sends it back to the client. You can have one mongos for the whole system no matter how many mongods you have, or you can have one local mongos for every client if you wanted to minimize network latency.

Go Mongo!

MongoDB offers many advantages over more traditional databases. It is easy to install, simple to manage, and fast. It efficiently stores binary files and complex data models. It is also resilient: because it’s schema-free, changes can made be made instantaneously and one record need not be identical to another. MongoDB is being used in production now and is ready for you to explore.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62