Harness the power of distributed computing using everyday Unix command-line tools and a clever little bash script.
In late 2004, Google surprised the world of computing with the release of the paper MapReduce: Simplified Data Processing on Large Clusters. That paper ushered in a new model for data processing across clusters of machines that had the benefit of being simple to understand and incredibly flexible. Once you adopt a MapReduce way of thinking, dozens of previously difficult or long-running tasks suddenly start to seem approachable–if you have sufficient hardware.
If you’ve managed to somehow miss most of the MapReduce revolution, Wikipedia describes it pretty well:
MapReduce is a framework for computing certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster. Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured).
“Map” step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.
The worker node processes that smaller problem, and passes the answer back to its master node.
“Reduce” step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output – the answer to the problem it was originally trying to solve.
In fact, the MapReduce model has proven so useful that the Apache Hadoop project (an Open Source implementation of the infrastructure described in the Google paper) has become very popular in the last few years. Yahoo, which employs numerous Hadoop committers, recently hosted their annual Hadoop Summit which attracted over 500 users and developers.
Amazon.com offers Hadoop as the Elastic MapReduce service on their EC2 platform. And startup Cloudera (founded by former Yahoo, Facebook, and Google employees) has begun to offer training, support, and consulting services around Hadoop and their own packaged Hadoop distribution. (Yahoo has since published their internal version on Github too.)
Until recently, there hasn’t been an easy way, aside from possibly Amazon’s offerings, to try out MapReduce–especially on your own hardware.
bash for the win!
Earlier this year, Erik Frey of last.fm, announced bashreduce on the last.fm blog.
More than just a toy project, bashreduce lets us address a common scenario around these parts: we have a few analysis machines lying around, and we have data from various systems that are not in Hadoop. Rather than go through the rigmarole of sending it to our Hadoop cluster and writing yet another one-off Java or Dumbo program, we instead fire off a one-liner bashreduce using tools we already know in our reducer: sort, awk, grep, join, and so on.
It sounds almost comical but this makes a lot of sense, really. Like most of the Unix shell tools, bash is nearly everywhere. So why not build up enough of a bash script to facilitate basic MapReduce style processing for periodic or one-off jobs? It’s really quite handy.
bashreduce is new enough that it’s not packaged up for popular distributions yet, but you can pull a copy from github easily enough:
$ git clone git://github.com/erikfrey/bashreduce.git
Initialized empty Git repository in /home/jzawodn/cvs/bashreduce/.git/
remote: Counting objects: 136, done.
remote: Compressing objects: 100% (136/136), done.
Receiving objects: 100% (136/136), 21.17 KiB, done.
Resolving deltas: 100% (49/49), done.
remote: Total 136 (delta 49), reused 0 (delta 0)
And let’s build the optional performance boosting utilities it comes with:
$ cd bashreduce/brutils
cc -O3 -Wall -c -o brp.o brp.c
cc -o brp brp.o
cc -O3 -Wall -c -o brm.o brm.c
cc -o brm brm.o
$ sudo make install
install -c brp /usr/local/bin
install -c brm /usr/local/bin
Let’s give it a run to see the command-line help:
$ ./br -h
Usage: br: [-m host1 [host2...]] [-c column] [-r reduce] [-i input] [-o output]
bashreduce. Map an input file to many hosts, sort/reduce, merge
-m: hosts to use, can repeat hosts for multiple cores
default hosts from /etc/br.hosts
-c: column to partition, default = 1 (1-based)
-r: reduce function, default = identity
-i: input file, default = stdin
-o: output file, default = stdout
-t: tmp dir to use, default = /tmp
-S: memory to use for sort, default = 256M
-h: this help message
As you can see,
br needs a few arguments and possibly a config file setup before it’s useful. First, you need to specify the list of hosts (nodes) which distribute data to and run on. You can either list them as a quoted
-m argument, like
"host1 host2 host3" or list them in a
/etc/br.hosts file, one host per line. If you have multi-core hosts, you can list them more than once to take advantage of additional CPU cores.
You’ll need password-less ssh access to all the hosts, but you’re already running keychain anyway, right? Good.
Let’s start with a useless but instructive example:
$ br -m "localhost localhost" < /etc/passwd > /tmp/sorted
That will take your
/etc/passwd file, chop it into two pieces, sort them, and them merge and sort the results. Nobody needs a sorted
/etc/passwd file but if you had a much larger file in need of sorting or, preferably, a more CPU-intensive bit of processing, this approach would make some sense. The point is that you’ve just distributed this work among both CPU cores on your machine without having to do a lot of extra work.
Suppose you have a multi-field whitespace delimited log file and wanted to extract a single column, count up the occurrences of each value in that column, and see the results.
$ cut -d' ' -f6 /var/log/messages | ./br -m "host1 host2 host3 host4" -r "uniq -c"
The choice of
/var/log/messages here is primarily motivated by the fact that you’re likely to have it on your system. A multi-gigabyte Apache log or application server log would lend itself to this type of processing.
These examples are fairly trivial but serve to show you how to get started. The real power comes when you’re using your own code instead of a
Since the release of bashreduce, developer Richard Crowley has extended bashreduce, adding several useful features:
- the ability to pass a filename (rather than the actual file data via an
nc pipe) to each process. This assumes that each machine has a local copy of the data or access to a shared filesystem. This will greatly reduce the network bandwidth required.
- supports processing a directory full of files rather than a single file. Any of the files may be compressed using gzip and bashreduce will detect that and transparently handle decompression.
-M option allows you to specify your own merge program instead of the default (
You can find his copy on github too. He’s actively using it for some data analysis tasks at OpenDNS.
Whether you use Erik’s original bashreduce or Richard’s fork, you end up with the ability to extend the basic Unix philosophy of standard tools speaking to each other on stdin/stdout to many hosts all doing work in parallel. Not bad for a little bash scripting, huh?