What can we learn from the ants and bees? Perhaps something we can take something from the ants that may be helpful for clustering. If nodes were cheap and plentiful, then who cares? It would be kind of like stepping on ants: there always seem to be more.
It seems after last weeks rant, the poll participation on Today’s HPC Cluster has increased. Great job everyone! If you haven’t applied your valuable expertise to these polls, please do so as soon as possible. I plan to discuss the results next week. How as that for positive re-enforcement? Let’s move on to this weeks topic, boxes and bugs.
Let’s talk about the boxes first. There seems to be an interesting trend developing in the PC market. Wal-Mart has started selling Everex Linux computers for $200 recently. There’s also news that Shuttle will also be making a small “sub-$200″ box as well.
Of course, these systems are a little thin on resources, but there seems to be this $200 price barrier that the industry is trying to crack. My interest is not in touting the “Linux makes this possible” story, but to point out that such systems are selling quickly. If the trend continues, a large amount of these systems will be produced and anytime a large amount of “computing anything” gets produced, my “Can you build a cluster with these?” antennae goes up.
Of course, the $200 systems are mere toys when it comes to a modern-day multi-core cluster node, but bare with me for a moment. The $200 price point also means that there will be up-selling to better systems. So instead of $200, suppose for around $400 you could get such a system with 2GB of RAM, a reasonably fast dual-core processor and a good Gigabit networking card. The smallish hard disk drive can be considered optional. At this point you have a respectable, but not super fast, cluster node.
A perfectly logical question to ask is why build a cluster out of throw-away hardware? Well, why indeed? Which brings us to the second part of this discussion — bugs.
I’ve always been fascinated by colonizing insects. You know those deforesting ants, or swarms of killer bees. After you get over the “wow! that is a lot of bugs” reaction, you may wonder, how the heck to they work together so well. Parallel processing at its finest I figure. We assume the ants are not calculating next weeks weather, but they are solving some difficult logistics problems using a large numbers of individual worker units.
What can we learn from the ants and bees? Well, first it seems everyone has their own job and a set of rules to follow, dare I say program to follow. Years ago when I played Sim-Ant on the computer I recall there were three types of ants, diggers, foragers, and soldiers. Each did its own thing and the survival of the colony was dependent upon the right balance. The other thing that I noticed was no individual ant was essential to survival of all the ants. In other words, ants were disposable, or there was redundancy in their tasks.
There is also the queen ant which was responsible for producing more ants, and could loosely be considered a central point of control or perhaps a single but highly effective ant foundry. The queen does not, however, direct the 20,000,000 ants reported to be in some colonies. There is no redundant queen, but there are redundant colonies.
Perhaps something we can take something from the ants that may be helpful for clustering. I have long thought about the idea of disposable nodes. Well, not the throw them in the trash disposable, but redundant disposable. That is, if a node fails, it doesn’t matter. What if codes were written with a dynamic redundancy so they could tolerate one or even several nodes dropping out of sight. Or, what if low cost nodes were “compute mirrored”, kind of like a RAID 1 for cluster nodes. If nodes were cheap and plentiful, then who cares? It would be kind of like stepping on ants: there always seem to be more. There are other possible scenarios as well. When nodes are plentiful and cheap, rather than limited and expensive, then the rules of the game may change.
One of the advantages of clustering is that the loss of a node does not bring down the entire cluster. With today’s ever growing multi-core behemoth nodes (e.g 8 cores with 8GB of RAM), a failed node now brings down a big chunk of computation. Compare that to eight of the cheap dual core nodes I mentioned above (for a total of 16 cores and 16GB of RAM at about the same price) and you might be inclined to start thinking like an ant.