Tying a LAN of computers together to do cooperative work isn't a new idea. Combining several thousand computers spread across the globe into a commodity service like water or electricity is. But is the "computing grid" ready for us to plug in? Here's a report.
In May of this year, IBM and a little-known video game developer named Butterfly.net launched the first ever computing grid for online video games. Butterfly.net’s service is an intriguing one: sell computer cycles as a commodity. Like a utility company, the Butterfly Grid promises to make processing power available to video game companies looking to outsource some of the back-end processing that their online games require. Besides being a novel idea, the Butterfly Grid is an interesting milestone for the grid community. For the first time, a commercial enterprise is betting its business on a grid — a technology that’s grown up and come of age in the scientific and research communities.
And nobody is more excited by the grid’s commercial prospects than IBM — the company whose data centers are physically hosting the Butterfly Grid. In the New York Times, IBM Vice President Scott Penberthy called the Butterfly Grid “the first ray of light” in the commercial grid market.
Now, after years of nurturing this offshoot of supercomputing, mainstream vendors like IBM, Sun, and HP are pronouncing grid as the future of networked computing. But what is the grid, really? Is it simply a few old computer science concepts dusted off and spiffed up with a new moniker, bigger bandwidth, and faster computers? Is the Internet a grid? What about a cluster of workstations? Why is that not a grid? Well, the answer to these questions really depends on what you mean by grid.
What is the Grid?
By its loosest definition, grid can be taken to mean “distributed resource management,” and there are many people creating and using grids for just that. There are data grids, rendering grids, bio grids, even sensor grids — and all of these common network architectures employ grid technology in one way or another. But, according to Associate Director of Math and Computer Science at Argonne National Labs Ian Foster, to be considered a capital “G” Grid, you need to be delivering non-trivial qualities of service, using standard protocols over a network that is not subject to centralized control. One of the chief architects of emerging grid computing standards and software, Foster has created a three-point grid checklist (see “What’s In a Grid?”) that serves as a useful way of defining what grid computing is and is not.
According to Ian Foster, Associate Director of Math and Computer Science at Argonne National Labs, a grid must be:
- Based on standard and open interfaces. (“Otherwise we’re dealing with an application-specific system,” he says.)
- Made up of decentralized resources that are not controlled by one single “control domain.” So, computers in a grid will have different security policies, usage policies, and management tools.
- Able to deliver non-trivial qualities of service.
By Foster’s definition, the recently announced National Science Foundation (NSF) Terra Grid — a virtual supercomputer linking clusters at four different research facilities across America is a capital “G” Grid, while Toronto’s Axyz Animation — a 25-person design shop that has built an image rendering farm out of five workstations linked together by the Sun Grid Engine — is doing distributed resource management, not full-fledged grid computing. As far as the grid gurus are concerned, you’ve got to get outside of the LAN to be on the grid.
Cynics say that grid computing is just the technology industry’s way of rationalizing the over-muscled machines that have been sold onto corporate desktops over the last decade. But as the debate over network computing proved in the late 1990s, personal computers provided benefits that companies really did want — greater and more flexible access to computing power. What is sometimes forgotten, however, is that the pre-PC computing model was pretty darned efficient. Say what you want about those old timesharing mainframes, but as anyone who ever had to wait until 3 a.m. to run a program knows, mainframes were rarely idle. Workstations, alas, are almost always idle.
|Table One: Development for building grids|
I-Way to Heaven
The idea of siphoning off idle cycles and delivering them to others has been around at least since the mid-1980s. That’s when a team of computer scientists at the University of Wisconsin (eventually led by Professor Miron Livny) began work on the appropriately named Condor project.
In active use even today, here’s how Condor works: like a bird of prey, Condor circles around a network of workstations, looking to swoop down and snatch up idle CPUs for the compute-hungry users at the University. Condor includes client software (that must be installed on all the workstations on the network), a job scheduler that negotiates for processing cycles on remote machines, a priority system called the Up Down algorithm, and a checkpoint management system that lets Condor drop and resume calculations mid-cycle without having to restart. Wildly successful, Condor collects and disburses hundreds of CPU days of computer power each day.
Based on that yield, Condor has been adapted in a number of other academic environments, but it’s still a small-scale solution compared to some of the grid architectures currently being developed. Condor shares resources on a single homogeneous network — the real challenge is sharing resources across disparate networks. That work began with the I-Way.
The I-Way was the first project to implement the software behind today’s capital “G” Grid. Begun in 1995 as a joint effort between Argonne National Labs, the University of Illinois at Chicago, and the National Center for Supercomputing Applications (NCSA), the I-Way was a high speed ATM (asynchronous transfer mode) network that connected 17 different points on ten disparate networks into one computing environment. One of the people signed up to produce the software infrastructure for this new network was Argonne’s Foster. A year later, he and coworker Steven Tuecke got funded to continue the software development they started on I-Way, and the Globus toolkit was born.
The Apache of Grid Computing
Though there are a dizzying array of technologies that make up the foundation of any grid computing environment, the Globus toolkit is usually the cornerstone. Unlike distributed resource management tools like Platform Computing’s LSF, GNU Queue, and the Condor project, Globus attempts to solve a lot of the sticky security, resource management, data management, communication, and cross-platform issues that come into play when different networks try to connect and share processing power — issues that need to be solved before we can ever think of computing power as a utility service, like gas or water.
Globus developers would like their toolkit to become, like Apache, a production-ready, high-quality reference implementation. Based on an emerging set of standards called the Open Grid Services Architecture (OGSA), the Globus toolkit aims to virtualize network resources so that applications can have a standardized way of discovering, scheduling, authenticating, and getting the authority to run on networks with different topologies, security policies, and owners across the globe. With Globus, for example, users only need to sign on one time to get access to machines on a variety of networks that may have completely different security policies.
Though Globus remains, essentially, a research project, and is not exactly “production-ready” today, the toolkit is being used in a growing number of test environments, including the Terra Grid, the Department of Energy’s Science Grid, the European DataGrid, and GriPhyN (the Grid Physics Network).
While most would agree that grid computing presents some intriguing possibilities, not everyone is as eager to jump on the grid bandwagon as Sun and IBM. According to one developer at a major scientific research facility who has looked at the Globus toolkit (and asked not to be identified), grid computing’s approach is far too generalized to deliver acceptable performance to a wide variety of applications on such a great range of systems. “It has to be done on a per-job basis,” he says. “You can use Globus to manage batch processing, but you’ve still got to write a considerable amount of software to glue it all together. There are not generalities that apply to how you connect these different processes.”
Another concern for the grid community is the threat of inflated expectations. With IBM running ads in The Wall Street Journal, and vendors like Microsoft, HP, and Sun all trumpeting grid as the Next Big Thing, developers are beginning to worry about a backlash as frustrated IT managers discover that grid is really just in its infancy. According to the developer quoted earlier, many of “the same people that were pushing Java as the solution to all the world’s ills are now pushing grid, so I’m really highly suspicious.”
That skepticism may not be entirely misplaced. After all, if grid computing is really just about freeing up unused computer cycles and saving money, why are so many hardware vendors encouraging it?
Because they see new corporate applications of the grid technology, says Vice President and Chief Analyst Nathan Palmer, of Delphi Group, a Boston research firm. Palmer concedes that while “hype is high enough that expectations can’t possibly be met,” there is still a “major market opportunity” for vendors looking to sell grid computing to the corporate world — particularly for any applications that need to do analytical computation.
So, what should you be thinking about actually doing with the Globus Toolkit? “This year it is feasible to start building systems that can dynamically manage workloads across multiple systems,” says Argonne’s Foster. “That’s what people are doing with a combination of Globus and other technologies like Platform, NMI, and the Sun Grid Engine.” Things like dynamic outsourcing — the kind of service that Butterfly is providing to the online gaming industry — are years away, though. As with the Butterfly example, it seems likely that companies will begin to experiment with new kinds of grid-enabled services before we see a generalized dynamic provisioning, or the kind of utility computing that companies like Sun, HP, and IBM would like to develop into a new line of business. Right now, “that [business] is more a marketing concept,” says Foster.
While distributed resource management is becoming more and more popular in smaller shops doing things like CAD, digital content, and bioinformatics, the domain of capital “G” Grid users is likely to be large corporations, according to Hewlett Packard Research Department Manager John Sontag. “All of our Fortune 100 customers are saying that to be competitive they need to reduce their costs.” He says that the appeal of world-class supercomputer performance at commodity desktop prices will eventually be enticing enough to bring grid into the enterprise.
Take Sontag’s own company, for example. The post-merger HP has 120,000 servers in 400 data centers — almost one server per employee. Sontag says that HP would very much like to take the operational costs of maintaining all of those disparate machines off of its balance sheets by making its applications grid-aware, essentially becoming an application service provider (ASP) to itself.
But Sontag admits that there are a number of barriers preventing this from happening right away. OSGA security still needs work, he says, and HP’s applications have to be made grid-aware.
Of course, the most complex problem of all may be the social engineering required to get departmental managers to allow the all-consuming grid onto their desktops. Then there’s the anxiety that grid users may have over guaranteed qualities of service. “What we haven’t overcome yet is anxiety about not controlling the resources,” says Sontag.
Version 3.0 of the Globus toolkit will go some of the way toward solving the security problem. Available in beta now, and with final code expected by year’s end, it will incorporate the emerging WS-Security framework (A SOAP-based standard that specifies how to secure web services), and will provide interfaces and protocols for things like lifetime management (which guarantees that grid resources don’t accidentally persist indefinitely), service discovery, and will include a new version of the OGSA resource management protocol, which will support service level agreement negotiations for the first time.
Outside of resource sharing, grid computing is also beginning to play a larger role in the field of collaborative computing. One of the first applications of the I-Way back in 1995 was as a teleconferencing backbone, and it appears that advanced collaboration environments may be another area where enterprises begin to embrace the grid.
Boeing and Johnson & Johnson are both actively involved in work in this area. Boeing is looking at applications for its Phantom Works advanced research and development arm, and Johnson & Johnson is “very interested in building access grid nodes at their locations so their people can interact with each other over the network in a natural way,” says Global Grid Forum (GGF) Chair Charlie Catlett. The GGF is a community-initiated forum of individual researchers and practitioners working on grid technologies and standards (for more information on the GGF, see “The Global Grid Forum — The IETF of the Future?”).
The Global Grid Forum: The IETF of the Future?
Because grid computing is Internet-based, many grid standards are simply Internet standards, and members of the grid community are participating in a number of Internet standards bodies: the IETF for networking, the W3C for Web services, and OASIS for security. The developers of the Globus standards and the authors of the first drafts of what is now called the OGSA considered doing their work within the auspices of the IETF, or the more exploratory IRTF (Internet Research Task Force), but in the end decided to create their own standards body, the Global Grid Forum (GGF), modeled very directly on the IETF.
“We were concerned that the grid community was largely not the same as the IETF community,” says Charlie Cartlett, Chair of the Global Grid Forum. “The people who go to the IETF are networking people. They’re not applications and middleware people, in general,” he adds. Cartlett says that the grid community worked closely with members of the IETF to have the same intellectual property policies and document processes. “Our strategy has been to design the Grid Forum process so that it’s completely compatible with the Internet standards process,” he says.
The Global Grid Forum meets several times a year, and while hard core grid developers view it as primarily a place to get standards work done, it has also evolved into a kind of educational resource for those yearning to learn about Grid computing. The GGF’s next meeting will be held in Chicago, Illinois from October 15-17.
Web Services: The Next Frontier?
Some in the industry say that as the OGSA and Web services standards advance, the two will eventually merge, with grid computing and Web services coming to mean exactly the same thing. Whether this is true or not, it is clear that Web services, like networking and security, represent a key component of the grid platform that will most likely be designed outside of the auspices of the GGF.
Though grid developers certainly are contributing to the W3C’s Web service standardization efforts, they’re betting heavily that web services will solve a major problem for grid computing: distributed resource discovery.
When grid applications like Condor are created, administrators must install client software — called cannonballs in the jargon of the grid — on every machine in the grid. Because of the wide variety of deployment environments of these cannonballs, they tend to use very safe, but generalized code that often cannot take advantage of specialized system resources.
Grid developers are looking for a way to tell applications exactly what individual client environments look like, and according to GGF’s Catlett, “it turns out that this description of an environment is exactly the same problem that the Web services people have been looking at.” He continues, “they may deploy different types of applications on a resource, and maybe ask different kinds of questions, but the processes and the interface specifications that you need to ask questions about the hosting environment in the first place apply to the Web as well as grid computing.”
J2EE will also come to play a larger role in grid computing in the next year. IBM has committed to incorporating OGSA support into Websphere, and Foster’s group at Argonne Labs is hoping to have an open source prototype of J2EE bindings for OGSA ready by year’s end. Foster says that grid-enabled applications need to have a basic idea of what resources will be available on the remote systems they will be using. The way they do this is by “virtualizing” remote resources.
“In the scientific community,” he says, “Linux serves as that virtualization technology, but only in a fairly primitive sense… in the commercial world,” he adds, “applications may be running Linux, but they assume more than that. They assume a set of J2EE services or something like that.” So the combination of OGSA and J2EE should enable a whole new class of applications to become grid-enabled.
The vendors are clearly supporting grid computing. Microsoft alone has funded Globus development to the tune of $1 million in 2002. And the developers meeting at the Global Grid Forum are clearly pushing the envelope of grid services quickly and far.
However, whether or not grid computing will be the panacea that some have suggested remains to be proved. As the history of high technology reveals, being all things to everybody is much more easily said than ever done.
Robert McMillan is Editor at Large for Linux Magazine. He can be reached at firstname.lastname@example.org