Natural Open Source

Open Source is about to change your life. Find out how it could even affect your genetic structure.


The automated sequencing machine changed everything. Before it came along, the hackneyed image of the biologist in a white lab coat staring intently through a microscope with caged rats scurrying in the background was not so far from the truth.

Twenty years ago, like today, the most interesting work in biology revolved around studying DNA (deoxyribonucleic acid), the double-helix of genetic code that living cells use to create new cells. Back then, DNA research involved a painfully slow process of manually dismantling DNA into its component nucleotides — the molecules that make up genes. In the mid-1980′s a company called Applied Biosystems, Inc. began selling a machine that vastly sped things up: the automated sequencing machine. Like the textile industry after the introduction of the mechanical loom, biology was changed forever.

Within ten years, sequencing machines were fast enough that researchers could seriously consider cataloging the sequence of three billion nucleotides that make up the 30,000­35,000 human genes and creating a kind of blueprint of humanity. The Human Genome Project was born. And with it came the task of making sense of the three gigabytes of data that comprise our DNA. Add to that all of the contextual information about the human genome — published research about certain sequences and information on the relationship between sequences — and the various algorithms for analyzing the data, and finally the fact that the human genome is merely one of many genomes to be mapped (the mouse genome is being finished now) and you begin to have some very large and messy data management problems. This is why, today, the computer has joined the microscope and the rat’s cage as an essential part of the biologist’s toolbox.

University research environments; vast amounts of data that need to be manipulated in customizable ways; a community of technical people with shared goals; new data analysis techniques cropping up on a regular basis. These are the hallmarks of the open source problem set, and if ever there was a world ready for open source software, the biological sciences are it. In the last ten years bioinformaticists (people who use computers to process biological information) have wholeheartedly embraced open source tools; in turn, the work done by biologists has begun to have an impact in the larger open source world.

Established open source projects have proved particularly useful to biologists in two areas: in number crunching, where Linux-based Beowulf clusters are providing a high-performance and inexpensive alternative to proprietary RISC systems, and in scripting, where biologically-focused scripting libraries like BioPerl and BioPython have become extremely popular tools for writing quick queries to the numerous publicly available genomic databases.

Josh Harr, the CTO of Linux Networx, a Beowulf system vendor, says his company first started seeing an interest in Linux systems for biotech around 2000, when biotech researchers at companies like Rosetta and Sequenome, and scientists at the University of California at Berkeley began replacing their RISC-based genomic analysis systems with Intel machines running Linux. Though Linux systems have yet to achieve the performance speeds of proprietary supercomputers, they’re gaining traction at the lower end of the spectrum because of their great price-vs.-performance figures.

Horst D. Simon, one of the authors of the Top500.org list of the world’s fastest supercomputers (http://www.top500.org/) says that, “Generally clusters are very productive for single workgroups and single applications, and are comparatively small. As far as I know there are no Linux clusters with more than 1,024 processors, yet state of the art supercomputers such as the IBM SP (Scalable Processor) at NERSC (the National Energy Research Scientific Computing Division where Simon works) in Berkeley has around three thousand.”

“People started using Linux when they realized it was a lot cheaper than getting a bunch of Alpha machines with Tru64 on them,” says Harr. He explains that a typical bioinformatics configuration will have a very large database on a high-performance storage subsystem connected to a number of Linux clients, all furiously trying to match a genetic sequence to data in the database. With Intel systems costing about one-fourth the price of their RISC counterparts, the open source option is often compelling.

It’s The Data, Stupid

While it may come as no surprise that open source software has gained support in the bread-and-butter computational and scripting areas that made it so successful on the Web, it has also proved to be surprisingly popular in the more vertically aligned BioIT (Bio Information Technology) tools market — an area that was filled with high-flying software startups just a few years ago. “Commercial models for bioinformatics software companies have yet to be proven successful,” says Bernadette Toner, an editor with Bioinform, a weekly newsletter for the bioinformatics crowd. “A lot of people are realizing that the software is not what’s important,” she adds, “it’s the data.”

Noted open source advocate Ewan Birney concurs. “Frankly, compared to the data openness issue, code openness is minor and a bit of a no-brainer. You should always open your code as it costs you nothing, enhances your reputation and generally helps you (and everyone else) tackle the interesting problems out there.” Nowhere was the importance of open access to data more strongly felt than in the race to map the human genome.

After beginning a publicly funded project in 1990 to map the tens of thousands of genes that make up a human, the international scientific community was thrown for a loop eight years later when an American company, Celera Genomics, announced plans to produce a similar map, but with the intention of patenting the genes it discovered and selling access to its database — even though researchers on the public genome sequencing team alleged that Celera had depended on public data to complete its human genome map.

According to J.W. Bizzaro, the founder of the open source advocacy group Bioinfomatics.org (http://bioinformatics.org/), Celera’s arrival helped focus the attention of the bioinformatics community on the importance of openness — of both the data and the software. “When Celera published their version of the human genome in Science Magazine, they were still permitted to charge a subscription fee for people to access their data,” says Bizzaro. “If you put something out in the open, other people are supposed to be able to replicate the work. That’s the whole scientific method.”

Celera expects to generate $129 million in revenues from its proprietary databases this year, but the commoditizing effect of open databases has industry observers expecting this line of business to be a shrinking concern for Celera. The company is instead focusing on leveraging its genetic research and patents into drug development. Birney says that he no longer sees companies like Celera as a major threat. “Business models like Celera’s are just not viable,” he says. “Data is best used as a precompetitive resource. There’s little point in keeping it locked up. However people don’t always see it that way.”

Shades of ’94

Though open genomic databases may seem pretty far removed from the concerns of the average Linux user, there’s no doubt that bioinformatics is having an impact on the wider open source world. Perl, in particular, is being influenced by the field. One of the most popular Perl modules, CGI.pm, was written by the most famous of all bioinformatics hackers, Lincoln Stein (see “Stein at your Service” for a conversation with Lincoln). And bioinformatics is also capturing the attention of established Perl developers, looking for a more interesting challenge than writing the backend for yet another e-commerce Web site.

Stein at Your Service: Q&A With Lincoln Stein

Lincoln Stein is the best-known open source advocate in bioinformatics. From his research offices at the Cold Spring Harbor Laboratory, he has made a number of important contributions to the Perl community, including the CPAN modules CGI.pm, HTTPD-User-Manage, and GD. These days his work is focused on improving the way the various genome databases interface with each other.

Linux Magazine: You’ve spoken in the past of the need for the bioinformatics community to standardize interfaces between its many different data sources and to develop a better Web services infrastructure. What other components of the Web services puzzle are important?

Stein: In addition to standardizing interfaces, we need to standardize our data objects so that we can exchange information. Shared data objects include DNA sequences, proteins, genes, gene functions, anatomies, taxonomies, and various standard types of experiments.

Security is important, but even more of an issue is proper attribution. The one thing that is anathema to the scientific community is to have someone else “borrow” ideas without giving acknowledgment to the proper source.

LM: Do you think there’s a risk of there being a Microsoft of the BioIT world?

Stein: Microsoft is heading towards controlling the APIs for the developers’ community as a whole, and it’s hard for bioinformatics to buck that trend. What happens in bioinformatics will reflect what happens in the larger world.

LM: Is .NET having an impact?

Stein: Yes. The Omnigene project (http://omnigene.sourceforge.net), for example, is developing a C# API for genomic information on the theory that C# will become just as significant a player as Java is now.

LM: Perl seems to be the most obvious area where work done by bioinformatics people such as yourself has benefitted people outside the field. Are there others?

Stein: Some of the most interesting work in Java and Python is also coming out of bioinformatics. I think that bioinformatics attracts top-notch developers.

LM: Tim O’Reilly has said that the bioinformatics field attracts these developers, like Nat Torkington, because it represents the most “hard core” problem sets in computer science today. Do you expect that there will be a lot more crossover with non-biologist IT people coming into the field?

Stein: Physicists do well in that crossover capacity. So far there hasn’t been much precedent for computer scientists or IT people, but I’m hopeful that Nat is the leading edge of a tidal wave rather than just a ripple.

LM: So is this like the early days of the Internet? In what ways is it not like 1993?

Stein: I think it is. But there’s nothing exactly like the Internet was in 1993.

“People don’t want to do the same old same old,” says O’Reilly & Associates President Tim O’Reilly, “and bioinformatics is a field where people can really prove their chops.” He adds, “This is about status. What better way to get status than to show off that you’re a real hard ass.” Perl 6 maintainer Nathan Torkington and noted hacker Damian Conway have become active in the bioinformatics community. Torkington, who was one of the organizers of a recent O’Reilly & Associates bioinformatics conference, says that the field is as exciting in the same way as the Internet was back in 1994. “There was no shared knowledge and history for people who wanted to come into the field,” he says. “And I get the feeling that we’re in the same situation with bioinformatics right now.”

Ewan Birney’s introduction to the open source world was typical. One day, while working at Cold Spring Harbor Laboratory in Cold Spring Harbor, NY (the employer of fellow open source bioinformatics guru Stein), he picked up Kernighan and Ritchie’s seminal book, “The C Programming Language” in hopes of figuring out a way to solve a problem he couldn’t get the local IT people interested in. Before he knew it, he had become immersed in the world of open source.

Today Birney is a noted Perl contributor as well as the coordinator of the open source Ensembl project (http://www.ensembl.org/), one of the most widely-used open source tools in bioinformatics today. Funded by medical research charity, the Wellcome Trust, Ensembl is both a publicly-available database of genomic information as well as a collection of open source tools that can be used to analyze genomic data from a variety of publicly available databases.

One of the most popular pieces of open software in bioinformatics is BLAST (Basic Local Alignment Search Tool, http://www.ncbi.nlm.nih.gov/BLAST/). Today’s bioinformaticians are constantly in search of patterns: connecting patterns found in DNA is the key to today’s biological research. Scientists learn about the new gene or protein sequences by analyzing the similarities they may have with other sequences. The standard set of algorithms that biologists use to spot these sequence patterns is implemented in BLAST, which was developed in 1990 by researchers at the National Center for Biotechnology Information (NCBI), Penn State, and the University of Arizona. BLAST’s algorithms make it much faster than other software, and it has quickly became the standard tool for comparing sequences.

New Frontiers

There are no established business plans for companies looking to make money with open source bioinformatics software, but a few are emerging, and the plans look a lot like Red Hat’s. One such company, Montreal’s Sequence Bioinformatics, Inc. (http://www.seqbio.com/), plans to begin offering an integrated suite of already-available open source tools called OpenGene. Sequence will package together popular open source tools and add its own open source help, as well as installation and management software.

If this sounds a bit like a Linux distribution, that’s because it is. People will be able to purchase the complete OpenGene suite on a CD-ROM if they like, but Sequence’s real money will come from support contracts and custom configurations. For companies that don’t want to spend the time or undergo the security risk of accessing public databases over the Internet, Sequence will set up anything from a single processor server, preconfigured with an open database and Red Hat Linux for about $40,000, to a multiprocessor Beowulf cluster for a lot more money.

Sequence founder Shibl Mourad, who became interested in bioinformatics after selling an Internet company that created a combination of proprietary and GPL-licensed collaboration software, says that the opportunity for commercial vendors is in creating the products that open source developers may not be so interested in, but which are of great importance to corporate clients: support, security, and visualization tools, for example. Mourad says that security is especially important to hyper-competitive biotech companies who fear that Internet queries to public databases might reveal important information to competitors.

But it will take more than technology to convince the big biotech companies to invest more heavily in open source tools, says Tania Broveak Hide, the CEO of another open source bioinformatics company, Capetown South Africa’s Electric Genetics (http://www.egenetics.com/). “Commercial executives who make big buying decisions are not people who are generally up on the value of open source,” she says. “They’re very wary of community development. They need to have their vendor be highly accountable.” And that’s where companies like Sequence, and Electric Genetics — who at the time of this writing had yet to reveal details of their business model — hope to come in.


In one very important area, the commercial open source vendors, developers, and the larger open source community share a common interest: Web services. When asked if it might be possible for a “Microsoft of BioIT” to somehow rise up and control the APIs that all bioinformatics software is written to, Bioinformatics.org’s Bizzaro doesn’t skip a beat, “Yes, and it’s Microsoft,” he says. “As Microsoft software proliferates, I think it’s going to proliferate into the scientific world as well,” he says. “Things like .NET could potentially dominate the way things work on the Internet, which would affect the way bioinformaticists work.”

Because of the distributed nature of the bioinformatics field — the public human genome was mapped by a world-wide organization of researchers — the push to create more structured Web services is a major focus of open source developers these days. Stein has called for the wide variety of open data providers (databases like Ensembl, NCBI, and FlyBase) to support a wider and more standardized set of interfaces to their data and has suggested the establishment of a formalized service registry so that developers wouldn’t have to worry about a completely different set of APIs for every database they wanted to query.

O’Reilly says that Stein’s advocacy of registries is definitely giving the whole discussion of Web services development “traction” in the open source world. “It’s pushing the various language communities,” he says. In fact, according to O’Reilly, the two most important battles for Open Source right now are in network computing and bioinformatics, and the work of developers like Stein is crucial to getting open standards established in both arenas. At the end of the day, Linux users should care about both because one battle will change the way they compute; the other will affect their very lives.

Robert McMillan is editor at large with Linux Magazine. He can be reached at bob@linux-mag.com.

Comments are closed.