Something Borrowed: Finding Open Source

The best place to look for open source code depends on what you’re looking for. But rest assured, the source is out there. The trick is finding it. Here’s a guide to what’s online.
The best place to look for open source code depends on what you’re looking for. A small routine? A sophisticated PDF manipulation library? An entire application? No matter the size, chances are that the source code you need is online and freely available — free as in beer and free as in speech. The trick is finding it.
Assuming no single application or shell script does precisely what you need, there are hundreds and thousands of open source components — functions, libraries, subsystems, packages, and more — for you to draw solutions from. Indeed, if the components’ software licenses permit, you can take a little from here and a lot from there to create a solution far more cheaply and quickly than building the code from scratch. For modern programmers, lazy is the new savvy.
Of course, nothing is completely free: you still have to do a little work to be lazy. At present there are no widely-accepted standards for annotating or indexing source code and the traditional heuristics used by search engines to scour and catalog text don’t work especially well with source code. And while some efforts are attempting to standardize source code distribution (see the sidebar “Making Open Source More Open”), because source code and requirements are so varied, finding just the right snippet of code or the perfect module will likely always remain something of an art.
Short of a repeatable method, let’s look at some tips and tricks to find open source code.

Finding the Needle in the Haystack

The first place to start a search for a module, such as a function, an algorithm, or a class, is Google (http://www.google.com/). Using the filetype: xxx predicate, you can restrict your search to files that end with xxx, For instance, providing the filetype:java modifier would limit a search to Java source code files.
It helps if the source has descriptive comments, but if not, try to search for single word identifiers. For example, if you’re searching for a Quick Sort (http://en.wikipedia.org/wiki/Quicksort) implementation, you may receive better results if, instead of typing quick sort, you try quicksort or quick_sort, instead. Although Google isn’t adept at indexing source code, it remains a superlative search engine and the popularity ranking system yields very relevant results.
Another search engine that (unlike Google) actually logs into Concurrent Versions Systems (CVS) repositories, such as those found at SourceForge.net, is Koders (http://www.koders.com; shown in Figure One). The Koders search engine is designed specifically to search open source code and can limit searches to specific languages or licenses. Koders.com is still relatively new and lacks great sophistication in its search heuristics, but is nonetheless able to yield matches that Google otherwise misses.
FIGURE ONE: The Koders source code search engine home page

If you’re looking for a relatively small snippet of code — a script or a Perl one-liner, say — you may have the best success searching Usenet newsgroups, online discussion forums, and mailing lists. Such venues tend to be more personal, where an individual might look a solution to a specific problem. In many cases, community members post working source in response to queries.
When looking for code on newsgroups, discussion forums and mailing list forums, invest some time searching the archive associated with each locale. Chances are, especially on Usenet, that if you’re looking for something, someone else has already posted the same inquiry. For example, typing bubble sort in the Ruby Language’s official ruby-talk archive search engine (http://blade.nagaokaut.ac.jp/ruby/ruby-talk/index.shtml) yields twenty five results. Chances are good that you’ll find a reply without proceeding further.
However, some mailing list archives, such as the various Python mailing lists at http://mail.python.org/, don’t have a search engine. The omission seems odd, but not insurmountable. You can search them very effectively using the Google site search feature. For instance, if you type site:mail.python.org quicksort, you’ll recieve hundreds of hits.
Be thorough in your search and try numerous searches with alternative words and phrases. Expect to have to parse a lot of results. Google’s index of Usenet postings is quite expansive, but providing just the right phrase can often separate fruitless searches from bountiful ones. Once you’ve exhausted the search (or yourself in the attempt) post a new entry, describing your problem, your environment (platform, compiler, language, and so on), and your requirements.

Shopping for Source

If you need a large, sophisticated component, code repositories tend to be more effective than the former two techniques. Two of the most well-known source code repositories are SourceForge (http://www.sourceforge.net) and O’Reilly’s CodeZoo (http://www.codezoo.com).
SourceForge is very popular — chances are you’ve downloaded something from its vast collections of projects. But SourceForge can also be difficult to use. It isn’t pretty to look at, and SourceForge is intended to be a collaborative tool for producers of open source code, not a portal for consumers of open source code. Still, it’s a gold mine.
Be wary, though: Searching SourceForge often yields handfuls of abandoned, immature, and unfinished projects. Be prudent and selective. Ideally, you should download a component or library with an activity level greater than 90 percent, an actively interrogated and maintained bug reporting system, a supportive and responsive forum or mailing list, and a maturity level of five (production/stable).
A relatively new source code repository is CodeZoo. CodeZoo provides hundreds of components and libraries for Java, Python, and Ruby. Unlike SourceForge, CodeZoo is intended specifically for the distribution of open source code, and it comes with a useful rating and review system. (See Figure Two.)
FIGURE TWO: O’Reilly’s new CodeZoo source code repository

At the moment, CodeZoo features component directories for Java, Python, and Ruby. If you’re a Perl programmer, you no doubt have used the Comprehensive Perl Archive Network, more commonly-referred to as the CPAN. And if you code in PHP, there’s the growing The PHP Extension and Application Repository, or PEAR.

Evaluating Open-Source Code

You can invest a lot of time downloading, integrating, and testing open source code, just to find the code is worthless in the end. Unfortunately, there’s no magic formula for predetermining whether the code you’ve found will serve you well. However, here are some guidelines that may help:
*Is the code adequately documented? Without documentation, it’s very hard to figure out what code is supposed to do, let alone whether the code accomplishes what it is supposed to. At a minimum, read the project’s Frequently Asked Questions (FAQ) portfolio. An FAQ is crucial reading for evaluating source code, because it may state directly and plainly what the software is capable of, including its limitations. Often, an FAQ describes inadequacies as “known issues.” Flowcharts, UML diagrams, design documents, and sample code are invaluable, too.
*Does the code include tests? An extensive, up-to-date test suite is the proper way to prove that code does what it’s supposed to. Moreover, fine-granularity test cases can act as requirements, demonstrating specific assumptions and requirements. Additionally, it doesn’t hurt to write your own tests to make sure that the code you are evaluating does what you need.
*Are bugs managed methodically and addressed systematically? Virtually all source code has bugs, so it’s very important to track defects. At the very least, bugs should be reported, replicated, and discussed on either the mailing list or discussion forums. Better yet, a project can maintain a Bugzilla database (http://www.bugzilla.org/) of known issues and requests. Healthy projects tend to have organized, helpful, visible, and responsive development teams.
Finally, read the code’s README, CHANGELOG, and TODO documents. Most open source projects include one or more of these files with source code, which summarize the purpose, history, and plans for the code, respectively. Look for these files in the top-most level of the project’s source code “tree”, and try to determine how many releases have been made, both over the project’s lifetime and in the recent past. Code with few revisions tends to be unstable

Software Licenses

Of course, you also need to carefully consider each and every component’s license. Software — as captured both in source code, object files, and executables — is protected by copyright. Unless the software’s accompanying license grants you adequate and necessary rights for your needs, that “perfect” module must remain off-limits.
Using open source code can be intimidating due to the plethora of open source licenses (the Open Source Initiative, http://www.opensource.org, offers over fifty “OSI-Approved Licenses”), seemingly scary legal jargon and the even more ominous fright of litigation and penalties, and a good deal of Fear, Uncertainty, and Doubt (FUD), not to mention misinformation. Different licenses have varying intents and conditions, and many licenses — including the venerable GNU Public License (GPL) — have not been litigated extensively and therefore lack vetting and substantiation.
For example, the GNU Software Foundation claims that linking your code against a GPL’d library requires your entire software to be licensed according to the terms of the GPL (see http://www.gnu.org/licenses/gpl-faq.html# TOCLinkingWithGPL). But this is a just one, broad interpretation; for another, slightly more refined opinion, see this month’s “On the Docket” column beginning on page XX.
Moreover, it is common for a licensor — the copyright holder — to improperly license software. (See the January 2006 “On the Docket” column, available online at http://www.linux-mag.com/2006-01/docket_01.html for examples and tips on what not to do with your original source code). Before you use open source code, determine if the source code is labelled with the correct license. Just because a README says that the code is available under the BSD license, the source may indicate something to the contrary. A header file may have been released under another license, perhaps under the GPL. Determining if one software license is compatible with another is best left to your attorney. Remember, ignorance is not a valid legal defense.
However, many licenses are well-written and quite generous. The best advice: Do your homework. And if you’re considering adopting open source code in your project, it’s best to seek the services of qualified counsel. (For more information on open source licenses, see Linux Magazine Editor-in-Chief Martin Streicher’s excellent series of articles at http://www-128.ibm.com/developerworks/opensource/library/os-license/.)

Language Interoperability

Oftentimes, you may find a module tailored to your needs, albeit in a programming language you’re not familiar with. Depending on the languages, mixing code from different languages can be an effective technique, since many programming languages are designed to solve specific kinds of problems. For instance, it’s hard to beat C for speed, while Perl is the undisputed champion of parsing.
One simple technique to mix languages is to invoke a “foreign” language’s interpreter. Virtually every language supports the equivalent of the the C system() function call, allowing you to launch any application or shell script in any language.
Additionally, many scripting languages, such as Perl, Python, and even Java, support some kind of integration with C. In fact, C underlies most programming languages and is often used exclusively to extend the core features of a language (for instance, PHP and Perl). As a result, C is a very effective “glue.”
Another approach is to embed a small programming language interpreter in your application and use its capabilities to build more effective solutions. For example, Lua (http://www.lua.org) is a well-known secret in the games industry, as it’s an extremely lightweight, efficient, and full-featured programming language with threads and rich data structures. Frink (http://futureboy.homeip.net/frinkdocs/) is a highly-specialized language for calculations involving units of measure.

Borrower Beware

Just because open source code is “free” (as in beer and speech) doesn’t mean it’s free of cost. Often, manipulating and adapting source code requires a significant larger investment of time and effort, and as the old adage says,” Time is money.”
Before searching for, evaluating, integrating, testing, and debugging open source code, perform a cost-benefit analysis. Sometimes, well-known commercial alternatives exist that cost the equivalent of a few hours’ wages. It can easily take one or two developer-days to measure even trivial open-source code. Lazy and frugal is the new savvy.
Of course, commercial code quality varies just as open source code quality varies, but commercial developers have extra incentive to provide support, documentation, and significant quality.
And don’t forget the” usual suspects.” If a programming problem can be trivially solved with a combination of shell utilities and scripts, go no further.
Returning to search, an intriguing possibility lies with Google Base, a public database for making virtually any kind of data searchable. Currently the technology is quite immature, but it can already be used for publishing open source code.

Christopher Diggins is a freelance software developer and designer of the Heron programming language (http://www.heron-language.com). Christopher can be contacted through his web page at http://www.cdiggins.com. Special thanks to Kris Unger for taking the time to carefully review and comment on this article.

Comments are closed.