Parrot is the name of a new virtual machine (VM). Parrot is object-oriented, introspective, and language-agnostic. Like other modern virtual machines, Parrot is driven by bytecode and supports continuations. However, unlike almost all of today’s popular, stack-based, RISC-like virtual machines (such as Java’s and .NET’s), Parrot is register-based and a CISC-analog.
While Parrot was originally conceived of as the virtual machine for Perl 6, it now encompasses the whole family of dynamic programming languages. For all intents and purposes, Perl is a proper superset of a wide variety of languages. Name a language feature, any language feature — objects, dynamic compilation, runtime binding, exceptions, threads, lexical scoping — Perl has it (and Perl 6 adds more). It makes sense then that an engine that runs Perl 6 would work just as well for any language with similar characteristics, including Ruby and Python.
Oddly enough, while Ruby and Python are established languages, the design of Perl 6 is still in flux. So why build Parrot now? While the syntax of Perl 6 has yet to be finalized, the semantics of Perl 6 are more or less well known (for a preview of Perl 6, see the accompanying article on page 24). That is, while we don’t know exactly what Perl 6 looks like, we do know how it should act, at least well enough to build a good engine.
What started out as a prank (see the sidebar “The Perl Posse’s Parrot Prank”) is now a promising new target platform for dynamic languages of all kinds. Here’s a look at the soul of a new virtual machine.
The Perl Posse’s Parrot Prank
Each year on April Fool’s Day, the Perl community plays a big, practical joke on the rest of us. In 2001, Simon Cozen’s planned and executed a whopper.
The big news that day: Larry Wall and Guido van Rossum were burying their linguistic differences and merging Perl and Python to produce “Parrot,” a language with all the characteristics of each of its parents.
Of course, the story wasn’t true, but as jokes go, Parrot was a pretty good one: there was a fake entry in O’Reilly’s book catalog for “Programming Parrot in a Nutshell,” a gag interview, and statements from prominent developers from both languages advocating the merger.
So, there never was a Parrot language, but as life so often imitates satire, when it came time to choose a name for our new virtual machine — a machine capable of supporting Perl and Python (and others) — Parrot seemed the logical choice. |
Parrot Wants A Magic Cookie
When we say Parrot is language-agnostic, we mean that it’s both human language-agnostic and computer language-agnostic. Parrot places few restrictions on text or on computer languages that might target the VM.
The two fundamental data types in Parrot are Parrot Magic Cookies (PMCs) and strings. A PMC is Parrot’s base variable type. It’s a (mostly) opaque data type, meant to be the most primitive type that a language targeting Parrot uses. A PMC can hold a single value, an aggregate such as a hash or an array, or an object. PMCs are typed in a very simple, object-oriented way, but it’s a fixed, low-level sort of abstraction. (Parrot has facilities to manipulate platform-native integers and floats, but they’re meant mostly for internal bookkeeping and the optimizer)
Each PMC has a set of functions, stored in a vtable that it must provide. The functions are a combination of high-level, object-oriented functions like making a method call; medium level functions, such as adding an integer to the current PMC and returning a result; and low-level functions, like returning the PMC class that the current PMC belongs to. These abstractions let us define and quickly access the features that each PMC must express, without specifying the mechanism that implements those features. The Parrot engine doesn’t care if these fixed functions are written in C, Parrot assembly code, some other high-level language targeting Parrot, or Fortran.
Parrot also provides a string data type as a fundamental construct. Parrot’s native string type is a complex one, reflecting the complexity of the problem. For example, it’s no longer adequate to call a series of 7-bit integers a “string,” and in many cases, Unicode is insufficient: there’s a lot of text in the world that isn’t Unicode and won’t ever be Unicode. Our final solution was an abstract string type.
Like any sane string, a Parrot string tracks its length, but a Parrot string also tracks its encoding, which tells how to turn a series of bytes into abstract integer code points. A Parrot string also tracks the character set, which describes what those abstract integer codes mean, and tracks the source language, which provides sufficient information to perform sorting and case-shifting. The encoding and character set libraries provide the few bits of specific information that Parrot actually needs to know.
Registers Galore and Stacks of Stacks
Parrot has four sets of registers for programs to use, one set each for integers, floating point numbers, strings, and PMCs. Each register set has 32 registers, numbered 0 to 31. Registers act as a sort of fast-access, named, temporary location.
While registers in a VM aren’t as fast as registers in an actual hardware system and a register-based architecture is somewhat more complex to program for, there are advantages, too. (For a brief list of the pros and cons of stack-based and register-based systems, see the sidebar “Registers vs. Stacks”). VM registers are generally faster to access than the top of a stack, and on systems with lots of hardware registers, one can map some of the commonly used interpreter registers to actual hardware registers when the engine JITs the code. (For information on JIT in Parrot, see the sidebar “The Just In Time Compiler”). This is a potential performance win that comes without penalty to more register-starved systems such as those with an x86-compatible processor. And unless you are working at the very lowest levels of Parrot — such as writing a compiler to target Parrot or writing Parrot assembly directly — the underlying architecture of the virtual machine is essentially irrelevant.
Parrot is a register-based system. Java, .NET, Perl 5, Ruby, and Python are stack-based systems. What’s the difference?
In a stack-based system, all operations take place on a stack. Want to add two numbers? Put each of them on the stack and issue the add command. Add takes the top two entries off the top of the stack, adds them together, and puts the result back on the stack. This is a very simple system to build and a simple one to target when writing a compiler.
The downside is that the machine spends a lot of time fiddling with the stack, since every operation, in addition to actually doing work, needs to tweak the stack pointer.
A register architecture, on the other hand, doesn’t have to do any stack manipulation unless it’s actually executing code that does something with the stack. In a register architecture, there’s less housekeeping, meaning that more of your code gets executed.
The downside here is that the code stream is less dense (the instructions have to specify the registers being affected) and code generation is more complex; in a stack-based system, you don’t have to worry about which registers are available to use.
Luckily, register allocation is one that’s been studied for decades, with many good solutions available. Parrot’s IMCC module, part of the compiler code, handles register allocation for you. |
Even though Parrot is a register machine, there’s still a need for stacks. There are six stacks in Parrot — one stack for each set of registers, one general-purpose stack, and the control stack. Each stack is segmented, avoiding the need for a contiguous piece of memory (as hardware stacks require), and each stack can grow as large as it has to, up to the limit of the amount of memory.
Segmented stacks are also great for threads (we don’t have to pre-allocate a fixed-size chunk of memory for each thread, something that’s either wasteful or insufficient or both), and having a segmented stack also makes implementing continuations and coroutines much easier.
The general-purpose stack is used to save individual registers. It’s used primarily for register spilling, when a segment of code needs more registers than are available. Admittedly, this is an uncommon occurrence with 32 registers, but it can happen. The general-purpose stack is typed, and the engine throws an exception if, for example, you try to pop an integer into a string register.
The register stacks are backing store for the register sets and each stack frame holds all 32 registers of a particular type. These stacks are used when a set of registers must be saved — for example, to save the contents of a register set before calling a subroutine. Having a dedicated stack for each register set makes saving and restoring all of a particular set of registers faster. Not only can we skip type checking when we pop a register frame, but we can also save and restore register sets with a single memory copy operation, something that’s heavily optimized on most systems.
The control stack is for the interpreter’s use. It’s where the interpreter pushes exception handler information, lexical pad information, stack marks, return addresses, and a variety of other housekeeping data. Program code generally doesn’t concern itself directly with this stack, though information can be queried through the interpreter’s introspection facilities.
The Just In Time Compiler
It’s almost obligatory these days to have a Just In Time (JIT) compiler built into any new VM. Popularized by Java and .NET, whose core designs require a JIT to achieve reasonable performance, JITs are relatively new to the worlds of Perl, Python, and Ruby. In general, those languages never really needed them — those languages have had acceptable performance without adding the extra complexity and platform-specificity that JITs require. JITs, by their very nature, are CPU- and, often, operating system-dependent.
While we had always planned to add a JIT to Parrot, it wasn’t one of our primary design goals, as we figured that the engineering effort to implement a JIT framework would be better spent elsewhere. We were surprised then, when in December 2001, Daniel Grunblatt, a computer science student in Argentina, sent in patches to implement a rudimentary JIT for Parrot. Originally written for x86 Unix, it’s since grown into a nice JIT framework that works on PPC, x86, Alpha, ARM, and SPARC CPU architectures.
Having a JIT available has turned out to be a very good thing. While it’s one thing to try and take a hypothetical JIT compiler into account when working on a design, it’s entirely another thing to have a real one handy, with all the real limitations actual code imposes. |
Garbage Collection
Managing memory, objects, and resources is one of those tasks that nobody likes to do, and one that’s more and more error-prone the larger a program gets. As an alternative, virtual machines generally include some form of garbage collection, and Parrot is no different.
Perl 5 uses reference counting to determine if a variable’s alive or dead. Using reference counting, each variable has a counter that tracks the number of other variables or structures that point to it. When the count goes to zero, the interpreter considers the variable dead and destroys it. Reference counting has some advantages, including the immediate destruction of dead objects.
Unfortunately, reference counting is deceptively expensive and very error-prone. One mistake means that a variable dies too soon — which is usually fatal — or lives forever, which leaks memory. Reference counting also can’t handle a self-referential data structure, which live forever because the reference count never drops to zero.
Parrot, on the other hand, uses a tracing garbage collector. Whenever the interpreter notices that it’s out of some resource or hits a point where a trace is in order, it walks the tree of live objects, starting from a set of known live things, such as the registers and the stack. Any object that the interpreter can’t eventually reach from this root set is assumed to be dead data and is freed. At the moment, Parrot is using a modified mark and sweep scheme and a compacting memory allocator for memory allocations, but those will eventually change since the solution does not scale: millions of live objects cause unacceptable pauses.
Safety and Security
Parrot, by default, trusts the bytecode it runs. It trusts that the bytecode is well-formed, has no malicious constructs, and was correctly generated. Excluding bugs, we assume that the bytecode being executed isn’t broken. That is, the bytecode doesn’t reference invalid registers, exit improperly, have bogus opcodes, or branch to bad places.
Assuming that code is safe improves speed. By trusting code, Parrot can avoid the additional overhead of validating bytecode on load, checking opcode parameters during execution, imposing security checks and resource limits, and double-checking branch destinations. In the default “trusting” mode, code executes at maximum speed, trusting that the compilers have emitted correct bytecode — a reasonable, valid trust. As a result, Parrot dies in horrible ways if you feed the default interpreter bad bytecode.
However, there are times when you don’t or shouldn’t trust the code you’re executing. You may have gotten a piece of bytecode from some third party or you may be executing bytecode with elevated privileges. In this case, bad behavior can be far more serious than just crashing the interpreter — you might execute unexpected or malicious code, compromising your system.
For this, Parrot will eventually have a “safe mode” where the interpreter doesn’t trust the code it’s executing and may put limits on the resources the code can use, including CPU time and memory. Safe mode will be slower, since security isn’t free, but it will allow your program the opportunity to protect itself from malicious code.
Speed
One of the fundamental goals of Parrot is to improve performance. Our original, stated goal was 20% faster, but we’re aggressively shooting for a 2X or 3X improvement.
Interpreters, especially for dynamic languages such as Perl and Python, have significant overhead. In many cases, recoding a Perl program in C results in a 200-fold increase — obvious headroom for significant speed gains. While some folks have disagreed with speed as a primary design goal, it has, so far, served us well, with the current code running faster than comparable Perl 5 code by a factor of five to eight, and that’s without the JIT or any micro-optimizations.
It’s tough to tell how overall performance will be compared to Perl 5 as we don’t have a complete regex implementation, the part of Perl 5 that’s seen the most aggressive and mind-warping optimizations. At the moment speed is quite good compared to the other interpreters in our class, as Parrot handily outstrips both Ruby and Python in areas where we’ve got completed functionality. That bodes well for the future.
Is There Anything New?
If you look at Parrot, it’s tempting to ask, “What are the big innovations here?” The answer is simple: there aren’t any. There isn’t anything innovative in Parrot’s design. All of the core elements are at least a decade old, if not older. (One of the running jokes among Parrot designers is that the Lisp folks solved all the problems we’ve run across over 30 years ago.) Parrot is a software engineering project, not a research project.
However, just because we’re not doing anything truly new doesn’t mean that we’ve ignored all of the good research that other people have done. Parrot’s garbage collection, register allocation, and feedback-driven dynamic optimization algorithms are all based on existing research.
Programming Parrot
There are two ways to write code for Parrot, either directly in Parrot assembly or in a language that compiles to Parrot assembly or Parrot bytecode.
Parrot currently executes bytecode files, either from disk or piped into stdin. Compilation and execution are separate, at least for now. Later in the development cycle, the interpreter will generate bytecode on-the-fly and support things like Perl’s string eval, that compiles and executes source code on-the-fly.
Building Parrot is straightforward. You’ll need a working version of Perl (V5.005_03 or later), a make tool, and a C compiler. (The dependencies on Perl and make are temporary). Currently, Parrot builds on many flavors of Unix (including Linux, Mac OS X, Solaris, and Tru64), Microsoft Windows, and VMS. The list of platforms should expand as time goes on — all you’ll eventually need to build Perl 6 is an ANSI89-compliant C compiler and reasonably-sized filenames.
To build Parrot, download the source and accompanying instructions from the Parrot Web site at http://parrotcode.org. The full Parrot source is currently about 48 MB, where more than half of that is from IBM’s ICU Unicode library. You’ll need about 100 MB of free disk space to download and build Parrot. That number will be drastically reduced for release, but for now it’s not an onerous amount.
After you’ve fetched and unpacked the source, you need to configure and build the code. Go to the top of the Parrot tree and issue the command:
This command probes your environment and generates makefiles in the appropriate places. Then, just issue make and wait for the build to complete. When it’s all done you’ll have a single Parrot executable, ready to execute Parrot programs.
Parrot comes with an ever-growing suite of tests. To test your build, run:
Of course, no test suite is ever complete, and ours is no exception. You may find bugs that the suite does not test. Extra tests are always welcome.
Parrot also comes with a number of languages you can use with it, including Integer Basic, Forth, a partial Perl 6 implementation, Scheme, and Befunge. We’ve also got a custom language, “Jako,” that’s being developed along with the interpreter. It’s a nice little language, and it makes it easier to write code for Parrot without having to drop all the way to assembly. After you’ve built and tested the install, run the command:
to build the languages included in the distribution. Once you’ve done this, you can fiddle with Jako, play Hunt the Wumpus, or write 2D programs in Befunge.
If, as you work with Parrot, you come across bugs or have patches, send mail to bugs6@bugs.perl.org. Email sent to that address goes to the development list and is logged in our RT bug tracking system.
A Simple Example
Since all languages must have a “Hello, World!”, here’s Parrot’s:
print “Hello, World!n”
end
This Parrot example is remarkably shorter than one written in C, but this is about the only time that’ll be true. Save the program to a file named hello.pasm and run the Parrot assembler against it, like so:
% perl assemble.pl hello.pasm -output=hello.pbc
The assembler will emit the compiled bytecode to stdout or you can redirect it with the -output switch. It’s generally best to use -output, as it’s a good cross-platform habit. (Different platforms will sometimes do odd things to line breaks, and throwing the standard filehandles into binary mode is somewhat unreliable in some cases).
Finally, run the program with:
% Parrot hello.pbc
Parrot will load in the bytecode file, print Hello, World!, and exit.
Some Assembly Required
Parrot’s assembly format owes a lot to that of older hardware, such as Digital Equipment’s VAX and Motorola’s 68000 CPUs. The format is relatively simple:
opcode [destination[, source1[, source2[, …]]]
That is, first comes the opcode, followed by an optional destination, followed by a set of optional sources. Only the first parameter to an opcode is changed, the rest are generally read-only.
Some opcodes, such as end, take no parameters. print, as you’ve seen, takes one parameter, while others, such as add, may take three. (And some, such as dlfunc, take four or more). The Parrot documentation describes all of the opcodes.
As we said earlier, Parrot is a register machine with some CISC-like characteristics. print, for example, can print string, integer, and float constants, as well as the contents of string, integer, float, and PMC registers.
So, let’s try an example that uses some registers and has some control flow. The following Parrot code prints Hello, world #10 through Hello, world #1 and then exits:
1 set I0, 10
2 PRINTER: print “Hello, world #”
3 print I0
4 print “n”
5 sub I0, I0, 1
6 ge I0, 1, PRINTER
7 end
Line 1 sets integer register 0, I0, to 10. Line 2 prints the constant string. Line 2 has a label so we can refer to it in other parts of the program. Line 3 prints the contents of register I0, automatically converting the integer to a string for printing (the content of I0 are left untouched.) Line 4 prints a newline.
Line 5 subtracts 1 from I0. I0 is there twice, once as the destination, and once as the source, which is perfectly valid. If you recall the Parrot opcode formats, the first parameter to an opcode is the destination, while the rest are usually sources. If the instruction were sub I0, I1, 1, it would have subtracted one from register I1 and put the result in I0.
Line 6 does a test and branch. In this case, the test is greater-than-or-equal, checking to see if the first parameter is greater than or equal to 1. If so, it branches to the label that is the third parameter, PRINTER. If not, it falls through to the next opcode.
The last opcode on line 7 ends the program, exiting cleanly. It’s important that all your programs exit with end rather than falling off the end of the bytecode segment. Unless you’re running in a safe compartment, Parrot tends to crash without a final end.
A full tutorial on writing Parrot assembly isn’t practical here. For more examples, look at the examples/assembly directory in the Parrot distribution. There are quite a few example programs, including a full XML parser and a Mandelbrot set generator.
The Road Ahead
Parrot isn’t complete. This is partly because the design of Perl 6 isn’t complete, partly because we’re not sure how how to implement some of the more interesting and conflicting pieces of the design (such as a good unified object system), and partly because we just haven’t had the time to finish it yet. Despite the best efforts of a lot of good folks, this is an almost entirely volunteer effort. We make do with the time folks have to contribute.
Still, we have a roadmap of features we need to add, and we’re working hard on getting them designed and implemented. Here are some features that lie ahead:
- Objects. Parrot must have good support for objects. Perl 5, Perl 6, Ruby, and Python are all object-oriented, as are Java and C#. If Parrot wants to efficiently support those languages, it needs to have a good object system built in. Unfortunately, those six languages all have different object semantics, making the design of a unified system something of a challenge. While supporting each individual scheme isn’t that tough, building something so that Perl 6 classes can inherit from Ruby or Java classes, or so C# classes can inherit from Perl 5 classes, is the tricky bit.
- Dynamic Loading. The ability to load in shared libraries on-the-fly is a fundamental one to Parrot. Extension modules, PMC classes, and opcode libraries all need to be loaded on demand. Dynamic loading, unfortunately, is fairly platform-specific, both in the actual mechanism to load in the libraries and in the way those libraries need to be built in the first place. It’s not difficult, just tedious and something we’ve not needed desperately enough to build yet.
- Threads. While threads have been kept in mind during the design of Parrot, we don’t, as yet, have working threads. This is something that we need to get done reasonably soon, though, so we don’t start building thread-unsafe code.
- Asynchronous I/O. Our intent is to have an asynchronous I/O system as Parrot’s base I/O model. While asynchronous I/O adds complexity, systems that support asynchronous I/O properly can have a 3X or 4X increase in throughput — a substantial gain well worth the effort. And layering synchrony on an asynchronous system is far easier than layering asynchrony on top of a synchronous system. So starting with an asynchronous model is the better way to go in the long run.
- Event. Any system with async I/O also needs a good event model. Even without async I/O, any program that does GUI programming has to deal with events in some form, so having a unified event model makes sense (without one everyone who needs an event system will implement their own, which can lead to programs that have two or more different event loops, something that’s often untenable). With a single event model, tying GUI events, async file I/O, network I/O, and signal handling together is much easier, giving the programmer a single place to coordinate event handling. Building in an event system means we also provide an abstract, platform-independent, signal- and thread-safe queueing system, which can come in handy at times.
Given magazine lead times, it’s distinctly possible that some of these features will be in by the time you read this. Check out the distribution to see where things stand.
Want to Know More?
Most of Parrot’s documentation can be found in the docs subdirectory of the main distribution. Look in the pdds subdirectory. This holds the “PDDs,” the Parrot Design Documents. These are the canonical documents that describe how Parrot is supposed to work, although they are occasionally at odds with how Parrot actually works.
If you want to join in and help work on Parrot, or are interested in targeting Parrot as a back-end for a language you’re working on, consider joining us. The mailing list is at perl6-internals@perl.org, and, despite the Perl in the name, (a remnant of the project’s origins which, at some point, will be dealt with) everyone who wants to help or watch is welcome.
While we’re not done yet, Parrot is coming along nicely, becoming far more than we’d originally planned. While the leap we had to make technically from being just for Perl to encompassing dynamic languages in general was a small one, the conceptual leap was big, and one we hadn’t imagined when we started the project. The results are both encouraging and exciting, though, and we hope that Parrot will serve as the foundation for many years of interesting language design.
The original impetus to create Parrot was the state of Perl 5’s virtual machine (P5VM). Perl 5’s interpreter was clever when it was first created, but that was ten years ago. In the past decade, it’s been assaulted by scores of programmers, has had a multitude of features grafted on (making new grafts more difficult), and has been bombarded by more gamma radiation and mutagenic chemicals than your average angst-ridden teenage superhero.
Still, the Parrot team is often asked why did we elect to implement a new virtual machine instead of repurposing an existing one. Developers frequently ask, “Why didn’t you just do this?”, where “do this” includes everything from “compile to .NET code,” to “compile to Java bytecode,” and even “use the Scheme-48 engine.” So, what was our rationale? Ultimately, it made the most sense. Here’s a recap of our analysis.
Existing interpreters fall into three general categories: first, there are those interpreters that didn’t exist when we started working on Parrot; next are those interpreters that we didn’t know about when we started working on Parrot; and finally, there are those interpreters that didn’t work well enough when we started working on Parrot.
.NET falls in the first category. The design work for Parrot started in July 2000, right after the meeting that launched the Perl 6 project. At that time, .NET wasn’t publicly available. While .NET is available now, it hasn’t been widely ported, making it untenable as a primary platform (Perl 5 runs on dozens of platforms and Parrot needs to run on the vast majority of them). .NET would also negatively affect performance: static languages like C# perform well within .NET, but dynamic languages like Perl and Ruby would not.
We decided against Java’s JVM for similar reasons. While the JVM has since been ported to more platforms than .NET, it wasn’t as widely available when we started Parrot. Moreover, the JVM cannot process un-closures and continuations, and it’s rather slow (even with the heroic efforts many people have made to get it running faster). Perl 6 is supposed to be faster than Perl 5.
The Scheme-48 interpreter was one we didn’t know about, though it has a number of interesting features and a collection of scary-smart folks working on it. We’d likely consider it if we were starting over. (While Scheme-48 has portability issues, too, the Scheme and Lisp communities have dealt with far more bizarre hardware and operating system software than even the Perl folks have.) |
Dan Sugalski is the designer of the Parrot engine and knows far more than any same human being should about the internals of Perl 5. You can reach Dan at dan@sidhe.org.
No comments yet.