dcsimg

Exclusive: Xen Grows Up

Xen 3.0 provides vastly improved stability and a wealth of new features. Xen 3.0 also supports unmodified operating systems and enterprise hardware. Here’s a look at the latest version-- one that’s ready for production environments.

In the past year, development of the open source Xen
virtualization platform ( "http://www.cl.cam.ac.uk/netos/xen/" class=
"story_link">http://www.cl.cam.ac.uk/netos/xen/
) has forged
ahead at a rapid pace, adding support for hardware virtualization
and large- scale enterprise server hardware such as "i">symmetric multiprocessor (SMP) guests and physical
address extensions (PAE). Simultaneously, the Xen project has
amassed a substantial community of developers and refined the
software to be stable and robust. Now with a third major release,
Xen is ready for “The Big Show,” production use.

Up until the recent release of Xen 3.0, a
major obstacle to the adoption of Xen in some environments was the
software’s lack of support for unmodified operating systems.
Xen’s original approach of "i">paravirtualization, modifying an operating system to
facilitate virtualization, yielded great performance, but failed to
host operating systems for which source code is unavailable.

Fortunately, with the launch of new x86
CPUs that provide hardware support for virtualization–
Intel’s VT extensions and AMD’s
SVM both provide on-chip support for
creating virtualized processor contexts– unmodified operating
systems can now be hosted on Xen. In fact, developers from the
Intel Open Source Technology Center have been working diligently
with the Xen source over the past year and have incorporated
support for VT into the Xen tree. The Intel team recently announced
the ability to host unmodified Linux kernels above Xen.

Hardware support for virtualization is not a panacea, though.
While necessary to host unmodified operating systems, CPU
extensions require software support. Two problems left unsolved by
hardware virtualization extensions are bootstrap code and access to
devices. Bring-up code runs in 16-bit real mode– VT and SVM only
provide support for the virtualization of 32-bit protected mode CPU
contexts. Device interfaces remain unvirtualized.

Bootstrapping the Operating System

The x86 processor has evolved considerably over its lifetime
(the 16- bit 8086 was first introduced in
1978, making the architecture almost thirty-years old). However,
Intel has maintained complete backward compatibility through the
entire history of the chip. A Pentium 4
comfortably runs binaries written for an 8086.

As the CPU evolved, the need for backward compatibility led to
special considerations to support new features, such as 32-bit
operation and hardware support for virtual memory. Hence, Intel
introduced the notion of a mode-change,
effectively escalating the backward-compatible 16-bit startup mode
of the processor to full 32-bit mode with extra instructions for
features like virtual memory support and special instructions for
multimedia.

Consistently throughout this evolution, though, the x86
architecture has presented a serious barrier to virtualization. The
CPU provides a set of four privilege “rings”: the
operating typically runs in ring zero, the most privileged level of
execution, while applications usually run in ring three. (Modern
operating systems make no use of rings one and two.) However, when
virtualizing hardware, the virtual machine monitor (VMM) must run
in the most privileged mode, and the OS is typically demoted to
ring one– a technique known as ring
compression.

Removing an OS from the most privileged mode of execution
changes the behavior of many instructions. Many instructions are
available only in ring zero, and trap into the VMM when issued in
ring one. These are handled by the VMM, which can validate the
safety of the instruction and virtualize it as necessary.
Unfortunately, there are a small set of instructions (such as
POPF) which behave differently in ring one,
but do not trap to the VMM. Without the virtualization extensions,
this problem requires either paravirtualization (manually replacing
the problematic instructions with explicit calls to the VMM) or
slower solutions such as emulation, or code scanning and binary
rewriting, as used in other virtualization software.

The privileged instruction problem is one of the issues solved
by the x86 virtualization extensions, which handle these
problematic instructions either by correctly virtualizing them (as
in the case of POPF) or by properly trapping
to the VMM.

Two other CPU-related issues that remain unsolved are the
handling of virtual memory and the initial bootstrapping of an OS
into 32-bit protected mode with paging enabled, which occurs when
an OS starts up. Xen assists a VM in managing virtual memory
through the use of shadow paging, which is
beyond the scope of this article.

Unfortunately, the period of time between CPU start and when the
operating system requests a transition to 32-bit protected mode
with paging enabled is not fully handled by the CPU extensions. The
code that runs in this duration typically just sets up initial CPU
and virtual memory states prior to switching modes, but some
operating systems, such as Windows, appear
to do considerably more, perhaps due to their heritage running on
16-bit platforms.

The solution to this problem is to take advantage of some
virtualization support that’s been on x86 processors since
the 80386. VM86 mode
is a virtualization of 16-bit mode that was originally intended to
support legacy 16-bit applications under 32-bit operating systems.
Xen 3.0 introduces a new tool, called VMX
assist
that runs in ring zero. On starting an unmodified OS,
VMX assist first boots the OS in VM86 mode, inside a
hardware-virtualized CPU. Certain 16-bit instructions (for
instance, control register interactions), are not available in VM86
mode and cause the OS to trap. VMX assist catches these traps and
emulates the instructions with help from Xen. VMX assist also helps
the OS through 16-bit execution and then hands it off to execution
on the hardware-virtualized CPU when it switches to 32-bit
protected mode.

Device Drivers

While the new CPU extensions allow unmodified operating systems
to run in isolated CPU contexts, they’re unable able to solve
the virtualization problem for hardware external to the CPU,
specifically devices. For this reason, software support is required
to allow unmodified virtualized operating systems to safely share
access to physical devices.

The first approach to this problem has been to provide device
support that the unmodified OS code can directly use. The system
must appear to have devices such as hard disks and network
interface cards that the OS has drivers for. Xen takes advantage of
the device emulation provided by emulators such as "i">QEMU to present the expected device interface to the
virtual machine. The emulator code is modified to convert the
device interactions into requests to Xen’s existing device
infrastructure. Requests are served by an isolated device domain,
and then passed back to the emulator to be delivered to the virtual
machine.

While this approach works well enough, it should be obvious that
a more efficient solution is possible. High-performance device
access is made possible by providing paravirtualized device drivers
for the unmodified OS. The administrator of the OS can install
Xen-specific drivers for disk, network, and "i">USB, and see the same high-speed I/O available to
paravirtualized guest operating systems.

Figure One shows a variety of device
configurations above Xen 3.0:

*An unmodified "i">Linux domain uses standard IDE driver interfaces to
access its disk, with requests forwarded through the hardware
emulator and then to the device domain.

*A Windows
XP
guest, also running on virtualization extensions, and a
paravirtualized NetBSD virtual machine both
use paravirtualized network device drivers for improved I/O.

*Finally, a paravirtualized
Linux VM uses a native device driver to interact directly with a
video capture card.

Xen in the Enterprise

Since the initial public release of Xen, there has been
considerable interest from organizations with large enterprise
networks, wanting support for high-performance virtualization. Over
the past year, much effort has been spent addressing the needs of
these “big iron” systems.

The Xen development team has added hardware support for
architectural features of server-class hardware, including SMP
support for multi- processor virtual machines, PAE support for very
large memories, and ports of Xen to 64-bit architectures.
Additionally, the team has taken steps to ensure that Xen is a
reliable platform. Working with the Linux Test Project (LTP,
"story_link">http://www.linuxtestproject.org/), IBM, and Intel,
Xen now includes a variety of automated testing and verification
suites. Finally, XenSource ( class="story_link">http://www.xensource.com/), a Xen-based
startup, provides support for organizations that want to use Xen in
production environments.

For 3.0, Xen has been updated to support multiple “virtual
CPUs” (VCPUs) per VM, allowing SMP-aware operating systems to
fully exploit large multiprocessor systems. The virtual CPUs that
the VM runs on typically execute on different physical CPUs.
However, running multiple VCPUs on a single physical CPU is an
excellent way to debug the guest kernel. Xen currently allows up to
32 VCPUs in a single VM all to be scheduled on a uniprocessor
system.

As with memory, Xen can dynamically adjust the amount of CPU
resource allocated to each VM by adjusting the scheduling allowance
of each VCPU, and by “hotplugging” VCPUs. This allows a
VM to be allocated more VCPUs during peaks in workload, which can
then be reclaimed when the peak subsides.

Live relocation of multiprocessor VMs is very similar to the
simpler uniprocessor case. When a guest OS shuts down, it first
voluntarily “unplugs” all but its primary VCPU. The
suspend record that is saved and restored on the target system need
not change format.

Beyond the 4 Gigabyte Barrier

Modern server workloads typically demand lots of memory to work
at peak performance. Until recently, x86 systems were limited to at
most 4 GB RAM, a necessity imposed by 32-bit addressing
architectures. However, this limitation was worked around in recent
years by extending the physical address lines on processors to 36
(or 40) bits, allowing up to 64 GB or 1 TB of memory.

Xen 3.0 has been updated to be aware of the new PAE operating
mode that’s required to take advantage of extra addressing.
This mode extends the page-table format to three levels, and
increases the size of each page-table entry to permit physical
addresses larger than 32 bits.

The 4 GB limitation is not restricted just to CPUs, though. Just
as old ISA systems were restricted to
addressing just a few megabytes of system memory, many "i">PCI- based systems and adapter cards are restricted to
32-bit addressing. To work around this, Xen’s memory
allocator is augmented with a “32-bit safe” memory
pool, the contents of which are guaranteed to be accessible by
legacy PCI hardware. Guest operating systems that manage hardware
can then use this new memory pool in two ways: they can ensure that
certain I/O buffers (network packet buffers, for example) are
always allocated from that pool, or in cases where that isn’t
possible, a “bounce buffer” can be allocated from the
pool, allowing data to be copied between it and the original
device-inaccessible buffer.

These problems are somewhat eased on modern systems that have an
IOMMU, which allow high-memory buffers to be
mapped into a small (typically 64 MB) window below the 4 GB
addressing boundary. This allows legacy devices to access such
buffers without the copying overhead of bounce buffers. Also, most
server-class systems have included 64-bit PCI slots for several
years now, and high-performance network adaptors and SCSI
controllers invariably have the ability to use those extra address
lines. Thus, the best way to ensure high performance is to buy
suitably equipped hardware.

Architecture Support for 64-bit Computing

Although an effective “band aid,” PAE is really no
more than a workaround to prolong the life of a dated processor
architecture on server systems. The future of the x86 family lies
with its more recent and far-reaching 64-bit extensions, generally
called x86/64. Supporting what is
effectively a new sub-architecture of x86 required some significant
changes to Xen. Crucially, many arcane features of x86 such as
hardware task switching and memory segmentation have been largely
swept away by x86/64.

Included in this revamp is the omission of the feature that
allowed Xen to protect itself from guest operating systems by
running them in protection ring 1. On x86/64, because of the
simpler segmentation model, it is not possible to make memory
that’s accessible from ring 0 inaccessible from ring 1. For
safety, guest operating systems now run in ring 3, which is usually
reserved for user applications.

But that raises an obvious question: How does Xen protect the
guest OS from guest applications?

The answer is to run the OS in a separate “address
space” from its applications. When a new address space is
created, the OS usually allocates a new page-table directory that
includes not only application memory mappings, but also mappings
that are private to the OS and should be inaccessible to the
application. However, because the OS runs in ring 3 it’s
impossible to use hardware to protect those mappings. Instead, a
second directory is created that contains "i">only the application mappings. Both page directories are
registered with Xen, and when switching between user and
guest-kernel contexts, Xen switches the page-table base
pointer.

Although this may sound expensive, the transitions to and from
Xen are actually very cheap, because of the performance-tuned
SYSCALL and SYSRET
instructions. In contrast, to switch between rings 1 and 0 on
32-bit x86, Xen is forced to use a flexible but much more expensive
trap gate. Even the cost of the page-table
switch is ameliorated on AMD processors, which contain a
TLB flush filter, a piece of hardware that
prevents many page-table switches from causing a complete flush of
the TLB cache. However, even on CPUs without this filter, the cost
of refilling the TLB, with page-table entries that are typically in
the CPU’s data cache, is quite small.

Ensuring Stability

There have been many advances in the area of testing and
verification of Xen over the past year. The Linux Test project
(http:// www.linuxtestproject.org) has built an automated set of
tests for Xen called XenTest. This test
suite builds the Xen and Linux trees, and performs a variety of
tasks intended to ensure that Xen remains fully functional.

The Xen development team has also produced a testing CD, which
can be used to carry out platform testing of Xen across a wide
range of hardware. Several test labs have expressed their
willingness to test Xen across a wide range of hardware. These CDs
are being developed as part of a larger effort to build a
comprehensive regression suite for Xen. For example, the developers
now run ongoing test instances of Xen using industry standard
benchmarks such as SPECWeb to generate
workload.

Ongoing Research Work at the University of
Cambridge

Xen was the product of research work at the University of
Cambridge Computer Laboratory. As Xen has matured into a piece of
production software, the University’s research efforts into
virtualization has continued. Here is a quick look at some of the
ongoing, Xen-related research projects.

*Parallax,
Storage Management for Virtual Machines.
While virtual
machines enable innovative management of operating systems, they
also present some serious challenges to managing storage. VMs can
be suspended and restarted at any time, migrated across physical
machines, duplicated, and “rewound” to historical
checkpoints. Unfortunately, existing storage managers, both for
local storage and in the cluster, do not provide the flexibility
required to manage virtual machines. Parallax is a distributed
storage system for virtual machines. It saves storage space by
allowing large numbers of VMs to share common, copy-on-write
template images, and allows virtual disks to be efficiently
check-pointed and duplicated. Storage bandwidth, a major concern in
clusters of virtual machines, is reduced by aggressively caching
data on local disks. Parallax is still in its infancy, but is
available in the Xen development tree. The team expects Parallax to
mature a great deal over the next year.

*Pervasive
Debugging, Whole-system Debugging with Virtual Machines.
The
Pervasive Debugging (PDB) project takes advantage of Xen to allow
the debugging of entire systems. PDB allows a running OS, or even a
distributed system consisting of many operating systems to be
debugged on a single physical host. Two unique aspects of PDB are
the ideas of vertical and "i">horizontal debugging. In vertical debugging, a developer
may follow events through the layers of a system, allowing the
transmission of a packet, for instance, to be traced through an
application, across a system call, and through the OS network
stack. Horizontal debugging allows a developer to validate
assertions across a set of hosts in a distributed system, greatly
assisting in the identification of concurrency problems. PDB also
takes advantage of the VMM’s control over devices, allowing
developers to set watchpoints on the contents of disk blocks and
network packets.

More information on these projects, and others, are available
through the Xen project website.

What A Difference A Year Makes

Since writing for Linux Magazine a year ago (read the October
2004 feature story “Now and Xen” online at http://
www.linux-mag.com/2004-10/xen_01.html), Xen has matured from a
research VMM into a stable virtualization package. Xen is now
included in major Linux distributions and is hosting production
servers in industrial settings.

Thanks to a growing developer community and support from several
industrial labs, Xen now features support for unmodified operating
systems on newer CPUs with virtualization support, and runs well on
enterprise-class servers.

Andrew Warfield is a Ph.D. student at the University
of Cambridge and hopes to finish his degree later this year. Keir
Fraser is a Lecturer at the Computer Laboratory and a founder of
XenSource.

Comments are closed.