Implementation

A hypervisor, the software agents which manage the logical partitioning of the hardware, form the foundation of our system software stack. It provides high-level memory, page table and interrupt management as well as partition scheduling policies. We based our initial prototype on the architecture employed by the IBM research hypervisor, rHype. rHype is a small (~30k lines of code), low-latency, modular, multi-platform (supports x86 and PowerPC) para-virtualization engine. Logical partitions (LPARs) access rHype services through a "system call" like mechanism known as a hcall. We have also ported our libOS framework to run on the Xen hypervisor on the PowerPC platform.

IO Diagram

Figure 1: I/O design



A mainstream operating system, Linux, provides the primary user interface to launching, monitoring and controlling libOS partitions. This operating system partition is designated the "controller partition" or Dom0, and has special authority to allocate and control the partitioned resources of the machine. A special wrapper script (varying depending on the underlying hypervisor) is used to manage the creation of new logical partitions, starting necessary support applications, and redirecting standard I/O.

Within rHype a special device is used to issue hcalls and communicate with peer partitions. This device can be memory mapped to provide a window into the private memory of the libOS partitions. With our Xen implementation, we use the Xentools API to setup a shared memory window into the application partition's memory. Ring buffers are established within the libOS partition and can be used for communication with peer partitions including the controller. In the prototype, these ring buffers are statically compiled within the application. The wrapper scripts simply examine the application image's symbol information (using nm(1)) to determine the location of the ring buffer within the application before launching it. Since the prototype I/O is handled by user-space applications on the controller, a simple poll methodology is used to detect I/O activity.

Resources and services are provided to libOS partitions via the 9p resource sharing protocol, originally developed as part of the Plan 9 operating system. 9p is the protocol which runs over the previously mentioned I/O channels. Some resources such as the controller's file system and access to its character and block devices may be provided by 9p directly. However, most resources are gatewayed by synthetic file server applications. These servers export hierarchical file systems representing system services and resources such as the TCP/IP stack, console I/O, or windowing system. More information on system services and resources via Plan 9 synthetic file systems can be found in the various Plan 9 technical papers and documentation.

We had implemented two stand-alone synthetic file system gateways as part of the file system prototype. There is a network gateway which provides socket services in a fashion similar to the Plan 9 devip device. We also have a partial implementation of the cons device which provides console interactions with the libOS partition as well as a libOS access to system information such as time. The synthetic file systems are organized within the global Linux name space by mounting them using the v9fs Linux kernel module. This name space is then re-exported to the libOS partition by the npfs application which gateways Posix file system name spaces to 9p.

Stability and performance concerns, along with the desire to quickly support a broader range of service and resource gateways, led us to using the Inferno environment as a replacement for the v9fs/npfs stack. Inferno is a distributed operating system which also runs as hosted middleware on many more traditional operating systems (Linux, Windows, etc.). It also uses the 9p protocol for resource sharing, and in its hosted environment contains a great number of gateway synthetic file systems to the underlying operating-system's resources and services. We ported the Inferno hosted environment to Linux/PPC and added a new device file system (devxenchan) which is used to manage shared memory channels to the underlying application partitions. Like the Plan 9 network statcks (such as devip), devxenchan operates as a 'clone' file system – allowing a single inferno instance to communicate with multiple child partitions.

A side benefit of using Inferno as a Dom0 infrastructure was a natural path to cluster solutions. The Dom0 partitions can export control interfaces over TCP/IP which can then be leveraged by a front-end node. By mounting resources (such as file system or networking stacks) from the front-end node, a whole cluster of machines can be managed as a single node. This provides a very attractive solution for bladed-clusters and other scale-out systems.