Monday, October 29, 2007

TPR patching

I'm heading off to Japan tomorrow morning for the Linux Foundation Japan Symposium but instead of packing like I should, I figured I'd post about an exciting new feature in KVM.

First, a little background. Even when doing hardware accelerated virtualization (using VT or SVM), there is a lot of emulation that is required for IO devices. While there are probably at least 15-20 different devices that must be emulated for a virtual machine, there are only a few that are performance sensitive. The two most notable are the network card and disk controller. Since all Operating Systems support a wide variety of these devices, we can create a fake network card driver that we can emulate in a high performance way and everything works out nicely (these are commonly called paravirtual device drivers).

There are some devices in the modern PC that you cannot write drivers for because there simply aren't that many of them. For instance, there are really only a couple kinds of interrupt controllers so most Operating Systems don't provide a mechanism for loading interrupt controller device drivers. Instead, these devices are baked in deeply within the Operating System's core.

For the most part, none of these devices affect performance significantly. The notably exception is the local APIC. The local APIC is a per-processor interrupt controller whose interface is memory-mapped. This means that an OS communicates with the local APIC by writing to a special memory location. In particular, the local APIC has a feature called the TPR (task priority register). Certain OS's (namely, Windows), access the TPR extremely frequently. If you've used Windows under KVM, you may be familiar with the ACPI work-around which effectively tricks Windows into thinking there isn't a local APIC. The result is a significant increase in performance since we no longer have to emulate thousands of TPR accesses per-second. Unfortunately, ACPI is a useful thing. You can't have SMP without it. Disabling it is not really a great solution to the problem.

At this past KVM Forum, Ben Serebin , from AMD, shared an interesting observation. Windows guests only access the TPR with instructions that are at least 5 bytes. The significance of 5 bytes is that that happens to be the size of an absolute call on the x86. This means that you can replace any of the TPR access instructions with an absolute call without the need to do fancy dynamic translation. If you're very clever about hiding routines within the BIOS (it turns out, Windows always has a valid virtual mapping to the BIOS), you can actually rewrite TPR access instruction to instead be calls to functions, that you provide, that access the TPR in a more efficient way.

Avi Kivity posted an implementation of this to KVM recently. The results are quite dramatic. Windows XP installs are at least twice as fast--perhaps even faster. The very latest Intel processors have a hardware feature that ends up with the same result but the nice thing about a purely software approach is that it will work with older processors.

This code hasn't made it's way into a KVM release yet as it needs a bit more testing and clean-up. I suspect we won't see it in a release for a couple more weeks but once it's there, you can reenable ACPI in your Windows guests and enjoy good performance :-)

Monday, October 08, 2007

The Myth of Type I and Type II Hypervisors

This has been something that has bothered me for a while that I have never gotten a chance to articulate. In the virtualization community, the terms "type-1" and "type-2" hypervisors get thrown around a lot--often carrying different meanings. Lately, "type-2" is being used as a derogatory term suggesting that a virtualization solution is "lesser" than a true "type-1" hypervisor.

The most common definition of "type-1" and "type-2" seem to be that "type-1" hypervisors do not require a host Operating System. In actuality, all hypervisors require an Operating System of some sort. Usually, "type-1" is used for hypervisors that have a micro-kernel based Operating System (like Xen and VMware ESX). In this case, a macro-kernel Operating System is still required for the control partition (Linux for both Xen and ESX).

The whole argument of micro-kernel vs macro-kernel hosts is a different blog post (just as a spoiler, I think one can make a better argument for macro-kernel hypervisors). I want to focus, instead, on why we have these terms and what they really mean.

Virtualization theory really started with a paper from Gerald Popek and Robert Goldberg called Formal Requirements for Virtualizable Third Generation Architectures. The paper is a mathematical proof of the architectural requirements to allow virtualization. It is very terse and I don't expect most people have read it. The paper focuses on implementing full virtualization on native hardware and focuses on things like whether privileged instructions are trappable. It was written in 1974 and Operating Systems were not actually all that common back then. Many people think the terms "type-1" and "type-2" originated from this paper but that is simply not the case. The paper does mention the concept of recursive virtualization and briefly discusses the requirements to allow one virtual machine to run within another virtual machine.

As best as I can tell, the terms "type-1" and "type-2" originate from a paper by John Robin called Analyzing the Intel Pentium's Capability to Support a Secure Virtual Machine Monitor. This paper was Robin's master thesis at the Naval Postgrade School. There are two versions of the paper available, the actual master's thesis and a condensed version for USENIX 2000.

This paper is really an application of the Popek/Goldberg proof to the Pentium architecture. A few points were missed, but it does a rather good analysis of why the Pentium architecture did not satisfy the Popek/Goldberg requirements for virtualization. Now, some folks at VMware have made a rather compelling case that this is in fact incorrect because the Popek/Goldberg proof does not eliminate the possibility of using dynamic translation. At any rate, Robin makes a distinction between "type-1" and "type-2" VMMs. The reason for the distinction is simple. When discussing "type-1" VMMs that access hardware directly, the set of requirements to enable Secure Virtualization entirely depends on the hardware. When discussing "type-2" VMMs, however, you do not have direct access to hardware so the requirements to enable virtualization are actually at the Operating System interface. A true "type-2" VMM is just a process in an Operating System and is not capable of accessing hardware directly.

The important point to take away here is that all modern virtualization solutions (except for unaccelerated QEMU maybe) are technically "type-1" VMMs according to Robin. The things commonly cited as "type-2" VMMs like VMware Workstation, Parallels, VirtualPC, and KVM all rely on kernel modules which means they do have direct access to hardware. This makes all of these solutions "type-1" VMMs. What's more important though is that the distinction of "type-1" and "type-2" has absolutely no bearings on performance, robustness, or any other qualitative factor. It is merely a distinction made when attempting to formulate a proof about whether virtualization is possible or not. It starts to lose meaning too when an Operating System is capable of supporting a true "type-2" VMM (which arguable, the KVM interface in Linux enables). Does that mean that Linux is a "type-1" VMM and QEMU using the KVM interface is a "type-2" VMM? How can the same solution be both though? IMHO, the introduction of the term "type-2" was really a mistake on Robin's part perhaps as a misunderstanding of the section of the Popek paper regarding recursive virtualization. That's just speculating though. The distinction really just doesn't make much sense in my mind.

So if you've made it this far, I'll hope you agree that these terms really have no practical meaning and will join me in refraining from using them in the future :-)