TPR patching
I'm heading off to Japan tomorrow morning for the Linux Foundation Japan Symposium but instead of packing like I should, I figured I'd post about an exciting new feature in KVM.
First, a little background. Even when doing hardware accelerated virtualization (using VT or SVM), there is a lot of emulation that is required for IO devices. While there are probably at least 15-20 different devices that must be emulated for a virtual machine, there are only a few that are performance sensitive. The two most notable are the network card and disk controller. Since all Operating Systems support a wide variety of these devices, we can create a fake network card driver that we can emulate in a high performance way and everything works out nicely (these are commonly called paravirtual device drivers).
There are some devices in the modern PC that you cannot write drivers for because there simply aren't that many of them. For instance, there are really only a couple kinds of interrupt controllers so most Operating Systems don't provide a mechanism for loading interrupt controller device drivers. Instead, these devices are baked in deeply within the Operating System's core.
For the most part, none of these devices affect performance significantly. The notably exception is the local APIC. The local APIC is a per-processor interrupt controller whose interface is memory-mapped. This means that an OS communicates with the local APIC by writing to a special memory location. In particular, the local APIC has a feature called the TPR (task priority register). Certain OS's (namely, Windows), access the TPR extremely frequently. If you've used Windows under KVM, you may be familiar with the ACPI work-around which effectively tricks Windows into thinking there isn't a local APIC. The result is a significant increase in performance since we no longer have to emulate thousands of TPR accesses per-second. Unfortunately, ACPI is a useful thing. You can't have SMP without it. Disabling it is not really a great solution to the problem.
At this past KVM Forum, Ben Serebin , from AMD, shared an interesting observation. Windows guests only access the TPR with instructions that are at least 5 bytes. The significance of 5 bytes is that that happens to be the size of an absolute call on the x86. This means that you can replace any of the TPR access instructions with an absolute call without the need to do fancy dynamic translation. If you're very clever about hiding routines within the BIOS (it turns out, Windows always has a valid virtual mapping to the BIOS), you can actually rewrite TPR access instruction to instead be calls to functions, that you provide, that access the TPR in a more efficient way.
Avi Kivity posted an implementation of this to KVM recently. The results are quite dramatic. Windows XP installs are at least twice as fast--perhaps even faster. The very latest Intel processors have a hardware feature that ends up with the same result but the nice thing about a purely software approach is that it will work with older processors.
This code hasn't made it's way into a KVM release yet as it needs a bit more testing and clean-up. I suspect we won't see it in a release for a couple more weeks but once it's there, you can reenable ACPI in your Windows guests and enjoy good performance :-)