10 Years of Virtual Machine Performance (Semi) Demystified
By Michael Mullany | October 5th, 2009 at 1:10PM
First, a solid statement: virtualization has always levied a CPU "tax." Early on, this was very high, recently not so much. Probably the most comprehensive recent non-vendor benchmark of performance vs. native is AnandTech's, which recently showed anywhere from a 2% to a 7% CPU tax on a fully loaded system running mixed workload 4-CPU virtual machines on recent hardware.
The seminal project inaugurating this generation of x86 virtual machines was the Disco project at Stanford, which published its key paper in 1997. That project (three of the four authors were future founders of VMware) built a virtual machine monitor for the Irix operating system running on the FLASH research processor.
The Stone Age: VMware Workstation
The reasons for the overheads were outlined by some of the VMware engineering team in a 2001 Usenix paper. The paper gamely showed that with several optimizations it was possible to get full native throughput for networking workloads (10/100 BaseT), although the amount of CPU work spent to process that workload was about 4x the work required in the native environment. The paper also pointed out several possible further optimizations.
The Iron Age: Paravirtualization and Virtual SMP
Meanwhile, VMware was introducing the first multi-CPU guest virtual machines. This was a long performance optimization task. In early stages of development in 2002, Virtual SMP achieved about 5% of native performance, but over the course of 18 months of steady performance optimization, it got to about 75% of native, and it shipped at around that performance level. Around the same time, (about early 2004), Intel shipped the first generation of VT technology, slightly ahead of AMD's equivalent. Ironically, this initially decreased VMware performance on some workloads, and VT did not enjoy a lot of adoption. A great backgrounder on the impact of VT technology is Ole Agesen's primer from VMworld 2007.
Since 2005, VMware and Xen have gradually reduced the performance overheads of virtualization, aided by the Moore's law doubling in transistor count, which inexorably shrinks overheads over time. AMD's Rapid Virtualization Indexing (RVI – 2007) and Intel's Extended Page Tables (EPT – 2009) substantially improved performance for a class of recalcitrant workloads by offloading the mapping of machine-level pages to Guest OS "physical" memory pages, from software to silicon. In the case of operations that stress the MMU—like an Apache compile with lots of short lived processes and intensive memory access—performance doubled with RVI/EPT. (Xen showed similar challenges prior to RVI/EPT on compilation benchmarks.)
Some of the other performance advances have included interrupt coalescing, IPv6 TCP segmentation offloading and NAPI support in the new VMware vmxnet3 driver. However, the last year has also seen two big advances: direct device mapping, enabled by this generation of CPU's (e.g. Intel VT-D first described back in 2006), and the first generation of i/o adapters that are truly virtualization-aware.
Before Intel VT-D, 10GigE workloads became CPU-limited out at around 3.5Gb/s of throughput. Afterwards (and with appropriate support in the hypervisor), throughputs above 9.6 Gb/s have been achieved. More important, however, is the next generation of i/o adapters that actually spin up mini-virtual NIC's in hardware and connect them directly into virtual machines—eliminating the need to copy networking packets around. This is one of the gems in Cisco's UCS hardware which tightly couples a new NIC design with matching switch hardware. We're now at the stage that if you're using this year's VMware or Xen technologies, Intel Nehalems and Shanghai Opterons and the new i/o adapters — virtualization has most performance issues pretty much beat.
Comments