| .. SPDX-License-Identifier: GPL-2.0 |
| |
| pKVM on Intel Platform Introduction |
| =================================== |
| |
| Protected-KVM (pKVM) on Intel platform is designed as a thin hypervisor, |
| it wants to extend KVM supporting VMs isolated from the host. |
| |
| The concept of pKVM is first introduced by Google for ARM platform |
| [1][2][3], which aims to extend Trust Execution Environment (TEE) from |
| ARM secure world to virtual machines (VMs). Such VMs are protected by the |
| pKVM from the host OS or other VMs accessing the payloads running inside |
| (so called protected VM). More details about the overall idea, design, |
| and motivations can be found in Will's talk at KVM forum 2020 [4]. |
| |
| There are similar use cases on x86 platforms requesting protected |
| environment which is isolated from host OS for confidential computing. |
| Meanwhile host OS still presents the primary user interface and people |
| will expect the same bare metal experience as before in terms of both |
| performance and functionalities (like rich-IO usages), so the host OS |
| is desired to remain the ability to manage all system resources. At |
| the same time, in order to mitigate the attack to the confidential |
| computing environment, the Trusted Computing Base (TCB) shall be |
| minimized. |
| |
| HW solutions e.g. TDX [5] also exist to support above use cases. But |
| they are available only on very new platforms. Hence having a software |
| solution on massive existing platforms is also plausible. |
| |
| pKVM has the merit of both providing an isolated environment for |
| protected VMs and also sustaining rich bare metal experiences as |
| expected by the host OS. This is achieved by creating a small |
| hypervisor below the host OS which contains only minimal |
| functionalities (e.g. VMX, EPT, IOMMU, etc.) for isolating protected |
| VMs from host OS and other VMs. In the meantime the host kernel still |
| remains access to most of the system resources and plays the role of |
| managing VM life cycles, allocating VM resources, etc. Existing KVM |
| module calls into the hypervisor (via emulation or enlightened PV ops) |
| to complete missing functionalities which have been moved downward. |
| |
| +--------------------+ +-----------------+ |
| | | | | |
| | host VM | | protected VM | |
| | (act like | | | |
| | on bare metal) | | | |
| | | +-----------------+ |
| | +---------------------+ |
| | +--------------------+ | |
| | | vVMX, vEPT, vIOMMU | | |
| | +--------------------+ | |
| +------------------------------------------+ |
| +------------------------------------------+ |
| | pKVM (own VMX, EPT, IOMMU) | |
| +------------------------------------------+ |
| |
| [note: above figure is based on Intel terminologies] |
| |
| The terminologies used in this document: |
| |
| - host VM: native Linux which boot pKVM then deprivilege to a VM |
| - protected VM: VM launched by host but protected by pKVM |
| - normal VM: VM launched & protected by host |
| |
| pKVM binary is compiled as an extension of KVM module, but resides in a |
| separate, dedicated memory section of the vmlinux image. It makes pKVM |
| easy to release and verified boot together with Linux kernel image. It |
| also means pKVM is a post-launched hypervisor since it's started by KVM |
| module. |
| |
| ARM platform naturally supports different exception level (EL) and the |
| host kernel can be set to run at EL1 during the early boot stage before |
| launching pKVM hypervisor, so pKVM just needs to be installed to EL2. |
| On Intel platform, the host Linux kernel is originally running in VMX |
| root mode, then deprivileged to run into vmx non-root mode as a host VM, |
| whereas pKVM is kept running at VMX root mode. Comparing with pKVM on |
| ARM, pKVM on Intel platform needs this deprivilege stage to prepare and |
| setup VMX environment in VMX root mode. |
| |
| As a hypervisor, pKVM on Intel platform leverages virtualization |
| technologies (see below) to guarantee the isolation among itself and low |
| privilege guests (include host Linux) on top of it: |
| |
| - pKVM manages CPU state/context switch between hypervisor and different |
| guests. It's largely done by VMCS. |
| |
| - pKVM owns EPT page table to manage the GPA to HPA mapping of its host |
| VM and guest VMs, which ensures they will not touch the hypervisor's |
| memory and isolate among each other. It's similar to pKVM on ARM which |
| owns stage-2 MMU page table to isolate memory among hypervisor, host, |
| protected VMs and normal VMs. To allow host manage EPT or stage-2 page |
| tables, pKVM can choose to provide either PV ops or emulation for these |
| page tables. pKVM on ARM chose PV ops, which providing hypervisor calls |
| (HVCs) in pKVM for stage-2 MMU page table changes. pKVM on Intel |
| platform provides emulation for EPT page table management - this avoids |
| the code changes in x86 KVM MMU. |
| |
| - pKVM owns IOMMU (VT-d for Intel platform and SMMU for ARM platform) |
| to manage device DMA buffer mapping to isolate DMA access. To allow |
| host manage IOMMU page tables, smilar to EPT/stage-2 page table |
| management, PV ops or emulation method could be chosen. pKVM on ARM |
| chose PV ops [6], while pKVM on Intel platform will use IOMMU |
| emulation. |
| |
| A topic in KVM forum 2022 about supporting TEE on x86 client platforms |
| with pKVM [7] may help you understand more details about the framework |
| of pKVM on Intel platforms and the deltas between pKVM on Intel and ARM |
| platforms. |
| |
| Deprivilege Host OS |
| =================== |
| |
| The primary motivation of pKVM on Intel platform is to be able to protect |
| VM's memory from the host, which is the same as pKVM on ARM. To achieve |
| this, the pKVM hypervisor shall run at the higher privilege level, while |
| Linux host kernel shall run at lower privilege level, which allow the |
| isolation control from the pKVM hypervisor. On ARM platform with nvhe |
| architecture, the Linux kernel runs at EL1, and pKVM runs at EL2, so that |
| pKVM on ARM can use stage-2 MMU translation to isolate guest memory from |
| the host kernel. Similarly for Intel architecture, only pKVM hypervisor |
| code runs at vmx root mode and the Linux kernel should run at vmx non-root |
| mode. But the host Linux kernel boots and runs at the vmx root mode, so it |
| needs to be deprivileged to vmx non-root mode. After that, the host becomes |
| a VM and its code/data is untrusted to pKVM hypervisor. Based on above, pKVM |
| code for Intel platform is divided into two parts: the deprivilege code (at |
| arch/x86/kvm/vmx/pkvm/) and the hypervisor code (at arch/x86/kvm/vmx/pkvm/hyp/). |
| The deprivilege code is pKVM initialization code in Linux kernel which helps |
| Linux kernel to deprivilege itself and ensure pKVM hypervisor keep running at |
| high privilege level. Meanwhile the hypervisor code is pKVM hypervisor runtime |
| code which is independent, self-contained, running at vmx root mode and isolated |
| to host Linux kernel. |
| |
| 1. Basic common infrastructure |
| ------------------------------- |
| As pKVM hypervisor is independent and isolated to host Linux, the memory |
| resource it used shall be reserved and maintained by itself. On ARM platform, |
| the memory used by pKVM is reserved during bootmem_init() from the memblocks, |
| and managed by pKVM through its own buddy allocator, which is pretty general |
| for Intel platform as well. So the memory reservation and buddy allocator is |
| stripped from pKVM on ARM to make it a common infrastructure, and move the |
| code to virt/kvm/pkvm/. |
| |
| 1) Memory Reservation |
| --------------------- |
| The reserved memory size is calculated by pkvm_total_reserve_pages() which is |
| depending on the architecture. For Intel platform, pKVM reserves the memory |
| for its data structures, vmemmap metadata of buddy allocator, MMU of hypervisor, |
| EPT of the host VM, and shadow EPT of the guest. The reserved memory is |
| physically contiguous. |
| |
| 2) Buddy Allocator |
| ------------------ |
| The Buddy allocator is designed and implemented in pKVM on ARM platform [8] |
| and is used as a common infrastructure, which is a conventional 'buddy |
| allocator', working with page granularity. It allows allocating and free |
| physically contiguous pages from memory 'pools', with a guaranteed order |
| alignment in the PA space. Each page in a memory pool is associated with a |
| struct pkvm_page which holds the page's metadata, including its refcount, as |
| well as its current order, hence mimicking the kernel's buddy system in the |
| GFP infrastructure. The pkvm_page metadata are made accessible through a |
| pkvm_vmemmap, following the concept of SPARSE_VMEMMAP in the kernel. |
| |
| Although buddy allocator is a common infrastructure, it may still need to use |
| some architecture-specific APIs, like spinlock and VA<->PA translations. These |
| are wrapped to general APIs, like pkvm_spin_lock, __pkvm_va(phys), __pkvm_pa(va) |
| with different architecture implementations in the back. |
| |
| Buddy allocator will be used by pKVM hypervisor code to dynamically allocate |
| and free memory at the runtime. |
| |
| 2. Independent binary of pKVM hypervisor |
| ---------------------------------------- |
| As the Linux kernel runs at vmx non-root mode, its code/data is untrusted to |
| pKVM hypervisor. The symbols in Linux kernel address space cannot be used by |
| pKVM hypervisor. To build an independent pKVM hypervisor binary, introduced a |
| linker script to put the hypervisor code and data in separated sections. Doing |
| so can easily isolate all pKVM hypervisor's code/data memory from the host |
| Linux kernel. This is different with pKVM deprivilege code - such code only |
| executes for deprivilege but not at the hypervisor runtime, they do not need |
| to be an independent binary. So the deprivilege code is compiled as usual and |
| able to use Linux kernel symbols. |
| |
| As pKVM hypervisor can only link to its symbols, while some common libraries |
| from Linux kernel are expected being used by pKVM hypervisor as well, so pull |
| them into pKVM's code section, e.g., memset, memcpy, find_bit etc.. |
| |
| To avoid symbol clashing between pKVM hypervisor code and Linux kernel, |
| added the prefix '__pkvm_' to all pKVM hypervisor's symbols. Doing so also |
| can help to catch the case that pKVM links symbols without '__pkvm_' prefix |
| at the building time. To reduce redundant code in pKVM, some of pKVM hypervisor |
| symbols may be used by the pKVM deprivilege code. As all the pKVM hypervisor |
| symbols are prefixed with '__pkvm_', it needs to explicitly add the prefix |
| '__pkvm_' when calls these symbols by the deprivilege code, which is implemented |
| by a simple macro pkvm_sym(symbol). |
| |
| To simplify, the pKVM hypervisor build also removed ftrace, Shadow Call Stack, |
| CFI CFLAGS, and disabled stack protector. As pKVM hypervisor shouldn't export any |
| symbols, also disabled 'EXPORT_SYMBOL'. |
| |
| 3. pKVM Initialization |
| ---------------------- |
| |
| With CONFIG_PKVM_INTEL=y, pKVM will be compiled into Linux kernel. During the |
| boot time, the Linux kernel reserves physical continuous memory according to the |
| size calculated by pkvm_total_reserve_pages() for pKVM hypervisor. The reserved |
| memory will be used as a memory pool for pKVM to dynamic allocate its own used |
| memory at the deprivilege time and runtime. |
| |
| pKVM deprivilege code will start to run when loads the kvm-intel module, and |
| after finishing the deprivilege, pKVM hypervisor code runs in vmx root mode. |
| And the rest part of the Linux kernel is deprivileged to vmx non-root mode. Host |
| Linux must be trusted until pKVM deprivileged it, so CONFIG_PKVM_INTEL=y selects |
| kvm-intel as a built-in module, which can be loaded earlier than user space |
| booting, so that pKVM can start deprivilege earlier. |
| |
| The buddy allocator will not be ready until pKVM hypervisor has set up the |
| pkvm_vmemmap. So before that, pKVM uses early_alloc mechanism to contiguously |
| allocate memory from the reserved area with holding a lock to avoid racing. |
| Unlike buddy allocator which can release the allocated memory through putting |
| the reference count in pkvm_vmemmap, early_alloc mechanism doesn't have |
| reference count so the memory allocated by early_alloc is not expected to be |
| released. |
| |
| 1) Allocate/Setup pkvm_hyp |
| -------------------------- |
| pkvm_hyp is a data structure allocated by early_alloc at the deprivilege time. |
| It contains vmcs_config, vmx_capability, MMU/EPT capability, hypervisor MMU, |
| physical CPU instances, host VM vCPU instances, host VM EPT. |
| |
| The vmcs_config and vmx_capability is set up with the mandatory capability like |
| EPT, shadow VMCS. To give the best performance to host VM, most of the IO/MSRs |
| accessing is configured as passthrough, as well as the interrupts and |
| exceptions. So almost all the IO devices(E.g., LAPIC/IOAPIC, serial port I/O, |
| all the PCI/PCIe devices) can be directly accessed by the host VM, and the |
| external interrupt can be directly injected to the host VM without causing any |
| vmexit. Only a few necessary vmexits can be triggered by the host VM, like |
| CPUID, CR accessing, intercepted MSRs. These setups will be used to configure |
| the VMCS later. |
| |
| Unlike vmcs_config/vmx_capability structure in pkvm_hyp, the physical/virtual |
| CPU instances are defined as pointer array, and the instances are allocated by |
| early_alloc according to the real CPU number. This is due to the CPU number is |
| different from platform to platform, and cannot predefine data structure array |
| with the maximum CPU number CONFIG_NR_CPUS, which will waste a lot of memory. |
| So the instances are allocated according to the real CPU number of this platform |
| running with, and each CPU will have a physical CPU instance and a virtual CPU |
| instance. |
| |
| The physical CPU instance stores the hypervisor's state, e.g., stack pages, GDT, |
| TSS, IDT, CR3. These states will be used to configure VMCS host state. As |
| mentioned in the above part, external interrupts will be directly injected to |
| the host VM, so the hypervisor will run with interrupt disabled and doesn't |
| handle any interrupt. Hypervisor also should not cause any exception at runtime, |
| so IDT is initialized with noop handlers for all the vectors except for NMI. NMI |
| is un-maskable so it may happen when hypervisor is running so a valid NMI handler |
| in hypervisor code is necessary. |
| |
| The virtual CPU instance stores host vCPU states by using the VMX structure |
| vcpu_vmx. The VMCS pages and MSR bitmap page are also allocated through |
| early_alloc. |
| |
| 4. Deprivilege the Linux Kernel |
| -------------------------------- |
| |
| Deprivilege the Linux kernel will finally make it running at vmx non-root mode |
| on each CPU, and pKVM hypervisor code will run at vmx root mode. To achieve this, |
| each physical CPU needs to turn on vmx and vmlaunch to vmx non-root mode. |
| |
| 1) Setup VMCS |
| ------------- |
| After vmx is on, each CPU can load and set up a VMCS. The VMCS setup is majorly |
| done for guest state, host state, and control states (execution control, |
| vmentry/vmexit controls). |
| |
| The guest state is for the host VM. It is configured with the current native |
| platform states, including CR registers, segment registers and MSRs, so that the |
| Linux kernel can smoothly run in vmx non-root mode after deprivilege. |
| |
| The host state is for the pKVM hypervisor. It is configured by using its own |
| GDT/IDT/TSS for segment registers, and reusing the CR registers and MSRs of |
| the current native platform. Reusing the Linux kernel's CR3 is temporary and |
| CR3 will be updated in the finalize phase when hypervisor's MMU page table is |
| ready. |
| |
| The control state is configured according to the pkvm_hyp.vmcs_config, which |
| passthrough most of the IO/MSRs as well as interrupts and exceptions. Some |
| resources which are controlled by hypervisor need to be intercepted, like VMX |
| MSRs, CR4 VMXE bit. EPT is not enabled at this moment as the EPT page table is |
| created at the finalize phase by pKVM hypervisor code, so EPT will be updated |
| later, similar to CR3. |
| |
| 2) Deprivilege |
| -------------- |
| After VMCS is setup, pKVM can start to deprivilege by executing vmlaunch on |
| each CPU. As the Linux kernel will start to run at the position after doing |
| vmlaunch, GUEST_RFLAGS/GUEST_RSP are configured to the current native rflags/rsp |
| registers and GUEST_RIP are set to the code next to the vmlaunch. Meanwhile, |
| HOST_RSP/HOST_RIP are also properly configured for running hypervisor vmexit |
| handlers. With these setups, after executing vmlaunch, the CPU enters vmx |
| non-root mode and jump to the place pointed by GUEST_RIP. At this point, the |
| Linux kernel runs at the vmx non-root mode. |
| |
| 3) Finalize Phase |
| ----------------- |
| Although the Linux kernel now runs in vmx non-root mode, pKVM hypervisor is |
| not fully ready yet as MMU/EPT still need to be updated to guarantee the |
| isolation between pKVM hypervisor and the Linux kernel. Currently, the host |
| VM and the hypervisor are using the same CR3, without EPT enabled. So after |
| vmlaunch, each CPU will use a vmcall to enter vmx root mode to trigger pKVM |
| hypervisor to complete the last step of deprivilege, which is to finalize the |
| deprivilege. |
| |
| The finalize vmcall takes the struct pkvm_section as input parameters, which |
| contains the range of the reserved memory and hypervisor's code/data sections. |
| The reserved memory is divided into several parts through early_alloc mechanism: |
| #1 pkvm_hyp data structures; #2 vmemmap metadata of buddy allocator; #3 |
| hypervisor MMU pages; #4 host EPT pages; #5 shadow EPT pages (Note: part#1 is |
| already allocated before deprivilege, and the reset parts should not overlap |
| with part#1). Then hypervisor will set up the MMU/EPT with the divided memory |
| pages. |
| |
| To enable the buddy allocator for a more flexible memory management, the vmemmap |
| metadata should be mapped in hypervisor's MMU first. So creating hypervisor's |
| MMU is the first thing to do after dividing the reserved memory. To simplify, |
| the MMU is created by mapping all the memblocks with kernel direct mapping |
| VA, and hypervisor's code/data sections with symbol VA. The vmemmap metadata is |
| mapped with the VA started from 0. Once all the required mappings are ready, |
| hypervisor can update its CR3 register with the new MMU page table. And after |
| that, hypervisor runs with its own CR3. With buddy allocator enabled, hypervisor |
| page-table manage framework can be used to dynamically manage the map/unmap for |
| hypervisor MMU and host VM's EPT. The page-table management is introduced in the |
| next section. |
| |
| To guarantee the isolation, hypervisor set up EPT for host VM. The EPT is |
| identical mapped for all the memblocks. As the MMIO is usually out of the range |
| of the memblocks, also identical maps all the possible holes between each |
| memblock. However, some MMIO may live in the high-end address which is difficult |
| to be covered by mapping these holes, so hypervisor still needs to handle such |
| EPT violation at the runtime. With EPT, hypervisor can be isolated from the |
| host VM. The memory which is not expected to be accessed by host VM will be |
| unmapped from EPT in the finalize phase, like reserved memory and hypervisor's |
| code/data sections. |
| |
| In the end of finalize phase, hypervisor code also initializes nested related |
| data, like shadow vmcs fields, emulated vmcs fields and shadow EPT pages pool. |
| |
| Although each CPU will execute the finalize vmcall, only the first finalize |
| vmcall needs to divide reserved memory and set up the buddy allocator/MMU/EPT |
| as these are onetime jobs. Once these are done, the other finalize vmcalls |
| on the other CPUs only need to do per-CPU stuff: switching CR3 and enabling |
| EPT. |
| |
| * Page-table management |
| ----------------------- |
| |
| As talked above, pKVM hypervisor finally needs to manage page tables for its |
| MMU, host VM EPT, and shadow EPT for guest VMs. To help supporting these |
| different page tables, pKVM provides a general page table walker framework. |
| Such framework provides interface for different operations like pgtable_ops |
| and mm_ops. The pgtable_ops provide operations for page table management, like |
| set page table entries, check a page table entry is present or whether it is a |
| leaf, or get entry size per page table level etc. Meanwhile the mm_ops provide |
| page table related mm operations, like page allocation, PV translation, flush |
| tlb etc. MMU and EPT can have different implementation for pgtable_ops & mm_ops, |
| thus they can use same page table walker framework to manage their page tables. |
| |
| 5. Isolated pKVM hypervisor |
| --------------------------- |
| |
| In the end of host deprivilege, pKVM hypervisor runs as an independent binary |
| with its own MMU page table. Host VM runs with EPT enabled, which unmaps the |
| pKVM hypervisor's code/data sections, as well as the reserved memory. With |
| this, accessing any pKVM hypervisor's memory from host VM will cause EPT |
| violation to the hypervisor, which guarantees the pKVM hypervisor is isolated |
| from host VM. |
| |
| |
| VMX Emulation (Shadow VMCS) |
| =========================== |
| |
| Host VM wants the capability to run its guest, it needs VMX support. |
| |
| pKVM is designed to emulate VMX for host VM based on shadow vmcs. |
| This requires "VMCS shadowing" feature support in VMX secondary |
| processor-based VM-Execution controls field [9]. |
| |
| +--------------------+ +-----------------+ |
| | host VM | | guest VM | |
| | | | | |
| | +--------+ | | | |
| | | vmcs12 | | | | |
| | +--------+ | | | |
| +--------------------+ +-----------------+ |
| +------------------------------------------+ +---------+ |
| | +--------+ +--------+ | | shadow | |
| | | vmcs01 | | vmcs02 +------+---+-->| vcpu | |
| | +--------+ +--------+ | | | state | |
| | +---------------+ | | +---------+ |
| | | cached_vmcs12 +---+---+ |
| | pKVM +---------------+ | |
| +------------------------------------------+ |
| |
| "VMCS shadowing" use a shadow vmcs page (vmcs02) to cache vmcs fields |
| accessing from host VM through VMWRITE/VMREAD, avoid causing vmexit. |
| The fields cached in vmcs02 is pre-defined by VMREAD/VMWRITE bitmap. |
| Meanwhile for other fields not in VMREAD/VMWRITE bitmap, accessing from |
| host VM cause VMREAD/VMWRITE vmexit, pKVM need to cache them in another |
| place - cached_vmcs12 is introduced for this purpose. |
| |
| The vmcs02 page in root mode is kept in the structure shadow_vcpu_state, |
| which allocated then donated from host VM when it initializes vcpus for |
| its launched guest (nested). Same for field of cached_vmcs12. |
| |
| pKVM use vmcs02 with two purposes, one is mentioned above, using it |
| as the shadow vmcs page of nested guest when host VM program its vmcs |
| fields. The other one is using it as ordinary (or active) vmcs for the |
| same guest during the vmlaunch/vmresume. |
| |
| For a nested guest, during its vmcs programing from host VM, according |
| to above, its virtual vmcs (vmcs12) is saved in two places: vmcs02 for |
| shadow fields and cached_vmcs12 for no shadow fields. Meanwhile for |
| cached_vmcs12, there are also two parts for its fields: one is emulated |
| fields, the other one is host state fields. The emulated fields shall be |
| emulated to the physical value then fill into vmcs02 before vmcs02 active |
| to do vmlaunch/vmresume for the nested guest. The host state fields are |
| guest state of host vcpu, it shall be restored to guest state of host |
| vcpu vmcs (vmcs01) before return to host VM. |
| |
| Below is a summary for contents of different vmcs fields in each above |
| mentioned vmcs: |
| |
| host state guest state control |
| --------------------------------------------------------------- |
| vmcs12*: host VM nested guest host VM |
| vmcs02*: pKVM nested guest host VM + pKVM* |
| vmcs01*: pKVM host VM pKVM |
| |
| [*]vmcs12: virtual vmcs of a nested guest |
| [*]vmcs02: vmcs of a nested guest |
| [*]vmcs01: vmcs of host VM |
| [*]the security related control fields of vmcs02 is controlled by pKVM |
| (e.g., EPT_POINTER) |
| |
| Below show the vmcs emulation method for different vmcs fields for a |
| nested guest: |
| |
| host state guest state control |
| --------------------------------------------------------------- |
| virutal vmcs: cached_vmcs12* vmcs02* emulated* |
| |
| [*]cached_vmcs12: vmexit then get value from cached_vmcs12 |
| [*]vmcs02: no-vmexit and directly shadow from vmcs02 |
| [*]emulated: vmexit then do the emulation |
| |
| The vmcs02 & cached_vmcs12 is sync back to vmcs12 during VMCLEAR |
| emulation, and updated from vmcs12 when emulating VMPTRLD. And before |
| the nested guest vmentry(vmlaunch/vmresume emulation), the vmcs02 is |
| further sync dirty fields (caused by vmwrite) from cached_vmcs12 and |
| update emulated fields through emulation. |
| |
| |
| EPT Emulation (Shadow EPT) |
| ========================== |
| |
| Host VM launches its guest, and manage such guest's memory through a EPT page table |
| maintained in host KVM. But this EPT page table is untrusted to pKVM, so pKVM shall |
| not directly use this EPT as guest's active EPT. To ensure isolating of guest memory |
| for protected VM, pKVM hypervisor shadows such guest's EPT in host KVM, to build out |
| active EPT page table after necessary check (the check is based on page state |
| management which will be introduced later). It's actually an emulation for guest EPT |
| page table, the guest EPT page table in host KVM is called "virtual EPT", while the |
| active EPT page table in pKVM is called "shadow EPT". |
| |
| How Shadow EPT be built? |
| ------------------------ |
| |
| In native world, the guest EPT is majorly populated during guest EPT_VIOLATION VMExit |
| handling: |
| |
| 1. guest access memory page which doesn't have a map in guest EPT, trigger |
| EPT_VIOLATION; |
| 2. KVM MMU handle page fault for EPT_VIOLATION, allocate page then create corresponding |
| EPT mapping. |
| |
| For pKVM, the majority of guest EPT population is still same as native, but added more |
| steps for the shadowing: |
| |
| 1. guest access memory page which doesn't have a map in shadow EPT, trigger |
| EPT_VIOLATION; |
| 2. pKVM check if there is mapping in virtual EPT: |
| - if yes, goto 5; |
| - if no, goto 3; |
| 3. pKVM forward EPT_VIOLATION to host VM; |
| 4. KVM MMU in host handle page fault for EPT_VIOLATION, allocate page then create |
| corresponding virtual EPT mapping, then VMResume back to guest, back to 1; |
| 5. pKVM shadow the mapping from virtual EPT to shadow EPT after page state check. |
| |
| Emulate INVEPT |
| -------------- |
| |
| The simplest way to emulate INVEPT is to remove all mapping in shadow EPT, it leads to |
| EPT_VIOLATION for all gpa, then all mapping in shadow EPT will be re-created based on |
| updated virtual EPT. This will cause a lot of unnecessary shadow EPT_VIOLATION as most |
| of entries in virtual EPT is not changed. Optimized way is adding PV method to do INVEPT |
| with specific range, then shadow EPT only need removing mapping of necessary range every |
| time. |
| |
| |
| Memory Protection based on Page State |
| ===================================== |
| |
| To protect guest memory, pKVM introduces a mechanism called page state management. pKVM |
| uses two ignored bits in EPT page table entry (BIT 56 & 57) to track page states. At the |
| same time, when the page is un-present in EPT page table entry, ignored bits(12~31) may |
| be used to record page owner id. |
| |
| 63 ... 58 | 57 56 | ... | 31 ... 12 | 11 ... 0 |
| --------------------------------------------------------- |
| | ... | page state | ... | [owner_id] | ... |
| |
| |
| Page state - bits[57,56]: |
| |
| - PKVM_NOPAGE(00b): |
| the page has no mapping in page table. |
| under this page state, host EPT is using the pte ignored |
| bits[31~12] to record owner_id. |
| - PKVM_PAGE_OWNED(01b): |
| the page is owned exclusively by the page-table owner. |
| - PKVM_PAGE_SHARED_OWNED(10b): |
| the page is owned by the page-table owner, but is shared |
| with another. |
| - PKVM_PAGE_SHARED_BORROWED(11b): |
| the page is shared with, but not owned by the page-table |
| owner. |
| |
| Owner_id - bits[31~12] (only valid in host EPT when PKVM_NOPAGE): |
| - 0: PKVM_ID_HYP |
| - 1: PKVM_ID_HOST |
| - others: PKVM_ID_GUEST |
| |
| The page states can be recorded in different entities: |
| - host EPT (with identical mapping) |
| - guest shadow EPT (with mapping of gpa to hpa) |
| |
| After a page is donated from host VM, the donee's id (page's owner id) can be recorded |
| in host EPT's corresponding page table entry. |
| |
| Based on these, it's easy to find out a physical page's current owner and state. |
| |
| Page state transition |
| --------------------- |
| |
| Below state machine defines how page states are transformed among different entities |
| (host EPT, and guest shadow EPT - which include normal VM & protected VM): |
| |
| |
| +------------------+ +------------------+ |
| | host : NOPAGE | <---------------- | host : OWNED | |
| | guestA: OWNED | ----------------> | guest: NOPAGE | |
| +------------------+ / +------------------+ |
| | ^ / | ^ |
| | | / | | |
| | | / | | |
| | | / | | |
| | | / | | |
| v | / v | |
| +----------------------------+ +----------------------------+ |
| | host : SHARED_BORROWED | | host : SHARED_OWNED | |
| | guestA: SHARED_OWNED | | guestB: SHARED_BORROWED | |
| +----------------------------+ +----------------------------+ |
| |
| |
| Initially, all pages except pKVM owned ones are owned by host VM, so these pages are |
| marked with PKVM_PAGE_OWNED in host EPT. Meantime, before guest first EPT_VIOLATION, |
| there is no page mapped in guest shadow EPT, so all page states in its shadow EPT are |
| PKVM_NOPAGE. |
| |
| When guest EPT_VIOLATION happen, pKVM needs to do EPT shadowing to build shadow EPT page |
| mapping based on virtual EPT. During it, the corresponding page's state shall follow |
| above state machine to do page donation or page sharing. |
| |
| 1) page donation |
| ----------------- |
| |
| For a protected VM (guestA), during EPT shadowing, the page assigned to guestA shall be |
| donated from host VM. Which means the page's ownership is moved from host to guestA. So |
| in host EPT, the mapping of corresponding page table entry (host_gpa to hpa(== host_gpa)) |
| is removed and its page state is marked as PKVM_NOPAGE (meantime the guestA is recorded |
| as owner_id). Meanwhile in guestA shadow EPT, the mapping of corresponding page table |
| entry (gpa to hpa) is setup and its page state is marked as PKVM_PAGE_OWNED. |
| |
| Once a page is donated to a guest, it cannot be donated or shared to other guests before |
| undonate back to host. |
| |
| Sometimes, host also need donate pages to the pKVM hypervisor (e.g., when creating a VM, |
| its shadow VM data strtucture is allocated in host then donated to the pKVM hypervisor). |
| |
| 2) page sharing |
| --------------- |
| |
| For a normal VM (guestB), during EPT shadowing, the page assigned to guestB shall be |
| shared from host VM. Which means both host VM and guestB can access this page. So |
| in host EPT, the mapping of corresponding page table entry is kept and its page state |
| is marked as PKVM_PAGE_SHARED_OWNED. Meanwhile in guestB shadow EPT, the mapping of |
| corresponding page table entry is setup and its page state is marked as |
| PKVM_PAGE_SHARED_BORROWED. |
| |
| Once a page is shared to a guest, it cannot be donated or shared to other guests before |
| unshare back to host. |
| |
| For a protected VM (guestA), a page can be shared back to host VM after donated to this |
| guest (e.g., to support virtio). For this case, in host EPT, the mapping of corresponding |
| page table entry is setup again and its page state is marked as PKVM_PAGE_SHARED_BORROWED. |
| Meanwhile in guestA shadow EPT, the mapping of corresponding page table entry is kept but |
| its page state is changed to PKVM_PAGE_SHARED_OWNED. |
| |
| Once a page is shared back to host after donated, guestA is allowed to unshare it. And |
| this page can also be returned back to host directly. |
| |
| |
| Misc |
| ==== |
| |
| |
| NMI handling in pKVM |
| --------------------- |
| |
| Normally pKVM shall not trigger any exception, but NMI is not able to mask in vmx |
| root mode thus pKVM shall provide appropriate handler for it. Such NMI handler needs |
| to ensure NMI happened in vmx root mode is captured then injected back to host VM, to |
| avoid NMI lost. |
| |
| The NMI injection is done as the last step before VMEnter to host VM, but there is |
| still case that a NMI happened just after the injection step. To avoid big delay, pKVM |
| enables irq window whenever there is a NMI happened in vmx root mode, then NMI injection |
| flow could be quickly done in next VMEnter. After it, host VM will VMExit once it open |
| interrupt, no matter the NMI is already injected or not. This may cause a dummy VMExit |
| but not cause any trouble. |
| |
| |
| [1]: https://lwn.net/Articles/836693/ |
| [2]: https://lwn.net/Articles/837552/ |
| [3]: https://lwn.net/Articles/895790/ |
| [4]: https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google |
| [5]: https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html |
| <<<<<<< HEAD:Documentation/virt/kvm/x86/pkvm-intel.rst |
| [6]: https://lore.kernel.org/linux-arm-kernel/[email protected]/T/ |
| [7]: https://kvmforum2022.sched.com/event/15jKc/supporting-tee-on-x86-client-platforms-with-pkvm-jason-chen-intel |
| [8]: https://lore.kernel.org/r/[email protected] |
| [9]: SDM: Virtual Machine Control Structures chapter, VMCS TYPES. |