Shedding Light on Huawei's Security Hypervisor

Dec 01, 2022

Alexandre Adamski

Maxime Peterlin

Android

Kernel

All recent Huawei devices ship with a security hypervisor, a defense-in-depth measure designed to enhance kernel security. Unlike other OEMs, Huawei encrypts this privileged piece of software, hence why it has received little to no public scrutiny. With this blog post, we aim to cast light on its inner-workings and provide an in-depth analysis of its implementation, from its entry point to the functions dedicated to protecting the kernel at runtime.

Table of Contents

Introduction
Overview
- ARM Virtualization Extensions
- Huawei Hypervisor Execution Environment
  - Retrieving the Hypervisor Binary
  - Analyzing the Hypervisor Binary
Initialization
Exception Handling
Conclusion
References

Introduction¶

In the mobile device world, long gone are the days where exploiting a single kernel vulnerability to execute arbitrary code or modify kernel data structures was enough to fully compromise the system. Vendors have addressed these concerns by introducing new technologies to enhance the security of their products. For ARM-based devices, the most prominent security feature is the ARM TrustZone, a secure zone in the CPU that executes concurrently with the main OS in an isolated environment.

Yet there exist other lesser-known components designed to harden a device, including security hypervisors that rely on ARM's virtualization extensions. Virtualization is traditionally used to emulate multiple hardware environments in which operating systems execute in complete isolation. In the case of a security hypervisor, the isolation features are leveraged to watch over a single kernel and ensure its integrity at run-time, mainly by filtering physical memory accesses and intercepting critical operations.

Most of the major Android phone manufacturers, such as Qualcomm, Samsung and Huawei, have developed their own security hypervisor implementations. In the past, we've detailed the internals of Samsung RKP, Samsung's implementation, and explained how we compromised it. In this blog post, we will investigate Huawei's security hypervisor, a component deployed on all recent Huawei mobile phones that has never, to our knowledge, received any public scrutiny.

We begin with an overview of the ARM virtualization extensions and introduce the concept of a security hypervisor. We then dive into Huawei's implementation and detail the hypervisor's boot process as well as its memory management system. We continue by explaining how the interception of system register writes, as well as instruction and data aborts resulting from faulting memory accesses, enables the hypervisor to retain control over the kernel page tables. In addition, we describe how the hypervisor and monitor calls are used to protect various pieces of kernel data.

Overview¶

ARM Virtualization Extensions¶

Modern software is split into different components, each requiring its own level of access to system and processor resources. In the ARMv8-A architecture, the privileges are divided into different levels, known as Exception Levels (EL), numbered from EL0 (the least privileged) to EL3 (the most privileged). Exception levels are also divided into two security states, Secure and Non-Secure, that separate the system into two Worlds in which:

regular user applications, that we directly interact with, are unprivileged and run at EL0;
the kernel, which requires a higher level of access, executes at EL1;
a security-focused hypervisor, or even a fully-fledged one, running at EL2;
and finally, the secure monitor at EL3, which is responsible for switching between the two worlds and has access to both secure and non-secure resources.

The current EL can only get higher when the processor takes an exception or lower when it returns from one using the Exception Return (ERET) instruction. User applications (EL0) can call into the kernel (EL1) using the Supervisor Call (SVC) instruction that triggers a synchronous exception that is then handled by the kernel. Likewise, one can call into the hypervisor (EL2) using the Hypervisor Call (HVC) instruction and into the secure monitor using the Secure Monitor Call (SMC) instruction.

To prevent components running at lower ELs to break the isolation, an hypervisor needs to emulate critical operations, including reads and writes to the privileged system registers. To that effect, the virtualization extensions introduce what is called trapping of specific actions. When an operation is trapped, instead of being performed at EL1 like it normally would, an exception is raised and the operation is handled at EL2 instead. Another system register, the Hypervisor Control Register (HCR_EL2), controls which operations should be trapped.

Restrictions are also applied to the kernel's virtual address space. To that effect, the address translation is extended with a second stage that uses its own set of page tables, which are fully under the hypervisor's control. When this mechanism is enabled, the standard walk of the stage 1 page tables, starting at the base address held in the Translation Table Base Register (TTBR0_EL1 for EL0 accesses and TTBR1_EL1 for EL1 accesses), is still performed. But instead of translating a Virtual Address (VA) directly into a Physical Address (PA), the resulting output address is an Intermediate Physical Address (IPA). This IPA then needs to be translated into the final PA by walking the stage 2 page tables, starting at the base address held in the Virtualization Translation Table Base Register (VTTBR_EL2). At each of the two stages, different access controls can be specified in the translation tables and applied to the memory pages.

After this brief description of the basic building blocks provided by the ARM architecture, let's focus on how Huawei used them to construct their security hypervisor.

Huawei Hypervisor Execution Environment¶

Huawei Hypervisor Execution Environment, or HHEE, is Huawei's security hypervisor implementation. As explained in the previous section, its main role is to supervise the kernel at runtime and prevent alterations to the system that would lead to privilege escalation. In order to analyze HHEE, the first step is to retrieve its binary.

Retrieving the Hypervisor Binary¶

Security research on Huawei devices has a high entry cost. Critical binaries, whether stored on the device or as part of an OTA update, are always encrypted. Unless you manage to leak the secret key, decrypting them requires taking over the device pretty early in the bootchain by attacking components such as the Bootrom, Xloader, or Fastboot that are themselves encrypted. However, finding and exploiting a vulnerability without the cleartext versions of the targeted binaries can be quite challenging.

Luckily, security research teams such as Taszk and Pangu have published the details of an exploitable vulnerability in Huawei's Bootrom. While it doesn't affect the latest Kirin 9000 chipsets, all the older devices were vulnerable. However, Huawei managed to fix this issue with a software update through an interesting trick detailed on Taszk's blog.

Once we have control over the Bootrom, the decryption method depends on the chipset version. For older versions, the encryption key can be extracted directly from the fuses, while on newer chipsets, the device must be used as a decryption oracle. In both cases, we end up with cleartext binaries ready to be analyzed.

Analyzing the Hypervisor Binary¶

In this blog post, we will document the inner workings of the HHEE image extracted from the 11.0.0.260 firmware for the Huawei P40 Pro. The base address of this binary, needed to load it into one's favorite SRE framework, can be found in the kernel sources that are made available on Huawei's Open Source website.

▸ drivers/hisi/ap/platform/kirin990_cs2/global_ddr_map.h

#define HISI_RESERVED_HHEE_PHYMEM_BASE 0x10F00000
#define HISI_RESERVED_HHEE_PHYMEM_SIZE (0x600000)

A quick glance at the disassembled binary shows that it is self-contained and fairly small (around 300 functions). However, it has no symbols and no useful strings that would have helped with reverse engineering. Fortunately, because the kernel has to call into the hypervisor, the kernel-side definitions can be used to infer the intended functionality of the corresponding hypervisor's functions. All the names of the HVCs can be found in the include/linux/hisi/hkip_hhee.h and include/linux/hisi/hhee_prmem.h files.

The rest of this article is dedicated to providing a thorough description of the hypervisor's design choices and inner workings, resulting from our best reverse engineering efforts. It should be noted that all the names given to functions, variables, and types are our own.

Initialization¶

Powering-On the Primary Core¶

When a Huawei device is powered on, booting Android — or rather, HarmonyOS — is a multi-stage process. It starts off in the Bootrom, which makes the first initialization operations and prepares the device for the second stage in the bootchain: Xloader (and UCE). Xloader is decrypted and authenticated by the Bootrom, before the execution is handed over to it to continue the initialization of Kirin-specific components and settings. Execution is then yielded to the third stage, Fastboot, where things get interesting for us. This component, part of the Android ecosystem, can be interacted with using a protocol of the same name. It implements different commands that allow a user to perform multiple operations, such as flashing a partition, unlocking their device, or changing vendor-specific settings. However, during the normal boot process, it is mainly responsible for loading the Android kernel and all the components it depends on, namely the hypervisor.

Up to this point, every component has been running at EL3 on a single CPU core – the primary core. During the boot process, when Fastboot is tasked with loading the kernel in memory, the primary core will first load the secure monitor (or BL31 if we follow ATF's nomenclature), the security hypervisor, and the trusted OS. We then enter the boot phase, which makes a little detour through the monitor and the hypervisor before finally jumping to the kernel at EL1.

The next section explains exactly what happens in the hypervisor, from the moment it's started by the monitor to the moment it hands over control to the kernel. Although it only details the operations performed by the primary core, we will come back to the secondary ones afterwards.

Global and Local State¶

Before digging into the initialization code, we first need to introduce two structures that are extensively used by HHEE.

The first structure is the cpu_list, which is actually a linked list that can be iterated upon using the next field. This structure is used to store the per-CPU state, and each booted core has its own instance.

typedef struct cpu_list {
  uint64_t mpidr_el1;
  cpu_list_t* next;
  sys_regs_t* sys_regs;
  uint8_t is_booted;
  uint8_t kernel_pt_processed;
} cpu_list_t;

The head of the linked list is stored in the g_cpu_list global variable.

cpu_list_t g_cpu_list;

The current core's instance is also stored in its TPIDR_EL2 system register, which can be retrieved using the current_cpu macro.

#define current_cpu (cpu_list_t*)get_tpidr_el2()

The second structure is sys_regs, which contains the saved values of the most important system registers.

typedef struct sys_regs {
  uint8_t regs_inited;
  uint64_t mair_el1;
  uint64_t tcr_el1;
  uint64_t ttbr_el1;
  uint64_t unkn_20;
  uint64_t elr_el2;
  uint32_t unkn_30;
  uint32_t unkn_34;
} sys_regs_t;

Since the saved values are the same for all cores, the sys_regs field of all cpu_list points to a single instance g_sysregs.

sys_regs_t g_sysregs;

Primary Core Boot¶

As said previously, control is handed over from the monitor, and the primary CPU core starts the hypervisor's execution with entrypoint_primary. This function performs the operations listed below.

Since we currently only have a physical address space, we make sure the MMU, WXN, and PAN are disabled. They will be enabled once the virtual address space has been setup.
The global value for the stack cookie is initialized in stack_chk_guard_setup using the current CPU physical timer.
The EL2 thread local storage is set to a pointer to g_cpu_list, which contains register values and information related to the execution.
The Exception Vector Table and the heap are initialized.

void entrypoint_primary(saved_regs_t args) {
  // SCTRL_EL2, System Control Register (EL2).
  //
  //   - M,    bit [0]  = 0: MMU disabled.
  //   - WXN,  bit [19] = 0: WXN disabled.
  //   - SPAN, bit [23] = 1: The value of PSTATE.PAN is left unchanged on taking an exception to EL2.
  set_sctlr_el2(0x30c5103a);
  // Initializes the stack cookie using the CNTPCT_EL0 system register.
  stack_chk_guard_setup();
  // TPIDR_EL2 holds a per-CPU object that contains saved register values and information related to the execution.
  set_tpidr_el2(&g_cpu_list);
  // Sets the exception vector table (VBAR_EL2) for the current CPU.
  set_vbar_el2(SynchronousExceptionSP0);
  // ...
  // Initializes the heap.
  g_heap_start = &heap_start;
  g_heap_size = 0x5b7680;
  // Boots the primary core.
  boot_primary_core(&args);
  // Now that the hypervisor has been initialized, the execution can continue in the kernel at EL1.
  asm("eret");
}

We then reach the boot_primary_core function, which calls init_memory_mappings to create the hypervisor's virtual address space as well as map_stage2 to setup the second translation stage.

void boot_primary_core(saved_regs_t* regs) {
  // Initializes the shared buffers used to send logs from the hypervisor to other components.
  init_log_buffers(regs);
  // ...
  // Allocates the hypervisor page tables, maps the hypervisor internal ranges, and enables addresses translation.
  init_memory_mappings();
  // ...
  current_cpu->sys_regs = &g_sysregs;
  // Maps the whole physical memory in the stage 2 translation tables, except for protected regions, including the
  // hypervisor memory ranges.
  map_stage2();
  // Sets a second batch of per-core system registers.
  hyp_set_el2_and_enable_stage_2_per_cpu();
  regs->x0 = hyp_set_elr_el2_spsr_el2_sctlr_el1(regs->x3, regs->x0);
  // ...
}

Hypervisor's Virtual Address Space¶

For the moment, HHEE has direct access to physical memory, and consequently no memory protections are enabled. init_memory_mappings fixes this by:

allocating memory for the hypervisor's page tables using allocate_hypervisor_page_tables;
mapping the corresponding physical pages in a 1:1 mapping with map_hypervisor_memory to make the physical to virtual memory transition as smooth as possible;
calling hyp_config_per_cpu to set the relevant system registers and render the virtual mappings effective.

void init_memory_mappings() {
  allocate_hypervisor_page_tables();
  // First set all hypervisor phyiscal memory as read-write.
  map_hypervisor_memory(0x10f00000, 0x10f00000, 0x00600000, HYP_READ | HYP_WRITE | INNER_SHAREABLE);
  // Then set the code + rodata range as read-only executable.
  map_hypervisor_memory(0x10f00000, 0x10f00000, 0x00010000, HYP_READ | HYP_EXEC | INNER_SHAREABLE);
  // Maybe this was meant to be the rodata range?
  map_hypervisor_memory(0x10f10000, 0x10f10000, 0x00000000, HYP_READ | INNER_SHAREABLE);
  // Map the device internal memory as read-write.
  map_hypervisor_memory(0xe0000000, 0xe0000000, 0x20000000, HYP_READ | HYP_WRITE);
  // Sets a first batch of per-core system registers.
  hyp_config_per_cpu();
}

The allocation stage implemented in allocate_hypervisor_page_tables starts by calling get_parange_in_bits to retrieve the number of bits needed to encode the size of the physical address space. In our case, the device uses 48-bit physical addresses.

Based on this size, it prepares the values that will be stored in EL2 system registers.

Sets TTBR0_EL2 to the address of the Page Global Directory (PGD), the first page table used during address translation.
Registers memory attributes in MAIR_EL2.
Configures TCR_EL2 to set the sizes of the physical and virtual address spaces as well as the size of the granule, the smallest block of memory that can be translated.
Enables the MMU, WXN, and caches by changing SCTLR_EL2.

These values will be effectively written to the corresponding registers in hyp_config_per_cpu once physical memory has been mapped.

The function then initializes the global bitmap g_bitmaps_array, which tracks virtual allocations in the hypervisor's address space. And finally, the actual memory allocations are performed by calling allocate_page_tables_per_addr_range.

We will come back to both the bitmap and the allocation function in a later section about the page tables to describe them in a broader context.

uint64_t allocate_hypervisor_page_tables() {
  // ...

  // Retrieves the number of bits used for the physical address range.
  parange_in_bits = get_parange_in_bits();
  // Default values used for a 48-bit address space, the most common size.
  if (parange_in_bits == 48) {
    g_page_size_log2 = 0xc;
    g_hyp_first_pt_level = 0;
    pgd_size = 0x1000;
    g_nb_pt_entries = 0x200;
    g_va_range_start = 0x800000000000;
    phys_addr_size_bits = 0b101 /* 48 bits */;
  } else {
    /* ... */
  }

  // TTBR0_EL2, Translation Table Base Register 0 (EL2).
  g_ttbr0_el2 = alloc_memory(pgd_size, pgd_size);
  // ...
  // MAIR_EL2, Memory Attribute Indirection Register (EL2).
  g_mair_el2 = 0x4ff;
  // TCR_EL2, Translation Control Register (EL2).
  //
  //   - T0SZ, bits [5:0]   = 16: TTBR0_EL2 region size is 2^48.
  //   - TG0,  bits [15:14] = 0b00: TTBR0_EL2 granule size is 4KB.
  //   - PS,   bits [18:16] = 0b101: PA size is 48 bits.
  g_tcr_el2 = (0x40 - parange_in_bits) | 0x80803500 | (phys_addr_size_bits << 0x10);
  // SCTLR_EL2, System Control Register (EL2).
  //
  //   - M,   bit [0]  = 1: MMU enabled.
  //   - C,   bit [2]  = 1: Data access Cacheability control for accesses at EL2.
  //   - WXN, bit [19] = 0: WXN enabled.
  g_sctlr_el2 = get_sctlr_el2() | 0x80005;

  // Initializes a global bitmap that tracks virtual allocations in the hypervisor's address space.
  for (i = 0; i != 8; ++i) {
    atomic_set(-1, &g_bitmaps_array[i]);
  }

  // Allocates the page tables needed for the mapping from a pool of physical pages.
  return allocate_page_tables_per_addr_range(g_ttbr0_el2, g_va_range_start, 0, 0x1000, g_hyp_first_pt_level,
                                             g_page_size_log2, g_nb_pt_entries, 0);
}

We will also hold off on detailing map_hypervisor_memory. For now, just know that the following ranges are mapped:

Physical Address	Virtual Address	Size	R	W	X	Inner Shareable	Description
0x10F00000	0x10F00000	0x10000	•		•	Yes	Code and read-only data
0x10F10000	0x10F00000	0x5F0000	•	•		Yes	Read-write data
0xE0000000	0xE0000000	0x20000000	•	•		No	Read-write internal device memory

Finally, as said earlier, hyp_config_per_cpu is called to modify EL2 system registers and activate the virtual memory range and the protections that go along with it.

void hyp_config_per_cpu() {
  // Puts the global/shared values into the per-core system registers.
  set_ttbr0_el2(g_ttbr0_el2);
  set_mair_el2(g_mair_el2);
  set_tcr_el2(g_tcr_el2);
  // Invalidates all TLBs for EL2.
  tlbi_alle2();
  set_sctlr_el2(g_sctlr_el2);
}

Initializing the Second Translation Stage¶

The second role of boot_primary_core is to setup the second stage of the virtual address translation system by calling map_stage2.

The first step is to map the entire physical memory in the second stage, except for the DDR holes, which are the memory regions that are not physically backed by the external memory chip. Mappings are performed using map_stage2_memory and unmap_stage2_memory, but just like the other memory mapping functions, we leave this one for the page tables section.

The second step is to unmap critical and protected ranges from the second stage. By doing this, we make sure the kernel will be unable to map certain physical regions, even if it is fully compromised. The regions in question are as follows:

Physical Memory Region	Start	End
`Hypervisor`	0x10F00000	0x114CF000
`LPMCU`	0x11A40000	0x11B00000
`SENSORHUB`	0x11B00000	0x12800000
`SUPERSONIC`	0x2C200000	0x2CBB0000
`SEC_CAMERA`	0x2CE00000	0x2DA00000
`HIFI`	0x2DA00000	0x2E980000
`NPU_SEC`	0x30660000	0x31060000
`MODEM`	0xA0000000	0xB1280000

The third and final step is to remap some ranges with different permissions.

The device internal memory range 0xE0000000-0x100000000 is remapped as device memory.
The memory range 0x114CF000-0x114F1000, right after the hypervisor, which contains global variables used by the hypervisor and log buffers (except for the monitor log buffer), is remapped as read-only.
The memory range 0x114F1000-0x114FF000 containing the monitor log buffer is remapped as read-only.

void map_stage2() {
  // ...
  // Sets global variables related to the second translation stage, in particular g_vttbr_el2, which contains the second
  // stage's PGD.
  init_stage2_globals();
  // ...
  // Maps all the physical memory in the stage 2 except the DDR holes.
  last_end_paddr = 0;
  for (int32_t i = 0; i < 32; i++) {
    if (g_ddr_holes_table[i].valid) {
      size = g_ddr_holes_table[i].beg_paddr - last_end_paddr;
      map_stage2_memory(last_end_paddr, last_end_paddr, size, NORMAL_MEMORY | READ | WRITE | EXEC_EL0 | EXEC_EL1, 0);
      // ...
      last_end_paddr = g_ddr_holes_table[i].last_end_paddr;
    }
  }

  // Maps the device internal memory range.
  map_stage2_memory(0xe0000000, 0xe0000000, 0x20000000, READ | WRITE | EXEC_EL0 | EXEC_EL1, 1);
  // ...

  // Unmap critical/protected memory ranges.
  for (int32_t i = 0; i < 32; i++) {
    if (g_unmapped_regions[i].valid) {
      beg_paddr = g_unmapped_regions[i].beg_paddr;
      unmap_stage2_memory(beg_paddr, g_unmapped_regions[i].end_paddr - beg_addr);
      // ...
    }
  }

  // Unmap the hypervisor memory range.
  unmap_stage2_memory(0x10f00000, 0x5cf000);
  // ...

  // Sets the memory range containing g_ddr_holes_table, g_unmapped_regions as well as all the log buffers but the
  // monitorlog buffer as read-only.
  set_memory_config_as_ro();
  // Sets the memory range containing the monitorlog buffer as read-only.
  set_monitorlog_buffers_as_ro();
}

We then move on to the next function called by boot_primary_core, namely hyp_set_el2_and_enable_stage_2_per_cpu. This function sets multiple EL2 system registers, such as HCR_EL2, to perform, in particular, the following operations:

disable the second translation stage (for now);
trap SMCs, and EL1 writes to EL2 virtual memory control registers (e.g. SCTLR_EL1, TTBR0_EL1, TTBR1_EL1, etc.).

void hyp_set_el2_and_enable_stage_2_per_cpu() {
  uint64_t hypervisor_config = 0x84080004;
  uint64_t attrs = get_id_aa64isar1_el1();

  // Configuration for PAC-compatible devices.
  //
  // ID_AA64ISAR1_EL1, AArch64 Instruction Set Attribute Register 1.
  //
  // The checked bits indicate whether QARMA or Architected algorithm, or support for an implementation defined
  // algorithm are implemented in the PE for address authentication and generic code authentication.
  if ((attrs & 0xff000ff0) != 0) {
    hypervisor_config = 0x30084080004;
  }

  // HCR_EL2, Hypervisor Configuration Register.
  //
  //   - VM,  bit [0]  = 0: EL1&0 stage 2 address translation disabled.
  //   - PTW, bit [2]  = 1: translation table walks that end up accessing device memory generate stage 2 permission
  //                        faults.
  //   - TSC, bit [19] = 1: traps SMC instructions.
  //   - TVM, bit [26] = 1: traps EL1 writes to the virtual memory control registers to EL2 (SCTLR_EL1, TTBR0_EL1,
  //                        TTBR1_EL1, TCR_EL1, ESR_EL1, FAR_EL1, AFSR0_EL1, AFSR1_EL1, MAIR_EL1, AMAIR_EL1,
  //                        CONTEXTIDR_EL1).
  //   - RW,  bit [31] = 1: the Execution state for EL1 is AArch64, and the Execution state for EL0 is determined by the
  //                        current value of PSTATE.nRW when executing at EL0.
  //   - E2H, bit [34] = 0: The facilities to support a Host Operating System at EL2 are disabled (influences the format
  //                        of the TCR_EL2 system register).
  set_hcr_el2(hypervisor_config);
  set_cptr_el2(0x33ff);
  set_hstr_el2(0);
  set_cnthctl_el2(3);
  set_cntvoff_el2(0);
  set_vpidr_el2(get_midr_el1());
  set_vmpidr_el2(get_mpidr_el1());
  set_vttbr_el2(0);
  enable_stage2_addr_translation();
}

Eventually, enable_stage2_addr_translation is called to set VTTBR_EL2 with the address of the second stage's PGD. The second stage is finally configured and enabled by the function.

void enable_stage2_addr_translation() {
  // ...
  parange_in_bits = get_parange_in_bits();
  if (parange_in_bits == 48) {
    vtcr_el2 = (0b101 << 0x10) | 0x80003580;
  } else {
    /* ... */
  }

  // VTCR_EL2, Virtualization Translation Control Register.
  //
  //   - T0SZ, bits [5:0]   = 16: VTTBR0_EL2 region size is 2^48.
  //   - SL0,  bits [7:6]   = 0b10: start at level 0.
  //   - TG0,  bits [15:14] = 0b00: VTTBR0_EL2 granule size is 4KB.
  //   - PS,   bits [18:16] = 0b101: PA size is 48 bits.
  set_vtcr_el2(vtcr_el2 | (0x40 - parange_in_bits));
  set_vttbr_el2(g_vttbr_el2);
  // HCR_EL2, Hypervisor Configuration Register.
  //
  // VM, bit [0] = 1: EL1&0 stage 2 address translation enabled.
  set_hcr_el2(get_hcr_el2() | 1);
}

At last, we have reached the end of the primary core's initialization. All that remains is to prepare the hypervisor to jump to the kernel. This is achieved with a call to hyp_set_elr_el2_spsr_el2_sctlr_el1 with the kernel's entry point as the first argument. This function begins by initializing EL1 system registers using hyp_set_ttbr_el1_tcr_el1_mair_el1. It then sets:

the execution level to EL1h (SPx Stack Pointer) in SPSR_EL2;
the exception return address to the kernel's entry point;
and makes sure that the MMU, WXN, and PAN are disabled.

uint64_t hyp_set_elr_el2_spsr_el2_sctlr_el1(uint64_t entrypoint, uint64_t context_id) {
  hyp_set_ttbr_el1_tcr_el1_mair_el1();
  // ELR_EL2, Exception Link Register (EL2).
  set_elr_el2(entrypoint);
  // SPSR_EL2, Saved Program Status Register (EL2).
  //
  // M, bits [3:0] = 0b0101: EL1h (SPx Stack Pointer).
  set_spsr_el2(0x3c5);
  // SCTLR_EL1, System Control Register (EL1).
  //
  //   - M,    bit [0]  = 0: MMU disabled.
  //   - WXN,  bit [19] = 0: WXN disabled.
  //   - SPAN, bit [23] = 1: the value of PSTATE.PAN is left unchanged on taking an exception to EL1.
  set_sctlr_el1(0x30d50838);
  return context_id;
}

void hyp_set_ttbr_el1_tcr_el1_mair_el1() {
  // ...
  if (atomic_get(&current_cpu->sys_regs->regs_inited)) {
    ttbr1_el1 = sys_regs->ttbr1_el1;
    tcr_el1 = sys_regs->tcr_el1;
    mair_el1 = sys_regs->mair_el1;
  } else {
    ttbr_el1 = tcr_el1 = mair_el1 = 0;
  }

  set_ttbr0_el1(0);
  set_ttbr1_el1(ttbr1_el1);
  set_tcr_el1(tcr_el1);
  set_mair_el1(mair_el1);

  // Ensures that the MMU, WXN are enabled, among other things.
  check_sctlr_el1(get_sctlr_el1());
  current_cpu->ttbr1_el1_processed = 0;
}

At this point, when the hypervisor reaches the ERET instruction in entrypoint_primary, the execution level will drop down to EL1 and enter the kernel. The kernel will do its thing and eventually start the secondary cores.

Secondary Cores Boot¶

To boot the secondary cores, the kernel uses the ARM Power Supply Coordination Interface entry, or PSCI, from the device tree.

psci {
    compatible = "arm,psci";
    cpu_off = <0x84000002>;
    cpu_on = <0xC4000003>;
    cpu_suspend = <0xC4000001>;
    method = "smc";
    system_off = <0x84000008>;
    system_reset = <0x84000009>;
};

We can see that the kernel makes the SMC 0xC4000003 to power on a CPU core, which corresponds to the PSCI_CPU_ON_AARCH64 command in the secure monitor. However, when the hypervisor is enabled, this SMC is trapped and handled by boot_hyp_cpu.

This function will first retrieve, or allocate if it does not exist, the thread local storage for the current secondary core. After that, it calls do_smc_psci_cpu_on to perform the initial SMC PSCI_CPU_ON_AARCH64 and passes it the core's entry point, namely entrypoint_secondary. The secure monitor then handles the SMC, eventually boots the corresponding CPU core, and jumps back to EL2 in the entrypoint_secondary function.

Note: you can have a look at ATF's source code and follow the execution flow in psci_cpu_on if you want more details about the operations performed by the secure monitor when a CPU core is started up.

uint64_t boot_hyp_cpu(uint64_t target_cpu, uint64_t entrypoint, uint64_t context_id) {
  // ...

  // Try to find the thread local storage of the target CPU.
  spin_lock(&g_cpu_list_lock);
  for (cpu_info = &g_cpu_list; cpu_info != NULL; cpu_info = cpu_info->next) {
    if (target_cpu == cpu_info->mpidr_el1) {
      break;
    }
  }

  // If no thread local storage exists for this CPU, allocate one.
  if (cpu_info == NULL) {
    // Allocate the CPU stack and TLS structure.
    cpu_stack = alloc_memory(0x40, 0x1480);
    // ...
    cpu_info = (cpu_list_t*)cpu_stack + 0x1440;
    memset_s(cpu_info, 0x40, 0, 0x40);
    cpu_info->mpidr_el1 = target_cpu;
    // Add the TLS structure to the global CPU list.
    cpu_info->next = g_cpu_list.next;
    g_cpu_list.next = cpu_info;
  }

  // If the CPU hasn't already booted, save the original arguments into the CPU stack, replace them with the secondary
  // entry point in the hypervisor and execute the PSCI_CPU_ON_AARCHXX SMC.
  if (!cpu_info->is_booting) {
    cpu_info->is_booting = 1;
    spin_unlock(&g_cpu_list_lock);
    *(uint64_t*)((void*)cpu_info - 0x10) = entrypoint;
    *(uint64_t*)((void*)cpu_info - 8) = context_id;
    // ...
    return do_smc_psci_cpu_on(target_cpu, entrypoint_secondary, cpu_info_ptr);
  }

  spin_unlock(&g_cpu_list_lock);
  return 0xfffffffc;
}

At this stage, we can finally run code on secondary cores and initialize their EL2 state using entrypoint_secondary. The function initializes the thread local storage, the exception vector table, and EL2 system registers related to memory control, which puts this secondary core in the same state as the primary one.

void entrypoint_secondary(saved_regs_t args) {
  // TPIDR_EL2 holds a per-CPU object that contains saved register values and information related to the execution.
  set_tpidr_el2(args->x0);
  // Sets the exception vector table (VBAR_EL2) for the current CPU.
  set_vbar_el2(SynchronousExceptionSP0);
  // Sets a first batch of per-core system registers.
  hyp_config_per_cpu();
  // Boot the secondary core.
  boot_secondary_cores(&args);
  // Continue the execution in the kernel at EL1.
  asm("eret");
}

All that remains is to enable the second translation stage by calling, once again, hyp_set_el2_and_enable_stage_2_per_cpu.

void boot_secondary_cores(saved_regs_t* regs) {
  // Flag the CPU as no longer booting (in the secure monitor).
  spin_lock(&g_cpu_list_lock);
  current_cpu->is_booting = 0;
  spin_unlock(&g_cpu_list_lock);
  current_cpu->sys_regs = &g_sysregs;
  // Sets a second batch of per-core system registers.
  hyp_set_el2_and_enable_stage_2_per_cpu();
  uint64_t entrypoint = *(uint64_t*)(regs->x0 - 0x10);
  uint64_t context_id = *(uint64_t*)(regs->x0 - 0x8);
  regs->x0 = hyp_set_elr_el2_spsr_el2_sctlr_el1(entrypoint, context_id);
}

Once entrypoint_secondary hits the ERET instruction, the core returns to the kernel at EL1 and continues its execution. The hypervisor is initialized on all cores and ready to supervise the kernel. The whole process described in this section about powering on the device is summarized in the figure below.

With the initialization taken care of, we can move on to the next topic: virtual memory management.

Virtual Memory Management¶

Arguably, the most important element of the hypervisor is its second translation stage implementation and, more broadly, how it handles page tables. In this section, we explain components we had put aside until now, namely memory mappings in the second stage and the hypervisor's address space.

Refresher Course on ARM Virtual Memory Management¶

By default, when the device starts, every exception level has direct access to physical memory. There is no isolation nor memory protection. Code running at EL0 could access the address space of other EL0 applications or even modify kernel resources, for example. This can be prevented by implementing a virtual memory system, where one or multiple virtual addresses map to a given physical address.

Translation from a virtual address to a physical one is performed by the Memory Management Unit, or MMU. But to perform this translation, the MMU needs a way to make virtual and physical addresses correspond. This is where page tables come in. Physical memory locations are subdivided and organized into multi-level tables.

Another important aspect of ARM's virtual memory implementation are the translation tables used by the kernel at EL1 and those used by processes running at EL0. Because tasks at EL0 are usually running concurrently, they all have their own set of translation tables, and the kernel regularly switches between them when a context switch occurs. However, this does not apply to the kernel, where memory mappings are less likely to change. Which is why, to make things more efficient, the virtual address space is divided into two regions:

low addresses, usually attributed to processes running at EL0, which use TTBR0_EL1 to store the base address of their translation table;
high addresses, generally reserved for the kernel at EL1, which uses TTBR1_EL1 to store the base address of its translation table.

This distinction does not exist at EL2 and EL3.

Finally, a virtual address under translation is divided into chunks that form indices in the page tables. This allows us to walk the page tables, going from the highest level to the lowest one, and retrieve the associated physical page.

In the hypervisor's and Android kernel's cases, they both use 4-level translation tables. In the ARM specification, page tables are referred to as Level n, where n ranges, in our case, from 0, the highest level, to 3, the lowest. However, the Linux kernel uses a different nomenclature given below:

Page Global Directory, or PGD, for a level 0 page table;
Page Upper Directory, or PUD, for a level 1 page table;
Page Middle Directory, or PMD, for a level 2 page table;
Page Table, or PT, for a level 3 page table.

In the rest of this article, we might use these terms interchangeably since some functions can process both the hypervisor and kernel page tables.

The page tables we deal with in this blog post all share the same configuration. They are 4096 bytes long and thus contain 512 64-bit entries. Each entry, which is called a descriptor, can only be of one of the types listed below.

Page descriptor:
- stores the address of a physical page as well as the memory attributes and protections applied;
- must have its least significant bits set to 0b11;
- can only be used in level 3 page tables.

Block descriptor:
- stores the address of the beginning of a physical address range as well as the memory attributes and protections applied;
- must have its least significant bits set to 0b01;
- can only be used in level 1 and level 2 page tables.
- a block used at level n covers the whole address range of level n-1; for example a level 2 block covers 0x200000 bytes because there are 0x200 physical pages referenced in a level 3 page table.

Table descriptor:
- stores the physical address of the page table at the next level;
- must have its least significant bits set to 0b11;
- can only be used in level 0, level 1, and level 2 page tables;
- memory attributes and protections can be applied at the table level, meaning they will apply regardless of the values in the next level tables.

Invalid/Reserved descriptor:
- any value that does not fit the above categories will be considered invalid.

The page and block descriptors contain attributes that apply to the memory they are mapping. These attributes are split into an upper block and a lower block, as shown below.

In particular, the following attributes are of interest to us:

the NS, or Non-Secure bit, specifying whether the output address is secure or non-secure;
the AP[2:1], or Access Permissions bits, controlling if memory is readable/writable from EL0, and writable from higher ELs;
the UXN, or Unprivileged eXecute-Never bit, controlling if memory is executable from EL0;
the PXN, or Privileged eXecute-Never bit, controlling if memory is executable from higher ELs.

In addition, the following lesser-known attributes are relevant to a security hypervisor:

the Contiguous bit, which indicates that the entry is one of a contiguous set of entries that might be cached in a single TLB entry;
the DBM bit, or Dirty Bit Modifier, which indicates whether a page or section memory is modified and changes the function of the AP[2] bit to record dirty state instead of access permissions;
the AF, or Access Flag bit, which tracks whether a region covered by the entry has been accessed.

Mapping Memory in the Hypervisor¶

map_pa_into_hypervisor is the primary function encountered when it comes to mapping physical memory into the hypervisor. It maps the physical page that contains the physical address phys_addr and returns an address in the hypervisor's virtual space. However, the choice of virtual address is left to the hypervisor.

To keep track of virtual memory in use, the hypervisor uses a simple linear allocator starting from the virtual address stored in g_va_range_start. The allocator is implemented with an array of eight 64-bit integers, where one bit represents a virtual page. If the bit is set, the allocation is free; otherwise, it is in use. Thus, the allocator can map at most 0x200000 bytes.

When the hypervisor wants to map a physical address, it checks in the bitmap if a virtual page is available by iterating over each integers. The page is then mapped into the page tables using map_hypervisor_memory before returning its virtual address.

uint64_t map_pa_into_hypervisor(uint64_t phys_addr) {
  // Column and row indices in the bitmap used to compute the resulting virtual address.
  bitmap_idx = 0;
  tracker = 0xffffffffffffffff;
  for (;;) {
    expected_tracker = tracker;
    // RBIT reverses the bit order of an integer, e.g. RBIT(0x1) returns 0x8000000000000000.
    //
    // CLZ counts the number of leading zeroes of an integer, e.g. CLZ(0xffff) returns 48.
    //
    // Using RBIT followed by CLZ returns the index of the last bit equal to 0 starting from the right. In our context,
    // it returns the bit position that corresponds to the free virtual page with the lowest possible address.
    next_alloc_bit_idx = clz(rbit(tracker));
    // Generates the new tracker value.
    new_tracker = tracker & (1 << next_alloc_bit_idx);
    // Atomically checks if the tracker is in the expected state. If it's the case, then atomically stores the new state
    // that takes the allocation into account.
    tracker = exclusive_load(g_bitmaps_array[bitmap_idx]);
    if (tracker == expected_tracker) {
      if (!exclusive_store(new_tracker, g_bitmaps_array[bitmap_idx])) {
        break;
      }
    }
    // If no more allocations can be made with the current bitmap integer, we move to the next one. If the end of the
    // bitmaps is reached, start over from the beginning.
    if (!tracker) {
      if (bitmap_idx++ >= 8) {
        bitmap_idx = 0;
      }
      tracker = 0xffffffffffffffff;
    }
  }

  // Maps the physical page and returns the virtual address that corresponds to the input physical address.
  page_size = 1 << g_page_size_log2;
  offset_mask = page_size - 1;
  addr_mask = -(1 << page_size);
  phys_page_addr = phys_addr & addr_mask;
  virt_page_addr = g_va_range_start + (bitmap_idx * 64 + next_alloc_bit_idx) * page_size;
  map_hypervisor_memory(virt_page_addr, phys_page_addr, 0x1000, HYP_WRITE | HYP_READ | INNER_SHAREABLE);
  virt_addr = virt_page_addr | phys_addr & offset_mask;
  return virt_addr;
}

map_hypervisor_memory takes the physical address passed as an argument and starts building a page (or block) descriptor that will be stored in the page table. It sets the following attributes depending on the permissions provided:

maps the range as non-executable if HYP_EXEC is not specified;
maps the range as inner shareable if INNER_SHAREABLE is specified, and as outer shareable otherwise;
sets the range as read-write if HYP_WRITE was specified, and as read-only otherwise.

The execution then continues to map_memory_range.

uint64_t map_hypervisor_memory(uint64_t virt_addr, uint64_t phys_addr, uint64_t size, hyp_attrs_t perms) {
  // ...

  desc = phys_addr;
  // If a range is not executable, we set XN[54] to 1.
  if (!(perms & HYP_EXEC)) {
    desc |= 1 << 54;
  }
  // Sets the Access Flag to 1.
  desc |= 1 << 10;
  if (perms & INNER_SHAREABLE) {
    // Sets SH[9:8] to 0b11 if our range should be inner Shareable.
    desc |= 0b11 << 8;
  } else {
    // Sets SH[9:8] to 0b10 if our range should be outer Shareable.
    desc |= 0b10 << 8;
  }
  if (perms & HYP_WRITE) {
    // Sets AP[7] to 0 if our range should be read/write.
    desc |= 0b0 << 7;
  } else {
    // Sets AP[7] to 1 if our range should be read-only.
    desc |= 0b1 << 8;
  }

  map_memory_range(g_ttbr0_el2, virt_addr, desc | 0b01 /* Block descriptor by default */, size, g_hyp_first_pt_level,
                   g_page_size_log2, g_nb_pt_entries, 0);

  // ...
}

map_memory_range first allocates the page tables necessary for the mapping from a pool of physical addresses and sets them up using allocate_page_tables_per_addr_range, a function that was also used in allocate_hypervisor_page_tables.

The function then adds the block and/or page descriptors to these tables using add_page_table_entries.

uint64_t map_memory_range(uint64_t page_table_addr,
                          uint64_t virt_addr,
                          uint64_t desc,
                          uint64_t size,
                          uint32_t pt_level,
                          uint32_t next_table_size_in_bits,
                          uint64_t nb_pt_entries,
                          uint8_t invalidate_by_ipa) {
  phys_addr = desc & 0xfffffffff000;
  // ...
  // Allocates physical memory pages for the page tables needed to map the requested range.
  ret = allocate_page_tables_per_addr_range(page_table_addr, virt_addr, phys_addr, size, pt_level,
                                            next_table_size_in_bits, nb_pt_entries, invalidate_by_ipa);
  if (ret) {
    return 0xffffffff;
  }

  // Updates the page tables and adds the page table descriptors to effectively map the requested range.
  add_page_table_entries(page_table_addr, virt_addr, attrs, size, pt_level, next_table_size_in_bits, nb_pt_entries,
                         invalidate_by_ipa);
  return 0;
}

allocate_page_tables_per_addr_range takes the address range we want to map, recursively performs a page table walk on it, allocates any page tables that don't already exist, and stores the corresponding page table descriptors. However, it doesn't register any pages or blocks in the page tables. This is the role of add_page_table_entries.

uint64_t allocate_page_tables_per_addr_range(uint64_t page_table_addr,
                                             uint64_t vaddr,
                                             uint64_t paddr,
                                             uint64_t size,
                                             uint32_t pt_level,
                                             uint32_t table_size_log2,
                                             uint64_t nb_pt_entries,
                                             uint8_t invalidate_by_ipa) {
  // ...

  // If we've reached the third level of the page table, we're done, other functions will take care of the page
  // allocations.
  if (pt_level == 3 || size == 0) {
    return 0;
  }
  vaddr_end = vaddr + size;
  nb_pt_entries_log2 = table_size_log2 - 3;
  nb_pt_entries = 1 << nb_pt_entries_log2;
  next_pt_level = pt_level + 1;

  // Computes the position in the virtual address of the index into the current page table level (e.g. 12 at level 3, 21
  // at level 2, etc.).
  pt_level_idx_pos = (4 - pt_level) * nb_pt_entries_log2 + 3;
  next_pt_level_idx_pos = (3 - pt_level) * nb_pt_entries_log2 + 3;

  is_not_level0_or_level1 = pt_level_idx_pos < 31;

  // Computes the size mapped by the current level (e.g. 0x1000 at level 3, 0x200000 at level 2, etc.).
  pt_level_size = 1 << pt_level_idx_pos;
  next_pt_level_size = 1 << next_pt_level_idx_pos;

  // Virtual address mask for the current page table level (e.g. 0xffff_ffff_ffff_f000 at level 3).
  pt_level_mask = -pt_level_size;

  // Mask to apply to get a page table index from a virtual address (e.g. 0x1ff for page tables with 0x200 entries).
  pt_idx_mask = nb_pt_entries - 1;

  // Current page table level descriptor address of the virtual address we want to map.
  desc_addr = page_table_addr + 8 * ((vaddr >> pt_level_idx_pos) & pt_idx_mask);

  for (;;) {
    // Checks if the range covered by the current page table level is large enough to contain the region we want to map.
    next_range_addr = (vaddr + pt_level_size) & pt_level_mask;
    range_vaddr_end = (next_range_addr > vaddr_end) ? vaddr_end : next_range_addr;
    size_to_map = range_vaddr_end - vaddr;

    // If we're at level 2 or 3 and the whole region is covered, we continue to the next one without any allocation.
    // Because in these cases, we would either allocate a block at level 2 or a page at level 3, but this is taken care
    // of in another function.
    if (is_not_level0_or_level1 && pt_level_size == size_to_map && (vaddr | paddr) & (pt_level_size - 1) == 0) {
      goto CONTINUE;
    }
    desc = *desc_addr;

    // If the descriptor is invalid or doesn't exist, we allocate a page table for the lower level and make the
    // descriptor point to it.
    if (desc & 1 == 0) {
      new_pt_addr = alloc_memory(next_pt_level_size, next_pt_level_size);
      *desc_addr = new_pt_addr | 3;
      goto HANDLE_NEXT_PT_LEVEL;
    }

    // If the descriptor is a block descriptor, we split the block into lower level blocks (e.g. a 0x4000_0000-long
    // block at level 1 would be split into 512 0x20_0000-long blocks at level 2).
    if (desc & 2 == 0) {
      new_pt_addr = alloc_memory(next_pt_level_size, next_pt_level_size);
      next_desc = desc;
      // Transforms block descriptors into page descriptors if the next level is 3.
      if (next_pt_level == 3) {
        next_desc = desc | 2;
      }
      // Splits the block into lower-level blocks (or pages if we're currently at level 2).
      if (nb_pt_entries) {
        for (idx = 0; idx != nb_pt_entries; idx++) {
          *(uint64_t*)(new_pt_addr + 8 * idx) = next_desc;
          next_desc += next_pt_level_size;
        }
      }
      // ...
      *desc_addr = new_pt_addr | 3;
      goto HANDLE_NEXT_PT_LEVEL;
    }

HANDLE_NEXT_PT_LEVEL:
    if (allocate_page_tables_per_addr_range(new_pt_addr, vaddr, paddr, size_to_map, next_pt_level, table_size_log2,
                                            nb_pt_entries, invalidate_by_ipa)) {
      return 0xffffffff;
    }

CONTINUE:
    size -= size_to_map;
    paddr += size_to_map;
    ++desc_addr;
    vaddr = range_vaddr_end;
    if (!size) {
      return 0L;
    }
  }
}

add_page_table_entries also performs a page table walk, but this time it is to register the page and/or block descriptor passed as an argument. Once this function returns, memory has been successfully mapped in the hypervisor.

void add_page_table_entries(uint64_t page_table_addr,
                            uint64_t vaddr,
                            uint64_t desc,
                            uint64_t size,
                            uint32_t pt_level,
                            uint32_t table_size_log2,
                            uint64_t nb_pt_entries,
                            uint8_t invalidate_by_ipa) {
  vaddr_end = vaddr + size;
  nb_pt_entries_log2 = table_size_log2 - 3;
  nb_pt_entries = 1 << nb_pt_entries_log2;

  // Computes the position in the virtual address of the index into the current page table level (e.g. 12 at level 3, 21
  // at level 2, etc.).
  pt_level_idx_pos = (4 - pt_level) * nb_pt_entries_log2 + 3;

  // Computes the size mapped by the current level (e.g. 0x1000 at level 3, 0x200000 at level 2, etc.).
  pt_level_size = 1 << pt_level_idx_pos;

  // Virtual address mask for the current page table level (e.g. 0xffff_ffff_ffff_f000 at level 3).
  pt_level_mask = -pt_level_size;

  // Mask to apply to get a page table index from a virtual address (e.g. 0x1ff for page tables with 0x200 entries).
  pt_idx_mask = nb_pt_entries - 1;

  // Current page table level descriptor address of the virtual address we want to map.
  desc_addr = page_table_addr + 8 * ((vaddr >> pt_level_idx_pos) & pt_idx_mask);
  is_not_level0_or_level1 = pt_level_idx_pos < 31;

  while (size) {
    // Checks if the range covered by the current page table level is large enough to contain the region we want to map.
    next_range_addr = (vaddr + pt_level_size) & pt_level_mask;
    range_addr_end = (next_range_addr > vaddr_end) ? vaddr_end : next_range_addr;
    size_to_map = range_addr_end - vaddr;

    // If we're at level 3, we add our input descriptor desc into the page table. All descriptors passed to this
    // function have their least significant bit set, so we set bit 1 to make it a page descriptor.
    if (pt_level == 3) {
      *desc_addr = desc | 2;
      // ...
    } else {
      // Since all descriptors are blocks by default, we deviate from a block mapping if:
      //
      //   - we're at page table level 0 or 1.
      //   - the size we're trying to map doesn't cover the region at this page table level.
      //   - the address we're trying to map is not block-aligned.
      if (!is_not_level0_or_level1 || pt_level_size != size_to_map || (vaddr | paddr) & (pt_level_size - 1) != 0) {
        lower_level_page_table_addr = *desc_addr & 0xfffffffff000;
        page_table_addr = add_page_table_entries(lower_level_page_table_addr, vaddr, desc, size_to_map, pt_level + 1,
                                                 table_size_log2, nb_pt_entries, invalidate_by_ipa);
      }
      // Otherwise, we map a block.
      else {
        *desc_addr = desc;
      }
      // ...
    }
    size -= size_to_map;
    desc += size_to_map;
    desc_addr++;
    vaddr = next_range_addr;
  }
}

Unmapping a virtual address from the hypervisor is straightforward. We first call unmap_hypervisor_memory, which simply replaces the descriptors that correspond to the memory range we want to unmap with invalid ones. We then update the bitmap that tracks mappings in the hypervisor, and we are done.

void unmap_hva_from_hypervisor(uint64_t virt_addr) {
  page_size = 1 << g_page_size_log2;
  addr_mask = -page_size;
  virt_addr_aligned = virt_addr & addr_mask;
  unmap_hypervisor_memory(virt_addr_aligned, page_size);
  page_idx = (virt_addr_aligned - g_va_range_start) / page_size;
  // Gets a pointer to the integer tracking the allocations for the corresponding page index.
  tracker_idx = page_idx / 64;
  tracker_p = &g_bitmaps_array[tracker_idx];
  // Retrieves the current tracker integer, updates it and tries storing it until it works.
  do {
    tracker = exclusive_load(tracker_p);
    new_tracker = tracker | (1 << page_idx);
  } while (exclusive_store(new_tracker, tracker_p));
}

uint64_t unmap_hypervisor_memory(uint64_t addr, uint64_t size) {
  ret = map_memory_range(g_ttbr0_el2, addr, 0, size, g_hyp_first_pt_level, g_page_size_log2, g_nb_pt_entries, 0);
  if (ret) {
    panic();
  }
}

Mapping Memory in the Second Stage¶

Stage 2 translation allows a hypervisor to control how physical memory is perceived by a virtual machine. It makes sure the guest only has access to specific parts of physical memory by either removing access to address ranges, changing their permissions, or remapping them to other locations.

We have explained in the introduction that stage 2 is a second translation stage that uses its own page tables, of which the base address is stored in the VTTBR_EL2 system register. These page tables control the mappings seen by a given guest, as well as the associated memory protections and attributes. Stage 2 maps physical addresses into intermediate physical addresses, which can then be used in the first translation stage of an OS to map virtual addresses.

The stage 2 page tables are very similar to stage 1's, although the descriptors have different attributes. The upper attributes of the table descriptors are all reserved to 0. The attributes of the page and block descriptors have different fields that are specific to the second stage and are reported below.

The stage 2 attributes are similar to the stage 1 attributes:

the S2AP, or Stage 2 Access Permissions bits, control whether memory is readable/writable at all lower ELs;
the XN[1:0], or eXecute-Never bits, control if memory is executable at EL0 and executable at EL1;
the Contiguous, DBM, and AF bits serve the same purpose as their first stage counterparts.

Huawei has also implemented an additional security feature using the "software reserved" bits 55 to 58 of the descriptor. These bits store usage information about the underlying physical memory that is mapped by the descriptor, and their possible values are listed below.

Software Attribute	Description
`0b0000`	Unmarked
`0b0100`	Level 0 Page Table
`0b0101`	Level 1 Page Table
`0b0110`	Level 2 Page Table
`0b0111`	Level 3 Page Table
`0b1000`	OS Read-Only
`0b1001`	OS Module Read-Only
`0b1010`	Hypervisor-mediated OS Read-Only
`0b1011`	Hypervisor-mediated OS Module Read-Only
`0b1100`	Shared Object Protection Execute-Only

For example, by setting the software attributes of a page to 0b0101, the hypervisor indicates that it is a kernel page table of level 1. It can later use this information to prevent prohibited changes to protected memory, such as making this page table writable again.

The get_software_attrs macro can be used to retrieve the software attributes from a stage 2 descriptor.

#define get_software_attrs(desc_s2) ((desc_s2 >> 55) & 0b1111)

To map memory in the second translation stage, the hypervisor uses the map_stage2_memory function. It starts by creating a descriptor using the physical address pa, the permissions perms, and the software attributes software_attrs passed as arguments.

Sets the mapping as executable at EL0 if perms contain EXEC_EL1.
Sets the mapping as executable at EL1 if perms contain EXEC_EL0.
Sets the mapping as writable if perms contain WRITE.
Sets the mapping as readable if perms contain READ.
Sets the mapping as normal memory if perms contain NORMAL_MEMORY, and as device memory otherwise.

map_stage2_memory then calls map_memory_range, the same function used by map_hypervisor_memory to map memory in the hypervisor's address space. Only this time, instead of TTBR0_EL2, we are adding our mappings in the page tables that VTTBR_EL2 points to.

void map_stage2_memory(uint64_t ipa,
                       uint64_t pa,
                       uint64_t size,
                       stage2_attrs_t perms,
                       software_attrs_t software_attrs) {
  // ...

  // Sets the descriptor's software attributes.
  desc = pa | (software_attrs << 55);
  // If execution is disabled for EL0, we set XN[1] to 1.
  if (!(perms & EXEC_EL0)) {
    desc |= 1 << 54;
  }
  // If execution is disabled at EL1, but permitted at EL0 (or conversely), we set XN[0] to 1.
  if (!(perms & EXEC_EL0) != !(perms & EXEC_EL1)) {
    desc |= 1 << 53;
  }
  // Sets the Access Flag bit.
  desc |= 1 << 10;
  // If our range is writable, S2AP[1] is set to 1.
  if (perms & WRITE) {
    desc |= 1 << 7;
  }
  // If our range is readable, S2AP[0] is set to 1.
  if (perms & READ) {
    desc |= 1 << 6;
  }
  // Sets the memory attributes that correspond to the type of memory being mapped (i.e. normal or device memory).
  desc |= (perms & NORMAL_MEMORY) ? 0x3c : 0xc;

  // Maps the descriptor in the second translation stage.
  map_memory_range(g_vttbr_el2, ipa, desc | 0b01 /* Block descriptor by default */, size, g_start_s2_pt_level, 0xc,
                   g_nb_s2_pt_entries, 1);
}

To unmap memory in the second stage, the hypervisor calls unmap_stage2_memory, which is a wrapper around map_memory_range that replaces the corresponding descriptors with all-zero ones. This removes the memory mappings as well as the software attributes the region was marked with.

uint64_t unmap_stage2_memory(uint64_t ipa, uint64_t size) {
  // Replaces the descriptor that corresponds to the IPA by 0 in the second translation stage.
  map_memory_range(g_vttbr_el2, ipa, 0, size, g_start_s2_pt_level, 0xc, g_nb_s2_pt_entries, 1);
}

The hypervisor also implements functions to change the permissions and software attributes of a given memory range. change_stage2_software_attrs_per_va_range operates on kernel virtual addresses and walks its page tables to find the corresponding IPA ranges, before calling change_stage2_software_attrs_per_ipa_range on each of them, which does most of the work.

uint64_t change_stage2_software_attrs_per_va_range(uint64_t virt_addr,
                                                   uint64_t size,
                                                   stage2_attrs_t perms,
                                                   software_attrs_t software_attrs,
                                                   uint8_t set_attrs, ) {
  // An error is returned for userland addresses.
  if ((virt_addr & (1 << 0x3f)) == 0) {
    return 0xfffffffe;
  }
  // Retrieves page table information using system register values configured by the kernel.
  ret = get_kernel_pt_info(1, &pgd, &pt_size_log2, &pt_level, &nb_pt_entries);
  if (ret) {
    return 0xfffffffe;
  }
  page_offset_mask = (1 << pt_size_log2) - 1;
  // Makes sure the address and size are page-aligned.
  if ((virt_addr | size) & page_offset_mask) {
    return 0xfffffffe;
  }

  while (size) {
    // Retrieves page table information depending on which address space the address is part of (userland or kernel).
    // However, because userland addresses are ignored by this function, virt_addr is always a kernel address.
    is_kernel = virt_addr >> 0x3f;
    ret = get_kernel_pt_info(is_kernel, &pgd, &pt_size_log2, &pt_level, &nb_pt_entries);
    if (ret) {
      return 0xfffffff7;
    }

    nb_pt_entries_log2 = pt_size_log2 - 3;

    // Mask to apply to get a page table index from a virtual address (e.g. 0x1ff for page tables with 0x200 entries).
    pt_idx_mask = nb_pt_entries - 1;

    // Computes the position in the virtual address of the index into the current page table level (e.g. 12 at level 3,
    // 21 at level 2, etc.).
    pt_level_idx_pos = (4 - pt_level) * nb_pt_entries_log2 + 3;

    pt_level_max_va = -(nb_pt_entries << pt_level_idx_pos);
    if (pt_level_max_va > virt_addr) {
      return 0xfffffff7;
    }

    desc_oa = pgd;

    for (;;) {
      // Maps the kernel PGD.
      pt_hyp_va = map_page_table_hyp_and_stage2(desc_oa, pt_size_log2, pt_level, nb_pt_entries, &was_not_marked);
      if (!pt_hyp_va) {
        return 0xfffffff7;
      }

      desc = *(uint64_t*)(pt_hyp_va + (8 * (virt_addr >> pt_level_idx_pos) & pt_idx_mask));
      unmap_va_from_hypervisor(pt_hyp_va);
      desc_oa = desc & 0xfffffffff000;
      // ...

      // Checks if the descriptor is valid.
      if ((desc & 1) == 0) {
        return 0xfffffff7;
      }

      // Checks if it's a page or block descriptor.
      if (pt_level == 3 || (desc & 2) == 0) {
        pt_level_size = 1 << pt_level_idx_pos;
        // Descriptor adjusted to contain the IPA offset into a block, if it's a block descriptor.
        ipa = desc_oa | (pt_level_size - 1) & virt_addr;
        if (ipa == 0xffffffffffffffff) {
          return 0xfffffff7;
        }
        // Finds the smallest range covered by a page table level that contains the address range to map.
        while (pt_level_size > size || ((pt_level_size - 1) & ipa) != 0) {
          pt_level_size >>= nb_pt_entries_log2;
        }
        ret = change_stage2_software_attrs_per_ipa_range(ipa, pt_level_size, perms, software_attrs, set_attrs);
        if (ret) {
          return ret;
        }
        // Once we have reached the last level entry of a memory mapping the size and address are updated and we
        // continue walking the page tables for the next entries.
        size -= pt_level_size;
        virt_addr += pt_level_size;
        break;
      }
      ++pt_level;
      pt_level_idx_pos -= nb_pt_entries_log2;
    }
  }
  return 0;
}

change_stage2_software_attrs_per_ipa_range first ensures that the IPA range is present in the stage 2 page tables by calling allocate_page_tables_per_addr_range. It then gets all the stage 2 descriptors that the IPA range spans using get_stage2_page_table_descriptor_for_ipa.

The main check performed by change_stage2_software_attrs_per_ipa_range involves the software attributes. The function is called with the following arguments:

set_attrs, a boolean that signifies whether we want to set or clear the software attributes;
software_attrs, the software attributes to set or to be cleared.

The function iterates over each stage 2 descriptor and makes sure the current software attributes are either:

unmarked (i.e. equal to 0) when they are being set,
equal to software_attrs when they are being cleared.

If everything is as expected, then it calls map_stage2_memory to make the actual changes.

uint64_t change_stage2_software_attrs_per_ipa_range(uint64_t ipa,
                                                    uint64_t size,
                                                    stage2_attrs_t perms,
                                                    software_attrs_t software_attrs,
                                                    uint8_t set_attrs, ) {
  // Allocates the stage 2 page tables that correspond to the input address range, in case they have not been mapped
  // yet.
  ret =
      allocate_page_tables_per_addr_range(g_vttbr_el2, ipa, ipa, size, g_start_s2_pt_level, 0xc, g_nb_s2_pt_entries, 1);
  if (ret) {
    return 0xfffffffa;
  }

  for (offset = 0; offset < size; offset += pt_level_size) {
    desc = get_stage2_page_table_descriptor_for_ipa(ipa + offset, &pt_level_size);
    // Checks if the descriptor is valid.
    if ((desc & 1) == 0) {
      return 0xfffffff7;
    }
    software_attrs_from_desc = get_software_attrs(desc);
    // If set_attrs is true, we want to change the software attributes, so we make sure the memory range is currently
    // unmarked.
    //
    // If set_attrs is false, we want to clear the software attributes, so we make sure the memory range is marked with
    // the expected software attributes before unmarking it.
    expected_software_attrs = set_attrs ? 0 : software_attrs;
    if (software_attrs_from_desc != expected_software_attrs) {
      if (software_attrs_from_desc == OS_RO) {
        return 0;
      } else {
        return 0xfffffff8;
      }
    }
  }

  perms |= NORMAL_MEMORY;
  if (!set_attrs) {
    software_attrs = 0;
    perms = EXEC_EL0 | WRITE | READ | NORMAL_MEMORY;
  }

  // Remaps the input range with the new software and stage 2 attributes.
  for (offset = 0; offset < size; offset += pt_level_size) {
    desc = get_stage2_page_table_descriptor_for_ipa(ipa + offset, &pt_level_size);
    desc_oa = desc & 0xfffffffff000;
    map_stage2_memory(ipa + offset, desc_oa, pt_level_size, perms, software_attrs);
  }

  return ret;
}

The last function dealing with stage 2 page tables that we want to detail is get_stage2_page_table_descriptor_for_ipa. It is used every time the hypervisor needs the stage 2 descriptor that corresponds to a given IPA. When called, it walks the stage 2 page tables and returns the page or block descriptor mapping the IPA.

uint64_t get_stage2_page_table_descriptor_for_ipa(uint64_t ipa, uint64_t* desc_va_range) {
  // ...

  // Gets stage 2 page table information.
  pt_level = g_start_s2_pt_level;
  nb_pt_entries = g_nb_s2_pt_entries;
  nb_pt_entries_log2 = g_page_size_log2 - 3;

  // Computes the position in the virtual address of the index into the current page table level (e.g. 12 at level 3, 21
  // at level 2, etc.).
  pt_level_idx_pos = (4 - pt_level) * nb_pt_entries_log2 + 3;

  // Computes the size mapped by the current level (e.g. 0x1000 at level 3, 0x200000 at level 2, etc.).
  pt_level_size = 1 << pt_level_idx_pos;

  // Checks that the IPA is in the correct range.
  if (g_nb_s2_pt_entries << pt_level_idx_pos <= ipa) {
    return 0;
  }

  pt_addr = g_vttbr_el2;

  for (;;) {
    desc = *(uint64_t*)(pt_addr + (8 * (ipa >> pt_level_idx_pos)));
    // Checks if the descriptor is valid.
    if ((desc & 1) == 0) {
      return 0;
    }
    // Computes the offset into the address range of the current page table level.
    addr_offset = (pt_level_size - 1) & ipa;
    // If it is a block descriptor, returns a descriptor adjusted to contain the IPA offset into the block.
    if ((desc & 2) == 0) {
      return desc | addr_offset & (0xffffffffffffffff << g_page_size_log2);
    }
    // If we have reached level 3, returns the corresponding page descriptor.
    if (pt_level == 3) {
      return desc;
    }
    // ...
    pt_addr = desc & 0xfffffffff000;
    pt_level_idx_pos -= nb_pt_entries_log2;
    pt_level++;
  }
}

This wraps up the section about memory management, the hypervisor's memory mappings, and stage 2 implementation. The next part is about exceptions and how the hypervisor intercepts them to enhance the device's security at runtime and to interact with the kernel.

Exception Handling¶

The ARM specification defines an exception as "any event that can cause the currently executing program to be suspended and cause a change in state to execute code to handle that exception". In the list below, you can see the different types of exceptions implemented on AArch64.

Synchronous: an exception generated as the result of executing or trying to execute an instruction. They are generally raised by instructions such as SVC, HVC, or SMC, but can also happen because of alignment errors, the execution of PXN instructions, etc.
IRQ, or Interrupt ReQuest: this exception is usually raised when a hardware device sends an interrupt to the CPU. This exception can either be physical or be generated virtually by software.
FIQ, or Fast Interrupt reQuest: identical to an IRQ but with a higher priority.
SError, or System Error: an exception type usually generated in response to erroneous memory accesses.

Exceptions are the most common way for a given exception level to interact with its counterparts. When an exception is raised, it is always handled by the current EL or a higher one, except for EL0, which is always handled by a higher EL. Likewise, we can only return to the current EL or a lower one when returning from an exception.

When an exception is raised to a particular EL, depending on its type, it will be handled by a specific function referenced in the exception vector table. The address of this table is stored in the VBAR_ELn system register, where n is the number of the current EL. The layout of an exception vector table is given below:

Address	Exception Type	Source
+0x000	Synchronous	Current EL with SP0
+0x080	IRQ/vIRQ	Current EL with SP0
+0x100	FIQ/vFIQ	Current EL with SP0
+0x180	SError/vSError	Current EL with SP0
+0x200	Synchronous	Current EL with SPx
+0x280	IRQ/vIRQ	Current EL with SPx
+0x300	FIQ/vFIQ	Current EL with SPx
+0x380	SError/vSError	Current EL with SPx
+0x400	Synchronous	Lower EL using AArch64
+0x480	IRQ/vIRQ	Lower EL using AArch64
+0x500	FIQ/vFIQ	Lower EL using AArch64
+0x580	SError/vSError	Lower EL using AArch64
+0x600	Synchronous	Lower EL using AArch32
+0x680	IRQ/vIRQ	Lower EL using AArch32
+0x700	FIQ/vFIQ	Lower EL using AArch32
+0x780	SError/vSError	Lower EL using AArch32

By default, the hypervisor is able to handle:

HVCs, which are explicit requests from the kernel to access EL2 APIs;
faults that occur in the second stage during address translation (unmapped addresses, permissions issues, etc.).

However, these operations are not enough to provide proper kernel supervision, and this is where trapping comes in. The hypervisor can be configured to make certain actions that would normally be allowed at EL1 cause an exception to a higher EL. It is also possible to redirect specific exceptions from the kernel and have the hypervisor handle them instead. All of this can be customized using the HCR_EL2 system register, whose value was set in hyp_set_el2_and_enable_stage_2_per_cpu.

If we take all this into account, we can deduce that HHEE will handle the following exceptions:

HVCs;
stage 2 faults;
traps from secure monitor calls that originate from the kernel (they can be forwarded to the secure monitor afterwards);
traps from accesses to the following system registers:
- SCTLR_EL1, TTBR0_EL1, TTBR1_EL1, TCR_EL1, ESR_EL1, FAR_EL1, AFSR0_EL1, AFSR1_EL1, MAIR_EL1, AMAIR_EL1, CONTEXTIDR_EL1.

Now going back to the hypervisor's implementation, when an EL1 exception is raised to EL2, it will be handled by the corresponding entry in the exception vector table. In our case, all the asynchronous exceptions, namely IRQs, FIQs, and SErrors, that come from EL1 are not rerouted to EL2. Indeed, in HCR_EL2, the corresponding bits IMO, FMO, and AMO are set to 0. This means that all exceptions the hypervisor will have to deal with are synchronous. Therefore, we only have to focus on the entries at addresses VBAR_EL2 + 0x400 and VBAR_EL2 + 0x600, which are respectively SynchronousExceptionA64, the handler for synchronous exceptions from a lower EL using AArch64, and SynchronousExceptionA32, for exceptions from a lower EL using AArch32.

SynchronousExceptionA64 reads the EC field of the EL2 Exception Syndrome Register (ESR_EL2) to determine the class of exception we are dealing with. It first checks if it comes from a SVC, a HVC, or a SMC. Although, the SVC check is not needed, since SVC trapping is disabled by default (the SVC_EL1 and SVC_EL0 bits of HFGITR_EL2 reset to 0). For exceptions of any other origin, we call hhee_handle_abort_from_aarch64.

void SynchronousExceptionA64(saved_regs_t args) {
  // Enables asynchronous exceptions.
  asm("msr daifclr, #4");
  // Determine which exception occured by checking the exception class.
  ec = (get_esr_el2() >> 26) & 0b111111;
  if (ec == 0b010101    /* SVC instruction execution in AArch64 state */
      || ec == 0b010110 /* HVC instruction execution in AArch64 state */
      || ec == 0b010111 /* SMC instruction execution in AArch64 state */) {
    hhee_handle_hvc_smc_instructions(&args);
  } else {
    hhee_handle_abort_from_aarch64(&args);
  }
  asm("eret");
}

hhee_handle_hvc_smc_instructions redirects the execution flow to hhee_handle_hvc_instruction if the exception originates from a HVC, or hhee_handle_smc_instruction if it comes from a SMC. The handling of these instructions will be detailed in a later section.

saved_regs_t* hhee_handle_hvc_smc_instructions(uint64_t x0,
                                               uint64_t x1,
                                               uint64_t x2,
                                               uint64_t x3,
                                               saved_regs_t* saved_regs) {
  // ...
  call_type = (get_esr_el2() >> 0x1a) & 3;
  if (call_type == 2 /* HVC */) {
    hhee_handle_hvc_instruction(x0, x1, x2, x3, saved_regs);
    return saved_regs;
  }
  if (call_type == 3 /* SMC */) {
    // Updates the exception return address to the instruction after the SMC, because trapping an instruction does not
    // update the ELR_EL2 register like a regular HVC would.
    set_elr_el2(get_elr_el2() + 4);
    hhee_handle_smc_instruction(x0, x1, x2, x3, saved_regs);
    return saved_regs;
  }
  // Normally unreachable.
  log_and_wait_for_interrupt();
}

hhee_handle_abort_from_aarch64 checks the exception class again and:

calls hhee_trap_system_registers if the exception comes from an access to a trapped system register;
handle_inst_abort_from_lower_el if it was an instruction abort;
handle_data_abort_from_lower_el if it was a data abort.

void hhee_handle_abort_from_aarch64(saved_regs_t* regs) {
  // ...
  // Determine which exception occured by checking the exception class.
  esr_el2 = get_esr_el2();
  ec = (esr_el2 >> 26) & 0b111111;
  switch (ec) {
    case 0b011000: /* Trapped MSR, MRS or System instruction execution in AArch64 state */
      hhee_trap_system_registers(regs, esr_el2);
      break;
    case 0b100000: /* Instruction Abort from a lower Exception level */
      handle_inst_abort_from_lower_el(regs, esr_el2);
      break;
    case 0b100100: /* Data Abort from a lower Exception level */
      handle_data_abort_from_lower_el(regs, esr_el2);
      break;
    default:
      log_and_wait_for_interrupt();
  }
}

SynchronousExceptionA32, on the other hand, only handles instruction and data aborts. It first calls hhee_handle_abort_from_aarch32, which then redirects the execution to handle_inst_abort_from_lower_el if an instruction abort occurred, or to handle_data_abort_from_lower_el, if a data abort occurred.

void SynchronousExceptionA32(saved_regs_t args) {
  // Enables asynchronous exceptions.
  asm("msr daifclr, #4");
  hhee_handle_abort_from_aarch32(args);
  asm("eret");
}

void hhee_handle_abort_from_aarch32(saved_regs_t* regs) {
  // ...
  esr_el2 = get_esr_el2();
  ec = (esr_el2 >> 26) & 0b111111;
  switch (ec) {
    case 0b100000: /* Instruction Abort from a lower Exception level */
      handle_inst_abort_from_lower_el(regs, esr_el2);
      break;
    case 0b100100: /* Data Abort from a lower Exception level */
      handle_data_abort_from_lower_el(regs, esr_el2);
      break;
    default:
      log_and_wait_for_interrupt();
  }
}

To make the explanations easier, we can split exception handling into three main categories:

trapped instruction handling;
instruction and data abort handling;
HVC and SMC handling.

In the next sections, we will have a look at each of them and explain how they tie in with the security assurances we listed at the beginning of this article.

Trapped Instruction Handling¶

As seen previously, the hypervisor has been configured to trap a set of system registers. Accesses to these registers are handled by hhee_trap_system_registers. The main goal of this function is to add an additional layer of security when modifying critical registers. The hypervisor checks which operation the kernel is trying to perform and decides whether it is allowed or not.

To know which operation we are dealing with, the CPU encodes information about the corresponding instruction into ESR_EL2:

This information is extracted and handled accordingly. For starters, depending on the value of CRn, a different function from the sysregs_handlers_by_crn array is called.

uint64_t (*sysregs_handlers_by_crn[16])(uint64_t, uint64_t, uint64_t) = {
    0, hhee_sysregs_crn1,  hhee_sysregs_crn2,  0,
    0, hhee_sysregs_crn5,  hhee_sysregs_crn6,  0,
    0, 0,                  hhee_sysregs_crn10, 0,
    0, hhee_sysregs_crn13, 0,                  0,
};

void hhee_trap_system_registers(saved_regs_t* regs, uint64_t esr_el2) {
  // ...
  // Direction, bit [0] = 0: Write access, including MSR instructions.
  if ((esr_el2 & 1) == 0) {
    op0 = (esr_el2 >> 20) & 0b11;
    op2 = (esr_el2 >> 17) & 0b111;
    op1 = (esr_el2 >> 14) & 0b111;
    crn = (esr_el2 >> 10) & 0b1111;
    rt = (esr_el2 >> 5) & 0b11111;
    crm = (esr_el2 >> 1) & 0b1111;

    reg_val = *(uint64_t*)(&regs->x0 + rt);
    if (op0 == 0b11 /* Moves to and from Non-debug System registers */
        && op1 == 0b000 /* Accessible from EL1 or higher */) {
      handler = sysregs_handlers_by_crn[crn];
      if (handler) {
        handler(reg_val, crm, op2);
      }
    }
  }

  // Updates the exception return address by adding the size of the instruction to the current value of ELR_EL2.
  //
  // IL, bit [25]: Instruction Length for synchronous exceptions.
  il = (esr_el2 >> 25) & 1;
  set_elr_el2(get_elr_el2() + 2 * il);
}

From this point on, it's relatively straightforward. We know which register accesses are trapped and where the handlers are. Now we just need to map the registers to their handlers, depending on their CRn value.

System Register	Op0	CRn	CRm	Op2	Handler
`SCTLR_EL1`	3	1	0	0	`hhee_sysregs_crn1`
`TTBR0_EL1`	3	2	0	0	`hhee_sysregs_crn2`
`TTBR1_EL1`	3	2	0	1	`hhee_sysregs_crn2`
`TCR_EL1`	3	2	0	2	`hhee_sysregs_crn2`
`AFSR0_EL1`	3	5	1	0	`hhee_sysregs_crn5`
`AFSR1_EL1`	3	5	1	1	`hhee_sysregs_crn5`
`ESR_EL1`	3	5	2	0	`hhee_sysregs_crn5`
`FAR_EL1`	3	6	0	0	`hhee_sysregs_crn6`
`MAIR_EL1`	3	10	2	0	`hhee_sysregs_crn10`
`AMAIR_EL1`	3	10	3	0	`hhee_sysregs_crn10`
`CONTEXTIDR_EL1`	3	13	0	1	`hhee_sysregs_crn13`

CRn 1: SCTLR_EL1¶

The first trap handler we are going to look at is hhee_sysregs_crn1, which verifies modifications made to the SCTLR_EL1 system register. The function first checks if the system registers structure stored in the thread local storage has been initialized. This structure is initialized once the kernel page tables have been processed by the hypervisor.

If the structure has not been initialized yet, the function directly skips to the call to check_sctlr_el1;
otherwise, it checks which register, TTBR0_EL1 or TTBR1_EL1, sets the address space ID;
- if it's TTBR0_EL1, we move on to the check_sctlr_el1 call;
- if it's TTBR1_EL1, and the ASID is not 0, we make sure the kernel is not trying to disable the MMU, we process the kernel page tables using process_ttbr1_el1, and then we call check_sctlr_el1.

If check_sctlr_el1 returns successfully, it means all operations have been allowed, and we update SCTRL_EL1 with the value provided by the kernel.

uint64_t hhee_sysregs_crn1(uint64_t rt_val, uint32_t crm, uint32_t op2) {
  if (crm) {
    return 0xfffffffe;
  }

  // ACTLR_EL1 - should be unreachable with the current configuration.
  if (op == 1) {
    return 0xffffffff;
  }
  // CPACR_EL1 - should be unreachable with the current configuration.
  else if (op == 2) {
    return 0xfffffffe;
  }

  sys_regs = current_cpu->sys_regs;

  // If the structure storing system register values has not been initialized yet, jumps to the SCTLR_EL1 validation
  // routine.
  if (!atomic_get(&sys_regs->regs_inited)) {
    goto SCTLR_CHECK;
  }

  // Checks which translation base address registers defines the address space ID. If it's TTBR0_EL1, we move on to the
  // SCTLR_EL1 validation routine.
  //
  // A1, bit [22]: Selects whether TTBR0_EL1 or TTBR1_EL1 defines the ASID.
  if (!(sys_regs->tcr_el1 & (1 << 22))) {
    goto SCTLR_CHECK;
  }

  ttbr1_el1 = get_ttbr1_el1();
  // If no ASID has been specified, we go to the SCTLR_EL1 validation routine.
  if (!(ttbr1_el1 & 0xffff000000000000)) {
    goto SCTLR_CHECK;
  }

  // If we try to disable the MMU, an error is returned.
  //
  // M, bit [0]: MMU enable for EL1&0 stage 1 address translation.
  if (!(rt_val & 1)) {
    debug_print(0x100, "Disallowed turning off of MMU");
    return 0xfffffff8;
  }

  // Processes the translation table referenced by TTBR1_EL1.
  ret = process_ttbr1_el1(sys_regs, ttbr1_el1);
  if (ret) {
    return ret;
  }

  // If no error occured, mark the kernel page tables as processed.
  current_cpu->kernel_pt_processed = 1;

SCTLR_CHECK:
  // Checks the value of SCTLR_EL1 and updates the system register if changes were allowed by the hypervisor.
  ret = check_sctlr_el1(rt_val);
  if (!ret) {
    set_sctlr_el1(rt_val);
  }

  return ret;
}

If system registers have already been saved in the g_sysregs global structure, process_ttbr1_el1 won't allow changing the address stored in TTBR1_EL1. Otherwise, it processes the kernel page tables using process_kernel_page_tables and, on success, stores the values of TCR_EL1, MAIR_EL1, and TTBR1_EL1 in this structure. The internals of process_kernel_page_tables are detailed later in this article in the section dealing with page table management.

uint64_t process_ttbr1_el1(sys_regs_t* sys_regs, uint64_t ttbr1_el1) {
  // ...
  // Checks if the ASID is determined by TTBR1_EL1.
  //
  // A1, bit [22]: Selects whether TTBR0_EL1 or TTBR1_EL1 defines the ASID.
  if (((sys_regs->tcr_el1 >> 22) & 1) != 0) {
    // Remove the ASID from TTBR1_EL1.
    ttbr1_el1 &= 0xffffffffffff;
  }

  // Check if the system registers have been stored.
  if (atomic_get(&sys_regs->regs_inited)) {
    // Changing TTBR1_EL1 is not allowed afterwards.
    if (sys_regs->ttbr1_el1 == ttbr1_el1 && sys_regs->tcr_el1 == get_tcr_el1() &&
        sys_regs->mair_el1 == get_mair_el1()) {
      return 0;
    } else {
      return 0xfffffff8;
    }
  } else {
    // Locking is required to modify the global structure containing the system registers values.
    spin_lock(&g_sys_regs_lock);
    // Retry the atomic read after taking the lock.
    if (!atomic_get(&sys_regs->regs_inited)) {
      // Store the system registers values.
      sys_regs->tcr_el1 = get_tcr_el1();
      sys_regs->mair_el1 = get_mair_el1();
      sys_regs->ttbr1_el1 = ttbr1_el1;
      // Only mark the registers as stored if the processing of the kernel page tables succeeded.
      ret = process_kernel_page_tables();
      if (!ret) {
        atomic_set(1, &sys_regs->regs_inited);
      }
    }
    spin_unlock(&g_sys_regs_lock);
    return ret;
  }
}

Then, regarding the function check_sctlr_el1, it performs multiple verifications listed below.

Checks that the kernel is not trying to disable the MMU if it has been enabled.
Checks that the kernel is not trying to disable WXN.
Checks that the kernel is not trying to disable PAN.
Checks data accesses at EL1 are set as little endian.
Checks data accesses at EL0 are set as little endian, or that mixed-endian access is implemented on the device.

Other miscellaneous checks are carried out but are not explained here. You can take a look at the code below for more information.

uint64_t check_sctlr_el1(int32_t rt_val) {
  // If kernel page table have been processed by the hypervisor and the MMU is enabled, the kernel is not allowed to
  // disable it.
  //
  // M, bit [0]: MMU enable for EL1&0 stage 1 address translation.
  if (!(rt_val & 1) && current_cpu->kernel_pt_processed && get_sctlr_el1() & 1) {
    debug_print(0x100, "Disallowed turning off of MMU");
    return 0xfffffff8;
  }

  // If WXN is enabled, the kernel is not allowed to disable it.
  //
  // WXN, bit [19]: Write permission implies XN (Execute-never).
  if (!(rt_val & (1 << 19)) && get_sctlr_el1() & (1 << 19)) {
    debug_print(0x102, "Disallowed turning off of WXN");
    return 0xfffffff8;
  }

  // If SPAN is enabled, the kernel is not allowed to disable it.
  //
  // SPAN, bit [23]: Set Privileged Access Never, on taking an exception to EL1.
  if (rt_val & (1 << 23) && !(get_sctlr_el1() & (1 << 23))) {
    return 0xfffffff8;
  }
  // If PAN is not supported on the platform, returns an error.
  //
  // PAN, bits [23:20]: Privileged Access Never, indicates support for the PAN bit.
  else if (!(get_id_aa64mmfr1_el1() >> 20) & 0xf) {
    return 0xfffffffe;
  }

  // Makes sure endianness of data accesses are little-endian at EL1.
  //
  // EE, bit [25]: Endianness of data accesses at EL1.
  if (rt_val & (1 << 25)) {
    return 0xfffffffd;
  }

  // Checks the endianness at EL0. Keeps going if we're in little endian, otherwise checks in ID_AA64MMFR0_EL1 if mixed-
  // endian is supported on the platform, and if it's not, returns an error.
  //
  // E0E, bit [24]: Endianness of data accesses at EL0.
  if ((rt_val & (1 << 24)) && !((get_id_aa64mmfr0_el1() >> 8) & 0xf)) {
    return 0xfffffffe;
  }

  // Having SCTRL_EL1 & 0x520c40 equal to 0x500800 means we want the following bit to have fixed values:
  //
  //   - nAA, bit [6]: Non-aligned access.
  //     * 0b0: certain load/store instructions are generate an Alignment fault if all bytes being accessed are not
  //            16-byte aligned.
  //   - EnRCTX, bit [10]: Enable EL0 Access to the CFP RCTX, DVP RCT and CPP RCTX instructions.
  //     * 0b0: EL0 access to these instructions is disabled and are trapped to EL1.
  //   - EOS, bit [11]: Exception Exit is Context Synchronizing.
  //     * 0b1: An exception return from EL1 is a context synchronizing event.
  //   - TSCXT, bit [20]: Trap EL0 Access to the SCXTNUM_EL0 register.
  //     * 0b1: EL0 access to SCXTNUM_EL0 is disabled, causing an exception to EL1.
  //   - EIS, bit [22]: Exception Entry is Context Synchronizing.
  //     * 0b1: The taking of an exception to EL1 is a context synchronizing event.
  if (rt_val & 0x520c40 != 0x500800) {
    return 0xfffffffe;
  }

  // Returns without error if the kernel wants to disable the SETEND instruction at EL0.
  //
  // SED, bit [8]: SETEND instruction disable.
  if (rt_val & (1 << 25)) {
    return 0;
  }

  // Otherwise it checks if:
  //
  //   - ID_AA64MMFR0_EL1, if mixed-endian is supported.
  //   - ID_AA64PFR0_EL1, if EL0 can be executed in either AArch64 or AArch32 state.
  if ((get_id_aa64pfr0_el1() & 0xf) > 1 && !(get_id_aa64mmfr0_el1() >> 8) & 0xf) {
    return 0;
  }

  return 0xfffffffe;
}

CRn 2: TTBR0_EL1, TTBR1_EL1, and TCR_EL1¶

hhee_sysregs_crn2 verifies the modifications made to TTBR0_EL1, TTBR1_EL1, and TCR_EL1.

For TTBR0_EL1, modifications to the ASID are not allowed until the hypervisor has processed the kernel page tables, but other than that and some sanity checks, all changes are permitted.
For TTBR1_EL1, the hypervisor performs sanity checks, and once the page table address has been set for the first time, the kernel won't be allowed to change it again, unless it is to switch from the identity mapping's PGD idmap_pg_dir to the swapper's PGD swapper_pg_dir. When the hypervisor allows the modification of TTBR1_EL1, the kernel page tables are processed using process_ttbr1_el1.
For TCR_EL1, the verifications are mostly sanity checks, and once the kernel page tables have been processed, the kernel can't change its value.

uint64_t hhee_sysregs_crn2(uint64_t rt_val, uint32_t crm, uint32_t op2) {
  if (crm) {
    return 0xfffffffe;
  }

  sys_regs = current_cpu->sys_regs;

  // TTBR0_EL1.
  if (op2 == 0) {
    // The Common not Private bit is ignored.
    //
    // CnP, bit [0]: Common not Private.
    new_ttbr = rt_val & 0xfffffffffffffffe;
    cnp = rt_val & 1;
    asid = rt_val & 0xffff000000000000;

    sys_regs = current_cpu->sys_regs;
    if (atomic_get(&sys_regs->regs_inited)) {
      // Checks if the ASID is determined by TTBR1_EL1 and if the ASID the kernel wants to set in TTBR0_EL1 is not null.
      // Verifications are made using the saved value of TCR_EL1, if it's available.
      //
      // A1, bit [22]: Selects whether TTBR0_EL1 or TTBR1_EL1 defines the ASID.
      if (asid && sys_regs->tcr_el1 & (1 << 22)) {
        new_ttbr &= 0xffffffffffff;
      }
    }
    // Checks if the ASID is determined by TTBR1_EL1 and if the ASID the kernel wants to set in TTBR0_EL1 is not null.
    // Verifications are made using the current value of TCR_EL1, because the saved value has not been set yet.
    else if (asid && get_tcr_el1() & (1 << 22)) {
      return 0xfffffffe;
    }
    // If the kernel wants to enable Common not Private translations but it is not supported by the device, it returns
    // an error.
    else if (cnp && get_id_aa64mmfr2_el1() & 0xf) {
      return 0xfffffffe;
    }

    // Updates TTBR0_EL1 with the new value.
    set_ttbr0_el1(new_ttbr);
    return 0;
  }

  // TTBR1_EL1.
  if (op2 == 1) {
    if (atomic_get(&sys_regs->regs_inited)) {
      curr_ttbr = sys_regs->ttbr1_el1;
      new_ttbr = rt_val & 0xffffffffffff;
      // If the ASID is determined by TTBR1_EL1.
      if (sys_regs->tcr_el1 & (1 << 22)) {
        // Switches from the identity mapping PGD to the swapper PGD.
        if (new_ttbr - curr_ttbr == 0x2000) {
          asid = rt_val & 0xffff000000000000;
          goto PROCESS_KERNEL_PT;
        }
      }
      // If we are not switching from IDMAP to SWAPPER, the page table address can't be changed once it has been set.
      if (new_ttbr != curr_ttbr) {
        debug_printf(0x101, "Disallowed change of privileged page table to 0x%016lx", new_ttbr);
        return 0xfffffff8;
      }
      asid = rt_val & 0xffff000000000000;
    } else {
      // If TTBR0_EL1 defines the ASID, the kernel is not allowed to change it in TTBR1_EL1.
      asid = rt_val & 0xffff000000000000;
      if (asid && !(get_tcr_el1() & (1 << 22))) {
        return 0xfffffffe;
      }

      // If the kernel wants to enable Common not Private translations but it is not supported by the device, it returns
      // an error.
      cnp = rt_val & 1;
      if (cnp && get_id_aa64mmfr2_el1() & 0xf) {
        return 0xfffffffe;
      }
    }

PROCESS_KERNEL_PT:
    // If TTBR1_EL1's ASID is non-null, kernel page tables have not been processes by the hypervisor and the MMU is
    // enabled, then we process the translation table referenced by TTBR1_EL1.
    if (asid && !current_cpu->kernel_pt_processed && get_sctlr_el1() & 1) {
      ret = process_ttbr1_el1();
      if (ret) {
        return ret;
      }
      // If no error occured, marks the kernel page tables as processed.
      current_cpu->kernel_pt_processed = 1;
    }
    // Updates TTBR1_EL1 with the new value.
    set_ttbr1_el1(rt_val);
    return 0;
  }

  // TCR_EL1.
  if (op2 == 2) {
    if (atomic_get(&sys_regs->regs_inited)) {
      // If the saved values of the system registers have already been initialized in the hypervisor, changing TCR_EL1
      // returns an error.
      if (rt_val != sys_regs->tcr_el1) {
        return 0xfffffff8;
      }
      // Otherwise we just rewrite the value in the register.
      set_tcr_el1(rt_val);
      return 0;
    }

    // For translations using TTBR0_EL1 and TTBR1_EL1, makes sure bits 59, 60, 61 and 62 of each stage 1 translation
    // table Block or Page entry can be used by hardware for an IMPLEMENTATION DEFINED purpose.
    //
    //   - HWU059-HWU062, bits [46:43]: Indicates IMPLEMENTATION DEFINED hardware use of bits 59, 60, 61 and 62 of the
    //                                  stage 1 translation table Block or Page entry for translations using TTBR0_EL1.
    //   - HWU159-HWU162, bits [49:46]: Indicates IMPLEMENTATION DEFINED hardware use of bits 59, 60, 61 and 62 of the
    //                                  stage 1 translation table Block or Page entry for translations using TTBR1_EL1.
    if (rt_val & 0xfffff80800000040) {
      return 0xfffffffe;
    }

    // If 16-bit ASID are not supported by the device, returns an error if the kernel tries to configure an ASID of size
    // 16 bits.
    //
    // AS, bit [36]: ASID Size.
    if (rt_val & (1 << 36) && get_id_aa64mmfr0_el1() >> 4) {
      return 0xfffffffe;
    }

    // If hardware updates of access flag and dirty states are not supported by the device, returns an error if the
    // kernel tries to enable them.
    //
    //   - HA, bit [39]: Hardware Access flag update in stage 1 translations.
    //   - HD, bit [40]: Hardware management of dirty state in stage 1 translations.
    if ((rt_val & (1 << 39) || rt_val & (1 << 40)) && !(get_id_aa64mmfr1_el1() & 0xf)) {
      return 0xfffffffe;
    }

    // If hierachical permission disables are not supported by the device, returns an error if the kernel tries to
    // enable one of them.
    //
    //   - HPD0, bit [41]: Hierarchical Permission Disables in TTBR0_EL1.
    //   - HPD1, bit [42]: Hierarchical Permission Disables in TTBR1_EL1.
    if ((rt_val & (1 << 41) || rt_val & (1 << 42)) && !(get_id_aa64mmfr1_el1() >> 4 & 0xf)) {
      return 0xfffffffe;
    }

    // Updates TCR_EL1 with the new value if no error was encountered.
    set_tcr_el1(rt_val);
    return 0;
  }
  return 0xfffffffe;
}

CRn 5: AFSR0_EL1, AFSR1_EL1, and ESR_EL1¶

The next trapped registers we are going to look at are the ones with a CRn value of 5, which are handled by hhee_sysregs_crn5.

Changes to AFSR0_EL1 are ignored, but it won't return an error.
Changes to AFSR1_EL1 are ignored, and it returns an error.
Changes to ESR_EL1 are accepted without checking.

uint64_t hhee_sysregs_crn5(uint64_t rt_val, uint32_t crm, uint32_t op2) {
  if (crm == 1 && op2 == 0) {
    // AFSR0_EL1 - Nothing happens, but no error is returned.
    return 0;
  }

  if (crm == 1 && op2 != 0) {
    // AFSR1_EL1 - An error is returned.
    return 0xffffffff;
  }

  if (crm == 2 && op2 == 0) {
    // Updates ESR_EL1 without further checks.
    set_esr_el1(rt_val);
    return 0;
  }

  // The rest of the registers should be unreachable with the current configuration.
  return 0xffffffff;
}

CRn 6: FAR_EL1¶

Then we have FAR_EL1, handled by hhee_sysregs_crn6, which can be changed however the kernel wants.

uint64_t hhee_sysregs_crn6(uint64_t rt_val, uint32_t crm, uint32_t op2) {
  if (crm != 0 || op2 != 0) {
    return 0xfffffffe;
  }
  set_far_el1(rt_val);
  return 0;
}

CRn 10: MAIR_EL1 and AMAIR_EL1¶

The next trapped registers on our list are MAIR_EL1 and AMAIR_EL1, which are handled by hhee_sysregs_crn10.

MAIR_EL1 can only be changed if the hypervisor structure storing kernel system register values has not been initialized yet. Otherwise, the function returns an error if we try to change its value and it differs from the one stored in the hypervisor.
Changes to AMAIR_EL1 are ignored, but no error is returned.

uint64_t hhee_sysregs_crn10(uint64_t rt_val, uint32_t crm, uint32_t op2) {
  if (op2 != 0) {
    return 0xfffffffe;
  }

  // MAIR_EL1 - Changes are only allowed if sys_regs->mair_el1 has not been set yet. If it has been, returns an error if
  // it differs from the value the kernel wants to set.
  if (crm == 2) {
    sys_regs = current_cpu->sys_regs;
    regs_inited = atomic_get(&sys_regs->regs_inited);
    if (regs_inited && rt_val != sys_regs->mair_el1) {
      return 0xfffffff8;
    } else {
      set_mair_el1(rt_val);
      return 0;
    }
  }

  // AMAIR_EL1 - Changes ignored, but doesn't return an error.
  if (crm == 3) {
    return 0;
  }

  return 0xfffffffe;
}

CRn 13: CONTEXTIDR_EL1¶

Finally, CONTEXTIDR_EL1, handled by hhee_sysregs_crn13, can be changed to any value.

uint64_t hhee_sysregs_crn13(uint64_t rt_val, uint32_t crm, uint32_t op2) {
  if (crm != 0) {
    return 0xfffffffe;
  }

  // Updates CONTEXTIDR_EL1 without further checks.
  if (op2 == 1) {
    set_contextidr_el1(rt_val);
    return 0;
  }

  return 0xffffffff;
}

Instruction and Data Abort Handling¶

As we explained earlier, the hypervisor is also able to handle the instruction and data aborts that occur if userland or the kernel tries to read, write, or execute protected memory. In particular, this is used by the hypervisor to check what the kernel writes to its page tables, as we will see in the next section.

Similarly to the trapped instruction handler, the abort handlers check fault-specific information present in the ESR_EL2 system register.

Instruction Abort Handling¶

When an instruction abort occurs, caused by EL0 or EL1 trying to execute invalid/protected memory, it is handled by either SynchronousExceptionA64 or SynchronousExceptionA32, both of which end up calling handle_inst_abort_from_lower_el.

handle_inst_abort_from_lower_el simply calls trigger_instr_data_abort_handling_in_el1 to let the kernel handle the abort. There is also a check for the special case of a stage 2 fault occurring during a stage 1 page table walk, which results in the handle_s2_fault_during_s1ptw function being called, but we will detail this function at a later point.

void handle_inst_abort_from_lower_el(saved_regs_t* regs, uint64_t esr_el2) {
  // S1PTW, bit [7] = 1: Stage 2 fault on an access made for a stage 1 translation table walk.
  if (((esr_el2 >> 7) & 1) == 1) {
    handle_s2_fault_during_s1ptw(0);
  } else {
    // Let the kernel handle the abort.
    trigger_instr_data_abort_handling_in_el1();
  }
}

trigger_instr_data_abort_handling_in_el1, which is also called by other handlers than the instruction abort one, retrieves the Instruction Specific Syndrome, or ISS, of ESR_EL2 to determine which fault occurred. It also reads the M[0] (EL0/EL1) and M[4] (AArch32/AArch64) bits of the SPSR_EL2 register to determine which exception vector of the kernel to return to. The actual returning is done by writing the ELR_EL2 register and then executing the ERET instruction back in SynchronousExceptionA64 or SynchronousExceptionA32.

void trigger_instr_data_abort_handling_in_el1() {
  // ...
  esr_el2 = get_esr_el2();
  ec = (esr_el2 >> 26) & 0b111111;
  switch (ec) {
    case 0b001000: /* Trapped VMRS access, from ID group trap */
      // IL, bit [25] = 1: 32-bit instruction trapped.
      esr_el1 = 1 << 25;
      break;

    case 0b010010: /* HVC instruction execution in AArch32 state */
    case 0b010011: /* SMC instruction execution in AArch32 state */
    case 0b010110: /* HVC instruction execution in AArch64 state */
    case 0b010111: /* SMC instruction execution in AArch64 state */
      // EC, bits [31:26] = 0b001110: Illegal Execution state.
      esr_el1 = (0b001110 << 26) | (1 << 25);
      break;

    case 0b100000: /* Instruction Abort from a lower Exception level */
    case 0b100100: /* Data Abort from a lower Exception level */
      //   - IFSC, bits [5:0]: Instruction Fault Status Code.
      //   - DFSC, bits [5:0]: Data Fault Status Code.
      esr_el1 = esr_el2 & ~0b111111;

      el = (get_spsr_el2() >> 2) & 0b1111;
      if (el != 0b0000 /* EL0t */) {
        // EC, bits [32:26]: Exception Class.
        //
        //   - 0b100000 -> 0b100001: Instruction Abort taken without a change in Exception level.
        //   - 0b100100 -> 0b100101: Data Abort without a change in Exception level.
        esr_el1 |= (1 << 26);
      }

      // S1PTW, bit [7] = 1: Fault on the stage 2 translation of an access for a stage 1 translation table walk.
      if (((esr_el2 >> 7) & 1) != 0) {
        //   - IFSC, bits [5:0]: Instruction Fault Status Code.
        //   - DFSC, bits [5:0]: Data Fault Status Code.
        //
        // If the S1PTW bit is set, then the level refers the level of the stage 2 translation that is translating a
        // stage 1 translation walk.
        esr_el1 |= 0b101000 | (esr_el2 & 3);
      } else {
        esr_el1 |= 0b100000;
      }
      break;

    default:
      esr_el1 = esr_el2;
  }

  far_el1 = get_far_el2();
  elr_el1 = get_elr_el2();
  spsr_el1 = get_spsr_el2();

  el = (spsr_el1 >> 2) & 0b1111;
  if (el != 0b0000 /* EL0t */) {
    if ((spsr_el1 & 1) == 0) {
      offset = 0x000 /* Synchronous, Current EL with SP0 */;
    } else {
      offset = 0x200 /* Synchronous, Current EL with SPx */;
    }
  } else {
    // M[4], bit [4]: Execution state.
    //
    //   - 0b0: AArch64 execution state.
    //   - 0b1: AArch32 execution state.
    if (((spsr_el1 >> 4) & 1) == 0) {
      offset = 0x400 /* Synchronous, Lower EL using AArch64 */;
    } else {
      offset = 0x600 /* Synchronous, Lower EL using AArch32 */;
    }
  }

  set_elr_el2(get_vbar_el1() + offset);
  //   - M[3:0], bits [3:0] = 0b0101: EL1h.
  //   - F,      bit [6]    = 1: FIQ interrupt mask.
  //   - I,      bit [7]    = 1: IRQ interrupt mask.
  //   - A,      bit [8]    = 1: SError interrupt mask.
  //   - D,      bit [9]    = 1: Debug exception mask.
  set_spsr_el2((0b0101 | (1 << 6) | (1 << 7) | (1 << 8) | (1 << 9)));
  set_elr_el1(elr_el1);
  set_spsr_el1(spsr_el1);
  set_esr_el1(esr_el1);
  set_far_el1(far_el1);
}

Data Abort Handling¶

When a data abort occurs, caused by EL0 or EL1 trying to read or write invalid/protected memory, it is handled by either SynchronousExceptionA64 or SynchronousExceptionA32, both of which end up calling handle_data_abort_from_lower_el.

handle_data_abort_from_lower_el also checks for the special case of a stage 2 fault occurring during a stage 1 page table walk, which we will see in detail later. Otherwise, in the general case, it inspects the Instruction Specific Syndrome field, or ISS, of the system register ESR_EL2. If it contains a valid instruction syndrome, it extracts access information from the ISS and then calls either handle_data_abort_on_write if the exception resulted from a write or handle_data_abort_on_read if it was a read.

If the ISS is unknown and the exception was raised while in AArch32, the hypervisor forwards it to the kernel by calling trigger_instr_data_abort_handling_in_el1. If we were in AArch64, handle_data_abort_on_unknown_iss is called instead.

void handle_data_abort_from_lower_el(saved_regs_t* regs, uint64_t esr_el2) {
  // ...
  far_el2 = get_far_el2();
  // Special case if the abort occured during a stage 1 PTW.
  //
  // S1PTW, bit [7] = 1: Stage 2 fault on an access made for a stage 1 translation table walk.
  if (((esr_el2 >> 7) & 1) == 1) {
    // WnR, bit [6].
    //
    //   - 0: Abort caused by an instruction reading from memory.
    //   - 1: Abort caused by an instruction writing to memory.
    handle_s2_fault_during_s1ptw(((esr_el2 >> 6) & 1) == 1);
  }

  // Common case: there is a valid instruction syndrome that contains all the information about the access that caused
  // the abort.
  //
  // ISV, bit [24] = 1: ISS hold a valid instruction syndrome.
  else if (((esr_el2 >> 24) & 1) == 1) {
    // SRT, bits [20:16]: Syndrome Register Transfer, this field holds register specifier, Xt.
    srt = (esr_el2 >> 16) & 0b11111;
    reg_val = *(uint64_t*)(&regs->x0 + srt);
    // SAS, bits [23:22]: Syndrome Access Size, indicates the size of the access attempted by the faulting operation.
    sas = (esr_el2 >> 22) & 0b11;
    // SSE, bit [21]: Syndrome Sign Extend, indicates whether the data item must be sign extended.
    sse = (esr_el2 >> 21) & 1;
    // SF, bit [15]: Indicates if the width of the register accessed by the instruction is 64 bits.
    sf = (esr_el2 >> 15) & 1;
    // AR, bit [14]: Acquire/Release, did the instruction have acquire/release semantics?
    af = (esr_el2 >> 14) & 1;
    wnr = (esr_el2 >> 6) & 1;
    // Call a different handler depending on the access type (read or write).
    if (wnr == 1) {
      ret = handle_data_abort_on_write(reg_val, far_el2, sas, ar);
    } else {
      ret = handle_data_abort_on_read(reg_val, far_el2, sas, sse, sf, ar);
    }
    // Move the instruction pointer past the faulting instruction.
    //
    // IL, bit [25]: Instruction Length for synchronous exceptions.
    if (ret == 0) {
      set_elr_el2(get_elr_el2() + 2 * (il + 1));
    }
  }

  // Uncommon case: there is no valid instruction syndrome and the data abort came from an AArch32 process, let the
  // kernel handle the fault.
  //
  // M[4], bit [4] = 1: AArch32 execution state.
  else if (((get_spsr_el2() >> 4) & 1) == 1) {
    trigger_instr_data_abort_handling_in_el1();
  }

  // Uncommon case: there is no valid instruction syndrome, the hypervisor will try to figure out what instruction
  // caused the abort by decoding it.
  else {
    handle_data_abort_on_unknown_iss(regs);
  }
}

Data Abort on Writes¶

For data aborts caused by writes, handle_data_abort_on_write calls handle_data_abort_on_write_aligned on addresses aligned on the access size:

if the address is already aligned on the access size, it calls handle_data_abort_on_write_aligned directly;
otherwise, it splits the access into multiple sub-accesses of smaller sizes that end up being aligned.

int32_t handle_data_abort_on_write(uint64_t reg_val,
                                   uint64_t fault_va,
                                   int32_t access_size_log2,
                                   bool acquire_release) {
  // Simple case: if the fault VA is aligned on the access size.
  access_size = 1 << access_size_log2;
  if ((fault_va & (access_size - 1)) == 0) {
    fault_ipa = virt_to_phys_el1(fault_va);
    if (fault_ipa == -1) {
      trigger_instr_data_abort_handling_in_el1();
      return -1;
    }
    if (handle_data_abort_on_write_aligned(reg_val, fault_ipa, access_size_log2) != 0) {
      return -1;
    }
    if (acquire_release) {
      dmb_ish();
    }
    return 0;
  }

  // Complicated case: if the fault VA is not aligned on the access size then iterate on each aligned chunk that are
  // contained in the access.
  chunk_access_size_log2 = __clz(__rbit(fault_va));
  chunk_access_size = 1 << chunk_access_size_log2;
  chunk_fault_va = fault_va;
  while (access_size != 0) {
    chunk_fault_ipa = virt_to_phys_el1(chunk_fault_va);
    if (chunk_fault_ipa == -1) {
      trigger_instr_data_abort_handling_in_el1();
      return -1;
    }
    if (handle_data_abort_on_write_aligned(reg_val, chunk_fault_ipa, chunk_access_size_log2) != 0) {
      return -1;
    }
    reg_val >>= 8 * chunk_access_size;
    chunk_fault_va += chunk_access_size;
    access_size -= chunk_access_size;
  }

  // Emulate A/R semantics.
  if (acquire_release) {
    dmb_ish();
  }
  return 0;
}

handle_data_abort_on_write_aligned checks if the write, while it raised an exception, should be allowed nonetheless. It does so by calling check_kernel_page_table_write, which returns a verdict value:

if it is zero, the write is allowed, and it calls perform_faulting_write to perform the write on behalf of the kernel;
if it is non-zero and the faulting instruction is in the kernel, it resumes execution at the next instruction;
if it is non-zero and the faulting instruction is in a userland process, it lets the kernel handle the fault.

int32_t handle_data_abort_on_write_aligned(uint64_t reg_val, uint64_t fault_ipa, int32_t access_size_log2) {
  // ...
  hvc_lock_acquire();
  // Check if the write, most likely into a kernel page tables, is allowed.
  verdict = check_kernel_page_table_write(&reg_val, &fault_ipa, &access_size_log2, &fault_pa);
  hvc_lock_set();

  if (verdict != 0) {
    // If the write is not allowed, let the kernel handle the fault, whether it came from user-land or kernel-land.
    //
    //   - M[3:0], bits [3:0]: AArch64 Exception level and selected Stack Pointer (...) M[3:2] is set to the value of
    //                         PSTATE.EL.
    el = (get_spsr_el2() & 0b1111) >> 2;
    if (el != 0 /* EL0 */) {
      hvc_lock_dec();
      return 0;
    } else {
      trigger_instr_data_abort_handling_in_el1();
      hvc_lock_dec();
      return verdict;
    }
  } else {
    // If the write is allowed, performs it on behalf of the kernel.
    perform_faulting_write(fault_pa, access_size_log2, reg_val);
    hvc_lock_dec();
    return 0;
  }
}

As mentioned above, perform_faulting_write simply maps the target physical address into the hypervisor, performs the write on behalf of the kernel, and unmaps it.

void perform_faulting_write(uint64_t fault_pa, int32_t access_size_log2, uint64_t reg_val) {
  // ...
  fault_hva = map_pa_into_hypervisor(fault_pa);
  switch (access_size_log2) {
    case 0:
      *(uint8_t*)fault_hva = reg_val;
      break;
    case 1:
      *(uint16_t*)fault_hva = reg_val;
      break;
    case 2:
      *(uint32_t*)fault_hva = reg_val;
      break;
    case 3:
      *(uint64_t*)fault_hva = reg_val;
      break;
  }
  unmap_hva_from_hypervisor(fault_hva);
}

Data Abort on Reads¶

For data aborts caused by reads, handle_data_abort_on_read calls handle_data_abort_on_read_aligned on addresses aligned on the access size:

if the address is already aligned on the access size, it calls handle_data_abort_on_read_aligned directly;
otherwise, it splits the access into one or two accesses of each of the 8 bytes surrounding the target address (for example, accessing the quadword at address 0x123 results in an 8-byte access at 0x128 and an 8-byte access at 0x120).

A little extra processing is needed if the access size is smaller than 8 bytes, the instruction sign extends the value read, or the instruction has acquire-release semantics.

int32_t handle_data_abort_on_read(uint64_t* reg_val_ptr,
                                  uint64_t fault_va,
                                  int32_t access_size_log2,
                                  bool sign_extend,
                                  bool sixty_four,
                                  bool acquire_release) {
  // Simple case: if the fault VA is aligned on the access size.
  access_size = 1 << access_size_log2;
  if ((fault_va & (access_size - 1)) == 0) {
    if (handle_data_abort_on_read_aligned(reg_val, fault_va, access_size_log2) != 0) {
      return -1;
    }
  }

  // Complicated case: if the fault VA is not aligned on the access size then split it into one or two aligned accesses
  // of 8 bytes each.
  else {
    // Handle the first 8-byte access.
    fault_va_align = fault_va & ~7;
    if (handle_data_abort_on_read_aligned(&g_reg_val, fault_va_align, 3) != 0) {
      return -1;
    }
    // Handle the second 8-byte access if necessary.
    if (fault_va + access_size > fault_va_align + 8 &&
        handle_data_abort_on_read_aligned(&g_reg_val + 8, fault_va_align + 8, 3) != 0) {
      return -1;
    }
    // Copy the resulting value at the right place.
    if (memcpy_s(reg_val_ptr, 8, &g_reg_val + (fault_va & 7), 8) != 0) {
      return -1;
    }
  }

  // If the access size is smaller than 8 bytes, extra work might be needed.
  if (access_size < 8) {
    // Sign extend the value if needed.
    if (!sign_extend) {
      *reg_val_ptr &= (1 << (8 * access_size)) - 1;
    }
    // Downcast the value if needed.
    if (!sixty_four) {
      *reg_val_ptr = *(uint32_t*)reg_val_ptr;
    }
  }

  // Emulate A/R semantics if needed.
  if (acquire_release) {
    dmb_ishld();
  }
  return 0;
}

handle_data_abort_on_read_aligned will check if the write, while it raised an exception, should be allowed nonetheless. It does so by calling check_kernel_page_table_read, which returns a verdict value:

if it is zero, the read is allowed, and it calls perform_faulting_read to perform the write on behalf of the faulting process;
if it is non-zero and the faulting instruction is in the kernel, it resumes execution at the next instruction;
if it is non-zero and the faulting instruction is in a userland process, it lets the kernel handle the fault.

int32_t handle_data_abort_on_read_aligned(uint64_t* reg_val_ptr, uint64_t fault_va, int32_t access_size_log2) {
  // ...
  fault_ipa = virt_to_phys_el1(fault_va);
  if (fault_ipa == -1) {
    trigger_instr_data_abort_handling_in_el1();
    return -1;
  }

  hvc_lock_inc();
  // Check if the read, most likely into a kernel page tables, is allowed.
  verdict = check_kernel_page_table_read(&fault_ipa);
  if (verdict != 0) {
    // If the read is not allowed, let the kernel handle the fault, whether it came from userland or kerneland.
    //
    //   - M[3:0], bits [3:0]: AArch64 Exception level and selected Stack Pointer (...) M[3:2] is set to the value of
    //                         PSTATE.EL.
    el = (get_spsr_el2() & 0b1111) >> 2;
    if (el != 0 /* EL0 */) {
      hvc_lock_dec();
      return 0;
    } else {
      trigger_instr_data_abort_handling_in_el1();
      hvc_lock_dec();
      return verdict;
    }
  } else {
    // If the read is allowed, performs it on behalf of the kernel.
    *(uint64_t*)reg_val_ptr = perform_faulting_read(fault_ipa, access_size_log2);
    hvc_lock_dec();
    return 0;
  }
}

check_kernel_page_table_read retrieves the stage 2 descriptor of the faulting IPA. If the address is unmapped or marked as execute-only, the access is denied. Otherwise, it returns the verdict of report_error_on_invalid_access.

int32_t check_kernel_page_table_read(uint64_t* fault_ipa_ptr) {
  // ...
  // Check if the fault IPA is mapped in the stage 2.
  desc_s2 = get_stage2_page_table_descriptor_for_ipa(*fault_ipa_ptr, &desc_s2_va_range);
  if ((desc_s2 & 1) == 0) {
    return -1;
  }

  // Decide what to do depending on the software attributes.
  verdict = report_error_on_invalid_access(desc_s2);
  sw_attrs = get_software_attrs(desc_s2);
  // Change the return code for userland execute-only memory.
  if (verdict != 0 && sw_attrs == SOP_XO) {
    return -2;
  }
  return verdict;
}

report_error_on_invalid_access only allows accesses to unmarked addresses and logs all other accesses.

int32_t report_error_on_invalid_access(uint64_t desc_s2) {
  // ...
  phys_addr = desc_s2 & 0xfffffffff000;
  sw_attrs = get_software_attrs(desc_s2);
  switch (sw_attrs) {
    case 0:
      return 0 /* allowed */;
    case 1:
      return -1 /* disallowed */;
    case OS_RO:
    case OS_RO_MOD:
      // If execution at EL1 was not permitted.
      //
      // XN, bits [54:53]: Execute-Never.
      if (((desc_s2 >> 53) & 0b11) == 0b01 || ((desc_s2 >> 53) & 0b11) == 0b10) {
        code = 0x202;
        segment = "data";
      } else {
        code = 0x201;
        segment = "code";
      }
      debug_printf(code, "Invalid write to %s read-only %s location 0x%lx", "operating system", segment, phys_addr);
      return -1;
    case HYP_RO:
    case HYP_RO_MOD:
      debug_printf(0x203, "Invalid write to %s read-only %s location 0x%lx", "hypervisor-mediated", "data", phys_addr);
      return -1;
    case SOP_XO:
      debug_printf(0x204, "Invalid read or write to %s execution-only %s location 0x%lx", "sop-protected", "code",
                   phys_addr);
      return -1;
  }
}

perform_faulting_read maps the target physical address into the hypervisor, performs the read on behalf of the process, and unmaps it.

uint64_t perform_faulting_read(uint64_t fault_pa, int32_t access_size_log2) {
  // ...
  fault_hva = map_pa_into_hypervisor(fault_pa);
  switch (access_size_log2) {
    case 0:
      reg_val = *(uint8_t*)fault_hva;
      break;
    case 1:
      reg_val = *(uint16_t*)fault_hva;
      break;
    case 2:
      reg_val = *(uint32_t*)fault_hva;
      break;
    case 3:
      reg_val = *(uint64_t*)fault_hva;
      break;
  }
  unmap_hva_from_hypervisor(fault_hva);
  return reg_val;
}

Data Abort on Unknown ISS¶

The handle_data_abort_on_unknown_iss function is called when no valid ISS is present in the ESR_EL2 register. In that case, the hypervisor will try to figure out which instruction caused the abort and decide the course of action.

The handler starts by retrieving the faulting PC by reading the ELR_EL2 register and returns early if it is not aligned on 4 bytes. It then reads the first two bytes of the faulting instruction and calls a different handler depending on the Op0 field of the instruction. If the handler returns a zero value, it sets the execution to resume at the next instruction.

int32_t handle_data_abort_on_unknown_iss(saved_regs_t* regs) {
  // ...
  // Get the IPA of the faulting instruction.
  //
  // ELR_EL2, Exception Link Register (EL2).
  return_address_va = get_elr_el2();
  return_address_ipa = virt_to_phys_el1(return_address_va);
  if (return_address_ipa == -1) {
    return 0;
  }

  // Unaligned instruction.
  if ((return_address_ipa & 3) != 0) {
    return -1;
  }

  // Read the first two bytes of the faulting instruction.
  if (read_physical_memory(return_address_ipa, &fault_instr, 2) == 0) {
    return -1;
  }

  // Extract the Op0 field of the instruction and call the appropriate handler.
  op0 = (fault_instr >> 25) & 0b1111;
  if (instr_handlers_by_op0_field[op0](regs, fault_instr) != 0) {
    return -1;
  }

  // Move the instruction pointer to the next instruction.
  set_elr_el2(return_address_va + 4);
  return 0;
}

Op0	Handler
`x1x0`	`handle_loads_and_stores_instructions`
`101x`	`handle_branches_exception_generating_and_system_instructions`

The function handle_loads_and_stores_instructions also calls a different handler depending on the instruction-specific Op0 field of the instruction (specific to the encoding of load and store instructions, not to be confused with the Op0 field of the general encoding).

uint64_t handle_loads_and_stores_instructions(saved_regs_t* regs, uint64_t fault_instr) {
  op0 = (fault_instr >> 28) & 0b11;
  return by_ldr_str_instr_field_op0[op0](regs);
}

Op0	Handler
`0bxx00`	`handle_load_store_exclusive`
`0bxx10`	`handle_load_store_register_pair`
`0bxx11`	`handle_load_store_register_immediate`

We won't be detailing the code of the specific instruction handler, but will provide a summary of what they do.

handle_load_store_exclusive handles the load/store exclusive register instructions, such as STXR and LDXR. It is basically a wrapper around handle_data_abort_on_write that handles the synchronization issues.
handle_load_store_register_pair handles the load/store register pair instructions, such as STP and LDP. It is basically a wrapper around two calls to handle_data_abort_on_write or handle_data_abort_on_read, depending on whether the instruction loads or stores a register, one for each of the two registers.
handle_load_store_register_immediate handles the load/store register immediate instructions, such as STR (immediate) and LDR (immediate). It is basically a wrapper around handle_data_abort_on_write or handle_data_abort_on_read that can decode the immediate value.
handle_branches_exception_generating_and_system_instructions mainly handles the DC IVAC and DC ZVA cache maintenance instructions. For DC IVAC, it emulates it by executing DC CIVAC on the target virtual address, and for DC ZVA, it calls handle_data_abort_on_write on each of the data cache blocks targeted.

Kernel Page Table Management¶

This section explains how the second stage page tables are populated and managed by the hypervisor. Among other things, it must enforce that only the kernel code is executable at EL1 and that kernel memory cannot be accessed in user space. This requires the hypervisor to be aware of the kernel memory layout. This information can be acquired either by having the kernel specify its layout through an HVC or by having the hypervisor parse the kernel page tables. Huawei chose to implement the latter, more generic approach, which doesn't require cooperation from the kernel.

However, the second stage is unable to distinguish between EL0 and EL1 accesses. For a given physical memory page, data access permissions are always the same for the two ELs, and originally, so were the access permissions for instruction execution. If the hypervisor were to set all pages as non-executable in the second stage by default, this would render them unusable for executable code at EL0. In the same way, if it made the kernel memory unaccessible in the second stage to ensure it cannot be accessed by EL0, the kernel itself would also not be able to access it. Thus, the hypervisor is forced to use the stage 1 permissions to that effect and needs to fully control the kernel page tables.

Note: The shortcoming of the second stage's access permissions for instruction execution was addressed by introducing FEAT_XNX with ARMv8.2-TTS2UXN ("Execute-never control distinction by Exception level at stage 2"). When this feature is supported, the XN[0] bit is no longer ignored by the hardware, and pages can be made non executable at EL0 or EL1 only.

The hypervisor intervenes in two cases when it comes to the kernel page tables:

when they are first initialized, by trapping accesses to SCTLR_EL1 and TTBR1_EL1;
and when they are modified or a translation fault occurs.

We will now detail how the hypervisor processes the kernel page tables and what changes are allowed to be made.

Processing the Kernel Page Tables¶

When the kernel starts, it initializes its virtual address space and sets its page tables. As seen previously, once the base address of the translation table is set in TTBR1_EL1, the hypervisor calls process_kernel_page_tables. This function first calls get_kernel_pt_info to retrieve information about the kernel page tables, such as the PGD's IPA, the numbers of entries per table, etc. The PGD is then mapped into the second stage using map_page_table_hyp_and_stage2 and each of its entries is processed using process_kernel_page_table to walk the corresponding page tables and ensure the descriptors stored have sensible values.

uint64_t process_kernel_page_tables() {
  // ...

  ret = 0;
  was_not_marked = 0;
  pt_size_log2 = 0;
  pt_level = 0;
  nb_pt_entries = 0;
  kernel_page_tables_processed = atomic_get(&g_kernel_page_tables_processed);

  hvc_lock_acquire();

  // Returns if kernel page tables have already been processed.
  if (!kernel_page_tables_processed) {
    goto EXIT;
  }

  // Retrieves page table information using system register values configured by the kernel.
  ret = get_kernel_pt_info(1, &pgd, &pt_size_log2, &pt_level, &nb_pt_entries);
  if (ret) {
    ret = 0xffffffff;
    goto EXIT;
  }

  // Maps the kernel PGD in the hypervisor's address space and in the second translation stage.
  pgd_hva = map_page_table_hyp_and_stage2(pgd, pt_size_log2, pt_level, nb_pt_entries, &was_not_marked);
  if (!pgd_hva) {
    unmap_hva_from_hypervisor(pgd_hva);
    ret = 0xffffffff;
    goto EXIT;
  }

  // Checks that the PGD has the expected size.
  if (pt_size_log2 != 0xc) {
    unmap_hva_from_hypervisor(pgd_hva);
    ret = 0xffffffff;
    goto EXIT;
  }

  if (nb_pt_entries) {
    ret = 0;
    pt_entries_count = 0;
    for (;;) {
      pt_ret = process_kernel_page_table(pgd_desc_ptr, pt_size_log2, pt_level, 0);
      ret += pt_ret;
      if (pt_ret < 0) {
        break;
      }
      if (nb_pt_entries <= ++pt_entries_count) {
        goto SUCCESS;
      }
    }
  } else {
SUCCESS:
    success = 1;
    unmap_hva_from_hypervisor(pgd_hva);
  }

  // EL0/EL1 execute control distinction at Stage 2 is supported.
  if (get_id_aa64mmfr1_el1() >> 0x1c == 1) {
    // Disables execution at EL1 for writable pages in the second stage.
    process_stage2_page_tables((void (*)(void))set_non_executable_at_el1);
    // ...
  }
  // ...
  if (success) {
    atomic_set(&g_kernel_page_tables_processed);
    ret = 0;
  }

EXIT:
  hvc_lock_release();
  return ret;
}

In the snippet below, you can see the code of the function get_kernel_pt_info, which retrieves information about the kernel page tables by parsing the content of the TCR_EL1 and TTBR1_EL1 registers. The code path using TTBR0_EL1 is never used by the hypervisor.

uint64_t get_kernel_pt_info(int32_t is_kernel,
                            uint64_t* kernel_pgd_p,
                            int32_t* page_size_log2_p,
                            uint32_t* pt_level_p,
                            uint64_t* nb_pt_entries_p) {
  // ...
  tcr_el1 = get_tcr_el1();
  if (is_kernel) {
    // Granule size for the TTBR1_EL1.
    tg1 = tcr_el1 >> 30;
    // The size offset of the memory region addressed by TTBR1_EL1.
    ttbr_txsz = (tcr_el1 >> 16) & 0x3f;
    // TTBR1_EL1, Translation Table Base Register 1.
    ttbrx_el1 = get_ttbr1_el1();
    page_size_log2 = (tg1 >= 1 && tg1 <= 3) ? 0 : g_tg_to_page_size_log2[tg1];
    txsz = (page_size_log2 <= 0xf) ? 0x10 : 0xc;
    // ...
  } else {
    // Granule size for the TTBR0_EL1.
    tg0 = tcr_el1 >> 14;
    // The size offset of the memory region addressed by TTBR0_EL1.
    ttbr_txsz = tcr_el1 & 0x3f;
    // TTBR0_EL1, Translation Table Base Register 0.
    ttbrx_el1 = get_ttbr0_el1();
    page_size_log2 = (tg0 >= 1 && tg0 <= 3) ? 0 : g_tg_to_page_size_log2[tg1];
    txsz = (page_size_log2 <= 0xf) ? 0x10 : 0xc;
    // ...
  }
  // ...
  if (txsz < ttbr_txsz) {
    txsz = ttbr_txsz;
  }
  ttbr_addr = ttbrx_el1 & 0xfffffffffffe;
  // The page table entries log2.
  nb_pt_entries_log2 = page_size_log2 - 3;
  // The page size log2.
  *page_size_log2_p = page_size_log2;
  // Computes the number of page table levels based on the address space size.
  *pt_level_p = 4 - (0x3c - txsz) / nb_pt_entries_log2;
  // The number of entries in a page table.
  *nb_pt_entries_p = 1 << nb_pt_entries_log2;
  // Checks if the level 0 page table address is page aligned.
  if (ttbr_addr & ((1 << page_size_log2) - 1)) {
    *kernel_pgd_p = ttbr_addr;
    return 0;
  }
  return 0xffffffff;
}

Once we got the information from the kernel, we call map_page_table_hyp_and_stage2. This function gets the stage 2 descriptor of the stage 1 page table, which is the kernel's PGD in our case. The page table is mapped into the hypervisor before checking the descriptor's software attributes:

if it has already been marked, we make sure it's a PTABLE_Ln, where n is the page table level, and we return while noting that we didn't change the second stage permissions;
otherwise, we map the page table as read-only in stage 2 and also iterate over its entries to:
- prevent reusing existing page tables that have already been marked;
- verify that contiguous descriptors share the same attributes and point to contiguous memory.

uint64_t map_page_table_hyp_and_stage2(uint64_t ptable,
                                       uint32_t page_size_log2,
                                       uint32_t pt_level,
                                       uint64_t nb_pt_entries,
                                       uint8_t* was_not_marked) {
  // ...
  // Retrieves the descriptor of the kernel's page table in the second translation stage.
  ptable_desc_s2 = get_stage2_page_table_descriptor_for_ipa(ptable, 0);
  if ((ptable_desc_s2 & 1) == 0) {
    return 0;
  }
  page_size = 1 << page_size_log2;
  ptable_pa = ptable_desc_s2 & 0xfffffffff000;
  ptable_s2_perms = READ | NORMAL_MEMORY;
  offset_in_ptable = 0;
  // ...
  // If the software attributes for the page table have already been set, we just map the address in the hypervisor and
  // return the corresponding virtual address.
  ptable_software_attrs = get_software_attrs(ptable_desc_s2);
  software_attrs = pt_level + PTABLE_L0;
  if (software_attrs == ptable_software_attrs) {
    *was_not_marked = 0;
    return map_pa_into_hypervisor(offset_in_ptable + ptable_pa);
  }
  // Sets was_not_marked to signal that the stage 2 mapping is new and other initialization procedures outside of this
  // function must be applied to this mapping.
  *was_not_marked = 1;
  // Tries to map the page table in the stage 2 and returns if an error occurs.
  if (ptable_software_attrs || map_stage2_memory(ptable, ptable_pa, page_size, ptable_s2_perms, software_attrs)) {
    return 0;
  }
  // Maps the page table's physical address in the hypervisor's address space.
  ptable_hva = map_pa_into_hypervisor(offset_in_ptable + ptable_pa);
  // Gets a function pointer to a function that checks a page table descriptor based on its level.
  pt_check_ops = g_pt_check_ops_by_level[pt_level];
  // ...
  num_contiguous_entries = 0x10;
  // ...
  // Computes the position in the virtual address of the index into the current page table level (e.g. 12 at level 3, 21
  // at level 2, etc.).
  nb_pt_entries_log2 = page_size_log2 - 3;
  pt_level_idx_pos = (4 - pt_level) * nb_pt_entries_log2 + 3;
  if (!nb_pt_entries) {
    return ptable_hva;
  }

  desc_va_range = 1 << pt_level_idx_pos;
  for (desc_idx = 0; desc_idx < nb_pt_entries; desc_idx += num_contiguous_entries) {
    contiguous_count = 0;
    desc_addr = ptable_hva + 8 * desc_idx;
    // Counts the number of contiguous entries in the range being mapped.
    for (addr_idx = 0; addr_idx != num_contiguous_entries; addr_idx++) {
      contig_desc = desc_addr[addr_idx++];
      if (contig_desc & 1) {
        is_contiguous = pt_check_ops(contig_desc, page_size_log2);
        contiguous_count += is_contiguous;
        if (is_contiguous < 0) {
          // If one of the descriptors is deemed invalid in the page table we mapped, the stage 2 entry of the page
          // table gets reset.
          goto RESET_PTABLE_FROM_S2;
        }
      }
    }
    // If there are enough contiguous entries, check that they are all mapped with the same attributes.
    if (contiguous_count) {
      if (num_contiguous_entries <= contiguous_count && num_contiguous_entries != 1) {
        desc = *(uint64_t*)(ptable_hva + 8 * desc_idx);
        desc_attrs = desc & 0x7f000000000b7f;
        desc_oa = desc & 0xfffffffff000;
        remaining_entries = num_contiguous_entries;
        curr_oa = desc_oa & -(desc_va_range * num_contiguous_entries);
        // ...
        for (;;) {
          iter_desc = *desc_addr++;
          iter_desc_attrs = iter_desc & 0x7f000000000b7f;
          iter_desc_oa = iter_desc & 0xfffffffff000;
          // Checks if attributes are the same for the whole range and makes sure we haven't reached the end of the
          // contiguous region.
          if (desc_attrs != iter_desc_attrs || iter_desc_oa != curr_oa) {
            break;
          }
          curr_oa += desc_va_range;
          if (!--remaining_entries) {
            continue;
          }
        }
      }
    }
RESET_PTABLE_FROM_S2:
    map_stage2_memory(ptable, ptable_pa, page_size, EXEC_EL0 | WRITE | READ | NORMAL_MEMORY, 0);
    // ...
    unmap_hva_from_hypervisor(ptable_hva);
    return 0;
  }
}

We then move on to the main kernel page table processing function, process_kernel_page_table. It recursively walks the page tables and each call returns a value, called verdict in the snippet below, indicating:

if it is negative, that an error occurred during the processing of the page tables;
if it is zero, that no region in the physical memory that the table maps is executable at EL1 or accessible from EL0;
if it is positive, that at least one region in the physical memory that the table maps is executable at EL1 or accessible at EL0.

During the recursion, depending on the page table level and descriptor type, process_kernel_page_table performs the operations listed below.

Table Descriptors
- If the current and previous levels prevent physical memory from being executable at EL1 or accessible from EL0, it returns a verdict of 0 without processing the current and next levels' tables.
- Otherwise, the current table is mapped as read-only and marked as PTABLE_Ln in the second stage, before calling process_kernel_page_table on each of its descriptors. If no region that the table maps is executable at EL1 or accessible from EL0, it simply sets the APTable[0] and PXNTable bits of the current descriptor and unmarks/unprotects the table in the second stage. But if there is at least one region, it must keep the table read-only.
Page and Block Descriptors
- If the previous levels prevent physical memory from being executable at EL1 or accessible from EL0, it returns a verdict of 0.
- If the current or previous levels prevent physical memory from being executable at EL1, it returns a verdict that depends on whether the region is accessible at EL0 or not.
- If the previous levels made the region read-only, then it is made read-only and executable at EL1 in the second stage, inaccessible at EL0 in the current descriptor, and a verdict of 1 is returned. The same thing happens if the physical memory is read-only and its dirty state is ignored.
- Otherwise, the PXN bit of the current descriptor is set, and the verdict returned depends on whether the region is accessible at EL0 or not.

int32_t process_kernel_page_table(uint64_t desc_p, uint32_t table_size_log2, uint32_t pt_level, uint64_t prev_attrs) {
  // ...

  was_not_marked = 0;
  verdict = 0;
  desc = *desc_p;
  table_size = 1 << table_size_log2;
  nb_pt_entries_log2 = table_size_log2 - 3;
  nb_pt_entries = 1 << nb_pt_entries_log2;

  // Computes the position in the virtual address of the index into the current page table level (e.g. 12 at level 3, 21
  // at level 2, etc.).
  pt_level_idx_pos = (4 - pt_level) * nb_pt_entries_log2 + 3;

  // Checks if the descriptor is valid.
  if (!(desc & 1)) {
    return 0;
  }

  // Level 0, 1 or 2 tables.
  if (pt_level <= 2 && (desc & 2)) {
    ptable_ipa = desc & 0xfffffffff000;
    // ...
    // Retrieves the stage 1 upper attributes of the table descriptor.
    desc_ua = desc >> 59;
    prev_ua = prev_attrs >> 59;
    // Page tables that don't allow access from EL0 and execution at EL1, e.g. kernel data, are ignored. We don't need
    // to set the upper attributes in the next levels page tables.
    //
    //   - APTable,  bits [62:61]: APTable[0] == 1 -> No access at EL0.
    //   - PXNTable, bit [59]: PXNTable == 1 -> No exec at EL1.
    if ((desc_ua & 5 | prev_ua & 5) == 5) {
      return 0;
    }
    // Gets the upper attributes for the current level.
    current_level_attrs = desc & 0x7800000000000000 | prev_attrs;
    next_pt_level += 1;
    // Maps the current page table level.
    ptable_hva =
        map_page_table_hyp_and_stage2(ptable_ipa, table_size_log2, pt_level + 1, nb_pt_entries, &was_not_marked);
    // ...
    for (int32_t i = 0; i < 0x1000; i += 8) {
      // Processes all the page table entries found in the current level we've just mapped.
      ret = process_kernel_page_table(ptable_hva + i, table_size_log2, next_pt_level, current_level_attrs);
      if (ret < 0) {
        unmap_hva_from_hypervisor(ptable_hva);
        if (was_not_marked) {
          goto UNMARK_PAGE_TABLE;
        }
        ret;
      }
      verdict += ret;
    }
    unmap_hva_from_hypervisor(ptable);
    // If the page table was made read-only by the call to map_page_table_hyp_and_stage2 and maps no region that is
    // executable at EL1 or accessible from EL0, then it doesn't need to be protected as long as the previous levels are
    // protected.
    if (was_not_marked && !verdict) {
      // Sets APTable[0] = 1 and PXNTable = 1 on the descriptor.
      *desc_p = desc | (1 << 61) | (1 << 59);
UNMARK_PAGE_TABLE:
      // Unmark the page table in the second translation stage.
      ptable_s2_desc = get_stage2_page_table_descriptor_for_ipa(ptable_ipa, 0);
      ptable_pa = ptable_s2_desc & 0xfffffffff000;
      map_stage2_memory(ptable_ipa, ptable_pa, table_size, EXEC_EL0 | WRITE | READ | NORMAL_MEMORY, 0);
      // ...
      return verdict;
    }
  }
  // Level 3 page or block.
  else if {
    // Table attributes - Bit 59, PXNTable == 1 -> No exec at EL1.
    if (prev_attrs >> 59 & 1) {
      // Table attributes - Bit 61, APTable[0] == 1 -> No access at EL0.
      if (prev_attrs >> 61 & 1) {
        return 0;
      }
      non_executable = 1;
      // Page attributes - Bit 6, AP[1]: EL0 Access.
      verdict = (desc >> 6) & 1;
    }
    // Table attributes - Bit 59, PXNTable == 0 -> Exec at EL1.
    else {
      // Page descriptor - Bit 53, PXN.
      non_executable = (desc >> 53) & 1;
      verdict = 0;
      // Table attributes - Bit 61, APTable[0] == 0 -> Access at EL0.
      if (!(prev_attrs >> 61 & 1)) {
        // Page attributes - Bit 6, AP[1]: EL0 Access.
        verdict = (desc >> 6) & 1;
      }
    }
    // If the region should be executable, the page will be set as RX in the second stage if the dirty state is not
    // tracked and the page is not writable.
    if (!non_executable) {
      // Table attributes - Bit 62, APTable[1] == 1 -> Read-only.
      if (prev_attrs >> 62 & 1) {
        goto SET_RX;
      }
      // Page attributes - Bit 52, Contiguous.
      if (desc >> 52 & 1) {
        num_contiguous_entries = 0x10;
        // Gets a pointer to the descriptor of the start of the contiguous block.
        desc_contiguous_start_p = desc & 0xffffffffffffff80;
        // ...
        for (int32_t idx = 0; i < num_contiguous_entries; i++) {
          // Retrieves the next descriptor in the contiguous range.
          next_desc = *(uint64_t*)(desc_contiguous_start_p + 8 * idx);
          // If one of the descriptor from the contiguous range has its dirty state tracked or is writable, the PXN bit
          // is forced and we return.
          //
          //   - DBM,   bit [51]: Dirty Bit Modifier.
          //   - AP[2], bit [7]: read / write access.
          if (!(next_desc >> 7 & 1) || (next_desc >> 51 & 1)) {
            // Page descriptor - Bit 53, PXN.
            *desc_p = desc | (1 << 53);
            return verdict;
          }
        }
        // Otherwise we set it as RX in the second stage.
        goto SET_RX;
      }
      // Otherwise, if the dirty state is not tracked and the page is read-only, it is set as RX in the second stage.
      //
      //   - DBM,   bit [51]: Dirty Bit Modifier.
      //   - AP[2], bit [7]: read / write access.
      else if ((desc >> 7 & 1) && !(desc >> 51 & 1)) {
SET_RX:
        ipa = desc & 0xfffffffff000;
        va_range_size = 1 << pt_level_idx_pos;
        // ...
        // Sets the current va range as read-only / executable.
        ret = change_stage2_software_attrs_per_ipa_range(ipa, ipa_range_size, EXEC_EL1 | EXEC_EL0 | READ, OS_RO,
                                                         1 /* sets the software attributes in the descriptor */);
        if (!ret) {
          if (pt_level == 3 && !g_has_set_l3_ptable_ro) {
            g_has_set_l3_ptable_ro = 1;
          }
          // Disables EL0 access.
          *desc_p = desc & 0xffffffffffffffbf;
          return 1;
        }
        return -1;
      }
      // Page descriptor - Bit 53, PXN.
      *desc_p = desc | (1 << 53);
    }
  }
  return verdict;
}

Kernel Page Table Writes¶

As we have just seen, kernel page tables that do not contain regions executable at EL1 or accessible from EL0 were made read-only in the second stage when they were initially processed in map_page_table_hyp_and_stage2. When the kernel writes to these tables, an exception is raised, allowing the hypervisor to validate the changes. It is now time to detail the check_kernel_page_table_write function that the hypervisor calls to get a verdict when a data abort occurs.

check_kernel_page_table_write ensures that we are in the situation above, i.e. the exception was triggered by the kernel and the page written to was marked as PTABLE_Ln. If the kernel writes to another PGD than the one currently in use, it is immediately allowed since verifications will be performed once the kernel sets it as its new translation table. Additionally, if the write is smaller than 8 bytes, it is transformed into a bigger access spanning over the whole descriptor.

In case the descriptor being written to is currently invalid, the access is allowed, but permissions are changed to enforce non-executability at EL1 (by setting PXN or PXNTable) and non-accessibility from EL0 (by unsetting AP[1] or setting APTable[0]). Otherwise, for block and page descriptors, the verdict is left to is_kernel_write_to_block_page_allowed, and for table descriptors, to is_kernel_write_to_table_allowed. If the verdict is negative or zero, it is simply returned. However, if it is positive, the current and next levels descriptors are processed by the reset_executable_memory function, and the permissions are changed to enforce non-executability at EL1 and non-accessibility from EL0.

int32_t check_kernel_page_table_write(uint64_t* reg_val_ptr,
                                      uint64_t* fault_ipa_ptr,
                                      int32_t* access_size_log2_ptr,
                                      uint64_t* fault_pa_ptr) {
  // ...
  // Userspace is not allowed to write kernel page tables.
  //
  //   - M[3:0], bits [3:0]: AArch64 Exception level and selected Stack Pointer (...) M[3:2] is set to the value of
  //                         PSTATE.EL.
  el = (get_spsr_el2() & 0b1111) >> 2;
  if (el != 0 /* EL0 */) {
    debug_print(0x201, "Illegal operation from user space");
    return -1;
  }

  // Check if the fault IPA is mapped in the stage 2.
  fault_ipa = *fault_ipa_ptr;
  desc_s2 = get_stage2_page_table_descriptor_for_ipa(fault_ipa, &desc_s2_va_range);
  if ((desc_s2 & 1) == 0) {
    return -1;
  }

  // Translate the fault IPA into a PA for the caller.
  *fault_pa_ptr = (desc_s2 & 0xfffffffff000) | (fault_ipa & (desc_s2_va_range - 1));

  // Check if the fault IPA belongs to a stage 1 PT of any level.
  sw_attrs = get_software_attrs(desc_s2);
  if (!is_software_attrs_ptable(sw_attrs)) {
    // Decide what to do for the other software attributes.
    return report_error_on_invalid_access(desc_s2);
  }

  // Get the information about the kernel page tables that was gathered when the kernel page tables where processed.
  if (get_kernel_pt_info(s2_desc & 1, &kernel_pgd, &kernel_granule_size_log2, &kernel_pt_level,
                         &kernel_nb_pt_entries) != 0) {
    return -1;
  }

  // Allow writes to other PGDs than the one currently in use by the kernel.
  //
  // Page tables will be checked once the kernel switches to this new PGD.
  fault_pt_level = sw_attrs & 3;
  if (fault_pt_level == kernel_pt_level && fault_ipa - kernel_pgd >= 8 * kernel_nb_pt_entries) {
    return 0;
  }

  // If the access is smaller than 8 bytes, the fault IPA and written value will be adjusted to fake a qword access that
  // is easier to work with.
  if (*access_size_log2_ptr != 3) {
    // Compute a mask of the bits updated by the write and the bits values updated by the write.
    offset_in_bits = 8 * (*fault_pa_ptr & 7);
    access_size_in_bits = 8 << *access_size_log2_ptr;
    update_mask = (0xffffffffffffffff >> (64 - access_size_in_bits)) << offset_in_bits;
    new_val = *reg_val_ptr << offset_in_bits;
    // Align the fault PA on 8 bytes.
    *fault_pa_ptr = *fault_pa_ptr & 0xfffffffffffffff8;
    // Get the old value from memory.
    old_val_hva = map_pa_into_hypervisor(*fault_pa_ptr);
    old_val = *fault_val_hva;
    unmap_hva_from_hypervisor(fault_va_hva);
    // Update the written value and access size.
    *reg_val_ptr = (old_val & ~update_mask) | (new_val & update_mask);
    *access_size_log2_ptr = 3;
  }

  // Get the current value of the descriptor in the stage 1 page tables.
  desc_s1_hva = map_pa_into_hypervisor(*fault_pa_ptr);
  cur_desc_s1 = *desc_s1_hva;
  // Pointer to the value the kernel wants to write at address fault_ipa.
  new_desc_s1_ptr = reg_val_ptr;

  // If the descriptor is invalid, the write is allowed but the PXN bit will be set and access from EL0 disallowed.
  if ((cur_desc_s1 & 1) == 0) {
    goto CHANGE_PXN_AND_AP_BITS;
  }

  // Check if the write is allowed by calling a different function depending on the descriptor type: block/page or
  // table.
  if (fault_pt_level == 3 || (cur_desc_s1 & 0b10) == 0b00) {
    verdict =
        is_kernel_write_to_block_page_allowed(desc_s1_hva, *new_desc_s1_ptr, kernel_granule_size_log2, fault_pt_level);
  } else {
    verdict = is_kernel_write_to_table_allowed(desc_s1_hva, *new_desc_s1_ptr, kernel_granule_size_log2, fault_pt_level);
  }

  // If verdict is:
  //
  //   - < 0  -> the write is not allowed.
  //   - == 0 -> the write is allowed.
  //   - > 0  -> the write is allowed but the PXN bit will be set and access from EL0 disallowed.
  if (verdict <= 0) {
    unmap_hva_from_hypervisor(desc_s1_hva);
    return verdict;
  }

CHANGE_PXN_AND_AP_BITS:
  // Recursively walk the page tables starting from the stage 1 descriptor and for each executable memory region found,
  // remove it from the stage 1 page tables, set the memory as read-write non-exec in the stage 2 and also resets the
  // software attributes, effectively unmarking the region.
  reset_executable_memory(desc_s1_hva, fault_pt_level);
  unmap_hva_from_hypervisor(desc_s1_hva);

  // If the kernel page tables have not been processed yet, then the access is allowed.
  if (!atomic_get(&g_kernel_page_tables_processed)) {
    return 0;
  }

  // If the descriptor is set to invalid, the access is also allowed.
  if (*new_desc_s1_ptr == 0) {
    return 0;
  }

  // If the descriptor is a reserved descriptor (a would-be level 3 block descriptor), change it into an invalid
  // descriptor.
  if (fault_pt_level == 3 && (*new_desc_s1_ptr & 0b10) == 0b00) {
    *new_desc_s1_ptr = 0;
  }

  // If the descriptor is a table descriptor, set the memory as not executable at EL1 and not accessible from EL0.
  //
  //   - PXNTable, bit [59]     = 1: the PXN bit is treated as 1 in all subsequent levels of lookup, regardless of the
  //                                 actual value of the bit.
  //   - APTable,  bits [62:61] = 0bx1: Access at EL0 not permitted, regardless of permissions in subsequent levels of
  //                                    lookup.
  else if (fault_pt_level != 3 && (*new_desc_s1_ptr & 0b10) != 0b00) {
    *new_desc_s1_ptr = *new_desc_s1_ptr | (1 << 59) | (1 << 61);
  }

  // If the descriptor is a page descriptor, set the memory as not executable at EL1 and not accessible from EL0.
  //
  //   - AP,  bits [7:6] = 0bx0: Access at EL0 not permitted.
  //   - PXN, bits [53]  = 1: Execution at EL1 not permitted.
  else {
    *new_desc_s1_ptr = (*new_desc_s1_ptr & ~(1 << 6)) | (1 << 53);
  }

  return 0;
}

The role of is_kernel_write_to_block_page_allowed is to examine the changes made to the input block or page descriptor and determine whether or not they are allowed. It does so by computing which bits differ between the old and the new values, as well as different masks to allow the following modifications of the descriptor:

toggling the AF bit;
setting the PXN and UXN bits;
clearing the AP[1] bit;
toggling the AP[2] and DBM bits if the page is read-write or the dirty state is tracked.

The function returns early if no disallowed changes were found. Otherwise, it ensures that none of the physical memory regions covered by the descriptor are protected by the hypervisor.

int32_t is_kernel_write_to_block_page_allowed(uint64_t* desc_s1_hva,
                                              uint64_t reg_val,
                                              uint32_t kernel_granule_size_log2,
                                              uint8_t fault_pt_level) {
  // ...
  // If the new descriptor is also a block descriptor.
  if ((reg_val & 1) == 1) {
    // Allow setting or unsetting the AF bit and the reserved bits.
    //
    // AF, bit [10]: Access Flag.
    changed_bits = (*desc_s1_hva ^ reg_val) & ~((1 << 10) | (0b111111111 << 55));

    // Allow setting the PXN and UXN bits.
    //
    //   - PXN, bit [53]: Privileged eXecute-Never.
    //   - UXN, bit [54]: Unprivileged eXecute-Never.
    ignored_bits_set = ~(reg_val & ((1 << 53) | (1 << 54)));
  }

  // Otherwise it is an invalid descriptor.
  else {
    changed_bits = *desc_s1_hva;
    ignored_bits_set = 0xffffffffffffffff;
  }

  // Allow unsetting the AP[1] bit, disallowing access from EL0.
  //
  // AP, bits [7:6]: AP[1] selects between EL0 or EL1 control.
  ignored_bits_unset = ~(*desc_s1_hva & (1 << 6));

  // Compute the changed bits to check if the access is allowed.
  changed_bits = changed_bits & ignored_bits_set & ignored_bits_unset;

  // If access from EL1 is read-write or the dirty state is tracked, ignore setting or unsetting the AP[2] and DBM bits.
  //
  //   - AP,  bits [7:6]: AP[2] selects between read-only and read/write access.
  //   - DBM, bit [51]: Dirty Bit Modifier.
  if (((*desc_s1_hva >> 7) & 1) == 0 || ((*desc_s1_hva >> 51) & 1) == 1) {
    changed_bits &= ~((1 << 7) | (1 << 51));
  }

  // No disallowed changes were made.
  if (changed_bits == 0) {
    return 0 /* allowed */;
  }

  // Get the output memory range (IPAs) from the descriptor.
  region_ipa = *desc_s1_hva & 0xfffffffff000;
  if (kernel_granule_size_log2 > 16) {
    region_ipa |= (*desc_s1_hva & 0xf000) << 36;
  }
  region_size = 1 << ((4 - fault_pt_level) * (kernel_granule_size_log2 - 3) + 3);
  // ...

  // Check if the current output memory range contains protected regions.
  for (offset = 0; offset < region_size; offset += desc_s2_va_range) {
    desc_s2 = get_stage2_page_table_descriptor_for_ipa(region_ipa + offset, &desc_s2_va_range);
    if (!desc_s2_va_range || (desc_s2 & 1) == 0) {
      break;
    }

    sw_attrs = get_software_attrs(desc_s2);
    switch (sw_attrs) {
      case OS_RO:
      case HYP_RO:
      case HYP_RO_MOD:
        return -1 /* disallowed */;
      case OS_RO_MOD:
        // If execution at EL1 was not permitted.
        //
        // XN[1:0], bits [54:53]: Execute-Never.
        if (((desc_s2 >> 53) & 0b11) == 0b01 || ((desc_s2 >> 53) & 0b11) == 0b10) {
          return -1 /* disallowed */;
        }
    }
  }
  return 1 /* allowed, but PXN will be set and access from EL0 disallowed */;
}

The role of is_kernel_write_to_table_allowed is to examine the changes made to the input table descriptor and determine whether or not they are allowed. It does so in a similar manner to is_kernel_write_to_block_page_allowed and allows setting the PXNTable, UXNTable, and APTable bits.

The function returns early if no disallowed changes were found. Otherwise, it calls is_table_hyp_mediated, which will recursively check if the tables pointed by the descriptor map protected memory. Depending on the return value of this function, the write is either disallowed, or allowed but the descriptor is changed to make the memory non-executable at EL1 and not accessible from EL0.

int32_t is_kernel_write_to_table_allowed(uint64_t* desc_s1_hva,
                                         uint64_t reg_val,
                                         uint32_t kernel_granule_size_log2,
                                         uint8_t fault_pt_level) {
  // If the new descriptor is also a table descriptor.
  if ((reg_val & 0b11) == 0b11) {
    // Allow setting or unsetting the reserved bits.
    changed_bits = reg_val & ~((0b1111111111 << 2) | (0b11111111 << 51));
    // Allow setting the PXNTable, UXNTable and APTable bits, which effectively allows disabling execution and access.
    //
    //   - PXNTable, bit [59]: the PXN bit is treated as 1 in all subsequent levels of lookup.
    //   - UXNTable, bit [60]: the UXN bit is treated as 1 in all subsequent levels of lookup.
    //   - APTable,  bits [62:61]: access permissions limit subsequent levels of lookup.
    ignored_bits_set = ~(reg_val & ((1 << 59) | (1 << 60) | (0b11 << 61)));
  }

  // Otherwise it is an invalid descriptor.
  else {
    changed_bits = *desc_s1_hva ^ reg_val;
    ignored_bits_set = 0xffffffffffffffff;
  }

  // No disallowed changes were made.
  if ((changed_bits & ignored_bits_set) == 0) {
    return 0 /* allowed */;
  }

  // Check if the table, or any of the subsequent level tables, contains memory that is protected by the hypervisor.
  if (is_table_hyp_mediated(desc_s1_hva, kernel_granule_size_log2, fault_pt_level + 1) < 0) {
    return -1 /* disallowed */;
  } else {
    return 1 /* allowed, but PXN will be set and access from EL0 disallowed */;
  }
}

is_table_hyp_mediated starts by retrieving the page table IPA from the input descriptor, makes sure it is marked as PTABLE_Ln in the second stage, and maps it into the hypervisor's address space. It then iterates over each of its descriptors, and for:

table descriptors, it makes a recursive call to is_table_hyp_mediated on the next level table;
page or block descriptors, it ensures that none of the physical memory regions covered by the descriptor are protected by the hypervisor.

If the write is allowed, it sets the PXNTable bit of the stage 1 descriptor and unmarks the table's physical memory page from the second stage.

int32_t is_table_hyp_mediated(uint64_t* desc_s1_hva, uint32_t kernel_granule_size_log2, uint8_t ptable_level) {
  // ...
  // Get the page table IPA from the descriptor.
  ptable_ipa = *desc_s1_hva & 0xfffffffff000;
  if (kernel_granule_size_log2 > 16) {
    ptable_ipa |= (*desc_s1_hva & 0xf000) << 36;
  }

  // Check if the page is mapped in the stage 2 and belongs to any stage 1 PT.
  desc_s2 = get_stage2_page_table_descriptor_for_ipa(ptable_ipa, 0);
  sw_attrs = get_software_attrs(desc_s2);
  if ((desc_s2 & 1) == 0 || !is_software_attrs_ptable(sw_attrs)) {
    return 0;
  }

  // Map the page table into the hypervisor.
  ptable = map_pa_into_hypervisor(ptable_ipa);
  ptable_size = 1 << kernel_granule_size_log2;

  // And iterate over each of its descriptors.
  for (desc_index = 0; desc_index < ptable_size / 8; ++desc_index) {
    ptable_desc_ptr = (uint64_t*)(ptable + desc_index * 8);

    // If it is an invalid descriptor, do nothing.
    if ((*ptable_desc_ptr & 1) == 0) {
      continue;
    }

    // If it is a table descriptor, call is_table_hyp_mediated to check it.
    if (ptable_level < 3 && (*ptable_desc_ptr & 0b10) == 0b10) {
      disallowed = is_table_hyp_mediated(ptable_desc_ptr, kernel_granule_size_log2, ptable_level + 1);
      if (disallowed < 0) {
        return disallowed;
      }
    }

    // If it is a block or page descriptor.
    else {
      // Get the output memory range (IPAs) from the descriptor.
      output_addr = *ptable_desc_ptr & 0xfffffffff000;
      // ...
      output_size = 1 << ((4 - fault_pt_level) * (kernel_granule_size_log2 - 3) + 3);

      // Check if the output memory range contains protected regions.
      for (offset = 0; offset < output_size; offset += desc_s2_va_range) {
        out_desc_s2 = get_stage2_page_table_descriptor_for_ipa(output_addr + offset, &desc_s2_va_range);
        if (!desc_s2_va_range || (out_desc_s2 & 1) == 0) {
          break;
        }

        sw_attrs = get_software_attrs(out_desc_s2);
        switch (sw_attrs) {
          case OS_RO:
          case HYP_RO:
          case HYP_RO_MOD:
            return -1 /* disallowed */;
          case OS_RO_MOD:
            // If execution at EL1 was not permitted.
            //
            // XN, bits [54:53]: Execute-Never.
            if (((out_desc_s2 >> 53) & 0b11) == 0b01 || ((out_desc_s2 >> 53) & 0b11) == 0b10) {
              return -1 /* disallowed */;
            }
        }
      }
    }
  }

  // Set the PXNTable bit of the descriptor.
  //
  // PXNTable, bit [59]: the PXN bit is treated as 1 in all subsequent levels of lookup.
  *desc_s1_hva |= 1 << 59;
  dsb_ishst();

  // Resets the stage 2 attributes and permissions of the current page table before it gets changed to the new one.
  map_stage2_memory(ptable_ipa, desc_s2 & 0xfffffffff000, 1 << kernel_granule_size_log2,
                    NORMAL_MEMORY | READ | WRITE | EXEC_EL0, 0);
  // ...
  tlbi_vmalle1is();
  return 1 /* allowed, but PXN will be set and access from EL0 disallowed */;
}

As we've seen previously, reset_executable_memory is called by check_kernel_page_table_write when the verdict is a positive value. If the current descriptor is:

a table descriptor, and the memory region is executable at EL1, it calls reset_executable_memory on each of the entries of the table;
a block or page descriptor, and the memory region is executable at EL1, it sets the descriptor as invalid and unmaps the physical memory regions covered by the descriptor from the second stage.

void reset_executable_memory(uint64_t* desc_s1_hva, uint8_t fault_pt_level) {
  // ...
  // If the descriptor is invalid, nothing to do.
  if ((*desc_s1_hva & 1) == 0) {
    return;
  }

  // If it is a table descriptor.
  if (fault_pt_level < 3 && (*desc_s1_hva & 0b10) == 0b10) {
    // And the memory region is executable at EL1.
    //
    // PXNTable, bit [59].
    //
    //   - 0b1: the PXN bit is treated as 1 in all subsequent levels of lookup.
    //   - 0b0: has no effect.
    if (((*desc_s1_hva >> 59) & 1) == 0) {
      // Get the page table IPA from the descriptor.
      ptable_ipa = *desc_s1_hva & 0xfffffffff000;

      // Get the page table PA in the stage 2.
      ptable_desc_s2 = get_stage2_page_table_descriptor_for_ipa(ptable_ipa, &desc_s2_va_range);
      ptable_pa = (ptable_desc_s2 & 0xfffffffff000) | (ptable_ipa & (desc_s2_va_range - 1));

      // Map the page table into the hypervisor.
      ptable = map_pa_into_hypervisor(ptable_pa);
      ptable_size = 0x1000;

      // And iterate over each of its descriptors and call reset_executable_memory recursively.
      for (desc_index = 0; desc_index < ptable_size / 8; ++desc_index) {
        ptable_desc_ptr = (uint64_t*)(ptable + desc_index * 8);
        reset_executable_memory(ptable_desc_ptr, fault_pt_level + 1);
      }

      // Unmap the page table from the hypervisor.
      unmap_hva_from_hypervisor(ptable);
    }
  }

  // If it is a block or a page descriptor.
  else {
    // And the memory is executable at EL1.
    //
    // PXN, bit [53]: Privileged eXecute-Never.
    if (((*desc_s1_hva >> 53) & 1) == 0) {
      // And the kernel page tables have been processed.
      if (atomic_get(&g_kernel_page_tables_processed)) {
        // Set the descriptor as invalid.
        *desc_s1_hva &= ~1;
        dsb_ish();
        tlbi_vmalle1is();
        dsb_ish();
        isb();
        ic_ialluis();
        // Set the PXN bit.
        *desc_s1_hva |= 1 << 53;
        dsb_ish();
        isb();
        // Remap the memory region pointed by the descriptor and resets permissions in the second stage.
        region_ipa = *desc_s1_hva & 0xfffffffff000;
        region_desc_s2 = get_stage2_page_table_descriptor_for_ipa(region_ipa, 0);
        region_size = 1 << (39 - 9 * fault_pt_level);
        region_pa = region_desc_s2 & 0xfffffffff000;
        map_stage2_memory(region_ipa, region_pa, region_size, NORMAL_MEMORY | READ | WRITE | EXEC_EL0, 0);
        // ...
      }
    }
  }
}

Stage 2 Faults During Stage 1 Page Table Walk¶

Although unrelated to the security guarantees, there is another hypervisor feature related to the kernel page tables that we still need to talk about: the emulation of hardware-managed bits implemented in handle_s2_fault_during_s1ptw.

If the hardware update of the dirty state and access flag is enabled, the corresponding bits are automatically changed by the CPU when a memory region is accessed or modified. However, in the second stage, we have seen that most kernel page tables are mapped as read-only. Thus, when the hardware tries to update the bits of the descriptors, an exception will be raised. The hypervisor needs to update the bits manually by handling this exception, which is what the handle_s2_fault_during_s1ptw function is used for.

In practice, handle_s2_fault_during_s1ptw is a wrapper around update_hw_bits_on_s2_fault_during_s1ptw that propagates the exception to EL1 if the call returned an error.

void handle_s2_fault_during_s1ptw(bool is_write) {
  // Updates the hardware-managed bits and lets the kernel handle the fault.
  //
  //   - HPFAR_EL2: Hypervisor IPA Fault Address Register, holds the IPA of the fault that occured during the stage 1
  //                translation table walk. This is NOT the IPA of the VA under translation.
  //   - FAR_EL2: Fault Address Register (EL2), holds the VA that was being translated in the stage 1.
  uint64_t fault_ipa = get_hpfar_el2() << 8;
  if (update_hw_bits_on_s2_fault_during_s1ptw(fault_ipa, get_far_el2(), is_write)) {
    trigger_instr_data_abort_handling_in_el1();
  }
}

update_hw_bits_on_s2_fault_during_s1ptw does the actual update of the hardware-managed bits under the condition that the faulting stage 1 descriptor is valid and mapped as PTABLE_Ln in the second stage.

int32_t update_hw_bits_on_s2_fault_during_s1ptw(uint64_t fault_ipa, uint64_t translated_va, bool is_write) {
  hvc_lock_inc();

  // Check if the fault IPA is mapped in the stage 2.
  desc_s2 = get_stage2_page_table_descriptor_for_ipa(fault_ipa, 0);
  if ((desc_s2 & 1) == 0) {
    hvc_lock_dec();
    return -1;
  }

  // Check if the fault IPA belongs to a stage 1 PT of any level.
  sw_attrs = get_software_attrs(desc_s2);
  if (!is_software_attrs_ptable(sw_attrs)) {
    hvc_lock_dec();
    // Decide what to do for the other software attributes.
    return report_error_on_invalid_access(desc_s2);
  }

  // Map the stage 1 page table descriptor that caused the fault on access.
  fault_pt_level = sw_attrs & 3;
  desc_index = get_pt_desc_index(translated_va, fault_pt_level);
  fault_page_pa = desc_s2 & 0xfffffffff000;
  desc_s1_hva = map_pa_into_hypervisor((8 * desc_index) | fault_page_pa);
  desc_s1 = *desc_s1_hva;

  // Check if the stage 1 page table descriptor was valid.
  if ((desc_s1 & 1) == 0) {
    unmap_hva_from_hypervisor(desc_s1_hva);
    hvc_lock_dec();
    return 0;
  }

  // Check if the hypervisor needs to update the access flag.
  //
  //   - HA, bit [39] = 1: Stage 1 Hardware Access flag update enabled.
  //   - AF, bit [10] = 0: Access Flag.
  if (((get_tcr_el1() >> 39) & 1) == 1 && ((desc_s1 >> 10) & 1) == 0) {
    // Change the AF to 1.
    if (exclusive_load(desc_s1_hva) == desc_s1) {
      exclusive_store(desc_s1 | (1 << 10), desc_s1_hva);
    }
  }

  // Check if the hypervisor needs to update the dirty state.
  //
  //   - HD,  bit [40]   = 1: Stage 1 hardware management of dirty state enabled.
  //   - AP,  bits [7:6] = 0b1x: Read-only access from EL1.
  //   - DBM, bit [51]   = 1: Dirty Bit Modifier.
  else if (is_write && ((get_tcr_el1() >> 40) & 1) == 1 && ((desc_s1 >> 7) & 1) == 1 && ((desc_s1 >> 51) & 1) == 1) {
    // Change the AP bits to read/write from EL1.
    if (exclusive_load(desc_s1_hva) == desc_s1) {
      exclusive_store(desc_s1 & ~(1 << 7), desc_s1_hva);
    }
    tlbi_vaale1is(desc_s1);
  }

  unmap_hva_from_hypervisor(desc_s1_hva);
  hvc_lock_dec();
  return 0;
}

In this section, we have seen how the hypervisor enforces the following restrictions on the kernel page tables.

The kernel cannot change its PGD once its page tables have been processed by the hypervisor.
While processing the page tables, the hypervisor finds all physical memory regions that need to be executable at EL1 or accessible at EL0. Their page tables are made read-only in the second stage. For the other regions, non-executability at EL1 and non-accessibility at EL0 are enforced using the APTable[0] and PXNTable bits for table descriptors, in addition to the AP and PXN bits for page/block descriptors, and the page tables are left writable.
When the kernel modifies its page tables, the hypervisor ensures that it doesn't create a region executable at EL1 or accessible at EL0 and that it doesn't remove a protected region.

Hypervisor and Secure Monitor Calls¶

To communicate with higher ELs, the kernel can send requests using the HVC and SMC instructions. While SMCs are not intended for the hypervisor, we have seen in Exception Handling that HHEE is configured such that it can intercept SMCs to filter arguments from the kernel, get notified of power management events, etc.

Hypervisor Calls¶

The Exception Handling section introduced hhee_handle_hvc_instruction, the handler for HVC calls from the kernel. HVCs are divided into four groups based on their ID, each managed by their own function:

handle_hvc_c0xxxxxx: related to the Arm Architectural Service SMCs;
handle_hvc_c4xxxxxx: related to the Arm Power State Coordination Interface SMCs;
handle_hvc_c6xxxxxx: related to the HHEE security functionalities;
handle_hvc_c9xxxxxx: related to the HHEE logging functionality.

Since SMC handlers are implemented in the secure monitor and because the logging system is not particularly relevant, the rest of this section focuses on HVC handlers in the range 0xC6001000-0xC60010FF that implement memory protection features for the kernel and userland.

void hhee_handle_hvc_instruction(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
  if ((x0 & 0x80000000) != 0) {
    switch (x0 & 0x3f000000) {
      case 0:
        handle_hvc_c0xxxxxx(x0, x1, x2, x3, regs);
        return;
      case 0x4000000:
        handle_hvc_c4xxxxxx(x0, x1, x2, x3, regs);
        return;
      case 0x6000000:
        handle_hvc_c6xxxxxx(x0, x1, x2, x3, regs);
        return;
      case 0x9000000:
        handle_hvc_c9xxxxxx(x0, x1, x2, x3, regs);
        return;
    }
  }
  // ...
}

When the execution reaches handle_hvc_c6xxxxxx, the hypervisor dispatches the call to the corresponding HVC handler in the hvc_handlers array.

void handle_hvc_c6xxxxxx(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
  if ((x0 & 0xff00) == 0x1000) {
    // Calls the corresponding handler with the arguments stored in the general purpose registers X0, X1, X2 and X3 set
    // by the kernel before making the HVC.
    hvc_handlers[x0 & 0xff](x0, x1, x2, x3, regs);
  } else {
    /* ... */
  }
}

The table below lists the implemented HVC handlers and gives a brief description of their functionality.

HVC ID	Name	Description
0xC6001030	`HKIP_HVC_RO_REGISTER`	Sets a memory region as read-only and marks it `OS_RO`.
0xC6001031	`HKIP_HVC_RO_UNREGISTER`	Does nothing (unused in the kernel).
0xC6001032	`HKIP_HVC_RO_MOD_REGISTER`	Sets a memory region as read-only and marks it `OS_RO_MOD`.
0xC6001033	`HKIP_HVC_RO_MOD_UNREGISTER`	Resets a `OS_RO_MOD` memory region and unmarks it.
0xC6001040	`HKIP_HVC_ROWM_REGISTER`	Sets a memory region as read-only and marks it `HYP_RO`.
0xC6001041	`HKIP_HVC_ROWM_UNREGISTER`	Does nothing (unused in the kernel).
0xC6001042	`HKIP_HVC_ROWM_MOD_REGISTER`	Sets a memory region as read-only and marks it `HYP_RO_MOD`.
0xC6001043	`HKIP_HVC_ROWM_MOD_UNREGISTER`	Resets a `HYP_RO_MOD` memory region and unmarks it.
0xC6001050	`HKIP_HVC_ROWM_SET_BIT`	Sets a bit of a page in a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC6001051	`HKIP_HVC_ROWM_WRITE`	Copies a buffer into a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC6001054-0xC6001057	`HKIP_HVC_ROWM_WRITE_{8,16,32,64}`	Writes a value in a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC6001058	`HKIP_HVC_ROWM_SET`	Sets a `HYP_RO` or `HYP_RO_MOD` memory region to a value.
0xC6001059	`HKIP_HVC_ROWM_CLEAR`	Sets a `HYP_RO` or `HYP_RO_MOD` memory region to zero.
0xC600105A-0xC600105D	`HKIP_HVC_ROWM_CLEAR_{8,16,32,64}`	Zeroes a value in a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC6001060	`HKIP_HVC_XO_REGISTER`	Sets a memory region as execute-only at EL0 and marks it `SOP_XO`.
0xC6001082	`HHEE_LKM_UPDATE`	Sets a memory region as executable at EL1 and marks it `OS_RO_MOD`.
0xC6001089	`HHEE_HVC_TOKEN`	Returns a random value called the clarify token needed for `HHEE_LKM_UPDATE`.
0xC600108A	`HHEE_HVC_ENABLE_TVM`	Sets a global variable in the hypervisor that re-enables `HCR_EL2.TVM` on a CPU-suspend.
0xC6001090	`HHEE_HVC_ROX_TEXT_REGISTER`	Sets a memory region as executable at EL1 in only the second stage and marks it `OS_RO` (unused in the kernel).
0xC6001091	`HHEE_HVC_VMMU_ENABLE`	Enables the second stage of address translation (unused in the kernel)
0xC60010C8-0xC60010CB	`HKIP_HVC_ROWM_XCHG_{8,16,32,64}`	Exchanges a value in a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC60010CC-0xC60010CF	`HKIP_HVC_ROWM_CMPXCHG_{8,16,32,64}`	Compares and exchanges a value in a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC60010D0-0xC60010D3	`HKIP_HVC_ROWM_ADD_{8,16,32,64}`	Adds to a value in a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC60010D4-0xC60010D7	`HKIP_HVC_ROWM_OR_{8,16,32,64}`	ORs a value in a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC60010D8-0xC60010DB	`HKIP_HVC_ROWM_AND_{8,16,32,64}`	ANDs a value in a `HYP_RO` or `HYP_RO_MOD` memory region.
0xC60010DC-0xC60010DF	`HKIP_HVC_ROWM_XOR_{8,16,32,64}`	XORs a value in a `HYP_RO` or `HYP_RO_MOD` memory region.

Hypervisor Side¶

Let's first have a look at the hypervisor call handlers. This will give us a good overview of the different protections that the kernel can apply to a memory region using the hypervisor.

Kernel Read-Only Memory Protection¶

There are two sets of HVC handlers that can be used to manage read-only memory regions at EL1. The first one, composed of hkip_hvc_ro_register and hkip_hvc_ro_unregister, is used for kernel data.

hkip_hvc_ro_register makes a memory range read-only and marks it as OS_RO in the second stage.

void hkip_hvc_ro_register(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
  hvc_lock_acquire();
  change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL0 | READ, OS_RO, 1);
  hvc_lock_release();
  tlbi_vmalle1is();
  // ...
}

hkip_hvc_ro_unregister always returns an error. It is not possible to unmark memory marked by hkip_hvc_ro_register.

void hkip_hvc_ro_unregister(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
  regs->x0 = -2;
  // ...
}

The second set of HVCs handlers, composed of hkip_hvc_ro_mod_register and hkip_hvc_ro_mod_unregister, is used for kernel modules that can be loaded and unloaded after kernel initialization.

hkip_hvc_ro_mod_register makes a memory range read-only and marks it as OS_RO_MOD in the second stage.

void hkip_hvc_ro_mod_register(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
  hvc_lock_acquire();
  change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL0 | READ, OS_RO_MOD, 1);
  hvc_lock_release();
  tlbi_vmalle1is();
  // ...
}

hkip_hvc_ro_mod_unregister unmarks a memory range marked as OS_RO_MOD in the second stage, effectively making it writable again.

void hkip_hvc_ro_mod_unregister(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
  hvc_lock_acquire();
  change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL0 | READ, OS_RO_MOD, 0);
  hvc_lock_release();
  tlbi_vmalle1is();
  // ...
}

Kernel Executable Memory Protection¶

Similarly to the kernel read-only memory protection, memory regions can also be made executable at EL1. For the kernel code, this is done automatically when the hypervisor processes the page tables. For kernel modules, this is done using the hhee_lkm_update HVC. This HVC can only be called if the caller knows a random value generated at boot that can be obtained using the hhee_hvc_token HVC.

hhee_hvc_token returns a 64-bit random value called the clarify token. If it has not been initialized yet, it is randomly generated, stored in a global variable, and returned to the caller. The clarify token is unobtainable once the kernel page tables have been processed, meaning it can only be retrieved very early in the kernel boot process.

void hhee_hvc_token(uint64_t hvc_id, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
  // The token cannot be obtained after the kernel page tables have been processed.
  if (current_cpu->kernel_pt_processed || atomic_get(&current_cpu->sys_regs->regs_inited)) {
    regs->x0 = 0xfffffff8;
    return;
  }

  spin_lock(&g_clarify_token_lock);
  if (!g_clarify_token_set) {
    // Checks if a clarify token was set earlier in the bootchain, if not derive one from the Counter-timer Physical
    // Count register.
    g_clarify_token = g_boot_clarify_token;
    if (!g_clarify_token) {
      g_clarify_token = (0x5deece66d * get_cntpct_el0() + 0xd) & 0xffffffffffff;
    }
    g_clarify_token_set = 1;
  }
  spin_unlock(&g_clarify_token_lock);
  regs->x0 = 0;
  regs->x1 = g_clarify_token;
  // ...
}

hhee_lkm_update first verifies that the clarify token given by the caller matches the one in the global variable. If it does, it then makes the target memory range executable at EL1 and marks it OS_RO_MOD in the second stage. It finally calls unset_pxn_stage1, which will unset the PXN bit in the stage 1 page tables, effectively making the memory executable at EL1.

uint64_t hhee_lkm_update(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t clarify_token, saved_regs_t* regs) {
  // ...
  spin_lock(&g_clarify_token_lock);
  if (!g_clarify_token_set || clarify_token != g_clarify_token) {
    spin_unlock(&g_clarify_token_lock);
    return 0xfffffff8;
  }
  spin_unlock(&g_clarify_token_lock);

  hvc_lock_acquire();

  ret = 0xfffffffd;
  if (get_kernel_pt_info(1, &pgd, &pt_size_log2, &pt_level, &nb_pt_entries)) {
    goto EXIT;
  }

  ret = 0xffffffff;
  if (pt_size_log2 != 0xc) {
    goto EXIT;
  }

  ret = change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL1 | EXEC_EL0 | READ, OS_RO_MOD, 1);
  if (ret) {
    goto EXIT;
  }

  // On our device, nb_pt_entries = 0x200, pt_level = 0, thus the computation below subtracts 0xffff000000000000 to
  // vaddr. It effectively converts a virtual address into an offset into the upper VA range.
  offset = vaddr + nb_pt_entries << (39 - 9 * pt_level);

  // Walk the stage 1 page tables and remove the PXN bit of descriptor of each of the pages contained in the virtual
  // memory region.
  pgd_desc = pgd | 3;
  while (size > 0) {
    ret = unset_pxn_stage1_by_ipa(&pdg_desc, offset, pt_level, nb_pt_entries);
    if (ret) {
      goto EXIT;
    }
    offset += 0x1000;
    size -= 0x1000;
  }

EXIT:
  hvc_lock_release();
  return ret;
}

unset_pxn_stage1_by_ipa maps the input table into the hypervisor. If the PXNTable bit of the input descriptor is set, it iterates over all the next level descriptors and clears their PXNTable or PXN bit, depending on their type, before unsetting the PXNTable bit of the input descriptor.

uint64_t unset_pxn_stage1_by_ipa(uint64_t* pt_desc_ptr, uint64_t offset, uint32_t pt_level, uint64_t nb_pt_entries) {
  // ...
  pt_desc = *pt_desc_ptr;
  pt_ipa = pt_desc & 0xfffffffff000;
  pt_s2_desc = get_stage2_page_table_descriptor_for_ipa(pt_ipa, NULL);

  // Returns an error if the stage 2 descriptor for the page table is invalid.
  if ((pt_s2_desc & 1) == 0) {
    return 0xfffffff7;
  }

  // Otherwise, map the table into the hypervisor to process it.
  pt_pa = pt_s2_desc & 0xfffffffff000;
  pt_hva = map_pa_into_hypervisor(pt_pa);

  // PXNTable, bit [59]: PXN limit for subsequent levels of lookup.
  if (((pt_desc >> 59) & 1) != 0) {
    for (uint64_t i = 0; i < nb_pt_entries; i++) {
      sub_desc = pt_hva[i];
      // If the current subsequent descriptor is valid...
      if ((sub_desc & 1) != 0) {
        // If the current descriptor is a table descriptor, sets the PXNTable bit.
        if (pt_level < 3 && (sub_desc & 0b10) != 0) {
          desc = sub_desc | (1 << 59);
        }
        // Otherwise, sets the PXN bit.
        else {
          desc = sub_desc | (1 << 53);
        }
        pt_hva[i] = desc;
      }
    }
    // Unsets PXNTable in the input descriptor.
    *pt_desc_ptr = pt_desc & ~(1 << 59);
  }

  pt_level_idx_pos = 39 - 9 * pt_level;
  pt_level_next_mask = (1 << pt_level_idx_pos) - 1;

  sub_desc_idx = offset >> pt_level_idx_pos;
  sub_desc = pt_hva[sub_desc_idx];

  // Returns an error if the subsequent descriptor is invalid.
  if ((sub_desc & 1) == 0) {
    ret = 0xfffffff7;
  }

  else if (pt_level <= 2) {
    // If the subsequent descriptor is a table, call unset_pxn_stage1_by_ipa recursively.
    if ((sub_desc & 0b10) != 0) {
      ret = unset_pxn_stage1_by_ipa(&pt_hva[sub_desc_idx], offset & pt_level_next_mask, pt_level + 1, 0x200);
    }
    // Otherwise, returns an error if it's a block descriptor.
    else {
      ret = 0xfffffff8;
    }
  }

  // Returns an error if the subsequent descriptor is reserved (i.e. level 3 block descriptor).
  else if ((sub_desc & 0b10) == 0) {
    ret = 0xfffffff7;
  }

  // Returns an error if the page is writable.
  //
  // AP, bits [7:6]: AP[2] selects between read-only and read/write access.
  else if (((sub_desc >> 7) & 1) == 0) {
    ret = 0xfffffff8;
  }

  // Otherwise, unsets the PXN bit and flushes the page tables.
  //
  // PXN, bit [53]: The Privileged execute-never field.
  else {
    pt_hva[sub_desc_idx] = sub_desc & ~(1 << 53);
    dsb_ish();
    tlbi_vaale1is();
    dsb_ish();
    isb();
    ret = 0;
  }

  unmap_hva_from_hypervisor(pt_hva);
  return ret;
}

Kernel Write-Rare Memory Protection¶

Similarly to the kernel read-only memory protection, there are two sets of HVC handlers that can be used to manage write-rare memory regions at EL1.

hkip_hvc_rowm_register marks kernel memory as read-only (READ) and hypervisor-mediated (HYP_RO);
hkip_hvc_rowm_unregister doesn't allow unmarking memory marked by hkip_hvc_rowm_register;
hkip_hvc_rowm_mod_register marks kernel module memory as read-only (READ) and hypervisor-mediated (HYP_RO_MOD);
hkip_hvc_rowm_mod_unregister unmarks memory marked by hkip_hvc_rowm_mod_register.

In addition, there are HVCs that allow modifying the hypervisor-mediated memory regions (i.e. all the other HVC IDs starting with HKIP_HVC_ROWM in the table from the Hypervisor Calls section). They implement functionalities equivalent to memcpy, memset, bzero, and arithmetic and logical operators. HVCs and their corresponding handlers in the hypervisor are given in the table below.

Let's take a look at one of these functions, hkip_hvc_rowm_set_bit:

void hkip_hvc_rowm_set_bit(uint64_t hvc_id, uint64_t bits, uint64_t pos, uint64_t value, saved_regs_t* regs) {
  // ...
  target_ipa = virt_to_phys_el1(bits + pos / 8);
  // ...
  hvc_lock_inc();
  desc_s2 = get_stage2_page_table_descriptor_for_ipa(target_ipa, 0);
  sw_attrs = get_software_attrs(desc_s2);
  // ...
  if (sw_attrs == 0 || sw_attrs == HYP_RO || sw_attrs == HYP_RO_MOD) {
    target_hva = map_pa_into_hypervisor((desc_s2 & 0xfffffffff000) | (target_ipa & 0xfff));
    mask = 1 << (pos & 7);
    if (value != 0) {
      do {
        cur_value = exclusive_load(target_hva);
      } while (exclusive_store(cur_value | mask, target_hva));
    } else {
      do {
        cur_value = exclusive_load(target_hva);
      } while (exclusive_store(cur_value & ~mask, target_hva));
    }
    unmap_hva_from_hypervisor(target_hva);
  }
  hvc_lock_dec();
  // ...
}

After retrieving the stage 2 descriptor of the byte being written to and mapping the corresponding page in the hypervisor's address space, hkip_hvc_rowm_set_bit sets one of the byte's bits. All other functions related to write-rare memory regions follow a similar implementation and won't be detailed in this article.

Userland Execute-Only Memory Protection¶

Finally, there is one more kind of memory protection enabled by the hypervisor: the creation of userland execute-only memory that is unreachable from the kernel. This protection can be applied to a memory region using the hkip_hvc_xo_register HVC handler that marks it executable at EL0 (EXEC_EL0) and shared-object protected (SOP_XO) in the second stage.

hkip_hvc_xo_register

void hkip_hvc_xo_register(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
  hvc_lock_acquire();
  change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL0, SOP_XO, 1);
  hvc_lock_release();
  tlbi_vmalle1is();
  // ...
}

Miscellaneous HVC Handlers¶

In addition, there are three miscellaneous HVC handlers.

The first one, hhee_hvc_enable_tvm, appears to be used to set the TVM bit of the HCR_EL2 system register on CPU suspend. It is triggered by the daemon /vendor/bin/hisecd writing to /proc/kernel_stp. However, in its configuration file, /vendor/etc/hisecd_scd.conf, the HheeTvmPolicy appears to be disabled. It is unclear to us what the true purpose of this HVC is, as the TVM bit is set in hyp_set_el2_and_enable_stage_2_per_cpu.

The other two, hhee_hvc_rox_text_register and hhee_hvc_vmmu_enable, are unused by the kernel and are suspected to be used only for debugging. hhee_hvc_rox_text_register sets a memory region as readable and executable at EL0/EL1 (READ | EXEC_EL0 | EXEC_EL1) and marks it as a kernel read-only region (OS_RO) in the second stage. hhee_hvc_vmmu_enable simply calls enable_stage2_addr_translation.

Kernel Side¶

We can now move on to the kernel side and explain how features implemented in the hypervisor are leveraged to protect sensitive information accessible at EL1.

Kernel Read-Only Data¶

The simplest use of the protections offered by the hypervisor is the enforcement of read-only memory permissions for the kernel data. The kernel code, as we have seen in the Processing the Kernel Page Tables section, was set as read-only in the second stage when the kernel page tables were processed by the hypervisor.

The kernel data is set as read-only in two steps. Both steps make use of the hkip_register_ro function, which simply does a HKIP_HVC_RO_REGISTER HVC.

▸ include/linux/hisi/hhee_prmem.h

static inline int hkip_register_ro(const void *base, size_t size)
{
    return hkip_reg_unreg(HKIP_HVC_RO_REGISTER, base, size);
}

Different kernel sections are set as read-only by the mark_constdata_ro function, in particular .rodata, .notes and the exception table.

▸ arch/arm64/mm/mmu.c

void mark_constdata_ro(void)
{
    unsigned long start = (unsigned long)__start_rodata;
    unsigned long end = (unsigned long)__init_begin;
    unsigned long section_size = (unsigned long)end - (unsigned long)start;

    /*
     * mark .rodata as read only. Use __init_begin rather than __end_rodata
     * to cover NOTES and EXCEPTION_TABLE.
     */
    update_mapping_prot(__pa_symbol(__start_rodata), start,
                section_size, PAGE_KERNEL_RO);
    hkip_register_ro((void *)start, ALIGN(section_size, PAGE_SIZE));
    // ...
}

This function is called from start_kernel, right after the architecture-specific initialization.

▸ init/main.c

asmlinkage __visible void __init start_kernel(void)
{
    // ...
    /* setup_arch is the last function to alter the constdata content */
    mark_constdata_ro();
    // ...
}

Lastly, the kernel section .ro_after_init_data is set as read-only by the mark_rodata_ro function.

▸ arch/arm64/mm/mmu.c

void mark_rodata_ro(void)
{
    // ...
    section_size = (unsigned long)__end_data_ro_after_init -
            (unsigned long)__start_data_ro_after_init;
    update_mapping_prot(__pa_symbol(__start_data_ro_after_init), (unsigned long)__start_data_ro_after_init,
                section_size, PAGE_KERNEL_RO);
    hkip_register_ro((void *)__start_data_ro_after_init,
             ALIGN(section_size, PAGE_SIZE));
    // ...
}

This function is called from mark_readonly.

▸ init/main.c

static void mark_readonly(void)
{
    if (rodata_enabled) {
        /*
         * load_module() results in W+X mappings, which are cleaned up
         * with call_rcu_sched().  Let's make sure that queued work is
         * flushed so that we don't hit false positives looking for
         * insecure pages which are W+X.
         */
        rcu_barrier_sched();
        mark_rodata_ro();
        rodata_test();
    } else {
        pr_info("Kernel memory protection disabled.\n");
    }
}

Unsurprisingly, it is called from kernel_init before spawning the initial process.

▸ init/main.c

static int __ref kernel_init(void *unused)
{
    // ...
    mark_readonly();
    // ...
}

Modules Read-Only Code¶

Another protection enforced by the hypervisor is making module code both read-only and executable. The module data doesn't seem to have any permissions enforced at the hypervisor level. The memory layout of a kernel module is as follows:

The code path to protect module code is a little less straightforward than for the kernel data. The first step is to obtain the clarify token that is needed to use the protection function.

The clarify token is retrieved at initialization by the module_token_init function, which stores it into the clarify_token global variable.

▸ kernel/module.c

static unsigned long clarify_token;

static int __init module_token_init(void)
{
    struct arm_smccc_res res;

    if (hhee_check_enable() == HHEE_ENABLE) {
        arm_smccc_hvc(HHEE_HVC_TOKEN, 0, 0,
            0, 0, 0, 0, 0, &res);
        clarify_token = res.a1;
    }
    return 0;
}
module_init(module_token_init);

The token can then be used by the hhee_lkm_update function, which will make its text section read-only and executable in the second stage.

▸ kernel/module.c

static inline void hhee_lkm_update(const struct module_layout *layout)
{
    struct arm_smccc_res res;

    if (hhee_check_enable() != HHEE_ENABLE)
        return;
    arm_smccc_hvc(HHEE_LKM_UPDATE, (unsigned long)layout->base,
            layout->text_size, clarify_token, 0, 0, 0, 0, &res);

    if (res.a0)
        pr_err("service from hhee failed test.\n");
}

This function is called from module_enable_ro for both of the module's layouts.

▸ kernel/module.c

void module_enable_ro(const struct module *mod, bool after_init)
{
    // ...

    /*
     * Note: make sure this is the last time
     * u change the page table to x or RO.
     */
    hhee_lkm_update(&mod->init_layout);
    hhee_lkm_update(&mod->core_layout);
}

module_enable_ro is called from:

complete_formation during initialization;
do_init_module after initialization.

▸ kernel/module.c

static int complete_formation(struct module *mod, struct load_info *info)
{
    // ...
    module_enable_ro(mod, false);
    // ...
}

▸ kernel/module.c

static noinline int do_init_module(struct module *mod)
{
    // ...
    module_enable_ro(mod, true);
    // ...
}

Both functions are called from load_module, which is called when a module is loaded.

▸ kernel/module.c

static int load_module(struct load_info *info, const char __user *uargs,
               int flags)
{
    // ...
    /* Finally it's fully formed, ready to start executing. */
    err = complete_formation(mod, info);
    // ...
    err = do_init_module(mod);
    // ...
}

Protected Memory¶

The features provided by the hypervisor are used by the kernel to create protected memory sections:

prmem_wr: read-only memory that can be written to using dedicated functions calling into the hypervisor;
prmem_wr_after_init: similar to prmem_wr, but permissions are only applied after the kernel is initialized;
prmem_rw: read-write memory for development purposes; it should be replaced in production by prmem_wr and prmem_wr_after_init.

The prmem_wr section is registered as read-only write-mediated in the mark_wr_data_wr function.

▸ arch/arm64/mm/mmu.c

void mark_wr_data_wr(void)
{
    unsigned long section_size;

    if (prmem_bypass())
        return;

    section_size = (unsigned long)(uintptr_t)__end_data_wr -
            (unsigned long)(uintptr_t)__start_data_wr;
    update_mapping_prot(__pa_symbol(__start_data_wr),
                (unsigned long)(uintptr_t)__start_data_wr,
                section_size, PAGE_KERNEL_RO);
    hkip_register_rowm((void *)__start_data_wr,
               ALIGN(section_size, PAGE_SIZE));
}

mark_wr_data_wr is called during the initialization of the protected memory subsystem.

▸ drivers/hisi/hhee/mm/prmem.c

void __init prmem_init(void)
{
    // ...
    mark_wr_data_wr();
    // ...
}

The prmem_wr_after_init section is initialized similarly to the prmem_wr section, by the mark_wr_after_init_data_wr function. It is called in the previously-seen mark_rodata_ro that is called after the kernel is initialized.

▸ arch/arm64/mm/mmu.c

void mark_rodata_ro(void)
{
    // ...
    mark_wr_after_init_data_wr();
}

Security-critical global variables can be moved at compile-time into the prmem_wr and prmem_wr_after_init memory sections using the __wr and __wr_after_init macros, respectively. In addition, the prmem_wr section contains the control structures of protected memory pools. These pools, defined with the PRMEM_POOL macro, can have the following types:

ro_no_recl: read only non reclaimable;
wr_no_recl: write rare non reclaimable;
start_wr_no_recl: pre-protected write rare non reclaimable;
start_wr_recl: pre-protected write rare reclaimable;
wr_recl: write rare reclaimable;
ro_recl: read only reclaimable;
rw_recl: read write reclaimable.

The protected memory pools are combined with a runtime allocator similar to vmalloc. The pmalloc function returns memory that is initially kernel-writable, unless it is pre-protected. This memory must then be protected using the prmem_protect_addr or prmem_protect_pool functions, and the permissions applied depend on the pool's type. Finally, if it is reclaimable, the memory can be freed using the pfree function.

If greater performance is needed, the protected memory feature provides object caches similar to the SLUB allocator. Caches for a specific object type can be defined using the PRMEM_CACHE macro. Objects can then be allocated using the prmem_cache_alloc function and freed using the prmem_cache_free function.

Objects returned from either allocators and protected are no longer kernel-writable. To modify their fields, the kernel must use dedicated functions, such as wr_memcpy and wr_assign, that end up calling into the hypervisor.

Credentials Protection¶

The tasks' credentials are protected by allocating the struct cred from a cache of pre-protected write-rare non-reclaimable memory. Because some members of this structure are written frequently, they are moved into a new dedicated struct cred_rw, allocated from a cache of read-write reclaimable memory.

▸ kernel/cred.c

PRMEM_POOL(cred_pool, start_wr_no_recl, CRED_ALIGNMENT, kB(8), CRED_POOL_SIZE);
PRMEM_CACHE(cred_cache, &cred_pool, sizeof(struct cred), CRED_ALIGNMENT);

PRMEM_POOL(cred_rw_pool, rw_recl, CRED_ALIGNMENT, kB(8), CRED_RW_POOL_SIZE);
PRMEM_CACHE(cred_rw_cache, &cred_rw_pool, sizeof(struct cred_rw),
        CRED_ALIGNMENT);

The frequently-written members of struct cred are changed into pointers to the corresponding members of struct cred_rw, and a pointer to the task is added for consistency checking.

▸ include/linux/cred.h

struct cred {
    atomic_t    *usage;
// ...
    struct rcu_head *rcu;           /* RCU deletion hook */
    struct task_struct *task;
// ...
};

In addition to the frequently-written members, the struct cred_rw also contains a pointer to the read-only struct cred.

▸ include/linux/cred.h

struct cred_rw {
    atomic_t usage;
    struct rcu_head rcu;
    struct cred *cred_wr;
};

The validate_task_creds function ensures the consistency of the task->creds->task loop. It is called from many places, including the current_cred macro, and the prepare_creds and commit_creds functions.

▸ include/linux/cred.h

void validate_task_creds(const struct task_struct *task)
{
    struct cred *cred;
    struct cred *real_cred;

    WARN_ON(!task);
    cred = (struct cred *)rcu_dereference_protected(task->cred, 1);
    WARN_ON(!cred);
    real_cred = (struct cred *)
        rcu_dereference_protected(task->real_cred, 1);
    WARN_ON(!real_cred);
    if (likely(task != &init_task)) {
        BUG_ON(!is_wr(real_cred, sizeof(struct cred)));
        if (cred != real_cred)
            BUG_ON(!is_wr(cred, sizeof(struct cred)));
    } else {
        BUG_ON(real_cred != &init_cred);
        BUG_ON(cred != &init_cred);
    }
    BUG_ON(real_cred->task != task);
    if (cred != real_cred)
        BUG_ON(cred->task != task);
}

Similarly, the validate_cred_rw function ensures the consistency of the cred_rw->cred->cred_rw.rcu loop. It is also called from many places, including the prepare_creds and commit_creds functions.

▸ include/linux/cred.h

void validate_cred_rw(struct cred_rw *cred_rw)
{
    BUG_ON(!cred_rw->cred_wr);
    if (likely(cred_rw->cred_wr != &init_cred))
        BUG_ON(!is_rw(cred_rw, sizeof(struct cred_rw)));
    BUG_ON(cred_rw->cred_wr->rcu != &cred_rw->rcu);
}

SELinux Protection¶

The selinux_enabled global variable is protected by making it read-only after initialization.

▸ security/selinux/hooks.c

#define __selinux_enabled_prot  __ro_after_init
int selinux_enabled __selinux_enabled_prot = 1;

The selinux_enforcing global variable is replaced at compile-time by a constant value.

▸ security/selinux/include/avc.h

#define selinux_enforcing 1

The ss_initialized global variable is protected by moving it into the prmem_wr section.

▸ security/selinux/ss/services.c

int ss_initialized __wr;

When loading a security policy, the SELinux data structures are allocated from a special pool of read-only reclaimable memory.

▸ security/selinux/ss/policydb_hkip.c

#define selinux_ro_recl  ro_recl
PRMEM_POOL(selinux_pool, selinux_ro_recl, SELINUX_POOL_ALIGNMENT,
       PAGE_SIZE, SELINUX_POOL_CAP);

This memory pool is write-protected after the loading of the security policy is complete.

▸ security/selinux/ss/services.c

int security_load_policy(void *data, size_t len)
{
    // ...
    prmem_protect_pool(&selinux_pool);
    return rc;
}

It should be noted that this doesn't prevent overwriting the AVC cache to bypass a negative decision. However, reloading the SELinux policy is not possible because of additional kernel hardening that prevents tasks without TGID equals to 1 from writing to /sys/fs/selinux/load.

Poweroff Command Protection¶

The poweroff command, previously used in a Samsung RKP bypass, is protected by moving it into the prmem_wr section.

▸ kernel/reboot.c

char poweroff_cmd[POWEROFF_CMD_PATH_LEN] __wr = "/sbin/poweroff";

BPF Protection¶

The BPF programs are also protected by the hypervisor. struct bpf_prog and struct bpf_binary_header are allocated from a read only reclaimable pool.

▸ kernel/bpf/core.c

#define bpf_cap round_up((CONFIG_HKIP_PROTECT_BPF_CAP * SZ_1M), PAGE_SIZE)
PRMEM_POOL(bpf_pool, ro_recl, sizeof(void *), PAGE_SIZE, bpf_cap);

The memory permissions are applied in the bpf_prog_lock_ro and bpf_jit_binary_lock_ro functions.

▸ include/linux/filter.h

static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
{
    fp->locked = 1;
    prmem_protect_addr(fp);
}

▸ include/linux/filter.h

static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
{
    prmem_protect_addr(hdr);
}

It should be noted that this mechanism doesn't seem to prevent calling ___bpf_prog_run to execute arbitrary bytecode. However, in our understanding, such a call would be prevented by the kernel CFI implementation.

CFI Hardening¶

The kernel CFI implementation is hardened by making the CFI_Check function pointer of the struct module into a separate, dynamically allocated struct safe_cfi_area. This structure also contains a pointer back to the owner struct module.

▸ include/linux/module.h

struct module {
    // ...
    struct safe_cfi_area {
        cfi_check_fn cfi_check;
        struct module *owner;
    } *safe_cfi_area;
    // ...
};

The struct safe_cfi_area is allocated from a dedicated cache using a pre-protected write-rare non-reclaimable memory pool.

▸ drivers/hisi/hhee/cfi_harden/cfi_harden.c

#define CFI_CHECK_POOL_SIZE  \
    (CONFIG_CFI_CHECK_CACHE_NUM * sizeof(struct safe_cfi_area))
#define CFI_GUARD_SIZE       (2 * PAGE_SIZE)
#define CFI_CHECK_MAX_SIZE   (CFI_CHECK_POOL_SIZE + CFI_GUARD_SIZE)

PRMEM_POOL(cfi_check_pool, start_wr_no_recl, sizeof(void *),
       CFI_CHECK_POOL_SIZE, CFI_CHECK_MAX_SIZE);
PRMEM_CACHE(cfi_check_cache, &cfi_check_pool,
        sizeof(struct safe_cfi_area), sizeof(void *));

Instead of simply dereferencing mod->cfi_check, the fetch_cfi_check_fn is used to retrieve the cfi_check function pointer.

▸ drivers/hisi/hhee/cfi_harden/cfi_harden.c

cfi_check_fn fetch_cfi_check_fn(struct module *mod)
{
    /*
     * In order to prevent forging the entire malicious module,
     * the verification of the module is also necessary in the future.
     */
    if (WARN(!is_cfi_valid(mod->safe_cfi_area, sizeof(struct safe_cfi_area)) ||
         (mod != mod->safe_cfi_area->owner),
         "Attempt to alter cfi_check!"))
        return 0;
    return mod->safe_cfi_area->cfi_check;
}

In addition, the CFI shadow pages global variable is moved into the prmem_wr section.

▸ kernel/cfi.c

static struct cfi_shadow __rcu *cfi_shadow __wr;

And the shadow pages themselves are allocated from a write-rare reclaimable memory pool.

▸ drivers/hisi/hhee/cfi_harden/cfi_harden.c

#define CFI_SHADOW_POOL_SIZE (SHADOW_PAGES * PAGE_SIZE)

PRMEM_POOL(cfi_shadow_pool, wr_recl, sizeof(void *),
       CFI_SHADOW_POOL_SIZE, PRMEM_NO_CAP);

The shadow pages are protected by the cfi_protect_shadow_pages function.

▸ drivers/hisi/hhee/cfi_harden/cfi_harden.c

void cfi_protect_shadow_pages(void *addr)
{
    prmem_protect_addr(addr);
}

Privilege Escalation Detection¶

The hypervisor also enables the detection of privilege escalation methods.

It protects the addr_limit of tasks by keeping for each of them a boolean indicating if it is USER_DS or KERNEL_FS. These booleans are contained in the array hkip_addr_limit_bits.

▸ drivers/hisi/hhee/hkip/critdata.c

#define DEFINE_HKIP_BITS(name, count) \
    u8 hkip_##name[ALIGN(DIV_ROUND_UP(count, 8), PAGE_SIZE)] \
        __aligned(PAGE_SIZE)
#define DEFINE_HKIP_TASK_BITS(name) DEFINE_HKIP_BITS(name, PID_MAX_DEFAULT)

DEFINE_HKIP_TASK_BITS(addr_limit_bits);

This array is protected by the hypervisor at initialization by the hkip_critdata_init function, which calls hkip_register_bits, a wrapper around the HKIP_HVC_ROWM_REGISTER HVC.

▸ drivers/hisi/hhee/hkip/critdata.c

static int __init hkip_critdata_init(void)
{
    hkip_register_bits(hkip_addr_limit_bits, sizeof (hkip_addr_limit_bits));
    // ...
}

The set_fs function, used to set the addr_limit field, calls hkip_set_fs.

▸ arch/arm64/include/asm/uaccess.h

static inline void set_fs(mm_segment_t fs)
{
    current_thread_info()->addr_limit = fs;
    hkip_set_fs(fs);
    // ...
}

The hkip_set_fs macro updates the corresponding bit of the hkip_addr_limit_bits array.

▸ include/linux/hisi/hkip.h

#define hkip_set_fs(fs) \
    hkip_set_current_bit(hkip_addr_limit_bits, (fs) == KERNEL_DS)

Similarly, the get_fs function, used to retrieve the addr_limit fields, calls hkip_get_fs directly.

▸ arch/arm64/include/asm/uaccess.h

#define get_fs()    (mm_segment_t)hkip_get_fs()

And the hkip_get_fs macro checks the boolean value for the current thread contained in the hkip_addr_limit_bits array.

▸ include/linux/hisi/hkip.h

#define hkip_is_kernel_fs() \
    ((current_thread_info()->addr_limit == KERNEL_DS) \
    && hkip_get_current_bit(hkip_addr_limit_bits, true))
#define hkip_get_fs() \
    (hkip_is_kernel_fs() ? KERNEL_DS : USER_DS)

In addition, a new field is added to the struct pt_regs to store the thread's PID and corresponding boolean value while in the kernel. These values are saved into the structure in the kernel_entry and restored in the kernel_exit assembly function.

▸ arch/arm64/include/asm/ptrace.h

struct pt_regs {
    // ...
    u32 orig_addr_limit_hkip[2];
    // ...
};

The hypervisor also protects the UID and GID of tasks using two arrays of booleans: hkip_uid_root_bits which denotes if a task has the root UID, and hkip_gid_root_bits if it has the root GID.

▸ drivers/hisi/hhee/hkip/critdata.c

static DEFINE_HKIP_TASK_BITS(uid_root_bits);
static DEFINE_HKIP_TASK_BITS(gid_root_bits);

▸ drivers/hisi/hhee/hkip/critdata.c

static int __init hkip_critdata_init(void)
{
    // ...
    hkip_register_bits(hkip_uid_root_bits, sizeof (hkip_uid_root_bits));
    hkip_register_bits(hkip_gid_root_bits, sizeof (hkip_gid_root_bits));
    return 0;
}

The hkip_check_uid_root function uses the hkip_uid_root_bits to check if a task has escalated to the root UID, that is it did not have the root UID in the hkip_uid_root_bits and has now the root UID or all capabilities.

▸ drivers/hisi/hhee/hkip/critdata.c

int hkip_check_uid_root(void)
{
    const struct cred *creds = NULL;

    if (hkip_get_current_bit(hkip_uid_root_bits, true)) {
        return 0;
    }

    /*
     * Note: In principles, FSUID cannot be zero if EGID is non-zero.
     * But we check it separately anyway, in case of memory corruption.
     */
    creds = (struct cred *)current_cred();/*lint !e666*/
    if (unlikely(hkip_compute_uid_root(creds) ||
            uid_eq(creds->fsuid, GLOBAL_ROOT_UID))) {
        pr_alert("UID root escalation!\n");
        force_sig(SIGKILL, current);
        return -EPERM;
    }

    return 0;
}

The check on the current credentials is performed by the hkip_compute_uid_root function.

▸ drivers/hisi/hhee/hkip/critdata.c

static bool hkip_compute_uid_root(const struct cred *creds)
{
    return uid_eq(creds->uid, GLOBAL_ROOT_UID) ||
        uid_eq(creds->euid, GLOBAL_ROOT_UID) ||
        uid_eq(creds->suid, GLOBAL_ROOT_UID) ||
    /*
     * Note: FSUID can only change when EUID is zero. So a change of FSUID
     * will not affect the overall root status bit: it will remain true.
     */
        !cap_isclear(creds->cap_inheritable) ||
        !cap_isclear(creds->cap_permitted);
}

There exists a similar pair of functions, hkip_check_gid_root and hkip_compute_gid_root, for checking if a task has escalated to the root GID. In addition, the hkip_check_xid_root function performs checks on both the UID and GID.

These checks are performed in the following functions:

in acl_permission_check, which is called on file accesses, the UID is checked if the file belongs to the root UID, and the GID is checked if it belongs to the root GID;
in prepare_creds, both the UID and GID are checked;
in copy_process, which is called from fork, both the UID and GID are checked;
in __cap_capable, which is called on syscalls, the UID is checked.

Execute Only Userland¶

One last feature enabled by the hypervisor is the creation of execute-only userland memory that is inaccessible from the kernel. This memory is used to store the code of shared libraries that are stored encrypted on the file-system. The decryption key is stored in the HW_KEYMASTER trustlet.

The /vendor/bin/hw/vendor.huawei.hardware.sop@1.0-service system service is responsible for loading the encrypted shared libraries into user memory from disk when it is started. The loading flow is as follows:

the service opens the first /dev/sop_privileged device exposed by the kernel xodata subsystem;
it makes an SOPLOCK_SET_TAG IOCTL, that creates a control structure on the kernel side, associated with the tag given as argument;
it calls mmap using the file descriptor to allocate user memory of the specified size and map it into its address space;
it decrypts the shared library in TrustZone and stores the plaintext code in the mmap'd memory;
it makes an SOPLOCK_SET_XOM IOCTL, that calls hkip_register_xo to the memory executable-only at EL0 and inaccessible at EL1.

When the plaintext shared library is needed in a process, it is loaded as follows:

the process opens the second /dev/sop_unprivileged device exposed by the kernel xodata subsystem;
it makes an SOPLOCK_SET_TAG IOCTL, passing the same tag as an argument, to get the size of the shared library;
it makes an SOPLOCK_MMAP_KERNEL IOCTL, passing it the address and size of the memory region where the plaintext code will be remapped.

Secure Monitor Calls¶

By design, SMCs are handled by the secure monitor. But, if the configuration permits it, the hypervisor can intercept these calls to filter them or to perform additional operations before calling the corresponding SMC. Since the TSC bit is set in HCR_EL2, SMCs are trapped by HHEE.

When an SMC instruction is used at EL1, the execution is redirected to the hypervisor, where hhee_handle_smc_instruction handles the monitor call. In Huawei's secure monitor implementation, which is based on the ARM Trusted Firmware, SMC handlers are grouped in runtime services, which are identified by Owning Entity Number, or OEN. The OEN is specified in the upper bits of the command ID sent in the first argument of the SMC. It is then used by the hypervisor in hhee_handle_smc_instruction to call the corresponding handler for the intercepted SMC.

void hhee_handle_smc_instruction(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
  // Call a handler specific to the Owning Entity Number (OEN).
  smc_handlers[(x0 >> 24) & 0x3f](x0, x1, x2, x3, regs);
}

Depending on the OEN provided by the user, different operations are performed by the hypervisor and are summarized in the table below.

OEN	Action
0x00	Does some processing for the arch workaround-related SMCs
0x01	Allows the SMC to be called without further processing
0x02	Disallows the `ARM_SIP_SVC_EXE_STATE_SWITCH` SMC
0x03	Allows the SMC to be called without further processing
0x04	Calls `handle_smc_oen_std`
0x05-0x0A	Allows the SMC to be called without further processing
0x0B-0x2F	Returns an error to the kernel
0x30-0x3F	Allows the SMC to be called without further processing

Let's have a look at handle_smc_oen_std, which calls handle_smc_psci to perform additional processing for commands with an OEN of 4 and a function ID lower than 0x20.

void handle_smc_oen_std(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
  // ...
  // A dedicated function is called to handle the function IDs of the standard service calls that belong to the PSCI
  // implementation.
  if ((x0 & 0xffff) < 0x20) {
    regs->x0 = handle_smc_psci(x0, x1, x2, x3);
    // ...
  }
  // ...
}

If the SMC ID meets the conditions checked by handle_smc_psci, it calls the handler from smc_handlers_psci that corresponds to the input function ID.

uint64_t handle_smc_psci(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
  // SMC must be of type FAST.
  if (((x0 >> 31) & 1) == 0) {
    return -1;
  }
  // SMC must be 32-bit or in the white-list.
  if (((x0 >> 30) & 1) != 0 && ((0x88f45 >> x0) & 1) != 0) {
    return -1;
  }
  return smc_handlers_psci[x0 & 0xffff](x0, x1, x2, x3, regs);
}

The possible handlers called by handle_smc_psci and the operations they perform are listed below.

OEN	Action
0x00	Allows the SMC to be called without further processing
0x01	Calls `handle_smc_psci_cpu_suspend`
0x02	Allows the SMC to be called without further processing
0x03	Calls `handle_smc_psci_cpu_on`
0x04-0x09	Allows the SMC to be called without further processing
0x0A	Either allows or denies the SMC to be called without further processing
0x0B	Allows the SMC to be called without further processing
0x0C	Calls `handle_smc_psci_cpu_default_suspend`
0x0D	Allows the SMC to be called without further processing
0x0E	Calls `handle_smc_psci_system_suspend`
0x0F-0x10	Either allows or denies the SMC to be called without further processing
0x11	Allows the SMC to be called without further processing
0x12	Either allows or denies the SMC to be called without further processing
0x13-0x14	Allows the SMC to be called without further processing
0x15-0x1F	Returns an error to the kernel

The handle_smc_psci_cpu_on handler, which is called after a CPU has been powered up and finished its initialization, saves the entrypoint given as an argument and calls boot_hyp_cpu.

uint64_t handle_smc_psci_cpu_on(uint64_t smc_id,
                                uint64_t target_cpu,
                                uint64_t entrypoint,
                                uint64_t context_id,
                                saved_regs_t* regs) {
  // Saves the entrypoint in the sys_regs structure, but it doesn't seem to be used anywhere (?).
  if (!set_sys_reg_value(0x20, entrypoint)) {
    boot_hyp_cpu(target_cpu, entrypoint, context_id);
  }
  // ...
}

The handle_smc_psci_cpu_suspend handler, which is called when a CPU wants to suspend its execution, will save the entrypoint and all registers before calling the original SMC to actually suspend the CPU. After resuming the execution, the entrypoint value is restored.

uint64_t handle_smc_psci_cpu_suspend(uint64_t smc_id,
                                     uint64_t power_state,
                                     uint64_t entrypoint,
                                     uint64_t context_id,
                                     saved_regs_t* regs) {
  // ...
  set_tvm();
  // ...
  // If the power state is valid and the requested state is POWERDOWN.
  if ((power_state & 0xfcfe0000) == 0 && (power_state & 0x10000) != 0) {
    // Set the saved ELR_EL2 to the entrypoint.
    set_sys_reg_value(0x28, entrypoint);
    // ...
    // Save all EL2 registers and call the original SMC to suspend the CPU.
    if (save_el2_registers_and_cpu_suspend(/* ... */) == 0x80000000) {
      // If we correctly returned from the hypervisor PSCI entrypoint restore the entrypoint value that came from the
      // kernel.
      return hyp_set_elr_el2_spsr_el2_sctlr_el1(entrypoint, context_id);
    }
    // ...
  }
  // ...
}

save_el2_registers_and_cpu_suspend simply calls save_el2_registers and passes as an argument the do_smc_pcsi_cpu_suspend function, which performs the initial SMC.

uint64_t save_el2_registers_and_cpu_suspend(/* ... */) {
  return save_el2_registers(/* ..., */ do_smc_pcsi_cpu_suspend);
}

save_el2_registers saves the general and system registers at EL2 before calling the function given as an argument.

uint64_t save_el2_registers(/* ..., */ cb_func_t* func) {
  // Saves TPIDR_EL2 (containing the current CPU informations).
  //
  // Saves general registers: x18 to x30.
  //
  // Saves system registers: CNTHCTL_EL2, CNTVOFF_EL2, CPTR_EL2, HCR_EL2 HSTR_EL2, VTCR_EL2, VTTBR_EL2, VMPIDR_EL2,
  // VPIDR_EL2.
  //
  // Calls the function pointer given as argument.
  return func(/* ... */);
}

As explained previously, do_smc_pcsi_cpu_suspend executes the original SMC.

uint64_t do_smc_pcsi_cpu_suspend(uint64_t smc_id, uint64_t power_state /* ,... */) {
  make_smc(0xc4000001, power_state, psci_entrypoint, get_stack_pointer());
}

Finally, when execution is resumed, the psci_entrypoint given as an argument to the SMC is executed. It sets the stack pointer, sets some system registers to their global value using hyp_config_per_cpu, and restores the general and the rest of the system registers at EL2.

uint64_t psci_entrypoint(uint64_t context_id) {
  set_stack_pointer(context_id);
  hyp_config_per_cpu();
  // Restores TPIDR_EL2 (containing the current CPU info).
  tpidr_el2 = *(uint64_t*)context_id;
  set_tpidr_el2(tpidr_el2);
  set_vbar_el2(SynchronousExceptionSP0);
  // Restores system registers CNTHCTL_EL2, CNTVOFF_EL2, CPTR_EL2, HCR_EL2, HSTR_EL2, VTCR_EL2, VTTBR_EL2, VMPIDR_EL2,
  // and VPIDR_EL2.
  //
  // Restores general registers from x18 to x30.
  return 0x80000000;
}

The handlers handle_smc_psci_cpu_default_suspend and handle_smc_psci_system_suspend are similar to handle_smc_psci_cpu_suspend, but end up making the corresponding SMC.

With SMCs explained, we have covered the entire exception handling system implemented by Huawei. We are now done with the explanation of HHEE's internals and can finally wrap up this article.

Conclusion¶

In this post, we thoroughly detailed the internals of HHEE, Huawei's security hypervisor. We have seen how the mechanisms implemented can help against kernel exploitation. As a defense-in-depth measure, it makes it harder for an attacker with a kernel read/write to do anything useful.

To recap the various assurances offered by the hypervisor:

kernel read-only and executable memory is protected by:
- using a second stage of address translation;
- keeping watch over the kernel page tables to enforce the permissions at EL0 and EL1;
using a dedicated allocator and cache system, various kernel structures are moved into protected memory:
- the tasks' credentials, thus preventing UID, GID, capabilities, and security context changes;
- the SELinux global variables and structures, to avoid modifying the policy;
- the power-off command, to prevent abusing it to easily gain code execution;
- the BPF programs, to prevent any changes that would lead to code execution;
- the CFI check function and shadow pages, to harden the CFI implementation;
privilege escalation, if achieved somehow, would be detected by:
- two read-only arrays that keep track of the tasks' root ID and are checked on file accesses and system calls;
- an array keeping track of the addr_limit field of tasks checked when used.

We have found HHEE to have a well-thought-out architecture and a robust implementation, thus providing effective security assurances to the kernel. During our evaluation, we found a single vulnerability that would allow compromising the security hypervisor. However, the kernel mitigations make it difficult, if not impossible, to exploit it. More details about this vulnerability are available in the corresponding advisory.

We know that this blog post can be a lot to digest, but hopefully it can be used in the future as a reference to bootstrap further research on Huawei's hypervisor or even another OEM's implementation. Feel free to reach out to us if you've spotted any mistake; we will happily update this blog post.

References¶

How To Tame Your Unicorn - Daniel Komaromy & Lorant Szabo, Black Hat USA, 2021
Checkmate Mate30 - Slipper & Guanxing Wen, MOSEC, 2021
Defeating Samsung KNOX with zero privilege - Di Shen, Black Hat USA, 2017
A Samsung RKP Compendium - Alexandre Adamski, Impalabs Blog, 2021
Attacking Samsung RKP - Alexandre Adamski, Impalabs Blog, 2021
PoC 2019-2215 exploit for S8 - Valentina Palmiotti, GitHub Repository, 2020
New Reliable Android Kernel Root Exploitation Techniques - Dong-Hoon You, POC, 2016
KNOX Kernel Mitigation Bypasses - Dong-Hoon You, POC, 2019