Privileged Docker containers—do you really need them?
2020年11月5日
0 分で読めますThis week, I dropped down a rabbit hole when doing some testing with Podman around why running a certain container in a rootless configuration required the --privileged
flag. Quite rightly, my colleague Eric Smalling asked why it should require the flag.
Ultimately --privileged
is shorthand for granting All The Things, and whilst you may think this doesn’t matter that much when running rootless, it does somewhat break the paradigms of both least privilege and zero trust. Even when running rootless, it’s always good practice to understand exactly what your container needs, and only give it those minimum permissions.
Although setting this flag on processes running rootless doesn’t actually give the process any more privileges than the user has it intrigued me about what capabilities this process actually required which had led to that configuration recommendation. In the great spirit of exploration, I donned my spelunking gear and set off into the heart of darkness.
Let's dig in
Firstly, let’s look at what running containers rootless means. This is all down to the magic of user namespaces in the Linux kernel, which allow unprivileged users to create new user namespaces. When a user creates and enters a new user namespace, they become root in the context of that namespace and gains most of the privileges required to spawn a functioning container. In user namespaces, root obviously doesn’t have the same privileges as the system root but, what this does mean is that we can run containers without requiring raised privileges, and so we significantly reduce the potential attack surface for any vulnerabilities.
The system I am using is a virtual machine installed with a minimal CentOS 8, running in Virtualbox on my Macbook. I first installed podman using dnf:
dnf install podman
Then I created a directory in my users home directory, and generated a self-signed certificate there :
mkdir certs
openssl req -newkey rsa:4096 -nodes -sha256 -keyout certs/domain.key -x509 -days 365 -out certs/domain.crt
Now I want to use that certificate to run a local Docker registry for testing purposes, and here is where our journey really starts. All the documentation I’d come across for doing this suggested using the --privileged
flag :
[matt@localhost ~]$ podman run -d --name registry -p 5000:5000 -v "$(pwd)"/certs:/certs --restart=always -e REGISTRY_HTTP_ADDR=0.0.0.0:5000 -e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/domain.crt -e REGISTRY_HTTP_TLS_KEY=/certs/domain.key --privileged registry:2
As we can see from the command string above, we’re running the registry image labeled 2, creating a volume mount binding the certs directory from my current directory in as /certs in the container, passing in some environment variables to configure the registry, and happily adding the --privileged
flag telling podman to run this container in privileged mode.
To use the registry with podman, we then need to add an entry into /etc/containers/registries.conf :
[registries.insecure]
registries = ['localhost:5000']
Running in privileged mode works fine, so the first thing I wanted to see was what happened if we ran it unprivileged. This was fairly simple, the container spawns and then crashes and respawns, with its logs showing it can’t access the certificate.
[matt@localhost log]$ podman logs bd323f90c60b
time="2020-10-20T18:24:27.806128235Z" level=fatal msg="open /certs/domain.crt: permission denied"
At this point, I assumed this was related to Linux capabilities, as one of the major things that the --privileged
flag does is to allow the container to access all the capabilities provided by the kernel. We can see that using podman when running this container in privileged mode :
[matt@localhost ~]$ podman top -l capeff
EFFECTIVE CAPS
full
From here I went to the Linux man pages where I could look in detail at what each of the capabilities allows processes to do. Again we can use podman to check what capabilities our running unprivileged container has:
[matt@localhost ~]$ podman top b7cea04eb70e
USER PID PPID %CPU ELAPSED TTY TIME COMMAND
root 1 0 0.000 1.400417458s ? 0s registry serve /etc/docker/registry/config.yml
[matt@localhost ~]$ podman top -l capeff
EFFECTIVE CAPS
AUDIT_WRITE,CHOWN,DAC_OVERRIDE,FOWNER,FSETID,KILL,MKNOD,NET_BIND_SERVICE,NET_RAW,SETFCAP,SETGID,SETPCAP,SETUID,SYS_CHROOT
Comparing this list to the Linux man page, the effective capabilities in unprivileged mode should be enough to allow the container to read files. In fact, this list probably gives us more capabilities than this particular container needs, but let’s come back to that later.
The next port of call is the audit.log
on the host, and sure enough :
type=AVC msg=audit(1603218166.554:4674): avc: denied { read } for pid=223435 comm="registry" name="domain.crt" dev="dm-0" ino=6582402 scontext=system_u:system_r:container_t:s0:c127,c779 tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=0
What this entry tells us is that SELinux has blocked the read call for the domain.crt
file. On CentOS and RHEL systems, SELinux is configured in Enforcing mode by default, and we can check that by using the sestatus
tool:
[matt@localhost ~]$ sudo sestatus
[sudo] password for matt:
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Memory protection checking: actual (secure)
Max kernel policy version: 31
SELinux labels files and directories to manage access to them in Enforcing mode, and this labeling is also reflected in the behavior of container engines. These launch processes with the container_t
label, and the actual container labeled container_file_t
. SELinux enforces that the processes can only interact with files labeled this way, denying access by default to files outside of the container. When we run with the --privileged
flag, labels are disabled and the container runs with the label that the container engine was started with. We can see this by looking at our containers using podman. Here’s a privileged container :
[matt@localhost ~]$ podman top -l label
LABEL
unconfined_u:system_r:container_runtime_t:s0
And here’s one running unprivileged :
[matt@localhost ~]$ podman top -l label
LABEL
System_u:system_r:container_t:s0:c23,c603
When we look at our certs directory in the host filesystem, we can see it has the label user_home_t
:
[matt@localhost ~]$ ls -lZ certs
total 8
-rw-rw-r--. 1 matt matt unconfined_u:object_r:user_home_t:s0 1944 Oct 20 17:54 domain.crt
-rw-------. 1 matt matt unconfined_u:object_r:user_home_t:s0 3272 Oct 20 17:53 domain.key
What this means in practice is that in unprivileged mode those files are not accessible to the container, even if they are bind-mounted into the image.
So, rather than run our container with --privileged
, to fix this we have a couple of different options. Firstly, we can disable labels entirely by using --security-opts label=disable
on our podman command line. This is obviously non-ideal from a security perspective, so both podman and Docker have a mechanism to re-label mounts, either privately by using the Z switch, or if that mount is shared, by using the z switch.
So to fix our container, we can just run it with :
podman run -d --name registry -p 5000:5000 -v "$(pwd)"/certs:/certs:Z --restart=always -e REGISTRY_HTTP_ADDR=0.0.0.0:5000 -e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/domain.crt -e REGISTRY_HTTP_TLS_KEY=/certs/domain.key registry:2
Finally, to return to capabilities, it’s never a bad idea to run containers with the absolute minimum of capabilities enabled. For this registry container, in my testing configuration it’s actually possible to run with all capabilities dropped
podman run -d --name registry -p 5000:5000 -v "$(pwd)"/certs:/certs:Z --restart=always -e REGISTRY_HTTP_ADDR=0.0.0.0:5000 -e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/domain.crt -e REGISTRY_HTTP_TLS_KEY=/certs/domain.key --cap-drop=all registry:2
There are various ways to work out which capabilities your container needs, for example by looking at what capabilities SELinux blocks during operation in the audit log, then add back in the required ones using the --cap-add argument. The rule of Least Privilege is always the best option!The moral of this story is that you don’t throw the baby out with the bathwater.
To sum it up
Flagging containers as --privileged
, even in user namespaces, is not good practice, and breaks the paradigms of least privileges and zero trust. Find out what your container actually needs before running it, using the outputs of tools like SELinux to audit what capabilities and permissions your container image is asking for, and set just those when you run the container in production. You’ll often find that the requirements are actually pretty minimal.