In the
previous post, you traced the path used by the Docker daemon through the socket, image builds, and BuildKit. All of these steps required root. The security concern is straightforward: the daemon runs as root and the socket at /var/run/docker.sock grants full access to anyone in the docker group. If the daemon has a vulnerability or an attacker escapes a container, they suddenly have root privileges on the host.
This post explores the alternative: rootless Docker. You’ll see how it works under the hood, what it genuinely improves, and why this solution introduces its own set of security trade-offs.
How rootless Docker works
Rootless Docker addresses the root-daemon risk by running dockerd itself as an unprivileged user. Each user who wants Docker runs their own personal daemon process, started under their own account. The socket is also per-user, typically at $XDG_RUNTIME_DIR/docker.sock (a per-user directory such as /run/user/1000/docker.sock) rather than /var/run/docker.sock.
That daemon still needs to create PID (Process ID), mount, and network namespaces – operations that require CAP_SYS_ADMIN, as you saw in the
first post. How does an unprivileged daemon pull this off? It takes advantage of a special namespace you saw in the previous post.
The user namespace is the one namespace type you can create without root. You previously saw that creating a user namespace without proper UID mapping lands you as “nobody.” Rootless Docker uses a helper called RootlessKit to solve this. RootlessKit bootstraps a user namespace and uses helpers to configure UID (User ID) mappings so the process inside appears to be root. These helper programs – newuidmap and newgidmap – are installed with the setuid bit. This grants them permission to temporarily run an executable with the permissions of the file’s owner – typically root – instead of the calling user. With those privileges granted, the daemon can create all the nested namespaces it needs, because the kernel permits nested namespace creation when a user namespace is the outermost boundary.
This architecture meaningfully limits the damage that a compromise can cause:
- If the daemon is exploited, the attacker gets the privileges of the unprivileged user – not root on the host.
- If a container escape occurs, the escaped process lands inside the user namespace as UID 0. This UID is mapped to a non-root UID on the host. That means it can’t install kernel modules, modify system files, or access other users’ data.
- The environment is mostly isolated from other processes
For these to work, the unprivileged user must not be able to alter the mappings since that would allow it to create processes that are mapped to root and gain those privileges.
How rootless BuildKit works
BuildKit, Docker’s modern build engine, can also operate in rootless mode using the same approach. RootlessKit bootstraps a user namespace, sets up UID mappings, and then starts buildkitd inside that namespace. From there, buildkitd can create the child PID, mount, and UTS (hostname) namespaces it needs for each build step.
A key part of this is UID mapping. When rootless BuildKit runs a RUN step, that step appears to execute as UID 0 (root) inside its container. But that UID 0 is actually mapped to the unprivileged host user’s UID through the user namespace. For example, if your host UID is 1000, the kernel maps UID 0 in the namespace to 1000 on the host, and UIDs 1 through 65535 map to a range defined in /etc/subuid. Files the build creates appear owned by root inside the container, but are actually owned by a non-root user on the host.
This means that the account itself isn’t actually running as root, so it also isn’t able to access the host system’s privileged APIs. It can access many of the same interfaces to create namespaces, but it does so through the user namespace, which limits the scope of its access.
The limitations of rootless mode
Rootless mode comes with practical limitations. These apply to both rootless Docker and rootless BuildKit:
- The host kernel must permit unprivileged user namespace creation (controlled by
kernel.unprivileged_userns_clone, or AppArmor’susernsrestriction on Ubuntu 24.04 and later). Many newer distributions often have this enabled by default to support browsers and containers. In 2025, Kubernetes enabled userns by default as well. Some distributions or hardened configurations may restrict this. - When running BuildKit inside a container (for example in a Kubernetes pod), the container needs
seccomp=unconfined. - On Ubuntu 24.04 and later, AppArmor restricts unprivileged user namespace creation by default; this needs to be relaxed with
apparmor=unconfinedor by settingkernel.apparmor_restrict_unprivileged_userns=0. - Rootless mode may not be able to use the kernel’s native overlayfs on most kernels because mounting overlayfs requires
CAP_SYS_ADMIN. Instead, it generally falls back tofuse-overlayfs. This is a userspace reimplementation that works through FUSE (Filesystem in Userspace). It may also utilize a slower native snapshot mode with a small performance overhead. Linux kernels 5.11 and later allow unprivileged overlayfs in a user namespace, removing this limitation. - Networking cannot use Docker’s standard bridge driver, because configuring isolated network namespaces (creating virtual Ethernet pairs, setting up bridges) requires root-level network capabilities. Instead, rootless Docker uses tools like
slirp4netnsandpastato provide isolated networking. They simulate a full network stack entirely in userspace, avoiding the need for kernel-level network configuration but having some performance tradeoffs. - Containers cannot run with
--privileged, because a daemon running inside the user namespace doesn’t have actual host capabilities to grant.
The practical implication is that “rootless” doesn’t mean “zero privileges.” It means the privileges required are reduced, moved inside a user namespace, and no longer permanently owned by a system-wide root daemon.
It’s important to understand, however, that the user namespace itself has security implications.
Why user namespaces concern kernel developers
When a process enters a user namespace, it gains a full set of capabilities within that namespace – including CAP_SYS_ADMIN and CAP_NET_ADMIN. These capabilities allow code to reach kernel interfaces that were historically only accessible to trusted root processes. The networking configuration API, the mount system, iptables (the kernel’s firewall rule engine) rule processing – all of this becomes reachable by any unprivileged user who creates a user namespace. In other words, it has gained privileges to call those kernel interfaces.
The problem isn’t user namespaces themselves. In a perfect world, the namespaces and groups would still provide limits and restrictions on what could be done with these privileges. The problem is that much of the kernel was written under the assumption that only a fully trusted root process would ever call these interfaces.
When developers write code that only trusted callers can reach, they sometimes apply less rigorous input validation or fewer safety checks. The caller is expected to know what it’s doing if it has that level of access. User namespaces remove that gate, letting untrusted callers access that same code. Any latent vulnerability in those code paths – a buffer overflow (writing past the end of allocated memory) in the networking stack, a use-after-free (accessing memory the program has already released) in mount handling, a logic error in iptables processing – suddenly becomes exploitable by anyone on the system. In addition, changes at this level can still affect the entire system. To be fair, they are continuously reviewing and hardening this code to avoid these kinds of issues.
Andy Lutomirski, a Linux kernel developer who worked extensively on user namespaces, captured the concern well when discussing the ability of unprivileged users to program iptables through user namespaces: “I’ll eat my hat if there are no privilege escalations in there.”
That isn’t a theoretical worry. In 2025, Qualys disclosed three methods to bypass Ubuntu’s restrictions on unprivileged namespaces, allowing attackers to gain full administrative capabilities. There were also several CVEs raised in 2025 and 2026 that relied on user namespace exploits to compromise the system. This means that user namespaces still require some care to properly secure them.
What unconfined really means
Earlier you saw that rootless mode (especially when running BuildKit inside a container) needs seccomp=unconfined and apparmor=unconfined. It’s worth understanding what those flags actually do, because they are not minor configuration tweaks.
To make sense of these flags, you need to understand two security layers that Docker applies to every container by default. Both act as filters between a running process and the Linux kernel.
The first layer involves system calls. Every time a program needs the operating system to do something – open a file, allocate memory, create a process – it makes a system call. These are the only way a program can talk to the kernel. Seccomp is a kernel feature that lets you define a filter controlling which system calls a process is allowed to make. Think of it as a bouncer at a door: the filter has a list of allowed requests, and anything not on the list gets rejected before the kernel even processes it.
The second layer is AppArmor, a Linux security module that restricts what a specific program can do based on a predefined profile. While seccomp filters which system calls are permitted, AppArmor controls what resources – files, network access, capabilities – a program can access. A program might be allowed to make the open system call (seccomp lets it through), but AppArmor can still deny access to a specific file or directory. The two systems complement each other: seccomp limits how a process talks to the kernel, and AppArmor limits what it can reach.
With that context, here is what the unconfined flags do:
seccomp=unconfineddisables the system call filter entirely. Docker’s default seccomp profile blocks over 40 system calls that are considered dangerous – includingunshare,mount, and others that rootless mode needs. Disabling seccomp means the process can invoke any system call the kernel supports, including the ones the profile was specifically designed to block.apparmor=unconfinedremoves all AppArmor restrictions from the process. On Ubuntu systems where AppArmor is the mechanism that restricts unprivileged user namespace creation, this flag directly removes the protection designed to limit the attack surface that user namespaces expose.
In other words, running rootless builds inside a container can require you to disable some important security mechanisms. You’re trading the risk of a root-running daemon for the risk of an unrestricted unprivileged process that can reach deeply into kernel interfaces without the usual safeguards. It’s limited to the permissions of the account making the calls, but it still gets a significant level of access.
What this means for you
None of this means rootless Docker is a bad idea. It is genuinely better than running a root daemon in many scenarios, because a successful container escape lands the attacker as an unprivileged user rather than as root. That’s an important difference.
At the same time, you should not treat rootless mode as a complete security solution. It trades one form of elevated access (a root-running daemon) for another (an unprivileged process with broad, root-like access to the same privileged kernel interfaces). The risks are different, but they are still real. The privileges involved are still significant, and each approach has tradeoffs.
As general security suggestions:
- Keep your kernel patched. User namespace exploits often depend on vulnerabilities in the kernel subsystems they expose. Up-to-date kernels close those holes.
- Follow the principle of least privilege – give each process only the minimum permissions it actually needs to do its job. If you don’t need user namespaces on a system, consider restricting them. If you do need them, be mindful that they are not a foolproof security strategy. They are one part of a solution.
- Understand what
seccomp=unconfinedandapparmor=unconfinedactually disable. If your threat model (the set of threats you’re designing your defenses around) involves untrusted code running inside containers – for CI/CD, for example – removing those filters reduces the security. At the same time, a fully privileged environment essentially lifts these restrictions as well. - Don’t assume that “rootless” means “safe to run untrusted workloads.” Rootless mode can reduce the blast radius (the scope of damage a security compromise can cause) of an attack, but it does not eliminate the need for trust and careful configuration. It also doesn’t prevent other types of attacks such as privilege escalation (where an attacker gains higher permissions than they started with) or lateral movement (where an attacker moves from one compromised system to others on the same network).
GitHub ARC
This post wouldn’t be complete without mentioning GitHub’s Actions Runner Controller which lets you run scalable CI/CD agents on Kubernetes. Many teams that want to build images with the Docker CLI try to avoid Docker-in-Docker (DinD) because it uses privileged containers. You can see why this is ultimately a losing battle. If they want to use the Dockerfile approach to building containers, they need privilege from somewhere. They have to grant access – rootless or privileged – to APIs that can impact all of the containers on the same node (or in extreme cases, across the cluster). The only way to avoid this is to build without a containerization solution.
The same is true with dynamically creating containers as part of a build process (for example, to run jobs within). If you use Kubernetes mode, the privilege comes from granting the runner broad access to the Kubernetes Server API – a RESTful endpoint for accessing privileged services across the cluster. This allows your jobs to create new containers on nodes outside of the job-runner pod. If you use DinD, you create a privileged Docker daemon on a node, granting it access to create an manage the container processes on the node with the runner pod. It’s scoped access, but improperly used it could open the node (or the cluster) to exploit.
Hopefully now you have more clarity as to why this is necessary and can make more informed decisions about the implications!
It’s all about the privilege
Across this series, you’ve traced the full chain from kernel primitives to Docker’s architecture to the security boundaries of rootless mode. Rootless Docker and rootless BuildKit genuinely reduce risk by wrapping the daemon inside a user namespace, so a compromise lands as an unprivileged user rather than root.
But the trade-offs are real. User namespaces expand the kernel’s attack surface by exposing privileged interfaces to unprivileged users, and running rootless builds inside containers often requires disabling seccomp and AppArmor protections. No single configuration eliminates the need for careful, informed judgment about the privileges your containers require.
