In the
previous post, you built container-style isolation from scratch using unshare and control groups (cgroups). All of the operations for isolating the processes needed root because the kernel enforces that boundary deliberately. This is part of what fools a container into thinking it is running on its own machine.
This post picks up where that one left off. You’ll learn more about why the kernel demands those privileges and how Docker’s architecture depends on them.
Why creating namespaces requires root
Every unshare command in the previous post needed sudo (or to run as the root user). That isn’t a limitation of the unshare tool – it’s the kernel enforcing a security policy. Try the same thing without root:
1unshare --pid --fork --mount-proc bash -c 'ps aux'unshare: unshare failed: Operation not permittedThe kernel refuses with “Operation not permitted.” The mechanism behind this is Linux capabilities. Rather than a simple on/off root flag, the kernel splits elevated permissions into over 40 individual capabilities. For example:
CAP_NET_ADMIN- Allows configuring network interfaces and routing.
CAP_SYS_CHROOT- Allows calling
chrootto change the root filesystem.
CAP_SYS_ADMIN- The broadest capability – it covers mounting filesystems, creating most namespace types, and dozens of other privileged operations.
Creating PID, mount, network, and UTS namespaces all require one or more of these system capabilities. This is a deliberate security decision. These namespaces can fundamentally change what a process perceives about the system. A process with its own mount namespace could hide files or mount malicious filesystems. A process in its own PID namespace could conceal itself from monitoring tools.
Restricting these sensitive operations to privileged accounts prevents attackers from using namespaces to evade detection or manipulate shared resources. This is why privileged containers are considered a security risk. If an attacker compromises a container running with elevated privileges, it can use these privileges to escape its namespace and gain access to the entire host system and any processes running on it.
The one exception: user namespaces
There is one namespace that we haven’t really discussed that is different: the user namespace. A big difference is that you can create one without any elevated privileges. Try this:
Outside: uid=1001(runner) gid=1001(runner) groups=1001(runner),4(adm),100(users),118(docker),999(systemd-journal)
Inside user namespace: uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)Inside the user namespace, the process has become user ID 65534, which is a reserved ID the kernel uses for “nobody” – an account with no meaningful group memberships or privileges. The kernel created the namespace but deliberately mapped you away from any useful identity. This behavior is the foundation of rootless containers: you can enter a user namespace, but you arrive as nobody unless user ID (UID) mappings are explicitly configured.
A mapping simply tells the kernel to translate a range of UIDs in the namespace to a different range on the host. For example, it might map UID 0 in the namespace to UID 1001 on the host. Inside the namespace, the code appears to be running as root. However, it’s actually running as a normal user on the host. This allows you to appear to have root privileges to make changes inside the container without having that level of privilege on the host. This is the basis behind “rootless” runtimes.
How Docker uses these primitives
To prevent leaking privileges to the user, Docker runs as two separate programs that work together:
- The Docker daemon (
dockerd) runs continuously as root in the background. It manages containers, images, volumes, and networks. Those steps requireCAP_SYS_ADMINor similar capabilities. Internally, the daemon delegates the low-level namespace creation to helper programs calledcontainerdandrunc, but the full root privileges required are the same. - The Docker command-line interface (CLI) (
docker) runs as your user. When you type adockercommand, the CLI translates it into a request and sends that request to the daemon.
The Docker CLI sends these commands using a Unix domain socket. If you’re not familiar, it’s essentially just a special file on the disk that two programs use to communicate with each other. It’s similar to a network connection but only between programs on the same machine.
When the daemon receives a “run this container” request, it performs the same steps you saw with unshare in the previous post: it creates PID, mount, network, and UTS namespaces, sets up the container’s filesystem using an
overlay of image layers (thin, stacked filesystem snapshots), configures the network, and then starts the container process inside all those namespaces.
You can confirm this directly by comparing a live Docker container with the DIY namespace you built in the previous post:
=== Host PID 1 ===
1 root systemd
=== Inside Docker container ===
PID USER COMMAND
1 root ps
=== Container hostname ===
53275435cf0eInside the Docker container, PID 1 is the ps command itself (not systemd), and the hostname is a generated container ID. The output formatting looks slightly different because Alpine uses BusyBox – a lightweight reimplementation of common Linux tools – rather than the GNU versions on the host. This is the same pattern the manual unshare experiment produced. Docker automated the whole thing using its privileged access.
Each namespace has a unique kernel identifier, visible via the /proc filesystem. Comparing those identifiers between the host and a container confirms that they truly live in separate namespaces:
Host PID namespace: pid:[4026531836]
Container PID namespace: pid:[4026532251]Different numbers mean different namespaces. The host process and the container process each have their own isolated world.
The Docker socket: where privilege lives in practice
For your ordinary user account to run docker commands, it needs to talk to the daemon. The answer is the Unix socket at /var/run/docker.sock:
1ls -la /var/run/docker.socksrw-rw---- 1 root docker 0 Apr 4 18:06 /var/run/docker.sockThe s at the start means this is a socket file (not a regular file). It is owned by root and the group docker, with permissions that allow only the owner (root) and members of the docker group to read from and write to it (shown as 660 in Linux permission notation). Anyone else gets nothing.
When you connect to this socket, you’re talking directly to the daemon. The daemon will execute any valid Docker command you send without further authentication. This means membership in the docker group is functionally equivalent to having full root access on the machine. A user in the docker group can, for example, start a container that bind-mounts the host’s root filesystem (maps the host’s entire file tree into the container so the container can read and write it) and modify any file on the system. This is why adding someone to the docker group is a significant trust decision, not a minor permission grant. This group membership allows you to perform privileged operations.
With Kubernetes, the same principle applies to the kubelet process, which also runs with elevated privileges and can be accessed by users with permissions to interact with it. Kubernetes uses Role-Based Access Control (RBAC) to manage who can talk to the kubelet and what they can do, but the underlying principle is the same: if you have access to the privileged interface, you have the power of root. That power lets you create sandboxed processes.
Why building images also requires root
Running a container isn’t the only Docker operation that needs root. Building an image does too, for exactly the same reasons. Each RUN instruction in a Dockerfile runs inside its own temporary container. Save this Dockerfile as /tmp/Dockerfile.test:
Next, build it. The final /tmp/ argument tells Docker where to find files referenced in the Dockerfile (the build context). Since this test Dockerfile doesn’t reference any local files, any directory works here:
1docker build --no-cache -f /tmp/Dockerfile.test -t test-build /tmp/#5 [2/4] RUN echo "Build step PID: $$" && ps -eo pid,comm && echo "---"
#5 0.168 Build step PID: 1
#5 0.169 PID COMMAND
#5 0.169 1 sh
#5 0.169 6 ps
#5 0.169 ---
#5 DONE 0.2s
#6 [3/4] RUN whoami
#6 0.196 root
#6 DONE 0.2s
#7 [4/4] RUN hostname
#7 0.201 buildkitsandbox
#7 DONE 0.2sEach RUN step runs as PID 1 in its own PID namespace and gets its own UTS namespace (thus the hostname buildkitsandbox). After each step completes, the daemon snapshots the changes to the container’s filesystem as a new image layer. Layering works using a kernel feature called an overlay filesystem (overlayfs), which stacks read-only layers on top of each other and presents them as a single unified file tree. Creating an overlay mount also requires CAP_SYS_ADMIN.
The build process is therefore: create a container, run the instruction, capture the filesystem diff, unmount the container, and store the changed files as a .tar.gz “layer”. This repeats for every RUN. All of this is orchestrated by the Docker daemon.
If you haven’t read my post on layers, it will walk you through how those work in more detail. This includes exploring how the process makes it easy to capture the specific files that have changed.
BuildKit: Docker’s modern build engine
The build output in the example above shows a #5 [2/4] style format – that’s BuildKit at work. BuildKit has been Docker’s default build engine since Docker 23.0, replacing the older “classic” builder. You can tell it’s active because build steps are shown with hash prefixes and can run in parallel.
BuildKit separates build concerns into a dedicated daemon called buildkitd. When you run docker build, the Docker daemon hands the build off to a buildkitd process, which handles parsing the Dockerfile, scheduling the build steps, managing the layer cache, and producing the final image. For most users this is invisible – it just makes builds faster and smarter.
Under the hood, BuildKit represents each RUN instruction as an “execution operation” (ExecOp). For each ExecOp, BuildKit:
- Prepares a snapshot of the current filesystem state (the layers built so far).
- Creates a new container around it – using fresh PID, mount, UTS, and network namespaces, just like a regular
docker run. - Executes the
RUNcommand inside that container. - Captures the filesystem diff as a new layer.
- Discards the container.
When you later run the image as a container, all of the individual layers are stacked together, the namespaces are created, and the entry point command is called. This is why running containers and building containers with these tools both require root privileges – they both rely on the same kernel features to create namespaces and manage filesystems.
To completely avoid needing this level of privilege at build time, you have to eliminate the need for containers, namespaces, and cgroups. The only way to do that is to build the image by organizing and packaging the files yourself or by using a tool like Buildroot to do that for you.
The big picture
Every path through Docker – running containers, building images, or using BuildKit – leads back to the same requirement for elevated privileges. The daemon needs CAP_SYS_ADMIN to create namespaces and overlay mounts, the socket at /var/run/docker.sock grants anyone in the docker group the equivalent of root access, and every RUN instruction in a Dockerfile creates a fresh set of privileged namespaces behind the scenes.
But what if you could do all of this without the daemon running as root? In the next post, you’ll see how rootless systems work to shift the privilege boundary – and why that shift introduces its own set of security trade-offs that are important to understand before you rely on it. You’ll also learn why that doesn’t mean you don’t have privilege.
