Ken Muse

Why You Should Use Dedicated Clusters For GitHub ARC

Working with GitHub ARC requires a fairly good understanding of Kubernetes to be implemented successfully. Many people are only exposed to the deployment aspects of Kubernetes. They become fluent in the YAML required to create resources (or at least how to utilize Helm charts), but don’t have the opportunity to build an understanding of how to configure Kubernetes itself. As a result, they start with a disadvantage when trying to implement ARC. This post will try to help. We’ll quickly look at one of the most important configuration recommendations for running ARC: a dedicated cluster.

One of my first recommendations for getting the best performance and security with ARC is to consider a dedicated Kubernetes cluster. This may seem counterintuitive. After all, its often goal with Kubernetes to minimize the number of clusters and maximize the density of the running containers. While this is a good rule of thumb, there are exceptions to this practice. To understand why, we need to understand Kubernetes and containers.

Containers and Kubernetes

Containers inherently have limitations on their ability to control hardware. They provide process-level isolation and control. In fact, containers are basically just processes which apply a set of restrictions to the code being run. By reducing isolation (compared to a virtual machine), containers can be used as a lightweight mechanism for subdividing the resources on a server. There are limits to what containers can control. For example, containers can’t consider or monitor the impact of the containers on the I/O bus bandwidth, CPU cache, or port exhaustion. These hardware-level limitations still exist, but are outside of the control plane of the container. Kubernetes builds on the features provided by containers, so it is similarly limited in its ability to manage these resources when orchestrating workloads.

This is part of why Kubernetes works best when the loads are somewhat predictable. Understanding resource utilization allows for better planning, enabling jobs can scale to demand. Managed properly, the other hardware resources remain within acceptable limits. This is why systems that need full control of the hardware resources — such as database servers — are often a poor fit for Kubernetes. These systems expect to be able to utilize the full resources of the system; they don’t expect to be constrained by the container or Kubernetes.

Kubernetes manages containers, so it’s actually just managing and orchestrating processes in the current OS kernel. This can create potential security issues. One of the most concerning is container escape exploits. In these situations, a particular container process manages to make changes which affect other processes or the OS as a whole. Containers have a soft security boundary which can be exploited under the right conditions. They are not natively isolated through virtualized hardware. This is why Kubernetes’ security domain is considered to be the entire cluster.

Some teams wrongly assume that they can eliminate the risk by using specific nodes to isolate some sensitive processes, but some risks can still remain. For example, processes can get scheduled to other less secure nodes inadvertently. This is more likely when scaling or replacing nodes. A bigger potential vulnerability comes from containers which are allowed to interact with the Kubernetes APIs. These containers can alter other containers or system settings. This can allow them to more broadly change the cluster configuration.

To minimize these issues, RBAC, AppArmor, and SecComp Profiles are used to restrict a given container’s access to resources. These add layers of protection, but they don’t change the fundamental system behavior. This is why it’s recommended to not collocate trusted and untrusted processes/containers in the same cluster. Any configuration mistakes can expose the cluster’s resources to the untrusted code.

There are ongoing efforts to create more secure boundaries by integrating Kubernetes with with virtual machines and micro-VMs. This doesn’t eliminate the security issues entirely; VM escape exploits and hyperjacking are possible. That said, it adds a much stronger protection boundary. It also adds substantially more complexity, a topic my colleague Natalie has explored.

ARC processes

The loads for build processes are highly variable and can require significant hardware resources. Most build systems are optimized. They can utilize as many CPU cores and memory as needed to optimize (and parallelize) the build performance. They expect to have access to the full resources of the system and will experience failures when resource constrained. In fact, even the GitHub runners can stop responding if they don’t have enough resources available for key activities. This doesn’t mean that build containers can’t (or shouldn’t) have constraints. It just means that there’s no single rule of thumb for the resources needed by a particular build.

Builds also tend to create large numbers of child processes and utilize substantial amounts of I/O (disk and network), depending on the process. This can saturate hardware and OS resources quickly. This can be even more challenging when networked storage used. Hopefully you can see why these would be a poor fit for collocating with other development services or workloads. And you definitely wouldn’t want these running side-by-side with production workloads! Because ARC orchestrates build runners which execute untrusted code that has highly variable resource requirements, it’s best to isolate them from other workloads. It’s worth mentioning that you can generally constrain resources within a given set of runners (or runner group), allowing administrators some control over the resources used by the builds.

ARC security

Developers typically need highly elevated privileges on their machines in order to build and debug code. The requirements for build environments is very similar. They often rely on elevated privileges to support parts of the build process or the build engine itself. A build is also inherently untrusted code, especially if you have third-party dependencies. A bad build definition or a malicious third-party dependency could easily over-consume resources or even compromise the system. All of these issues mean that build processes are inherently unsafe to run with other workloads on the same system.

This is one of my favorite reasons to rely on GitHub-hosted runners. Builds occur on an isolated network in a hardened VM which has limited access to other resources. The untrusted code is completely sandboxed.

Another consideration is that GitHub Actions can utilize service containers, container-based Actions, and containerized builds. When any of these will be used, you have to enable one of two features in ARC: Docker-in-Docker (DinD) or Kubernetes mode. Both of these create container resources dynamically. DinD requires a privileged container running Docker (giving it direct access to the host system). Kubernetes mode creates resource using the Kubernetes Server APIs. That means there are ways for containers to have privileges for controlling Kubernetes. To support containers in builds, ARC requires one of these privileged approaches. This is one more reason to isolate ARC from other workloads.

It’s also worth noting that by design, containerized Actions must be able to run as the default user (typically root). This is important so that the runner can create appropriate mounts for the functionality. It also ensures that the generated files have appropriate permissions within each container.

Not using containers doesn’t completely eliminate the challenge. ARC itself requires cluster-level permissions in order to dynamically create, monitor, and clean up pods for the runners. While not the same security challenge, it is a consideration for other workloads on the same cluster. The elevated permissions required are more than most containers need, and systems should follow a principle of least privilege. By default, ARC uses naming conventions for the resources, including service accounts, roles, and assignments. You don’t want other deployments to be able to take advantage of known names to gain elevated permissions.

There’s one more important security consideration. Builds may need access to internal servers, systems, or resources. That means that ARC may need to be deployed on a Kubernetes cluster that has access to protected, internal resources. Other workloads shouldn’t have the same needs. Keeping with a principal of least privilege, they should not be deployed into an environment that grants network access to those resources.

A world apart

As you can see, there are quite a few reasons why I recommend isolating ARC in its own cluster. For development teams with source code in multiple security boundaries, multiple clusters may be needed. ARC has very specific needs that often do not align with the security or resource postures recommended for other typical Kubernetes use cases. It’s generally a bad idea to try to treat ARC as just another workload. In fact, one of the leading issues for most ARC deployments is issues related to trying to enforce security postures that are incompatible with ARC.

Ultimately, isolation is the best way to ensure that ARC can meet the needs of the developers without compromising the security of the rest of the organization. It provides ephemeral environments that are isolated from other company resources. It also helps ensure that the ARC deployment can be scaled to meet the build (and deployment) needs of the developers.