Ken Muse

Best Practices for Deploying GitHub ARC

This is a post in the series Deploying ARC. The posts in this series include:

Happy 2024! Hope that it brings the very best for you and your family!

Today, I want to dive a bit deeper into some best practices for deploying ARC. There’s a few common recommendations I make to teams that are setting up ARC. These help to minimize issues and make it easier to maintain. That doesn’t mean these are the only things to consider, of course.

If you have a lot of experience with Kubernetes, you know these are just the start of the discussions.

Use the right versions

It should go without saying, but the controllers and runner scale sets for ARC should both be running the same version. This ensures the various components will work together correctly. For example, if you’re running gha-runner-scale-set-controller:0.8.1, then your runner scale sets should all be using gha-runner-scale-set:0.8.1. Try to keep the versions up-to-date to have access to the latest features along with security and bug fixes.

The images used by the runners should always use the latest version of the GitHub runner. Each new runner release also publishes a new image. You can use that image as a base to ensure you’re always including the right version. If you’re building images from scratch, be sure to keep the runner on the latest version or the runner might not be allowed to receive jobs.

If you’re using ARC version 0.23.7 or lower, then you’re using the legacy version of ARC and may want to consider moving to the GitHub-supported version. This will simplify the experience, improve the performance, and make it easier to maintain.

Consider a dedicated cluster

I’ve already explained some of the technical reasons to consider a dedicated cluster. This isn’t a hard requirement, but it certainly makes the environment easier to troubleshoot and configure.

More advanced Kubernetes administrators may choose to use an existing cluster. As my colleague Blaize Stewart points out, network policies and dedicated node pools can help give you better isolation. It doesn’t eliminate all of the issues, but it can help mitigate some of the biggest. If you’re going that route, I strongly discourage having production resources on the same cluster. Build workloads can quickly consume a lot of resources, especially networking. The last thing you want to worry about is port exhaustion or losing production pods due to resource-intensive builds. And you never want to run untrusted code on the same cluster as production systems!

Don’t default the security

Tools like Open Policy Agent (OPA) Gatekeeper and Kyverno can do a lot to help you secure your cluster. By rejecting resource requests that don’t meet certain criteria, you can help minimize vulnerabilities. Similarly, admission webhooks can be used to modify resource requests. These are often used to implement default configurations, validate resource requests, or implement require configurations for resources.

Unfortunately, all of these can interfere with ARC. ARC needs to be able to manage resources dynamically. This is even more true when you configure it to support containers, services, or containerized Actions. When you do that, the interactions can get a bit more complex.

Normally, ARC is responsible for creating most of the resources. The templates it generates are well-documented in both GitHub Docs and the Helm chart’s values.yaml. This part is user configurable. Once you enable Kubernetes mode or Docker-in-Docker, the runner itself gets involved. In these modes, the runner is responsible for creating additional resources using a feature called Runner Container Hooks. These hooks do not use the configurations defined in the Helm chart. Instead, the runners programatically create the resources using TypeScript. This containers may not match the company’s policy definitions. While it’s possible to customize that code, it’s not usually where companies want to put their developer resources.

While these security standards may work for managing and securing production and development resources, they may also prevent ARC from working. Of course, all of this is avoided inherently if you use a dedicated cluster for ARC. It allows you to customize the security needs to the use case.

This does not mean you should avoid these tools or use native features like RBAC/ABAC, network policies, and AppArmor! In fact, these are important tools for securing the cluster. I recommend starting with a vanilla environment, then tightening the security from there. This allows you to assess each change and determine whether or not it is interfering with ARC’s ability to run. This is the fastest way to both get ARC running and secure the cluster.

Pre-baked security requirements are the leading cause of problems for teams setting up ARC. In nearly every case, we can trace their problems to a “standard security policy”. They had security restrictions applied to the cluster based on general guidance, but that guidance did not consider ARC, how it operates, or build practices. While they might be good practices for production workloads, they prevent ARC from being able to create and manage runners. In extreme cases, I’ve seen policies that prevented the runner pods from being created, blocked the listeners, or stopped the required roles or bindings from being created.

A variation of this problem is trying to start with Kata containers or Firecracker because “it’s how we secure containers”. These tools add a lot of complexity, and they add an additional layer for troubleshooting problems. My colleague Natalie covers this topic (and more) on her blog. Definitely worth the read.

It’s always more challenging if you need to troubleshoot ARC, the security policies, and the cluster settings simultaneously. You can save yourself a lot of time if you apply your security standards after installing and configuring ARC. Once you have ARC working, it’s much less challenging to tighten the security controls.

Prefer local storage

If you want to get the most out of ARC, keep the builds local. Build processes are I/O intensive. They read and write lots of small files. Because of this high file churn, local storage will provide the best performance and the least overhead.

Generally speaking, build processes do not work well with networked shares; NFS, CIFS, and are not optimal for high IOPS with small files. They can slow the processes substantially. Networks, like file systems, don’t work as well with small files. This leads many teams to consider other storage providers, like OpenEBS.

While container-attached storage can offer benefits for production workloads, it frequently works against ARC and build systems. The main purpose of these systems is to provide highly reliable storage with replication and resiliency. Build processes on stateless runners do not need this. If a storage resource fails, there’s typically no critical data loss, although you may need to rerun the build process.

In short, when it comes to storage simple is better.

Some teams have a concern that using local drives could expose data on the node to other processes. Exploiting this usually requires one of two things — node access or a container escape. If the node has been compromised, then all aspects of the machine are already insecure (and you have much bigger issues). Similarly, a container escape is generally unlikely to occur with build systems if your dependency chain is secure. In the event of an issue, having an isolated cluster limits the potential blast radius.

With a good pull request process, malicious internal actors cannot exploit the build process through direct coding; their changes would be reviewed and rejected by one or team members. If you have more extreme concerns, remember that stateless ephemeral builds have another advantage. Nodes can be tainted and replaced at any time.

Continuing the journey

This gives you a starting point for thinking about your deployments. They aren’t the only considerations, and these aren’t my only recommendations. Next week, we’ll discuss a few more considerations that I’ve found can create deployments that are more manageable and stable.