Ken Muse

Top 5 Things To Know About ARC

Actions Runner Controller (ARC) is a powerful way to manage and scale ephemeral, self-hosted GitHub runners. At the same time, to be successful using it you need to understand five key things. These are often overlooked, but they are critical to ensuring that you have a successful experience with ARC.

1. Expect to be a Kubernetes expert

One of the most important aspects of the modern version of ARC is that it was designed with the Unix philosophy: “Write programs that do one thing and do it well.” Beyond the scaling decisions, it delegates everything else to the Kubernetes platform. That makes it incredibly versatile, but it also incorporates a key assumption: the team responsible for the Kubernetes cluster has the deep expertise required to maintain, troubleshoot, and manage the cluster. That core expectation is often overlooked, and it’s critical to using ARC.

Most books and videos on Kubernetes will teach you how to install components, run Helm charts, or change settings. That’s the easy part. Kubernetes starts off with a generalized configuration, but to get the most out of it (or to scale to hundreds or thousands of nodes), you have to know how to “tune” the environment. That means understanding what the cluster is doing, then adjusting the settings to optimize it for your workloads. Unless you have a single CI/CD workload that never changes, this is a continuously shifting target that requires ongoing care. There is no single setting, node size, or configuration that will work for all processes. That’s why there is no “best practices” guide to answer your questions about deploying or scaling the system.

The majority of cluster scaling issues can be traced to contention problems with resources, excessive loading on the API Server (or etcd), or scheduling problems. You want someone that is comfortable finding and resolving these issues. Downtime is expensive. Sometimes the fix may include creating more clusters. Other times, it will be lower-level configurations and settings. Knowing what to change and how to change it is a critical skill.

There is one more reason this is important. Most guidance you will find covers using Kubernetes for hosting web applications or public-facing services that often have less variable load characteristics. GitHub Runner pods are not the same thing. They have very different performance and security characteristics. Attempting to use the same settings for both approaches will likely lead to performance and scalability issues. In fact, ARC should be set up on a vanilla cluster without any security policies, network policies, admissions controllers or other restrictions. From there, you can add the necessary security gradually to ensure it doesn’t impact what you’re doing with ARC.

2. GitHub-hosted runners may be cheaper (and will always have a higher SLA)

Adopting ARC because you think it will be cheaper? Hoping to have a higher uptime than GitHub? In most cases, you will be very disappointed. I covered how to understand the SLA of ARC in a previous post. The short version: because ARC uses the GitHub control plane, it is mathematically impossible to have a higher SLA than it provides. It is, however, quite easy to have a lower SLA due to the additional points of failure that are being created by hosting your own cluster. If you’re not running pairs of highly available clusters with two or more nodes, you are almost certainly going to see a lower SLA.

In terms of the cost, teams often erroneously think of ARC as “free”. It’s not. You are paying for compute, log storage, and the expertise of someone that can maintain the cluster. Enterprise customers also start with 50,000 minutes of runner time per month. All of the runners are dedicated, ephemeral virtual machines, so they have a lot more power than a container. They also have the security benefit of being isolated from internal systems. Runners start at $0.005 per minute, so the costs are surprisingly low.

To put this in perspective: let’s assume you set up ARC on an EKS cluster ($78/month) with two nodes ($400/mo each). That spending $878 per month for compute, ignoring the costs storage, NAT gateways, and other services. That’s the equivalent of 175,600 minutes on a dedicated Linux machine. That’s almost 4 runner-months of time. Even if the node cost was half as much, you would still have the equivalent more than two months of runner time. That’s in addition to the 50,000 free minutes provided each month to enterprise accounts.

A modest ARC cluster will also log multiple GB of data per day. Ingesting 10GB per day – 300GB per month – could add another $147 (or nearly 30K minutes). That doesn’t include the costs to query those logs or any metrics. Assuming I can find a salaried Kubernetes expert for $85/hr that only needs to spend 2 hours per week maintaining and optimizing the clusters (which is very low, especially in the first few months). That just added another 136,000 minutes.

In short, a single two-node cluster with modest logging could easily cost $1,700 per month, or the equivalent of 341K minutes (7.8 runner-months). That’s a lot of minutes, and you still aren’t highly available! To be clear, that doesn’t mean there aren’t situations where ARC is the right choice. It also doesn’t mean there aren’t situations where ARC is the less expensive choice (assuming you have the right skill sets). There are absolutely times to consider using self-hosted runners. It just means that you need to be aware of the costs and the trade-offs. There’s a reason that even some of the largest users of ARC will still use GitHub-hosted runners for some workloads.

3. Observability is key

This goes hand-in-hand with the need to be an expert. As I mentioned in The Importance of Kubernetes Logs, observability is key to understanding what is happening in your cluster. Kubernetes provides a wealth of information about the state of the cluster, the nodes, and the pods. It also provides information about the network, storage, and the API Server. This information is provided via log files and metrics. You need to be able to collect, store, and analyze this information to understand what is happening in your cluster. In fact, the top-performing teams usually have processes in place to transform the raw logs into actionable information. If you don’t have a system that is aggregating the logs and metrics in real time, you will struggle to manage or scale the cluster. You will also likely find yourself frequently reacting to issues after they have already caused problems. Observability tools are often the very first thing a Kubernetes expert will setup on a new cluster.

Did you know that the Runner will throw an expected exception as part of its lifecycle? It actually may log several exception stacks … and that’s expected. Have you seen these in your logs? Are you able to ignore them to find actual issues? If not, you are likely missing critical information about the health of your cluster and runners.

4. Expect ongoing maintenance

Set it and forget it? Not likely. An ARC cluster requires ongoing care and maintenance. While it is possible to automate everything, it’s typically something that happens over time as the cluster reaches a steady state. There are three areas that you are likely to need to maintain more frequently: the image, the cluster, and ARC itself.

The base image for ARC has very little on it. To get the best performance and ensure you have the latest runner, you’ll need to build your own base image. This process is easily automated, but it does require some planning. Properly configured, you should be able to automate the process to run daily or weekly updates to ensure you always have the latest fixes and features. By including a tool cache and Actions cache, you can significantly reduce network traffic and improve the performance of your runners.

The cluster and its nodes also require ongoing care. Kubernetes should be frequently updated to fix bugs and performance issues. Kubernetes releases new versions every 15 weeks, or 3 per year. Because of the support policy, a given release typically has support for 12 months (after which time, they also stop reporting CVEs). During that time, there may be multiple patches released for security issues and fixes. Even if you only take a new minor release annually, you are expected to take patches as they are made available (typically on a monthly release cadence), although CVEs can cause an out-of-band release. Similarly, the nodes should be frequently updated (or better, replaced) with nodes running the latest OS patches.

Of course, the ARC controllers and scale sets need to be periodically updated to ensure that they have the latest fixes and support the current versions of the GitHub services. These updates do not happen on a fixed schedule. They are typically driven by issues or backend changes.

5. Not everything is supported

ARC continues to evolve, but is currently lacks some features that are available with traditional hosted runners. This may require some special considerations. Obviously, since it’s managing the resources as containers instead of virtual machines, it doesn’t have quite the same protections and boundaries. It also means that we have some additional considerations about the security.

ARC also does not officially support Windows runners. While it is possible to run Windows containers, it is not a supported configuration. In my experience, these containers tend to also need more resources than a comparable Linux runner. They may also have additional dependencies that need to be considered to compile Windows applications. For those cases, it may be worth considering a virtual machine or hosted runner.

ARC also doesn’t support labeled runners. While it is possible to customize the environment to use labels, GitHub has been moving away from labels for quite a while now. I’d recommend avoiding them at this point. That said, the name of the scale set is treated like a label. That means using the name dependabot or code-scanning will allow ARC to support those features in runners.

The modern version of ARC does not support all of the features that were part of the original community version. For example, it doesn’t natively support schedules for scaling. In most cases, this is not a bad thing. It’s better to use native Kubernetes tools for this functionality (remember the need to be a Kubernetes expert?). I’ll cover this more in a future post.

Finally, ARC does not fully support load balancing between clusters today. I expect that will be resolved in the near future, but until then just realize that the jobs may be unevenly distributed. In practice, this is often not a serious issue. You do, however, want to be aware of it.

Final thoughts

There are many things to consider with Actions Runner Controller, but with careful planning you can make your environment very effective. It’s a powerful tool that can help you scale your CI/CD processes to meet the demands of your team. Just remember that it’s not a magic bullet. It requires ongoing care and feeding to ensure that it continues to meet your needs. When it doesn’t meet the needs, then GitHub-hosted runners may be a better choice. It’s all about finding the right balance for your team.