Ken Muse

More Best Practices for Deploying GitHub ARC

Last week we looked at some of my recommendations for deploying and managing GitHub Actions Runner Controller. This week, I’ll add a few more recommendations that can help to improve your experience with ARC. As a reminder, these are not official GitHub recommendations; it is based on some of my experiences working with customers and partners to deploy ARC. It’s also not an exhaustive list or guide to configuring Kubernetes.

TL;DR. The list of recommendations:

  • Use the right version of ARC
  • Dedicated clusters
  • Don’t default the security
  • Prefer local storage
  • Use a namespace for each runner scale set
  • Don’t think in labels
  • Start without resource limits
  • Monitor for resource exhaustion
  • Don’t overthink high availability

Use a namespace for each runner scale set

A runner scale set represents a configuration with a homogeneous set of runners. Runner scale sets need to be unique per runner group, but what about the namespaces? A best practice is to separate the controller and each runner scale set into separate namespaces. This ensures maximum isolation and configurability for the related resources. Because Kubernetes doesn’t recognize the GitHub concepts of organizations, teams, and repositories, namespaces can also provide an additional mechanism for reporting or analyzing resource consumption on the cluster.

Don’t think in labels

Labels are often misused and poorly understood. Its not uncommon to see teams use them as a way of targeting specific runner instances, an anti-pattern that doesn’t work well with ephemeral runners. GitHub has been moving away from using labels as the primary means of identifying the runners to be targeted. Instead, the recommended approach is to use a runner group (or the name of the scale set). Larger hosted runners started to enforce this approach, and ARC has followed suit. Instead of using labels, utilize groups (or the scale sets) to target runners. Since scale sets must be homogeneous, every member will have the same features (and the set itself provides configurable levels of scale)

What about Dependabot and CodeQL on GitHub Enterprise Server (GHES)? They have historically relied on the labels dependabot and code-scanning to be present. If you’re using GHES 3.9 or higher, simply configure the installation name to one of those values when deploying the Helm chart. GitHub Docs cover using ARC with Dependabot and code scanning in more detail. The docs also highlight that Dependabot support uses the Dependabot Action to run updates, which requires Docker as a dependency; Docker-in-Docker mode must be enabled for the runner scale set.

Start without resource limits

Normally, it’s important to configure the resource management for your containers. With ARC, the charts do not specify requests or limits. Instead, they have a comment about resources in the values.yaml:

We usually recommend not to specify default resources and to leave this as a conscious choice for the user. This also increases chances charts run on environments with little resources, such as Minikube.

There is no one-size-fits-all rule for the resources required by the listener, the controller manager, or the pods hosting the runners. For example, enabling ARC’s metrics will require additional memory and CPU. The requirements for that will grow as the number of jobs, workflows, and listeners increase. This is because each new definition increases the number of tracked metrics. Without enough memory resources, the controller will be killed. Without enough CPU resources, it will be throttled, slowing down the provisioning of runners.

Similarly, if the Actions runner pods have very low resource limits, it can lead to problems that are VERY hard to diagnose. For example, I’ve seen runners accept a job and then stop processing. The runners spawn several processes and make multiple outbound network connections. If it fails to spawn a required process, the runner fails (often silently). When this happens, it may be unable to notify the pipeline service of the failure. The job appears to be stuck in a “waiting to start” state until it eventually times out.

I’ve also seen situations where the logs contain failures or authentication issues which are caused by restrictive network policies. Because runners assume are optimized to use the full machine’s resources, they don’t always provide the clearest logging for unexpected resource restrictions. For example, you might see authentication errors (403) in the logs. In this case, the runner failed to make a required service call to get an access token. Future calls to services that required authentication then failed.

These issues can lead to a situation where one or more pods become “stuck”. They remain in a loop, retrying the connections or waiting on an update from a failed process. The runner is no longer processing the build, it hasn’t terminated, and it may have initiated the process of processing a job. This requires both fixing the resource constraint and terminating the zombie pod.

This doesn’t mean that you can’t (or shouldn’t) apply resource constraints. In fact, you can configure constraints throughout ARC, down to the runner itself (although its tougher to configure the containers created dynamically in Kubernetes mode). Kubernetes works best when it has constraints to help it schedule.

I recommend starting without any constraints, then iteratively identifying and configuring appropriate minimum, maximum, and default values. This is similar to applying security restrictions. If your pods are getting close to their limits or failing, adjust the constraints. Monitoring will be your best friend in this process. If you’re setting constraints on the resources, make sure to have adequate monitoring.

Monitor for resource exhaustion

In general, monitoring resources is a good practice with Kubernetes. It enables you to stay ahead of situations where pods are scheduled but later fail due to resource constraints or exhaustion. The limits exist to ensure fair scheduling and to prevent pods (or OS processes) from consuming too many system resources. If a process is consuming excessive resources, it can starve other processes and prevent them from working properly. Exhaustion occurs when a given resource is no longer available for other processes.

Exhaustion is a common issue, especially if your ARC system will be hosting large numbers of builds. This is because the OS (or Kubernetes) is not always optimized for creating and destroying large numbers of resources in a short time. While cloud providers preconfigure several settings for you, ultimately the cluster’s administrator is responsible for monitoring and configuring these. In some cases, it can indicate that additional nodes or resources are require by the cluster.

What are some of the resources?

  • Process ID (PID). There is a limit to the number of child processes the K8S runtime components and pods can create. On some systems, default may be as low as 32K. When the resource is exceeded, new child processes and threads won’t be able to spawn. These limits can exist at multiple levels. There’s more details about PID limiting here.
  • Network ports. Every network connection consumes resources. A given IP addresses has 65,535 available ports for inbound and outbound communication (and about 64,000 are available). To avoid conflicts, a port must wait to be reused after a connection. This allows late-arriving traffic to be handled gracefully. Each runner creates a new connection, and its build processes may create additional connections for downloading dependencies or uploading. Kubernetes itself also requires network communications. If no ports are available, these connections fail.
  • Storage. Every pod needs a place to store its data. If this runs out, the build operations will fail. If it’s too slow, the builds will take an excessive amount of time and limit the performance of the entire system.
  • CPU. The CPU time is being shared between all of the running processes. Without enough available CPU time, the processes can become starved and perform their work very slowly.
  • Memory. When the memory runs low, it can cause child processes to fail to start. When it runs out, the current process will suddenly fail. At the same time, without enough memory available, Kubernetes may refuse to schedule the pods.
  • Runners. ARC is designed to support your builds. If there aren’t enough runners to handle the jobs, builds will be queued. This increases the amount of time developers are waiting to complete a process. Not having enough runners may mean increasing the limits on the runner scale set, adding additional nodes, or increasing the node size to ensure enough resources are available for scheduling the requests.

The pods themselves are also a resource. They are scheduled based on perceived resource availability. If the resources get too low (for example, too many pods on the same node), the pod may not get scheduled (or it may schedule, but then terminate). By monitoring resource consumption — from the node to the individual OS resources — you can stay ahead of these issues.

Hopefully you can see why monitoring and handling resource allocations is a key responsibilitiy for a Kubernetes administrator. They need to proactively monitor and adjust resource consumption to stay ahead of issues. Beyond optimizing ARC, it also ensures that teams are well-positioned to identify and remediate resource and security issues. You can’t fix what you can’t see.

Don’t overthink high availability

If you want the best uptime with ARC, you need to have a plan for handling system downtime. While you can’t do anything about the GitHub services, you do have control over your use of the services and your self-hosted runners. Compared to most other Kubernetes deployments, the strategies for highly available runners is simple.

ARC is using stateless resources. That means that high availability does not require data replication or complex failover strategies. Instead, you just need a second Kubernetes cluster running ARC in a different region. To make this work, create scale sets with the same name in each cluster. Each scale set should be assigned to a different runner group. For example, you might create a runner scale set named java-build in each cluster, with one assigned to the group java-build-east and the other assigned to java-build-west. This is an active-active configuration, so both clusters will receive and process jobs while they are online. This is covered in the GitHub Docs.

If you’re using GitHub Enterprise Cloud, an easy way to have high uptime is to consider using GitHub-hosted runners (standard or larger runners) whenever possible. The runners are network isolated. By default, they have no access to company-critical resources and infrastructure. This limits the ability for malicious actors to compromise internal resources. At the same time, larger runners support VNET injection (in beta), allowing them to be dynamically created within Azure private networks, giving administrators some ability to control the network access. Hosted runners also utilize full virtual machines instead of containers, giving you access to the full resources of the machine.

To be clear, this isn’t an all-or-nothing approach. There’s nothing stopping you from using both GitHub-hosted and self-hosted runners. In fact, there are times where you might need both in a single workflow.

Sadly, there is no way to directly configure a “fallback” approach. For example, you can’t directly configure a workflow to use self-hosted runners if they are available, but fallback to hosted runners if not. Similarly, you can’t have a workflow natively select the approach based on a queue, wait time, or other criteria. It is possible, however, to implement logic inside the workflow which determines the appropriate runner for a job. You can learn more about that from my article on dynamic build matrices.

Final thoughts

All of the typical Kubernetes considerations also come into play, and this can include some other best practices (including some that teams may have built up over time). Overall, the process is a typical one for DevOps – iterative development and improvement. Start with the smallest working unit, then iterate until you reach your goal.

As you can see, there is a broad expectation with ARC that you have a strong understanding of infrastructure, networking, Kubernetes administration, and security (among other things). For teams with this expertise, this should give them a starting point for their discussions. For everyone else, this should give you some important recommendations for ensuring that you are getting started successfully.