Ken Muse

GitHub Actions and Monitoring

Whether you’re developing drivers, building appliances, or testing system deployments, you need to test your code. For many, this means installing GitHub Actions Runners (or other agents) on the system under test. But is this really the best way to test your code?

For many teams, this doesn’t seem to create any direct issues. By monitoring the system as they deploy, they can directly assess whether the deployment is working correctly. This approach has several significant drawbacks. If you’re testing system drivers (or low-level components), the first issue is obvious: system failures will kill the current agent and the pipeline process. When the system restarts or crashes, the runner or agent is also killed. Because the system is no longer accessible, teams look for ways to reconnect to the environment to query the root cause. The common work around is to treat those as “failures” and then try to assess what happened manually. When manual processes are required, we lose efficiency. We need to be able to assess the outcome and report it as easily as possible in order to enable the problem to be remediated.

In many cases, the problems are a bit more insidious. By running an additional process on the system, you’re testing both the system and the deployment process. With Actions, this means that the Actions Runner is consuming memory and CPU resources, potentially impacting the stability and behavior of the environment. If these processes utilize additional tools and services to deploy the code, then those additional components are also being included in the test. There’s nothing worse than spending hours triaging the problem only to discover the failure was caused by a zombie process consuming too many resources.

An additional challenge is that all of these components add security concerns to the environment. Modern agents – like Actions Runners – rely on outbound HTTPS connections. This avoids the need to open inbound ports. However, this doesn’t entirely eliminate the risk, especially if you’re deploying to production resources. The running agents are generally highly privileged, so they have access to perform activities on the system. This means that the tasks running on the agent create an additional attack surface. In addition, any tools used to support the deployment add to that attack surface. In most of these cases, the runners are also long-lived, so extra care must be taken to avoid the runner being compromised by “ambient” resources. For example, binaries or scripts that are left on the system after the deployment process completes may be executed by future processes, inadvertently impacting the system stability.

The best solution to all of these problems is to run the deployment from a separate environment. This isolates that process from the system under test. It also allows the deployment process to monitor that system continuously throughout the process. Consider driver testing. If the system under test fails, the deployment process can automatically restart the system and connect to it (or access a disk image). From there, it can capture any relevant logs, crash dumps, or process details. None of this is possible if the deployment service is running on the system under test. Similarly, when deploying to a production system, the deployment process can monitor the system to detect and report any issues. If the system fails, then the deployment process can repair the state. This minimizes the need for manual processes and intervention.

For deployments to Windows servers, Web Deploy can facilitate the process. Going back more than a decade, this tool provides a way to remotely install and configure web applications (and more) on the server. Historically, it was the optimal way to configure and deploy to large Windows Server farms. Alternatively, Windows supports PowerShell remoting. With Linux machines, the tool of choice is often SSH.

An alternative approach is to consider creating infrastructure using features such as disk images or containers. By baking the components into the system, you can create complete environments for testing and deployment. Instead of performing a rollback by uninstalling components or attempting to restore older configurations, rollbacks are as simple as reverting to the previous system or components. The rollback is then using the original environment instead of trying to reconfigure an updated environment. This also has the advantage that it can give you the opportunity to adopt immutable file systems. This can be a significant security advantage since the main file system is never accessible. An additional benefit is that it makes it easier to take OS updates and patches. Instead of updating the live system, you can update the disk image and deploy a new system, then retire the original system. This eliminates downtime and reduces the risk from the update process.

An advantage to this approach is that it also positions the teams for dynamic testing. By creating a complete environment, deployments can be robustly tested without affecting the production systems. For example, you can perform controlled load testing or smoke testing in a fully isolated environment. This also makes it easier to triage production issues since the entire environment can be quickly recreated and analyzed. If you’re used to long-running QA servers, changing to dynamic systems can substantially reduce the cost of testing and the chances of false test outcomes. I can recall more than a few times were a QA team indicated that a release was production ready, only to discover after countless hours of repair time that their environment had drifted and relied on a component that was no longer present in the production environment. It also enables greater parallelization of efforts since multiple environments can be created and tested in parallel. Automation anyone? 😄

This doesn’t mean we ignore concepts like testing in production. It just creates additional opportunities for how the systems can be built and tested. In fact, it makes it easier to test in production since we can validate systems as they deploy (and quickly revert to previous versions when necessary). This means that you can deploy more frequently without increasing the risk of failure.

In summary, there are quite a few advantages to avoiding installing deployment agents or runners on production systems and systems under test. By moving away from this historical pattern, you can create more robust (and secure) deployment practices, reduce your risk, and improve your monitoring.