Ken Muse

The Importance of Kubernetes Logs

If you’re trying to build a system for any purpose, observability is the key to understanding how it performs. In fact, it’s very difficult to understand what’s happening if you don’t have mechanisms for seeing what’s happening in real time. This is double true in the world of Kubernetes, where the complexity of the system can make it difficult to understand what’s happening. Kubernetes is a powerful tool, but because of that it does not have a single configuration that “just works” for a given use case.

This is an interesting problem, because it means that you need to “tune” your Kubernetes cluster to work well for the specific workload you are running on it. There is no one-size-fits-all solution. As a result, you need to be able to see what’s happening in the cluster to understand how to configure it (and the workloads) for optimal management and scale. This is where logs come in. They provide you visibility to what’s happening.

What is Kubernetes doing?

Kubernetes is not like virtual machines in the cloud or on premises. In fact, it lacks the hardware-level boundary enforcement that power virtual machines. Kubernetes is classified as a container orchestrator, but it’s easier to think of it as a process orchestrator. Containers are nothing more than processes running on a host with some software-managed boundaries on how it accesses memory, network, CPU, and storage. While it sounds similar to a VM, the mechanisms are significantly different. The management of these resources and processes rely on containers (more processes!) that provide APIs that support CRUDL (Create, Read, Update, Delete, and List) activities on the resources.

Under the covers, the CPU is providing slices of time to each running process. Essentially, each container gets some amount of time to run, then it pauses briefly while the next container gets its turn. With more cores available, more containers can run simultaneously. This functionality is managed by the operating system, meaning that it applies to the services and containers that Kubernetes is using as well. Why does that matter? It means that key services and functionality can be directly impacted by how many containers are requesting (and using) CPU time at any given moment on the same node. In addition, the Kubernetes services are coordinating these settings and activities across all of the nodes. As the number of nodes, scheduled Pods, and running Pods increases, the additional activities can impact the performance of the cluster.

Depending on what features are installed, you may have operators and controllers that are frequently requesting details or sending requests to create/update/delete resources. These activities run through the API Server. To help coordinate that, it can uses queues to manage the requests and ensure fairness. If that process becomes overwhelmed, it can take longer to process these queues. That, in turn, can slow down the entire cluster as components wait for responses to the requests.

With managed Kubernetes services – such as Elastic Kubernetes Service or Azure Kubernetes Service – the control plane components are outside of your control. In fact, aside from the API Server (which exposes some metrics), you only have access to the logs for the other components on EKS. Azure was similar, but has a preview available that allows you to collect metrics from api-server, etcd, kube-scheduler, cluster-autoscaler, and kube-controller-manager. That means that one of the best ways to get visibility to what’s happening is through the logs.

The Kubernetes log limit

Now that you understand the logs, it’s important to know that you can’t just use the logs stored on the Kubernetes server. By default, Kubernetes rotates logs with they reach 10MB. If a container logs 35MB of information, you will only see the most recent information. Since it would have rotated three times, you will likely only have 5MB of captured logs available. In most cases that’s not enough for diagnosing problems under a full load.

What makes this more challenging is Kubernetes can generate a LOT of logs. The Actions Runner Controller, for example, can easily log over 10GB of logs per scale set each day. The more active a system is, the more logs it will generate. If you aren’t capturing those logs as they are written and exporting them from Kubernetes, you may have less than 0.1% of the information available when you need it most!

How do I find the logs?

If you’re running EKS, then consider using CloudWatch (with Logs Insights) and Container Insights. These tools capture the numerous logs (as log streams) and key metrics. AWS has prescriptive guidance for monitoring EKS that you will want to review. It contains details about where to find the logs.

With Azure monitoring and Container Insights, you can access these logs using Log Analytics (with each component captured in a specific Table, Category and Namespace). Make sure to review the Azure container monitoring best practices. Both environments offer managed Prometheus for additional metrics scraping. Azure provides a list of Log Analytics tables and resource logs with the logged information. The kubelet logs are available from Container Insights in the Syslog table.

Of course, you can also capture the various folders that will exist in /var/log on each node.

Is there another option?

While the cloud providers offer managed solutions that are optimized to be low-overhead, you can also use self-managed solutions. This is especially important if you’re managing your own clusters. The most common solution for capturing (“scraping”) metrics is Prometheus. It’s not the only option, but it’s one of the most common (so much so that the cloud providers offer managed options). For log scraping, you have even more options. Loki (with Promtail or the Grafana Agent), Fluentd, and Fluent Bit are some of the frequent choices. Each has its own strengths and weaknesses, but all are capable of capturing logs and sending them to a central location for analysis. There are three key considerations:

  1. Do you have a solution that can scrape the metrics for system and user components?
  2. Do you have a solution that can tail the logs to ensure you capture all of the information from the user and system pods?
  3. Do you have a solution that makes it easy to query and analyze the metrics and logs? This may include a system that can extract key information from the logs to make it easier to query.

The final word

If you really want to master Kubernetes, take the time to review the information you’re capturing and understand how to use it. There’s a lot there! You don’t have to know everything (and most people only know a fraction of the numerous metrics available). You do want to know what “healthy” logs and metrics look like, and you want to understand how these change as the system scales up. You also want to understand how to navigate the logs to understand scheduling issues and problems that impact the API server (and other lower level services). As you understand these aspects, you’ll be better positioned to manage (and grow) your Kubernetes cluster effectively.