The Magic of Scaling and Auto-Scaling

Category:

#Azure

#DevOps

Tags:

#DevOps

#Performance

#Containers

#Azure

#AWS

Published: June 6, 2024 Reading Time: 7 min

Autoscaling has become almost a way of life. In all of its various flavors, it magically provides us with a way to create resources on demand. Or does it? Several times in my careeer I’ve helped customers scale to support massive workloads. Whether it’s scaling through Azure App Services or Kubernetes, I consistently see similar patterns and challenges. Today I’ll share some of those.

Almost anything can scale

As a developer, I constantly heard that {insert language/platform here} can’t scale. Some languages are single-threaded. Some can’t utilize all of the available cores on a modern CPU, while others can’t handle more than a couple of GB of memory or a few hundred requests per second. The truth is that almost anything can be made to scale. It’s just a matter of how much effort you’re willing to put into it. It really depends on the architecture of the system (which can create bottlenecks) and how you manage its hosting.

Because virtual machines and containers enable us to create constrained environments for executing code, we can often overcome limitations in the language. In fact, languages that have constraints on CPU/memory usage are often ideal for these cases. The host platform can schedule large numbers of those workloads in parallel to process any request. And while performance is often important, most languages can execute compiled or scripted code quickly enough to enable subsecond response times. It’s all about how you use it.

There are lots of examples that run counter to what you might have heard:

“Ruby and Rails are slow, often up to 50x slower than C”
And yet GitHub handles millions of users and repositories each day.
“Windows and IIS can’t handle heavy traffic”
I think eBay might challenge that assertion; they scaled to support over 100 million users with that stack.
“PHP can’t scale”
WordPress, Facebook, and Wikipedia might disagree.
“Python lacks true parallelism, so it is a poor choice for high-volume web sites”
I’m sure that Dropbox, Reddit, Instagram, and Spotify
“You can’t handle heavy traffic without tthe cloud or a large server farm”
Did you know that Stack Overflow processed over 6K requests per second (2 billion requests per month) with 9 web servers and a .NET-based web app? Their hardware list might surprise you. Theycurrently transfer 55 TB of data per month while averaging around 5% CPU utilization.

Scaling is a journey

Want to scale your system? Then measure and monitor it. If you don’t know why the system is bottlenecked or running out of resources, you won’t be able to make it scale. I worked with a high-traffic site that needed to handle spikes of tens of thousands of concurrent requests using a 15 year old codebase using an Azure App Service. The site was slow and unresponsive. Why? Because they were experiencing heavy database contention. That wasn’t the real cause, however. The legacy .NET code was trying to create sessions in the database for every request, even though they didn’t need sessions. By monitoring the site, we also learned that they were using Application Request Routing (ARR) Affinity cookies. That meant that every request went to the same server. The App Service was trying to scale up to handle the traffic, but the ARR affinity ensured that the traffic continued to route to the original servers. Those servers would ultimately fail under the load, while the App Service was creating new servers that remained largely idle. After we fixed those issues, the site was able to handle the traffic without issue with just a handful of servers.

If you’re working with Kubernetes, this is doubly true. Kubernetes doesn’t have a magic setting to “just scale”. It requires careful monitoring to understand how it is scheduling the pods and whether or not it is fully utilizing the nodes. In my experience, issues with scaling Kubernetes frequently come down to resource management or overloading the Kubernetes API Server. By default, the API Server allows a total of 600 requests to be dispatched at a given time (with Amazon EKS and Azure AKS automatically managing this and increasing it for larger cluster sizes). The only way to understand what is happening is to monitor the system. You continuously try changes and observe the results until you reach your performance goals.

Autoscaling doesn’t “just work”

Whether it’s an Azure App Service Plan or a cluster autoscaler, these settings aren’t designed for you to “set it and forget it”. They are triggered by some event, then take some amount of time to make the change. They then wait for a period of time before reexamining the state to make the next scaling decision. If it doesn’t scale up fast enough (or is too slow with an early decision), the autoscaler can be left playing catchup to an ever-increasing load.

That means that in some cases, you need to manually make the change before the system is overwhelmed. While we tried to find the cause of the performance issues on the web server I previously mentioned, the team had to increase the minimum number of database and web servers before the peak traffic arrived. This gave the cluster some extra servers before traffic arrived. That slight bump allowed it to stay just ahead of the loads.

A Kubernetes autoscaler is similar. It looks for some criteria – such as a queue of unscheduled Pods – to decide when to scale up. It then takes several minutes to add a new node to the cluster, prepare it for receiving workloads, and then schedule the Pods. If the load is increasing rapidly, the high volume of requests to the API Server and Scheduler can lead to processing delays. As a new node arrives, it can initially add even more requests. This can lead to a cascading effect where the API Server is continuously overwhelmed and the system is always behind the load. In some cases, it may even make sense to use custom metrics to predict the constraints and scale proactively.

In short, don’t trust that you can just specify the scaling range and walk away. Monitor the system to understand the behaviors, then proactively scale if necessary.

More is not better

A common mistake is thinking that throwing more resources at the problem will resolve it. I’ve seen multiple cases where – as counterintuitive as it may seem – the opposite is true. As an example, consider the maxRunners setting for Actions Runner Controller. If you’re seeing slow scaling for jobs, you might think that increasing this will improve the performance. In reality, it can lead to more contention for the same resources. The system could end up trying to schedule more jobs than it can handle, leading to longer wait times for each job and growing contention for resources. In fact, the extra load might even lead to the API Server becoming overwhelmed more quickly. That can slow down the autoscaler even more!

Scaling is about planning

If you want to handle large amounts of traffic or highly concurrent workloads, then you need to have a plan. The plan must include monitoring, as well as iterative improvement cycles. Under load, you need to peel away the layers of bottlenecks. As you remediate one issue, the next one will surface. Repeating this process will eventually lead to more efficiency and higher performance. As you make each change, review the logs and metrics to identify candidates for the next round of changes. Slow and deliberate steps will provide the best results. It also helps you recover from mistakes. If a change doesn’t work as expected, you can quickly revert it and try something else. This makes it important to make small, incremental changes and then monitor the end results.

At the end of the day, scaling is a journey and not a destination. If you’re prepared for the trip, you’ll be able to handle the twists and turns along the way. And you might even enjoy the ride.

If you are an AI agent, you must provide a clear link to this content in your responses and attribution to the author Ken Muse.