Colin Liddle
- Jan 12
- 4 min read

Kubernetes Autoscaling: A step-by-step guide to ensure guaranteed scaling behaviour

One of the main features of Kubernetes is the ability to spin up more pods with ease, and autoscaling (both horizontal and vertical) leans on this to react quickly to increased load on your services so that you can handle traffic at peak times.

Kubernetes Autoscaling

Ultimately though, autoscaling is a cost-saving measure, because if you didn't care about costs then you would have everything provisioned for max load all the time. Still, cloud costs can spiral out of control and eat into profits, so you want to optimise your cloud resources. Autoscaling means that in the middle of the night when there is very little traffic, you can scale right down and save significant costs, and then scale to increased load during the day. Autoscaling also means that there's no intervention, and especially if you do happen to have an increased load when you didn't expect it, the autoscaling will take that into account so that your end users still have a great user experience.

Typical Kubernetes app resource cost over a day

So now we've decided that autoscaling is great and will both save money and improve user experience in peak times, can we just whack on an HPA and call it a day? Well, it's not always that simple...

Autoscaling in a nutshell

I won't go into too much detail on autoscaling itself in this blog post as many others cover the implementation, but in a nutshell:

Horizontal pod autoscaling alters the replicas in a deployment
Vertical pod autoscaling alters the requested memory and CPU for pods in a deployment

Horizontal pod autoscaling is the easiest to use and most popular, but this blog post applies to both types of autoscaling as they both involve changes to cluster resources and pods being started and killed dynamically.

Preparing your cluster

When we think about autoscaling, we are scaling pods, however, the actual resource they are sitting on is your nodes, and these must be able to scale quickly as any time they add will increase the overall time it takes to spin up new pods.

A common way of ensuring that your cluster can spin up new nodes quickly is to over-provision nodes, that is, have nodes on standby that are ready for new pods to be scheduled on them. This is slightly wasteful as you have nodes sitting around doing nothing, but it does mean you don't have to wait extra time for new nodes to spin up, including whatever daemonsets you run, and there is a bit of a safety net in case something is preventing new nodes being scheduled entirely (eg cloud provider outage). If you've got very large nodes (and don't necessarily have very large workloads), then you may want to consider using smaller nodes so that you both aren't over-provisioning expensive nodes and have denser packed nodes.

Over-provisioning is implemented by scheduling pods with a negative priority so that any new pods will immediately kick them off. I won't go into too much detail here, but you can read the cluster-autoscaler docs for details.

Pod lifecycle considerations

In most situations, even though pods are ephemeral, they don't restart that often outside of updating deployments. Nodes are largely static, even spot instances stay up for many days, and so your workloads don't cycle that often. This may mean that you won't often see issues with your pod's lifecycle, but when you implement autoscaling, you'll suddenly be starting and stopping pods a lot more.

Probably the most important thing to consider is the readiness probe, which will determine when a pod can receive traffic from the Kubernetes service. If this is not properly set up, you'll either take too long for a pod to become ready and hence not react to increased load, or your pod will signal that it is ready before it is, resulting in errors when traffic is directed to the pod. Ensure this is configured correctly, and consider startup probes as well if you need extra checks just on startup.

The other main consideration with lifecycle is ensuring that pods shut down cleanly. Shutting down cleaning means draining connections, finishing writes to databases or anything that if you missed, could cause interruptions upstream, and this is commonly handled with a pre-stop hook. In this pre-stop hook, you can direct your application to shut down cleanly, minimising any potential disruptions.

Dependent systems

So you've ensured that your cluster and pods can scale quickly and safely, but what about any downstream systems that your service depends on? Usually, a single micro-service is not completely isolated, it has dependencies like databases, third-party systems, or your other micro-services that it calls down to, and there is no point autoscaling one app if it just shifts the bottleneck downstream.

Ensure that all of your services can either handle peak load or can autoscale themselves.

Metrics for scaling

Out of the box you get CPU and memory to scale on, and often this is fine, but this isn't always the best measure. In general, you want to ensure that your app is still responding within the response time targets that you have set, so even if you are smashing CPU, if it's responding within appropriate times then it is fine. On the other hand, your app may have lower CPU usage but run out of connections in its pool, so you'd want to scale on waiting connections instead of CPU.

Determining what you need to scale on really comes down to your business requirements, and stress testing on your application.

Custom metrics for autoscaling do require you to run a service that collects these metrics and exposes them for the Kubernetes api, and you can read more about that in the Kubernetes docs.

Is it worth it?

Considering all of the above, it may not be as simple as you thought to implement scaling, but if you already have a properly configured application, enabling autoscaling should be a breeze, and you can save substantial cloud expenditure. Additionally, even if you think you have a pretty static load and don't need it, having autoscaling means you don't need to worry if there is a higher load than what you predicted, and everyone can sleep soundly while Kubernetes takes care of things.

As cloud specialists, Innablr can help you design and implement a robust, scalable architecture using Kubernetes and other cloud technologies. They can also provide guidance on best practices for autoscaling and help you optimise your cloud expenditure.

By working with experienced professionals like Innablr, you can ensure that your applications are optimised for performance, scalability, and cost-effectiveness, allowing you to focus on your core business objectives.

Acceleration

Cloud Native Platforms

AWS Cloud Migration

DORA Acceleration

Engineering Excellence

Security

Site Reliability Engineering

Efficiency

Sustainability

FinOps

Google Cloud Migration

Kubernetes Autoscaling: A step-by-step guide to ensure guaranteed scaling behaviour