Istio Service Mesh: Use Sidecar to partition the mesh and save cost

Saurabh Pandit
Jan 11, 2024
4 min read

Updated: Jan 12, 2024

Over the last few years, the service mesh ecosystem has gone through a lot of evolution. And the good part of this evolution is that it is easy enough these days to install Service Mesh like Istio on Kubernetes. There are hundreds of blogs that talk about how to do so with step by step instructions. But there are hardly any blogs that talk about issues in run aspect of the mesh, specially as your cluster grows. I am going to explore one such problem I faced as the cluster grew to about 2000 services.

Istio Service Mesh - Why?

Imagine your application as a bustling city. Each building is a microservice, a small, independent piece of software that works together with others to provide a service. But just like in a city, these microservices need a way to communicate and collaborate efficiently. This is where Istio comes in.

Istio service mesh is a sophisticated traffic management system for your microservices. It lets you control service-to-service communication in distributed application with the lens of load balancing, failure recovery, security, monitoring and much more.

As mentioned before, there are many blogs and resource if you wish to delve further and I recommend starting with the Istio Docs. So let's get into the actual issue we encountered.

Memory hungry istio-proxy in large clusters

The istio data plane is made up of istio-proxy containers(Envoy) deployed as sidecars. Istio-proxy intercepts and manages all network communication to and from the application container. The control plane manages the envoy config applied to istio-proxy containers.

Over the period of few months after cluster was stood up, teams deployed various micro-services. Within year or so we were looking at ~1500 microservices across the cluster. As we were deploying more and more services on our kubernetes cluster we noticed istio-proxy started consuming a lot more memory than before, even for the same pod running same version of the micro-service the memory consumption was going up.

To demonstrate the issue I created a GKE cluster and deployed about 5 sample nginx applications.

Namespace names being ns<i> and nginx-deployment being the name of the deployment. I deployed same deployment in these namespace just to demonstrate the issue.

Below is the memory consumption of a istio-proxy container in ns1 namespace.

Memory usage by container when mesh has smaller footprint

I then deployed same micro-service 400 times, all the way to ns400.

Here is the memory consumption of the same istio-proxy container in ns1 namespace. Memory consumption had more doubled.

Memory usage by container when mesh has larger footprint

The increase in memory consumption by istio-proxy is a common issue in larger deployments due to the complexity of managing configurations for a growing number of services. In above example the numbers may be as alarming but as you complexity services increases the memory consumption can easily go above 1GBi. That is a lot for a proxy that does not even cache the request or response in its default configuration.

Diagnosis

The way Istio control plane operates is that it discovers all the services deployed in the mesh, it then builds a config required to reach all these services and pushes it to all istio-proxy containers. Even if kubernetes network policies are used to isolate certain workloads or group of workloads, istio does not consider this while building the istio-proxy configuration.

This config can be inspected using istioctl command.

e.g. sample config from istio-proxy container in ns1 namespace

Solution

Istio provides a way to fine-tune the mesh or envoy config, using Sidecar resource. Istio documentation does not put emphasis on using Sidecar resource in order to relieve pressure on memory but it certainly does help. Istio documentation does mention that using Sidecar resource, it is possible to restrict the set of services that the proxy can reach when forwarding outbound traffic from workload instances, but that is more from security standpoint.

In our demo setup, I deployed below sidecar resource in all namespaces. This sidecar resources restricts outbound traffic to only services in ns1-ns5 namespaces.

You can deploy a default Sidecar resource in the istio-system namespace that applies to all other namespaces. However, remember that this default resource cannot include a workload selector.

After applying these Sidecar resources, here is the memory consumption graph.

Memory usage by container when mesh is partitioned

Recommendation

I would recommend to think about partitioning the mesh from the start, don’t just rely on network policy to isolate some services. Best way is to at least isolate the namespace and only open outgoing traffic to other namespace as required. It is also possible to automatically create and manage this Sidecar resource in the namespace by creating kubernetes operator and a CRD.

While security is one benefit, restricting outbound traffic for your Istio proxies can also reduce their memory footprint. This allows you to run pods with lower memory requests, resulting in node cost savings. With smaller memory needs per pod, you may be able to fit more pods on each node, potentially reducing the total number of nodes needed and therefore the overall node cost.

About Innablr

Finally, Innablr is a Kubernetes Certified Service Provider and leading consultancy for cloud native, Kubernetes, and serverless technologies. Frequently championing community events, delivering thought leadership and leading practices, Innablr is recognised in the Australian market as one of the most experienced providers of Kubernetes solutions.

Continuing our successful approach of building repeatable and extensible frameworks, Innablr has built a blueprint for Google Cloud and Amazon Web Services Kubernetes deployment whether it is Google Kubernetes Engine (GKE) or Elastic Kubernetes Service (EKS).

Get in touch to learn more about how we’ve been helping businesses innovate with Kubernetes.

Saurabh Pandit, Lead Engineer @ Innablr

FinOps

Lower Your Cloud Costs

FinOps Assessment

Go Cloud Native

Migrate to AWS

Container Assessment

Data Platform Assessment

Your Cloud Journey

DevSecOps

Site Reliability

Acceleration

Engineering Excellence

Cloud Native Platforms

AWS Cloud Migration

DORA Acceleration

Google Cloud Migration

Security

Site Reliability Engineering

Data Engineering

Efficiency

Sustainability