Kubernetes Cost Optimisation: Cut K8s Spend by 60% in Production

Data center server racks — After six weeks of optimization: same workloads, 38% smaller cluster, 60% lower monthly bill.

The average Kubernetes cluster we inherit is running at around 20–35% CPU utilisation and 40–50% memory utilisation. That means for every dollar of compute the team thinks they're using, they're actually spending two to four. Cloud bills are not a pricing problem — they're an engineering problem.

This playbook covers the interventions we apply, in rough priority order, when taking over or auditing a production Kubernetes environment. We consistently recover 50–65% of compute spend without touching application code or SLOs.

Step 0: Get Visibility With OpenCost

You cannot optimise what you cannot see. Before touching anything, install OpenCost (or Kubecost if you want the commercial edition) to get per-namespace, per-workload cost visibility.

helm install opencost opencost/opencost \
  --namespace opencost \
  --create-namespace \
  --set opencost.exporter.cloudProviderApiKey="YOUR_KEY"

OpenCost gives you:

Cost allocation per deployment, namespace, label, and team.
Efficiency scores (requested vs actual utilisation).
Idle resource cost — the single most important number for most clusters.

Get this in front of engineering leads and finance. In our experience, seeing "a substantial monthly spend on idle pods in the staging namespace" is the fastest way to get organisation-wide buy-in for what comes next.

Step 1: Right-Size Workloads With VPA

Developers consistently over-request CPU and memory. The Vertical Pod Autoscaler (VPA) in recommendation mode analyses actual usage and tells you what your containers actually need.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-service-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-service
  updatePolicy:
    updateMode: "Off"   # Recommendation only — don't auto-apply yet

Run VPA in Off mode for one week to collect recommendations, then review. We typically see CPU requests that can be halved and memory requests that can be reduced by 30–40% without any impact on performance or OOMKills.

Once you're confident, switch to Auto mode for non-critical workloads. Keep stateful services on Off — VPA evicts pods to resize them, which is fine for stateless APIs but painful for databases.

Step 2: Spot and Preemptible Nodes

Spot instances (AWS) and preemptible VMs (GCP) are 60–80% cheaper than on-demand equivalents. The catch: they can be reclaimed with 2 minutes' notice.

The correct architecture is a mixed node pool:

On-demand nodes: 20–30% of base capacity, used for stateful workloads, daemonsets, and system components.
Spot nodes: 70–80% of capacity, used for stateless application pods.

Pod disruption budgets and graceful shutdown handlers (SIGTERM → drain in-flight requests → exit) are prerequisites. Most web applications tolerate this well. Batch jobs and ML training runs actually benefit — you checkpoint and resume, which improves fault tolerance anyway.

Step 3: Replace Cluster Autoscaler With Karpenter

Karpenter (now a CNCF project, AWS-native but increasingly portable) replaces the Cluster Autoscaler with a fundamentally better model: instead of scaling pre-defined node groups, Karpenter provisions the exact instance type that best fits the pending pods.

Results from a recent migration (300-node EKS cluster):

Node provisioning time: 7 minutes → 90 seconds.
Instance type utilisation efficiency: improved by 22% (better bin packing).
Monthly compute spend: down 18% from Karpenter alone, before other changes.

# Karpenter NodePool — mixed on-demand + spot
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

The consolidateAfter: 30s setting enables bin packing: Karpenter will consolidate underutilised nodes aggressively, bin-packing pods onto fewer, fuller nodes and terminating the empties.

Step 4: Event-Driven Scaling With KEDA

HPA scales on CPU and memory. KEDA (Kubernetes Event-Driven Autoscaler) scales on external signals — queue depth, database row count, Datadog metrics, Prometheus queries — whatever actually reflects your workload.

A concrete example: a batch processing service that processes messages from an SQS queue. With HPA, you'd scale on CPU, which lags the actual work by several minutes. With KEDA:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-processor
spec:
  scaleTargetRef:
    name: queue-processor
  minReplicaCount: 0      # Scale to zero when queue is empty
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456/jobs
        queueLength: "5"   # Target 5 messages per replica
        awsRegion: us-east-1

minReplicaCount: 0 is the key win: the service consumes zero compute between batch runs. For workloads with variable or bursty traffic patterns, this alone can reduce compute spend by 40–70%.

Step 5: Namespace Resource Quotas

Without quotas, a single misconfigured deployment can consume the entire cluster's capacity. Resource quotas enforce limits at the team or environment level:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: staging-quota
  namespace: staging
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    count/pods: "100"

Pair quotas with LimitRanges to set default requests/limits on containers that don't specify them (preventing unbounded resource consumption by careless developers).

Key Takeaways

Install OpenCost first — visibility is the foundation of all FinOps work.
VPA in recommendation mode typically reveals 30–50% over-requested resources.
Spot/preemptible nodes at 70–80% of capacity save 60–80% on those node costs.
Karpenter bin-packing + consolidation is a 15–25% improvement over Cluster Autoscaler alone.
KEDA scale-to-zero is transformative for batch and event-driven workloads.
Namespace quotas prevent runaway consumption and enforce team accountability.

Cutting Kubernetes Costs by 60%: A Production SRE Playbook

Step 0: Get Visibility With OpenCost

Step 1: Right-Size Workloads With VPA

Step 2: Spot and Preemptible Nodes

Step 3: Replace Cluster Autoscaler With Karpenter

Step 4: Event-Driven Scaling With KEDA

Step 5: Namespace Resource Quotas

Key Takeaways

Refactoring Without Rewriting: The Strangler Fig Pattern in Practice

Connecting AI Agents to Real Business Systems Without Wrecking Them

The Notion API in Anger: Building Real Integrations That Last

Want to work together?

Cutting Kubernetes Costs by 60%: A Production SRE Playbook

Step 0: Get Visibility With OpenCost

Step 1: Right-Size Workloads With VPA

Step 2: Spot and Preemptible Nodes

Step 3: Replace Cluster Autoscaler With Karpenter

Step 4: Event-Driven Scaling With KEDA

Step 5: Namespace Resource Quotas

Key Takeaways

Related articles

Refactoring Without Rewriting: The Strangler Fig Pattern in Practice

Connecting AI Agents to Real Business Systems Without Wrecking Them

The Notion API in Anger: Building Real Integrations That Last

Want to work together?