The Kubernetes Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics. Instead of manually scaling workloads when traffic increases at 2 AM, HPA handles it for you. In this guide you will learn how HPA works internally, how to configure it with CPU, memory, and custom metrics, and how to tune scaling policies for production workloads.

Prerequisites

Before configuring HPA, make sure you have:

  • A running Kubernetes cluster (v1.23+) — minikube, kind, or a managed cluster (EKS, GKE, AKS)
  • kubectl installed and configured to talk to your cluster
  • metrics-server deployed (most managed clusters include it by default)
  • Basic familiarity with Deployments and Services
  • A deployment with CPU/memory resource requests defined (HPA cannot calculate utilization without them)

Verify metrics-server is running:

kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods

If kubectl top returns metrics, you are ready to proceed.

How HPA Works

HPA runs as a control loop in the Kubernetes controller manager, checking metrics at a configurable interval (default: 15 seconds). On each iteration it:

  1. Fetches current metric values from the metrics API for all pods in the target workload
  2. Calculates the desired replica count using the formula:

$$ \text{desiredReplicas} = \lceil \text{currentReplicas} \times \frac{\text{currentMetricValue}}{\text{desiredMetricValue}} \rceil $$

  1. Compares desired replicas with current replicas
  2. Scales the workload up or down if the ratio exceeds a configurable tolerance (default ±10%)

For example, if your deployment has 3 replicas running at an average of 75% CPU and your target is 50%:

$$ \text{desired} = \lceil 3 \times \frac{75}{50} \rceil = \lceil 4.5 \rceil = 5 $$

HPA would scale the deployment to 5 replicas.

Cooldown behavior: HPA uses stabilization windows to prevent flapping. By default, scale-down events use a 5-minute stabilization window — HPA picks the highest recommended replica count from the last 5 minutes. Scale-up has no default stabilization window, reacting immediately to load increases.

Configuring HPA with CPU Metrics

Quick Setup with kubectl

The fastest way to create an HPA:

kubectl autoscale deployment my-app \
  --cpu-percent=50 \
  --min=2 \
  --max=10

This tells Kubernetes: keep average CPU utilization at 50% across all pods, with a minimum of 2 and maximum of 10 replicas.

Full YAML Manifest

For production, define HPA as a YAML manifest you can version-control:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70

Apply and monitor:

kubectl apply -f hpa.yaml
kubectl get hpa my-app-hpa --watch

Output looks like:

NAME         REFERENCE           TARGETS           MINPODS   MAXPODS   REPLICAS
my-app-hpa   Deployment/my-app   32%/50%, 45%/70%   2         10        3

Important: Your deployment containers must have resource requests defined, or HPA cannot compute utilization percentages:

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Custom Metrics and External Metrics

CPU and memory are often insufficient for scaling decisions. A web service might have low CPU but a large request queue. HPA v2 supports three metric types:

Metric TypeSourceExample
Resourcemetrics-serverCPU, memory utilization
Podscustom metrics APIrequests-per-second per pod
Externalexternal metrics APISQS queue depth, Pub/Sub backlog

Setting Up Custom Metrics with Prometheus Adapter

Install prometheus-adapter to expose Prometheus metrics through the Kubernetes custom metrics API:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-server.monitoring.svc

Then configure an HPA using a custom metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"

Scaling on External Queue Depth

For event-driven architectures, scale based on an external metric like an SQS queue:

metrics:
  - type: External
    external:
      metric:
        name: sqs_queue_length
        selector:
          matchLabels:
            queue: "order-processing"
      target:
        type: Value
        value: "30"

This scales your consumer pods to keep the queue length around 30 messages.

HPA vs VPA vs Cluster Autoscaler

FeatureHPAVPACluster Autoscaler
What it scalesNumber of pod replicasCPU/memory requests per podNumber of cluster nodes
DirectionHorizontal (more pods)Vertical (bigger pods)Horizontal (more nodes)
Metrics usedCPU, memory, custom, externalHistorical resource usagePending pod scheduling
Requires restartNoYes (recreates pods)No (adds nodes)
Best forStateless, horizontally scalable appsSingle-instance, resource-variable appsAccommodating HPA scale-up
Can combineYes, with Cluster AutoscalerNot with HPA on same metricYes, with HPA
Reaction speedSeconds to minutesMinutes (pod restart)Minutes (node provisioning)

Recommended combination: Use HPA for workload scaling + Cluster Autoscaler for infrastructure scaling. Avoid using HPA and VPA on the same metric for the same deployment — they will conflict.

Real-World Scenario

Consider an e-commerce platform that sees 10x traffic during flash sales. Your checkout-service deployment normally handles 50 requests/second with 3 replicas.

Without HPA: Your on-call engineer gets paged at midnight, manually scales to 20 replicas, then forgets to scale down. You pay for idle resources for 3 days.

With HPA configured:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "50"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

This configuration:

  • Scales up aggressively — doubling replicas every 60 seconds when CPU exceeds 60% or request rate exceeds 50 req/s per pod
  • Scales down conservatively — removing only 10% of replicas per minute with a 5-minute stabilization window
  • Prevents premature scale-down after short traffic bursts

During the flash sale, HPA scales from 3 to 25 replicas in under 5 minutes. After traffic subsides, it gradually reduces to 3 over approximately 30 minutes.

Scaling Policies and Behavior

The behavior field (autoscaling/v2) gives you fine-grained control over scaling speed:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
    selectPolicy: Max
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    selectPolicy: Min

Key fields:

  • stabilizationWindowSeconds — how long HPA looks back to pick the most conservative recommendation. Set to 0 for immediate scale-up; 300 (5 min) for cautious scale-down.
  • policies — define scaling rate as a percentage of current replicas or an absolute number of pods per time period.
  • selectPolicyMax picks the policy that allows the most change (aggressive), Min picks the most conservative, Disabled prevents scaling in that direction.

Common Policy Patterns

Fast scale-up, slow scale-down (recommended for most web apps):

scaleUp:
  stabilizationWindowSeconds: 0
  policies:
    - type: Percent
      value: 100
      periodSeconds: 15
scaleDown:
  stabilizationWindowSeconds: 300
  policies:
    - type: Pods
      value: 1
      periodSeconds: 60

Prevent scale-down entirely (batch jobs):

scaleDown:
  selectPolicy: Disabled

Gotchas and Edge Cases

Metrics delay: Metrics-server collects data every 15 seconds, and HPA evaluates every 15 seconds. This means there is a 15–30 second delay between a load spike and the first scaling decision. For latency-sensitive services, consider starting with more minReplicas.

Missing resource requests: If your containers do not define CPU or memory requests, HPA reports <unknown> for utilization and will not scale. Always define requests.

Readiness probes matter: HPA counts only Ready pods when calculating metrics. If new pods take 60 seconds to become ready, HPA may over-provision because existing pods remain overloaded during startup. Tune readiness probes and consider startup probes for slow-starting applications.

PodDisruptionBudget conflicts: A PDB that allows zero unavailable pods can block scale-down if HPA tries to remove replicas during a voluntary disruption. Ensure your PDB minAvailable is below your HPA minReplicas.

Multiple metrics interaction: When HPA evaluates multiple metrics, it calculates the desired replica count for each metric independently and picks the highest value. This means a single metric spike can trigger scale-up even if other metrics are low.

HPA and Cluster Autoscaler latency: HPA creates new pods immediately, but if no nodes have capacity, pods stay Pending until Cluster Autoscaler provisions new nodes (typically 2–5 minutes on cloud providers). Factor this into your scaling strategy.

Object metric vs pods metric: Use Pods type when the metric is per-pod (request rate). Use Object type when the metric comes from a single Kubernetes object like an Ingress.

Summary

  • HPA automatically scales pod replicas based on CPU, memory, custom, or external metrics
  • The scaling formula is desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric)
  • Always define CPU and memory requests in your containers — HPA cannot work without them
  • Use autoscaling/v2 API for multi-metric scaling and behavior policies
  • Configure aggressive scale-up and conservative scale-down for production workloads
  • Combine HPA with Cluster Autoscaler for full elastic scaling — avoid combining HPA and VPA on the same metric
  • Custom metrics via prometheus-adapter unlock scaling on business metrics like request rate or queue depth
  • Test your HPA configuration under realistic load before relying on it in production