The Kubernetes Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics. Instead of manually scaling workloads when traffic increases at 2 AM, HPA handles it for you. In this guide you will learn how HPA works internally, how to configure it with CPU, memory, and custom metrics, and how to tune scaling policies for production workloads.
Prerequisites
Before configuring HPA, make sure you have:
- A running Kubernetes cluster (v1.23+) — minikube, kind, or a managed cluster (EKS, GKE, AKS)
kubectlinstalled and configured to talk to your clustermetrics-serverdeployed (most managed clusters include it by default)- Basic familiarity with Deployments and Services
- A deployment with CPU/memory resource
requestsdefined (HPA cannot calculate utilization without them)
Verify metrics-server is running:
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods
If kubectl top returns metrics, you are ready to proceed.
How HPA Works
HPA runs as a control loop in the Kubernetes controller manager, checking metrics at a configurable interval (default: 15 seconds). On each iteration it:
- Fetches current metric values from the metrics API for all pods in the target workload
- Calculates the desired replica count using the formula:
$$ \text{desiredReplicas} = \lceil \text{currentReplicas} \times \frac{\text{currentMetricValue}}{\text{desiredMetricValue}} \rceil $$
- Compares desired replicas with current replicas
- Scales the workload up or down if the ratio exceeds a configurable tolerance (default ±10%)
For example, if your deployment has 3 replicas running at an average of 75% CPU and your target is 50%:
$$ \text{desired} = \lceil 3 \times \frac{75}{50} \rceil = \lceil 4.5 \rceil = 5 $$
HPA would scale the deployment to 5 replicas.
Cooldown behavior: HPA uses stabilization windows to prevent flapping. By default, scale-down events use a 5-minute stabilization window — HPA picks the highest recommended replica count from the last 5 minutes. Scale-up has no default stabilization window, reacting immediately to load increases.
Configuring HPA with CPU Metrics
Quick Setup with kubectl
The fastest way to create an HPA:
kubectl autoscale deployment my-app \
--cpu-percent=50 \
--min=2 \
--max=10
This tells Kubernetes: keep average CPU utilization at 50% across all pods, with a minimum of 2 and maximum of 10 replicas.
Full YAML Manifest
For production, define HPA as a YAML manifest you can version-control:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
Apply and monitor:
kubectl apply -f hpa.yaml
kubectl get hpa my-app-hpa --watch
Output looks like:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
my-app-hpa Deployment/my-app 32%/50%, 45%/70% 2 10 3
Important: Your deployment containers must have resource requests defined, or HPA cannot compute utilization percentages:
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Custom Metrics and External Metrics
CPU and memory are often insufficient for scaling decisions. A web service might have low CPU but a large request queue. HPA v2 supports three metric types:
| Metric Type | Source | Example |
|---|---|---|
| Resource | metrics-server | CPU, memory utilization |
| Pods | custom metrics API | requests-per-second per pod |
| External | external metrics API | SQS queue depth, Pub/Sub backlog |
Setting Up Custom Metrics with Prometheus Adapter
Install prometheus-adapter to expose Prometheus metrics through the Kubernetes custom metrics API:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus-server.monitoring.svc
Then configure an HPA using a custom metric:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
Scaling on External Queue Depth
For event-driven architectures, scale based on an external metric like an SQS queue:
metrics:
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: "order-processing"
target:
type: Value
value: "30"
This scales your consumer pods to keep the queue length around 30 messages.
HPA vs VPA vs Cluster Autoscaler
| Feature | HPA | VPA | Cluster Autoscaler |
|---|---|---|---|
| What it scales | Number of pod replicas | CPU/memory requests per pod | Number of cluster nodes |
| Direction | Horizontal (more pods) | Vertical (bigger pods) | Horizontal (more nodes) |
| Metrics used | CPU, memory, custom, external | Historical resource usage | Pending pod scheduling |
| Requires restart | No | Yes (recreates pods) | No (adds nodes) |
| Best for | Stateless, horizontally scalable apps | Single-instance, resource-variable apps | Accommodating HPA scale-up |
| Can combine | Yes, with Cluster Autoscaler | Not with HPA on same metric | Yes, with HPA |
| Reaction speed | Seconds to minutes | Minutes (pod restart) | Minutes (node provisioning) |
Recommended combination: Use HPA for workload scaling + Cluster Autoscaler for infrastructure scaling. Avoid using HPA and VPA on the same metric for the same deployment — they will conflict.
Real-World Scenario
Consider an e-commerce platform that sees 10x traffic during flash sales. Your checkout-service deployment normally handles 50 requests/second with 3 replicas.
Without HPA: Your on-call engineer gets paged at midnight, manually scales to 20 replicas, then forgets to scale down. You pay for idle resources for 3 days.
With HPA configured:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-service
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "50"
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
This configuration:
- Scales up aggressively — doubling replicas every 60 seconds when CPU exceeds 60% or request rate exceeds 50 req/s per pod
- Scales down conservatively — removing only 10% of replicas per minute with a 5-minute stabilization window
- Prevents premature scale-down after short traffic bursts
During the flash sale, HPA scales from 3 to 25 replicas in under 5 minutes. After traffic subsides, it gradually reduces to 3 over approximately 30 minutes.
Scaling Policies and Behavior
The behavior field (autoscaling/v2) gives you fine-grained control over scaling speed:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
selectPolicy: Min
Key fields:
- stabilizationWindowSeconds — how long HPA looks back to pick the most conservative recommendation. Set to 0 for immediate scale-up; 300 (5 min) for cautious scale-down.
- policies — define scaling rate as a percentage of current replicas or an absolute number of pods per time period.
- selectPolicy —
Maxpicks the policy that allows the most change (aggressive),Minpicks the most conservative,Disabledprevents scaling in that direction.
Common Policy Patterns
Fast scale-up, slow scale-down (recommended for most web apps):
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
Prevent scale-down entirely (batch jobs):
scaleDown:
selectPolicy: Disabled
Gotchas and Edge Cases
Metrics delay: Metrics-server collects data every 15 seconds, and HPA evaluates every 15 seconds. This means there is a 15–30 second delay between a load spike and the first scaling decision. For latency-sensitive services, consider starting with more minReplicas.
Missing resource requests: If your containers do not define CPU or memory requests, HPA reports <unknown> for utilization and will not scale. Always define requests.
Readiness probes matter: HPA counts only Ready pods when calculating metrics. If new pods take 60 seconds to become ready, HPA may over-provision because existing pods remain overloaded during startup. Tune readiness probes and consider startup probes for slow-starting applications.
PodDisruptionBudget conflicts: A PDB that allows zero unavailable pods can block scale-down if HPA tries to remove replicas during a voluntary disruption. Ensure your PDB minAvailable is below your HPA minReplicas.
Multiple metrics interaction: When HPA evaluates multiple metrics, it calculates the desired replica count for each metric independently and picks the highest value. This means a single metric spike can trigger scale-up even if other metrics are low.
HPA and Cluster Autoscaler latency: HPA creates new pods immediately, but if no nodes have capacity, pods stay Pending until Cluster Autoscaler provisions new nodes (typically 2–5 minutes on cloud providers). Factor this into your scaling strategy.
Object metric vs pods metric: Use Pods type when the metric is per-pod (request rate). Use Object type when the metric comes from a single Kubernetes object like an Ingress.
Summary
- HPA automatically scales pod replicas based on CPU, memory, custom, or external metrics
- The scaling formula is
desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric) - Always define CPU and memory
requestsin your containers — HPA cannot work without them - Use
autoscaling/v2API for multi-metric scaling and behavior policies - Configure aggressive scale-up and conservative scale-down for production workloads
- Combine HPA with Cluster Autoscaler for full elastic scaling — avoid combining HPA and VPA on the same metric
- Custom metrics via prometheus-adapter unlock scaling on business metrics like request rate or queue depth
- Test your HPA configuration under realistic load before relying on it in production