Kubernetes pods can fail in many ways, and each failure state tells a different story. Whether you are facing CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled, or other error states, knowing how to systematically diagnose and fix these issues is essential for any Kubernetes operator. This guide walks you through the most common pod failure states, the kubectl commands to diagnose them, and proven strategies to resolve each one.
Prerequisites
- A running Kubernetes cluster (v1.24 or later recommended)
kubectlinstalled and configured with cluster access- Basic understanding of Kubernetes objects (pods, deployments, services)
- Permissions to read pods, events, and node resources in your target namespace
- Familiarity with container concepts (images, registries, resource limits)
Understanding Pod Lifecycle and States
Before diving into troubleshooting, it helps to understand the Kubernetes pod lifecycle. A pod moves through several phases:
| Phase | Description |
|---|---|
| Pending | Pod accepted by the cluster but one or more containers are not yet running |
| Running | Pod bound to a node and all containers started |
| Succeeded | All containers terminated successfully (exit code 0) |
| Failed | All containers terminated and at least one exited with an error |
| Unknown | Pod state cannot be determined, usually due to node communication failure |
Within these phases, containers can enter specific waiting states that indicate what went wrong. These are the status messages you see in kubectl get pods output — and they are your first diagnostic clue.
Common Pod Failure States
CrashLoopBackOff
CrashLoopBackOff is the most common pod failure you will encounter. It means the container starts, crashes, and Kubernetes keeps restarting it with increasing delays (10s, 20s, 40s, up to 5 minutes).
Common causes:
- Application error causing immediate exit (missing config, unhandled exception)
- Missing environment variables or mounted secrets
- Incorrect command or entrypoint in the container spec
- Health check (liveness probe) failing too aggressively
- Dependency on a service that is not available
Diagnostic commands:
# Check the pod status and restart count
kubectl get pods -o wide
# View the last container's logs
kubectl logs <pod-name> --previous
# Check events for the pod
kubectl describe pod <pod-name>
The --previous flag is critical — without it, you may get empty or partial logs because the new container instance just started. The describe output shows the Last State with the exit code, which tells you whether the process crashed (exit code 1) or was killed (exit code 137 for OOM, 143 for SIGTERM).
ImagePullBackOff
ImagePullBackOff means Kubernetes cannot download the container image from the registry. The pod stays in this state, retrying with exponential backoff.
Common causes:
- Typo in the image name or tag
- Image tag does not exist (e.g., referencing
latestwhen only versioned tags are pushed) - Missing or expired
imagePullSecrets - Private registry with no credentials configured
- Network policy or firewall blocking access to the registry
- Registry rate limits (Docker Hub throttling)
Diagnostic commands:
# Check the exact image reference
kubectl describe pod <pod-name> | grep -A 5 "Image:"
# Look for pull errors in events
kubectl get events --field-selector involvedObject.name=<pod-name>
# Verify imagePullSecrets exist
kubectl get secrets -n <namespace>
Pending
A Pending pod means the scheduler has not yet assigned it to a node. This can persist indefinitely if the underlying issue is not resolved.
Common causes:
- Insufficient CPU or memory across all nodes
- Node selectors, taints, or affinities that no node satisfies
- PersistentVolumeClaim (PVC) not bound — no matching PersistentVolume available
- ResourceQuota exceeded in the namespace
- Too many pods on the cluster (max-pods limit on nodes)
Diagnostic commands:
# Check why the pod is pending
kubectl describe pod <pod-name>
# View scheduler events
kubectl get events --sort-by='.lastTimestamp' -n <namespace>
# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check PVC status if the pod uses volumes
kubectl get pvc -n <namespace>
OOMKilled
OOMKilled (exit code 137) means the Linux kernel’s Out-Of-Memory killer terminated the container because it exceeded its memory limit.
Common causes:
- Memory limit set too low for the application
- Memory leak in the application
- JVM heap size not aligned with container memory limit
- Sidecar containers consuming shared memory
- Loading large datasets into memory
Diagnostic commands:
# Check the termination reason
kubectl describe pod <pod-name> | grep -A 3 "Last State"
# View current memory usage (requires metrics-server)
kubectl top pod <pod-name>
# Check configured limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'
Other Failure States
| State | Meaning | Typical Fix |
|---|---|---|
| CreateContainerConfigError | Missing ConfigMap or Secret | Verify referenced ConfigMaps and Secrets exist |
| RunContainerError | Container runtime failure | Check security context, volume mounts, and node container runtime logs |
| Evicted | Node under resource pressure | Check node disk/memory pressure conditions and set proper resource requests |
| Init:Error | Init container failed | Check init container logs with kubectl logs <pod> -c <init-container> |
| Terminating (stuck) | Finalizers blocking deletion | Check finalizers with kubectl get pod -o json and remove if safe |
Diagnosing Issues with kubectl
The diagnosis workflow follows a consistent pattern regardless of the failure type:
Step 1: Get the Big Picture
# All pods in the namespace with status
kubectl get pods -n <namespace> -o wide
# Recent events sorted by time
kubectl get events --sort-by='.lastTimestamp' -n <namespace> | tail -20
Step 2: Deep Dive into the Pod
# Full pod description with events
kubectl describe pod <pod-name> -n <namespace>
# Container logs (current instance)
kubectl logs <pod-name> -n <namespace>
# Container logs (previous crashed instance)
kubectl logs <pod-name> -n <namespace> --previous
# Logs from a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name> -n <namespace>
Step 3: Check the Node
# Node conditions (disk pressure, memory pressure, PID pressure)
kubectl describe node <node-name> | grep -A 10 "Conditions"
# Resource allocation on the node
kubectl describe node <node-name> | grep -A 20 "Allocated resources"
Step 4: Interactive Debugging
# Exec into a running container
kubectl exec -it <pod-name> -- /bin/sh
# Use ephemeral debug container (K8s 1.23+)
kubectl debug -it <pod-name> --image=busybox --target=<container-name>
# Run a debug pod in the same network namespace
kubectl run debug --rm -it --image=busybox -- /bin/sh
kubectl Commands Comparison Table
| Failure State | First Command | Key Information |
|---|---|---|
| CrashLoopBackOff | kubectl logs <pod> --previous | Application error output before crash |
| ImagePullBackOff | kubectl describe pod <pod> | Image name, pull errors, secret references |
| Pending | kubectl describe pod <pod> | Scheduler failure reason in Events section |
| OOMKilled | kubectl describe pod <pod> | Last State termination reason and exit code |
| CreateContainerConfigError | kubectl get configmap,secret -n <ns> | Missing referenced resources |
| Evicted | kubectl describe node <node> | Node resource pressure conditions |
| Init:Error | kubectl logs <pod> -c <init-container> | Init container failure logs |
Real-World Scenario
You manage a production cluster running a microservices application. After a deployment, the payment-service pod keeps restarting and shows CrashLoopBackOff. Here is how you diagnose it:
$ kubectl get pods -n production
NAME READY STATUS RESTARTS AGE
payment-service-7d4f8b9c6-x2k9m 0/1 CrashLoopBackOff 5 8m
api-gateway-5c8f7d6b4-h3j7n 1/1 Running 0 2d
user-service-6b7c8d9e5-m4n8p 1/1 Running 0 2d
You check the previous container’s logs:
$ kubectl logs payment-service-7d4f8b9c6-x2k9m --previous
2026-02-28 10:15:03 ERROR: Failed to connect to database
ConnectionRefused: tcp://db-service:5432
2026-02-28 10:15:03 FATAL: Cannot start without database connection. Exiting.
The application requires a database connection at startup, but db-service is unreachable. You check the service:
$ kubectl get svc db-service -n production
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
db-service ClusterIP 10.96.45.123 <none> 5432/TCP 30d
$ kubectl get endpoints db-service -n production
NAME ENDPOINTS AGE
db-service <none> 30d
No endpoints — the database pod is down. You find it was evicted due to node disk pressure:
$ kubectl get pods -n production | grep db
db-postgresql-0 0/1 Evicted 0 30d
The fix: clear disk space on the node (or add more nodes), restart the database pod, and the payment service recovers automatically. You also add a startupProbe with a generous timeout so the payment service waits for the database instead of immediately crashing.
Gotchas and Edge Cases
-
Exit code 137 vs 143: Exit code 137 means the container was killed by SIGKILL (usually OOMKilled). Exit code 143 means SIGTERM (graceful shutdown). Do not confuse them — 137 requires memory investigation, 143 is usually normal during rollouts.
-
CrashLoopBackOff delay: Kubernetes uses exponential backoff up to 5 minutes between restarts. If you fix the issue, you may still need to wait or delete the pod to restart it immediately.
-
ImagePullPolicy: Always: If your pod spec uses
imagePullPolicy: Always(the default forlatesttags), every pod restart triggers an image pull. This can cause ImagePullBackOff if the registry is temporarily unreachable, even though the image was previously cached on the node. -
Resource requests vs limits: Pods are scheduled based on requests, not limits. A pod requesting 100Mi but limited to 500Mi can be OOMKilled at 500Mi even if the node has 2Gi free — the limit is enforced regardless of node capacity.
-
Multi-container pods: In a pod with sidecars,
kubectl logsdefaults to the first container. Always specify-c <container-name>when debugging multi-container pods. -
Ephemeral storage evictions: Even if CPU and memory are fine, high ephemeral storage usage (logs, temp files) can trigger eviction. Check with
kubectl describe nodeunderConditions. -
PVC in wrong availability zone: In cloud environments, a PVC bound to a volume in
us-east-1acannot be mounted by a pod scheduled to a node inus-east-1b. The pod stays Pending with no obvious error. -
DNS resolution lag: Newly created services may not resolve immediately. If your container crashes because it cannot resolve a service name, add a startup delay or retry logic instead of relying on instant DNS propagation.
Summary
- CrashLoopBackOff means your container keeps crashing — check logs with
--previousto see the error before the crash - ImagePullBackOff indicates an image pull failure — verify image name, tag, and registry credentials
- Pending means the scheduler cannot place the pod — check resource availability, PVC status, and node affinity rules
- OOMKilled (exit code 137) means the container exceeded its memory limit — increase limits or optimize memory usage
- Always start with
kubectl describe podandkubectl get eventsto understand the failure context - Use
kubectl debugfor ephemeral containers when you need interactive troubleshooting without modifying the pod spec - Set proper resource requests, limits, and probes to prevent many common pod failures before they happen