Kubernetes pods can fail in many ways, and each failure state tells a different story. Whether you are facing CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled, or other error states, knowing how to systematically diagnose and fix these issues is essential for any Kubernetes operator. This guide walks you through the most common pod failure states, the kubectl commands to diagnose them, and proven strategies to resolve each one.

Prerequisites

  • A running Kubernetes cluster (v1.24 or later recommended)
  • kubectl installed and configured with cluster access
  • Basic understanding of Kubernetes objects (pods, deployments, services)
  • Permissions to read pods, events, and node resources in your target namespace
  • Familiarity with container concepts (images, registries, resource limits)

Understanding Pod Lifecycle and States

Before diving into troubleshooting, it helps to understand the Kubernetes pod lifecycle. A pod moves through several phases:

PhaseDescription
PendingPod accepted by the cluster but one or more containers are not yet running
RunningPod bound to a node and all containers started
SucceededAll containers terminated successfully (exit code 0)
FailedAll containers terminated and at least one exited with an error
UnknownPod state cannot be determined, usually due to node communication failure

Within these phases, containers can enter specific waiting states that indicate what went wrong. These are the status messages you see in kubectl get pods output — and they are your first diagnostic clue.

Common Pod Failure States

CrashLoopBackOff

CrashLoopBackOff is the most common pod failure you will encounter. It means the container starts, crashes, and Kubernetes keeps restarting it with increasing delays (10s, 20s, 40s, up to 5 minutes).

Common causes:

  • Application error causing immediate exit (missing config, unhandled exception)
  • Missing environment variables or mounted secrets
  • Incorrect command or entrypoint in the container spec
  • Health check (liveness probe) failing too aggressively
  • Dependency on a service that is not available

Diagnostic commands:

# Check the pod status and restart count
kubectl get pods -o wide

# View the last container's logs
kubectl logs <pod-name> --previous

# Check events for the pod
kubectl describe pod <pod-name>

The --previous flag is critical — without it, you may get empty or partial logs because the new container instance just started. The describe output shows the Last State with the exit code, which tells you whether the process crashed (exit code 1) or was killed (exit code 137 for OOM, 143 for SIGTERM).

ImagePullBackOff

ImagePullBackOff means Kubernetes cannot download the container image from the registry. The pod stays in this state, retrying with exponential backoff.

Common causes:

  • Typo in the image name or tag
  • Image tag does not exist (e.g., referencing latest when only versioned tags are pushed)
  • Missing or expired imagePullSecrets
  • Private registry with no credentials configured
  • Network policy or firewall blocking access to the registry
  • Registry rate limits (Docker Hub throttling)

Diagnostic commands:

# Check the exact image reference
kubectl describe pod <pod-name> | grep -A 5 "Image:"

# Look for pull errors in events
kubectl get events --field-selector involvedObject.name=<pod-name>

# Verify imagePullSecrets exist
kubectl get secrets -n <namespace>

Pending

A Pending pod means the scheduler has not yet assigned it to a node. This can persist indefinitely if the underlying issue is not resolved.

Common causes:

  • Insufficient CPU or memory across all nodes
  • Node selectors, taints, or affinities that no node satisfies
  • PersistentVolumeClaim (PVC) not bound — no matching PersistentVolume available
  • ResourceQuota exceeded in the namespace
  • Too many pods on the cluster (max-pods limit on nodes)

Diagnostic commands:

# Check why the pod is pending
kubectl describe pod <pod-name>

# View scheduler events
kubectl get events --sort-by='.lastTimestamp' -n <namespace>

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check PVC status if the pod uses volumes
kubectl get pvc -n <namespace>

OOMKilled

OOMKilled (exit code 137) means the Linux kernel’s Out-Of-Memory killer terminated the container because it exceeded its memory limit.

Common causes:

  • Memory limit set too low for the application
  • Memory leak in the application
  • JVM heap size not aligned with container memory limit
  • Sidecar containers consuming shared memory
  • Loading large datasets into memory

Diagnostic commands:

# Check the termination reason
kubectl describe pod <pod-name> | grep -A 3 "Last State"

# View current memory usage (requires metrics-server)
kubectl top pod <pod-name>

# Check configured limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'

Other Failure States

StateMeaningTypical Fix
CreateContainerConfigErrorMissing ConfigMap or SecretVerify referenced ConfigMaps and Secrets exist
RunContainerErrorContainer runtime failureCheck security context, volume mounts, and node container runtime logs
EvictedNode under resource pressureCheck node disk/memory pressure conditions and set proper resource requests
Init:ErrorInit container failedCheck init container logs with kubectl logs <pod> -c <init-container>
Terminating (stuck)Finalizers blocking deletionCheck finalizers with kubectl get pod -o json and remove if safe

Diagnosing Issues with kubectl

The diagnosis workflow follows a consistent pattern regardless of the failure type:

Step 1: Get the Big Picture

# All pods in the namespace with status
kubectl get pods -n <namespace> -o wide

# Recent events sorted by time
kubectl get events --sort-by='.lastTimestamp' -n <namespace> | tail -20

Step 2: Deep Dive into the Pod

# Full pod description with events
kubectl describe pod <pod-name> -n <namespace>

# Container logs (current instance)
kubectl logs <pod-name> -n <namespace>

# Container logs (previous crashed instance)
kubectl logs <pod-name> -n <namespace> --previous

# Logs from a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name> -n <namespace>

Step 3: Check the Node

# Node conditions (disk pressure, memory pressure, PID pressure)
kubectl describe node <node-name> | grep -A 10 "Conditions"

# Resource allocation on the node
kubectl describe node <node-name> | grep -A 20 "Allocated resources"

Step 4: Interactive Debugging

# Exec into a running container
kubectl exec -it <pod-name> -- /bin/sh

# Use ephemeral debug container (K8s 1.23+)
kubectl debug -it <pod-name> --image=busybox --target=<container-name>

# Run a debug pod in the same network namespace
kubectl run debug --rm -it --image=busybox -- /bin/sh

kubectl Commands Comparison Table

Failure StateFirst CommandKey Information
CrashLoopBackOffkubectl logs <pod> --previousApplication error output before crash
ImagePullBackOffkubectl describe pod <pod>Image name, pull errors, secret references
Pendingkubectl describe pod <pod>Scheduler failure reason in Events section
OOMKilledkubectl describe pod <pod>Last State termination reason and exit code
CreateContainerConfigErrorkubectl get configmap,secret -n <ns>Missing referenced resources
Evictedkubectl describe node <node>Node resource pressure conditions
Init:Errorkubectl logs <pod> -c <init-container>Init container failure logs

Real-World Scenario

You manage a production cluster running a microservices application. After a deployment, the payment-service pod keeps restarting and shows CrashLoopBackOff. Here is how you diagnose it:

$ kubectl get pods -n production
NAME                              READY   STATUS             RESTARTS   AGE
payment-service-7d4f8b9c6-x2k9m  0/1     CrashLoopBackOff   5          8m
api-gateway-5c8f7d6b4-h3j7n      1/1     Running            0          2d
user-service-6b7c8d9e5-m4n8p     1/1     Running            0          2d

You check the previous container’s logs:

$ kubectl logs payment-service-7d4f8b9c6-x2k9m --previous
2026-02-28 10:15:03 ERROR: Failed to connect to database
  ConnectionRefused: tcp://db-service:5432
2026-02-28 10:15:03 FATAL: Cannot start without database connection. Exiting.

The application requires a database connection at startup, but db-service is unreachable. You check the service:

$ kubectl get svc db-service -n production
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
db-service   ClusterIP   10.96.45.123   <none>        5432/TCP   30d

$ kubectl get endpoints db-service -n production
NAME         ENDPOINTS   AGE
db-service   <none>      30d

No endpoints — the database pod is down. You find it was evicted due to node disk pressure:

$ kubectl get pods -n production | grep db
db-postgresql-0   0/1     Evicted   0   30d

The fix: clear disk space on the node (or add more nodes), restart the database pod, and the payment service recovers automatically. You also add a startupProbe with a generous timeout so the payment service waits for the database instead of immediately crashing.

Gotchas and Edge Cases

  • Exit code 137 vs 143: Exit code 137 means the container was killed by SIGKILL (usually OOMKilled). Exit code 143 means SIGTERM (graceful shutdown). Do not confuse them — 137 requires memory investigation, 143 is usually normal during rollouts.

  • CrashLoopBackOff delay: Kubernetes uses exponential backoff up to 5 minutes between restarts. If you fix the issue, you may still need to wait or delete the pod to restart it immediately.

  • ImagePullPolicy: Always: If your pod spec uses imagePullPolicy: Always (the default for latest tags), every pod restart triggers an image pull. This can cause ImagePullBackOff if the registry is temporarily unreachable, even though the image was previously cached on the node.

  • Resource requests vs limits: Pods are scheduled based on requests, not limits. A pod requesting 100Mi but limited to 500Mi can be OOMKilled at 500Mi even if the node has 2Gi free — the limit is enforced regardless of node capacity.

  • Multi-container pods: In a pod with sidecars, kubectl logs defaults to the first container. Always specify -c <container-name> when debugging multi-container pods.

  • Ephemeral storage evictions: Even if CPU and memory are fine, high ephemeral storage usage (logs, temp files) can trigger eviction. Check with kubectl describe node under Conditions.

  • PVC in wrong availability zone: In cloud environments, a PVC bound to a volume in us-east-1a cannot be mounted by a pod scheduled to a node in us-east-1b. The pod stays Pending with no obvious error.

  • DNS resolution lag: Newly created services may not resolve immediately. If your container crashes because it cannot resolve a service name, add a startup delay or retry logic instead of relying on instant DNS propagation.

Summary

  • CrashLoopBackOff means your container keeps crashing — check logs with --previous to see the error before the crash
  • ImagePullBackOff indicates an image pull failure — verify image name, tag, and registry credentials
  • Pending means the scheduler cannot place the pod — check resource availability, PVC status, and node affinity rules
  • OOMKilled (exit code 137) means the container exceeded its memory limit — increase limits or optimize memory usage
  • Always start with kubectl describe pod and kubectl get events to understand the failure context
  • Use kubectl debug for ephemeral containers when you need interactive troubleshooting without modifying the pod spec
  • Set proper resource requests, limits, and probes to prevent many common pod failures before they happen