TL;DR — Quick Summary
etcd backup and restore for Kubernetes disaster recovery: snapshot methods, verification, multi-node restore, kubeadm certs, and monitoring.
etcd is the heartbeat of every Kubernetes cluster: a strongly consistent, distributed key-value store that holds the entire desired and observed state of your cluster. When etcd is healthy, kubectl commands return in milliseconds and controllers reconcile continuously. When etcd is lost without a backup, your cluster is gone — every Deployment definition, every Secret, every RBAC binding, every CRD, every ConfigMap disappears. This guide covers everything you need to build a production-grade etcd backup and restore strategy, from understanding the internals to running a full disaster recovery on a multi-node kubeadm cluster.
Prerequisites
- A Kubernetes cluster managed with kubeadm (v1.22+) or access to etcd certificates on a managed cluster
etcdctlinstalled on the control plane node (version must match your etcd version)- Root or sudo access on the control plane node
kubectlconfigured with cluster-admin permissions- Basic familiarity with Kubernetes control plane components
- For automated backups: access to S3, GCS, or equivalent offsite storage
etcd’s Role in Kubernetes
Every time you run kubectl apply, the Kubernetes API server validates the request and writes the resulting object to etcd. Every controller (Deployment controller, ReplicaSet controller, Scheduler) watches etcd for changes via the API server’s watch mechanism and reconciles the cluster accordingly. etcd is the only stateful component in the Kubernetes control plane — all other components are stateless and can be restarted from scratch as long as etcd is intact.
What lives in etcd:
- All API object definitions: Pods, Deployments, StatefulSets, DaemonSets, Services, Ingresses
- Secrets and ConfigMaps
- RBAC: Roles, ClusterRoles, RoleBindings, ClusterRoleBindings
- Custom Resource Definitions and all custom resource instances
- Namespace definitions, ResourceQuotas, LimitRanges
- ServiceAccounts and associated tokens
- Node registrations and lease objects
- Leader election records for kube-controller-manager and kube-scheduler
What does not live in etcd: the actual data stored in PersistentVolumes. etcd only stores the PersistentVolumeClaim and PersistentVolume API objects (metadata and binding), not the bytes on disk.
etcd Architecture: Raft, WAL, and Snapshots
etcd uses the Raft consensus algorithm to replicate state across a cluster of odd-numbered members (typically 3 or 5). Raft elects a leader that processes all writes; followers replicate the leader’s log. The cluster tolerates (n-1)/2 member failures — a 3-node cluster survives 1 failure, a 5-node cluster survives 2.
Writes are first appended to the Write-Ahead Log (WAL) on disk, then applied to an in-memory B-tree (bbolt). Periodically, etcd takes an internal snapshot of the B-tree to disk and truncates the WAL to prevent unbounded growth. The combination of WAL + snapshot means etcd can recover from a crash without losing committed data.
The etcdctl snapshot save command triggers an on-demand snapshot of the current B-tree state. This snapshot is a complete, self-contained backup of all etcd data at the moment it was taken — no WAL required for restore.
Backup Methods
Method 1: etcdctl Snapshot Save (Recommended)
The canonical backup method. On a kubeadm cluster, etcd runs as a static pod with TLS. Certificates live at /etc/kubernetes/pki/etcd/.
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
You can also exec into the etcd container directly:
kubectl exec -n kube-system etcd-<control-plane-node> -- \
etcdctl snapshot save /tmp/etcd-backup.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
kubectl cp kube-system/etcd-<control-plane-node>:/tmp/etcd-backup.db /backup/etcd-backup.db
Method 2: Automated CronJob
Deploy a Kubernetes CronJob on the control plane that mounts the host’s etcd certs and writes snapshots to a mounted PVC or cloud storage:
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 * * * *"
jobTemplate:
spec:
template:
spec:
hostNetwork: true
containers:
- name: etcd-backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup-dir
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup-dir
persistentVolumeClaim:
claimName: etcd-backup-pvc
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
Method 3: Velero with etcd Plugin
Velero is a full cluster backup solution. The velero-plugin-for-etcd takes etcd snapshots and stores them in object storage alongside Velero’s PV snapshots, giving you a unified backup for both cluster state and persistent data. Velero is better suited for application-level backup (namespace + PV together); for control-plane-only DR, etcdctl remains the preferred approach.
| Tool | etcd State | PV Data | Restore Granularity | Complexity |
|---|---|---|---|---|
| etcdctl snapshot | Yes | No | Full cluster | Low |
| Velero + etcd plugin | Yes | Yes | Namespace or full | Medium |
| etcd-backup-operator | Yes | No | Full cluster | Medium |
| kube-backup | Yes | No | Full cluster | Low |
| Manual CronJob | Yes | No | Full cluster | Low |
Snapshot Verification
Never trust a backup you have not verified. After every snapshot:
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db \
--write-out=table
Sample output:
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 3b0d7ab2 | 1234567 | 3821 | 12 MB |
+----------+----------+------------+------------+
If TOTAL KEYS is 0 or the hash is malformed, the snapshot is corrupt. A healthy production cluster typically has 2000–8000 keys. Build snapshot verification into your backup CronJob and alert on failures.
Restore Procedures
Single-Node Kubeadm Cluster
Step 1: Stop the API server and etcd
# Move static pod manifests out of the manifests directory
mkdir -p /tmp/k8s-manifests-backup
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/k8s-manifests-backup/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/k8s-manifests-backup/
# Wait for processes to exit
sleep 10
ps aux | grep etcd
Step 2: Back up the existing (corrupt) data directory
mv /var/lib/etcd /var/lib/etcd.bak
Step 3: Restore the snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd \
--initial-cluster=default=https://127.0.0.1:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://127.0.0.1:2380
Step 4: Fix ownership and restore manifests
chown -R etcd:etcd /var/lib/etcd # if running as non-root
mv /tmp/k8s-manifests-backup/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/k8s-manifests-backup/kube-apiserver.yaml /etc/kubernetes/manifests/
# Wait for control plane to come up
sleep 30
kubectl get nodes
Multi-Node Kubeadm Cluster
For a 3-node control-plane cluster (HA), you must restore on all etcd members simultaneously using consistent --initial-cluster and --initial-cluster-token values:
# On each control plane node, run restore with the SAME snapshot and token
# but with the correct --name and --initial-advertise-peer-urls for that node
# Node 1
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--name=etcd-cp1 \
--data-dir=/var/lib/etcd \
--initial-cluster=etcd-cp1=https://10.0.0.1:2380,etcd-cp2=https://10.0.0.2:2380,etcd-cp3=https://10.0.0.3:2380 \
--initial-cluster-token=etcd-restore-token-$(date +%s) \
--initial-advertise-peer-urls=https://10.0.0.1:2380
# Node 2 (same token)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--name=etcd-cp2 \
--data-dir=/var/lib/etcd \
--initial-cluster=etcd-cp1=https://10.0.0.1:2380,etcd-cp2=https://10.0.0.2:2380,etcd-cp3=https://10.0.0.3:2380 \
--initial-cluster-token=etcd-restore-token-<SAME_VALUE> \
--initial-advertise-peer-urls=https://10.0.0.2:2380
Use a unique --initial-cluster-token different from the original to prevent the restored cluster from accidentally joining the old (degraded) cluster.
Managed Kubernetes Considerations
EKS, GKE, AKS — The cloud provider manages etcd entirely. You cannot access etcd directly. Use provider-native backup mechanisms:
- EKS: Velero with S3; AWS does not expose etcd directly
- GKE: Velero; Google manages etcd with automatic backups on Autopilot
- AKS: Velero + Azure Blob; Microsoft manages etcd for managed node pools
For managed clusters, focus on application-level backup (Velero namespaces + PV snapshots) rather than etcd-level backup.
etcd Health Monitoring
Monitor etcd continuously — do not wait for a disaster to discover problems:
# Check endpoint health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check leader and member status
ETCDCTL_API=3 etcdctl endpoint status --write-out=table \
--endpoints=https://10.0.0.1:2379,https://10.0.0.2:2379,https://10.0.0.3:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Prometheus alerts to configure for etcd:
etcd_server_has_leader == 0— no leader elected (critical)etcd_disk_wal_fsync_duration_seconds{quantile="0.99"} > 0.01— slow WAL writes (storage degradation)etcd_disk_backend_commit_duration_seconds{quantile="0.99"} > 0.25— slow B-tree commitsetcd_server_proposals_failed_total > 0— consensus failuresetcd_mvcc_db_total_size_in_bytes > 8589934592— DB approaching 8 GB limit
Compaction and Defragmentation
etcd keeps a history of all key revisions to support watch semantics. Over time this consumes significant disk space. Enable auto-compaction in the etcd configuration:
# In etcd static pod manifest or etcd.conf
--auto-compaction-mode=periodic
--auto-compaction-retention=1h
Even with auto-compaction, the on-disk B-tree file (bbolt) does not shrink automatically because bbolt does not reclaim free pages. Run defragmentation periodically during off-peak hours:
ETCDCTL_API=3 etcdctl defrag \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
In a multi-node cluster, defrag one member at a time. Defragging the leader causes a brief leadership change. Schedule defrag monthly or when etcd_mvcc_db_total_size_in_bytes is significantly larger than etcd_mvcc_db_total_size_in_use_in_bytes.
Performance Tuning
etcd is extremely sensitive to disk latency. The WAL fsync must complete before a Raft entry is considered committed. Recommendations:
- Dedicated SSD: Never share etcd’s data disk with application workloads. Use a dedicated NVMe or SSD with sustained random write IOPS > 2000.
- Heartbeat and election timeouts: Default
heartbeat-intervalis 100ms andelection-timeoutis 1000ms. In high-latency environments (cloud VMs with noisy neighbors), increase to 250ms / 1250ms.
--heartbeat-interval=250
--election-timeout=1250
- DB quota: Default is 2 GB. Increase to 8 GB for large clusters with many namespaces or frequent object churn:
--quota-backend-bytes=8589934592 - Network: etcd peer traffic should be on a dedicated, low-latency network path. Do not route etcd peer traffic through a shared application load balancer.
Production Backup Script with Alerting
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/opt/etcd-backups"
RETENTION_COUNT=24 # keep last 24 snapshots (24h at hourly)
S3_BUCKET="s3://my-cluster-etcd-backups"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx/yyy/zzz"
ETCD_ENDPOINTS="https://127.0.0.1:2379"
CACERT="/etc/kubernetes/pki/etcd/ca.crt"
CERT="/etc/kubernetes/pki/etcd/server.crt"
KEY="/etc/kubernetes/pki/etcd/server.key"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_FILE="${BACKUP_DIR}/etcd-${TIMESTAMP}.db"
alert() {
local msg="$1"
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[etcd-backup] ${msg}\"}" "${SLACK_WEBHOOK}" || true
}
mkdir -p "${BACKUP_DIR}"
if ! ETCDCTL_API=3 etcdctl snapshot save "${SNAPSHOT_FILE}" \
--endpoints="${ETCD_ENDPOINTS}" \
--cacert="${CACERT}" --cert="${CERT}" --key="${KEY}"; then
alert "CRITICAL: etcd snapshot save FAILED on $(hostname) at ${TIMESTAMP}"
exit 1
fi
# Verify snapshot
KEYS=$(ETCDCTL_API=3 etcdctl snapshot status "${SNAPSHOT_FILE}" \
--write-out=json | python3 -c "import sys,json; print(json.load(sys.stdin)['totalKey'])")
if [ "${KEYS}" -lt 100 ]; then
alert "WARNING: snapshot has only ${KEYS} keys — possible empty or corrupt snapshot"
exit 1
fi
# Upload to S3
aws s3 cp "${SNAPSHOT_FILE}" "${S3_BUCKET}/$(basename ${SNAPSHOT_FILE})" \
--storage-class STANDARD_IA
# Rotate local copies
ls -t "${BACKUP_DIR}"/etcd-*.db | tail -n +$((RETENTION_COUNT + 1)) | xargs -r rm -f
echo "Backup complete: ${SNAPSHOT_FILE} (${KEYS} keys)"
Disaster Recovery Scenarios
Scenario 1: Single member failure (quorum intact)
The cluster continues operating with 2/3 or 3/5 healthy members. Replace the failed member using etcdctl member remove + etcdctl member add and join a new etcd process to the cluster without a restore.
Scenario 2: Quorum loss (majority of members down)
The cluster becomes read-only and kubectl writes will fail. If members can be recovered (disk intact, network issue), bring them back online. If data is lost, restore from snapshot on all members.
Scenario 3: Full cluster restore (all data lost)
Stop all control plane components on all nodes, restore the snapshot on each node with consistent --initial-cluster-token, restart control plane components in order: etcd first, then kube-apiserver, then kube-controller-manager and kube-scheduler. Verify all nodes re-register and all Pods report correct status.
Gotchas and Edge Cases
etcdctl version mismatch — Always set ETCDCTL_API=3. Run etcdctl version and verify the client version matches the server version. Mismatches cause silent failures or corrupted restores.
Snapshot from non-leader — Snapshots taken from a follower member are valid but may lag behind the leader by a few entries. For critical restores, take the snapshot from the leader.
Restore overwrites the data directory — etcdctl snapshot restore writes to --data-dir. If the directory already exists, the restore fails. Always move the existing data directory out of the way first.
CronJob on control plane node — etcd CronJobs must tolerate the node-role.kubernetes.io/control-plane: NoSchedule taint and use nodeSelector to land on control plane nodes where the certs are mounted via hostPath.
Clock skew between members — etcd peer TLS certificates are time-sensitive. If node clocks diverge by more than a few minutes, certificate validation fails. Ensure NTP is configured and synchronised on all control plane nodes.
Managed cluster surprises — On GKE or EKS, attempting to exec into the etcd pod will fail or be blocked. If you are on a managed cluster, shift to Velero immediately and do not rely on etcd-level backup.
Summary
- etcd stores all Kubernetes cluster state; losing it without a backup means rebuilding from scratch
- Use
etcdctl snapshot savewith TLS flags pointing to/etc/kubernetes/pki/etcd/for kubeadm clusters - Always run
etcdctl snapshot statusto verify snapshots after creation - Restore requires stopping the API server and etcd, running
etcdctl snapshot restore, and restarting the control plane - Multi-node restore requires consistent
--initial-cluster-tokenand correct per-node--initial-advertise-peer-urlson all members - Enable auto-compaction (
--auto-compaction-retention=1h) and runetcdctl defragmonthly - Dedicate a low-latency SSD to etcd data; monitor WAL fsync latency with Prometheus
- Store snapshots offsite (S3/GCS) with at least 24h retention; automate with a CronJob + alerting script
- For EKS, GKE, AKS: etcd is managed internally — use Velero for application-level backup instead