etcd is the distributed key-value store at the heart of every Kubernetes cluster, storing all cluster state from pod definitions to secrets. When etcd goes down or loses data, the entire control plane stops working. This guide walks you through deploying etcd as a standalone service and as a three-node cluster, securing it with mutual TLS, taking and restoring snapshots, and understanding how Kubernetes uses it — so you can operate it confidently in production.
Prerequisites
- Linux server(s) running Ubuntu 22.04 or later (one for single-node, three for cluster)
- Root or sudo access
- Basic familiarity with systemd service management
curlandtarinstalled- For TLS:
cfssloropensslavailable - Ports 2379 (client) and 2380 (peer) open between etcd nodes
What Is etcd and How It Works
etcd is an open-source, strongly consistent distributed key-value store originally built by CoreOS and now a Cloud Native Computing Foundation (CNCF) graduated project. It uses the Raft consensus algorithm to ensure all nodes agree on the current state even when network partitions or node failures occur.
The key properties you need to understand:
- Strong consistency: every read returns the latest committed write, never a stale value
- Watch API: clients subscribe to key changes and receive notifications in real time
- Atomic transactions: compare-and-swap operations let you build distributed locks
- Leases: keys can expire automatically, enabling heartbeat-based leader election
Kubernetes relies on etcd for every piece of cluster state. The API server is the only component that reads and writes to etcd directly — all other components (scheduler, controller manager, kubelet) communicate through the API server.
| Feature | etcd | Redis | Consul | ZooKeeper |
|---|---|---|---|---|
| Consensus | Raft | None (standalone) | Raft | ZAB |
| Strong consistency | Yes | No (eventual) | Yes | Yes |
| Watch API | Yes | Pub/sub only | Yes | Yes |
| Kubernetes native | Yes | No | No | No |
| TLS built-in | Yes | Optional | Yes | Optional |
| Operational complexity | Low | Very low | Medium | High |
etcd wins for Kubernetes because it was purpose-built for control plane use cases: small values, high read rate, infrequent writes, and correctness over raw throughput.
Installing etcd
Download the latest release from the official GitHub repository. At the time of writing, 3.5.x is the stable series recommended for Kubernetes 1.29+.
ETCD_VER=v3.5.12
DOWNLOAD_URL=https://github.com/etcd-io/etcd/releases/download
curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz \
-o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcd /usr/local/bin/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcdctl /usr/local/bin/
sudo mv /tmp/etcd-${ETCD_VER}-linux-amd64/etcdutl /usr/local/bin/
etcd --version
etcdctl version
Create the data directory and a dedicated system user:
sudo groupadd --system etcd
sudo useradd -s /sbin/nologin --system -g etcd etcd
sudo mkdir -p /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd
sudo chmod 700 /var/lib/etcd
Generating TLS Certificates
Never run etcd without TLS in any environment beyond a local laptop. etcd stores secrets and credentials — any unencrypted listener is an immediate security risk.
Install cfssl:
curl -Lo /usr/local/bin/cfssl \
https://github.com/cloudflare/cfssl/releases/latest/download/cfssl_linux-amd64
curl -Lo /usr/local/bin/cfssljson \
https://github.com/cloudflare/cfssl/releases/latest/download/cfssljson_linux-amd64
chmod +x /usr/local/bin/cfssl /usr/local/bin/cfssljson
Create a CA and server certificate. Save the following as ca-config.json:
{
"signing": {
"default": { "expiry": "87600h" },
"profiles": {
"etcd": {
"expiry": "87600h",
"usages": ["signing","key encipherment","server auth","client auth"]
}
}
}
}
Generate the CA and server certificates:
# CA
cfssl gencert -initca ca-csr.json | cfssljson -bare ca
# Server + peer cert (add all node IPs/hostnames in the CSR hosts array)
cfssl gencert \
-ca=ca.pem -ca-key=ca-key.pem \
-config=ca-config.json \
-profile=etcd \
etcd-csr.json | cfssljson -bare etcd
sudo mkdir -p /etc/etcd/pki
sudo cp ca.pem etcd.pem etcd-key.pem /etc/etcd/pki/
sudo chown -R etcd:etcd /etc/etcd/pki
sudo chmod 600 /etc/etcd/pki/*-key.pem
Setting Up a Three-Node etcd Cluster
A production etcd cluster needs three or five members. With three members, it tolerates one failure and still maintains quorum. With five, it tolerates two simultaneous failures.
Assume three nodes with the following IPs:
etcd-1: 10.0.0.11etcd-2: 10.0.0.12etcd-3: 10.0.0.13
Create /etc/etcd/etcd.conf.yml on each node, substituting the node-specific values:
# /etc/etcd/etcd.conf.yml — example for etcd-1
name: etcd-1
data-dir: /var/lib/etcd
listen-peer-urls: https://10.0.0.11:2380
listen-client-urls: https://10.0.0.11:2379,https://127.0.0.1:2379
advertise-client-urls: https://10.0.0.11:2379
initial-advertise-peer-urls: https://10.0.0.11:2380
initial-cluster: >-
etcd-1=https://10.0.0.11:2380,
etcd-2=https://10.0.0.12:2380,
etcd-3=https://10.0.0.13:2380
initial-cluster-token: etcd-cluster-prod-01
initial-cluster-state: new
client-transport-security:
cert-file: /etc/etcd/pki/etcd.pem
key-file: /etc/etcd/pki/etcd-key.pem
trusted-ca-file: /etc/etcd/pki/ca.pem
client-cert-auth: true
peer-transport-security:
cert-file: /etc/etcd/pki/etcd.pem
key-file: /etc/etcd/pki/etcd-key.pem
trusted-ca-file: /etc/etcd/pki/ca.pem
peer-client-cert-auth: true
Create the systemd unit /etc/systemd/system/etcd.service on all nodes:
[Unit]
Description=etcd distributed key-value store
Documentation=https://etcd.io
After=network.target
[Service]
User=etcd
Group=etcd
Type=notify
ExecStart=/usr/local/bin/etcd --config-file /etc/etcd/etcd.conf.yml
Restart=on-failure
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Start the cluster — all three nodes must start within the election timeout (default 1 second) for the initial bootstrap to succeed:
# Run on all three nodes in quick succession
sudo systemctl daemon-reload
sudo systemctl enable --now etcd
Verify the cluster formed correctly:
export ETCDCTL_API=3
export ETCDCTL_ENDPOINTS=https://10.0.0.11:2379,https://10.0.0.12:2379,https://10.0.0.13:2379
export ETCDCTL_CACERT=/etc/etcd/pki/ca.pem
export ETCDCTL_CERT=/etc/etcd/pki/etcd.pem
export ETCDCTL_KEY=/etc/etcd/pki/etcd-key.pem
etcdctl endpoint health
etcdctl endpoint status --write-out=table
etcdctl member list --write-out=table
etcd Backup and Snapshot Restore
etcd backup is non-negotiable in production. A snapshot captures the entire key-value store at a point in time. Always back up before upgrading etcd or upgrading Kubernetes.
Taking a Snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/pki/ca.pem \
--cert=/etc/etcd/pki/etcd.pem \
--key=/etc/etcd/pki/etcd-key.pem
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-*.db --write-out=table
Automate this with a cron job that rotates old snapshots:
# /etc/cron.d/etcd-backup
0 2 * * * etcd /usr/local/bin/etcdctl snapshot save \
/backup/etcd-$(date +\%Y\%m\%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/pki/ca.pem \
--cert=/etc/etcd/pki/etcd.pem \
--key=/etc/etcd/pki/etcd-key.pem && \
find /backup/ -name "etcd-*.db" -mtime +7 -delete
Restoring from a Snapshot
Restoration replaces the data directory. Perform this on every member in the cluster:
# 1. Stop etcd on ALL nodes first
sudo systemctl stop etcd
# 2. Restore on each node (use a unique --name and --initial-advertise-peer-urls per node)
ETCDCTL_API=3 etcdutl snapshot restore /backup/etcd-snapshot.db \
--name etcd-1 \
--initial-cluster "etcd-1=https://10.0.0.11:2380,etcd-2=https://10.0.0.12:2380,etcd-3=https://10.0.0.13:2380" \
--initial-cluster-token etcd-cluster-prod-01 \
--initial-advertise-peer-urls https://10.0.0.11:2380 \
--data-dir /var/lib/etcd-restored
# 3. Swap the data directory
sudo mv /var/lib/etcd /var/lib/etcd-old
sudo mv /var/lib/etcd-restored /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd
# 4. Start etcd on ALL nodes
sudo systemctl start etcd
Real-World Scenario: Recovering a Kubernetes Control Plane
You have a three-node Kubernetes cluster where one control plane node lost its disk. The etcd member on that node is gone, and kubectl get nodes now hangs because the API server cannot reach quorum on writes.
Here is the recovery path:
- Check current membership from a healthy node:
etcdctl member list. You will see the failed member with an empty URL. - Remove the dead member:
etcdctl member remove <MEMBER_ID> - Provision a new node with the same hostname and IP (or update DNS).
- Add the replacement:
etcdctl member add etcd-1-new --peer-urls=https://10.0.0.11:2380 - Start etcd on the new node with
ETCD_INITIAL_CLUSTER_STATE=existingin the config — do not usenew. - Watch the new member sync:
etcdctl endpoint statuswill show the revision number catch up in real time.
The entire process typically takes five to ten minutes for a cluster with less than 1 GB of etcd data. During removal and before the new member reaches quorum, Kubernetes continues to serve reads but blocks writes.
Gotchas and Edge Cases
Never run an even number of etcd members. A two-member cluster has no fault tolerance — losing one member immediately loses quorum. A four-member cluster tolerates only one failure, same as three members, but requires an extra node. Stick to 3 or 5.
Disk I/O is the primary etcd bottleneck. etcd calls fdatasync on every write. If your disk latency exceeds 10ms consistently, you will see leader elections and request timeout errors in Kubernetes. Use SSDs or NVMe storage. Check disk latency with: fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/etcd --size=22m --bs=2300 --name=etcd-bench.
etcd is not a general-purpose database. The default storage quota is 2 GB. Kubernetes clusters with many objects or heavy use of custom resources can hit this limit. Monitor etcd_mvcc_db_total_size_in_bytes in Prometheus and compact/defragment regularly:
# Compact old revisions (keep last 1000)
rev=$(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
etcdctl compact $((rev - 1000))
etcdctl defrag --cluster
Clock skew causes leader elections. etcd uses wall-clock time for lease expiry. If nodes diverge by more than a few hundred milliseconds, you will see spurious leader elections. Run NTP (chrony or systemd-timesyncd) on all etcd nodes and verify with chronyc tracking.
Snapshot restoration clears cluster membership. After restoring, all members are treated as a brand new cluster with the membership encoded in the snapshot. Never restore a snapshot onto a running cluster without stopping all members first — you will create a split-brain situation.
Common Issues and Troubleshooting
etcdserver: request timed out — Usually disk latency. Check iostat -x 1 on etcd nodes. Also check peer connectivity: etcdctl endpoint health from each node individually.
etcdserver: mvcc: database space exceeded — The 2 GB storage quota was hit. Run compaction and defrag as shown above, or increase the quota with --quota-backend-bytes=4294967296 (4 GB) in the etcd config.
raft: failed to send message — Firewall blocking port 2380 between peers, or a peer certificate CN mismatch. Verify with openssl s_client -connect 10.0.0.12:2380 from another etcd node.
certificate has expired — etcd peer and client certificates need rotation before expiry. Kubernetes kubeadm clusters auto-rotate these at 90 days to 1-year marks, but manually managed clusters do not. Check expiry: openssl x509 -in /etc/etcd/pki/etcd.pem -noout -dates.
etcd using too much memory — etcd caches the working set in memory. On clusters with many objects, RSS can exceed 8 GB. This is normal. Set --snapshot-count=5000 (default 100000) to trigger more frequent snapshots and reduce the raft log size in memory.
Summary
- etcd stores all Kubernetes cluster state; it is the most critical service in the control plane
- Always deploy three or five members for production — odd numbers only
- Enable mutual TLS on both client and peer ports; never expose etcd without authentication
- Take daily snapshots with
etcdctl snapshot saveand verify them withsnapshot status - Restore by stopping all members, running
etcdutl snapshot restoreon each with unique node parameters, swapping the data directory, then restarting - Monitor disk latency (keep below 10ms), database size (compact when approaching 2 GB), and leader stability
- Use
etcdctl member remove+member addto replace a failed node without a full restore