How do I recover a MariaDB Galera Cluster after a full shutdown?

After a full cluster shutdown, you must bootstrap from the node with the most recent data. Run galera_recovery on each node to find the one with the highest seqno, then start that node with galera_new_cluster. Start the remaining nodes normally and they will join via SST or IST.

What is the difference between SST and IST in Galera?

SST (State Snapshot Transfer) is a full data transfer used when a node joins for the first time or has been offline too long. IST (Incremental State Transfer) transfers only the missing transactions from the gcache and is much faster. IST is preferred but requires the needed transactions to still be in the donor gcache.

MariaDB Galera Cluster Setup and Troubleshooting

Q: What is MariaDB Galera Cluster?

MariaDB Galera Cluster is a synchronous multi-master database cluster that allows you to read and write to any node. It uses the Galera replication library for virtually synchronous replication, automatic node provisioning, and automatic membership control with node failure detection.

What is MariaDB Galera Cluster?

MariaDB Galera Cluster provides a synchronous multi-master replication solution for MariaDB. Unlike traditional asynchronous replication where writes go to a single master, Galera allows reads and writes to any node in the cluster. All nodes contain the same data at any given time (virtually synchronous), providing true high availability with automatic failover.

Key features:

Synchronous replication: All nodes are consistent — no slave lag, no lost transactions.
Multi-master: Read and write to any node, simplifying application load balancing.
Automatic node provisioning: New nodes joining the cluster automatically receive a full dataset copy (SST) or incremental updates (IST).
Automatic membership control: Failed nodes are detected and removed from the cluster automatically.

This guide covers installation, configuration, bootstrapping, and troubleshooting of a 3-node MariaDB Galera Cluster on Ubuntu/Debian.

Prerequisites

Three Linux servers (Ubuntu 22.04+ or Debian 12+ recommended) with static IPs.
MariaDB 10.6+ or 11.x installed on all nodes.
Ports open between all nodes: 3306 (MySQL), 4567 (Galera replication), 4568 (IST), 4444 (SST).
Root or sudo access on all servers.
Firewall configured to allow inter-node communication.

Step-by-Step Solution

1. Install MariaDB and Galera on All Nodes

# Ubuntu/Debian
sudo apt update
sudo apt install -y mariadb-server galera-4 mariadb-backup

# Verify installation
mariadbd --version

Important: All nodes must run the exact same MariaDB version. Version mismatches will cause SST failures.

2. Configure Galera on Each Node

Create or edit /etc/mysql/mariadb.conf.d/60-galera.cnf on each node:

[galera]
# Galera provider
wsrep_on                 = ON
wsrep_provider           = /usr/lib/galera/libgalera_smm.so

# Cluster configuration
wsrep_cluster_name       = "my_galera_cluster"
wsrep_cluster_address    = "gcomm://192.168.1.101,192.168.1.102,192.168.1.103"

# Node-specific settings (change on each node)
wsrep_node_address       = "192.168.1.101"   # This node's IP
wsrep_node_name          = "node1"            # This node's name

# SST method (mariabackup is recommended for production)
wsrep_sst_method         = mariabackup
wsrep_sst_auth           = "sstuser:sstpassword"

# InnoDB settings (required for Galera)
binlog_format            = ROW
default_storage_engine   = InnoDB
innodb_autoinc_lock_mode = 2
innodb_force_primary_key = 1

# Performance tuning
wsrep_slave_threads      = 4
innodb_flush_log_at_trx_commit = 2

3. Create the SST User

On one node (before bootstrapping):

-- Start MariaDB normally first
sudo systemctl start mariadb

-- Create the SST user
CREATE USER 'sstuser'@'localhost' IDENTIFIED BY 'sstpassword';
GRANT RELOAD, PROCESS, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstuser'@'localhost';
FLUSH PRIVILEGES;

-- Stop MariaDB
sudo systemctl stop mariadb

4. Bootstrap the First Node

On Node 1 only:

sudo galera_new_cluster

Verify the cluster has started:

SHOW STATUS LIKE 'wsrep_cluster_size';
-- Should return: 1

SHOW STATUS LIKE 'wsrep_cluster_status';
-- Should return: Primary

SHOW STATUS LIKE 'wsrep_ready';
-- Should return: ON

5. Join Remaining Nodes

On Node 2 and Node 3, simply start MariaDB normally:

sudo systemctl start mariadb

Each node will automatically connect to the cluster and receive data via SST. Monitor the process:

# Watch the MariaDB error log
sudo tail -f /var/log/mysql/error.log

Verify the cluster size increases:

SHOW STATUS LIKE 'wsrep_cluster_size';
-- Should return: 3 (after all nodes join)

Troubleshooting Common Issues

Split-Brain Recovery

A split-brain occurs when network partitioning causes nodes to disagree about cluster membership. The minority partition will stop accepting queries (wsrep_ready = OFF).

-- Check for split-brain
SHOW STATUS LIKE 'wsrep_cluster_status';
-- "Non-Primary" means this node is in the minority partition

Recovery:

# On the minority partition node, stop MariaDB
sudo systemctl stop mariadb

# Fix the network issue, then restart
sudo systemctl start mariadb
# The node will rejoin the majority partition

Full Cluster Crash Recovery

When all nodes crash or are stopped simultaneously:

# On each node, find the most recent data
sudo galera_recovery
# Look for "Recovered position:" — the node with the highest seqno has the latest data

# Bootstrap from the node with the highest seqno
sudo galera_new_cluster   # On the most recent node ONLY

# Start remaining nodes normally
sudo systemctl start mariadb   # On other nodes

Warning: Never bootstrap from a node that is not the most up-to-date. This can lead to data loss.

SST Failures

If SST fails during node joining:

# Check the error log
sudo tail -100 /var/log/mysql/error.log | grep -i "sst\|wsrep"

# Common causes:
# 1. Wrong SST user credentials — verify wsrep_sst_auth
# 2. mariabackup not installed — install mariadb-backup
# 3. Firewall blocking port 4444 — open it
# 4. Disk full on joining node — free space

Cluster Won’t Start After Safe Shutdown

If you gracefully stopped all nodes and the cluster refuses to start:

# Check safe_to_bootstrap in grastate.dat
cat /var/lib/mysql/grastate.dat

# If safe_to_bootstrap: 0 on all nodes, manually set it to 1
# on the node you want to bootstrap from
sudo sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/' /var/lib/mysql/grastate.dat
sudo galera_new_cluster

Gotchas and Edge Cases

No ALTER TABLE on large tables during traffic: Large DDL operations will block the entire cluster due to Total Order Isolation (TOI). Use pt-online-schema-change or rolling schema upgrades instead.
Only InnoDB: Galera only replicates InnoDB tables. MyISAM, Aria, and other engines are not supported.
Auto-increment gaps: With innodb_autoinc_lock_mode = 2, auto-increment values will have gaps. This is expected and necessary for multi-master writes.
Minimum 3 nodes: Always run an odd number of nodes (3, 5, 7) to avoid split-brain scenarios where both partitions are equal in size.
gcache sizing: Set wsrep_provider_options = "gcache.size=1G" to enable IST for nodes that were briefly offline (IST is much faster than full SST).

Summary

MariaDB Galera Cluster provides synchronous multi-master replication with automatic failover.
Always bootstrap from the node with the most recent data after a full cluster shutdown.
Use mariabackup as the SST method for production clusters.
Monitor wsrep_cluster_size, wsrep_cluster_status, and wsrep_ready for cluster health.
Run an odd number of nodes (minimum 3) to avoid split-brain scenarios.