Operating Patroni H/A Across Multiple Regions — Part 4: Split-Brain Prevention
In multi-region HA, the real danger is not the failure itself but Split-Brain — two nodes simultaneously acting as Primary and accepting writes. This post explains Patroni's three-layer defense architecture (DCS Leader Lock + Quorum, Linux Watchdog, STONITH/Fencing), TTL/loop_wait/safety_margin tuning principles, DCS Failsafe Mode caveats, and emergency recovery procedures for an actual Split-Brain event.
Series — Operating Patroni H/A Across Multiple Regions
- Part 1 — Fundamentals and Architecture Design Principles
- Part 2 — Synchronous Multi-DC Setup in Practice
- Part 3 — Async Replication + Standby Cluster Setup
- Part 4 — Split-Brain Prevention (this post)
- Part 5 — Failover Runbook and DR Drills
- Part 6 — Monitoring, Operational Automation, and Best Practices
Table of Contents
- Why Is Split-Brain So Dangerous?
- How Split-Brain Actually Happens
- Patroni's Three-Layer Defense Architecture
- Layer 1 — DCS(etcd) Leader Lock and Quorum
- Layer 2 — Linux Watchdog
- Layer 3 — STONITH (Cloud Fencing)
- Bonus: DCS Failsafe Mode
- Defense Layer Timeline Visualization
- TTL, loop_wait, safety_margin Tuning Guide
- Common Troubleshooting
- References
1. Why Is Split-Brain So Dangerous?
Split-Brain occurs when two or more nodes in a cluster simultaneously believe they are the Primary (Leader). Both nodes start accepting client writes, and the data diverges into separate, incompatible timelines: a Diverging Timeline.
Normal state:
Client --> [ Primary A (Seoul) ] --WAL--> [ Replica B (Tokyo) ]
Split-Brain state:
Client --> [ Primary A (Seoul) ] --writes--> Timeline #3
Client --> [ Primary B (Tokyo) ] --writes--> Timeline #3 (diverged!)
The two timelines can never be automatically merged -> permanent data loss
This is worse than simple downtime because:
| Consequence | Description |
|---|---|
| Data conflict | Different values written to the same primary key |
| Data loss | Transactions on the diverged timeline cannot be recovered |
| Referential integrity violation | FK relationships committed in different states on each node |
| Audit / regulatory violation | Consistency guarantees for financial or healthcare data broken |
| Business trust erosion | Customer data inconsistency may create legal liability |
Split-Brain is also hard to detect after the fact. Both nodes appear healthy and produce no errors in their logs. The damage typically surfaces much later as application-level data inconsistencies.
2. How Split-Brain Actually Happens
Scenario 1: Patroni Process Crash (most common)
t=0s Primary A successfully renews Leader Key in etcd (TTL=30s)
t=5s Patroni process killed by OOM (kill -9)
t=5s PostgreSQL is still alive -> continues accepting client writes
t=15s etcd TTL expires -> Replica B enters leader election
t=17s Replica B -> promoted to Primary B
t=17s Primary A (PG) + Primary B (PG) both accepting writes = SPLIT-BRAIN!
t=30s Watchdog timeout -> Primary A node force-rebooted <- the defense line
Even when the Patroni process dies, PostgreSQL continues running independently. Without a Watchdog, the window between the etcd TTL expiry and the new Primary promotion is a live Split-Brain state.
Scenario 2: Network Partition (most dangerous in multi-region)
With etcd 2-1 distributed as DC1=2, DC2=1: if a partition occurs, DC1 holds Quorum and Primary A keeps renewing its Leader Key. DC2 cannot promote — no Split-Brain.
But if etcd is misplaced as DC1=1, DC2=2:
t=0s Network partition
t=0s DC1 etcd-1: loses Quorum -> Primary A's Leader Key renewal fails
t=10s DC1: Patroni starts PostgreSQL demotion
t=10s DC2 etcd-2,3: holds Quorum -> Replica B attempts promotion
t=12s DC2: Primary B promotion completes
t=12s DC1: PostgreSQL still accepting writes = SPLIT-BRAIN! (without Watchdog)
Scenario 3: VM Hypervisor Pause (cloud-specific)
In AWS/GCP, a hypervisor may suspend a VM during Live Migration or a snapshot. While paused, Patroni cannot respond. When the VM wakes up, Patroni tries to renew its Leader Lock — but the TTL may have already expired and a new Primary may have been elected.
3. Patroni's Three-Layer Defense Architecture
Patroni stacks multiple independent protection layers. Each layer acts independently: if one fails, the next takes over.
4. Layer 1 — DCS(etcd) Leader Lock and Quorum
The most fundamental defense is etcd's Leader Lock (TTL-based key). The Primary node renews its Leader Key in etcd via CAS (Compare-And-Swap) every loop_wait seconds. If renewal fails, Patroni immediately demotes PostgreSQL.
How etcd Quorum Prevents Split-Brain
3-node etcd (DC1=Seoul, DC2=Tokyo, DC3=Singapore):
Normal: [Seoul etcd] <-- Raft --> [Tokyo etcd] <-- Raft --> [Singapore etcd]
(Quorum = at least 2/3 agreement required)
DC1 fails:
Tokyo + Singapore = 2/3 Quorum maintained -> can issue new Leader Key (OK)
Seoul alone = 1/3 Quorum lost -> cannot renew Leader Key -> demoted (OK)
DC1 + DC2 fail simultaneously:
Singapore alone = 1/3 Quorum lost -> no node can become Primary
(availability sacrificed, but no Split-Brain) (OK)
Key TTL Parameters
# patroni.yml -> bootstrap.dcs section
bootstrap:
dcs:
# TTL: Leader Key validity period in seconds. Lock expires if not renewed within TTL.
ttl: 30
# loop_wait: Patroni HA loop interval in seconds.
# Primary attempts to renew Leader Key every loop_wait seconds.
loop_wait: 10
# retry_timeout: DCS reconnect time limit in seconds.
# Primary demotes if DCS connection is not restored within this window.
retry_timeout: 10
Minimum write-block guarantee formula (ttl=30, safety_margin=5, loop_wait=10):
Minimum guaranteed write-block time = TTL - safety_margin - loop_wait
= 30 - 5 - 10 = 15 seconds
Even if the Primary loses contact with DCS, writes must be blocked within 15 seconds.
etcd Placement Golden Rules
| Placement | Quorum maintained | Split-Brain risk | Recommended? |
|---|---|---|---|
| All 3 in DC1 | Fails if DC1 goes down | None | Not recommended (SPOF) |
| DC1=2, DC2=1 | Fails if DC1 goes down | Risk if DC2 splits | Use with caution |
| 1 per DC (3 DCs) | Survives any single DC failure | None | Recommended |
| 5 nodes, 2-2-1 across 3 DCs | Survives 2 simultaneous DC failures | None | Best |
5. Layer 2 — Linux Watchdog
When DCS communication fails and Patroni tries to demote PostgreSQL, if Patroni itself crashes or becomes unresponsive, it cannot stop PostgreSQL. That is the exact gap the Watchdog seals.
Patroni activates the Watchdog before promoting to Primary. With watchdog.mode: required, if Watchdog activation fails, the node refuses to become Leader.
How the Watchdog Works
Normal operation:
Patroni -> sends keepalive to /dev/watchdog (every loop_wait seconds)
Watchdog timer resets -> system stays up
Abnormal (Patroni crashed / unresponsive):
Patroni -> keepalive stops
Watchdog timer expires (TTL - safety_margin = 25 seconds)
Linux kernel -> force-reboots the node (SIGKILL level)
PostgreSQL stops -> no more writes accepted (OK)
Step 1 — Load the softdog Kernel Module
# Run on all Primary candidate nodes (as root)
modprobe softdog
# Grant /dev/watchdog access to Patroni's user (postgres)
chown postgres:postgres /dev/watchdog
# Persist across reboots
echo "softdog" >> /etc/modules-load.d/patroni-watchdog.conf
# Auto-grant permissions via udev rule
cat > /etc/udev/rules.d/99-patroni-watchdog.rules <<'EOF'
KERNEL=="watchdog", OWNER="postgres", GROUP="postgres", MODE="0600"
EOF
# Verify
ls -la /dev/watchdog
# crw------- 1 postgres postgres 10, 130 Apr 26 10:00 /dev/watchdog
Test environment tip: To verify Watchdog behavior without an actual reboot, use the
soft_noboot=1option. On timeout, instead of rebooting, the kernel only writes a warning todmesg.
# Test mode: log only, no reboot
modprobe softdog soft_noboot=1
Step 2 — Add Watchdog Config to patroni.yml
# /etc/patroni/patroni.yml
watchdog:
# mode options:
# off - Watchdog disabled (default, not recommended for production)
# automatic - Use if available, but node can still become Leader without it
# required - Watchdog mandatory; refuses Leader if not available (most secure)
mode: required # Use required in production
device: /dev/watchdog # Watchdog device path
# safety_margin: how many seconds before TTL expiry to trigger Watchdog timeout
# Default: 5 seconds
# Set to -1 for Watchdog timeout = TTL / 2 (absolute guarantee mode)
safety_margin: 5
Step 3 — Hardware Watchdog (recommended for on-premises)
In cloud environments softdog is sufficient, but on bare-metal servers the kernel itself can hang, at which point softdog also stops working. Use a hardware Watchdog in those cases.
# IPMI-based hardware Watchdog (HP iLO, Dell iDRAC, etc.)
# Identify device
ls /dev/watchdog*
# /dev/watchdog0 <- hardware Watchdog
# Change the device path in Patroni config:
# watchdog.device: /dev/watchdog0
# Check Watchdog status (wdctl package)
apt-get install -y watchdog
wdctl /dev/watchdog0
Verify Watchdog Is Active
# Check Patroni logs for Watchdog activation
journalctl -u patroni | grep -i watchdog
# Expected output:
# INFO: Software Watchdog activated with 25 second timeout, timing slack 15 ms
# INFO: Watchdog activated successfully
# Check via Patroni REST API
curl -s http://localhost:8008/patroni | python3 -m json.tool | grep watchdog
# "watchdog_failed": false
6. Layer 3 — STONITH (Cloud Fencing)
STONITH (Shoot The Other Node In The Head) is an external mechanism to forcibly terminate a problem node. While the Watchdog is the node's internal self-defense, STONITH is external forced isolation. It serves as the last resort when promoting a 2-DC Standby Cluster or in environments without a Watchdog.
AWS EC2 STONITH Example
#!/usr/bin/env python3
# /usr/local/bin/stonith-aws.py
# Usage: stonith-aws.py <instance-id> [stop|terminate]
import sys
import boto3
import logging
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
logger = logging.getLogger(__name__)
def stonith_instance(instance_id: str, action: str = "stop") -> bool:
"""
Force-stop or terminate an AWS EC2 instance.
action: "stop" (reusable) or "terminate" (permanent deletion)
"""
ec2 = boto3.client('ec2')
try:
logger.info(f"STONITH: {action} targeting instance {instance_id}")
if action == "terminate":
response = ec2.terminate_instances(InstanceIds=[instance_id])
state = response['TerminatingInstances'][0]['CurrentState']['Name']
waiter_name = 'instance_terminated'
else:
response = ec2.stop_instances(
InstanceIds=[instance_id],
Force=True # Skip graceful shutdown
)
state = response['StoppingInstances'][0]['CurrentState']['Name']
waiter_name = 'instance_stopped'
logger.info(f"STONITH initiated: instance {instance_id} -> {state}")
waiter = ec2.get_waiter(waiter_name)
waiter.wait(
InstanceIds=[instance_id],
WaiterConfig={'Delay': 5, 'MaxAttempts': 24} # Wait up to 2 minutes
)
logger.info(f"STONITH confirmed: instance {instance_id} fully stopped")
return True
except Exception as e:
logger.error(f"STONITH failed: {e}")
return False
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: stonith-aws.py <instance-id> [stop|terminate]")
sys.exit(1)
instance_id = sys.argv[1]
action = sys.argv[2] if len(sys.argv) > 2 else "stop"
success = stonith_instance(instance_id, action)
sys.exit(0 if success else 1)
Before deploying this script to production, verify three things: the IAM Role running the script has ec2:StopInstances or ec2:TerminateInstances permissions; there is input validation to guard against incorrect instance IDs; and the script has been tested in a non-production environment to rule out misfires.
Automate STONITH via Patroni on_stop Callback
Patroni supports callback scripts that run on role changes. Using the on_stop event, the Primary can immediately block client connections the moment it is demoted.
# patroni.yml
postgresql:
callbacks:
on_stop: /usr/local/bin/patroni-on-stop.sh
on_role_change: /usr/local/bin/patroni-on-role-change.sh
#!/bin/bash
# /usr/local/bin/patroni-on-stop.sh
# Args: $1=action(on_stop), $2=role(master/replica), $3=scope
ACTION="$1"
ROLE="$2"
if [ "$ROLE" = "master" ]; then
logger -t patroni-callback "Primary demotion detected: blocking port 5432 immediately"
iptables -I INPUT -p tcp --dport 5432 -j DROP
iptables -I OUTPUT -p tcp --sport 5432 -j DROP
sleep 30
iptables -D INPUT -p tcp --dport 5432 -j DROP
iptables -D OUTPUT -p tcp --sport 5432 -j DROP
fi
GCP / Azure / On-Premises STONITH Examples
# GCP: force-stop instance via gcloud CLI
gcloud compute instances stop INSTANCE_NAME \
--zone=asia-northeast3-a \
--project=MY_PROJECT
# Azure: force-stop VM via az CLI
az vm stop \
--resource-group MY_RG \
--name MY_VM \
--no-wait
# On-premises: cut power via IPMI
ipmitool -H 10.1.0.10 -U admin -P password \
chassis power off
7. Bonus: DCS Failsafe Mode
DCS Failsafe Mode works in the opposite direction from Split-Brain prevention. Where Split-Brain arises from a Primary that lives too long, DCS Failsafe Mode addresses the problem of a Primary being unnecessarily demoted during a transient DCS instability.
How DCS Failsafe Mode Works
When Failsafe Mode is enabled and the Primary fails to renew its Leader Lock in the DCS, instead of demoting immediately it sends a POST /failsafe request directly to every cluster member via the REST API. If all members respond, the Primary stays up until the DCS recovers. If even one member fails to respond, the Primary demotes immediately.
Enable DCS Failsafe Mode
patronictl -c /etc/patroni/patroni.yml edit-config
# Dynamic Configuration change
failsafe_mode: true
# Before enabling, verify:
# 1. All cluster members are running the same Patroni version
# 2. All members can reach each other on the REST API port (8008)
# 3. Is the cluster member count even? (even count makes /failsafe checks stricter)
When NOT to Use DCS Failsafe Mode
DCS Failsafe Mode supplements — it does not replace — quorum design, Watchdog, and STONITH.
Avoid when:
- Kubernetes API is used as DCS (K8s API instability is common)
- Direct REST API communication between members is blocked by a firewall
- Only 2 cluster members (one failure = instant demotion)
Effective when:
- etcd runs on dedicated nodes and the network is stable
- 3+ cluster members spread across AZs
- Preventing Primary demotion during a planned DCS maintenance window
8. Defense Layer Timeline Visualization
The sequence below shows when and how each defense layer fires when a Primary node loses contact with DCS. (ttl=30, loop_wait=10, retry_timeout=10, safety_margin=5)
t=0s DCS(etcd) contact lost detected
t=0-10s Patroni: retry reconnection for retry_timeout seconds
t=10s Reconnect failed -> Patroni: starts PostgreSQL demotion
[!] Watchdog keepalive stops (if Patroni is unresponsive)
t=25s [!] Watchdog timeout (TTL - safety_margin = 30 - 5)
-> kernel force-reboots node -> PostgreSQL stops
t=30s [!] etcd TTL expires -> other nodes enter Leader Race
t=32s New Primary promoted -> normal operation restored
[Layer 1] etcd TTL (30s): |<----------- 30s ----------->|
[Layer 2] Watchdog (25s): |<--------- 25s -------->|
[Layer 3] STONITH: operator-triggered (when needed)
9. TTL, loop_wait, safety_margin Tuning Guide
In multi-region environments, high network latency (RTT) can make the defaults unstable.
Tuning Principles
1. loop_wait >= inter-region RTT x 3
2. retry_timeout >= loop_wait
3. ttl >= loop_wait x 2 + retry_timeout
4. safety_margin = ttl - loop_wait - retry_timeout - buffer(5s)
5. watchdog timeout = ttl - safety_margin
Recommended Settings by Environment
# -- Single DC (local, RTT < 1ms) -------------------------------
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
# watchdog timeout = 30 - 5 = 25s
watchdog:
mode: required
safety_margin: 5
# -- Domestic multi-region (Seoul-Busan, RTT ~10ms) -------------
bootstrap:
dcs:
ttl: 40
loop_wait: 15
retry_timeout: 15
watchdog:
mode: required
safety_margin: 10 # watchdog timeout = 40 - 10 = 30s
# -- Global multi-region (Seoul-Singapore, RTT ~60ms) -----------
bootstrap:
dcs:
ttl: 60
loop_wait: 20
retry_timeout: 20
watchdog:
mode: required
safety_margin: 20 # watchdog timeout = 60 - 20 = 40s
# -- Absolute guarantee mode (Split-Brain never tolerated) ------
watchdog:
mode: required
safety_margin: -1 # watchdog timeout = ttl / 2
Raising TTL and loop_wait strengthens Split-Brain protection but increases the time to detect failures and trigger automatic failover. Balance against your RTO requirements.
Check and Apply Current Settings
# View current DCS dynamic config
patronictl -c /etc/patroni/patroni.yml show-config
# Apply changes (live, no downtime required)
patronictl -c /etc/patroni/patroni.yml edit-config
# Edit ttl, loop_wait, retry_timeout then save
# Verify changes took effect
patronictl -c /etc/patroni/patroni.yml list
10. Common Troubleshooting
Issue 1: Watchdog Activation Failure — Node Refuses Leader Role
CRITICAL: watchdog.device /dev/watchdog not found
WARNING: Could not activate watchdog. Refusing to be promoted
# Verify softdog module is loaded
lsmod | grep softdog
# If not loaded:
modprobe softdog
# Check /dev/watchdog exists
ls -la /dev/watchdog
# Confirm ownership (must be postgres)
stat /dev/watchdog
chown postgres:postgres /dev/watchdog
# Reload udev rules
udevadm control --reload-rules && udevadm trigger
Issue 2: Frequent Primary Demotion Due to Transient DCS Instability
INFO: DCS not accessible, stopping Postgres
INFO: demoted self because DCS is not accessible and i had the leader lock
# Option 1: Enable DCS Failsafe Mode (tolerate brief DCS instability)
patronictl -c /etc/patroni/patroni.yml edit-config
# Add: failsafe_mode: true
# Option 2: Raise retry_timeout (more tolerant of transient DCS hiccups)
# retry_timeout: 15 -> retry_timeout: 30
patronictl -c /etc/patroni/patroni.yml edit-config
Issue 3: Old Primary Attempts Promotion After Network Partition Heals
WARNING: Stale leader key found, ignoring
INFO: following a different leader because my replay LSN is behind
This is expected behavior. Patroni automatically rejoins as a Replica when it finds its LSN is behind the current Leader. If pg_rewind fails:
# Force re-initialize without pg_rewind
patronictl -c /etc/patroni/patroni.yml reinit \
pg-seoul-cluster pg-seoul-1 --force
Issue 4: Two Nodes Both Claiming to Be Primary (Real Split-Brain)
# -- Emergency response procedure --------------------------------
# 1. Determine which node is the stale Primary (lower LSN)
psql -h 10.1.0.10 -U postgres -c "SELECT pg_current_wal_lsn(), pg_is_in_recovery();"
psql -h 10.2.0.10 -U postgres -c "SELECT pg_current_wal_lsn(), pg_is_in_recovery();"
# 2. Stop the node with the lower LSN immediately
systemctl stop patroni
systemctl stop postgresql
# 3. Block port 5432 on that node
iptables -I INPUT -p tcp --dport 5432 -j DROP
# 4. Check the surviving Primary's cluster state
patronictl -c /etc/patroni/patroni.yml list
# 5. Re-initialize the stopped node as a Replica
rm -rf /var/lib/postgresql/17/main/*
systemctl start patroni
References
- Patroni Official Documentation — Watchdog Support
- Patroni Official Documentation — DCS Failsafe Mode
- Percona — Patroni Split-Brain Prevention Architecture
- Stormatics — Understanding Split-Brain Scenarios in HA PostgreSQL Clusters
- Medium (Oz) — Avoiding Split-Brain Using Watchdog/softdog (2025)
- Patroni POSETTE 2025 — What is Patroni, really?