Lv.3 IntermediatePostgreSQL

2026.06.0422 min readLv.3 Intermediate

SeriesOperating Patroni H/A Across Multiple Regions · Part 4View series hub

Operating Patroni H/A Across Multiple Regions — Part 4: Split-Brain Prevention

In multi-region HA, the real danger is not the failure itself but Split-Brain — two nodes simultaneously acting as Primary and accepting writes. This post explains Patroni's three-layer defense architecture (DCS Leader Lock + Quorum, Linux Watchdog, STONITH/Fencing), TTL/loop_wait/safety_margin tuning principles, DCS Failsafe Mode caveats, and emergency recovery procedures for an actual Split-Brain event.

Series — Operating Patroni H/A Across Multiple Regions

Part 1 — Fundamentals and Architecture Design Principles

Part 2 — Synchronous Multi-DC Setup in Practice

Part 3 — Async Replication + Standby Cluster Setup

Part 4 — Split-Brain Prevention (this post)

Part 5 — Failover Runbook and DR Drills

Part 6 — Monitoring, Operational Automation, and Best Practices

Why Is Split-Brain So Dangerous?
How Split-Brain Actually Happens
Patroni's Three-Layer Defense Architecture
Layer 1 — DCS(etcd) Leader Lock and Quorum
Layer 2 — Linux Watchdog
Layer 3 — STONITH (Cloud Fencing)
Bonus: DCS Failsafe Mode
Defense Layer Timeline Visualization
TTL, loop_wait, safety_margin Tuning Guide
Common Troubleshooting
References

1. Why Is Split-Brain So Dangerous?

Split-Brain occurs when two or more nodes in a cluster simultaneously believe they are the Primary (Leader). Both nodes start accepting client writes, and the data diverges into separate, incompatible timelines: a Diverging Timeline.

Normal state:
  Client --> [ Primary A (Seoul) ] --WAL--> [ Replica B (Tokyo) ]

Split-Brain state:
  Client --> [ Primary A (Seoul) ] --writes--> Timeline #3
  Client --> [ Primary B (Tokyo) ] --writes--> Timeline #3 (diverged!)

  The two timelines can never be automatically merged -> permanent data loss

This is worse than simple downtime because:

Consequence	Description
Data conflict	Different values written to the same primary key
Data loss	Transactions on the diverged timeline cannot be recovered
Referential integrity violation	FK relationships committed in different states on each node
Audit / regulatory violation	Consistency guarantees for financial or healthcare data broken
Business trust erosion	Customer data inconsistency may create legal liability

Split-Brain is also hard to detect after the fact. Both nodes appear healthy and produce no errors in their logs. The damage typically surfaces much later as application-level data inconsistencies.

2. How Split-Brain Actually Happens

Scenario 1: Patroni Process Crash (most common)

t=0s   Primary A successfully renews Leader Key in etcd (TTL=30s)
t=5s   Patroni process killed by OOM (kill -9)
t=5s   PostgreSQL is still alive -> continues accepting client writes
t=15s  etcd TTL expires -> Replica B enters leader election
t=17s  Replica B -> promoted to Primary B
t=17s  Primary A (PG) + Primary B (PG) both accepting writes = SPLIT-BRAIN!
t=30s  Watchdog timeout -> Primary A node force-rebooted <- the defense line

Even when the Patroni process dies, PostgreSQL continues running independently. Without a Watchdog, the window between the etcd TTL expiry and the new Primary promotion is a live Split-Brain state.

Scenario 2: Network Partition (most dangerous in multi-region)

With etcd 2-1 distributed as DC1=2, DC2=1: if a partition occurs, DC1 holds Quorum and Primary A keeps renewing its Leader Key. DC2 cannot promote — no Split-Brain.

But if etcd is misplaced as DC1=1, DC2=2:

t=0s   Network partition
t=0s   DC1 etcd-1: loses Quorum -> Primary A's Leader Key renewal fails
t=10s  DC1: Patroni starts PostgreSQL demotion
t=10s  DC2 etcd-2,3: holds Quorum -> Replica B attempts promotion
t=12s  DC2: Primary B promotion completes
t=12s  DC1: PostgreSQL still accepting writes = SPLIT-BRAIN! (without Watchdog)

Scenario 3: VM Hypervisor Pause (cloud-specific)

In AWS/GCP, a hypervisor may suspend a VM during Live Migration or a snapshot. While paused, Patroni cannot respond. When the VM wakes up, Patroni tries to renew its Leader Lock — but the TTL may have already expired and a new Primary may have been elected.

3. Patroni's Three-Layer Defense Architecture

Patroni stacks multiple independent protection layers. Each layer acts independently: if one fails, the next takes over.

4. Layer 1 — DCS(etcd) Leader Lock and Quorum

The most fundamental defense is etcd's Leader Lock (TTL-based key). The Primary node renews its Leader Key in etcd via CAS (Compare-And-Swap) every loop_wait seconds. If renewal fails, Patroni immediately demotes PostgreSQL.

How etcd Quorum Prevents Split-Brain

3-node etcd (DC1=Seoul, DC2=Tokyo, DC3=Singapore):

  Normal:  [Seoul etcd] <-- Raft --> [Tokyo etcd] <-- Raft --> [Singapore etcd]
                          (Quorum = at least 2/3 agreement required)

  DC1 fails:
    Tokyo + Singapore = 2/3 Quorum maintained -> can issue new Leader Key (OK)
    Seoul alone       = 1/3 Quorum lost       -> cannot renew Leader Key -> demoted (OK)

  DC1 + DC2 fail simultaneously:
    Singapore alone = 1/3 Quorum lost -> no node can become Primary
    (availability sacrificed, but no Split-Brain) (OK)

Key TTL Parameters

# patroni.yml -> bootstrap.dcs section
bootstrap:
  dcs:
    # TTL: Leader Key validity period in seconds. Lock expires if not renewed within TTL.
    ttl: 30

    # loop_wait: Patroni HA loop interval in seconds.
    # Primary attempts to renew Leader Key every loop_wait seconds.
    loop_wait: 10

    # retry_timeout: DCS reconnect time limit in seconds.
    # Primary demotes if DCS connection is not restored within this window.
    retry_timeout: 10

Minimum write-block guarantee formula (ttl=30, safety_margin=5, loop_wait=10):

Minimum guaranteed write-block time = TTL - safety_margin - loop_wait
                                    = 30  - 5             - 10        = 15 seconds

Even if the Primary loses contact with DCS, writes must be blocked within 15 seconds.

etcd Placement Golden Rules

Placement	Quorum maintained	Split-Brain risk	Recommended?
All 3 in DC1	Fails if DC1 goes down	None	Not recommended (SPOF)
DC1=2, DC2=1	Fails if DC1 goes down	Risk if DC2 splits	Use with caution
1 per DC (3 DCs)	Survives any single DC failure	None	Recommended
5 nodes, 2-2-1 across 3 DCs	Survives 2 simultaneous DC failures	None	Best

5. Layer 2 — Linux Watchdog

When DCS communication fails and Patroni tries to demote PostgreSQL, if Patroni itself crashes or becomes unresponsive, it cannot stop PostgreSQL. That is the exact gap the Watchdog seals.

Patroni activates the Watchdog before promoting to Primary. With watchdog.mode: required, if Watchdog activation fails, the node refuses to become Leader.

How the Watchdog Works

Normal operation:
  Patroni -> sends keepalive to /dev/watchdog (every loop_wait seconds)
  Watchdog timer resets -> system stays up

Abnormal (Patroni crashed / unresponsive):
  Patroni -> keepalive stops
  Watchdog timer expires (TTL - safety_margin = 25 seconds)
  Linux kernel -> force-reboots the node (SIGKILL level)
  PostgreSQL stops -> no more writes accepted (OK)

Step 1 — Load the softdog Kernel Module

# Run on all Primary candidate nodes (as root)
modprobe softdog

# Grant /dev/watchdog access to Patroni's user (postgres)
chown postgres:postgres /dev/watchdog

# Persist across reboots
echo "softdog" >> /etc/modules-load.d/patroni-watchdog.conf

# Auto-grant permissions via udev rule
cat > /etc/udev/rules.d/99-patroni-watchdog.rules <<'EOF'
KERNEL=="watchdog", OWNER="postgres", GROUP="postgres", MODE="0600"
EOF

# Verify
ls -la /dev/watchdog
# crw------- 1 postgres postgres 10, 130 Apr 26 10:00 /dev/watchdog

Test environment tip: To verify Watchdog behavior without an actual reboot, use the soft_noboot=1 option. On timeout, instead of rebooting, the kernel only writes a warning to dmesg.

# Test mode: log only, no reboot
modprobe softdog soft_noboot=1

Step 2 — Add Watchdog Config to patroni.yml

# /etc/patroni/patroni.yml

watchdog:
  # mode options:
  #   off       - Watchdog disabled (default, not recommended for production)
  #   automatic - Use if available, but node can still become Leader without it
  #   required  - Watchdog mandatory; refuses Leader if not available (most secure)
  mode: required          # Use required in production

  device: /dev/watchdog   # Watchdog device path

  # safety_margin: how many seconds before TTL expiry to trigger Watchdog timeout
  # Default: 5 seconds
  # Set to -1 for Watchdog timeout = TTL / 2 (absolute guarantee mode)
  safety_margin: 5

Step 3 — Hardware Watchdog (recommended for on-premises)

In cloud environments softdog is sufficient, but on bare-metal servers the kernel itself can hang, at which point softdog also stops working. Use a hardware Watchdog in those cases.

# IPMI-based hardware Watchdog (HP iLO, Dell iDRAC, etc.)
# Identify device
ls /dev/watchdog*
# /dev/watchdog0  <- hardware Watchdog

# Change the device path in Patroni config:
# watchdog.device: /dev/watchdog0

# Check Watchdog status (wdctl package)
apt-get install -y watchdog
wdctl /dev/watchdog0

Verify Watchdog Is Active

# Check Patroni logs for Watchdog activation
journalctl -u patroni | grep -i watchdog

# Expected output:
# INFO: Software Watchdog activated with 25 second timeout, timing slack 15 ms
# INFO: Watchdog activated successfully

# Check via Patroni REST API
curl -s http://localhost:8008/patroni | python3 -m json.tool | grep watchdog
# "watchdog_failed": false

6. Layer 3 — STONITH (Cloud Fencing)

STONITH (Shoot The Other Node In The Head) is an external mechanism to forcibly terminate a problem node. While the Watchdog is the node's internal self-defense, STONITH is external forced isolation. It serves as the last resort when promoting a 2-DC Standby Cluster or in environments without a Watchdog.

AWS EC2 STONITH Example

#!/usr/bin/env python3
# /usr/local/bin/stonith-aws.py
# Usage: stonith-aws.py <instance-id> [stop|terminate]

import sys
import boto3
import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')
logger = logging.getLogger(__name__)


def stonith_instance(instance_id: str, action: str = "stop") -> bool:
    """
    Force-stop or terminate an AWS EC2 instance.
    action: "stop" (reusable) or "terminate" (permanent deletion)
    """
    ec2 = boto3.client('ec2')

    try:
        logger.info(f"STONITH: {action} targeting instance {instance_id}")

        if action == "terminate":
            response = ec2.terminate_instances(InstanceIds=[instance_id])
            state = response['TerminatingInstances'][0]['CurrentState']['Name']
            waiter_name = 'instance_terminated'
        else:
            response = ec2.stop_instances(
                InstanceIds=[instance_id],
                Force=True   # Skip graceful shutdown
            )
            state = response['StoppingInstances'][0]['CurrentState']['Name']
            waiter_name = 'instance_stopped'

        logger.info(f"STONITH initiated: instance {instance_id} -> {state}")

        waiter = ec2.get_waiter(waiter_name)
        waiter.wait(
            InstanceIds=[instance_id],
            WaiterConfig={'Delay': 5, 'MaxAttempts': 24}  # Wait up to 2 minutes
        )
        logger.info(f"STONITH confirmed: instance {instance_id} fully stopped")
        return True

    except Exception as e:
        logger.error(f"STONITH failed: {e}")
        return False


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: stonith-aws.py <instance-id> [stop|terminate]")
        sys.exit(1)

    instance_id = sys.argv[1]
    action = sys.argv[2] if len(sys.argv) > 2 else "stop"

    success = stonith_instance(instance_id, action)
    sys.exit(0 if success else 1)

Before deploying this script to production, verify three things: the IAM Role running the script has ec2:StopInstances or ec2:TerminateInstances permissions; there is input validation to guard against incorrect instance IDs; and the script has been tested in a non-production environment to rule out misfires.

Automate STONITH via Patroni on_stop Callback

Patroni supports callback scripts that run on role changes. Using the on_stop event, the Primary can immediately block client connections the moment it is demoted.

# patroni.yml
postgresql:
  callbacks:
    on_stop: /usr/local/bin/patroni-on-stop.sh
    on_role_change: /usr/local/bin/patroni-on-role-change.sh

#!/bin/bash
# /usr/local/bin/patroni-on-stop.sh
# Args: $1=action(on_stop), $2=role(master/replica), $3=scope

ACTION="$1"
ROLE="$2"

if [ "$ROLE" = "master" ]; then
    logger -t patroni-callback "Primary demotion detected: blocking port 5432 immediately"
    iptables -I INPUT  -p tcp --dport 5432 -j DROP
    iptables -I OUTPUT -p tcp --sport 5432 -j DROP

    sleep 30
    iptables -D INPUT  -p tcp --dport 5432 -j DROP
    iptables -D OUTPUT -p tcp --sport 5432 -j DROP
fi

GCP / Azure / On-Premises STONITH Examples

# GCP: force-stop instance via gcloud CLI
gcloud compute instances stop INSTANCE_NAME \
  --zone=asia-northeast3-a \
  --project=MY_PROJECT

# Azure: force-stop VM via az CLI
az vm stop \
  --resource-group MY_RG \
  --name MY_VM \
  --no-wait

# On-premises: cut power via IPMI
ipmitool -H 10.1.0.10 -U admin -P password \
  chassis power off

7. Bonus: DCS Failsafe Mode

DCS Failsafe Mode works in the opposite direction from Split-Brain prevention. Where Split-Brain arises from a Primary that lives too long, DCS Failsafe Mode addresses the problem of a Primary being unnecessarily demoted during a transient DCS instability.

How DCS Failsafe Mode Works

When Failsafe Mode is enabled and the Primary fails to renew its Leader Lock in the DCS, instead of demoting immediately it sends a POST /failsafe request directly to every cluster member via the REST API. If all members respond, the Primary stays up until the DCS recovers. If even one member fails to respond, the Primary demotes immediately.

Enable DCS Failsafe Mode

patronictl -c /etc/patroni/patroni.yml edit-config

# Dynamic Configuration change
failsafe_mode: true

# Before enabling, verify:
# 1. All cluster members are running the same Patroni version
# 2. All members can reach each other on the REST API port (8008)
# 3. Is the cluster member count even? (even count makes /failsafe checks stricter)

When NOT to Use DCS Failsafe Mode

DCS Failsafe Mode supplements — it does not replace — quorum design, Watchdog, and STONITH.

Avoid when:
  - Kubernetes API is used as DCS (K8s API instability is common)
  - Direct REST API communication between members is blocked by a firewall
  - Only 2 cluster members (one failure = instant demotion)

Effective when:
  - etcd runs on dedicated nodes and the network is stable
  - 3+ cluster members spread across AZs
  - Preventing Primary demotion during a planned DCS maintenance window

8. Defense Layer Timeline Visualization

The sequence below shows when and how each defense layer fires when a Primary node loses contact with DCS. (ttl=30, loop_wait=10, retry_timeout=10, safety_margin=5)

t=0s     DCS(etcd) contact lost detected
t=0-10s  Patroni: retry reconnection for retry_timeout seconds
t=10s    Reconnect failed -> Patroni: starts PostgreSQL demotion
         [!] Watchdog keepalive stops (if Patroni is unresponsive)
t=25s    [!] Watchdog timeout (TTL - safety_margin = 30 - 5)
             -> kernel force-reboots node -> PostgreSQL stops
t=30s    [!] etcd TTL expires -> other nodes enter Leader Race
t=32s    New Primary promoted -> normal operation restored

[Layer 1] etcd TTL (30s):  |<----------- 30s ----------->|
[Layer 2] Watchdog (25s):  |<--------- 25s -------->|
[Layer 3] STONITH:         operator-triggered (when needed)

9. TTL, loop_wait, safety_margin Tuning Guide

In multi-region environments, high network latency (RTT) can make the defaults unstable.

Tuning Principles

1. loop_wait >= inter-region RTT x 3
2. retry_timeout >= loop_wait
3. ttl >= loop_wait x 2 + retry_timeout
4. safety_margin = ttl - loop_wait - retry_timeout - buffer(5s)
5. watchdog timeout = ttl - safety_margin

Recommended Settings by Environment

# -- Single DC (local, RTT < 1ms) -------------------------------
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    # watchdog timeout = 30 - 5 = 25s
watchdog:
  mode: required
  safety_margin: 5

# -- Domestic multi-region (Seoul-Busan, RTT ~10ms) -------------
bootstrap:
  dcs:
    ttl: 40
    loop_wait: 15
    retry_timeout: 15
watchdog:
  mode: required
  safety_margin: 10        # watchdog timeout = 40 - 10 = 30s

# -- Global multi-region (Seoul-Singapore, RTT ~60ms) -----------
bootstrap:
  dcs:
    ttl: 60
    loop_wait: 20
    retry_timeout: 20
watchdog:
  mode: required
  safety_margin: 20        # watchdog timeout = 60 - 20 = 40s

# -- Absolute guarantee mode (Split-Brain never tolerated) ------
watchdog:
  mode: required
  safety_margin: -1        # watchdog timeout = ttl / 2

Raising TTL and loop_wait strengthens Split-Brain protection but increases the time to detect failures and trigger automatic failover. Balance against your RTO requirements.

Check and Apply Current Settings

# View current DCS dynamic config
patronictl -c /etc/patroni/patroni.yml show-config

# Apply changes (live, no downtime required)
patronictl -c /etc/patroni/patroni.yml edit-config
# Edit ttl, loop_wait, retry_timeout then save

# Verify changes took effect
patronictl -c /etc/patroni/patroni.yml list

10. Common Troubleshooting

Issue 1: Watchdog Activation Failure — Node Refuses Leader Role

CRITICAL: watchdog.device /dev/watchdog not found
WARNING: Could not activate watchdog. Refusing to be promoted

# Verify softdog module is loaded
lsmod | grep softdog
# If not loaded:
modprobe softdog

# Check /dev/watchdog exists
ls -la /dev/watchdog

# Confirm ownership (must be postgres)
stat /dev/watchdog
chown postgres:postgres /dev/watchdog

# Reload udev rules
udevadm control --reload-rules && udevadm trigger

Issue 2: Frequent Primary Demotion Due to Transient DCS Instability

INFO: DCS not accessible, stopping Postgres
INFO: demoted self because DCS is not accessible and i had the leader lock

# Option 1: Enable DCS Failsafe Mode (tolerate brief DCS instability)
patronictl -c /etc/patroni/patroni.yml edit-config
# Add: failsafe_mode: true

# Option 2: Raise retry_timeout (more tolerant of transient DCS hiccups)
# retry_timeout: 15  ->  retry_timeout: 30
patronictl -c /etc/patroni/patroni.yml edit-config

Issue 3: Old Primary Attempts Promotion After Network Partition Heals

WARNING: Stale leader key found, ignoring
INFO: following a different leader because my replay LSN is behind

This is expected behavior. Patroni automatically rejoins as a Replica when it finds its LSN is behind the current Leader. If pg_rewind fails:

# Force re-initialize without pg_rewind
patronictl -c /etc/patroni/patroni.yml reinit \
  pg-seoul-cluster pg-seoul-1 --force

Issue 4: Two Nodes Both Claiming to Be Primary (Real Split-Brain)

# -- Emergency response procedure --------------------------------

# 1. Determine which node is the stale Primary (lower LSN)
psql -h 10.1.0.10 -U postgres -c "SELECT pg_current_wal_lsn(), pg_is_in_recovery();"
psql -h 10.2.0.10 -U postgres -c "SELECT pg_current_wal_lsn(), pg_is_in_recovery();"

# 2. Stop the node with the lower LSN immediately
systemctl stop patroni
systemctl stop postgresql

# 3. Block port 5432 on that node
iptables -I INPUT  -p tcp --dport 5432 -j DROP

# 4. Check the surviving Primary's cluster state
patronictl -c /etc/patroni/patroni.yml list

# 5. Re-initialize the stopped node as a Replica
rm -rf /var/lib/postgresql/17/main/*
systemctl start patroni

References

Patroni Official Documentation — Watchdog Support
Patroni Official Documentation — DCS Failsafe Mode
Percona — Patroni Split-Brain Prevention Architecture
Stormatics — Understanding Split-Brain Scenarios in HA PostgreSQL Clusters
Medium (Oz) — Avoiding Split-Brain Using Watchdog/softdog (2025)
Patroni POSETTE 2025 — What is Patroni, really?