Lv.3 IntermediateGeneral

2026.06.0424 min readLv.3 Intermediate

SeriesOperating Patroni H/A Across Multiple Regions · Part 5View series hub

Operating Patroni H/A Across Multiple Regions — Part 5: Incident Runbooks and DR Drill Scenarios

From single-node failure to full-DC outage and Split-Brain — five ready-to-use Patroni multi-region HA runbooks, DR drill scenarios, and a Post-Mortem checklist for real incident response.

Part 4 built the three-layer defense against Split-Brain — DCS Leader Lock, Watchdog, and STONITH. Even with that architecture in place, if the decision sequence and the exact commands to run have not been defined in advance, the defense will not hold when a real incident strikes.

This article is the runbook collection that fills that gap. For each of the five major failure scenarios in a multi-region HA environment — prerequisite checks, decision criteria, step-by-step commands, and post-recovery checklists are included. A Runbook is a safety system, not a document. An undrilled Runbook will not work during an actual incident.

1. Prerequisites Common to All Runbooks

Verify the following items before executing any Runbook.

Common Environment Variable Setup

Set these variables at the start of every terminal session. Every command in every Runbook references them.

# -- Common environment variables --
export PATRONI_CONF="/etc/patroni/patroni.yml"
export DC1_NODES="10.1.0.10 10.1.0.11 10.1.0.12"   # Seoul PG nodes
export DC2_NODES="10.2.0.10 10.2.0.11 10.2.0.12"   # Busan PG nodes
export DC1_CLUSTER="pg-seoul-cluster"
export DC2_CLUSTER="pg-busan-standby"
export HAPROXY_HOST="10.1.0.20"
export ETCD_CERTS="--cacert=/etc/etcd/ssl/ca.pem \
  --cert=/etc/etcd/ssl/etcd-seoul.pem \
  --key=/etc/etcd/ssl/etcd-seoul-key.pem"

# -- Common health check function --
check_cluster() {
  local CLUSTER=$1
  local CONF=$2
  echo "=== Cluster topology check: $CLUSTER ==="
  patronictl -c $CONF topology
  echo ""
  echo "=== Replication lag check ==="
  for NODE in $DC1_NODES; do
    echo -n "  $NODE: "
    psql -h $NODE -U postgres -t -c \
      "SELECT CASE WHEN pg_is_in_recovery()
       THEN pg_last_wal_replay_lsn()::text
       ELSE pg_current_wal_lsn()::text END;" 2>/dev/null || echo "connection failed"
  done
}

Common Notification Checklist Before Incident Response

Before starting incident response:
□ Post incident acknowledgment to alert channels (Slack, PagerDuty, etc.)
□ On-call DBA/infrastructure engineer contacted
□ Current service traffic level and business impact assessed
□ Application-level read-only mode switch confirmed
□ Runbook version confirmed (version number and last-modified date required)

2. Runbook A — Single-Node Failure (Automatic Failover Confirmation)

Trigger Condition: One Primary node unresponsive / HAProxy health check failure alert received

Expected RTO: 30 seconds – 2 minutes (automatic failover)

Expected RPO: 0 (Synchronous mode) / a few seconds (Asynchronous mode)

Step 1: Assess Failed Node Status (Target: within 2 minutes)

# Check entire cluster status immediately
patronictl -c $PATRONI_CONF list

# Attempt direct access to the failed node
ping -c 3 10.1.0.10
ssh 10.1.0.10 "systemctl status patroni postgresql"

# Check Patroni REST API response
curl -s --max-time 5 http://10.1.0.10:8008/health | python3 -m json.tool

# Check current Leader key in etcd
etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379,https://10.2.0.10:2379,https://10.3.0.10:2379 \
  get /db/pg-seoul-cluster/leader

Step 2: Monitor Automatic Failover Progress

After TTL expiry, Patroni elects a new Leader automatically. Observe without intervening.

# Real-time cluster status monitoring (refresh every 5 seconds)
watch -n 5 "patronictl -c $PATRONI_CONF list"

# Failover completion checkpoints:
# 1. New Leader node appears in 'running' state
# 2. Failed node disappears from the list or shows 'stopped' state
# 3. Remaining Replicas connect to new Primary in streaming state

Step 3: Post-Failover Verification

# Confirm new Primary
NEW_PRIMARY=$(patronictl -c $PATRONI_CONF list | grep Leader | awk '{print $4}')
echo "New Primary: $NEW_PRIMARY"

# Test write capability on new Primary
psql -h $NEW_PRIMARY -U postgres -c "
  CREATE TABLE IF NOT EXISTS failover_test (
    id SERIAL PRIMARY KEY,
    created_at TIMESTAMPTZ DEFAULT now(),
    note TEXT
  );
  INSERT INTO failover_test (note) VALUES ('failover_$(date +%s)');
  SELECT * FROM failover_test ORDER BY created_at DESC LIMIT 1;
"

# Confirm HAProxy routes to new Primary
psql -h $HAPROXY_HOST -p 5000 -U postgres -c \
  "SELECT inet_server_addr(), pg_is_in_recovery();"
# Expected: new Primary IP | f (false)

# Check replication status
psql -h $NEW_PRIMARY -U postgres -c "
  SELECT application_name, state, sync_state, write_lag
  FROM pg_stat_replication;
"

Step 4: Failed Node Recovery and Rejoin

# After resolving the failure cause (service restart, HW replacement, etc.)
# On the failed node:
systemctl start patroni

# Patroni automatically starts syncing via pg_basebackup or pg_rewind
# Monitor progress:
journalctl -fu patroni

# Confirm successful rejoin
patronictl -c $PATRONI_CONF list
# Failed node should appear as Replica in streaming state

# If automatic rejoin fails, manual reinitialize:
patronictl -c $PATRONI_CONF reinit $DC1_CLUSTER pg-seoul-1 --force

3. Runbook B — Full-DC Failure (2DC Standby Cluster Manual Promotion)

Trigger Condition: DC1 (Seoul) entirely unreachable / All DC1 nodes unresponsive

Architecture: Part 3 baseline (async replication + Standby Cluster)

⚠️ This Runbook is a manual failover. Never rush through it. Complete each step in order before proceeding to the next.

Expected RTO: 5 – 20 minutes (depends on operator proficiency)

Expected RPO: Most recent replication lag value (seconds to minutes)

Pre-Promotion Checklist (Complete Before Every Promotion)

□ 1. Have all PostgreSQL instances in DC1 fully stopped?
      -> DC1 nodes: SSH access failed + Patroni REST API timeout confirmed
□ 2. Has the DC1 etcd cluster fully stopped?
      -> etcdctl endpoint health: no response confirmed
□ 3. Have you recorded the current LSN and replication lag of the DC2 Standby Leader?
□ 4. Has all application traffic to DC1 been blocked?
□ 5. Has STONITH been performed if needed? (see Part 4)
□ 6. Has the promotion authority (DBA lead / infrastructure manager) approved?

Step 1: Confirm DC1 is Fully Down

Eliminate any chance that DC1 is still partially alive before promoting. Promoting while DC1 is partially up will cause Split-Brain.

# Confirm all DC1 nodes are unreachable
for NODE in $DC1_NODES; do
  echo -n "DC1 $NODE: "
  timeout 5 bash -c "echo >/dev/tcp/$NODE/5432" 2>/dev/null \
    && echo "Connected - still alive!" \
    || echo "Unreachable OK"
done

# DC1 Primary connectivity as seen from DC2
psql "host=10.1.0.10,10.1.0.11,10.1.0.12 port=5432 \
      user=replicator password=SecureRepPass123! \
      connect_timeout=5 \
      target_session_attrs=read-write sslmode=require" \
  -c "SELECT 1;" 2>&1 | grep -E "error|FATAL|could not"
# "could not connect to the server" -> DC1 fully stopped OK

Step 2: Check DC2 Current State and Replication Lag

# DC2 Standby Cluster status
patronictl -c /etc/patroni/patroni.yml topology  # Run on DC2 node

# Check last received WAL position of DC2 Standby Leader
psql -h 10.2.0.10 -U postgres -c "
  SELECT
    pg_last_wal_receive_lsn()                    AS receive_lsn,
    pg_last_wal_replay_lsn()                     AS replay_lsn,
    now() - pg_last_xact_replay_timestamp()      AS replay_delay,
    pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() AS fully_replayed;
"

# If there is replication lag, record it (for Post-Mortem)
# fully_replayed = t  -> no data loss
# fully_replayed = f  -> potential data loss equal to replay_delay

Step 3: STONITH (If DC1 May Be Partially Alive)

# AWS example: force-stop DC1 instances
python3 /usr/local/bin/stonith-aws.py i-0123456789abcdef0 stop  # pg-seoul-1
python3 /usr/local/bin/stonith-aws.py i-0123456789abcdef1 stop  # pg-seoul-2
python3 /usr/local/bin/stonith-aws.py i-0123456789abcdef2 stop  # pg-seoul-3

# Reconfirm after STONITH completes
for INSTANCE in i-0123456789abcdef0 i-0123456789abcdef1 i-0123456789abcdef2; do
  aws ec2 describe-instances --instance-ids $INSTANCE \
    --query 'Reservations[].Instances[].State.Name' --output text
  # All must be "stopped" or "terminated"
done

Step 4: Promote DC2 Standby Cluster

# Recommended method for Patroni 4.1+
patronictl -c /etc/patroni/patroni.yml promote-cluster $DC2_CLUSTER

# Older version or manual method
# patronictl -c /etc/patroni/patroni.yml edit-config \
#   --set standby_cluster=null --force

# Check status after promotion
patronictl -c /etc/patroni/patroni.yml topology

# Expected result:
# pg-busan-1  Leader  running  TL+1
# pg-busan-2  Replica running  TL+1
# pg-busan-3  Replica running  TL+1

# Direct confirmation from PostgreSQL
psql -h 10.2.0.10 -U postgres -c "SELECT pg_is_in_recovery();"
# f (false) -> Promotion to Primary complete OK

Step 5: Switch Application Endpoints

# Update HAProxy to DC2 nodes (or switch DNS)
# Method 1: Replace HAProxy config and reload
cp /etc/haproxy/haproxy-dc2.cfg /etc/haproxy/haproxy.cfg
haproxy -c -f /etc/haproxy/haproxy.cfg  # Validate config
systemctl reload haproxy

# Method 2: Update DNS record if TTL was pre-lowered
# (AWS Route53 example)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "db.example.com",
        "Type": "A",
        "TTL": 60,
        "ResourceRecords": [{"Value": "10.2.0.10"}]
      }
    }]
  }'

# Confirm actual traffic routing after switch
psql -h db.example.com -p 5000 -U appuser -c \
  "SELECT inet_server_addr(), pg_is_in_recovery();"

Step 6: Immediate Post-Promotion Checklist

□ Write test on DC2 Primary succeeded
□ Application connections normal (no error logs)
□ DC2 internal Replica replication confirmed resumed
□ "DC2 promotion complete" posted to alert channels
□ DC1 recovery plan initiated (see Part 3 demote-cluster procedure)

4. Runbook C — Network Partition (Partial Connectivity Failure)

Trigger Condition: Some nodes unreachable / Part of the cluster is isolated

Key Decision Criterion: Is etcd Quorum maintained?

# Check etcd cluster status immediately
etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379,https://10.2.0.10:2379,https://10.3.0.10:2379 \
  endpoint health --write-out=table

# Output example:
# +------------------------+--------+---------------------------+
# |       ENDPOINT         | HEALTH |    ERROR                  |
# +------------------------+--------+---------------------------+
# | https://10.1.0.10:2379 |  true  |                           |
# | https://10.2.0.10:2379 |  true  |                           |
# | https://10.3.0.10:2379 | false  | context deadline exceeded |
# +------------------------+--------+---------------------------+
# -> 2/3 nodes healthy = Quorum maintained -> cluster can operate normally

etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379,https://10.2.0.10:2379,https://10.3.0.10:2379 \
  member list -w table

Response by Case

Case 1: etcd Quorum maintained (2/3 or more healthy)
  -> Handled automatically. Continue monitoring only.
  -> If the isolated node was Primary, confirm automatic failover is proceeding
  -> After network recovery, confirm isolated node automatically rejoins

Case 2: etcd Quorum lost (fewer than half healthy)
  -> Entire cluster switches to read-only mode
  -> Network recovery is the highest priority
  -> If recovery impossible, see Runbook D (etcd Failure)
  -> If DCS Failsafe Mode is active: attempt direct connection to all members

# Monitor Patroni logs during network partition
journalctl -fu patroni | grep -E "DCS|leader|promote|demote|failover"

# Check each node status after partition clears
for NODE in $DC1_NODES; do
  echo "=== $NODE ==="
  curl -s --max-time 3 http://$NODE:8008/patroni 2>/dev/null \
    | python3 -c "import sys,json; d=json.load(sys.stdin); \
                  print(f'role={d[\"role\"]}, timeline={d[\"timeline\"]}')" \
    || echo "no response"
done

5. Runbook D — etcd Cluster Failure

Trigger Condition: All or a majority of etcd nodes are down / Patroni logs frequently show "DCS not accessible"

Immediate Effect: All Patroni clusters switch to read-only mode (when DCS Failsafe Mode is disabled)

Step 1: Assess etcd Status

# Check etcd cluster status
etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379,https://10.2.0.10:2379,https://10.3.0.10:2379 \
  endpoint status --write-out=table

# Count healthy etcd nodes
HEALTHY=$(etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379,https://10.2.0.10:2379,https://10.3.0.10:2379 \
  endpoint health 2>/dev/null | grep -c "is healthy")
echo "Healthy etcd nodes: $HEALTHY / 3"

if [ "$HEALTHY" -ge 2 ]; then
  echo "Quorum maintained -> waiting for Patroni auto-recovery"
else
  echo "Quorum lost! Immediate recovery required"
fi

Step 2: Attempt etcd Node Restart

# On the failed etcd node (e.g., Singapore node)
ssh pg-singapore-1

# Check disk space and memory (common causes of etcd failure)
df -h /var/lib/etcd
free -h
journalctl -u etcd --since "10 minutes ago" | tail -50

# Restart etcd
systemctl restart etcd

# Confirm cluster rejoining after restart
etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379 \
  member list

Step 3: Full etcd Reconstruction (Last Resort)

Use only when etcd data is corrupted and restart-based recovery is impossible. Patroni restores cluster configuration from patroni.dynamic.json, so wiping the data directory alone allows automatic re-registration.

# This procedure completely wipes etcd data.

# 1. Pause Patroni (prevent automatic failover)
patronictl -c $PATRONI_CONF pause $DC1_CLUSTER --wait

# 2. Stop service and delete data on all etcd nodes
for NODE in $DC1_NODES; do
  ssh $NODE "systemctl stop etcd && rm -rf /var/lib/etcd/data"
done

# 3. Initialize new cluster on Seoul node (first node)
# etcd.conf.yml: verify initial-cluster-state: new
ssh 10.1.0.10 "systemctl start etcd"

# 4. Add Tokyo, Singapore nodes
# Change initial-cluster-state to "existing" in etcd.conf.yml, then start
ssh 10.2.0.10 "systemctl start etcd"
ssh 10.3.0.10 "systemctl start etcd"

# 5. Resume Patroni
patronictl -c $PATRONI_CONF resume $DC1_CLUSTER --wait

# 6. Confirm Patroni restores config from patroni.dynamic.json
patronictl -c $PATRONI_CONF show-config
patronictl -c $PATRONI_CONF list

6. Runbook E — Split-Brain Detection and Emergency Recovery

Trigger Conditions:

Monitoring system (PMM, etc.) detects two or more Primary nodes
HAProxy stats show both nodes responding 200 to /primary
Application-level data inconsistency detected

⚠️ Split-Brain is the highest-severity incident. Block all writes immediately and begin recovery. Keeping the node with the lower LSN alive means more data loss.

Step 1: Confirm Split-Brain

# Immediately check role and LSN of all nodes
echo "=== Split-Brain Diagnosis ==="
for NODE in $DC1_NODES $DC2_NODES; do
  echo -n "$NODE: "
  psql -h $NODE -U postgres -t --no-align -c \
    "SELECT inet_server_addr()::text || ' | is_recovery=' || pg_is_in_recovery()::text
          || ' | lsn=' || CASE WHEN pg_is_in_recovery()
                         THEN pg_last_wal_replay_lsn()::text
                         ELSE pg_current_wal_lsn()::text END;" 2>/dev/null \
    || echo "connection failed"
done

# If two nodes return is_recovery=false, Split-Brain is confirmed

Step 2: Block Writes Immediately

Block both Primaries to prevent further writes while determining which to keep.

# Immediately block port 5432 on both Primaries via iptables
for NODE in $DC1_NODES $DC2_NODES; do
  ssh $NODE "iptables -I INPUT -p tcp --dport 5432 -j DROP" &
done
wait
echo "Write block applied to all nodes"

Step 3: Determine the Surviving Primary (by LSN)

The node with the higher LSN has more data. Always preserve this node.

# LSN comparison
echo "=== LSN Comparison ==="
for NODE in $DC1_NODES $DC2_NODES; do
  LSN=$(psql -h $NODE -U postgres -t --no-align -c \
    "SELECT pg_current_wal_lsn();" 2>/dev/null)
  echo "$NODE: LSN = $LSN"
done

# Stop the node with the lower LSN (the one with less data)
# Example: if pg-seoul-1 is identified as the stale Primary:
ssh 10.1.0.10 "systemctl stop patroni; systemctl stop postgresql"

Step 4: Assess Data Loss Scope

# Identify divergence point (using pg_waldump)
# On the stale Primary (stopped node):
pg_waldump -n 100 --path=/var/lib/postgresql/17/main/pg_wal \
  --start=<divergence_LSN> 2>/dev/null | head -50

# Build list of lost transactions (for business recovery)
# Find transactions committed after divergence in PostgreSQL logs
grep "COMMIT" /var/log/postgresql/postgresql-*.log \
  | tail -100 > /tmp/lost_transactions.txt

Step 5: Normalize the Cluster Around the Surviving Primary

# Release iptables block on surviving Primary (higher LSN)
ssh $SURVIVING_PRIMARY "iptables -D INPUT -p tcp --dport 5432 -j DROP"

# Reinitialize stale Primary as Replica
# (delete data directory and resync via basebackup)
ssh $OLD_PRIMARY "rm -rf /var/lib/postgresql/17/main/*"
ssh $OLD_PRIMARY "systemctl start patroni"

# Confirm rejoin
patronictl -c $PATRONI_CONF list

# Clear all iptables rules
for NODE in $DC1_NODES $DC2_NODES; do
  ssh $NODE "iptables -D INPUT -p tcp --dport 5432 -j DROP" 2>/dev/null
done

7. DR Drill Scenarios (Chaos Engineering)

DR drills should be conducted at least once per quarter. It has been repeatedly demonstrated in production that an undrilled Runbook fails during a real incident. Each drill must have assigned roles, a time limit, and must produce measurable evidence.

Drill Preparation

# Snapshot current data state before drill
psql -U postgres -h $CURRENT_PRIMARY -c "
  CREATE TABLE IF NOT EXISTS dr_drill_log (
    drill_id TEXT,
    event TEXT,
    ts TIMESTAMPTZ DEFAULT now(),
    lsn PG_LSN DEFAULT pg_current_wal_lsn()
  );
  INSERT INTO dr_drill_log (drill_id, event)
  VALUES ('drill_$(date +%Y%m%d_%H%M)', 'drill_start');
"

# Generate sustained write load during drill (separate terminal, optional)
# pgbench -h $CURRENT_PRIMARY -U postgres -d postgres \
#         -c 5 -j 2 -T 300 --no-vacuum &

Scenario 1: Primary Process Force-Kill

Goal: Measure automatic failover completion time / Confirm Watchdog behavior

echo "=== Drill Scenario 1 Start: $(date) ==="

# Force-kill Patroni process on Primary node (kill -9)
PRIMARY_NODE=$(patronictl -c $PATRONI_CONF list \
  | grep Leader | awk '{print $2}' | cut -d: -f1)
echo "Current Primary: $PRIMARY_NODE"

ssh $PRIMARY_NODE "kill -9 \$(pgrep -f 'patroni')"
FAILOVER_START=$(date +%s)

# Wait for failover completion and measure time
while true; do
  NEW_LEADER=$(patronictl -c $PATRONI_CONF list 2>/dev/null \
    | grep Leader | awk '{print $2}')
  if [ -n "$NEW_LEADER" ] && [ "$NEW_LEADER" != "$PRIMARY_NODE:5432" ]; then
    FAILOVER_END=$(date +%s)
    echo "Failover complete: $NEW_LEADER"
    echo "Elapsed time: $((FAILOVER_END - FAILOVER_START)) seconds"
    break
  fi
  sleep 1
done

# Verification
patronictl -c $PATRONI_CONF topology
psql -h $HAPROXY_HOST -p 5000 -U postgres -c \
  "SELECT inet_server_addr(), pg_is_in_recovery();"

Scenario 2: Network Partition Simulation (iptables)

Goal: Confirm etcd Quorum behavior under cross-region network isolation

# Running this on the Primary node will trigger an actual failover.
# Run only on a Replica node or notify the team in advance.

echo "=== Drill Scenario 2: Cross-Region Partition Simulation ==="

REPLICA_NODE="10.2.0.10"

ssh $REPLICA_NODE "iptables -I INPUT -s 10.1.0.0/24 -j DROP
                   iptables -I OUTPUT -d 10.1.0.0/24 -j DROP"
PARTITION_START=$(date +%s)

echo "Partition started. Restoring in 30 seconds..."
sleep 30

# Release partition
ssh $REPLICA_NODE "iptables -D INPUT -s 10.1.0.0/24 -j DROP
                   iptables -D OUTPUT -d 10.1.0.0/24 -j DROP"
PARTITION_END=$(date +%s)

echo "Partition duration: $((PARTITION_END - PARTITION_START)) seconds"

# Check cluster status after recovery
sleep 15  # Wait for reconnection to stabilize
patronictl -c $PATRONI_CONF topology

Scenario 3: Single etcd Node Failure

Goal: Confirm cluster stability while etcd Quorum is maintained

echo "=== Drill Scenario 3: Single etcd Node Failure ==="

# Stop Singapore etcd
ssh 10.3.0.10 "systemctl stop etcd"
ETCD_STOP=$(date)
echo "etcd stopped at: $ETCD_STOP"

# Confirm normal cluster operation (Quorum 2/3 maintained)
etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379,https://10.2.0.10:2379 \
  endpoint health

# Check Patroni cluster health
patronictl -c $PATRONI_CONF list

# Write test (should succeed)
psql -h $HAPROXY_HOST -p 5000 -U postgres -c \
  "INSERT INTO dr_drill_log (drill_id, event)
   VALUES ('drill_$(date +%Y%m%d)', 'etcd_node_down_write_test');"

# Restore etcd after 10 minutes
sleep 600
ssh 10.3.0.10 "systemctl start etcd"

# Confirm etcd rejoin
etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379,https://10.2.0.10:2379,https://10.3.0.10:2379 \
  member list

Scenario 4: Planned Standby Cluster Promotion Drill

Goal: Practice manual promotion procedure / Measure RTO

echo "=== Drill Scenario 4: Planned Standby Cluster Promotion Drill ==="
DRILL_START=$(date +%s)

# Planned shutdown of DC1 (controlled procedure for the drill)
patronictl -c $PATRONI_CONF pause $DC1_CLUSTER --wait
for NODE in $DC1_NODES; do
  ssh $NODE "systemctl stop patroni" &
done
wait

# DC2 Standby promotion (see Runbook B Step 4)
patronictl -c /etc/patroni/patroni.yml promote-cluster $DC2_CLUSTER

PROMOTE_TIME=$(date +%s)
echo "Promotion elapsed time: $((PROMOTE_TIME - DRILL_START)) seconds"

# Write test
psql -h 10.2.0.10 -U postgres -c \
  "INSERT INTO dr_drill_log (drill_id, event)
   VALUES ('drill_$(date +%Y%m%d)', 'standby_promoted');"

# -- Post-drill restoration --
# Restart DC1 (switch back to Standby mode)
# Register Permanent Slot on DC2, then restart DC1 Patroni
# (see Part 3 Step 6)

DR Drill Record Sheet

Complete this form after every drill and archive it in the team wiki.

## DR Drill Record

- Drill date/time: YYYY-MM-DD HH:MM ~ HH:MM
- Participants: DBA, infrastructure engineer names
- Scenario: [ ] 1 (Primary kill) [ ] 2 (partition) [ ] 3 (etcd) [ ] 4 (Standby promotion)

### Measurement Results
| Item                     | Target          | Actual | Achieved |
|--------------------------|-----------------|--------|----------|
| RTO (service recovery)   | within 5 min    | __ min | OK/NG    |
| RPO (data loss)          | 0 (sync)/30 s   | __ s   | OK/NG    |
| Failover detection time  | within 30 s     | __ s   | OK/NG    |
| App reconnect time       | within 60 s     | __ s   | OK/NG    |

### Issues Found and Improvements
- [ ] Command error found in Runbook Step X -> needs correction
- [ ] Expired SSH key found on a specific node
- [ ] HAProxy DNS switch time exceeded target

### Next Drill Schedule
- Scheduled date: YYYY-MM-DD
- Scenario: (prioritize vulnerabilities found in this drill)

8. Incident Response Decision Tree

A flowchart for quickly choosing the right Runbook under pressure. The first question is always whether patronictl list can be executed.

9. Post-Recovery Post-Mortem Checklist

Write the Post-Mortem within 24 hours of completing recovery. The goal is not a routine incident report — it is creating recurrence prevention action items. A Post-Mortem closes only when recovery is fully complete.

Immediate Checks (within 30 minutes of recovery)

# Final cluster state snapshot
patronictl -c $PATRONI_CONF topology > /tmp/post_recovery_topology.txt
patronictl -c $PATRONI_CONF show-config > /tmp/post_recovery_config.txt

# Confirm replication lag has converged to zero
psql -h $NEW_PRIMARY -U postgres -c "
  SELECT application_name, state, sync_state,
         write_lag, flush_lag, replay_lag
  FROM pg_stat_replication;
"

# Basic data consistency check (compare row counts of key tables)
for NODE in $DC1_NODES; do
  echo -n "$NODE: "
  psql -h $NODE -U postgres -d mydb -t --no-align -c \
    "SELECT COUNT(*) FROM orders;" 2>/dev/null || echo "unreachable"
done

# Final etcd status check
etcdctl $ETCD_CERTS \
  --endpoints=https://10.1.0.10:2379,https://10.2.0.10:2379,https://10.3.0.10:2379 \
  endpoint health --write-out=table

# Watchdog status check
curl -s http://localhost:8008/patroni | python3 -m json.tool \
  | grep -E "role|watchdog|timeline"

Post-Mortem Document Template

## Post-Mortem Report

- Incident start time: YYYY-MM-DD HH:MM:SS (KST)
- Alert received time: (monitoring alert timestamp)
- Recovery complete time: YYYY-MM-DD HH:MM:SS (KST)
- Total downtime: __ min __ sec
- Impact scope: Write unavailable / Read unavailable / Fully down
- Owner: DBA name

### Timeline
| Time    | Event                          |
|---------|--------------------------------|
| T+00:00 | Incident occurred (estimated)  |
| T+00:XX | Monitoring alert received      |
| T+00:XX | On-call responded              |
| T+00:XX | Root cause identified          |
| T+00:XX | Recovery started               |
| T+00:XX | Service restored               |

### Root Cause
(5-Why analysis result)

### Immediate Actions
- (Done) ...
- (In progress) ...
- (Planned) ...

### Recurrence Prevention Action Items
| Item                              | Owner    | Due        | Status |
|-----------------------------------|----------|------------|--------|
| Change Watchdog mode to required  | DBA      | YYYY-MM-DD | []     |
| Add etcd disk I/O alert           | Infra    | YYYY-MM-DD | []     |
| Increase DR drills to monthly     | DBA lead | YYYY-MM-DD | []     |

### Runbook Improvements
(Which parts of the Runbook were inaccurate in this incident?)

Part 6, the next article in this series, covers monitoring, automation, and Best Practices for day-to-day operation of a multi-region Patroni cluster. Topics include a Prometheus + Grafana Patroni dashboard, patroni_exporter key alert rules, a pgBackRest multi-region backup strategy, and a patronictl command cheat sheet.