Operating Patroni H/A Across Multiple Regions — Part 6: Monitoring, Operations Automation, and Best Practices
Monitor the control plane and data plane together with Prometheus, Grafana, and Alertmanager. Complete the HA story with pgBackRest multi-region backup, Ansible automation, a patronictl cheat sheet, and series-wide Best Practices — the final installment.
This is the final part of the series. Part 1 laid out the architecture, Parts 2 and 3 built synchronous and asynchronous replication, Part 4 stacked the Split-Brain defense layers, and Part 5 established the Runbook and DR drill framework. One question remains — how do you live with this cluster every day?
HA is not a one-time configuration. It is complete only when you observe it daily, validate it through backups, and make healthy operation repeatable through automation. This part covers that operational foundation.
1. Understanding the Patroni Metrics Collection Architecture
Patroni has natively provided Prometheus-format metrics via the /metrics endpoint since v2.1.0. The key advantage is that cluster state can be collected through the Patroni REST API alone, without a separate exporter.
The key principle of the monitoring stack is to observe Patroni (control plane), PostgreSQL (data plane), and etcd (DCS) together rather than in isolation. HA failures frequently surface first as control-plane signals — DCS heartbeat delays, paused failover, repeated etcd leader elections — not as data-plane events.
Key Patroni /metrics Metrics
| Metric Name | Type | Description |
|---|---|---|
patroni_primary | Gauge | 1 = currently Primary (holds Leader Lock) |
patroni_replica | Gauge | 1 = currently Replica |
patroni_standby_leader | Gauge | 1 = Standby Cluster Leader |
patroni_xlog_location | Gauge | Current WAL position (LSN, bytes) |
patroni_xlog_received_location | Gauge | Received WAL position (Replica) |
patroni_xlog_replayed_location | Gauge | Replayed WAL position (Replica) |
patroni_postgres_running | Gauge | Whether PostgreSQL process is running |
patroni_dcs_last_seen | Gauge | Last DCS (etcd) communication time (Unix timestamp) |
patroni_failsafe_mode_is_active | Gauge | Whether DCS Failsafe Mode is active |
patroni_is_paused | Gauge | Whether automatic failover is paused |
patroni_heartbeat_failed_at | Gauge | Timestamp of last DCS heartbeat failure |
# Check Patroni /metrics response sample
curl -s http://10.1.0.10:8008/metrics | grep -E "^patroni_"
# Expected output (Primary node):
# patroni_primary 1
# patroni_replica 0
# patroni_postgres_running 1
# patroni_xlog_location 5.36870912e+08
# patroni_dcs_last_seen 1.745780000e+09
# patroni_failsafe_mode_is_active 0
# patroni_is_paused 0
2. Prometheus Scrape Configuration
Prometheus Installation (Dedicated Monitoring Node, Docker Compose)
# /opt/monitoring/docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v3.3.1
container_name: prometheus
ports: ["9090:9090"]
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana:11.6.0
container_name: grafana
ports: ["3000:3000"]
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=SecureGrafanaPass!
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.28.1
container_name: alertmanager
ports: ["9093:9093"]
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
prometheus.yml — Patroni + PostgreSQL + etcd Scraping
# /opt/monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'pg-multiregion'
env: 'production'
rule_files:
- "rules/patroni_alerts.yml"
- "rules/postgresql_alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
# -- Patroni REST API /metrics --
- job_name: 'patroni'
scheme: https
tls_config:
ca_file: /etc/prometheus/ssl/ca.pem
cert_file: /etc/prometheus/ssl/prometheus.pem
key_file: /etc/prometheus/ssl/prometheus-key.pem
static_configs:
- targets:
- '10.1.0.10:8008' # pg-seoul-1
- '10.1.0.11:8008' # pg-seoul-2
- '10.1.0.12:8008' # pg-seoul-3
- '10.2.0.10:8008' # pg-busan-1
- '10.2.0.11:8008' # pg-busan-2
- '10.2.0.12:8008' # pg-busan-3
relabel_configs:
# Automatically assign DC label
- source_labels: [__address__]
regex: '10\.1\..*'
target_label: dc
replacement: 'seoul'
- source_labels: [__address__]
regex: '10\.2\..*'
target_label: dc
replacement: 'busan'
# -- postgres_exporter (PostgreSQL internal metrics) --
- job_name: 'postgres'
static_configs:
- targets:
- '10.1.0.10:9187'
- '10.1.0.11:9187'
- '10.1.0.12:9187'
- '10.2.0.10:9187'
- '10.2.0.11:9187'
- '10.2.0.12:9187'
# -- etcd metrics --
- job_name: 'etcd'
scheme: https
tls_config:
ca_file: /etc/prometheus/ssl/ca.pem
cert_file: /etc/prometheus/ssl/prometheus.pem
key_file: /etc/prometheus/ssl/prometheus-key.pem
static_configs:
- targets:
- '10.1.0.10:2381'
- '10.2.0.10:2381'
- '10.3.0.10:2381'
# -- Node Exporter (OS level) --
- job_name: 'node'
static_configs:
- targets:
- '10.1.0.10:9100'
- '10.1.0.11:9100'
- '10.1.0.12:9100'
- '10.2.0.10:9100'
- '10.2.0.11:9100'
- '10.2.0.12:9100'
postgres_exporter Installation and Configuration
# Run on all PostgreSQL nodes
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.17.1/postgres_exporter-0.17.1.linux-amd64.tar.gz
tar xzf postgres_exporter-0.17.1.linux-amd64.tar.gz
cp postgres_exporter-0.17.1.linux-amd64/postgres_exporter /usr/local/bin/
# Create dedicated monitoring user (PostgreSQL)
psql -U postgres -c "
CREATE ROLE prometheus_scraper WITH LOGIN NOSUPERUSER NOCREATEDB NOCREATEROLE
PASSWORD 'MonitoringPass123!';
GRANT pg_monitor TO prometheus_scraper;
GRANT CONNECT ON DATABASE postgres TO prometheus_scraper;
"
# Register systemd service
cat > /etc/systemd/system/postgres_exporter.service <<'EOF'
[Unit]
Description=PostgreSQL Exporter for Prometheus
After=postgresql.service
[Service]
Type=simple
User=postgres
Environment="DATA_SOURCE_NAME=postgresql://prometheus_scraper:MonitoringPass123!@localhost:5432/postgres?sslmode=disable"
Environment="PG_EXPORTER_AUTO_DISCOVER_DATABASES=true"
ExecStart=/usr/local/bin/postgres_exporter \
--web.listen-address=:9187 \
--collector.stat_bgwriter \
--collector.replication \
--collector.replication_slot
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now postgres_exporter
3. Core Alert Rules
Alert rules are operational invariants expressed as code. Conditions such as "there must be exactly one Primary" or "DCS heartbeat must be within half the TTL" fire an alert the moment they are violated.
# /opt/monitoring/prometheus/rules/patroni_alerts.yml
groups:
- name: patroni_critical
rules:
# -- No Primary: entire cluster write-unavailable --
- alert: PatroniNoPrimary
expr: sum(patroni_primary) by (cluster) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "No Primary in Patroni cluster"
description: >
No Primary node detected in cluster {{ $labels.cluster }}.
Either automatic failover is in progress or all nodes are down.
Execute Runbook A or B immediately.
# -- Two or more Primaries: Split-Brain risk --
- alert: PatroniSplitBrain
expr: sum(patroni_primary) by (cluster) > 1
for: 0s # Fire immediately (no delay)
labels:
severity: critical
annotations:
summary: "Split-Brain detected: {{ $value }} Primaries"
description: >
Two or more Primaries detected in cluster {{ $labels.cluster }}.
Execute Runbook E immediately and block all writes.
# -- PostgreSQL process down --
- alert: PatroniPostgresDown
expr: patroni_postgres_running == 0
for: 30s
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} PostgreSQL process stopped"
description: >
PostgreSQL on {{ $labels.instance }} is not running.
Patroni is attempting automatic recovery. Manual inspection
required if the condition persists beyond 30 seconds.
- name: patroni_warning
rules:
# -- DCS communication lost --
- alert: PatroniDCSUnreachable
expr: time() - patroni_dcs_last_seen > 20
for: 10s
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} DCS communication lost"
description: >
{{ $labels.instance }} has not communicated with etcd
for {{ $value | humanizeDuration }}. Auto-demotion will
occur when TTL expires.
# -- Automatic failover disabled --
- alert: PatroniAutofailoverPaused
expr: patroni_is_paused == 1
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} automatic failover paused"
description: >
patronictl pause has been active for over 5 minutes.
Run patronictl resume after maintenance is complete.
# -- High replication lag (over 500 MB) --
- alert: PatroniReplicationLagHigh
expr: >
(patroni_xlog_location - patroni_xlog_replayed_location) > 524288000
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} replication lag exceeds 500 MB"
description: >
Replication lag on {{ $labels.instance }} is {{ $value | humanize1024 }}B.
Check network status and disk I/O.
# -- Insufficient Replicas --
- alert: PatroniInsufficientReplicas
expr: sum(patroni_replica) by (cluster) < 1
for: 1m
labels:
severity: warning
annotations:
summary: "Cluster {{ $labels.cluster }} has no active Replicas"
description: >
There are currently no active Replicas in the cluster.
Automatic failover is not possible if the Primary fails.
alertmanager.yml — Slack + PagerDuty Integration
# /opt/monitoring/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'dc']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warning'
receivers:
- name: 'default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#db-alerts'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: 'slack-warning'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#db-warnings'
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
4. Grafana Dashboard Configuration
A Grafana dashboard is an incident decision screen, not a decorative chart. It must let an operator determine in three seconds which node is Primary, whether replication lag has exceeded the threshold, and whether DCS communication is alive.
Official Dashboard IDs for Immediate Import
| Dashboard | Grafana ID | Description |
|---|---|---|
| PostgreSQL Patroni (Percona PMM) | 18870 | Cluster state via Patroni /metrics |
| PostgreSQL Overview | 9628 | DB internal metrics via postgres_exporter |
| pgBackRest Exporter | 17709 | Backup status and WAL archive health |
| etcd | 3070 | etcd cluster state and Raft metrics |
| Node Exporter Full | 1860 | OS-level CPU/memory/disk/network |
# Auto-import dashboards via Grafana API
# (Fetch dashboard JSON from Grafana.com first, then import)
GRAFANA_URL="http://localhost:3000"
GRAFANA_AUTH="admin:SecureGrafanaPass!"
for ID in 18870 9628 17709 3070 1860; do
echo "Importing dashboard ID: $ID"
curl -s -X POST \
-H "Content-Type: application/json" \
-u "$GRAFANA_AUTH" \
-d "{\"dashboard\": {\"id\": null}, \"folderId\": 0,
\"inputs\": [{\"name\": \"DS_PROMETHEUS\",
\"type\": \"datasource\",
\"pluginId\": \"prometheus\",
\"value\": \"Prometheus\"}],
\"overwrite\": true}" \
"${GRAFANA_URL}/api/dashboards/import/${ID}" \
| python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('status',''))"
done
Key Panels to Check During Operations
Daily Health Check:
Is exactly one node showing patroni_primary == 1?
Is replay_lag for all Replicas within the acceptable threshold?
Is patroni_dcs_last_seen close to the current time? (within 20 seconds)
Was the last pgBackRest backup within the past 24 hours?
Weekly Health Check:
Has etcd leader election frequency spiked?
Are per-node disk I/O and WAL generation rate normal?
Is WAL retained in Replication Slots growing excessively?
5. pgBackRest Multi-Region Backup Strategy
Patroni and pgBackRest are highly complementary. Beyond being a backup tool, pgBackRest can handle Replica initialization (create_replica_methods) and reinitialization (reinit), reducing network load and dramatically accelerating replication setup for large clusters.
High availability and disaster recovery are different concepts. Patroni failover provides HA, but it cannot recover from accidental data deletion or storage corruption. pgBackRest is the safety net behind failover.
Multi-Region Backup Architecture
DC1 PostgreSQL nodes
- archive_command: push WAL to S3 (Seoul) + S3 (Singapore)
- backup-standby: prefer -> backup from Replica (reduces Primary load)
pgBackRest Repository:
repo1: S3 ap-northeast-2 (Seoul) <- primary repo, 7-day retention
repo2: S3 ap-southeast-1 (Singapore) <- DR repo, 3-day retention
pgbackrest.conf — Multi-Repository + S3 Configuration
# /etc/pgbackrest/pgbackrest.conf (apply identically on all PG nodes)
[global]
process-max=4 # parallel processing
start-fast=y
delta=y # incremental backup processes only changed blocks
archive-async=y # async WAL archiving (minimizes write latency)
compress-type=lz4
compress-level=3
backup-standby=prefer # prefer Replica for backup
# -- Repository 1: S3 Seoul (primary) --
repo1-type=s3
repo1-path=/pg-multiregion
repo1-s3-bucket=my-pgbackrest-seoul
repo1-s3-region=ap-northeast-2
repo1-s3-endpoint=s3.ap-northeast-2.amazonaws.com
repo1-s3-uri-style=host
repo1-retention-full=7
repo1-retention-diff=14
repo1-retention-full-type=count
# -- Repository 2: S3 Singapore (DR cross-region) --
repo2-type=s3
repo2-path=/pg-multiregion
repo2-s3-bucket=my-pgbackrest-singapore
repo2-s3-region=ap-southeast-1
repo2-s3-endpoint=s3.ap-southeast-1.amazonaws.com
repo2-s3-uri-style=host
repo2-retention-full=3
repo2-retention-full-type=count
log-level-console=info
log-level-file=detail
log-path=/var/log/pgbackrest
# -- Stanza: register all nodes (maintains archive after failover) --
[pg-multiregion]
pg1-path=/var/lib/postgresql/17/main
pg1-host=10.1.0.10
pg1-host-user=postgres
pg1-port=5432
pg2-path=/var/lib/postgresql/17/main
pg2-host=10.1.0.11
pg2-host-user=postgres
pg2-port=5432
pg3-path=/var/lib/postgresql/17/main
pg3-host=10.1.0.12
pg3-host-user=postgres
pg3-port=5432
Integrating pgBackRest with Patroni
# patroni.yml - pgBackRest integration
bootstrap:
dcs:
postgresql:
parameters:
archive_mode: "on"
archive_command: >
pgbackrest --stanza=pg-multiregion
--config=/etc/pgbackrest/pgbackrest.conf
archive-push %p
restore_command: >
pgbackrest --stanza=pg-multiregion
--config=/etc/pgbackrest/pgbackrest.conf
archive-get %f "%p"
# Handle Replica initialization with pgBackRest (instead of pg_basebackup)
# Much faster and more efficient than basebackup for large clusters
method:
pgbackrest:
command: >
pgbackrest --stanza=pg-multiregion
--config=/etc/pgbackrest/pgbackrest.conf
--delta restore
keep_data: True
no_params: True
basebackup:
command: pg_basebackup -R -P --wal-method=stream
Backup Schedule Automation (crontab)
# Add to the postgres user's crontab
crontab -u postgres -e
# Full backup: Sunday 02:00
0 2 * * 0 pgbackrest --stanza=pg-multiregion --type=full backup --repo=1
# Differential backup: Monday-Saturday 02:00
0 2 * * 1-6 pgbackrest --stanza=pg-multiregion --type=diff backup --repo=1
# DR repo full backup: daily 04:00
0 4 * * * pgbackrest --stanza=pg-multiregion --type=full backup --repo=2
# Backup integrity verification: Saturday 03:00
0 3 * * 6 pgbackrest --stanza=pg-multiregion verify
# Daily status report: 06:00
0 6 * * * pgbackrest info --stanza=pg-multiregion \
| mail -s "[pgBackRest] Daily Backup Status" dba@example.com
PITR Restore Test (Recommended Monthly)
A backup system that has never been restored is an unverified system. Run a restore on a separate test instance at least once a month, directly measuring restore time and data consistency.
# Restore to a specific point in time on a test instance
pgbackrest --stanza=pg-multiregion \
--type=time \
--target="2026-04-26 14:30:00+09" \
--target-action=promote \
--log-level-console=detail \
restore
# Verify data consistency
psql -U postgres -c "
SELECT COUNT(*), MAX(created_at) FROM orders
WHERE created_at < '2026-04-26 14:30:00+09';
"
6. Cluster Configuration Automation with Ansible
Applying consistent configuration across all nodes through Ansible eliminates manual errors and allows the same management approach even as the cluster count grows to dozens. Configuration drift — subtle differences between nodes — is a silent source of failures.
Directory Structure
ansible/
inventory/
production.ini
staging.ini
group_vars/
all.yml # common variables (versions, TTL, etc.)
dc1.yml # DC1-specific variables
dc2.yml # DC2-specific variables
roles/
common/ # OS base configuration
etcd/ # etcd installation/configuration
postgresql/ # PostgreSQL installation
patroni/ # Patroni installation/configuration
haproxy/ # HAProxy configuration
pgbackrest/ # pgBackRest installation/configuration
site.yml # master playbook
inventory/production.ini
[dc1_nodes]
pg-seoul-1 ansible_host=10.1.0.10 dc=seoul
pg-seoul-2 ansible_host=10.1.0.11 dc=seoul
pg-seoul-3 ansible_host=10.1.0.12 dc=seoul
[dc2_nodes]
pg-busan-1 ansible_host=10.2.0.10 dc=busan
pg-busan-2 ansible_host=10.2.0.11 dc=busan
pg-busan-3 ansible_host=10.2.0.12 dc=busan
[pg_all:children]
dc1_nodes
dc2_nodes
[haproxy]
haproxy-seoul ansible_host=10.1.0.20
group_vars/all.yml
postgresql_version: "17"
patroni_version: "4.1.2"
etcd_version: "v3.5.17"
patroni_scope: "pg-multiregion"
patroni_namespace: "/db/"
patroni_dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
synchronous_mode: true
patroni_watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5
tls_ca_cert: /etc/etcd/ssl/ca.pem
roles/patroni/tasks/main.yml (Key Excerpt)
---
- name: Install Patroni in virtualenv
pip:
name: "patroni[etcd3]"
version: "{{ patroni_version }}"
virtualenv: /opt/patroni
virtualenv_python: python3
become: yes
- name: Deploy patroni.yml (Jinja2 template)
template:
src: patroni.yml.j2
dest: /etc/patroni/patroni.yml
owner: postgres
group: postgres
mode: '0640'
notify: restart patroni
- name: Load softdog kernel module
modprobe:
name: softdog
state: present
- name: Set /dev/watchdog permissions
file:
path: /dev/watchdog
owner: postgres
group: postgres
mode: '0600'
- name: Register and enable Patroni systemd service
template:
src: patroni.service.j2
dest: /etc/systemd/system/patroni.service
notify:
- systemd daemon-reload
- restart patroni
- name: Disable PostgreSQL systemd service (managed exclusively by Patroni)
systemd:
name: "postgresql@{{ postgresql_version }}-main"
state: stopped
enabled: no
Run Examples
# Initial full cluster deployment
ansible-playbook -i inventory/production.ini site.yml --ask-vault-pass
# Rolling Update after Patroni config change (one node at a time)
ansible-playbook -i inventory/production.ini site.yml \
--tags patroni-config \
--serial 1 \
--ask-vault-pass
# Update DC1 only
ansible-playbook -i inventory/production.ini site.yml \
--limit dc1_nodes --tags patroni
# Dry run (preview changes)
ansible-playbook -i inventory/production.ini site.yml \
--check --diff
7. patronictl Cheat Sheet
Frequently used commands collected in one place. Adding export PATRONI_CONF=/etc/patroni/patroni.yml to your shell profile eliminates the need to pass -c $PATRONI_CONF every time.
Status Checks
# Cluster topology (role, state, replication lag)
patronictl topology
# Concise list view
patronictl list
# Dynamic configuration stored in DCS
patronictl show-config
# Detailed REST API information for a specific node
curl -s http://10.1.0.10:8008/patroni | python3 -m json.tool
# Check all health check endpoints
for EP in /primary /replica /synchronous /standby-leader /health; do
echo -n "$EP: $(curl -s -o /dev/null -w '%{http_code}' http://10.1.0.10:8008$EP)"
echo ""
done
# Cluster timeline history (failover event log)
patronictl history
Failover / Role Switching
# Planned Switchover (zero downtime)
# Verify Replica is in sync before proceeding
patronictl switchover \
--master pg-seoul-1 --candidate pg-seoul-2 \
--scheduled now --force
# Forced Failover (manual trigger when no Primary exists)
# Prerequisite: confirm patronictl list shows zero Primaries
patronictl failover pg-multiregion \
--candidate pg-tokyo-1 --force
# Standby Cluster promotion (Patroni 4.1+, see Runbook B)
patronictl promote-cluster pg-busan-standby
# Primary -> Standby Cluster demotion
patronictl demote-cluster pg-seoul-cluster
Node Management
# Reinitialize a specific node (full resync)
# Warning: deletes the node's data directory and resyncs from scratch
patronictl reinit pg-multiregion pg-seoul-1 --force
# Restart a specific node
patronictl restart pg-multiregion pg-seoul-1
# Pause automatic failover (before starting maintenance)
patronictl pause pg-multiregion --wait
# Resume automatic failover (after maintenance is complete)
patronictl resume pg-multiregion --wait
Configuration Management
# Edit dynamic DCS configuration (applied cluster-wide in real time)
patronictl edit-config
# Change a single parameter immediately
patronictl edit-config --set synchronous_mode=true --force
# Reload PostgreSQL configuration (no restart needed)
patronictl reload pg-multiregion
# Restart PostgreSQL (after changing parameters requiring restart)
patronictl restart pg-multiregion --scheduled now --force
8. Operations Best Practices
Key lessons learned across the entire series, consolidated in one place.
Design
Deploy etcd nodes in odd numbers (3, 5, 7) distributed across 3 or more DCs
If 3 DCs is not feasible, design with 2 DCs + Standby Cluster
Use synchronous_mode: true when RPO=0 is required
Always factor in inter-region RTT when tuning TTL, loop_wait, and retry_timeout
Watch for disk I/O contention when etcd and PostgreSQL share the same node
Security
Apply mTLS to all inter-region communication (etcd peer/client, Patroni REST API)
Follow the principle of least privilege for replicator and rewind_user accounts
Manage passwords in patroni.yml through Vault or Secret Manager
Configure TLS client authentication for the Patroni REST API
Reliability
Enable Watchdog (mode: required) on all Primary-candidate nodes
Keep the PostgreSQL systemd service in disabled state (Patroni manages it exclusively)
postgresql.service in enabled state is a silent cause of Split-Brain
Use Permanent Replication Slots to prevent premature WAL deletion
Always use patronictl pause/resume before and after failover operations
Backup
Use pgBackRest with backup-standby: prefer to reduce Primary load
Maintain backup repositories in two or more different regions (3-2-1 rule)
Perform an actual PITR restore test at least once a month
Detect archive_command failures immediately through alerting
Register all Patroni nodes in pgbackrest.conf to maintain archive after failover
Monitoring
Alert when the number of nodes with patroni_primary == 1 is not exactly one
Fire a warning when patroni_dcs_last_seen exceeds half the TTL value
Set threshold alerts on replication lag (xlog_location difference)
Alert when patroni_is_paused persists for more than 5 minutes
Investigate immediately when etcd leader election frequency spikes
Operational Culture
Establish a team routine of checking patronictl topology daily
Validate Runbooks through quarterly DR drills and update them immediately
Write a Post-Mortem after every incident and track action items to completion
Perform Patroni version upgrades as a Rolling Update: Replicas first, then Primary
An undrilled Runbook will not work during a real incident
9. Series Wrap-Up
Across six parts, we covered the complete lifecycle of multi-region Patroni HA operations from the ground up.
Part 1: Why multi-region? Which architecture to choose?
Part 2: Achieve RPO=0 and automatic failover with 3-DC synchronous replication
Part 3: Secure cost efficiency and flexibility with 2-DC async + Standby Cluster
Part 4: Block Split-Brain at the source with Watchdog, STONITH, and etcd Quorum
Part 5: Build real-world incident response capability through typed Runbooks and DR drills
Part 6: Keep the cluster healthy through monitoring, automation, and Best Practices
Patroni is the de facto standard for PostgreSQL HA, but its strength comes not from the tool itself — it comes from the team that understands it and operates it correctly.
We hope what we've covered helps your database serve reliably and quietly through the night.
References
- Patroni Documentation
- Patroni GitHub — patroni/patroni
- Grafana Dashboard 18870 — PostgreSQL Patroni (Percona)
- Percona - Monitoring a PostgreSQL Patroni Cluster
- pgstef's blog — Patroni and pgBackRest Combined
- PGConf.EU 2025 — Patroni and pgBackRest: Better Together (Stefan Fercot)
- DEV Community — PostgreSQL HA: Patroni, Replication and Failover Patterns