Operating Patroni H/A Across Multiple Regions — Part 1: Fundamentals and Architecture Design Principles
A single-DC Patroni setup handles node failures well but falls apart against a full region outage. This post covers the starting point of multi-region HA design: the RPO/RTO/RTT trade-off, the difference between synchronous 3-DC auto-failover and async Standby Cluster 2-DC manual-failover patterns, and the DCS placement principles that tie it all together.
Series — Operating Patroni H/A Across Multiple Regions
- Part 1 — Fundamentals and Architecture Design Principles (this post)
- Part 2 — Synchronous Multi-DC Setup in Practice
- Part 3 — Asynchronous Replication + Standby Cluster Setup
- Part 4 — Split-Brain Prevention (STONITH, Watchdog, Quorum)
- Part 5 — Failover Runbook and DR Drills
- Part 6 — Monitoring, Operational Automation, and Best Practices
Table of Contents
- Why Multi-Region HA?
- Patroni Core Components Recap
- The Heart of Multi-Region Design: CAP Theory and Trade-offs
- Two Main Architecture Patterns
- DCS Placement Strategy
- Architecture Selection Criteria
- References
1. Why Multi-Region HA?
Running Patroni within a single data center is enough to handle common failures like individual node outages and OS crashes. But when an entire cloud region goes down — due to a natural disaster, a cut fiber cable, or an infrastructure-level incident — a single-region cluster has no answer.
Scenarios that require multi-region HA:
| Failure Type | Single-Region Response | Multi-Region Required? |
|---|---|---|
| Individual node failure | Automatic failover | Not needed |
| Availability Zone (AZ) failure | Handled with AZ-distributed layout | Conditional |
| Full region outage | Not possible | Required |
| Network partition (inter-region) | Not possible | Required |
| Regulatory requirements (mandatory DR site) | Not possible | Required |
In financial services, public sector, and e-commerce, a disaster recovery (DR) site is often a legal obligation. Patroni's official documentation explicitly guides multi-DC operations to meet these requirements.
2. Patroni Core Components Recap
Before diving into multi-region architecture, here is a quick refresher on Patroni's key components.
- Leader Key management: The Primary node continuously renews a Leader Key in the DCS (TTL-based) as its liveness signal.
- Automatic failover: When Leader Key renewal stops, one of the Replicas wins a Leader Race and is promoted as the new Primary.
- Split-Brain prevention: etcd's Raft consensus algorithm combined with a Linux Watchdog prevents two nodes from simultaneously acting as Primary.
As of Patroni 4.1.2, supported DCS backends: etcd v3, Consul, ZooKeeper, Kubernetes (using its built-in etcd).
3. The Heart of Multi-Region Design: CAP Theory and Trade-offs
The fundamental challenge of multi-region DB operations is the trade-off between network latency and data consistency. In a distributed system, Partition Tolerance is always a given — the real question is whether to prioritize Consistency or Availability on top of that.
Three questions you must answer before designing multi-region HA:
Q1. Do you need RPO (Recovery Point Objective) of zero?
- YES → Synchronous Replication is mandatory
- NO → Asynchronous Replication is acceptable
Q2. How short does your RTO (Recovery Time Objective) need to be?
- Within seconds → Automatic failover required (3 or more DCs)
- Minutes to tens of minutes acceptable → Manual failover acceptable (2 DCs possible)
Q3. Can you absorb inter-region network latency?
- With synchronous replication, round-trip time (RTT) between regions adds directly to every write's latency.
- Example: Seoul–Tokyo RTT ≈ 30ms → synchronous replication adds at least 30ms to every write.
The answers to these three questions define your architecture.
4. Two Main Architecture Patterns
Patroni's official documentation describes two primary patterns for multi-DC deployments.
Pattern A — Synchronous Replication (3+ DCs, Automatic Failover)
# patroni.yml — synchronous replication config
bootstrap:
dcs:
synchronous_mode: true
synchronous_mode_strict: false
Key characteristics:
- At least 3 DCs need one etcd node each to maintain Quorum.
synchronous_mode: truecauses the Primary to designate a Synchronous Replica.- If one DC goes completely down, the remaining two DCs can auto-failover.
- Downside: Every write incurs added latency equal to inter-region RTT; operating 3 DCs increases cost significantly.
synchronous_mode_strict: false (the default) allows writes even when no synchronous replica is available. Setting it to true blocks writes the moment a synchronous replica disappears — decide your availability policy before enabling this in production.
Pattern B — Asynchronous Replication + Standby Cluster (2 DCs, Manual Failover)
# patroni.yml — DC2 Standby Cluster config
bootstrap:
dcs:
standby_cluster:
host: dc1-primary.example.com
port: 5432
primary_slot_name: dc2_standby
Key characteristics:
- Each DC operates its own independent etcd cluster.
- DC2 uses Patroni's
standby_clusterfeature to receive WAL from DC1. - If DC1 fails, automatic failover is not possible — you must manually remove the
standby_clusterconfig and promote DC2. - Before promoting DC2, you must confirm DC1 has fully stopped. Skipping this step causes Split-Brain.
- Upside: No write latency impact from inter-region RTT; lower operational cost than Pattern A.
⚠️ Promoting DC2 while DC1 is still alive causes Split-Brain. You must perform STONITH (Shoot The Other Node In The Head) to forcibly stop DC1 before promoting. Shutting down DC1 first is not optional — it is a mandatory prerequisite.
The replication slot named in primary_slot_name must be created on DC1's Primary beforehand. Without the slot, DC2's WAL streaming will not start.
5. DCS Placement Strategy
The placement of your DCS (etcd, Consul, etc.) is just as critical as your PostgreSQL node placement. A poorly designed DCS becomes a single point of failure (SPOF) that can bring down the entire cluster.
The Odd-Node Rule for etcd
etcd uses the Raft consensus algorithm, which requires an odd number of nodes to maintain a stable majority quorum.
| etcd Nodes | Quorum (majority) | Tolerable node failures |
|---|---|---|
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
An even number of nodes does not improve Quorum stability. 4 nodes and 3 nodes tolerate the same number of failures (1), so adding a fourth node only increases cost.
Region Placement Guide
Recommended: distribute 5 nodes across 3 regions
Region A: etcd × 2
Region B: etcd × 2
Region C: etcd × 1 (Tiebreaker / Witness)
Note: inter-region network latency must be shorter than the etcd heartbeat timeout
Default heartbeat: 100ms, election timeout: 1000ms
In multi-region environments, adjust these values to match the actual RTT
Tuning etcd Timeouts
# etcd config — example for Seoul-Tokyo-Singapore setup (assuming RTT ~30–80ms)
heartbeat-interval: 500 # ms, raised from default 100ms to account for RTT
election-timeout: 5000 # ms, maintain heartbeat-interval × 10 ratio
# patroni.yml — etcd3 connection config (TLS)
etcd3:
hosts:
- seoul-etcd-1:2379
- tokyo-etcd-1:2379
- singapore-etcd-1:2379
protocol: https
cacert: /etc/ssl/etcd/ca.crt
cert: /etc/ssl/etcd/client.crt
key: /etc/ssl/etcd/client.key
Always measure the actual inter-region RTT before setting heartbeat-interval and election-timeout. A common starting point is heartbeat-interval at 5–10× the RTT and election-timeout at 10× the heartbeat-interval. Values that are too low cause frequent leader re-elections triggered by nothing more than normal network jitter.
6. Architecture Selection Criteria
| Criteria | Pattern A (Sync, 3 DCs) | Pattern B (Async, 2 DCs) |
|---|---|---|
| Minimum DCs | 3 | 2 |
| RPO | 0 (no data loss) | Seconds to minutes (depends on network lag) |
| RTO | Seconds (auto failover) | Minutes to tens of minutes (manual) |
| Write performance impact | High (RTT added per write) | Low |
| Operational complexity | High | Moderate |
| Cost | High | Moderate |
| Primary risk | Write performance degradation from inter-region latency | Split-Brain (if manual failover procedure is mishandled) |
Regardless of which pattern you choose, two prerequisites are non-negotiable: etcd nodes must be distributed across regions in an odd number, and heartbeat-interval/election-timeout must be tuned after measuring the actual inter-region RTT. No matter how perfectly you place your PostgreSQL nodes, a DCS that is misconfigured for network conditions becomes the weakest link in the chain.
References
- Patroni Official Documentation — Multi-Datacenter HA Configuration
- Patroni Official Documentation — Standby Cluster
- Percona — Patroni Architecture Guide
- Blue Crystal Solutions — PostgreSQL HA with Patroni, etcd & Barman (2026)
- Stormatics — Split-Brain Scenarios in HA PostgreSQL Clusters