Lv.2 BeginnerPostgreSQL

2026.05.3114 min readLv.2 Beginner

SeriesOperating Patroni H/A Across Multiple Regions · Part 1View series hub

Operating Patroni H/A Across Multiple Regions — Part 1: Fundamentals and Architecture Design Principles

A single-DC Patroni setup handles node failures well but falls apart against a full region outage. This post covers the starting point of multi-region HA design: the RPO/RTO/RTT trade-off, the difference between synchronous 3-DC auto-failover and async Standby Cluster 2-DC manual-failover patterns, and the DCS placement principles that tie it all together.

Series — Operating Patroni H/A Across Multiple Regions

Part 1 — Fundamentals and Architecture Design Principles (this post)

Part 2 — Synchronous Multi-DC Setup in Practice

Part 3 — Asynchronous Replication + Standby Cluster Setup

Part 4 — Split-Brain Prevention (STONITH, Watchdog, Quorum)

Part 5 — Failover Runbook and DR Drills

Part 6 — Monitoring, Operational Automation, and Best Practices

Why Multi-Region HA?
Patroni Core Components Recap
The Heart of Multi-Region Design: CAP Theory and Trade-offs
Two Main Architecture Patterns
DCS Placement Strategy
Architecture Selection Criteria
References

1. Why Multi-Region HA?

Running Patroni within a single data center is enough to handle common failures like individual node outages and OS crashes. But when an entire cloud region goes down — due to a natural disaster, a cut fiber cable, or an infrastructure-level incident — a single-region cluster has no answer.

Scenarios that require multi-region HA:

Failure Type	Single-Region Response	Multi-Region Required?
Individual node failure	Automatic failover	Not needed
Availability Zone (AZ) failure	Handled with AZ-distributed layout	Conditional
Full region outage	Not possible	Required
Network partition (inter-region)	Not possible	Required
Regulatory requirements (mandatory DR site)	Not possible	Required

In financial services, public sector, and e-commerce, a disaster recovery (DR) site is often a legal obligation. Patroni's official documentation explicitly guides multi-DC operations to meet these requirements.

2. Patroni Core Components Recap

Before diving into multi-region architecture, here is a quick refresher on Patroni's key components.

Leader Key management: The Primary node continuously renews a Leader Key in the DCS (TTL-based) as its liveness signal.
Automatic failover: When Leader Key renewal stops, one of the Replicas wins a Leader Race and is promoted as the new Primary.
Split-Brain prevention: etcd's Raft consensus algorithm combined with a Linux Watchdog prevents two nodes from simultaneously acting as Primary.

As of Patroni 4.1.2, supported DCS backends: etcd v3, Consul, ZooKeeper, Kubernetes (using its built-in etcd).

3. The Heart of Multi-Region Design: CAP Theory and Trade-offs

The fundamental challenge of multi-region DB operations is the trade-off between network latency and data consistency. In a distributed system, Partition Tolerance is always a given — the real question is whether to prioritize Consistency or Availability on top of that.

Three questions you must answer before designing multi-region HA:

Q1. Do you need RPO (Recovery Point Objective) of zero?

YES → Synchronous Replication is mandatory
NO → Asynchronous Replication is acceptable

Q2. How short does your RTO (Recovery Time Objective) need to be?

Within seconds → Automatic failover required (3 or more DCs)
Minutes to tens of minutes acceptable → Manual failover acceptable (2 DCs possible)

Q3. Can you absorb inter-region network latency?

With synchronous replication, round-trip time (RTT) between regions adds directly to every write's latency.
Example: Seoul–Tokyo RTT ≈ 30ms → synchronous replication adds at least 30ms to every write.

The answers to these three questions define your architecture.

4. Two Main Architecture Patterns

Patroni's official documentation describes two primary patterns for multi-DC deployments.

Pattern A — Synchronous Replication (3+ DCs, Automatic Failover)

# patroni.yml — synchronous replication config
bootstrap:
  dcs:
    synchronous_mode: true
    synchronous_mode_strict: false

Key characteristics:

At least 3 DCs need one etcd node each to maintain Quorum.
synchronous_mode: true causes the Primary to designate a Synchronous Replica.
If one DC goes completely down, the remaining two DCs can auto-failover.
Downside: Every write incurs added latency equal to inter-region RTT; operating 3 DCs increases cost significantly.

synchronous_mode_strict: false (the default) allows writes even when no synchronous replica is available. Setting it to true blocks writes the moment a synchronous replica disappears — decide your availability policy before enabling this in production.

Pattern B — Asynchronous Replication + Standby Cluster (2 DCs, Manual Failover)

# patroni.yml — DC2 Standby Cluster config
bootstrap:
  dcs:
    standby_cluster:
      host: dc1-primary.example.com
      port: 5432
      primary_slot_name: dc2_standby

Key characteristics:

Each DC operates its own independent etcd cluster.
DC2 uses Patroni's standby_cluster feature to receive WAL from DC1.
If DC1 fails, automatic failover is not possible — you must manually remove the standby_cluster config and promote DC2.
Before promoting DC2, you must confirm DC1 has fully stopped. Skipping this step causes Split-Brain.
Upside: No write latency impact from inter-region RTT; lower operational cost than Pattern A.

⚠️ Promoting DC2 while DC1 is still alive causes Split-Brain. You must perform STONITH (Shoot The Other Node In The Head) to forcibly stop DC1 before promoting. Shutting down DC1 first is not optional — it is a mandatory prerequisite.

The replication slot named in primary_slot_name must be created on DC1's Primary beforehand. Without the slot, DC2's WAL streaming will not start.

5. DCS Placement Strategy

The placement of your DCS (etcd, Consul, etc.) is just as critical as your PostgreSQL node placement. A poorly designed DCS becomes a single point of failure (SPOF) that can bring down the entire cluster.

The Odd-Node Rule for etcd

etcd uses the Raft consensus algorithm, which requires an odd number of nodes to maintain a stable majority quorum.

etcd Nodes	Quorum (majority)	Tolerable node failures
3	2	1
5	3	2
7	4	3

An even number of nodes does not improve Quorum stability. 4 nodes and 3 nodes tolerate the same number of failures (1), so adding a fourth node only increases cost.

Region Placement Guide

Recommended: distribute 5 nodes across 3 regions
  Region A: etcd × 2
  Region B: etcd × 2
  Region C: etcd × 1  (Tiebreaker / Witness)

Note: inter-region network latency must be shorter than the etcd heartbeat timeout
  Default heartbeat: 100ms, election timeout: 1000ms
  In multi-region environments, adjust these values to match the actual RTT

Tuning etcd Timeouts

# etcd config — example for Seoul-Tokyo-Singapore setup (assuming RTT ~30–80ms)
heartbeat-interval: 500   # ms, raised from default 100ms to account for RTT
election-timeout: 5000    # ms, maintain heartbeat-interval × 10 ratio

# patroni.yml — etcd3 connection config (TLS)
etcd3:
  hosts:
    - seoul-etcd-1:2379
    - tokyo-etcd-1:2379
    - singapore-etcd-1:2379
  protocol: https
  cacert: /etc/ssl/etcd/ca.crt
  cert: /etc/ssl/etcd/client.crt
  key: /etc/ssl/etcd/client.key

Always measure the actual inter-region RTT before setting heartbeat-interval and election-timeout. A common starting point is heartbeat-interval at 5–10× the RTT and election-timeout at 10× the heartbeat-interval. Values that are too low cause frequent leader re-elections triggered by nothing more than normal network jitter.

6. Architecture Selection Criteria

Criteria	Pattern A (Sync, 3 DCs)	Pattern B (Async, 2 DCs)
Minimum DCs	3	2
RPO	0 (no data loss)	Seconds to minutes (depends on network lag)
RTO	Seconds (auto failover)	Minutes to tens of minutes (manual)
Write performance impact	High (RTT added per write)	Low
Operational complexity	High	Moderate
Cost	High	Moderate
Primary risk	Write performance degradation from inter-region latency	Split-Brain (if manual failover procedure is mishandled)

Regardless of which pattern you choose, two prerequisites are non-negotiable: etcd nodes must be distributed across regions in an odd number, and heartbeat-interval/election-timeout must be tuned after measuring the actual inter-region RTT. No matter how perfectly you place your PostgreSQL nodes, a DCS that is misconfigured for network conditions becomes the weakest link in the chain.

References

Patroni Official Documentation — Multi-Datacenter HA Configuration
Patroni Official Documentation — Standby Cluster
Percona — Patroni Architecture Guide
Blue Crystal Solutions — PostgreSQL HA with Patroni, etcd & Barman (2026)
Stormatics — Split-Brain Scenarios in HA PostgreSQL Clusters