Thursday, June 4, 2026
All posts
Lv.2 BeginnerPostgreSQL
14 min readLv.2 Beginner
SeriesOperating Patroni H/A Across Multiple Regions · Part 1View series hub

Operating Patroni H/A Across Multiple Regions — Part 1: Fundamentals and Architecture Design Principles

Operating Patroni H/A Across Multiple Regions — Part 1: Fundamentals and Architecture Design Principles

A single-DC Patroni setup handles node failures well but falls apart against a full region outage. This post covers the starting point of multi-region HA design: the RPO/RTO/RTT trade-off, the difference between synchronous 3-DC auto-failover and async Standby Cluster 2-DC manual-failover patterns, and the DCS placement principles that tie it all together.

Series — Operating Patroni H/A Across Multiple Regions

  • Part 1 — Fundamentals and Architecture Design Principles (this post)
  • Part 2 — Synchronous Multi-DC Setup in Practice
  • Part 3 — Asynchronous Replication + Standby Cluster Setup
  • Part 4 — Split-Brain Prevention (STONITH, Watchdog, Quorum)
  • Part 5 — Failover Runbook and DR Drills
  • Part 6 — Monitoring, Operational Automation, and Best Practices

Table of Contents

  1. Why Multi-Region HA?
  2. Patroni Core Components Recap
  3. The Heart of Multi-Region Design: CAP Theory and Trade-offs
  4. Two Main Architecture Patterns
  5. DCS Placement Strategy
  6. Architecture Selection Criteria
  7. References

1. Why Multi-Region HA?

Running Patroni within a single data center is enough to handle common failures like individual node outages and OS crashes. But when an entire cloud region goes down — due to a natural disaster, a cut fiber cable, or an infrastructure-level incident — a single-region cluster has no answer.

Scenarios that require multi-region HA:

Failure TypeSingle-Region ResponseMulti-Region Required?
Individual node failureAutomatic failoverNot needed
Availability Zone (AZ) failureHandled with AZ-distributed layoutConditional
Full region outageNot possibleRequired
Network partition (inter-region)Not possibleRequired
Regulatory requirements (mandatory DR site)Not possibleRequired

In financial services, public sector, and e-commerce, a disaster recovery (DR) site is often a legal obligation. Patroni's official documentation explicitly guides multi-DC operations to meet these requirements.


2. Patroni Core Components Recap

Before diving into multi-region architecture, here is a quick refresher on Patroni's key components.

  • Leader Key management: The Primary node continuously renews a Leader Key in the DCS (TTL-based) as its liveness signal.
  • Automatic failover: When Leader Key renewal stops, one of the Replicas wins a Leader Race and is promoted as the new Primary.
  • Split-Brain prevention: etcd's Raft consensus algorithm combined with a Linux Watchdog prevents two nodes from simultaneously acting as Primary.

As of Patroni 4.1.2, supported DCS backends: etcd v3, Consul, ZooKeeper, Kubernetes (using its built-in etcd).


3. The Heart of Multi-Region Design: CAP Theory and Trade-offs

The fundamental challenge of multi-region DB operations is the trade-off between network latency and data consistency. In a distributed system, Partition Tolerance is always a given — the real question is whether to prioritize Consistency or Availability on top of that.

Three questions you must answer before designing multi-region HA:

Q1. Do you need RPO (Recovery Point Objective) of zero?

  • YES → Synchronous Replication is mandatory
  • NO → Asynchronous Replication is acceptable

Q2. How short does your RTO (Recovery Time Objective) need to be?

  • Within seconds → Automatic failover required (3 or more DCs)
  • Minutes to tens of minutes acceptable → Manual failover acceptable (2 DCs possible)

Q3. Can you absorb inter-region network latency?

  • With synchronous replication, round-trip time (RTT) between regions adds directly to every write's latency.
  • Example: Seoul–Tokyo RTT ≈ 30ms → synchronous replication adds at least 30ms to every write.

The answers to these three questions define your architecture.


4. Two Main Architecture Patterns

Patroni's official documentation describes two primary patterns for multi-DC deployments.

Pattern A — Synchronous Replication (3+ DCs, Automatic Failover)

# patroni.yml — synchronous replication config
bootstrap:
  dcs:
    synchronous_mode: true
    synchronous_mode_strict: false

Key characteristics:

  • At least 3 DCs need one etcd node each to maintain Quorum.
  • synchronous_mode: true causes the Primary to designate a Synchronous Replica.
  • If one DC goes completely down, the remaining two DCs can auto-failover.
  • Downside: Every write incurs added latency equal to inter-region RTT; operating 3 DCs increases cost significantly.

synchronous_mode_strict: false (the default) allows writes even when no synchronous replica is available. Setting it to true blocks writes the moment a synchronous replica disappears — decide your availability policy before enabling this in production.

Pattern B — Asynchronous Replication + Standby Cluster (2 DCs, Manual Failover)

# patroni.yml — DC2 Standby Cluster config
bootstrap:
  dcs:
    standby_cluster:
      host: dc1-primary.example.com
      port: 5432
      primary_slot_name: dc2_standby

Key characteristics:

  • Each DC operates its own independent etcd cluster.
  • DC2 uses Patroni's standby_cluster feature to receive WAL from DC1.
  • If DC1 fails, automatic failover is not possible — you must manually remove the standby_cluster config and promote DC2.
  • Before promoting DC2, you must confirm DC1 has fully stopped. Skipping this step causes Split-Brain.
  • Upside: No write latency impact from inter-region RTT; lower operational cost than Pattern A.

⚠️ Promoting DC2 while DC1 is still alive causes Split-Brain. You must perform STONITH (Shoot The Other Node In The Head) to forcibly stop DC1 before promoting. Shutting down DC1 first is not optional — it is a mandatory prerequisite.

The replication slot named in primary_slot_name must be created on DC1's Primary beforehand. Without the slot, DC2's WAL streaming will not start.


5. DCS Placement Strategy

The placement of your DCS (etcd, Consul, etc.) is just as critical as your PostgreSQL node placement. A poorly designed DCS becomes a single point of failure (SPOF) that can bring down the entire cluster.

The Odd-Node Rule for etcd

etcd uses the Raft consensus algorithm, which requires an odd number of nodes to maintain a stable majority quorum.

etcd NodesQuorum (majority)Tolerable node failures
321
532
743

An even number of nodes does not improve Quorum stability. 4 nodes and 3 nodes tolerate the same number of failures (1), so adding a fourth node only increases cost.

Region Placement Guide

Recommended: distribute 5 nodes across 3 regions
  Region A: etcd × 2
  Region B: etcd × 2
  Region C: etcd × 1  (Tiebreaker / Witness)

Note: inter-region network latency must be shorter than the etcd heartbeat timeout
  Default heartbeat: 100ms, election timeout: 1000ms
  In multi-region environments, adjust these values to match the actual RTT

Tuning etcd Timeouts

# etcd config — example for Seoul-Tokyo-Singapore setup (assuming RTT ~30–80ms)
heartbeat-interval: 500   # ms, raised from default 100ms to account for RTT
election-timeout: 5000    # ms, maintain heartbeat-interval × 10 ratio
# patroni.yml — etcd3 connection config (TLS)
etcd3:
  hosts:
    - seoul-etcd-1:2379
    - tokyo-etcd-1:2379
    - singapore-etcd-1:2379
  protocol: https
  cacert: /etc/ssl/etcd/ca.crt
  cert: /etc/ssl/etcd/client.crt
  key: /etc/ssl/etcd/client.key

Always measure the actual inter-region RTT before setting heartbeat-interval and election-timeout. A common starting point is heartbeat-interval at 5–10× the RTT and election-timeout at 10× the heartbeat-interval. Values that are too low cause frequent leader re-elections triggered by nothing more than normal network jitter.


6. Architecture Selection Criteria

CriteriaPattern A (Sync, 3 DCs)Pattern B (Async, 2 DCs)
Minimum DCs32
RPO0 (no data loss)Seconds to minutes (depends on network lag)
RTOSeconds (auto failover)Minutes to tens of minutes (manual)
Write performance impactHigh (RTT added per write)Low
Operational complexityHighModerate
CostHighModerate
Primary riskWrite performance degradation from inter-region latencySplit-Brain (if manual failover procedure is mishandled)

Regardless of which pattern you choose, two prerequisites are non-negotiable: etcd nodes must be distributed across regions in an odd number, and heartbeat-interval/election-timeout must be tuned after measuring the actual inter-region RTT. No matter how perfectly you place your PostgreSQL nodes, a DCS that is misconfigured for network conditions becomes the weakest link in the chain.


References

  • Patroni Official Documentation — Multi-Datacenter HA Configuration
  • Patroni Official Documentation — Standby Cluster
  • Percona — Patroni Architecture Guide
  • Blue Crystal Solutions — PostgreSQL HA with Patroni, etcd & Barman (2026)
  • Stormatics — Split-Brain Scenarios in HA PostgreSQL Clusters

Share This Article

Series Navigation

Operating Patroni H/A Across Multiple Regions

Current part 1 · 6 published

Explore this topic·Start with featured series

한국어

Follow new posts via RSS

Use RSS to get new posts and series updates directly.

Open RSS Guide