Lv.3 중급PostgreSQL

2026.06.0120분 읽기Lv.3 중급

시리즈멀티 리전에서 Patroni H/A 운용하기 · 파트 3시리즈 허브 보기

멀티 리전에서 Patroni H/A 운용하기 — Part 3: 비동기 복제 + Standby Cluster 구성 실습

Part 1의 패턴 B를 실제로 구현한다. DC1 Primary Cluster와 DC2 Standby Cluster를 독립 etcd로 분리하고, WAL Streaming 기반 비동기 복제를 구성한다. DC1 장애 시 수동 승격 절차, STONITH 확인 방법, Patroni 4.1의 promote-cluster/demote-cluster 활용법, DC1 복구 후 재전환까지 DR Runbook의 골격을 만든다.

시리즈 구성 — 멀티 리전에서 Patroni H/A 운용하기

Part 1 — 기초 개념과 아키텍처 설계 원칙

Part 2 — 동기 복제 멀티 DC 구성 실습

Part 3 — 비동기 복제 + Standby Cluster 구성 실습 (현재 편)

Part 4 — Split-Brain 방지 전략 (STONITH, Watchdog, Quorum)

Part 5 — 장애 대응 Runbook 및 DR 훈련 시나리오

Part 6 — 모니터링, 운영 자동화 및 Best Practices

Standby Cluster란 무엇인가?
실습 환경 및 구성 개요
Step 1 — DC1에 Replication Slot 준비
Step 2 — DC2 독립 etcd 클러스터 구성
Step 3 — DC2 Standby Cluster patroni.yml 작성
Step 4 — Standby Cluster 기동 및 복제 검증
Step 5 — DC1 장애 시 수동 Standby 승격 절차
Step 6 — DC1 복구 후 Standby Cluster로 재전환
Patroni 4.1 신규 명령어: promote-cluster / demote-cluster
흔히 겪는 트러블슈팅
참고 자료

1. Standby Cluster란 무엇인가?

Patroni의 Standby Cluster는 원격 데이터센터에 캐스케이딩(Cascading) 복제를 운용하는 기능이다. 일반적인 Replica 노드와 달리, Standby Cluster는 DC2 내부에 자체적인 HA 구조(독립 etcd + Patroni 리더 선출)를 갖추면서도, 그 데이터 전체는 DC1 Primary에서 오는 WAL로 동기화된다.

DC2 내부에는 Standby Leader라는 특수 역할 노드가 존재한다. Standby Leader는 DC2 내에서는 일반 Leader처럼 행동(DCS 잠금 유지, Cascade Replica 관리)하지만, 실제 데이터는 DC1 Primary로부터 스트리밍 복제로 수신한다.

Standby Cluster를 선택하는 이유

RPO가 0이 아닌 환경: 리전 간 네트워크 지연이 크거나 쓰기 성능 저하를 감당할 수 없을 때
비용 절감: 3개 DC 동기 복제 대신 2개 DC 비동기 구성으로 인프라 비용 최소화
지역 읽기 서비스: DC2의 Standby Leader를 읽기 전용 엔드포인트로 활용해 레이턴시 절감
DR 테스트 격리: DC1에 영향 없이 DC2에서 독립적으로 페일오버 시뮬레이션 가능
클라우드 마이그레이션: 온프레미스 → 클라우드, 또는 리전 이전 시 Zero-Downtime Migration 도구로 활용

⚠️ Standby Cluster와 Primary Cluster는 절대로 동일한 DCS scope를 공유해선 안 된다. 반드시 독립된 etcd 클러스터 또는 다른 네임스페이스를 사용해야 한다.

2. 실습 환경 및 구성 개요

구성 노드 목록

노드명	리전	IP (예시)	역할
pg-seoul-1	ap-northeast-2 (서울)	10.1.0.10	PostgreSQL Primary + etcd
pg-seoul-2	ap-northeast-2 (서울)	10.1.0.11	PostgreSQL Replica + etcd
pg-seoul-3	ap-northeast-2 (서울, AZ 분산)	10.1.0.12	PostgreSQL Replica + etcd
pg-busan-1	on-premise (부산 DR)	10.2.0.10	Standby Leader + etcd
pg-busan-2	on-premise (부산 DR)	10.2.0.11	Cascade Replica + etcd
pg-busan-3	on-premise (부산 DR)	10.2.0.12	Cascade Replica + etcd

각 DC는 독립된 etcd 3-node 클러스터를 운영한다. 두 etcd 클러스터는 서로 통신하지 않으며, 복제는 오직 PostgreSQL WAL Streaming으로만 이루어진다.

전체 아키텍처

3. Step 1 — DC1에 Replication Slot 준비

Standby Cluster가 DC1 Primary에 접속해 WAL을 수신하려면, DC1에 Replication Slot이 있어야 WAL이 중간에 삭제되지 않는다. Patroni의 Permanent Replication Slot 기능을 활용하면 슬롯을 DCS에 등록해 페일오버 이후에도 자동 유지된다.

DC1에서 Permanent Slot 등록

# DC1 Primary Cluster의 동적 설정에 Permanent Slot 추가
patronictl -c /etc/patroni/patroni.yml edit-config

에디터가 열리면 아래 slots 섹션을 추가한다.

# DC1 Dynamic Configuration (DCS에 저장됨)
slots:
  standby_cluster_busan:
    type: physical
    cluster_type: primary    # Primary 클러스터에서만 생성

저장 후 슬롯 생성 확인:

# DC1 Primary에서 실행
psql -U postgres -c "
  SELECT slot_name, slot_type, active, restart_lsn
  FROM pg_replication_slots
  WHERE slot_name = 'standby_cluster_busan';
"

# 예상 출력:
#       slot_name        | slot_type | active | restart_lsn
# -----------------------+-----------+--------+-------------
#  standby_cluster_busan | physical  | f      | 0/3000000

슬롯이 활성화되기 전(active: f)은 DC2가 아직 연결하지 않은 상태다. DC2 기동 후 active: t로 바뀌는지 확인한다.

DC1 pg_hba — DC2 노드 복제 허용

DC2의 모든 노드가 복제 연결을 할 수 있도록 DC1의 pg_hba 설정을 업데이트한다.

# DC1 patroni.yml — pg_hba 항목 추가
postgresql:
  pg_hba:
    - host replication replicator 10.2.0.10/32 md5   # pg-busan-1
    - host replication replicator 10.2.0.11/32 md5   # pg-busan-2
    - host replication replicator 10.2.0.12/32 md5   # pg-busan-3

# pg_hba 변경 적용 (재시작 없이 Reload)
patronictl -c /etc/patroni/patroni.yml reload pg-seoul-cluster

4. Step 2 — DC2 독립 etcd 클러스터 구성

DC2의 etcd 클러스터는 DC1 etcd와 완전히 독립적으로 동작한다. 구성 방법은 Part 2의 Step 2와 동일하나, IP 대역과 노드명만 DC2에 맞게 변경한다.

# /etc/etcd/etcd.conf.yml (부산 노드 1)
name: etcd-busan-1
data-dir: /var/lib/etcd/data

listen-client-urls: https://10.2.0.10:2379,https://127.0.0.1:2379
advertise-client-urls: https://10.2.0.10:2379

listen-peer-urls: https://10.2.0.10:2380
initial-advertise-peer-urls: https://10.2.0.10:2380

# DC2 내부 3개 노드만 포함 (DC1 etcd와 무관)
initial-cluster: >
  etcd-busan-1=https://10.2.0.10:2380,
  etcd-busan-2=https://10.2.0.11:2380,
  etcd-busan-3=https://10.2.0.12:2380
initial-cluster-state: new
initial-cluster-token: pg-busan-standby-cluster-v1  # DC1과 다른 토큰

client-transport-security:
  cert-file: /etc/etcd/ssl/etcd-busan-1.pem
  key-file: /etc/etcd/ssl/etcd-busan-1-key.pem
  trusted-ca-file: /etc/etcd/ssl/ca.pem
  client-cert-auth: true

peer-transport-security:
  cert-file: /etc/etcd/ssl/etcd-busan-1.pem
  key-file: /etc/etcd/ssl/etcd-busan-1-key.pem
  trusted-ca-file: /etc/etcd/ssl/ca.pem
  peer-client-cert-auth: true

# DC 내부 통신이므로 기본 타임아웃으로 충분
heartbeat-interval: 100
election-timeout: 1000

5. Step 3 — DC2 Standby Cluster patroni.yml 작성

Standby Cluster 설정의 핵심은 bootstrap.dcs.standby_cluster 섹션이다. 여기에 DC1의 접속 정보와 Replication Slot 이름을 명시한다. 이 설정은 최초 부트스트랩 시에만 적용되며, 이후 변경은 반드시 DCS(patronictl edit-config)를 통해서만 가능하다.

# /etc/patroni/patroni.yml (pg-busan-1 - Standby Leader 후보)

scope: pg-busan-standby        # DC1과 반드시 다른 scope 이름
namespace: /db/
name: pg-busan-1               # DC1의 어떤 노드 이름과도 달라야 함

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.2.0.10:8008
  certfile: /etc/patroni/ssl/patroni.pem
  keyfile: /etc/patroni/ssl/patroni-key.pem
  cafile: /etc/patroni/ssl/ca.pem

# DC2 독립 etcd에 연결
etcd3:
  hosts:
    - 10.2.0.10:2379
    - 10.2.0.11:2379
    - 10.2.0.12:2379
  protocol: https
  cacert: /etc/etcd/ssl/ca.pem
  cert: /etc/etcd/ssl/etcd-busan-1.pem
  key: /etc/etcd/ssl/etcd-busan-1-key.pem

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 104857600   # 100MB: 리전 간 지연 허용 폭

    # 핵심: Standby Cluster 설정
    standby_cluster:
      # DC1 Primary 클러스터의 모든 노드를 나열 (페일오버 후에도 연결 유지)
      host: 10.1.0.10,10.1.0.11,10.1.0.12
      port: 5432
      # DC1에서 생성한 Permanent Slot 이름과 정확히 일치해야 함
      primary_slot_name: standby_cluster_busan
      create_replica_methods:
        - basebackup

    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"

  pg_hba:
    - local   all             all                         trust
    - host    all             all         127.0.0.1/32    md5
    - host    replication     replicator  10.2.0.0/24     md5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.2.0.10:5432
  data_dir: /var/lib/postgresql/17/main
  bin_dir: /usr/lib/postgresql/17/bin
  config_dir: /etc/postgresql/17/main

  authentication:
    replication:
      username: replicator
      password: "SecureRepPass123!"
    superuser:
      username: postgres
      password: "SecureSuperPass123!"
    rewind:
      username: rewind_user
      password: "SecureRewindPass123!"

tags:
  nofailover: false
  noloadbalance: false
  dc: busan

노드 이름 규칙 주의: DC2의 모든 노드 이름(pg-busan-1, pg-busan-2, pg-busan-3)은 DC1의 어떤 멤버 이름과도 겹치면 안 된다. 이름이 겹치면 DC1이 잘못된 application_name으로 Synchronous Standby를 판단해 데이터 손실 위험이 생기는 Silent Failure가 발생한다.

6. Step 4 — Standby Cluster 기동 및 복제 검증

Standby Cluster 시작

# DC2 — pg-busan-1 먼저 시작 (Standby Leader로 부트스트랩)
systemctl enable --now patroni

# 부트스트랩 과정 모니터링
journalctl -fu patroni

# 예상 로그 흐름:
# INFO: trying to bootstrap a standby leader
# INFO: trying to use basebackup from 10.1.0.10:5432
# INFO: replica has been created using basebackup
# INFO: bootstrapped as a standby leader

# pg-busan-1이 Standby Leader로 올라온 것을 확인한 후 나머지 노드 시작
ssh pg-busan-2 "systemctl enable --now patroni"
ssh pg-busan-3 "systemctl enable --now patroni"

복제 상태 검증

# DC2에서 Standby Cluster 상태 확인
patronictl -c /etc/patroni/patroni.yml topology

# 예상 출력:
# + Cluster: pg-busan-standby (8901234567890123456) +----------------+-----------+
# | Member      | Host            | Role           | State   | TL | Lag in MB |
# +-------------+-----------------+----------------+---------+----+-----------+
# | pg-busan-1  | 10.2.0.10:5432  | Standby Leader | running |  3 |       0.0 |
# | pg-busan-2  | 10.2.0.11:5432  | Replica        | running |  3 |       0.0 |
# | pg-busan-3  | 10.2.0.12:5432  | Replica        | running |  3 |       0.0 |
# +-------------+-----------------+----------------+---------+----+-----------+

# DC1 Primary에서 복제 연결 확인
psql -U postgres -c "
  SELECT application_name, client_addr, state, sync_state,
         write_lag, flush_lag, replay_lag
  FROM pg_stat_replication;
"
# pg-busan-1: sync_state = async 확인

복제 지연 모니터링

# DC2 Standby Leader에서 수신 지연 확인
psql -U postgres -c "
  SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;
"

# WAL 수신 위치와 재생 위치 비교
psql -U postgres -c "
  SELECT pg_is_in_recovery(),
         pg_last_wal_receive_lsn(),
         pg_last_wal_replay_lsn(),
         pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;
"

# DC1에서 Replication Slot 활성화 확인
psql -U postgres -h 10.1.0.10 -c "
  SELECT slot_name, active, restart_lsn
  FROM pg_replication_slots
  WHERE slot_name = 'standby_cluster_busan';
"
# active must be t

7. Step 5 — DC1 장애 시 수동 Standby 승격 절차

비동기 복제 + 2DC 구성에서는 DC2가 DC1의 상태를 독립적으로 파악할 수 없기 때문에 자동 페일오버는 불가능하다. DC1이 완전히 다운된 것을 운영자가 직접 확인한 뒤, 수동으로 DC2를 승격시켜야 한다.

승격 전 필수 체크리스트

[ ] 1. DC1의 모든 PostgreSQL 노드가 완전히 정지되었는가?
[ ] 2. DC1 Patroni가 Leader Lock을 해제했는가? (etcd TTL 만료 확인)
[ ] 3. DC1으로 향하는 모든 애플리케이션 연결이 차단되었는가?
[ ] 4. DC2 Standby Leader의 현재 복제 지연(Lag)을 기록했는가?
[ ] 5. DC1이 부분적으로 살아있을 가능성은 없는가? (네트워크 파티션인지 전체 장애인지 확인)

⚠️ DC1이 살아있는 상태에서 DC2를 승격하면 Split-Brain이 발생한다. 데이터 충돌과 영구적인 데이터 손실로 이어질 수 있다. 반드시 DC1이 완전히 중단된 것을 확인하거나 STONITH를 수행한 후 승격해야 한다.

방법 A — patronictl promote-cluster (Patroni 4.1+ 권장)

Patroni 4.1부터 도입된 promote-cluster 명령어는 standby_cluster 섹션 제거와 결과 확인을 하나의 명령으로 처리한다.

# DC2 노드에서 실행
# STONITH 후 DC1이 완전히 중단된 것을 확인한 뒤:
patronictl -c /etc/patroni/patroni.yml promote-cluster pg-busan-standby

# 예상 출력:
# + Cluster: pg-busan-standby (8901234567890123456) +---------+-----------+
# | Member      | Host            | Role    | State   | TL | Lag in MB |
# +-------------+-----------------+---------+---------+----+-----------+
# | pg-busan-1  | 10.2.0.10:5432  | Leader  | running |  4 |           |
# | pg-busan-2  | 10.2.0.11:5432  | Replica | running |  4 |       0.0 |
# | pg-busan-3  | 10.2.0.12:5432  | Replica | running |  4 |       0.0 |
# +-------------+-----------------+---------+---------+----+-----------+
# Success: cluster has been promoted

방법 B — patronictl edit-config (구버전 호환)

# standby_cluster 섹션을 null로 설정해 제거
patronictl -c /etc/patroni/patroni.yml edit-config \
  --set standby_cluster=null \
  --force

# 승격 확인
patronictl -c /etc/patroni/patroni.yml list

# PostgreSQL이 실제로 Primary로 승격되었는지 직접 확인
psql -U postgres -h 10.2.0.10 -c "SELECT pg_is_in_recovery();"
# result: f (false), promoted to Primary

승격 후 — 애플리케이션 연결 전환

# HAProxy 또는 DNS를 DC2 엔드포인트로 전환
systemctl reload haproxy

# 연결 확인
psql -h haproxy-busan -p 5000 -U appuser -c \
  "SELECT inet_server_addr(), pg_is_in_recovery();"
# result: 10.2.0.1x | f, connected to DC2 Primary

8. Step 6 — DC1 복구 후 Standby Cluster로 재전환

DC1 인프라가 복구되면, DC1을 새로운 Standby Cluster로 재구성해 DC2(현재 Primary)로부터 WAL을 수신하도록 전환한다.

DC2에 Permanent Slot 등록

# DC2가 Primary가 된 상태에서, DC1용 Replication Slot 등록
patronictl -c /etc/patroni/patroni.yml edit-config

# DC2 Dynamic Configuration에 추가
slots:
  standby_cluster_seoul:
    type: physical
    cluster_type: primary

DC1을 Standby Cluster로 재구성

DC1이 이제는 DC2(현재 Primary)에서 복제를 받아야 하므로, DC1의 patroni.yml을 Standby Cluster 설정으로 수정한다.

# DC1 patroni.yml 수정 — standby_cluster 섹션 추가
bootstrap:
  dcs:
    standby_cluster:
      host: 10.2.0.10,10.2.0.11,10.2.0.12
      port: 5432
      primary_slot_name: standby_cluster_seoul
      create_replica_methods:
        - basebackup

# DC1 데이터 디렉토리 초기화 (DC2로부터 basebackup)
# 기존 데이터가 오염되었을 수 있으므로 완전 재초기화 권장
rm -rf /var/lib/postgresql/17/main/*

# Patroni 시작 — 자동으로 DC2에서 basebackup 수행
systemctl start patroni

# 복제 상태 확인
patronictl -c /etc/patroni/patroni.yml topology

demote-cluster 명령어 활용 (Patroni 4.1+)

Patroni 4.1에서는 demote-cluster 명령어로 기존 Primary Cluster를 Standby로 전환할 수 있다. DC1이 복구된 후 DC1 scope(pg-seoul-cluster)를 대상으로 실행한다.

# DC1의 patronictl 설정을 이용해 DC1 Standby 전환
patronictl -c /etc/patroni/patroni-seoul.yml demote-cluster pg-seoul-cluster \
  --standby-config host=10.2.0.10,10.2.0.11,10.2.0.12 \
  --standby-config port=5432 \
  --standby-config primary_slot_name=standby_cluster_seoul

9. Patroni 4.1 신규 명령어: promote-cluster / demote-cluster

Patroni 4.1에서 Standby Cluster 운영을 위한 전용 명령어가 추가되었다. 기존의 edit-config로 standby_cluster=null을 직접 편집하던 방식보다 안전하고 의도가 명확하다.

명령어	역할	주요 동작
`patronictl promote-cluster`	Standby → Primary 승격	standby_cluster 섹션 제거 + 결과 검증
`patronictl demote-cluster`	Primary → Standby 전환	standby_cluster 섹션 삽입 + Demotion 보장

# promote-cluster 사용 예
patronictl -c /etc/patroni/patroni.yml promote-cluster pg-busan-standby

# demote-cluster 사용 예 (DC1을 다시 Standby로)
patronictl -c /etc/patroni/patroni-seoul.yml demote-cluster pg-seoul-cluster \
  --standby-config host=10.2.0.10 \
  --standby-config port=5432 \
  --standby-config primary_slot_name=standby_cluster_seoul

이 두 명령어는 Patroni 4.1+ 전용이다. 클러스터 전환 중 중간 상태(demoting 중 재승격)를 방지하는 로직이 강화되어 있으므로, 가능하면 최신 버전을 사용하는 것이 좋다.

10. 흔히 겪는 트러블슈팅

문제 1: Standby Leader가 DC1에 연결하지 못함

WARNING: master_start_timeout: Failed to connect to 10.1.0.10:5432
ERROR: standby_cluster: no primary found

# DC2에서 DC1으로 직접 복제 연결 테스트
psql "host=10.1.0.10,10.1.0.11,10.1.0.12 \
      port=5432 \
      user=replicator \
      password=SecureRepPass123! \
      target_session_attrs=read-write \
      sslmode=require" \
  -c "SELECT pg_is_in_recovery(), inet_server_addr();"

# pg_hba.conf에서 DC2 IP 허용 여부 재확인
psql -U postgres -h 10.1.0.10 -c "SELECT * FROM pg_hba_file_rules();"

문제 2: 부트스트랩 시 "postgresql.conf not found" 오류

FATAL: Patroni expects to find postgresql.conf in PGDATA of the remote primary

Debian/Ubuntu 패키지 설치 환경에서 postgresql.conf가 /etc/postgresql/17/main/에 있고 PGDATA가 /var/lib/postgresql/17/main/일 때 발생한다.

# DC1에서 postgresql.conf를 PGDATA에 심볼릭 링크 생성
ln -s /etc/postgresql/17/main/postgresql.conf \
  /var/lib/postgresql/17/main/postgresql.conf

문제 3: Standby 승격 후 DC1이 살아나며 Split-Brain 발생

DC1이 예상보다 일찍 복구되어 스스로 Primary로 행동할 경우 두 DC가 모두 쓰기를 받아들이는 Split-Brain 상태가 된다.

# 1. DC1 PostgreSQL 즉시 강제 종료
ssh pg-seoul-1 "systemctl stop patroni && systemctl stop postgresql"
ssh pg-seoul-2 "systemctl stop patroni && systemctl stop postgresql"
ssh pg-seoul-3 "systemctl stop patroni && systemctl stop postgresql"

# 2. DC2의 현재 상태 확인 및 데이터 손실 범위 파악
psql -h 10.2.0.10 -U postgres -c "
  SELECT now(), pg_current_wal_lsn(), timeline_id
  FROM pg_control_checkpoint();
"

# 3. DC1 재구성 전에 DC1의 추가 WAL 레코드 확인 (가능한 경우)
pg_waldump -n 1000 /var/lib/postgresql/17/main/pg_wal/

# 4. DC1을 Standby Cluster로 재초기화
rm -rf /var/lib/postgresql/17/main/*
systemctl start patroni

문제 4: pg_rewind 실패로 Standby 재합류 불가

pg_rewind: error: could not find previous WAL record at 0/3000000

pg_rewind는 data-checksums 또는 wal_log_hints=on 중 하나가 설정되어 있어야 동작한다.

# data-checksums 활성화 여부 확인
pg_controldata /var/lib/postgresql/17/main | grep "Data page checksum"

# wal_log_hints 현재 설정 확인
psql -U postgres -c "SHOW wal_log_hints;"

# pg_rewind 없이 강제 재초기화
patronictl -c /etc/patroni/patroni.yml reinit pg-busan-standby pg-busan-1 --force

참고 자료

Patroni 공식 문서 — Standby Cluster
Patroni 공식 문서 — Multi-Datacenter HA Configuration
Patroni 공식 문서 — patronictl
CYBERTEC — Patroni: Cascading Replication with Standby Cluster
Percona — Performing Standby Datacentre Promotions of a Patroni Cluster
Patroni Release Notes 4.1.2 — promote-cluster / demote-cluster