PostgreSQL Backup & Recovery Part 6 — Backup Automation, Monitoring, and Recovery Drills
Creating backup files is not enough. Automation must bundle execution and verification into a single cycle to prevent silent failures; monitoring must cover five layers from file existence all the way to recoverability verification; and recovery drills — monthly logical restores, quarterly PITR runs, and semi-annual blackout exercises — must measure RTO, RPO, and team readiness. The ability to restore service within a predictable window is what true reliability looks like.
Series overview
- Part 1 — Backup Fundamentals and Strategy
- Part 2 — Logical Backup — pg_dump & pg_dumpall in Practice
- Part 3 — Physical Backup — pg_basebackup and WAL Archiving
- Part 4 — PITR (Point-in-Time Recovery) Implementation Guide
- Part 5 — Enterprise Backup Tools Compared — pgBackRest vs Barman vs WAL-G
- Part 6 — Backup Automation, Monitoring, and Recovery Drills (current · series complete)
Table of Contents
- Introduction — The Last Three Pillars That Complete the Strategy
- Backup Automation — Never Rely on Human Memory
- Backup Monitoring — A System for Catching Silent Failures
- Recovery Drills — Disasters Are Only Predictable for the Prepared
- Recovery Runbook — Written to Follow at 3 AM
- Backup Strategy Maturity Self-Assessment
- Closing — Recoverability Is Reliability
1. Introduction — The Last Three Pillars That Complete the Strategy
Parts 1 through 5 covered backup concepts, logical and physical backup tools, PITR implementation, and enterprise tooling. Now it is time to turn all of that into a living operational system with three final ingredients.
① Automation — Remove human hands as much as possible; eliminate mistakes and gaps
② Monitoring — Detect silent failures before they become disasters
③ Recovery drills — Ensure the team performs reliably under real incident pressure
A backup strategy missing these three elements is like keeping a fire extinguisher in an office where nobody knows where it is.
"Only half of organizations have a database disaster recovery plan. Of those, 50% created it only after experiencing severe data loss." — Barman / EnterpriseDB
2. Backup Automation — Never Rely on Human Memory
2.1 The Core Principle
There is one rule that matters most in backup automation.
Always automate backup execution and backup verification together as a single unit.
Automating execution while leaving verification manual means a silently failing backup can go unnoticed for weeks. Every automation script must also include failure notifications.
2.2 Comprehensive Backup Automation Script
The script below combines physical backup (pgBackRest), logical backup (pg_dump), and WAL archiving status checks into a single production-grade automation cycle.
#!/bin/bash
# /usr/local/bin/pg_backup_full_cycle.sh
# Physical + logical backup automation + verification + notifications
set -euo pipefail
# === Configuration =============================================
DB_HOST="localhost"
DB_USER="postgres"
BACKUP_ROOT="/backups"
LOG_FILE="/var/log/pg_backup/backup_$(date +%Y%m%d).log"
ALERT_EMAIL="dba-team@example.com"
SLACK_WEBHOOK="${PG_SLACK_WEBHOOK}" # Security: manage Webhook URL via env var
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# === Function definitions =====================================
log() {
local level="$1"; shift
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [${level}] $*" | tee -a "${LOG_FILE}"
}
send_alert() {
local subject="$1"
local message="$2"
# Email notification
echo "${message}" | mail -s "[PG BACKUP ALERT] ${subject}" "${ALERT_EMAIL}" 2>/dev/null || true
# Slack notification
curl -s -X POST "${SLACK_WEBHOOK}" \
-H 'Content-type: application/json' \
--data "{\"text\":\"[PG BACKUP ALERT]\n*${subject}*\n${message}\"}" \
2>/dev/null || true
log "ALERT" "Notification sent: ${subject}"
}
send_success() {
local message="$1"
curl -s -X POST "${SLACK_WEBHOOK}" \
-H 'Content-type: application/json' \
--data "{\"text\":\"[PG BACKUP OK]\n${message}\"}" \
2>/dev/null || true
}
check_wal_archiving() {
log "INFO" "Checking WAL archiving status..."
local failed_count
failed_count=$(psql -U "${DB_USER}" -h "${DB_HOST}" -tAc \
"SELECT failed_count FROM pg_stat_archiver;")
if [ "${failed_count}" -gt 0 ]; then
local last_failed
last_failed=$(psql -U "${DB_USER}" -h "${DB_HOST}" -tAc \
"SELECT last_failed_wal || ' at ' || last_failed_time FROM pg_stat_archiver;")
send_alert "WAL archiving failure" "Failure count: ${failed_count}\nLast failure: ${last_failed}"
return 1
fi
local lag_seconds
lag_seconds=$(psql -U "${DB_USER}" -h "${DB_HOST}" -tAc \
"SELECT EXTRACT(EPOCH FROM (now() - last_archived_time))::int FROM pg_stat_archiver;")
# Alert if archiving lag exceeds 30 minutes
if [ "${lag_seconds:-0}" -gt 1800 ]; then
send_alert "WAL archiving lag" \
"${lag_seconds} seconds since last archive (threshold: 1800s)"
return 1
fi
log "INFO" "WAL archiving healthy (last archive ${lag_seconds:-?}s ago)"
}
run_physical_backup() {
log "INFO" "Starting physical backup (pgBackRest)..."
local backup_type="${1:-diff}" # full / diff / incr
if sudo -u postgres pgbackrest \
--stanza=main \
--type="${backup_type}" \
--log-level-console=warn \
backup >> "${LOG_FILE}" 2>&1; then
log "INFO" "Physical backup (${backup_type}) complete"
return 0
else
send_alert "Physical backup failed" \
"pgBackRest ${backup_type} backup failed\nLog: ${LOG_FILE}"
return 1
fi
}
run_logical_backup() {
log "INFO" "Starting logical backup (pg_dump)..."
local db_list
db_list=$(psql -U "${DB_USER}" -h "${DB_HOST}" -tAc \
"SELECT datname FROM pg_database WHERE datistemplate = false AND datname != 'postgres';")
local failed=0
for db in ${db_list}; do
local dump_file="${BACKUP_ROOT}/logical/${db}_${TIMESTAMP}.dump"
mkdir -p "${BACKUP_ROOT}/logical"
if pg_dump -Fc -U "${DB_USER}" -h "${DB_HOST}" -d "${db}" \
-f "${dump_file}" >> "${LOG_FILE}" 2>&1; then
local size
size=$(du -sh "${dump_file}" | cut -f1)
log "INFO" "Logical backup complete: ${db} (${size})"
else
send_alert "Logical backup failed" "Database: ${db}\nFile: ${dump_file}"
failed=$((failed + 1))
fi
done
return ${failed}
}
cleanup_old_backups() {
log "INFO" "Removing expired backups (retention: ${RETENTION_DAYS} days)..."
local deleted
deleted=$(find "${BACKUP_ROOT}/logical" -name "*.dump" \
-mtime +${RETENTION_DAYS} -delete -print 2>/dev/null | wc -l)
log "INFO" "Deleted ${deleted} logical backup(s)"
}
# === Main execution ===========================================
mkdir -p "$(dirname "${LOG_FILE}")" "${BACKUP_ROOT}/logical"
log "INFO" "========== Backup cycle start =========="
# 1. WAL archiving status check
check_wal_archiving || log "WARN" "WAL archiving anomaly detected (continuing)"
# 2. Physical backup (auto-select full/diff by day of week)
DOW=$(date +%u) # 1=Mon ~ 7=Sun
if [ "${DOW}" -eq 7 ]; then
run_physical_backup "full"
else
run_physical_backup "diff"
fi
# 3. Logical backup
run_logical_backup
# 4. Remove expired backups
cleanup_old_backups
# 5. Completion notification
ELAPSED=$(( SECONDS ))
send_success "Backup complete — elapsed: ${ELAPSED}s | $(date '+%Y-%m-%d %H:%M:%S')"
log "INFO" "========== Backup cycle complete (${ELAPSED}s) =========="
2.3 Cron Schedule
# crontab -u postgres -e
# Every day at 1 AM: automated backup (Sunday=full, weekdays=differential)
0 1 * * * /usr/local/bin/pg_backup_full_cycle.sh
# Every hour on the hour: incremental backup
0 * * * * pgbackrest --stanza=main --type=incr backup \
>> /var/log/pg_backup/incr_$(date +\%Y\%m\%d).log 2>&1
# Every Sunday at 12:30 AM: global objects backup
30 0 * * 0 pg_dumpall -U postgres --globals-only \
-f /backups/logical/globals_$(date +\%Y\%m\%d).sql
3. Backup Monitoring — A System for Catching Silent Failures
3.1 Five Monitoring Layers
Effective backup monitoring must cover all five of these layers.
Level 1: Existence check — Was a backup file created?
Level 2: Integrity check — Is the file uncorrupted?
Level 3: Partial restore — Can a specific table be extracted?
Level 4: Full restore — Can the entire database be recovered?
Level 5: Application check — Does the app work correctly after recovery?
Daily monitoring should cover Levels 1–2. Monthly verification should include Levels 3–4. Annual drills should go all the way to Level 5.
3.2 Key Monitoring Queries
Permission requirement:
pg_ls_waldir()andpg_control_checkpoint()require the superuser role orpg_monitor. Grant it to your dedicated monitoring account:GRANT pg_monitor TO monitoring_user;
-- ① WAL archiving summary
SELECT
archived_count,
failed_count,
last_archived_wal,
to_char(last_archived_time, 'YYYY-MM-DD HH24:MI:SS') AS last_archived_at,
EXTRACT(EPOCH FROM (now() - last_archived_time))::int AS lag_seconds,
last_failed_wal,
to_char(last_failed_time, 'YYYY-MM-DD HH24:MI:SS') AS last_failed_at
FROM pg_stat_archiver;
-- ② pg_wal directory accumulated size (abnormal growth signals archiving backlog)
SELECT
pg_size_pretty(sum(size)) AS wal_dir_size,
count(*) AS wal_file_count
FROM pg_ls_waldir();
-- ③ Last checkpoint time (stale checkpoint may indicate a problem)
SELECT
to_char(checkpoint_time, 'YYYY-MM-DD HH24:MI:SS') AS last_checkpoint,
EXTRACT(EPOCH FROM (now() - checkpoint_time))::int AS seconds_ago
FROM pg_control_checkpoint();
-- ④ Current replication lag (if standbys are present)
SELECT
application_name,
state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS send_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;
3.3 Automated Health Check Script
#!/bin/bash
# /usr/local/bin/pg_backup_healthcheck.sh
# Run every 5 minutes via cron
DB_USER="postgres"
SLACK_WEBHOOK="${PG_SLACK_WEBHOOK}" # Security: manage Webhook URL via env var
BACKUP_ROOT="/backups"
MAX_BACKUP_AGE_HOURS=26 # Max allowed age of the last backup
MAX_WAL_LAG_MINUTES=30 # Max allowed WAL archiving lag
alert() {
local emoji="$1"; local msg="$2"
curl -s -X POST "${SLACK_WEBHOOK}" \
-H 'Content-type: application/json' \
--data "{\"text\":\"${emoji} *[PG HEALTH]* ${msg}\"}" >/dev/null
}
# Check 1: WAL archiving failures
FAILED=$(psql -U "${DB_USER}" -tAc \
"SELECT failed_count FROM pg_stat_archiver;")
[ "${FAILED:-0}" -gt 0 ] && \
alert "[RED]" "WAL archiving failures: ${FAILED} accumulated"
# Check 2: WAL archiving lag
LAG=$(psql -U "${DB_USER}" -tAc \
"SELECT COALESCE(EXTRACT(EPOCH FROM now()-last_archived_time)::int, 99999)
FROM pg_stat_archiver;")
[ "${LAG:-99999}" -gt $((MAX_WAL_LAG_MINUTES * 60)) ] && \
alert "[WARN]" "WAL archiving lag: ${LAG}s (threshold: $((MAX_WAL_LAG_MINUTES * 60))s)"
# Check 3: pg_wal file count spike
WAL_COUNT=$(psql -U "${DB_USER}" -tAc \
"SELECT count(*) FROM pg_ls_waldir() WHERE name ~ '^[0-9A-F]{24}$';")
[ "${WAL_COUNT:-0}" -gt 100 ] && \
alert "[WARN]" "pg_wal file count: ${WAL_COUNT} — possible archiving backlog"
# Check 4: Recent backup file exists (logical backup basis)
LATEST_BACKUP=$(find "${BACKUP_ROOT}/logical" -name "*.dump" \
-newer "$(date -d "-${MAX_BACKUP_AGE_HOURS} hours" +%Y%m%d%H%M)" 2>/dev/null | wc -l)
[ "${LATEST_BACKUP}" -eq 0 ] && \
alert "[RED]" "No backup file found in the last ${MAX_BACKUP_AGE_HOURS} hours!"
# Check 5: Disk usage (warn above 80%)
DISK_USE=$(df -h "${BACKUP_ROOT}" | awk 'NR==2{print $5}' | tr -d '%')
[ "${DISK_USE:-0}" -gt 80 ] && \
alert "[WARN]" "Backup disk usage: ${DISK_USE}% (threshold: 80%)"
3.4 Prometheus + Grafana Integration
With postgres_exporter, backup-related metrics can be collected in Prometheus and visualized in a Grafana dashboard.
# /etc/postgres_exporter/custom_queries.yaml
# Custom backup status metrics
pg_backup_archiver:
query: |
SELECT
archived_count,
failed_count,
EXTRACT(EPOCH FROM (now() - last_archived_time)) AS last_archive_lag_seconds
FROM pg_stat_archiver;
metrics:
- archived_count:
usage: COUNTER
description: "Cumulative WAL archiving successes"
- failed_count:
usage: COUNTER
description: "Cumulative WAL archiving failures"
- last_archive_lag_seconds:
usage: GAUGE
description: "Seconds elapsed since last successful archive"
pg_backup_wal_files:
query: |
SELECT count(*) AS pending_wal_files
FROM pg_ls_waldir()
WHERE name ~ '^[0-9A-F]{24}$';
metrics:
- pending_wal_files:
usage: GAUGE
description: "Unprocessed WAL files in pg_wal directory"
# Prometheus alert rules (prometheus/alerts/postgresql_backup.yml)
groups:
- name: postgresql_backup
rules:
# Immediate alert on WAL archiving failure
- alert: PostgreSQLWALArchivingFailed
expr: increase(pg_backup_archiver_failed_count[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "PostgreSQL WAL archiving failure"
description: "WAL archiving failed on {{ $labels.instance }}"
# WAL archiving lag exceeds 30 minutes
- alert: PostgreSQLWALArchiveLag
expr: pg_backup_archiver_last_archive_lag_seconds > 1800
for: 5m
labels:
severity: warning
annotations:
summary: "WAL archiving lag"
description: "{{ $value | humanizeDuration }} since last archive"
# More than 100 WAL files accumulated
- alert: PostgreSQLWALPileup
expr: pg_backup_wal_files_pending_wal_files > 100
for: 10m
labels:
severity: warning
annotations:
summary: "pg_wal file accumulation"
description: "{{ $value }} WAL files pending — possible archiving backlog"
4. Recovery Drills — Disasters Are Only Predictable for the Prepared
4.1 Why Drills Are Non-Negotiable
A real incident looks like this.
3:17 AM — on-call alert fires
On-call engineer: half asleep, searching for the laptop
Teammates: half of them are asleep
Pressure: service is down, management is waiting
Attempting a recovery for the first time in this situation is not a technical problem — it is a human crisis. Drills turn this situation into something the team has already done before.
"A backup tool is the means of executing a strategy. No matter how capable the tool, it is useless if the team has not drilled."
4.2 Five-Level Recovery Drill Framework
Level 1: Existence check drill — Weekly, automated (script)
Level 2: Integrity verification — Weekly, automated (pg_verifybackup)
Level 3: Logical selective restore — Monthly, manual (restore a specific table)
Level 4: Full PITR drill — Quarterly, manual (separate environment, full scenario)
Level 5: Blackout drill — Semi-annually (no advance notice, full team)
4.3 Monthly Logical Restore Drill Script (Level 3)
#!/bin/bash
# /usr/local/bin/pg_restore_drill_monthly.sh
# Run on the 1st of each month at 3 AM (cron: 0 3 1 * *)
set -euo pipefail
BACKUP_ROOT="/backups/logical"
TEST_DB="restore_drill_$(date +%Y%m)"
PROD_DB="production"
DB_USER="postgres"
LOG_FILE="/var/log/pg_backup/drill_$(date +%Y%m%d).log"
SLACK_WEBHOOK="${PG_SLACK_WEBHOOK}"
log() { echo "[$(date '+%H:%M:%S')] $*" | tee -a "${LOG_FILE}"; }
notify() {
curl -s -X POST "${SLACK_WEBHOOK}" \
-H 'Content-type: application/json' \
--data "{\"text\":\"$1\"}" >/dev/null
}
log "===== Monthly restore drill start ====="
START_TIME=${SECONDS}
# 1. Locate the most recent dump file
LATEST_DUMP=$(find "${BACKUP_ROOT}" -name "${PROD_DB}_*.dump" \
-printf '%T@ %p\n' | sort -n | tail -1 | cut -d' ' -f2)
if [ -z "${LATEST_DUMP}" ]; then
notify "Restore drill FAILED: no backup file found"
exit 1
fi
log "Target backup: ${LATEST_DUMP}"
# 2. Integrity check
DUMP_SIZE=$(du -sh "${LATEST_DUMP}" | cut -f1)
if ! pg_restore --list "${LATEST_DUMP}" > /dev/null 2>&1; then
notify "Restore drill FAILED: backup file corrupted (${LATEST_DUMP})"
exit 1
fi
log "Integrity check passed (size: ${DUMP_SIZE})"
# 3. Create test DB and restore
psql -U "${DB_USER}" -c "DROP DATABASE IF EXISTS ${TEST_DB};" > /dev/null
psql -U "${DB_USER}" -c "CREATE DATABASE ${TEST_DB};" > /dev/null
pg_restore -U "${DB_USER}" -d "${TEST_DB}" -j 4 "${LATEST_DUMP}" \
>> "${LOG_FILE}" 2>&1
log "Restore complete -> ${TEST_DB}"
# 4. Data validation query
PROD_COUNT=$(psql -U "${DB_USER}" -d "${PROD_DB}" -tAc \
"SELECT count(*) FROM pg_tables WHERE schemaname='public';")
TEST_COUNT=$(psql -U "${DB_USER}" -d "${TEST_DB}" -tAc \
"SELECT count(*) FROM pg_tables WHERE schemaname='public';")
if [ "${PROD_COUNT}" -ne "${TEST_COUNT}" ]; then
notify "Restore drill WARNING: table count mismatch (source: ${PROD_COUNT}, restored: ${TEST_COUNT})"
else
log "Table count matches: ${TEST_COUNT}"
fi
# 5. Clean up test DB
psql -U "${DB_USER}" -c "DROP DATABASE ${TEST_DB};" > /dev/null
# 6. Report results
ELAPSED=$((SECONDS - START_TIME))
notify "Monthly restore drill COMPLETE | Backup: $(basename ${LATEST_DUMP}) | Size: ${DUMP_SIZE} | Elapsed: ${ELAPSED}s | Tables: ${TEST_COUNT}"
log "===== Drill complete (${ELAPSED}s) ====="
4.4 Quarterly Full PITR Drill Guide (Level 4)
Quarterly PITR drills must always be run on a separate environment. Production must not be affected during the exercise.
Example scenario: "A critical table was dropped at 2:30 PM today. Restore to the state at 2:29 PM."
# Quarterly PITR drill procedure (run on a separate server)
# [Drill facilitator] Announce recovery target time (not disclosed to participants in advance)
TARGET_TIME="2026-04-14 14:29:59 Asia/Seoul"
# Step 1: Restore base backup
mkdir -p /drill/datadir
tar -xzf /backups/basebackup/latest/base.tar.gz -C /drill/datadir
# Step 2: Create recovery.signal and set recovery parameters
touch /drill/datadir/recovery.signal
cat >> /drill/datadir/postgresql.auto.conf << EOF
restore_command = 'cp /var/lib/postgresql/archive/%f %p'
recovery_target_time = '${TARGET_TIME}'
recovery_target_action = 'pause'
port = 5555
EOF
# Step 3: Start PostgreSQL in recovery mode
pg_ctl -D /drill/datadir start
# Step 4: Monitor recovery progress (participants perform this)
tail -f /drill/datadir/log/postgresql-*.log
# Step 5: Verify recovery and validate data (participants perform this)
psql -p 5555 -U postgres -c "SELECT pg_is_in_recovery();"
psql -p 5555 -U postgres -c "SELECT COUNT(*) FROM critical_table;"
psql -p 5555 -U postgres -c "SELECT MAX(created_at) FROM orders;"
# Step 6: Switch to normal operation mode
psql -p 5555 -U postgres -c "SELECT pg_wal_replay_resume();"
# Step 7: End drill and clean up
pg_ctl -D /drill/datadir stop
rm -rf /drill/datadir
4.5 Drill Result Metrics
Record and track the following metrics after every drill.
| Metric | How to measure | Target |
|---|---|---|
| Recovery time (RTO) | Drill start to service resumption | Within target RTO |
| Data loss window (RPO) | Recovery point to incident time (compare pg_stat_archiver.last_archived_time against incident timestamp) | Within target RPO |
| Procedure success rate | Steps completed without error / total steps | 100% |
| Team participation rate | Participants / total on-call roster | 100% |
| Runbook accuracy | Match rate between actual steps and documented steps | 100% |
5. Recovery Runbook — Written to Follow at 3 AM
Every recovery procedure must be documented clearly enough for the least experienced person on the team to follow independently.
5.1 Required Runbook Components
# [DB Recovery Runbook] Table DROP Incident
## Pre-conditions checklist
- [ ] Last backup location: /backups/basebackup/LATEST
- [ ] WAL archive location: /var/lib/postgresql/archive/
- [ ] Recovery server access: restore-server.internal
- [ ] Emergency contacts: dba-oncall@example.com / Slack #db-alerts
## Recovery decision criteria
- RPO: 5 minutes (WAL archiving interval)
- RTO: 2 hours (full recovery baseline)
## Recovery steps (PITR)
1. [ ] Block service traffic (or restrict DB connections)
2. [ ] Confirm the exact incident time (from logs or alert timestamps)
3. [ ] Determine recovery target time (incident time minus 1 minute)
4. [ ] Locate and restore base backup
5. [ ] Set recovery.signal + postgresql.auto.conf
6. [ ] Start PostgreSQL and monitor recovery progress
7. [ ] Validate data (row counts, latest records)
8. [ ] Execute pg_wal_replay_resume()
9. [ ] Verify sequence reset
10. [ ] Restore service traffic
11. [ ] Immediately take a new full base backup
12. [ ] Write post-incident report
## Escalation
- Unable to start recovery within 30 min -> contact team lead
- Recovery outcome uncertain after 1 hour -> report to CTO
5.2 Runbook Maintenance Principles
[ ] Update documentation after every quarterly drill
[ ] Re-validate after every PostgreSQL version upgrade
[ ] Re-verify commands after tool version changes (pgBackRest, Barman, etc.)
[ ] Conduct a dry run with any new team member
[ ] Version-control the Runbook in Git (track change history)
6. Backup Strategy Maturity Self-Assessment
Use the checklist below to assess your organization's current backup strategy maturity.
Foundation
[ ] Regular backups via pg_dump or pg_basebackup are automated
[ ] Backup files are stored separately from the production server
[ ] Failure notifications are sent when a backup fails
[ ] A retention policy is configured
Intermediate
[ ] WAL archiving is configured so PITR is possible
[ ] pg_verifybackup or equivalent integrity checking is automated
[ ] An actual restore test is performed at least once a month
[ ] Backup monitoring is visualized with Prometheus/Grafana or equivalent
[ ] The 3-2-1 rule (3 copies, 2 media types, 1 offsite) is applied
Advanced
[ ] An enterprise tool (pgBackRest / Barman / WAL-G) is in production
[ ] Block-level incremental backup optimizes both storage and duration
[ ] Full PITR drills are conducted quarterly and results are recorded
[ ] The Runbook is up to date and every team member knows it
[ ] RTO/RPO targets are documented and regularly validated
[ ] pg_create_restore_point() is a habit before critical operations
[ ] Backup metrics trends (size, duration, success rate) are tracked
7. Closing — Recoverability Is Reliability
This has been a long journey across six parts. Let us return to the question posed in Part 1.
Is "we have a backup" the same as "we can recover at any time"?
Now the answer is clear — they are not the same.
A backup is not finished when a file is created. The file must be verified as uncorrupted, the team must know the recovery procedure, and the team must repeatedly prove it can succeed. Only then is the system truly "recoverable."
A file exists ≠ a backup exists
A backup exists ≠ recovery is possible
Recovery is possible ≠ fast recovery is possible
Fast recovery possible = genuine reliability
To compress the entire series into one sentence:
"The goal of PostgreSQL backup is not to create files. It is to restore service within a predictable window — regardless of the disaster."
Start reviewing your team's backup strategy today, and improve the single most vulnerable point first.
Full Series Summary
| Part | Title | Key Content |
|---|---|---|
| Part 1 | Backup Fundamentals | RPO/RTO, 3 backup types, 3-2-1 rule, common mistakes |
| Part 2 | Logical Backup | pg_dump formats, parallel dump/restore, pg_dumpall, automation |
| Part 3 | Physical Backup | pg_basebackup, WAL archiving, incremental backup (v17+) |
| Part 4 | PITR Implementation | Recovery target parameters, step-by-step recovery, timelines, scenarios |
| Part 5 | Tool Comparison | pgBackRest vs Barman vs WAL-G in depth |
| Part 6 | Automation & Drills | Automation scripts, monitoring, Runbook, maturity assessment |
References
Official documentation
- PostgreSQL 18 Documentation — Backup and Restore
- PostgreSQL 18 Documentation — Continuous Archiving and PITR
- pgBackRest User Guide
- Barman Documentation v3.18
- WAL-G GitHub
2025–2026 articles
- 13 PostgreSQL Backup Best Practices — Medium (Nov 2025)
- PostgreSQL Backup Verification — DEV Community (Jan 2026)
- Automating Backups and DR: pgBackRest vs Barman — Severalnines (Nov 2025)
- PostgreSQL Disaster Recovery — Stormatics (Feb 2026)
- Best PostgreSQL Backup Solutions in 2026 — PostgresGUI (Feb 2026)
- How to Test PostgreSQL Backup Restoration — OneUptime (Jan 2026)