Lv.2 BeginnerMongoDB

2026.05.0428 min readLv.2 Beginner

SeriesMongoDB Backup & Recovery Guide · Part 3/3View series hub

MongoDB Backup & Recovery Guide Part 3 — Atlas Cloud Backup, PITR & Disaster Recovery Playbook

Where Part 1 covered mongodump and Part 2 covered LVM/EBS snapshots and PBM, Part 3 examines MongoDB Atlas Cloud Backup — which absorbs every operational burden of backup management into a fully managed service. It explains how to design Oplog-based Point-in-Time Recovery (PITR) down to the second, configure multi-region Snapshot Distribution, and enforce immutable backups with Backup Compliance Policy. Four disaster recovery playbooks — accidental collection drop, bad migration script, region-wide outage, and ransomware — are paired with scenario-specific RTO and RPO targets. The series closes with a comprehensive audit checklist and an environment-by-environment strategy selection guide.

Series outline

Part 1 — From RTO, RPO & Oplog to mongodump/mongorestore in Practice

Part 2 — Filesystem Snapshots (LVM·EBS), PBM & Automation Pipelines

Part 3 — Atlas Cloud Backup, PITR & Disaster Recovery Playbook (this post)

Why Atlas Backup? — The Value of Managed Infrastructure
Atlas Backup Architecture Overview
Enabling Cloud Backup — UI, API & Terraform
Snapshot Retention Policy Design Guide
Point-in-Time Recovery (PITR) Deep Dive
Multi-Region Snapshot Distribution
Backup Compliance Policy — Immutable Lock
Disaster Recovery Playbooks by Scenario
The 3-2-1 Backup Rule Applied to MongoDB
Final Checklist & Strategy Selection Guide
Closing Thoughts

1. Why Atlas Backup? — The Value of Managed Infrastructure

Part 1 covered mongodump. Part 2 covered LVM/EBS snapshots and PBM.

All of these are solid tools, but they share one thing in common: you install, configure, and maintain them yourself.

MongoDB Atlas Cloud Backup removes that operational burden entirely.

For most organisations, the engineering cost of building reliable, automated backup for a sharded cluster far exceeds the cost of a managed service like Atlas.

The core value of Atlas Cloud Backup:

Full automation: Schedule once — creation, retention, and deletion are all handled automatically
Incremental snapshots: Uses the cloud provider's native snapshot mechanism, so every backup is incremental by default
Complete sharding support: Cluster-wide consistent snapshots across all shards, guaranteed automatically
PITR: Continuous Oplog capture enables RPO of under one minute
Multi-region distribution: One click to replicate snapshots to a DR region automatically
M10+ only: Available on all dedicated clusters (M10 and above); Free and Flex tiers are not supported

2. Atlas Backup Architecture Overview

Understanding how Atlas Cloud Backup works internally makes policy design significantly easier.

Snapshots are stored incrementally in the cloud provider's object storage. The Oplog Store continuously captures write operations, and combining a base snapshot with Oplog replay makes arbitrary point-in-time recovery possible. When Snapshot Distribution is enabled, snapshots in the Primary Region are automatically replicated to the DR Region.

3. Enabling Cloud Backup — UI, API & Terraform

3.1 Enabling via Atlas UI

Select your cluster in the Atlas dashboard
Go to Edit Configuration → Additional Settings
Toggle Turn on Cloud Backup → On
Toggle Continuous Cloud Backup → On (required for PITR)
Click Save Changes

3.2 Enabling via Atlas Admin API

# Enable Cloud Backup and PITR together
curl -X PATCH \
  "https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}" \
  --digest -u "{publicKey}:{privateKey}" \
  -H "Content-Type: application/json" \
  -d '{
    "providerBackupEnabled": true,
    "pitEnabled": true
  }'

# Set backup schedule policy (snapshot every 6 hours, 7-day retention + 7-day PITR)
curl -X PUT \
  "https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/schedule" \
  --digest -u "{publicKey}:{privateKey}" \
  -H "Content-Type: application/json" \
  -d '{
    "referenceHourOfDay": 2,
    "referenceMinuteOfHour": 0,
    "restoreWindowDays": 7,
    "policies": [{
      "policyItems": [
        {
          "frequencyInterval": 6,
          "frequencyType": "hourly",
          "retentionUnit": "days",
          "retentionValue": 7
        },
        {
          "frequencyInterval": 1,
          "frequencyType": "daily",
          "retentionUnit": "days",
          "retentionValue": 14
        },
        {
          "frequencyInterval": 6,
          "frequencyType": "weekly",
          "retentionUnit": "weeks",
          "retentionValue": 4
        },
        {
          "frequencyInterval": 40,
          "frequencyType": "monthly",
          "retentionUnit": "months",
          "retentionValue": 12
        }
      ]
    }]
  }'

3.3 Managing via Terraform (IaC)

In infrastructure-as-code environments, Terraform lets you version-control your backup policies alongside the rest of your infrastructure.

# main.tf

resource "mongodbatlas_cluster" "production" {
  project_id                  = var.atlas_project_id
  name                        = "prod-cluster"
  provider_name               = "AWS"
  provider_region_name        = "AP_SOUTHEAST_2"
  provider_instance_size_name = "M30"

  cloud_backup = true
  pit_enabled  = true
}

resource "mongodbatlas_cloud_backup_schedule" "production_schedule" {
  project_id   = var.atlas_project_id
  cluster_name = mongodbatlas_cluster.production.name

  reference_hour_of_day    = 2
  reference_minute_of_hour = 0
  restore_window_days      = 7

  policy_item_hourly {
    frequency_interval = 6
    retention_unit     = "days"
    retention_value    = 7
  }

  policy_item_daily {
    frequency_interval = 1
    retention_unit     = "days"
    retention_value    = 14
  }

  policy_item_weekly {
    frequency_interval = 6   # Saturday
    retention_unit     = "weeks"
    retention_value    = 4
  }

  policy_item_monthly {
    frequency_interval = 40  # Last day of the month
    retention_unit     = "months"
    retention_value    = 12
  }
}

4. Snapshot Retention Policy Design Guide

Atlas supports five snapshot frequencies (hourly / daily / weekly / monthly / yearly), each with an independently configurable retention period.

Atlas Default Retention Policy (recommended starting point)

Frequency	Default Retention	Purpose
Hourly	2 days	Short-term recovery from operational mistakes
Daily	7 days	General data corruption recovery
Weekly	4 weeks	Buffer for weekly deployments and migration errors
Monthly	12 months	Monthly reporting and audit requirements
Yearly	Maximum configurable	Legal compliance (GDPR, HIPAA, SOC 2)
PITR Restore Window	7 days (default)	Arbitrary point-in-time recovery range

Recommended Policies by Business Criticality

General production services (e-commerce, SaaS)

RPO target: under 1 hour
- Hourly snapshots: every 6 hours, 7-day retention
- Daily snapshots: 14-day retention
- Weekly snapshots: 4-week retention
- PITR: 7-day restore window

Financial and healthcare services (strict compliance)

RPO target: near-zero (up to last transaction)
- Hourly snapshots: every 2 hours, 7-day retention
- Daily snapshots: 1-month retention
- Monthly snapshots: 12-month retention (for fast Atlas-side recovery)
- Monthly snapshots → S3 archive: 5-year long-term retention
- PITR: 2-day restore window (shorter window reduces restore time)
- Backup Compliance Policy: mandatory

Development and staging environments

Backup disabled — recommended
(MongoDB's official documentation also states backup is unnecessary for dev/test environments)

5. Point-in-Time Recovery (PITR) Deep Dive

5.1 How PITR Works

Atlas PITR combines a base snapshot with Oplog replay.

A shorter restore window means less Oplog to replay, which directly reduces the actual recovery time (RTO). For instance, if snapshots are taken every 2 hours, you only ever need to replay up to 2 hours of Oplog.

5.2 PITR Recovery Procedure — UI Walkthrough

Scenario: At 10:26 AM on April 14, 2026, a developer accidentally dropped the payments collection in production. You need to recover to the state at 10:25:55 AM.

Step 1. Select the affected cluster in Atlas UI

Step 2. Click Backup in the left sidebar

Step 3. Click Point in Time Restore

Step 4. Choose recovery method

Recovery method:
  ● Date & Time       → minute-level precision
  ○ Oplog Timestamp   → second-level precision  ← recommended for precise recovery

Step 5. Select the Oplog Timestamp tab and enter the value

Timestamp (seconds since epoch): 1744603555
Increment: 1

(2026-04-14T10:25:55 UTC = Unix timestamp 1744603555)

Step 6. Select the recovery target (overwrite existing cluster or restore to a new cluster)

Step 7. Click Restore and monitor until complete

5.3 Automating PITR Recovery via Atlas Admin API

# PITR recovery using Oplog Timestamp (REST API)
curl -X POST \
  "https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/restoreJobs" \
  --digest -u "{publicKey}:{privateKey}" \
  -H "Content-Type: application/json" \
  -d '{
    "delivery": {
      "methodName": "AUTOMATED_RESTORE",
      "targetClusterName": "prod-recovery-cluster",
      "targetGroupId": "{targetGroupId}"
    },
    "oplogTs": 1744603555,
    "oplogInc": 1
  }'

# Monitor restore job status
RESTORE_JOB_ID="..."
curl -X GET \
  "https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/restoreJobs/$RESTORE_JOB_ID" \
  --digest -u "{publicKey}:{privateKey}"

5.4 Key Considerations When Designing PITR

Restore window vs. snapshot interval

The restore window (restoreWindowDays) cannot exceed the hourly snapshot retention period. For example, if hourly snapshots are retained for 7 days, the PITR restore window is capped at 7 days.

Caution when disabling PITR

Disabling Continuous Cloud Backup in Atlas immediately deletes all existing Oplog data. If a Backup Compliance Policy is active, disabling it requires a verification process through MongoDB support.

NVMe storage restore time

Clusters using NVMe storage have physically attached volumes, which can make restore operations slower than network-attached virtual storage clusters.

6. Multi-Region Snapshot Distribution

To protect against a single-region failure (an entire cloud provider region going down), snapshots must be automatically replicated to a geographically separate region.

6.1 Enabling Snapshot Distribution

Atlas UI path:

Clusters → select cluster → Backup Policy tab → Additional Backup Copies section → toggle Copy to other regions → select target region

Note: Only other regions within the same cloud provider can be selected. Example: AWS ap-northeast-2 → AWS ap-southeast-1

6.2 Cross-Region Policy Configuration Example

Atlas lets you independently control which snapshots to replicate and how long to retain the copies in the DR region.

Primary Region (ap-northeast-2, Seoul):
  - Snapshot every 6 hours, 30-day retention (operational recovery)

DR Region (ap-southeast-1, Singapore):
  - Replicate daily snapshots only, 7-day retention (cost optimisation)
  - Enable Oplog distribution → PITR available in DR region as well

This approach maintains DR protection while significantly reducing unnecessary storage costs.

7. Backup Compliance Policy — Immutable Lock

To prevent backups themselves from being deleted — whether by ransomware or an insider threat — activate the Backup Compliance Policy.

Once active, the following become impossible:

Any user (including organisation owners) deleting snapshots or shortening retention periods
Existing snapshots being removed when a cluster is terminated, until their retention period expires
Disabling Continuous Cloud Backup without a verification process through MongoDB support

WORM (Write Once Read Many) compliance: This feature is mandatory for financial, healthcare, and public-sector services subject to GDPR, HIPAA, SOC 2, or PCI-DSS requirements.

7.1 Configuring Compliance Policy via Atlas CLI

# Enable Backup Compliance Policy
# Note: detailed policy options can be supplied via --file <policy.json>
atlas backups compliancePolicy setup \
  --projectId {projectId} \
  --authorizedEmail "security@company.com" \
  --authorizedUserFirstName "Security" \
  --authorizedUserLastName "Officer"

# Require PITR for all clusters
atlas backups compliancePolicy pointInTimeRestores enable \
  --projectId {projectId} \
  --restoreWindowDays 7

# Enforce encryption at rest (require Customer Key Management)
atlas backups compliancePolicy encryptionAtRest enable \
  --projectId {projectId}

8. Disaster Recovery Playbooks by Scenario

When disaster strikes, reading the runbook for the first time is already too late. These playbooks must be reviewed by the entire team during normal operations and practised through regular drills.

Scenario 1 — Accidental Collection Drop

Situation: A developer runs db.orders.drop() in production

Item	Detail
Detection	Application error spike / APM alert
Expected RTO	15-30 minutes (with Atlas PITR)
Expected RPO	Just before the drop (second-level recovery)

[Immediate response]
1. Block write access to the affected database (prevent further changes)
2. Record the exact time of the drop (check logs and APM timestamps)
3. Atlas UI → Backup → Point in Time Restore
4. Enter Oplog Timestamp for 1 minute before the drop
5. Restore to a staging cluster first → validate data
6. Apply to production or use mongorestore to transplant the collection
7. Root cause analysis (review access permissions, add environment safeguards)

Scenario 2 — Bad Migration Script

Situation: A deployed schema migration script corrupts a large number of documents

Item	Detail
Detection	Data validation query failures / user reports
Expected RTO	30 minutes to 2 hours (depending on data volume)
Expected RPO	Just before the migration started

[Immediate response]
1. Halt the migration script immediately
2. Assess the blast radius (which collections, how many documents)
3. Use the migration start time as the PITR recovery target
4. Atlas → Backup → Point in Time Restore
   → Enter Oplog Timestamp for 30 seconds before the migration started
5. Restore to a new cluster and validate data integrity
   db.runCommand({ validate: "affected_collection" })
6. Switch traffic using a Blue-Green approach
7. Prevention: strengthen migration validation in staging environments

Scenario 3 — Full Replica Set Failure

Situation: A region-wide outage takes down all nodes in the cluster

Item	Detail
Detection	Atlas alert / cloud provider status page
Expected RTO	1-3 hours (depends on data volume and network throughput)
Expected RPO	Time of last Oplog capture (typically within 1 minute)

[Immediate response]
1. Check the cloud provider status page (estimate regional recovery time)
2. If multi-region Snapshot Distribution is configured:
   → Immediately spin up a new cluster from the DR region snapshot (e.g., Singapore)
3. Update DNS or load balancer to point to the DR cluster
4. After the primary region recovers, run the data synchronisation procedure
5. Decide whether to roll back or switch back to the primary region

[Prerequisites]
- Pre-create a DR cluster and manage connection strings as environment variables
- Snapshot Distribution: DR region must be active before an incident
- Run DR recovery drills at least once per quarter

Scenario 4 — Ransomware Attack

Situation: An attacker encrypts data and attempts to delete backups

Item	Detail
Detection	Anomalous access pattern / file encryption alert
Expected RTO	2-4 hours
Expected RPO	Within the Backup Compliance Policy retention period

[Immediate response]
1. Immediately revoke all API keys associated with the affected cluster
2. Emergency audit of Atlas access permissions; disable suspicious accounts
3. Verify that Backup Compliance Policy is active
   → If active, the attacker cannot delete backups
4. Identify the most recent uncontaminated snapshot
5. Restore to a fully isolated new project or organisation
6. Comprehensive data integrity check of the recovered cluster
7. Report to the security and legal teams; notify regulators if required

9. The 3-2-1 Backup Rule Applied to MongoDB

The 3-2-1 rule is the golden principle of backup strategy, and it applies directly to MongoDB.

Rule	Meaning
3	Maintain at least three copies of the data
2	Store copies on at least two different storage media
1	Keep at least one copy off-site (different geographic location)

Implementing 3-2-1 in a MongoDB Atlas Environment

In self-managed environments, storing mongodump output in an S3 bucket under a separate AWS account achieves account-level isolation. Even if the primary account is compromised, backups in the separate account remain safe.

10. Final Checklist & Strategy Selection Guide

MongoDB Backup Strategy Audit Checklist

Any unchecked items below represent gaps to close before the next incident.

Fundamentals (required in any environment)

Is backup automated, and is the cron/schedule confirmed to be running?
Is the backup target a Secondary node? (to avoid adding load to the Primary)
Do backup file names include timestamps for identification?
Are backup files stored in a physically or logically separate location from the production database?
Are backup files encrypted?
Are alerts sent when a backup job fails?

Restore validation (the most commonly skipped category)

Has an actual restore test been performed within the last month?
Is a partial restore procedure (at the collection level) documented and tested?
Did the restore test meet the expected RTO target?
Are data integrity validation queries ready to run post-restore?

Advanced operational items

Is PITR (Point-in-Time Recovery) enabled?
Is multi-region Snapshot Distribution configured?
Is Backup Compliance Policy active? (required for compliance environments)
Are RPO/RTO targets documented and known by the entire team?
Is the disaster recovery Runbook being kept up to date?
Is a DR drill being conducted at least once per quarter?

Security

Is a dedicated backup user account configured with minimum necessary permissions?
Are backup API keys managed separately from other keys?
Is an immutable (WORM) setting applied to the backup bucket or storage?

Environment-by-Environment Strategy Selection Guide

11. Closing Thoughts

Across three parts, we have covered the full spectrum of MongoDB backup and recovery.

Part	Core tools	Best-fit environment
Part 1	mongodump / mongorestore	Small-scale, portable, simple deployments
Part 2	LVM/EBS snapshots, PBM	Large on-premises, sharded clusters
Part 3	Atlas Cloud Backup, PITR	Cloud-managed, compliance environments

Regardless of which tool you choose, one principle above all others applies.

"An untested backup is not a backup."

The existence of a backup file and the ability to actually recover from it are entirely different things. Run restore tests regularly — at least monthly — record the results in your Runbook, and make sure the entire team knows the procedure.

Disasters arrive without warning. The person who stays calmest is the one who has practised the most.