MongoDB Backup & Recovery Guide Part 3 — Atlas Cloud Backup, PITR & Disaster Recovery Playbook
Where Part 1 covered mongodump and Part 2 covered LVM/EBS snapshots and PBM, Part 3 examines MongoDB Atlas Cloud Backup — which absorbs every operational burden of backup management into a fully managed service. It explains how to design Oplog-based Point-in-Time Recovery (PITR) down to the second, configure multi-region Snapshot Distribution, and enforce immutable backups with Backup Compliance Policy. Four disaster recovery playbooks — accidental collection drop, bad migration script, region-wide outage, and ransomware — are paired with scenario-specific RTO and RPO targets. The series closes with a comprehensive audit checklist and an environment-by-environment strategy selection guide.
Series outline
- Part 1 — From RTO, RPO & Oplog to mongodump/mongorestore in Practice
- Part 2 — Filesystem Snapshots (LVM·EBS), PBM & Automation Pipelines
- Part 3 — Atlas Cloud Backup, PITR & Disaster Recovery Playbook (this post)
Table of Contents
- Why Atlas Backup? — The Value of Managed Infrastructure
- Atlas Backup Architecture Overview
- Enabling Cloud Backup — UI, API & Terraform
- Snapshot Retention Policy Design Guide
- Point-in-Time Recovery (PITR) Deep Dive
- Multi-Region Snapshot Distribution
- Backup Compliance Policy — Immutable Lock
- Disaster Recovery Playbooks by Scenario
- The 3-2-1 Backup Rule Applied to MongoDB
- Final Checklist & Strategy Selection Guide
- Closing Thoughts
1. Why Atlas Backup? — The Value of Managed Infrastructure
Part 1 covered mongodump. Part 2 covered LVM/EBS snapshots and PBM.
All of these are solid tools, but they share one thing in common: you install, configure, and maintain them yourself.
MongoDB Atlas Cloud Backup removes that operational burden entirely.
For most organisations, the engineering cost of building reliable, automated backup for a sharded cluster far exceeds the cost of a managed service like Atlas.
The core value of Atlas Cloud Backup:
- Full automation: Schedule once — creation, retention, and deletion are all handled automatically
- Incremental snapshots: Uses the cloud provider's native snapshot mechanism, so every backup is incremental by default
- Complete sharding support: Cluster-wide consistent snapshots across all shards, guaranteed automatically
- PITR: Continuous Oplog capture enables RPO of under one minute
- Multi-region distribution: One click to replicate snapshots to a DR region automatically
- M10+ only: Available on all dedicated clusters (M10 and above); Free and Flex tiers are not supported
2. Atlas Backup Architecture Overview
Understanding how Atlas Cloud Backup works internally makes policy design significantly easier.
Snapshots are stored incrementally in the cloud provider's object storage. The Oplog Store continuously captures write operations, and combining a base snapshot with Oplog replay makes arbitrary point-in-time recovery possible. When Snapshot Distribution is enabled, snapshots in the Primary Region are automatically replicated to the DR Region.
3. Enabling Cloud Backup — UI, API & Terraform
3.1 Enabling via Atlas UI
- Select your cluster in the Atlas dashboard
- Go to Edit Configuration → Additional Settings
- Toggle Turn on Cloud Backup → On
- Toggle Continuous Cloud Backup → On (required for PITR)
- Click Save Changes
3.2 Enabling via Atlas Admin API
# Enable Cloud Backup and PITR together
curl -X PATCH \
"https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}" \
--digest -u "{publicKey}:{privateKey}" \
-H "Content-Type: application/json" \
-d '{
"providerBackupEnabled": true,
"pitEnabled": true
}'
# Set backup schedule policy (snapshot every 6 hours, 7-day retention + 7-day PITR)
curl -X PUT \
"https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/schedule" \
--digest -u "{publicKey}:{privateKey}" \
-H "Content-Type: application/json" \
-d '{
"referenceHourOfDay": 2,
"referenceMinuteOfHour": 0,
"restoreWindowDays": 7,
"policies": [{
"policyItems": [
{
"frequencyInterval": 6,
"frequencyType": "hourly",
"retentionUnit": "days",
"retentionValue": 7
},
{
"frequencyInterval": 1,
"frequencyType": "daily",
"retentionUnit": "days",
"retentionValue": 14
},
{
"frequencyInterval": 6,
"frequencyType": "weekly",
"retentionUnit": "weeks",
"retentionValue": 4
},
{
"frequencyInterval": 40,
"frequencyType": "monthly",
"retentionUnit": "months",
"retentionValue": 12
}
]
}]
}'
3.3 Managing via Terraform (IaC)
In infrastructure-as-code environments, Terraform lets you version-control your backup policies alongside the rest of your infrastructure.
# main.tf
resource "mongodbatlas_cluster" "production" {
project_id = var.atlas_project_id
name = "prod-cluster"
provider_name = "AWS"
provider_region_name = "AP_SOUTHEAST_2"
provider_instance_size_name = "M30"
cloud_backup = true
pit_enabled = true
}
resource "mongodbatlas_cloud_backup_schedule" "production_schedule" {
project_id = var.atlas_project_id
cluster_name = mongodbatlas_cluster.production.name
reference_hour_of_day = 2
reference_minute_of_hour = 0
restore_window_days = 7
policy_item_hourly {
frequency_interval = 6
retention_unit = "days"
retention_value = 7
}
policy_item_daily {
frequency_interval = 1
retention_unit = "days"
retention_value = 14
}
policy_item_weekly {
frequency_interval = 6 # Saturday
retention_unit = "weeks"
retention_value = 4
}
policy_item_monthly {
frequency_interval = 40 # Last day of the month
retention_unit = "months"
retention_value = 12
}
}
4. Snapshot Retention Policy Design Guide
Atlas supports five snapshot frequencies (hourly / daily / weekly / monthly / yearly), each with an independently configurable retention period.
Atlas Default Retention Policy (recommended starting point)
| Frequency | Default Retention | Purpose |
|---|---|---|
| Hourly | 2 days | Short-term recovery from operational mistakes |
| Daily | 7 days | General data corruption recovery |
| Weekly | 4 weeks | Buffer for weekly deployments and migration errors |
| Monthly | 12 months | Monthly reporting and audit requirements |
| Yearly | Maximum configurable | Legal compliance (GDPR, HIPAA, SOC 2) |
| PITR Restore Window | 7 days (default) | Arbitrary point-in-time recovery range |
Recommended Policies by Business Criticality
General production services (e-commerce, SaaS)
RPO target: under 1 hour
- Hourly snapshots: every 6 hours, 7-day retention
- Daily snapshots: 14-day retention
- Weekly snapshots: 4-week retention
- PITR: 7-day restore window
Financial and healthcare services (strict compliance)
RPO target: near-zero (up to last transaction)
- Hourly snapshots: every 2 hours, 7-day retention
- Daily snapshots: 1-month retention
- Monthly snapshots: 12-month retention (for fast Atlas-side recovery)
- Monthly snapshots → S3 archive: 5-year long-term retention
- PITR: 2-day restore window (shorter window reduces restore time)
- Backup Compliance Policy: mandatory
Development and staging environments
Backup disabled — recommended
(MongoDB's official documentation also states backup is unnecessary for dev/test environments)
5. Point-in-Time Recovery (PITR) Deep Dive
5.1 How PITR Works
Atlas PITR combines a base snapshot with Oplog replay.
A shorter restore window means less Oplog to replay, which directly reduces the actual recovery time (RTO). For instance, if snapshots are taken every 2 hours, you only ever need to replay up to 2 hours of Oplog.
5.2 PITR Recovery Procedure — UI Walkthrough
Scenario: At 10:26 AM on April 14, 2026, a developer accidentally dropped the
paymentscollection in production. You need to recover to the state at 10:25:55 AM.
Step 1. Select the affected cluster in Atlas UI
Step 2. Click Backup in the left sidebar
Step 3. Click Point in Time Restore
Step 4. Choose recovery method
Recovery method:
● Date & Time → minute-level precision
○ Oplog Timestamp → second-level precision ← recommended for precise recovery
Step 5. Select the Oplog Timestamp tab and enter the value
Timestamp (seconds since epoch): 1744603555
Increment: 1
(2026-04-14T10:25:55 UTC = Unix timestamp 1744603555)
Step 6. Select the recovery target (overwrite existing cluster or restore to a new cluster)
Step 7. Click Restore and monitor until complete
5.3 Automating PITR Recovery via Atlas Admin API
# PITR recovery using Oplog Timestamp (REST API)
curl -X POST \
"https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/restoreJobs" \
--digest -u "{publicKey}:{privateKey}" \
-H "Content-Type: application/json" \
-d '{
"delivery": {
"methodName": "AUTOMATED_RESTORE",
"targetClusterName": "prod-recovery-cluster",
"targetGroupId": "{targetGroupId}"
},
"oplogTs": 1744603555,
"oplogInc": 1
}'
# Monitor restore job status
RESTORE_JOB_ID="..."
curl -X GET \
"https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/restoreJobs/$RESTORE_JOB_ID" \
--digest -u "{publicKey}:{privateKey}"
5.4 Key Considerations When Designing PITR
Restore window vs. snapshot interval
The restore window (restoreWindowDays) cannot exceed the hourly snapshot retention period. For example, if hourly snapshots are retained for 7 days, the PITR restore window is capped at 7 days.
Caution when disabling PITR
Disabling Continuous Cloud Backup in Atlas immediately deletes all existing Oplog data. If a Backup Compliance Policy is active, disabling it requires a verification process through MongoDB support.
NVMe storage restore time
Clusters using NVMe storage have physically attached volumes, which can make restore operations slower than network-attached virtual storage clusters.
6. Multi-Region Snapshot Distribution
To protect against a single-region failure (an entire cloud provider region going down), snapshots must be automatically replicated to a geographically separate region.
6.1 Enabling Snapshot Distribution
Atlas UI path:
Clusters → select cluster → Backup Policy tab → Additional Backup Copies section → toggle Copy to other regions → select target region
Note: Only other regions within the same cloud provider can be selected. Example: AWS
ap-northeast-2→ AWSap-southeast-1
6.2 Cross-Region Policy Configuration Example
Atlas lets you independently control which snapshots to replicate and how long to retain the copies in the DR region.
Primary Region (ap-northeast-2, Seoul):
- Snapshot every 6 hours, 30-day retention (operational recovery)
DR Region (ap-southeast-1, Singapore):
- Replicate daily snapshots only, 7-day retention (cost optimisation)
- Enable Oplog distribution → PITR available in DR region as well
This approach maintains DR protection while significantly reducing unnecessary storage costs.
7. Backup Compliance Policy — Immutable Lock
To prevent backups themselves from being deleted — whether by ransomware or an insider threat — activate the Backup Compliance Policy.
Once active, the following become impossible:
- Any user (including organisation owners) deleting snapshots or shortening retention periods
- Existing snapshots being removed when a cluster is terminated, until their retention period expires
- Disabling Continuous Cloud Backup without a verification process through MongoDB support
WORM (Write Once Read Many) compliance: This feature is mandatory for financial, healthcare, and public-sector services subject to GDPR, HIPAA, SOC 2, or PCI-DSS requirements.
7.1 Configuring Compliance Policy via Atlas CLI
# Enable Backup Compliance Policy
# Note: detailed policy options can be supplied via --file <policy.json>
atlas backups compliancePolicy setup \
--projectId {projectId} \
--authorizedEmail "security@company.com" \
--authorizedUserFirstName "Security" \
--authorizedUserLastName "Officer"
# Require PITR for all clusters
atlas backups compliancePolicy pointInTimeRestores enable \
--projectId {projectId} \
--restoreWindowDays 7
# Enforce encryption at rest (require Customer Key Management)
atlas backups compliancePolicy encryptionAtRest enable \
--projectId {projectId}
8. Disaster Recovery Playbooks by Scenario
When disaster strikes, reading the runbook for the first time is already too late. These playbooks must be reviewed by the entire team during normal operations and practised through regular drills.
Scenario 1 — Accidental Collection Drop
Situation: A developer runs db.orders.drop() in production
| Item | Detail |
|---|---|
| Detection | Application error spike / APM alert |
| Expected RTO | 15-30 minutes (with Atlas PITR) |
| Expected RPO | Just before the drop (second-level recovery) |
[Immediate response]
1. Block write access to the affected database (prevent further changes)
2. Record the exact time of the drop (check logs and APM timestamps)
3. Atlas UI → Backup → Point in Time Restore
4. Enter Oplog Timestamp for 1 minute before the drop
5. Restore to a staging cluster first → validate data
6. Apply to production or use mongorestore to transplant the collection
7. Root cause analysis (review access permissions, add environment safeguards)
Scenario 2 — Bad Migration Script
Situation: A deployed schema migration script corrupts a large number of documents
| Item | Detail |
|---|---|
| Detection | Data validation query failures / user reports |
| Expected RTO | 30 minutes to 2 hours (depending on data volume) |
| Expected RPO | Just before the migration started |
[Immediate response]
1. Halt the migration script immediately
2. Assess the blast radius (which collections, how many documents)
3. Use the migration start time as the PITR recovery target
4. Atlas → Backup → Point in Time Restore
→ Enter Oplog Timestamp for 30 seconds before the migration started
5. Restore to a new cluster and validate data integrity
db.runCommand({ validate: "affected_collection" })
6. Switch traffic using a Blue-Green approach
7. Prevention: strengthen migration validation in staging environments
Scenario 3 — Full Replica Set Failure
Situation: A region-wide outage takes down all nodes in the cluster
| Item | Detail |
|---|---|
| Detection | Atlas alert / cloud provider status page |
| Expected RTO | 1-3 hours (depends on data volume and network throughput) |
| Expected RPO | Time of last Oplog capture (typically within 1 minute) |
[Immediate response]
1. Check the cloud provider status page (estimate regional recovery time)
2. If multi-region Snapshot Distribution is configured:
→ Immediately spin up a new cluster from the DR region snapshot (e.g., Singapore)
3. Update DNS or load balancer to point to the DR cluster
4. After the primary region recovers, run the data synchronisation procedure
5. Decide whether to roll back or switch back to the primary region
[Prerequisites]
- Pre-create a DR cluster and manage connection strings as environment variables
- Snapshot Distribution: DR region must be active before an incident
- Run DR recovery drills at least once per quarter
Scenario 4 — Ransomware Attack
Situation: An attacker encrypts data and attempts to delete backups
| Item | Detail |
|---|---|
| Detection | Anomalous access pattern / file encryption alert |
| Expected RTO | 2-4 hours |
| Expected RPO | Within the Backup Compliance Policy retention period |
[Immediate response]
1. Immediately revoke all API keys associated with the affected cluster
2. Emergency audit of Atlas access permissions; disable suspicious accounts
3. Verify that Backup Compliance Policy is active
→ If active, the attacker cannot delete backups
4. Identify the most recent uncontaminated snapshot
5. Restore to a fully isolated new project or organisation
6. Comprehensive data integrity check of the recovered cluster
7. Report to the security and legal teams; notify regulators if required
9. The 3-2-1 Backup Rule Applied to MongoDB
The 3-2-1 rule is the golden principle of backup strategy, and it applies directly to MongoDB.
| Rule | Meaning |
|---|---|
| 3 | Maintain at least three copies of the data |
| 2 | Store copies on at least two different storage media |
| 1 | Keep at least one copy off-site (different geographic location) |
Implementing 3-2-1 in a MongoDB Atlas Environment
In self-managed environments, storing mongodump output in an S3 bucket under a separate AWS account achieves account-level isolation. Even if the primary account is compromised, backups in the separate account remain safe.
10. Final Checklist & Strategy Selection Guide
MongoDB Backup Strategy Audit Checklist
Any unchecked items below represent gaps to close before the next incident.
Fundamentals (required in any environment)
- Is backup automated, and is the cron/schedule confirmed to be running?
- Is the backup target a Secondary node? (to avoid adding load to the Primary)
- Do backup file names include timestamps for identification?
- Are backup files stored in a physically or logically separate location from the production database?
- Are backup files encrypted?
- Are alerts sent when a backup job fails?
Restore validation (the most commonly skipped category)
- Has an actual restore test been performed within the last month?
- Is a partial restore procedure (at the collection level) documented and tested?
- Did the restore test meet the expected RTO target?
- Are data integrity validation queries ready to run post-restore?
Advanced operational items
- Is PITR (Point-in-Time Recovery) enabled?
- Is multi-region Snapshot Distribution configured?
- Is Backup Compliance Policy active? (required for compliance environments)
- Are RPO/RTO targets documented and known by the entire team?
- Is the disaster recovery Runbook being kept up to date?
- Is a DR drill being conducted at least once per quarter?
Security
- Is a dedicated backup user account configured with minimum necessary permissions?
- Are backup API keys managed separately from other keys?
- Is an immutable (WORM) setting applied to the backup bucket or storage?
Environment-by-Environment Strategy Selection Guide
11. Closing Thoughts
Across three parts, we have covered the full spectrum of MongoDB backup and recovery.
| Part | Core tools | Best-fit environment |
|---|---|---|
| Part 1 | mongodump / mongorestore | Small-scale, portable, simple deployments |
| Part 2 | LVM/EBS snapshots, PBM | Large on-premises, sharded clusters |
| Part 3 | Atlas Cloud Backup, PITR | Cloud-managed, compliance environments |
Regardless of which tool you choose, one principle above all others applies.
"An untested backup is not a backup."
The existence of a backup file and the ability to actually recover from it are entirely different things. Run restore tests regularly — at least monthly — record the results in your Runbook, and make sure the entire team knows the procedure.
Disasters arrive without warning. The person who stays calmest is the one who has practised the most.