Monday, May 4, 2026
All posts
Lv.2 BeginnerMongoDB
28 min readLv.2 Beginner
SeriesMongoDB Backup & Recovery Guide · Part 3/3View series hub

MongoDB Backup & Recovery Guide Part 3 — Atlas Cloud Backup, PITR & Disaster Recovery Playbook

MongoDB Backup & Recovery Guide Part 3 — Atlas Cloud Backup, PITR & Disaster Recovery Playbook

Where Part 1 covered mongodump and Part 2 covered LVM/EBS snapshots and PBM, Part 3 examines MongoDB Atlas Cloud Backup — which absorbs every operational burden of backup management into a fully managed service. It explains how to design Oplog-based Point-in-Time Recovery (PITR) down to the second, configure multi-region Snapshot Distribution, and enforce immutable backups with Backup Compliance Policy. Four disaster recovery playbooks — accidental collection drop, bad migration script, region-wide outage, and ransomware — are paired with scenario-specific RTO and RPO targets. The series closes with a comprehensive audit checklist and an environment-by-environment strategy selection guide.

Series outline

Table of Contents

  1. Why Atlas Backup? — The Value of Managed Infrastructure
  2. Atlas Backup Architecture Overview
  3. Enabling Cloud Backup — UI, API & Terraform
  4. Snapshot Retention Policy Design Guide
  5. Point-in-Time Recovery (PITR) Deep Dive
  6. Multi-Region Snapshot Distribution
  7. Backup Compliance Policy — Immutable Lock
  8. Disaster Recovery Playbooks by Scenario
  9. The 3-2-1 Backup Rule Applied to MongoDB
  10. Final Checklist & Strategy Selection Guide
  11. Closing Thoughts

1. Why Atlas Backup? — The Value of Managed Infrastructure

Part 1 covered mongodump. Part 2 covered LVM/EBS snapshots and PBM.

All of these are solid tools, but they share one thing in common: you install, configure, and maintain them yourself.

MongoDB Atlas Cloud Backup removes that operational burden entirely.

For most organisations, the engineering cost of building reliable, automated backup for a sharded cluster far exceeds the cost of a managed service like Atlas.

The core value of Atlas Cloud Backup:

  • Full automation: Schedule once — creation, retention, and deletion are all handled automatically
  • Incremental snapshots: Uses the cloud provider's native snapshot mechanism, so every backup is incremental by default
  • Complete sharding support: Cluster-wide consistent snapshots across all shards, guaranteed automatically
  • PITR: Continuous Oplog capture enables RPO of under one minute
  • Multi-region distribution: One click to replicate snapshots to a DR region automatically
  • M10+ only: Available on all dedicated clusters (M10 and above); Free and Flex tiers are not supported

2. Atlas Backup Architecture Overview

Understanding how Atlas Cloud Backup works internally makes policy design significantly easier.

Snapshots are stored incrementally in the cloud provider's object storage. The Oplog Store continuously captures write operations, and combining a base snapshot with Oplog replay makes arbitrary point-in-time recovery possible. When Snapshot Distribution is enabled, snapshots in the Primary Region are automatically replicated to the DR Region.


3. Enabling Cloud Backup — UI, API & Terraform

3.1 Enabling via Atlas UI

  1. Select your cluster in the Atlas dashboard
  2. Go to Edit Configuration → Additional Settings
  3. Toggle Turn on Cloud Backup → On
  4. Toggle Continuous Cloud Backup → On (required for PITR)
  5. Click Save Changes

3.2 Enabling via Atlas Admin API

# Enable Cloud Backup and PITR together
curl -X PATCH \
  "https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}" \
  --digest -u "{publicKey}:{privateKey}" \
  -H "Content-Type: application/json" \
  -d '{
    "providerBackupEnabled": true,
    "pitEnabled": true
  }'
# Set backup schedule policy (snapshot every 6 hours, 7-day retention + 7-day PITR)
curl -X PUT \
  "https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/schedule" \
  --digest -u "{publicKey}:{privateKey}" \
  -H "Content-Type: application/json" \
  -d '{
    "referenceHourOfDay": 2,
    "referenceMinuteOfHour": 0,
    "restoreWindowDays": 7,
    "policies": [{
      "policyItems": [
        {
          "frequencyInterval": 6,
          "frequencyType": "hourly",
          "retentionUnit": "days",
          "retentionValue": 7
        },
        {
          "frequencyInterval": 1,
          "frequencyType": "daily",
          "retentionUnit": "days",
          "retentionValue": 14
        },
        {
          "frequencyInterval": 6,
          "frequencyType": "weekly",
          "retentionUnit": "weeks",
          "retentionValue": 4
        },
        {
          "frequencyInterval": 40,
          "frequencyType": "monthly",
          "retentionUnit": "months",
          "retentionValue": 12
        }
      ]
    }]
  }'

3.3 Managing via Terraform (IaC)

In infrastructure-as-code environments, Terraform lets you version-control your backup policies alongside the rest of your infrastructure.

# main.tf

resource "mongodbatlas_cluster" "production" {
  project_id                  = var.atlas_project_id
  name                        = "prod-cluster"
  provider_name               = "AWS"
  provider_region_name        = "AP_SOUTHEAST_2"
  provider_instance_size_name = "M30"

  cloud_backup = true
  pit_enabled  = true
}

resource "mongodbatlas_cloud_backup_schedule" "production_schedule" {
  project_id   = var.atlas_project_id
  cluster_name = mongodbatlas_cluster.production.name

  reference_hour_of_day    = 2
  reference_minute_of_hour = 0
  restore_window_days      = 7

  policy_item_hourly {
    frequency_interval = 6
    retention_unit     = "days"
    retention_value    = 7
  }

  policy_item_daily {
    frequency_interval = 1
    retention_unit     = "days"
    retention_value    = 14
  }

  policy_item_weekly {
    frequency_interval = 6   # Saturday
    retention_unit     = "weeks"
    retention_value    = 4
  }

  policy_item_monthly {
    frequency_interval = 40  # Last day of the month
    retention_unit     = "months"
    retention_value    = 12
  }
}

4. Snapshot Retention Policy Design Guide

Atlas supports five snapshot frequencies (hourly / daily / weekly / monthly / yearly), each with an independently configurable retention period.

Atlas Default Retention Policy (recommended starting point)

FrequencyDefault RetentionPurpose
Hourly2 daysShort-term recovery from operational mistakes
Daily7 daysGeneral data corruption recovery
Weekly4 weeksBuffer for weekly deployments and migration errors
Monthly12 monthsMonthly reporting and audit requirements
YearlyMaximum configurableLegal compliance (GDPR, HIPAA, SOC 2)
PITR Restore Window7 days (default)Arbitrary point-in-time recovery range

Recommended Policies by Business Criticality

General production services (e-commerce, SaaS)

RPO target: under 1 hour
- Hourly snapshots: every 6 hours, 7-day retention
- Daily snapshots: 14-day retention
- Weekly snapshots: 4-week retention
- PITR: 7-day restore window

Financial and healthcare services (strict compliance)

RPO target: near-zero (up to last transaction)
- Hourly snapshots: every 2 hours, 7-day retention
- Daily snapshots: 1-month retention
- Monthly snapshots: 12-month retention (for fast Atlas-side recovery)
- Monthly snapshots → S3 archive: 5-year long-term retention
- PITR: 2-day restore window (shorter window reduces restore time)
- Backup Compliance Policy: mandatory

Development and staging environments

Backup disabled — recommended
(MongoDB's official documentation also states backup is unnecessary for dev/test environments)

5. Point-in-Time Recovery (PITR) Deep Dive

5.1 How PITR Works

Atlas PITR combines a base snapshot with Oplog replay.

A shorter restore window means less Oplog to replay, which directly reduces the actual recovery time (RTO). For instance, if snapshots are taken every 2 hours, you only ever need to replay up to 2 hours of Oplog.

5.2 PITR Recovery Procedure — UI Walkthrough

Scenario: At 10:26 AM on April 14, 2026, a developer accidentally dropped the payments collection in production. You need to recover to the state at 10:25:55 AM.

Step 1. Select the affected cluster in Atlas UI

Step 2. Click Backup in the left sidebar

Step 3. Click Point in Time Restore

Step 4. Choose recovery method

Recovery method:
  ● Date & Time       → minute-level precision
  ○ Oplog Timestamp   → second-level precision  ← recommended for precise recovery

Step 5. Select the Oplog Timestamp tab and enter the value

Timestamp (seconds since epoch): 1744603555
Increment: 1

(2026-04-14T10:25:55 UTC = Unix timestamp 1744603555)

Step 6. Select the recovery target (overwrite existing cluster or restore to a new cluster)

Step 7. Click Restore and monitor until complete

5.3 Automating PITR Recovery via Atlas Admin API

# PITR recovery using Oplog Timestamp (REST API)
curl -X POST \
  "https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/restoreJobs" \
  --digest -u "{publicKey}:{privateKey}" \
  -H "Content-Type: application/json" \
  -d '{
    "delivery": {
      "methodName": "AUTOMATED_RESTORE",
      "targetClusterName": "prod-recovery-cluster",
      "targetGroupId": "{targetGroupId}"
    },
    "oplogTs": 1744603555,
    "oplogInc": 1
  }'

# Monitor restore job status
RESTORE_JOB_ID="..."
curl -X GET \
  "https://cloud.mongodb.com/api/atlas/v1.0/groups/{groupId}/clusters/{clusterName}/backup/restoreJobs/$RESTORE_JOB_ID" \
  --digest -u "{publicKey}:{privateKey}"

5.4 Key Considerations When Designing PITR

Restore window vs. snapshot interval

The restore window (restoreWindowDays) cannot exceed the hourly snapshot retention period. For example, if hourly snapshots are retained for 7 days, the PITR restore window is capped at 7 days.

Caution when disabling PITR

Disabling Continuous Cloud Backup in Atlas immediately deletes all existing Oplog data. If a Backup Compliance Policy is active, disabling it requires a verification process through MongoDB support.

NVMe storage restore time

Clusters using NVMe storage have physically attached volumes, which can make restore operations slower than network-attached virtual storage clusters.


6. Multi-Region Snapshot Distribution

To protect against a single-region failure (an entire cloud provider region going down), snapshots must be automatically replicated to a geographically separate region.

6.1 Enabling Snapshot Distribution

Atlas UI path:

Clusters → select cluster → Backup Policy tab → Additional Backup Copies section → toggle Copy to other regions → select target region

Note: Only other regions within the same cloud provider can be selected. Example: AWS ap-northeast-2 → AWS ap-southeast-1

6.2 Cross-Region Policy Configuration Example

Atlas lets you independently control which snapshots to replicate and how long to retain the copies in the DR region.

Primary Region (ap-northeast-2, Seoul):
  - Snapshot every 6 hours, 30-day retention (operational recovery)

DR Region (ap-southeast-1, Singapore):
  - Replicate daily snapshots only, 7-day retention (cost optimisation)
  - Enable Oplog distribution → PITR available in DR region as well

This approach maintains DR protection while significantly reducing unnecessary storage costs.


7. Backup Compliance Policy — Immutable Lock

To prevent backups themselves from being deleted — whether by ransomware or an insider threat — activate the Backup Compliance Policy.

Once active, the following become impossible:

  • Any user (including organisation owners) deleting snapshots or shortening retention periods
  • Existing snapshots being removed when a cluster is terminated, until their retention period expires
  • Disabling Continuous Cloud Backup without a verification process through MongoDB support

WORM (Write Once Read Many) compliance: This feature is mandatory for financial, healthcare, and public-sector services subject to GDPR, HIPAA, SOC 2, or PCI-DSS requirements.

7.1 Configuring Compliance Policy via Atlas CLI

# Enable Backup Compliance Policy
# Note: detailed policy options can be supplied via --file <policy.json>
atlas backups compliancePolicy setup \
  --projectId {projectId} \
  --authorizedEmail "security@company.com" \
  --authorizedUserFirstName "Security" \
  --authorizedUserLastName "Officer"

# Require PITR for all clusters
atlas backups compliancePolicy pointInTimeRestores enable \
  --projectId {projectId} \
  --restoreWindowDays 7

# Enforce encryption at rest (require Customer Key Management)
atlas backups compliancePolicy encryptionAtRest enable \
  --projectId {projectId}

8. Disaster Recovery Playbooks by Scenario

When disaster strikes, reading the runbook for the first time is already too late. These playbooks must be reviewed by the entire team during normal operations and practised through regular drills.


Scenario 1 — Accidental Collection Drop

Situation: A developer runs db.orders.drop() in production

ItemDetail
DetectionApplication error spike / APM alert
Expected RTO15-30 minutes (with Atlas PITR)
Expected RPOJust before the drop (second-level recovery)
[Immediate response]
1. Block write access to the affected database (prevent further changes)
2. Record the exact time of the drop (check logs and APM timestamps)
3. Atlas UI → Backup → Point in Time Restore
4. Enter Oplog Timestamp for 1 minute before the drop
5. Restore to a staging cluster first → validate data
6. Apply to production or use mongorestore to transplant the collection
7. Root cause analysis (review access permissions, add environment safeguards)

Scenario 2 — Bad Migration Script

Situation: A deployed schema migration script corrupts a large number of documents

ItemDetail
DetectionData validation query failures / user reports
Expected RTO30 minutes to 2 hours (depending on data volume)
Expected RPOJust before the migration started
[Immediate response]
1. Halt the migration script immediately
2. Assess the blast radius (which collections, how many documents)
3. Use the migration start time as the PITR recovery target
4. Atlas → Backup → Point in Time Restore
   → Enter Oplog Timestamp for 30 seconds before the migration started
5. Restore to a new cluster and validate data integrity
   db.runCommand({ validate: "affected_collection" })
6. Switch traffic using a Blue-Green approach
7. Prevention: strengthen migration validation in staging environments

Scenario 3 — Full Replica Set Failure

Situation: A region-wide outage takes down all nodes in the cluster

ItemDetail
DetectionAtlas alert / cloud provider status page
Expected RTO1-3 hours (depends on data volume and network throughput)
Expected RPOTime of last Oplog capture (typically within 1 minute)
[Immediate response]
1. Check the cloud provider status page (estimate regional recovery time)
2. If multi-region Snapshot Distribution is configured:
   → Immediately spin up a new cluster from the DR region snapshot (e.g., Singapore)
3. Update DNS or load balancer to point to the DR cluster
4. After the primary region recovers, run the data synchronisation procedure
5. Decide whether to roll back or switch back to the primary region

[Prerequisites]
- Pre-create a DR cluster and manage connection strings as environment variables
- Snapshot Distribution: DR region must be active before an incident
- Run DR recovery drills at least once per quarter

Scenario 4 — Ransomware Attack

Situation: An attacker encrypts data and attempts to delete backups

ItemDetail
DetectionAnomalous access pattern / file encryption alert
Expected RTO2-4 hours
Expected RPOWithin the Backup Compliance Policy retention period
[Immediate response]
1. Immediately revoke all API keys associated with the affected cluster
2. Emergency audit of Atlas access permissions; disable suspicious accounts
3. Verify that Backup Compliance Policy is active
   → If active, the attacker cannot delete backups
4. Identify the most recent uncontaminated snapshot
5. Restore to a fully isolated new project or organisation
6. Comprehensive data integrity check of the recovered cluster
7. Report to the security and legal teams; notify regulators if required

9. The 3-2-1 Backup Rule Applied to MongoDB

The 3-2-1 rule is the golden principle of backup strategy, and it applies directly to MongoDB.

RuleMeaning
3Maintain at least three copies of the data
2Store copies on at least two different storage media
1Keep at least one copy off-site (different geographic location)

Implementing 3-2-1 in a MongoDB Atlas Environment

In self-managed environments, storing mongodump output in an S3 bucket under a separate AWS account achieves account-level isolation. Even if the primary account is compromised, backups in the separate account remain safe.


10. Final Checklist & Strategy Selection Guide

MongoDB Backup Strategy Audit Checklist

Any unchecked items below represent gaps to close before the next incident.

Fundamentals (required in any environment)

  • Is backup automated, and is the cron/schedule confirmed to be running?
  • Is the backup target a Secondary node? (to avoid adding load to the Primary)
  • Do backup file names include timestamps for identification?
  • Are backup files stored in a physically or logically separate location from the production database?
  • Are backup files encrypted?
  • Are alerts sent when a backup job fails?

Restore validation (the most commonly skipped category)

  • Has an actual restore test been performed within the last month?
  • Is a partial restore procedure (at the collection level) documented and tested?
  • Did the restore test meet the expected RTO target?
  • Are data integrity validation queries ready to run post-restore?

Advanced operational items

  • Is PITR (Point-in-Time Recovery) enabled?
  • Is multi-region Snapshot Distribution configured?
  • Is Backup Compliance Policy active? (required for compliance environments)
  • Are RPO/RTO targets documented and known by the entire team?
  • Is the disaster recovery Runbook being kept up to date?
  • Is a DR drill being conducted at least once per quarter?

Security

  • Is a dedicated backup user account configured with minimum necessary permissions?
  • Are backup API keys managed separately from other keys?
  • Is an immutable (WORM) setting applied to the backup bucket or storage?

Environment-by-Environment Strategy Selection Guide


11. Closing Thoughts

Across three parts, we have covered the full spectrum of MongoDB backup and recovery.

PartCore toolsBest-fit environment
Part 1mongodump / mongorestoreSmall-scale, portable, simple deployments
Part 2LVM/EBS snapshots, PBMLarge on-premises, sharded clusters
Part 3Atlas Cloud Backup, PITRCloud-managed, compliance environments

Regardless of which tool you choose, one principle above all others applies.

"An untested backup is not a backup."

The existence of a backup file and the ability to actually recover from it are entirely different things. Run restore tests regularly — at least monthly — record the results in your Runbook, and make sure the entire team knows the procedure.

Disasters arrive without warning. The person who stays calmest is the one who has practised the most.

Share This Article

Series Navigation

MongoDB Backup & Recovery Guide

3 / 3 · 3

Explore this topic·Start with featured series

한국어

Follow new posts via RSS

Until the newsletter opens, RSS is the fastest way to get updates.

Open RSS Guide