Data Backup and Disaster Recovery

Master data backup and disaster recovery for cloud-native SaaS. Learn RPO/RTO, point-in-time recovery, cross-region replication, and automated verification for data protection.

TL;DR

  • RPO & RTO: RPO = max data loss (e.g., 1 hr). RTO = max downtime (e.g., 15 min). Critical systems need minutes/seconds; internal tools can tolerate hours.
  • Backup Strategy: Full + incremental + continuous archiving (WAL/oplog) = point-in-time recovery (PITR).
  • Tools: Velero (K8s), CloudNativePG (PostgreSQL), MongoDB Operator.
  • Cross-Region Replication: Protects against regional outages. Replicate backups or maintain standby clusters elsewhere.
  • Verification: Mandatory. Unverified backups don't exist. Automate restore tests + run quarterly drills.
  • Cost Optimization: Tiered storage + compression. Lifecycle policies: STANDARD_IA (7 days) → GLACIER (30 days) → DEEP_ARCHIVE (90 days) → expire (365 days).

Disaster recovery planning protects against catastrophic failures that threaten data availability. Hardware failures, software bugs, human errors, security breaches, and natural disasters all pose risks to production systems. Cloud-native backup strategies leverage distributed infrastructure and object storage to build resilient recovery mechanisms.

Backup Strategy Fundamentals

Effective backup strategies balance recovery objectives against infrastructure costs and operational complexity. Two key metrics define backup requirements.

Recovery Point Objective (RPO) measures maximum acceptable data loss. An RPO of 1 hour means tolerating loss of up to one hour of data during disaster scenarios. An RPO of zero requires synchronous replication with no data loss tolerance.

Recovery Time Objective (RTO) measures maximum acceptable downtime. An RTO of 15 minutes requires hot standby systems ready to take over immediately. An RTO of 24 hours allows time for manual intervention and restoration from backups.

RPO and RTO timeline: RPO measures data loss window (10 min), RTO measures downtime (30 min).

Business requirements drive these metrics. E-commerce platforms during peak shopping seasons might require RPO of minutes and RTO of seconds. Internal tools with lower criticality might tolerate hours of data loss and downtime.

Backup Types and Approaches

Different backup types serve different recovery scenarios and have distinct storage and performance characteristics.

Full backups capture complete database state at a point in time. They enable straightforward recovery but consume significant storage and bandwidth. A 500GB PostgreSQL database requires 500GB for each full backup.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: production-postgres
spec:
  instances: 3
  storage:
    size: 200Gi
  backup:
    barmanObjectStore:
      destinationPath: s3://backups/postgres-prod
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      data:
        compression: gzip
        immediateCheckpoint: true
        jobs: 4
      wal:
        compression: gzip
        maxParallel: 2
    retentionPolicy: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: daily-backup
spec:
  schedule: "0 2 * * *"
  backupOwnerReference: self
  cluster:
    name: production-postgres

Incremental backups capture only changes since the last backup. They reduce storage costs and backup duration but complicate restore procedures requiring the full backup plus all subsequent incrementals.

Continuous archiving streams transaction logs to object storage in real-time. PostgreSQL's Write-Ahead Log (WAL) archiving enables point-in-time recovery to any moment between backups.

-- PostgreSQL WAL archiving configuration
ALTER SYSTEM SET wal_level = 'replica';
ALTER SYSTEM SET archive_mode = 'on';
ALTER SYSTEM SET archive_command = 'aws s3 cp %p s3://wal-archive/postgres/%f';
ALTER SYSTEM SET archive_timeout = '300';  -- Force WAL switch every 5 minutes
SELECT pg_reload_conf();

-- Verify archiving is working
SELECT * FROM pg_stat_archiver;

MongoDB supports point-in-time recovery through oplog archiving. Continuous backup solutions capture oplog entries alongside periodic snapshots.

// MongoDB oplog tailing for continuous backup
const client = new MongoClient(uri);
const db = client.db('local');
const oplog = db.collection('oplog.rs');

// Get latest timestamp
const latest = await oplog.find().sort({$natural: -1}).limit(1).next();
const timestamp = latest.ts;

// Tail oplog for changes
const cursor = oplog.find({
  ts: {$gt: timestamp}
}, {
  tailable: true,
  awaitData: true
});

cursor.forEach(async (op) => {
  // Archive operation to S3
  await archiveOperation(op);
});

Cloud-Native Backup Tools

Modern backup solutions integrate with Kubernetes and cloud storage providers.

Velero backs up Kubernetes resources and persistent volumes. It captures entire application state including configurations, secrets, and data volumes.

# Install Velero with AWS plugin
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket backup-bucket \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero

# Create backup schedule
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces production \
  --ttl 720h0m0s

# Backup specific application
velero backup create postgres-backup \
  --selector app=postgresql \
  --snapshot-volumes=true

# Restore from backup
velero restore create --from-backup postgres-backup

CloudNativePG provides integrated backup and recovery for PostgreSQL clusters running on Kubernetes.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: production-db
spec:
  instances: 3
  backup:
    barmanObjectStore:
      destinationPath: s3://postgres-backups/prod-cluster
      s3Credentials:
        accessKeyId:
          name: backup-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: backup-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        encryption: AES256
        maxParallel: 4
      data:
        compression: gzip
        encryption: AES256
        jobs: 2
        immediateCheckpoint: true
    retentionPolicy: "30d"
    volumeSnapshot:
      className: csi-snapshot-class
      snapshotOwnerReference: cluster

MongoDB Enterprise Operator automates backup and restore operations for MongoDB deployments.

apiVersion: mongodb.com/v1
kind: MongoDB
metadata:
  name: production-mongo
spec:
  members: 3
  type: ReplicaSet
  version: "6.0.5"
  backup:
    mode: enabled
  cloudManager:
    configMapRef:
      name: cloud-manager-config
  credentials: cloud-manager-credentials
---
apiVersion: mongodb.com/v1
kind: MongoDBOpsManager
metadata:
  name: ops-manager
spec:
  replicas: 1
  version: "6.0.12"
  backup:
    enabled: true
    headDB:
      storage: 50Gi
    opLogStores:
    - name: oplog1
      mongodbResourceRef:
        name: oplog-store
      mongodbUserRef:
        name: oplog-user
    s3Stores:
    - name: s3-backup
      s3BucketEndpoint: s3.amazonaws.com
      s3BucketName: mongo-backups
      s3SecretRef:
        name: s3-credentials
      pathStyleAccessEnabled: false

Point-in-Time Recovery

PITR enables restoring data to any specific moment, crucial for recovering from data corruption or accidental deletions discovered hours after occurrence.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: restored-postgres
spec:
  instances: 3
  bootstrap:
    recovery:
      source: production-postgres
      recoveryTarget:
        targetTime: "2025-11-25 14:30:00.00000+00"
  externalClusters:
  - name: production-postgres
    barmanObjectStore:
      destinationPath: s3://backups/postgres-prod
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      wal:
        maxParallel: 8

This manifest creates a new cluster restored to November 25, 2025 at 14:30 UTC. The operator fetches the most recent full backup before that timestamp, then replays WAL files to reach the exact recovery point.

Testing PITR regularly ensures the process works when needed. Schedule quarterly recovery drills where teams restore production data to test environments at specific points in time.

Cross-Region Replication

Geographic distribution protects against regional failures. Cloud provider outages affecting entire regions occur periodically, making cross-region strategies essential for critical systems.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-primary
  namespace: production
spec:
  instances: 3
  storage:
    size: 200Gi
  backup:
    barmanObjectStore:
      destinationPath: s3://postgres-backups-us-east-1/primary
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        maxParallel: 4
    retentionPolicy: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-replica
  namespace: production
spec:
  instances: 3
  replica:
    enabled: true
    source: postgres-primary
  externalClusters:
  - name: postgres-primary
    connectionParameters:
      host: postgres-primary-rw.production.svc.cluster.local
      user: streaming_replica
      dbname: postgres
    password:
      name: replica-creds
      key: password

Multi-region backup replication stores backups in multiple geographic locations. Configure S3 cross-region replication or backup to multiple regional buckets.

# AWS S3 cross-region replication configuration
aws s3api put-bucket-replication --bucket postgres-backups-us-east-1 \
  --replication-configuration file://replication-config.json

# replication-config.json
{
  "Role": "arn:aws:iam::123456789012:role/s3-replication-role",
  "Rules": [{
    "Status": "Enabled",
    "Priority": 1,
    "Filter": {},
    "Destination": {
      "Bucket": "arn:aws:s3:::postgres-backups-eu-west-1",
      "ReplicationTime": {
        "Status": "Enabled",
        "Time": {
          "Minutes": 15
        }
      }
    }
  }]
}

Backup Verification

Backups are worthless if they cannot be restored. Automated verification catches backup corruption before disasters occur.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup-verification
spec:
  schedule: "0 3 * * 0"  # Weekly on Sunday at 3 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: verify-backup
            image: postgres:15
            command:
            - /bin/bash
            - -c
            - |
              # Download latest backup
              aws s3 cp s3://backups/latest.tar.gz /tmp/backup.tar.gz

              # Extract and verify
              tar -xzf /tmp/backup.tar.gz -C /tmp/restore

              # Start temporary postgres instance
              pg_ctl -D /tmp/restore start

              # Run verification queries
              psql -d postgres -c "SELECT COUNT(*) FROM pg_database;"
              psql -d postgres -c "SELECT pg_database_size('production');"

              # Verify critical tables exist
              psql -d production -c "\dt"

              # Cleanup
              pg_ctl -D /tmp/restore stop
              rm -rf /tmp/restore /tmp/backup.tar.gz
          restartPolicy: OnFailure

Disaster Recovery Procedures

Document and practice recovery procedures. During actual disasters, clear runbooks reduce recovery time and prevent errors under pressure.

Automated failover for database clusters enables rapid recovery from primary instance failures.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: production-postgres
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  failoverDelay: 30
  switchoverDelay: 60
  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "256MB"
  monitoring:
    enabled: true

Manual failover procedures document steps for promoting standby instances when automatic failover fails or isn't configured.

# Promote standby to primary in CloudNativePG
kubectl cnpg promote production-postgres 1

# Check cluster status
kubectl cnpg status production-postgres

# Update application connection strings
kubectl patch service postgres-rw -p '{"spec":{"selector":{"postgresql":"production-postgres-1"}}}'

Redis Backup Strategies

Redis requires different backup approaches due to its in-memory nature. RDB snapshots and AOF logs provide recovery options.

apiVersion: redis.redis.opstreelabs.in/v1beta1
kind: Redis
metadata:
  name: production-redis
spec:
  kubernetesConfig:
    image: redis:7.0
    imagePullPolicy: IfNotPresent
  storage:
    volumeClaimTemplate:
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi
  redisConfig:
    save: "900 1 300 10 60 10000"  # RDB snapshot policy
    appendonly: "yes"  # Enable AOF
    appendfsync: "everysec"
    auto-aof-rewrite-percentage: "100"
    auto-aof-rewrite-min-size: "64mb"
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: redis-backup
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup-redis
            image: redis:7.0
            command:
            - /bin/sh
            - -c
            - |
              redis-cli -h production-redis BGSAVE
              sleep 10
              aws s3 cp /data/dump.rdb s3://redis-backups/dump-$(date +%Y%m%d-%H%M%S).rdb
              aws s3 cp /data/appendonly.aof s3://redis-backups/aof-$(date +%Y%m%d-%H%M%S).aof
          restartPolicy: OnFailure

Cost Optimization

Backup storage costs accumulate quickly. Lifecycle policies reduce expenses while maintaining adequate recovery capabilities.

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: production-postgres
spec:
  backup:
    barmanObjectStore:
      destinationPath: s3://backups/postgres
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      data:
        compression: gzip
      wal:
        compression: gzip
    retentionPolicy: "30d"  # Keep backups for 30 days

S3 lifecycle policies automatically transition older backups to cheaper storage tiers.

{
  "Rules": [{
    "Id": "BackupLifecycle",
    "Status": "Enabled",
    "Transitions": [
      {
        "Days": 7,
        "StorageClass": "STANDARD_IA"
      },
      {
        "Days": 30,
        "StorageClass": "GLACIER"
      },
      {
        "Days": 90,
        "StorageClass": "DEEP_ARCHIVE"
      }
    ],
    "Expiration": {
      "Days": 365
    }
  }]
}

Effective disaster recovery requires planning, automation, regular testing, and documented procedures. Investment in robust backup infrastructure and recovery processes protects against catastrophic data loss and enables rapid recovery from failures.


Conclusion

Data backup and disaster recovery are not technical details to address after launch they are fundamental business protections. The cloud-native era makes robust backup strategies more accessible than ever: object storage for durable, infinite retention; managed operators that automate backup scheduling and WAL archiving; cross-region replication for geographic redundancy; and PITR for granular recovery. But access to tools does not guarantee protection.

Real protection comes from disciplined implementation: setting RPO/RTO based on business requirements, not technical convenience; automating backups and verification; testing recovery procedures quarterly; and documenting runbooks for incident response.

The most expensive backup is the one that fails when you need it. Invest in automation, test regularly, and treat disaster recovery as a continuous practice not a one-time project. Your data's availability depends on it.


Frequently Asked Questions

What's the difference between RPO and RTO, and how do I choose the right values?

RPO (Recovery Point Objective) = how much data you can lose. RTO (Recovery Time Objective) = how long you can be down. Choose values based on business impact:

  • E-commerce checkout: RPO = minutes, RTO = seconds (lost sales per minute)
  • Internal analytics dashboard: RPO = 24 hours, RTO = 4 hours (lower criticality)
  • Financial transactions: RPO = 0 (no data loss), RTO = minutes (regulatory requirements)
    Lower RPO/RTO requires more investment (synchronous replication, hot standbys). Match protection to business value.

How do I perform point-in-time recovery (PITR) with PostgreSQL on Kubernetes?

PITR requires continuous WAL archiving. Using CloudNativePG:

  1. Enable WAL archiving to S3 in the cluster spec
  2. To restore, create a new cluster with bootstrap.recovery pointing to the source
  3. Specify targetTime (e.g., "2025-11-25 14:30:00")
  4. The operator fetches the latest full backup before that time, then replays WAL files to reach the exact moment
    Test this process quarterly recovering to specific times catches data corruption or accidental deletions discovered hours later.

How do I verify backups actually work?

Two levels of verification:

  • Automated: Run a CronJob that restores the latest backup to a temporary instance, runs verification queries (table counts, critical data presence), then cleans up. Fail the job if verification fails.
  • Manual drills: Quarterly, randomly select a PITR target from the last 30 days. Have an on-call engineer restore production data to a test environment at that exact time. Time the recovery and document any issues.
    If you haven't successfully restored from a backup within the last 30 days, assume it's broken. Verification is not optional it's the only way to know your backups are valid.
Expert Cloud Consulting

Ready to put this into production?

Our engineers have deployed these architectures across 100+ client engagements — from AWS migrations to Kubernetes clusters to AI infrastructure. We turn complex cloud challenges into measurable outcomes.

100+ Deployments
99.99% Uptime SLA
15 min Response time