Disaster recovery RTO/RPO definition and practice
An operation guide that increases DR reliability by not only completing backups but also including recovery rehearsals

Introduction
DR strategy may seem like a cost in peacetime, but it becomes a survival condition in moments of failure. Many organizations fail to achieve their goals because they only set RTO/RPO in numbers and do not verify actual recovery procedures. This article explains practical methods for defining RTO/RPO appropriate for service characteristics and designing practice cycles.

Problem definition
The cause of disaster recovery failure is missing assumptions rather than technology.
- The same goal is enforced by not distinguishing between core and non-core services.
- Backups exist, but there is no restoration verification, so the actual recovery time cannot be predicted.
- The order of DNS, secret, and batch jobs during DR conversion is not documented.
RTO/RPO is a sequence, not an SLA phrase. It makes sense to have a playbook that includes recovery sequences and decision makers.
Key concepts
| perspective | Design criteria | Verification points |
|---|---|---|
| Category | Separation of goals by service tier | Tier1 goal achievement rate |
| data protection | Snapshot + WAL + Offsite | Restoration point error |
| Transition Procedure | Network/Application/Deployment Sequence | Total recovery time |
| training | Quarterly DR Drill | Error compared to actual |
The most effective improvement is drill automation. Recovery commands and verification reports must be created as scripts to perform repeatable exercises.
Code example 1: Recovery check script
#!/usr/bin/env bash
set -euo pipefail
./scripts/dr/restore-db.sh --target-time "$1"
./scripts/dr/restore-object-storage.sh
./scripts/dr/switch-dns.sh --to standby
./scripts/dr/health-check.sh --region standby
printf "DR drill completed at %s\n" "$(date -Iseconds)"
Code example 2: RTO/RPO goal declaration
services:
checkout:
tier: 1
rto_minutes: 30
rpo_minutes: 5
blog:
tier: 2
rto_minutes: 120
rpo_minutes: 30
analytics:
tier: 3
rto_minutes: 360
rpo_minutes: 120
Architecture flow
Tradeoffs
- If RTO/RPO is set aggressively, the cost of infrastructure redundancy increases significantly.
- Automated DR is fast, but risky without script self-verification.
- Increasing the frequency of drills increases readiness, but increases the cost of operating team time.
Cleanup
DR is completed through repeated practice, not documentation. By clearly dividing goals by tier and automating the recovery process, predictable recovery times can be achieved even in real-world failures.
Image source
- Cover: source link
- License: AGPL / Author: Proxmox Server Solutions GmbH
- Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.