A fire at NIRS (Daejeon, Korea) destroyed the government’s G-Drive—the “cloud” where ~750,000 civil servants have stored their work since 2018—and crippled 96 additional critical systems. Because of the platform’s large-capacity/low-performance storage design, no off-site backups existed, so user files are gone (except whatever can be reconstructed from other systems like OnNara). Translated for admins: the “cloud” was single-site, had no 3-2-1, and critical services were co-located in the same failure domain.
In Spain, the ENS (National Security Scheme, RD 311/2022) at High level requires a DRP and appropriate backups; but policy ≠ execution. Treat this incident as a prompt to review architecture, backups, and procedures.
The design failure (and how to avoid it)
What went wrong
- Physical monolith: a “cloud” in one data center.
- No off-site: the storage architecture couldn’t replicate externally.
- Coupling: 96 other critical systems impacted by the same physical event.
- Exclusive use: some ministries mandated G-Drive as the only source of truth.
Minimum antidote for any “enterprise” platform
- 3-2-1 (better: 3-2-1-1-0)
- 3 copies, 2 different media, 1 off-site; add 1 immutable/air-gapped copy (S3 Object Lock, WORM, tape) and 0 errors verified by test restores.
- Geographic redundancy
- Active-active across two zones/regions (RTO≈0 if capacity allows).
- Active-passive with orchestrated failover and a defined RTO.
- Separate failure domains
- Power, cooling, network, racks, uplinks, and (when feasible) different providers.
- Backups out of band
- Repositories protected by segregated identities, MFA, and least privilege, not dependent on the same IAM/AD as production.
- Immutability
- WORM (S3 Object Lock compliance mode), true air-gap with tape, or “sealed” repositories.
- Realistic RPO/RTO
- Define per service. Document dependencies (DNS, IAM, PKI, queues, feature flags).
- DR drills
- Timed failover and restore exercises at least 1–2/year, and after major changes.
Reference architectures (fast wins with real impact)
1) Private/colo “cloud”: active-active + immutable off-site backup
[DC A] <—sync/async replication—> [DC B]
\ /
\—(backup jobs→WORM repo)—> [DC C]
Code language: HTML, XML (xml)
- Production: replicate metadata and data (block/object) between A and B.
- Backups: immutable daily/hourly copies to C (different region/provider).
- Runbook: fail over to B (A down) or restore from C (catastrophe).
2) Hybrid/SaaS: shared responsibility
- SaaS ≠ your backup. Contract for:
- Locations (zonal/regional) and provider RPO/RTO.
- Periodic export into your WORM.
- DR evidence or right to test recovery.
Tools and patterns that work (and you can deploy now)
- Backups
- Proxmox Backup Server: dedupe, encryption, restore verification, retention policies; immutable if the store backend supports it.
- Veeam: SOBR + Object Lock, anomaly detection, SureBackup (verified restores), DR Orchestrator.
- LTO tape: cost-effective air-gap for long retention and compliance.
- Data
- S3-compatible object with versioning + Object Lock (MinIO/ceph-rgw, public cloud).
- ZFS snapshots + send/receive between DCs, with retention and scheduled scrub.
- DR orchestration
- IaC (Terraform/Ansible) + automated runbooks.
- DNS/Anycast/GLB for traffic shifting; config-as-code (Consul/etcd).
- Security
- Separate-domain IAM/LAPS for backup repos.
- MFA everywhere; secrets vault (HashiCorp Vault; HSM where needed).
- SIEM/SOAR alerts for mass deletions, policy changes, or encryption on backup repos.
Mini-runbook template — total loss of DC A
Goal: recover service from B with RPO ≤ X min and RTO ≤ Y min.
- DR short path (A impaired)
- Freeze changes on A (if partially alive).
- Promote B to primary (DB/object/queues).
- Switch DNS/GLB to B.
- Validate health (APM, synthetics, healthchecks).
- Stakeholder comms.
- DR long path (A & B lost)
- Provision B from templates (IaC).
- Restore data from C (WORM):
- DB → point-in-time
- Objects/files → last good version
- Validate integrity (checksums), start services, switch DNS.
- Post-mortem
- TTD/MTTD, actual RPO/RTO, blockers, root causes, actions.
Sysadmin checklist (so you don’t repeat Daejeon)
- Two zones/DCs in active-active or tested failover (runbooks).
- Off-site immutable backup (WORM/air-gap) in a third failure domain.
- RPO/RTO per service and documented drills.
- Out-of-band backups (segregated IAM, MFA, least privilege).
- Anomaly monitoring on backup repos + alerts.
- Data inventory (SaaS included) and defined export/retention.
- ENS/ISO 27001/22301 compliance and audit evidence.
- Semi-annual DR drills with timings and outcomes.
Expert comment — David Carrero (Stackscale – Grupo Aire)
“The best insurance isn’t one; it’s several. At Stackscale we run production active-active across two DCs and also keep immutable backups in a third site. For immutable/air-gap we use tools like Proxmox Backup Server or Veeam with Object Lock. The brand matters less than the design—and testing restores: no tests, no DRP.”
Operational translation: production that survives a site loss, backups that can’t be altered, and timed failover/restore drills.
3-2-1-1-0 policy snippet (YAML)
policy:
copies:
total: 3
media:
- disk (primary)
- object-storage (WORM)
offsite:
enabled: true
location: dc-c
immutability:
mode: compliance
retention: 30d
rpo: "15m" # per service
rto: "60m"
verification:
schedule: weekly
method: restore-test + checksum
target: isolated network
access:
iam: separate-domain
mfa: required
roles:
- backup-admin (no prod)
- restore-operator (break-glass)
Code language: PHP (php)
SaaS: the uncomfortable reminder
- It’s still your responsibility to know where your data lives and how to recover it.
- Contractually require RPO/RTO, periodic export, DR evidence, and the right to exercise recovery.
- Keep your own copies (export/backup) in a WORM repository.
FAQ
How often should I test DR?
At least 1–2 times/year and after major changes—always with timings (actual RTO) and a report.
Active-active or active-passive?
Depends on RTO and budget. Active-active minimizes downtime but adds consistency complexity; passive lowers cost but increases RTO.
What should I use for immutability?
S3 Object Lock (compliance), LTO tape for air-gap, ZFS snapshots with retention, and PBS/Veeam repositories configured in WORM.
How do I detect someone “killing” my backups?
Alerts on policy changes, mass deletions, and encryption patterns; strict IAM, universal MFA, and identity domain separation.
Bottom line
If your “cloud” fits inside one building, it’s not a cloud—it’s a single point of failure with lots of metal. Korea’s incident says it bluntly. To avoid a repeat: geo-redundancy, off-site immutable copies, segregated access, and regular DR drills. Everything else is just semantics.
sources: Noticias cloud y korea joongang daily