A fire at NIRS (Daejeon, Korea) destroyed the government’s G-Drive—the “cloud” where ~750,000 civil servants have stored their work since 2018—and crippled 96 additional critical systems. Because of the platform’s large-capacity/low-performance storage design, no off-site backups existed, so user files are gone (except whatever can be reconstructed from other systems like OnNara). Translated for admins: the “cloud” was single-site, had no 3-2-1, and critical services were co-located in the same failure domain.

In Spain, the ENS (National Security Scheme, RD 311/2022) at High level requires a DRP and appropriate backups; but policy ≠ execution. Treat this incident as a prompt to review architecture, backups, and procedures.


The design failure (and how to avoid it)

What went wrong

  1. Physical monolith: a “cloud” in one data center.
  2. No off-site: the storage architecture couldn’t replicate externally.
  3. Coupling: 96 other critical systems impacted by the same physical event.
  4. Exclusive use: some ministries mandated G-Drive as the only source of truth.

Minimum antidote for any “enterprise” platform

  • 3-2-1 (better: 3-2-1-1-0)
    • 3 copies, 2 different media, 1 off-site; add 1 immutable/air-gapped copy (S3 Object Lock, WORM, tape) and 0 errors verified by test restores.
  • Geographic redundancy
    • Active-active across two zones/regions (RTO≈0 if capacity allows).
    • Active-passive with orchestrated failover and a defined RTO.
  • Separate failure domains
    • Power, cooling, network, racks, uplinks, and (when feasible) different providers.
  • Backups out of band
    • Repositories protected by segregated identities, MFA, and least privilege, not dependent on the same IAM/AD as production.
  • Immutability
    • WORM (S3 Object Lock compliance mode), true air-gap with tape, or “sealed” repositories.
  • Realistic RPO/RTO
    • Define per service. Document dependencies (DNS, IAM, PKI, queues, feature flags).
  • DR drills
    • Timed failover and restore exercises at least 1–2/year, and after major changes.

Reference architectures (fast wins with real impact)

1) Private/colo “cloud”: active-active + immutable off-site backup

[DC A]  <—sync/async replication—>  [DC B]
      \                               /
       \—(backup jobs→WORM repo)—>  [DC C]
Code language: HTML, XML (xml)
  • Production: replicate metadata and data (block/object) between A and B.
  • Backups: immutable daily/hourly copies to C (different region/provider).
  • Runbook: fail over to B (A down) or restore from C (catastrophe).

2) Hybrid/SaaS: shared responsibility

  • SaaS ≠ your backup. Contract for:
    • Locations (zonal/regional) and provider RPO/RTO.
    • Periodic export into your WORM.
    • DR evidence or right to test recovery.

Tools and patterns that work (and you can deploy now)

  • Backups
    • Proxmox Backup Server: dedupe, encryption, restore verification, retention policies; immutable if the store backend supports it.
    • Veeam: SOBR + Object Lock, anomaly detection, SureBackup (verified restores), DR Orchestrator.
    • LTO tape: cost-effective air-gap for long retention and compliance.
  • Data
    • S3-compatible object with versioning + Object Lock (MinIO/ceph-rgw, public cloud).
    • ZFS snapshots + send/receive between DCs, with retention and scheduled scrub.
  • DR orchestration
    • IaC (Terraform/Ansible) + automated runbooks.
    • DNS/Anycast/GLB for traffic shifting; config-as-code (Consul/etcd).
  • Security
    • Separate-domain IAM/LAPS for backup repos.
    • MFA everywhere; secrets vault (HashiCorp Vault; HSM where needed).
    • SIEM/SOAR alerts for mass deletions, policy changes, or encryption on backup repos.

Mini-runbook template — total loss of DC A

Goal: recover service from B with RPO ≤ X min and RTO ≤ Y min.

  1. DR short path (A impaired)
    • Freeze changes on A (if partially alive).
    • Promote B to primary (DB/object/queues).
    • Switch DNS/GLB to B.
    • Validate health (APM, synthetics, healthchecks).
    • Stakeholder comms.
  2. DR long path (A & B lost)
    • Provision B from templates (IaC).
    • Restore data from C (WORM):
      • DB → point-in-time
      • Objects/files → last good version
    • Validate integrity (checksums), start services, switch DNS.
  3. Post-mortem
    • TTD/MTTD, actual RPO/RTO, blockers, root causes, actions.

Sysadmin checklist (so you don’t repeat Daejeon)

  • Two zones/DCs in active-active or tested failover (runbooks).
  • Off-site immutable backup (WORM/air-gap) in a third failure domain.
  • RPO/RTO per service and documented drills.
  • Out-of-band backups (segregated IAM, MFA, least privilege).
  • Anomaly monitoring on backup repos + alerts.
  • Data inventory (SaaS included) and defined export/retention.
  • ENS/ISO 27001/22301 compliance and audit evidence.
  • Semi-annual DR drills with timings and outcomes.

Expert comment — David Carrero (Stackscale – Grupo Aire)

“The best insurance isn’t one; it’s several. At Stackscale we run production active-active across two DCs and also keep immutable backups in a third site. For immutable/air-gap we use tools like Proxmox Backup Server or Veeam with Object Lock. The brand matters less than the design—and testing restores: no tests, no DRP.”

Operational translation: production that survives a site loss, backups that can’t be altered, and timed failover/restore drills.


3-2-1-1-0 policy snippet (YAML)

policy:
  copies:
    total: 3
    media:
      - disk (primary)
      - object-storage (WORM)
    offsite:
      enabled: true
      location: dc-c
    immutability:
      mode: compliance
      retention: 30d
  rpo: "15m"    # per service
  rto: "60m"
  verification:
    schedule: weekly
    method: restore-test + checksum
    target: isolated network
  access:
    iam: separate-domain
    mfa: required
    roles:
      - backup-admin (no prod)
      - restore-operator (break-glass)
Code language: PHP (php)

SaaS: the uncomfortable reminder

  • It’s still your responsibility to know where your data lives and how to recover it.
  • Contractually require RPO/RTO, periodic export, DR evidence, and the right to exercise recovery.
  • Keep your own copies (export/backup) in a WORM repository.

FAQ

How often should I test DR?
At least 1–2 times/year and after major changes—always with timings (actual RTO) and a report.

Active-active or active-passive?
Depends on RTO and budget. Active-active minimizes downtime but adds consistency complexity; passive lowers cost but increases RTO.

What should I use for immutability?
S3 Object Lock (compliance), LTO tape for air-gap, ZFS snapshots with retention, and PBS/Veeam repositories configured in WORM.

How do I detect someone “killing” my backups?
Alerts on policy changes, mass deletions, and encryption patterns; strict IAM, universal MFA, and identity domain separation.


Bottom line

If your “cloud” fits inside one building, it’s not a cloud—it’s a single point of failure with lots of metal. Korea’s incident says it bluntly. To avoid a repeat: geo-redundancy, off-site immutable copies, segregated access, and regular DR drills. Everything else is just semantics.

sources: Noticias cloud y korea joongang daily

Scroll to Top