In production environments, continuous availability is non-negotiable. Any unexpected downtime—whether in the database, web server, or caching service—can result in loss of users, revenue, and reputation.
To minimize this risk, RunCloud natively integrates the Auto Healing feature: a detection and automated recovery system that acts as the first line of defense against service outages.


Services Covered by Auto Healing

The RunCloud agent continuously performs health checks and can automatically restart key services when it detects an unplanned outage:

  • Databases: MariaDB / MySQL
  • Web servers: Apache, NGINX, OpenLiteSpeed (OLS)
  • Cache & in-memory stores: Redis, Memcached
  • Application runtimes: PHP-FPM
  • Containerization: Docker
  • Job queues: Beanstalkd
  • Process management: Supervisor
  • Security: Fail2Ban, firewall (UFW / Firewalld)

Auto Healing Workflow

  1. Incident detection
    • The system runs continuous checks on each service.
    • If it detects that one has stopped unexpectedly, it triggers the recovery process.
    • If the stop was intentional (e.g., via the dashboard or CLI), Auto Healing respects the action and does not intervene.
  2. Initial notification
    • Before restarting, it sends an alert to the administrator specifying the affected service and that the automated recovery has started.
  3. Automated restart cycle
    • Auto Healing attempts to restart the service up to 5 times.
    • Short delays are added between retries to allow proper initialization.
    • Every attempt is logged for traceability.
  4. Successful recovery and counter reset
    • If the service stabilizes during any of the attempts, the counter for that service is reset to zero.
    • Normal monitoring resumes without affecting other services.
  5. Persistent failure and manual escalation
    • If, after 5 attempts, the service does not recover, the system stops trying.
    • A final notification is sent, requiring manual intervention—preventing deeper issues (such as data corruption, misconfiguration, or resource exhaustion) from being masked.

Control and Configuration

  • Enabled by default: Auto Healing is active on all new and existing servers managed by RunCloud.
  • Management: Can be toggled on or off from Server → Settings → Auto Healing Services Settings in the dashboard.
  • Granularity: It’s possible to enable or disable automated recovery per service.

Operational Benefits for Sysadmins and DevOps

  • Reduced Mean Time to Recovery (MTTR) for software failures.
  • Prevention of extended downtime without human oversight.
  • Minimizes off-hours intervention, ideal for 24/7 environments.
  • Direct integration with other server management functions in RunCloud.

💡 Comparison with other auto-healing solutions

While platforms like AWS EC2 Auto Recovery or Kubernetes Liveness Probes focus on instances or containers, RunCloud Auto Healing offers a multi-service, server-level approach—making it particularly useful in monolithic or hybrid infrastructures where multiple stacks coexist.

Introducing Auto Healing on RunCloud
Scroll to Top