RunCloud Auto Healing: Proactive Monitoring and Automatic Recovery of Critical Services

Published 08/13/2025

X (Twitter) Facebook Pinterest LinkedIn Email WhatsApp

X (Twitter) Facebook Pinterest LinkedIn Email WhatsApp

In production environments, continuous availability is non-negotiable. Any unexpected downtime—whether in the database, web server, or caching service—can result in loss of users, revenue, and reputation.
To minimize this risk, RunCloud natively integrates the Auto Healing feature: a detection and automated recovery system that acts as the first line of defense against service outages.

Services Covered by Auto Healing

The RunCloud agent continuously performs health checks and can automatically restart key services when it detects an unplanned outage:

Databases: MariaDB / MySQL
Web servers: Apache, NGINX, OpenLiteSpeed (OLS)
Cache & in-memory stores: Redis, Memcached
Application runtimes: PHP-FPM
Containerization: Docker
Job queues: Beanstalkd
Process management: Supervisor
Security: Fail2Ban, firewall (UFW / Firewalld)

Auto Healing Workflow

Incident detection
- The system runs continuous checks on each service.
- If it detects that one has stopped unexpectedly, it triggers the recovery process.
- If the stop was intentional (e.g., via the dashboard or CLI), Auto Healing respects the action and does not intervene.
Initial notification
- Before restarting, it sends an alert to the administrator specifying the affected service and that the automated recovery has started.
Automated restart cycle
- Auto Healing attempts to restart the service up to 5 times.
- Short delays are added between retries to allow proper initialization.
- Every attempt is logged for traceability.
Successful recovery and counter reset
- If the service stabilizes during any of the attempts, the counter for that service is reset to zero.
- Normal monitoring resumes without affecting other services.
Persistent failure and manual escalation
- If, after 5 attempts, the service does not recover, the system stops trying.
- A final notification is sent, requiring manual intervention—preventing deeper issues (such as data corruption, misconfiguration, or resource exhaustion) from being masked.

Control and Configuration

Enabled by default: Auto Healing is active on all new and existing servers managed by RunCloud.
Management: Can be toggled on or off from Server → Settings → Auto Healing Services Settings in the dashboard.
Granularity: It’s possible to enable or disable automated recovery per service.

Operational Benefits for Sysadmins and DevOps

Reduced Mean Time to Recovery (MTTR) for software failures.
Prevention of extended downtime without human oversight.
Minimizes off-hours intervention, ideal for 24/7 environments.
Direct integration with other server management functions in RunCloud.

💡 Comparison with other auto-healing solutions

While platforms like AWS EC2 Auto Recovery or Kubernetes Liveness Probes focus on instances or containers, RunCloud Auto Healing offers a multi-service, server-level approach—making it particularly useful in monolithic or hybrid infrastructures where multiple stacks coexist.

Introducing Auto Healing on RunCloud

X (Twitter) Facebook Pinterest LinkedIn Email WhatsApp

SUBSCRIBE TO THE
SYSADMINS NEWSLETTER!

Related articles

HeadlessX v1.2.0: the open-source browserless server sysadmins can put into production in one afternoon

How to compress and decompress files in Linux: A comprehensive guide

MariaDB vs. MySQL: Which is Better for WordPress?

Installing and Configuring Docker on Windows with WSL or Hyper-V

How to Solve the mysqldump Error 2013: A Practical Guide for Database Administrators

Servo: a lightweight, high-performance web engine for modern applications

oVirt 4.5.7 Is Back: Community Momentum, Modern Compatibility, and a Key Security Fix

Windows Turns 40: The Desktop Revolution That Changed Personal Computing Forever

Dstp: The Ultimate CLI Tool for Network Diagnostics