Cloudflare has released ecdysis, a Rust library for graceful process restarts designed to upgrade network services without dropping live connections and without refusing new ones—a deceptively hard problem when a service is handling massive traffic across a global edge network. The company says the project has been used in production for five years and was open-sourced “last month,” turning an internal reliability mechanism into a reusable building block for the broader Rust infrastructure ecosystem.

The name is more than branding. “Ecdysis” is the biological process of shedding an old skin or outer layer—an apt metaphor for what Cloudflare is trying to achieve operationally: replacing running code while the system remains continuously available. In Cloudflare’s framing, getting upgrades wrong at this scale is not a minor inconvenience. Many of its Rust services do work that cannot pause—traffic routing, TLS lifecycle management, firewall enforcement—and even brief disruption can cascade into customer-visible performance issues and direct business impact.

Why “just restart it” breaks at Internet scale

A naïve stop-and-start deployment pattern still exists in plenty of environments: terminate the old process, then spawn the new one. Cloudflare argues that this method is fundamentally unsuitable for high-throughput network services for two reasons.

First, there is always a gap—milliseconds or seconds—where nothing is listening on the service socket, and the operating system begins refusing new connections (commonly with ECONNREFUSED). Even a short interruption becomes expensive when the request rate is high; Cloudflare notes that a 100 ms gap can mean hundreds of dropped connections for a service handling thousands of requests per second.

Second, killing a process also kills already-established connections. Long-lived sessions (WebSockets, gRPC streams) are terminated mid-flight; uploads fail; streams reset. From the client’s perspective, the service disappears. At global scale, Cloudflare says, that “brief restart” pattern can translate into millions of failed requests when multiplied across data centers.

Engineers often reach for SO_REUSEPORT as a way to bind a new process to the same address and port while the old one is still running. But Cloudflare points out a subtle failure mode: with SO_REUSEPORT, the kernel load-balances new connections across separate listening sockets. A connection can be assigned to a process that exits before it calls accept(), leaving the connection orphaned and terminated by the kernel—precisely the kind of edge-case that becomes a certainty under real-world load.

The ecdysis model: fork, exec, and keep the socket alive

Cloudflare’s implementation follows a pattern popularized by NGINX-style graceful upgrades: fork a new process generation, exec into the new binary, and pass the listening socket forward so the service never stops accepting traffic. Cloudflare describes four design goals:

  1. old code fully shuts down post-upgrade
  2. the new process gets a safe initialization grace period
  3. crashes during initialization are acceptable and should not impact the running service
  4. only one upgrade runs in parallel to avoid cascading failures

Operationally, the parent process forks, the child replaces itself via execve(), inherits listening socket file descriptors via a named pipe, and the parent waits for a readiness signal before beginning to drain and exit. The crucial point is continuity: during the child’s startup window, the socket remains open and the parent continues serving both existing and new connections. Once the child signals it is ready, it begins accepting connections and the parent transitions into draining mode until remaining connections complete.

This approach also provides a pragmatic safety net: if the child fails during initialization—say, due to a configuration error—it can exit without any downtime, because the parent never stopped listening. Deployments can be retried after fixing the issue, without turning a bad rollout into an outage.

Production claims: since 2021, across 330+ data centers

Cloudflare positions ecdysis as “battle-tested,” stating it has been running in production since 2021 and powers critical Rust infrastructure services deployed across 330+ data centers in 120+ countries. These services, Cloudflare says, handle billions of requests per day and require frequent updates for patches, feature releases, and configuration changes.

The company quantifies the operational benefit in request preservation: each restart using ecdysis “saves hundreds of thousands of requests” that would otherwise be lost in a naïve stop/start cycle—adding up to millions of preserved connections across its footprint.

Tokio and systemd integration, with explicit constraints

The open-source release is not just a conceptual write-up. Cloudflare ships ecdysis with first-class support for modern Rust service environments, including Tokio integration and systemd lifecycle handling. The GitHub documentation notes that enabling the systemd_notify feature requires Tokio and that systemd-notify integration only works with systemd version >= v253, along with Type=notify-reload in the unit file; otherwise initialization can fail.

There is also optional support for systemd named sockets via the systemd_sockets feature, again requiring Tokio and imposing constraints around named file descriptors.

And there is a hard platform boundary: ecdysis does not work on Microsoft Windows, because the approach relies on Unix-style process and socket inheritance mechanics. Cloudflare says it has been properly tested on Linux—specifically Debian Bullseye, Bookworm, and Trixie—and may work on other POSIX-like systems.

Security trade-offs: two generations briefly coexist

Graceful restarts introduce a short-lived but real security consideration: two generations of the service overlap, both with access to listening sockets and potentially sensitive descriptors. Cloudflare argues that ecdysis manages this risk through a traditional fork-then-exec design (the child gets a clean address space and fresh code), explicit inheritance of only the required descriptors, and defensive use of CLOEXEC flags to reduce accidental leakage. The company also notes that services using seccomp filtering must permit fork() and execve(), as the model requires them.

Where ecdysis fits among alternatives

Cloudflare frames ecdysis as part of a small family of graceful-restart tools. It points to tableflip as the Go library that inspired the approach and highlights shellflip as another Rust option that is more opinionated and focused on transferring arbitrary application state between process generations. In contrast, ecdysis is described as specializing in socket inheritance and rebinding, while remaining less prescriptive about systemd-by-default setups and even supporting a synchronous mode for use cases that do not require an async runtime.

In practical terms, this positions ecdysis as a clean answer to a specific operational pain: high-availability upgrades for connection-heavy Rust services, where uptime is not a feature but the baseline.

Scroll to Top