We built the entire rolling update system in a single detached pthread — roughly 80 lines of C23. Launch new replica, wait 8 seconds, kill old replica. Repeat per replica. Verified on a local single-node cluster. Here is exactly how it works.
A Kubernetes rolling update requires you to configure and correctly understand at least five
things before it works without incident: maxSurge, maxUnavailable,
readinessProbe, PodDisruptionBudget, and the termination grace period.
Get the readiness probe wrong and the rollout stalls indefinitely — waiting for a pod to become
ready that will never report ready because the probe path is wrong. We have all been there.
Skr8tr's rollout is one command:
When the Conductor receives a ROLLOUT command it does two things: responds
immediately with OK|ROLLOUT|<app> (non-blocking), then spawns a detached
pthread to do the actual work. The thread owns the rollout from that point forward.
The generation counter is the key: every Placement struct carries an integer
generation field. When a rollout starts, all existing placements are old-gen
(N). New placements get gen N+1. The thread walks the old-gen list, replacing each one.
During the rollout, both generations coexist briefly — there is no downtime gap for
multi-replica workloads.
One problem we hit early: if a workload binds port 8080, and both the old and new replica land on the same node, the new process fails to bind. We fixed this by tracking bound ports per node in the Conductor.
node_least_loaded_for_port(port) skips any node that already has that port
in its used_ports array. Ports are claimed when the node confirms
LAUNCHED, released on KILL or node expiry.
We ran this on a single Arch Linux workstation — one Conductor, one node, both on localhost.
The workload was /bin/sleep 3600 (the simplest possible long-running process).
settle field in the manifest and an optional
HTTP probe to cut the window short once the new replica is responding.
Honest gaps in the current implementation:
node_least_loaded_for_port which handles any node count) but we have not run it against a physical multi-node setup yet.
Source: src/daemon/skr8tr_sched.c — search for rollout_thread.
Questions: scott.bakerphx@gmail.com