Rolling Updates Without ReadinessProbes

We built the entire rolling update system in a single detached pthread — roughly 80 lines of C23. Launch new replica, wait 8 seconds, kill old replica. Repeat per replica. Verified on a local single-node cluster. Here is exactly how it works.

A Kubernetes rolling update requires you to configure and correctly understand at least five things before it works without incident: maxSurge, maxUnavailable, readinessProbe, PodDisruptionBudget, and the termination grace period. Get the readiness probe wrong and the rollout stalls indefinitely — waiting for a pod to become ready that will never report ready because the probe path is wrong. We have all been there.

Skr8tr's rollout is one command:

triggering a rolling update
$ skr8tr --key ~/.skr8tr/signing.sec rollout api-v2.skr8tr
  rolling out /home/sbaker/skr8tr/examples/api-v2.skr8tr... ok
  app     api-server
  status  new replicas launching, old replicas draining (8s settle)

How rollout_thread Works

When the Conductor receives a ROLLOUT command it does two things: responds immediately with OK|ROLLOUT|<app> (non-blocking), then spawns a detached pthread to do the actual work. The thread owns the rollout from that point forward.

src/daemon/skr8tr_sched.c — rollout_thread (simplified)
static void *rollout_thread(void *arg) {
    RolloutArgs *ra = arg;

    /* Bump the generation counter — new replicas get gen N+1 */
    pthread_mutex_lock(&g_mu);
    Workload *wl = workload_find(ra->app_name);
    wl->current_gen++;
    int new_gen = wl->current_gen;
    pthread_mutex_unlock(&g_mu);

    /* For each existing (old-gen) placement, one at a time: */
    for (int i = 0; i < old_count; i++) {

        /* 1. Launch a new-generation replica on the best available node */
        NodeEntry *node = node_least_loaded_for_port(wl->port);
        launch_replica(wl, node);   /* sends LAUNCH to node, waits for LAUNCHED+PID */

        /* 2. Wait for the settle window — workload starts, stabilises */
        sleep(ROLLOUT_WAIT_S);      /* ROLLOUT_WAIT_S = 8 */

        /* 3. Kill the old-generation replica */
        send_kill(old_placements[i].node_id, old_placements[i].app_name);
        node_port_release(old_node, wl->port);
    }
    free(ra);
    return NULL;
}

The generation counter is the key: every Placement struct carries an integer generation field. When a rollout starts, all existing placements are old-gen (N). New placements get gen N+1. The thread walks the old-gen list, replacing each one. During the rollout, both generations coexist briefly — there is no downtime gap for multi-replica workloads.

Port Collision Safety

One problem we hit early: if a workload binds port 8080, and both the old and new replica land on the same node, the new process fails to bind. We fixed this by tracking bound ports per node in the Conductor.

NodeEntry struct — port tracking added for rollout safety
typedef struct {
    char   node_id[64];
    char   ip[64];
    int    cpu_pct;
    int    ram_free_mb;
    time_t last_heartbeat;
    /* Added for rollout port safety: */
    int    used_ports[64];
    int    used_port_count;
} NodeEntry;

node_least_loaded_for_port(port) skips any node that already has that port in its used_ports array. Ports are claimed when the node confirms LAUNCHED, released on KILL or node expiry.

The Verified Test

We ran this on a single Arch Linux workstation — one Conductor, one node, both on localhost. The workload was /bin/sleep 3600 (the simplest possible long-running process).

end-to-end rollout verification — 2026-04-06
# Deploy initial workload — PID 1031776
$ skr8tr --key ~/.skr8tr/signing.sec up examples/my-server.skr8tr
  submitting... ok  |  app my-server  |  node 638eb13e...

$ skr8tr list
  my-server   638eb13e...   1031776

# Trigger rollout
$ skr8tr --key ~/.skr8tr/signing.sec rollout examples/my-server.skr8tr
  rolling out... ok — new replicas launching, old replicas draining (8s settle)

# Wait 10s, then check
$ skr8tr list
  my-server   638eb13e...   1032014    ← new PID

# Confirm old PID is dead
$ ps -p 1031776
  old PID 1031776 is dead — rollout replaced it

The 8-second settle window is a pragmatic choice, not a fundamental constraint. For services that start in milliseconds (most compiled binaries), 8 seconds is conservative. A future version will support a configurable settle field in the manifest and an optional HTTP probe to cut the window short once the new replica is responding.

What We Have Not Done

Honest gaps in the current implementation:

No configurable settle time — hardcoded 8s. Works for our use case; may be too long for fast-start services or too short for slow-start ones.
No HTTP readiness check during the settle window — we sleep unconditionally. The old replica dies at 8s regardless of whether the new one is actually serving traffic.
Single-node test only — the rollout logic is correct for multi-node clusters (it calls node_least_loaded_for_port which handles any node count) but we have not run it against a physical multi-node setup yet.

Source: src/daemon/skr8tr_sched.c — search for rollout_thread.

Questions: scott.bakerphx@gmail.com

Rolling Updates Without ReadinessProbes, PodDisruptionBudgets, or YAML

How rollout_thread Works

Port Collision Safety

The Verified Test

What We Have Not Done