Checkpoint-restart: making multi-hour runs preemption-safe

Spot and preemptible GPUs are the cheapest way to run a long simulation — and the most dangerous, because the provider can reclaim the instance at any moment. A 9-hour LES that dies at hour 8 isn't 89% done. It's 0% done, unless you checkpointed.

The problem with long runs

Large-eddy simulations, transient studies, and high-Reynolds cases routinely run for hours. Three things conspire against them:

Preemption — spot instances can be reclaimed with little warning.
Cost — keeping an on-demand H100 alive for a 12-hour run is expensive.
Fragility — a single transient fault wipes out the whole solve.

How checkpoint-restart works

Aviato Studio periodically snapshots the full solver state — the solution field, timestep, and convergence history — to durable storage. The cadence is adaptive: more frequent as a run gets longer and more valuable.

When an instance is preempted, the orchestrator:

detects the loss within seconds,
provisions a fresh GPU of the same class,
restores the most recent checkpoint, and
resumes the solve from exactly where it stopped.

[ solving ] --snapshot--> [ checkpoint @ t=6.0h ]
       |
   preemption at t=8.1h
       |
[ new GPU ] --restore--> [ resume from t=6.0h ] --> [ converged ]

You lose at most the work since the last checkpoint, not the whole run.

Why it changes the cost equation

Checkpointing is what makes spot GPUs safe for real work. Combined with automatic convergence detection — which stops the clock the moment residuals and forces settle — you get the price of preemptible hardware with the reliability of on-demand.

Set a budget cap, launch on spot, and walk away. If a run is reclaimed, it resumes itself. If it converges early, you stop paying early.

That's the whole idea behind metered, managed CFD: the platform should sweat the infrastructure so you can think about the flow.

Checkpoint-restart: making multi-hour runs preemption-safe

The problem with long runs

How checkpoint-restart works

Why it changes the cost equation

Comments

Put high-order CFD to work