Skip to content

RelayMD: Job Lifecycle

Lifecycle Flow

Operator prepares simulation input directory
relaymd submit ./inputs/ --title "lig42-eq1" --command "python run_atom.py"
         ├── packs directory into bundle.tar.gz
         ├── uploads to B2 at jobs/{uuid}/input/bundle.tar.gz
         └── POST /jobs → job enters "queued" state in orchestrator DB
         Orchestrator sbatch loop fires (every 60s)
         Sees queued job, no active HPC workers for cluster
         sbatch renders job.sbatch.j2 and calls sbatch directly
         Worker boots on compute node
         → fetches secrets from Infisical
         → joins Tailnet
         → registers with orchestrator (POST /workers/register)
         Worker polls POST /jobs/request
         → receives job_id + input_bundle_path + latest_checkpoint_path
         Worker downloads input bundle (+ checkpoint if resuming) from B2
         Worker launches MD subprocess; heartbeat thread starts
                  ├── every 60s: POST /workers/{id}/heartbeat
                  ├── every 5min (poll; default 300s): new checkpoint found?
                  │       → upload to B2
                  │       → POST /jobs/{id}/checkpoint
                  │       → if job already terminal: typed 409 conflict (safe to ignore)
                  ├── on wall-time margin (SIGTERM from SLURM):
                  │       → send SIGTERM to subprocess
                  │       → wait up to 60s for checkpoint newer than pre-shutdown baseline mtime
                  │       → if newer checkpoint exists: upload to B2 + POST /jobs/{id}/checkpoint
                  │       → exit  (orchestrator re-queues automatically)
                  └── on clean subprocess exit:
                          → POST /jobs/{id}/complete (or /fail)
                          → late callback against terminal state may return typed 409
                          → loop back to POST /jobs/request

If the worker dies without reporting:

Orchestrator detects stale heartbeat (`last_heartbeat > heartbeat_interval_seconds × heartbeat_timeout_multiplier`, default `60 × 2 = 120s`)
Job re-enters "queued" state with latest_checkpoint_path preserved
Next available worker resumes from that checkpoint

First Use Case: AToM-OpenMM

The first concrete workload RelayMD is designed to run is AToM-OpenMM, an alchemical free-energy engine that runs replica exchange across multiple lambda windows on a single multi-GPU node.

From RelayMD's perspective, AToM-OpenMM is an opaque subprocess. The worker launches it via a command specified in the input bundle's relaymd-worker.json config file, waits for it to run, and handles checkpointing at the boundaries of each chunk. Replica exchange between lambda windows is entirely internal to the subprocess — the orchestrator never sees individual replicas, only the job as a whole.

AToM-OpenMM supports restart from a checkpoint file, which is the prerequisite for RelayMD's resume-on-any-worker model. A typical job runs for several days of wall time. At a 4-hour SLURM limit per job, this means roughly 15–20 worker handoffs per ligand. RelayMD makes this transparent: each handoff picks up exactly where the last one left off.

The input bundle for an AToM job contains all simulation input files plus a relaymd-worker.json:

{
  "command": "python run_atom.py --config simulation.json",
  "checkpoint_glob_pattern": "*.chk"
}

The --command flag on relaymd submit can write this file automatically so it does not need to be included in the source directory.