RelayMD: Orchestrator Internals
Web Framework
FastAPI, async throughout. All route handlers are async def. Database calls use an AsyncSession. No run_in_executor wrappers needed — the orchestrator is entirely I/O-bound.
Database
SQLite via SQLModel (which wraps SQLAlchemy 2.0 + Pydantic). SQLModel unifies the ORM model and Pydantic schema into a single class definition, eliminating the boilerplate of separate JobDB / JobResponse classes. Migrations via Alembic.
Networking Constraint: All Communication is Worker-Initiated
Salad Cloud blocks all inbound traffic to containers. This is a hard platform constraint.
The rule: the orchestrator never initiates a connection to a worker. Every interaction is a worker making an outbound HTTP request to the orchestrator. The worker lifecycle is designed so this constraint never requires a workaround — the orchestrator controls worker behaviour entirely through job assignment responses, not push signals.
Configuration
pydantic-settings with YamlConfigSettingsSource. Config is loaded from a YAML file (path from RELAYMD_CONFIG env var, default ~/.config/relaymd/config.yaml). Env vars override YAML for secrets so that api_token and infisical_token never need to appear in a file on disk.
A missing YAML file is non-fatal — the orchestrator starts with defaults and logs a warning. The reference config is deploy/config.example.yaml.
Scheduling Loops
Three APScheduler interval jobs are registered from FastAPI lifespan using an in-memory AsyncIOScheduler:
stale_worker_reaper_job— everystale_worker_reaper_interval_seconds(default 60s); marks workers stale iflast_heartbeat > heartbeat_interval_seconds × heartbeat_timeout_multiplier; re-queues their jobs; calls Salad autoscaling.orphaned_job_requeue_once— everyorphaned_job_requeue_interval_seconds(default 60s); handles jobs that reachedassignedstate but whose worker never registered (e.g. SLURM job failed to boot).sbatch_submission_job— everysbatch_submission_interval_seconds(default 60s); for eachClusterConfig, proceeds in two steps:- Dead-placeholder reap: queries all placeholder workers (those with
slurm_job_idcontaining:), callssqueue --jobs <id,...>, and deletes any whose SLURM job is no longer alive. This reclaims themax_pending_jobsslot so the next submission cycle can proceed. Errors fromsqueue(e.g. on non-HPC environments) are swallowed and never crash the scheduler. - New submission: if there are queued jobs and no active/pending HPC workers for that cluster, renders the Jinja2 sbatch template and calls
sbatch --parsableas a direct subprocess. Stores the SLURM job ID in the DB as a placeholder worker record (slurm_job_id = "<cluster>:<id>") to prevent duplicate submissions during the SLURM pending window.
Scheduler settings: coalesce=True, max_instances=1, no persistent job store.
sbatch Submission
Direct subprocess call — no SSH, no paramiko. The orchestrator runs on the login node where sbatch is in PATH. Submission is:
result = await asyncio.create_subprocess_exec(
"sbatch", "--parsable", rendered_script_path,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
Current limitation: the orchestrator must run on the same login node as the target cluster. Cross-cluster submission (e.g. submitting to clusterB from a clusterA-hosted orchestrator) requires SSH and is not yet implemented.
Placeholder Worker Lifecycle
When sbatch succeeds, the orchestrator inserts a placeholder Worker row with:
platform = hpcslurm_job_id = "<cluster_name>:<slurm_job_id>"(the colon is the sentinel)vram_gb = 0(unknown until the real worker registers)last_heartbeat = now
The placeholder is visible in the UI with status = provisioning. It is never reaped by the stale-worker reaper (which explicitly skips rows whose slurm_job_id contains :), and it is never assigned jobs (the assignment query requires slurm_job_id IS NULL).
The placeholder is cleaned up by one of two paths:
-
Happy path — the SLURM job starts and the worker process calls
POST /workers/registerwithslurm_job_idset to$SLURM_JOB_ID.register_workerfinds the matching placeholder (by suffix":<id>") and deletes it atomically before committing the real worker row. The real worker hasslurm_job_id = NULLand a live heartbeat. -
Dead-job path — the
sbatch_submission_jobscheduler callsreap_dead_slurm_placeholdersbefore each submission cycle. It callssqueue --jobs <id1,id2,...> --noheader --format=%iand deletes any placeholder whose job ID is no longer returned by squeue (job failed, timed out, or was cancelled before starting).