RelayMD: Tech Stack & Development Guidelines

Guiding Principles

Apptainer-first. The worker runs inside an Apptainer .sif container. All Python dependencies — including the MD engine and the relaymd-worker package — must be installed inside the image. Do not rely on the host environment for anything.

No conda environments. Conda adds significant image size and slow solve times. Use pip inside the container image. The base image should provide a suitable Python (3.11+) directly.

uv for local installs. Use uv — not pip, not conda — for local development and non-container workflows. Production HPC deployments run both orchestrator and worker from GHCR-backed Apptainer images.

Pydantic everywhere. All structured data — API request/response bodies, config objects, DB models, inter-component messages — should be typed with Pydantic. If you are writing a dict with string keys to pass data between functions, use a Pydantic model instead.

One binary for the operator. The relaymd CLI is compiled to a self-contained ELF binary with PyInstaller and distributed via GitHub Releases. No Python environment is required on the machine that submits jobs.

Language & Runtime

Decision	Choice	Rationale
Language	Python 3.11+	Required by alchemical MD workloads; modern typing
Package manager (login node)	`uv`	Fast, reproducible, lockfile-based local/dev installs
Package manager (container)	pip inside Apptainer/Docker	No conda; uv can also be used in the Dockerfile
CLI distribution	PyInstaller single binary	No Python env required on HPC login node
Python typing	strict Pydantic throughout	See guiding principle above

Repository Layout

relaymd/
├── src/
│   └── relaymd/
│       ├── cli/                # Operator CLI commands
│       └── orchestrator/       # FastAPI app, DB, scheduler, sbatch
├── packages/
│   ├── relaymd-core/          # Shared models + storage only
│   │   └── src/relaymd/
│   │       ├── models/
│   │       └── storage/
│   └── relaymd-worker/        # Worker bootstrap, main loop, heartbeat
├── deploy/
│   ├── slurm/                 # SLURM .sbatch.j2 templates + cluster configs
│   ├── salad/                 # Salad Cloud container group config
│   ├── tmux/                  # tmux launcher script
│   └── config.example.yaml    # Canonical reference config for orchestrator + CLI
├── docs/
│   ├── architecture.md
│   ├── scheduling.md
│   ├── api-schema.md
│   ├── cli.md                 # CLI install and usage guide
│   ├── deployment.md          # Orchestrator deployment guide
│   ├── hpc-notes.md           # Apptainer + Tailscale runbook
│   └── storage-layout.md
├── frontend/                  # React operator dashboard served by FastAPI
├── Dockerfile                 # Worker container image
├── Dockerfile.orchestrator    # Orchestrator container image (+ bundled frontend)
└── pyproject.toml             # Root relaymd package + workspace config

relaymd-core is the shared dependency layer: it carries only relaymd.models + relaymd.storage. The worker container installs relaymd-core + relaymd-worker only; it does not install relaymd (and therefore does not pull FastAPI, uvicorn, alembic, or typer). A uv workspace at the repo root manages these three packages with one lockfile. The frontend uses npm under frontend/, with cache and build output kept inside the repo.

Shared Data Models

All API request/response models live in relaymd-core under relaymd.models, shared by relaymd and relaymd-worker. If a field changes, it changes in one place and all consumers break loudly at import time rather than silently at runtime.

Logging

structlog with JSON output in production, ConsoleRenderer in development (detected via RELAYMD_ENV=development). Renderer is orjson for performance. All log statements use keyword arguments — no f-string messages.

log.info("checkpoint_uploaded", job_id=str(job_id), b2_key=key, size_bytes=size)

Both the orchestrator and worker have a logging.py module that configures structlog once on startup and exposes get_logger(name).

Testing

Layer	Framework	Notes
Orchestrator API	`pytest` + `httpx AsyncClient`	In-memory SQLite DB per test
Worker logic	`pytest` + `unittest.mock`	Mock Infisical, B2, subprocess
Storage module	`pytest` + `moto`	`moto` mocks the S3/B2 API locally
Scheduling loops	`pytest` + `freezegun`	Freeze time to test stale worker detection
CLI commands	`pytest` + `unittest.mock`	Mock StorageClient and httpx
Config loading	`pytest` + `tmp_path`	Write YAML to temp dir, assert round-trip

Container Registry

GHCR (GitHub Container Registry). Images tagged as:

ghcr.io/<org>/relaymd-worker:<tag>
ghcr.io/<org>/relaymd-orchestrator:<tag>

Use immutable SHA tags for production deployments.

Open Items

AToM-OpenMM checkpoint glob pattern — what files does AToM actually write? Confirm during end-to-end testing.
Salad GPU model strings — the VRAM_TIERS dict in scheduling.py needs exact nvidia-smi model name strings from real Salad nodes.
clusterB cross-cluster sbatch — submitting SLURM jobs to clusterB from a clusterA-hosted orchestrator requires SSH. Not yet implemented.