RelayMD: Tech Stack & Development Guidelines
Guiding Principles
Apptainer-first. The worker runs inside an Apptainer .sif container. All Python dependencies — including the MD engine and the relaymd-worker package — must be installed inside the image. Do not rely on the host environment for anything.
No conda environments. Conda adds significant image size and slow solve times. Use pip inside the container image. The base image should provide a suitable Python (3.11+) directly.
uv for local installs. Use uv — not pip, not conda — for local development and non-container workflows. Production HPC deployments run both orchestrator and worker from GHCR-backed Apptainer images.
Pydantic everywhere. All structured data — API request/response bodies, config objects, DB models, inter-component messages — should be typed with Pydantic. If you are writing a dict with string keys to pass data between functions, use a Pydantic model instead.
One binary for the operator. The relaymd CLI is compiled to a self-contained ELF binary with PyInstaller and distributed via GitHub Releases. No Python environment is required on the machine that submits jobs.
Language & Runtime
| Decision | Choice | Rationale |
|---|---|---|
| Language | Python 3.11+ | Required by alchemical MD workloads; modern typing |
| Package manager (login node) | uv |
Fast, reproducible, lockfile-based local/dev installs |
| Package manager (container) | pip inside Apptainer/Docker | No conda; uv can also be used in the Dockerfile |
| CLI distribution | PyInstaller single binary | No Python env required on HPC login node |
| Python typing | strict Pydantic throughout | See guiding principle above |
Repository Layout
relaymd/
├── src/
│ └── relaymd/
│ ├── cli/ # Operator CLI commands
│ └── orchestrator/ # FastAPI app, DB, scheduler, sbatch
├── packages/
│ ├── relaymd-core/ # Shared models + storage only
│ │ └── src/relaymd/
│ │ ├── models/
│ │ └── storage/
│ └── relaymd-worker/ # Worker bootstrap, main loop, heartbeat
├── deploy/
│ ├── slurm/ # SLURM .sbatch.j2 templates + cluster configs
│ ├── salad/ # Salad Cloud container group config
│ ├── tmux/ # tmux launcher script
│ └── config.example.yaml # Canonical reference config for orchestrator + CLI
├── docs/
│ ├── architecture.md
│ ├── scheduling.md
│ ├── api-schema.md
│ ├── cli.md # CLI install and usage guide
│ ├── deployment.md # Orchestrator deployment guide
│ ├── hpc-notes.md # Apptainer + Tailscale runbook
│ └── storage-layout.md
├── frontend/ # React operator dashboard served by FastAPI
├── Dockerfile # Worker container image
├── Dockerfile.orchestrator # Orchestrator container image (+ bundled frontend)
└── pyproject.toml # Root relaymd package + workspace config
relaymd-core is the shared dependency layer: it carries only relaymd.models + relaymd.storage. The worker container installs relaymd-core + relaymd-worker only; it does not install relaymd (and therefore does not pull FastAPI, uvicorn, alembic, or typer). A uv workspace at the repo root manages these three packages with one lockfile. The frontend uses npm under frontend/, with cache and build output kept inside the repo.
Shared Data Models
All API request/response models live in relaymd-core under relaymd.models, shared by relaymd and relaymd-worker. If a field changes, it changes in one place and all consumers break loudly at import time rather than silently at runtime.
Logging
structlog with JSON output in production, ConsoleRenderer in development (detected via RELAYMD_ENV=development). Renderer is orjson for performance. All log statements use keyword arguments — no f-string messages.
Both the orchestrator and worker have a logging.py module that configures structlog once on startup and exposes get_logger(name).
Testing
| Layer | Framework | Notes |
|---|---|---|
| Orchestrator API | pytest + httpx AsyncClient |
In-memory SQLite DB per test |
| Worker logic | pytest + unittest.mock |
Mock Infisical, B2, subprocess |
| Storage module | pytest + moto |
moto mocks the S3/B2 API locally |
| Scheduling loops | pytest + freezegun |
Freeze time to test stale worker detection |
| CLI commands | pytest + unittest.mock |
Mock StorageClient and httpx |
| Config loading | pytest + tmp_path |
Write YAML to temp dir, assert round-trip |
Container Registry
GHCR (GitHub Container Registry). Images tagged as:
ghcr.io/<org>/relaymd-worker:<tag>ghcr.io/<org>/relaymd-orchestrator:<tag>
Use immutable SHA tags for production deployments.
Open Items
- AToM-OpenMM checkpoint glob pattern — what files does AToM actually write? Confirm during end-to-end testing.
- Salad GPU model strings — the
VRAM_TIERSdict inscheduling.pyneeds exactnvidia-smimodel name strings from real Salad nodes. - clusterB cross-cluster sbatch — submitting SLURM jobs to clusterB from a clusterA-hosted orchestrator requires SSH. Not yet implemented.