W-167 E2E Integration Test Report
Date: <YYYY-MM-DD>
Operator: <name>
Environment: <cluster names / orchestrator host>
Issue: W-167
Scope
Manual end-to-end validation of one RelayMD job across at least two worker handoffs, including checkpoint continuity and final completion.
Preflight
- [ ] Orchestrator running persistently (
tmuxor systemd) - [ ] API token confirmed (
X-API-Token) - [ ] Infisical secrets available to workers
- [ ] SLURM cluster config active in orchestrator config
- [ ] B2 credentials valid for upload/list/head operations
Input Bundle
Local input bundle path: <path>
relaymd-worker.json used for this run:
Bundle archive creation:
Upload to B2 key required by this ticket:
aws s3api put-object \
--endpoint-url "$B2_ENDPOINT_URL" \
--bucket "$B2_BUCKET_NAME" \
--key "jobs/e2e-test-001/input/" \
--body /tmp/e2e-test-001.bundle.tar.gz
Initial B2 key listing:
aws s3api list-objects-v2 \
--endpoint-url "$B2_ENDPOINT_URL" \
--bucket "$B2_BUCKET_NAME" \
--prefix "jobs/e2e-test-001/" \
--query 'Contents[].{Key:Key,Size:Size,LastModified:LastModified}'
Job Registration
Create job:
JOB_JSON=$(curl -sS -X POST "$ORCHESTRATOR_URL/jobs" \
-H "X-API-Token: $RELAYMD_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"title":"e2e-test-001","input_bundle_path":"jobs/e2e-test-001/input/"}')
JOB_ID=$(jq -r '.id' <<<"$JOB_JSON")
echo "$JOB_ID"
Recorded job ID: <uuid>
Live Monitoring Commands
Orchestrator logs (tmux deployment):
tmux capture-pane -pt relaymd:0 -S -800 | rg "/workers/register|/jobs/request|/checkpoint|/complete|/fail|/workers/.*/heartbeat"
SLURM worker logs:
Job state poll:
watch -n 10 "curl -sS \"$ORCHESTRATOR_URL/jobs/$JOB_ID\" -H \"X-API-Token: $RELAYMD_API_TOKEN\" | jq '{status,assigned_worker_id,latest_checkpoint_path,last_checkpoint_at,updated_at}'"
Timeline
| Time (UTC) | Event | Evidence |
|---|---|---|
<hh:mm:ss> |
Job created | POST /jobs response |
<hh:mm:ss> |
Worker #1 submitted | sbatch id / scheduler log |
<hh:mm:ss> |
Worker #1 registered | orchestrator access log |
<hh:mm:ss> |
Job assigned | POST /jobs/request log |
<hh:mm:ss> |
First checkpoint uploaded | POST /jobs/{id}/checkpoint + B2 head |
<hh:mm:ss> |
Worker #1 cancelled (scancel) |
SLURM command output |
<hh:mm:ss> |
Job re-queued with checkpoint path preserved | GET /jobs/{id} payload |
<hh:mm:ss> |
Worker #2 registered/assigned | orchestrator access log |
<hh:mm:ss> |
Worker #2 resumes from checkpoint | AToM-OpenMM step evidence |
<hh:mm:ss> |
Job completed | GET /jobs/{id} status=completed |
Lifecycle Evidence
Orchestrator/API evidence commands:
curl -sS "$ORCHESTRATOR_URL/jobs/$JOB_ID" \
-H "X-API-Token: $RELAYMD_API_TOKEN" | jq
curl -sS "$ORCHESTRATOR_URL/workers" \
-H "X-API-Token: $RELAYMD_API_TOKEN" | jq
Checkpoint key verification:
aws s3api head-object \
--endpoint-url "$B2_ENDPOINT_URL" \
--bucket "$B2_BUCKET_NAME" \
--key "jobs/$JOB_ID/checkpoints/latest"
Handoff Validation
Worker #1 SLURM job ID: <id>
Worker #2 SLURM job ID: <id>
Cancellation command used:
Expected/observed after cancellation:
- [ ] Job returned to
queued/reassignable state within reaper window - [ ]
latest_checkpoint_pathremained non-null and unchanged - [ ] Worker #2 downloaded checkpoint and resumed
GET /jobs/{id} snapshots:
- Before cancel:
<json snippet> - After cancel/requeue:
<json snippet> - After reassignment:
<json snippet>
Step Continuity Check
| Metric | Value |
|---|---|
| Worker #1 last reported step before cancel | <n> |
| Worker #2 first reported step after resume | <m> |
Continuity result (m >= n and not reset to 0) |
<pass/fail> |
AToM-OpenMM log excerpts:
<paste concise worker #1 excerpt showing final step>
<paste concise worker #2 excerpt showing resumed step>
B2 Key Listing by Stage
| Stage | Command | Output summary |
|---|---|---|
| After input upload | list-objects --prefix jobs/e2e-test-001/ |
<keys> |
| After first checkpoint | head/list jobs/$JOB_ID/checkpoints/ |
<keys + timestamp> |
| After handoff | head/list jobs/$JOB_ID/checkpoints/ |
<unchanged or updated> |
| At completion | list jobs/$JOB_ID/ |
<final keys> |
Checkpoint Glob Pattern Finalization
Patterns tested:
<pattern A><pattern B>
Final selected pattern: <pattern>
Reason:
<why this pattern reliably matched AToM-OpenMM checkpoints in this run>
Issues Found
| Ticket | Severity | Description | Repro/Evidence |
|---|---|---|---|
<W-xxx> |
<high/med/low> |
<issue> |
<log/link> |
If no issues:
No new issues were identified during this run.
Definition of Done Checklist
- [ ] Job completes end-to-end across at least two worker handoffs with no manual intervention between them
- [ ] Checkpoint continuity verified: second worker step count starts where first left off
- [ ]
latest_checkpoint_pathpreserved in orchestrator DB across the handoff - [ ] AToM-OpenMM checkpoint glob pattern finalized and documented
- [ ] All issues found filed as new Linear tickets
- [ ] Commit message and PR description include
fixes W-167
Implementation Notes (Observed During Setup)
- Worker writes checkpoints to
jobs/<job_id>/checkpoints/latest, where<job_id>is the UUID returned byPOST /jobs. POST /jobscurrently acceptstitleandinput_bundle_pathonly.- Job status may remain
assigneduntil terminal transition (completed/failed/cancelled), while checkpoint updates are still accepted inassigned. - Successful Infisical fetch and Tailscale join are inferred from worker registration (
POST /workers/register) and subsequent API activity; bootstrap code does not emit explicit success logs.