Simulation Lifecycle
Every simulation goes through a defined state machine. This page documents the states, transitions, and the differences between direct (Celery) and batch (Slurm) execution paths.
Job States
Section titled “Job States”UNKNOWN ─────> PENDING ─────> RUNNING ─────> MERGING_QUEUED ─────> MERGING_RUNNING ─────> COMPLETED │ │ │ │ │ │ └──────────────┴───────────────────────────────────────────────────────> FAILED │ │ └──────────────┴───────────────────────────────────────────────────────> CANCELED| State | Description |
|---|---|
UNKNOWN | Initial state, before any processing |
PENDING | Job accepted, tasks being created |
RUNNING | At least one task is actively simulating |
MERGING_QUEUED | All tasks complete, merge task is waiting in the Celery queue |
MERGING_RUNNING | Merge task is actively averaging results |
COMPLETED | Results stored, job finished successfully |
FAILED | One or more tasks failed, or the merge failed |
CANCELED | User or system canceled the job |
These states are defined in utils/enums.py as the EntityState enum.
Task States
Section titled “Task States”Each job contains N tasks (one per parallel simulation run). Tasks have their own state:
| State | Description |
|---|---|
PENDING | Task created, waiting for a worker |
RUNNING | Simulator binary is executing |
COMPLETED | Simulation finished, output available |
FAILED | Simulator crashed or timed out |
CANCELED | Task was revoked |
Direct Execution (Celery)
Section titled “Direct Execution (Celery)”Submission
Section titled “Submission”POST /jobs/direct → Create CelerySimulationModel (state: PENDING) → Create N CeleryTaskModel rows (state: PENDING) → Convert editor JSON → simulator input files → Dispatch Celery chord: group(run_single_simulation × N) | get_job_resultsTask Execution
Section titled “Task Execution”Each run_single_simulation task:
- Receives input files and a task index
- Creates a temporary directory
- Writes input files
- Spawns the simulator binary (
shieldhit,fluka_sim) as a subprocess - Starts a monitoring thread that reads stdout/logfiles for progress
- Periodically POSTs progress to
POST /tasks:{"task_id": 0,"simulated_primaries": 5000,"requested_primaries": 10000,"estimated_time": 42} - On completion, returns the output files (estimator data)
Merge Step
Section titled “Merge Step”When all N tasks complete, the get_job_results callback task runs:
- Collects estimator data from all N tasks
- Averages the results (weighted by primaries per task)
- Compresses and stores:
EstimatorModel→PageModel - Updates job state to
COMPLETED
If any task fails, the merge is skipped and the job state is set to FAILED.
Cancellation
Section titled “Cancellation”DELETE /jobs/direct?job_id=<id> → Revoke all Celery tasks (terminate=True) → Set job state to CANCELEDTask Time Limit
Section titled “Task Time Limit”Simulation tasks have a 10-hour hard time limit (configured in the Celery worker). Tasks exceeding this are killed.
Batch Execution (Slurm via SSH)
Section titled “Batch Execution (Slurm via SSH)”Submission
Section titled “Submission”POST /jobs/batch → Create BatchSimulationModel (state: PENDING) → Dispatch helper_worker.submit_job taskThe submit_job task on the helper worker:
- Connects to the HPC cluster via SSH (using Fabric and the user’s PLGrid SSH certificate from
KeycloakUserModel) - Creates a remote working directory
- Uploads:
- Compressed simulation input files
- A watcher script (monitors each array task)
- A data-sender script (POSTs results back to YAPTIDE)
- Submits a Slurm array job:
Terminal window sbatch --array=0-N-1 run_simulation.sh - Submits a collect job (depends on the array job):
Terminal window sbatch --dependency=afterok:<array_id> collect_results.sh - Stores the array and collect Slurm job IDs in
BatchSimulationModel
Progress Monitoring
Section titled “Progress Monitoring”The watcher script on the cluster:
- Runs alongside each array task
- Monitors simulator output (logfiles, stdout)
- POSTs progress updates to the YAPTIDE backend:
POST /tasksAuthorization: Bearer <simulation_update_key>
Status Polling
Section titled “Status Polling”When the frontend polls GET /jobs/batch?job_id=<id>, the backend:
- Returns cached task states from the database
- Optionally queries
saccton the cluster via SSH to update Slurm job status
Result Collection
Section titled “Result Collection”The collect job on the cluster:
- Runs after all array tasks complete
- Gathers output files from each task directory
- Averages/merges results
- POSTs the final results to
POST /results - The backend stores them as
EstimatorModel→PageModel
Cancellation
Section titled “Cancellation”DELETE /jobs/batch?job_id=<id> → SSH to cluster → scancel <array_id> <collect_id> → Set job state to CANCELEDWorker Communication
Section titled “Worker Communication”Both execution paths use HTTP callbacks for workers to report state back to Flask:
| Endpoint | Who Calls It | Purpose |
|---|---|---|
POST /tasks | Simulation worker / cluster watcher | Update task progress (primaries, estimated time) |
POST /results | Merge task / collect job | Store final results |
POST /jobs | Helper worker | Update job-level state |
These internal endpoints are authenticated with a simulation update key — a 7-day JWT generated at job submission and stored (hashed) in the SimulationModel.
Polling Pattern
Section titled “Polling Pattern”The frontend polls for job status using this pattern:
1. POST /jobs/direct → { job_id }2. Loop: GET /jobs/direct?job_id=<id> → If RUNNING: show progress bars (primaries/estimated_time per task) → If COMPLETED: GET /results?job_id=<id> → render plots → If FAILED: show error → If CANCELED: show cancellation notice Wait 2–5 seconds, repeatThe polling interval increases as the simulation runs longer to reduce server load.