Orchestration Rework

Vision

This effort defines and documents a new orchestration architecture for simulation jobs that is faster, more observable, and production-ready across both HPC and cloud deployments.

The design process is intentionally structured so it can be reused for future cross-repository architecture reworks.

Goals

Researchers can inspect meaningful intermediate results before a full simulation completes.
Users can run larger simulations with predictable turnaround times.
Users can run multiple simulations concurrently without unstable platform behavior.
Operators can understand system state quickly and diagnose failures with confidence.
The platform behaves reliably under transient infrastructure failures and recovers gracefully.
Architecture decisions are driven by measured workload characteristics, not assumptions.

User Stories

As a researcher, I want to see early simulation signals so I can stop bad runs quickly.
As a researcher, I want to be able to stop and gather results from simulations at any time.
As a team lead, I want predictable cluster usage so I can plan shared resource allocation.
As an operator, I want to detect degraded system behavior before users are blocked.
As an operator, I want retry and failure semantics to be explicit so incidents are easier to resolve.
As a product owner, I want clear performance baselines so architectural options can be compared fairly.

Discovery Backlog

The full discovery backlog (31 open questions with status, owner, confidence, and estimate fields) lives in a dedicated page:

Open Questions and Estimation Log

Priority unresolved topics for near-term discussion:

Time-to-first-insight target for users during running jobs.
Typical and p95 HPC queue wait by partition/class (for example Ares).
Expected parallel tasks per simulation and parallel simulations per user.
Typical dump time and result cardinality (files and pages per simulation).
Required reliability semantics for final result persistence.
Evidence threshold needed before selecting transport/merge architecture options.

Scope

Primary repositories in scope:

Possible secondary changes:

Process Structure

This section follows a reusable pattern:

context: as-is architecture, bottlenecks, constraints, and user requirements
adr: decision records with explicit status changes
design: implementation-ready design documents
research: AI-assisted investigation logs for auditability

Phase 0 Deliverables

Bootstrap the documentation structure under this section.
Add the ADR registry and template.
Wire sidebar navigation for the full process structure.
Create cross-repository tracking artifacts for implementation planning.