Skip to content

Orchestration Rework

This effort defines and documents a new orchestration architecture for simulation jobs that is faster, more observable, and production-ready across both HPC and cloud deployments.

The design process is intentionally structured so it can be reused for future cross-repository architecture reworks.

  1. Researchers can inspect meaningful intermediate results before a full simulation completes.
  2. Users can run larger simulations with predictable turnaround times.
  3. Users can run multiple simulations concurrently without unstable platform behavior.
  4. Operators can understand system state quickly and diagnose failures with confidence.
  5. The platform behaves reliably under transient infrastructure failures and recovers gracefully.
  6. Architecture decisions are driven by measured workload characteristics, not assumptions.
  1. As a researcher, I want to see early simulation signals so I can stop bad runs quickly.
  2. As a researcher, I want to be able to stop and gather results from simulations at any time.
  3. As a team lead, I want predictable cluster usage so I can plan shared resource allocation.
  4. As an operator, I want to detect degraded system behavior before users are blocked.
  5. As an operator, I want retry and failure semantics to be explicit so incidents are easier to resolve.
  6. As a product owner, I want clear performance baselines so architectural options can be compared fairly.

The full discovery backlog (31 open questions with status, owner, confidence, and estimate fields) lives in a dedicated page:

Priority unresolved topics for near-term discussion:

  1. Time-to-first-insight target for users during running jobs.
  2. Typical and p95 HPC queue wait by partition/class (for example Ares).
  3. Expected parallel tasks per simulation and parallel simulations per user.
  4. Typical dump time and result cardinality (files and pages per simulation).
  5. Required reliability semantics for final result persistence.
  6. Evidence threshold needed before selecting transport/merge architecture options.

Primary repositories in scope:

Possible secondary changes:

This section follows a reusable pattern:

  • context: as-is architecture, bottlenecks, constraints, and user requirements
  • adr: decision records with explicit status changes
  • design: implementation-ready design documents
  • research: AI-assisted investigation logs for auditability
  • Bootstrap the documentation structure under this section.
  • Add the ADR registry and template.
  • Wire sidebar navigation for the full process structure.
  • Create cross-repository tracking artifacts for implementation planning.