Open Questions and Estimation Log
This page tracks open discovery questions that should be answered before locking Phase 2/3 architecture choices.
Status values:
- Open
- In progress
- Answered
- Needs measurement
Response fields to fill for each question:
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
A) User Workflow and Product Expectations
Section titled “A) User Workflow and Product Expectations”-
What is an acceptable time-to-first-insight for a typical user run?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What is an acceptable total turnaround time for small, medium, and large simulations?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Which intermediate results are most valuable to users during RUNNING status?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
How much partial-result staleness is acceptable (for example 10 s vs 60 s)?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Do users prefer fewer high-confidence updates or frequent best-effort updates?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Which user personas need queue predictions versus detailed task-level telemetry?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
B) HPC Connectivity and Queue Characteristics
Section titled “B) HPC Connectivity and Queue Characteristics”-
What is the typical time needed to connect to target HPC clusters (for example Ares)?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What is the p50/p95 time jobs spend in HPC waiting queues by partition/class?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
How often do HPC connection/setup failures occur, and what are top failure modes?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Are there cluster-side limits that strongly affect orchestration design (rate limits, job caps, walltime constraints)?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Which queue-delay patterns are predictable enough to model in UI ETA messaging?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
C) Parallelism and Concurrency Demand
Section titled “C) Parallelism and Concurrency Demand”-
How many parallel tasks per simulation do users typically want on HPC?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What is the upper bound users realistically request for tasks per simulation?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
How many simulations in parallel does a single user expect to run?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
How many concurrent active users should the system support in normal and peak periods?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What fairness model is expected when concurrent user demand exceeds capacity?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
D) Result Shape and Data Volume
Section titled “D) Result Shape and Data Volume”-
How long does it take to dump a single binary result artifact to disk (p50/p95)?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
How many result files are produced per simulation for each supported simulator?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
How many pages are typically present per result file?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What are typical and worst-case per-page array sizes?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What are typical and worst-case merged-result payload sizes?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Which result subsets are most frequently viewed first by users?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
E) Reliability, Recovery, and Operations
Section titled “E) Reliability, Recovery, and Operations”-
What data loss tolerance is acceptable for task progress updates?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Is at-least-once or exactly-once semantics required for final result persistence?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What recovery time objective is acceptable after transient broker/network outages?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Which retries should be automatic versus operator-controlled?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What observability minimum is required for incident triage (logs, traces, metrics)?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
F) Decision and Rollout Constraints
Section titled “F) Decision and Rollout Constraints”-
Which changes must be backward-compatible with current API contracts?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What migration windows are acceptable for infrastructure-affecting changes?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
What evidence threshold is required before adopting a new transport or merge approach?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
-
Which design options are blocked without domain input from HPC operators or power users?
- Status (Open/In progress/Answered/Needs measurement): Open
- Notes:
Estimation Inputs Needed from Domain Experts
Section titled “Estimation Inputs Needed from Domain Experts”The following inputs are required to turn these open questions into concrete ADR comparisons:
- HPC connection timing distributions for each target cluster (especially Ares).
- Queue waiting-time distributions by queue/partition and time-of-day.
- Typical and peak desired parallelism per simulation.
- Typical and peak number of simultaneous simulations per user.
- Result dump timings and file/page cardinality for representative scenarios.
- Practical upper bounds considered acceptable by users for time-to-first-insight and time-to-completion.