Current Bottlenecks
This page documents current bottlenecks from code-level analysis and local fixture measurements.
Evidence Base
Section titled “Evidence Base”Primary evidence sources:
- backend orchestration and merge code in
yaptide - UI polling/result-fetch behavior in
ui - representative result fixture:
yaptide/tests/res/json_with_results.json
Measured fixture snapshot (json_with_results.json):
- file size:
2,564,199bytes - estimators:
4 - pages:
21 - total numeric values across pages:
98,892 - largest page value array length:
32,000
These numbers are not production maxima, but they are large enough to expose current scaling pressure points.
1. JSON-Centric Result Transport and Serialization
Section titled “1. JSON-Centric Result Transport and Serialization”Current behavior
Section titled “Current behavior”Direct path and batch path both convert estimator outputs into JSON structures before transport/persistence:
- direct:
estimators_to_listwrites estimator JSON to disk, then reads it back into Python dicts - batch: collect scripts run
convertmc json --manyand sender script reads all*.jsonfiles and posts them as one JSON payload - backend
/resultsreceives nested JSON and re-serializes page/metadata data into compressed DB blobs
Why this is a bottleneck
Section titled “Why this is a bottleneck”- multiple JSON encode/decode cycles occur in one workflow
- direct path has extra temp-file round-trips in
estimators_to_list - large nested numeric arrays are moved in-memory as Python lists and JSON strings repeatedly
Code references
Section titled “Code references”yaptide/utils/sim_utils.py(estimators_to_list)yaptide/batch/fluka_string_templates.pyandyaptide/batch/shieldhit_string_templates.pyyaptide/batch/simulation_data_sender.pyyaptide/routes/common_sim_routes.py(ResultsResource.post)
2. Broker/Backend Data Plane Pressure (Redis + Celery Results)
Section titled “2. Broker/Backend Data Plane Pressure (Redis + Celery Results)”Current behavior
Section titled “Current behavior”- Redis is configured as both Celery broker and Celery result backend in deployment compose.
- Direct tasks return full estimator payloads to Celery chord body (
merge_results). - If
/resultsPOST fails during merge, merged estimators are left in Celery task result payload (final_result). ResultsDirect.getcan fallback to CeleryAsyncResult.infoif DB has no results yet.
Why this is a bottleneck
Section titled “Why this is a bottleneck”- control messages and heavy data share the same broker/backend infrastructure
- large result payloads can amplify Redis memory/network pressure
- fallback behavior keeps heavy payload coupling between compute and broker state
Code references
Section titled “Code references”docker-compose.yml(CELERY_BROKER_URL,CELERY_RESULT_BACKEND)yaptide/celery/tasks.py(run_single_simulation,merge_results)yaptide/celery/utils/manage_tasks.py(get_job_results)yaptide/routes/celery_routes.py(ResultsDirect.get)
3. Merge Pipeline in Pure Python Loops
Section titled “3. Merge Pipeline in Pure Python Loops”Current behavior
Section titled “Current behavior”average_estimators performs page-value averaging in Python loops over list structures.
Pseudo-complexity:
O(T*P*V), where:- T is number of completed tasks,
- P is number of pages,
- V is values per page.
With the measured fixture shape (98,892 values) and many tasks, per-value Python loop overhead becomes significant.
Why this is a bottleneck
Section titled “Why this is a bottleneck”- Python-level numeric loops are slower than vectorized kernels
- merge task is single-task chokepoint in the chord callback
- all merged payloads are materialized in memory as nested Python objects
Code references
Section titled “Code references”yaptide/celery/utils/pymc.py(average_values,average_estimators)yaptide/celery/tasks.py(merge_results)
4. No True Partial Numerical Results During RUNNING
Section titled “4. No True Partial Numerical Results During RUNNING”Current behavior
Section titled “Current behavior”Progress updates are streamed, numerical estimators are not.
- monitors send task progress (
simulated_primaries,estimated_time, states) via/tasks - merged estimator payload is sent at end of workflow via
/results - UI fetches/loads result datasets when status becomes
COMPLETED
Batch mode similarly sends final results in collect stage after array completion.
Why this is a bottleneck
Section titled “Why this is a bottleneck”- users can see progress bars but not intermediate dosimetry/fluence data
- misconfiguration detection is delayed until after full run completion
- long runs provide limited actionable feedback
Code references
Section titled “Code references”yaptide/celery/utils/pymc.py(read_shieldhit_file,read_fluka_file)yaptide/routes/task_routes.pyui/src/WrapperApp/components/Simulation/SimulationsGrid/SimulationsGridHelpers.tsui/src/services/RemoteWorkerSimulationService.tsyaptide/batch/simulation_data_sender.py
5. Reliability and Operational Control Gaps
Section titled “5. Reliability and Operational Control Gaps”Current behavior
Section titled “Current behavior”- no explicit Celery task retry policies (
autoretry_for,max_retries, etc.) in task definitions - HTTP update/result posts are single-attempt requests with no backoff strategy
- known log handling issue is documented in code comments and partially disabled behavior
- helper task process inspection uses hardcoded worker name (
celery@yaptide-simulation-worker)
Why this is a bottleneck
Section titled “Why this is a bottleneck”- transient network/backend failures can drop updates or results
- operational behavior depends on implicit defaults rather than explicit policy
- brittle assumptions reduce portability across deployment topologies
Code references
Section titled “Code references”yaptide/celery/tasks.py(comments around logfile overwrite handling)yaptide/celery/utils/requests.pyyaptide/utils/helper_tasks.pyyaptide/celery/simulation_worker.py
6. Scaling Model Limits in Current Topology
Section titled “6. Scaling Model Limits in Current Topology”Current behavior
Section titled “Current behavior”- default compose topology runs a single simulation worker service container
- merge is serialized in one callback task per job
- no object-storage-based result fan-in; final payloads are moved through backend API and DB writes
Why this is a bottleneck
Section titled “Why this is a bottleneck”- single-job fan-in stage can dominate latency as task count grows
- throughput scales less predictably for large payloads and high parallelism
- architecture is not yet optimized for multi-node distributed result transport/merge
Code references
Section titled “Code references”docker-compose.ymlyaptide/celery/utils/manage_tasks.pyyaptide/celery/tasks.py
Impact Summary
Section titled “Impact Summary”| Bottleneck | Current behavior | Main impact |
|---|---|---|
| JSON-heavy transport | Multiple JSON conversions and full-payload posts | CPU, memory, and I/O overhead |
| Redis data-plane coupling | Broker and result backend share Redis with large payload exposure | Throughput and reliability risk under load |
| Pure-Python merge loops | Single-task Python list averaging | Longer merge time and higher memory pressure |
| No partial estimator streaming | Only progress updates during run | Delayed scientific feedback |
| Limited retry/control policy | Mostly single-attempt update/result posting | Fragility on transient failures |
| Fan-in merge topology | One callback merge step per job | Scaling bottleneck as tasks/data grow |
Immediate Profiling Tasks (Phase 1 input)
Section titled “Immediate Profiling Tasks (Phase 1 input)”To replace assumptions with hard numbers, run these measurements next:
- End-to-end timings per stage: run, merge,
/resultspersistence. - Payload size telemetry: per-task estimator JSON size, merged JSON size, response times.
- Merge CPU and memory profiling for realistic task counts (for example 4, 8, 16, 32).
- Redis memory/network telemetry under concurrent workloads.
- Failure-injection tests for
/tasksand/resultsposting resilience.
These measurements will feed ADR option comparisons and benchmark-based design choices in Phase 2 and Phase 3.
Detailed execution plan: Phase 1 Profiling Plan
Recommended default tooling:
- OpenTelemetry + Prometheus + Grafana for stage timing and trace correlation
- Locust for concurrent workload generation
- py-spy for merge CPU profiling
- Memray for merge memory profiling
- redis_exporter + Prometheus for Redis memory/network telemetry
- Toxiproxy for deterministic
/tasksand/resultsfailure injection