Phase 1 Profiling Plan
This runbook turns Phase 1 profiling goals into a repeatable protocol that can be reused for ADR comparisons in Phase 2 and design validation in Phase 3.
Related context: Current Bottlenecks
Required Outputs
Section titled “Required Outputs”Produce these artifacts from the same test matrix:
- End-to-end timing percentiles for run, merge, and
/resultspersistence. - Payload size distributions for per-task and merged estimator JSON.
- Merge CPU hotspot evidence and memory peak measurements for task counts 4, 8, 16, and 32.
- Redis memory and network telemetry under concurrent workloads.
- Failure-injection resilience results for
/tasksand/resultsposting.
Recommended Tool Stack
Section titled “Recommended Tool Stack”| Measurement goal | Recommended tool(s) | Why this is a good fit |
|---|---|---|
| Stage timings and service latency | OpenTelemetry + Prometheus + Grafana | One source for traces plus percentile metrics and historical trends |
| Concurrent workload generation | Locust | Scriptable and repeatable load patterns for matrix testing |
| Merge CPU profiling | py-spy | Low overhead sampling and flamegraph output for live worker processes |
| Merge memory profiling | Memray | Detailed allocation and peak usage visibility |
| Redis memory/network/command telemetry | redis_exporter + Prometheus + Grafana | Standard metrics pipeline for broker/backend behavior |
| Network/transport failure injection | Toxiproxy | Deterministic latency, loss, and disconnect scenarios |
Baseline Test Matrix
Section titled “Baseline Test Matrix”Start with one representative simulation input and vary only orchestration pressure.
| Scenario | Concurrent jobs | Tasks per job | Fault mode | Repetitions |
|---|---|---|---|---|
| S1 | 1 | 4 | none | 10 |
| S2 | 1 | 8 | none | 10 |
| S3 | 1 | 16 | none | 10 |
| S4 | 1 | 32 | none | 10 |
| S5 | 5 | 8 | none | 10 |
| S6 | 10 | 8 | none | 10 |
| S7 | 10 | 8 | latency | 10 |
| S8 | 10 | 8 | drop/retry | 10 |
Use S1-S6 for throughput/latency characterization and S7-S8 for resilience behavior.
Instrumentation Plan
Section titled “Instrumentation Plan”1) End-to-End Timings per Stage
Section titled “1) End-to-End Timings per Stage”Instrument three stage boundaries in backend orchestration:
- run stage: task execution start to completion
- merge stage: merge callback start to completion
- persistence stage:
/resultsreceive to durable DB write completion
Recommended metrics:
yaptide_stage_duration_seconds(histogram)- labels:
stage=run|merge|persist,success=true|false,task_count_bucket
- labels:
yaptide_job_duration_seconds(histogram)yaptide_stage_events_total(counter)
Guidance:
- keep labels low-cardinality; do not attach unique
job_idas metric label - attach
job_idin tracing spans, not in Prometheus labels
2) Payload Size Telemetry
Section titled “2) Payload Size Telemetry”Capture payload bytes at key boundaries:
- per-task estimator JSON before transport
- merged estimator JSON before
/resultsPOST /tasksand/resultsrequest and response body sizes
Recommended metrics:
yaptide_payload_bytes(histogram)- labels:
kind=task_estimator|merged_estimator|tasks_request|results_request|results_response
- labels:
yaptide_http_request_duration_seconds(histogram)- labels:
route=/tasks|/results,method=POST,status_code
- labels:
Expected output:
- percentiles and max sizes by scenario
- scatter: payload size vs endpoint latency
3) Merge CPU and Memory Profiling
Section titled “3) Merge CPU and Memory Profiling”For each task count (4, 8, 16, 32), execute merge repeatedly with realistic fixtures.
Suggested commands:
py-spy record -o profiles/merge-16.svg --pid "$SIMULATION_WORKER_PID" --duration 60python -m memray run -o profiles/merge-16.bin scripts/profile_merge.py --tasks 16python -m memray summary profiles/merge-16.binTrack per scenario:
- merge wall time: p50/p95
- CPU: top hotspots and cumulative percentage
- memory: peak RSS and top allocation sources
4) Redis Memory and Network Telemetry
Section titled “4) Redis Memory and Network Telemetry”Monitor Redis while running S1-S8:
- memory usage (
used_memory,used_memory_rss) - network throughput (
instantaneous_input_kbps,instantaneous_output_kbps) - pressure indicators (
blocked_clients,evicted_keys) - command rates and latency where available
Required correlation:
- overlay Redis telemetry with stage latency panels for the same run window
- identify inflection points when increasing concurrency or task counts
5) Failure-Injection Tests for /tasks and /results
Section titled “5) Failure-Injection Tests for /tasks and /results”Inject deterministic failures between workers and backend endpoints:
- added latency (for example +250 ms, +1 s)
- packet loss (for example 2%, 5%)
- temporary disconnects
Validate resilience behavior:
- retries occur as expected
- no duplicated final writes
- eventual consistency of terminal simulation status
- bounded recovery time after link restoration
Record per test:
- success/failure
- retries attempted
- recovery latency
- final state correctness
Execution Protocol
Section titled “Execution Protocol”Use this run protocol for reproducibility:
- Freeze code revision (record git SHAs for backend, UI, docs notes).
- Use fixed dataset and simulator settings for all scenarios.
- Warm up services once before recording metrics.
- Run scenario matrix in randomized order to reduce temporal bias.
- Export dashboards/traces/raw CSV after each scenario group.
- Store artifacts under a date-stamped directory.
Suggested artifact structure:
rework-orchestration/research/phase-1-results/YYYY-MM-DD/ timings/ payloads/ merge-profiles/ redis/ resilience/ summary.mdPhase 1 Exit Criteria
Section titled “Phase 1 Exit Criteria”Phase 1 can be considered complete when all are true:
- Timing percentiles are available for run, merge, and persistence across S1-S8.
- Payload-size and endpoint-latency relationship is quantified.
- Merge CPU and memory hotspots are evidenced for 4, 8, 16, and 32 tasks.
- Redis saturation behavior is visible under concurrent load.
- Failure-injection results show current retry/idempotency limits and recovery characteristics.
The output of this runbook becomes input evidence for ADR option scoring in Phase 2.