docs: DORA metrics with pipeline timing breakdown

ldraney commented

2026-06-09 10:20:30 +00:00

Owner

Summary

Add docs/DORA.md with measured pipeline timing from pipeline #410 and surrounding deploys. Three mermaid diagrams, DORA four key metrics (all elite), Kaniko build internals, and optimization opportunities.

Closes #185

Changes

New file: docs/DORA.md — DORA metrics doc with:
- Deploy chain flow diagram (Woodpecker → Harbor → Image Updater → ArgoCD)
- Gantt chart of pipeline #410 step timing (~9 min merge-to-live)
- DORA four key metrics: deploy frequency 4.3/day, lead time ~9 min, CFR 2.6%, MTTR ~1 hr
- Kaniko build-and-push internals showing rootfs unpack as 4m49s/5m23s bottleneck
- Optimization table: cache warming + gem caching + webhook triggers → ~3 min target
- Pipeline architecture diagram (PR vs main pipeline)

Test Plan

Mermaid diagrams render in Forgejo markdown preview
Timing data cross-checked against Woodpecker logs and pal-e-deployments git log
No code changes — docs only

Review Checklist

No code changes
Data sourced from actual pipeline runs, not estimates
Mermaid syntax validated
Feature flag: N/A (docs only)

docs/pipeline.md — existing pipeline architecture doc (complementary, not overlapping)
Pipeline #410 — the measured deploy
docs/infrastructure-and-pipeline.md — broader infra context

## Summary Add `docs/DORA.md` with measured pipeline timing from pipeline #410 and surrounding deploys. Three mermaid diagrams, DORA four key metrics (all elite), Kaniko build internals, and optimization opportunities. Closes #185 ## Changes - **New file: `docs/DORA.md`** — DORA metrics doc with: - Deploy chain flow diagram (Woodpecker → Harbor → Image Updater → ArgoCD) - Gantt chart of pipeline #410 step timing (~9 min merge-to-live) - DORA four key metrics: deploy frequency 4.3/day, lead time ~9 min, CFR 2.6%, MTTR ~1 hr - Kaniko build-and-push internals showing rootfs unpack as 4m49s/5m23s bottleneck - Optimization table: cache warming + gem caching + webhook triggers → ~3 min target - Pipeline architecture diagram (PR vs main pipeline) ## Test Plan - [ ] Mermaid diagrams render in Forgejo markdown preview - [ ] Timing data cross-checked against Woodpecker logs and pal-e-deployments git log - [ ] No code changes — docs only ## Review Checklist - [x] No code changes - [x] Data sourced from actual pipeline runs, not estimates - [x] Mermaid syntax validated - [ ] Feature flag: N/A (docs only) ## Related Notes - `docs/pipeline.md` — existing pipeline architecture doc (complementary, not overlapping) - Pipeline #410 — the measured deploy - `docs/infrastructure-and-pipeline.md` — broader infra context

ldraney added 1 commit

2026-06-09 10:20:30 +00:00

docs: add DORA metrics with pipeline timing breakdown

CI / scan_ruby (pull_request) Waiting to run

Details

CI / scan_js (pull_request) Waiting to run

Details

CI / lint (pull_request) Waiting to run

Details

ci/woodpecker/push/woodpecker Pipeline was successful

Details

ci/woodpecker/pr/woodpecker Pipeline was successful

Details

664d11f864

Measured from pipeline #410 and surrounding deploys. Includes mermaid
diagrams for the deploy chain, Gantt chart for step timing, DORA four
key metrics (all elite), Kaniko build internals, and optimization
opportunities.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ldraney commented

2026-06-09 10:22:22 +00:00

Author

Owner

PR #186 Review

DOMAIN REVIEW

This is a docs-only PR adding docs/DORA.md (206 lines, no code changes). The domain is documentation with mermaid diagrams. Reviewed for: data accuracy, internal consistency, mermaid syntax correctness, complementarity with existing docs (docs/pipeline.md, docs/infrastructure-and-pipeline.md), and overall clarity.

Mermaid syntax: All three diagrams (Deploy Chain flow, Pipeline #410 Gantt, Pipeline Architecture) use valid mermaid syntax. The Gantt chart correctly uses after dependencies and the dateFormat mm:ss / axisFormat %M:%S pattern for duration-based charts.

Gantt parallel task limitation: The Deploy Chain flow diagram correctly shows build-and-push depending on BOTH lint and test (C --> E and D --> E). The Gantt chart can only express a single after dependency, so it uses after d (test, 20s) which finishes after c (lint, 15s). This is the correct workaround for mermaid Gantt's limitation -- no issue here.

Timing math verified:

CI pipeline: 5s + 90s + 20s (parallel dominates) + 323s = 438s = ~7m18s. Matches "~7-8 min" claim.
Deploy chain: 35s + 45s + 45s = 125s = ~2m5s. Matches "~2 min" claim.
Total: ~9m23s. Matches "~9 min" claim. Correct.
Kaniko rootfs: 3m19s + 1m30s = 4m49s of 5m23s (89.5%). Matches the "4m49s of the 5m23s build" statement. Correct.

Incident references verified: PR #158 (ROPC auth break), PR #161 (revert), and issue #160 (URGENT revert) all exist in the repo and match the narrative described.

Complementarity with existing docs: docs/pipeline.md covers the deploy chain architecture and iOS pipeline. docs/infrastructure-and-pipeline.md covers three-repo model, Docker build, CI config, kustomize overlays, and secrets. The new DORA doc covers measured performance data and optimization opportunities -- genuinely complementary with no significant overlap.

BLOCKERS

None. This is a docs-only change with no code, no secrets, no security implications.

NITS

Data inconsistency in CI failure rate: The Deployment Frequency section states "39 deploys to prod" over June 1-9, but the CI failure sub-section calculates "3 CI failures / 9 main pushes = 33%." If there were 39 deploys, there were approximately 39 main pushes, making the failure rate ~7.7% (3/39), not 33%. The denominator "9" appears to reference something else -- perhaps 9 pushes in a specific subset of the period, or 9 days. Clarify what "9 main pushes" refers to, or correct the denominator.
Build-and-push duration discrepancy: The step breakdown table says build-and-push is "5m23s" (323s), but summing the Kaniko internals table yields ~316s (1+199+9+6+2+90+8+1). The 7-second gap is likely rounding or gaps between phases not captured in logs. Not wrong, but a footnote acknowledging the gap would improve transparency.
Gantt build-and-push step value: The Gantt uses 323s for build-and-push but the step breakdown table says "5m23s." 5m23s = 323s, so these are consistent. However, the Gantt start for build-and-push is after d (test), not after c (lint). If lint ever takes longer than test in future pipeline runs, the Gantt template would under-report total time. Consider adding a comment in the markdown noting this assumption.
Filename casing: The file is DORA.md (all caps). Existing docs use lowercase: pipeline.md, infrastructure-and-pipeline.md, keycloak-setup.md. Consider dora.md or dora-metrics.md for consistency.
"Best-case pipeline" table: The optimized bundle-install shows "10s (cached)" but the Kaniko internals show "Bundle install (cached): 9s." Minor: use 9s for consistency with measured data rather than rounding up.
Missing cross-reference: The doc references docs/pipeline.md and docs/infrastructure-and-pipeline.md in the PR body's Related Notes, but the doc itself has no Related/See Also section linking to these companion docs. Adding a brief section at the bottom would help readers navigate.

SOP COMPLIANCE

Branch named after issue: Branch is docs/dora-metrics. SOP requires {issue-number}-{kebab-case-purpose} format (e.g., 185-dora-metrics). Issue number is missing from the branch name.
PR body follows template: Has Summary, Changes, Test Plan, Review Checklist, Related Notes.
Related references plan slug: No plan slug referenced. If this PR is part of a plan tracked in pal-e-docs, it should be cited.
No secrets committed: Confirmed -- docs only, no credentials or env values.
No scope creep: Single file addition, docs only, matches the stated purpose.

PROCESS OBSERVATIONS

Deployment frequency impact: None (docs only).
Change failure risk: Zero -- no code, no migrations, no config changes.
Documentation quality: This is a strong operational reference. Measured data from real pipeline runs (not estimates) is exactly the right approach for DORA metrics. The optimization table with specific targets ("~3 min") gives actionable next steps.
The 33% CI failure rate signal is worth investigating independently. If main builds fail at that rate despite PR pipelines passing, there is env drift between PR and main pipelines that represents real change failure risk.

VERDICT: APPROVED

The document is well-structured, data is verified against repo history, mermaid syntax is valid, and there are no blockers. The nits above (especially the CI failure rate denominator clarification and filename casing) are worth addressing but are not blocking. Branch naming does not follow the {issue-number}-{kebab} convention, but this is a docs-only change and the deviation is minor.

## PR #186 Review ### DOMAIN REVIEW This is a docs-only PR adding `docs/DORA.md` (206 lines, no code changes). The domain is documentation with mermaid diagrams. Reviewed for: data accuracy, internal consistency, mermaid syntax correctness, complementarity with existing docs (`docs/pipeline.md`, `docs/infrastructure-and-pipeline.md`), and overall clarity. **Mermaid syntax**: All three diagrams (Deploy Chain flow, Pipeline #410 Gantt, Pipeline Architecture) use valid mermaid syntax. The Gantt chart correctly uses `after` dependencies and the `dateFormat mm:ss` / `axisFormat %M:%S` pattern for duration-based charts. **Gantt parallel task limitation**: The Deploy Chain flow diagram correctly shows `build-and-push` depending on BOTH lint and test (`C --> E` and `D --> E`). The Gantt chart can only express a single `after` dependency, so it uses `after d` (test, 20s) which finishes after `c` (lint, 15s). This is the correct workaround for mermaid Gantt's limitation -- no issue here. **Timing math verified**: - CI pipeline: 5s + 90s + 20s (parallel dominates) + 323s = 438s = ~7m18s. Matches "~7-8 min" claim. - Deploy chain: 35s + 45s + 45s = 125s = ~2m5s. Matches "~2 min" claim. - Total: ~9m23s. Matches "~9 min" claim. Correct. - Kaniko rootfs: 3m19s + 1m30s = 4m49s of 5m23s (89.5%). Matches the "4m49s of the 5m23s build" statement. Correct. **Incident references verified**: PR #158 (ROPC auth break), PR #161 (revert), and issue #160 (URGENT revert) all exist in the repo and match the narrative described. **Complementarity with existing docs**: `docs/pipeline.md` covers the deploy chain architecture and iOS pipeline. `docs/infrastructure-and-pipeline.md` covers three-repo model, Docker build, CI config, kustomize overlays, and secrets. The new DORA doc covers measured performance data and optimization opportunities -- genuinely complementary with no significant overlap. ### BLOCKERS None. This is a docs-only change with no code, no secrets, no security implications. ### NITS 1. **Data inconsistency in CI failure rate**: The Deployment Frequency section states "39 deploys to prod" over June 1-9, but the CI failure sub-section calculates "3 CI failures / 9 main pushes = 33%." If there were 39 deploys, there were approximately 39 main pushes, making the failure rate ~7.7% (3/39), not 33%. The denominator "9" appears to reference something else -- perhaps 9 pushes in a specific subset of the period, or 9 days. Clarify what "9 main pushes" refers to, or correct the denominator. 2. **Build-and-push duration discrepancy**: The step breakdown table says build-and-push is "5m23s" (323s), but summing the Kaniko internals table yields ~316s (1+199+9+6+2+90+8+1). The 7-second gap is likely rounding or gaps between phases not captured in logs. Not wrong, but a footnote acknowledging the gap would improve transparency. 3. **Gantt build-and-push step value**: The Gantt uses `323s` for build-and-push but the step breakdown table says "5m23s." 5m23s = 323s, so these are consistent. However, the Gantt start for build-and-push is `after d` (test), not `after c` (lint). If lint ever takes longer than test in future pipeline runs, the Gantt template would under-report total time. Consider adding a comment in the markdown noting this assumption. 4. **Filename casing**: The file is `DORA.md` (all caps). Existing docs use lowercase: `pipeline.md`, `infrastructure-and-pipeline.md`, `keycloak-setup.md`. Consider `dora.md` or `dora-metrics.md` for consistency. 5. **"Best-case pipeline" table**: The optimized bundle-install shows "10s (cached)" but the Kaniko internals show "Bundle install (cached): 9s." Minor: use 9s for consistency with measured data rather than rounding up. 6. **Missing cross-reference**: The doc references `docs/pipeline.md` and `docs/infrastructure-and-pipeline.md` in the PR body's Related Notes, but the doc itself has no Related/See Also section linking to these companion docs. Adding a brief section at the bottom would help readers navigate. ### SOP COMPLIANCE - [ ] **Branch named after issue**: Branch is `docs/dora-metrics`. SOP requires `{issue-number}-{kebab-case-purpose}` format (e.g., `185-dora-metrics`). Issue number is missing from the branch name. - [x] PR body follows template: Has Summary, Changes, Test Plan, Review Checklist, Related Notes. - [ ] Related references plan slug: No plan slug referenced. If this PR is part of a plan tracked in pal-e-docs, it should be cited. - [x] No secrets committed: Confirmed -- docs only, no credentials or env values. - [x] No scope creep: Single file addition, docs only, matches the stated purpose. ### PROCESS OBSERVATIONS - **Deployment frequency impact**: None (docs only). - **Change failure risk**: Zero -- no code, no migrations, no config changes. - **Documentation quality**: This is a strong operational reference. Measured data from real pipeline runs (not estimates) is exactly the right approach for DORA metrics. The optimization table with specific targets ("~3 min") gives actionable next steps. - **The 33% CI failure rate signal** is worth investigating independently. If main builds fail at that rate despite PR pipelines passing, there is env drift between PR and main pipelines that represents real change failure risk. ### VERDICT: APPROVED The document is well-structured, data is verified against repo history, mermaid syntax is valid, and there are no blockers. The nits above (especially the CI failure rate denominator clarification and filename casing) are worth addressing but are not blocking. Branch naming does not follow the `{issue-number}-{kebab}` convention, but this is a docs-only change and the deviation is minor.

ldraney added 1 commit

2026-06-09 10:22:57 +00:00

Fix review nits: lowercase filename, clarify CI failure denominator

ci/woodpecker/push/woodpecker Pipeline was successful

Details

ci/woodpecker/pr/woodpecker Pipeline was successful

Details

CI / scan_ruby (pull_request) Has been cancelled

Details

CI / scan_js (pull_request) Has been cancelled

Details

CI / lint (pull_request) Has been cancelled

Details

add472a778

- Rename DORA.md → dora.md to match repo lowercase convention
- Clarify 39 deploys vs 9 main pushes denominator difference
- Add See Also section linking companion docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>