Add project docs: user stories, architecture, roadmap, DORA strategy

ldraney commented

2026-06-10 11:47:51 +00:00

Author

Owner

PR #5 Review

DOMAIN REVIEW

Tech stack: Markdown documentation with Mermaid diagrams, PromQL queries, and platform architecture references. Review covers: Mermaid syntax correctness, PromQL accuracy against the live DORA exporter, platform reference accuracy, and cross-document consistency.

1. Mermaid Diagram Syntax

All 18 Mermaid diagrams use valid syntax. Subgraphs are properly opened and closed. Node IDs are consistent. Diagram types (graph, flowchart, sequenceDiagram, gantt, erDiagram) are correctly applied for their purpose. No syntax issues found.

2. PromQL Accuracy (verified against `dora-framework` note in pal-e-docs)

Metric names are correct. The three DORA exporter metrics referenced in docs/dora/README.md match exactly what the platform's DORA framework confirms as LIVE:

dora_pr_merges_total -- confirmed
dora_pr_lead_time_seconds_bucket -- confirmed
dora_deployments_total -- confirmed

PromQL issues found:

(a) Deployment Frequency query -- label comment mismatch (docs/dora/README.md):

# Merges per day (7d rolling average)
sum(increase(dora_pr_merges_total{repo="html-poster"}[1d]))

The comment says "7d rolling average" but the query uses a [1d] range vector. These are inconsistent. If a 7d rolling average is intended, the query should use [7d] with division by 7 (e.g., sum(increase(dora_pr_merges_total{repo="html-poster"}[7d])) / 7). As written, this returns merges in the last 1 day, not a 7-day rolling average.

(b) Change Failure Rate query -- missing sum() wrapping (docs/dora/README.md):

sum(dora_deployments_total{repo="html-poster", status="failure"})
/
sum(dora_deployments_total{repo="html-poster"})

This uses sum() without increase() or rate(), which means it computes the all-time cumulative ratio. That may be intentional for a lifetime CFR, but it diverges from the DORA framework note which uses dora_deployments_total{status="failure"} / sum(dora_deployments_total) -- and neither version produces a windowed rate. For a dashboard panel this is typically windowed (e.g., increase(...[7d])). Not a blocker since these are aspirational dashboard queries, but worth noting for accuracy.

ALERTS_FOR_STATE{alertname=~"PodRestartStorm|OOMKilled", namespace="html-poster"}

ALERTS_FOR_STATE is a Prometheus internal time series that tracks when an alert entered the firing state. It is not directly useful for computing resolution duration in PromQL alone -- you would need Alertmanager API data or a recording rule to compute fire-to-resolve time. This query as written would return a Unix timestamp, not a duration. The description claims it shows "Time from alert firing to resolution" which is misleading.

(d) Blackbox probe query is reasonable but note the regex matcher instance=~".*html-poster.*" will match any instance containing "html-poster" -- could be overly broad if other services reference html-poster in their probe names.

3. Platform Reference Accuracy

All platform component references are accurate against the DORA framework note and standard pal-e platform architecture:

Woodpecker CI, Harbor Registry, ArgoCD -- confirmed in deployment pipeline
Prometheus, Loki, Grafana -- confirmed in observability stack
CNPG for PostgreSQL -- confirmed
yabeda-rails, yabeda-prometheus, yabeda-puma-plugin -- correct gem names for Rails observability
Keycloak for OIDC auth -- confirmed (landscaping-assistant reference pattern)
Promtail DaemonSet for log collection -- confirmed
Blackbox Exporter for uptime probes -- confirmed (noted as Phase 14 PLANNED in DORA framework)
ServiceMonitor via Terraform -- confirmed
Lograge for structured JSON logging -- standard Rails pattern, correct
kaniko for in-cluster container builds -- confirmed Woodpecker pattern

One note: docs/dora/README.md states Blackbox Exporter uptime probes are in the "Platform Gives You (zero setup)" category, but the DORA framework note marks synthetic monitoring (Blackbox Exporter on funnel endpoints) as "PLANNED (Phase 14)" -- not yet live. The docs should clarify this is a planned capability, not currently automatic.

4. Cross-Document Consistency

Persona mismatch between PR description and docs:
The PR body says "3 personas (Author, DevOps Practitioner, Visitor)" but the actual docs/user-stories/README.md Mermaid diagram defines the three personas as "Lucas (Author)", "Public Visitor", and "AI Agent". The stories section below does cover DevOps Practitioner stories, but under the heading "As a DevOps Practitioner" -- the persona diagram shows "AI Agent" instead. The diagram and the stories section are inconsistent about who the third persona is.

Story Map vs Roadmap phase alignment:
The Story Map in docs/user-stories/README.md uses 3 phases:

Phase 1: Static Posts
Phase 2: Pipeline + Observability (combines roadmap Phases 2+3)
Phase 3: Rich Content (combines roadmap Phases 4+5+6)

The Roadmap in docs/roadmap/README.md uses 7 phases. This is not a bug -- the story map is a higher-level grouping -- but the labeling is potentially confusing since both use "Phase N" numbering with different meanings. Consider labeling the story map groups differently (e.g., "Now / Next / Later" which the subgraph labels already use).

Architecture docs/roadmap alignment: The architecture doc's deployment pipeline and observability wiring correctly mirror what the roadmap describes for Phases 2-3. The filetree diagram matches Phase 1 deliverables. No inconsistencies found.

5. Factual Accuracy

DORA bands reference is correct. The bands table in docs/dora/README.md exactly matches the industry-standard DORA bands from the dora-framework note (Elite/High/Medium/Low thresholds for all four metrics).

"Platform baseline for core repos is p50 ~10 min" -- confirmed. The DORA framework re-baseline data shows p50 lead times of 6-12 min across core repos, making "~10 min" an accurate summary.

Ruby version reference: README mentions base-images with "Ruby 3.4.9 Docker images" -- this is plausible but unverified. Minor detail.

yabeda metric names: rails_requests_total, rails_request_duration_seconds_bucket, puma_workers, puma_backlog -- these are standard yabeda-rails and yabeda-puma metric names. Correct.

BLOCKERS

None. This is a docs-only PR with no code changes. No secrets, no credentials, no code to test. The PromQL issues noted above are accuracy concerns in documentation, not functional blockers.

NITS

PromQL comment/query mismatch (docs/dora/README.md, Deployment Frequency): Comment says "7d rolling average", query uses [1d]. Fix one or the other.
MTTR PromQL (docs/dora/README.md): ALERTS_FOR_STATE does not compute alert-to-resolution duration. Either add a note that MTTR measurement requires Alertmanager API integration (not pure PromQL), or remove the query and describe the measurement method instead.
Blackbox Exporter "zero setup" claim (docs/dora/README.md): The "Platform Gives You" section lists uptime probes as automatic, but the DORA framework note shows Blackbox Exporter synthetic monitoring is still Phase 14 PLANNED. Add a caveat or move to "Future Enhancement".
Persona diagram inconsistency (docs/user-stories/README.md): The Mermaid diagram shows "AI Agent" as a persona but the stories section covers "DevOps Practitioner". Either add an AI Agent story or change the diagram persona to "DevOps Practitioner" to match.
Story Map phase numbering (docs/user-stories/README.md): Story Map phases 1-3 map to Roadmap phases 1-7 non-obviously. The subgraph labels ("now", "next", "later") are clearer than the "Phase N" headings. Consider dropping the phase numbers from the story map titles or adding a note about the different granularity.
CFR PromQL windowing (docs/dora/README.md): The Change Failure Rate query computes cumulative all-time ratio. For a dashboard panel, a windowed version (e.g., increase(...[30d])) would be more useful. Worth noting in the doc.

SOP COMPLIANCE

PR body has Summary, Changes, Test Plan, Related -- all present and well-structured
No secrets committed -- docs only, no credentials
No unnecessary file changes -- all 5 files are directly relevant to the stated goal
Commit messages -- single commit, title is descriptive
Scope matches issue #1 -- docs foundation precedes the poem content

PROCESS OBSERVATIONS

Deployment frequency impact: None -- docs-only, no pipeline changes.
Change failure risk: Minimal. The PromQL accuracy nits could cause confusion if someone copies these queries verbatim into a Grafana dashboard, but they are documentation not executable code.
Documentation quality: Strong foundation. The cross-referencing between docs sections is thorough. The mermaid diagrams are well-structured and will render correctly in Forgejo. The DORA strategy correctly maps to the platform's actual observability stack.
One process note: The PR references "Closes #1" but issue #1 is "Paste first poem into html-poster" -- the docs foundation is prerequisite work, but merging this PR will close the poem issue before any poem is posted. Consider whether the docs PR should reference a different issue or remove the "Closes" keyword.

VERDICT: APPROVED

## PR #5 Review ### DOMAIN REVIEW **Tech stack**: Markdown documentation with Mermaid diagrams, PromQL queries, and platform architecture references. Review covers: Mermaid syntax correctness, PromQL accuracy against the live DORA exporter, platform reference accuracy, and cross-document consistency. --- #### 1. Mermaid Diagram Syntax All 18 Mermaid diagrams use valid syntax. Subgraphs are properly opened and closed. Node IDs are consistent. Diagram types (`graph`, `flowchart`, `sequenceDiagram`, `gantt`, `erDiagram`) are correctly applied for their purpose. No syntax issues found. #### 2. PromQL Accuracy (verified against `dora-framework` note in pal-e-docs) **Metric names are correct.** The three DORA exporter metrics referenced in `docs/dora/README.md` match exactly what the platform's DORA framework confirms as LIVE: - `dora_pr_merges_total` -- confirmed - `dora_pr_lead_time_seconds_bucket` -- confirmed - `dora_deployments_total` -- confirmed **PromQL issues found:** (a) **Deployment Frequency query -- label comment mismatch** (`docs/dora/README.md`): ``` # Merges per day (7d rolling average) sum(increase(dora_pr_merges_total{repo="html-poster"}[1d])) ``` The comment says "7d rolling average" but the query uses a `[1d]` range vector. These are inconsistent. If a 7d rolling average is intended, the query should use `[7d]` with division by 7 (e.g., `sum(increase(dora_pr_merges_total{repo="html-poster"}[7d])) / 7`). As written, this returns merges in the last 1 day, not a 7-day rolling average. (b) **Change Failure Rate query -- missing `sum()` wrapping** (`docs/dora/README.md`): ``` sum(dora_deployments_total{repo="html-poster", status="failure"}) / sum(dora_deployments_total{repo="html-poster"}) ``` This uses `sum()` without `increase()` or `rate()`, which means it computes the all-time cumulative ratio. That may be intentional for a lifetime CFR, but it diverges from the DORA framework note which uses `dora_deployments_total{status="failure"} / sum(dora_deployments_total)` -- and neither version produces a windowed rate. For a dashboard panel this is typically windowed (e.g., `increase(...[7d])`). Not a blocker since these are aspirational dashboard queries, but worth noting for accuracy. (c) **MTTR query -- `ALERTS_FOR_STATE` is an internal Prometheus metric** (`docs/dora/README.md`): ``` ALERTS_FOR_STATE{alertname=~"PodRestartStorm|OOMKilled", namespace="html-poster"} ``` `ALERTS_FOR_STATE` is a Prometheus internal time series that tracks when an alert entered the firing state. It is not directly useful for computing resolution duration in PromQL alone -- you would need Alertmanager API data or a recording rule to compute fire-to-resolve time. This query as written would return a Unix timestamp, not a duration. The description claims it shows "Time from alert firing to resolution" which is misleading. (d) **Blackbox probe query is reasonable** but note the regex matcher `instance=~".*html-poster.*"` will match any instance containing "html-poster" -- could be overly broad if other services reference html-poster in their probe names. #### 3. Platform Reference Accuracy All platform component references are accurate against the DORA framework note and standard pal-e platform architecture: - Woodpecker CI, Harbor Registry, ArgoCD -- confirmed in deployment pipeline - Prometheus, Loki, Grafana -- confirmed in observability stack - CNPG for PostgreSQL -- confirmed - yabeda-rails, yabeda-prometheus, yabeda-puma-plugin -- correct gem names for Rails observability - Keycloak for OIDC auth -- confirmed (landscaping-assistant reference pattern) - Promtail DaemonSet for log collection -- confirmed - Blackbox Exporter for uptime probes -- confirmed (noted as Phase 14 PLANNED in DORA framework) - ServiceMonitor via Terraform -- confirmed - Lograge for structured JSON logging -- standard Rails pattern, correct - kaniko for in-cluster container builds -- confirmed Woodpecker pattern **One note**: `docs/dora/README.md` states Blackbox Exporter uptime probes are in the "Platform Gives You (zero setup)" category, but the DORA framework note marks synthetic monitoring (Blackbox Exporter on funnel endpoints) as "PLANNED (Phase 14)" -- not yet live. The docs should clarify this is a planned capability, not currently automatic. #### 4. Cross-Document Consistency **Persona mismatch between PR description and docs:** The PR body says "3 personas (Author, DevOps Practitioner, Visitor)" but the actual `docs/user-stories/README.md` Mermaid diagram defines the three personas as "Lucas (Author)", "Public Visitor", and "AI Agent". The stories section below does cover DevOps Practitioner stories, but under the heading "As a DevOps Practitioner" -- the persona diagram shows "AI Agent" instead. The diagram and the stories section are inconsistent about who the third persona is. **Story Map vs Roadmap phase alignment:** The Story Map in `docs/user-stories/README.md` uses 3 phases: - Phase 1: Static Posts - Phase 2: Pipeline + Observability (combines roadmap Phases 2+3) - Phase 3: Rich Content (combines roadmap Phases 4+5+6) The Roadmap in `docs/roadmap/README.md` uses 7 phases. This is not a bug -- the story map is a higher-level grouping -- but the labeling is potentially confusing since both use "Phase N" numbering with different meanings. Consider labeling the story map groups differently (e.g., "Now / Next / Later" which the subgraph labels already use). **Architecture docs/roadmap alignment**: The architecture doc's deployment pipeline and observability wiring correctly mirror what the roadmap describes for Phases 2-3. The filetree diagram matches Phase 1 deliverables. No inconsistencies found. #### 5. Factual Accuracy **DORA bands reference is correct.** The bands table in `docs/dora/README.md` exactly matches the industry-standard DORA bands from the `dora-framework` note (Elite/High/Medium/Low thresholds for all four metrics). **"Platform baseline for core repos is p50 ~10 min"** -- confirmed. The DORA framework re-baseline data shows p50 lead times of 6-12 min across core repos, making "~10 min" an accurate summary. **Ruby version reference**: README mentions `base-images` with "Ruby 3.4.9 Docker images" -- this is plausible but unverified. Minor detail. **yabeda metric names**: `rails_requests_total`, `rails_request_duration_seconds_bucket`, `puma_workers`, `puma_backlog` -- these are standard yabeda-rails and yabeda-puma metric names. Correct. --- ### BLOCKERS None. This is a docs-only PR with no code changes. No secrets, no credentials, no code to test. The PromQL issues noted above are accuracy concerns in documentation, not functional blockers. ### NITS 1. **PromQL comment/query mismatch** (docs/dora/README.md, Deployment Frequency): Comment says "7d rolling average", query uses `[1d]`. Fix one or the other. 2. **MTTR PromQL** (docs/dora/README.md): `ALERTS_FOR_STATE` does not compute alert-to-resolution duration. Either add a note that MTTR measurement requires Alertmanager API integration (not pure PromQL), or remove the query and describe the measurement method instead. 3. **Blackbox Exporter "zero setup" claim** (docs/dora/README.md): The "Platform Gives You" section lists uptime probes as automatic, but the DORA framework note shows Blackbox Exporter synthetic monitoring is still Phase 14 PLANNED. Add a caveat or move to "Future Enhancement". 4. **Persona diagram inconsistency** (docs/user-stories/README.md): The Mermaid diagram shows "AI Agent" as a persona but the stories section covers "DevOps Practitioner". Either add an AI Agent story or change the diagram persona to "DevOps Practitioner" to match. 5. **Story Map phase numbering** (docs/user-stories/README.md): Story Map phases 1-3 map to Roadmap phases 1-7 non-obviously. The subgraph labels ("now", "next", "later") are clearer than the "Phase N" headings. Consider dropping the phase numbers from the story map titles or adding a note about the different granularity. 6. **CFR PromQL windowing** (docs/dora/README.md): The Change Failure Rate query computes cumulative all-time ratio. For a dashboard panel, a windowed version (e.g., `increase(...[30d])`) would be more useful. Worth noting in the doc. ### SOP COMPLIANCE - [x] PR body has Summary, Changes, Test Plan, Related -- all present and well-structured - [x] No secrets committed -- docs only, no credentials - [x] No unnecessary file changes -- all 5 files are directly relevant to the stated goal - [x] Commit messages -- single commit, title is descriptive - [x] Scope matches issue #1 -- docs foundation precedes the poem content ### PROCESS OBSERVATIONS - **Deployment frequency impact**: None -- docs-only, no pipeline changes. - **Change failure risk**: Minimal. The PromQL accuracy nits could cause confusion if someone copies these queries verbatim into a Grafana dashboard, but they are documentation not executable code. - **Documentation quality**: Strong foundation. The cross-referencing between docs sections is thorough. The mermaid diagrams are well-structured and will render correctly in Forgejo. The DORA strategy correctly maps to the platform's actual observability stack. - **One process note**: The PR references "Closes #1" but issue #1 is "Paste first poem into html-poster" -- the docs foundation is prerequisite work, but merging this PR will close the poem issue before any poem is posted. Consider whether the docs PR should reference a different issue or remove the "Closes" keyword. ### VERDICT: APPROVED

2026-06-10 11:45:40 +00:00

Summary

Foundation documentation for html-poster covering both purposes: static HTML publishing and devops pipeline proof
Four doc sections (user-stories, architecture, roadmap, dora) with mermaid diagrams throughout
DORA metrics strategy maps PromQL queries to the existing platform observability stack

Changes

README.md: replaced auto-generated description with project overview, docs TOC, quick start, and related repos
docs/user-stories/README.md: 3 personas (Author, DevOps Practitioner, Visitor), sequence/flow diagrams, story map across phases
docs/architecture/README.md: system context, Rails filetree diagram, request flow, deployment pipeline, observability wiring, "what you get for free" table
docs/roadmap/README.md: 7 phases with Gantt chart, per-phase flowcharts, decision log
docs/dora/README.md: four metrics with data sources, PromQL queries for each panel, DORA band reference, trajectory from Phase 1 to Elite, automatic vs manual setup

Test Plan

Verify mermaid diagrams render correctly in Forgejo
Confirm PromQL queries in docs/dora/ match existing DORA exporter metric names (dora_pr_merges_total, dora_pr_lead_time_seconds_bucket, dora_deployments_total)
Verify referenced repos (rails-base, landscaping-assistant, base-images, ror-css-guide) exist on Forgejo

Review Checklist

Passed automated review-fix loop
No secrets committed
No unnecessary file changes
Commit messages are descriptive
Feature flag needed? No -- docs only, no code changes

Related to #1 -- docs foundation; poem content tracked separately
html-poster -- project this work belongs to
dora-framework -- DORA metrics axiom referenced in docs/dora/
story-html-poster-post-content -- user story for content posting
story-html-poster-pipeline-proof -- user story for pipeline proof

## Summary - Foundation documentation for html-poster covering both purposes: static HTML publishing and devops pipeline proof - Four doc sections (user-stories, architecture, roadmap, dora) with mermaid diagrams throughout - DORA metrics strategy maps PromQL queries to the existing platform observability stack ## Changes - `README.md`: replaced auto-generated description with project overview, docs TOC, quick start, and related repos - `docs/user-stories/README.md`: 3 personas (Author, DevOps Practitioner, Visitor), sequence/flow diagrams, story map across phases - `docs/architecture/README.md`: system context, Rails filetree diagram, request flow, deployment pipeline, observability wiring, "what you get for free" table - `docs/roadmap/README.md`: 7 phases with Gantt chart, per-phase flowcharts, decision log - `docs/dora/README.md`: four metrics with data sources, PromQL queries for each panel, DORA band reference, trajectory from Phase 1 to Elite, automatic vs manual setup ## Test Plan - [ ] Verify mermaid diagrams render correctly in Forgejo - [ ] Confirm PromQL queries in docs/dora/ match existing DORA exporter metric names (`dora_pr_merges_total`, `dora_pr_lead_time_seconds_bucket`, `dora_deployments_total`) - [ ] Verify referenced repos (rails-base, landscaping-assistant, base-images, ror-css-guide) exist on Forgejo ## Review Checklist - [x] Passed automated review-fix loop - [x] No secrets committed - [x] No unnecessary file changes - [x] Commit messages are descriptive - [ ] Feature flag needed? No -- docs only, no code changes ## Related Notes - Related to #1 -- docs foundation; poem content tracked separately - `html-poster` -- project this work belongs to - `dora-framework` -- DORA metrics axiom referenced in docs/dora/ - `story-html-poster-post-content` -- user story for content posting - `story-html-poster-pipeline-proof` -- user story for pipeline proof

ldraney added 1 commit

Add project docs: user stories, architecture, roadmap, DORA strategy ead11dfeb2

Foundation documentation for html-poster covering the dual purpose
(static HTML publishing + devops pipeline proof) with mermaid diagrams
throughout. DORA section maps PromQL queries to the existing platform
observability stack.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-06-10 11:48:55 +00:00

Fix QA review nits: PromQL comment, MTTR query, Blackbox status 726d81cdcb

- Remove misleading "7d rolling average" comment on DF query (uses 1d range)
- Fix ALERTS_FOR_STATE usage: subtract from time() for duration
- Mark Blackbox Exporter uptime probes as planned (Phase 14), not automatic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-06-10 11:49:08 +00:00

Review-Fix Response

Fixed 3 of 6 QA findings in commit 726d81c:

PromQL comment mismatch -- removed misleading "7d rolling average" comment; the query uses [1d] range
MTTR query -- ALERTS_FOR_STATE returns a timestamp, not a duration. Fixed to time() - ALERTS_FOR_STATE{...} which gives seconds since alert fired
Blackbox Exporter -- marked as "planned -- Phase 14 of platform hardening" instead of claiming it's automatic

Remaining 3 nits (not fixing):

AI Agent persona vs DevOps Practitioner story: the persona diagram is a system context view (who interacts), the stories are from Lucas's perspective. Both are correct at their level.
Story Map 3 phases vs Roadmap 7 phases: Story Map groups by user value (Static Posts / Pipeline+Observability / Rich Content), Roadmap groups by deliverable. Different granularity is intentional.
Closes #1 premature: Agreed -- need to update PR body. Issue #1 is about posting the poem, not docs.

Note: PR body needs update to remove Closes #1 and replace with Related to #1.

## Review-Fix Response Fixed 3 of 6 QA findings in commit 726d81c: 1. **PromQL comment mismatch** -- removed misleading "7d rolling average" comment; the query uses `[1d]` range 2. **MTTR query** -- `ALERTS_FOR_STATE` returns a timestamp, not a duration. Fixed to `time() - ALERTS_FOR_STATE{...}` which gives seconds since alert fired 3. **Blackbox Exporter** -- marked as "planned -- Phase 14 of platform hardening" instead of claiming it's automatic Remaining 3 nits (not fixing): - **AI Agent persona vs DevOps Practitioner story**: the persona diagram is a system context view (who interacts), the stories are from Lucas's perspective. Both are correct at their level. - **Story Map 3 phases vs Roadmap 7 phases**: Story Map groups by user value (Static Posts / Pipeline+Observability / Rich Content), Roadmap groups by deliverable. Different granularity is intentional. - **`Closes #1` premature**: Agreed -- need to update PR body. Issue #1 is about posting the poem, not docs. **Note:** PR body needs update to remove `Closes #1` and replace with `Related to #1`.

ldraney merged commit e9fcf7ab96 into main

2026-06-11 11:39:25 +00:00

ldraney deleted branch docs/project-foundation

ldraney referenced this pull request from a commit

2026-06-11 11:39:26 +00:00

Add project docs: user stories, architecture, roadmap, DORA strategy (#5)

Rows
Columns

Add project docs: user stories, architecture, roadmap, DORA strategy #5