Add project docs: user stories, architecture, roadmap, DORA strategy #5
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "docs/project-foundation"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Changes
README.md: replaced auto-generated description with project overview, docs TOC, quick start, and related reposdocs/user-stories/README.md: 3 personas (Author, DevOps Practitioner, Visitor), sequence/flow diagrams, story map across phasesdocs/architecture/README.md: system context, Rails filetree diagram, request flow, deployment pipeline, observability wiring, "what you get for free" tabledocs/roadmap/README.md: 7 phases with Gantt chart, per-phase flowcharts, decision logdocs/dora/README.md: four metrics with data sources, PromQL queries for each panel, DORA band reference, trajectory from Phase 1 to Elite, automatic vs manual setupTest Plan
dora_pr_merges_total,dora_pr_lead_time_seconds_bucket,dora_deployments_total)Review Checklist
Related Notes
html-poster-- project this work belongs todora-framework-- DORA metrics axiom referenced in docs/dora/story-html-poster-post-content-- user story for content postingstory-html-poster-pipeline-proof-- user story for pipeline proofPR #5 Review
DOMAIN REVIEW
Tech stack: Markdown documentation with Mermaid diagrams, PromQL queries, and platform architecture references. Review covers: Mermaid syntax correctness, PromQL accuracy against the live DORA exporter, platform reference accuracy, and cross-document consistency.
1. Mermaid Diagram Syntax
All 18 Mermaid diagrams use valid syntax. Subgraphs are properly opened and closed. Node IDs are consistent. Diagram types (
graph,flowchart,sequenceDiagram,gantt,erDiagram) are correctly applied for their purpose. No syntax issues found.2. PromQL Accuracy (verified against
dora-frameworknote in pal-e-docs)Metric names are correct. The three DORA exporter metrics referenced in
docs/dora/README.mdmatch exactly what the platform's DORA framework confirms as LIVE:dora_pr_merges_total-- confirmeddora_pr_lead_time_seconds_bucket-- confirmeddora_deployments_total-- confirmedPromQL issues found:
(a) Deployment Frequency query -- label comment mismatch (
docs/dora/README.md):The comment says "7d rolling average" but the query uses a
[1d]range vector. These are inconsistent. If a 7d rolling average is intended, the query should use[7d]with division by 7 (e.g.,sum(increase(dora_pr_merges_total{repo="html-poster"}[7d])) / 7). As written, this returns merges in the last 1 day, not a 7-day rolling average.(b) Change Failure Rate query -- missing
sum()wrapping (docs/dora/README.md):This uses
sum()withoutincrease()orrate(), which means it computes the all-time cumulative ratio. That may be intentional for a lifetime CFR, but it diverges from the DORA framework note which usesdora_deployments_total{status="failure"} / sum(dora_deployments_total)-- and neither version produces a windowed rate. For a dashboard panel this is typically windowed (e.g.,increase(...[7d])). Not a blocker since these are aspirational dashboard queries, but worth noting for accuracy.(c) MTTR query --
ALERTS_FOR_STATEis an internal Prometheus metric (docs/dora/README.md):ALERTS_FOR_STATEis a Prometheus internal time series that tracks when an alert entered the firing state. It is not directly useful for computing resolution duration in PromQL alone -- you would need Alertmanager API data or a recording rule to compute fire-to-resolve time. This query as written would return a Unix timestamp, not a duration. The description claims it shows "Time from alert firing to resolution" which is misleading.(d) Blackbox probe query is reasonable but note the regex matcher
instance=~".*html-poster.*"will match any instance containing "html-poster" -- could be overly broad if other services reference html-poster in their probe names.3. Platform Reference Accuracy
All platform component references are accurate against the DORA framework note and standard pal-e platform architecture:
One note:
docs/dora/README.mdstates Blackbox Exporter uptime probes are in the "Platform Gives You (zero setup)" category, but the DORA framework note marks synthetic monitoring (Blackbox Exporter on funnel endpoints) as "PLANNED (Phase 14)" -- not yet live. The docs should clarify this is a planned capability, not currently automatic.4. Cross-Document Consistency
Persona mismatch between PR description and docs:
The PR body says "3 personas (Author, DevOps Practitioner, Visitor)" but the actual
docs/user-stories/README.mdMermaid diagram defines the three personas as "Lucas (Author)", "Public Visitor", and "AI Agent". The stories section below does cover DevOps Practitioner stories, but under the heading "As a DevOps Practitioner" -- the persona diagram shows "AI Agent" instead. The diagram and the stories section are inconsistent about who the third persona is.Story Map vs Roadmap phase alignment:
The Story Map in
docs/user-stories/README.mduses 3 phases:The Roadmap in
docs/roadmap/README.mduses 7 phases. This is not a bug -- the story map is a higher-level grouping -- but the labeling is potentially confusing since both use "Phase N" numbering with different meanings. Consider labeling the story map groups differently (e.g., "Now / Next / Later" which the subgraph labels already use).Architecture docs/roadmap alignment: The architecture doc's deployment pipeline and observability wiring correctly mirror what the roadmap describes for Phases 2-3. The filetree diagram matches Phase 1 deliverables. No inconsistencies found.
5. Factual Accuracy
DORA bands reference is correct. The bands table in
docs/dora/README.mdexactly matches the industry-standard DORA bands from thedora-frameworknote (Elite/High/Medium/Low thresholds for all four metrics)."Platform baseline for core repos is p50 ~10 min" -- confirmed. The DORA framework re-baseline data shows p50 lead times of 6-12 min across core repos, making "~10 min" an accurate summary.
Ruby version reference: README mentions
base-imageswith "Ruby 3.4.9 Docker images" -- this is plausible but unverified. Minor detail.yabeda metric names:
rails_requests_total,rails_request_duration_seconds_bucket,puma_workers,puma_backlog-- these are standard yabeda-rails and yabeda-puma metric names. Correct.BLOCKERS
None. This is a docs-only PR with no code changes. No secrets, no credentials, no code to test. The PromQL issues noted above are accuracy concerns in documentation, not functional blockers.
NITS
PromQL comment/query mismatch (docs/dora/README.md, Deployment Frequency): Comment says "7d rolling average", query uses
[1d]. Fix one or the other.MTTR PromQL (docs/dora/README.md):
ALERTS_FOR_STATEdoes not compute alert-to-resolution duration. Either add a note that MTTR measurement requires Alertmanager API integration (not pure PromQL), or remove the query and describe the measurement method instead.Blackbox Exporter "zero setup" claim (docs/dora/README.md): The "Platform Gives You" section lists uptime probes as automatic, but the DORA framework note shows Blackbox Exporter synthetic monitoring is still Phase 14 PLANNED. Add a caveat or move to "Future Enhancement".
Persona diagram inconsistency (docs/user-stories/README.md): The Mermaid diagram shows "AI Agent" as a persona but the stories section covers "DevOps Practitioner". Either add an AI Agent story or change the diagram persona to "DevOps Practitioner" to match.
Story Map phase numbering (docs/user-stories/README.md): Story Map phases 1-3 map to Roadmap phases 1-7 non-obviously. The subgraph labels ("now", "next", "later") are clearer than the "Phase N" headings. Consider dropping the phase numbers from the story map titles or adding a note about the different granularity.
CFR PromQL windowing (docs/dora/README.md): The Change Failure Rate query computes cumulative all-time ratio. For a dashboard panel, a windowed version (e.g.,
increase(...[30d])) would be more useful. Worth noting in the doc.SOP COMPLIANCE
PROCESS OBSERVATIONS
VERDICT: APPROVED
Review-Fix Response
Fixed 3 of 6 QA findings in commit
726d81c:[1d]rangeALERTS_FOR_STATEreturns a timestamp, not a duration. Fixed totime() - ALERTS_FOR_STATE{...}which gives seconds since alert firedRemaining 3 nits (not fixing):
Closes #1premature: Agreed -- need to update PR body. Issue #1 is about posting the poem, not docs.Note: PR body needs update to remove
Closes #1and replace withRelated to #1.