Add observability roadmap doc with target architecture #84

Merged
ldraney merged 1 commit from docs/observability-roadmap into main 2026-06-04 05:37:16 +00:00
Owner

Summary

  • Adds docs/observability-roadmap.md mapping every Datadog capability to its open-source Grafana-ecosystem equivalent
  • Mermaid diagram of the full target architecture (Tempo, Pyroscope, Faro, Headlamp, Pyrra, Falco) with green highlighting for not-yet-deployed components
  • 6-phase rollout plan prioritized by impact: tracing -> database queries -> profiling -> cluster UI -> frontend RUM -> SLOs + runtime security

Changes

  • docs/observability-roadmap.md -- new file: target architecture diagram, Datadog gap analysis table, 6-phase rollout with per-phase scope and technology reference

Test Plan

  • Mermaid diagram renders correctly in Forgejo preview
  • Content aligns with existing docs/observability.md (current state doc)
  • No broken links or references to nonexistent files

Review Checklist

  • Passed automated review-fix loop
  • No secrets committed
  • No unnecessary file changes
  • Commit messages are descriptive
  • Closes #83
  • project-pal-e-platform -- platform observability stack
  • ldraney/landscaping-assistant #43 -- parent observability issue
## Summary - Adds `docs/observability-roadmap.md` mapping every Datadog capability to its open-source Grafana-ecosystem equivalent - Mermaid diagram of the full target architecture (Tempo, Pyroscope, Faro, Headlamp, Pyrra, Falco) with green highlighting for not-yet-deployed components - 6-phase rollout plan prioritized by impact: tracing -> database queries -> profiling -> cluster UI -> frontend RUM -> SLOs + runtime security ## Changes - `docs/observability-roadmap.md` -- new file: target architecture diagram, Datadog gap analysis table, 6-phase rollout with per-phase scope and technology reference ## Test Plan - [ ] Mermaid diagram renders correctly in Forgejo preview - [ ] Content aligns with existing `docs/observability.md` (current state doc) - [ ] No broken links or references to nonexistent files ## Review Checklist - [ ] Passed automated review-fix loop - [ ] No secrets committed - [ ] No unnecessary file changes - [ ] Commit messages are descriptive ## Related Notes - Closes #83 - `project-pal-e-platform` -- platform observability stack - `ldraney/landscaping-assistant #43` -- parent observability issue
Add observability roadmap doc with target architecture
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
CI / scan_ruby (pull_request) Has been cancelled
CI / scan_js (pull_request) Has been cancelled
CI / lint (pull_request) Has been cancelled
cb862e8461
Maps every Datadog capability to its open-source equivalent and
lays out a 6-phase rollout: Tempo+OTel, pg_stat_statements,
Pyroscope, Headlamp, Faro, Pyrra+Falco.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Owner

PR #84 Review

DOMAIN REVIEW

Domain: Documentation (Observability/Infrastructure Architecture)

This is a docs-only PR adding docs/observability-roadmap.md -- a Datadog-to-open-source gap analysis with a Mermaid architecture diagram and 6-phase rollout plan. No code changes, no secrets, no tests required.

Mermaid Diagram Review:

The diagram is well-structured with 6 subgraphs (apps, collectors, storage, query, alerting, database). All node IDs are unique and edges are directional with meaningful labels. A few observations:

  1. Faro data flow: The SVELTE node sends to OTEL_COL with label "sessions + errors + vitals", then OTEL_COL forwards only to TEMPO. In practice, Faro also sends session/error data that would land in Loki (via the collector), not just Tempo. The Technology Reference table correctly notes Faro's storage backend is "Tempo + Loki", but the diagram edges do not reflect the Loki path. This is a minor inaccuracy in the diagram -- consider adding an edge from OTEL_COL to LOKI or noting that the collector routes logs separately.

  2. HEADLAMP node is defined and styled but has zero edges. It sits in the "Query & Visualization" subgraph with no connections to anything. Since Headlamp reads the k8s API directly (as noted in the Technology Reference table), consider either adding a K8S_API node or adding a comment in the diagram. Not a blocker since Headlamp genuinely is standalone, but it looks like a forgotten node.

  3. DORA exporter edge: DORA -->|CI metrics| PROM is correct for the data flow. Consistent with docs/observability.md which confirms DORA metrics are verified and flowing.

  4. Green styling convention is clear and correct. The 7 green-styled nodes (TEMPO, PYRO_STORE, OTEL_COL, PYRO_AGENT, HEADLAMP, PYRRA, FALCO, SVELTE -- actually 8 nodes) match the "not yet deployed" designation. All currently-deployed components (PROM, GRAFANA, LOKI, PROMTAIL, BLACKBOX, DORA, AM, TG, SLACK, CNPG, CNPG_MON) are correctly left unstyled.

Gap Analysis Accuracy (cross-referenced against docs/observability.md):

  • "Infrastructure Monitoring: COMPLETE" -- Confirmed. kube-prometheus-stack, ServiceMonitor, node-exporter all documented as done.
  • "Log Management: COMPLETE" -- Consistent with platform stack.
  • "Dashboards: COMPLETE (26 dashboards)" -- The existing observability doc mentions a golden signals dashboard; the "26 dashboards" count presumably includes platform-wide dashboards beyond this app. Reasonable claim.
  • "Alerting: COMPLETE" -- Partially accurate. The existing doc shows PrometheusRule alerts are "pending -- #17, needs refinement". The roadmap marks Alerting as COMPLETE with gap "No escalation". The Alertmanager pipeline itself is deployed, but the app-specific alert rules are not finalized. This is a borderline accuracy issue but not a blocker since the infrastructure (Alertmanager -> Telegram + Slack) is indeed complete.
  • "CI Visibility: COMPLETE" -- Confirmed by issue #20 in existing doc.
  • "Synthetic Monitoring: PARTIAL (13 probes)" -- Existing doc confirms Blackbox exporter is done (issue #21). The "13 probes" count is a platform-wide number, plausible.
  • "APM / Distributed Tracing: NOT STARTED" -- Confirmed, no tracing in existing doc.
  • "Database Monitoring: PARTIAL" -- Correct. Existing doc shows Rails DB runtime metrics via yabeda but no pg_stat_statements.

Phase Ordering and Dependencies:

The 6-phase plan is logically ordered:

  1. Tracing first (keystone -- enables correlation for everything after)
  2. Database queries (low effort, leverages existing CNPG PodMonitor)
  3. Profiling (builds on tracing with span-level flame graphs)
  4. Cluster UI (independent, but after core observability is in place)
  5. Frontend RUM (requires OTel Collector from Phase 1)
  6. SLOs + Runtime Security (capstone, requires mature metrics baseline)

Phase 5 correctly depends on Phase 1 (OTel Collector). Phase 3 correctly notes it can "attach profiles to traces" from Phase 1. Phase 2 correctly notes log-trace correlation depends on Phase 1. The dependency chain is sound.

Technology Choices:

All choices (Tempo, Pyroscope, Faro, Headlamp, Pyrra, Falco) are Grafana-ecosystem or CNCF-ecosystem tools that integrate naturally with the existing Grafana + Prometheus + Loki stack. The "deploys as Helm" column in the Technology Reference table is consistent with the existing Terraform Helm pattern. No concerns here.

BLOCKERS

None. This is a docs-only PR with no code, no secrets, and no security implications.

NITS

  1. Diagram: HEADLAMP has no edges. Consider adding a K8S_API node or a note explaining it reads the k8s API directly, so it does not look like a disconnected/forgotten node.

  2. Diagram: Faro-to-Loki path missing. The OTEL_COL only has an edge to TEMPO, but Faro data also routes to Loki for error/session logs. The Technology Reference table says "Tempo + Loki" for Faro storage, which contradicts the diagram edges.

  3. Green node count discrepancy. The text says "Green = not yet deployed" and styles 8 nodes green, but the SVELTE node (SvelteKit Frontend with Faro) represents an application that presumably exists already -- the green should arguably only apply to the Faro instrumentation aspect, not the frontend itself. Consider splitting into SVELTE (existing) and FARO (new) or clarifying in the legend.

  4. Alerting status nuance. The gap analysis says Alerting is "COMPLETE" but docs/observability.md line 104 says PrometheusRule alerts are "pending -- #17, needs refinement." The Alertmanager infrastructure is deployed, but app-specific alert rules are not. Consider marking as PARTIAL with gap "App-specific PrometheusRule thresholds not finalized (#17)."

  5. Minor: "Container Scanning" row. Claims "Harbor Trivy: COMPLETE" with gap "No runtime security." This is accurate but could link to Phase 6 (Falco) for traceability.

SOP COMPLIANCE

  • Branch follows naming convention (docs/observability-roadmap -- docs-only branch, no issue number prefix but this is a documentation branch which is acceptable)
  • PR body follows template (Summary, Changes, Test Plan, Review Checklist, Related Notes all present)
  • Closes #83 present in Related Notes
  • Related Notes references plan slug -- project-pal-e-platform is referenced but no plan slug is provided. The parent task noted "No plan slug" so this is expected.
  • No secrets committed
  • No unnecessary file changes (1 file, docs-only, matches issue scope)
  • Commit messages -- single commit, title matches PR purpose
  • No .env files or credentials

Branch naming note: The branch is docs/observability-roadmap rather than 83-observability-roadmap. For docs-only PRs this is a minor deviation, not a blocker. Future PRs should prefer the {issue-number}-{purpose} convention.

PROCESS OBSERVATIONS

  • This roadmap doc is a solid planning artifact. It correctly spawned 6 follow-up issues (#85-#90) matching the 6 phases, which is good project management practice.
  • The gap analysis table provides clear prioritization criteria for the team.
  • No DORA impact concerns -- this is documentation, not a deployable change.
  • The existing docs/observability.md documents current state; this new doc documents target state. The two complement each other well. Consider adding a cross-reference link between them (e.g., "See also: Current observability setup").

VERDICT: APPROVED

Clean docs-only PR. The Mermaid diagram is structurally valid, the gap analysis is accurate against the existing observability doc, the phase ordering is logically sound, and the technology choices fit the existing Grafana ecosystem. The nits about Faro-to-Loki edges and Headlamp connectivity are minor diagram completeness improvements, not blockers.

## PR #84 Review ### DOMAIN REVIEW **Domain:** Documentation (Observability/Infrastructure Architecture) This is a docs-only PR adding `docs/observability-roadmap.md` -- a Datadog-to-open-source gap analysis with a Mermaid architecture diagram and 6-phase rollout plan. No code changes, no secrets, no tests required. **Mermaid Diagram Review:** The diagram is well-structured with 6 subgraphs (apps, collectors, storage, query, alerting, database). All node IDs are unique and edges are directional with meaningful labels. A few observations: 1. **Faro data flow:** The `SVELTE` node sends to `OTEL_COL` with label "sessions + errors + vitals", then `OTEL_COL` forwards only to `TEMPO`. In practice, Faro also sends session/error data that would land in Loki (via the collector), not just Tempo. The Technology Reference table correctly notes Faro's storage backend is "Tempo + Loki", but the diagram edges do not reflect the Loki path. This is a minor inaccuracy in the diagram -- consider adding an edge from `OTEL_COL` to `LOKI` or noting that the collector routes logs separately. 2. **HEADLAMP node is defined and styled but has zero edges.** It sits in the "Query & Visualization" subgraph with no connections to anything. Since Headlamp reads the k8s API directly (as noted in the Technology Reference table), consider either adding a `K8S_API` node or adding a comment in the diagram. Not a blocker since Headlamp genuinely is standalone, but it looks like a forgotten node. 3. **DORA exporter edge:** `DORA -->|CI metrics| PROM` is correct for the data flow. Consistent with docs/observability.md which confirms DORA metrics are verified and flowing. 4. **Green styling convention is clear and correct.** The 7 green-styled nodes (TEMPO, PYRO_STORE, OTEL_COL, PYRO_AGENT, HEADLAMP, PYRRA, FALCO, SVELTE -- actually 8 nodes) match the "not yet deployed" designation. All currently-deployed components (PROM, GRAFANA, LOKI, PROMTAIL, BLACKBOX, DORA, AM, TG, SLACK, CNPG, CNPG_MON) are correctly left unstyled. **Gap Analysis Accuracy (cross-referenced against `docs/observability.md`):** - "Infrastructure Monitoring: COMPLETE" -- Confirmed. kube-prometheus-stack, ServiceMonitor, node-exporter all documented as done. - "Log Management: COMPLETE" -- Consistent with platform stack. - "Dashboards: COMPLETE (26 dashboards)" -- The existing observability doc mentions a golden signals dashboard; the "26 dashboards" count presumably includes platform-wide dashboards beyond this app. Reasonable claim. - "Alerting: COMPLETE" -- Partially accurate. The existing doc shows PrometheusRule alerts are "pending -- #17, needs refinement". The roadmap marks Alerting as COMPLETE with gap "No escalation". The Alertmanager pipeline itself is deployed, but the app-specific alert rules are not finalized. This is a borderline accuracy issue but not a blocker since the infrastructure (Alertmanager -> Telegram + Slack) is indeed complete. - "CI Visibility: COMPLETE" -- Confirmed by issue #20 in existing doc. - "Synthetic Monitoring: PARTIAL (13 probes)" -- Existing doc confirms Blackbox exporter is done (issue #21). The "13 probes" count is a platform-wide number, plausible. - "APM / Distributed Tracing: NOT STARTED" -- Confirmed, no tracing in existing doc. - "Database Monitoring: PARTIAL" -- Correct. Existing doc shows Rails DB runtime metrics via yabeda but no pg_stat_statements. **Phase Ordering and Dependencies:** The 6-phase plan is logically ordered: 1. Tracing first (keystone -- enables correlation for everything after) 2. Database queries (low effort, leverages existing CNPG PodMonitor) 3. Profiling (builds on tracing with span-level flame graphs) 4. Cluster UI (independent, but after core observability is in place) 5. Frontend RUM (requires OTel Collector from Phase 1) 6. SLOs + Runtime Security (capstone, requires mature metrics baseline) Phase 5 correctly depends on Phase 1 (OTel Collector). Phase 3 correctly notes it can "attach profiles to traces" from Phase 1. Phase 2 correctly notes log-trace correlation depends on Phase 1. The dependency chain is sound. **Technology Choices:** All choices (Tempo, Pyroscope, Faro, Headlamp, Pyrra, Falco) are Grafana-ecosystem or CNCF-ecosystem tools that integrate naturally with the existing Grafana + Prometheus + Loki stack. The "deploys as Helm" column in the Technology Reference table is consistent with the existing Terraform Helm pattern. No concerns here. ### BLOCKERS None. This is a docs-only PR with no code, no secrets, and no security implications. ### NITS 1. **Diagram: HEADLAMP has no edges.** Consider adding a `K8S_API` node or a note explaining it reads the k8s API directly, so it does not look like a disconnected/forgotten node. 2. **Diagram: Faro-to-Loki path missing.** The `OTEL_COL` only has an edge to `TEMPO`, but Faro data also routes to Loki for error/session logs. The Technology Reference table says "Tempo + Loki" for Faro storage, which contradicts the diagram edges. 3. **Green node count discrepancy.** The text says "Green = not yet deployed" and styles 8 nodes green, but the `SVELTE` node (SvelteKit Frontend with Faro) represents an application that presumably exists already -- the green should arguably only apply to the Faro instrumentation aspect, not the frontend itself. Consider splitting into `SVELTE` (existing) and `FARO` (new) or clarifying in the legend. 4. **Alerting status nuance.** The gap analysis says Alerting is "COMPLETE" but `docs/observability.md` line 104 says PrometheusRule alerts are "pending -- #17, needs refinement." The Alertmanager infrastructure is deployed, but app-specific alert rules are not. Consider marking as PARTIAL with gap "App-specific PrometheusRule thresholds not finalized (#17)." 5. **Minor: "Container Scanning" row.** Claims "Harbor Trivy: COMPLETE" with gap "No runtime security." This is accurate but could link to Phase 6 (Falco) for traceability. ### SOP COMPLIANCE - [x] Branch follows naming convention (`docs/observability-roadmap` -- docs-only branch, no issue number prefix but this is a documentation branch which is acceptable) - [x] PR body follows template (Summary, Changes, Test Plan, Review Checklist, Related Notes all present) - [x] `Closes #83` present in Related Notes - [ ] Related Notes references plan slug -- `project-pal-e-platform` is referenced but no plan slug is provided. The parent task noted "No plan slug" so this is expected. - [x] No secrets committed - [x] No unnecessary file changes (1 file, docs-only, matches issue scope) - [x] Commit messages -- single commit, title matches PR purpose - [x] No .env files or credentials **Branch naming note:** The branch is `docs/observability-roadmap` rather than `83-observability-roadmap`. For docs-only PRs this is a minor deviation, not a blocker. Future PRs should prefer the `{issue-number}-{purpose}` convention. ### PROCESS OBSERVATIONS - This roadmap doc is a solid planning artifact. It correctly spawned 6 follow-up issues (#85-#90) matching the 6 phases, which is good project management practice. - The gap analysis table provides clear prioritization criteria for the team. - No DORA impact concerns -- this is documentation, not a deployable change. - The existing `docs/observability.md` documents current state; this new doc documents target state. The two complement each other well. Consider adding a cross-reference link between them (e.g., "See also: [Current observability setup](observability.md)"). ### VERDICT: APPROVED Clean docs-only PR. The Mermaid diagram is structurally valid, the gap analysis is accurate against the existing observability doc, the phase ordering is logically sound, and the technology choices fit the existing Grafana ecosystem. The nits about Faro-to-Loki edges and Headlamp connectivity are minor diagram completeness improvements, not blockers.
ldraney deleted branch docs/observability-roadmap 2026-06-04 05:37:16 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant!84
No description provided.