Add observability roadmap doc with target architecture #84
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "docs/observability-roadmap"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
docs/observability-roadmap.mdmapping every Datadog capability to its open-source Grafana-ecosystem equivalentChanges
docs/observability-roadmap.md-- new file: target architecture diagram, Datadog gap analysis table, 6-phase rollout with per-phase scope and technology referenceTest Plan
docs/observability.md(current state doc)Review Checklist
Related Notes
project-pal-e-platform-- platform observability stackldraney/landscaping-assistant #43-- parent observability issuePR #84 Review
DOMAIN REVIEW
Domain: Documentation (Observability/Infrastructure Architecture)
This is a docs-only PR adding
docs/observability-roadmap.md-- a Datadog-to-open-source gap analysis with a Mermaid architecture diagram and 6-phase rollout plan. No code changes, no secrets, no tests required.Mermaid Diagram Review:
The diagram is well-structured with 6 subgraphs (apps, collectors, storage, query, alerting, database). All node IDs are unique and edges are directional with meaningful labels. A few observations:
Faro data flow: The
SVELTEnode sends toOTEL_COLwith label "sessions + errors + vitals", thenOTEL_COLforwards only toTEMPO. In practice, Faro also sends session/error data that would land in Loki (via the collector), not just Tempo. The Technology Reference table correctly notes Faro's storage backend is "Tempo + Loki", but the diagram edges do not reflect the Loki path. This is a minor inaccuracy in the diagram -- consider adding an edge fromOTEL_COLtoLOKIor noting that the collector routes logs separately.HEADLAMP node is defined and styled but has zero edges. It sits in the "Query & Visualization" subgraph with no connections to anything. Since Headlamp reads the k8s API directly (as noted in the Technology Reference table), consider either adding a
K8S_APInode or adding a comment in the diagram. Not a blocker since Headlamp genuinely is standalone, but it looks like a forgotten node.DORA exporter edge:
DORA -->|CI metrics| PROMis correct for the data flow. Consistent with docs/observability.md which confirms DORA metrics are verified and flowing.Green styling convention is clear and correct. The 7 green-styled nodes (TEMPO, PYRO_STORE, OTEL_COL, PYRO_AGENT, HEADLAMP, PYRRA, FALCO, SVELTE -- actually 8 nodes) match the "not yet deployed" designation. All currently-deployed components (PROM, GRAFANA, LOKI, PROMTAIL, BLACKBOX, DORA, AM, TG, SLACK, CNPG, CNPG_MON) are correctly left unstyled.
Gap Analysis Accuracy (cross-referenced against
docs/observability.md):Phase Ordering and Dependencies:
The 6-phase plan is logically ordered:
Phase 5 correctly depends on Phase 1 (OTel Collector). Phase 3 correctly notes it can "attach profiles to traces" from Phase 1. Phase 2 correctly notes log-trace correlation depends on Phase 1. The dependency chain is sound.
Technology Choices:
All choices (Tempo, Pyroscope, Faro, Headlamp, Pyrra, Falco) are Grafana-ecosystem or CNCF-ecosystem tools that integrate naturally with the existing Grafana + Prometheus + Loki stack. The "deploys as Helm" column in the Technology Reference table is consistent with the existing Terraform Helm pattern. No concerns here.
BLOCKERS
None. This is a docs-only PR with no code, no secrets, and no security implications.
NITS
Diagram: HEADLAMP has no edges. Consider adding a
K8S_APInode or a note explaining it reads the k8s API directly, so it does not look like a disconnected/forgotten node.Diagram: Faro-to-Loki path missing. The
OTEL_COLonly has an edge toTEMPO, but Faro data also routes to Loki for error/session logs. The Technology Reference table says "Tempo + Loki" for Faro storage, which contradicts the diagram edges.Green node count discrepancy. The text says "Green = not yet deployed" and styles 8 nodes green, but the
SVELTEnode (SvelteKit Frontend with Faro) represents an application that presumably exists already -- the green should arguably only apply to the Faro instrumentation aspect, not the frontend itself. Consider splitting intoSVELTE(existing) andFARO(new) or clarifying in the legend.Alerting status nuance. The gap analysis says Alerting is "COMPLETE" but
docs/observability.mdline 104 says PrometheusRule alerts are "pending -- #17, needs refinement." The Alertmanager infrastructure is deployed, but app-specific alert rules are not. Consider marking as PARTIAL with gap "App-specific PrometheusRule thresholds not finalized (#17)."Minor: "Container Scanning" row. Claims "Harbor Trivy: COMPLETE" with gap "No runtime security." This is accurate but could link to Phase 6 (Falco) for traceability.
SOP COMPLIANCE
docs/observability-roadmap-- docs-only branch, no issue number prefix but this is a documentation branch which is acceptable)Closes #83present in Related Notesproject-pal-e-platformis referenced but no plan slug is provided. The parent task noted "No plan slug" so this is expected.Branch naming note: The branch is
docs/observability-roadmaprather than83-observability-roadmap. For docs-only PRs this is a minor deviation, not a blocker. Future PRs should prefer the{issue-number}-{purpose}convention.PROCESS OBSERVATIONS
docs/observability.mddocuments current state; this new doc documents target state. The two complement each other well. Consider adding a cross-reference link between them (e.g., "See also: Current observability setup").VERDICT: APPROVED
Clean docs-only PR. The Mermaid diagram is structurally valid, the gap analysis is accurate against the existing observability doc, the phase ordering is logically sound, and the technology choices fit the existing Grafana ecosystem. The nits about Faro-to-Loki edges and Headlamp connectivity are minor diagram completeness improvements, not blockers.