Add alerting architecture doc #112

Merged
ldraney merged 1 commit from docs/alerting-architecture into main 2026-06-06 02:41:30 +00:00
Owner

Summary

Documents the alerting setup added in pal-e-platform PR #410.

Changes

  • docs/alerting.md: New doc covering alert rules, AlertManager routing, available metrics, Mermaid architecture diagram, and how to add new alerts
  • docs/observability-roadmap.md: Added link to the new alerting doc

Context

Follows pal-e-platform PR #408 (disable default rules) and PR #410 (add landscaping-assistant alerts + dedicated Telegram channel). This documents the live setup for future reference.

Test Plan

  • Mermaid diagram renders correctly
  • Links between docs work

Review Checklist

  • No secrets or chat IDs in doc
  • Alert expressions match what's deployed
  • Cross-links to related docs
  • landscaping-assistant#95 (alert cleanup)
  • landscaping-assistant#17 (PrometheusRule alerts)

Closes #95

## Summary Documents the alerting setup added in pal-e-platform PR #410. ## Changes - `docs/alerting.md`: New doc covering alert rules, AlertManager routing, available metrics, Mermaid architecture diagram, and how to add new alerts - `docs/observability-roadmap.md`: Added link to the new alerting doc ## Context Follows pal-e-platform PR #408 (disable default rules) and PR #410 (add landscaping-assistant alerts + dedicated Telegram channel). This documents the live setup for future reference. ## Test Plan - [ ] Mermaid diagram renders correctly - [ ] Links between docs work ## Review Checklist - [x] No secrets or chat IDs in doc - [x] Alert expressions match what's deployed - [x] Cross-links to related docs ## Related Notes - landscaping-assistant#95 (alert cleanup) - landscaping-assistant#17 (PrometheusRule alerts) Closes #95
Add alerting architecture doc with Mermaid diagram
Some checks failed
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
CI / scan_ruby (pull_request) Has been cancelled
CI / scan_js (pull_request) Has been cancelled
CI / lint (pull_request) Has been cancelled
c4994f80cb
Documents the 4 PrometheusRule alerts, AlertManager routing to a
dedicated Telegram channel, available metrics, and how to add new
alerts. Links from observability-roadmap.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Owner

PR #112 Review

DOMAIN REVIEW

Tech stack: Documentation only -- Markdown with Mermaid diagrams. No application code changes.

Content quality:

  • The Mermaid architecture diagram is well-structured, showing the full flow from Puma through ServiceMonitor/Blackbox to Prometheus, PrometheusRule, AlertManager, and the two Telegram channels. The graph LR layout is a good choice for a pipeline flow.
  • The alert rules table is clear and complete: 4 rules with severity, expression, firing delay, and plain-English description.
  • AlertManager routing snippet is concise and the prose explains the split behavior (namespace matcher, continue: false, send_resolved: true).
  • The "Adding a New Alert" section gives a concrete 4-step workflow including the tofu apply -target command, which is helpful for future reference.
  • Metrics list covers both yabeda-rails and yabeda-puma metrics with their label dimensions.

Cross-links: Bidirectional. alerting.md links to observability-roadmap.md and infrastructure-and-pipeline.md. The roadmap now links back to alerting.md. Both target files confirmed to exist on disk.

Consistency with observability-roadmap.md: The roadmap's "Alerting" row says "Alertmanager -> Telegram + Slack" but the new alerting doc only documents Telegram routing (no Slack receiver). This is not a blocker -- the roadmap describes the target architecture while the alerting doc describes what is live -- but a brief note in alerting.md acknowledging that Slack routing is planned but not yet configured would prevent future confusion.

BLOCKERS

None. This is a docs-only change. No code, no secrets, no credentials, no user input handling. The BLOCKER criteria (test coverage, input validation, secrets, DRY auth) do not apply to pure documentation.

NITS

  1. Slack mention gap: The observability roadmap lists "Alertmanager -> Telegram + Slack" as COMPLETE, but alerting.md only documents Telegram routing. Consider either adding a note that Slack is planned/not-yet-configured, or updating the roadmap table to reflect the actual state.

  2. Branch naming convention: Branch is docs/alerting-architecture. SOP convention is {issue-number}-{kebab-case-purpose} (e.g., 95-alerting-architecture). The docs/ prefix style is reasonable for documentation branches but does not match the documented convention.

  3. Related section: PR body uses "Related Notes" instead of "Related" and references issue numbers (landscaping-assistant#95, landscaping-assistant#17) rather than a plan slug. Acceptable for a docs-only PR where there may not be a formal plan, but noted for SOP alignment.

  4. PromQL expression truncation: The LandscapingLatencyHigh expression is shown as histogram_quantile(0.95, ...) > 1 with an ellipsis. For a reference doc, the full expression (or at least a note saying "see Terraform source for full query") would be more useful to someone debugging an alert at 2 AM.

SOP COMPLIANCE

  • Branch named after issue -- uses docs/alerting-architecture instead of 95-alerting-architecture
  • PR body follows template -- has Summary, Changes, Test Plan, Related (as "Related Notes")
  • Related references plan slug -- references issue numbers, no plan slug
  • No secrets committed -- confirmed no chat IDs, API keys, or credentials in the diff
  • No unnecessary file changes -- exactly 2 files, both directly relevant to issue #95

PROCESS OBSERVATIONS

  • Low change failure risk: documentation-only PR with no runtime impact.
  • Good practice: documenting operational infrastructure (alerting) alongside the platform changes that implemented it (pal-e-platform PRs #408, #410).
  • The bidirectional cross-linking between alerting.md and observability-roadmap.md keeps the docs navigable as the observability suite grows.

VERDICT: APPROVED

Clean documentation PR. The nits (branch naming, truncated PromQL, Slack mention gap) are non-blocking. Content is accurate, well-structured, and properly cross-linked.

## PR #112 Review ### DOMAIN REVIEW **Tech stack**: Documentation only -- Markdown with Mermaid diagrams. No application code changes. **Content quality**: - The Mermaid architecture diagram is well-structured, showing the full flow from Puma through ServiceMonitor/Blackbox to Prometheus, PrometheusRule, AlertManager, and the two Telegram channels. The `graph LR` layout is a good choice for a pipeline flow. - The alert rules table is clear and complete: 4 rules with severity, expression, firing delay, and plain-English description. - AlertManager routing snippet is concise and the prose explains the split behavior (namespace matcher, `continue: false`, `send_resolved: true`). - The "Adding a New Alert" section gives a concrete 4-step workflow including the `tofu apply -target` command, which is helpful for future reference. - Metrics list covers both yabeda-rails and yabeda-puma metrics with their label dimensions. **Cross-links**: Bidirectional. `alerting.md` links to `observability-roadmap.md` and `infrastructure-and-pipeline.md`. The roadmap now links back to `alerting.md`. Both target files confirmed to exist on disk. **Consistency with observability-roadmap.md**: The roadmap's "Alerting" row says "Alertmanager -> Telegram + Slack" but the new alerting doc only documents Telegram routing (no Slack receiver). This is not a blocker -- the roadmap describes the target architecture while the alerting doc describes what is live -- but a brief note in `alerting.md` acknowledging that Slack routing is planned but not yet configured would prevent future confusion. ### BLOCKERS None. This is a docs-only change. No code, no secrets, no credentials, no user input handling. The BLOCKER criteria (test coverage, input validation, secrets, DRY auth) do not apply to pure documentation. ### NITS 1. **Slack mention gap**: The observability roadmap lists "Alertmanager -> Telegram + Slack" as COMPLETE, but `alerting.md` only documents Telegram routing. Consider either adding a note that Slack is planned/not-yet-configured, or updating the roadmap table to reflect the actual state. 2. **Branch naming convention**: Branch is `docs/alerting-architecture`. SOP convention is `{issue-number}-{kebab-case-purpose}` (e.g., `95-alerting-architecture`). The `docs/` prefix style is reasonable for documentation branches but does not match the documented convention. 3. **Related section**: PR body uses "Related Notes" instead of "Related" and references issue numbers (`landscaping-assistant#95`, `landscaping-assistant#17`) rather than a plan slug. Acceptable for a docs-only PR where there may not be a formal plan, but noted for SOP alignment. 4. **PromQL expression truncation**: The `LandscapingLatencyHigh` expression is shown as `histogram_quantile(0.95, ...) > 1` with an ellipsis. For a reference doc, the full expression (or at least a note saying "see Terraform source for full query") would be more useful to someone debugging an alert at 2 AM. ### SOP COMPLIANCE - [ ] Branch named after issue -- uses `docs/alerting-architecture` instead of `95-alerting-architecture` - [x] PR body follows template -- has Summary, Changes, Test Plan, Related (as "Related Notes") - [ ] Related references plan slug -- references issue numbers, no plan slug - [x] No secrets committed -- confirmed no chat IDs, API keys, or credentials in the diff - [x] No unnecessary file changes -- exactly 2 files, both directly relevant to issue #95 ### PROCESS OBSERVATIONS - Low change failure risk: documentation-only PR with no runtime impact. - Good practice: documenting operational infrastructure (alerting) alongside the platform changes that implemented it (pal-e-platform PRs #408, #410). - The bidirectional cross-linking between `alerting.md` and `observability-roadmap.md` keeps the docs navigable as the observability suite grows. ### VERDICT: APPROVED Clean documentation PR. The nits (branch naming, truncated PromQL, Slack mention gap) are non-blocking. Content is accurate, well-structured, and properly cross-linked.
ldraney deleted branch docs/alerting-architecture 2026-06-06 02:41:31 +00:00
ldraney referenced this pull request from a commit 2026-06-06 02:41:32 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant!112
No description provided.