Deploy Sloth for SLO tracking with error budgets #90

Open
opened 2026-06-04 05:30:32 +00:00 by ldraney · 1 comment
Owner

Type

Feature

Lineage

Child of ldraney/landscaping-assistant #43 (Observability & DORA metrics stack).
Phase 6a of the observability roadmap (docs/observability-roadmap.md).
Aligns with pal-e-platform Phase 16 (SLO & Error Budgets).

Repo

ldraney/pal-e-platform (Helm release + Terraform)

User Story

As a platform operator
I want SLO tracking with error budgets
So that I can measure reliability against targets and get alerted on burn rate, not just thresholds

Context

Currently using threshold-based alerts (#17). Sloth generates PrometheusRules from SLO definitions using the Google SRE multi-window multi-burn-rate pattern. Instead of "error rate > 5%", you track "99.5% availability SLO, burning at 2x, 3 days until budget exhausted." Aligns with pal-e-platform Phase 16 which specifies Sloth.

File Targets

Files the agent should modify or create:

  • terraform/modules/monitoring/sloth.tf or add to terraform/modules/monitoring/main.tf (pal-e-platform) — Helm release for Sloth (chart: slok/sloth)
  • SLO YAML definitions in terraform/modules/monitoring/slos/ or equivalent (pal-e-platform) — per-service SLO specs
  • Grafana dashboard for SLO overview (pal-e-platform) — error budget remaining, burn rate

Acceptance Criteria

  • Sloth deployed and generating PrometheusRules from SLO definitions
  • At least one SLO defined: landscaping-assistant 99.5% availability (based on rails_requests_total success ratio)
  • Grafana dashboard shows error budget remaining and burn rate
  • Multi-window burn rate alerts fire through Alertmanager -> Telegram

Test Expectations

  • tofu plan shows Sloth Helm release
  • kubectl get pods -n monitoring shows sloth running
  • kubectl get prometheusrules -n monitoring shows Sloth-generated rules
  • Grafana SLO dashboard shows budget data

Constraints

  • Sloth, not Pyrra (aligns with pal-e-platform plan Phase 16)
  • Follow existing Terraform pattern in terraform/modules/monitoring/
  • SLO targets must be decided before deployment (start with availability, add latency later)
  • Depends on #17 (PrometheusRule alerts) being deployed first for baseline comparison

Checklist

  • PR opened (pal-e-platform)
  • Tests pass
  • No unrelated changes
  • project-landscaping-observability — observability project
  • ldraney/landscaping-assistant #43 — parent observability issue
  • ldraney/landscaping-assistant #17 — threshold alerts (baseline before SLOs)
### Type Feature ### Lineage Child of `ldraney/landscaping-assistant #43` (Observability & DORA metrics stack). Phase 6a of the observability roadmap (`docs/observability-roadmap.md`). Aligns with pal-e-platform Phase 16 (SLO & Error Budgets). ### Repo `ldraney/pal-e-platform` (Helm release + Terraform) ### User Story As a platform operator I want SLO tracking with error budgets So that I can measure reliability against targets and get alerted on burn rate, not just thresholds ### Context Currently using threshold-based alerts (#17). Sloth generates PrometheusRules from SLO definitions using the Google SRE multi-window multi-burn-rate pattern. Instead of "error rate > 5%", you track "99.5% availability SLO, burning at 2x, 3 days until budget exhausted." Aligns with pal-e-platform Phase 16 which specifies Sloth. ### File Targets Files the agent should modify or create: - `terraform/modules/monitoring/sloth.tf` or add to `terraform/modules/monitoring/main.tf` (pal-e-platform) — Helm release for Sloth (chart: `slok/sloth`) - SLO YAML definitions in `terraform/modules/monitoring/slos/` or equivalent (pal-e-platform) — per-service SLO specs - Grafana dashboard for SLO overview (pal-e-platform) — error budget remaining, burn rate ### Acceptance Criteria - [ ] Sloth deployed and generating PrometheusRules from SLO definitions - [ ] At least one SLO defined: landscaping-assistant 99.5% availability (based on `rails_requests_total` success ratio) - [ ] Grafana dashboard shows error budget remaining and burn rate - [ ] Multi-window burn rate alerts fire through Alertmanager -> Telegram ### Test Expectations - [ ] `tofu plan` shows Sloth Helm release - [ ] `kubectl get pods -n monitoring` shows sloth running - [ ] `kubectl get prometheusrules -n monitoring` shows Sloth-generated rules - [ ] Grafana SLO dashboard shows budget data ### Constraints - Sloth, not Pyrra (aligns with pal-e-platform plan Phase 16) - Follow existing Terraform pattern in `terraform/modules/monitoring/` - SLO targets must be decided before deployment (start with availability, add latency later) - Depends on #17 (PrometheusRule alerts) being deployed first for baseline comparison ### Checklist - [ ] PR opened (pal-e-platform) - [ ] Tests pass - [ ] No unrelated changes ### Related - `project-landscaping-observability` — observability project - `ldraney/landscaping-assistant #43` — parent observability issue - `ldraney/landscaping-assistant #17` — threshold alerts (baseline before SLOs)
Author
Owner

Scope Review: BLOCK

Review note: review-1313-2026-06-04 (board-landscaping-observability#1313)

Summary of Blocking Issues

  1. Sloth vs Pyrra conflict -- Platform plan Phase 16 specifies Sloth for SLO governance. This issue specifies Pyrra and explicitly says "not Sloth." The docs/observability-roadmap.md in this repo also says Pyrra. Two sources of truth disagree. Human decision required before any agent can execute.

  2. Single-responsibility violation -- Pyrra/Sloth (Tier 1 Foundation, Phase 16) and Falco (Tier 2 Hardening, Phase 20b) are independent capabilities with different dependency chains, different risk profiles, and different plan tiers. This ticket should be split into two issues.

  3. Wrong repo -- Issue body says Repo: forgejo_admin/pal-e-platform and all file targets are Terraform files in pal-e-platform. But the issue is filed here in ldraney/landscaping-assistant.

  4. Unsatisfied dependencies -- Phase 16 depends on Phases 14+15. Phase 20b depends on Phase 19 (Kyverno, NOT STARTED). The plan states "Tier 1.5 gates Tier 2."

  5. Ambiguous file targets -- SLO YAML location is "pal-e-platform or pal-e-deployments" (a question, not a spec). Alertmanager config changes needed for Falco alerting are not listed.

Required Actions

  1. Decide Sloth vs Pyrra. Update the losing documentation.
  2. Split into two issues aligned with plan phases (16 and 20b). File in pal-e-platform.
  3. Address dependency chains or justify skipping them.
  4. Decide SLO YAML location.
  5. Add Alertmanager config to Falco file targets.

Full analysis in the review note linked above.

## Scope Review: BLOCK Review note: [`review-1313-2026-06-04`](https://pal-e-docs.tail5b443a.ts.net/notes/review-1313-2026-06-04) (board-landscaping-observability#1313) ### Summary of Blocking Issues 1. **Sloth vs Pyrra conflict** -- Platform plan Phase 16 specifies **Sloth** for SLO governance. This issue specifies **Pyrra** and explicitly says "not Sloth." The `docs/observability-roadmap.md` in this repo also says Pyrra. Two sources of truth disagree. Human decision required before any agent can execute. 2. **Single-responsibility violation** -- Pyrra/Sloth (Tier 1 Foundation, Phase 16) and Falco (Tier 2 Hardening, Phase 20b) are independent capabilities with different dependency chains, different risk profiles, and different plan tiers. This ticket should be split into two issues. 3. **Wrong repo** -- Issue body says `Repo: forgejo_admin/pal-e-platform` and all file targets are Terraform files in pal-e-platform. But the issue is filed here in `ldraney/landscaping-assistant`. 4. **Unsatisfied dependencies** -- Phase 16 depends on Phases 14+15. Phase 20b depends on Phase 19 (Kyverno, NOT STARTED). The plan states "Tier 1.5 gates Tier 2." 5. **Ambiguous file targets** -- SLO YAML location is "pal-e-platform or pal-e-deployments" (a question, not a spec). Alertmanager config changes needed for Falco alerting are not listed. ### Required Actions 1. Decide Sloth vs Pyrra. Update the losing documentation. 2. Split into two issues aligned with plan phases (16 and 20b). File in pal-e-platform. 3. Address dependency chains or justify skipping them. 4. Decide SLO YAML location. 5. Add Alertmanager config to Falco file targets. Full analysis in the review note linked above.
ldraney changed title from Deploy Pyrra for SLO tracking and Falco for runtime security to Deploy Sloth for SLO tracking with error budgets 2026-06-04 12:03:48 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant#90
No description provided.