Deploy Sloth for SLO tracking with error budgets #90
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Feature
Lineage
Child of
ldraney/landscaping-assistant #43(Observability & DORA metrics stack).Phase 6a of the observability roadmap (
docs/observability-roadmap.md).Aligns with pal-e-platform Phase 16 (SLO & Error Budgets).
Repo
ldraney/pal-e-platform(Helm release + Terraform)User Story
As a platform operator
I want SLO tracking with error budgets
So that I can measure reliability against targets and get alerted on burn rate, not just thresholds
Context
Currently using threshold-based alerts (#17). Sloth generates PrometheusRules from SLO definitions using the Google SRE multi-window multi-burn-rate pattern. Instead of "error rate > 5%", you track "99.5% availability SLO, burning at 2x, 3 days until budget exhausted." Aligns with pal-e-platform Phase 16 which specifies Sloth.
File Targets
Files the agent should modify or create:
terraform/modules/monitoring/sloth.tfor add toterraform/modules/monitoring/main.tf(pal-e-platform) — Helm release for Sloth (chart:slok/sloth)terraform/modules/monitoring/slos/or equivalent (pal-e-platform) — per-service SLO specsAcceptance Criteria
rails_requests_totalsuccess ratio)Test Expectations
tofu planshows Sloth Helm releasekubectl get pods -n monitoringshows sloth runningkubectl get prometheusrules -n monitoringshows Sloth-generated rulesConstraints
terraform/modules/monitoring/Checklist
Related
project-landscaping-observability— observability projectldraney/landscaping-assistant #43— parent observability issueldraney/landscaping-assistant #17— threshold alerts (baseline before SLOs)Scope Review: BLOCK
Review note:
review-1313-2026-06-04(board-landscaping-observability#1313)Summary of Blocking Issues
Sloth vs Pyrra conflict -- Platform plan Phase 16 specifies Sloth for SLO governance. This issue specifies Pyrra and explicitly says "not Sloth." The
docs/observability-roadmap.mdin this repo also says Pyrra. Two sources of truth disagree. Human decision required before any agent can execute.Single-responsibility violation -- Pyrra/Sloth (Tier 1 Foundation, Phase 16) and Falco (Tier 2 Hardening, Phase 20b) are independent capabilities with different dependency chains, different risk profiles, and different plan tiers. This ticket should be split into two issues.
Wrong repo -- Issue body says
Repo: forgejo_admin/pal-e-platformand all file targets are Terraform files in pal-e-platform. But the issue is filed here inldraney/landscaping-assistant.Unsatisfied dependencies -- Phase 16 depends on Phases 14+15. Phase 20b depends on Phase 19 (Kyverno, NOT STARTED). The plan states "Tier 1.5 gates Tier 2."
Ambiguous file targets -- SLO YAML location is "pal-e-platform or pal-e-deployments" (a question, not a spec). Alertmanager config changes needed for Falco alerting are not listed.
Required Actions
Full analysis in the review note linked above.
Deploy Pyrra for SLO tracking and Falco for runtime securityto Deploy Sloth for SLO tracking with error budgets