Alert on sheet_sync CronJob failures via Telegram #440
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/basketball-api#440
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Feature
Lineage
Standalone — spawned from the westside-sheet-sync project scaffold on 2026-04-10. Depends on the CronJob deployment ticket being merged first.
Repo
forgejo_admin/pal-e-platformUser Story
As ops
I want a Telegram alert when the sheet_sync CronJob fails
So that I find out within minutes instead of discovering it days later when Marcus asks why his sheet is stale
Ties to
story:sheet-sync.Context
The pal-e-platform already has a Prometheus + Alertmanager + Telegram pipeline for alerting on infrastructure issues. Kubernetes CronJob failures surface via the
kube_job_status_failedmetric (via kube-state-metrics). An alert rule that triggers when anysheet-syncjob fails more than N times in a row → routed to Telegram → gives ops immediate visibility.The pattern is already established for other alerts — see existing PrometheusRule resources in the pal-e-platform Terraform. This ticket adds one more rule.
File Targets
Files to modify:
terraform/monitoring/alerts/sheet_sync.yaml(or equivalent — check the existing alerts directory structure and follow the convention). New PrometheusRule group with one rule:SheetSyncJobFailing.Files NOT to touch:
Acceptance Criteria
SheetSyncJobFailingfires with severitywarning.kubectl -n monitoring get prometheusruleI see the new rule.Test Expectations
PGURL, wait for 2 job failures (~2 hours — or manually trigger 2 failed runs back-to-back), verify the alert fires in Alertmanager and a Telegram message arrives.PGURL, wait for one successful run, verify the alert resolves.promtool check rules terraform/monitoring/alerts/sheet_sync.yamlexits 0.promtool check rules terraform/monitoring/alerts/sheet_sync.yamlConstraints
warning, notcritical. A missed hourly sync is not a prod outage.SheetSyncJobFailingfor consistency with other CronJob alerts.kube_job_status_failed{job_name=~"sheet-sync.*"}with afor: 15mclause.Checklist
Related
westside-sheet-sync— projectstory-westside-jersey-sheet-sync— user story