Alert on sheet_sync CronJob failures via Telegram #440

Open
opened 2026-04-10 23:32:00 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Feature

Lineage

Standalone — spawned from the westside-sheet-sync project scaffold on 2026-04-10. Depends on the CronJob deployment ticket being merged first.

Repo

forgejo_admin/pal-e-platform

User Story

As ops
I want a Telegram alert when the sheet_sync CronJob fails
So that I find out within minutes instead of discovering it days later when Marcus asks why his sheet is stale

Ties to story:sheet-sync.

Context

The pal-e-platform already has a Prometheus + Alertmanager + Telegram pipeline for alerting on infrastructure issues. Kubernetes CronJob failures surface via the kube_job_status_failed metric (via kube-state-metrics). An alert rule that triggers when any sheet-sync job fails more than N times in a row → routed to Telegram → gives ops immediate visibility.

The pattern is already established for other alerts — see existing PrometheusRule resources in the pal-e-platform Terraform. This ticket adds one more rule.

File Targets

Files to modify:

  • terraform/monitoring/alerts/sheet_sync.yaml (or equivalent — check the existing alerts directory structure and follow the convention). New PrometheusRule group with one rule: SheetSyncJobFailing.

Files NOT to touch:

  • Existing alert rules — do not modify or consolidate.
  • Alertmanager routing config — the default Telegram route should catch this alert via namespace label.

Acceptance Criteria

  • When the sheet_sync CronJob fails 2 times in a row, then a Prometheus alert SheetSyncJobFailing fires with severity warning.
  • When the alert fires, then a Telegram message arrives in the ops chat within 2 minutes.
  • When the CronJob starts succeeding again, then the alert resolves automatically.
  • When I run kubectl -n monitoring get prometheusrule I see the new rule.
  • When I visit Alertmanager UI, I can see the rule definition and its current state.

Test Expectations

  • Synthetic failure test: temporarily point the CronJob at an invalid PGURL, wait for 2 job failures (~2 hours — or manually trigger 2 failed runs back-to-back), verify the alert fires in Alertmanager and a Telegram message arrives.
  • Recovery test: restore valid PGURL, wait for one successful run, verify the alert resolves.
  • Lint test: promtool check rules terraform/monitoring/alerts/sheet_sync.yaml exits 0.
  • Run command: promtool check rules terraform/monitoring/alerts/sheet_sync.yaml

Constraints

  • Alert severity: warning, not critical. A missed hourly sync is not a prod outage.
  • Use the existing Telegram routing; do NOT add a new receiver or routing rule.
  • Alert name exactly SheetSyncJobFailing for consistency with other CronJob alerts.
  • Threshold: 2 consecutive failures (not 1) to avoid flapping on transient Google API issues.
  • The alert expression should use kube_job_status_failed{job_name=~"sheet-sync.*"} with a for: 15m clause.

Checklist

  • PR opened
  • Tests pass (promtool lint)
  • No unrelated changes
  • westside-sheet-sync — project
  • story-westside-jersey-sheet-sync — user story
  • Blocks on: CronJob deployment ticket
### Type Feature ### Lineage Standalone — spawned from the westside-sheet-sync project scaffold on 2026-04-10. Depends on the CronJob deployment ticket being merged first. ### Repo `forgejo_admin/pal-e-platform` ### User Story As ops I want a Telegram alert when the sheet_sync CronJob fails So that I find out within minutes instead of discovering it days later when Marcus asks why his sheet is stale Ties to `story:sheet-sync`. ### Context The pal-e-platform already has a Prometheus + Alertmanager + Telegram pipeline for alerting on infrastructure issues. Kubernetes CronJob failures surface via the `kube_job_status_failed` metric (via kube-state-metrics). An alert rule that triggers when any `sheet-sync` job fails more than N times in a row → routed to Telegram → gives ops immediate visibility. The pattern is already established for other alerts — see existing PrometheusRule resources in the pal-e-platform Terraform. This ticket adds one more rule. ### File Targets Files to modify: - `terraform/monitoring/alerts/sheet_sync.yaml` (or equivalent — check the existing alerts directory structure and follow the convention). New PrometheusRule group with one rule: `SheetSyncJobFailing`. Files NOT to touch: - Existing alert rules — do not modify or consolidate. - Alertmanager routing config — the default Telegram route should catch this alert via namespace label. ### Acceptance Criteria - [ ] When the sheet_sync CronJob fails 2 times in a row, then a Prometheus alert `SheetSyncJobFailing` fires with severity `warning`. - [ ] When the alert fires, then a Telegram message arrives in the ops chat within 2 minutes. - [ ] When the CronJob starts succeeding again, then the alert resolves automatically. - [ ] When I run `kubectl -n monitoring get prometheusrule` I see the new rule. - [ ] When I visit Alertmanager UI, I can see the rule definition and its current state. ### Test Expectations - [ ] Synthetic failure test: temporarily point the CronJob at an invalid `PGURL`, wait for 2 job failures (~2 hours — or manually trigger 2 failed runs back-to-back), verify the alert fires in Alertmanager and a Telegram message arrives. - [ ] Recovery test: restore valid `PGURL`, wait for one successful run, verify the alert resolves. - [ ] Lint test: `promtool check rules terraform/monitoring/alerts/sheet_sync.yaml` exits 0. - Run command: `promtool check rules terraform/monitoring/alerts/sheet_sync.yaml` ### Constraints - Alert severity: `warning`, not `critical`. A missed hourly sync is not a prod outage. - Use the existing Telegram routing; do NOT add a new receiver or routing rule. - Alert name exactly `SheetSyncJobFailing` for consistency with other CronJob alerts. - Threshold: 2 consecutive failures (not 1) to avoid flapping on transient Google API issues. - The alert expression should use `kube_job_status_failed{job_name=~"sheet-sync.*"}` with a `for: 15m` clause. ### Checklist - [ ] PR opened - [ ] Tests pass (promtool lint) - [ ] No unrelated changes ### Related - `westside-sheet-sync` — project - `story-westside-jersey-sheet-sync` — user story - Blocks on: CronJob deployment ticket
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/basketball-api#440
No description provided.