ldraney/pal-e-platform

Fork 0

Bug: CronJob stale failures causing persistent KubeJobFailed alerts #170

New issue

Closed

opened 2026-03-26 15:22:43 +00:00 by forgejo_admin · 3 comments

forgejo_admin commented

2026-03-26 15:22:43 +00:00

Contributor

Type

Bug

Lineage

standalone — discovered during AlertManager triage 2026-03-26

Repo

forgejo_admin/pal-e-platform

What Broke

Six failed Job objects from 5-8 days ago persist in the cluster because failedJobsHistoryLimit is not set on three CronJobs. The KubeJobFailed alert (warning) fires on each stale object even though all CronJobs are currently succeeding.

Affected CronJobs and stale jobs:

postgres/cnpg-backup-verify — 3 stale failed jobs (5-8 days old), last 3 runs succeeded
tofu-state/tf-state-backup — 2 stale failed jobs (5-7 days old), last 3 runs succeeded
palworld/daily-reboot — 1 stale failed job (8 days old), last 3 runs succeeded

The alert rule kube_job_failed > 0 fires on the existence of any failed Job object, not on recent failure.

Repro Steps

kubectl get jobs -A --field-selector=status.successful=0 — shows 6 stale failed jobs
Check AlertManager: 6 KubeJobFailed alerts (warning)
Check recent CronJob runs: all succeeding for 2+ days

Expected Behavior

Failed Job objects are auto-cleaned by Kubernetes. Alerts only fire on recent/active failures, not stale history.

Environment

Cluster/namespaces: postgres, tofu-state, palworld
Related alerts: KubeJobFailed x6 (warning), firing since 2026-03-24

Acceptance Criteria

6 stale failed Job objects deleted
failedJobsHistoryLimit: 2 set on all 3 CronJobs
KubeJobFailed alerts clear in AlertManager
Future CronJob failures auto-clean after 2 retained

project-pal-e-platform — project
story:superuser-observe — user story
arch:prometheus — architecture component (alert pipeline)

### Type Bug ### Lineage standalone — discovered during AlertManager triage 2026-03-26 ### Repo `forgejo_admin/pal-e-platform` ### What Broke Six failed Job objects from 5-8 days ago persist in the cluster because `failedJobsHistoryLimit` is not set on three CronJobs. The `KubeJobFailed` alert (warning) fires on each stale object even though all CronJobs are currently succeeding. Affected CronJobs and stale jobs: - `postgres/cnpg-backup-verify` — 3 stale failed jobs (5-8 days old), last 3 runs succeeded - `tofu-state/tf-state-backup` — 2 stale failed jobs (5-7 days old), last 3 runs succeeded - `palworld/daily-reboot` — 1 stale failed job (8 days old), last 3 runs succeeded The alert rule `kube_job_failed > 0` fires on the existence of any failed Job object, not on recent failure. ### Repro Steps 1. `kubectl get jobs -A --field-selector=status.successful=0` — shows 6 stale failed jobs 2. Check AlertManager: 6 `KubeJobFailed` alerts (warning) 3. Check recent CronJob runs: all succeeding for 2+ days ### Expected Behavior Failed Job objects are auto-cleaned by Kubernetes. Alerts only fire on recent/active failures, not stale history. ### Environment - Cluster/namespaces: postgres, tofu-state, palworld - Related alerts: `KubeJobFailed` x6 (warning), firing since 2026-03-24 ### Acceptance Criteria - [ ] 6 stale failed Job objects deleted - [ ] `failedJobsHistoryLimit: 2` set on all 3 CronJobs - [ ] `KubeJobFailed` alerts clear in AlertManager - [ ] Future CronJob failures auto-clean after 2 retained ### Related - `project-pal-e-platform` — project - `story:superuser-observe` — user story - `arch:prometheus` — architecture component (alert pipeline)

forgejo_admin commented

2026-03-26 15:27:20 +00:00

Author

Contributor

Scope Review: NEEDS_REFINEMENT

Review note: review-387-2026-03-26

Cluster verification found three issues that need correction before this ticket is agent-ready:

Factual inaccuracy: failedJobsHistoryLimit IS set on all 3 CronJobs (3/3/1), not missing. Kubernetes is correctly honoring the limits. The fix is lowering the value, not adding it.
Repo mismatch: palworld/daily-reboot is managed by palworld-server repo (Helm chart), not pal-e-platform. Only cnpg-backup-verify and tf-state-backup live in this repo's terraform/main.tf.
Missing File Targets: No file paths specified. Actual targets are terraform/main.tf lines ~2231 and ~2330.

Additionally: daily-reboot already has failedJobsHistoryLimit=1 which is already lower than the proposed value of 2 — it should likely be descoped from this ticket.

## Scope Review: NEEDS_REFINEMENT Review note: `review-387-2026-03-26` Cluster verification found three issues that need correction before this ticket is agent-ready: - **Factual inaccuracy**: `failedJobsHistoryLimit` IS set on all 3 CronJobs (3/3/1), not missing. Kubernetes is correctly honoring the limits. The fix is lowering the value, not adding it. - **Repo mismatch**: `palworld/daily-reboot` is managed by `palworld-server` repo (Helm chart), not `pal-e-platform`. Only `cnpg-backup-verify` and `tf-state-backup` live in this repo's `terraform/main.tf`. - **Missing File Targets**: No file paths specified. Actual targets are `terraform/main.tf` lines ~2231 and ~2330. Additionally: daily-reboot already has `failedJobsHistoryLimit=1` which is already lower than the proposed value of 2 — it should likely be descoped from this ticket.

forgejo_admin commented

2026-03-26 17:22:45 +00:00

Author

Contributor

Scope Correction (post-review)

Per review review-387-2026-03-26, correcting factual errors and narrowing scope.

Factual Correction

failedJobsHistoryLimit IS set, not missing:

cnpg-backup-verify: failedJobsHistoryLimit = 3
tf-state-backup: failedJobsHistoryLimit = 3
daily-reboot: failedJobsHistoryLimit = 1 (already lower, in different repo)

Scope Narrowed

Drop daily-reboot from this ticket. It's managed by palworld-server repo (Helm chart), not pal-e-platform. Already has limit=1.

KubeJobFailed is a kube-prometheus-stack built-in alert rule (kube_job_failed > 0). We can't easily change the expression — it fires on any failed Job object cluster-wide. The fix is reducing how many failed objects accumulate.

File Targets

All in pal-e-platform/terraform/main.tf:

tf-state-backup CronJob: lines ~2288-2381, failed_jobs_history_limit value (change 3 → 2)
cnpg-backup-verify CronJob: lines ~2387-2514, failed_jobs_history_limit value (change 3 → 2)

Manual Step

Delete 6 stale failed Job objects after applying:

kubectl delete job -n postgres cnpg-backup-verify-29563740 cnpg-backup-verify-29565180 cnpg-backup-verify-29566620
kubectl delete job -n tofu-state tf-state-backup-29565120 tf-state-backup-29568000
kubectl delete job -n palworld daily-reboot-29562840

Acceptance Criteria (updated)

failedJobsHistoryLimit lowered to 2 on both CronJobs
6 stale failed jobs deleted
KubeJobFailed alerts clear in AlertManager
tofu plan shows only the 2 history limit changes

## Scope Correction (post-review) Per review `review-387-2026-03-26`, correcting factual errors and narrowing scope. ### Factual Correction **`failedJobsHistoryLimit` IS set**, not missing: - `cnpg-backup-verify`: `failedJobsHistoryLimit = 3` - `tf-state-backup`: `failedJobsHistoryLimit = 3` - `daily-reboot`: `failedJobsHistoryLimit = 1` (already lower, in different repo) ### Scope Narrowed **Drop `daily-reboot`** from this ticket. It's managed by `palworld-server` repo (Helm chart), not pal-e-platform. Already has limit=1. **KubeJobFailed** is a kube-prometheus-stack built-in alert rule (`kube_job_failed > 0`). We can't easily change the expression — it fires on any failed Job object cluster-wide. The fix is reducing how many failed objects accumulate. ### File Targets All in `pal-e-platform/terraform/main.tf`: - **tf-state-backup** CronJob: lines ~2288-2381, `failed_jobs_history_limit` value (change 3 → 2) - **cnpg-backup-verify** CronJob: lines ~2387-2514, `failed_jobs_history_limit` value (change 3 → 2) ### Manual Step Delete 6 stale failed Job objects after applying: ``` kubectl delete job -n postgres cnpg-backup-verify-29563740 cnpg-backup-verify-29565180 cnpg-backup-verify-29566620 kubectl delete job -n tofu-state tf-state-backup-29565120 tf-state-backup-29568000 kubectl delete job -n palworld daily-reboot-29562840 ``` ### Acceptance Criteria (updated) - [ ] `failedJobsHistoryLimit` lowered to 2 on both CronJobs - [ ] 6 stale failed jobs deleted - [ ] `KubeJobFailed` alerts clear in AlertManager - [ ] `tofu plan` shows only the 2 history limit changes

forgejo_admin commented

2026-03-26 17:29:11 +00:00

Author

Contributor

Scope Review: READY

Review note: review-387-2026-03-26
Scope is solid after the correction comment. Both file targets verified in terraform/main.tf (lines 2297 and 2396), repo placement confirmed, no dependencies or blast radius concerns. Agent-ready as a 2-line value change (3 → 2) plus manual stale job cleanup.

## Scope Review: READY Review note: `review-387-2026-03-26` Scope is solid after the correction comment. Both file targets verified in `terraform/main.tf` (lines 2297 and 2396), repo placement confirmed, no dependencies or blast radius concerns. Agent-ready as a 2-line value change (3 → 2) plus manual stale job cleanup.