Bug: CronJob stale failures causing persistent KubeJobFailed alerts #170
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#170
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
standalone — discovered during AlertManager triage 2026-03-26
Repo
forgejo_admin/pal-e-platformWhat Broke
Six failed Job objects from 5-8 days ago persist in the cluster because
failedJobsHistoryLimitis not set on three CronJobs. TheKubeJobFailedalert (warning) fires on each stale object even though all CronJobs are currently succeeding.Affected CronJobs and stale jobs:
postgres/cnpg-backup-verify— 3 stale failed jobs (5-8 days old), last 3 runs succeededtofu-state/tf-state-backup— 2 stale failed jobs (5-7 days old), last 3 runs succeededpalworld/daily-reboot— 1 stale failed job (8 days old), last 3 runs succeededThe alert rule
kube_job_failed > 0fires on the existence of any failed Job object, not on recent failure.Repro Steps
kubectl get jobs -A --field-selector=status.successful=0— shows 6 stale failed jobsKubeJobFailedalerts (warning)Expected Behavior
Failed Job objects are auto-cleaned by Kubernetes. Alerts only fire on recent/active failures, not stale history.
Environment
KubeJobFailedx6 (warning), firing since 2026-03-24Acceptance Criteria
failedJobsHistoryLimit: 2set on all 3 CronJobsKubeJobFailedalerts clear in AlertManagerRelated
project-pal-e-platform— projectstory:superuser-observe— user storyarch:prometheus— architecture component (alert pipeline)Scope Review: NEEDS_REFINEMENT
Review note:
review-387-2026-03-26Cluster verification found three issues that need correction before this ticket is agent-ready:
failedJobsHistoryLimitIS set on all 3 CronJobs (3/3/1), not missing. Kubernetes is correctly honoring the limits. The fix is lowering the value, not adding it.palworld/daily-rebootis managed bypalworld-serverrepo (Helm chart), notpal-e-platform. Onlycnpg-backup-verifyandtf-state-backuplive in this repo'sterraform/main.tf.terraform/main.tflines ~2231 and ~2330.Additionally: daily-reboot already has
failedJobsHistoryLimit=1which is already lower than the proposed value of 2 — it should likely be descoped from this ticket.Scope Correction (post-review)
Per review
review-387-2026-03-26, correcting factual errors and narrowing scope.Factual Correction
failedJobsHistoryLimitIS set, not missing:cnpg-backup-verify:failedJobsHistoryLimit = 3tf-state-backup:failedJobsHistoryLimit = 3daily-reboot:failedJobsHistoryLimit = 1(already lower, in different repo)Scope Narrowed
Drop
daily-rebootfrom this ticket. It's managed bypalworld-serverrepo (Helm chart), not pal-e-platform. Already has limit=1.KubeJobFailed is a kube-prometheus-stack built-in alert rule (
kube_job_failed > 0). We can't easily change the expression — it fires on any failed Job object cluster-wide. The fix is reducing how many failed objects accumulate.File Targets
All in
pal-e-platform/terraform/main.tf:failed_jobs_history_limitvalue (change 3 → 2)failed_jobs_history_limitvalue (change 3 → 2)Manual Step
Delete 6 stale failed Job objects after applying:
Acceptance Criteria (updated)
failedJobsHistoryLimitlowered to 2 on both CronJobsKubeJobFailedalerts clear in AlertManagertofu planshows only the 2 history limit changesScope Review: READY
Review note:
review-387-2026-03-26Scope is solid after the correction comment. Both file targets verified in
terraform/main.tf(lines 2297 and 2396), repo placement confirmed, no dependencies or blast radius concerns. Agent-ready as a 2-line value change (3 → 2) plus manual stale job cleanup.