fix: lower CronJob failedJobsHistoryLimit 3 → 2 #177
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform!177
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "170-lower-cronjob-failed-history-limit"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Lowers
failedJobsHistoryLimitfrom 3 to 2 on both CronJobs managed by this repo (tf-state-backupandcnpg-backup-verify). Stale failed Job objects (5-8 days old) trigger persistentKubeJobFailedalerts even though both CronJobs are currently succeeding. Reducing the limit cuts stale accumulation so alerts clear faster.Changes
terraform/main.tfline 2297:tf-state-backupfailed_jobs_history_limit3 → 2terraform/main.tfline 2396:cnpg-backup-verifyfailed_jobs_history_limit3 → 2tofu plan Output
Test Plan
tofu planshows exactly 2 in-place updates (the history limit changes)tofu applysucceedsKubeJobFailedalerts clear in AlertManagerReview Checklist
tofu fmtpassespalworld/daily-rebootcorrectly descoped (different repo, already has limit=1)Related
Review: LGTM
Diff verified: 2-line value change,
failed_jobs_history_limit3 → 2 on bothtf-state-backupandcnpg-backup-verifyCronJobs. No structural changes, no unintended modifications.Scope check:
palworld/daily-rebootcorrectly excluded (different repo, already limit=1).Pre-apply note: 6 stale failed Job objects need manual deletion after
tofu apply(commands in issue comment #7563).PR #177 Review
DOMAIN REVIEW
Tech stack: OpenTofu managing
kubernetes_cron_job_v1resources.Change scope: Two identical value edits --
failed_jobs_history_limitchanged from3to2on both CronJobs interraform/main.tf:tf_state_backup(line ~2330 on main)cnpg_backup_verify(line ~2429 on main)Terraform/k8s analysis:
failedJobsHistoryLimitis a mutable field -- no resource recreation expected.successful_jobs_history_limitremains at 3, which is appropriate (successful history is useful for debugging, not alerting noise).concurrency_policy = "Forbid"andbackoff_limit = 2, which are unchanged and appropriate.palworld/daily-rebootis out of scope (different repo, already at limit=1).tofu fmtcompliance: the alignment spacing on the changed lines matches the surrounding context. No formatting issues.Observation on root cause: Reducing history from 3 to 2 mitigates alert noise from stale failed Job objects, but it does not eliminate it entirely -- 2 stale failures can still trigger
KubeJobFailed. The PR Test Plan appropriately includes manual deletion of the 6 existing stale jobs as a complementary step. This is a reasonable operational fix.BLOCKERS
None.
This is a 2-line value edit with no new functionality, no user input, no secrets, no auth changes, and no structural modifications. No BLOCKER criteria are triggered.
NITS
tofu plan output is a placeholder: The PR body notes "Worktree lacks .terraform providers/state; plan must run from main checkout" and provides expected output instead of actual output. Per PR conventions,
tofu planoutput should be included. For a 2-line value change this is low risk, but the operator should confirm with an actualtofu plan -lock=falsebefore apply.Consider lowering to 1: If the goal is to minimize stale alert noise,
failedJobsHistoryLimit = 1would be more aggressive. The Kubernetes default is 1. Keeping 2 preserves one extra failure for debugging, which is a reasonable tradeoff -- just noting the option.SOP COMPLIANCE
170-lower-cronjob-failed-history-limitreferences issue #170PROCESS OBSERVATIONS
KubeJobFailedalerts will clear faster after CronJobs recover, directly improving mean time to restore signal clarity in AlertManager.tofu apply-- the Test Plan covers verification steps including manual cleanup of existing stale jobs.VERDICT: APPROVED