fix: lower CronJob failedJobsHistoryLimit 3 → 2 #177

Merged

forgejo_admin merged 1 commit from 170-lower-cronjob-failed-history-limit into main

2026-03-26 22:59:54 +00:00

forgejo_admin commented

2026-03-26 22:20:38 +00:00

Owner

Summary

Lowers failedJobsHistoryLimit from 3 to 2 on both CronJobs managed by this repo (tf-state-backup and cnpg-backup-verify). Stale failed Job objects (5-8 days old) trigger persistent KubeJobFailed alerts even though both CronJobs are currently succeeding. Reducing the limit cuts stale accumulation so alerts clear faster.

Changes

terraform/main.tf line 2297: tf-state-backup failed_jobs_history_limit 3 → 2
terraform/main.tf line 2396: cnpg-backup-verify failed_jobs_history_limit 3 → 2

tofu plan Output

# Worktree lacks .terraform providers/state; plan must run from main checkout.
# Expected plan output after apply:
#   ~ kubernetes_cron_job_v1.tf_state_backup  spec.failed_jobs_history_limit: 3 → 2
#   ~ kubernetes_cron_job_v1.cnpg_backup_verify  spec.failed_jobs_history_limit: 3 → 2

Test Plan

tofu plan shows exactly 2 in-place updates (the history limit changes)
tofu apply succeeds
Delete 6 stale failed jobs manually (commands in issue comment #7563)
Verify KubeJobFailed alerts clear in AlertManager

Review Checklist

tofu fmt passes
Change is a 2-line value edit, no structural changes
palworld/daily-reboot correctly descoped (different repo, already has limit=1)

Forgejo issue: #170
Closes #170

## Summary Lowers `failedJobsHistoryLimit` from 3 to 2 on both CronJobs managed by this repo (`tf-state-backup` and `cnpg-backup-verify`). Stale failed Job objects (5-8 days old) trigger persistent `KubeJobFailed` alerts even though both CronJobs are currently succeeding. Reducing the limit cuts stale accumulation so alerts clear faster. ## Changes - `terraform/main.tf` line 2297: `tf-state-backup` `failed_jobs_history_limit` 3 → 2 - `terraform/main.tf` line 2396: `cnpg-backup-verify` `failed_jobs_history_limit` 3 → 2 ## tofu plan Output ``` # Worktree lacks .terraform providers/state; plan must run from main checkout. # Expected plan output after apply: # ~ kubernetes_cron_job_v1.tf_state_backup spec.failed_jobs_history_limit: 3 → 2 # ~ kubernetes_cron_job_v1.cnpg_backup_verify spec.failed_jobs_history_limit: 3 → 2 ``` ## Test Plan - [ ] `tofu plan` shows exactly 2 in-place updates (the history limit changes) - [ ] `tofu apply` succeeds - [ ] Delete 6 stale failed jobs manually (commands in issue comment #7563) - [ ] Verify `KubeJobFailed` alerts clear in AlertManager ## Review Checklist - [x] `tofu fmt` passes - [x] Change is a 2-line value edit, no structural changes - [x] `palworld/daily-reboot` correctly descoped (different repo, already has limit=1) ## Related - Forgejo issue: #170 - Closes #170

forgejo_admin added 1 commit

2026-03-26 22:20:38 +00:00

fix: lower failedJobsHistoryLimit from 3 to 2 on both CronJobs

ci/woodpecker/push/woodpecker Pipeline was successful

Details

ci/woodpecker/pr/woodpecker Pipeline was successful

Details

ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful

Details

85f058385d

Stale failed Job objects (5-8 days old) persist because
failedJobsHistoryLimit=3 retains too many failures, triggering
persistent KubeJobFailed alerts even when CronJobs are succeeding.

Lowering to 2 reduces stale Job accumulation so alerts clear faster.

Closes #170

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

forgejo_admin commented

2026-03-26 22:20:55 +00:00

Author

Owner

Review: LGTM

Diff verified: 2-line value change, failed_jobs_history_limit 3 → 2 on both tf-state-backup and cnpg-backup-verify CronJobs. No structural changes, no unintended modifications.

Scope check: palworld/daily-reboot correctly excluded (different repo, already limit=1).

Pre-apply note: 6 stale failed Job objects need manual deletion after tofu apply (commands in issue comment #7563).

## Review: LGTM **Diff verified**: 2-line value change, `failed_jobs_history_limit` 3 → 2 on both `tf-state-backup` and `cnpg-backup-verify` CronJobs. No structural changes, no unintended modifications. **Scope check**: `palworld/daily-reboot` correctly excluded (different repo, already limit=1). **Pre-apply note**: 6 stale failed Job objects need manual deletion after `tofu apply` (commands in issue comment #7563).

forgejo_admin commented

2026-03-26 22:25:36 +00:00

Author

Owner

PR #177 Review

DOMAIN REVIEW

Tech stack: OpenTofu managing kubernetes_cron_job_v1 resources.

Change scope: Two identical value edits -- failed_jobs_history_limit changed from 3 to 2 on both CronJobs in terraform/main.tf:

tf_state_backup (line ~2330 on main)
cnpg_backup_verify (line ~2429 on main)

Terraform/k8s analysis:

The change is a safe in-place update. Kubernetes CronJob failedJobsHistoryLimit is a mutable field -- no resource recreation expected.
successful_jobs_history_limit remains at 3, which is appropriate (successful history is useful for debugging, not alerting noise).
Both CronJobs use concurrency_policy = "Forbid" and backoff_limit = 2, which are unchanged and appropriate.
The PR correctly identifies that palworld/daily-reboot is out of scope (different repo, already at limit=1).
tofu fmt compliance: the alignment spacing on the changed lines matches the surrounding context. No formatting issues.

Observation on root cause: Reducing history from 3 to 2 mitigates alert noise from stale failed Job objects, but it does not eliminate it entirely -- 2 stale failures can still trigger KubeJobFailed. The PR Test Plan appropriately includes manual deletion of the 6 existing stale jobs as a complementary step. This is a reasonable operational fix.

BLOCKERS

None.

This is a 2-line value edit with no new functionality, no user input, no secrets, no auth changes, and no structural modifications. No BLOCKER criteria are triggered.

NITS

tofu plan output is a placeholder: The PR body notes "Worktree lacks .terraform providers/state; plan must run from main checkout" and provides expected output instead of actual output. Per PR conventions, tofu plan output should be included. For a 2-line value change this is low risk, but the operator should confirm with an actual tofu plan -lock=false before apply.
Consider lowering to 1: If the goal is to minimize stale alert noise, failedJobsHistoryLimit = 1 would be more aggressive. The Kubernetes default is 1. Keeping 2 preserves one extra failure for debugging, which is a reasonable tradeoff -- just noting the option.

SOP COMPLIANCE

Branch named after issue: 170-lower-cronjob-failed-history-limit references issue #170
PR body follows template: Summary, Changes, tofu plan Output, Test Plan, Review Checklist, Related sections all present
Related section references parent issue: "Forgejo issue: #170" and "Closes #170"
No secrets committed: change is numeric values only
No scope creep: exactly 2 lines changed, both directly addressing the issue
Commit message is descriptive: "fix: lower CronJob failedJobsHistoryLimit 3 -> 2"
tofu plan output is placeholder, not actual (nit, noted above)

PROCESS OBSERVATIONS

MTTR impact: Positive. Reducing stale failed job accumulation means KubeJobFailed alerts will clear faster after CronJobs recover, directly improving mean time to restore signal clarity in AlertManager.
Change failure risk: Very low. This is a mutable field change on two existing resources with no structural implications.
Deployment: Standard tofu apply -- the Test Plan covers verification steps including manual cleanup of existing stale jobs.

VERDICT: APPROVED

## PR #177 Review ### DOMAIN REVIEW **Tech stack**: OpenTofu managing `kubernetes_cron_job_v1` resources. **Change scope**: Two identical value edits -- `failed_jobs_history_limit` changed from `3` to `2` on both CronJobs in `terraform/main.tf`: - `tf_state_backup` (line ~2330 on main) - `cnpg_backup_verify` (line ~2429 on main) **Terraform/k8s analysis**: - The change is a safe in-place update. Kubernetes CronJob `failedJobsHistoryLimit` is a mutable field -- no resource recreation expected. - `successful_jobs_history_limit` remains at 3, which is appropriate (successful history is useful for debugging, not alerting noise). - Both CronJobs use `concurrency_policy = "Forbid"` and `backoff_limit = 2`, which are unchanged and appropriate. - The PR correctly identifies that `palworld/daily-reboot` is out of scope (different repo, already at limit=1). - `tofu fmt` compliance: the alignment spacing on the changed lines matches the surrounding context. No formatting issues. **Observation on root cause**: Reducing history from 3 to 2 mitigates alert noise from stale failed Job objects, but it does not eliminate it entirely -- 2 stale failures can still trigger `KubeJobFailed`. The PR Test Plan appropriately includes manual deletion of the 6 existing stale jobs as a complementary step. This is a reasonable operational fix. ### BLOCKERS None. This is a 2-line value edit with no new functionality, no user input, no secrets, no auth changes, and no structural modifications. No BLOCKER criteria are triggered. ### NITS 1. **tofu plan output is a placeholder**: The PR body notes "Worktree lacks .terraform providers/state; plan must run from main checkout" and provides expected output instead of actual output. Per PR conventions, `tofu plan` output should be included. For a 2-line value change this is low risk, but the operator should confirm with an actual `tofu plan -lock=false` before apply. 2. **Consider lowering to 1**: If the goal is to minimize stale alert noise, `failedJobsHistoryLimit = 1` would be more aggressive. The Kubernetes default is 1. Keeping 2 preserves one extra failure for debugging, which is a reasonable tradeoff -- just noting the option. ### SOP COMPLIANCE - [x] Branch named after issue: `170-lower-cronjob-failed-history-limit` references issue #170 - [x] PR body follows template: Summary, Changes, tofu plan Output, Test Plan, Review Checklist, Related sections all present - [x] Related section references parent issue: "Forgejo issue: #170" and "Closes #170" - [x] No secrets committed: change is numeric values only - [x] No scope creep: exactly 2 lines changed, both directly addressing the issue - [x] Commit message is descriptive: "fix: lower CronJob failedJobsHistoryLimit 3 -> 2" - [ ] tofu plan output is placeholder, not actual (nit, noted above) ### PROCESS OBSERVATIONS - **MTTR impact**: Positive. Reducing stale failed job accumulation means `KubeJobFailed` alerts will clear faster after CronJobs recover, directly improving mean time to restore signal clarity in AlertManager. - **Change failure risk**: Very low. This is a mutable field change on two existing resources with no structural implications. - **Deployment**: Standard `tofu apply` -- the Test Plan covers verification steps including manual cleanup of existing stale jobs. ### VERDICT: APPROVED