Bug: tofu-state backup CronJob intermittent failures (2 alerts) #123

Closed
opened 2026-03-21 14:03:55 +00:00 by forgejo_admin · 2 comments

Type

Bug

Lineage

plan-pal-e-platform → Platform Hardening — standalone, discovered during monitoring

Repo

forgejo_admin/pal-e-platform

What Broke

The tf-state-backup CronJob in the tofu-state namespace intermittently fails with BackoffLimitExceeded after 1 attempt. Failed pods are cleaned up before logs can be captured. Two KubeJobFailed alerts are active.

Pattern:

  • 29562240 — Complete
  • 29563680 — Complete
  • 29565120Failed (2d6h duration before fail)
  • 29566560 — Complete
  • 29568000Failed (6h duration before fail)

The job downloads kubectl and mc (MinIO client), exports Terraform state secrets to MinIO. Could be: network issue to MinIO, image pull failure, resource limits, or NetworkPolicy blocking tofu-state → minio traffic.

Likely Root Causes

  1. 128Mi memory limit may be insufficient — The container runs apk add --no-cache curl then downloads two large binaries (mc ~25MB, kubectl ~49MB) into /tmp. The apk add phase itself consumes memory for package index parsing. Combined with the running shell, this can push RSS past the 128Mi limit, triggering an OOMKill that gets recorded as BackoffLimitExceeded after 2 restarts.

  2. External CDN downloads on every run are a reliability risk — The CronJob downloads mc from dl.min.io and kubectl from dl.k8s.io on every execution. If either CDN is slow, rate-limited, or temporarily unavailable, the job fails. Binaries should be baked into a custom image or cached in a PVC/init-container to eliminate this failure mode.

File Targets

Resource File Lines
kubernetes_cron_job_v1.tf_state_backup (CronJob) terraform/main.tf 2222–2308
kubernetes_secret_v1.tf_backup_s3_creds (S3 creds) terraform/main.tf 2164–2174
kubernetes_service_account_v1.tf_backup (ServiceAccount) terraform/main.tf 2178–2182
kubernetes_role_v1.tf_backup (RBAC Role) terraform/main.tf 2187–2200
kubernetes_role_binding_v1.tf_backup (RoleBinding) terraform/main.tf 2201–2216
minio_s3_bucket.tf_state_backups (MinIO bucket) terraform/main.tf 2123–2128
minio_iam_user.tf_backup / minio_iam_policy.tf_backup (IAM) terraform/main.tf 2132–2160
MinIO NetworkPolicy (tofu-state ingress rule) terraform/network-policies.tf 112

Test Expectations

  • Verify CronJob runs successfully: kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5
  • Verify backup file appears in MinIO: mc ls backup/tf-state-backups/ | tail -5
  • Check pod logs for OOM or download errors: kubectl logs -n tofu-state job/tf-state-backup-<id> (must capture before cleanup)
  • Confirm no OOMKilled events: kubectl describe pod -n tofu-state -l job-name=tf-state-backup-<id> | grep -i oom

Debugging Strategy

Failed pod logs are unavailable (already cleaned up). To capture the failure mode:

  1. Create a one-off Job with the same spec but with memory limit: 512Mi and set -x prepended to the script. This isolates whether the failure is memory (OOM) vs. network (download timeout).
  2. If the high-memory job succeeds, the root cause is the 128Mi limit. Fix by raising the limit or baking binaries into the image.
  3. If the high-memory job still fails, the root cause is external downloads. Fix by building a custom image with mc and kubectl pre-installed, or caching them in a PVC.

Constraints

  • Failed pod logs are unavailable — pods from the two failed jobs have already been cleaned up. All debugging must use fresh runs with increased verbosity (set -x) or resource limits.
  • The CronJob runs daily at 02:00 UTC. Verifying "3 consecutive successes" requires 3 days of observation after the fix.
  • backoff_limit: 2 with restartPolicy: OnFailure means the container crashes/fails twice within the same pod (not two separate pods), so there is only one pod to inspect per failure.

Repro Steps

  1. Wait for next scheduled tf-state-backup CronJob run
  2. Observe if pod starts and completes, or fails
  3. If fails: capture logs before pod is cleaned up

Expected Behavior

Every scheduled backup job completes successfully. Terraform state secrets backed up to MinIO.

Environment

  • Cluster/namespace: prod, tofu-state
  • CronJob: tf-state-backup
  • Resources: 50m CPU / 64Mi request, 128Mi memory limit
  • Related alerts: KubeJobFailed: tf-state-backup-29565120, KubeJobFailed: tf-state-backup-29568000

Acceptance Criteria

  • Next 3 consecutive backup jobs complete successfully
  • Both KubeJobFailed alerts clear
  • Root cause identified and fixed

Checklist

  • PR opened
  • Tests pass (3 consecutive CronJob completions)
  • No unrelated changes
  • pal-e-platform — project board
  • Issue #109 — umbrella alert cleanup
### Type Bug ### Lineage `plan-pal-e-platform` → Platform Hardening — standalone, discovered during monitoring ### Repo `forgejo_admin/pal-e-platform` ### What Broke The `tf-state-backup` CronJob in the `tofu-state` namespace intermittently fails with `BackoffLimitExceeded` after 1 attempt. Failed pods are cleaned up before logs can be captured. Two `KubeJobFailed` alerts are active. Pattern: - `29562240` — Complete - `29563680` — Complete - `29565120` — **Failed** (2d6h duration before fail) - `29566560` — Complete - `29568000` — **Failed** (6h duration before fail) The job downloads `kubectl` and `mc` (MinIO client), exports Terraform state secrets to MinIO. Could be: network issue to MinIO, image pull failure, resource limits, or NetworkPolicy blocking tofu-state → minio traffic. ### Likely Root Causes 1. **128Mi memory limit may be insufficient** — The container runs `apk add --no-cache curl` then downloads two large binaries (mc ~25MB, kubectl ~49MB) into `/tmp`. The `apk add` phase itself consumes memory for package index parsing. Combined with the running shell, this can push RSS past the 128Mi limit, triggering an OOMKill that gets recorded as `BackoffLimitExceeded` after 2 restarts. 2. **External CDN downloads on every run are a reliability risk** — The CronJob downloads `mc` from `dl.min.io` and `kubectl` from `dl.k8s.io` on every execution. If either CDN is slow, rate-limited, or temporarily unavailable, the job fails. Binaries should be baked into a custom image or cached in a PVC/init-container to eliminate this failure mode. ### File Targets | Resource | File | Lines | |---|---|---| | `kubernetes_cron_job_v1.tf_state_backup` (CronJob) | `terraform/main.tf` | 2222–2308 | | `kubernetes_secret_v1.tf_backup_s3_creds` (S3 creds) | `terraform/main.tf` | 2164–2174 | | `kubernetes_service_account_v1.tf_backup` (ServiceAccount) | `terraform/main.tf` | 2178–2182 | | `kubernetes_role_v1.tf_backup` (RBAC Role) | `terraform/main.tf` | 2187–2200 | | `kubernetes_role_binding_v1.tf_backup` (RoleBinding) | `terraform/main.tf` | 2201–2216 | | `minio_s3_bucket.tf_state_backups` (MinIO bucket) | `terraform/main.tf` | 2123–2128 | | `minio_iam_user.tf_backup` / `minio_iam_policy.tf_backup` (IAM) | `terraform/main.tf` | 2132–2160 | | MinIO NetworkPolicy (tofu-state ingress rule) | `terraform/network-policies.tf` | 112 | ### Test Expectations - [ ] Verify CronJob runs successfully: `kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5` - [ ] Verify backup file appears in MinIO: `mc ls backup/tf-state-backups/ | tail -5` - [ ] Check pod logs for OOM or download errors: `kubectl logs -n tofu-state job/tf-state-backup-<id>` (must capture before cleanup) - [ ] Confirm no OOMKilled events: `kubectl describe pod -n tofu-state -l job-name=tf-state-backup-<id> | grep -i oom` ### Debugging Strategy Failed pod logs are unavailable (already cleaned up). To capture the failure mode: 1. Create a **one-off Job** with the same spec but with `memory limit: 512Mi` and `set -x` prepended to the script. This isolates whether the failure is memory (OOM) vs. network (download timeout). 2. If the high-memory job succeeds, the root cause is the 128Mi limit. Fix by raising the limit or baking binaries into the image. 3. If the high-memory job still fails, the root cause is external downloads. Fix by building a custom image with `mc` and `kubectl` pre-installed, or caching them in a PVC. ### Constraints - Failed pod logs are **unavailable** — pods from the two failed jobs have already been cleaned up. All debugging must use fresh runs with increased verbosity (`set -x`) or resource limits. - The CronJob runs daily at 02:00 UTC. Verifying "3 consecutive successes" requires 3 days of observation after the fix. - `backoff_limit: 2` with `restartPolicy: OnFailure` means the container crashes/fails twice within the same pod (not two separate pods), so there is only one pod to inspect per failure. ### Repro Steps 1. Wait for next scheduled `tf-state-backup` CronJob run 2. Observe if pod starts and completes, or fails 3. If fails: capture logs before pod is cleaned up ### Expected Behavior Every scheduled backup job completes successfully. Terraform state secrets backed up to MinIO. ### Environment - Cluster/namespace: prod, `tofu-state` - CronJob: `tf-state-backup` - Resources: 50m CPU / 64Mi request, 128Mi memory limit - Related alerts: `KubeJobFailed: tf-state-backup-29565120`, `KubeJobFailed: tf-state-backup-29568000` ### Acceptance Criteria - [ ] Next 3 consecutive backup jobs complete successfully - [ ] Both KubeJobFailed alerts clear - [ ] Root cause identified and fixed ### Checklist - [ ] PR opened - [ ] Tests pass (3 consecutive CronJob completions) - [ ] No unrelated changes ### Related - `pal-e-platform` — project board - Issue #109 — umbrella alert cleanup
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-224-2026-03-22
CronJob and infrastructure verified, but ticket needs refinement before agent execution.

  • Missing File Targets section -- CronJob is at terraform/main.tf:2222-2308, netpol at network-policies.tf:112
  • Missing Test Expectations -- add kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5
  • Missing debugging strategy -- failed pod logs are unavailable (confirmed). Likely root causes: (a) 128Mi memory limit exceeded during apk add + binary downloads, (b) external CDN download failures for mc/kubectl. Ticket should guide agent to these hypotheses.
  • Missing Checklist and Constraints sections
## Scope Review: NEEDS_REFINEMENT Review note: `review-224-2026-03-22` CronJob and infrastructure verified, but ticket needs refinement before agent execution. - Missing **File Targets** section -- CronJob is at `terraform/main.tf:2222-2308`, netpol at `network-policies.tf:112` - Missing **Test Expectations** -- add `kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5` - Missing **debugging strategy** -- failed pod logs are unavailable (confirmed). Likely root causes: (a) 128Mi memory limit exceeded during `apk add` + binary downloads, (b) external CDN download failures for mc/kubectl. Ticket should guide agent to these hypotheses. - Missing **Checklist** and **Constraints** sections
Author
Owner

Root Cause Analysis (2026-03-24)

Likely root cause: The CronJob (main.tf lines 2222-2315) downloads 3 external binaries at runtime on every execution:

  1. curl via apk add (requires Alpine CDN)
  2. mc from dl.min.io (MinIO client)
  3. kubectl from dl.k8s.io

Any DNS flakiness, egress routing issue, or CDN slowness causes the job to fail before it even starts the actual backup. This is the same class of reliability issue hitting the CI pipeline.

Fix options:

  1. Pre-built image — create a Docker image with curl, mc, and kubectl pre-installed. Push to Harbor. Eliminates runtime downloads entirely.
  2. Retry logic — add wget --retry or curl --retry 3 for each download. Band-aid.
  3. Use k8s API directly — replace kubectl with a curl call to the kube API using the SA token (the SA is already bound). Removes one download.

Recommendation: Option 1. Build a harbor.tail5b443a.ts.net/pal-e-platform/backup-tools:latest image. Eliminates 3 external dependencies per run.

## Root Cause Analysis (2026-03-24) **Likely root cause:** The CronJob (main.tf lines 2222-2315) downloads 3 external binaries at runtime on every execution: 1. `curl` via `apk add` (requires Alpine CDN) 2. `mc` from `dl.min.io` (MinIO client) 3. `kubectl` from `dl.k8s.io` Any DNS flakiness, egress routing issue, or CDN slowness causes the job to fail before it even starts the actual backup. This is the same class of reliability issue hitting the CI pipeline. **Fix options:** 1. **Pre-built image** — create a Docker image with curl, mc, and kubectl pre-installed. Push to Harbor. Eliminates runtime downloads entirely. 2. **Retry logic** — add `wget --retry` or `curl --retry 3` for each download. Band-aid. 3. **Use k8s API directly** — replace kubectl with a curl call to the kube API using the SA token (the SA is already bound). Removes one download. **Recommendation:** Option 1. Build a `harbor.tail5b443a.ts.net/pal-e-platform/backup-tools:latest` image. Eliminates 3 external dependencies per run.
forgejo_admin 2026-03-24 21:00:20 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#123
No description provided.