ldraney/pal-e-platform

Fork 0

Bug: tofu-state backup CronJob intermittent failures (2 alerts) #123

New issue

Closed

opened 2026-03-21 14:03:55 +00:00 by forgejo_admin · 2 comments

forgejo_admin commented

2026-03-21 14:03:55 +00:00

Contributor

Type

Bug

Lineage

plan-pal-e-platform → Platform Hardening — standalone, discovered during monitoring

Repo

forgejo_admin/pal-e-platform

What Broke

The tf-state-backup CronJob in the tofu-state namespace intermittently fails with BackoffLimitExceeded after 1 attempt. Failed pods are cleaned up before logs can be captured. Two KubeJobFailed alerts are active.

Pattern:

29562240 — Complete
29563680 — Complete
29565120 — Failed (2d6h duration before fail)
29566560 — Complete
29568000 — Failed (6h duration before fail)

The job downloads kubectl and mc (MinIO client), exports Terraform state secrets to MinIO. Could be: network issue to MinIO, image pull failure, resource limits, or NetworkPolicy blocking tofu-state → minio traffic.

Likely Root Causes

128Mi memory limit may be insufficient — The container runs apk add --no-cache curl then downloads two large binaries (mc ~25MB, kubectl ~49MB) into /tmp. The apk add phase itself consumes memory for package index parsing. Combined with the running shell, this can push RSS past the 128Mi limit, triggering an OOMKill that gets recorded as BackoffLimitExceeded after 2 restarts.
External CDN downloads on every run are a reliability risk — The CronJob downloads mc from dl.min.io and kubectl from dl.k8s.io on every execution. If either CDN is slow, rate-limited, or temporarily unavailable, the job fails. Binaries should be baked into a custom image or cached in a PVC/init-container to eliminate this failure mode.

File Targets

Resource	File	Lines
`kubernetes_cron_job_v1.tf_state_backup` (CronJob)	`terraform/main.tf`	2222–2308
`kubernetes_secret_v1.tf_backup_s3_creds` (S3 creds)	`terraform/main.tf`	2164–2174
`kubernetes_service_account_v1.tf_backup` (ServiceAccount)	`terraform/main.tf`	2178–2182
`kubernetes_role_v1.tf_backup` (RBAC Role)	`terraform/main.tf`	2187–2200
`kubernetes_role_binding_v1.tf_backup` (RoleBinding)	`terraform/main.tf`	2201–2216
`minio_s3_bucket.tf_state_backups` (MinIO bucket)	`terraform/main.tf`	2123–2128
`minio_iam_user.tf_backup` / `minio_iam_policy.tf_backup` (IAM)	`terraform/main.tf`	2132–2160
MinIO NetworkPolicy (tofu-state ingress rule)	`terraform/network-policies.tf`	112

Test Expectations

Verify CronJob runs successfully: kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5
Verify backup file appears in MinIO: mc ls backup/tf-state-backups/ | tail -5
Check pod logs for OOM or download errors: kubectl logs -n tofu-state job/tf-state-backup-<id> (must capture before cleanup)
Confirm no OOMKilled events: kubectl describe pod -n tofu-state -l job-name=tf-state-backup-<id> | grep -i oom

Debugging Strategy

Failed pod logs are unavailable (already cleaned up). To capture the failure mode:

Create a one-off Job with the same spec but with memory limit: 512Mi and set -x prepended to the script. This isolates whether the failure is memory (OOM) vs. network (download timeout).
If the high-memory job succeeds, the root cause is the 128Mi limit. Fix by raising the limit or baking binaries into the image.
If the high-memory job still fails, the root cause is external downloads. Fix by building a custom image with mc and kubectl pre-installed, or caching them in a PVC.

Constraints

Failed pod logs are unavailable — pods from the two failed jobs have already been cleaned up. All debugging must use fresh runs with increased verbosity (set -x) or resource limits.
The CronJob runs daily at 02:00 UTC. Verifying "3 consecutive successes" requires 3 days of observation after the fix.
backoff_limit: 2 with restartPolicy: OnFailure means the container crashes/fails twice within the same pod (not two separate pods), so there is only one pod to inspect per failure.

Repro Steps

Wait for next scheduled tf-state-backup CronJob run
Observe if pod starts and completes, or fails
If fails: capture logs before pod is cleaned up

Expected Behavior

Every scheduled backup job completes successfully. Terraform state secrets backed up to MinIO.

Environment

Cluster/namespace: prod, tofu-state
CronJob: tf-state-backup
Resources: 50m CPU / 64Mi request, 128Mi memory limit
Related alerts: KubeJobFailed: tf-state-backup-29565120, KubeJobFailed: tf-state-backup-29568000

Acceptance Criteria

Next 3 consecutive backup jobs complete successfully
Both KubeJobFailed alerts clear
Root cause identified and fixed

Checklist

PR opened
Tests pass (3 consecutive CronJob completions)
No unrelated changes

pal-e-platform — project board
Issue #109 — umbrella alert cleanup

### Type Bug ### Lineage `plan-pal-e-platform` → Platform Hardening — standalone, discovered during monitoring ### Repo `forgejo_admin/pal-e-platform` ### What Broke The `tf-state-backup` CronJob in the `tofu-state` namespace intermittently fails with `BackoffLimitExceeded` after 1 attempt. Failed pods are cleaned up before logs can be captured. Two `KubeJobFailed` alerts are active. Pattern: - `29562240` — Complete - `29563680` — Complete - `29565120` — **Failed** (2d6h duration before fail) - `29566560` — Complete - `29568000` — **Failed** (6h duration before fail) The job downloads `kubectl` and `mc` (MinIO client), exports Terraform state secrets to MinIO. Could be: network issue to MinIO, image pull failure, resource limits, or NetworkPolicy blocking tofu-state → minio traffic. ### Likely Root Causes 1. **128Mi memory limit may be insufficient** — The container runs `apk add --no-cache curl` then downloads two large binaries (mc ~25MB, kubectl ~49MB) into `/tmp`. The `apk add` phase itself consumes memory for package index parsing. Combined with the running shell, this can push RSS past the 128Mi limit, triggering an OOMKill that gets recorded as `BackoffLimitExceeded` after 2 restarts. 2. **External CDN downloads on every run are a reliability risk** — The CronJob downloads `mc` from `dl.min.io` and `kubectl` from `dl.k8s.io` on every execution. If either CDN is slow, rate-limited, or temporarily unavailable, the job fails. Binaries should be baked into a custom image or cached in a PVC/init-container to eliminate this failure mode. ### File Targets | Resource | File | Lines | |---|---|---| | `kubernetes_cron_job_v1.tf_state_backup` (CronJob) | `terraform/main.tf` | 2222–2308 | | `kubernetes_secret_v1.tf_backup_s3_creds` (S3 creds) | `terraform/main.tf` | 2164–2174 | | `kubernetes_service_account_v1.tf_backup` (ServiceAccount) | `terraform/main.tf` | 2178–2182 | | `kubernetes_role_v1.tf_backup` (RBAC Role) | `terraform/main.tf` | 2187–2200 | | `kubernetes_role_binding_v1.tf_backup` (RoleBinding) | `terraform/main.tf` | 2201–2216 | | `minio_s3_bucket.tf_state_backups` (MinIO bucket) | `terraform/main.tf` | 2123–2128 | | `minio_iam_user.tf_backup` / `minio_iam_policy.tf_backup` (IAM) | `terraform/main.tf` | 2132–2160 | | MinIO NetworkPolicy (tofu-state ingress rule) | `terraform/network-policies.tf` | 112 | ### Test Expectations - [ ] Verify CronJob runs successfully: `kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5` - [ ] Verify backup file appears in MinIO: `mc ls backup/tf-state-backups/ | tail -5` - [ ] Check pod logs for OOM or download errors: `kubectl logs -n tofu-state job/tf-state-backup-<id>` (must capture before cleanup) - [ ] Confirm no OOMKilled events: `kubectl describe pod -n tofu-state -l job-name=tf-state-backup-<id> | grep -i oom` ### Debugging Strategy Failed pod logs are unavailable (already cleaned up). To capture the failure mode: 1. Create a **one-off Job** with the same spec but with `memory limit: 512Mi` and `set -x` prepended to the script. This isolates whether the failure is memory (OOM) vs. network (download timeout). 2. If the high-memory job succeeds, the root cause is the 128Mi limit. Fix by raising the limit or baking binaries into the image. 3. If the high-memory job still fails, the root cause is external downloads. Fix by building a custom image with `mc` and `kubectl` pre-installed, or caching them in a PVC. ### Constraints - Failed pod logs are **unavailable** — pods from the two failed jobs have already been cleaned up. All debugging must use fresh runs with increased verbosity (`set -x`) or resource limits. - The CronJob runs daily at 02:00 UTC. Verifying "3 consecutive successes" requires 3 days of observation after the fix. - `backoff_limit: 2` with `restartPolicy: OnFailure` means the container crashes/fails twice within the same pod (not two separate pods), so there is only one pod to inspect per failure. ### Repro Steps 1. Wait for next scheduled `tf-state-backup` CronJob run 2. Observe if pod starts and completes, or fails 3. If fails: capture logs before pod is cleaned up ### Expected Behavior Every scheduled backup job completes successfully. Terraform state secrets backed up to MinIO. ### Environment - Cluster/namespace: prod, `tofu-state` - CronJob: `tf-state-backup` - Resources: 50m CPU / 64Mi request, 128Mi memory limit - Related alerts: `KubeJobFailed: tf-state-backup-29565120`, `KubeJobFailed: tf-state-backup-29568000` ### Acceptance Criteria - [ ] Next 3 consecutive backup jobs complete successfully - [ ] Both KubeJobFailed alerts clear - [ ] Root cause identified and fixed ### Checklist - [ ] PR opened - [ ] Tests pass (3 consecutive CronJob completions) - [ ] No unrelated changes ### Related - `pal-e-platform` — project board - Issue #109 — umbrella alert cleanup

forgejo_admin commented

2026-03-22 18:23:20 +00:00

Author

Contributor

Scope Review: NEEDS_REFINEMENT

Review note: review-224-2026-03-22
CronJob and infrastructure verified, but ticket needs refinement before agent execution.

Missing File Targets section -- CronJob is at terraform/main.tf:2222-2308, netpol at network-policies.tf:112
Missing Test Expectations -- add kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5
Missing debugging strategy -- failed pod logs are unavailable (confirmed). Likely root causes: (a) 128Mi memory limit exceeded during apk add + binary downloads, (b) external CDN download failures for mc/kubectl. Ticket should guide agent to these hypotheses.
Missing Checklist and Constraints sections

## Scope Review: NEEDS_REFINEMENT Review note: `review-224-2026-03-22` CronJob and infrastructure verified, but ticket needs refinement before agent execution. - Missing **File Targets** section -- CronJob is at `terraform/main.tf:2222-2308`, netpol at `network-policies.tf:112` - Missing **Test Expectations** -- add `kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5` - Missing **debugging strategy** -- failed pod logs are unavailable (confirmed). Likely root causes: (a) 128Mi memory limit exceeded during `apk add` + binary downloads, (b) external CDN download failures for mc/kubectl. Ticket should guide agent to these hypotheses. - Missing **Checklist** and **Constraints** sections

forgejo_admin commented

2026-03-24 20:21:58 +00:00

Author

Contributor

Root Cause Analysis (2026-03-24)

Likely root cause: The CronJob (main.tf lines 2222-2315) downloads 3 external binaries at runtime on every execution:

curl via apk add (requires Alpine CDN)
mc from dl.min.io (MinIO client)
kubectl from dl.k8s.io

Any DNS flakiness, egress routing issue, or CDN slowness causes the job to fail before it even starts the actual backup. This is the same class of reliability issue hitting the CI pipeline.

Fix options:

Pre-built image — create a Docker image with curl, mc, and kubectl pre-installed. Push to Harbor. Eliminates runtime downloads entirely.
Retry logic — add wget --retry or curl --retry 3 for each download. Band-aid.
Use k8s API directly — replace kubectl with a curl call to the kube API using the SA token (the SA is already bound). Removes one download.

Recommendation: Option 1. Build a harbor.tail5b443a.ts.net/pal-e-platform/backup-tools:latest image. Eliminates 3 external dependencies per run.

## Root Cause Analysis (2026-03-24) **Likely root cause:** The CronJob (main.tf lines 2222-2315) downloads 3 external binaries at runtime on every execution: 1. `curl` via `apk add` (requires Alpine CDN) 2. `mc` from `dl.min.io` (MinIO client) 3. `kubectl` from `dl.k8s.io` Any DNS flakiness, egress routing issue, or CDN slowness causes the job to fail before it even starts the actual backup. This is the same class of reliability issue hitting the CI pipeline. **Fix options:** 1. **Pre-built image** — create a Docker image with curl, mc, and kubectl pre-installed. Push to Harbor. Eliminates runtime downloads entirely. 2. **Retry logic** — add `wget --retry` or `curl --retry 3` for each download. Band-aid. 3. **Use k8s API directly** — replace kubectl with a curl call to the kube API using the SA token (the SA is already bound). Removes one download. **Recommendation:** Option 1. Build a `harbor.tail5b443a.ts.net/pal-e-platform/backup-tools:latest` image. Eliminates 3 external dependencies per run.