Bug: tofu-state backup CronJob intermittent failures (2 alerts) #123
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#123
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
plan-pal-e-platform→ Platform Hardening — standalone, discovered during monitoringRepo
forgejo_admin/pal-e-platformWhat Broke
The
tf-state-backupCronJob in thetofu-statenamespace intermittently fails withBackoffLimitExceededafter 1 attempt. Failed pods are cleaned up before logs can be captured. TwoKubeJobFailedalerts are active.Pattern:
29562240— Complete29563680— Complete29565120— Failed (2d6h duration before fail)29566560— Complete29568000— Failed (6h duration before fail)The job downloads
kubectlandmc(MinIO client), exports Terraform state secrets to MinIO. Could be: network issue to MinIO, image pull failure, resource limits, or NetworkPolicy blocking tofu-state → minio traffic.Likely Root Causes
128Mi memory limit may be insufficient — The container runs
apk add --no-cache curlthen downloads two large binaries (mc ~25MB, kubectl ~49MB) into/tmp. Theapk addphase itself consumes memory for package index parsing. Combined with the running shell, this can push RSS past the 128Mi limit, triggering an OOMKill that gets recorded asBackoffLimitExceededafter 2 restarts.External CDN downloads on every run are a reliability risk — The CronJob downloads
mcfromdl.min.ioandkubectlfromdl.k8s.ioon every execution. If either CDN is slow, rate-limited, or temporarily unavailable, the job fails. Binaries should be baked into a custom image or cached in a PVC/init-container to eliminate this failure mode.File Targets
kubernetes_cron_job_v1.tf_state_backup(CronJob)terraform/main.tfkubernetes_secret_v1.tf_backup_s3_creds(S3 creds)terraform/main.tfkubernetes_service_account_v1.tf_backup(ServiceAccount)terraform/main.tfkubernetes_role_v1.tf_backup(RBAC Role)terraform/main.tfkubernetes_role_binding_v1.tf_backup(RoleBinding)terraform/main.tfminio_s3_bucket.tf_state_backups(MinIO bucket)terraform/main.tfminio_iam_user.tf_backup/minio_iam_policy.tf_backup(IAM)terraform/main.tfterraform/network-policies.tfTest Expectations
kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5mc ls backup/tf-state-backups/ | tail -5kubectl logs -n tofu-state job/tf-state-backup-<id>(must capture before cleanup)kubectl describe pod -n tofu-state -l job-name=tf-state-backup-<id> | grep -i oomDebugging Strategy
Failed pod logs are unavailable (already cleaned up). To capture the failure mode:
memory limit: 512Miandset -xprepended to the script. This isolates whether the failure is memory (OOM) vs. network (download timeout).mcandkubectlpre-installed, or caching them in a PVC.Constraints
set -x) or resource limits.backoff_limit: 2withrestartPolicy: OnFailuremeans the container crashes/fails twice within the same pod (not two separate pods), so there is only one pod to inspect per failure.Repro Steps
tf-state-backupCronJob runExpected Behavior
Every scheduled backup job completes successfully. Terraform state secrets backed up to MinIO.
Environment
tofu-statetf-state-backupKubeJobFailed: tf-state-backup-29565120,KubeJobFailed: tf-state-backup-29568000Acceptance Criteria
Checklist
Related
pal-e-platform— project boardScope Review: NEEDS_REFINEMENT
Review note:
review-224-2026-03-22CronJob and infrastructure verified, but ticket needs refinement before agent execution.
terraform/main.tf:2222-2308, netpol atnetwork-policies.tf:112kubectl get jobs -n tofu-state --sort-by=.metadata.creationTimestamp | tail -5apk add+ binary downloads, (b) external CDN download failures for mc/kubectl. Ticket should guide agent to these hypotheses.Root Cause Analysis (2026-03-24)
Likely root cause: The CronJob (main.tf lines 2222-2315) downloads 3 external binaries at runtime on every execution:
curlviaapk add(requires Alpine CDN)mcfromdl.min.io(MinIO client)kubectlfromdl.k8s.ioAny DNS flakiness, egress routing issue, or CDN slowness causes the job to fail before it even starts the actual backup. This is the same class of reliability issue hitting the CI pipeline.
Fix options:
wget --retryorcurl --retry 3for each download. Band-aid.Recommendation: Option 1. Build a
harbor.tail5b443a.ts.net/pal-e-platform/backup-tools:latestimage. Eliminates 3 external dependencies per run.