fix: add retry logic to backup CronJob downloads #160
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform!160
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "123-backup-cronjob-retry"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
DNS flakiness causes
tf-state-backupandcnpg-backup-verifyCronJobs to fail before the actual backup logic runs. All external download steps (apk, mc binary, kubectl binary) now retry on transient failures.Changes
terraform/main.tf—kubernetes_cron_job_v1.tf_state_backup: added--retry 3 --retry-delay 5to both curl commands and a single retry fallback toapk addterraform/main.tf—kubernetes_cron_job_v1.cnpg_backup_verify: added--retry 3 --retry-delay 5to the mc download curl and a single retry fallback toapk addtofu plan Output
No HCL structure changes — only string content within heredoc shell scripts.
tofu fmtpassed clean.tofu validatecannot run without provider cache in the worktree, but changes are confined to shell script strings inside existing heredoc blocks.Test Plan
Review Checklist
Related
Review: LGTM
Diff: 6 lines changed, 1 file, 2 CronJob resources.
Verified:
--retry 3 --retry-delay 5is valid curl syntax for transient failure resilienceapk addretry via|| (sleep 5 && apk add ...)is correct -- no built-in retry flag for apkset -euo pipefailpreserved -- second apk failure will still exit the scripttf_state_backupandcnpg_backup_verifyCronJobs are coveredtofu fmtpassed cleanNo findings. Ready for merge.
PR #160 Review
DOMAIN REVIEW
Tech stack: OpenTofu HCL, shell scripts inside heredoc strings, Kubernetes CronJob resources.
Completeness audit -- I verified every
curlandapk addcommand interraform/main.tf:tf_state_backupapk add --no-cache curl|| (sleep 5 && apk add ...)tf_state_backupcurlmc binary download--retry 3 --retry-delay 5tf_state_backupcurlkubectl binary download--retry 3 --retry-delay 5cnpg_backup_verifyapk add --no-cache curl|| (sleep 5 && apk add ...)cnpg_backup_verifycurlmc binary download--retry 3 --retry-delay 5All 5 external-download commands across both CronJobs are covered. No
curlorapk addcommands exist elsewhere in the file.HCL syntax: The changes are confined to string content within
<<-EOT ... EOTheredoc blocks. No structural HCL changes. Heredoc delimiters, bracket nesting, and indentation are preserved correctly. The 6 additions / 6 deletions count is accurate (5 functional lines + 1 comment update).Shell correctness:
apk addretry pattern (cmd || (sleep 5 && cmd)) is sound. The subshell ensuresset -edoes not abort on the first failure -- the||short-circuits, and only if the retry also fails does the script exit. Single retry with 5s sleep is appropriate for transient DNS.curl --retry 3 --retry-delay 5is the standard curl retry mechanism. It retries on transient errors (connection refused, timeout, DNS failures). The flags are placed correctly before the URL argument.Scope: Only
terraform/main.tfis touched. 1 file changed, 6 lines added, 6 lines removed. Tightly scoped to the bug fix described in #123.BLOCKERS
None.
This is a shell-script-only change inside existing HCL heredocs. No new functionality requiring tests -- the retry behavior is provided by
curlitself (--retryis a built-in curl feature) and by a simple||fallback forapk. The test plan (wait for scheduled CronJob runs or trigger manual Jobs) is the correct verification approach for CronJob changes.NITS
apk retry is single-attempt, curl retry is triple-attempt. The
apk addpattern retries once, whilecurlretries 3 times. This asymmetry is fine in practice (apk DNS resolution is the flaky step, and one retry suffices), but worth noting for consistency. Not blocking.No
--retry-connrefusedon curl. By default,curl --retryonly retries on transient HTTP errors and timeouts, not connection-refused errors. Adding--retry-connrefusedwould also cover cases where the download server is temporarily unreachable. Minor hardening opportunity for a future pass.SOP COMPLIANCE
123-backup-cronjob-retryreferences issue #123Closes #123tofu fmtreported clean per PR bodytofu planoutput: PR notes that changes are string-only within heredocs, so no resource-level plan changes expected. Acceptable justification for omitting full plan output.PROCESS OBSERVATIONS
restartPolicy: OnFailureprovides an additional safety net.VERDICT: APPROVED