fix: add retry logic to backup CronJob downloads #160

Merged
forgejo_admin merged 1 commit from 123-backup-cronjob-retry into main 2026-03-24 21:09:50 +00:00

Summary

DNS flakiness causes tf-state-backup and cnpg-backup-verify CronJobs to fail before the actual backup logic runs. All external download steps (apk, mc binary, kubectl binary) now retry on transient failures.

Changes

  • terraform/main.tfkubernetes_cron_job_v1.tf_state_backup: added --retry 3 --retry-delay 5 to both curl commands and a single retry fallback to apk add
  • terraform/main.tfkubernetes_cron_job_v1.cnpg_backup_verify: added --retry 3 --retry-delay 5 to the mc download curl and a single retry fallback to apk add

tofu plan Output

No HCL structure changes — only string content within heredoc shell scripts. tofu fmt passed clean. tofu validate cannot run without provider cache in the worktree, but changes are confined to shell script strings inside existing heredoc blocks.

Test Plan

  • Wait for next scheduled run of both CronJobs (02:00 and 03:00 UTC) and verify successful completion
  • Alternatively, trigger a manual Job from either CronJob and confirm the retry flags appear in logs
  • No regressions in backup file content or MinIO upload behavior

Review Checklist

  • Passed automated review-fix loop
  • No secrets committed
  • No unnecessary file changes
  • Commit messages are descriptive
## Summary DNS flakiness causes `tf-state-backup` and `cnpg-backup-verify` CronJobs to fail before the actual backup logic runs. All external download steps (apk, mc binary, kubectl binary) now retry on transient failures. ## Changes - `terraform/main.tf` — `kubernetes_cron_job_v1.tf_state_backup`: added `--retry 3 --retry-delay 5` to both curl commands and a single retry fallback to `apk add` - `terraform/main.tf` — `kubernetes_cron_job_v1.cnpg_backup_verify`: added `--retry 3 --retry-delay 5` to the mc download curl and a single retry fallback to `apk add` ## tofu plan Output No HCL structure changes — only string content within heredoc shell scripts. `tofu fmt` passed clean. `tofu validate` cannot run without provider cache in the worktree, but changes are confined to shell script strings inside existing heredoc blocks. ## Test Plan - [ ] Wait for next scheduled run of both CronJobs (02:00 and 03:00 UTC) and verify successful completion - [ ] Alternatively, trigger a manual Job from either CronJob and confirm the retry flags appear in logs - [ ] No regressions in backup file content or MinIO upload behavior ## Review Checklist - [x] Passed automated review-fix loop - [x] No secrets committed - [x] No unnecessary file changes - [x] Commit messages are descriptive ## Related - Closes #123
fix: add retry logic to backup CronJob external downloads
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful
d1000828bf
DNS flakiness causes tf-state-backup and cnpg-backup-verify CronJobs to
fail before the actual backup runs. Add --retry 3 --retry-delay 5 to all
curl commands and a single retry to apk add to survive transient DNS
resolution failures.

Closes #123

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Owner

Review: LGTM

Diff: 6 lines changed, 1 file, 2 CronJob resources.

Verified:

  • --retry 3 --retry-delay 5 is valid curl syntax for transient failure resilience
  • apk add retry via || (sleep 5 && apk add ...) is correct -- no built-in retry flag for apk
  • set -euo pipefail preserved -- second apk failure will still exit the script
  • Both tf_state_backup and cnpg_backup_verify CronJobs are covered
  • No secrets, no unnecessary file changes
  • tofu fmt passed clean

No findings. Ready for merge.

## Review: LGTM **Diff**: 6 lines changed, 1 file, 2 CronJob resources. **Verified**: - `--retry 3 --retry-delay 5` is valid curl syntax for transient failure resilience - `apk add` retry via `|| (sleep 5 && apk add ...)` is correct -- no built-in retry flag for apk - `set -euo pipefail` preserved -- second apk failure will still exit the script - Both `tf_state_backup` and `cnpg_backup_verify` CronJobs are covered - No secrets, no unnecessary file changes - `tofu fmt` passed clean No findings. Ready for merge.
Author
Owner

PR #160 Review

DOMAIN REVIEW

Tech stack: OpenTofu HCL, shell scripts inside heredoc strings, Kubernetes CronJob resources.

Completeness audit -- I verified every curl and apk add command in terraform/main.tf:

CronJob Command Retry added?
tf_state_backup apk add --no-cache curl Yes -- || (sleep 5 && apk add ...)
tf_state_backup curl mc binary download Yes -- --retry 3 --retry-delay 5
tf_state_backup curl kubectl binary download Yes -- --retry 3 --retry-delay 5
cnpg_backup_verify apk add --no-cache curl Yes -- || (sleep 5 && apk add ...)
cnpg_backup_verify curl mc binary download Yes -- --retry 3 --retry-delay 5

All 5 external-download commands across both CronJobs are covered. No curl or apk add commands exist elsewhere in the file.

HCL syntax: The changes are confined to string content within <<-EOT ... EOT heredoc blocks. No structural HCL changes. Heredoc delimiters, bracket nesting, and indentation are preserved correctly. The 6 additions / 6 deletions count is accurate (5 functional lines + 1 comment update).

Shell correctness:

  • The apk add retry pattern (cmd || (sleep 5 && cmd)) is sound. The subshell ensures set -e does not abort on the first failure -- the || short-circuits, and only if the retry also fails does the script exit. Single retry with 5s sleep is appropriate for transient DNS.
  • curl --retry 3 --retry-delay 5 is the standard curl retry mechanism. It retries on transient errors (connection refused, timeout, DNS failures). The flags are placed correctly before the URL argument.

Scope: Only terraform/main.tf is touched. 1 file changed, 6 lines added, 6 lines removed. Tightly scoped to the bug fix described in #123.

BLOCKERS

None.

This is a shell-script-only change inside existing HCL heredocs. No new functionality requiring tests -- the retry behavior is provided by curl itself (--retry is a built-in curl feature) and by a simple || fallback for apk. The test plan (wait for scheduled CronJob runs or trigger manual Jobs) is the correct verification approach for CronJob changes.

NITS

  1. apk retry is single-attempt, curl retry is triple-attempt. The apk add pattern retries once, while curl retries 3 times. This asymmetry is fine in practice (apk DNS resolution is the flaky step, and one retry suffices), but worth noting for consistency. Not blocking.

  2. No --retry-connrefused on curl. By default, curl --retry only retries on transient HTTP errors and timeouts, not connection-refused errors. Adding --retry-connrefused would also cover cases where the download server is temporarily unreachable. Minor hardening opportunity for a future pass.

SOP COMPLIANCE

  • Branch named after issue: 123-backup-cronjob-retry references issue #123
  • PR body follows template: Summary, Changes, Test Plan, Review Checklist, Related sections all present
  • Related references closing issue: Closes #123
  • Related references plan slug: No plan slug referenced -- acceptable for a standalone bug fix (not plan-driven work)
  • No secrets committed
  • No unnecessary file changes (single file, tightly scoped)
  • Commit messages are descriptive (PR title: "fix: add retry logic to backup CronJob downloads")
  • tofu fmt reported clean per PR body
  • tofu plan output: PR notes that changes are string-only within heredocs, so no resource-level plan changes expected. Acceptable justification for omitting full plan output.

PROCESS OBSERVATIONS

  • Deployment Frequency: This is a low-risk fix that unblocks nightly backup reliability. Quick merge supports DF.
  • Change Failure Risk: Minimal. The changes add retry resilience to existing commands without altering backup logic, upload paths, or state handling. If the retry flags somehow caused issues, the CronJob's restartPolicy: OnFailure provides an additional safety net.
  • MTTR: Directly addresses intermittent CronJob failures surfaced by alerting (#123 mentions 2 alerts). Merging this reduces MTTR for backup monitoring noise.

VERDICT: APPROVED

## PR #160 Review ### DOMAIN REVIEW **Tech stack**: OpenTofu HCL, shell scripts inside heredoc strings, Kubernetes CronJob resources. **Completeness audit** -- I verified every `curl` and `apk add` command in `terraform/main.tf`: | CronJob | Command | Retry added? | |---------|---------|:---:| | `tf_state_backup` | `apk add --no-cache curl` | Yes -- `\|\| (sleep 5 && apk add ...)` | | `tf_state_backup` | `curl` mc binary download | Yes -- `--retry 3 --retry-delay 5` | | `tf_state_backup` | `curl` kubectl binary download | Yes -- `--retry 3 --retry-delay 5` | | `cnpg_backup_verify` | `apk add --no-cache curl` | Yes -- `\|\| (sleep 5 && apk add ...)` | | `cnpg_backup_verify` | `curl` mc binary download | Yes -- `--retry 3 --retry-delay 5` | All 5 external-download commands across both CronJobs are covered. No `curl` or `apk add` commands exist elsewhere in the file. **HCL syntax**: The changes are confined to string content within `<<-EOT ... EOT` heredoc blocks. No structural HCL changes. Heredoc delimiters, bracket nesting, and indentation are preserved correctly. The 6 additions / 6 deletions count is accurate (5 functional lines + 1 comment update). **Shell correctness**: - The `apk add` retry pattern (`cmd || (sleep 5 && cmd)`) is sound. The subshell ensures `set -e` does not abort on the first failure -- the `||` short-circuits, and only if the retry also fails does the script exit. Single retry with 5s sleep is appropriate for transient DNS. - `curl --retry 3 --retry-delay 5` is the standard curl retry mechanism. It retries on transient errors (connection refused, timeout, DNS failures). The flags are placed correctly before the URL argument. **Scope**: Only `terraform/main.tf` is touched. 1 file changed, 6 lines added, 6 lines removed. Tightly scoped to the bug fix described in #123. ### BLOCKERS None. This is a shell-script-only change inside existing HCL heredocs. No new functionality requiring tests -- the retry behavior is provided by `curl` itself (`--retry` is a built-in curl feature) and by a simple `||` fallback for `apk`. The test plan (wait for scheduled CronJob runs or trigger manual Jobs) is the correct verification approach for CronJob changes. ### NITS 1. **apk retry is single-attempt, curl retry is triple-attempt.** The `apk add` pattern retries once, while `curl` retries 3 times. This asymmetry is fine in practice (apk DNS resolution is the flaky step, and one retry suffices), but worth noting for consistency. Not blocking. 2. **No `--retry-connrefused` on curl.** By default, `curl --retry` only retries on transient HTTP errors and timeouts, not connection-refused errors. Adding `--retry-connrefused` would also cover cases where the download server is temporarily unreachable. Minor hardening opportunity for a future pass. ### SOP COMPLIANCE - [x] Branch named after issue: `123-backup-cronjob-retry` references issue #123 - [x] PR body follows template: Summary, Changes, Test Plan, Review Checklist, Related sections all present - [x] Related references closing issue: `Closes #123` - [ ] Related references plan slug: No plan slug referenced -- acceptable for a standalone bug fix (not plan-driven work) - [x] No secrets committed - [x] No unnecessary file changes (single file, tightly scoped) - [x] Commit messages are descriptive (PR title: "fix: add retry logic to backup CronJob downloads") - [x] `tofu fmt` reported clean per PR body - [ ] `tofu plan` output: PR notes that changes are string-only within heredocs, so no resource-level plan changes expected. Acceptable justification for omitting full plan output. ### PROCESS OBSERVATIONS - **Deployment Frequency**: This is a low-risk fix that unblocks nightly backup reliability. Quick merge supports DF. - **Change Failure Risk**: Minimal. The changes add retry resilience to existing commands without altering backup logic, upload paths, or state handling. If the retry flags somehow caused issues, the CronJob's `restartPolicy: OnFailure` provides an additional safety net. - **MTTR**: Directly addresses intermittent CronJob failures surfaced by alerting (#123 mentions 2 alerts). Merging this reduces MTTR for backup monitoring noise. ### VERDICT: APPROVED
forgejo_admin deleted branch 123-backup-cronjob-retry 2026-03-24 21:09:50 +00:00
Sign in to join this conversation.
No description provided.