fix: skip WAL freshness check for new CNPG clusters without archives #93
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform!93
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "92-fix-backup-verify-new-clusters"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Backup verify CronJob was failing because woodpecker CNPG cluster (2 days old) has no WAL archives yet. CNPG only archives WAL segments when they fill up (16MB) — low-traffic DBs take days. Added a WAL directory existence check before the freshness check.
Changes
terraform/main.tf— Added WAL directory existence check incnpg_backup_verifyCronJob script. If no WAL archives exist yet for a prefix, logsSKIP: No WAL archives yetand continues without error. Base backup object presence (already checked above) is sufficient verification for new clusters.Test Plan
tofu planshows only CronJob resource changingkubectl create job --from=cronjob/cnpg-backup-verify test-verify -n postgres)Review Checklist
tofu fmt -checkpassestofu validatepassesRelated
todo-cnpg-backup-verify-failure— pal-e-docs TODOdeployment-lessons— lessons learned docTofu Plan Output
PR #93 Review
DOMAIN REVIEW
Tech stack: Terraform (HCL) + Kubernetes CronJob with inline shell script.
Terraform/k8s specifics reviewed:
Shell script correctness: The WAL existence check is well-structured.
set -euo pipefailis set at the top of the script, and the newmc lscall correctly uses2>/dev/null | head -1 || trueto prevent the pipeline from exiting on a missing directory. Thecontinuestatement correctly skips to the next PREFIX iteration without incrementing the error counter.Edge case handling: Both scenarios are covered -- (a) WAL directory does not exist (mc ls errors, suppressed, returns empty) and (b) WAL directory exists but is empty (mc ls returns nothing). Both correctly result in SKIP.
No injection risk: The
$PREFIXvariable iterates over a hardcoded list ("pal-e-postgres" "woodpecker"), not user-supplied input. No injection surface.Secrets handling: Credentials are sourced from
kubernetes_secret_v1.cnpg_s3_credsviavalue_from.secret_key_ref. No plaintext secrets.Resource limits: Already set on the container (
50m/64Mirequests,128Mimemory limit). No change needed.Image tag:
alpine:3.20is pinned to a minor version. Acceptable for a verification CronJob, though a SHA pin would be stricter.Diff scope note: The Forgejo diff shows Ollama volume changes (hostPath rename, OLLAMA_MODELS env var) alongside the CNPG fix. These Ollama changes are already on main via PR #91 (commit
4c8eb04). The diff is a merge-base computation artifact, not scope creep. The net-new change in this PR is solely the 8-line WAL existence check block at lines 2314-2320.One observation on silent skip behavior: A mature cluster that loses its WAL archives due to storage corruption would be silently skipped rather than flagged. The base backup object check still catches total backup loss, so the risk is low. Worth noting in
deployment-lessonsthat the verify script trusts "no WAL = new cluster" -- if a cluster is known to have had WAL archives previously, manual investigation would be warranted. This is not a blocker, just a design-awareness note.BLOCKERS
None.
kubectl create jobis the appropriate test method).NITS
Comment precision (line 2314): The comment says "Check if WAL directory has content" but it actually checks if the
wals/subdirectory has any listing output. Minor wording -- no functional impact.alpine image tag:
alpine:3.20is pinned to minor but not patch. Consideralpine:3.20.xor a digest pin for fully reproducible builds. Low priority for a verification CronJob.SOP COMPLIANCE
92-fix-backup-verify-new-clustersreferences #92)todo-cnpg-backup-verify-failure(TODO-driven fix, no plan slug -- acceptable)fix: skip WAL freshness check for new CNPG clusters without archives)tofu fmtandtofu validateconfirmed passingPROCESS OBSERVATIONS
deployment-lessonsfor lessons learned capture. The silent-skip design observation above would be a good addition there.VERDICT: APPROVED