fix: backup verify CronJob fails on new CNPG clusters without WAL archives #92

Closed
opened 2026-03-17 01:57:40 +00:00 by forgejo_admin · 1 comment

Lineage

todo-cnpg-backup-verify-failure (no plan ancestry)

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want the backup verification CronJob to handle new CNPG clusters that haven't archived WAL segments yet
So that the KubeJobFailed alert only fires on real backup problems, not on expected new-cluster behavior

Context

The cnpg-backup-verify CronJob checks for recent WAL files in MinIO for both pal-e-postgres and woodpecker prefixes. The woodpecker CNPG cluster is 2 days old and hasn't archived any WAL segments yet — CNPG only archives WALs when they fill up (16MB default). Low-traffic databases can take days to produce their first WAL archive.

The verify script treats an empty WAL directory as a failure, causing a KubeJobFailed warning alert even though base backups are completing successfully (8/8 for pal-e-postgres, 1/1 for woodpecker).

Verified via MinIO: backup/postgres-wal/pal-e-postgres/wals/ has 3 WAL segment directories. backup/postgres-wal/woodpecker/wals/ is empty.

File Targets

Files to modify:

  • terraform/main.tfkubernetes_cron_job_v1.cnpg_backup_verify script block (~line 2295). Add WAL directory existence check before the freshness check.

Files NOT to touch:

  • Everything else — this is a one-function fix in the CronJob script.

Acceptance Criteria

  • When the verify job runs and a cluster has base backup objects but no WAL directory, it logs SKIP: No WAL directory yet and continues without error
  • When the verify job runs and a cluster has WAL files older than 25h, it still errors as before
  • When the verify job runs and a cluster has no backup objects at all, it still errors as before
  • tofu plan shows only the CronJob resource changing

Test Expectations

  • Manual: delete the failed job (kubectl delete job cnpg-backup-verify-29560860 -n postgres — already done)
  • Manual: wait for next 03:00 UTC run or trigger manually, verify it passes
  • Run command: tofu plan -lock=false to verify only CronJob changes

Constraints

  • Use $${VAR} syntax for shell variables inside terraform heredoc (Woodpecker variable syntax)
  • Keep the existing check logic for base backup objects — only add a pre-check for WAL directory existence
  • Don't change the CronJob schedule or resource limits

Checklist

  • PR opened
  • tofu plan shows only CronJob change
  • No unrelated changes
  • pal-e-platform — project
  • todo-cnpg-backup-verify-failure — pal-e-docs TODO tracking this
  • deployment-lessons — lessons learned doc
### Lineage `todo-cnpg-backup-verify-failure` (no plan ancestry) ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want the backup verification CronJob to handle new CNPG clusters that haven't archived WAL segments yet So that the KubeJobFailed alert only fires on real backup problems, not on expected new-cluster behavior ### Context The `cnpg-backup-verify` CronJob checks for recent WAL files in MinIO for both `pal-e-postgres` and `woodpecker` prefixes. The woodpecker CNPG cluster is 2 days old and hasn't archived any WAL segments yet — CNPG only archives WALs when they fill up (16MB default). Low-traffic databases can take days to produce their first WAL archive. The verify script treats an empty WAL directory as a failure, causing a KubeJobFailed warning alert even though base backups are completing successfully (8/8 for pal-e-postgres, 1/1 for woodpecker). Verified via MinIO: `backup/postgres-wal/pal-e-postgres/wals/` has 3 WAL segment directories. `backup/postgres-wal/woodpecker/wals/` is empty. ### File Targets Files to modify: - `terraform/main.tf` — `kubernetes_cron_job_v1.cnpg_backup_verify` script block (~line 2295). Add WAL directory existence check before the freshness check. Files NOT to touch: - Everything else — this is a one-function fix in the CronJob script. ### Acceptance Criteria - [ ] When the verify job runs and a cluster has base backup objects but no WAL directory, it logs `SKIP: No WAL directory yet` and continues without error - [ ] When the verify job runs and a cluster has WAL files older than 25h, it still errors as before - [ ] When the verify job runs and a cluster has no backup objects at all, it still errors as before - [ ] `tofu plan` shows only the CronJob resource changing ### Test Expectations - [ ] Manual: delete the failed job (`kubectl delete job cnpg-backup-verify-29560860 -n postgres` — already done) - [ ] Manual: wait for next 03:00 UTC run or trigger manually, verify it passes - Run command: `tofu plan -lock=false` to verify only CronJob changes ### Constraints - Use `$${VAR}` syntax for shell variables inside terraform heredoc (Woodpecker variable syntax) - Keep the existing check logic for base backup objects — only add a pre-check for WAL directory existence - Don't change the CronJob schedule or resource limits ### Checklist - [ ] PR opened - [ ] `tofu plan` shows only CronJob change - [ ] No unrelated changes ### Related - `pal-e-platform` — project - `todo-cnpg-backup-verify-failure` — pal-e-docs TODO tracking this - `deployment-lessons` — lessons learned doc
Author
Owner

Reading issue for QA review context.

Reading issue for QA review context.
forgejo_admin 2026-03-17 02:02:24 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#92
No description provided.