fix: backup verify CronJob fails on new CNPG clusters without WAL archives #92

New issue

Closed

opened 2026-03-17 01:57:40 +00:00 by forgejo_admin · 1 comment

forgejo_admin commented

2026-03-17 01:57:40 +00:00

Owner

Lineage

todo-cnpg-backup-verify-failure (no plan ancestry)

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want the backup verification CronJob to handle new CNPG clusters that haven't archived WAL segments yet
So that the KubeJobFailed alert only fires on real backup problems, not on expected new-cluster behavior

Context

The cnpg-backup-verify CronJob checks for recent WAL files in MinIO for both pal-e-postgres and woodpecker prefixes. The woodpecker CNPG cluster is 2 days old and hasn't archived any WAL segments yet — CNPG only archives WALs when they fill up (16MB default). Low-traffic databases can take days to produce their first WAL archive.

The verify script treats an empty WAL directory as a failure, causing a KubeJobFailed warning alert even though base backups are completing successfully (8/8 for pal-e-postgres, 1/1 for woodpecker).

Verified via MinIO: backup/postgres-wal/pal-e-postgres/wals/ has 3 WAL segment directories. backup/postgres-wal/woodpecker/wals/ is empty.

File Targets

Files to modify:

terraform/main.tf — kubernetes_cron_job_v1.cnpg_backup_verify script block (~line 2295). Add WAL directory existence check before the freshness check.

Files NOT to touch:

Everything else — this is a one-function fix in the CronJob script.

Acceptance Criteria

When the verify job runs and a cluster has base backup objects but no WAL directory, it logs SKIP: No WAL directory yet and continues without error
When the verify job runs and a cluster has WAL files older than 25h, it still errors as before
When the verify job runs and a cluster has no backup objects at all, it still errors as before
tofu plan shows only the CronJob resource changing

Test Expectations

Manual: delete the failed job (kubectl delete job cnpg-backup-verify-29560860 -n postgres — already done)
Manual: wait for next 03:00 UTC run or trigger manually, verify it passes
Run command: tofu plan -lock=false to verify only CronJob changes

Constraints

Use $${VAR} syntax for shell variables inside terraform heredoc (Woodpecker variable syntax)
Keep the existing check logic for base backup objects — only add a pre-check for WAL directory existence
Don't change the CronJob schedule or resource limits

Checklist

PR opened
tofu plan shows only CronJob change
No unrelated changes

pal-e-platform — project
todo-cnpg-backup-verify-failure — pal-e-docs TODO tracking this
deployment-lessons — lessons learned doc

### Lineage `todo-cnpg-backup-verify-failure` (no plan ancestry) ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want the backup verification CronJob to handle new CNPG clusters that haven't archived WAL segments yet So that the KubeJobFailed alert only fires on real backup problems, not on expected new-cluster behavior ### Context The `cnpg-backup-verify` CronJob checks for recent WAL files in MinIO for both `pal-e-postgres` and `woodpecker` prefixes. The woodpecker CNPG cluster is 2 days old and hasn't archived any WAL segments yet — CNPG only archives WALs when they fill up (16MB default). Low-traffic databases can take days to produce their first WAL archive. The verify script treats an empty WAL directory as a failure, causing a KubeJobFailed warning alert even though base backups are completing successfully (8/8 for pal-e-postgres, 1/1 for woodpecker). Verified via MinIO: `backup/postgres-wal/pal-e-postgres/wals/` has 3 WAL segment directories. `backup/postgres-wal/woodpecker/wals/` is empty. ### File Targets Files to modify: - `terraform/main.tf` — `kubernetes_cron_job_v1.cnpg_backup_verify` script block (~line 2295). Add WAL directory existence check before the freshness check. Files NOT to touch: - Everything else — this is a one-function fix in the CronJob script. ### Acceptance Criteria - [ ] When the verify job runs and a cluster has base backup objects but no WAL directory, it logs `SKIP: No WAL directory yet` and continues without error - [ ] When the verify job runs and a cluster has WAL files older than 25h, it still errors as before - [ ] When the verify job runs and a cluster has no backup objects at all, it still errors as before - [ ] `tofu plan` shows only the CronJob resource changing ### Test Expectations - [ ] Manual: delete the failed job (`kubectl delete job cnpg-backup-verify-29560860 -n postgres` — already done) - [ ] Manual: wait for next 03:00 UTC run or trigger manually, verify it passes - Run command: `tofu plan -lock=false` to verify only CronJob changes ### Constraints - Use `$${VAR}` syntax for shell variables inside terraform heredoc (Woodpecker variable syntax) - Keep the existing check logic for base backup objects — only add a pre-check for WAL directory existence - Don't change the CronJob schedule or resource limits ### Checklist - [ ] PR opened - [ ] `tofu plan` shows only CronJob change - [ ] No unrelated changes ### Related - `pal-e-platform` — project - `todo-cnpg-backup-verify-failure` — pal-e-docs TODO tracking this - `deployment-lessons` — lessons learned doc