Switch basketball-api to RollingUpdate to prevent webhook delivery failures during deploys #346

Closed
opened 2026-04-05 19:25:46 +00:00 by forgejo_admin · 1 comment

Type

Bug

Lineage

Root cause discovered from forgejo_admin/basketball-api#343 investigation. The funnel works, but Recreate strategy causes downtime that breaks Stripe webhook delivery.

Repo

ldraney/pal-e-deployments (kustomize overlay for basketball-api)

What Broke

basketball-api uses strategy: Recreate in its Deployment. Every CI build takes the API completely offline for 30-60 seconds. During that window, Stripe webhook retries get connection refused. With frequent deployments (10+ replicasets observed), multiple retry windows are missed, leaving checkout.session.completed events with pending_webhooks=1.

This caused 6 jersey payments ($780) to go unrecorded in the database between March 25 and April 5. Payments were manually synced from Stripe on 2026-04-05.

Stripe retries webhooks with exponential backoff over 72 hours. Repeated deploy-time outages can exhaust all retry attempts.

Repro Steps

  1. Trigger a basketball-api deployment (push to main, ArgoCD sync)
  2. During the deploy, the old pod is terminated before the new pod is ready (Recreate strategy)
  3. Any Stripe webhook delivery during this window gets connection refused
  4. pending_webhooks counter stays at 1

Expected Behavior

Zero-downtime deploys via RollingUpdate strategy — new pod comes up and passes readiness checks before old pod is terminated. Stripe webhook delivery succeeds at all times.

Environment

  • Cluster/namespace: prod / basketball-api
  • Deployment: basketball-api (managed by ArgoCD via pal-e-deployments)
  • Current strategy: Recreate
  • Desired strategy: RollingUpdate with maxUnavailable: 0, maxSurge: 1

Acceptance Criteria

  • Deployment strategy changed from Recreate to RollingUpdate
  • maxUnavailable: 0 ensures zero-downtime during deploys
  • ArgoCD syncs the change successfully
  • Verify: trigger a deploy and confirm API stays reachable throughout (no 502s)
  • project-westside-basketball — project this affects
  • forgejo_admin/basketball-api#343 — parent investigation
  • forgejo_admin/basketball-api#340 — original symptom (Daniel's "load failed")
  • Key files: pal-e-deployments/basketball-api/deployment-patch.yaml or equivalent kustomize overlay
### Type Bug ### Lineage Root cause discovered from forgejo_admin/basketball-api#343 investigation. The funnel works, but `Recreate` strategy causes downtime that breaks Stripe webhook delivery. ### Repo `ldraney/pal-e-deployments` (kustomize overlay for basketball-api) ### What Broke basketball-api uses `strategy: Recreate` in its Deployment. Every CI build takes the API completely offline for 30-60 seconds. During that window, Stripe webhook retries get connection refused. With frequent deployments (10+ replicasets observed), multiple retry windows are missed, leaving `checkout.session.completed` events with `pending_webhooks=1`. This caused 6 jersey payments ($780) to go unrecorded in the database between March 25 and April 5. Payments were manually synced from Stripe on 2026-04-05. Stripe retries webhooks with exponential backoff over 72 hours. Repeated deploy-time outages can exhaust all retry attempts. ### Repro Steps 1. Trigger a basketball-api deployment (push to main, ArgoCD sync) 2. During the deploy, the old pod is terminated before the new pod is ready (`Recreate` strategy) 3. Any Stripe webhook delivery during this window gets connection refused 4. `pending_webhooks` counter stays at 1 ### Expected Behavior Zero-downtime deploys via `RollingUpdate` strategy — new pod comes up and passes readiness checks before old pod is terminated. Stripe webhook delivery succeeds at all times. ### Environment - Cluster/namespace: prod / basketball-api - Deployment: basketball-api (managed by ArgoCD via pal-e-deployments) - Current strategy: `Recreate` - Desired strategy: `RollingUpdate` with `maxUnavailable: 0, maxSurge: 1` ### Acceptance Criteria - [ ] Deployment strategy changed from `Recreate` to `RollingUpdate` - [ ] `maxUnavailable: 0` ensures zero-downtime during deploys - [ ] ArgoCD syncs the change successfully - [ ] Verify: trigger a deploy and confirm API stays reachable throughout (no 502s) ### Related - `project-westside-basketball` — project this affects - `forgejo_admin/basketball-api#343` — parent investigation - `forgejo_admin/basketball-api#340` — original symptom (Daniel's "load failed") - Key files: `pal-e-deployments/basketball-api/deployment-patch.yaml` or equivalent kustomize overlay
Author
Owner

Scope Review: READY

Review note: review-840-2026-04-04

Ticket is well-scoped. Single 3-line YAML deletion from overlays/basketball-api/prod/deployment-patch.yaml — the base template already provides the desired RollingUpdate with maxUnavailable: 0, maxSurge: 1. All 4 AC are testable. No blockers, no decomposition needed.

Minor recommendations (non-blocking):

  • [BODY] Fix file path reference: basketball-api/deployment-patch.yamloverlays/basketball-api/prod/deployment-patch.yaml
  • [SCOPE] Create architecture note arch-basketball-api (platform-wide gap, not specific to this ticket)

Blast radius note: mcd-tracker and gcal-scheduler also use Recreate overrides — consider follow-up tickets if they need zero-downtime.

## Scope Review: READY Review note: `review-840-2026-04-04` Ticket is well-scoped. Single 3-line YAML deletion from `overlays/basketball-api/prod/deployment-patch.yaml` — the base template already provides the desired `RollingUpdate` with `maxUnavailable: 0, maxSurge: 1`. All 4 AC are testable. No blockers, no decomposition needed. **Minor recommendations (non-blocking):** - `[BODY]` Fix file path reference: `basketball-api/deployment-patch.yaml` → `overlays/basketball-api/prod/deployment-patch.yaml` - `[SCOPE]` Create architecture note `arch-basketball-api` (platform-wide gap, not specific to this ticket) **Blast radius note:** mcd-tracker and gcal-scheduler also use `Recreate` overrides — consider follow-up tickets if they need zero-downtime.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/basketball-api#346
No description provided.