Outbox processor dead since March 25 — 23 unsent welcome emails #402

Open
opened 2026-04-08 22:19:50 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Bug

Lineage

Standalone — discovered during post-signing user flow investigation (2026-04-08).

Repo

forgejo_admin/basketball-api

What Broke

The outbox processor stopped processing contract_signed events on 2026-03-25. Since then, 23 welcome emails have not been sent (21 pending, 2 failed). Every family that signed a contract in the last two weeks got radio silence — no welcome email, no GroupMe invite link.

Root cause: There is no Kubernetes CronJob deployed to poll the outbox. The only processing path is a fire-and-forget HTTP ping from westside-contracts (POST /admin/process-outbox?tenant_id=1), which silently fails with no alerting. The two failed events (David Kaneko 2026-03-26, Tristen Thorn 2026-03-27) suggest the Gmail OAuth token expired around that time (Google app in Testing mode, 7-day token expiry). After those failures, subsequent events were never even attempted.

Current state (as of 2026-04-08):

Status Count Date Range
processed 10 through 2026-03-25
failed 2 2026-03-26 to 2026-03-27
pending 21 2026-03-27 to 2026-04-08

Gmail OAuth is currently healthy — announcement emails sent successfully today (2026-04-08).

Additional data issue: Outbox event #13 (Creed Draney Jr) has team_id=11 in its payload, but only teams 1–7 exist. The processor falls back gracefully ("Westside Kings & Queens", no GroupMe link), but the stale team_id should be corrected before draining.

Repro Steps

  1. Sign a contract on westside-contracts
  2. Observe outbox event created with status=pending
  3. Wait — no CronJob exists, fire-and-forget ping silently fails
  4. Event stays pending indefinitely

Expected Behavior

  • Welcome email sent within minutes of contract signing
  • Outbox events processed reliably via CronJob (not just fire-and-forget)
  • Failed events are retried or alerted on, not silently abandoned

Environment

  • Cluster/namespace: prod / basketball-api
  • No CronJob manifest exists in k8s/ directory
  • Gmail OAuth: healthy (announcements sent today)
  • Outbox processor code: src/basketball_api/services/outbox.py — code is correct, just never invoked

Acceptance Criteria

  • CronJob deployed in k8s/ that calls /admin/process-outbox?tenant_id=1 on a schedule (every 5 min)
  • All 23 stuck events drained (21 pending + 2 failed reset to pending)
  • Outbox event #13 (Creed) has correct team_id before draining
  • Prometheus metric or log-based alert for outbox events stuck in pending > 1 hour
  • failed events have a retry mechanism (reset to pending after N minutes, with max retry count)

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • westside-basketball — project this affects
  • Outbox pattern: src/basketball_api/services/outbox.py
  • Email template: src/basketball_api/services/email.py:888-1010
### Type Bug ### Lineage Standalone — discovered during post-signing user flow investigation (2026-04-08). ### Repo `forgejo_admin/basketball-api` ### What Broke The outbox processor stopped processing `contract_signed` events on 2026-03-25. Since then, **23 welcome emails have not been sent** (21 pending, 2 failed). Every family that signed a contract in the last two weeks got radio silence — no welcome email, no GroupMe invite link. **Root cause:** There is no Kubernetes CronJob deployed to poll the outbox. The only processing path is a fire-and-forget HTTP ping from westside-contracts (`POST /admin/process-outbox?tenant_id=1`), which silently fails with no alerting. The two `failed` events (David Kaneko 2026-03-26, Tristen Thorn 2026-03-27) suggest the Gmail OAuth token expired around that time (Google app in Testing mode, 7-day token expiry). After those failures, subsequent events were never even attempted. **Current state (as of 2026-04-08):** | Status | Count | Date Range | |--------|-------|------------| | processed | 10 | through 2026-03-25 | | failed | 2 | 2026-03-26 to 2026-03-27 | | pending | 21 | 2026-03-27 to 2026-04-08 | Gmail OAuth is currently healthy — announcement emails sent successfully today (2026-04-08). **Additional data issue:** Outbox event #13 (Creed Draney Jr) has `team_id=11` in its payload, but only teams 1–7 exist. The processor falls back gracefully ("Westside Kings & Queens", no GroupMe link), but the stale team_id should be corrected before draining. ### Repro Steps 1. Sign a contract on westside-contracts 2. Observe outbox event created with `status=pending` 3. Wait — no CronJob exists, fire-and-forget ping silently fails 4. Event stays pending indefinitely ### Expected Behavior - Welcome email sent within minutes of contract signing - Outbox events processed reliably via CronJob (not just fire-and-forget) - Failed events are retried or alerted on, not silently abandoned ### Environment - Cluster/namespace: prod / basketball-api - No CronJob manifest exists in `k8s/` directory - Gmail OAuth: healthy (announcements sent today) - Outbox processor code: `src/basketball_api/services/outbox.py` — code is correct, just never invoked ### Acceptance Criteria - [ ] CronJob deployed in `k8s/` that calls `/admin/process-outbox?tenant_id=1` on a schedule (every 5 min) - [ ] All 23 stuck events drained (21 pending + 2 failed reset to pending) - [ ] Outbox event #13 (Creed) has correct team_id before draining - [ ] Prometheus metric or log-based alert for outbox events stuck in pending > 1 hour - [ ] `failed` events have a retry mechanism (reset to pending after N minutes, with max retry count) ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `westside-basketball` — project this affects - Outbox pattern: `src/basketball_api/services/outbox.py` - Email template: `src/basketball_api/services/email.py:888-1010`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/basketball-api#402
No description provided.