Spike: Gmail OAuth token persistence in Postgres #130
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/basketball-api#130
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Spike
Lineage
plan-wkq→ Phase 11 → discovered scope (email pipeline broke during login investigation)Repo
forgejo_admin/basketball-apiQuestion
How should basketball-api persist Gmail OAuth tokens in Postgres so that token refresh survives pod restarts without writable filesystem mounts?
Currently gmail-sdk reads/writes tokens from a JSON file mounted as a read-only k8s secret. Access token expires ~60 min, SDK refreshes via Google but can't write back. Email silently breaks. Caused ~29-hour undetected outage (2026-03-19 to 2026-03-21).
What to Explore
1. gmail-sdk token storage mechanism
gmail_sdk.GmailClient.__init__— how does it load tokens?2. Database schema options
oauth_tokenstable (type, account, token_json, expires_at, updated_at)tenantstable (token is per-tenant anyway)3. Token refresh flow
4. Migration path
gmail-oauth-tokensecret volume mount from pal-e-deployments kustomize5. Cross-repo impact
basketball-api— app code changes (email service, config, models, migration)pal-e-deployments— remove secret volume mount, remove secret resourcesSuccess Criteria
Time-box
2 hours investigation, written findings
Related
Westside Basketball— projectSpike Findings: Gmail OAuth Token Persistence in Postgres
1. gmail-sdk Token Storage Mechanism
Package:
gmail-sdk-ldraney==0.1.3(Lucas's own SDK, published to Forgejo PyPI)Token load flow (
auth.py->_load_and_refresh_token):GmailClient.__init__calls_load_and_refresh_token(account, secrets_dir)whenaccess_tokenis not provided_load_token(account, secrets_dir)reads{secrets_dir}/gmail-{account}.jsonexpires_at < time.time() + 300(5-minute buffer), it refreshes:{secrets_dir}/credentials.jsonforclient_id/client_secretrefresh_token_save_token()usingos.open(..., O_WRONLY | O_CREAT | O_TRUNC, 0o600)access_tokenstring, which is set on thehttpx.ClientheadersToken file structure (confirmed from live container):
Critical finding -- the root cause is confirmed:
readOnly: truein deployment-patch.yaml)_save_token()call throwsOSError: [Errno 30] Read-only file systemPluggable token store: NO.
The SDK uses static methods
_load_token/_save_tokenwith hardcoded file I/O. There is no callback, protocol, or dependency injection for custom storage. The three options are:AuthMixinand override_load_token/_save_token/_load_and_refresh_token(they're static methods, so this requires overriding_load_and_refresh_tokenasclassmethod)AuthMixin._load_tokenandAuthMixin._save_tokenTokenStoreprotocol (since Lucas owns the SDK)Recommendation: Option 3 -- add a
TokenStoreprotocol to gmail-sdk-ldraney. This is the clean enterprise path. The SDK is Lucas's own package, and aTokenStoreprotocol is a 30-minute change. Then basketball-api implements aPostgresTokenStorethat reads/writes from the DB.2. Current Usage in basketball-api
get_gmail_client(tenant)inservices/email.py(line 19):GmailClientper call (no singleton, no caching)tenant.gmail_accountas the account name (currently"westsidebasketball")settings.gmail_secrets_dirwhich resolves to/secrets/google-oauthin prodCall sites (4 email functions, each creates a fresh client):
send_confirmation_email()-- called from registration service + Stripe webhooksend_profile_reminder_email()-- admin endpointsend_roster_export_email()-- admin endpointsend_tryout_announcement_email()-- admin endpointConcurrency model: FastAPI with sync routes -> threadpool. Multiple requests can hit expired token simultaneously. Each creates its own GmailClient, each would independently attempt refresh.
3. Schema Recommendation
Recommendation: JSONB column on
tenantstable.Reasoning:
tenantstable already hasgmail_accountfield (the account alias)Schema change:
Encryption decision: NOT NEEDED at the application layer.
Reasoning:
POSTGRES_PASSWORDfrom k8s secrets4. Recommended Refresh Flow
Design: Single-refresh with DB persistence + last-write-wins.
Concurrency handling: Last-write-wins is sufficient.
Error handling:
5. Implementation Plan -- Two Repos
Phase A: gmail-sdk-ldraney (new feature)
Add a
TokenStoreprotocol so the client accepts custom storage backends:Then
GmailClient.__init__acceptstoken_store: TokenStore | None = None. If provided, uses it instead of file I/O.Files changed:
gmail_sdk/auth.py-- addTokenStoreprotocol, modify_load_and_refresh_tokento use itgmail_sdk/__init__.py-- exportTokenStoretests/test_auth.py-- test custom token store0.2.0Phase B: basketball-api
Files changed:
src/basketball_api/models.py-- addgmail_tokenJSONB +gmail_token_updated_atto Tenantalembic/versions/014_add_gmail_token_to_tenants.py-- migrationsrc/basketball_api/services/email.py-- rewriteget_gmail_client()to:tenant.gmail_tokenfrom DBPostgresTokenStoreclassGmailClient(token_store=pg_store)src/basketball_api/config.py-- keepgmail_secrets_diras fallback during transitiontests/-- update email service tests for DB-backed tokensPhase C: pal-e-deployments
Files changed:
overlays/basketball-api/prod/deployment-patch.yaml-- removegmail-oauthvolume mount + volumeBASKETBALL_GMAIL_SECRETS_DIRenv var temporarily as fallback, remove in follow-upPhase D: Cleanup
gmail-oauth-tokenk8s secretgmail-oauth-westsidebasketballk8s secret (there are 2 in the namespace)gmail_secrets_dirfrom config.py6. Story Point Estimate
7. Surprises and Blockers
Two gmail-oauth secrets exist in the basketball-api namespace:
gmail-oauth-token(11d old) andgmail-oauth-westsidebasketball(2d13h old). Onlygmail-oauth-tokenis mounted. The second one appears to be from a manual update attempt. Should be cleaned up.The SDK exception is unhandled. When
_save_token()fails on the read-only filesystem, it raisesOSError. This propagates up throughGmailClient.__init__, which means the client never gets created, and the email send fails entirely. This is worse than "can't persist" -- it means zero emails send after the first token expiry (not just "tokens don't persist").Token is currently NOT expired (
expires_at: 1774108514= 2026-03-19 + offset from refresh). This is because someone manually refreshed the k8s secret recently (the 2d13h old second secret suggests this). The 29-hour outage was real but was manually patched.gmail-sdk-ldraneyhas onlyhttpxas a dependency. Adding aProtocolforTokenStorerequires zero new dependencies. Clean.No other services consume gmail-oauth secrets. Only basketball-api namespace has gmail secrets. No cross-service impact.
The SDK creates a new
httpx.ClientperGmailClientinstance. Since basketball-api creates a new client per email send, this means a new TCP connection per email. Not a blocker, but worth noting for future connection pooling optimization.Next Step
Create a feature issue with the full scope from this spike, linking to this spike issue.
Spike Findings: Gmail OAuth Token Persistence in Postgres
1. Current Token Loading Mechanism
gmail-sdk v0.1.3 uses a purely file-based token storage model. The flow:
GmailClient.__init__receivesaccountandsecrets_dirparamsAuthMixin._load_and_refresh_token(account, secrets_dir)which:{secrets_dir}/gmail-{account}.jsonvia_load_token()(JSON file on disk)expires_at— if token expires within 300 seconds, refreshes via Googlecredentials.jsonfromsecrets_dirto getclient_id/client_secretfor refresh_save_token()(usesos.openwithO_WRONLY | O_CREAT | O_TRUNC, mode 0o600)self.access_tokenon the httpx client — one-shot at init time, no mid-request re-authCritical detail: The SDK refreshes at init time only. Once
GmailClientis constructed, the access token is baked into the httpxAuthorizationheader. There is no retry-on-401 or lazy refresh. If the token expires mid-session (after ~60 min), subsequent API calls fail with 401.basketball-api creates a new
GmailClientper email send (get_gmail_client()inservices/email.py), so each send triggers a fresh token load + potential refresh. This is actually safe for token expiry — the problem is that_save_token()fails silently or throws on read-only filesystems.pal-e-mail caches
GmailClientinstances in a module-level_clientsdict, so it only refreshes on first use per sender per process lifetime. This is more vulnerable to token expiry.Current k8s workaround (basketball-api): An init container copies the read-only k8s secret to an
emptyDirvolume so_save_token()can write. ButemptyDiris ephemeral — refreshed tokens are lost on pod restart, and the original k8s secret still has the old (possibly expired) access token. The refresh_token survives (Google doesn't revoke it), so the SDK can always re-refresh, but the cycle repeats every restart.pal-e-mail uses a PVC (
gmail-oauthPersistentVolumeClaim), which is more durable but still filesystem-dependent and creates operational complexity.2. Does gmail-sdk Support Pluggable Token Stores?
No. The SDK has no abstraction layer for token storage. The relevant methods are all
@staticmethodonAuthMixin:_load_token(account, secrets_dir)— hardcodedPath(secrets_dir) / f"gmail-{account}.json"file read_save_token(account, secrets_dir, token_data)— hardcoded file write_load_and_refresh_token(account, secrets_dir)— orchestrates load/refresh/save, returnsaccess_tokenstring_load_credentials(secrets_dir)— readscredentials.jsonfor client_id/client_secretAll four are static methods with no hook points, no callback interface, no class-level override pattern. The
GmailClient.__init__calls_load_and_refresh_tokendirectly.3. Recommended Approach
Option A (Recommended): Add a
TokenStoreprotocol to gmail-sdkAdd an abstract
TokenStoreinterface to gmail-sdk, with a defaultFileTokenStorethat preserves current behavior:Then
GmailClient.__init__accepts an optionaltoken_store: TokenStore | Noneparam. If provided, uses it instead of file-based methods. If not, createsFileTokenStore(secrets_dir)— fully backwards compatible.basketball-api then implements
PostgresTokenStore:Why Option A over wrapping at the app level: Intercepting
_save_tokenfrom outside the SDK is fragile (monkeypatching static methods). The SDK owns the refresh-then-save flow, so the abstraction belongs in the SDK. This also benefits pal-e-mail, which has the same problem.Option B (Alternative): Wrapper in basketball-api only, no SDK changes
Subclass
GmailClientand override the static methods. This works because Python allows overriding static methods on subclasses, but it is brittle —_load_and_refresh_tokencallscls._load_tokenandcls._save_token, so overriding on a subclass does work viaclsdispatch. However, this couples basketball-api to SDK internals and breaks if the SDK refactors.Recommendation: Option A. The SDK change is small (one new file, ~40 lines, plus ~10 lines changed in
auth.pyandclient.py), backwards compatible, and fixes the problem for all consumers.4. Database Schema
New
oauth_tokenstable (not a JSONB column ontenants):Why a separate table, not a column on tenants:
accountkey (e.g. "marcus") is the gmail-sdk concept, not the tenant slugEncryption decision: Not needed at this stage. Reasoning:
token_jsoncolumn, add decrypt inPostgresTokenStore.load).5. Token Refresh Flow and Concurrency
Concurrency handling: Last-write-wins is sufficient.
GmailClientper request (no caching). Two concurrent requests hitting an expired token will both refresh with Google. Google handles this gracefully — both get valid access tokens and the refresh_token is not rotated.oauth_tokenswill contain valid tokens. Whichever writes last wins. No mutex needed._load_and_refresh_tokenis synchronous and scoped to the request. No shared state.Error handling if DB write fails after successful Google refresh:
GmailClientinstance — the email still sends.6. Migration Path
One-time seed: Read current token file from k8s secret, INSERT into
oauth_tokenstable via alembic data migration or a management command.Deployment sequence:
TokenStoreprotocol (bump to v0.1.4)oauth_tokensmodel + alembic migration to basketball-apiPostgresTokenStorein basketball-apigmail-oauth-tokensecret,copy-gmail-oauthinit container, andgmail-oauth-writableemptyDir frompal-e-deployments/overlays/basketball-api/7. Complete File Change List
gmail-sdk (new minor version 0.1.4):
src/gmail_sdk/token_store.py— NEW:TokenStoreprotocol +FileTokenStoredefaultsrc/gmail_sdk/auth.py— MODIFY: refactor_load_token,_save_token,_load_and_refresh_tokento delegate to token storesrc/gmail_sdk/client.py— MODIFY: accept optionaltoken_storeparam in__init__src/gmail_sdk/__init__.py— MODIFY: exportTokenStore,FileTokenStoretests/test_token_store.py— NEW: unit tests for protocol + file storepyproject.toml— MODIFY: bump versionbasketball-api:
src/basketball_api/models.py— MODIFY: addOAuthTokenmodelsrc/basketball_api/token_store.py— NEW:PostgresTokenStoreimplementationsrc/basketball_api/services/email.py— MODIFY:get_gmail_client()passesPostgresTokenStoretoGmailClientalembic/versions/019_add_oauth_tokens_table.py— NEW: migrationpyproject.toml— MODIFY: bump gmail-sdk dependency to >=0.1.4pal-e-deployments:
overlays/basketball-api/prod/deployment-patch.yaml— MODIFY: remove init container, gmail-oauth volumes, gmail-oauth-writable emptyDir8. Effort Estimate
Split across 2-3 PRs: gmail-sdk first, then basketball-api + pal-e-deployments together.
9. Follow-up Ticket Scope
A feature ticket should cover: