feat: automate Gmail OAuth reauth lifecycle (7-day token expiry) #222

Merged
forgejo_admin merged 3 commits from 162-gmail-oauth-reauth-automation into main 2026-03-28 18:05:51 +00:00

Summary

  • Automates the Gmail OAuth token refresh lifecycle that currently requires manual multi-step intervention every 7 days
  • Adds orchestration script, secrets registry entry, and Prometheus alerting so token expiry becomes a 1-minute ops task with proactive alerts

Changes

  • scripts/gmail-reauth.sh (new): end-to-end orchestration script that invokes gmail-mcp reauth CLI, validates 4 required scopes, updates k8s secret gmail-oauth-token in basketball-api namespace, deletes stale gmail-oauth-westsidebasketball secret, and triggers rollout restart. Supports --dry-run. Bridges local filename typo transparently.
  • salt/pillar/secrets_registry.sls: adds gmail_oauth_token entry under platform: with rotation_days: 7, origin external, provider Google OAuth (westsidebasketball@gmail.com)
  • terraform/modules/monitoring/main.tf: adds gmail-oauth-expiry PrometheusRule with day-6 warning (GmailOAuthTokenExpiringSoon) and day-7 critical (GmailOAuthTokenExpired) alerts using kube_secret_created metric

tofu plan Output

Cannot run tofu plan in worktree (no .terraform/ init state). tofu fmt -recursive passed clean. Syntax consistent with existing PrometheusRule resources (blackbox_alerts, embedding_alerts).

Test Plan

  • bash -n scripts/gmail-reauth.sh passes (verified)
  • Run scripts/gmail-reauth.sh --dry-run to verify flag parsing and output
  • Run scripts/gmail-reauth.sh against live basketball-api namespace
  • Verify kubectl get secret gmail-oauth-token -n basketball-api returns updated token
  • Verify stale gmail-oauth-westsidebasketball secret is deleted
  • tofu plan on archbox to verify PrometheusRule resource is valid

Review Checklist

  • Passed automated review-fix loop
  • No secrets committed
  • No unnecessary file changes
  • Commit messages are descriptive
  • tofu fmt -recursive clean
  • Script syntax validated (bash -n)
  • Closes #162
  • Prerequisite: forgejo_admin/gmail-mcp#6 (SSH reauth tool, done)
  • pal-e-platform -- the project this work belongs to
  • Discovered scope: pal-e-mail PVC-to-secret migration (separate ticket in pal-e-deployments)
## Summary - Automates the Gmail OAuth token refresh lifecycle that currently requires manual multi-step intervention every 7 days - Adds orchestration script, secrets registry entry, and Prometheus alerting so token expiry becomes a 1-minute ops task with proactive alerts ## Changes - `scripts/gmail-reauth.sh` (new): end-to-end orchestration script that invokes gmail-mcp reauth CLI, validates 4 required scopes, updates k8s secret `gmail-oauth-token` in basketball-api namespace, deletes stale `gmail-oauth-westsidebasketball` secret, and triggers rollout restart. Supports `--dry-run`. Bridges local filename typo transparently. - `salt/pillar/secrets_registry.sls`: adds `gmail_oauth_token` entry under `platform:` with `rotation_days: 7`, origin `external`, provider `Google OAuth (westsidebasketball@gmail.com)` - `terraform/modules/monitoring/main.tf`: adds `gmail-oauth-expiry` PrometheusRule with day-6 warning (`GmailOAuthTokenExpiringSoon`) and day-7 critical (`GmailOAuthTokenExpired`) alerts using `kube_secret_created` metric ## tofu plan Output Cannot run `tofu plan` in worktree (no `.terraform/` init state). `tofu fmt -recursive` passed clean. Syntax consistent with existing PrometheusRule resources (`blackbox_alerts`, `embedding_alerts`). ## Test Plan - [ ] `bash -n scripts/gmail-reauth.sh` passes (verified) - [ ] Run `scripts/gmail-reauth.sh --dry-run` to verify flag parsing and output - [ ] Run `scripts/gmail-reauth.sh` against live basketball-api namespace - [ ] Verify `kubectl get secret gmail-oauth-token -n basketball-api` returns updated token - [ ] Verify stale `gmail-oauth-westsidebasketball` secret is deleted - [ ] `tofu plan` on archbox to verify PrometheusRule resource is valid ## Review Checklist - [x] Passed automated review-fix loop - [x] No secrets committed - [x] No unnecessary file changes - [x] Commit messages are descriptive - [x] `tofu fmt -recursive` clean - [x] Script syntax validated (`bash -n`) ## Related Notes - Closes #162 - Prerequisite: `forgejo_admin/gmail-mcp#6` (SSH reauth tool, done) - `pal-e-platform` -- the project this work belongs to - Discovered scope: pal-e-mail PVC-to-secret migration (separate ticket in pal-e-deployments)
feat: automate Gmail OAuth reauth lifecycle (7-day token expiry)
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
ci/woodpecker/pr/woodpecker Pipeline failed
c45bf6aa24
Add end-to-end automation for the Gmail OAuth token refresh cycle that
currently requires manual multi-step intervention every 7 days.

- scripts/gmail-reauth.sh: orchestration script that invokes gmail-mcp
  reauth, validates token scopes, updates k8s secret in basketball-api
  namespace, cleans up stale secret, and restarts the deployment.
  Supports --dry-run. Bridges the local filename typo transparently.
- salt/pillar/secrets_registry.sls: add gmail_oauth_token entry under
  platform section with rotation_days: 7 and full metadata.
- terraform/modules/monitoring/main.tf: add PrometheusRule with
  day-6 warning and day-7 critical alerts for token expiry.

Closes #162

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: harden gmail-reauth.sh dry-run mode and stderr handling
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
2b3f436371
- Remove errant 2>&1 redirect on reauth invocation that could mask
  exit codes
- Handle missing token file gracefully in dry-run mode (reauth is
  skipped so file may not exist yet)
- Move base64 encoding after dry-run guard so it does not fail on
  missing file during dry runs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Owner

QA Review -- PR #222

Issues Found and Fixed (commit 2b3f436)

  1. Dry-run mode crash on missing token file -- validate_token() called die if the token file was missing, even in dry-run mode. Since dry-run skips the actual reauth, the file may not exist yet. Fixed: returns early with a warning in dry-run mode.

  2. update_k8s_secret() base64 encoding before dry-run guard -- base64 -w0 < $LOCAL_TOKEN_FILE ran before the dry-run check, so --dry-run would crash on a missing file. Fixed: moved encoding after the guard, added file-existence check in dry-run output.

  3. Errant 2>&1 redirect on reauth invocation -- python3 -m gmail_mcp.reauth 2>&1 merged stderr into stdout but did not capture either. This masks error messages in the terminal output and could hide diagnostic information. Fixed: removed the redirect so stderr flows naturally.

Verified Correct

  • Filename typo bridge: local gmail-westsidebasktball.json (missing 'e') mapped to k8s key gmail-westsidebasketball.json (correct spelling)
  • All 4 required scopes validated: gmail.send, gmail.readonly, gmail.modify, gmail.labels
  • Stale secret cleanup is idempotent (no-op if already deleted)
  • kubectl create --dry-run=client | kubectl apply is the correct idempotent pattern
  • Secret verification with jsonpath bracket notation fallback for dots in key names
  • PrometheusRule follows existing pattern (labels match blackbox_alerts, embedding_alerts)
  • kube_secret_created metric is correct for kube-state-metrics
  • secrets_registry.sls entry follows existing conventions (under platform: section)
  • Script uses set -euo pipefail, color output, usage function -- matches issue constraints

VERDICT: APPROVE

All issues found during review have been fixed in commit 2b3f436. The implementation matches all acceptance criteria from issue #162.

## QA Review -- PR #222 ### Issues Found and Fixed (commit 2b3f436) 1. **Dry-run mode crash on missing token file** -- `validate_token()` called `die` if the token file was missing, even in dry-run mode. Since dry-run skips the actual reauth, the file may not exist yet. Fixed: returns early with a warning in dry-run mode. 2. **`update_k8s_secret()` base64 encoding before dry-run guard** -- `base64 -w0 < $LOCAL_TOKEN_FILE` ran before the dry-run check, so `--dry-run` would crash on a missing file. Fixed: moved encoding after the guard, added file-existence check in dry-run output. 3. **Errant `2>&1` redirect on reauth invocation** -- `python3 -m gmail_mcp.reauth 2>&1` merged stderr into stdout but did not capture either. This masks error messages in the terminal output and could hide diagnostic information. Fixed: removed the redirect so stderr flows naturally. ### Verified Correct - Filename typo bridge: local `gmail-westsidebasktball.json` (missing 'e') mapped to k8s key `gmail-westsidebasketball.json` (correct spelling) - All 4 required scopes validated: gmail.send, gmail.readonly, gmail.modify, gmail.labels - Stale secret cleanup is idempotent (no-op if already deleted) - `kubectl create --dry-run=client | kubectl apply` is the correct idempotent pattern - Secret verification with jsonpath bracket notation fallback for dots in key names - PrometheusRule follows existing pattern (labels match `blackbox_alerts`, `embedding_alerts`) - `kube_secret_created` metric is correct for kube-state-metrics - secrets_registry.sls entry follows existing conventions (under `platform:` section) - Script uses `set -euo pipefail`, color output, usage function -- matches issue constraints ### VERDICT: APPROVE All issues found during review have been fixed in commit 2b3f436. The implementation matches all acceptance criteria from issue #162.
Author
Owner

PR #222 Review

DOMAIN REVIEW

Tech stack: Bash orchestration script, Salt pillar (YAML), Terraform/HCL (PrometheusRule via kubernetes_manifest). Domain checklists applied: Terraform/k8s/Salt.

scripts/gmail-reauth.sh (311 lines, new file)

The script is well-structured with clear separation of concerns (prereqs, reauth, validation, k8s update, cleanup, restart). Good use of set -euo pipefail, --dry-run support, colored output, and prerequisite checks.

Observations:

  1. kube_secret_created vs kubectl apply mismatch (correctness bug): The PrometheusRule alerts use kube_secret_created (the k8s metadata.creationTimestamp) to measure secret age. However, the script uses kubectl create secret ... --dry-run=client -o yaml | kubectl apply -f - to update the secret. kubectl apply patches an existing resource in place -- it does NOT reset metadata.creationTimestamp. After the first 7-day cycle, the alert will fire and never clear, because every subsequent apply preserves the original creation timestamp. The fix is either: (a) kubectl delete secret then kubectl create secret (so creationTimestamp resets), or (b) use an annotation like last-rotated that the script sets explicitly, and query that annotation in the PromQL expression instead. This is a functional correctness issue that will cause alert fatigue from day 8 onward.

  2. Unused variable token_b64: Line computes token_b64=$(base64 -w0 < "${LOCAL_TOKEN_FILE}") for verification, then the verification step reads back via kubectl get secret ... -o jsonpath. The comparison stored_b64 vs token_b64 is correct logic, but base64 -w0 is a GNU coreutils flag. On macOS (which is referenced as a potential host given the Mac build agent work), base64 does not support -w0 -- it uses base64 with no line-wrap by default or base64 -b0. The stat command already has cross-platform handling (GNU --format=%s with macOS fallback -f%z), but base64 does not. Since this script runs on archbox (Linux), this is a nit rather than a blocker, but worth noting for portability.

  3. jsonpath dot-escaping fallback: The secret key gmail-westsidebasketball.json contains a dot, which breaks jsonpath's default field selector. The script tries plain jsonpath first, then falls back to bracket notation with escaped dot. This is a reasonable approach but the fallback syntax '{.data.gmail-westsidebasketball\.json}' may not work in all kubectl versions. Consider using -o json | jq for the verification step, since jq is already a prerequisite.

salt/pillar/secrets_registry.sls

Clean addition. Follows the established pattern (origin, description, provider, created, rotation_days, notes). Documents the filename typo transparently. Placement under platform: is correct. No secrets committed -- only metadata.

terraform/modules/monitoring/main.tf

Follows the exact pattern of existing PrometheusRule resources (blackbox_alerts, embedding_alerts): same labels, same depends_on, same structure. The for = "0m" on the critical alert matches the PodRestartStorm precedent at line 122. The for = "1h" on the warning gives appropriate debounce.

The humanizeDuration template function is correct for formatting seconds as human-readable duration.

BLOCKERS

  1. Alert will never clear after first rotation: As described above, kube_secret_created (the k8s object creation timestamp) is immutable once the object exists. kubectl apply does not reset it. After the first 7-day cycle, both alerts will fire permanently regardless of token freshness. The alerting strategy and the secret update strategy are incompatible. Either the script must delete+recreate the secret (so creationTimestamp resets), or the alert must use a different signal (e.g., a custom annotation or a recording rule based on secret data hash changes). This is a correctness blocker -- deploying this as-is will produce permanent false-positive critical alerts.

NITS

  1. base64 -w0 portability: GNU-only flag. Not blocking since archbox is Linux, but if the script is ever run from the Mac agent, it will fail. Consider base64 | tr -d '\n' as a portable alternative.

  2. jsonpath verification could use jq: Since jq is already a prerequisite, the verification step for reading back the secret could use kubectl get secret ... -o json | jq -r '.data["gmail-westsidebasketball.json"]' instead of relying on jsonpath dot-escaping quirks.

  3. tofu plan not included: PR body explains this cannot be run in the worktree (no .terraform/ init state). The PR conventions say "Include tofu plan output for any Terraform changes." This is understandable given worktree limitations but worth noting. The new resource follows existing patterns closely, reducing risk.

  4. Script has no shebang portability note: The script uses #!/bin/bash which is fine, but the set -euo pipefail plus array syntax (REQUIRED_SCOPES=(...)) means this genuinely requires bash 4+. Worth a comment since some systems default to bash 3.

SOP COMPLIANCE

  • Branch named after issue (162-gmail-oauth-reauth-automation matches issue #162)
  • PR body has Summary, Changes, Test Plan, Related sections
  • Related section references the project (pal-e-platform)
  • No secrets committed (only metadata in secrets_registry)
  • No unnecessary file changes (3 files, all directly related)
  • Commit messages are descriptive
  • Discovered scope noted (pal-e-mail PVC-to-secret migration deferred)
  • tofu plan output missing (explained in PR body -- worktree limitation)

PROCESS OBSERVATIONS

  • MTTR improvement: This PR directly reduces MTTR for email outages by converting a multi-step manual runbook into a single script. Strong operational hygiene.
  • Change failure risk: The kube_secret_created mismatch means the alerting component will produce false positives. The script itself is sound -- only the alert/script integration has the bug.
  • Deployment frequency: No CI pipeline changes. Manual tofu apply required for the PrometheusRule. Script is a standalone ops tool.

VERDICT: NOT APPROVED

One blocker: the PrometheusRule uses kube_secret_created (immutable after object creation) but the script updates the secret via kubectl apply (which preserves creation timestamp). After the first rotation cycle, both alerts will fire permanently. Either the script must delete+recreate the secret so the timestamp resets, or the PromQL must use a different signal. Fix this integration mismatch and this is ready to merge.

## PR #222 Review ### DOMAIN REVIEW **Tech stack**: Bash orchestration script, Salt pillar (YAML), Terraform/HCL (PrometheusRule via kubernetes_manifest). Domain checklists applied: Terraform/k8s/Salt. **scripts/gmail-reauth.sh (311 lines, new file)** The script is well-structured with clear separation of concerns (prereqs, reauth, validation, k8s update, cleanup, restart). Good use of `set -euo pipefail`, `--dry-run` support, colored output, and prerequisite checks. Observations: 1. **`kube_secret_created` vs `kubectl apply` mismatch (correctness bug)**: The PrometheusRule alerts use `kube_secret_created` (the k8s `metadata.creationTimestamp`) to measure secret age. However, the script uses `kubectl create secret ... --dry-run=client -o yaml | kubectl apply -f -` to update the secret. `kubectl apply` patches an existing resource in place -- it does NOT reset `metadata.creationTimestamp`. After the first 7-day cycle, the alert will fire and never clear, because every subsequent `apply` preserves the original creation timestamp. The fix is either: (a) `kubectl delete secret` then `kubectl create secret` (so creationTimestamp resets), or (b) use an annotation like `last-rotated` that the script sets explicitly, and query that annotation in the PromQL expression instead. This is a functional correctness issue that will cause alert fatigue from day 8 onward. 2. **Unused variable `token_b64`**: Line computes `token_b64=$(base64 -w0 < "${LOCAL_TOKEN_FILE}")` for verification, then the verification step reads back via `kubectl get secret ... -o jsonpath`. The comparison `stored_b64` vs `token_b64` is correct logic, but `base64 -w0` is a GNU coreutils flag. On macOS (which is referenced as a potential host given the Mac build agent work), `base64` does not support `-w0` -- it uses `base64` with no line-wrap by default or `base64 -b0`. The `stat` command already has cross-platform handling (GNU `--format=%s` with macOS fallback `-f%z`), but `base64` does not. Since this script runs on archbox (Linux), this is a nit rather than a blocker, but worth noting for portability. 3. **jsonpath dot-escaping fallback**: The secret key `gmail-westsidebasketball.json` contains a dot, which breaks jsonpath's default field selector. The script tries plain jsonpath first, then falls back to bracket notation with escaped dot. This is a reasonable approach but the fallback syntax `'{.data.gmail-westsidebasketball\.json}'` may not work in all kubectl versions. Consider using `-o json | jq` for the verification step, since jq is already a prerequisite. **salt/pillar/secrets_registry.sls** Clean addition. Follows the established pattern (origin, description, provider, created, rotation_days, notes). Documents the filename typo transparently. Placement under `platform:` is correct. No secrets committed -- only metadata. **terraform/modules/monitoring/main.tf** Follows the exact pattern of existing PrometheusRule resources (`blackbox_alerts`, `embedding_alerts`): same labels, same `depends_on`, same structure. The `for = "0m"` on the critical alert matches the `PodRestartStorm` precedent at line 122. The `for = "1h"` on the warning gives appropriate debounce. The `humanizeDuration` template function is correct for formatting seconds as human-readable duration. ### BLOCKERS 1. **Alert will never clear after first rotation**: As described above, `kube_secret_created` (the k8s object creation timestamp) is immutable once the object exists. `kubectl apply` does not reset it. After the first 7-day cycle, both alerts will fire permanently regardless of token freshness. The alerting strategy and the secret update strategy are incompatible. Either the script must delete+recreate the secret (so `creationTimestamp` resets), or the alert must use a different signal (e.g., a custom annotation or a recording rule based on secret data hash changes). This is a correctness blocker -- deploying this as-is will produce permanent false-positive critical alerts. ### NITS 1. **`base64 -w0` portability**: GNU-only flag. Not blocking since archbox is Linux, but if the script is ever run from the Mac agent, it will fail. Consider `base64 | tr -d '\n'` as a portable alternative. 2. **jsonpath verification could use jq**: Since jq is already a prerequisite, the verification step for reading back the secret could use `kubectl get secret ... -o json | jq -r '.data["gmail-westsidebasketball.json"]'` instead of relying on jsonpath dot-escaping quirks. 3. **`tofu plan` not included**: PR body explains this cannot be run in the worktree (no `.terraform/` init state). The PR conventions say "Include `tofu plan` output for any Terraform changes." This is understandable given worktree limitations but worth noting. The new resource follows existing patterns closely, reducing risk. 4. **Script has no shebang portability note**: The script uses `#!/bin/bash` which is fine, but the `set -euo pipefail` plus array syntax (`REQUIRED_SCOPES=(...)`) means this genuinely requires bash 4+. Worth a comment since some systems default to bash 3. ### SOP COMPLIANCE - [x] Branch named after issue (`162-gmail-oauth-reauth-automation` matches issue #162) - [x] PR body has Summary, Changes, Test Plan, Related sections - [x] Related section references the project (`pal-e-platform`) - [x] No secrets committed (only metadata in secrets_registry) - [x] No unnecessary file changes (3 files, all directly related) - [x] Commit messages are descriptive - [x] Discovered scope noted (pal-e-mail PVC-to-secret migration deferred) - [ ] `tofu plan` output missing (explained in PR body -- worktree limitation) ### PROCESS OBSERVATIONS - **MTTR improvement**: This PR directly reduces MTTR for email outages by converting a multi-step manual runbook into a single script. Strong operational hygiene. - **Change failure risk**: The `kube_secret_created` mismatch means the alerting component will produce false positives. The script itself is sound -- only the alert/script integration has the bug. - **Deployment frequency**: No CI pipeline changes. Manual `tofu apply` required for the PrometheusRule. Script is a standalone ops tool. ### VERDICT: NOT APPROVED One blocker: the PrometheusRule uses `kube_secret_created` (immutable after object creation) but the script updates the secret via `kubectl apply` (which preserves creation timestamp). After the first rotation cycle, both alerts will fire permanently. Either the script must delete+recreate the secret so the timestamp resets, or the PromQL must use a different signal. Fix this integration mismatch and this is ready to merge.
fix: delete+recreate k8s secret to reset creationTimestamp
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful
c3bba560c7
The PrometheusRule uses kube_secret_created to detect token age, but
kubectl apply preserves the original creationTimestamp. After the first
rotation, alerts would fire permanently as false positives. Switch from
dry-run/apply to delete+create so creationTimestamp resets on each
rotation cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Owner

PR #222 Review (Re-review)

Previous blocker: kubectl apply preserves metadata.creationTimestamp, causing PrometheusRule false positives after first rotation cycle. Fix needed: delete+recreate pattern.

BLOCKER RESOLUTION

RESOLVED. The script now uses explicit delete+recreate at lines 199-210 of scripts/gmail-reauth.sh:

# Delete then recreate the secret so metadata.creationTimestamp resets.
# The PrometheusRule uses kube_secret_created to measure token age —
# kubectl apply would preserve the original timestamp, causing permanent
# false-positive alerts after the first rotation cycle.
if kubectl get secret "${K8S_SECRET_NAME}" -n "${K8S_NAMESPACE}" &>/dev/null; then
    info "Deleting existing secret '${K8S_SECRET_NAME}' to reset creationTimestamp..."
    kubectl delete secret "${K8S_SECRET_NAME}" -n "${K8S_NAMESPACE}"
fi

kubectl create secret generic "${K8S_SECRET_NAME}" \
    --namespace="${K8S_NAMESPACE}" \
    --from-file="${K8S_SECRET_KEY}=${LOCAL_TOKEN_FILE}"

The comment explaining why delete+recreate is used (rather than apply) is excellent -- this prevents future refactors from regressing to the broken pattern.

DOMAIN REVIEW

Tech stack: Bash ops script + Terraform/HCL (PrometheusRule kubernetes_manifest) + Salt pillar (YAML).

Terraform/k8s:

  • PrometheusRule follows the exact same pattern as blackbox_alerts and embedding_alerts (same metadata labels, depends_on, namespace reference via kubernetes_namespace_v1.monitoring.metadata[0].name). Consistent.
  • for = "0m" on the critical alert is consistent with the existing PodRestartStorm rule at line 122 of the same file.
  • for = "1h" on the warning alert is appropriate -- avoids flapping near the 6-day boundary.
  • Alert expressions use kube_secret_created metric which is exposed by kube-state-metrics (included in kube-prometheus-stack). Valid metric source.
  • The $value | humanizeDuration template function is correct Prometheus alerting syntax.

Bash script:

  • set -euo pipefail at the top -- correct strict mode.
  • All variables are quoted consistently. No word-splitting vulnerabilities.
  • stat --format=%s with || stat -f%z fallback handles both GNU and BSD stat -- good cross-platform consideration.
  • Scope validation on the token JSON handles both .scopes (array) and .scope (space-delimited string) field formats. Defensive.
  • The verification step (lines 212-233) reads back the created secret and compares base64 to local -- ensures data integrity.
  • --dry-run flag is respected in every mutating function (update_k8s_secret, delete_stale_secret, restart_deployment, run_reauth). Complete coverage.
  • Prerequisite checks (kubectl, jq, cluster connectivity, namespace existence) fail fast with clear messages.

Salt pillar:

  • New gmail_oauth_token entry follows the exact structure of adjacent entries (origin, description, provider, created, rotation_days, notes).
  • rotation_days: 7 matches the Google policy documented in the notes.
  • The typo documentation (gmail-westsidebasktball.json missing 'e') is documented in both the script header and the pillar notes -- transparent and consistent.

BLOCKERS

None. Previous blocker is resolved.

NITS

  1. Unused token_b64 variable: Line 196-197 computes token_b64=$(base64 -w0 < "${LOCAL_TOKEN_FILE}") but since the create now uses --from-file instead of constructing a manifest, the token_b64 value at line 197 is only used for verification comparison at line 229. However, the stored_b64 fetched from k8s via jsonpath is already base64-encoded by Kubernetes, so this comparison is correct. The variable name could be clearer (e.g., local_b64) but this is cosmetic.

  2. Brief downtime window: Between kubectl delete secret and kubectl create secret generic, there is a brief window where the secret does not exist. If basketball-api pods restart during this window, they would fail to mount the secret. The window is milliseconds for a scripted operation, but worth documenting as a known limitation. Not a blocker for an operator-initiated script.

  3. for = "0m" on critical alert: While consistent with the existing PodRestartStorm pattern, Prometheus treats for: 0m as "fire immediately on first evaluation." For a 7-day threshold this is fine (if the secret is >7 days old, it is definitively expired), so this is correct. Just noting for completeness.

  4. No tofu plan output: PR body explains this is due to worktree lacking .terraform/ init state, and notes tofu fmt -recursive passed clean. The Terraform change is a new resource (pure addition), so risk of state breakage is minimal. Acceptable given the constraint.

SOP COMPLIANCE

  • Branch named after issue: 162-gmail-oauth-reauth-automation references issue #162
  • PR body has: Summary, Changes, Test Plan, Related -- all present and detailed
  • Related references project: "pal-e-platform -- the project this work belongs to"
  • No secrets committed -- verified via grep, no credentials in any file
  • No unnecessary file changes -- 3 files, all directly related to the feature
  • Commit messages are descriptive
  • tofu fmt clean (stated in PR body)
  • Discovered scope tracked: "pal-e-mail PVC-to-secret migration (separate ticket in pal-e-deployments)"

PROCESS OBSERVATIONS

  • MTTR improvement: This PR directly reduces MTTR for Gmail OAuth expiry from a multi-step manual process to a single script invocation. The PrometheusRule adds proactive alerting (day 6 warning) which shifts from reactive to preventive.
  • Change failure risk: Low. Pure additions (new script, new pillar entry, new PrometheusRule). No modifications to existing resources.
  • Deployment frequency: The 7-day rotation cadence is now documented and alertable, which should prevent the silent failures that previously disrupted email delivery.

VERDICT: APPROVED

## PR #222 Review (Re-review) **Previous blocker**: `kubectl apply` preserves `metadata.creationTimestamp`, causing PrometheusRule false positives after first rotation cycle. Fix needed: delete+recreate pattern. ### BLOCKER RESOLUTION **RESOLVED.** The script now uses explicit delete+recreate at lines 199-210 of `scripts/gmail-reauth.sh`: ```bash # Delete then recreate the secret so metadata.creationTimestamp resets. # The PrometheusRule uses kube_secret_created to measure token age — # kubectl apply would preserve the original timestamp, causing permanent # false-positive alerts after the first rotation cycle. if kubectl get secret "${K8S_SECRET_NAME}" -n "${K8S_NAMESPACE}" &>/dev/null; then info "Deleting existing secret '${K8S_SECRET_NAME}' to reset creationTimestamp..." kubectl delete secret "${K8S_SECRET_NAME}" -n "${K8S_NAMESPACE}" fi kubectl create secret generic "${K8S_SECRET_NAME}" \ --namespace="${K8S_NAMESPACE}" \ --from-file="${K8S_SECRET_KEY}=${LOCAL_TOKEN_FILE}" ``` The comment explaining *why* delete+recreate is used (rather than apply) is excellent -- this prevents future refactors from regressing to the broken pattern. ### DOMAIN REVIEW **Tech stack**: Bash ops script + Terraform/HCL (PrometheusRule kubernetes_manifest) + Salt pillar (YAML). **Terraform/k8s:** - PrometheusRule follows the exact same pattern as `blackbox_alerts` and `embedding_alerts` (same metadata labels, `depends_on`, namespace reference via `kubernetes_namespace_v1.monitoring.metadata[0].name`). Consistent. - `for = "0m"` on the critical alert is consistent with the existing `PodRestartStorm` rule at line 122 of the same file. - `for = "1h"` on the warning alert is appropriate -- avoids flapping near the 6-day boundary. - Alert expressions use `kube_secret_created` metric which is exposed by kube-state-metrics (included in kube-prometheus-stack). Valid metric source. - The `$value | humanizeDuration` template function is correct Prometheus alerting syntax. **Bash script:** - `set -euo pipefail` at the top -- correct strict mode. - All variables are quoted consistently. No word-splitting vulnerabilities. - `stat --format=%s` with `|| stat -f%z` fallback handles both GNU and BSD stat -- good cross-platform consideration. - Scope validation on the token JSON handles both `.scopes` (array) and `.scope` (space-delimited string) field formats. Defensive. - The verification step (lines 212-233) reads back the created secret and compares base64 to local -- ensures data integrity. - `--dry-run` flag is respected in every mutating function (`update_k8s_secret`, `delete_stale_secret`, `restart_deployment`, `run_reauth`). Complete coverage. - Prerequisite checks (kubectl, jq, cluster connectivity, namespace existence) fail fast with clear messages. **Salt pillar:** - New `gmail_oauth_token` entry follows the exact structure of adjacent entries (origin, description, provider, created, rotation_days, notes). - `rotation_days: 7` matches the Google policy documented in the notes. - The typo documentation (`gmail-westsidebasktball.json` missing 'e') is documented in both the script header and the pillar notes -- transparent and consistent. ### BLOCKERS None. Previous blocker is resolved. ### NITS 1. **Unused `token_b64` variable**: Line 196-197 computes `token_b64=$(base64 -w0 < "${LOCAL_TOKEN_FILE}")` but since the create now uses `--from-file` instead of constructing a manifest, the `token_b64` value at line 197 is only used for verification comparison at line 229. However, the `stored_b64` fetched from k8s via jsonpath is already base64-encoded by Kubernetes, so this comparison is correct. The variable name could be clearer (e.g., `local_b64`) but this is cosmetic. 2. **Brief downtime window**: Between `kubectl delete secret` and `kubectl create secret generic`, there is a brief window where the secret does not exist. If basketball-api pods restart during this window, they would fail to mount the secret. The window is milliseconds for a scripted operation, but worth documenting as a known limitation. Not a blocker for an operator-initiated script. 3. **`for = "0m"` on critical alert**: While consistent with the existing `PodRestartStorm` pattern, Prometheus treats `for: 0m` as "fire immediately on first evaluation." For a 7-day threshold this is fine (if the secret is >7 days old, it is definitively expired), so this is correct. Just noting for completeness. 4. **No `tofu plan` output**: PR body explains this is due to worktree lacking `.terraform/` init state, and notes `tofu fmt -recursive` passed clean. The Terraform change is a new resource (pure addition), so risk of state breakage is minimal. Acceptable given the constraint. ### SOP COMPLIANCE - [x] Branch named after issue: `162-gmail-oauth-reauth-automation` references issue #162 - [x] PR body has: Summary, Changes, Test Plan, Related -- all present and detailed - [x] Related references project: "pal-e-platform -- the project this work belongs to" - [x] No secrets committed -- verified via grep, no credentials in any file - [x] No unnecessary file changes -- 3 files, all directly related to the feature - [x] Commit messages are descriptive - [x] `tofu fmt` clean (stated in PR body) - [x] Discovered scope tracked: "pal-e-mail PVC-to-secret migration (separate ticket in pal-e-deployments)" ### PROCESS OBSERVATIONS - **MTTR improvement**: This PR directly reduces MTTR for Gmail OAuth expiry from a multi-step manual process to a single script invocation. The PrometheusRule adds proactive alerting (day 6 warning) which shifts from reactive to preventive. - **Change failure risk**: Low. Pure additions (new script, new pillar entry, new PrometheusRule). No modifications to existing resources. - **Deployment frequency**: The 7-day rotation cadence is now documented and alertable, which should prevent the silent failures that previously disrupted email delivery. ### VERDICT: APPROVED
Author
Owner

PR #222 Review

DOMAIN REVIEW

Tech stack identified: Bash scripting (ops automation), OpenTofu/Terraform (Helm provider, kubernetes_manifest), Salt pillar (YAML metadata registry).

Bash/Ops script (scripts/gmail-reauth.sh):

  • Well-structured: set -euo pipefail, clean function decomposition, color-coded output, --dry-run support, prerequisite checks.
  • Delete-then-create pattern for secret rotation is correct -- the comment at lines 199-202 explicitly documents why kubectl apply would break the PrometheusRule's kube_secret_created metric. Good engineering judgment.
  • Cross-platform stat fallback (GNU --format=%s then BSD -f%z) at line 150 is a nice touch, though archbox is always Linux.
  • Scope validation for the token (lines 155-177) with fallback for .scopes[] vs .scope (space-delimited) is thorough.
  • The known typo in gmail-westsidebasktball.json is documented in 3 places (script header, constant comment, secrets registry). Well-handled.

Terraform (terraform/modules/monitoring/main.tf):

  • New gmail_oauth_expiry_alert resource follows the exact same pattern as blackbox_alerts and embedding_alerts (same labels, same depends_on). Consistent.
  • kube_secret_created is a standard kube-state-metrics metric, and kube-state-metrics is running with default collectors (no filter). The metric will be available.
  • Day-6 warning / day-7 critical thresholds match the documented Google policy for unverified apps.
  • for = "0m" on the critical alert is consistent with existing PodRestartStorm at line 122. Pattern is: when the condition IS the event, fire immediately.
  • No new variables needed -- this is a static PrometheusRule referencing a hardcoded namespace/secret name. Appropriate for a single-consumer alert.

Salt pillar (salt/pillar/secrets_registry.sls):

  • New gmail_oauth_token entry under platform: follows the exact schema of every other entry (origin, description, provider, created, rotation_days, notes). Clean.
  • rotation_days: 7 is the shortest rotation in the registry -- appropriately documented as Google policy-driven.
  • Notes field documents the filename typo, k8s secret key mapping, and all three consumers. Excellent metadata.

BLOCKERS

None.

  • No secrets or credentials committed -- the script reads from ~/secrets/ at runtime, never embeds values.
  • No unvalidated user input -- the script takes only --dry-run / --help flags with strict case matching; unknown args cause die.
  • No DRY violations in auth/security paths.
  • Test coverage: this is an ops script, not application code. The bash -n syntax check and --dry-run mode serve as the validation mechanism. No unit tests needed for a kubectl orchestration script.

NITS

  1. Dead-code jsonpath attempt (line 214-216): The first verification attempt uses jsonpath="{.data.${K8S_SECRET_KEY}}" which expands to {.data.gmail-westsidebasketball.json}. The unescaped dot before json means kubectl interprets this as nested key access (.data.gmail-westsidebasketball -> .json), which will always fail. The fallback at line 222 with the escaped dot (gmail-westsidebasketball\.json) is the one that actually works. Consider removing the first attempt and going straight to the escaped version to avoid confusion:

    stored_b64=$(kubectl get secret "${K8S_SECRET_NAME}" \
      -n "${K8S_NAMESPACE}" \
      -o jsonpath='{.data.gmail-westsidebasketball\.json}' 2>/dev/null || true)
    

    Alternatively, use -o go-template or jq on the JSON output to avoid jsonpath escaping entirely.

  2. Hardcoded secret key in jsonpath fallback (line 222): The fallback hardcodes gmail-westsidebasketball\.json instead of deriving it from K8S_SECRET_KEY. If the constant ever changes, this line would silently break. Minor since the constant is 5 lines above, but worth noting.

  3. python3 -m gmail_mcp.reauth prerequisite not checked (line 125): The check_prerequisites function validates kubectl and jq but not whether gmail_mcp is importable. If the module is missing, the user gets a Python traceback instead of a clean die message. Consider adding a check: python3 -c "import gmail_mcp.reauth" 2>/dev/null || die "gmail_mcp not installed".

  4. for = "0m" on GmailOAuthTokenExpired: While consistent with existing patterns, Prometheus documentation notes that for: 0m and omitting for are equivalent. Omitting it would be marginally cleaner, but this matches the existing codebase convention, so no strong opinion.

  5. PR body notes "Cannot run tofu plan": This is understood given worktree isolation, but the PR convention says "Include tofu plan output for any Terraform changes." The syntax is consistent with existing PrometheusRule resources, so the risk is low. Should be validated during tofu apply per issue #224.

SOP COMPLIANCE

  • Branch named after issue: 162-gmail-oauth-reauth-automation matches issue #162
  • PR body has: Summary, Changes, Test Plan, Related -- all present and thorough
  • Related section references the project (pal-e-platform) and closes #162
  • No secrets committed -- script reads from ~/secrets/ at runtime
  • No unnecessary file changes -- 3 files, all directly related to the feature
  • Commit messages are descriptive
  • Discovered scope documented: "pal-e-mail PVC-to-secret migration (separate ticket in pal-e-deployments)"
  • tofu plan output not included (worktree limitation acknowledged in PR body)

PROCESS OBSERVATIONS

DORA impact: This PR directly improves MTTR for Gmail OAuth token expiry (currently a multi-step manual process every 7 days). The proactive alerting at day 6 converts a reactive outage into a scheduled 1-minute ops task. Strong MTTR improvement.

Change failure risk: Low. The PrometheusRule is additive (no existing resources modified). The script is operator-facing (not automated CI), and --dry-run provides a safety net. The secrets registry change is metadata-only.

Validation gap: The tofu plan was not run. The alert resource should be validated during the next tofu apply cycle (tracked by #224). The kube_secret_created metric availability should be confirmed in Prometheus after apply.

VERDICT: APPROVED

## PR #222 Review ### DOMAIN REVIEW **Tech stack identified**: Bash scripting (ops automation), OpenTofu/Terraform (Helm provider, `kubernetes_manifest`), Salt pillar (YAML metadata registry). **Bash/Ops script (`scripts/gmail-reauth.sh`)**: - Well-structured: `set -euo pipefail`, clean function decomposition, color-coded output, `--dry-run` support, prerequisite checks. - Delete-then-create pattern for secret rotation is correct -- the comment at lines 199-202 explicitly documents why `kubectl apply` would break the PrometheusRule's `kube_secret_created` metric. Good engineering judgment. - Cross-platform `stat` fallback (GNU `--format=%s` then BSD `-f%z`) at line 150 is a nice touch, though archbox is always Linux. - Scope validation for the token (lines 155-177) with fallback for `.scopes[]` vs `.scope` (space-delimited) is thorough. - The known typo in `gmail-westsidebasktball.json` is documented in 3 places (script header, constant comment, secrets registry). Well-handled. **Terraform (`terraform/modules/monitoring/main.tf`)**: - New `gmail_oauth_expiry_alert` resource follows the exact same pattern as `blackbox_alerts` and `embedding_alerts` (same labels, same `depends_on`). Consistent. - `kube_secret_created` is a standard kube-state-metrics metric, and kube-state-metrics is running with default collectors (no filter). The metric will be available. - Day-6 warning / day-7 critical thresholds match the documented Google policy for unverified apps. - `for = "0m"` on the critical alert is consistent with existing `PodRestartStorm` at line 122. Pattern is: when the condition IS the event, fire immediately. - No new variables needed -- this is a static PrometheusRule referencing a hardcoded namespace/secret name. Appropriate for a single-consumer alert. **Salt pillar (`salt/pillar/secrets_registry.sls`)**: - New `gmail_oauth_token` entry under `platform:` follows the exact schema of every other entry (origin, description, provider, created, rotation_days, notes). Clean. - `rotation_days: 7` is the shortest rotation in the registry -- appropriately documented as Google policy-driven. - Notes field documents the filename typo, k8s secret key mapping, and all three consumers. Excellent metadata. ### BLOCKERS None. - No secrets or credentials committed -- the script reads from `~/secrets/` at runtime, never embeds values. - No unvalidated user input -- the script takes only `--dry-run` / `--help` flags with strict `case` matching; unknown args cause `die`. - No DRY violations in auth/security paths. - Test coverage: this is an ops script, not application code. The `bash -n` syntax check and `--dry-run` mode serve as the validation mechanism. No unit tests needed for a kubectl orchestration script. ### NITS 1. **Dead-code jsonpath attempt (line 214-216)**: The first verification attempt uses `jsonpath="{.data.${K8S_SECRET_KEY}}"` which expands to `{.data.gmail-westsidebasketball.json}`. The unescaped dot before `json` means kubectl interprets this as nested key access (`.data.gmail-westsidebasketball` -> `.json`), which will always fail. The fallback at line 222 with the escaped dot (`gmail-westsidebasketball\.json`) is the one that actually works. Consider removing the first attempt and going straight to the escaped version to avoid confusion: ``` stored_b64=$(kubectl get secret "${K8S_SECRET_NAME}" \ -n "${K8S_NAMESPACE}" \ -o jsonpath='{.data.gmail-westsidebasketball\.json}' 2>/dev/null || true) ``` Alternatively, use `-o go-template` or `jq` on the JSON output to avoid jsonpath escaping entirely. 2. **Hardcoded secret key in jsonpath fallback (line 222)**: The fallback hardcodes `gmail-westsidebasketball\.json` instead of deriving it from `K8S_SECRET_KEY`. If the constant ever changes, this line would silently break. Minor since the constant is 5 lines above, but worth noting. 3. **`python3 -m gmail_mcp.reauth` prerequisite not checked (line 125)**: The `check_prerequisites` function validates `kubectl` and `jq` but not whether `gmail_mcp` is importable. If the module is missing, the user gets a Python traceback instead of a clean `die` message. Consider adding a check: `python3 -c "import gmail_mcp.reauth" 2>/dev/null || die "gmail_mcp not installed"`. 4. **`for = "0m"` on GmailOAuthTokenExpired**: While consistent with existing patterns, Prometheus documentation notes that `for: 0m` and omitting `for` are equivalent. Omitting it would be marginally cleaner, but this matches the existing codebase convention, so no strong opinion. 5. **PR body notes "Cannot run `tofu plan`"**: This is understood given worktree isolation, but the PR convention says "Include `tofu plan` output for any Terraform changes." The syntax is consistent with existing PrometheusRule resources, so the risk is low. Should be validated during `tofu apply` per issue #224. ### SOP COMPLIANCE - [x] Branch named after issue: `162-gmail-oauth-reauth-automation` matches issue #162 - [x] PR body has: Summary, Changes, Test Plan, Related -- all present and thorough - [x] Related section references the project (`pal-e-platform`) and closes #162 - [x] No secrets committed -- script reads from `~/secrets/` at runtime - [x] No unnecessary file changes -- 3 files, all directly related to the feature - [x] Commit messages are descriptive - [x] Discovered scope documented: "pal-e-mail PVC-to-secret migration (separate ticket in pal-e-deployments)" - [ ] `tofu plan` output not included (worktree limitation acknowledged in PR body) ### PROCESS OBSERVATIONS **DORA impact**: This PR directly improves MTTR for Gmail OAuth token expiry (currently a multi-step manual process every 7 days). The proactive alerting at day 6 converts a reactive outage into a scheduled 1-minute ops task. Strong MTTR improvement. **Change failure risk**: Low. The PrometheusRule is additive (no existing resources modified). The script is operator-facing (not automated CI), and `--dry-run` provides a safety net. The secrets registry change is metadata-only. **Validation gap**: The `tofu plan` was not run. The alert resource should be validated during the next `tofu apply` cycle (tracked by #224). The `kube_secret_created` metric availability should be confirmed in Prometheus after apply. ### VERDICT: APPROVED
forgejo_admin deleted branch 162-gmail-oauth-reauth-automation 2026-03-28 18:05:51 +00:00
Sign in to join this conversation.
No description provided.