Fix production CrashLoopBackOff: activate Puma control app #68

Merged
ldraney merged 1 commit from fix/puma-control-app into main 2026-06-04 01:41:59 +00:00
Owner

Summary

  • Removes the development-only gate on activate_control_app in config/puma.rb
  • yabeda-puma-plugin requires the control app for thread/worker metrics but it was only activated in development, crashing production on startup
  • Pod has been in CrashLoopBackOff for 12h (148 restarts); old pod still serving traffic

Closes #67

Changes

  • config/puma.rb: Remove if ENV.fetch("RAILS_ENV", "development") == "development" guard from activate_control_app — safe in all environments (binds a Unix socket within the container)

Test Plan

  • CI pipeline passes (lint + test + build)
  • After merge, ArgoCD picks up new image and pod starts without crash
  • /up health check returns 200
  • /metrics endpoint returns Puma thread/worker gauges

Review Checklist

  • Passed automated review-fix loop
  • No secrets committed
  • No unnecessary file changes
  • Commit messages are descriptive
  • ldraney/landscaping-assistant #67 — production CrashLoopBackOff bug
  • ldraney/landscaping-assistant #19 — original observability issue (regression source)
  • landscaping-assistant — project
## Summary - Removes the development-only gate on `activate_control_app` in `config/puma.rb` - yabeda-puma-plugin requires the control app for thread/worker metrics but it was only activated in development, crashing production on startup - Pod has been in CrashLoopBackOff for 12h (148 restarts); old pod still serving traffic Closes #67 ## Changes - `config/puma.rb`: Remove `if ENV.fetch("RAILS_ENV", "development") == "development"` guard from `activate_control_app` — safe in all environments (binds a Unix socket within the container) ## Test Plan - [ ] CI pipeline passes (lint + test + build) - [ ] After merge, ArgoCD picks up new image and pod starts without crash - [ ] `/up` health check returns 200 - [ ] `/metrics` endpoint returns Puma thread/worker gauges ## Review Checklist - [ ] Passed automated review-fix loop - [ ] No secrets committed - [ ] No unnecessary file changes - [ ] Commit messages are descriptive ## Related Notes - `ldraney/landscaping-assistant #67` — production CrashLoopBackOff bug - `ldraney/landscaping-assistant #19` — original observability issue (regression source) - `landscaping-assistant` — project
Fix production CrashLoopBackOff: activate Puma control app in all environments
Some checks failed
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
CI / scan_ruby (pull_request) Has been cancelled
CI / scan_js (pull_request) Has been cancelled
CI / lint (pull_request) Has been cancelled
deb6930d67
yabeda-puma-plugin requires the Puma control app to collect thread/worker
metrics but activate_control_app was gated to development only, causing
an immediate crash on startup in production.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Owner

PR #68 Review

DOMAIN REVIEW

Tech stack: Ruby on Rails 8.1, Puma web server, yabeda-puma-plugin (Prometheus metrics), Woodpecker CI, k8s deployment.

Change: Single-file, 2-line diff in config/puma.rb. Removes the if ENV.fetch("RAILS_ENV", "development") == "development" guard from activate_control_app, making the Puma control app active in all environments. Adds a clarifying comment.

Analysis of the fix:

  • The root cause is correct. yabeda-puma-plugin requires activate_control_app to collect thread/worker metrics. Gating it to development-only would crash production at startup when the plugin tries to read metrics from the control app socket.
  • activate_control_app without arguments binds to a random Unix socket (not a TCP port), so there is no network exposure risk. This is safe for production use inside a container.
  • The comment accurately explains why the control app is needed.
  • The diff is minimal and precisely scoped to the bug.

No security concerns: The control app Unix socket is container-local and not exposed externally. No secrets, no new network listeners.

BLOCKERS

None.

Test coverage exemption reasoning: This is a one-line configuration fix to a Puma DSL directive. The change is not testable via unit/integration tests -- it is a server boot configuration that can only be validated by observing the pod starts successfully and /metrics returns Puma gauges. The PR's Test Plan appropriately covers this with post-deploy verification steps. The BLOCKER criterion for "new functionality with zero test coverage" does not apply to infrastructure configuration fixes.

NITS

None. The diff is clean, minimal, and well-commented.

SOP COMPLIANCE

  • Branch named after issue -- Branch is fix/puma-control-app, not 67-puma-control-app or 67-fix-puma-control-app. Convention calls for {issue-number}-{kebab-case-purpose}. Minor deviation, but this is a hotfix for a production crash so the urgency is understood.
  • PR body follows template -- Summary, Changes, Test Plan, Review Checklist, Related sections all present.
  • Related references plan slug -- No plan slug provided (caller confirmed none). Related section references issues #67 and #19 and the project name, which is reasonable for a hotfix.
  • No secrets committed -- Confirmed. No credentials, .env files, or sensitive data in the diff.
  • No unnecessary file changes -- Single file changed, precisely scoped.
  • Commit messages are descriptive -- PR title is clear and actionable.

PROCESS OBSERVATIONS

  • DORA impact: This is a high-priority fix. 12h of CrashLoopBackOff with 148 restarts indicates the regression from PR #47 (observability stack) was not caught before merge. The old pod is still serving traffic due to k8s rolling update strategy, so user impact is mitigated but operational risk is elevated.
  • Regression source: The guard was likely introduced in #47 as a safety measure during development but should have been removed or adjusted before merge. Future observability changes should include a post-deploy smoke check for pod health as part of the merge validation.
  • Change failure rate: This is a change-failure recovery. The original observability PR introduced the defect. Consider adding a CI step or post-deploy check that verifies the app boots cleanly (e.g., rails runner "puts :ok" or a container startup probe test).

VERDICT: APPROVED

The fix is correct, minimal, safe, and well-documented. The branch naming deviation is noted but not blocking given the production urgency. Ship it and verify pod recovery.

## PR #68 Review ### DOMAIN REVIEW **Tech stack**: Ruby on Rails 8.1, Puma web server, yabeda-puma-plugin (Prometheus metrics), Woodpecker CI, k8s deployment. **Change**: Single-file, 2-line diff in `config/puma.rb`. Removes the `if ENV.fetch("RAILS_ENV", "development") == "development"` guard from `activate_control_app`, making the Puma control app active in all environments. Adds a clarifying comment. **Analysis of the fix**: - The root cause is correct. `yabeda-puma-plugin` requires `activate_control_app` to collect thread/worker metrics. Gating it to development-only would crash production at startup when the plugin tries to read metrics from the control app socket. - `activate_control_app` without arguments binds to a random Unix socket (not a TCP port), so there is no network exposure risk. This is safe for production use inside a container. - The comment accurately explains why the control app is needed. - The diff is minimal and precisely scoped to the bug. **No security concerns**: The control app Unix socket is container-local and not exposed externally. No secrets, no new network listeners. ### BLOCKERS None. **Test coverage exemption reasoning**: This is a one-line configuration fix to a Puma DSL directive. The change is not testable via unit/integration tests -- it is a server boot configuration that can only be validated by observing the pod starts successfully and `/metrics` returns Puma gauges. The PR's Test Plan appropriately covers this with post-deploy verification steps. The BLOCKER criterion for "new functionality with zero test coverage" does not apply to infrastructure configuration fixes. ### NITS None. The diff is clean, minimal, and well-commented. ### SOP COMPLIANCE - [ ] Branch named after issue -- Branch is `fix/puma-control-app`, not `67-puma-control-app` or `67-fix-puma-control-app`. Convention calls for `{issue-number}-{kebab-case-purpose}`. Minor deviation, but this is a hotfix for a production crash so the urgency is understood. - [x] PR body follows template -- Summary, Changes, Test Plan, Review Checklist, Related sections all present. - [ ] Related references plan slug -- No plan slug provided (caller confirmed none). Related section references issues #67 and #19 and the project name, which is reasonable for a hotfix. - [x] No secrets committed -- Confirmed. No credentials, .env files, or sensitive data in the diff. - [x] No unnecessary file changes -- Single file changed, precisely scoped. - [x] Commit messages are descriptive -- PR title is clear and actionable. ### PROCESS OBSERVATIONS - **DORA impact**: This is a high-priority fix. 12h of CrashLoopBackOff with 148 restarts indicates the regression from PR #47 (observability stack) was not caught before merge. The old pod is still serving traffic due to k8s rolling update strategy, so user impact is mitigated but operational risk is elevated. - **Regression source**: The guard was likely introduced in #47 as a safety measure during development but should have been removed or adjusted before merge. Future observability changes should include a post-deploy smoke check for pod health as part of the merge validation. - **Change failure rate**: This is a change-failure recovery. The original observability PR introduced the defect. Consider adding a CI step or post-deploy check that verifies the app boots cleanly (e.g., `rails runner "puts :ok"` or a container startup probe test). ### VERDICT: APPROVED The fix is correct, minimal, safe, and well-documented. The branch naming deviation is noted but not blocking given the production urgency. Ship it and verify pod recovery.
ldraney deleted branch fix/puma-control-app 2026-06-04 01:41:59 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant!68
No description provided.