Bump ArgoCD repo-server memory limit (1 alert) #112

Closed
opened 2026-03-18 17:03:48 +00:00 by forgejo_admin · 3 comments

Lineage

plan-pal-e-platform → Platform Hardening

Repo

forgejo_admin/pal-e-services

User Story

As a platform operator
I want ArgoCD repo-server to have sufficient memory
So that it stops OOMKilling every few hours and can reliably deploy all 8 applications

Context

ArgoCD repo-server OOMKilled 4 times in 4 days. Current limit is 256Mi, managing 8 ArgoCD Applications with a SOPS CMP plugin sidecar. Memory usage climbs past 256Mi over hours as it clones repos, renders manifests, and caches results. Current: requests=64Mi, limits=256Mi. Usage after restart: 81Mi, climbs over time.

File Targets

  • ~/pal-e-services/terraform/main.tf lines 93-97 — bump requests.memory: 64Mi → 128Mi, limits.memory: 256Mi → 512Mi

Files NOT to touch:

  • ArgoCD Application resources — the issue is the server process, not the apps

Acceptance Criteria

  • No OOMKill events for 48 hours
  • kubectl top pod -n argocd -l app.kubernetes.io/component=repo-server stays under 512Mi
  • OOMKilled alert clears

Test Expectations

  • tofu plan -lock=false shows only the memory limit changes
  • After apply: kubectl describe pod -n argocd -l app.kubernetes.io/component=repo-server shows new limits
  • Monitor for 48h: no restarts with exitCode 137

Constraints

  • Straightforward helm values change, no dependencies
  • tofu apply -lock=false in pal-e-services
  • 512Mi is 2x current — conservative bump. If it OOMs again, go to 768Mi.

Checklist

  • Memory limits bumped
  • PR opened
  • tofu plan clean
  • Tests pass
  • No unrelated changes
  • pal-e-platform — project board
  • Issue #109 — umbrella alert cleanup
### Lineage `plan-pal-e-platform` → Platform Hardening ### Repo `forgejo_admin/pal-e-services` ### User Story As a platform operator I want ArgoCD repo-server to have sufficient memory So that it stops OOMKilling every few hours and can reliably deploy all 8 applications ### Context ArgoCD repo-server OOMKilled 4 times in 4 days. Current limit is 256Mi, managing 8 ArgoCD Applications with a SOPS CMP plugin sidecar. Memory usage climbs past 256Mi over hours as it clones repos, renders manifests, and caches results. Current: requests=64Mi, limits=256Mi. Usage after restart: 81Mi, climbs over time. ### File Targets - `~/pal-e-services/terraform/main.tf` lines 93-97 — bump `requests.memory: 64Mi → 128Mi`, `limits.memory: 256Mi → 512Mi` Files NOT to touch: - ArgoCD Application resources — the issue is the server process, not the apps ### Acceptance Criteria - [ ] No OOMKill events for 48 hours - [ ] `kubectl top pod -n argocd -l app.kubernetes.io/component=repo-server` stays under 512Mi - [ ] OOMKilled alert clears ### Test Expectations - [ ] `tofu plan -lock=false` shows only the memory limit changes - [ ] After apply: `kubectl describe pod -n argocd -l app.kubernetes.io/component=repo-server` shows new limits - [ ] Monitor for 48h: no restarts with exitCode 137 ### Constraints - Straightforward helm values change, no dependencies - `tofu apply -lock=false` in pal-e-services - 512Mi is 2x current — conservative bump. If it OOMs again, go to 768Mi. ### Checklist - [ ] Memory limits bumped - [ ] PR opened - [ ] `tofu plan` clean - [ ] Tests pass - [ ] No unrelated changes ### Related - `pal-e-platform` — project board - Issue #109 — umbrella alert cleanup
Author
Owner

Scope Review: READY

Review note: review-191-2026-03-18
Scope is solid — all template sections present, file targets verified at lines 93-97 of pal-e-services/terraform/main.tf, acceptance criteria are agent-testable. One blast radius note: ArgoCD server component (lines 79-84) has identical 64Mi/256Mi limits worth monitoring post-fix.

## Scope Review: READY Review note: `review-191-2026-03-18` Scope is solid — all template sections present, file targets verified at lines 93-97 of `pal-e-services/terraform/main.tf`, acceptance criteria are agent-testable. One blast radius note: ArgoCD `server` component (lines 79-84) has identical 64Mi/256Mi limits worth monitoring post-fix.
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-item-191-2026-03-18
Repo mismatch: issue filed on pal-e-platform but code change is in pal-e-services (terraform/main.tf lines 93-97). Agent will target wrong repo.

  • Fix required: Move issue to pal-e-services or create cross-repo companion issue
  • Minor: 48-hour OOM criterion not agent-verifiable — add Prometheus query
  • Minor: SOPS sidecar shares the 512Mi limit — document for future debugging
## Scope Review: NEEDS_REFINEMENT Review note: `review-item-191-2026-03-18` Repo mismatch: issue filed on pal-e-platform but code change is in pal-e-services (terraform/main.tf lines 93-97). Agent will target wrong repo. - **Fix required:** Move issue to pal-e-services or create cross-repo companion issue - **Minor:** 48-hour OOM criterion not agent-verifiable — add Prometheus query - **Minor:** SOPS sidecar shares the 512Mi limit — document for future debugging
Author
Owner

Moved to pal-e-services — the fix is in that repo, not pal-e-platform. See the new issue there.

Moved to pal-e-services — the fix is in that repo, not pal-e-platform. See the new issue there.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#112
No description provided.