Raise argocd-application-controller memory limit (chronic OOMKilled) #327

Open
opened 2026-05-02 14:51:40 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Bug

Lineage

Standalone — discovered 2026-05-01 during alert-state audit.

Repo

forgejo_admin/pal-e-platform (argocd helm values may live in forgejo_admin/pal-e-services — confirm during fix)

What Broke

argocd-application-controller-0 is being OOMKilled often enough to keep the OOMKilled critical alert firing for 2+ days. This is a known kube-prometheus-stack issue: the default 256Mi limit is too low once the controller manages a non-trivial number of Application resources (we have 20+).

Repro Steps

  1. kubectl get pods -n argocd argocd-application-controller-0 -o jsonpath='{.spec.containers[0].resources.limits.memory}' → current limit
  2. kubectl describe pod -n argocd argocd-application-controller-0 → recent OOMKilled events
  3. kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=container_memory_working_set_bytes{namespace="argocd",container="application-controller"}' → recent peak

Expected Behavior

Controller does not OOMKill under normal cluster operation. OOMKilled alert clears.

Environment

  • Cluster: pal-e, namespace argocd
  • Pod: argocd-application-controller-0
  • Container: application-controller
  • Likely managed by terraform under pal-e-platform/terraform/modules/argocd/ or kustomize in pal-e-services

Acceptance Criteria

  • Memory limit raised to 2× recent p95 with headroom (typical: 256Mi → 512Mi or 1Gi)
  • Limit chosen documented in deployment-lessons SOP
  • OOMKilled alert clears for the argocd container
  • No regression to other argocd components
  • pal-e-platform — project
  • alert-report-2026-05-01 — alert snapshot
  • deployment-lessons — SOP to update
### Type Bug ### Lineage Standalone — discovered 2026-05-01 during alert-state audit. ### Repo `forgejo_admin/pal-e-platform` (argocd helm values may live in `forgejo_admin/pal-e-services` — confirm during fix) ### What Broke `argocd-application-controller-0` is being OOMKilled often enough to keep the `OOMKilled` critical alert firing for 2+ days. This is a known kube-prometheus-stack issue: the default 256Mi limit is too low once the controller manages a non-trivial number of Application resources (we have 20+). ### Repro Steps 1. `kubectl get pods -n argocd argocd-application-controller-0 -o jsonpath='{.spec.containers[0].resources.limits.memory}'` → current limit 2. `kubectl describe pod -n argocd argocd-application-controller-0` → recent OOMKilled events 3. `kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=container_memory_working_set_bytes{namespace="argocd",container="application-controller"}'` → recent peak ### Expected Behavior Controller does not OOMKill under normal cluster operation. `OOMKilled` alert clears. ### Environment - Cluster: pal-e, namespace `argocd` - Pod: `argocd-application-controller-0` - Container: `application-controller` - Likely managed by terraform under `pal-e-platform/terraform/modules/argocd/` or kustomize in `pal-e-services` ### Acceptance Criteria - [ ] Memory limit raised to 2× recent p95 with headroom (typical: 256Mi → 512Mi or 1Gi) - [ ] Limit chosen documented in `deployment-lessons` SOP - [ ] `OOMKilled` alert clears for the argocd container - [ ] No regression to other argocd components ### Related - `pal-e-platform` — project - `alert-report-2026-05-01` — alert snapshot - `deployment-lessons` — SOP to update
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-platform#327
No description provided.