Raise argocd-application-controller memory limit (chronic OOMKilled) #327

New issue

Open

opened 2026-05-02 14:51:40 +00:00 by forgejo_admin · 0 comments

forgejo_admin commented

2026-05-02 14:51:40 +00:00

Contributor

Type

Bug

Lineage

Standalone — discovered 2026-05-01 during alert-state audit.

Repo

forgejo_admin/pal-e-platform (argocd helm values may live in forgejo_admin/pal-e-services — confirm during fix)

What Broke

argocd-application-controller-0 is being OOMKilled often enough to keep the OOMKilled critical alert firing for 2+ days. This is a known kube-prometheus-stack issue: the default 256Mi limit is too low once the controller manages a non-trivial number of Application resources (we have 20+).

Repro Steps

kubectl get pods -n argocd argocd-application-controller-0 -o jsonpath='{.spec.containers[0].resources.limits.memory}' → current limit
kubectl describe pod -n argocd argocd-application-controller-0 → recent OOMKilled events
kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=container_memory_working_set_bytes{namespace="argocd",container="application-controller"}' → recent peak

Expected Behavior

Controller does not OOMKill under normal cluster operation. OOMKilled alert clears.

Environment

Cluster: pal-e, namespace argocd
Pod: argocd-application-controller-0
Container: application-controller
Likely managed by terraform under pal-e-platform/terraform/modules/argocd/ or kustomize in pal-e-services

Acceptance Criteria

Memory limit raised to 2× recent p95 with headroom (typical: 256Mi → 512Mi or 1Gi)
Limit chosen documented in deployment-lessons SOP
OOMKilled alert clears for the argocd container
No regression to other argocd components

pal-e-platform — project
alert-report-2026-05-01 — alert snapshot
deployment-lessons — SOP to update

### Type Bug ### Lineage Standalone — discovered 2026-05-01 during alert-state audit. ### Repo `forgejo_admin/pal-e-platform` (argocd helm values may live in `forgejo_admin/pal-e-services` — confirm during fix) ### What Broke `argocd-application-controller-0` is being OOMKilled often enough to keep the `OOMKilled` critical alert firing for 2+ days. This is a known kube-prometheus-stack issue: the default 256Mi limit is too low once the controller manages a non-trivial number of Application resources (we have 20+). ### Repro Steps 1. `kubectl get pods -n argocd argocd-application-controller-0 -o jsonpath='{.spec.containers[0].resources.limits.memory}'` → current limit 2. `kubectl describe pod -n argocd argocd-application-controller-0` → recent OOMKilled events 3. `kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=container_memory_working_set_bytes{namespace="argocd",container="application-controller"}'` → recent peak ### Expected Behavior Controller does not OOMKill under normal cluster operation. `OOMKilled` alert clears. ### Environment - Cluster: pal-e, namespace `argocd` - Pod: `argocd-application-controller-0` - Container: `application-controller` - Likely managed by terraform under `pal-e-platform/terraform/modules/argocd/` or kustomize in `pal-e-services` ### Acceptance Criteria - [ ] Memory limit raised to 2× recent p95 with headroom (typical: 256Mi → 512Mi or 1Gi) - [ ] Limit chosen documented in `deployment-lessons` SOP - [ ] `OOMKilled` alert clears for the argocd container - [ ] No regression to other argocd components ### Related - `pal-e-platform` — project - `alert-report-2026-05-01` — alert snapshot - `deployment-lessons` — SOP to update