Raise argocd-application-controller memory limit (chronic OOMKilled) #327
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#327
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
Standalone — discovered 2026-05-01 during alert-state audit.
Repo
forgejo_admin/pal-e-platform(argocd helm values may live inforgejo_admin/pal-e-services— confirm during fix)What Broke
argocd-application-controller-0is being OOMKilled often enough to keep theOOMKilledcritical alert firing for 2+ days. This is a known kube-prometheus-stack issue: the default 256Mi limit is too low once the controller manages a non-trivial number of Application resources (we have 20+).Repro Steps
kubectl get pods -n argocd argocd-application-controller-0 -o jsonpath='{.spec.containers[0].resources.limits.memory}'→ current limitkubectl describe pod -n argocd argocd-application-controller-0→ recent OOMKilled eventskubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=container_memory_working_set_bytes{namespace="argocd",container="application-controller"}'→ recent peakExpected Behavior
Controller does not OOMKill under normal cluster operation.
OOMKilledalert clears.Environment
argocdargocd-application-controller-0application-controllerpal-e-platform/terraform/modules/argocd/or kustomize inpal-e-servicesAcceptance Criteria
deployment-lessonsSOPOOMKilledalert clears for the argocd containerRelated
pal-e-platform— projectalert-report-2026-05-01— alert snapshotdeployment-lessons— SOP to update