Debug pal-e-docs container OOMKilled (chronic) #275

Open
opened 2026-05-02 14:52:33 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Bug

Lineage

Standalone — discovered 2026-05-01 during alert-state audit.

Repo

forgejo_admin/pal-e-api

What Broke

The pal-e-docs container is OOMKilling regularly, driving a chronic OOMKilled critical alert for 2+ days. Currently affects pod pal-e-docs-6c7fdd96d7-fll8h in namespace pal-e-docs. Note: the container/pod name still says pal-e-docs (legacy name) but the codebase repo was renamed to pal-e-api.

Repro Steps

  1. kubectl describe pod -n pal-e-docs pal-e-docs-6c7fdd96d7-fll8h → look for Last State: Terminated, Reason: OOMKilled
  2. kubectl get pod -n pal-e-docs -l app=pal-e-docs -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}' → restart count
  3. kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=container_memory_working_set_bytes{namespace="pal-e-docs",container="pal-e-docs"}' → recent peak vs limit

Expected Behavior

Container runs at steady-state memory usage well below limit. OOMKilled alert does not fire.

Environment

  • Cluster: pal-e, namespace pal-e-docs
  • Deployment manifest: kustomize overlay in pal-e-deployments (most likely)
  • Workload: pal-e-api FastAPI + embedding pipeline

Acceptance Criteria

  • Determine the cause: (a) limit too low, (b) genuine memory leak, (c) embedding/vector workload spike
  • If (a): raise limit to 2× p95
  • If (b): file follow-up issue in this repo with reproduction
  • If (c): investigate embedding-worker scheduling and memory profile
  • OOMKilled alert clears for pal-e-docs container
  • pal-e-platform — alerting rule lives there
  • alert-report-2026-05-01 — alert snapshot
### Type Bug ### Lineage Standalone — discovered 2026-05-01 during alert-state audit. ### Repo `forgejo_admin/pal-e-api` ### What Broke The `pal-e-docs` container is OOMKilling regularly, driving a chronic `OOMKilled` critical alert for 2+ days. Currently affects pod `pal-e-docs-6c7fdd96d7-fll8h` in namespace `pal-e-docs`. Note: the container/pod name still says `pal-e-docs` (legacy name) but the codebase repo was renamed to `pal-e-api`. ### Repro Steps 1. `kubectl describe pod -n pal-e-docs pal-e-docs-6c7fdd96d7-fll8h` → look for `Last State: Terminated, Reason: OOMKilled` 2. `kubectl get pod -n pal-e-docs -l app=pal-e-docs -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'` → restart count 3. `kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=container_memory_working_set_bytes{namespace="pal-e-docs",container="pal-e-docs"}'` → recent peak vs limit ### Expected Behavior Container runs at steady-state memory usage well below limit. `OOMKilled` alert does not fire. ### Environment - Cluster: pal-e, namespace `pal-e-docs` - Deployment manifest: kustomize overlay in `pal-e-deployments` (most likely) - Workload: pal-e-api FastAPI + embedding pipeline ### Acceptance Criteria - [ ] Determine the cause: (a) limit too low, (b) genuine memory leak, (c) embedding/vector workload spike - [ ] If (a): raise limit to 2× p95 - [ ] If (b): file follow-up issue in this repo with reproduction - [ ] If (c): investigate embedding-worker scheduling and memory profile - [ ] `OOMKilled` alert clears for `pal-e-docs` container ### Related - `pal-e-platform` — alerting rule lives there - `alert-report-2026-05-01` — alert snapshot
Commenting is not possible because the repository is archived.
No milestone
No project
No assignees
1 participant
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-api#275
No description provided.