fix: bump pal-e-docs memory limit to prevent OOMKilled #182

Closed
opened 2026-03-15 16:04:26 +00:00 by forgejo_admin · 1 comment

Lineage

plan-pal-e-platform → Phase 16 → 16b (memory limits — pal-e-docs)

Repo

forgejo_admin/pal-e-docs

User Story

As a platform operator
I want pal-e-docs to have sufficient memory
So that the container stops getting OOMKilled and restarting

Context

The pal-e-docs container is getting OOMKilled with the current 128Mi memory limit. It has restarted 2 times in the last 12 hours. The app runs FastAPI + SQLAlchemy + pgvector embedding operations — 128Mi is insufficient for normal operation, especially when handling concurrent MCP tool calls that trigger embedding computations.

File Targets

Files the agent should modify:

  • k8s/deployment.yaml — bump the pal-e-docs container resource limits:
    • requests.memory: 32Mi → 64Mi
    • limits.memory: 128Mi → 256Mi

Files the agent should NOT touch:

  • Any other files — this is a single resource limit change

Acceptance Criteria

  • Memory requests set to 64Mi in deployment.yaml
  • Memory limits set to 256Mi in deployment.yaml
  • No other changes to the deployment spec

Test Expectations

  • Verify the YAML is valid: python3 -c "import yaml; yaml.safe_load(open('k8s/deployment.yaml'))"
  • If kustomize is used: kubectl kustomize k8s/ succeeds

Constraints

  • Only change the memory request and limit values — nothing else
  • Do not change CPU requests or limits

Checklist

  • PR opened
  • YAML is valid
  • No unrelated changes
  • phase-platform-16-alert-tuning — parent phase
  • pal-e-platform — project
### Lineage `plan-pal-e-platform` → Phase 16 → 16b (memory limits — pal-e-docs) ### Repo `forgejo_admin/pal-e-docs` ### User Story As a platform operator I want pal-e-docs to have sufficient memory So that the container stops getting OOMKilled and restarting ### Context The pal-e-docs container is getting OOMKilled with the current 128Mi memory limit. It has restarted 2 times in the last 12 hours. The app runs FastAPI + SQLAlchemy + pgvector embedding operations — 128Mi is insufficient for normal operation, especially when handling concurrent MCP tool calls that trigger embedding computations. ### File Targets Files the agent should modify: - `k8s/deployment.yaml` — bump the pal-e-docs container resource limits: - `requests.memory`: 32Mi → 64Mi - `limits.memory`: 128Mi → 256Mi Files the agent should NOT touch: - Any other files — this is a single resource limit change ### Acceptance Criteria - [ ] Memory requests set to 64Mi in deployment.yaml - [ ] Memory limits set to 256Mi in deployment.yaml - [ ] No other changes to the deployment spec ### Test Expectations - [ ] Verify the YAML is valid: `python3 -c "import yaml; yaml.safe_load(open('k8s/deployment.yaml'))"` - [ ] If kustomize is used: `kubectl kustomize k8s/` succeeds ### Constraints - Only change the memory request and limit values — nothing else - Do not change CPU requests or limits ### Checklist - [ ] PR opened - [ ] YAML is valid - [ ] No unrelated changes ### Related - `phase-platform-16-alert-tuning` — parent phase - `pal-e-platform` — project
Author
Owner

PR #183 Review

DOMAIN REVIEW

Tech stack: Kubernetes deployment manifest (YAML). Single-file change to resource requests/limits.

Diff verified against full file (k8s/deployment.yaml, 73 lines):

  • requests.memory: 32Mi -> 64Mi (changed)
  • limits.memory: 128Mi -> 256Mi (changed)
  • requests.cpu: 10m (unchanged, line 70)
  • No CPU limit set (unchanged -- correct for non-CPU-bound workloads)
  • No other spec changes: replicas, strategy, probes, env vars, image tag all identical

Reasonableness: FastAPI + SQLAlchemy + pgvector is memory-hungry. pgvector loads embedding vectors into memory for similarity search. 128Mi was demonstrably too tight (OOMKilled, 2 restarts in 12h). 256Mi limit with 64Mi request gives a 4:1 burst ratio, which is appropriate for this workload profile. The 2x bump is conservative and measured.

BLOCKERS

None.

NITS

None.

SOP COMPLIANCE

  • Branch 182-fix-bump-pal-e-docs-memory-limit-to-prev named after issue #182
  • PR body follows template (Summary, Changes, Test Plan, Related)
  • Related references plan-pal-e-platform
  • Closes #182 present
  • No secrets committed
  • No unnecessary file changes (1 file, 2 additions, 2 deletions)
  • Commit message is descriptive

PROCESS OBSERVATIONS

Clean operational fix. Low change failure risk -- memory limit bumps are safe, reversible changes. ArgoCD will auto-sync after merge. Post-merge validation items in the Test Plan (kubectl top pod monitoring) are appropriate.

VERDICT: APPROVED

## PR #183 Review ### DOMAIN REVIEW **Tech stack:** Kubernetes deployment manifest (YAML). Single-file change to resource requests/limits. **Diff verified against full file** (`k8s/deployment.yaml`, 73 lines): - `requests.memory`: 32Mi -> 64Mi (changed) - `limits.memory`: 128Mi -> 256Mi (changed) - `requests.cpu`: 10m (unchanged, line 70) - No CPU limit set (unchanged -- correct for non-CPU-bound workloads) - No other spec changes: replicas, strategy, probes, env vars, image tag all identical **Reasonableness:** FastAPI + SQLAlchemy + pgvector is memory-hungry. pgvector loads embedding vectors into memory for similarity search. 128Mi was demonstrably too tight (OOMKilled, 2 restarts in 12h). 256Mi limit with 64Mi request gives a 4:1 burst ratio, which is appropriate for this workload profile. The 2x bump is conservative and measured. ### BLOCKERS None. ### NITS None. ### SOP COMPLIANCE - [x] Branch `182-fix-bump-pal-e-docs-memory-limit-to-prev` named after issue #182 - [x] PR body follows template (Summary, Changes, Test Plan, Related) - [x] Related references `plan-pal-e-platform` - [x] `Closes #182` present - [x] No secrets committed - [x] No unnecessary file changes (1 file, 2 additions, 2 deletions) - [x] Commit message is descriptive ### PROCESS OBSERVATIONS Clean operational fix. Low change failure risk -- memory limit bumps are safe, reversible changes. ArgoCD will auto-sync after merge. Post-merge validation items in the Test Plan (kubectl top pod monitoring) are appropriate. ### VERDICT: APPROVED
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-api#182
No description provided.