Cluster at pod-scheduling ceiling (125/110 maxPods on single-node k3s) #331
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#331
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
Discovered 2026-05-04 during the notion-mcp-remote cascade.
tofu applysucceeded for notion-mcp-remote infra (5 resources created), but thenotion-mcp-remotedeployment pod cannot schedule. This is platform-level capacity, not a notion-mcp-remote issue. Will block onboarding any future funnel-enabled service until resolved.Repo
forgejo_admin/pal-e-platformWhat Broke
The single-node k3s cluster (
archbox) is at or above the kubeletmaxPods=110default. New pods cannot schedule.Pod event on the stuck pod:
Total pod count at time of discovery: 125 (over the 110 default).
Distribution:
The
tailscalenamespace dominates because every funnel-enabled service spawns its own operator-managed pod. Onboarding scales pod count linearly.Repro Steps
kubectl get pods -A --no-headers | wc -l→ 125kubectl describe node archbox | grep -i "Too many pods"→ confirms scheduling rejectiontofu applythat creates a new Deployment/StatefulSet → pod stays Pending foreverExpected Behavior
Per
feedback_enterprise_no_workarounds.md(be ready to scale): the platform should accommodate ongoing service onboarding. New services that passservice-onboarding-sopshould not be blocked by static cluster capacity limits.Environment
archbox(single-node k3s)notion-mcp-remotedeploymentnotion-mcp-remoteis Synced/Progressing but pod is PendingAcceptance Criteria
notion-mcp-remotepod transitions Pending → Running (assuming Harbor image is also resolved via #1048)service-onboarding-sopPre-Deploy Validation Checklist (capacity check before adding funnel-enabled service)Proposed fixes (operator decision)
--kubelet-arg=max-pods=250in k3s service config. Requires k3s restart (~30s API downtime). Single change, scales platform.palworldnamespace: frees 5 slots, mitigates short-term, doesn't solve scaling.Recommended path: Option 1 short-term, Option 3 spike for medium-term.
Related
project-pal-e-platformarch-deployment-notion-mcp-remote— the cascade that surfaced thisfeedback_enterprise_no_workarounds.mdservice-onboarding-sop