Bug: kube-router ipset sync stale — NetworkPolicy blocks short-lived pods #127

New issue

Open

opened 2026-03-21 15:51:13 +00:00 by forgejo_admin · 1 comment

forgejo_admin commented

2026-03-21 15:51:13 +00:00

Owner

Type

Bug

Lineage

plan-pal-e-platform → Platform Hardening — standalone, discovered during CI investigation

Repo

forgejo_admin/pal-e-platform

What Broke

kube-router (embedded in k3s) maintains ipsets for NetworkPolicy source selectors. These ipsets contain pod IPs that should be allowed through. The ipsets are stale — they contain IPs of long-running pods but do NOT add newly created short-lived pods (Woodpecker pipeline containers, kubectl run test pods).

This causes all NetworkPolicy-protected namespaces to reject traffic from new pods, even when the policy explicitly allows the source namespace. Currently the forgejo namespace NetworkPolicies have been temporarily deleted as a workaround — this is a security regression that must be reversed.

Repro Steps

Create a NetworkPolicy allowing ingress from namespace X
Verify long-running pods in namespace X can connect (they can)
kubectl run test --image=alpine/curl -n X -- curl http://target-svc:80/
New pod gets REJECTED despite matching the NetworkPolicy namespace selector
Check ipset: new pod IP is not in the source ipset

Expected Behavior

kube-router should add pod IPs to namespace ipsets immediately when pods are created, and remove them when pods are deleted. Short-lived pods should be able to connect to NetworkPolicy-protected services.

Environment

Cluster: k3s single-node (archbox)
k3s version: check /usr/local/bin/k3s --version
kube-router: embedded in k3s (no separate pod/process)
Affected namespaces: forgejo (NetworkPolicies temporarily removed), likely all namespaces with NetworkPolicies
Related: bug-kube-router-ipset-empty (#157, marked done but issue resurfaced)
Post-move network recovery (#176) may have triggered this

Acceptance Criteria

Short-lived pods can connect to NetworkPolicy-protected services
Forgejo namespace NetworkPolicies restored (tailscale, woodpecker, monitoring allowed)
CI pipeline clone step works WITH NetworkPolicies active
ipsets populated correctly for new pods within 5 seconds of creation

pal-e-platform — project board
Issue #121 — CI clone failure (symptom of this bug)
Issue #157 — previous kube-router ipset empty bug (may not have been fully resolved)
Board item #176 — post-move network recovery

### Type Bug ### Lineage `plan-pal-e-platform` → Platform Hardening — standalone, discovered during CI investigation ### Repo `forgejo_admin/pal-e-platform` ### What Broke kube-router (embedded in k3s) maintains ipsets for NetworkPolicy source selectors. These ipsets contain pod IPs that should be allowed through. The ipsets are stale — they contain IPs of long-running pods but do NOT add newly created short-lived pods (Woodpecker pipeline containers, kubectl run test pods). This causes all NetworkPolicy-protected namespaces to reject traffic from new pods, even when the policy explicitly allows the source namespace. Currently the forgejo namespace NetworkPolicies have been **temporarily deleted** as a workaround — this is a security regression that must be reversed. ### Repro Steps 1. Create a NetworkPolicy allowing ingress from namespace X 2. Verify long-running pods in namespace X can connect (they can) 3. `kubectl run test --image=alpine/curl -n X -- curl http://target-svc:80/` 4. New pod gets REJECTED despite matching the NetworkPolicy namespace selector 5. Check ipset: new pod IP is not in the source ipset ### Expected Behavior kube-router should add pod IPs to namespace ipsets immediately when pods are created, and remove them when pods are deleted. Short-lived pods should be able to connect to NetworkPolicy-protected services. ### Environment - Cluster: k3s single-node (archbox) - k3s version: check `/usr/local/bin/k3s --version` - kube-router: embedded in k3s (no separate pod/process) - Affected namespaces: forgejo (NetworkPolicies temporarily removed), likely all namespaces with NetworkPolicies - Related: `bug-kube-router-ipset-empty` (#157, marked done but issue resurfaced) - Post-move network recovery (#176) may have triggered this ### Acceptance Criteria - [ ] Short-lived pods can connect to NetworkPolicy-protected services - [ ] Forgejo namespace NetworkPolicies restored (tailscale, woodpecker, monitoring allowed) - [ ] CI pipeline clone step works WITH NetworkPolicies active - [ ] ipsets populated correctly for new pods within 5 seconds of creation ### Related - `pal-e-platform` — project board - Issue #121 — CI clone failure (symptom of this bug) - Issue #157 — previous kube-router ipset empty bug (may not have been fully resolved) - Board item #176 — post-move network recovery

forgejo_admin commented

2026-03-21 16:45:40 +00:00

Author

Owner

Closing — Diagnosis Was Wrong

kube-router ipset sync is working correctly. The issue was with my test methodology:

kubectl run --rm -it pods are so short-lived they execute BEFORE kube-router adds their IP to the ipset (~5s sync interval)
Created a persistent test pod → IP appeared in ipset within 5 seconds → 200 from forgejo
Pipeline #160 succeeded WITH NetworkPolicies active (clone + apply)

The forgejo NetworkPolicy was recreated by tofu apply and is correctly allowing woodpecker, tailscale, and monitoring namespaces. No security regression.

Root cause of the original CI clone failure was the Forgejo IPv4 binding issue (PR #124) combined with the internal URL clone override (PR #118, now reverted). Both are fixed.

## Closing — Diagnosis Was Wrong **kube-router ipset sync is working correctly.** The issue was with my test methodology: - `kubectl run --rm -it` pods are so short-lived they execute BEFORE kube-router adds their IP to the ipset (~5s sync interval) - Created a persistent test pod → IP appeared in ipset within 5 seconds → 200 from forgejo - Pipeline #160 succeeded WITH NetworkPolicies active (clone + apply) The forgejo NetworkPolicy was recreated by `tofu apply` and is correctly allowing woodpecker, tailscale, and monitoring namespaces. No security regression. **Root cause of the original CI clone failure was the Forgejo IPv4 binding issue (PR #124) combined with the internal URL clone override (PR #118, now reverted). Both are fixed.**