infra(salt): manage k3s unit file with maxPods=250 #333
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform!333
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "332-salt-k3s-maxpods"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
--kubelet-arg=max-pods=250via pillar-driven Jinja2 templateChanges
salt/states/k3s/init.sls: Converted fromfile.exists+service.runningtofile.managed+cmd.wait+ watch-triggered restartsalt/states/k3s/k3s.service.j2: New Jinja2 template for the systemd unit, iterates over pillar argssalt/pillar/k3s.sls: New pillar withserver_argsandkubelet_argslistssalt/pillar/top.sls: Include k3s pillar for archboxTest Plan
salt-call pillar.get k3sreturns correct valuessalt-call state.apply k3s test=Truepasses (4 succeeded, 0 failed)Review Checklist
Related Notes
forgejo_admin/pal-e-platform #332— the Forgejo issue this PR implementsproject-pal-e-platform— platform infrastructureCloses #332
Addresses the 31-firing-alerts fatigue identified during the Apr 13 checkout outage by removing expected-down blackbox targets, silencing kube-state-metrics default duplicates, and adding revenue-critical Stripe webhook alerts + a basketball-api golden signals dashboard. Changes: - Remove blackbox targets westside-dev, pal-e-app, mac-agent (all expected-down; generated critical EndpointDown alerts indistinguishable from real outages). Mac-agent uptime still covered by MacAgentDown. - Downgrade EndpointDown critical -> warning, raise "for" 2m -> 5m to suppress flaps. - Disable kube-prometheus-stack defaultRules.kubeStateMetrics and kubernetesApps (duplicate namespace noise). Our PodRestartStorm, OOMKilled, TargetDown rules provide the real signal. - Add alertmanager inhibit rule so critical alerts suppress their warning counterparts on the same alertname+namespace. - New PrometheusRule payment-pipeline-alerts with: - WebhookErrorRate (warning, 5m) on increase(webhook_errors_total[5m]) - WebhookStale (warning, 10m) when checkout.session.completed timestamp is >30min stale during business hours (9am-9pm MST weekdays) - New basketball-api-golden-signals dashboard wired via ConfigMap, following the pal-e-app-golden-signals pattern. Uses only metrics basketball-api currently exposes (basketball_api_up, webhook_*). tofu validate passes. tofu plan -lock=false shows only the expected resource additions/updates under module.monitoring plus pre-existing drift from previously-merged PRs (database/ops). Closes #290 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>PR #333 Review
DOMAIN REVIEW
Tech stack: Salt (SaltStack states + pillar + Jinja2 templates), systemd unit files.
Salt state correctness (
init.sls):k3s-unit-filerequiresk3s-binary,k3s-daemon-reloadwatchesk3s-unit-file,k3s-servicewatchesk3s-unit-fileAND requiresk3s-daemon-reload. This guarantees the sequence: binary exists -> unit file written -> daemon-reload -> service restart.cmd.waitis the correct state for daemon-reload (only fires on watch trigger, not every highstate). Good.service.runningwithwatchon the unit file means Salt will restart k3s if the file changes. Combined with therequireoncmd.wait, daemon-reload will always happen before the restart. Correct.Pillar structure (
k3s.sls+top.sls):archboxgets the k3s pillar. Appropriate.pillar.get('k3s', {}).get('server_args', ['--disable=traefik'])-- safe fallback if pillar is missing entirely.Jinja2 template safety (
k3s.service.j2):pillar.get('k3s', {}).get(...)is the correct defensive pattern. If k3s pillar is absent or the key is missing, defaults kick in. No KeyError risk.BLOCKERS
1. Trailing backslash on last loop iteration produces invalid ExecStart (BLOCKER)
With the current pillar (
server_args: ['--disable=traefik'],kubelet_args: ['max-pods=250']), the template renders:The LAST argument in the last loop also gets a trailing
\(backslash), followed by an empty line. In systemd unit files, a trailing backslash is a line continuation. This means:\+ empty line results in an empty string appended to argv, which k3s may reject or ignore silently.Fix: The last line of ExecStart must NOT have a trailing backslash. Common Jinja2 patterns to fix this:
Or use
loop.lastin the existing loop structure. The key is: the final rendered argument must NOT have a trailing backslash.Severity note: The PR body says
salt-call state.apply k3s test=Truepassed and the service is running post-fix. This suggests systemd on this host silently accepts the trailing backslash+newline. However, this is not portable behavior and violates systemd unit file spec. It is a correctness issue that should be fixed before merge.NITS
Unit file mode 0644 is fine for a systemd unit file. Some systems use 0644, some 0755. Since this is a service unit (not a binary), 0644 is standard. No issue.
Single-quoting args in ExecStart (
'--disable=traefik') is unusual for systemd. Systemd ExecStart does support quoting persystemd.exec(5), but arguments are typically unquoted unless they contain spaces. Not harmful, but unconventional. Minor style point.Empty
kubelet_argsdefault means if someone removes thekubelet_argskey from pillar, the kubelet_args loop produces no output and only the trailing backslash from theserver_argsloop's last iteration remains dangling. Same root cause as the blocker, but a second scenario to consider.Template does not have a trailing newline control. Line 37 is blank. Depending on how the file is saved, there may or may not be a final newline. Salt's
file.managedgenerally handles this correctly, so this is cosmetic.SOP COMPLIANCE
332-salt-k3s-maxpodsmatches #332)project-pal-e-platform)290-payment-pipeline-observability, pulling in 3 unrelated files (pal-e-boards design spec, basketball-api dashboard JSON, monitoring terraform changes). The diff shows 811 additions / 29 deletions across 7 files, but only 4 files are in-scope. Mergeable isfalse.PROCESS OBSERVATIONS
290-payment-pipeline-observabilityinstead ofmain) means this PR cannot be merged cleanly until rebased. Forgejo reportsmergeable: false.VERDICT: NOT APPROVED
Reason: The trailing backslash on the last ExecStart argument line is a correctness bug in the Jinja2 template. While it appears to work on the current systemd version (based on test plan results), it produces technically invalid unit file syntax. Fix the template to omit the trailing backslash on the final argument, then rebase off
mainto remove unrelated diff noise.4cd232afca9b4e7dbcb8PR #333 Re-Review
DOMAIN REVIEW
Tech stack: Salt (state + pillar) + Jinja2 systemd unit template.
Jinja2 loop.last logic (the previous blocker) -- VERIFIED CORRECT:
Both trailing-backslash conditions handle all cases properly:
server_args loop (line 33):
{% if not loop.last or kubelet_args %} \{% endif %}False or True= True -> backslash (continuation needed)False or False= False -> no backslash (final arg)True or *= True -> backslash (always continue)kubelet_args loop (line 37):
{% if not loop.last %} \{% endif %}Salt requisites -- CORRECT:
cmd.wait(daemon-reload) only fires whenk3s-unit-filechanges (watch trigger)service.runningwatchesk3s-unit-file(restart on change) and requires bothk3s-binaryandk3s-daemon-reloadrequireoncmd.waitensures daemon-reload completes before the service state evaluates, even thoughcmd.waitis a no-op when nothing changesPillar structure: Clean separation of server_args vs kubelet_args. Default fallback in template (
['--disable=traefik']) provides safety if pillar is missing.BLOCKERS
None.
NITS
Line 40 of the template has a trailing blank line after the final
endfor. Cosmetic only -- systemd ignores it, butsystemd-analyze verifymay emit a warning on some versions. Non-blocking.The
server_argsdefault in the template (['--disable=traefik']) duplicates the pillar value. If someone removes it from pillar but expects the default to hold, it works -- but the intent could be documented with a comment in the template. Non-blocking.SOP COMPLIANCE
332-salt-k3s-maxpods)project-pal-e-platform)PROCESS OBSERVATIONS
salt-call state.apply k3s test=Truepassing, node reporting 250 allocatable pods). Strong evidence of correctness.VERDICT: APPROVED
Tofu Plan Output (full)