feat: Mac build agent Salt states with observability #227
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform!227
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "174-mac-build-agent-salt-observability"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Brings the Mac CI build agent (lucass-macbook-air-1) under Salt management with full observability -- Woodpecker agent, Promtail log shipping, node-exporter metrics, and a Grafana dashboard. Replaces the fragile manual SSH setup from #166.
Changes
Salt states (new:
salt/states/mac-agent/)init.sls-- orchestrator, includes all sub-states in dependency orderhomebrew.sls-- managed packages (node, fastlane, cocoapods, watchman) viacmd.runwith idempotent guardsmagicdns.sls-- verifies Tailscale is running and MagicDNS resolves woodpecker/grafana before agent startswoodpecker.sls-- downloads agent binary (darwin/arm64), creates launchd plist with KeepAlive, loads serviceobservability.sls-- installs Promtail + node-exporter via Homebrew, configures launchd plists for bothSalt pillar (new:
salt/pillar/mac-agent.sls)platform=darwin), max workflowsSalt config changes
salt/pillar/top.sls-- targetslucass-macbook-air-1withsecrets.platform+mac-agentpillarsalt/states/top.sls-- assignsmac-agentstate to the Mac minionsalt/master.conf-- listens on0.0.0.0(was127.0.0.1) so the Mac can connect over TailscaleTerraform/monitoring
additionalScrapeConfigsfor Mac node-exporter (Tailscale IP, 30s interval)MacAgentDownPrometheusRule alert (5m threshold, critical severity)mac-agent-dashboard.jsonGrafana dashboard: status, CPU, memory, disk, uptime, agent logs (Loki)mac_agent_tailscale_ip(default100.83.0.5:9100) wired through root -> monitoring moduleTest Plan
salt 'lucass-macbook-air-1' test.pingreturns True (requires Salt minion installed on Mac)salt 'lucass-macbook-air-1' state.apply mac-agent test=Trueshows no errorstofu validatepasses (verified in this PR)tofu apply: Grafana query{host="lucass-macbook-air-1"}returns agent logstofu apply: Prometheus queryup{job="mac-node-exporter"}returns 1Note: Loki
push_urlin pillar has a FIXME -- the actual Loki ClusterIP must be looked up withkubectl get svc loki-stack -n monitoring -o jsonpath='{.spec.clusterIP}'and set before first apply. The Mac reaches Loki via the Tailscale subnet router (10.43.0.0/16), not DNS.Review Checklist
tofu fmtruntofu validatepassesRelated Notes
None.
Related
Closes #174
QA Review -- PR #227
Findings
1. [NIT]
homebrew.sls--unless: 'false'is a no-op guardThe
homebrew-availablestate usesunless: 'false'which means theunlesscommand always exits non-zero, so thebrew --versioncommand always runs. This is functionally fine (verifies brew exists every apply) but theunlessline is confusing noise. Either remove theunlessentirely or add a comment explaining intent.2. [MEDIUM]
woodpecker.sls-- plist renders agent secret in plaintext on diskThe launchd plist at
~/Library/LaunchAgents/com.woodpecker-ci.agent.plistcontains{{ agent_secret }}which Salt will render as the actual secret value. This is a plaintext secret on the Mac filesystem readable by the user. This is the standard launchd pattern (no alternative for env vars), but worth noting -- the plist file permissions are 0644 (world-readable). Consider 0600.3. [NIT]
observability.sls-- Promtail config fallback URL is unreachableThe Jinja default for
push_urlishttp://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/pushbut the PR body and pillar both correctly note that.svc.cluster.localDNS does not resolve on the Mac. The fallback should match the pillar default (ClusterIP), not the unreachable DNS name.4. [INFO]
mac-agent.slspillar -- FIXME for Loki ClusterIPThe
push_urlhas a FIXME with placeholder10.43.0.0. This is acknowledged in the PR body. Fine for initial merge -- must be resolved before firststate.apply.5. [NIT]
master.conf--0.0.0.0broadens listen surfaceThe change from
127.0.0.1to0.0.0.0is necessary for the Mac minion. The comment correctly notes nftables protection. Verified: the firewall state allows all traffic ontailscale0andloonly, so Salt ports (4505/4506) are not exposed on the physical NIC. Correct.6. [NIT] Dashboard JSON -- hardcoded datasource UIDs
The dashboard uses
"uid": "prometheus"and"uid": "loki"which are conventional defaults for kube-prometheus-stack but not guaranteed. If the datasource UIDs differ in the actual Grafana instance, panels will show "No data." Low risk since this follows the existing dashboard pattern in the repo.Summary
Clean implementation. The three deliverables (Salt states, Woodpecker agent, observability) are all present. Terraform validates. No secrets leaked into state files (agent secret sourced from pillar). No changes to existing archbox states.
The plist file permission (finding #2) is the only actionable item -- changing from 0644 to 0600 for the woodpecker plist since it contains the agent secret.
VERDICT: APPROVE with nit -- fix plist permission to 0600 on
com.woodpecker-ci.agent.plist.PR #227 Review
DOMAIN REVIEW
Tech stack: Salt (YAML + Jinja states/pillar), Terraform/OpenTofu (HCL modules), Grafana dashboard JSON.
Salt state correctness
State ordering is include-order only -- no cross-file
requiredirectives.init.slslists includes in "dependency order" (homebrew, magicdns, woodpecker, observability), but Salt include order does NOT guarantee execution order. Thewoodpecker.slsstates haverequire: cmd: woodpecker-agent-installinternally, but nothing requires themagicdnsstates to complete first. If MagicDNS verification runs in parallel with or after the Woodpecker agent load, the stated design goal ("verify MagicDNS before agent starts") is not enforced.woodpecker-agent-loadshouldrequire: cmd: magicdns-woodpecker-resolves. Similarly,promtail-loadshould requiremagicdns-grafana-resolves(Promtail pushes to Loki but the MagicDNS check for Grafana is the closest pre-flight; though Promtail actually uses the ClusterIP, not MagicDNS).Loki
push_urlFIXME is present and documented. The pillar haspush_url: http://10.43.0.0:3100/loki/api/v1/pushwith a FIXME comment, and the PR body + Test Plan both call this out. The10.43.0.0is the base of the service CIDR, not the actual Loki ClusterIP. This is acknowledged as a pre-apply requirement. Acceptable as a known placeholder, but must be resolved beforestate.applyor Promtail will silently fail to ship logs.MISSING_SECRETfallback inwoodpecker.slsline 15. If the GPG-encrypted pillar fails to decrypt (wrong key, missing GPG agent, etc.),agent_secretsilently falls back to the literal stringMISSING_SECRET, which will be rendered into the launchd plist in cleartext. The agent will then fail to auth with the server, but the plist file will exist on disk with a non-secret placeholder that could mask the real problem. Consider either: (a) failing the state explicitly if the secret is missing (e.g., acmd.runthat checks the value), or (b) using a Jinjaraiseto fail rendering.runaspattern repeated 3 times. The Homebrew owner detection (stat -f "%Su" ... || echo ldraney) appears identically inhomebrew.sls:22,observability.sls:23, andobservability.sls:111. This is a DRY concern but NOT in an auth/security path -- it is a convenience pattern for detecting the brew owner. Not a blocker, but worth extracting to a Jinja macro or pillar variable in a follow-up.Woodpecker binary download uses
/releases/latest/-- no version pinning. The download URL ishttps://github.com/woodpecker-ci/woodpecker/releases/latest/download/woodpecker-agent_darwin_arm64.tar.gz. Combined withunless: test -x /usr/local/bin/woodpecker-agent, this means: (a) the first install gets whatever "latest" is at that moment, (b) subsequent applies never upgrade because the binary exists. This is pragmatic for a first deploy, but version pinning (pillar-driven) would make the state reproducible.homebrew-availableruns every apply.brew --versionhas nounlessguard. This is intentional (verifies brew exists), but on every highstate it will show as "changed" in Salt output. Minor noise concern.Terraform module wiring
Variable wiring is clean.
mac_agent_tailscale_ipis declared interraform/variables.tfwith a default, passed throughmain.tftomodule.monitoring, and declared again inmodules/monitoring/variables.tf. Consistent descriptions and defaults across both levels.Default duplicated at two levels. Both root
variables.tfandmodules/monitoring/variables.tfhavedefault = "100.83.0.5:9100". If the default ever changes, both must be updated. Standard Terraform pattern when root vars have defaults, but worth noting.additionalScrapeConfigsplacement is correct. Nested insideprometheusSpecwhere kube-prometheus-stack expects it.Blackbox probe URL construction:
http://${var.mac_agent_tailscale_ip}/metrics-- sincemac_agent_tailscale_ipincludes the port (e.g.,100.83.0.5:9100), the URL becomeshttp://100.83.0.5:9100/metrics. This is correct for node-exporter.Dashboard ConfigMap follows existing pattern. Same structure as
dora_dashboard,pal_e_docs_dashboard,uptime_dashboard--grafana_dashboard = "1"label,depends_on = [helm_release.kube_prometheus_stack], file loaded via${path.module}/../../dashboards/.Security review
interface: 0.0.0.0inmaster.conf. Binding Salt master to all interfaces. The comment references nftables restricting totailscale0andlo. Verified:firewall.slspillar definesallowed_interfaces: [tailscale0, lo]with default policydrop. Salt ports 4505/4506 are only reachable via Tailscale or localhost. The firewall claim checks out.Agent secret sourced from GPG-encrypted pillar.
secrets.get('woodpecker_agent_secret', 'MISSING_SECRET')reads fromsecrets.platformwhich is GPG-encrypted. The secret is rendered into the launchd plist (file mode0600). No secrets in state files, no credentials in code. The plist file permissions are appropriate.Tailscale IPs (100.x.x.x) are not public. These are CGNAT-range addresses used by Tailscale's overlay network. Not externally routable. Safe to have in pillar/variables.
No
tofu planoutput included. PR conventions requiretofu planoutput for Terraform changes. The PR has a "Review Checklist" notingtofu validatepasses, but no plan output is shown.Grafana dashboard review
Dashboard JSON is well-structured. Unique UID (
mac-build-agent), appropriate tags, correct datasource references (prometheusandlokiUIDs). Panel IDs are sequential and non-conflicting.Dashboard PromQL queries use hardcoded
instancelabel. All queries filter oninstance="lucass-macbook-air-1". This works for a single Mac agent but won't scale. Fine for current scope.BLOCKERS
None. This PR has no test coverage requirement (infrastructure-as-code with manual test plan), no unvalidated user input, no secrets in code, and no DRY violations in auth/security paths. The FIXME placeholder is documented and acknowledged.
NITS
Cross-file
requiredirectives missing.woodpecker-agent-loadshould explicitlyrequire: cmd: magicdns-woodpecker-resolvesto enforce the stated dependency order. Salt include order is not execution order. This is not a blocker because the MagicDNS checks are diagnostic (the agent would also fail if DNS is broken), but it contradicts the stated design ininit.slscomments.MISSING_SECRETsilent fallback. Consider failing the state explicitly rather than rendering a placeholder secret into the plist. A Jinja conditional orcmd.runassertion would surface pillar decryption failures at apply time instead of at agent connection time.Woodpecker agent binary not version-pinned. A pillar key like
woodpecker.agent_version: 2.x.xwith a versioned download URL would make the state reproducible. Current approach works for bootstrapping.runasHomebrew owner detection repeated 3x. Extract to a Jinja{% set brew_owner = ... %}at the top of each file, or centralize in pillar.No
tofu planoutput in PR body. PR conventions specify including plan output for Terraform changes. The changes are additive-only (new scrape config, new alert rule, new blackbox target, new ConfigMap), so risk is low, but the convention should be followed.MacAgentDownalert overlaps with genericTargetDown. The genericTargetDown(severity: warning, 5m) already coversup == 0. The Mac-specific alert (severity: critical, 5m) adds a descriptive annotation and escalated severity. This is intentional but will produce two alerts for the same event. Consider suppressing the generic one via anunlessclause or inhibition rule if alert fatigue becomes a concern.Promtail positions file in
/tmp.filename: /tmp/promtail-positions.yamlwill be lost on reboot, causing Promtail to re-read and potentially duplicate log entries. A persistent location like/usr/local/var/promtail/positions.yamlwould be more robust.SOP COMPLIANCE
174-mac-build-agent-salt-observabilitymatches #174)tofu planoutput not included (required by PR conventions for Terraform changes)PROCESS OBSERVATIONS
Deployment frequency: This PR adds a new Salt-managed host and Terraform monitoring resources. The Salt states can be applied independently (
state.apply mac-agent) without affecting the existing archbox minion. The Terraform changes are additive (new scrape config, alert, blackbox target, dashboard ConfigMap) and should not disrupt existing monitoring. Low change failure risk.Pre-apply gate: The FIXME on Loki
push_urlmust be resolved before the firststate.apply. The PR body documents this clearly, but it means this PR is not fully deployable as-is -- it requires a manual lookup step.Discovered scope: Issue #179 (Woodpecker agent secret duplication) is related -- the Mac agent now has a second consumer of the same shared secret. Worth verifying alignment.
VERDICT: APPROVED
The PR is well-structured with clean separation between Salt states, pillar data, and Terraform monitoring. No blockers. The FIXME is documented. The nits above (cross-file require directives, MISSING_SECRET fallback, version pinning, tofu plan output) are improvements for a follow-up.