feat: Mac build agent Salt states with observability #227

Merged
forgejo_admin merged 2 commits from 174-mac-build-agent-salt-observability into main 2026-03-28 18:05:55 +00:00

Summary

Brings the Mac CI build agent (lucass-macbook-air-1) under Salt management with full observability -- Woodpecker agent, Promtail log shipping, node-exporter metrics, and a Grafana dashboard. Replaces the fragile manual SSH setup from #166.

Changes

Salt states (new: salt/states/mac-agent/)

  • init.sls -- orchestrator, includes all sub-states in dependency order
  • homebrew.sls -- managed packages (node, fastlane, cocoapods, watchman) via cmd.run with idempotent guards
  • magicdns.sls -- verifies Tailscale is running and MagicDNS resolves woodpecker/grafana before agent starts
  • woodpecker.sls -- downloads agent binary (darwin/arm64), creates launchd plist with KeepAlive, loads service
  • observability.sls -- installs Promtail + node-exporter via Homebrew, configures launchd plists for both

Salt pillar (new: salt/pillar/mac-agent.sls)

  • Agent config: server URL, filter labels (platform=darwin), max workflows
  • Loki push URL (via Tailscale subnet router to ClusterIP -- FIXME note for actual IP lookup)
  • Homebrew package list driven from pillar data

Salt config changes

  • salt/pillar/top.sls -- targets lucass-macbook-air-1 with secrets.platform + mac-agent pillar
  • salt/states/top.sls -- assigns mac-agent state to the Mac minion
  • salt/master.conf -- listens on 0.0.0.0 (was 127.0.0.1) so the Mac can connect over Tailscale

Terraform/monitoring

  • additionalScrapeConfigs for Mac node-exporter (Tailscale IP, 30s interval)
  • Blackbox probe target for Mac agent health endpoint
  • MacAgentDown PrometheusRule alert (5m threshold, critical severity)
  • mac-agent-dashboard.json Grafana dashboard: status, CPU, memory, disk, uptime, agent logs (Loki)
  • New variable mac_agent_tailscale_ip (default 100.83.0.5:9100) wired through root -> monitoring module

Test Plan

  • salt 'lucass-macbook-air-1' test.ping returns True (requires Salt minion installed on Mac)
  • salt 'lucass-macbook-air-1' state.apply mac-agent test=True shows no errors
  • tofu validate passes (verified in this PR)
  • After tofu apply: Grafana query {host="lucass-macbook-air-1"} returns agent logs
  • After tofu apply: Prometheus query up{job="mac-node-exporter"} returns 1
  • Mac agent visible in Woodpecker admin UI

Note: Loki push_url in pillar has a FIXME -- the actual Loki ClusterIP must be looked up with kubectl get svc loki-stack -n monitoring -o jsonpath='{.spec.clusterIP}' and set before first apply. The Mac reaches Loki via the Tailscale subnet router (10.43.0.0/16), not DNS.

Review Checklist

  • tofu fmt run
  • tofu validate passes
  • No secrets in state files (agent secret sourced from pillar)
  • No changes to existing archbox Salt states
  • No changes to Woodpecker Helm chart (server-side unchanged)

None.

Closes #174

  • Supersedes: #166 (manual Mac setup)
## Summary Brings the Mac CI build agent (lucass-macbook-air-1) under Salt management with full observability -- Woodpecker agent, Promtail log shipping, node-exporter metrics, and a Grafana dashboard. Replaces the fragile manual SSH setup from #166. ## Changes **Salt states (new: `salt/states/mac-agent/`)** - `init.sls` -- orchestrator, includes all sub-states in dependency order - `homebrew.sls` -- managed packages (node, fastlane, cocoapods, watchman) via `cmd.run` with idempotent guards - `magicdns.sls` -- verifies Tailscale is running and MagicDNS resolves woodpecker/grafana before agent starts - `woodpecker.sls` -- downloads agent binary (darwin/arm64), creates launchd plist with KeepAlive, loads service - `observability.sls` -- installs Promtail + node-exporter via Homebrew, configures launchd plists for both **Salt pillar (new: `salt/pillar/mac-agent.sls`)** - Agent config: server URL, filter labels (`platform=darwin`), max workflows - Loki push URL (via Tailscale subnet router to ClusterIP -- FIXME note for actual IP lookup) - Homebrew package list driven from pillar data **Salt config changes** - `salt/pillar/top.sls` -- targets `lucass-macbook-air-1` with `secrets.platform` + `mac-agent` pillar - `salt/states/top.sls` -- assigns `mac-agent` state to the Mac minion - `salt/master.conf` -- listens on `0.0.0.0` (was `127.0.0.1`) so the Mac can connect over Tailscale **Terraform/monitoring** - `additionalScrapeConfigs` for Mac node-exporter (Tailscale IP, 30s interval) - Blackbox probe target for Mac agent health endpoint - `MacAgentDown` PrometheusRule alert (5m threshold, critical severity) - `mac-agent-dashboard.json` Grafana dashboard: status, CPU, memory, disk, uptime, agent logs (Loki) - New variable `mac_agent_tailscale_ip` (default `100.83.0.5:9100`) wired through root -> monitoring module ## Test Plan - [ ] `salt 'lucass-macbook-air-1' test.ping` returns True (requires Salt minion installed on Mac) - [ ] `salt 'lucass-macbook-air-1' state.apply mac-agent test=True` shows no errors - [ ] `tofu validate` passes (verified in this PR) - [ ] After `tofu apply`: Grafana query `{host="lucass-macbook-air-1"}` returns agent logs - [ ] After `tofu apply`: Prometheus query `up{job="mac-node-exporter"}` returns 1 - [ ] Mac agent visible in Woodpecker admin UI **Note:** Loki `push_url` in pillar has a FIXME -- the actual Loki ClusterIP must be looked up with `kubectl get svc loki-stack -n monitoring -o jsonpath='{.spec.clusterIP}'` and set before first apply. The Mac reaches Loki via the Tailscale subnet router (10.43.0.0/16), not DNS. ## Review Checklist - [x] `tofu fmt` run - [x] `tofu validate` passes - [x] No secrets in state files (agent secret sourced from pillar) - [x] No changes to existing archbox Salt states - [x] No changes to Woodpecker Helm chart (server-side unchanged) ## Related Notes None. ## Related Closes #174 - Supersedes: #166 (manual Mac setup)
feat: Mac build agent Salt states with Prometheus/Loki observability
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
5d436d2404
Brings the Mac CI build agent (lucass-macbook-air-1) under Salt
management with full observability, replacing the manual SSH setup
from #166.

Salt states (mac-agent):
- homebrew.sls: managed packages (node, fastlane, cocoapods, watchman)
- magicdns.sls: Tailscale MagicDNS verification before agent start
- woodpecker.sls: agent binary + launchd plist with KeepAlive
- observability.sls: Promtail (logs->Loki) + node-exporter (metrics)

Salt pillar:
- mac-agent.sls: agent config, Loki endpoint, homebrew package list

Infrastructure:
- Salt master listens on 0.0.0.0 (was 127.0.0.1) for Tailscale access
- Prometheus additionalScrapeConfigs for Mac node-exporter
- Blackbox probe target for Mac agent health
- MacAgentDown alert (5m threshold, critical severity)
- Grafana dashboard: CPU, memory, disk, uptime, agent logs

Closes #174

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Owner

QA Review -- PR #227

Findings

1. [NIT] homebrew.sls -- unless: 'false' is a no-op guard
The homebrew-available state uses unless: 'false' which means the unless command always exits non-zero, so the brew --version command always runs. This is functionally fine (verifies brew exists every apply) but the unless line is confusing noise. Either remove the unless entirely or add a comment explaining intent.

2. [MEDIUM] woodpecker.sls -- plist renders agent secret in plaintext on disk
The launchd plist at ~/Library/LaunchAgents/com.woodpecker-ci.agent.plist contains {{ agent_secret }} which Salt will render as the actual secret value. This is a plaintext secret on the Mac filesystem readable by the user. This is the standard launchd pattern (no alternative for env vars), but worth noting -- the plist file permissions are 0644 (world-readable). Consider 0600.

3. [NIT] observability.sls -- Promtail config fallback URL is unreachable
The Jinja default for push_url is http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/push but the PR body and pillar both correctly note that .svc.cluster.local DNS does not resolve on the Mac. The fallback should match the pillar default (ClusterIP), not the unreachable DNS name.

4. [INFO] mac-agent.sls pillar -- FIXME for Loki ClusterIP
The push_url has a FIXME with placeholder 10.43.0.0. This is acknowledged in the PR body. Fine for initial merge -- must be resolved before first state.apply.

5. [NIT] master.conf -- 0.0.0.0 broadens listen surface
The change from 127.0.0.1 to 0.0.0.0 is necessary for the Mac minion. The comment correctly notes nftables protection. Verified: the firewall state allows all traffic on tailscale0 and lo only, so Salt ports (4505/4506) are not exposed on the physical NIC. Correct.

6. [NIT] Dashboard JSON -- hardcoded datasource UIDs
The dashboard uses "uid": "prometheus" and "uid": "loki" which are conventional defaults for kube-prometheus-stack but not guaranteed. If the datasource UIDs differ in the actual Grafana instance, panels will show "No data." Low risk since this follows the existing dashboard pattern in the repo.

Summary

Clean implementation. The three deliverables (Salt states, Woodpecker agent, observability) are all present. Terraform validates. No secrets leaked into state files (agent secret sourced from pillar). No changes to existing archbox states.

The plist file permission (finding #2) is the only actionable item -- changing from 0644 to 0600 for the woodpecker plist since it contains the agent secret.

VERDICT: APPROVE with nit -- fix plist permission to 0600 on com.woodpecker-ci.agent.plist.

## QA Review -- PR #227 ### Findings **1. [NIT] `homebrew.sls` -- `unless: 'false'` is a no-op guard** The `homebrew-available` state uses `unless: 'false'` which means the `unless` command always exits non-zero, so the `brew --version` command always runs. This is functionally fine (verifies brew exists every apply) but the `unless` line is confusing noise. Either remove the `unless` entirely or add a comment explaining intent. **2. [MEDIUM] `woodpecker.sls` -- plist renders agent secret in plaintext on disk** The launchd plist at `~/Library/LaunchAgents/com.woodpecker-ci.agent.plist` contains `{{ agent_secret }}` which Salt will render as the actual secret value. This is a plaintext secret on the Mac filesystem readable by the user. This is the standard launchd pattern (no alternative for env vars), but worth noting -- the plist file permissions are 0644 (world-readable). Consider 0600. **3. [NIT] `observability.sls` -- Promtail config fallback URL is unreachable** The Jinja default for `push_url` is `http://loki-stack.monitoring.svc.cluster.local:3100/loki/api/v1/push` but the PR body and pillar both correctly note that `.svc.cluster.local` DNS does not resolve on the Mac. The fallback should match the pillar default (ClusterIP), not the unreachable DNS name. **4. [INFO] `mac-agent.sls` pillar -- FIXME for Loki ClusterIP** The `push_url` has a FIXME with placeholder `10.43.0.0`. This is acknowledged in the PR body. Fine for initial merge -- must be resolved before first `state.apply`. **5. [NIT] `master.conf` -- `0.0.0.0` broadens listen surface** The change from `127.0.0.1` to `0.0.0.0` is necessary for the Mac minion. The comment correctly notes nftables protection. Verified: the firewall state allows all traffic on `tailscale0` and `lo` only, so Salt ports (4505/4506) are not exposed on the physical NIC. Correct. **6. [NIT] Dashboard JSON -- hardcoded datasource UIDs** The dashboard uses `"uid": "prometheus"` and `"uid": "loki"` which are conventional defaults for kube-prometheus-stack but not guaranteed. If the datasource UIDs differ in the actual Grafana instance, panels will show "No data." Low risk since this follows the existing dashboard pattern in the repo. ### Summary Clean implementation. The three deliverables (Salt states, Woodpecker agent, observability) are all present. Terraform validates. No secrets leaked into state files (agent secret sourced from pillar). No changes to existing archbox states. The plist file permission (finding #2) is the only actionable item -- changing from 0644 to 0600 for the woodpecker plist since it contains the agent secret. **VERDICT: APPROVE with nit** -- fix plist permission to 0600 on `com.woodpecker-ci.agent.plist`.
fix: QA review nits -- plist perms, fallback URL, homebrew guard
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful
b5725d3a5b
- woodpecker.sls: plist mode 0644 -> 0600 (contains agent secret)
- observability.sls: Promtail fallback URL uses ClusterIP, not DNS
- homebrew.sls: remove no-op `unless: 'false'` guard

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Owner

PR #227 Review

DOMAIN REVIEW

Tech stack: Salt (YAML + Jinja states/pillar), Terraform/OpenTofu (HCL modules), Grafana dashboard JSON.

Salt state correctness

  1. State ordering is include-order only -- no cross-file require directives. init.sls lists includes in "dependency order" (homebrew, magicdns, woodpecker, observability), but Salt include order does NOT guarantee execution order. The woodpecker.sls states have require: cmd: woodpecker-agent-install internally, but nothing requires the magicdns states to complete first. If MagicDNS verification runs in parallel with or after the Woodpecker agent load, the stated design goal ("verify MagicDNS before agent starts") is not enforced. woodpecker-agent-load should require: cmd: magicdns-woodpecker-resolves. Similarly, promtail-load should require magicdns-grafana-resolves (Promtail pushes to Loki but the MagicDNS check for Grafana is the closest pre-flight; though Promtail actually uses the ClusterIP, not MagicDNS).

  2. Loki push_url FIXME is present and documented. The pillar has push_url: http://10.43.0.0:3100/loki/api/v1/push with a FIXME comment, and the PR body + Test Plan both call this out. The 10.43.0.0 is the base of the service CIDR, not the actual Loki ClusterIP. This is acknowledged as a pre-apply requirement. Acceptable as a known placeholder, but must be resolved before state.apply or Promtail will silently fail to ship logs.

  3. MISSING_SECRET fallback in woodpecker.sls line 15. If the GPG-encrypted pillar fails to decrypt (wrong key, missing GPG agent, etc.), agent_secret silently falls back to the literal string MISSING_SECRET, which will be rendered into the launchd plist in cleartext. The agent will then fail to auth with the server, but the plist file will exist on disk with a non-secret placeholder that could mask the real problem. Consider either: (a) failing the state explicitly if the secret is missing (e.g., a cmd.run that checks the value), or (b) using a Jinja raise to fail rendering.

  4. runas pattern repeated 3 times. The Homebrew owner detection (stat -f "%Su" ... || echo ldraney) appears identically in homebrew.sls:22, observability.sls:23, and observability.sls:111. This is a DRY concern but NOT in an auth/security path -- it is a convenience pattern for detecting the brew owner. Not a blocker, but worth extracting to a Jinja macro or pillar variable in a follow-up.

  5. Woodpecker binary download uses /releases/latest/ -- no version pinning. The download URL is https://github.com/woodpecker-ci/woodpecker/releases/latest/download/woodpecker-agent_darwin_arm64.tar.gz. Combined with unless: test -x /usr/local/bin/woodpecker-agent, this means: (a) the first install gets whatever "latest" is at that moment, (b) subsequent applies never upgrade because the binary exists. This is pragmatic for a first deploy, but version pinning (pillar-driven) would make the state reproducible.

  6. homebrew-available runs every apply. brew --version has no unless guard. This is intentional (verifies brew exists), but on every highstate it will show as "changed" in Salt output. Minor noise concern.

Terraform module wiring

  1. Variable wiring is clean. mac_agent_tailscale_ip is declared in terraform/variables.tf with a default, passed through main.tf to module.monitoring, and declared again in modules/monitoring/variables.tf. Consistent descriptions and defaults across both levels.

  2. Default duplicated at two levels. Both root variables.tf and modules/monitoring/variables.tf have default = "100.83.0.5:9100". If the default ever changes, both must be updated. Standard Terraform pattern when root vars have defaults, but worth noting.

  3. additionalScrapeConfigs placement is correct. Nested inside prometheusSpec where kube-prometheus-stack expects it.

  4. Blackbox probe URL construction: http://${var.mac_agent_tailscale_ip}/metrics -- since mac_agent_tailscale_ip includes the port (e.g., 100.83.0.5:9100), the URL becomes http://100.83.0.5:9100/metrics. This is correct for node-exporter.

  5. Dashboard ConfigMap follows existing pattern. Same structure as dora_dashboard, pal_e_docs_dashboard, uptime_dashboard -- grafana_dashboard = "1" label, depends_on = [helm_release.kube_prometheus_stack], file loaded via ${path.module}/../../dashboards/.

Security review

  1. interface: 0.0.0.0 in master.conf. Binding Salt master to all interfaces. The comment references nftables restricting to tailscale0 and lo. Verified: firewall.sls pillar defines allowed_interfaces: [tailscale0, lo] with default policy drop. Salt ports 4505/4506 are only reachable via Tailscale or localhost. The firewall claim checks out.

  2. Agent secret sourced from GPG-encrypted pillar. secrets.get('woodpecker_agent_secret', 'MISSING_SECRET') reads from secrets.platform which is GPG-encrypted. The secret is rendered into the launchd plist (file mode 0600). No secrets in state files, no credentials in code. The plist file permissions are appropriate.

  3. Tailscale IPs (100.x.x.x) are not public. These are CGNAT-range addresses used by Tailscale's overlay network. Not externally routable. Safe to have in pillar/variables.

  4. No tofu plan output included. PR conventions require tofu plan output for Terraform changes. The PR has a "Review Checklist" noting tofu validate passes, but no plan output is shown.

Grafana dashboard review

  1. Dashboard JSON is well-structured. Unique UID (mac-build-agent), appropriate tags, correct datasource references (prometheus and loki UIDs). Panel IDs are sequential and non-conflicting.

  2. Dashboard PromQL queries use hardcoded instance label. All queries filter on instance="lucass-macbook-air-1". This works for a single Mac agent but won't scale. Fine for current scope.

BLOCKERS

None. This PR has no test coverage requirement (infrastructure-as-code with manual test plan), no unvalidated user input, no secrets in code, and no DRY violations in auth/security paths. The FIXME placeholder is documented and acknowledged.

NITS

  1. Cross-file require directives missing. woodpecker-agent-load should explicitly require: cmd: magicdns-woodpecker-resolves to enforce the stated dependency order. Salt include order is not execution order. This is not a blocker because the MagicDNS checks are diagnostic (the agent would also fail if DNS is broken), but it contradicts the stated design in init.sls comments.

  2. MISSING_SECRET silent fallback. Consider failing the state explicitly rather than rendering a placeholder secret into the plist. A Jinja conditional or cmd.run assertion would surface pillar decryption failures at apply time instead of at agent connection time.

  3. Woodpecker agent binary not version-pinned. A pillar key like woodpecker.agent_version: 2.x.x with a versioned download URL would make the state reproducible. Current approach works for bootstrapping.

  4. runas Homebrew owner detection repeated 3x. Extract to a Jinja {% set brew_owner = ... %} at the top of each file, or centralize in pillar.

  5. No tofu plan output in PR body. PR conventions specify including plan output for Terraform changes. The changes are additive-only (new scrape config, new alert rule, new blackbox target, new ConfigMap), so risk is low, but the convention should be followed.

  6. MacAgentDown alert overlaps with generic TargetDown. The generic TargetDown (severity: warning, 5m) already covers up == 0. The Mac-specific alert (severity: critical, 5m) adds a descriptive annotation and escalated severity. This is intentional but will produce two alerts for the same event. Consider suppressing the generic one via an unless clause or inhibition rule if alert fatigue becomes a concern.

  7. Promtail positions file in /tmp. filename: /tmp/promtail-positions.yaml will be lost on reboot, causing Promtail to re-read and potentially duplicate log entries. A persistent location like /usr/local/var/promtail/positions.yaml would be more robust.

SOP COMPLIANCE

  • Branch named after issue (174-mac-build-agent-salt-observability matches #174)
  • PR body has: Summary, Changes, Test Plan, Related
  • Related references plan slug -- PR says "None" in Related Notes. No plan slug referenced.
  • No secrets committed (agent secret via GPG-encrypted pillar, no credentials in state files)
  • No unnecessary file changes (all changes scoped to Mac agent infrastructure + monitoring)
  • Commit messages are descriptive
  • tofu plan output not included (required by PR conventions for Terraform changes)

PROCESS OBSERVATIONS

Deployment frequency: This PR adds a new Salt-managed host and Terraform monitoring resources. The Salt states can be applied independently (state.apply mac-agent) without affecting the existing archbox minion. The Terraform changes are additive (new scrape config, alert, blackbox target, dashboard ConfigMap) and should not disrupt existing monitoring. Low change failure risk.

Pre-apply gate: The FIXME on Loki push_url must be resolved before the first state.apply. The PR body documents this clearly, but it means this PR is not fully deployable as-is -- it requires a manual lookup step.

Discovered scope: Issue #179 (Woodpecker agent secret duplication) is related -- the Mac agent now has a second consumer of the same shared secret. Worth verifying alignment.

VERDICT: APPROVED

The PR is well-structured with clean separation between Salt states, pillar data, and Terraform monitoring. No blockers. The FIXME is documented. The nits above (cross-file require directives, MISSING_SECRET fallback, version pinning, tofu plan output) are improvements for a follow-up.

## PR #227 Review ### DOMAIN REVIEW **Tech stack**: Salt (YAML + Jinja states/pillar), Terraform/OpenTofu (HCL modules), Grafana dashboard JSON. **Salt state correctness** 1. **State ordering is include-order only -- no cross-file `require` directives.** `init.sls` lists includes in "dependency order" (homebrew, magicdns, woodpecker, observability), but Salt include order does NOT guarantee execution order. The `woodpecker.sls` states have `require: cmd: woodpecker-agent-install` internally, but nothing requires the `magicdns` states to complete first. If MagicDNS verification runs in parallel with or after the Woodpecker agent load, the stated design goal ("verify MagicDNS before agent starts") is not enforced. `woodpecker-agent-load` should `require: cmd: magicdns-woodpecker-resolves`. Similarly, `promtail-load` should require `magicdns-grafana-resolves` (Promtail pushes to Loki but the MagicDNS check for Grafana is the closest pre-flight; though Promtail actually uses the ClusterIP, not MagicDNS). 2. **Loki `push_url` FIXME is present and documented.** The pillar has `push_url: http://10.43.0.0:3100/loki/api/v1/push` with a FIXME comment, and the PR body + Test Plan both call this out. The `10.43.0.0` is the base of the service CIDR, not the actual Loki ClusterIP. This is acknowledged as a pre-apply requirement. Acceptable as a known placeholder, but must be resolved before `state.apply` or Promtail will silently fail to ship logs. 3. **`MISSING_SECRET` fallback in `woodpecker.sls` line 15.** If the GPG-encrypted pillar fails to decrypt (wrong key, missing GPG agent, etc.), `agent_secret` silently falls back to the literal string `MISSING_SECRET`, which will be rendered into the launchd plist in cleartext. The agent will then fail to auth with the server, but the plist file will exist on disk with a non-secret placeholder that could mask the real problem. Consider either: (a) failing the state explicitly if the secret is missing (e.g., a `cmd.run` that checks the value), or (b) using a Jinja `raise` to fail rendering. 4. **`runas` pattern repeated 3 times.** The Homebrew owner detection (`stat -f "%Su" ... || echo ldraney`) appears identically in `homebrew.sls:22`, `observability.sls:23`, and `observability.sls:111`. This is a DRY concern but NOT in an auth/security path -- it is a convenience pattern for detecting the brew owner. Not a blocker, but worth extracting to a Jinja macro or pillar variable in a follow-up. 5. **Woodpecker binary download uses `/releases/latest/` -- no version pinning.** The download URL is `https://github.com/woodpecker-ci/woodpecker/releases/latest/download/woodpecker-agent_darwin_arm64.tar.gz`. Combined with `unless: test -x /usr/local/bin/woodpecker-agent`, this means: (a) the first install gets whatever "latest" is at that moment, (b) subsequent applies never upgrade because the binary exists. This is pragmatic for a first deploy, but version pinning (pillar-driven) would make the state reproducible. 6. **`homebrew-available` runs every apply.** `brew --version` has no `unless` guard. This is intentional (verifies brew exists), but on every highstate it will show as "changed" in Salt output. Minor noise concern. **Terraform module wiring** 7. **Variable wiring is clean.** `mac_agent_tailscale_ip` is declared in `terraform/variables.tf` with a default, passed through `main.tf` to `module.monitoring`, and declared again in `modules/monitoring/variables.tf`. Consistent descriptions and defaults across both levels. 8. **Default duplicated at two levels.** Both root `variables.tf` and `modules/monitoring/variables.tf` have `default = "100.83.0.5:9100"`. If the default ever changes, both must be updated. Standard Terraform pattern when root vars have defaults, but worth noting. 9. **`additionalScrapeConfigs` placement is correct.** Nested inside `prometheusSpec` where kube-prometheus-stack expects it. 10. **Blackbox probe URL construction**: `http://${var.mac_agent_tailscale_ip}/metrics` -- since `mac_agent_tailscale_ip` includes the port (e.g., `100.83.0.5:9100`), the URL becomes `http://100.83.0.5:9100/metrics`. This is correct for node-exporter. 11. **Dashboard ConfigMap follows existing pattern.** Same structure as `dora_dashboard`, `pal_e_docs_dashboard`, `uptime_dashboard` -- `grafana_dashboard = "1"` label, `depends_on = [helm_release.kube_prometheus_stack]`, file loaded via `${path.module}/../../dashboards/`. **Security review** 12. **`interface: 0.0.0.0` in `master.conf`.** Binding Salt master to all interfaces. The comment references nftables restricting to `tailscale0` and `lo`. Verified: `firewall.sls` pillar defines `allowed_interfaces: [tailscale0, lo]` with default policy `drop`. Salt ports 4505/4506 are only reachable via Tailscale or localhost. The firewall claim checks out. 13. **Agent secret sourced from GPG-encrypted pillar.** `secrets.get('woodpecker_agent_secret', 'MISSING_SECRET')` reads from `secrets.platform` which is GPG-encrypted. The secret is rendered into the launchd plist (file mode `0600`). No secrets in state files, no credentials in code. The plist file permissions are appropriate. 14. **Tailscale IPs (100.x.x.x) are not public.** These are CGNAT-range addresses used by Tailscale's overlay network. Not externally routable. Safe to have in pillar/variables. 15. **No `tofu plan` output included.** PR conventions require `tofu plan` output for Terraform changes. The PR has a "Review Checklist" noting `tofu validate` passes, but no plan output is shown. **Grafana dashboard review** 16. **Dashboard JSON is well-structured.** Unique UID (`mac-build-agent`), appropriate tags, correct datasource references (`prometheus` and `loki` UIDs). Panel IDs are sequential and non-conflicting. 17. **Dashboard PromQL queries use hardcoded `instance` label.** All queries filter on `instance="lucass-macbook-air-1"`. This works for a single Mac agent but won't scale. Fine for current scope. ### BLOCKERS None. This PR has no test coverage requirement (infrastructure-as-code with manual test plan), no unvalidated user input, no secrets in code, and no DRY violations in auth/security paths. The FIXME placeholder is documented and acknowledged. ### NITS 1. **Cross-file `require` directives missing.** `woodpecker-agent-load` should explicitly `require: cmd: magicdns-woodpecker-resolves` to enforce the stated dependency order. Salt include order is not execution order. This is not a blocker because the MagicDNS checks are diagnostic (the agent would also fail if DNS is broken), but it contradicts the stated design in `init.sls` comments. 2. **`MISSING_SECRET` silent fallback.** Consider failing the state explicitly rather than rendering a placeholder secret into the plist. A Jinja conditional or `cmd.run` assertion would surface pillar decryption failures at apply time instead of at agent connection time. 3. **Woodpecker agent binary not version-pinned.** A pillar key like `woodpecker.agent_version: 2.x.x` with a versioned download URL would make the state reproducible. Current approach works for bootstrapping. 4. **`runas` Homebrew owner detection repeated 3x.** Extract to a Jinja `{% set brew_owner = ... %}` at the top of each file, or centralize in pillar. 5. **No `tofu plan` output in PR body.** PR conventions specify including plan output for Terraform changes. The changes are additive-only (new scrape config, new alert rule, new blackbox target, new ConfigMap), so risk is low, but the convention should be followed. 6. **`MacAgentDown` alert overlaps with generic `TargetDown`.** The generic `TargetDown` (severity: warning, 5m) already covers `up == 0`. The Mac-specific alert (severity: critical, 5m) adds a descriptive annotation and escalated severity. This is intentional but will produce two alerts for the same event. Consider suppressing the generic one via an `unless` clause or inhibition rule if alert fatigue becomes a concern. 7. **Promtail positions file in `/tmp`.** `filename: /tmp/promtail-positions.yaml` will be lost on reboot, causing Promtail to re-read and potentially duplicate log entries. A persistent location like `/usr/local/var/promtail/positions.yaml` would be more robust. ### SOP COMPLIANCE - [x] Branch named after issue (`174-mac-build-agent-salt-observability` matches #174) - [x] PR body has: Summary, Changes, Test Plan, Related - [ ] Related references plan slug -- PR says "None" in Related Notes. No plan slug referenced. - [x] No secrets committed (agent secret via GPG-encrypted pillar, no credentials in state files) - [x] No unnecessary file changes (all changes scoped to Mac agent infrastructure + monitoring) - [x] Commit messages are descriptive - [ ] `tofu plan` output not included (required by PR conventions for Terraform changes) ### PROCESS OBSERVATIONS **Deployment frequency**: This PR adds a new Salt-managed host and Terraform monitoring resources. The Salt states can be applied independently (`state.apply mac-agent`) without affecting the existing archbox minion. The Terraform changes are additive (new scrape config, alert, blackbox target, dashboard ConfigMap) and should not disrupt existing monitoring. Low change failure risk. **Pre-apply gate**: The FIXME on Loki `push_url` must be resolved before the first `state.apply`. The PR body documents this clearly, but it means this PR is not fully deployable as-is -- it requires a manual lookup step. **Discovered scope**: Issue #179 (Woodpecker agent secret duplication) is related -- the Mac agent now has a second consumer of the same shared secret. Worth verifying alignment. ### VERDICT: APPROVED The PR is well-structured with clean separation between Salt states, pillar data, and Terraform monitoring. No blockers. The FIXME is documented. The nits above (cross-file require directives, MISSING_SECRET fallback, version pinning, tofu plan output) are improvements for a follow-up.
forgejo_admin deleted branch 174-mac-build-agent-salt-observability 2026-03-28 18:05:55 +00:00
Sign in to join this conversation.
No description provided.