docs: ArgoCD CFR spike decision record (#6) #8

Merged
ldraney merged 1 commit from 6-argocd-cfr-spike into main 2026-06-13 20:05:42 +00:00
Owner

Summary

  • Investigates how to instrument ArgoCD deployments for Change Failure Rate tracking
  • Evaluates three options: native Prometheus metrics (ServiceMonitor), webhook notifications, and API polling
  • Recommends Option A: ServiceMonitor for ArgoCD native metrics -- zero code changes to the exporter, lowest effort, lowest risk

Key Findings

  • ArgoCD is deployed at argocd-server.argocd.svc.cluster.local but has no ServiceMonitor -- Prometheus is not scraping its metrics
  • ArgoCD natively emits argocd_app_sync_total with phase labels (Succeeded/Failed/Error) on port 8082
  • CFR can be calculated directly in PromQL without building a new collector
  • A manual cluster check is needed to confirm metrics are actually being emitted before creating follow-up tickets

Deliverable

  • docs/argocd-cfr-decision.md -- full decision record with options evaluated, pros/cons, and follow-up tickets

Test Plan

  • Decision document reviewed for completeness
  • Follow-up tickets scoped correctly per recommendation
  • Manual cluster check items are actionable

Closes #6

🤖 Generated with Claude Code

## Summary - Investigates how to instrument ArgoCD deployments for Change Failure Rate tracking - Evaluates three options: native Prometheus metrics (ServiceMonitor), webhook notifications, and API polling - Recommends **Option A: ServiceMonitor for ArgoCD native metrics** -- zero code changes to the exporter, lowest effort, lowest risk ## Key Findings - ArgoCD is deployed at `argocd-server.argocd.svc.cluster.local` but has no ServiceMonitor -- Prometheus is not scraping its metrics - ArgoCD natively emits `argocd_app_sync_total` with `phase` labels (Succeeded/Failed/Error) on port 8082 - CFR can be calculated directly in PromQL without building a new collector - A manual cluster check is needed to confirm metrics are actually being emitted before creating follow-up tickets ## Deliverable - `docs/argocd-cfr-decision.md` -- full decision record with options evaluated, pros/cons, and follow-up tickets ## Test Plan - [ ] Decision document reviewed for completeness - [ ] Follow-up tickets scoped correctly per recommendation - [ ] Manual cluster check items are actionable Closes #6 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Investigation for issue #6. ArgoCD already emits argocd_app_sync_total
with phase labels (Succeeded/Failed/Error) on port 8082. Adding a
ServiceMonitor to scrape these is simpler and lower-risk than building
a new webhook collector or API-polling collector in the exporter.

Closes #6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Owner

PR #8 Review

DOMAIN REVIEW

Stack detected: Docs-only spike (ArgoCD / Prometheus / Kubernetes / PromQL domain)

Technical accuracy -- all verified correct:

  • ArgoCD metrics ports (8082 controller, 8083 server, 8084 repo-server) match official ArgoCD docs
  • Metric names (argocd_app_sync_total, argocd_app_info, argocd_app_sync_duration_seconds_total) are accurate
  • Phase label values (Succeeded, Failed, Error, Running) are correct
  • The PromQL CFR query is sound: uses increase() over counters with phase filtering, and clamp_min(..., 1) correctly prevents division-by-zero
  • serviceMonitorSelectorNilUsesHelmValues = false is a real and operationally important kube-prometheus-stack setting -- good catch documenting it
  • The prerequisite kubectl commands are correct and actionable
  • ServiceMonitor approach is the industry-standard pattern for ArgoCD metrics ingestion

Three-option analysis quality:

  • Options A/B/C are well-differentiated with realistic effort estimates (1h / 4-6h / 6-8h)
  • Pros/cons are balanced and honest -- the cons for Option A (split metric namespace) are real and acknowledged rather than swept under the rug
  • The webhook fallback (Option B) correctly identifies the ArgoCD Notifications ConfigMap requirement and per-app annotation overhead
  • Option C (API polling) correctly flags the auth requirement as a differentiating cost

Cross-references:

  • Dependency on issue #7 (MTTR) is correctly noted -- the CFR event source informs MTTR failure/recovery timestamp design
  • References to src/collectors/woodpecker.py and terraform/modules/monitoring/main.tf are contextually appropriate for the pal-e-dora-exporter and pal-e-platform repos respectively

Minor PromQL note (nit): The phase=~"Succeeded|Failed|Error" regex in the denominator excludes Running syncs. This is correct for CFR calculation (in-progress syncs are neither success nor failure), but worth a one-line comment in the doc explaining the exclusion so future readers do not wonder if it is a bug.

BLOCKERS

None. This is a docs-only spike with no code changes, no secrets, and no security surface.

NITS

  1. PR body missing ## Changes section. SOP template (template-pr-body) expects ## Summary, ## Changes, ## Test Plan, ## Review Checklist, ## Related Notes. The ## Key Findings and ## Deliverable sections are useful but non-standard. For a docs-only spike this is cosmetic, not blocking.

  2. PR body missing ## Related Notes section. Should reference the plan slug or related pal-e-docs notes if any exist for this spike.

  3. PromQL comment suggestion. Add a brief inline comment in the decision doc explaining why Running phase is excluded from the denominator (see domain review above).

  4. Singular "Deliverable" header in PR body. Minor -- "Deliverables" (plural) would be consistent with spike template conventions.

  5. Reference link validity. The oneuptime.com/blog link (reference #3) may be a hallucinated URL -- worth verifying it resolves before merge. The ArgoCD docs links are standard and should be fine.

SOP COMPLIANCE

  • Branch named after issue (6-argocd-cfr-spike)
  • Closes #6 present in PR body
  • No secrets, .env files, or credentials committed
  • No unnecessary file changes (single file, on-topic)
  • Commit message is descriptive
  • PR body missing ## Changes section (nit for docs-only PR)
  • PR body missing ## Related Notes section (nit for docs-only PR)
  • Test Plan section present with checkboxes
  • Scope is tight -- single deliverable matching spike intent

PROCESS OBSERVATIONS

  • Deployment frequency impact: None -- docs only, no deployment artifact.
  • Change failure risk: Minimal -- no code, no config, no infra changes.
  • Follow-up ticket quality: The conditional follow-up structure (if metrics emitted -> Option A tickets, if not -> Option B tickets) is well-designed. The tickets correctly target the right repos (pal-e-platform for ServiceMonitor/dashboard, pal-e-dora-exporter for webhook fallback).
  • Spike value: High. The investigation eliminates 6-8 hours of unnecessary collector development by identifying that ArgoCD natively provides the needed metrics. The prerequisite cluster check is the right gate before committing to implementation tickets.

VERDICT: APPROVED

## PR #8 Review ### DOMAIN REVIEW **Stack detected:** Docs-only spike (ArgoCD / Prometheus / Kubernetes / PromQL domain) **Technical accuracy -- all verified correct:** - ArgoCD metrics ports (8082 controller, 8083 server, 8084 repo-server) match official ArgoCD docs - Metric names (`argocd_app_sync_total`, `argocd_app_info`, `argocd_app_sync_duration_seconds_total`) are accurate - Phase label values (`Succeeded`, `Failed`, `Error`, `Running`) are correct - The PromQL CFR query is sound: uses `increase()` over counters with `phase` filtering, and `clamp_min(..., 1)` correctly prevents division-by-zero - `serviceMonitorSelectorNilUsesHelmValues = false` is a real and operationally important kube-prometheus-stack setting -- good catch documenting it - The prerequisite `kubectl` commands are correct and actionable - ServiceMonitor approach is the industry-standard pattern for ArgoCD metrics ingestion **Three-option analysis quality:** - Options A/B/C are well-differentiated with realistic effort estimates (1h / 4-6h / 6-8h) - Pros/cons are balanced and honest -- the cons for Option A (split metric namespace) are real and acknowledged rather than swept under the rug - The webhook fallback (Option B) correctly identifies the ArgoCD Notifications ConfigMap requirement and per-app annotation overhead - Option C (API polling) correctly flags the auth requirement as a differentiating cost **Cross-references:** - Dependency on issue #7 (MTTR) is correctly noted -- the CFR event source informs MTTR failure/recovery timestamp design - References to `src/collectors/woodpecker.py` and `terraform/modules/monitoring/main.tf` are contextually appropriate for the pal-e-dora-exporter and pal-e-platform repos respectively **Minor PromQL note (nit):** The `phase=~"Succeeded|Failed|Error"` regex in the denominator excludes `Running` syncs. This is correct for CFR calculation (in-progress syncs are neither success nor failure), but worth a one-line comment in the doc explaining the exclusion so future readers do not wonder if it is a bug. ### BLOCKERS None. This is a docs-only spike with no code changes, no secrets, and no security surface. ### NITS 1. **PR body missing `## Changes` section.** SOP template (`template-pr-body`) expects `## Summary`, `## Changes`, `## Test Plan`, `## Review Checklist`, `## Related Notes`. The `## Key Findings` and `## Deliverable` sections are useful but non-standard. For a docs-only spike this is cosmetic, not blocking. 2. **PR body missing `## Related Notes` section.** Should reference the plan slug or related pal-e-docs notes if any exist for this spike. 3. **PromQL comment suggestion.** Add a brief inline comment in the decision doc explaining why `Running` phase is excluded from the denominator (see domain review above). 4. **Singular "Deliverable" header in PR body.** Minor -- "Deliverables" (plural) would be consistent with spike template conventions. 5. **Reference link validity.** The `oneuptime.com/blog` link (reference #3) may be a hallucinated URL -- worth verifying it resolves before merge. The ArgoCD docs links are standard and should be fine. ### SOP COMPLIANCE - [x] Branch named after issue (`6-argocd-cfr-spike`) - [x] `Closes #6` present in PR body - [x] No secrets, `.env` files, or credentials committed - [x] No unnecessary file changes (single file, on-topic) - [x] Commit message is descriptive - [ ] PR body missing `## Changes` section (nit for docs-only PR) - [ ] PR body missing `## Related Notes` section (nit for docs-only PR) - [x] Test Plan section present with checkboxes - [x] Scope is tight -- single deliverable matching spike intent ### PROCESS OBSERVATIONS - **Deployment frequency impact:** None -- docs only, no deployment artifact. - **Change failure risk:** Minimal -- no code, no config, no infra changes. - **Follow-up ticket quality:** The conditional follow-up structure (if metrics emitted -> Option A tickets, if not -> Option B tickets) is well-designed. The tickets correctly target the right repos (pal-e-platform for ServiceMonitor/dashboard, pal-e-dora-exporter for webhook fallback). - **Spike value:** High. The investigation eliminates 6-8 hours of unnecessary collector development by identifying that ArgoCD natively provides the needed metrics. The prerequisite cluster check is the right gate before committing to implementation tickets. ### VERDICT: APPROVED
ldraney deleted branch 6-argocd-cfr-spike 2026-06-13 20:05:42 +00:00
Sign in to join this conversation.
No description provided.