Close CFR gap: ArgoCD event integration for pal-e-deployments #6
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-dora-exporter#6
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Spike
Lineage
Migrated from ldraney/DORA#6.
Standalone — identified as measurement gap in DORA framework.
Repo
ldraney/pal-e-dora-exporterUser Story
As a platform operator
I want Tier 1 DORA composite in Grafana
So that I know if production pipelines are healthy without digging
Context
pal-e-deployments deploys via ArgoCD auto-sync, bypassing Woodpecker entirely. The exporter has no visibility into sync success/failure. Options: ArgoCD notifications webhook to exporter, or leverage ArgoCD's own Prometheus metrics (
argocd_app_sync_total) if already exported.No ArgoCD configuration was found in
pal-e-platform, so it is currently unknown whether ArgoCD exports Prometheus metrics on this cluster.Question
argocd_app_sync_totalto Prometheus on this cluster?Deliverables
docs/argocd-cfr-decision.mddocumenting the chosen approach with rationale, tradeoffs evaluated, and any ArgoCD configuration changes requiredTime-box
2 hours max. If ArgoCD metrics discovery takes longer, escalate to Lucas for a decision on whether to pursue webhook approach without full investigation.
Constraints
Related
dora-metrics— project this affectsIssue #6 Template Review
TEMPLATE CONFORMANCE
### Type-- present, value "Feature" matchestype:featurelabel### Lineage-- present, traces to ldraney/DORA#6### Repo-- present, correct repo### User Story-- present, follows As/I want/So that format### Context-- present, non-empty, explains the ArgoCD gap### File Targets-- present (see content quality notes below)### Feature Flag-- present, "None required" with justification### Acceptance Criteria-- present, uses- [ ]checkbox format### Test Expectations-- present (see content quality notes below)### Constraints-- present, non-empty### Checklist-- present, standard format### Related-- presentAll 12 required sections are present and non-empty.
CONTENT QUALITY
File Targets -- ambiguous (BLOCKER)
The file targets present two alternative approaches ("if webhook approach" vs "if ArgoCD already exports to Prometheus") without committing to one. This is a feature ticket, not a spike. An agent picking this up will not know which path to implement. Either:
The path
src/collectors/argocd.pyis consistent with the existing repo structure (src/collectors/forgejo.py,src/collectors/woodpecker.pyexist), so the webhook-approach target is plausible. But the Grafana dashboard alternative has no file path at all -- "Grafana dashboard queries" is not actionable for an agent.File Targets -- missing "should NOT touch" list (nit)
Template calls for a "Files the agent should NOT touch" sub-section. Not present. Not a blocker but useful for scope guardrails.
Acceptance Criteria -- reasonable but couples to approach decision
The three criteria are testable and well-formed, but they assume the webhook/collector approach (they reference
dora_deployments_total{repo="pal-e-deployments"}). If the alternative approach (direct ArgoCD Prometheus metrics in Grafana) is chosen, these criteria would not apply. This is downstream of the file targets ambiguity.Test Expectations -- too thin
Only one integration test is listed. The template asks for both unit and integration tests. If a new
argocd.pycollector is built, it needs unit tests for the collector logic (mocked ArgoCD events) in addition to the integration test. Missing unit test expectations = the agent will likely skip them.Context -- good
Clearly explains why pal-e-deployments is invisible to the exporter and names the two options. This is well-written.
User Story -- scope mismatch (nit)
The user story says "Tier 1 DORA composite in Grafana" but the acceptance criteria only address CFR for pal-e-deployments. The composite metric (combining DF, CFR, MTTR, CLT) is a larger scope than what this issue delivers. Consider narrowing the user story to match the actual deliverable: "I want CFR visibility for pal-e-deployments in Grafana."
BLOCKERS
NITS
VERDICT: NEEDS_REWORK
The issue is well-structured and all template sections are present, but the core file targets section is ambiguous -- it describes two alternative implementation paths without committing to one. This makes it unsuitable for agent execution. Recommend either: (a) run a spike first to determine the approach, then update this ticket with concrete targets, or (b) pick the webhook collector approach, lock in file targets, and add unit test expectations.
Converted to spike per QA review. The two implementation approaches (new webhook collector vs ArgoCD native Prometheus metrics) are mutually exclusive and need investigation before an agent can execute.
Additionally, no ArgoCD-related files were found in
pal-e-platform, so we don't yet know whether ArgoCD metrics are even exposed on this cluster.Spike deliverable: Decision comment on this issue recommending one approach with rationale.
Issue #6 Template Review (Re-review)
Previous review flagged ambiguous dual-approach scope. Issue was converted from Feature to Spike. Re-reviewing against
template-issue-spike.TEMPLATE CONFORMANCE
Required sections from
template-issue-spike:### Type### Lineageldraney/DORA#6present### Repoldraney/pal-e-dora-exporter### Question### Investigation Questionsinstead of### Question. Content exists but does not follow template format -- should be a top-level framing question with bullet sub-questions, not a numbered list.### Deliverables### Spike Deliverablesinstead of### Deliverables. Content does not match template requirements (see BLOCKERS below).### Time-box### RelatedExtra sections not in spike template:
### User Story,### Context,### Constraints. These are not harmful but### User Storyis a Feature-template section -- spikes do not have user stories.CONTENT QUALITY
What improved from the original Feature version:
What still needs work:
Question format: The template calls for a top-level framing question (yes/no or "which approach") with bullet sub-questions. Current format is a numbered list without a framing question. Should be restructured as:
Deliverables do not match template requirements. The template mandates two specific outputs:
docs/{topic}.mdfile (the durable artifact, merged via docs-only PR)The current deliverables list a "decision document posted as a comment on this issue" -- this contradicts the template. Spike knowledge must live in a
docs/file, not an issue comment. Comments are ephemeral; docs files are durable.Time-box is missing entirely. Every spike must have a maximum time and an escalation clause ("if time-box expires without answer: close spike, document findings, escalate to Lucas for direction").
BLOCKERS
Missing
### Time-boxsection -- Required by spike template. Without a time-box, spikes become open-ended investigations. Must include maximum time (e.g., "2 hours" or "1 session") and escalation clause.Deliverables do not include
docs/file -- Template requiresdocs/{topic}.mdas a durable artifact. "Decision document posted as a comment" does not satisfy this. Change to:docs/argocd-cfr-integration.md(or similar).Deliverables do not include follow-up tickets -- Template requires "Follow-up tickets created or existing tickets updated with refined scope." This is the whole point of a spike: investigation produces scoped work items.
Section headers do not match template --
### Investigation Questionsshould be### Question;### Spike Deliverablesshould be### Deliverables. Standardized headers enable tooling and agent parsing.NITS
### User Story-- spikes do not have user stories per template guidance. The context section already covers the "why."### Contextand### Constraintsare not in the template but contain useful information. Consider folding their content into the### Questionsection as background context, or keep them as supplementary sections below the required ones.### Relatedsection usesdora-metricsas the project slug -- verify this matches the actual pal-e-docs project slug.VERDICT: NOT APPROVED
The conversion from Feature to Spike was the right call -- the investigation framing is much better. However, three required spike template sections are missing or non-conformant:
### Time-box(absent),### Deliverables(wrong outputs), and### Question(wrong header and format). These are structural blockers per template conformance rules. Fix the section headers, add a time-box with escalation clause, and align deliverables to the two required spike outputs (docs/file + follow-up tickets).Updated per re-review: added time-box, fixed section headers to match template-issue-spike, changed deliverable from issue comment to docs/ artifact, added follow-up tickets as required output.
Issue #6 Template Review (Re-review #2)
TEMPLATE CONFORMANCE
### Typepresent and valid (Spike)### Lineagepresent and non-empty### Repopresent and non-empty### Questionpresent and non-empty (previously### Investigation Questions-- FIXED)### Deliverablespresent and non-empty (previously### Spike Deliverables-- FIXED)### Time-boxpresent and non-empty (previously MISSING -- FIXED)### Relatedpresent and non-emptyExtra sections present (not in template, not a problem):
### User Story,### Context,### Constraints. These add useful context.PREVIOUS BLOCKER RESOLUTION
All six previously identified blockers are resolved:
### Time-boxsection### Investigation Questionsheader### Question### Spike Deliverablesheader### Deliverablesdocs/artifactdocs/argocd-cfr-decision.mdlisted### Relatedwith clear dependency rationaleCONTENT QUALITY
Question section: Three numbered questions cover the investigation space well: (1) check if ArgoCD already exports metrics, (2) evaluate direct query path, (3) scope the webhook fallback. The decision tree is clear.
Deliverables: Both required spike outputs are present with checkboxes:
docs/argocd-cfr-decision.md-- durable artifact with specified content (rationale, tradeoffs, config changes)Time-box: "2 hours max" with a specific escalation path: "escalate to Lucas for a decision on whether to pursue webhook approach without full investigation." This is concrete and actionable.
Related: #7 cross-reference includes dependency rationale ("deployment event source determines MTTR measurement approach"). Project slug
dora-metricsis listed.NITS
Question format: Template recommends framing the top-level as a single yes/no or "which approach" question with sub-questions as bullets. Current format uses three parallel numbered questions. Consider restructuring as: "What is the right deployment-event source for CFR calculation -- existing ArgoCD Prometheus metrics, or a new webhook integration?" with the three current questions as sub-bullets. Minor style point.
Deliverables checkbox text: The second deliverable could name the likely target repo (e.g.,
pal-e-dora-exporterorpal-e-platform) rather than "the appropriate repo." This is a minor specificity improvement, not a blocker since the spike's purpose is to determine exactly this.VERDICT: APPROVED
Spike Complete -- ArgoCD CFR Investigation
Recommendation: Option A -- ServiceMonitor for ArgoCD native metrics
ArgoCD already emits
argocd_app_sync_totalwithphaselabels (Succeeded/Failed/Error) on port 8082. Adding a ServiceMonitor to scrape these metrics is the simplest path -- zero code changes to the exporter.Key Findings
ArgoCD is deployed at
argocd-server.argocd.svc.cluster.localin theargocdnamespace, but it is NOT managed by Terraform (no module exists). It appears as a blackbox probe target only.No ServiceMonitor exists for ArgoCD. Prometheus is not scraping ArgoCD's built-in metrics endpoint. The
serviceMonitorSelectorNilUsesHelmValues = falsesetting means adding a ServiceMonitor will be auto-discovered.CFR is calculable in pure PromQL without any exporter changes:
Three options evaluated (ServiceMonitor vs webhook vs API polling). ServiceMonitor wins on effort (~1hr vs 4-8hr), risk, and simplicity.
Prerequisite -- Manual Cluster Check
Before creating follow-up tickets, confirm metrics are emitted:
Follow-up Tickets (pending cluster check)
Deliverable
Full decision record:
docs/argocd-cfr-decision.md-- PR #8