ldraney/pal-e-dora-exporter

Fork 0

MTTR: failure event detection #7

New issue

Closed

opened 2026-06-13 18:34:25 +00:00 by ldraney · 3 comments

ldraney commented

2026-06-13 18:34:25 +00:00

Owner

Type

Feature

Lineage

New issue — created as part of DORA measurement expansion (2026-06-13).

Dependencies

Depends on #6 (ArgoCD spike) for ArgoCD failure detection. This ticket scopes to Woodpecker pipeline failures only.

Repo

ldraney/pal-e-dora-exporter

User Story

As a platform operator
I want failure events detected and recorded
So that MTTR (Mean Time to Recovery) can be calculated from real data

What

Add failure event detection to the DORA exporter so true MTTR can be calculated. Scoped to Woodpecker pipeline failures only — ArgoCD failure detection is deferred to #6.

Why

The exporter currently only tracks dora_deployment_last_success_timestamp. No failure events are captured, so true MTTR cannot be calculated — it requires knowing when a failure occurred and when the next success restored service.

Context

Need to detect Woodpecker pipeline failures and emit failure timestamps. ArgoCD sync failure detection depends on #6 (ArgoCD integration) which has not landed yet. This closes the last gap in the four DORA metrics for Woodpecker-tracked deployments.

File Targets

src/collectors/woodpecker.py — add failure event tracking and failure timestamp metric (metrics are defined inline at module level in this file)

Feature Flag

None required — new metric is additive and does not change existing behavior.

Acceptance Criteria

Exporter emits dora_deployment_failure_timestamp or equivalent metric
Failure events captured from Woodpecker pipeline failures
MTTR calculable in Grafana as time between failure and next success

Test Expectations

Unit test: failure event detection from mock Woodpecker data
Unit test: failure timestamp metric emitted correctly

Constraints

Must not break existing success-tracking metrics
Failure detection should be reliable — avoid false positives

Checklist

PR opened
Tests pass
No unrelated changes

dora-metrics — project this affects
#6 — ArgoCD integration (required for ArgoCD failure detection, out of scope here)

### Type Feature ### Lineage New issue — created as part of DORA measurement expansion (2026-06-13). ### Dependencies Depends on #6 (ArgoCD spike) for ArgoCD failure detection. This ticket scopes to Woodpecker pipeline failures only. ### Repo `ldraney/pal-e-dora-exporter` ### User Story As a platform operator I want failure events detected and recorded So that MTTR (Mean Time to Recovery) can be calculated from real data ### What Add failure event detection to the DORA exporter so true MTTR can be calculated. Scoped to Woodpecker pipeline failures only — ArgoCD failure detection is deferred to #6. ### Why The exporter currently only tracks `dora_deployment_last_success_timestamp`. No failure events are captured, so true MTTR cannot be calculated — it requires knowing when a failure occurred and when the next success restored service. ### Context Need to detect Woodpecker pipeline failures and emit failure timestamps. ArgoCD sync failure detection depends on #6 (ArgoCD integration) which has not landed yet. This closes the last gap in the four DORA metrics for Woodpecker-tracked deployments. ### File Targets - `src/collectors/woodpecker.py` — add failure event tracking and failure timestamp metric (metrics are defined inline at module level in this file) ### Feature Flag None required — new metric is additive and does not change existing behavior. ### Acceptance Criteria - [ ] Exporter emits `dora_deployment_failure_timestamp` or equivalent metric - [ ] Failure events captured from Woodpecker pipeline failures - [ ] MTTR calculable in Grafana as time between failure and next success ### Test Expectations - [ ] Unit test: failure event detection from mock Woodpecker data - [ ] Unit test: failure timestamp metric emitted correctly ### Constraints - Must not break existing success-tracking metrics - Failure detection should be reliable — avoid false positives ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `dora-metrics` — project this affects - #6 — ArgoCD integration (required for ArgoCD failure detection, out of scope here)

ldraney added the

type:feature

label

2026-06-13 18:36:00 +00:00

ldraney commented

2026-06-13 19:16:32 +00:00

Author

Owner

Issue #7 Template Review

TEMPLATE CONFORMANCE

Validated against template-issue-feature from pal-e-docs.

### Type -- present and valid ("Feature")
### Lineage -- present and non-empty
### Repo -- present and correct
### User Story -- present, follows As a / I want / So that format
### Context -- present with sufficient background
### File Targets -- present (but see blockers below)
### Feature Flag -- present ("None required" -- reasonable for additive metric)
### Acceptance Criteria -- present with checkboxes
### Test Expectations -- present with checkboxes
### Constraints -- present and non-empty
### Checklist -- present with checkboxes
### Related -- present

Extra sections not in template: ### What, ### Why -- these add useful context and are not a problem.

BLOCKERS

1. File target src/metrics.py does not exist and is not marked as a new file.
The repo has no src/metrics.py. The File Targets section says "add failure timestamp metric definition" as though the file already exists. If this is a new file to be created, say so explicitly (e.g., src/metrics.py -- create -- define failure timestamp metric). If the metric definition belongs in an existing file, point to the correct one.

2. File target src/collectors/argocd.py does not exist -- dependency on #6 not called out.
The issue hedges with "(if exists)" but the ArgoCD collector is the subject of issue #6 ("Close CFR gap: ArgoCD event integration"). This creates an implicit ordering dependency: if #6 ships first, this issue's ArgoCD scope is real; if not, it is dead code. This must be stated explicitly as a dependency or the ArgoCD acceptance criteria should be removed and deferred to a follow-up after #6 lands.

3. Related section has a vague cross-reference.
"ArgoCD event integration issue -- related capability" does not name the issue number. This should be #6 so agents and humans can trace the dependency. The template calls for specific cross-references, not descriptions.

NITS

Missing "Files the agent should NOT touch" subsection. The template includes a negative-targeting section. Even a brief "N/A" or "None" is better than omission -- it signals the author considered scope boundaries.
Test Expectations missing run command. Template calls for a Run command: line (e.g., pytest tests/ -k test_failure). Currently no test directory exists in the repo at all -- the issue should note whether a tests/ directory needs to be created.
No spec file names in Test Expectations. Template guidance says "names specific spec files." The tests are described generically ("Unit test: failure event detection from mock Woodpecker data") but should name target files (e.g., tests/test_woodpecker_failures.py).
Acceptance criterion #4 is a Grafana concern, not a code deliverable. "MTTR calculable in Grafana as time between failure and next success" is a validation step, not something the exporter PR can satisfy. Consider moving it to a separate validation step or rephrasing as "metric is queryable via PromQL to calculate MTTR."

SCOPE ASSESSMENT

The scope is a single deployable unit IF the ArgoCD dependency is resolved. Without #6, this is purely Woodpecker failure detection + a new metric definition -- clean and focused. With the ArgoCD hedging left in, the scope is ambiguous and could expand unpredictably when #6 lands.

Recommendation: Scope this issue to Woodpecker failure detection only. Create a follow-up issue for ArgoCD failure detection that explicitly depends on #6.

VERDICT: NEEDS_REWORK

Three blockers must be addressed before this moves to next_up:

Clarify whether src/metrics.py is a new file or fix the path
Resolve the #6 dependency explicitly -- either scope ArgoCD out or declare the dependency
Fix the vague cross-reference in Related to name #6

## Issue #7 Template Review ### TEMPLATE CONFORMANCE Validated against `template-issue-feature` from pal-e-docs. - [x] `### Type` -- present and valid ("Feature") - [x] `### Lineage` -- present and non-empty - [x] `### Repo` -- present and correct - [x] `### User Story` -- present, follows As a / I want / So that format - [x] `### Context` -- present with sufficient background - [x] `### File Targets` -- present (but see blockers below) - [x] `### Feature Flag` -- present ("None required" -- reasonable for additive metric) - [x] `### Acceptance Criteria` -- present with checkboxes - [x] `### Test Expectations` -- present with checkboxes - [x] `### Constraints` -- present and non-empty - [x] `### Checklist` -- present with checkboxes - [x] `### Related` -- present Extra sections not in template: `### What`, `### Why` -- these add useful context and are not a problem. ### BLOCKERS **1. File target `src/metrics.py` does not exist and is not marked as a new file.** The repo has no `src/metrics.py`. The File Targets section says "add failure timestamp metric definition" as though the file already exists. If this is a new file to be created, say so explicitly (e.g., `src/metrics.py` -- **create** -- define failure timestamp metric). If the metric definition belongs in an existing file, point to the correct one. **2. File target `src/collectors/argocd.py` does not exist -- dependency on #6 not called out.** The issue hedges with "(if exists)" but the ArgoCD collector is the subject of issue #6 ("Close CFR gap: ArgoCD event integration"). This creates an implicit ordering dependency: if #6 ships first, this issue's ArgoCD scope is real; if not, it is dead code. This must be stated explicitly as a dependency or the ArgoCD acceptance criteria should be removed and deferred to a follow-up after #6 lands. **3. Related section has a vague cross-reference.** "ArgoCD event integration issue -- related capability" does not name the issue number. This should be `#6` so agents and humans can trace the dependency. The template calls for specific cross-references, not descriptions. ### NITS 1. **Missing "Files the agent should NOT touch" subsection.** The template includes a negative-targeting section. Even a brief "N/A" or "None" is better than omission -- it signals the author considered scope boundaries. 2. **Test Expectations missing run command.** Template calls for a `Run command:` line (e.g., `pytest tests/ -k test_failure`). Currently no test directory exists in the repo at all -- the issue should note whether a `tests/` directory needs to be created. 3. **No spec file names in Test Expectations.** Template guidance says "names specific spec files." The tests are described generically ("Unit test: failure event detection from mock Woodpecker data") but should name target files (e.g., `tests/test_woodpecker_failures.py`). 4. **Acceptance criterion #4 is a Grafana concern, not a code deliverable.** "MTTR calculable in Grafana as time between failure and next success" is a validation step, not something the exporter PR can satisfy. Consider moving it to a separate validation step or rephrasing as "metric is queryable via PromQL to calculate MTTR." ### SCOPE ASSESSMENT The scope is a single deployable unit IF the ArgoCD dependency is resolved. Without #6, this is purely Woodpecker failure detection + a new metric definition -- clean and focused. With the ArgoCD hedging left in, the scope is ambiguous and could expand unpredictably when #6 lands. Recommendation: Scope this issue to Woodpecker failure detection only. Create a follow-up issue for ArgoCD failure detection that explicitly depends on #6. ### VERDICT: NEEDS_REWORK Three blockers must be addressed before this moves to next_up: 1. Clarify whether `src/metrics.py` is a new file or fix the path 2. Resolve the #6 dependency explicitly -- either scope ArgoCD out or declare the dependency 3. Fix the vague cross-reference in Related to name `#6`

ldraney referenced this issue

2026-06-13 19:20:30 +00:00

Close CFR gap: ArgoCD event integration for pal-e-deployments #6

ldraney commented

2026-06-13 19:21:01 +00:00

Author

Owner

Updated per QA review: fixed nonexistent file targets, scoped to Woodpecker failures only (ArgoCD deferred to #6), added explicit dependency declaration.

ldraney referenced this issue

2026-06-13 19:23:28 +00:00

Close CFR gap: ArgoCD event integration for pal-e-deployments #6

ldraney commented

2026-06-13 19:23:39 +00:00

Author

Owner

Issue #7 Template Review (Re-review)

Previous review flagged three blockers: nonexistent file targets (src/metrics.py, src/collectors/argocd.py), implicit dependency on #6, and vague cross-references. This re-review verifies those are resolved.

TEMPLATE CONFORMANCE

Type header present and valid (Feature)
All required sections present (Type, Lineage, Repo, User Story, Context, File Targets, Feature Flag, Acceptance Criteria, Test Expectations, Constraints, Checklist, Related)
Sections non-empty -- all have substantive content
Acceptance Criteria use - [ ] checkbox format
Test Expectations use - [ ] checkbox format

PREVIOUS BLOCKER RESOLUTION

Blocker	Status	Evidence
`src/metrics.py` did not exist	RESOLVED	File Target now points to `src/collectors/woodpecker.py` (HTTP 200 confirmed). Note clarifies metrics are defined inline at module level.
`src/collectors/argocd.py` did not exist	RESOLVED	No longer referenced. ArgoCD scope explicitly deferred to #6.
Implicit dependency on #6	RESOLVED	Dedicated `### Dependencies` section: "Depends on #6 (ArgoCD spike) for ArgoCD failure detection. This ticket scopes to Woodpecker pipeline failures only."
Vague cross-references	RESOLVED	Related section references #6 by number with context. Dependencies section does the same.

CONTENT QUALITY

Scope: Clean and focused. Title, User Story, What, Context, and File Targets all consistently scope to Woodpecker pipeline failures only. ArgoCD is explicitly out of scope with a clear pointer to #6.

File Targets: Single target (src/collectors/woodpecker.py) verified to exist in the repo at main. The note about inline metric definitions is helpful context for the implementing agent.

Acceptance Criteria: Three concrete, testable criteria. The metric name suggestion (dora_deployment_failure_timestamp) gives the implementer a starting point while the "or equivalent" leaves room for better naming.

Test Expectations: Two unit tests specified. Both are concrete (mock Woodpecker data, metric emission verification).

Dependencies section: Not part of the standard feature template but adds clear value here. Explicitly scopes what is and is not covered.

BLOCKERS

None.

NITS

The feature template specifies a ### Lineage format of "Related to org/repo #N" but the issue uses "New issue -- created as part of DORA measurement expansion." This is fine since this is a standalone issue, but could reference the DORA measurement project slug for traceability.
The ### What and ### Why sections are not part of the standard feature template. They add clarity here but are redundant with Context. Not a problem, just noting the deviation.
Test Expectations could specify a run command (e.g., pytest tests/ -k test_failure) per the template guidance. Minor omission.

VERDICT: APPROVED

## Issue #7 Template Review (Re-review) Previous review flagged three blockers: nonexistent file targets (`src/metrics.py`, `src/collectors/argocd.py`), implicit dependency on #6, and vague cross-references. This re-review verifies those are resolved. ### TEMPLATE CONFORMANCE - [x] Type header present and valid (`Feature`) - [x] All required sections present (Type, Lineage, Repo, User Story, Context, File Targets, Feature Flag, Acceptance Criteria, Test Expectations, Constraints, Checklist, Related) - [x] Sections non-empty -- all have substantive content - [x] Acceptance Criteria use `- [ ]` checkbox format - [x] Test Expectations use `- [ ]` checkbox format ### PREVIOUS BLOCKER RESOLUTION | Blocker | Status | Evidence | |---------|--------|----------| | `src/metrics.py` did not exist | RESOLVED | File Target now points to `src/collectors/woodpecker.py` (HTTP 200 confirmed). Note clarifies metrics are defined inline at module level. | | `src/collectors/argocd.py` did not exist | RESOLVED | No longer referenced. ArgoCD scope explicitly deferred to #6. | | Implicit dependency on #6 | RESOLVED | Dedicated `### Dependencies` section: "Depends on #6 (ArgoCD spike) for ArgoCD failure detection. This ticket scopes to Woodpecker pipeline failures only." | | Vague cross-references | RESOLVED | Related section references #6 by number with context. Dependencies section does the same. | ### CONTENT QUALITY **Scope**: Clean and focused. Title, User Story, What, Context, and File Targets all consistently scope to Woodpecker pipeline failures only. ArgoCD is explicitly out of scope with a clear pointer to #6. **File Targets**: Single target (`src/collectors/woodpecker.py`) verified to exist in the repo at `main`. The note about inline metric definitions is helpful context for the implementing agent. **Acceptance Criteria**: Three concrete, testable criteria. The metric name suggestion (`dora_deployment_failure_timestamp`) gives the implementer a starting point while the "or equivalent" leaves room for better naming. **Test Expectations**: Two unit tests specified. Both are concrete (mock Woodpecker data, metric emission verification). **Dependencies section**: Not part of the standard feature template but adds clear value here. Explicitly scopes what is and is not covered. ### BLOCKERS None. ### NITS 1. The feature template specifies a `### Lineage` format of "Related to `org/repo #N`" but the issue uses "New issue -- created as part of DORA measurement expansion." This is fine since this is a standalone issue, but could reference the DORA measurement project slug for traceability. 2. The `### What` and `### Why` sections are not part of the standard feature template. They add clarity here but are redundant with Context. Not a problem, just noting the deviation. 3. Test Expectations could specify a run command (e.g., `pytest tests/ -k test_failure`) per the template guidance. Minor omission. ### VERDICT: APPROVED

ldraney referenced this issue

2026-06-13 19:30:34 +00:00

Close CFR gap: ArgoCD event integration for pal-e-deployments #6

ldraney referenced this issue

2026-06-13 19:55:35 +00:00

docs: ArgoCD CFR spike decision record (#6) #8

ldraney referenced this issue from a commit

2026-06-13 20:42:16 +00:00

feat: add MTTR failure event detection for Woodpecker pipelines

ldraney referenced this issue

2026-06-13 20:42:30 +00:00

feat: add MTTR failure event detection #9