ldraney/pal-e-dora-exporter

Fork 0

feat: add MTTR failure event detection #9

Merged

ldraney merged 2 commits from 7-mttr-failure-detection into main

2026-06-14 01:44:25 +00:00

ldraney commented

2026-06-13 20:42:30 +00:00

Owner

Summary

Add failure event detection to the Woodpecker collector, emitting a dora_deployment_failure_timestamp gauge metric that records when pipelines fail. This enables MTTR (Mean Time to Recovery) calculation in Grafana as the time between a failure timestamp and the next success timestamp.

Changes

src/collectors/woodpecker.py -- add deployment_last_failure_ts Gauge metric at module level; add _last_failure_ts tracking dict; add failure timestamp tracking in _collect_repo (mirrors existing success timestamp pattern)
tests/test_woodpecker_failure.py -- 11 new tests covering failure detection, independence from success tracking, most-recent-wins behavior, zero-timestamp filtering, tier label application, and metric definition correctness
tests/test_tiers.py -- include new failure metric in existing tier label verification test

Test Plan

All 22 tests pass (python -m pytest tests/ -v)
Failure events correctly detected from mock Woodpecker pipeline data
Failure timestamp metric emitted with correct value and labels
Success metrics unchanged by failure events (and vice versa)
No breaking changes to existing metrics

Forgejo issue: #7
ArgoCD failure detection deferred to #6

🤖 Generated with Claude Code

## Summary Add failure event detection to the Woodpecker collector, emitting a `dora_deployment_failure_timestamp` gauge metric that records when pipelines fail. This enables MTTR (Mean Time to Recovery) calculation in Grafana as the time between a failure timestamp and the next success timestamp. ## Changes - `src/collectors/woodpecker.py` -- add `deployment_last_failure_ts` Gauge metric at module level; add `_last_failure_ts` tracking dict; add failure timestamp tracking in `_collect_repo` (mirrors existing success timestamp pattern) - `tests/test_woodpecker_failure.py` -- 11 new tests covering failure detection, independence from success tracking, most-recent-wins behavior, zero-timestamp filtering, tier label application, and metric definition correctness - `tests/test_tiers.py` -- include new failure metric in existing tier label verification test ## Test Plan - [x] All 22 tests pass (`python -m pytest tests/ -v`) - [x] Failure events correctly detected from mock Woodpecker pipeline data - [x] Failure timestamp metric emitted with correct value and labels - [x] Success metrics unchanged by failure events (and vice versa) - [x] No breaking changes to existing metrics ## Related - Forgejo issue: #7 - ArgoCD failure detection deferred to #6 🤖 Generated with [Claude Code](https://claude.com/claude-code)

ldraney added 1 commit

2026-06-13 20:42:30 +00:00

feat: add MTTR failure event detection for Woodpecker pipelines d0567a266d

Emit dora_deployment_failure_timestamp gauge metric when Woodpecker
pipelines fail, enabling MTTR calculation as the time between a failure
and the next success. The new metric follows the same pattern as
the existing success timestamp gauge -- additive, no breaking changes.

Closes #7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ldraney commented

2026-06-13 20:44:39 +00:00

Author

Owner

PR #9 Review

DOMAIN REVIEW

Tech stack: Python / Prometheus client / async collector pattern.

Prometheus conventions:

Metric name dora_deployment_failure_timestamp follows naming conventions: namespace_subsystem_unit with _timestamp suffix for epoch-seconds gauges. Consistent with the existing dora_deployment_success_timestamp sibling. Good.
Labels ["repo", "tier"] match the existing success metric. Consistent and correct.
Gauge type is appropriate -- this records the most-recent failure timestamp, not a monotonic counter. Correct.
Help string "Unix timestamp of the most recent failed pipeline (for MTTR calculation)" is descriptive and explains purpose. Good.

Collector pattern:

The failure detection mirrors the success timestamp pattern almost line-for-line:
- Status check: if status == "failure" (parallels if status == "success")
- Zero-filtering: if finished: (parallels success path)
- Max-tracking: if finished > current_max (parallels success path)
- Dict storage: _last_failure_ts (parallels _last_success_ts)
This symmetry is good -- same pattern, same edge case handling, easy to reason about correctness.

Python quality:

Type hints on _last_failure_ts: dict[str, float] match existing style.
The p.get("finished_at", 0) default to 0 with subsequent if finished: guard is correct -- avoids emitting metrics for in-progress or unfinished builds.

BLOCKERS

None identified. The implementation is clean and correct.

NITS

DRY opportunity (non-blocking, future consideration): The success and failure timestamp tracking blocks are nearly identical -- they share the same pattern of status check -> get finished_at -> filter zero -> track max -> set gauge. A helper method like _track_timestamp(status, repo_name, tier, finished, registry, gauge) could unify them. However, since there are only two instances and the code is short, this is a nit, not a blocker. Worth considering if a third status type (e.g., "error", "cancelled") is ever added.
Test helper boilerplate (non-blocking): Every test in TestFailureEventDetection repeats the same with patch(...) as MockClient / mock_instance = MagicMock() / ... pattern. A @pytest.fixture that yields a pre-configured (collector, mock_instance) tuple would reduce boilerplate. The existing tests appear to follow this same verbose pattern though, so this is consistent with project convention.
_value.get() in test assertions (non-blocking): Accessing sample._value.get() reaches into prometheus_client internals. The canonical way is to use REGISTRY.get_sample_value() or generate_latest() and parse. However, the existing test suite already uses this pattern (visible in test_tiers.py), so this is consistent. Something to revisit project-wide if prometheus_client upgrades break the internal API.

SOP COMPLIANCE

PR body has ## Summary, ## Changes, ## Test Plan, ## Related
Tests exist: 11 new tests + 1 updated existing test (22 total reported passing)
No secrets, .env files, or credentials committed
No unnecessary file changes -- all 3 files are directly related to failure event detection
Commit messages not visible in diff but PR title follows conventional commit format (feat: add MTTR failure event detection)
Branch name 7-mttr-failure-detection follows issue-number prefix convention

PROCESS OBSERVATIONS

DORA relevance: This PR directly supports DORA MTTR measurement by emitting the failure timestamp needed for time_to_recovery = next_success_ts - failure_ts calculations in Grafana. Good incremental progress.
Scope: Correctly scoped to Woodpecker failure detection only. ArgoCD failure detection is explicitly deferred to issue #6 as noted in the PR body. Clean scope boundary.
Change failure risk: Low. The change is additive -- new metric, new tracking dict, new conditional block. No modification to existing success tracking logic. The test suite verifies independence (success does not affect failure and vice versa).

VERDICT: APPROVED

Clean, well-tested, additive change that mirrors an established pattern. No blockers. The 11 new tests provide thorough coverage including edge cases (zero timestamps, most-recent-wins, status independence, tier labels). Nits are minor style suggestions for future consideration.

## PR #9 Review ### DOMAIN REVIEW **Tech stack:** Python / Prometheus client / async collector pattern. **Prometheus conventions:** - Metric name `dora_deployment_failure_timestamp` follows naming conventions: `namespace_subsystem_unit` with `_timestamp` suffix for epoch-seconds gauges. Consistent with the existing `dora_deployment_success_timestamp` sibling. Good. - Labels `["repo", "tier"]` match the existing success metric. Consistent and correct. - Gauge type is appropriate -- this records the most-recent failure timestamp, not a monotonic counter. Correct. - Help string `"Unix timestamp of the most recent failed pipeline (for MTTR calculation)"` is descriptive and explains purpose. Good. **Collector pattern:** - The failure detection mirrors the success timestamp pattern almost line-for-line: - Status check: `if status == "failure"` (parallels `if status == "success"`) - Zero-filtering: `if finished:` (parallels success path) - Max-tracking: `if finished > current_max` (parallels success path) - Dict storage: `_last_failure_ts` (parallels `_last_success_ts`) - This symmetry is good -- same pattern, same edge case handling, easy to reason about correctness. **Python quality:** - Type hints on `_last_failure_ts: dict[str, float]` match existing style. - The `p.get("finished_at", 0)` default to 0 with subsequent `if finished:` guard is correct -- avoids emitting metrics for in-progress or unfinished builds. ### BLOCKERS None identified. The implementation is clean and correct. ### NITS 1. **DRY opportunity (non-blocking, future consideration):** The success and failure timestamp tracking blocks are nearly identical -- they share the same pattern of `status check -> get finished_at -> filter zero -> track max -> set gauge`. A helper method like `_track_timestamp(status, repo_name, tier, finished, registry, gauge)` could unify them. However, since there are only two instances and the code is short, this is a nit, not a blocker. Worth considering if a third status type (e.g., "error", "cancelled") is ever added. 2. **Test helper boilerplate (non-blocking):** Every test in `TestFailureEventDetection` repeats the same `with patch(...) as MockClient / mock_instance = MagicMock() / ...` pattern. A `@pytest.fixture` that yields a pre-configured `(collector, mock_instance)` tuple would reduce boilerplate. The existing tests appear to follow this same verbose pattern though, so this is consistent with project convention. 3. **`_value.get()` in test assertions (non-blocking):** Accessing `sample._value.get()` reaches into prometheus_client internals. The canonical way is to use `REGISTRY.get_sample_value()` or `generate_latest()` and parse. However, the existing test suite already uses this pattern (visible in `test_tiers.py`), so this is consistent. Something to revisit project-wide if prometheus_client upgrades break the internal API. ### SOP COMPLIANCE - [x] PR body has ## Summary, ## Changes, ## Test Plan, ## Related - [x] Tests exist: 11 new tests + 1 updated existing test (22 total reported passing) - [x] No secrets, .env files, or credentials committed - [x] No unnecessary file changes -- all 3 files are directly related to failure event detection - [x] Commit messages not visible in diff but PR title follows conventional commit format (`feat: add MTTR failure event detection`) - [x] Branch name `7-mttr-failure-detection` follows issue-number prefix convention ### PROCESS OBSERVATIONS - **DORA relevance:** This PR directly supports DORA MTTR measurement by emitting the failure timestamp needed for `time_to_recovery = next_success_ts - failure_ts` calculations in Grafana. Good incremental progress. - **Scope:** Correctly scoped to Woodpecker failure detection only. ArgoCD failure detection is explicitly deferred to issue #6 as noted in the PR body. Clean scope boundary. - **Change failure risk:** Low. The change is additive -- new metric, new tracking dict, new conditional block. No modification to existing success tracking logic. The test suite verifies independence (success does not affect failure and vice versa). ### VERDICT: APPROVED Clean, well-tested, additive change that mirrors an established pattern. No blockers. The 11 new tests provide thorough coverage including edge cases (zero timestamps, most-recent-wins, status independence, tier labels). Nits are minor style suggestions for future consideration.

ldraney commented

2026-06-13 21:17:44 +00:00

Author

Owner

PR #9 Review

DOMAIN REVIEW

Stack: Python / prometheus-client / asyncio / pytest

The PR adds a dora_deployment_failure_timestamp Gauge metric to the Woodpecker collector, enabling MTTR calculation in Grafana by recording the most recent failure timestamp per repo. The implementation follows the established pattern from the success timestamp tracking.

Pattern conformance: The failure tracking block in _collect_repo is a near-exact structural mirror of the existing success tracking block:

Module-level Gauge definition with ["repo", "tier"] labels
Instance-level _last_failure_ts dict for max-timestamp caching
Status check -> finished_at extraction -> zero-guard -> max comparison -> gauge update

This is exactly how it should be done. No deviation from the established collector pattern.

Metric correctness:

Gauge type is correct for "last known timestamp" semantics (vs. Counter)
Labels match the existing metric labeling scheme (repo, tier)
Zero-timestamp filtering prevents emitting meaningless data
Max-timestamp logic ensures only the most recent failure is reported (idempotent across poll cycles)

Test quality (11 tests, 271 lines):

Happy path: failure timestamp emitted correctly
Isolation: failure does not update success (and vice versa)
Edge case: zero finished_at is filtered
Ordering: most-recent-wins with out-of-order pipeline data
Integration: failures counted in deployments_total counter
Tier labeling: unknown repos default to tier 3
Metric definition: correct name, correct labels

Coverage is thorough. Both positive and negative cases are tested. The _reset_metrics fixture properly clears state between tests to prevent cross-talk.

BLOCKERS

None.

NITS

Metric name inconsistency: The Python variable is deployment_last_failure_ts (contains "last"), but the Prometheus metric name is dora_deployment_failure_timestamp (missing "last"). Compare with the success metric: variable deployment_last_success_ts maps to dora_deployment_last_success_timestamp. For consistency, consider dora_deployment_last_failure_timestamp. This is cosmetic and non-blocking -- changing it later would be a breaking change to any Grafana dashboards already consuming the metric, so it is worth deciding now before this ships.
"error" status: The code only handles status == "failure", not status == "error". This is consistent with the existing success pattern (which only handles status == "success"), and Woodpecker's "error" status typically indicates infrastructure issues rather than code failures. Noting for awareness -- if infrastructure errors should count toward MTTR, this would need a follow-up.

SOP COMPLIANCE

PR body follows template (Summary, Changes, Test Plan, Related)
Tests exist and are comprehensive (11 new tests)
No secrets or credentials committed
No unnecessary file changes (3 files, all directly related)
Related issue referenced (#7)
ArgoCD deferral noted (#6)

PROCESS OBSERVATIONS

Change failure risk: Low. All changes are purely additive. No existing metrics, collectors, or tests are modified in substance. The only change to an existing test file (test_tiers.py) is adding the new metric to the tier label verification loop.
Deployment frequency: This PR unblocks MTTR calculation, one of the four DORA metrics. Once merged, Grafana dashboards can compute MTTR as dora_deployment_last_success_timestamp - dora_deployment_failure_timestamp for the most recent failure/recovery pair.
Documentation gap: The README listing exposed metrics should be updated to include the new dora_deployment_failure_timestamp metric. This can be a follow-up.

VERDICT: APPROVED

## PR #9 Review ### DOMAIN REVIEW **Stack:** Python / prometheus-client / asyncio / pytest The PR adds a `dora_deployment_failure_timestamp` Gauge metric to the Woodpecker collector, enabling MTTR calculation in Grafana by recording the most recent failure timestamp per repo. The implementation follows the established pattern from the success timestamp tracking. **Pattern conformance:** The failure tracking block in `_collect_repo` is a near-exact structural mirror of the existing success tracking block: - Module-level Gauge definition with `["repo", "tier"]` labels - Instance-level `_last_failure_ts` dict for max-timestamp caching - Status check -> finished_at extraction -> zero-guard -> max comparison -> gauge update This is exactly how it should be done. No deviation from the established collector pattern. **Metric correctness:** - Gauge type is correct for "last known timestamp" semantics (vs. Counter) - Labels match the existing metric labeling scheme (`repo`, `tier`) - Zero-timestamp filtering prevents emitting meaningless data - Max-timestamp logic ensures only the most recent failure is reported (idempotent across poll cycles) **Test quality (11 tests, 271 lines):** - Happy path: failure timestamp emitted correctly - Isolation: failure does not update success (and vice versa) - Edge case: zero `finished_at` is filtered - Ordering: most-recent-wins with out-of-order pipeline data - Integration: failures counted in `deployments_total` counter - Tier labeling: unknown repos default to tier 3 - Metric definition: correct name, correct labels Coverage is thorough. Both positive and negative cases are tested. The `_reset_metrics` fixture properly clears state between tests to prevent cross-talk. ### BLOCKERS None. ### NITS 1. **Metric name inconsistency**: The Python variable is `deployment_last_failure_ts` (contains "last"), but the Prometheus metric name is `dora_deployment_failure_timestamp` (missing "last"). Compare with the success metric: variable `deployment_last_success_ts` maps to `dora_deployment_last_success_timestamp`. For consistency, consider `dora_deployment_last_failure_timestamp`. This is cosmetic and non-blocking -- changing it later would be a breaking change to any Grafana dashboards already consuming the metric, so it is worth deciding now before this ships. 2. **"error" status**: The code only handles `status == "failure"`, not `status == "error"`. This is consistent with the existing success pattern (which only handles `status == "success"`), and Woodpecker's "error" status typically indicates infrastructure issues rather than code failures. Noting for awareness -- if infrastructure errors should count toward MTTR, this would need a follow-up. ### SOP COMPLIANCE - [x] PR body follows template (Summary, Changes, Test Plan, Related) - [x] Tests exist and are comprehensive (11 new tests) - [x] No secrets or credentials committed - [x] No unnecessary file changes (3 files, all directly related) - [x] Related issue referenced (#7) - [x] ArgoCD deferral noted (#6) ### PROCESS OBSERVATIONS - **Change failure risk:** Low. All changes are purely additive. No existing metrics, collectors, or tests are modified in substance. The only change to an existing test file (`test_tiers.py`) is adding the new metric to the tier label verification loop. - **Deployment frequency:** This PR unblocks MTTR calculation, one of the four DORA metrics. Once merged, Grafana dashboards can compute MTTR as `dora_deployment_last_success_timestamp - dora_deployment_failure_timestamp` for the most recent failure/recovery pair. - **Documentation gap:** The README listing exposed metrics should be updated to include the new `dora_deployment_failure_timestamp` metric. This can be a follow-up. ### VERDICT: APPROVED

ldraney added 1 commit

2026-06-14 01:32:01 +00:00

fix: rename failure metric to match success naming convention 9f7476a202

dora_deployment_failure_timestamp → dora_deployment_last_failure_timestamp
Matches dora_deployment_last_success_timestamp. Important to fix before
merge since renaming later would break Grafana dashboards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ldraney merged commit a8077327fa into main

2026-06-14 01:44:25 +00:00

ldraney deleted branch 7-mttr-failure-detection

2026-06-14 01:44:25 +00:00

ldraney referenced this pull request from a commit

2026-06-14 01:44:25 +00:00

feat: add MTTR failure event detection (#9)

ldraney commented

2026-06-14 01:51:35 +00:00

Author

Owner

PR #9 Review

Post-merge review of "feat: add MTTR failure event detection"

DOMAIN REVIEW

Stack: Python / Prometheus client / async collector pattern / pytest

Metric definition (deployment_last_failure_ts):

Gauge type is correct for "most recent timestamp" semantics -- this value can go up or be replaced, not monotonically incremented. Correct choice over Counter.
Label set ["repo", "tier"] matches deployment_last_success_ts exactly. Consistent.
Metric name dora_deployment_last_failure_timestamp follows the existing dora_deployment_last_success_timestamp naming convention. Correct.
Description clearly states MTTR intent.

Collector logic (lines 130-138 of _collect_repo):

Mirrors the existing success timestamp pattern at lines 120-128. The symmetry is clean:
- Same finished_at extraction with p.get("finished_at", 0)
- Same zero-timestamp guard (if finished:)
- Same max-wins comparison against _last_failure_ts dict
- Same .set() call with identical label structure
The status == "failure" branch is independent of the status == "success" branch -- no accidental coupling. A pipeline that is neither success nor failure (e.g., "running", "pending") correctly hits neither branch.

State tracking (_last_failure_ts dict):

Initialized as empty dict on __init__, same pattern as _last_success_ts. Correct.
Per-repo keying ensures cross-repo failures don't interfere. Correct.

One observation (non-blocking): The finished_at is extracted via p.get("finished_at", 0) inside the if status == "failure" block, which means finished is locally scoped. The success block does the same extraction independently. This is fine -- the alternative (extracting once before both branches) would couple them and reduce clarity. Current approach is correct.

BLOCKERS

None.

NITS

Test helper repetition: Every test in TestFailureEventDetection repeats the same 8-line with patch(...) as MockClient / mock_instance / list_repos / list_pipelines / MockClient.return_value / WoodpeckerCollector(config) / asyncio.run(collector.collect_once()) block. A shared fixture (e.g., a helper that takes repos and pipelines args and returns the collector post-collection) would reduce ~50 lines without hurting readability. Not blocking -- the tests are clear as-is and this matches the existing test style in the repo.
Missing finished_at key test: There is a test for finished_at=0 but no test for a pipeline dict that lacks the finished_at key entirely. The code handles this via p.get("finished_at", 0) which defaults to 0 and is then filtered by if finished:, so it would work correctly. A test would document this edge case. Minor.
No test for event != "push": The _pipeline helper defaults event="push" but there is no test verifying that non-push failure events (e.g., event="tag", event="cron") are also tracked. If the collector intentionally filters by event type upstream, this is moot -- but from the diff alone, failures appear to be tracked regardless of event type, which seems correct for MTTR.

SOP COMPLIANCE

PR body has: Summary, Changes, Test Plan, Related
Tests exist and pass (11 new tests, existing tier test updated)
No secrets, .env files, or credentials committed
No unnecessary file changes -- all 3 files are directly related to the feature
Commit message is descriptive (feat: add MTTR failure event detection)
Related issue linked (#7)

PROCESS OBSERVATIONS

Change Failure Risk: Low. Pure additive change -- no existing metrics modified, no breaking changes. The failure metric is a new Gauge that only fires on status == "failure" pipelines.
MTTR enablement: This metric, combined with the existing dora_deployment_last_success_timestamp, provides the raw data needed for MTTR calculation in Grafana: failure_ts - next_success_ts. The Grafana query layer is not part of this PR, which is correct scoping.
ArgoCD gap: PR body correctly notes that ArgoCD failure detection is deferred to #6. This is appropriate -- Woodpecker and ArgoCD collectors should be independent.

VERDICT: APPROVED

Clean, well-tested, pattern-consistent additive change. 11 tests with strong edge case coverage. No blockers.

## PR #9 Review **Post-merge review** of "feat: add MTTR failure event detection" ### DOMAIN REVIEW **Stack**: Python / Prometheus client / async collector pattern / pytest **Metric definition** (`deployment_last_failure_ts`): - Gauge type is correct for "most recent timestamp" semantics -- this value can go up or be replaced, not monotonically incremented. Correct choice over Counter. - Label set `["repo", "tier"]` matches `deployment_last_success_ts` exactly. Consistent. - Metric name `dora_deployment_last_failure_timestamp` follows the existing `dora_deployment_last_success_timestamp` naming convention. Correct. - Description clearly states MTTR intent. **Collector logic** (lines 130-138 of `_collect_repo`): - Mirrors the existing success timestamp pattern at lines 120-128. The symmetry is clean: - Same `finished_at` extraction with `p.get("finished_at", 0)` - Same zero-timestamp guard (`if finished:`) - Same max-wins comparison against `_last_failure_ts` dict - Same `.set()` call with identical label structure - The `status == "failure"` branch is independent of the `status == "success"` branch -- no accidental coupling. A pipeline that is neither success nor failure (e.g., "running", "pending") correctly hits neither branch. **State tracking** (`_last_failure_ts` dict): - Initialized as empty dict on `__init__`, same pattern as `_last_success_ts`. Correct. - Per-repo keying ensures cross-repo failures don't interfere. Correct. **One observation** (non-blocking): The `finished_at` is extracted via `p.get("finished_at", 0)` inside the `if status == "failure"` block, which means `finished` is locally scoped. The success block does the same extraction independently. This is fine -- the alternative (extracting once before both branches) would couple them and reduce clarity. Current approach is correct. ### BLOCKERS None. ### NITS 1. **Test helper repetition**: Every test in `TestFailureEventDetection` repeats the same 8-line `with patch(...) as MockClient` / `mock_instance` / `list_repos` / `list_pipelines` / `MockClient.return_value` / `WoodpeckerCollector(config)` / `asyncio.run(collector.collect_once())` block. A shared fixture (e.g., a helper that takes `repos` and `pipelines` args and returns the collector post-collection) would reduce ~50 lines without hurting readability. Not blocking -- the tests are clear as-is and this matches the existing test style in the repo. 2. **Missing `finished_at` key test**: There is a test for `finished_at=0` but no test for a pipeline dict that lacks the `finished_at` key entirely. The code handles this via `p.get("finished_at", 0)` which defaults to 0 and is then filtered by `if finished:`, so it would work correctly. A test would document this edge case. Minor. 3. **No test for `event != "push"`**: The `_pipeline` helper defaults `event="push"` but there is no test verifying that non-push failure events (e.g., `event="tag"`, `event="cron"`) are also tracked. If the collector intentionally filters by event type upstream, this is moot -- but from the diff alone, failures appear to be tracked regardless of event type, which seems correct for MTTR. ### SOP COMPLIANCE - [x] PR body has: Summary, Changes, Test Plan, Related - [x] Tests exist and pass (11 new tests, existing tier test updated) - [x] No secrets, .env files, or credentials committed - [x] No unnecessary file changes -- all 3 files are directly related to the feature - [x] Commit message is descriptive (feat: add MTTR failure event detection) - [x] Related issue linked (#7) ### PROCESS OBSERVATIONS - **Change Failure Risk**: Low. Pure additive change -- no existing metrics modified, no breaking changes. The failure metric is a new Gauge that only fires on `status == "failure"` pipelines. - **MTTR enablement**: This metric, combined with the existing `dora_deployment_last_success_timestamp`, provides the raw data needed for MTTR calculation in Grafana: `failure_ts - next_success_ts`. The Grafana query layer is not part of this PR, which is correct scoping. - **ArgoCD gap**: PR body correctly notes that ArgoCD failure detection is deferred to #6. This is appropriate -- Woodpecker and ArgoCD collectors should be independent. ### VERDICT: APPROVED Clean, well-tested, pattern-consistent additive change. 11 tests with strong edge case coverage. No blockers.

ldraney referenced this pull request

2026-06-14 01:55:30 +00:00

test: add coverage for missing finished_at key in pipeline data #10

ldraney referenced this pull request

2026-06-14 01:56:36 +00:00

test: add coverage for missing finished_at key in pipeline data #10