feat: add MTTR failure event detection #9
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-dora-exporter!9
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "7-mttr-failure-detection"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Add failure event detection to the Woodpecker collector, emitting a
dora_deployment_failure_timestampgauge metric that records when pipelines fail. This enables MTTR (Mean Time to Recovery) calculation in Grafana as the time between a failure timestamp and the next success timestamp.Changes
src/collectors/woodpecker.py-- adddeployment_last_failure_tsGauge metric at module level; add_last_failure_tstracking dict; add failure timestamp tracking in_collect_repo(mirrors existing success timestamp pattern)tests/test_woodpecker_failure.py-- 11 new tests covering failure detection, independence from success tracking, most-recent-wins behavior, zero-timestamp filtering, tier label application, and metric definition correctnesstests/test_tiers.py-- include new failure metric in existing tier label verification testTest Plan
python -m pytest tests/ -v)Related
🤖 Generated with Claude Code
PR #9 Review
DOMAIN REVIEW
Tech stack: Python / Prometheus client / async collector pattern.
Prometheus conventions:
dora_deployment_failure_timestampfollows naming conventions:namespace_subsystem_unitwith_timestampsuffix for epoch-seconds gauges. Consistent with the existingdora_deployment_success_timestampsibling. Good.["repo", "tier"]match the existing success metric. Consistent and correct."Unix timestamp of the most recent failed pipeline (for MTTR calculation)"is descriptive and explains purpose. Good.Collector pattern:
if status == "failure"(parallelsif status == "success")if finished:(parallels success path)if finished > current_max(parallels success path)_last_failure_ts(parallels_last_success_ts)Python quality:
_last_failure_ts: dict[str, float]match existing style.p.get("finished_at", 0)default to 0 with subsequentif finished:guard is correct -- avoids emitting metrics for in-progress or unfinished builds.BLOCKERS
None identified. The implementation is clean and correct.
NITS
DRY opportunity (non-blocking, future consideration): The success and failure timestamp tracking blocks are nearly identical -- they share the same pattern of
status check -> get finished_at -> filter zero -> track max -> set gauge. A helper method like_track_timestamp(status, repo_name, tier, finished, registry, gauge)could unify them. However, since there are only two instances and the code is short, this is a nit, not a blocker. Worth considering if a third status type (e.g., "error", "cancelled") is ever added.Test helper boilerplate (non-blocking): Every test in
TestFailureEventDetectionrepeats the samewith patch(...) as MockClient / mock_instance = MagicMock() / ...pattern. A@pytest.fixturethat yields a pre-configured(collector, mock_instance)tuple would reduce boilerplate. The existing tests appear to follow this same verbose pattern though, so this is consistent with project convention._value.get()in test assertions (non-blocking): Accessingsample._value.get()reaches into prometheus_client internals. The canonical way is to useREGISTRY.get_sample_value()orgenerate_latest()and parse. However, the existing test suite already uses this pattern (visible intest_tiers.py), so this is consistent. Something to revisit project-wide if prometheus_client upgrades break the internal API.SOP COMPLIANCE
feat: add MTTR failure event detection)7-mttr-failure-detectionfollows issue-number prefix conventionPROCESS OBSERVATIONS
time_to_recovery = next_success_ts - failure_tscalculations in Grafana. Good incremental progress.VERDICT: APPROVED
Clean, well-tested, additive change that mirrors an established pattern. No blockers. The 11 new tests provide thorough coverage including edge cases (zero timestamps, most-recent-wins, status independence, tier labels). Nits are minor style suggestions for future consideration.
PR #9 Review
DOMAIN REVIEW
Stack: Python / prometheus-client / asyncio / pytest
The PR adds a
dora_deployment_failure_timestampGauge metric to the Woodpecker collector, enabling MTTR calculation in Grafana by recording the most recent failure timestamp per repo. The implementation follows the established pattern from the success timestamp tracking.Pattern conformance: The failure tracking block in
_collect_repois a near-exact structural mirror of the existing success tracking block:["repo", "tier"]labels_last_failure_tsdict for max-timestamp cachingThis is exactly how it should be done. No deviation from the established collector pattern.
Metric correctness:
repo,tier)Test quality (11 tests, 271 lines):
finished_atis filtereddeployments_totalcounterCoverage is thorough. Both positive and negative cases are tested. The
_reset_metricsfixture properly clears state between tests to prevent cross-talk.BLOCKERS
None.
NITS
Metric name inconsistency: The Python variable is
deployment_last_failure_ts(contains "last"), but the Prometheus metric name isdora_deployment_failure_timestamp(missing "last"). Compare with the success metric: variabledeployment_last_success_tsmaps todora_deployment_last_success_timestamp. For consistency, considerdora_deployment_last_failure_timestamp. This is cosmetic and non-blocking -- changing it later would be a breaking change to any Grafana dashboards already consuming the metric, so it is worth deciding now before this ships."error" status: The code only handles
status == "failure", notstatus == "error". This is consistent with the existing success pattern (which only handlesstatus == "success"), and Woodpecker's "error" status typically indicates infrastructure issues rather than code failures. Noting for awareness -- if infrastructure errors should count toward MTTR, this would need a follow-up.SOP COMPLIANCE
PROCESS OBSERVATIONS
test_tiers.py) is adding the new metric to the tier label verification loop.dora_deployment_last_success_timestamp - dora_deployment_failure_timestampfor the most recent failure/recovery pair.dora_deployment_failure_timestampmetric. This can be a follow-up.VERDICT: APPROVED
PR #9 Review
Post-merge review of "feat: add MTTR failure event detection"
DOMAIN REVIEW
Stack: Python / Prometheus client / async collector pattern / pytest
Metric definition (
deployment_last_failure_ts):["repo", "tier"]matchesdeployment_last_success_tsexactly. Consistent.dora_deployment_last_failure_timestampfollows the existingdora_deployment_last_success_timestampnaming convention. Correct.Collector logic (lines 130-138 of
_collect_repo):finished_atextraction withp.get("finished_at", 0)if finished:)_last_failure_tsdict.set()call with identical label structurestatus == "failure"branch is independent of thestatus == "success"branch -- no accidental coupling. A pipeline that is neither success nor failure (e.g., "running", "pending") correctly hits neither branch.State tracking (
_last_failure_tsdict):__init__, same pattern as_last_success_ts. Correct.One observation (non-blocking): The
finished_atis extracted viap.get("finished_at", 0)inside theif status == "failure"block, which meansfinishedis locally scoped. The success block does the same extraction independently. This is fine -- the alternative (extracting once before both branches) would couple them and reduce clarity. Current approach is correct.BLOCKERS
None.
NITS
Test helper repetition: Every test in
TestFailureEventDetectionrepeats the same 8-linewith patch(...) as MockClient/mock_instance/list_repos/list_pipelines/MockClient.return_value/WoodpeckerCollector(config)/asyncio.run(collector.collect_once())block. A shared fixture (e.g., a helper that takesreposandpipelinesargs and returns the collector post-collection) would reduce ~50 lines without hurting readability. Not blocking -- the tests are clear as-is and this matches the existing test style in the repo.Missing
finished_atkey test: There is a test forfinished_at=0but no test for a pipeline dict that lacks thefinished_atkey entirely. The code handles this viap.get("finished_at", 0)which defaults to 0 and is then filtered byif finished:, so it would work correctly. A test would document this edge case. Minor.No test for
event != "push": The_pipelinehelper defaultsevent="push"but there is no test verifying that non-push failure events (e.g.,event="tag",event="cron") are also tracked. If the collector intentionally filters by event type upstream, this is moot -- but from the diff alone, failures appear to be tracked regardless of event type, which seems correct for MTTR.SOP COMPLIANCE
PROCESS OBSERVATIONS
status == "failure"pipelines.dora_deployment_last_success_timestamp, provides the raw data needed for MTTR calculation in Grafana:failure_ts - next_success_ts. The Grafana query layer is not part of this PR, which is correct scoping.VERDICT: APPROVED
Clean, well-tested, pattern-consistent additive change. 11 tests with strong edge case coverage. No blockers.