spike: CI bootstrap resilience findings #147
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform!147
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "125-spike-ci-bootstrap-resilience"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Research spike investigating the enterprise-grade solution for merging platform fixes when CI itself is broken. Triggered by the PR #124 force merge during the Forgejo IPv4 incident (#121).
Changes
docs/spikes/125-ci-bootstrap-resilience.md-- New spike findings document with 6 findings covering clone resilience, branch protection gaps, secret pipeline analysis, pipeline recovery patterns, and industry precedent.Key Findings
apply_to_admins=Falseis already set on pal-e-platform's branch protection. Admins can merge via Forgejo UI even when CI checks fail. The PR #124 force push was unnecessary.forgejo_branch_protectionresources in pal-e-services. Only pal-e-platform has protection; 5 other repos have none. Thesvalabs/forgejoTerraform provider supports this resource.from_secretkeys set manually via UI. No drift detection.Recommended Follow-up Tickets
svalabs/forgejoprovider in pal-e-servicesTest Plan
apply_to_admins=Falsevia Forgejo API or UIReview Checklist
Related
Self-Review: LGTM
Type: Spike (research only, no production code)
Verification performed during investigation
apply_to_admins=Falseon the live branch protection rule via API.woodpecker.yamlfrom 3 repos (pal-e-platform, basketball-api, westside-app) -- confirmed identical clone pattern with no retry/fallbacksvalabs/forgejoTerraform provider exists withforgejo_branch_protectionresource (21+ fields, missingapply_to_admins)apply_to_admins=Falsefrom_secretreferences in pal-e-platform's.woodpecker.yamlforgejo_repositoryorbranch_protectionresources in pal-e-services/terraform/Issue success criteria coverage
No issues found. Ready for human review.
PR #147 Review
DOMAIN REVIEW
Domain: Infrastructure spike (Terraform/k8s/Woodpecker/Forgejo platform). This is a research-only deliverable -- a findings document at
docs/spikes/125-ci-bootstrap-resilience.md. No production code changes. Reviewed for technical accuracy, completeness, actionability, and scope containment.Verification method: Cross-referenced every claim in the spike against the actual codebase on disk.
Findings accuracy assessment:
Finding 1 (Branch protection not in IaC): ACCURATE. Confirmed zero
forgejo_branch_protectionorforgejo_repository_branch_protectionresources inpal-e-services/terraform/. Nosvalabs/forgejoprovider present either.Finding 2 (Admin bypass already works): ACCURATE in conclusion. The
apply_to_admins=Falseclaim is stated confidently. The spike references Forgejo issue #3780 on Codeberg for corroboration. The recommendation (document runbook, no config change) is sound.Finding 3 (Clone step has no fallback): ACCURATE but UNDERSTATED. The spike says the clone block is "copy-pasted into
pal-e-platform,basketball-api, andwestside-app" -- this lists 3 repos. Actual count is at least 5:pal-e-platform--alpine/gitwith netrc authbasketball-api--alpine/gitunauthenticatedwestside-app--alpine/gitunauthenticatedminio-api--alpine/gitunauthenticated (same exact pattern, not mentioned)pal-e-app--woodpeckerci/plugin-gitwithsettings.remote(different image, same internal URL, also not mentioned)The spike's claim "All repos use a custom clone step with
alpine/git" is inaccurate --pal-e-appusesplugin-git, notalpine/git. The vulnerability (no retry, no fallback) applies equally, but the fix pattern would differ forplugin-gitvs shell commands.Finding 4 (Secret pipeline gaps): ACCURATE. Verified 19 distinct
from_secretkeys in.woodpecker.yaml(1forgejo_token+ 1kubeconfig_content+ 17tf_var_*). The secret duplication observation and IaC gap are well-documented.Finding 5 (Pipeline recovery patterns): ACCURATE and well-structured. The recovery matrix is clear and the risk assessments are reasonable.
Finding 6 (CI bypass label): ACCURATE. The conclusion (admin bypass is simpler and already configured) is correct for current scale.
Proposed clone retry script (Finding 3): The proposed
try_clonefunction has a gap -- it referencesFORGEJO_TOKENin the environment block but the function body does not set up.netrcfor authentication. The currentpal-e-platformclone step creates.netrcbefore cloning. The proposed pattern would break authenticated clone forpal-e-platform(and any private repo). This should be noted when the follow-up ticket is created.BLOCKERS
None. This is a research spike -- no production code, no new functionality, no test coverage requirement. The single changed file is a markdown document. No secrets committed, no IaC changes, no scope creep.
NITS
Finding 3 understates repo count. The spike lists 3 repos with the clone pattern but at least 5 have it (
minio-apiandpal-e-appare missing). The statement "All repos use a custom clone step withalpine/git" is incorrect --pal-e-appuseswoodpeckerci/plugin-git. The follow-up ticket scope will be larger than the spike implies. Recommend adding the missing repos to the findings or changing the language to "most repos" with a note to audit all repos when creating the follow-up ticket.Proposed try_clone script omits netrc auth. The proposed clone retry pattern in Finding 3 includes
FORGEJO_TOKENin the environment but never creates.netrc. This would silently break authenticated clone forpal-e-platform. Since this is a spike (not production code), this is a nit -- but the follow-up ticket implementer will need to catch this.Missing repos in Finding 1 inventory. The branch protection audit lists 5 repos without protection but does not include
minio-api,minio-sdk,mcd-tracker-api,mcd-tracker-app,pal-e-docs-sdk, orpal-e-docs-mcp. The follow-up IaC ticket should cover all repos, not just those listed.Related section missing plan slug. The PR body references
#125,#121, andPR #124but does not referenceplan-pal-e-platform. Per SOP, the Related section should include the plan slug.SOP COMPLIANCE
125-spike-ci-bootstrap-resiliencereferences #125)plan-pal-e-platform)PROCESS OBSERVATIONS
This spike is a high-value investment. The finding that admin bypass already works (
apply_to_admins=False) eliminates the original incident's root cause at zero cost. The clone retry follow-up ticket directly addresses MTTR (mean time to recovery) for CI failures. The branch protection IaC ticket addresses change failure rate by making protection reproducible.The two recommended follow-up tickets are well-scoped and actionable. The medium-term recommendations (Woodpecker secret audit, secret management IaC) are appropriately deferred.
One risk: the follow-up clone retry ticket should be scoped to all repos with Woodpecker pipelines (at least 5), not just the 3 listed in Finding 3. Underscoping the ticket based on the spike's incomplete inventory could leave gaps.
VERDICT: APPROVED
The spike document is thorough, well-structured, and the findings are verified accurate against the codebase. The nits about understated repo counts and the missing netrc in the proposed pattern are non-blocking observations that should be captured in the follow-up tickets. No production code changes, no blockers.