Audit and re-block legacy un-decomposed notes (single-paragraph blocks containing full HTML) #255
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-api#255
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
Discovered during forgejo_admin/claude-custom#239 (mermaid fence enforcement) when the dev agent's regression run was unable to interact with
arch-secrets-pipelinein a structurally meaningful way. Initial scope (filed as "empty html_content") was wrong — re-review of this ticket (review-969-2026-04-11) verified thatarch-secrets-pipelineis NOT empty: it contains ~7KB of substantive HTML stuffed into a single paragraph block. The note (id=404, created 2026-03-14, never updated since) predates the block-decomposition pipeline that was introduced in Phase F2-ish. The real bug class is legacy un-decomposed notes, not empty notes.Repo
forgejo_admin/pal-e-apiWhat Broke
A subset of pal-e-docs notes — specifically those created before the block-decomposition pipeline was introduced — exist as a single
paragraphblock whosecontent.htmlfield contains the entire note body as raw HTML (h2/h3/table/pre/code stuffed inline). These notes are technically queryable but are not block-addressable:get_note_tocreturns no headings,get_sectioncannot find anchors,update_blockcannot surgically edit, and the embedding worker cannot generate per-section embeddings. They're invisible to every block-first query path the agents use.arch-secrets-pipeline(note id=404) is the canary case. There are likely others.Repro Steps
content.htmlcontains the entire ~7KB document.Expected Behavior
Every pal-e-docs note should be properly block-decomposed: each h2/h3/h4 becomes a
headingblock, paragraphs becomeparagraphblocks, tables becometableblocks, mermaid fences becomemermaidblocks, etc. The block parser (src/pal_e_docs/blocks/parser.py) handles this for new notes; legacy notes from before the parser existed need to be re-blocked retroactively.Environment
paledocsonpal-e-postgres-1(CNPG, namespacepostgres)arch-secrets-pipeline, id=404, created 2026-03-14, never updatedsrc/pal_e_docs/blocks/parser.py— the function that converts HTML to blocks for new notesAudit Query (Postgres, run via
kubectl -n postgres exec pal-e-postgres-1 -- psql -U postgres -d paledocs)The previous ticket draft used
WHERE html_content = '' OR IS NULL, which would have returned zero rows and missed the entire bug class. The corrected query finds the structural mismatch directly.Resolution Paths
For each affected note, choose one of:
html_contentand replace the note's blocks with the parser output. The note keeps its slug, id, history, and inbound links; only the block structure changes. Implementation: a one-off Python script (or new Alembic data migration) that iterates the affected notes and callspal_e_docs.blocks.parser.parse_html(note.html_content)then inserts the resulting blocks.sop-note-deletion, backup-first) — only for notes that are orphaned, stale, or duplicated by a newer note.legacy-undecomposedtag and accept the limitation. Reasonable only if the note is read-only history that won't be edited.Acceptance Criteria
arch-secrets-pipeline) — verifiable byget_note_tocreturning >0 headings after the fixscripts/(or an Alembic data migration) handles them in bulkTest Expectations
mcp__pal-e-docs__get_note_toc(slug="arch-secrets-pipeline")returns 0 headingsmcp__pal-e-docs__list_blocks(slug="arch-secrets-pipeline")returns multiple blocks (heading, paragraph, table, mermaid as appropriate)mcp__pal-e-docs__get_section(slug="arch-secrets-pipeline", anchor_id="<known-heading-anchor>")returns the targeted sectionConstraints
sop-note-deletion)parser.py— do NOT write a new HTML→block converternote_revisionshistory — the re-block operation should add a new revision, not overwrite the old oneChecklist
arch-secrets-pipeline) verified working post-fixRelated
forgejo_admin/claude-custom#239— parent ticket that surfaced thisreview-969-2026-04-11— the review that caught the original scope error and corrected the bug classsop-note-deletion— backup-first procedure (only relevant for the delete resolution path)arch-domain-pal-e-docs—notes,blocks, andnote_revisionscomponents relevant to the investigationsrc/pal_e_docs/blocks/parser.py— the block parser to call for the re-block pathScope Review: NEEDS_REFINEMENT
Review note:
review-969-2026-04-11Premise does not reproduce as stated. Live verification against pal-e-docs MCP shows
arch-secrets-pipelinereturns ~7KB of substantivehtml_content(not empty, not null). The real data integrity issue is different: the note has exactly 1 block of typeparagraphcontaining the entire document as raw HTML — it was never block-decomposed. That is what breaksget_note_toc/get_section/ mermaid rendering / per-block semantic search, and it is almost certainly what the dev agent in claude-custom#239 actually saw.Ticket is otherwise well-structured. Traceability triangle is clean (
story:superuser-maintain+arch:notesboth verified against backing notes). Bug template is complete. Scope is correctly spike-shaped and bounded — no decomposition needed.[BODY] fixes required before dispatch:
list_blocks(observes 1 paragraph block with raw HTML inside)WHERE html_content = '' OR IS NULLwill return zero rows. Needs aCOUNT(blocks) <= 1 AND LENGTH(...) > thresholdquery to find legacy un-decomposed notesid=404, created and never updated since2026-03-14T16:02:45(predates block-decomposition pipeline)No [LABEL], [SCOPE], or [DECOMPOSE] recommendations. Route to
skill-refine-ticketfor body edits, then re-review. Stays in backlog until refined.Investigate empty html_content on arch-secrets-pipeline (and audit for other zero-content notes)to Audit and re-block legacy un-decomposed notes (single-paragraph blocks containing full HTML)Scope Review: APPROVED (round 2)
Review note:
review-969-2026-04-11-r2All round 1 [BODY] findings resolved: title, What Broke, Repro Steps, audit SQL, resolution paths (re-block preferred), environment (id=404 + timestamps), and constraints (must use existing
parser.py) are all corrected and internally consistent. Canary re-verified live:get_note_tocreturns[],list_blocksreturns 1 paragraph with ~7KB of raw HTML. Parser target file verified atsrc/pal_e_docs/blocks/parser.py(316 lines).Traceability triangle complete (story:superuser-maintain, arch:notes, issue open). Spike-shaped, single-agent scope. No decomposition needed.
Ticket can advance backlog → todo.