Audit and re-block legacy un-decomposed notes (single-paragraph blocks containing full HTML) #255

Closed
opened 2026-04-11 20:30:44 +00:00 by forgejo_admin · 2 comments

Type

Bug

Lineage

Discovered during forgejo_admin/claude-custom#239 (mermaid fence enforcement) when the dev agent's regression run was unable to interact with arch-secrets-pipeline in a structurally meaningful way. Initial scope (filed as "empty html_content") was wrong — re-review of this ticket (review-969-2026-04-11) verified that arch-secrets-pipeline is NOT empty: it contains ~7KB of substantive HTML stuffed into a single paragraph block. The note (id=404, created 2026-03-14, never updated since) predates the block-decomposition pipeline that was introduced in Phase F2-ish. The real bug class is legacy un-decomposed notes, not empty notes.

Repo

forgejo_admin/pal-e-api

What Broke

A subset of pal-e-docs notes — specifically those created before the block-decomposition pipeline was introduced — exist as a single paragraph block whose content.html field contains the entire note body as raw HTML (h2/h3/table/pre/code stuffed inline). These notes are technically queryable but are not block-addressable: get_note_toc returns no headings, get_section cannot find anchors, update_block cannot surgically edit, and the embedding worker cannot generate per-section embeddings. They're invisible to every block-first query path the agents use.

arch-secrets-pipeline (note id=404) is the canary case. There are likely others.

Repro Steps

  1. Read the canary note's TOC:
    mcp__pal-e-docs__get_note_toc(slug="arch-secrets-pipeline")
    
    Observe: returns no heading entries, despite the note containing visible h2/h3 headings in its rendered HTML.
  2. List its blocks:
    mcp__pal-e-docs__list_blocks(slug="arch-secrets-pipeline")
    
    Observe: exactly 1 paragraph block whose content.html contains the entire ~7KB document.
  3. Read the note's html_content:
    mcp__pal-e-docs__get_note(slug="arch-secrets-pipeline")
    
    Observe: substantive content, ~7KB, with full architecture documentation including 3 mermaid diagrams. Not empty.
  4. Confirm timestamps: created 2026-03-14, never updated. Predates block decomposition.
  5. Run the audit query (see below) to find other affected notes.

Expected Behavior

Every pal-e-docs note should be properly block-decomposed: each h2/h3/h4 becomes a heading block, paragraphs become paragraph blocks, tables become table blocks, mermaid fences become mermaid blocks, etc. The block parser (src/pal_e_docs/blocks/parser.py) handles this for new notes; legacy notes from before the parser existed need to be re-blocked retroactively.

Environment

  • pal-e-docs API: production (https://pal-e-docs.tail5b443a.ts.net)
  • Database: paledocs on pal-e-postgres-1 (CNPG, namespace postgres)
  • Canary note: arch-secrets-pipeline, id=404, created 2026-03-14, never updated
  • Block parser: src/pal_e_docs/blocks/parser.py — the function that converts HTML to blocks for new notes
  • Discovered: 2026-04-11 during claude-custom#239 dev agent regression run, refined via review-969-2026-04-11

Audit Query (Postgres, run via kubectl -n postgres exec pal-e-postgres-1 -- psql -U postgres -d paledocs)

-- Find notes that have ≤1 block but >1KB of html_content
-- These are the legacy un-decomposed notes
SELECT
  n.id,
  n.slug,
  n.note_type,
  n.created_at,
  n.updated_at,
  LENGTH(n.html_content) AS content_length,
  COUNT(b.id) AS block_count,
  SUM(LENGTH(b.content::text)) AS block_content_length
FROM notes n
LEFT JOIN blocks b ON b.note_id = n.id
GROUP BY n.id, n.slug, n.note_type, n.created_at, n.updated_at, n.html_content
HAVING COUNT(b.id) <= 1
   AND LENGTH(n.html_content) > 1024
ORDER BY n.created_at ASC;

The previous ticket draft used WHERE html_content = '' OR IS NULL, which would have returned zero rows and missed the entire bug class. The corrected query finds the structural mismatch directly.

Resolution Paths

For each affected note, choose one of:

  1. Re-block in place (preferred for content with value) — call the block parser on the existing html_content and replace the note's blocks with the parser output. The note keeps its slug, id, history, and inbound links; only the block structure changes. Implementation: a one-off Python script (or new Alembic data migration) that iterates the affected notes and calls pal_e_docs.blocks.parser.parse_html(note.html_content) then inserts the resulting blocks.
  2. Delete (per sop-note-deletion, backup-first) — only for notes that are orphaned, stale, or duplicated by a newer note.
  3. Document as legacy and skip — add a legacy-undecomposed tag and accept the limitation. Reasonable only if the note is read-only history that won't be edited.

Acceptance Criteria

  • Audit query (above) executed against prod, results captured in PR description or a fresh review note
  • Count + list of affected notes documented with: id, slug, note_type, created_at, content_length, block_count
  • For each affected note, a resolution decision (re-block / delete / legacy-tag) recorded
  • Re-block path implemented for at least the canary case (arch-secrets-pipeline) — verifiable by get_note_toc returning >0 headings after the fix
  • If multiple notes need re-blocking, a one-off script committed to scripts/ (or an Alembic data migration) handles them in bulk
  • Backup taken before any deletion
  • If a hook gap allowed un-decomposed notes to slip through, file a follow-up ticket (out of scope here)
  • If multiple distinct bug classes emerge from the audit, file separate follow-up tickets

Test Expectations

  • Before fix: mcp__pal-e-docs__get_note_toc(slug="arch-secrets-pipeline") returns 0 headings
  • After fix: same call returns ≥1 heading entries matching the rendered h2/h3 structure
  • After fix: mcp__pal-e-docs__list_blocks(slug="arch-secrets-pipeline") returns multiple blocks (heading, paragraph, table, mermaid as appropriate)
  • After fix: mcp__pal-e-docs__get_section(slug="arch-secrets-pipeline", anchor_id="<known-heading-anchor>") returns the targeted section

Constraints

  • Backup before any deletion (sop-note-deletion)
  • Do NOT update the hook to enforce minimum block count in this ticket — that's separate scope (hook hardening)
  • Re-block path must use the existing parser.py — do NOT write a new HTML→block converter
  • Do NOT lose note_revisions history — the re-block operation should add a new revision, not overwrite the old one

Checklist

  • Audit query run, results captured
  • Resolution decisions documented per affected note
  • Re-block script (or migration) committed
  • Canary case (arch-secrets-pipeline) verified working post-fix
  • Follow-up tickets filed if needed
  • forgejo_admin/claude-custom#239 — parent ticket that surfaced this
  • review-969-2026-04-11 — the review that caught the original scope error and corrected the bug class
  • sop-note-deletion — backup-first procedure (only relevant for the delete resolution path)
  • arch-domain-pal-e-docsnotes, blocks, and note_revisions components relevant to the investigation
  • src/pal_e_docs/blocks/parser.py — the block parser to call for the re-block path
### Type Bug ### Lineage Discovered during forgejo_admin/claude-custom#239 (mermaid fence enforcement) when the dev agent's regression run was unable to interact with `arch-secrets-pipeline` in a structurally meaningful way. Initial scope (filed as "empty html_content") was wrong — re-review of this ticket (review-969-2026-04-11) verified that `arch-secrets-pipeline` is NOT empty: it contains ~7KB of substantive HTML stuffed into **a single paragraph block**. The note (id=404, created 2026-03-14, never updated since) predates the block-decomposition pipeline that was introduced in Phase F2-ish. The real bug class is **legacy un-decomposed notes**, not empty notes. ### Repo `forgejo_admin/pal-e-api` ### What Broke A subset of pal-e-docs notes — specifically those created before the block-decomposition pipeline was introduced — exist as a single `paragraph` block whose `content.html` field contains the entire note body as raw HTML (h2/h3/table/pre/code stuffed inline). These notes are technically queryable but are not block-addressable: `get_note_toc` returns no headings, `get_section` cannot find anchors, `update_block` cannot surgically edit, and the embedding worker cannot generate per-section embeddings. They're invisible to every block-first query path the agents use. `arch-secrets-pipeline` (note id=404) is the canary case. There are likely others. ### Repro Steps 1. Read the canary note's TOC: ``` mcp__pal-e-docs__get_note_toc(slug="arch-secrets-pipeline") ``` Observe: returns no heading entries, despite the note containing visible h2/h3 headings in its rendered HTML. 2. List its blocks: ``` mcp__pal-e-docs__list_blocks(slug="arch-secrets-pipeline") ``` Observe: exactly **1 paragraph block** whose `content.html` contains the entire ~7KB document. 3. Read the note's html_content: ``` mcp__pal-e-docs__get_note(slug="arch-secrets-pipeline") ``` Observe: substantive content, ~7KB, with full architecture documentation including 3 mermaid diagrams. **Not empty.** 4. Confirm timestamps: created 2026-03-14, never updated. Predates block decomposition. 5. Run the audit query (see below) to find other affected notes. ### Expected Behavior Every pal-e-docs note should be properly block-decomposed: each h2/h3/h4 becomes a `heading` block, paragraphs become `paragraph` blocks, tables become `table` blocks, mermaid fences become `mermaid` blocks, etc. The block parser (`src/pal_e_docs/blocks/parser.py`) handles this for new notes; legacy notes from before the parser existed need to be re-blocked retroactively. ### Environment - pal-e-docs API: production (https://pal-e-docs.tail5b443a.ts.net) - Database: `paledocs` on `pal-e-postgres-1` (CNPG, namespace `postgres`) - Canary note: `arch-secrets-pipeline`, id=404, created 2026-03-14, never updated - Block parser: `src/pal_e_docs/blocks/parser.py` — the function that converts HTML to blocks for new notes - Discovered: 2026-04-11 during claude-custom#239 dev agent regression run, refined via review-969-2026-04-11 ### Audit Query (Postgres, run via `kubectl -n postgres exec pal-e-postgres-1 -- psql -U postgres -d paledocs`) ```sql -- Find notes that have ≤1 block but >1KB of html_content -- These are the legacy un-decomposed notes SELECT n.id, n.slug, n.note_type, n.created_at, n.updated_at, LENGTH(n.html_content) AS content_length, COUNT(b.id) AS block_count, SUM(LENGTH(b.content::text)) AS block_content_length FROM notes n LEFT JOIN blocks b ON b.note_id = n.id GROUP BY n.id, n.slug, n.note_type, n.created_at, n.updated_at, n.html_content HAVING COUNT(b.id) <= 1 AND LENGTH(n.html_content) > 1024 ORDER BY n.created_at ASC; ``` The previous ticket draft used `WHERE html_content = '' OR IS NULL`, which would have returned **zero rows** and missed the entire bug class. The corrected query finds the structural mismatch directly. ### Resolution Paths For each affected note, choose one of: 1. **Re-block in place (preferred for content with value)** — call the block parser on the existing `html_content` and replace the note's blocks with the parser output. The note keeps its slug, id, history, and inbound links; only the block structure changes. Implementation: a one-off Python script (or new Alembic data migration) that iterates the affected notes and calls `pal_e_docs.blocks.parser.parse_html(note.html_content)` then inserts the resulting blocks. 2. **Delete (per `sop-note-deletion`, backup-first)** — only for notes that are orphaned, stale, or duplicated by a newer note. 3. **Document as legacy and skip** — add a `legacy-undecomposed` tag and accept the limitation. Reasonable only if the note is read-only history that won't be edited. ### Acceptance Criteria - [ ] Audit query (above) executed against prod, results captured in PR description or a fresh review note - [ ] Count + list of affected notes documented with: id, slug, note_type, created_at, content_length, block_count - [ ] For each affected note, a resolution decision (re-block / delete / legacy-tag) recorded - [ ] Re-block path implemented for at least the canary case (`arch-secrets-pipeline`) — verifiable by `get_note_toc` returning >0 headings after the fix - [ ] If multiple notes need re-blocking, a one-off script committed to `scripts/` (or an Alembic data migration) handles them in bulk - [ ] Backup taken before any deletion - [ ] If a hook gap allowed un-decomposed notes to slip through, file a follow-up ticket (out of scope here) - [ ] If multiple distinct bug classes emerge from the audit, file separate follow-up tickets ### Test Expectations - [ ] Before fix: `mcp__pal-e-docs__get_note_toc(slug="arch-secrets-pipeline")` returns 0 headings - [ ] After fix: same call returns ≥1 heading entries matching the rendered h2/h3 structure - [ ] After fix: `mcp__pal-e-docs__list_blocks(slug="arch-secrets-pipeline")` returns multiple blocks (heading, paragraph, table, mermaid as appropriate) - [ ] After fix: `mcp__pal-e-docs__get_section(slug="arch-secrets-pipeline", anchor_id="<known-heading-anchor>")` returns the targeted section ### Constraints - Backup before any deletion (`sop-note-deletion`) - Do NOT update the hook to enforce minimum block count in this ticket — that's separate scope (hook hardening) - Re-block path must use the existing `parser.py` — do NOT write a new HTML→block converter - Do NOT lose `note_revisions` history — the re-block operation should add a new revision, not overwrite the old one ### Checklist - [ ] Audit query run, results captured - [ ] Resolution decisions documented per affected note - [ ] Re-block script (or migration) committed - [ ] Canary case (`arch-secrets-pipeline`) verified working post-fix - [ ] Follow-up tickets filed if needed ### Related - `forgejo_admin/claude-custom#239` — parent ticket that surfaced this - `review-969-2026-04-11` — the review that caught the original scope error and corrected the bug class - `sop-note-deletion` — backup-first procedure (only relevant for the delete resolution path) - `arch-domain-pal-e-docs` — `notes`, `blocks`, and `note_revisions` components relevant to the investigation - `src/pal_e_docs/blocks/parser.py` — the block parser to call for the re-block path
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-969-2026-04-11

Premise does not reproduce as stated. Live verification against pal-e-docs MCP shows arch-secrets-pipeline returns ~7KB of substantive html_content (not empty, not null). The real data integrity issue is different: the note has exactly 1 block of type paragraph containing the entire document as raw HTML — it was never block-decomposed. That is what breaks get_note_toc / get_section / mermaid rendering / per-block semantic search, and it is almost certainly what the dev agent in claude-custom#239 actually saw.

Ticket is otherwise well-structured. Traceability triangle is clean (story:superuser-maintain + arch:notes both verified against backing notes). Bug template is complete. Scope is correctly spike-shaped and bounded — no decomposition needed.

[BODY] fixes required before dispatch:

  • Rewrite "What Broke" to describe un-decomposed single-paragraph structure, not empty content
  • Update Repro Step 3 to use list_blocks (observes 1 paragraph block with raw HTML inside)
  • Fix audit query in Investigation Plan step 2 — WHERE html_content = '' OR IS NULL will return zero rows. Needs a COUNT(blocks) <= 1 AND LENGTH(...) > threshold query to find legacy un-decomposed notes
  • Add "re-block the note" as a third resolution path (deletion is wrong for a note with valuable content)
  • Fill in the "Note ID: TBD" — confirmed id=404, created and never updated since 2026-03-14T16:02:45 (predates block-decomposition pipeline)

No [LABEL], [SCOPE], or [DECOMPOSE] recommendations. Route to skill-refine-ticket for body edits, then re-review. Stays in backlog until refined.

## Scope Review: NEEDS_REFINEMENT Review note: `review-969-2026-04-11` **Premise does not reproduce as stated.** Live verification against pal-e-docs MCP shows `arch-secrets-pipeline` returns ~7KB of substantive `html_content` (not empty, not null). The real data integrity issue is different: the note has exactly **1 block** of type `paragraph` containing the entire document as raw HTML — it was never block-decomposed. That is what breaks `get_note_toc` / `get_section` / mermaid rendering / per-block semantic search, and it is almost certainly what the dev agent in claude-custom#239 actually saw. Ticket is otherwise well-structured. Traceability triangle is clean (`story:superuser-maintain` + `arch:notes` both verified against backing notes). Bug template is complete. Scope is correctly spike-shaped and bounded — no decomposition needed. **[BODY] fixes required before dispatch:** - Rewrite "What Broke" to describe un-decomposed single-paragraph structure, not empty content - Update Repro Step 3 to use `list_blocks` (observes 1 paragraph block with raw HTML inside) - Fix audit query in Investigation Plan step 2 — `WHERE html_content = '' OR IS NULL` will return zero rows. Needs a `COUNT(blocks) <= 1 AND LENGTH(...) > threshold` query to find legacy un-decomposed notes - Add "re-block the note" as a third resolution path (deletion is wrong for a note with valuable content) - Fill in the "Note ID: TBD" — confirmed `id=404`, created and never updated since `2026-03-14T16:02:45` (predates block-decomposition pipeline) No [LABEL], [SCOPE], or [DECOMPOSE] recommendations. Route to `skill-refine-ticket` for body edits, then re-review. Stays in backlog until refined.
forgejo_admin changed title from Investigate empty html_content on arch-secrets-pipeline (and audit for other zero-content notes) to Audit and re-block legacy un-decomposed notes (single-paragraph blocks containing full HTML) 2026-04-11 20:58:14 +00:00
Author
Owner

Scope Review: APPROVED (round 2)

Review note: review-969-2026-04-11-r2

All round 1 [BODY] findings resolved: title, What Broke, Repro Steps, audit SQL, resolution paths (re-block preferred), environment (id=404 + timestamps), and constraints (must use existing parser.py) are all corrected and internally consistent. Canary re-verified live: get_note_toc returns [], list_blocks returns 1 paragraph with ~7KB of raw HTML. Parser target file verified at src/pal_e_docs/blocks/parser.py (316 lines).

Traceability triangle complete (story:superuser-maintain, arch:notes, issue open). Spike-shaped, single-agent scope. No decomposition needed.

Ticket can advance backlog → todo.

## Scope Review: APPROVED (round 2) Review note: `review-969-2026-04-11-r2` All round 1 [BODY] findings resolved: title, What Broke, Repro Steps, audit SQL, resolution paths (re-block preferred), environment (id=404 + timestamps), and constraints (must use existing `parser.py`) are all corrected and internally consistent. Canary re-verified live: `get_note_toc` returns `[]`, `list_blocks` returns 1 paragraph with ~7KB of raw HTML. Parser target file verified at `src/pal_e_docs/blocks/parser.py` (316 lines). Traceability triangle complete (story:superuser-maintain, arch:notes, issue open). Spike-shaped, single-agent scope. No decomposition needed. Ticket can advance backlog → todo.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-api#255
No description provided.