fix: reblock 34 legacy un-decomposed notes #259
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-api!259
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "255-reblock-legacy-notes"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
34 notes created before the block-decomposition pipeline existed were stored as a single paragraph block wrapping the entire HTML document in an outer
<p>tag. This made them invisible to all block-first query paths:get_note_tocreturned no headings,get_sectioncouldn't find anchors, and the embedding worker couldn't generate per-section embeddings.Changes
scripts/reblock_legacy_notes.py— one-off script that:<p>wrapper that was hiding block-level elements from the parserparse_html_to_blocks()parsernote_revisionentry (preserving history, not overwriting)--executeflag (dry-run by default)Audit Results (prod, 2026-04-12)
34 affected notes found. All re-blocked successfully. Resolution: re-block in place for all 34.
Totals: 581 new blocks, 252 headings across 34 notes.
2 notes (validation-223, validation-383) remained at 1 block because their content is a single paragraph without block-level elements inside the outer
<p>.Root Cause
Legacy notes created before
blocks/parser.pyexisted were stored with their entire HTML content wrapped in an outer<p>tag. When the initial backfill script ran, it parsed<p><h2>...<h3>...</p>as a single paragraph block (the parser correctly sees one<p>element). The fix: unwrap the outer<p>before re-parsing, exposing the actual h2/h3/table/pre structure to the parser.Test Plan
get_note_toc(slug="arch-secrets-pipeline")returns 9 headings (was 0)revised_by="scripts/reblock_legacy_notes.py"Discovered Scope
<p>wrapping pattern suggests the original note creation path did not go through the block parser. A follow-up ticket should investigate whether the hook/creation path has been fully fixed to prevent new un-decomposed notes.Review Checklist
parse_html_to_blocks()parser (no custom HTML converter)--executeflag required for writesPALDOCS_DATABASE_URLenv var supported for in-pod executionRelated Notes
arch-secrets-pipeline(id=404) — canary note, verified working post-fixsop-validation(id=952) — largest active SOP affected, now 31 blocks / 15 headingsRelated
Closes #255
QA Review: PR #259
Scope Verification
scripts/reblock_legacy_notes.py(315 lines)Findings
Positive:
parse_html_to_blocks()andcompile_blocks_to_html()-- no custom HTML converter_unwrap_outer_paragraph()correctly identifies the root cause (outer<p>wrapping legacy content) with a safe guard: only unwraps when block-level elements (h1-h6, table, pre, ul, ol, p) exist insideNoteRevisionsnapshot created BEFORE block deletion, with correctrevision_numberincrement matchingroutes/notes.pypattern--executerequired for writes -- good operational safetyscripts/backfill_blocks.pypatterns exactlyDATABASE_URLandPALDOCS_DATABASE_URLsupported for pod/local executionget_note_tocreturns 9 headings forarch-secrets-pipelineNits (non-blocking):
import re as re_mod-- aliasing stdlibreis unusual. The alias exists becausereis not used elsewhere in the file and was likely added to avoid a hypothetical conflict. Could just beimport re. Cosmetic only.Acceptance Criteria Check
scripts/arch-secrets-pipeline, 9 headings)VERDICT: APPROVED
PR #259 Review
DOMAIN REVIEW
Tech stack: Python / SQLAlchemy / BeautifulSoup -- one-off data migration script.
Parser reuse (constraint check): PASS. The script imports and uses the canonical
parse_html_to_blocks()frompal_e_docs.blocks.parserandcompile_blocks_to_html()frompal_e_docs.blocks.compiler. No custom HTML-to-block converter introduced.Audit SQL correctness: The
find_legacy_notes()query correctly implements the audit criteria:<=1 block(via outer join + coalesce) AND>1KB html_content(viafunc.length). Ordering bycreated_atis deterministic. The PR body documents all 34 results with before/after block counts.Canary verification: PR body confirms
arch-secrets-pipeline(id=404) went from 1 block to 22 blocks with 9 headings.get_note_tocwould return those 9 headings.note_revisions preservation: PASS. The script creates a
NoteRevisionwithrevised_by="scripts/reblock_legacy_notes.py"and incrementedrevision_numberBEFORE modifying blocks. The oldhtml_contentis captured in the revision snapshot. History is preserved, not overwritten.Data migration safety:
--executeflag required -- good safety gate.session.commit()at the end (all-or-nothing transaction) -- correct.finallyblock -- correct._unwrap_outer_paragraphlogic: Sound approach. Only unwraps when (a) single top-level<p>element AND (b) inner content contains block-level tags (h1-h6,table,pre,ul,ol,p). This correctly preserves notes that are genuinely single paragraphs (the 2 validation notes that stayed at 1 block confirm this works).Scope: Single file added (
scripts/reblock_legacy_notes.py). No unrelated changes. 315 additions, 0 deletions.BLOCKERS
None.
Test coverage assessment: This is a one-off migration script in
scripts/, not new application functionality (no new endpoints, models, or routes). The PR documents extensive manual validation: dry-run audit of all 34 notes, execute with per-note verification, canary check viaget_note_toc, revision history confirmation, and existing 685 tests passing. The script reuses the canonical parser/compiler which are already tested. For a run-once data fix, the manual test plan is sufficient.DRY observation (non-blocking): The script reimplements block insertion + compiled_page update + TOC build logic that already exists in
pal_e_docs/blocks/sync.pyasparse_and_store_blocks(). The script could callparse_and_store_blocks(session, note, html_to_parse)instead of manually deleting blocks, inserting blocks, compiling, building TOC, and updating compiled_page (lines 133-177 of the script replicate lines 75-112 ofsync.py). However, this is a one-off script that has already been executed, and the duplicated logic is functionally correct, so this is a nit rather than a blocker.NITS
DRY with
blocks/sync.py:reblock_note()lines 133-177 duplicateparse_and_store_blocks()frompal_e_docs/blocks/sync.py. The only addition the script needs beyond that function is the_unwrap_outer_paragraphpre-processing andNoteRevisioncreation. Refactoring toparse_and_store_blocks(session, note, html_to_parse)would eliminate ~40 lines and guarantee future parity if the canonical write path evolves. Low-risk since script is already executed, but worth noting for pattern awareness.re_modalias:import re as re_modis an unusual alias. Theremodule is only used once (line 84). Standard practice isimport re-- the alias adds cognitive overhead for no benefit.Discovered scope follow-up: The PR correctly identifies that the root cause (note creation path not going through block parser) should be investigated. Confirm a Forgejo issue exists for this.
SOP COMPLIANCE
255-reblock-legacy-notesmatches issue #255forgejo_admin/claude-custom#239fix: reblock 34 legacy un-decomposed notesPROCESS OBSERVATIONS
VERDICT: APPROVED