Add player name normalization helper #437

Open
opened 2026-04-10 23:30:36 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Feature

Lineage

Standalone — spawned from the westside-sheet-sync project scaffold on 2026-04-10. Blocker for the sheet_sync service module because the DB and sheet use different name formats; without normalization the sync would insert duplicates.

Repo

forgejo_admin/basketball-api

User Story

As the sheet_sync service
I want a function that normalizes player names across different formats (DB "Firstname Lastname" vs Sheet "LASTNAME, Firstname")
So that I can reliably decide whether a DB player already exists in the sheet without inserting duplicates

Ties to story:sheet-sync.

Context

The basketball-api DB stores player names as a single VARCHAR column players.name, formatted "First Last" or "First Middle Last" (e.g., "Daniel Bryan Niyitanga", "Mateus Rigitano de Paula", "Sarah Lédio da Silva"). Marcus's Google Sheet uses "LASTNAME, Firstname" formatting (e.g., "NIYITANGA, Daniel Bryan", "RIGITANO DE PAULA, Mateus", "DA SILVA, Sarah Lédio").

For the sync to be idempotent (running it twice in a row should be a no-op if nothing has changed), we need a function that takes two strings and returns True if they refer to the same player. Challenges:

  • Capitalization: DB has proper case, sheet has uppercase last names
  • Order: "First Last" vs "Last, First"
  • Multi-word last names: "Rigitano de Paula" must not be split as "Rigitano" last / "de Paula" first
  • Accents and unicode: "Sarah Lédio da Silva" — normalize to NFKD + strip diacritics for matching
  • Single-name players: "Elson" (no last name in the DB) needs to match "ELSON" in the sheet
  • Apostrophes and hyphens: "Brown Jr." vs "BROWN JR."

The right abstraction: a normalize_name(name: str) -> str function that returns a canonical form regardless of input format. Two names are "the same" if their canonical forms match.

File Targets

Files to create:

  • src/basketball_api/services/name_normalize.py — contains normalize_name(name: str) -> str and names_match(a: str, b: str) -> bool.

Files to create (tests):

  • tests/test_name_normalize.py — table-driven tests covering every edge case listed in Context.

Files NOT to touch:

  • src/basketball_api/models.py — no schema changes.
  • Any existing route or service file — this is a new standalone helper.

Acceptance Criteria

  • When I call normalize_name("Daniel Bryan Niyitanga") and normalize_name("NIYITANGA, Daniel Bryan"), then both return the same canonical form.
  • When I call names_match("Mateus Rigitano de Paula", "RIGITANO DE PAULA, Mateus"), then it returns True.
  • When I call names_match("Sarah Lédio da Silva", "DA SILVA, Sarah Lédio"), then it returns True (unicode handled).
  • When I call names_match("Elson", "ELSON"), then it returns True (single-name player).
  • When I call names_match("Terrail Brown Jr.", "BROWN JR., Terrail"), then it returns True (suffix handled).
  • When I call names_match("Jace Bronson", "Jacelyn Bronson"), then it returns False (similar but different first names).
  • When I call names_match("", ""), then it returns False (empty strings are not a match).

Test Expectations

  • Unit test: table-driven test in tests/test_name_normalize.py with at least 15 pairs covering the edge cases above. Half are expected True, half are expected False.
  • Edge case test: verify that the function handles None input by raising TypeError (or returning False from names_match — pick one, document it).
  • Run command: pytest tests/test_name_normalize.py -v

Constraints

  • Use only stdlib (unicodedata, re) — no new third-party dependencies.
  • Function must be pure (no side effects, no DB access).
  • Must handle Unicode correctly via unicodedata.normalize("NFKD", s).
  • Do NOT use fuzzy matching (Levenshtein, SequenceMatcher) — exact canonical match only. Fuzzy matching would silently collapse similar-but-different players ("Jace Bronson" vs "Jacelyn Bronson") which is a data integrity bug.

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • westside-sheet-sync — project this affects
  • story-westside-jersey-sheet-sync — user story
  • Blocks: sheet_sync service module ticket
### Type Feature ### Lineage Standalone — spawned from the westside-sheet-sync project scaffold on 2026-04-10. Blocker for the sheet_sync service module because the DB and sheet use different name formats; without normalization the sync would insert duplicates. ### Repo `forgejo_admin/basketball-api` ### User Story As the sheet_sync service I want a function that normalizes player names across different formats (DB "Firstname Lastname" vs Sheet "LASTNAME, Firstname") So that I can reliably decide whether a DB player already exists in the sheet without inserting duplicates Ties to `story:sheet-sync`. ### Context The basketball-api DB stores player names as a single VARCHAR column `players.name`, formatted "First Last" or "First Middle Last" (e.g., "Daniel Bryan Niyitanga", "Mateus Rigitano de Paula", "Sarah Lédio da Silva"). Marcus's Google Sheet uses "LASTNAME, Firstname" formatting (e.g., "NIYITANGA, Daniel Bryan", "RIGITANO DE PAULA, Mateus", "DA SILVA, Sarah Lédio"). For the sync to be idempotent (running it twice in a row should be a no-op if nothing has changed), we need a function that takes two strings and returns `True` if they refer to the same player. Challenges: - **Capitalization:** DB has proper case, sheet has uppercase last names - **Order:** "First Last" vs "Last, First" - **Multi-word last names:** "Rigitano de Paula" must not be split as "Rigitano" last / "de Paula" first - **Accents and unicode:** "Sarah Lédio da Silva" — normalize to NFKD + strip diacritics for matching - **Single-name players:** "Elson" (no last name in the DB) needs to match "ELSON" in the sheet - **Apostrophes and hyphens:** "Brown Jr." vs "BROWN JR." The right abstraction: a `normalize_name(name: str) -> str` function that returns a canonical form regardless of input format. Two names are "the same" if their canonical forms match. ### File Targets Files to create: - `src/basketball_api/services/name_normalize.py` — contains `normalize_name(name: str) -> str` and `names_match(a: str, b: str) -> bool`. Files to create (tests): - `tests/test_name_normalize.py` — table-driven tests covering every edge case listed in Context. Files NOT to touch: - `src/basketball_api/models.py` — no schema changes. - Any existing route or service file — this is a new standalone helper. ### Acceptance Criteria - [ ] When I call `normalize_name("Daniel Bryan Niyitanga")` and `normalize_name("NIYITANGA, Daniel Bryan")`, then both return the same canonical form. - [ ] When I call `names_match("Mateus Rigitano de Paula", "RIGITANO DE PAULA, Mateus")`, then it returns `True`. - [ ] When I call `names_match("Sarah Lédio da Silva", "DA SILVA, Sarah Lédio")`, then it returns `True` (unicode handled). - [ ] When I call `names_match("Elson", "ELSON")`, then it returns `True` (single-name player). - [ ] When I call `names_match("Terrail Brown Jr.", "BROWN JR., Terrail")`, then it returns `True` (suffix handled). - [ ] When I call `names_match("Jace Bronson", "Jacelyn Bronson")`, then it returns `False` (similar but different first names). - [ ] When I call `names_match("", "")`, then it returns `False` (empty strings are not a match). ### Test Expectations - [ ] Unit test: table-driven test in `tests/test_name_normalize.py` with at least 15 pairs covering the edge cases above. Half are expected `True`, half are expected `False`. - [ ] Edge case test: verify that the function handles `None` input by raising TypeError (or returning `False` from `names_match` — pick one, document it). - Run command: `pytest tests/test_name_normalize.py -v` ### Constraints - Use only stdlib (`unicodedata`, `re`) — no new third-party dependencies. - Function must be pure (no side effects, no DB access). - Must handle Unicode correctly via `unicodedata.normalize("NFKD", s)`. - Do NOT use fuzzy matching (Levenshtein, SequenceMatcher) — exact canonical match only. Fuzzy matching would silently collapse similar-but-different players ("Jace Bronson" vs "Jacelyn Bronson") which is a data integrity bug. ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `westside-sheet-sync` — project this affects - `story-westside-jersey-sheet-sync` — user story - Blocks: sheet_sync service module ticket
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/basketball-api#437
No description provided.