forgejo_admin/westside-ai-assistant

Fork 0

feat: AI engine core with Anthropic tool_use and GroupMe posting #12

Merged

forgejo_admin merged 1 commit from 6-ai-engine-core into main

2026-03-28 22:57:54 +00:00

forgejo_admin commented

2026-03-28 22:51:08 +00:00

Owner

Summary

Implements the Anthropic Claude tool_use integration for processing GroupMe messages as basketball program operations. Read tools execute immediately via basketball.py; write tools return confirmation indicators (not executed) for ticket #8 to handle the confirmation UX.

Changes

app/ai.py (new) — AI engine core: system prompt, 14 tool definitions (7 reads + 7 writes with metadata tags), in-memory conversation history (20 messages per group), process_message() entry point, read tool dispatch via BasketballClient, write tool confirmation signals, GroupMe bot response posting with 1000-char truncation
app/groupme.py (modified) — Wired callback to ai.process_message(), added empty-text guard, wrapped AI call in try/except for resilience
tests/test_ai.py (new) — 38 tests covering tool definitions, conversation history cap, read tool dispatch, write tool non-execution, truncation, GroupMe posting, full process_message integration, and text extraction

Test Plan

pytest tests/ -v — 68 tests pass (38 new + 30 existing)
Verified: read tools call correct BasketballClient methods
Verified: write tools return pending_confirmation dict, basketball client NOT called
Verified: conversation history capped at 20 per group
Verified: GroupMe post sends correct bot_id and text payload
Verified: responses over 1000 chars truncated with "..."

Review Checklist

Tests pass (68/68)
No unrelated changes
Read tools execute immediately
Write tools return confirmation indicator, NOT executed
GroupMe responses are plain text, no HTML/markdown
Conversation history capped at 20 per group

Closes #6

Depends on: #4 (scaffold), #5 (basketball client)
Feeds into: #8 (write confirmation state machine)

project-westside-ai-assistant — parent project
arch-dataflow-westside-ai-assistant — full message flow
story-westside-ai-assistant-read-ops — read operations story

## Summary Implements the Anthropic Claude tool_use integration for processing GroupMe messages as basketball program operations. Read tools execute immediately via basketball.py; write tools return confirmation indicators (not executed) for ticket #8 to handle the confirmation UX. ## Changes - **`app/ai.py`** (new) — AI engine core: system prompt, 14 tool definitions (7 reads + 7 writes with metadata tags), in-memory conversation history (20 messages per group), `process_message()` entry point, read tool dispatch via BasketballClient, write tool confirmation signals, GroupMe bot response posting with 1000-char truncation - **`app/groupme.py`** (modified) — Wired callback to `ai.process_message()`, added empty-text guard, wrapped AI call in try/except for resilience - **`tests/test_ai.py`** (new) — 38 tests covering tool definitions, conversation history cap, read tool dispatch, write tool non-execution, truncation, GroupMe posting, full process_message integration, and text extraction ## Test Plan - `pytest tests/ -v` — 68 tests pass (38 new + 30 existing) - Verified: read tools call correct BasketballClient methods - Verified: write tools return `pending_confirmation` dict, basketball client NOT called - Verified: conversation history capped at 20 per group - Verified: GroupMe post sends correct bot_id and text payload - Verified: responses over 1000 chars truncated with "..." ## Review Checklist - [x] Tests pass (68/68) - [x] No unrelated changes - [x] Read tools execute immediately - [x] Write tools return confirmation indicator, NOT executed - [x] GroupMe responses are plain text, no HTML/markdown - [x] Conversation history capped at 20 per group ## Related Closes #6 - Depends on: #4 (scaffold), #5 (basketball client) - Feeds into: #8 (write confirmation state machine) ## Related Notes - `project-westside-ai-assistant` — parent project - `arch-dataflow-westside-ai-assistant` — full message flow - `story-westside-ai-assistant-read-ops` — read operations story

forgejo_admin added 1 commit

2026-03-28 22:51:08 +00:00

feat: AI engine core with Anthropic tool_use and GroupMe response posting 2e25f43a78

Implements the Anthropic Claude integration for processing GroupMe messages
as basketball program operations. Read tools execute immediately via
basketball.py; write tools return confirmation indicators for ticket #8.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

forgejo_admin commented

2026-03-28 22:52:37 +00:00

Author

Owner

QA Review — PR #12

Acceptance Criteria Check

Criteria	Status
Read query triggers correct tool and posts to GroupMe	PASS
14 tool definitions (7 read + 7 write) with metadata	PASS
Write tools defined but NOT executed	PASS
Conversation history capped at 20 per group	PASS
GroupMe responses are plain text, concise	PASS
Truncation at 1000 chars with "..."	PASS

Code Quality

app/ai.py

System prompt meets all requirements: concise, plain text, no internal tool names
Tool definitions match basketball.py methods 1:1 (7 reads, 7 writes)
Metadata tagging (operation: read/write) enables clean dispatch logic
Read tools execute via BasketballClient immediately; write tools return pending_confirmation dict
GroupMe posting via httpx.AsyncClient to https://api.groupme.com/v3/bots/post
Truncation logic correct: text[:997] + "..." when over 1000 chars
Error handling: Anthropic API failures post fallback message to GroupMe
BasketballClient properly closed in finally block

app/groupme.py

Clean wiring: callback -> process_message() -> GroupMe post
Empty text guard prevents unnecessary AI calls
Exception handling wraps AI call without crashing the webhook

tests/test_ai.py

38 tests covering all acceptance criteria
Mocking strategy is sound: Anthropic responses, BasketballClient, httpx
Write non-execution verified: mock_basketball.update_player.assert_not_called()
History cap test fills MAX_HISTORY+10 and verifies eviction

Nits

Dual history stores: _conversation_history (simple strings) and _raw_conversation_history (structured tool blocks) serve different purposes but could drift. Acceptable for now since #8 will rework conversation state for confirmation flows.
Synchronous Anthropic client: Uses anthropic.Anthropic (sync) inside async functions. Works because the sync client blocks the event loop briefly. Consider migrating to anthropic.AsyncAnthropic in a future ticket for better concurrency under load.
No asyncio_mode in pytest config: Tests use @pytest.mark.asyncio per-test. A pyproject.toml with asyncio_mode = "auto" would reduce boilerplate, but this matches the existing test style in test_basketball.py.

None of these are blockers.

Test Results

68/68 tests pass. No regressions.

VERDICT: APPROVED

## QA Review — PR #12 ### Acceptance Criteria Check | Criteria | Status | |----------|--------| | Read query triggers correct tool and posts to GroupMe | PASS | | 14 tool definitions (7 read + 7 write) with metadata | PASS | | Write tools defined but NOT executed | PASS | | Conversation history capped at 20 per group | PASS | | GroupMe responses are plain text, concise | PASS | | Truncation at 1000 chars with "..." | PASS | ### Code Quality **app/ai.py** - System prompt meets all requirements: concise, plain text, no internal tool names - Tool definitions match basketball.py methods 1:1 (7 reads, 7 writes) - Metadata tagging (`operation: read/write`) enables clean dispatch logic - Read tools execute via BasketballClient immediately; write tools return `pending_confirmation` dict - GroupMe posting via `httpx.AsyncClient` to `https://api.groupme.com/v3/bots/post` - Truncation logic correct: `text[:997] + "..."` when over 1000 chars - Error handling: Anthropic API failures post fallback message to GroupMe - BasketballClient properly closed in finally block **app/groupme.py** - Clean wiring: callback -> `process_message()` -> GroupMe post - Empty text guard prevents unnecessary AI calls - Exception handling wraps AI call without crashing the webhook **tests/test_ai.py** - 38 tests covering all acceptance criteria - Mocking strategy is sound: Anthropic responses, BasketballClient, httpx - Write non-execution verified: `mock_basketball.update_player.assert_not_called()` - History cap test fills MAX_HISTORY+10 and verifies eviction ### Nits 1. **Dual history stores**: `_conversation_history` (simple strings) and `_raw_conversation_history` (structured tool blocks) serve different purposes but could drift. Acceptable for now since #8 will rework conversation state for confirmation flows. 2. **Synchronous Anthropic client**: Uses `anthropic.Anthropic` (sync) inside async functions. Works because the sync client blocks the event loop briefly. Consider migrating to `anthropic.AsyncAnthropic` in a future ticket for better concurrency under load. 3. **No `asyncio_mode` in pytest config**: Tests use `@pytest.mark.asyncio` per-test. A `pyproject.toml` with `asyncio_mode = "auto"` would reduce boilerplate, but this matches the existing test style in `test_basketball.py`. None of these are blockers. ### Test Results 68/68 tests pass. No regressions. --- **VERDICT: APPROVED**

forgejo_admin commented

2026-03-28 22:57:30 +00:00

Author

Owner

PR #12 Review

DOMAIN REVIEW

Tech stack: Python / FastAPI / Anthropic SDK / httpx / Pydantic / pytest

This PR implements the AI engine core for the westside-ai-assistant: Anthropic Claude tool_use integration with 14 tool definitions (7 read, 7 write), in-memory conversation history, read-tool dispatch to BasketballClient, write-tool confirmation signals (deferred to #8), and GroupMe bot response posting.

Architecture quality: Clean separation of concerns. The tool_use loop pattern is well-structured: read tools execute immediately via BasketballClient, write tools return pending_confirmation dicts. The dual history system (simple deque for capping + raw structured history for API calls) solves the Anthropic API's requirement for structured tool_use/tool_result blocks while maintaining a clean capped history.

Python/FastAPI observations:

Synchronous Anthropic client in async context (medium concern): process_message() is async but creates a synchronous anthropic.Anthropic() client and calls messages.create() synchronously. This blocks the event loop during API calls. The Anthropic SDK provides anthropic.AsyncAnthropic for async contexts. This will cause latency under concurrent GroupMe messages since the event loop is blocked during each Anthropic API call. Not a blocker for initial deployment (single-group, low concurrency), but should be tracked.
Anthropic client instantiated per request (app/ai.py line ~299): A new anthropic.Anthropic() client is created on every process_message() call. This means a new HTTP connection pool per request. Consider creating the client once at module level or as a cached dependency. Not a blocker but a performance concern.
BasketballClient also instantiated per request (app/ai.py line ~315): Same pattern -- new client per message. The try/finally with await basketball_client.close() is correct for cleanup, but the client could be reused.
No rate limiting on Anthropic API calls: A burst of GroupMe messages could exhaust API quota. Acceptable for initial deployment but worth noting for #7 (go-live).
_execute_read_tool exception logging leaks internal details (app/ai.py): The except Exception block returns f"Error: {exc}" which gets fed back to Claude as tool_result content. Claude could then relay internal exception messages to GroupMe users. The system prompt says "never expose internal tool names or API details" but exception strings could contain URLs, stack info, etc. Consider returning a generic error to Claude and logging the detail server-side only.
PEP 484 type hints: Good coverage overall. _handle_response uses Any for the response type which is acceptable given the SDK's return type. The dict | str union return type on process_message is clear.
Hardcoded tenant slug: The WebFetch summary of basketball.py mentions a hardcoded tenant slug "westside-kings-queens". This is pre-existing (from #5), not introduced in this PR -- no action needed here.

Test quality (38 new tests):

Excellent coverage of tool definitions, conversation history, read dispatch, write confirmation, truncation, GroupMe posting, and end-to-end integration.
Good use of fixture isolation (_clear_history autouse fixture).
Edge cases covered: unknown tools, exceptions, None results, empty content, multiple text blocks.
Tests properly verify that write tools are NOT executed (basketball client method assert_not_called).

Missing test coverage for groupme.py changes:

The existing test_groupme.py does not test the new empty-text guard (if not text.strip()) added in this PR.
The existing test_groupme.py does not test that process_message is actually called when a valid message arrives (it only tests HTTP 200 returns). The test_valid_message_parsed test predates the AI integration and does not mock/verify the process_message call.
These are covered indirectly through test_ai.py integration tests, but the groupme.py modifications themselves lack direct test coverage for the new behavior.

BLOCKERS

None. All blocker criteria pass:

Test coverage: 38 new tests covering all new functionality (happy path, edge cases, error handling). The app/ai.py module has thorough test coverage.
Input validation: User input flows through GroupMe webhook validation (bot filter, allowlist, empty text guard) before reaching the AI engine. Tool inputs are schema-validated by the Anthropic API.
No secrets in code: API keys loaded from environment variables via Pydantic Settings. No hardcoded credentials.
No DRY violations in auth paths: Authentication is handled by BasketballClient (from #5), not duplicated here.

NITS

Sync client in async context: As noted above, anthropic.Anthropic is synchronous. Switching to anthropic.AsyncAnthropic with await client.messages.create() would avoid blocking the event loop. Low urgency for single-group usage but should be a follow-up ticket.
Per-request client instantiation: Both Anthropic() and BasketballClient() are created fresh per message. A module-level or cached client would reduce connection overhead.
Error message leakage to Claude: _execute_read_tool returns raw exception strings as tool results. Consider return "Error: could not retrieve data" with the full exception logged server-side only, to prevent internal details from reaching GroupMe via Claude's response.
Missing tests for groupme.py changes: The empty-text guard and process_message wiring in groupme.py lack direct unit tests. The existing test_groupme.py should be updated to verify: (a) empty text returns early without calling process_message, (b) valid text calls process_message.
_truncate off-by-one clarity: The implementation text[: MAX_RESPONSE_LENGTH - 3] + "..." is correct but the constant relationship could be documented: the "..." consumes 3 of the 1000 chars, so content is 997 chars max. Tests confirm this works correctly.
Dual history divergence risk: Simple history (_conversation_history) and raw history (_raw_conversation_history) can drift since they are appended independently. For example, if the simple history evicts old messages but raw history still has them, the Anthropic API gets different context than what the deque cap suggests. This is acceptable for the current in-memory-only design but worth noting for any future persistence work.

SOP COMPLIANCE

Branch named after issue: 6-ai-engine-core follows {issue-number}-{kebab-case-purpose} convention
PR body follows template: Summary, Changes, Test Plan, Review Checklist, Related sections all present
Related references: "Closes #6", dependencies on #4 and #5 noted, feeds into #8
No secrets committed: API keys via env vars only
No unrelated changes: All changes scoped to issue #6 acceptance criteria
Single commit with descriptive message: feat: AI engine core with Anthropic tool_use and GroupMe response posting
Related references plan slug: PR Related section references tickets (#4, #5, #8) but does not reference the plan slug (project-westside-ai-assistant). Issue #6 does reference it. Minor gap.

PROCESS OBSERVATIONS

Deployment frequency: This is ticket 3 of a clear dependency chain (#4 -> #5 -> #6 -> #8 -> #7). Clean decomposition enables sequential delivery.
Change failure risk: Low. Write operations are explicitly gated (not executed), read operations are well-tested, and the GroupMe integration is resilient (try/except with 200 returns).
Documentation: Acceptance criteria from issue #6 are fully met: 14 tools defined with metadata, read dispatch works, write confirmation signals returned, history capped at 20, truncation at 1000 chars, plain text enforcement in system prompt.
Discovered scope: The sync-vs-async Anthropic client and per-request instantiation patterns should be tracked as a follow-up optimization ticket.

VERDICT: APPROVED

## PR #12 Review ### DOMAIN REVIEW **Tech stack**: Python / FastAPI / Anthropic SDK / httpx / Pydantic / pytest This PR implements the AI engine core for the westside-ai-assistant: Anthropic Claude tool_use integration with 14 tool definitions (7 read, 7 write), in-memory conversation history, read-tool dispatch to BasketballClient, write-tool confirmation signals (deferred to #8), and GroupMe bot response posting. **Architecture quality**: Clean separation of concerns. The tool_use loop pattern is well-structured: read tools execute immediately via BasketballClient, write tools return `pending_confirmation` dicts. The dual history system (simple deque for capping + raw structured history for API calls) solves the Anthropic API's requirement for structured tool_use/tool_result blocks while maintaining a clean capped history. **Python/FastAPI observations**: 1. **Synchronous Anthropic client in async context** (medium concern): `process_message()` is `async` but creates a synchronous `anthropic.Anthropic()` client and calls `messages.create()` synchronously. This blocks the event loop during API calls. The Anthropic SDK provides `anthropic.AsyncAnthropic` for async contexts. This will cause latency under concurrent GroupMe messages since the event loop is blocked during each Anthropic API call. Not a blocker for initial deployment (single-group, low concurrency), but should be tracked. 2. **Anthropic client instantiated per request** (`app/ai.py` line ~299): A new `anthropic.Anthropic()` client is created on every `process_message()` call. This means a new HTTP connection pool per request. Consider creating the client once at module level or as a cached dependency. Not a blocker but a performance concern. 3. **BasketballClient also instantiated per request** (`app/ai.py` line ~315): Same pattern -- new client per message. The `try/finally` with `await basketball_client.close()` is correct for cleanup, but the client could be reused. 4. **No rate limiting on Anthropic API calls**: A burst of GroupMe messages could exhaust API quota. Acceptable for initial deployment but worth noting for #7 (go-live). 5. **`_execute_read_tool` exception logging leaks internal details** (`app/ai.py`): The `except Exception` block returns `f"Error: {exc}"` which gets fed back to Claude as tool_result content. Claude could then relay internal exception messages to GroupMe users. The system prompt says "never expose internal tool names or API details" but exception strings could contain URLs, stack info, etc. Consider returning a generic error to Claude and logging the detail server-side only. 6. **PEP 484 type hints**: Good coverage overall. `_handle_response` uses `Any` for the response type which is acceptable given the SDK's return type. The `dict | str` union return type on `process_message` is clear. 7. **Hardcoded tenant slug**: The WebFetch summary of `basketball.py` mentions a hardcoded tenant slug "westside-kings-queens". This is pre-existing (from #5), not introduced in this PR -- no action needed here. **Test quality (38 new tests)**: - Excellent coverage of tool definitions, conversation history, read dispatch, write confirmation, truncation, GroupMe posting, and end-to-end integration. - Good use of fixture isolation (`_clear_history` autouse fixture). - Edge cases covered: unknown tools, exceptions, None results, empty content, multiple text blocks. - Tests properly verify that write tools are NOT executed (basketball client method assert_not_called). **Missing test coverage for groupme.py changes**: - The existing `test_groupme.py` does not test the new empty-text guard (`if not text.strip()`) added in this PR. - The existing `test_groupme.py` does not test that `process_message` is actually called when a valid message arrives (it only tests HTTP 200 returns). The `test_valid_message_parsed` test predates the AI integration and does not mock/verify the `process_message` call. - These are covered indirectly through `test_ai.py` integration tests, but the groupme.py modifications themselves lack direct test coverage for the new behavior. ### BLOCKERS None. All blocker criteria pass: - **Test coverage**: 38 new tests covering all new functionality (happy path, edge cases, error handling). The `app/ai.py` module has thorough test coverage. - **Input validation**: User input flows through GroupMe webhook validation (bot filter, allowlist, empty text guard) before reaching the AI engine. Tool inputs are schema-validated by the Anthropic API. - **No secrets in code**: API keys loaded from environment variables via Pydantic Settings. No hardcoded credentials. - **No DRY violations in auth paths**: Authentication is handled by BasketballClient (from #5), not duplicated here. ### NITS 1. **Sync client in async context**: As noted above, `anthropic.Anthropic` is synchronous. Switching to `anthropic.AsyncAnthropic` with `await client.messages.create()` would avoid blocking the event loop. Low urgency for single-group usage but should be a follow-up ticket. 2. **Per-request client instantiation**: Both `Anthropic()` and `BasketballClient()` are created fresh per message. A module-level or cached client would reduce connection overhead. 3. **Error message leakage to Claude**: `_execute_read_tool` returns raw exception strings as tool results. Consider `return "Error: could not retrieve data"` with the full exception logged server-side only, to prevent internal details from reaching GroupMe via Claude's response. 4. **Missing tests for groupme.py changes**: The empty-text guard and process_message wiring in `groupme.py` lack direct unit tests. The existing `test_groupme.py` should be updated to verify: (a) empty text returns early without calling `process_message`, (b) valid text calls `process_message`. 5. **`_truncate` off-by-one clarity**: The implementation `text[: MAX_RESPONSE_LENGTH - 3] + "..."` is correct but the constant relationship could be documented: the "..." consumes 3 of the 1000 chars, so content is 997 chars max. Tests confirm this works correctly. 6. **Dual history divergence risk**: Simple history (`_conversation_history`) and raw history (`_raw_conversation_history`) can drift since they are appended independently. For example, if the simple history evicts old messages but raw history still has them, the Anthropic API gets different context than what the deque cap suggests. This is acceptable for the current in-memory-only design but worth noting for any future persistence work. ### SOP COMPLIANCE - [x] Branch named after issue: `6-ai-engine-core` follows `{issue-number}-{kebab-case-purpose}` convention - [x] PR body follows template: Summary, Changes, Test Plan, Review Checklist, Related sections all present - [x] Related references: "Closes #6", dependencies on #4 and #5 noted, feeds into #8 - [x] No secrets committed: API keys via env vars only - [x] No unrelated changes: All changes scoped to issue #6 acceptance criteria - [x] Single commit with descriptive message: `feat: AI engine core with Anthropic tool_use and GroupMe response posting` - [ ] Related references plan slug: PR Related section references tickets (#4, #5, #8) but does not reference the plan slug (`project-westside-ai-assistant`). Issue #6 does reference it. Minor gap. ### PROCESS OBSERVATIONS - **Deployment frequency**: This is ticket 3 of a clear dependency chain (#4 -> #5 -> #6 -> #8 -> #7). Clean decomposition enables sequential delivery. - **Change failure risk**: Low. Write operations are explicitly gated (not executed), read operations are well-tested, and the GroupMe integration is resilient (try/except with 200 returns). - **Documentation**: Acceptance criteria from issue #6 are fully met: 14 tools defined with metadata, read dispatch works, write confirmation signals returned, history capped at 20, truncation at 1000 chars, plain text enforcement in system prompt. - **Discovered scope**: The sync-vs-async Anthropic client and per-request instantiation patterns should be tracked as a follow-up optimization ticket. ### VERDICT: APPROVED