Corpus hygiene

Ownership per folder, stale doc alerts, and broken link checks in CI.

Retrieval is a logistics problem

Broken RAG usually fails on chunk boundaries and freshness, not embedding mystique. Start with ownership: who ingests new docs, how often, and what happens when Sales edits a slide deck at 4:59 p.m. on Friday.

Draw a simple swimlane from “author publishes” to “index reflects” with measured latency. If that number is unknown, stop tuning prompts.

Chunking rules worth writing down

Prefer structure-aware splits (headings, tables, API sections) over fixed token windows when your corpus mixes prose and specs. Keep provenance: file path, commit hash, or CMS version on every chunk.

Maintain a small “golden questions” set—twenty queries stakeholders agree should never regress—and run them after every ingestion change.

Hybrid search and human overrides

Keyword recall still wins for SKUs, error codes, and legal citations. Blend dense and lexical scores transparently so support engineers trust why a chunk surfaced.

Expose “source inspector” in internal tools: reviewers who cannot see context will invent their own (wrong) mental model.

Operational hygiene

Alert when duplicate near-copies flood the index; dedupe before users feel hallucination pressure. Log retrieval misses separately from model refusals—they need different fixes.

SignalSpring’s field note: the best retrieval teams spend more time deleting bad documents than writing clever prompts.