Rows

Task type, acceptable error, human review rule, logging requirement.

Owners

Security, legal, product, and ops each sign one row.

Cadence

Re-evaluate when vendors ship major version bumps.

Pitfall

Do not confuse demo sparkle with production robustness.

Evaluation sheets are contracts

A serious model evaluation sheet locks the task definition, dataset version, scorer rubric, and pass thresholds before anyone sees a leaderboard. Moving goalposts mid-benchmark produces theater, not decisions.

Freeze a small public-internal eval set and a larger private set; tune on neither more than once per release cycle.

Dimensions beyond accuracy

Score latency at p95, refusal appropriateness, formatting stability, and tool-call correctness independently. Stakeholders weight dimensions differently by use case.

Capture cost per successful task early; a smarter model that bankrupts a workflow is not an upgrade.

Human agreement discipline

Two independent graders on a subset; measure Cohen’s kappa or simple disagreement rate. Low agreement means the rubric—not the model—is broken.

Publish example disagreements (redacted) in the appendix so future graders calibrate faster.

From sheet to ship decision

End each eval with an explicit “ship / iterate / stop” recommendation signed by product and risk. Ambiguous conclusions waste engineering weeks.

SignalSpring archives sheets with model IDs and commit hashes so six-month audits are painless.