GenAIWiki

Model Evaluation Rubric for Production LLMs

A repeatable rubric for comparing production LLM candidates across quality, latency, cost, tool use, safety, and operational fit.

evaluationllmopsmodel-selectionproduction

Prompt text

Copy into your favorite runtime.

Create a production LLM evaluation rubric for the following application:

[Describe application, users, traffic profile, and risk level]

Score candidate models across:
- task quality and groundedness
- latency at p50, p95, and p99
- token cost and expected monthly spend
- tool-use reliability and structured-output adherence
- safety, privacy, and data-residency requirements
- model availability, quota, and operational fit

Return a weighted scorecard, a test-set plan, acceptance thresholds, and a recommendation format that separates measured evidence from assumptions.