Model Evaluation Rubric for Production LLMs
A repeatable rubric for comparing production LLM candidates across quality, latency, cost, tool use, safety, and operational fit.
evaluationllmopsmodel-selectionproduction
Prompt text
Copy into your favorite runtime.
Create a production LLM evaluation rubric for the following application: [Describe application, users, traffic profile, and risk level] Score candidate models across: - task quality and groundedness - latency at p50, p95, and p99 - token cost and expected monthly spend - tool-use reliability and structured-output adherence - safety, privacy, and data-residency requirements - model availability, quota, and operational fit Return a weighted scorecard, a test-set plan, acceptance thresholds, and a recommendation format that separates measured evidence from assumptions.