GenAIWiki

Evaluation

LLM evaluation

LLM evaluation measures whether a model or AI workflow is accurate, useful, safe, reliable, and cost-effective for a target task.

Expanded definition

LLM evaluation combines offline test sets, human review, automated graders, regression tests, production monitoring, and task-specific metrics. Strong evaluation looks beyond generic benchmark scores and measures groundedness, tool accuracy, latency, cost, refusal behavior, and failure cases for the actual user workflow.

Related terms

Explore adjacent ideas in the knowledge graph.

Related

Comparisons, tools, and models that connect to this idea.