Evaluation
LLM evaluation
LLM evaluation measures whether a model or AI workflow is accurate, useful, safe, reliable, and cost-effective for a target task.
Expanded definition
LLM evaluation combines offline test sets, human review, automated graders, regression tests, production monitoring, and task-specific metrics. Strong evaluation looks beyond generic benchmark scores and measures groundedness, tool accuracy, latency, cost, refusal behavior, and failure cases for the actual user workflow.
Related terms
Explore adjacent ideas in the knowledge graph.
Related
Comparisons, tools, and models that connect to this idea.