Evaluation

LLM evaluation

LLM evaluation measures whether a model or AI workflow is accurate, useful, safe, reliable, and cost-effective for a target task.

Expanded definition

LLM evaluation combines offline test sets, human review, automated graders, regression tests, production monitoring, and task-specific metrics. Strong evaluation looks beyond generic benchmark scores and measures groundedness, tool accuracy, latency, cost, refusal behavior, and failure cases for the actual user workflow.

Related terms

Explore adjacent ideas in the knowledge graph.

evals monitoring observability golden set

Comparisons, tools, and models that connect to this idea.

Azure Openai Vs Amazon Bedrock (comparisons)
Generative Model (glossary)
Claude 3 5 Sonnet (models)
Adversarial Training (glossary)
Generative Adversarial Network Gan (glossary)
Graph Machine Learning (glossary)