Grounded search

Search GenAIWiki

Search across models, tools, comparisons, tutorials, and glossary entries — with sources shown.

GenAIWiki

Grounded AI answer — wiki index sources only

Searching GenAIWiki index…

Full results from the index

All matches for “LLM benchmarks evaluation”, grouped by content type.

Glossary

LLM evaluation

LLM evaluation measures whether a model or AI workflow is accurate, useful, safe, reliable, and cost-effective for a target task.

Strong match

Indic LLM

An Indic LLM is a language model optimized for Indian languages, scripts, romanized text, code-mixing, and India-specific cultural or domain context.

Strong match

Want a cited narrative answer?

Ask GenAIWiki →

Prompts

Experiment Design for A/B LLM - Advanced

In-depth guide for designing A/B tests specifically for large language models.

Strong match

A/B Testing Experiment Design

A structured template to design A/B tests for LLM applications, ensuring consistency in experiment setup.

Strong match

Dataset Card Draft for LLM Training (Advanced)

An advanced template for creating detailed dataset cards focusing on comprehensive metadata for LLM training datasets.

Dataset Card Draft for LLM Training

Specific guidelines for creating dataset cards for LLM training datasets.

Technical Workshop Lesson Plan

An organized lesson plan template for conducting technical workshops on LLMs and their applications.

ML Interview Evaluation Framework

A structured rubric for evaluating candidates in machine learning roles.

Experiment Design for A/B LLM

A structured approach to designing experiments for A/B testing in language models.

Dataset Card Draft

A standardized template for documenting dataset characteristics, usage, and limitations for LLM training.

ML Role Interview Rubric

A structured rubric designed for evaluating candidates in machine learning roles.

ML Role Interview Evaluation Criteria

A detailed rubric for evaluating candidates for machine learning roles, focusing on technical and interpersonal skills.

Data Pipeline Debugging Protocol - Comprehensive

Evaluates candidates for machine learning positions based on technical and soft skills.

Machine Learning Interview Rubric for Candidates

A structured rubric for evaluating candidates in machine learning interviews, focusing on technical and soft skills.

Tutorials

Observability: Traces for LLM + Tool Spans

Implementing observability practices to trace interactions between large language models (LLMs) and external tools. Prerequisites include knowledge of observability tools and LLM architectures.

Strong match

Models

Llama 3.1 405B Instruct

Meta’s largest open-weights instruct checkpoint in the Llama 3.1 family, aimed at strong reasoning and coding quality with a permissive license for research and customization. It is typically served on dedicated GPU clusters or via partners (cloud inference, on-prem) rather than a single vendor API.

Strong match

DeepSeek-V3

DeepSeek-V3 is a large-scale language model family noted for strong coding and math performance under open or research-friendly terms (verify the exact license for your deployment). Teams adopt it for cost-sensitive research, self-hosted inference, or comparison against frontier APIs.

Strong match

Claude Opus 4.8

Anthropic's current Opus-tier Claude model, documented for complex reasoning, coding, and multimodal enterprise workloads below the newer Fable tier.

Sarvam 30B

Sarvam 30B is a 30B parameter Mixture-of-Experts chat and reasoning model from Sarvam AI, optimized for Indian languages, real-time conversation, high-throughput voice-agent pipelines, coding, and practical deployment. Sarvam documents 2.4B active parameters per token, 16T tokens of pre-training data, a 64K context window, Grouped Query Attention, Apache 2.0 open weights, and OpenAI-compatible chat completions.

Tools

LlamaIndex

Data framework for LLM applications focused on ingestion pipelines, indexing, retrieval, and query orchestration over private and enterprise content sources.

Strong match

OpenAI Playground

Provider of widely used frontier model APIs for text, vision, and audio, with strong developer tooling and broad ecosystem adoption across production AI applications.

Strong match

Ollama

Local model runtime for running and serving open LLMs on developer machines and private infrastructure, with simple pull/run workflows and API access.

Groq

GroqCloud offers very low-latency, high-throughput LLM inference using Groq’s LPU-style hardware, with OpenAI-compatible APIs for select open and partner models aimed at interactive and batch production workloads.

LangChain

Application framework for orchestrating LLM workflows, tool calling, retrieval, and agents across multiple providers in Python and TypeScript ecosystems.