Vector search
Search GenAIWiki
Query the full knowledge graph. Results rank by semantic similarity across all six libraries.
Search results for “LLM benchmarks evaluation”
Prompts
14Experiment Design for A/B LLM - Advanced
In-depth guide for designing A/B tests specifically for large language models.
Best match
A/B Testing Experiment Design
A structured template to design A/B tests for LLM applications, ensuring consistency in experiment setup.
Best match
Dataset Card Draft for LLM Training (Advanced)
An advanced template for creating detailed dataset cards focusing on comprehensive metadata for LLM training datasets.
Dataset Card Draft for LLM Training
Specific guidelines for creating dataset cards for LLM training datasets.
Technical Workshop Lesson Plan
An organized lesson plan template for conducting technical workshops on LLMs and their applications.
ML Interview Evaluation Framework
A structured rubric for evaluating candidates in machine learning roles.
Experiment Design for A/B LLM
A structured approach to designing experiments for A/B testing in language models.
Dataset Card Draft
A standardized template for documenting dataset characteristics, usage, and limitations for LLM training.
ML Role Interview Rubric
A structured rubric designed for evaluating candidates in machine learning roles.
ML Role Interview Evaluation Criteria
A detailed rubric for evaluating candidates for machine learning roles, focusing on technical and interpersonal skills.
Data Pipeline Debugging Protocol - Comprehensive
Evaluates candidates for machine learning positions based on technical and soft skills.
Machine Learning Interview Rubric for Candidates
A structured rubric for evaluating candidates in machine learning interviews, focusing on technical and soft skills.
Interview Rubric for ML Roles
Create a structured rubric for evaluating candidates in machine learning interviews.
Competitive Feature Matrix (Duplicate)
A repeat entry for the competitive feature matrix, emphasizing comparative analysis of product features.
Not finding exactly what you need?
Ask GenAIWiki →Tutorials
14Observability: Traces for LLM + Tool Spans
Implementing observability practices to trace interactions between large language models (LLMs) and external tools. Prerequisites include knowledge of observability tools and LLM architectures.
Best match
Metadata Filters and ACL-Aware Retrieval in Legal Document Management
This tutorial outlines the implementation of metadata filters and Access Control List (ACL)-aware retrieval systems in legal document management applications. Prerequisites include knowledge of legal data structures and basic programming skills.
Best match
Hybrid Search: BM25 + Dense Re-Ranking for Academic Research
This tutorial explores the integration of BM25 and dense re-ranking for enhancing academic search engines. Familiarity with information retrieval concepts is required.
Enhancing Observability with Traces for LLM and Tool Spans in Data Pipelines
This tutorial focuses on enhancing observability in data pipelines that utilize large language models (LLMs) by implementing tracing for both LLM and tool spans. Prerequisites include familiarity with observability concepts and experience with LLMs.
Hybrid Search: BM25 + Dense Re-Ranking
This tutorial explores the integration of BM25 and dense re-ranking techniques to enhance search accuracy. Prerequisites include familiarity with information retrieval concepts and basic machine learning.
Multimodal Prompts for Document QA in Legal Settings
Using multimodal prompts can improve document question answering (QA) in legal contexts. Prerequisites include access to relevant legal documents and a model capable of processing multimodal inputs.
Canary Prompts for Regression Detection
Utilizing canary prompts to detect regressions in language models. Prerequisites include familiarity with regression testing and LLM evaluation metrics.
Chunking Strategies for Legal PDFs: Improving Document Retrieval
This tutorial focuses on optimizing chunking strategies for legal documents to enhance retrieval accuracy. Prerequisites include familiarity with document processing and retrieval systems.
Offline vs Online Evaluation Frequency
This tutorial explores the differences between offline and online evaluation methods for machine learning models, focusing on their respective benefits and drawbacks. Prerequisites include a basic understanding of machine learning evaluation metrics and experience with model deployment.
SLI/SLO for Generative Endpoints
Establishing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for generative endpoints is crucial for maintaining quality and reliability. This tutorial outlines how to define and implement SLIs/SLOs effectively.
Golden-Set Design for RAG Faithfulness
Understand how to design a golden set for evaluating the faithfulness of Retrieval-Augmented Generation (RAG) models. Prerequisites include familiarity with RAG systems and evaluation metrics.
Implementing SLI/SLO for Generative Endpoints
This tutorial outlines how to define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for generative endpoints, ensuring high availability and performance. Prerequisites include understanding of SLIs, SLOs, and basic API concepts.
Understanding Offline vs Online Evaluation Frequency
This tutorial explores the trade-offs between offline and online evaluation methods for machine learning models, focusing on their impact on performance metrics and deployment strategies. Prerequisites include familiarity with basic ML concepts and evaluation metrics.
Evaluating Tool-Calling Reliability Under Load
This tutorial provides a framework for assessing the reliability of tool calls in high-load scenarios, ensuring system robustness.
Models
3LLaMA 3 70B
LLaMA 3 70B features 70 billion parameters and a context window of 32k tokens, optimized for high-performance text generation and understanding across diverse tasks.
Best match
LLaMA 3 8B
LLaMA 3 8B is a compact model with 8 billion parameters, designed for efficient text generation and understanding with a context window of 8k tokens.
Best match
Mistral Large
Mistral Large supports up to 16k tokens with a response latency of 150ms, targeting enterprise-level applications and complex document understanding.
Tools
5LlamaIndex
Data framework for LLM applications focused on ingestion pipelines, indexing, retrieval, and query orchestration over private and enterprise content sources.
Best match
OpenAI Playground
Provider of widely used frontier model APIs for text, vision, and audio, with strong developer tooling and broad ecosystem adoption across production AI applications.
Best match
Ollama
Local model runtime for running and serving open LLMs on developer machines and private infrastructure, with simple pull/run workflows and API access.
Groq
GroqCloud offers very low-latency, high-throughput LLM inference using Groq’s LPU-style hardware, with OpenAI-compatible APIs for select open and partner models aimed at interactive and batch production workloads.
LangChain
Application framework for orchestrating LLM workflows, tool calling, retrieval, and agents across multiple providers in Python and TypeScript ecosystems.