GENAIWIKI

Vector search

Search GenAIWiki

Query the full knowledge graph. Results rank by semantic similarity across all six libraries.

Search results for “LLM benchmarks evaluation

Prompts

14

Experiment Design for A/B LLM - Advanced

In-depth guide for designing A/B tests specifically for large language models.

Best match

A/B Testing Experiment Design

A structured template to design A/B tests for LLM applications, ensuring consistency in experiment setup.

Best match

Dataset Card Draft for LLM Training (Advanced)

An advanced template for creating detailed dataset cards focusing on comprehensive metadata for LLM training datasets.

Dataset Card Draft for LLM Training

Specific guidelines for creating dataset cards for LLM training datasets.

Technical Workshop Lesson Plan

An organized lesson plan template for conducting technical workshops on LLMs and their applications.

ML Interview Evaluation Framework

A structured rubric for evaluating candidates in machine learning roles.

Experiment Design for A/B LLM

A structured approach to designing experiments for A/B testing in language models.

Dataset Card Draft

A standardized template for documenting dataset characteristics, usage, and limitations for LLM training.

ML Role Interview Rubric

A structured rubric designed for evaluating candidates in machine learning roles.

ML Role Interview Evaluation Criteria

A detailed rubric for evaluating candidates for machine learning roles, focusing on technical and interpersonal skills.

Data Pipeline Debugging Protocol - Comprehensive

Evaluates candidates for machine learning positions based on technical and soft skills.

Machine Learning Interview Rubric for Candidates

A structured rubric for evaluating candidates in machine learning interviews, focusing on technical and soft skills.

Interview Rubric for ML Roles

Create a structured rubric for evaluating candidates in machine learning interviews.

Competitive Feature Matrix (Duplicate)

A repeat entry for the competitive feature matrix, emphasizing comparative analysis of product features.

Not finding exactly what you need?

Ask GenAIWiki →

Tutorials

14

Observability: Traces for LLM + Tool Spans

Implementing observability practices to trace interactions between large language models (LLMs) and external tools. Prerequisites include knowledge of observability tools and LLM architectures.

Best match

Metadata Filters and ACL-Aware Retrieval in Legal Document Management

This tutorial outlines the implementation of metadata filters and Access Control List (ACL)-aware retrieval systems in legal document management applications. Prerequisites include knowledge of legal data structures and basic programming skills.

Best match

Hybrid Search: BM25 + Dense Re-Ranking for Academic Research

This tutorial explores the integration of BM25 and dense re-ranking for enhancing academic search engines. Familiarity with information retrieval concepts is required.

Enhancing Observability with Traces for LLM and Tool Spans in Data Pipelines

This tutorial focuses on enhancing observability in data pipelines that utilize large language models (LLMs) by implementing tracing for both LLM and tool spans. Prerequisites include familiarity with observability concepts and experience with LLMs.

Hybrid Search: BM25 + Dense Re-Ranking

This tutorial explores the integration of BM25 and dense re-ranking techniques to enhance search accuracy. Prerequisites include familiarity with information retrieval concepts and basic machine learning.

Multimodal Prompts for Document QA in Legal Settings

Using multimodal prompts can improve document question answering (QA) in legal contexts. Prerequisites include access to relevant legal documents and a model capable of processing multimodal inputs.

Canary Prompts for Regression Detection

Utilizing canary prompts to detect regressions in language models. Prerequisites include familiarity with regression testing and LLM evaluation metrics.

Chunking Strategies for Legal PDFs: Improving Document Retrieval

This tutorial focuses on optimizing chunking strategies for legal documents to enhance retrieval accuracy. Prerequisites include familiarity with document processing and retrieval systems.

Offline vs Online Evaluation Frequency

This tutorial explores the differences between offline and online evaluation methods for machine learning models, focusing on their respective benefits and drawbacks. Prerequisites include a basic understanding of machine learning evaluation metrics and experience with model deployment.

SLI/SLO for Generative Endpoints

Establishing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for generative endpoints is crucial for maintaining quality and reliability. This tutorial outlines how to define and implement SLIs/SLOs effectively.

Golden-Set Design for RAG Faithfulness

Understand how to design a golden set for evaluating the faithfulness of Retrieval-Augmented Generation (RAG) models. Prerequisites include familiarity with RAG systems and evaluation metrics.

Implementing SLI/SLO for Generative Endpoints

This tutorial outlines how to define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for generative endpoints, ensuring high availability and performance. Prerequisites include understanding of SLIs, SLOs, and basic API concepts.

Understanding Offline vs Online Evaluation Frequency

This tutorial explores the trade-offs between offline and online evaluation methods for machine learning models, focusing on their impact on performance metrics and deployment strategies. Prerequisites include familiarity with basic ML concepts and evaluation metrics.

Evaluating Tool-Calling Reliability Under Load

This tutorial provides a framework for assessing the reliability of tool calls in high-load scenarios, ensuring system robustness.

Models

3

Tools

5