Playbooks
Tutorials
Long-form guides optimized for engineers shipping GenAI features responsibly.
15 min read
Optimizing Golden-Set Design for RAG in Healthcare Applications
This tutorial covers the design of golden sets for ensuring RAG (Retrieval-Augmented Generation) faithfulness in healthcare applications. It requires an understanding of RAG principles and access to domain-specific datasets.
10 min read
Multimodal Prompts for Document QA in Legal Settings
Using multimodal prompts can improve document question answering (QA) in legal contexts. Prerequisites include access to relevant legal documents and a model capable of processing multimodal inputs.
11 min read
Reducing Hallucinations with Citation Constraints in Research Models
Implementing citation constraints can significantly reduce hallucinations in research-oriented models. Prerequisites include a robust database of citations and a model capable of handling constraints.
9 min read
Latency Budgets for Streaming Chat UX in Customer Support
Establishing latency budgets can enhance the user experience in customer support chat applications. Prerequisites include understanding user expectations and system capabilities.
12 min read
Embedding Drift Monitoring in Financial Services
Monitoring embedding drift is crucial for financial services to ensure model accuracy over time. Prerequisites include a data pipeline that captures embeddings and a monitoring framework.
10 min read
Shadow Traffic for Safe Model Rollouts in E-commerce
Implementing shadow traffic allows e-commerce platforms to test new models against live traffic without affecting user experience. Prerequisites include a robust logging mechanism and a dual model setup.
10 min read
Offline vs Online Evaluation Frequency
This tutorial explores the differences between offline and online evaluation methods for machine learning models, focusing on their respective benefits and drawbacks. Prerequisites include a basic understanding of machine learning evaluation metrics and experience with model deployment.
20 min read
Pgvector Index Tuning (HNSW vs IVF)
Learn how to tune pgvector indexes using HNSW and IVF algorithms for optimal performance. Prerequisites include familiarity with PostgreSQL and vector databases.
11 min read
Multimodal Prompts for Document QA
Explore how to create effective multimodal prompts for document question answering (QA) systems. Prerequisites include understanding of multimodal models and QA frameworks.
18 min read
SLI/SLO for Generative Endpoints
Establishing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for generative endpoints is crucial for maintaining quality and reliability. This tutorial outlines how to define and implement SLIs/SLOs effectively.
12 min read
Offline vs Online Eval Frequency
This tutorial discusses the trade-offs between offline and online evaluation frequencies for machine learning models, focusing on their impact on model performance and user experience.
18 min read
Planner–Executor Loops and Failure Recovery
This tutorial explains the planner-executor loop in AI systems and how to implement effective failure recovery strategies. Prerequisites include knowledge of AI planning algorithms and system design.
20 min read
Agent Memory: Scratchpad vs Vector Store
This tutorial compares scratchpad memory and vector store memory in AI agents, focusing on their use cases and performance characteristics. Prerequisites include a basic understanding of AI memory architectures.
15 min read
Runbooks When Quality Regresses Overnight
This tutorial outlines how to create effective runbooks to address overnight quality regressions in software systems. Prerequisites include familiarity with incident management and basic scripting skills.
16 min read
Canary Prompts for Regression Detection
Utilizing canary prompts to detect regressions in language models. Prerequisites include familiarity with regression testing and LLM evaluation metrics.
22 min read
Prompt Injection Defenses in Multi-Tenant Apps
Developing strategies to protect multi-tenant applications from prompt injection attacks. Prerequisites include understanding of security vulnerabilities and multi-tenant architecture.
18 min read
Observability: Traces for LLM + Tool Spans
Implementing observability practices to trace interactions between large language models (LLMs) and external tools. Prerequisites include knowledge of observability tools and LLM architectures.
20 min read
Sandboxing Tools with Least Privilege
Implementing sandboxing techniques to limit tool access and enhance security. Prerequisites include familiarity with security protocols and system architecture.
15 min read
Human-in-the-Loop for High-Stakes Actions
Integrating human oversight in automated systems to ensure accuracy and accountability in critical scenarios. Prerequisites include understanding of automation frameworks and risk management principles.
20 min read
Graph RAG for Entity-Heavy Domains
Explore the use of Graph Retrieval-Augmented Generation (RAG) for domains with complex entities, requiring knowledge of graph databases and RAG techniques.
10 min read
Hybrid Search: BM25 + Dense Re-Ranking
This tutorial explores the integration of BM25 and dense re-ranking techniques to enhance search accuracy. Prerequisites include familiarity with information retrieval concepts and basic machine learning.
6 min read
PII Handling in Retrieval Pipelines
Effective handling of Personally Identifiable Information (PII) is essential in retrieval systems to ensure compliance and user trust. Prerequisites include knowledge of data privacy regulations and retrieval system architecture.
6 min read
Cost Controls: Batching vs Streaming Tokens
Understanding the trade-offs between batching and streaming token processing can optimize costs in NLP applications. Prerequisites include familiarity with tokenization and processing pipelines.
18 min read
Golden-Set Design for RAG Faithfulness
Understand how to design a golden set for evaluating the faithfulness of Retrieval-Augmented Generation (RAG) models. Prerequisites include familiarity with RAG systems and evaluation metrics.