Workflow
Evaluation
icon
Search documents
Fuzzing the GenAI Era Leonard Tang
AI Engineer· 2025-08-21 16:26
AI Evaluation Challenges - Traditional evaluation methods are inadequate for assessing GenAI applications' brittleness [1] - The industry faces a "Last Mile Problem" in AI, ensuring reliability, quality, and alignment for any application [1] - Standard evaluation methods often fail to uncover corner cases and unexpected user inputs [1] Haize Labs' Approach - Haize Labs simulates the "last mile" by bombarding AI with unexpected user inputs to uncover corner cases at scale [1] - Haize Labs focuses on Quality Metric (defining criteria for good/bad responses and automating judgment) and Stimuli Generation (creating diverse data to discover bugs) [1] - Haize Labs uses agents as judges to scale evaluation, considering factors like accuracy vs latency [1] - Haize Labs employs RL-tuned judges to further scale evaluation processes [1] - Haize Labs utilizes simulation as a form of prompt optimization [1] Case Studies - Haize Labs has worked with a major European bank's AI app [1] - Haize Labs has worked with a F500 bank's voice agents [1] - Haize Labs scales voice agent evaluations [1]
The Future of Evals - Ankur Goyal, Braintrust
AI Engineer· 2025-08-09 15:12
Product & Technology - Brain Trust introduces "Loop," an agent integrated into its platform designed to automate and improve prompts, datasets, and scorers for AI model evaluation [4][5][7] - Loop leverages advancements in frontier models, particularly noting Claude 4's significant improvement (6x better) in prompt engineering capabilities compared to previous models [6] - Loop allows users to compare suggested edits to data and prompts side-by-side within the UI, maintaining data visibility [9][10] - Loop supports various models, including OpenAI, Gemini, and custom LLMs [9] User Engagement & Adoption - The average organization using Brain Trust runs approximately 13 evaluations (EVELs) per day [3] - Some advanced customers are running over 3,000 evaluations daily and spending more than two hours per day using the product [3] - Brain Trust encourages users to try Loop and provide feedback [12] Future Vision - Brain Trust anticipates a revolution in AI model evaluation, driven by advancements in frontier models [11] - The company is focused on incorporating these advancements into its platform [11] Hiring - Brain Trust is actively hiring for UI, AI, and infrastructure roles [12]
Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear
AI Engineer· 2025-08-03 04:34
Core Problem & Solution - Traditional software development lifecycle is insufficient for AI applications due to non-deterministic models, requiring a data science approach and continuous experimentation [3] - The key is to reverse engineer metrics from real-world scenarios, focusing on product experience and business outcomes rather than abstract data science metrics [6] - Build evaluations (evals) at the beginning of the process, not at the end, to identify failures and areas for improvement early on [14] - Continuous improvement of evals and solutions is necessary to reach a baseline benchmark for optimization [19] Evaluation Methodology - Evaluations should mimic specific user questions and criteria relevant to the solution's end goal [7] - Use Large Language Models (LLMs) to generate evaluations, considering different user personas and expected answers [9][11] - Focus on the details of each evaluation failure to understand the root cause, whether it's the test definition or the solution's performance [15] - Experimentation involves changing models, logic, prompts, or data, and continuously running evaluations to catch regressions [16][18] Industry Specific Examples - For customer support bots, measure the rate of escalation to human support as a key metric [5] - For text-to-SQL or text-to-graph database applications, create a mock database with known data to validate expected results [22] - For call center conversation classifiers, use simple matching to determine if the correct rubric is applied [23] Key Takeaways - Evaluate AI applications the way users actually use them, avoiding abstract metrics [24] - Frequent evaluations enable rapid progress and reduce regressions [25] - Well-defined evaluations lead to explainable AI, providing insights into how the solution works and its limitations [26]
The 2025 AI Engineering Report — Barr Yaron, Amplify
AI Engineer· 2025-08-01 22:51
AI Engineering Landscape - The AI engineering community is broad, technical, and growing, with the "AI Engineer" title expected to gain more ground [5] - Many seasoned software developers are AI newcomers, with nearly half of those with 10+ years of experience having worked with AI for three years or less [7] LLM Usage and Customization - Over half of respondents are using LLMs for both internal and external use cases, with OpenAI models dominating external, customer-facing applications [8] - LLM users are leveraging them across multiple use cases, with 94% using them for at least two and 82% for at least three [9] - Retrieval-Augmented Generation (RAG) is the most popular customization method, with 70% of respondents using it [10] - Parameter-efficient fine-tuning methods like LoRA/Q-LoRA are strongly preferred, mentioned by 40% of fine-tuners [12] Model and Prompt Management - Over 50% of respondents are updating their models at least monthly, with 17% doing so weekly [14] - 70% of respondents are updating prompts at least monthly, and 10% are doing so daily [14] - A significant 31% of respondents lack any system for managing their prompts [15] Multimodal AI and Agents - Image, video, and audio usage lag text usage significantly, indicating a "multimodal production gap" [16][17] - Audio has the highest intent to adopt among those not currently using it, with 37% planning to eventually adopt audio [18] - While 80% of respondents say LLMs are working well, less than 20% say the same about agents [20] Monitoring and Evaluation - Most respondents use multiple methods to monitor their AI systems, with 60% using standard observability and over 50% relying on offline evaluation [22] - Human review remains the most popular method for evaluating model and system accuracy and quality [23] - 65% of respondents are using a dedicated vector database [24] Industry Outlook - The mean guess for the percentage of the US Gen Z population that will have AI girlfriends/boyfriends is 26% [27] - Evaluation is the number one most painful thing about AI engineering today [28]
Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)
AI Engineer· 2025-07-29 16:00
[Music] All right. Uh, thank you everyone. We're excited for to be here and thank you for uh, coming to our talk.Uh, my name is Chong. I'm the CEO and co-founder of LANCB. I've been making data tools for machine learning and data science for about 20 years.I was one of the co-authors of pandas library and I'm working on LANCB today for all of that data that doesn't fit neatly into those pandas data frames. And I'm Calvin. I lead one of the teams at Harvey Aai working on rag um tough rag problems across mass ...
Building Applications with AI Agents — Michael Albada, Microsoft
AI Engineer· 2025-07-24 15:00
Agentic Development Landscape - The adoption of agentic technology is rapidly increasing, with a 254% increase in companies self-identifying as agentic in the last three years based on Y Combinator data [5] - Agentic systems are complex, and while initial prototypes may achieve around 70% accuracy, reaching perfection is difficult due to the long tail of complex scenarios [6][7] - The industry defines an agent as an entity that can reason, act, communicate, and adapt to solve tasks, viewing the foundation model as a base for adding components to enhance performance [8] - The industry emphasizes that agency should not be the ultimate goal but a tool to solve problems, ensuring that increased agency maintains a high level of effectiveness [9][11][12] Tool Use and Orchestration - Exposing tools and functionalities to language models enables agents to invoke functions via APIs, but requires careful consideration of which functionalities to expose [14] - The industry advises against a one-to-one mapping between APIs and tools, recommending grouping tools logically to reduce semantic collision and improve accuracy [17][18] - Simple workflow patterns, such as single chains, are recommended for orchestration to improve measurability, reduce costs, and enhance reliability [19][20] - For complex scenarios, the industry suggests considering a move to more agentic patterns and potentially fine-tuning the model [22][23] Multi-Agent Systems and Evaluation - Multi-agent systems can help scale the number of tools by breaking them into semantically similar groups and routing tasks to appropriate agents [24][25] - The industry recommends investing more in evaluation to address the numerous hyperparameters involved in building agentic systems [27][28] - AI architects and engineers should take ownership of defining the inputs and outputs of agents to accelerate team progress [29][30] - Tools like Intel Agent, Microsoft's Pirate, and Label Studio can aid in generating synthetic inputs, red teaming agents, and building evaluation sets [33][34][35] Observability and Common Pitfalls - The industry emphasizes the importance of observability using tools like OpenTelemetry to understand failure modes and improve systems [38] - Common pitfalls include insufficient evaluation, inadequate tool descriptions, semantic overlap between tools, and excessive complexity [39][40] - The industry stresses the importance of designing for safety at every layer of agentic systems, including building tripwires and detectors [41][42]
How Intuit uses LLMs to explain taxes to millions of taxpayers - Jaspreet Singh, Intuit
AI Engineer· 2025-07-23 15:51
Intuit's Use of LLMs in TurboTax - Intuit successfully processed 44 million tax returns for tax year 2023, aiming to provide users with high confidence in their tax filings and ensure they receive the best deductions [2] - Intuit's Geni experiences are built on GenOS, a proprietary generative OS platform designed to address the limitations of out-of-the-box tooling, especially concerning regulatory compliance, safety, and security in the tax domain [4][5] - Intuit uses Claude (Anthropic) for static queries related to tax refunds and OpenAI's GPT-4 for dynamic question answering, such as user-specific tax inquiries [9][10][12] - Intuit is one of the biggest users of Claude, with a multi-million dollar contract [9][10] Development and Evaluation - Intuit emphasizes a phased evaluation system, starting with manual evaluations by tax analysts and transitioning to automated evaluations using LLM as a judge [16][17] - Tax analysts also serve as prompt engineers, leveraging their expertise to ensure accurate evaluations and prompt design [16][17] - Key evaluation pillars include accuracy, relevancy, and coherence, with a strong focus on tax accuracy [20][24] - Intuit uses AWS Ground Truth for creating golden datasets for evaluations [22] Challenges and Learnings - LLM contracts are expensive, and long-term contracts are slightly cheaper but create vendor lock-in [25][26] - LLM models have higher latency compared to backend services (3-10 seconds), which can be exacerbated during peak tax season [27][28] - Intuit employs safety guardrails and ML models to prevent hallucination of numbers in LLM responses, ensuring data accuracy [40][41] - Graph RAG outperforms regular RAG in providing personalized and helpful answers to users [42][43]
Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo
AI Engineer· 2025-06-27 10:27
AI Agent Evaluation & Observability - The industry emphasizes the necessity of observability in AI development, particularly for evaluation-driven development [1] - AI trustworthiness is a significant concern, highlighting the need for robust evaluation methods [1] - Detecting problems in AI is challenging due to its non-deterministic nature, making traditional unit testing difficult [1] AI-Driven Evaluation - The industry suggests using AI to evaluate AI, leveraging its ability to understand and identify issues in AI systems [1] - LLMs can be used to score the performance of other LLMs, with the recommendation to use a better (potentially more expensive or custom-trained) LLM for evaluation than the one used in the primary application [2] - Galileo offers a custom-trained small language model (SLM) designed for effective AI evaluations [2] Implementation & Metrics - Evaluations should be integrated from the beginning of the AI application development process, including prompt engineering and model selection [2] - Granularity in evaluation is crucial, requiring analysis at each step of the AI workflow to identify failure points [2] - Key metrics for evaluation include action completion (did it complete the task) and action advancement (did it move towards the goal) [2] Continuous Improvement & Human Feedback - AI can provide insights and suggestions for improving AI agent performance based on evaluation data [3] - Human feedback is essential to validate and refine AI-generated metrics, ensuring accuracy and continuous learning [4] - Real-time prevention and alerting are necessary to address rogue AI agents and prevent issues in production [8]