Evaluation
Search documents
Building Applications with AI Agents — Michael Albada, Microsoft
AI Engineer· 2025-07-24 15:00
Agentic Development Landscape - The adoption of agentic technology is rapidly increasing, with a 254% increase in companies self-identifying as agentic in the last three years based on Y Combinator data [5] - Agentic systems are complex, and while initial prototypes may achieve around 70% accuracy, reaching perfection is difficult due to the long tail of complex scenarios [6][7] - The industry defines an agent as an entity that can reason, act, communicate, and adapt to solve tasks, viewing the foundation model as a base for adding components to enhance performance [8] - The industry emphasizes that agency should not be the ultimate goal but a tool to solve problems, ensuring that increased agency maintains a high level of effectiveness [9][11][12] Tool Use and Orchestration - Exposing tools and functionalities to language models enables agents to invoke functions via APIs, but requires careful consideration of which functionalities to expose [14] - The industry advises against a one-to-one mapping between APIs and tools, recommending grouping tools logically to reduce semantic collision and improve accuracy [17][18] - Simple workflow patterns, such as single chains, are recommended for orchestration to improve measurability, reduce costs, and enhance reliability [19][20] - For complex scenarios, the industry suggests considering a move to more agentic patterns and potentially fine-tuning the model [22][23] Multi-Agent Systems and Evaluation - Multi-agent systems can help scale the number of tools by breaking them into semantically similar groups and routing tasks to appropriate agents [24][25] - The industry recommends investing more in evaluation to address the numerous hyperparameters involved in building agentic systems [27][28] - AI architects and engineers should take ownership of defining the inputs and outputs of agents to accelerate team progress [29][30] - Tools like Intel Agent, Microsoft's Pirate, and Label Studio can aid in generating synthetic inputs, red teaming agents, and building evaluation sets [33][34][35] Observability and Common Pitfalls - The industry emphasizes the importance of observability using tools like OpenTelemetry to understand failure modes and improve systems [38] - Common pitfalls include insufficient evaluation, inadequate tool descriptions, semantic overlap between tools, and excessive complexity [39][40] - The industry stresses the importance of designing for safety at every layer of agentic systems, including building tripwires and detectors [41][42]
How Intuit uses LLMs to explain taxes to millions of taxpayers - Jaspreet Singh, Intuit
AI Engineer· 2025-07-23 15:51
Intuit's Use of LLMs in TurboTax - Intuit successfully processed 44 million tax returns for tax year 2023, aiming to provide users with high confidence in their tax filings and ensure they receive the best deductions [2] - Intuit's Geni experiences are built on GenOS, a proprietary generative OS platform designed to address the limitations of out-of-the-box tooling, especially concerning regulatory compliance, safety, and security in the tax domain [4][5] - Intuit uses Claude (Anthropic) for static queries related to tax refunds and OpenAI's GPT-4 for dynamic question answering, such as user-specific tax inquiries [9][10][12] - Intuit is one of the biggest users of Claude, with a multi-million dollar contract [9][10] Development and Evaluation - Intuit emphasizes a phased evaluation system, starting with manual evaluations by tax analysts and transitioning to automated evaluations using LLM as a judge [16][17] - Tax analysts also serve as prompt engineers, leveraging their expertise to ensure accurate evaluations and prompt design [16][17] - Key evaluation pillars include accuracy, relevancy, and coherence, with a strong focus on tax accuracy [20][24] - Intuit uses AWS Ground Truth for creating golden datasets for evaluations [22] Challenges and Learnings - LLM contracts are expensive, and long-term contracts are slightly cheaper but create vendor lock-in [25][26] - LLM models have higher latency compared to backend services (3-10 seconds), which can be exacerbated during peak tax season [27][28] - Intuit employs safety guardrails and ML models to prevent hallucination of numbers in LLM responses, ensuring data accuracy [40][41] - Graph RAG outperforms regular RAG in providing personalized and helpful answers to users [42][43]
Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo
AI Engineer· 2025-06-27 10:27
AI Agent Evaluation & Observability - The industry emphasizes the necessity of observability in AI development, particularly for evaluation-driven development [1] - AI trustworthiness is a significant concern, highlighting the need for robust evaluation methods [1] - Detecting problems in AI is challenging due to its non-deterministic nature, making traditional unit testing difficult [1] AI-Driven Evaluation - The industry suggests using AI to evaluate AI, leveraging its ability to understand and identify issues in AI systems [1] - LLMs can be used to score the performance of other LLMs, with the recommendation to use a better (potentially more expensive or custom-trained) LLM for evaluation than the one used in the primary application [2] - Galileo offers a custom-trained small language model (SLM) designed for effective AI evaluations [2] Implementation & Metrics - Evaluations should be integrated from the beginning of the AI application development process, including prompt engineering and model selection [2] - Granularity in evaluation is crucial, requiring analysis at each step of the AI workflow to identify failure points [2] - Key metrics for evaluation include action completion (did it complete the task) and action advancement (did it move towards the goal) [2] Continuous Improvement & Human Feedback - AI can provide insights and suggestions for improving AI agent performance based on evaluation data [3] - Human feedback is essential to validate and refine AI-generated metrics, ensuring accuracy and continuous learning [4] - Real-time prevention and alerting are necessary to address rogue AI agents and prevent issues in production [8]