Evaluation

Search documents
The Future of Evals - Ankur Goyal, Braintrust
AI Engineer· 2025-08-09 15:12
[Music] [Applause] Awesome. Uh so today we're going to talk a little bit about evals to date and where we think eval are going to be going in the future. Also for those of you who saw my brother earlier um I'm going to do my best to live up to his energy and uh and charisma.But um yeah, you know, it's been an amazing almost two-year journey for us at Brain Trust. We have had the opportunity to work with some of the most amazing companies building um I think the best AI products in the world. Uh I'm blown aw ...
Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear
AI Engineer· 2025-08-03 04:34
[Music] Welcome everyone. I'm going to talk about practical tactics to build uh reliable AI applications and why nobody does it this way yet. Uh a little bit about myself or why you should trust me.Uh I have allowed 15 years as a startup co-founder and CTO. Uh I held executive positions for the last five years at uh several enterprises. uh but most importantly I spent last couple of years developing a lot of gen projects ranging from PC's to uh many production level uh solutions and helped some companies to ...
The 2025 AI Engineering Report — Barr Yaron, Amplify
AI Engineer· 2025-08-01 22:51
AI Engineering Landscape - The AI engineering community is broad, technical, and growing, with the "AI Engineer" title expected to gain more ground [5] - Many seasoned software developers are AI newcomers, with nearly half of those with 10+ years of experience having worked with AI for three years or less [7] LLM Usage and Customization - Over half of respondents are using LLMs for both internal and external use cases, with OpenAI models dominating external, customer-facing applications [8] - LLM users are leveraging them across multiple use cases, with 94% using them for at least two and 82% for at least three [9] - Retrieval-Augmented Generation (RAG) is the most popular customization method, with 70% of respondents using it [10] - Parameter-efficient fine-tuning methods like LoRA/Q-LoRA are strongly preferred, mentioned by 40% of fine-tuners [12] Model and Prompt Management - Over 50% of respondents are updating their models at least monthly, with 17% doing so weekly [14] - 70% of respondents are updating prompts at least monthly, and 10% are doing so daily [14] - A significant 31% of respondents lack any system for managing their prompts [15] Multimodal AI and Agents - Image, video, and audio usage lag text usage significantly, indicating a "multimodal production gap" [16][17] - Audio has the highest intent to adopt among those not currently using it, with 37% planning to eventually adopt audio [18] - While 80% of respondents say LLMs are working well, less than 20% say the same about agents [20] Monitoring and Evaluation - Most respondents use multiple methods to monitor their AI systems, with 60% using standard observability and over 50% relying on offline evaluation [22] - Human review remains the most popular method for evaluating model and system accuracy and quality [23] - 65% of respondents are using a dedicated vector database [24] Industry Outlook - The mean guess for the percentage of the US Gen Z population that will have AI girlfriends/boyfriends is 26% [27] - Evaluation is the number one most painful thing about AI engineering today [28]
Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)
AI Engineer· 2025-07-29 16:00
[Music] All right. Uh, thank you everyone. We're excited for to be here and thank you for uh, coming to our talk.Uh, my name is Chong. I'm the CEO and co-founder of LANCB. I've been making data tools for machine learning and data science for about 20 years.I was one of the co-authors of pandas library and I'm working on LANCB today for all of that data that doesn't fit neatly into those pandas data frames. And I'm Calvin. I lead one of the teams at Harvey Aai working on rag um tough rag problems across mass ...
Building Applications with AI Agents — Michael Albada, Microsoft
AI Engineer· 2025-07-24 15:00
Agentic Development Landscape - The adoption of agentic technology is rapidly increasing, with a 254% increase in companies self-identifying as agentic in the last three years based on Y Combinator data [5] - Agentic systems are complex, and while initial prototypes may achieve around 70% accuracy, reaching perfection is difficult due to the long tail of complex scenarios [6][7] - The industry defines an agent as an entity that can reason, act, communicate, and adapt to solve tasks, viewing the foundation model as a base for adding components to enhance performance [8] - The industry emphasizes that agency should not be the ultimate goal but a tool to solve problems, ensuring that increased agency maintains a high level of effectiveness [9][11][12] Tool Use and Orchestration - Exposing tools and functionalities to language models enables agents to invoke functions via APIs, but requires careful consideration of which functionalities to expose [14] - The industry advises against a one-to-one mapping between APIs and tools, recommending grouping tools logically to reduce semantic collision and improve accuracy [17][18] - Simple workflow patterns, such as single chains, are recommended for orchestration to improve measurability, reduce costs, and enhance reliability [19][20] - For complex scenarios, the industry suggests considering a move to more agentic patterns and potentially fine-tuning the model [22][23] Multi-Agent Systems and Evaluation - Multi-agent systems can help scale the number of tools by breaking them into semantically similar groups and routing tasks to appropriate agents [24][25] - The industry recommends investing more in evaluation to address the numerous hyperparameters involved in building agentic systems [27][28] - AI architects and engineers should take ownership of defining the inputs and outputs of agents to accelerate team progress [29][30] - Tools like Intel Agent, Microsoft's Pirate, and Label Studio can aid in generating synthetic inputs, red teaming agents, and building evaluation sets [33][34][35] Observability and Common Pitfalls - The industry emphasizes the importance of observability using tools like OpenTelemetry to understand failure modes and improve systems [38] - Common pitfalls include insufficient evaluation, inadequate tool descriptions, semantic overlap between tools, and excessive complexity [39][40] - The industry stresses the importance of designing for safety at every layer of agentic systems, including building tripwires and detectors [41][42]
How Intuit uses LLMs to explain taxes to millions of taxpayers - Jaspreet Singh, Intuit
AI Engineer· 2025-07-23 15:51
[Music] Hi, I'm Jaspit. I'm a senior staff engineer in it. I work on Genifi for Turboax.And today we'll be talking about how we use LLMs at Inuit to well help you understand your taxes better. So I think uh to just to understand the scale right uh into Turboax successfully processed 44 million tax returns for tax year 23 and that's really the scale we're going for. We want everybody to be have high confidence in how their taxes are filed and understand them that they are getting the best deductions uh that ...
Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo
AI Engineer· 2025-06-27 10:27
[Music] So I'm here to talk about taming rogue AI agents but essentially want to talk about uh evaluation driven development observability driven but really why we need observability. So, who uses AI? Is that Jim's stupid most stupid question of the day? Probably. Who trusts AI? Right. If you'd like to meet me after, I've got some snake oil you might be interested in buying. Yeah, we do not trust AI in the slightest. Now, different question. Who reads books? That's reading books. If you want some recommenda ...