Workflow
Evals
icon
Search documents
Why We Built LangSmith for Improving Agent Quality
LangChain· 2025-11-04 16:04
Langsmith Platform Updates - Langchain is launching new features for Langsmith, a platform for agent engineering, focusing on tracing, evaluation, and observability to improve agent reliability [1] - Langsmith introduces "Insights," a feature designed to automatically identify trends in user interactions and agent behavior from millions of daily traces, helping users understand how their agents are being used and where they are making mistakes [1] - Insights is inspired by Anthropic's work on understanding conversation topics, but adapted for Langsmith's broader range of agent payloads [5][6] Evaluation and Testing - Langsmith emphasizes the importance of methodical testing, including online evaluations, to move beyond simple "vibe testing" and add rigor to agent development [1][33] - Langsmith introduces "thread evals," which allow users to evaluate agent performance across entire user interactions or conversations, providing a more comprehensive view than single-turn evaluations [16][17] - Online evals measure agent performance in real-time using production data, complementing offline evals that are based on known examples [24] - The company argues against the idea that offline evals are obsolete, highlighting their continued usefulness for regression testing and ensuring agents perform well on known interaction types [30][31] Use Cases and Applications - Insights can help product managers understand which product features are most frequently used with an agent, informing product roadmap prioritization [2][12] - Insights can assist AI engineers in identifying and categorizing agent failure modes, such as incorrect tool usage or errors, enabling targeted improvements [3][13] - Thread evals are particularly useful for evaluating user sentiment across an entire conversation or tracking the trajectory of tool calls within a conversation [21] Future Development - Langsmith plans to introduce agent and thread-level metrics into its dashboards, providing greater visibility into agent performance and cost [26] - The company aims to enable more flows with automation rules over threads, such as spot-checking threads with negative user feedback [27]
How to Improve Evals
Greylock· 2025-09-30 19:47
Evaluation Analysis - The industry emphasizes the importance of scrutinizing both regressions and improvements in evaluation results [2] - The industry suggests that initial improvements observed during evaluation are often misleading [2] - The industry recommends focusing on refining the scoring function when encountering unexpected evaluation outcomes, rather than immediately altering the agentic system or prompt [1] Debugging and Improvement - The industry advises analyzing specific tests or cases that have worsened compared to previous evaluations to identify potential issues [1] - The industry highlights the need to validate whether observed improvements are genuine or artificial [2] - The industry suggests using fake improvements as opportunities to refine the evaluation function [2]
Future of Evals
Greylock· 2025-09-30 19:43
AI Model Evaluation (Eval) Industry Trends - Eval remains a core driver for building great AI software, expected to be relevant in the future [2] - The implementation of running evals has changed significantly and will continue to evolve [2] - Updates based on eval results have transitioned from slow and manual to fast and manual [3] - The industry anticipates a shift towards faster updates that are partially or entirely automatic [3] Future of Human-AI Interaction in Evals - Human interaction with evals will evolve from analyzing dashboards to a collaborative process with LLM systems suggesting changes [3][4] - LLM systems may contextualize why changes should be made based on eval results [4] Brain Trust's Perspective - Brain Trust was founded partly due to the lack of significant changes in evals prior to its inception [1] - Brain Trust is excited about the anticipated shift in how humans interact with evals [4]
How to Build Planning Agents without losing control - Yogendra Miraje, Factset
AI Engineer· 2025-07-23 15:51
[Music] Hi everyone, I'm Yogi. I work at Faxet, a financial data and software company. And today I'll be sharing some of my experience while building agent.In last few years we have seen tremendous growth in AI and especially in last couple of years we are on exponential curve of intelligence growth and yet it feels like when we are develop AI applications driving a monster truck through a crowded mall with a tiny joysticks. So AI applications have not seen its charge GPD moment yet. There are many reasons ...
How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe
AI Engineer· 2025-07-22 19:46
[Music] Hey everyone. Um hope you are having a great conference. Um so I'm going to talk about uh how to run events at scale and thinking beyond accuracy or similarity.Uh so in the last uh presentation we we learned about like how to art u architect the AI applications um and then whys are important. In this presentation I am going to talk about like the importance of ewells as well as what type of ewells we have to choose when we are crafting an application. This is a bit about me.Um so I work as a lead en ...
State-Of-The-Art Prompting For AI Agents
Y Combinator· 2025-05-30 14:00
Prompt Engineering & Metaprompting - Metaprompting is emerging as a powerful tool, likened to coding in 1995 due to the evolving nature of the tools [1] - The best prompts often start by defining the role of the LLM, detailing the task, and outlining a step-by-step plan, often using markdown-style formatting [1] - Vertical AI agent companies are exploring how to balance flexibility for customer-specific logic with maintaining a general-purpose product, considering forking and merging prompts [1] - An emerging architecture involves defining a system prompt (company API), a developer prompt (customer-specific context), and a user prompt (end-user input) [1] - Worked examples are crucial for improving output quality, and automating the process of extracting and ingesting these examples from customer data is a valuable opportunity [2] - Prompt folding allows a prompt to dynamically generate better versions of itself by feeding it examples where it failed [2] - When LLMs lack sufficient information, it's important to provide them with an "escape hatch" to avoid hallucinations, either by allowing them to ask for more information or by providing debug info in the response [2] Evaluation & Model Personalities - Evals are considered the "crown jewels" for AI companies, essential for understanding why a prompt was written a certain way and for improving it [3] - Different LLMs exhibit distinct personalities; for example, Claude is considered more steerable, while Llama 4 requires more steering and prompting [5] - When using LLMs to generate numerical scores, providing rubrics is best practice, but models may interpret and apply these rubrics with varying degrees of rigidity and flexibility [5] Founder Role & Forward Deployed Engineer - Founders need to deeply understand their users and codify these insights into specific evals to ensure the software works for them [3] - Founders should act as "forward deployed engineers," directly engaging with users to understand their needs and rapidly iterate on the product [4] - The forward deployed engineer model, combined with AI, enables faster iteration and closing of significant deals with large enterprises [5]