Agent Engineering
Search documents
Why We Built LangSmith for Improving Agent Quality
LangChainยท 2025-11-04 16:04
Langsmith Platform Updates - Langchain is launching new features for Langsmith, a platform for agent engineering, focusing on tracing, evaluation, and observability to improve agent reliability [1] - Langsmith introduces "Insights," a feature designed to automatically identify trends in user interactions and agent behavior from millions of daily traces, helping users understand how their agents are being used and where they are making mistakes [1] - Insights is inspired by Anthropic's work on understanding conversation topics, but adapted for Langsmith's broader range of agent payloads [5][6] Evaluation and Testing - Langsmith emphasizes the importance of methodical testing, including online evaluations, to move beyond simple "vibe testing" and add rigor to agent development [1][33] - Langsmith introduces "thread evals," which allow users to evaluate agent performance across entire user interactions or conversations, providing a more comprehensive view than single-turn evaluations [16][17] - Online evals measure agent performance in real-time using production data, complementing offline evals that are based on known examples [24] - The company argues against the idea that offline evals are obsolete, highlighting their continued usefulness for regression testing and ensuring agents perform well on known interaction types [30][31] Use Cases and Applications - Insights can help product managers understand which product features are most frequently used with an agent, informing product roadmap prioritization [2][12] - Insights can assist AI engineers in identifying and categorizing agent failure modes, such as incorrect tool usage or errors, enabling targeted improvements [3][13] - Thread evals are particularly useful for evaluating user sentiment across an entire conversation or tracking the trajectory of tool calls within a conversation [21] Future Development - Langsmith plans to introduce agent and thread-level metrics into its dashboards, providing greater visibility into agent performance and cost [26] - The company aims to enable more flows with automation rules over threads, such as spot-checking threads with negative user feedback [27]