Workflow
Multi-turn Conversations
icon
Search documents
Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents - Aparna Dhinakaran
AI Engineerยท 2025-06-10 17:30
Agent Evaluation Challenges - Building agents is difficult, requiring iteration at the prompt, model, and tool call definition levels [2][3] - Systematically tracking the performance of new prompts versus previous ones is challenging [4] - Including product managers or other team members in the iterative evaluation process is difficult [5] - Identifying bottlenecks in applications and pinpointing specific sub-agents or tool calls that create poor responses is hard [6] Evaluation Components - Agent evaluation should include evaluating at the tool call level, considering whether the right tool was called and if the correct arguments were passed [7][11] - Trajectory evaluation is important to determine if tool calls are executed in the correct order across a series of steps [7][20] - Multi-turn conversation evaluation is necessary to assess consistency in tone and context retention across multiple interactions [8][22][23] - Improving evaluation prompts is crucial, as the evals used to identify failure cases are essential for improving the agent [8][27] Arise Product Features - Arise offers a product for tracing and evaluating agent performance, allowing teams to ask questions about application performance and suggest improvements [12][13] - The product provides a high-level view of different paths an agent can take, helping to pinpoint performance bottlenecks [14][15] - Users can drill down into specific traces to evaluate tool call correctness and argument alignment [17][18]