Evals

Search documents
How to Build Planning Agents without losing control - Yogendra Miraje, Factset
AI Engineer· 2025-07-23 15:51
[Music] Hi everyone, I'm Yogi. I work at Faxet, a financial data and software company. And today I'll be sharing some of my experience while building agent.In last few years we have seen tremendous growth in AI and especially in last couple of years we are on exponential curve of intelligence growth and yet it feels like when we are develop AI applications driving a monster truck through a crowded mall with a tiny joysticks. So AI applications have not seen its charge GPD moment yet. There are many reasons ...
How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe
AI Engineer· 2025-07-22 19:46
[Music] Hey everyone. Um hope you are having a great conference. Um so I'm going to talk about uh how to run events at scale and thinking beyond accuracy or similarity.Uh so in the last uh presentation we we learned about like how to art u architect the AI applications um and then whys are important. In this presentation I am going to talk about like the importance of ewells as well as what type of ewells we have to choose when we are crafting an application. This is a bit about me.Um so I work as a lead en ...
State-Of-The-Art Prompting For AI Agents
Y Combinator· 2025-05-30 14:00
Prompt Engineering & Metaprompting - Metaprompting is emerging as a powerful tool, likened to coding in 1995 due to the evolving nature of the tools [1] - The best prompts often start by defining the role of the LLM, detailing the task, and outlining a step-by-step plan, often using markdown-style formatting [1] - Vertical AI agent companies are exploring how to balance flexibility for customer-specific logic with maintaining a general-purpose product, considering forking and merging prompts [1] - An emerging architecture involves defining a system prompt (company API), a developer prompt (customer-specific context), and a user prompt (end-user input) [1] - Worked examples are crucial for improving output quality, and automating the process of extracting and ingesting these examples from customer data is a valuable opportunity [2] - Prompt folding allows a prompt to dynamically generate better versions of itself by feeding it examples where it failed [2] - When LLMs lack sufficient information, it's important to provide them with an "escape hatch" to avoid hallucinations, either by allowing them to ask for more information or by providing debug info in the response [2] Evaluation & Model Personalities - Evals are considered the "crown jewels" for AI companies, essential for understanding why a prompt was written a certain way and for improving it [3] - Different LLMs exhibit distinct personalities; for example, Claude is considered more steerable, while Llama 4 requires more steering and prompting [5] - When using LLMs to generate numerical scores, providing rubrics is best practice, but models may interpret and apply these rubrics with varying degrees of rigidity and flexibility [5] Founder Role & Forward Deployed Engineer - Founders need to deeply understand their users and codify these insights into specific evals to ensure the software works for them [3] - Founders should act as "forward deployed engineers," directly engaging with users to understand their needs and rapidly iterate on the product [4] - The forward deployed engineer model, combined with AI, enables faster iteration and closing of significant deals with large enterprises [5]