The Future of Evals - Ankur Goyal, Braintrust

Product & Technology - Brain Trust introduces "Loop," an agent integrated into its platform designed to automate and improve prompts, datasets, and scorers for AI model evaluation [4][5][7] - Loop leverages advancements in frontier models, particularly noting Claude 4's significant improvement (6x better) in prompt engineering capabilities compared to previous models [6] - Loop allows users to compare suggested edits to data and prompts side-by-side within the UI, maintaining data visibility [9][10] - Loop supports various models, including OpenAI, Gemini, and custom LLMs [9] User Engagement & Adoption - The average organization using Brain Trust runs approximately 13 evaluations (EVELs) per day [3] - Some advanced customers are running over 3,000 evaluations daily and spending more than two hours per day using the product [3] - Brain Trust encourages users to try Loop and provide feedback [12] Future Vision - Brain Trust anticipates a revolution in AI model evaluation, driven by advancements in frontier models [11] - The company is focused on incorporating these advancements into its platform [11] Hiring - Brain Trust is actively hiring for UI, AI, and infrastructure roles [12]