Evaluation - filings, earnings calls, financial reports, news - Reportify

Evaluation

Search documents

Observability and Evals for AI Agents: A Simple Breakdown

LangChain· 2026-02-17 16:30

Two of the most crucial things when building production agents is setting up proper observability and setting up proper evaluation. And these are actually tied and coupled, and this is different than in software engineering and the role observability and evaluation play when building agents is different than in software engineering as well. So I wanna talk a little bit about how we view observability and how it powers a lot of agent evaluation.Maybe starting briefly highlighting some of the things that we t ...

豆包大模型 1.8 发布，通用 Agent 模型成为了 AI 行业的新叙事

Founder Park· 2025-12-19 07:22

Core Insights - The AI industry in 2025 is focusing on the core narrative of model capabilities, particularly in supporting agents, coding abilities, and effective tool usage, moving beyond mere leaderboard scores to real-world task performance as a new standard for evaluation [2][10] - ByteDance's newly released Doubao model 1.8 enhances agent support capabilities, including coding and tool usage, and introduces an imaginative scenario with OS Agent [4][11] - The introduction of visual capabilities in agents allows them to understand and interact with the world, which is crucial for assisting with complex real-world tasks [8][16] Model Development - Current model advancements are not limited to text-based models; they now include enhanced visual capabilities, allowing models to "see" and comprehend the world [7][10] - Doubao 1.8 combines LLM and VLM capabilities from the outset, achieving significant improvements in visual understanding and maintaining reasoning performance [8][10] - The Doubao model's ability to catch up with the Gemini series in a short time indicates a consensus among foundational model companies regarding the future development of models [10] Agent Capabilities - The emergence of OS Agent has sparked a wave of entrepreneurship in AI agents, with a focus on the reliability of tool invocation becoming a key concern for developers [11][12] - Doubao 1.8 significantly enhances the ability of agents to use tools, which is a common focus among recently released models [12][13] - The core capability of Doubao 1.8 is its OS Agent, which allows it to "see" and interact directly with interfaces, unlocking new use cases [14][16] Evaluation Systems - The evaluation of models is shifting from traditional benchmarks to real-world applications, emphasizing user experience and the ability to perform complex tasks that reflect actual user needs [29][32] - Doubao 1.8's evaluation system prioritizes real-world scenarios and aims to advance general intelligence while ensuring practical usability [35][36] - The challenges of customer service scenarios highlight the complexity of real-world tasks, which require high accuracy and emotional intelligence, showcasing the potential for AI to enhance user experiences [36][40]

通用 Agent 模型

多模态理解能力

Artificial Intelligence

豆包大模型 1.8

通用 Agent 模型

多模态理解能力

Artificial Intelligence

豆包大模型 1.8

Avi Chawla· 2025-12-08 19:06

Educational Resources - Stanford's CS336 provides a video guide to Karpathy's nanochat, covering essential topics for Frontier AI Labs preparation [1] Key AI Concepts - The curriculum includes Tokenization, Resource Accounting, Pretraining, Finetuning (SFT/RLHF), Key Architectures, GPUs, Kernels, Tritons, Parallelism, Scaling Laws, Inference, Evaluation, and Alignment [1]

Avi Chawla· 2025-12-08 06:31

Educational Resources - Stanford's CS336 video guide covers topics essential for Frontier AI Labs jobs [1] - The curriculum includes tokenization, resource accounting, pretraining, and finetuning (SFT/RLHF) [1] - Key AI architectures, GPU usage, kernels, parallelism, and scaling laws are addressed [1] AI Development Lifecycle - The guide also covers inference, evaluation, and alignment in AI models [1]

X @Investopedia

Investopedia· 2025-10-20 11:30

Investment Evaluation - Financial statements possess 12 characteristics crucial for evaluating companies before investing [1] - These characteristics can increase the chances of choosing a winner [1] Resource - A resource is available to discover these 12 characteristics [1]

Financial Statements

Financial Statements

Fuzzing the GenAI Era Leonard Tang

AI Engineer· 2025-08-21 16:26

AI Evaluation Challenges - Traditional evaluation methods are inadequate for assessing GenAI applications' brittleness [1] - The industry faces a "Last Mile Problem" in AI, ensuring reliability, quality, and alignment for any application [1] - Standard evaluation methods often fail to uncover corner cases and unexpected user inputs [1] Haize Labs' Approach - Haize Labs simulates the "last mile" by bombarding AI with unexpected user inputs to uncover corner cases at scale [1] - Haize Labs focuses on Quality Metric (defining criteria for good/bad responses and automating judgment) and Stimuli Generation (creating diverse data to discover bugs) [1] - Haize Labs uses agents as judges to scale evaluation, considering factors like accuracy vs latency [1] - Haize Labs employs RL-tuned judges to further scale evaluation processes [1] - Haize Labs utilizes simulation as a form of prompt optimization [1] Case Studies - Haize Labs has worked with a major European bank's AI app [1] - Haize Labs has worked with a F500 bank's voice agents [1] - Haize Labs scales voice agent evaluations [1]

Stimuli Generation

Stimuli Generation

The Future of Evals - Ankur Goyal, Braintrust

AI Engineer· 2025-08-09 15:12

Product & Technology - Brain Trust introduces "Loop," an agent integrated into its platform designed to automate and improve prompts, datasets, and scorers for AI model evaluation [4][5][7] - Loop leverages advancements in frontier models, particularly noting Claude 4's significant improvement (6x better) in prompt engineering capabilities compared to previous models [6] - Loop allows users to compare suggested edits to data and prompts side-by-side within the UI, maintaining data visibility [9][10] - Loop supports various models, including OpenAI, Gemini, and custom LLMs [9] User Engagement & Adoption - The average organization using Brain Trust runs approximately 13 evaluations (EVELs) per day [3] - Some advanced customers are running over 3,000 evaluations daily and spending more than two hours per day using the product [3] - Brain Trust encourages users to try Loop and provide feedback [12] Future Vision - Brain Trust anticipates a revolution in AI model evaluation, driven by advancements in frontier models [11] - The company is focused on incorporating these advancements into its platform [11] Hiring - Brain Trust is actively hiring for UI, AI, and infrastructure roles [12]

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

AI Engineer· 2025-08-03 04:34

Core Problem & Solution - Traditional software development lifecycle is insufficient for AI applications due to non-deterministic models, requiring a data science approach and continuous experimentation [3] - The key is to reverse engineer metrics from real-world scenarios, focusing on product experience and business outcomes rather than abstract data science metrics [6] - Build evaluations (evals) at the beginning of the process, not at the end, to identify failures and areas for improvement early on [14] - Continuous improvement of evals and solutions is necessary to reach a baseline benchmark for optimization [19] Evaluation Methodology - Evaluations should mimic specific user questions and criteria relevant to the solution's end goal [7] - Use Large Language Models (LLMs) to generate evaluations, considering different user personas and expected answers [9][11] - Focus on the details of each evaluation failure to understand the root cause, whether it's the test definition or the solution's performance [15] - Experimentation involves changing models, logic, prompts, or data, and continuously running evaluations to catch regressions [16][18] Industry Specific Examples - For customer support bots, measure the rate of escalation to human support as a key metric [5] - For text-to-SQL or text-to-graph database applications, create a mock database with known data to validate expected results [22] - For call center conversation classifiers, use simple matching to determine if the correct rubric is applied [23] Key Takeaways - Evaluate AI applications the way users actually use them, avoiding abstract metrics [24] - Frequent evaluations enable rapid progress and reduce regressions [25] - Well-defined evaluations lead to explainable AI, providing insights into how the solution works and its limitations [26]

AI applications

Customer support bot

Experimentation

Software development life cycle

AI applications

Customer support bot

Experimentation

Software development life cycle

The 2025 AI Engineering Report — Barr Yaron, Amplify

AI Engineer· 2025-08-01 22:51

AI Engineering Landscape - The AI engineering community is broad, technical, and growing, with the "AI Engineer" title expected to gain more ground [5] - Many seasoned software developers are AI newcomers, with nearly half of those with 10+ years of experience having worked with AI for three years or less [7] LLM Usage and Customization - Over half of respondents are using LLMs for both internal and external use cases, with OpenAI models dominating external, customer-facing applications [8] - LLM users are leveraging them across multiple use cases, with 94% using them for at least two and 82% for at least three [9] - Retrieval-Augmented Generation (RAG) is the most popular customization method, with 70% of respondents using it [10] - Parameter-efficient fine-tuning methods like LoRA/Q-LoRA are strongly preferred, mentioned by 40% of fine-tuners [12] Model and Prompt Management - Over 50% of respondents are updating their models at least monthly, with 17% doing so weekly [14] - 70% of respondents are updating prompts at least monthly, and 10% are doing so daily [14] - A significant 31% of respondents lack any system for managing their prompts [15] Multimodal AI and Agents - Image, video, and audio usage lag text usage significantly, indicating a "multimodal production gap" [16][17] - Audio has the highest intent to adopt among those not currently using it, with 37% planning to eventually adopt audio [18] - While 80% of respondents say LLMs are working well, less than 20% say the same about agents [20] Monitoring and Evaluation - Most respondents use multiple methods to monitor their AI systems, with 60% using standard observability and over 50% relying on offline evaluation [22] - Human review remains the most popular method for evaluating model and system accuracy and quality [23] - 65% of respondents are using a dedicated vector database [24] Industry Outlook - The mean guess for the percentage of the US Gen Z population that will have AI girlfriends/boyfriends is 26% [27] - Evaluation is the number one most painful thing about AI engineering today [28]

Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)

AI Engineer· 2025-07-29 16:00

[Music] All right. Uh, thank you everyone. We're excited for to be here and thank you for uh, coming to our talk.Uh, my name is Chong. I'm the CEO and co-founder of LANCB. I've been making data tools for machine learning and data science for about 20 years.I was one of the co-authors of pandas library and I'm working on LANCB today for all of that data that doesn't fit neatly into those pandas data frames. And I'm Calvin. I lead one of the teams at Harvey Aai working on rag um tough rag problems across mass ...

Retrieval Augmented Generation (RAG)

AI native multimodal lakehouse

Legal AI assistant

Retrieval Augmented Generation (RAG)

AI native multimodal lakehouse

Legal AI assistant