Workflow
AI Engineer
icon
Search documents
Rishabh Garg, Tesla Optimus — Challenges in High Performance Robotics Systems
AI Engineer· 2025-08-21 16:41
A robot's behavior is influenced by the control policy, the software configuration, and electrical characteristics of the communication protocol. When unexpected behaviors arise, it is not straightforward to root cause them to the RL policy, electrical characteristics, mechanical characteristics. This talk walks through some of these issues and explains what might cause the observed behavior. We will talk about concrete issues that audience will be able to take away from and develop their understanding of p ...
Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai
AI Engineer· 2025-08-21 16:30
AI Evaluation Challenges - Current AI evaluations face problems [1] - Limitations exist in both AI and human-centric metrics for evaluating generative media [1] - Evaluating aesthetics and image/generative media is the hardest kind of AI evaluation [1] KREA.ai's Perspective - KREA.ai focuses on perceptual evaluations [1] - Krea's role involves rethinking evaluation and shaping the future of AI [1] - Krea issues a call to action regarding AI evaluation [1] Key Discussion Points - The discussion covers the historical context and compression in relation to AI evaluation [1] - The session emphasizes the importance of evaluating our evaluations [1]
Fuzzing the GenAI Era Leonard Tang
AI Engineer· 2025-08-21 16:26
AI Evaluation Challenges - Traditional evaluation methods are inadequate for assessing GenAI applications' brittleness [1] - The industry faces a "Last Mile Problem" in AI, ensuring reliability, quality, and alignment for any application [1] - Standard evaluation methods often fail to uncover corner cases and unexpected user inputs [1] Haize Labs' Approach - Haize Labs simulates the "last mile" by bombarding AI with unexpected user inputs to uncover corner cases at scale [1] - Haize Labs focuses on Quality Metric (defining criteria for good/bad responses and automating judgment) and Stimuli Generation (creating diverse data to discover bugs) [1] - Haize Labs uses agents as judges to scale evaluation, considering factors like accuracy vs latency [1] - Haize Labs employs RL-tuned judges to further scale evaluation processes [1] - Haize Labs utilizes simulation as a form of prompt optimization [1] Case Studies - Haize Labs has worked with a major European bank's AI app [1] - Haize Labs has worked with a F500 bank's voice agents [1] - Haize Labs scales voice agent evaluations [1]
#define AI Engineer - Greg Brockman, OpenAI (ft. Jensen Huang, NVIDIA)
AI Engineer· 2025-08-10 16:00
People - Greg Brockman 在旧金山 AI 工程师世界博览会上发表职业生涯建议,面向 AI 工程师 [1] Resources - AI 行业可通过订阅时事通讯获取最新活动和内容信息 [1]
The Future of Evals - Ankur Goyal, Braintrust
AI Engineer· 2025-08-09 15:12
Product & Technology - Brain Trust introduces "Loop," an agent integrated into its platform designed to automate and improve prompts, datasets, and scorers for AI model evaluation [4][5][7] - Loop leverages advancements in frontier models, particularly noting Claude 4's significant improvement (6x better) in prompt engineering capabilities compared to previous models [6] - Loop allows users to compare suggested edits to data and prompts side-by-side within the UI, maintaining data visibility [9][10] - Loop supports various models, including OpenAI, Gemini, and custom LLMs [9] User Engagement & Adoption - The average organization using Brain Trust runs approximately 13 evaluations (EVELs) per day [3] - Some advanced customers are running over 3,000 evaluations daily and spending more than two hours per day using the product [3] - Brain Trust encourages users to try Loop and provide feedback [12] Future Vision - Brain Trust anticipates a revolution in AI model evaluation, driven by advancements in frontier models [11] - The company is focused on incorporating these advancements into its platform [11] Hiring - Brain Trust is actively hiring for UI, AI, and infrastructure roles [12]
How to look at your data — Jeff Huber (Choma) + Jason Liu (567)
AI Engineer· 2025-08-06 16:22
Retrieval System Evaluation - Industry should prioritize fast and inexpensive evaluations (fast evals) using query and document pairs to enable rapid experimentation [7] - Industry can leverage LLMs to generate queries, but should focus on aligning synthetic queries with real-world user queries to avoid misleading results [9][11] - Industry can empirically validate the performance of new embedding models on specific data using fast evals, rather than relying solely on public benchmarks like MTeb [12] - Weights & Biases chatbot analysis reveals that the original embedding model (text embedding three small) performed the worst, while voyage 3 large model performed the best, highlighting the importance of data-driven evaluation [17][18] Output Analysis and Product Development - Industry should extract structured data from user conversations (summaries, tools used, errors, satisfaction, frustration) to identify patterns and inform product development [28][29] - Industry can use extracted metadata to find clusters and identify segments for targeted improvements, similar to how marketing uses user segmentation [29][26] - Cura library enables summarization, clustering, and aggregation of conversations to compare evals across different KPIs, helping to identify areas for improvement [32] - Industry should focus on providing the right infrastructure and tools to support AI agents, rather than solely focusing on improving the AI itself [39] - Industry should define evals, find clusters, and compare KPIs across clusters to make informed decisions on what to build, fix, and ignore [40][41] - Industry should monitor query types and performance over time to understand how the product is being used and identify opportunities for improvement [45]
Evals Are Not Unit Tests — Ido Pesok, Vercel v0
AI Engineer· 2025-08-06 16:14
Key Takeaways on LLM Evaluation - LLMs can be unreliable, impacting user experience and application usability [6] - AI applications are prone to failure in production despite successful demos [7] - It is crucial to build reliable software using LLMs through methods like prompt engineering [8] Evaluation Strategies and Best Practices - Evals should focus on relevant user queries and avoid out-of-bounds scenarios [19] - Data collection methods include thumbs up/down feedback, log analysis, and community forums [21][22][23] - Evals should test across the entire data distribution to understand system performance [20][24] - Constants should be factored into data, and variables into tasks for clarity and reuse [25][26] - Evaluation scores should be deterministic and simple for easier debugging and team collaboration [29][30] - Evals should be integrated into CI pipelines to detect improvements and regressions [34][35] Vercel's Perspective - Vercel's Vzero is a full-stack web coding platform designed for rapid prototyping and building [1] - Vzero recently launched GitHub sync, enabling code push and pull directly from the platform [2] - Vercel emphasizes the importance of continuous evaluation to improve AI app reliability and quality [37] - Vercel has reached 100 million messages sent on Vzero [2]
Vision AI in 2025 — Peter Robicheaux, Roboflow
AI Engineer· 2025-08-03 17:45
AI Vision Challenges & Opportunities - Computer vision lags behind human vision and language models in intelligence and leveraging big pre-training [3][8][11] - Current vision evaluations like ImageNet and COCO are saturated and primarily measure pattern matching, hindering the development of true visual intelligence [5][22] - Vision models struggle with tasks requiring visual understanding, such as determining the time on a watch or understanding spatial relationships in images [9][10] - Vision-language pre-training, exemplified by CLIP, may fail to capture subtle visual details not explicitly included in image captions [14][15] Rooflow's Solution & Innovation - Rooflow introduces RF DTOR, a real-time object detection model leveraging the Dinov2 pre-trained backbone to address the underutilization of large pre-trainings in visual models [20] - Rooflow created R100VL, a new dataset comprising 100 diverse object detection datasets, to better measure the intelligence and domain adaptability of visual models [24][25] - R100VL includes challenging domains like aerial imagery, microscopy, and X-rays, and incorporates visual language tasks to assess contextual understanding [25][26][27][28][29] - Rooflow's benchmark reveals that current vision language models struggle to generalize in the visual domain compared to the linguistic domain [30] - Fine-tuning a YOLO V8 nano model from scratch on 10-shot examples performs better than zero-shot Grounding DINO on R100VL, highlighting the need for improved visual generalization [30][36][37] Industry Trends & Future Directions - Transformers are proving more effective than convolutional models in leveraging large pre-training datasets for vision tasks [18] - The scale of pre-training in the vision world is significantly smaller compared to the language world, indicating room for growth [19] - Rooflow makes its platform freely available to researchers, encouraging open-source data contributions to the community [33]
Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear
AI Engineer· 2025-08-03 04:34
Core Problem & Solution - Traditional software development lifecycle is insufficient for AI applications due to non-deterministic models, requiring a data science approach and continuous experimentation [3] - The key is to reverse engineer metrics from real-world scenarios, focusing on product experience and business outcomes rather than abstract data science metrics [6] - Build evaluations (evals) at the beginning of the process, not at the end, to identify failures and areas for improvement early on [14] - Continuous improvement of evals and solutions is necessary to reach a baseline benchmark for optimization [19] Evaluation Methodology - Evaluations should mimic specific user questions and criteria relevant to the solution's end goal [7] - Use Large Language Models (LLMs) to generate evaluations, considering different user personas and expected answers [9][11] - Focus on the details of each evaluation failure to understand the root cause, whether it's the test definition or the solution's performance [15] - Experimentation involves changing models, logic, prompts, or data, and continuously running evaluations to catch regressions [16][18] Industry Specific Examples - For customer support bots, measure the rate of escalation to human support as a key metric [5] - For text-to-SQL or text-to-graph database applications, create a mock database with known data to validate expected results [22] - For call center conversation classifiers, use simple matching to determine if the correct rubric is applied [23] Key Takeaways - Evaluate AI applications the way users actually use them, avoiding abstract metrics [24] - Frequent evaluations enable rapid progress and reduce regressions [25] - Well-defined evaluations lead to explainable AI, providing insights into how the solution works and its limitations [26]
How to Improve your Vibe Coding — Ian Butler
AI Engineer· 2025-08-03 04:32
Agent Performance - Current agents have a low overall bug find rate and generate a significant amount of false positives [1][2] - Some agents have a true positive rate of less than 10% for finding bugs [2] - Three out of six agents benchmarked had a 10% or less true positive rate out of 900+ reports [3] - One agent produced 70 issues for a single task, all of which were false [4] - Cursor had a 97% false positive rate over 100+ repos and 1,200+ issues [4] - Thinking models are significantly better at finding bugs in a codebase [8][18] - Agents are not holistically looking at files, leading to high variability across runs [20] Implications for Developers - Alert fatigue reduces the effectiveness of trusting agents, potentially leading to bugs in production [5] - Developers are unlikely to sift through numerous false positives to identify actual bugs [4] Recommendations for Improving Agent Performance - Use bug-focused rules with scoped instructions detailing security issues and logical bugs [6] - Prioritize naming explicit classes of bugs in rules, such as "off bypasses" or "SQL injection" [11] - Require fix validation by ensuring agents write and pass tests before incorporating changes [12] - Manage context thoroughly by feeding diffs of code changes and preventing key files from being summarized [15] - Ask agents to create a step-by-step component inventory of the codebase [16] - Bias the model with specific security information like the OWASP Top 10 [9][10]