Workflow
AI Engineer
icon
Search documents
Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith
AI Engineer· 2025-07-27 16:15
LLM Evaluation Challenges - Traditional benchmarks often fail to reflect real-world LLM performance, reliability, and user satisfaction [1] - Evaluating reasoning quality, agent consistency, MCP integration, and user-focused outcomes requires going beyond standard benchmarks [1] - Benchmarks and leaderboards rarely reflect the realities of production AI [1] Evaluation Strategies & Frameworks - The industry needs tangible evaluation strategies using open-source frameworks like GuideLLM and lm-eval-harness [1] - Custom eval suites tailored to specific use cases are crucial for accurate assessment [1] - Integrating human-in-the-loop feedback is essential for better user-aligned outcomes [1] Key Evaluation Areas - Evaluating reasoning skills, consistency, and reliability in agentic AI applications is critical [1] - Validating MCP (Model Context Protocol) and agent interactions with practical reliability tests is necessary [1] - Agent reliability checks should reflect production conditions [1] Deployment Considerations - Robust evaluation is critical for confidently deploying LLMs in real-world applications like chatbots, copilots, or autonomous AI agents [1]
Why you should care about AI interpretability - Mark Bissell, Goodfire AI
AI Engineer· 2025-07-27 15:30
The goal of mechanistic interpretability is to reverse engineer neural networks. Having direct, programmable access to the internal neurons of models unlocks new ways for developers and users to interact with AI — from more precise steering to guardrails to novel user interfaces. While interpretability has long been an interesting research topic, it is now finding real-world use cases, making it an important tool for AI engineers. About Mark Bissell Mark Bissell is an applied researcher at Goodfire AI worki ...
Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten
AI Engineer· 2025-07-26 17:45
SGLang Overview - SGLang is an open-source, high-performance serving framework for large language models (LLMs) and large vision models (VLMs) [5] - SGLang supports day zero releases for new models from labs like Quen and DeepSeek, and has a strong open-source community [7] - The project has grown rapidly, from a research paper in December 2023 to nearly 15,000 GitHub stars in 18 months [9] Usage and Adoption - Base 10 uses SGLang as part of its inference stack for various models [8] - SGLang is also used by XAI for their Glock models, inference providers, cloud providers, research labs, universities, and product companies like Koser [8] Performance Optimization - SGLang's performance can be optimized using flags and configuration options, such as CUDA graph settings [20] - Eagle 3, a speculative decoding algorithm, can be used to improve performance by increasing the token acceptance rate [28][42][43] - The default CUDA graph max batch size on L4 GPUs is eight, but it can be adjusted to improve performance [31][36] Community and Contribution - The SGLang community is active and welcomes contributions [7][54] - Developers can get involved by starring the project on GitHub, filing issues, joining the Slack channel, and contributing to the codebase [9][54][55] - The codebase includes the SGLang runtime, a domain-specific front-end language, and a set of optimized kernels [58]
Robotics: why now? - Quan Vuong and Jost Tobias Springberg, Physical Intelligence
AI Engineer· 2025-07-26 17:00
Sharing recent progress from Physical Intelligence and why it is an exciting time to push the frontier in general purpose robotics About Quan Vuong Quan Vuong is co-founder at Physical Intelligence. His research focuses on generalist robotics and algorithms that enable intelligent behaviors through large scale learning. About Jost Tobias Springenberg Tobias is currently a research scientist at Physical Intelligence where he works on bringing AI into the real world and understanding the fundamentals of seque ...
Waymo's EMMA: Teaching Cars to Think - Jyh Jing Hwang, Waymo
AI Engineer· 2025-07-26 17:00
Autonomous Driving History and Challenges - Autonomous driving research started in the 1980s with simple neural networks and evolved to end-to-end driving models by 2020 [2] - Scaling autonomous driving presents challenges, requiring solutions for long-tail events and rare scenarios [5][7] - Foundation models, like Gemini, show promise in generalizing to rare driving events and providing appropriate responses [8][9][10][11] Emma: A Multimodal Large Language Model for Autonomous Driving - The company is exploring Emma, a driving system leveraging Gemini, which uses routing text and camera input to predict future waypoints [11][12][13][14] - Emma is self-supervised, camera-only, and high-dimension map-free, achieving state-of-the-art quality on the nuScenes benchmark [15][16][17] - Channel reasoning is incorporated into Emma, allowing the model to explain its driving decisions and improve performance on a 100k dataset [17] Evaluation and Validation - Evaluation is crucial for the success of autonomous driving models, including open loop evaluation, simulations, and real-world testing [25] - Generative models are being explored for sensor simulation to evaluate the planner under various conditions like rain and different times of day [26][27][28] Future Directions - The company aims to improve generalization and scale autonomous driving by leveraging foundation models [30] - Training on larger datasets improves the quality of the planner [19][20] - The company is exploring training on various tasks, such as 3D detection and rograph estimation, to create a more generalizable model [21][22][23][24]
Ship Production Software in Minutes, Not Months — Eno Reyes, Factory
AI Engineer· 2025-07-25 23:11
Core Argument - Factory believes agentic systems will radically change software development, transitioning from human-driven to agent-driven development [2] - The company emphasizes that AI tools are only as good as the context they receive, and providing comprehensive context is crucial for effective AI-assisted development [14][15][16] - The company advocates for using agents at every stage of development, including planning and design, by delegating groundwork and research to AI agents [18][19][20] Technological Advancements - The company's "droids" can ingest tasks, ground themselves in the environment, search codebases, and generate pull requests that pass CI [12][13] - The company's platform integrates natively with various data sources, enabling agents to access and utilize information from across the organization [17] - The company's system can condense incident response search efforts from hours to minutes by pulling context from relevant system logs, past incidents, and team discussions [31][32] Enterprise Solutions & Security - Factory is an enterprise platform focused on security, auditability, and ownership concerns related to AI agents in large organizations [41][42] - The company offers a platform with controls to address security concerns and emphasizes the importance of responsible AI implementation within enterprises [43] - The company provides 20 million free tokens for users to try out the droids [40] Future of Software Development - The industry is moving from executing to orchestrating systems, with developers managing agents and building patterns that supersede the inner loop of software development [27][38] - The future belongs to developers who can effectively work with AI agents, with clear communication skills being paramount [39] - AI agents amplify individual capabilities, allowing developers to focus on higher-leverage tasks and the outer loop of software development [37][38]
Beyond the Prototype: Using AI to Write High-Quality Code - Josh Albrecht, Imbue
AI Engineer· 2025-07-25 23:10
Imbue's Focus and Sculptor's Purpose - Imbue is focused on creating more robust and useful AI agents, specifically software agents, with Sculptor as its main product [1] - Sculptor aims to bridge the gap between AI-generated code and production-ready code, addressing the challenges of using AI coding tools in established codebases [3] - The goal of Sculptor is to build user trust in AI-generated code by using another AI system to identify potential problems like race conditions or exposed API keys [7][8] Key Technical Decisions and Features of Sculptor - Sculptor emphasizes synchronous and immediate feedback on code changes to facilitate early problem detection and resolution [9][10] - Sculptor encourages users to learn existing solutions, plan before coding, write specs and docs, and adhere to strict style guides to prevent errors in AI-generated code [11][12][13][15][16][18] - Sculptor helps detect outdated code and documentation, highlights inconsistencies, and suggests style guide improvements to maintain code quality [17][18][19] Error Detection and Prevention Strategies in Sculptor - Sculptor integrates automated tools like linters to detect and automatically fix errors in AI-generated code [21][22] - Sculptor promotes writing tests, especially with AI assistance, to ensure code correctness and prevent unintended behavior changes [25][26][27] - Sculptor advocates for functional-style coding, happy and unhappy path unit tests, and integration tests to improve test effectiveness [28][29][30][33] - Sculptor utilizes LLMs to check for various issues, including style guide violations, missing specs, and unimplemented features, allowing for custom best practices [38] Future of AI-Assisted Development - Imbue is interested in integrating other developer tools for debugging, logging, tracing, profiling, and automated quality assurance into Sculptor [42][44] - The company anticipates that improved contextual search systems and AI models will further enhance the development experience [43]
Your Coding Agent Just Got Cloned And Your Brain Isn't Ready - Rustin Banks, Google Jules
AI Engineer· 2025-07-25 23:06
Product Introduction & Features - Jules is introduced as an asynchronous coding agent designed to run in the background and handle parallel tasks, launched at IO [1] - Jules aims to automate routine coding tasks, such as Firebase SDK updates or enabling development from a phone [1] - Jules is powered by Gemini 2.5% Pro [18] Parallelism & Use Cases - Two types of parallelism are emerging: multitasking and multiple variations, where agents try different approaches to a task [11] - Users are leveraging multiple variations to test different libraries or approaches for front-end tasks like adding drag and drop functionality [11] - Jules is used to add tests with Jest and Playwright, comparing test coverage to choose the best option [4][5] - Jules is used to add a calendar link feature, accessibility audits, and improve Lighthouse scores [5][6][13] Workflow & Best Practices - AI can assist in task creation from backlogs and bug reports, as well as in merging code and handling merge conflicts [3][14] - Remote agents in the cloud offer infinite scalability and continuous connectivity, enabling development from any device [14] - A clear definition of success and a robust merge and test framework are crucial for effective parallel workflows [14][15] - Providing ample context, including documentation links, improves the agent's ability to understand and execute tasks [18]
Human seeded Evals — Samuel Colvin, Pydantic
AI Engineer· 2025-07-25 07:00
In this talk I'll introduce the concept of Human-seeded Evals, explain the principle and demo them with Pydantic Logfire. ---related links--- https://x.com/samuel_colvin https://www.linkedin.com/in/samuel-colvin/ https://github.com/samuelcolvin https://pydantic.dev/ ...
Building AI Products That Actually Work — Ben Hylak (Raindrop), Sid Bendre (Oleve)
AI Engineer· 2025-07-24 17:15
[Music] Uh my name is Ben Hilac and uh also just feeling really grateful to be with all of you guys today. Uh it's pretty exciting and we're here to talk about building AI products that actually work. Um I'll introduce this guy in a second.Sorry, it wasn't the right word. Uh, so I tweeted last night. I was kind of like, what should we uh what should we talk about today.Uh, and the overwhelming response I got was like, please no more evals. Uh, apparently there's a lot of eval tracks. We'll touch on eval sti ...