LLM

Search documents
Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo
AI Engineer· 2025-06-27 10:27
[Music] So I'm here to talk about taming rogue AI agents but essentially want to talk about uh evaluation driven development observability driven but really why we need observability. So, who uses AI? Is that Jim's stupid most stupid question of the day? Probably. Who trusts AI? Right. If you'd like to meet me after, I've got some snake oil you might be interested in buying. Yeah, we do not trust AI in the slightest. Now, different question. Who reads books? That's reading books. If you want some recommenda ...
The State of AI Powered Search and Retrieval — Frank Liu, MongoDB (prev Voyage AI)
AI Engineer· 2025-06-27 09:57
[Music] Welcome everybody. Uh I want to thank you for coming to this session today. Um today I want to talk about AI powered search and retrieval.Uh and for those of you who don't know me, my name is Frank. Uh I am actually a part of the Voyage AI team and we recently joined MongoDB. I want to say probably about three to four months ago.Uh just a quick introduction to Voyage AI. You know, we build the most accurate, cost-effective embedding models and rerankers for rag and semantic search. Uh a lot of the a ...
"Data readiness" is a Myth: Reliable AI with an Agentic Semantic Layer — Anushrut Gupta, PromptQL
AI Engineer· 2025-06-27 09:40
Problem Statement - Data readiness is a myth, and achieving perfect data for AI is an unattainable pipe dream [1][2][3] - Fortune 500 companies lose an average of $250 million due to poor data quality [7] - Traditional semantic layers and knowledge graphs are insufficient for capturing the nuances of business language and tribal knowledge [8][9][10][11][12][13][14] Solution: Agentic Semantic Layer (PromQL) - PromQL is presented as a "day zero smart analyst" AI system that learns and improves over time through course correction and steering [17][18][19][20] - It uses a domain-specific language (DSL) for data retrieval, computation, aggregation, and semantics, decoupling LLM plan generation from execution [21][22] - The system allows for editing the AI's "brain" to correct its understanding and guide its learning [28] - It incorporates a prompt learning layer to improve the semantic graph and create a company-specific business language [31] - The semantic layer is version controlled, allowing for fallback to previous builds [33] Key Features and Benefits - Correctable, explainable, and steerable AI that improves with use [19] - Ability to handle messy data and understand business context [24][25] - Reduces months of work into immediate start, enabling faster AI deployments [37] - Self-improving and achieves 100% accuracy on complex tasks [37] Demonstrated Capabilities - The system can understand what revenue means and perform calculations [23] - It can identify and correct errors in data, such as incorrect status values [24] - It can integrate data from multiple databases and SAS applications [25][27] - It can summarize support tickets and extract sentiment [26][29] - It can learn the meaning of custom terms and relationships between tables [35][36] Customer Validation - A Fortune 500 food chain company and a high-growth fintech company achieved 100% accurate AI using PromQL [38]
Building Agentic Applications w/ Heroku Managed Inference and Agents — Julián Duque & Anush Dsouza
AI Engineer· 2025-06-27 09:38
Heroku Managed Inference and Agents Platform Overview - Heroku Managed Inference and Agents platform enables developers to build agentic applications that can reason, make decisions, and trigger actions [1] - The platform allows for provisioning and deploying LLMs, running untrusted code securely in multiple languages, and extending agents with the Model Context Protocol (MCP) [1] Key Capabilities - Heroku Managed Inference and Agents facilitates the deployment and management of LLMs [1] - The platform supports secure execution of untrusted code in Python, Nodejs, Go, and Ruby [1] - Model Context Protocol (MCP) can be used to extend agent capabilities [1] Target Applications - The platform is suitable for building internal tools, developer assistants, or customer-facing AI features [1]
Prompt Engineering is Dead — Nir Gazit, Traceloop
AI Engineer· 2025-06-27 09:34
Core Argument - The presentation challenges the notion of "prompt engineering" as a true engineering discipline, suggesting that iterative prompt improvement can be automated [1][2] - The speaker advocates for an alternative approach to prompt optimization, emphasizing the use of evaluators and automated agents [23] Methodology & Implementation - The company developed a chatbot for its website documentation using a Retrieval-Augmented Generation (RAG) pipeline [2] - The RAG pipeline consists of a Chroma database, OpenAI, and prompts to answer questions about the documentation [7] - An evaluator was built to assess the RAG pipeline's responses, using a dataset of questions and expected answers [5][7] - The evaluator uses a ground truth-based LLM as a judge, checking if the generated answers contain specific facts [10][13] - An agent was created to automatically improve prompts by researching online guides, running evaluations, and regenerating prompts based on failure reasons [5][18][19] - The agent uses Crew AI to think, call the evaluator, and regenerate prompts based on best practices [20] Results & Future Considerations - The initial score of the prompt was 0.4 (40%), and after two iterations with the agent, the score improved to 0.9 (90%) [21][22] - The company acknowledges the risk of overfitting to the training data (20 examples) and suggests splitting the data into train/test sets for better generalization [24][25] - Future work may involve applying the same automated optimization techniques to the evaluator and agent prompts [27] - The demo is available in the trace loop/autoprompting demo repository [27]
ICCV 2025放榜!录取率24%,夏威夷门票你抢到了吗?
具身智能之心· 2025-06-26 14:19
Core Viewpoint - The article discusses the significant increase in submissions to the ICCV 2025 conference, reflecting rapid growth in the computer vision field and the challenges faced in the peer review process due to the high volume of submissions [3][26][31]. Submission and Acceptance Data - ICCV 2025 received 11,239 valid submissions, with 2,699 papers accepted, resulting in an acceptance rate of 24% [3][4]. - In comparison, ICCV 2023 had 8,260 submissions and accepted 2,160 papers, yielding an acceptance rate of approximately 26.15% [6]. - Historical data shows ICCV 2021 had 6,152 submissions with a 26.20% acceptance rate, and ICCV 2019 had 4,323 submissions with a 25% acceptance rate [6]. Peer Review Challenges - Despite the increase in submissions, the acceptance rate has remained relatively stable, hovering around 25% to 26% [4]. - The ICCV 2025 conference implemented a new policy to enhance accountability and integrity, identifying 25 irresponsible reviewers and rejecting 29 associated papers [4][5]. - The article highlights the growing challenges in the peer review process as submission volumes exceed 10,000, with NIPS expected to surpass 30,000 submissions [31]. Recommendations for Peer Review System - The article advocates for a two-way feedback loop in the peer review process, allowing authors to evaluate review quality while reviewers receive formal recognition [34][38]. - It suggests a systematic reviewer reward mechanism to incentivize high-quality reviews [38]. - The need for reforms in the peer review system is emphasized to address issues of fairness and accountability [36][37].
AI大神Karpathy演讲刷屏:软件3.0时代已来,提示词就是新代码
3 6 Ke· 2025-06-20 12:18
Core Insights - Andrej Karpathy emphasizes that LLMs (Large Language Models) should enhance human capabilities rather than replace them, presenting a new perspective on the evolution of programming languages and AI [3][10][32] Group 1: LLM as an Ecosystem - Karpathy compares LLMs to operating systems rather than simple commodities, highlighting their complexity and the need for significant capital investment for development [4][6] - He categorizes LLM providers into closed-source (like Windows and Mac OS) and open-source (like Linux), illustrating the intricate software ecosystem surrounding LLMs [6][8] Group 2: Automation and User Interaction - The current interaction with LLMs through text-based interfaces is not sustainable; Karpathy advocates for GUI (Graphical User Interface) to enhance user experience and efficiency [11][13] - He outlines three prerequisites for automating LLM products: perception, action, and supervision, emphasizing the need for AI systems to be accessible and manageable by humans [15][17] Group 3: Educational Implications - Karpathy stresses the importance of structured education in AI, warning against unstructured commands that could lead to ineffective teaching outcomes [23][24] - He proposes collaboration between teachers and AI to create structured courses, ensuring quality and direction in education [24] Group 4: Psychological Aspects of AI - Karpathy believes that LLMs exhibit human-like psychological traits due to their training on vast amounts of human-written text, which includes both strengths and weaknesses [26][29] - He notes that LLMs have the potential for both exceptional capabilities and significant cognitive flaws, drawing parallels to human conditions [27][29] Group 5: Market Timing and Adoption - The current landscape presents a unique opportunity for entering the industry, as LLMs have reached consumers before being widely adopted by governments and enterprises [31] - Karpathy's insights reflect a continuous iterative thinking process, essential for those learning to navigate the evolving AI landscape [32]
Case Study + Deep Dive: Telemedicine Support Agents with LangGraph/MCP - Dan Mason
AI Engineer· 2025-06-17 18:58
Industry Focus: Autonomous Agents in Healthcare - The workshop explores building autonomous agents for managing complex processes like multi-day medical treatments [1] - The system aims to help patients self-administer medication regimens at home [1] - A key challenge is enabling agents to adhere to protocols while handling unexpected patient situations [1] Technology Stack - The solution utilizes a hybrid system of code and prompts, leveraging LLM decision-making to drive a web application, message queue, and database [1] - The stack includes LangGraph/LangSmith, Claude, MCP, Nodejs, React, MongoDB, and Twilio [1] - Treatment blueprints, designed in Google Docs, guide LLM-powered agents [1] Agent Evaluation and Human Support - The system incorporates an agent evaluation system using LLM-as-a-judge to assess interaction complexity [1] - The evaluation system escalates complex interactions to human support when needed [1] Key Learning Objectives - Participants will learn how to build a hybrid system of code and prompts that leverages LLM decisioning [1] - Participants will learn how to design and maintain flexible agentic workflow blueprints [1] - Participants will learn how to create an agent evaluation system [1]
游戏教父 John Carmack:LLM 不是游戏的未来
AI前线· 2025-06-16 07:37
作者丨 John Carmark 译者丨明知山 策划丨 Tina 快速背景介绍 Id Software Id Software 成立于 90 年代,作为创始人之一,我参与开发了《指挥官基恩》、《德军总部 3D》、《毁灭战士》和《雷神之锤》系列。我深感自豪的是,《雷神之锤》推动了 GPU 的发展 和普及,间接促成了现代人工智能世界的形成。DeepMind 的 DMLab 环境也是基于《雷神之锤 竞技场》的净化版本构建的。 Armadillo Aerospace 与此同时,我在 Armadillo Aerospace 工作了十年,致力于垂直起降(VTVL)火箭的研发。 Oculus 更近一些,我在 Oculus(后被 Meta 收购)为现代虚拟现实奠定了技术基础。 Keen Technologies 我还在 Meta 的时候,OpenAI 创始人试图向我伸出橄榄枝。我深感荣幸,但我并非 AI 领域的专 业人士。 我进行了大量的阅读,形成了一些关于当前局势的看法,并最终确定这就是我能够参与的最重要 的事情。 从系统工程转向研究工作对我来说是一个非常大的变化,但我很享受这个过程。 能与强化学习之父 Richard S ...
Chime Board Member Shawn Carolan talks today's IPO
CNBC Television· 2025-06-12 21:17
It opened at $43. So with us is Chime board member and Menlo Ventures partner, Sean Carroll. Sean, it's great to have you on the program, and I'm going to start right there.What's got investors so excited about chime is this sort of this new era of fintech and what chime represents or is it a company specific story. >> I think every company is a company specific story. And chime set out with a new vision of how banking could treat their customers.Reimagining just from the ground up. Like, what does it look ...