Workflow
LLM
icon
Search documents
X @Avi Chawla
Avi Chawla· 2025-06-30 06:33
If you found it insightful, reshare it with your network.Find me → @_avichawlaEvery day, I share tutorials and insights on DS, ML, LLMs, and RAGs.Avi Chawla (@_avichawla):A Python decorator is all you need to trace LLM apps (open-source).Most LLM evals treat the app like an end-to-end black box.But LLM apps need component-level evals and tracing since the issue can be anywhere inside the box, like the retriever, tool call, or the LLM itself. https://t.co/dWXyJb3DNs ...
Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo
AI Engineer· 2025-06-27 10:27
AI Agent Evaluation & Observability - The industry emphasizes the necessity of observability in AI development, particularly for evaluation-driven development [1] - AI trustworthiness is a significant concern, highlighting the need for robust evaluation methods [1] - Detecting problems in AI is challenging due to its non-deterministic nature, making traditional unit testing difficult [1] AI-Driven Evaluation - The industry suggests using AI to evaluate AI, leveraging its ability to understand and identify issues in AI systems [1] - LLMs can be used to score the performance of other LLMs, with the recommendation to use a better (potentially more expensive or custom-trained) LLM for evaluation than the one used in the primary application [2] - Galileo offers a custom-trained small language model (SLM) designed for effective AI evaluations [2] Implementation & Metrics - Evaluations should be integrated from the beginning of the AI application development process, including prompt engineering and model selection [2] - Granularity in evaluation is crucial, requiring analysis at each step of the AI workflow to identify failure points [2] - Key metrics for evaluation include action completion (did it complete the task) and action advancement (did it move towards the goal) [2] Continuous Improvement & Human Feedback - AI can provide insights and suggestions for improving AI agent performance based on evaluation data [3] - Human feedback is essential to validate and refine AI-generated metrics, ensuring accuracy and continuous learning [4] - Real-time prevention and alerting are necessary to address rogue AI agents and prevent issues in production [8]
The State of AI Powered Search and Retrieval — Frank Liu, MongoDB (prev Voyage AI)
AI Engineer· 2025-06-27 09:57
Voyage AI & MongoDB Partnership - Voyage AI was acquired by MongoDB approximately 3-4 months ago [1] - The partnership aims to create a single data platform for embedding, re-ranking, query augmentation, and query decomposition [29][30][31] AI-Powered Search & Retrieval - AI-powered search finds related concepts beyond identical wording and understands user intent [7][8][9] - Embedding quality is a core component, with 95-99% of systems using embeddings [12] - Real-world applications include chatting with codebases, where evaluation is crucial to determine the best embedding model and LLM for the specific application [14][15] - Structured data, beyond embeddings, is often necessary for building powerful search and retrieval systems, such as filtering by state or document type in legal documents [16][17][18] - Agentic retrieval involves feedback loops where the AI search system is no longer just input-output, but can expand or decompose queries [19][20] Future Trends - The future of AI-powered search is multimodal, involving understanding images, text, and audio together [23][24][25] - Instruction tuning will allow steering vectors based on instructions, enabling more specific document retrieval [27][28]
"Data readiness" is a Myth: Reliable AI with an Agentic Semantic Layer — Anushrut Gupta, PromptQL
AI Engineer· 2025-06-27 09:40
Problem Statement - Data readiness is a myth, and achieving perfect data for AI is an unattainable pipe dream [1][2][3] - Fortune 500 companies lose an average of $250 million due to poor data quality [7] - Traditional semantic layers and knowledge graphs are insufficient for capturing the nuances of business language and tribal knowledge [8][9][10][11][12][13][14] Solution: Agentic Semantic Layer (PromQL) - PromQL is presented as a "day zero smart analyst" AI system that learns and improves over time through course correction and steering [17][18][19][20] - It uses a domain-specific language (DSL) for data retrieval, computation, aggregation, and semantics, decoupling LLM plan generation from execution [21][22] - The system allows for editing the AI's "brain" to correct its understanding and guide its learning [28] - It incorporates a prompt learning layer to improve the semantic graph and create a company-specific business language [31] - The semantic layer is version controlled, allowing for fallback to previous builds [33] Key Features and Benefits - Correctable, explainable, and steerable AI that improves with use [19] - Ability to handle messy data and understand business context [24][25] - Reduces months of work into immediate start, enabling faster AI deployments [37] - Self-improving and achieves 100% accuracy on complex tasks [37] Demonstrated Capabilities - The system can understand what revenue means and perform calculations [23] - It can identify and correct errors in data, such as incorrect status values [24] - It can integrate data from multiple databases and SAS applications [25][27] - It can summarize support tickets and extract sentiment [26][29] - It can learn the meaning of custom terms and relationships between tables [35][36] Customer Validation - A Fortune 500 food chain company and a high-growth fintech company achieved 100% accurate AI using PromQL [38]
Building Agentic Applications w/ Heroku Managed Inference and Agents — Julián Duque & Anush Dsouza
AI Engineer· 2025-06-27 09:38
Heroku Managed Inference and Agents Platform Overview - Heroku Managed Inference and Agents platform enables developers to build agentic applications that can reason, make decisions, and trigger actions [1] - The platform allows for provisioning and deploying LLMs, running untrusted code securely in multiple languages, and extending agents with the Model Context Protocol (MCP) [1] Key Capabilities - Heroku Managed Inference and Agents facilitates the deployment and management of LLMs [1] - The platform supports secure execution of untrusted code in Python, Nodejs, Go, and Ruby [1] - Model Context Protocol (MCP) can be used to extend agent capabilities [1] Target Applications - The platform is suitable for building internal tools, developer assistants, or customer-facing AI features [1]
Prompt Engineering is Dead — Nir Gazit, Traceloop
AI Engineer· 2025-06-27 09:34
Core Argument - The presentation challenges the notion of "prompt engineering" as a true engineering discipline, suggesting that iterative prompt improvement can be automated [1][2] - The speaker advocates for an alternative approach to prompt optimization, emphasizing the use of evaluators and automated agents [23] Methodology & Implementation - The company developed a chatbot for its website documentation using a Retrieval-Augmented Generation (RAG) pipeline [2] - The RAG pipeline consists of a Chroma database, OpenAI, and prompts to answer questions about the documentation [7] - An evaluator was built to assess the RAG pipeline's responses, using a dataset of questions and expected answers [5][7] - The evaluator uses a ground truth-based LLM as a judge, checking if the generated answers contain specific facts [10][13] - An agent was created to automatically improve prompts by researching online guides, running evaluations, and regenerating prompts based on failure reasons [5][18][19] - The agent uses Crew AI to think, call the evaluator, and regenerate prompts based on best practices [20] Results & Future Considerations - The initial score of the prompt was 0.4 (40%), and after two iterations with the agent, the score improved to 0.9 (90%) [21][22] - The company acknowledges the risk of overfitting to the training data (20 examples) and suggests splitting the data into train/test sets for better generalization [24][25] - Future work may involve applying the same automated optimization techniques to the evaluator and agent prompts [27] - The demo is available in the trace loop/autoprompting demo repository [27]
ICCV 2025放榜!录取率24%,夏威夷门票你抢到了吗?
具身智能之心· 2025-06-26 14:19
Core Viewpoint - The article discusses the significant increase in submissions to the ICCV 2025 conference, reflecting rapid growth in the computer vision field and the challenges faced in the peer review process due to the high volume of submissions [3][26][31]. Submission and Acceptance Data - ICCV 2025 received 11,239 valid submissions, with 2,699 papers accepted, resulting in an acceptance rate of 24% [3][4]. - In comparison, ICCV 2023 had 8,260 submissions and accepted 2,160 papers, yielding an acceptance rate of approximately 26.15% [6]. - Historical data shows ICCV 2021 had 6,152 submissions with a 26.20% acceptance rate, and ICCV 2019 had 4,323 submissions with a 25% acceptance rate [6]. Peer Review Challenges - Despite the increase in submissions, the acceptance rate has remained relatively stable, hovering around 25% to 26% [4]. - The ICCV 2025 conference implemented a new policy to enhance accountability and integrity, identifying 25 irresponsible reviewers and rejecting 29 associated papers [4][5]. - The article highlights the growing challenges in the peer review process as submission volumes exceed 10,000, with NIPS expected to surpass 30,000 submissions [31]. Recommendations for Peer Review System - The article advocates for a two-way feedback loop in the peer review process, allowing authors to evaluate review quality while reviewers receive formal recognition [34][38]. - It suggests a systematic reviewer reward mechanism to incentivize high-quality reviews [38]. - The need for reforms in the peer review system is emphasized to address issues of fairness and accountability [36][37].
AI大神Karpathy演讲刷屏:软件3.0时代已来,提示词就是新代码
3 6 Ke· 2025-06-20 12:18
Core Insights - Andrej Karpathy emphasizes that LLMs (Large Language Models) should enhance human capabilities rather than replace them, presenting a new perspective on the evolution of programming languages and AI [3][10][32] Group 1: LLM as an Ecosystem - Karpathy compares LLMs to operating systems rather than simple commodities, highlighting their complexity and the need for significant capital investment for development [4][6] - He categorizes LLM providers into closed-source (like Windows and Mac OS) and open-source (like Linux), illustrating the intricate software ecosystem surrounding LLMs [6][8] Group 2: Automation and User Interaction - The current interaction with LLMs through text-based interfaces is not sustainable; Karpathy advocates for GUI (Graphical User Interface) to enhance user experience and efficiency [11][13] - He outlines three prerequisites for automating LLM products: perception, action, and supervision, emphasizing the need for AI systems to be accessible and manageable by humans [15][17] Group 3: Educational Implications - Karpathy stresses the importance of structured education in AI, warning against unstructured commands that could lead to ineffective teaching outcomes [23][24] - He proposes collaboration between teachers and AI to create structured courses, ensuring quality and direction in education [24] Group 4: Psychological Aspects of AI - Karpathy believes that LLMs exhibit human-like psychological traits due to their training on vast amounts of human-written text, which includes both strengths and weaknesses [26][29] - He notes that LLMs have the potential for both exceptional capabilities and significant cognitive flaws, drawing parallels to human conditions [27][29] Group 5: Market Timing and Adoption - The current landscape presents a unique opportunity for entering the industry, as LLMs have reached consumers before being widely adopted by governments and enterprises [31] - Karpathy's insights reflect a continuous iterative thinking process, essential for those learning to navigate the evolving AI landscape [32]
Case Study + Deep Dive: Telemedicine Support Agents with LangGraph/MCP - Dan Mason
AI Engineer· 2025-06-17 18:58
Industry Focus: Autonomous Agents in Healthcare - The workshop explores building autonomous agents for managing complex processes like multi-day medical treatments [1] - The system aims to help patients self-administer medication regimens at home [1] - A key challenge is enabling agents to adhere to protocols while handling unexpected patient situations [1] Technology Stack - The solution utilizes a hybrid system of code and prompts, leveraging LLM decision-making to drive a web application, message queue, and database [1] - The stack includes LangGraph/LangSmith, Claude, MCP, Nodejs, React, MongoDB, and Twilio [1] - Treatment blueprints, designed in Google Docs, guide LLM-powered agents [1] Agent Evaluation and Human Support - The system incorporates an agent evaluation system using LLM-as-a-judge to assess interaction complexity [1] - The evaluation system escalates complex interactions to human support when needed [1] Key Learning Objectives - Participants will learn how to build a hybrid system of code and prompts that leverages LLM decisioning [1] - Participants will learn how to design and maintain flexible agentic workflow blueprints [1] - Participants will learn how to create an agent evaluation system [1]
游戏教父 John Carmack:LLM 不是游戏的未来
AI前线· 2025-06-16 07:37
Core Viewpoint - The article discusses the evolution and challenges of artificial intelligence (AI) in gaming and virtual environments, emphasizing the importance of interactive learning experiences over traditional pre-training methods. It critiques the limitations of large language models (LLMs) and highlights the need for more effective learning frameworks in AI development [16][18][19]. Group 1: Background and Development - Id Software, founded in the 1990s, played a significant role in the development of iconic games that contributed to GPU advancements and the modern AI landscape [3]. - The author has extensive experience in various tech companies, including Armadillo Aerospace and Oculus, focusing on the development of virtual reality technologies [6][8]. Group 2: Learning and AI Models - The article critiques the effectiveness of LLMs, arguing that many people do not fully understand their limitations, particularly in learning from new environments [16]. - It emphasizes the importance of interactive learning, suggesting that AI should learn through experiences similar to how humans and animals do, rather than relying solely on pre-trained models [16][18]. Group 3: Gaming and AI Interaction - The author notes that traditional gaming AI often relies on internal game structures, which can lead to cheating, while cloud gaming could mitigate this issue [18]. - The article discusses the limitations of current AI models in learning from games, highlighting that significant amounts of experience (e.g., 200 million frames) are required to reach human-level performance [20][34]. Group 4: Challenges in AI Learning - The article identifies ongoing challenges in continuous, efficient, and lifelong learning within AI, which are tasks that even simple animals can accomplish easily [20]. - It points out that many AI systems struggle with learning in complex environments, and traditional reinforcement learning frameworks may not be suitable for all scenarios [30][32]. Group 5: Future Directions - The author proposes a mixed approach to learning environments, combining passive and interactive content to enhance AI learning capabilities [22]. - The article suggests that new benchmarks should be established to evaluate AI performance across various games, focusing on long-term learning and retention of skills [95][97].