Large Language Models

Search documents
Actions speak louder than LLMs (behavioral AI) | Rickard Brüel Gabrielsson | TEDxMIT
TEDx Talks· 2025-07-29 15:15
I'm so excited to be here. Well, actually, I'm not. I mean, I'm excited, but I'm mostly nervous and I'm scared to mess up.I mean, maybe I just did. But how come it's so easy for me to lie to you. And how come we're often incentivized to lie to each other.I think one problem is that talk is cheap. Incidentally, this also a type of lies and cheap data that we train our current artificial intelligence on, namely large language models. This is also why we say that actions speak louder than words.But if that's t ...
Make your LLM app a Domain Expert: How to Build an Expert System — Christopher Lovejoy, Anterior
AI Engineer· 2025-07-28 19:55
Core Problem & Solution - Vertical AI applications face a "last mile problem" in understanding industry-specific context and workflows, which is more critical than model sophistication [4][6] - Anterior proposes an "adaptive domain intelligence engine" to convert customer-specific domain insights into performance improvements [17] - The engine consists of measurement (performance evaluation) and improvement (iterative refinement) components [17] Measurement & Metrics - Defining key performance metrics that users care about is crucial, such as minimizing false approvals in healthcare or preventing dollar loss from fraud [18][19][20] - Developing a failure mode ontology helps categorize and analyze different ways the AI can fail, enabling targeted improvements [21][22] - Combining metric tracking with failure mode analysis allows prioritization of development efforts based on the impact on key metrics [26][27] Iteration & Improvement - Failure mode labeling creates ready-made datasets for iterative model improvement, using production data to ensure relevance [29] - Domain experts can suggest changes to the application pipeline and provide new domain knowledge to enhance performance [32][33] - This process enables rapid iteration, potentially fixing issues the same day by adding relevant domain knowledge and validating with evals [37] Domain Expertise - The level of domain expertise required depends on the specific workflow and optimization goals, with clinical reasoning requiring experienced doctors [38][39] - Bespoke tooling is recommended for integrating domain expert feedback into the platform and workflows [41] - Domain expert reviews provide performance metrics, failure modes, and suggested improvements, all in one [38] Results & Performance - Anterior achieved a 95% accuracy baseline in approving care requests, which was further improved to 99% through iterative refinement using the described system [14][15]
AI: Inclusive and Transformative | Manish Gupta | TEDxIITGandhinagar
TEDx Talks· 2025-07-28 16:02
AI发展与应用 - DeepMind 的使命是负责任地构建 AI,以造福人类,深度学习已成为解决图像分类、语音识别和机器翻译等问题的最佳方法 [5][6] - Transformer 架构促成了大型语言模型的构建,这些模型在大量公开数据上进行训练,能够解决广泛的问题 [8] - 现代基础模型(LLM)已超越文本,成为多模态模型,能够处理文本、手写文本和图像,为个性化辅导等学习方式带来可能性 [11][12] - Gemini 1.5 Pro 能够处理高达 1 million 多模态 tokens 的上下文窗口,可以处理大量信息作为输入 [15] - AI Agents 不仅限于简单的聊天机器人,还可以进行语音交互,甚至在 3D 世界中进行实时交互 [16] AI的包容性与可及性 - 行业致力于弥合英语和其他语言(特别是印度语言)之间 AI 能力的差距,目标是开发能够理解 125 种以上印度语言的模型 [19][20][21][22] - Vani 项目与印度科学研究所合作,旨在收集印度各个角落的语音数据,目标是从印度每个地区收集数据,以覆盖更多零语料库语言 [24][25] AI在特定领域的应用 - 行业正在构建数字农业堆栈的基础层,利用卫星图像识别农田边界、作物类型和水源,为农民提供个性化服务,如作物保险 [26][27][28] - AlphaFold 通过预测蛋白质结构,将原本需要 5 年的研究缩短到几秒钟,并在不到一年的时间内完成了 200 million 个蛋白质结构的预测,并免费提供数据,极大地加速了科学发现 [29][30][31][32] 未来展望 - 行业期望 AI 能够帮助更多人,使他们能够做出诺贝尔奖级别的贡献 [35]
X @Avi Chawla
Avi Chawla· 2025-07-28 06:30
Overview - Taipy is an open-source Python AI & data web application builder [1] - Taipy can build prototypes and robust production-ready data apps [1] Technology & Features - Taipy eliminates the need to learn JavaScript, CSS, or HTML [1] - Taipy's VS Code extension provides no-code functionalities to build data apps [2] - Taipy is presented as a more robust version of Streamlit [1] - Taipy has a noticeable latency difference compared to other apps [1] Community & Adoption - Taipy is fully open-source with over 18 thousand stars [2]
Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith
AI Engineer· 2025-07-27 16:15
LLM Evaluation Challenges - Traditional benchmarks often fail to reflect real-world LLM performance, reliability, and user satisfaction [1] - Evaluating reasoning quality, agent consistency, MCP integration, and user-focused outcomes requires going beyond standard benchmarks [1] - Benchmarks and leaderboards rarely reflect the realities of production AI [1] Evaluation Strategies & Frameworks - The industry needs tangible evaluation strategies using open-source frameworks like GuideLLM and lm-eval-harness [1] - Custom eval suites tailored to specific use cases are crucial for accurate assessment [1] - Integrating human-in-the-loop feedback is essential for better user-aligned outcomes [1] Key Evaluation Areas - Evaluating reasoning skills, consistency, and reliability in agentic AI applications is critical [1] - Validating MCP (Model Context Protocol) and agent interactions with practical reliability tests is necessary [1] - Agent reliability checks should reflect production conditions [1] Deployment Considerations - Robust evaluation is critical for confidently deploying LLMs in real-world applications like chatbots, copilots, or autonomous AI agents [1]
Waymo's EMMA: Teaching Cars to Think - Jyh Jing Hwang, Waymo
AI Engineer· 2025-07-26 17:00
Autonomous Driving History and Challenges - Autonomous driving research started in the 1980s with simple neural networks and evolved to end-to-end driving models by 2020 [2] - Scaling autonomous driving presents challenges, requiring solutions for long-tail events and rare scenarios [5][7] - Foundation models, like Gemini, show promise in generalizing to rare driving events and providing appropriate responses [8][9][10][11] Emma: A Multimodal Large Language Model for Autonomous Driving - The company is exploring Emma, a driving system leveraging Gemini, which uses routing text and camera input to predict future waypoints [11][12][13][14] - Emma is self-supervised, camera-only, and high-dimension map-free, achieving state-of-the-art quality on the nuScenes benchmark [15][16][17] - Channel reasoning is incorporated into Emma, allowing the model to explain its driving decisions and improve performance on a 100k dataset [17] Evaluation and Validation - Evaluation is crucial for the success of autonomous driving models, including open loop evaluation, simulations, and real-world testing [25] - Generative models are being explored for sensor simulation to evaluate the planner under various conditions like rain and different times of day [26][27][28] Future Directions - The company aims to improve generalization and scale autonomous driving by leveraging foundation models [30] - Training on larger datasets improves the quality of the planner [19][20] - The company is exploring training on various tasks, such as 3D detection and rograph estimation, to create a more generalizable model [21][22][23][24]
X @The Wall Street Journal
The Wall Street Journal· 2025-07-23 10:59
Vulnerability & Mitigation - Large language models like Grok have vulnerabilities that need to be addressed immediately [1] - Addressing vulnerabilities is crucial as these models gain capabilities beyond language generation [1]
X @The Wall Street Journal
The Wall Street Journal· 2025-07-22 19:57
Large language models aren’t replacing traditional browsers anytime soon, but they have become another responsibility for brands https://t.co/n8m7uemRHr ...
X @Bloomberg
Bloomberg· 2025-07-22 11:22
Technology & Finance Convergence - Large language models are predicted to possess the technical capability to make real investment decisions for clients within five years [1]
X @Avi Chawla
Avi Chawla· 2025-07-21 06:39
Today, we are covering the 4 stages of building LLMs from scratch to make them applicable for real-world use cases.We'll cover:- Pre-training- Instruction fine-tuning- Preference fine-tuning- Reasoning fine-tuningThe visual summarizes these techniques.Let's dive in! https://t.co/SiqzXaiZd0 ...