AI Engineer

Search documents
The 2025 AI Engineering Report — Barr Yaron, Amplify
AI Engineer· 2025-08-01 22:51
AI Engineering Landscape - The AI engineering community is broad, technical, and growing, with the "AI Engineer" title expected to gain more ground [5] - Many seasoned software developers are AI newcomers, with nearly half of those with 10+ years of experience having worked with AI for three years or less [7] LLM Usage and Customization - Over half of respondents are using LLMs for both internal and external use cases, with OpenAI models dominating external, customer-facing applications [8] - LLM users are leveraging them across multiple use cases, with 94% using them for at least two and 82% for at least three [9] - Retrieval-Augmented Generation (RAG) is the most popular customization method, with 70% of respondents using it [10] - Parameter-efficient fine-tuning methods like LoRA/Q-LoRA are strongly preferred, mentioned by 40% of fine-tuners [12] Model and Prompt Management - Over 50% of respondents are updating their models at least monthly, with 17% doing so weekly [14] - 70% of respondents are updating prompts at least monthly, and 10% are doing so daily [14] - A significant 31% of respondents lack any system for managing their prompts [15] Multimodal AI and Agents - Image, video, and audio usage lag text usage significantly, indicating a "multimodal production gap" [16][17] - Audio has the highest intent to adopt among those not currently using it, with 37% planning to eventually adopt audio [18] - While 80% of respondents say LLMs are working well, less than 20% say the same about agents [20] Monitoring and Evaluation - Most respondents use multiple methods to monitor their AI systems, with 60% using standard observability and over 50% relying on offline evaluation [22] - Human review remains the most popular method for evaluating model and system accuracy and quality [23] - 65% of respondents are using a dedicated vector database [24] Industry Outlook - The mean guess for the percentage of the US Gen Z population that will have AI girlfriends/boyfriends is 26% [27] - Evaluation is the number one most painful thing about AI engineering today [28]
Agents vs Workflows: Why Not Both? — Sam Bhagwat, Mastra.ai
AI Engineer· 2025-08-01 16:00
[Music] Okay. Agents or workflows. Why not both.Uh, thank you, Alex, for the nice intro. Um, uh, like like he said, I used to be the founder of co-founder of Gatsby. Um, I wrote a book called Principles of AI agents, which is floating around.Hopefully many of you have gotten a copy. We we have more around the conference. Uh there was a big debate uh a couple of months ago which the term on Twitter people may have noticed um which I just referenced.Um and I think like this is a big reason why I'm h why we're ...
Why We Don’t Need More Data Centers - Dr. Jasper Zhang, Hyperbolic
AI Engineer· 2025-08-01 15:00
Market Trend & Problem Statement - AI 将与未来的一切融合,对 GPU 和数据中心的需求正在爆炸式增长 [4] - 到 2030 年,需要比现在快四倍的速度建造多四倍的数据中心 [5] - 仅在美国,到 2030 年数据中心供应缺口将超过 15 吉瓦 [8] - 企业和公司 GPU 的空闲时间占 80% [9] - 构建数据中心面临挑战,例如成本高昂(第一个星际之门数据中心耗资超过 10 亿美元),连接电网速度慢(等待时间长达 7 年才能将 100 兆瓦的设施连接到北弗吉尼亚州的电网) [6][7] - GPU 和数据中心消耗了美国总用电量的 4%,并且环境可持续性较差,导致大量的二氧化碳排放 [8] Proposed Solution & Hyperbolic's Approach - 行业需要构建一个 GPU 市场或聚合层,以聚合不同的数据中心和 GPU 提供商,从而解决 GPU 用户的问题 [10] - Hyperbolic 正在构建一个名为 HyperDOS(Hyperbolic Distributed Operating System)的全球编排层,它类似于 Kubernetes 软件,允许任何集群在安装软件后成为网络中的一个集群 [11] - 用户可以通过多种方式租用 GPU,例如现货实例、按需、长期预留或托管模型 [11] - Hyperbolic 的 GPU 市场 H100 的 GPU 成本为每小时 0.99 美元,而 Google 的按需 GPU 成本为 11 美元 [13] - 通过统一的分销渠道,可以大幅降低价格 [13][14] - Hyperbolic 正在构建一个统一的平台,初创公司或公司不再需要审查不同的数据中心,只需选择评级高或价格最优的数据中心即可,还将对 GPU 的性能进行基准测试 [16] Benefits & Cost Savings - 通过 GPU 市场,可以节省 50% 到 75% 的成本 [13] - 通过 Hyperbolic,可以将成本从 4380 万美元降低到 690 万美元,节省 6 倍 [19] - 通过增加计算量,可以提高模型的质量,在相同的预算下,生产力可以提高 6 倍 [20] - 通过将闲置的 GPU 出售给其他人,可以帮助其他人获得更便宜的 GPU [20] Future Vision - GPU 市场将发展成为不同 AI 工作负载的一体化平台,包括 AI 推理(在线和离线)和训练作业 [21] - 行业应该更好地重用和回收那些闲置的计算资源,而不是仅仅关注构建数据中心,因为这会消耗大量能源和占用大量土地 [21]
Flipping the Inference Stack — Robert Wachen, Etched
AI Engineer· 2025-08-01 14:30
Scalability Challenges in AI Inference - Current AI inference systems rely on brute-force scaling, adding more GPUs per user, leading to unsustainable compute demands and spiraling costs [1] - Real-time use cases are bottlenecked by latency and costs per user [1] Proposed Solution - Rethinking hardware is the only way to unlock real-time AI at scale [1] Key Argument - The current approach to inference is not scalable [1]
Infrastructure for the Singularity — Jesse Han, Morph
AI Engineer· 2025-08-01 14:30
AI Agent Transition - AI agents are transitioning from experimental tools to practical coworkers [1] - This transition demands new infrastructure for RL training, test-time scaling, and deployment [1] Morph Labs' Innovation - Morph Labs developed Infinibranch to address the infrastructure needs of AI agents [1] - Morph Labs is building the infrastructure for the singularity [1] - Infinibranch enables scaling train-time and test-time search for agentic reasoning models [1] Leadership - Jesse Han is the Founder and CEO of Morph Labs [1] - Jesse Han previously worked at OpenAI on test-time compute scaling, GPT-4, and reasoning [1]
Hacking the Inference Pareto Frontier - Kyle Kranen, NVIDIA
AI Engineer· 2025-08-01 13:45
Challenges in LLM Inference - LLM inference systems face challenges related to latency, cost, and output quality, impacting user experience, profitability, and applicability [1] - The trade-offs between cost, throughput, latency, and quality define a Pareto frontier, limiting the successful application of LLM systems [1] NVIDIA Dynamo and Inference Techniques - NVIDIA Dynamo, a datacenter-scale distributed inference framework, aims to improve the Pareto frontier of inference systems [1] - Techniques employed include disaggregation (separating LLM generation phases), speculation (predicting multiple tokens per cycle), KV routing, storage, and manipulation (avoiding redundant work), and pipelining improvements for agents (accelerating workflows) [1] Key Inference Optimization Strategies - Disaggregation enhances efficiency by separating phases of LLM generation [1] - Speculation predicts multiple tokens per cycle to improve throughput [1] - KV routing, storage, and manipulation prevent redoing work, optimizing resource utilization [1] - Pipelining improvements for agents accelerate workflows by leveraging agent information [1]
Pipecat Cloud: Enterprise Voice Agents Built On Open Source - Kwindla Hultman Kramer, Daily
AI Engineer· 2025-07-31 18:56
Core Technology & Product Offering - Daily 公司提供实时音视频和 AI 的全球基础设施,并推出开源、供应商中立的项目 Pipecat,旨在帮助开发者构建可靠、高性能的语音 AI 代理 [2][3] - Pipecat 框架包含原生电话支持,可与 Twilio 和 Pivo 等多个电话提供商即插即用,还包括完全开源的音频智能转向模型 [12][13] - Pipecat Cloud 是首个开源语音 AI 云,旨在托管专为语音 AI 问题设计的代码,支持 60 多种模型和服务 [14][15] - Daily 推出 Pipecat Cloud,作为 Docker 和 Kubernetes 的轻量级封装,专门为语音 AI 优化,解决快速启动、自动缩放和实时性能等问题 [29] Voice AI Agent Development & Challenges - 构建语音代理需要考虑代码编写、代码部署和用户连接三个方面,用户对语音 AI 的期望很高,要求 AI 能够理解、智能、会话且听起来自然 [5][6] - 语音 AI 代理需要快速响应,目标是 800 毫秒的语音到语音响应时间,同时需要准确判断何时响应 [7][8] - 开发者使用 Pipecat 等框架,以避免编写turn detection(转弯检测)、中断处理和上下文管理等复杂代码,从而专注于业务逻辑和用户体验 [10] - 语音 AI 面临长会话、低延迟网络协议和自动缩放等独特挑战,冷启动时间至关重要 [25][26][30] - 语音 AI 的主要挑战包括:背景噪音会触发不必要的LLM中断,以及代理的非确定性 [38][40] Model & Service Ecosystem - Pipecat 支持多种模型和服务,包括 OpenAI 的音频模型和 Gemini 的多模态实时 API,用于会话流程和游戏互动 [15][19][22] - 行业正在探索 Moshi 和 Sesame 等下一代研究模型,这些模型具有持续双向流架构,但尚未完全准备好用于生产 [49][56] - Gemini 在原生音频输入模式下表现良好,且定价具有竞争力,但模型在音频模式下的可靠性低于文本模式 [61][53] - Ultravox 是一个基于 Llama 3 7B 主干的语音合成模型,如果 Llama 3 70B 满足需求,那么 Ultravox 是一个不错的选择 [57][58] Deployment & Infrastructure - Daily 公司在全球范围内提供端点,通过 AWS 或 OCI 骨干网路由,以优化延迟并满足数据隐私要求 [47] - 针对澳大利亚等地理位置较远的用户,建议将服务部署在靠近推理服务器的位置,或者在本地运行开放权重模型 [42][44] - 语音到语音模型的主要优势在于,它们可以在转录步骤中保留信息,例如混合语言,但音频数据量不足可能会导致问题 [63][67]
[Voice Keynote] Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)
AI Engineer· 2025-07-31 16:00
Sean DuBois of OpenAI and Pion, and Kwindla Hultman Kramer of Daily and Pipecat, will talk about why you have to design realtime AI systems from the network layer up. Most people who build realtime AI apps and frameworks get it wrong. They build from either the model out or the app layer down. But unless you start with the network layer and build up, you'll never be able to deliver realtime audio and video streams reliably. And perhaps even worse, you'll get core primitives wrong: interruption handling, con ...
Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit
AI Engineer· 2025-07-31 16:00
ChatGPT Advanced Voice Mode isn’t interrupting just you. Interruptions, and turn-taking in general, are unsolved problems for all Voice AI agents. Nobody likes being cut short – and people have much less patience for machines than they do for other humans. Turn-taking failures take many forms (e.g., the agent interrupts the user, the agent mistakes a cough for an interruption), and all of them lead to users immediately hanging up the phone. In this talk, we use human conversation as a framework for understa ...
Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil+Jack Dwyer, Gabber
AI Engineer· 2025-07-31 13:45
Technology & Product Development - Orpheus (Emotive, Realtime TTS) 的部署经验,包括延迟和优化[1] - 高保真语音克隆及示例[1] - 使用多个 GPU 和多个 LoRa 进行负载均衡[1] Company & Industry Focus - Gabber 致力于简化和降低实时、多模态消费者应用程序的开发成本[1] - 演讲者 Neil Dwyer 在 Bebo 构建了实时流媒体 + 计算机视觉管道,并在 LiveKit 参与了 Agents 平台的开发[1]