AI Engineer
Search documents
Pipecat Cloud: Enterprise Voice Agents Built On Open Source - Kwindla Hultman Kramer, Daily
AI Engineer· 2025-07-31 18:56
Core Technology & Product Offering - Daily 公司提供实时音视频和 AI 的全球基础设施,并推出开源、供应商中立的项目 Pipecat,旨在帮助开发者构建可靠、高性能的语音 AI 代理 [2][3] - Pipecat 框架包含原生电话支持,可与 Twilio 和 Pivo 等多个电话提供商即插即用,还包括完全开源的音频智能转向模型 [12][13] - Pipecat Cloud 是首个开源语音 AI 云,旨在托管专为语音 AI 问题设计的代码,支持 60 多种模型和服务 [14][15] - Daily 推出 Pipecat Cloud,作为 Docker 和 Kubernetes 的轻量级封装,专门为语音 AI 优化,解决快速启动、自动缩放和实时性能等问题 [29] Voice AI Agent Development & Challenges - 构建语音代理需要考虑代码编写、代码部署和用户连接三个方面,用户对语音 AI 的期望很高,要求 AI 能够理解、智能、会话且听起来自然 [5][6] - 语音 AI 代理需要快速响应,目标是 800 毫秒的语音到语音响应时间,同时需要准确判断何时响应 [7][8] - 开发者使用 Pipecat 等框架,以避免编写turn detection(转弯检测)、中断处理和上下文管理等复杂代码,从而专注于业务逻辑和用户体验 [10] - 语音 AI 面临长会话、低延迟网络协议和自动缩放等独特挑战,冷启动时间至关重要 [25][26][30] - 语音 AI 的主要挑战包括:背景噪音会触发不必要的LLM中断,以及代理的非确定性 [38][40] Model & Service Ecosystem - Pipecat 支持多种模型和服务,包括 OpenAI 的音频模型和 Gemini 的多模态实时 API,用于会话流程和游戏互动 [15][19][22] - 行业正在探索 Moshi 和 Sesame 等下一代研究模型,这些模型具有持续双向流架构,但尚未完全准备好用于生产 [49][56] - Gemini 在原生音频输入模式下表现良好,且定价具有竞争力,但模型在音频模式下的可靠性低于文本模式 [61][53] - Ultravox 是一个基于 Llama 3 7B 主干的语音合成模型,如果 Llama 3 70B 满足需求,那么 Ultravox 是一个不错的选择 [57][58] Deployment & Infrastructure - Daily 公司在全球范围内提供端点,通过 AWS 或 OCI 骨干网路由,以优化延迟并满足数据隐私要求 [47] - 针对澳大利亚等地理位置较远的用户,建议将服务部署在靠近推理服务器的位置,或者在本地运行开放权重模型 [42][44] - 语音到语音模型的主要优势在于,它们可以在转录步骤中保留信息,例如混合语言,但音频数据量不足可能会导致问题 [63][67]
[Voice Keynote] Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)
AI Engineer· 2025-07-31 16:00
Sean DuBois of OpenAI and Pion, and Kwindla Hultman Kramer of Daily and Pipecat, will talk about why you have to design realtime AI systems from the network layer up. Most people who build realtime AI apps and frameworks get it wrong. They build from either the model out or the app layer down. But unless you start with the network layer and build up, you'll never be able to deliver realtime audio and video streams reliably. And perhaps even worse, you'll get core primitives wrong: interruption handling, con ...
Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit
AI Engineer· 2025-07-31 16:00
ChatGPT Advanced Voice Mode isn’t interrupting just you. Interruptions, and turn-taking in general, are unsolved problems for all Voice AI agents. Nobody likes being cut short – and people have much less patience for machines than they do for other humans. Turn-taking failures take many forms (e.g., the agent interrupts the user, the agent mistakes a cough for an interruption), and all of them lead to users immediately hanging up the phone. In this talk, we use human conversation as a framework for understa ...
Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil+Jack Dwyer, Gabber
AI Engineer· 2025-07-31 13:45
Technology & Product Development - Orpheus (Emotive, Realtime TTS) 的部署经验,包括延迟和优化[1] - 高保真语音克隆及示例[1] - 使用多个 GPU 和多个 LoRa 进行负载均衡[1] Company & Industry Focus - Gabber 致力于简化和降低实时、多模态消费者应用程序的开发成本[1] - 演讲者 Neil Dwyer 在 Bebo 构建了实时流媒体 + 计算机视觉管道,并在 LiveKit 参与了 Agents 平台的开发[1]
How to defend your sites from AI bots — David Mytton, Arcjet
AI Engineer· 2025-07-30 17:30
Constantly seeing CAPTCHAs? It used to be easy to detect the humans from the droids, but what else can we do when synthetic clients make up nearly half of all web requests. Rotating IPs, spoofed browsers, and agents acting on behalf of real users - are we doomed to forever be solving puzzles? In this talk, we’ll explore user agents, HTTP fingerprints, and IP reputation signals that make humans and agents stand out from scrapers, build a realistic threat model, and dig into the behaviors that reveal the LLM- ...
The Unofficial Guide to Apple’s Private Cloud Compute - Jonathan Mortensen, CONFSEC
AI Engineer· 2025-07-30 17:00
Technology Innovation - Apple introduced "Private Cloud Compute" in October 2024, a new private AI technology for millions of devices [1] - Private Cloud Compute offers local device-level privacy and security on an untrusted remote server [1] - The technology enables developers to run sensitive, multi-tenant workloads with cryptographically-provable privacy guarantees at scale and at reasonable cost [1] Industry Impact - Private Cloud Compute represents a paradigm shift in confidential computing, making it mainstream [1] - The technology can be leveraged for data and AI applications where privacy and security are paramount [1] Key Personnel - Jonathan Mortensen, CEO of a stealth AI startup and Founder Fellow at South Park Commons, previously founded bit.io, a multi-cloud serverless PostgreSQL platform acquired by Databricks [1] - Prior to bit.io, Jonathan Mortensen led data science and engineering teams at BlueVoyant, designing high-volume data pipelines processing 50 million events per second [1]
How we hacked YC Spring 2025 batch’s AI agents — Rene Brandel, Casco
AI Engineer· 2025-07-30 15:45
Security Vulnerabilities - AI agents in the industry are vulnerable to hacking, with 7 out of 16 (43.75%) publicly accessible YC X25 AI agents being compromised [1] - Hacking these AI agents allowed for user data leaks, remote code execution, and database takeover [1] - The time required to compromise each AI agent was approximately 30 minutes [1] Risk Mitigation - Companies should address common mistakes in AI agent security to mitigate risks [1] - Proactive security measures are crucial to protect businesses from potential harm caused by AI agents [1]
Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)
AI Engineer· 2025-07-29 16:00
[Music] All right. Uh, thank you everyone. We're excited for to be here and thank you for uh, coming to our talk.Uh, my name is Chong. I'm the CEO and co-founder of LANCB. I've been making data tools for machine learning and data science for about 20 years.I was one of the co-authors of pandas library and I'm working on LANCB today for all of that data that doesn't fit neatly into those pandas data frames. And I'm Calvin. I lead one of the teams at Harvey Aai working on rag um tough rag problems across mass ...
Building Alice’s Brain: an AI Sales Rep that Learns Like a Human - Sherwood & Satwik, 11x
AI Engineer· 2025-07-29 15:30
Overview of Alice and 11X - 11X is building digital workers for the go-to-market organization, including Alice, an AI SDR, and Julian, a voice agent [2] - Alice sends approximately 50,000 emails per day, significantly more than a human SDR's 20-50 emails, and runs campaigns for about 300 business organizations [6] - The knowledge base centralizes seller information, allowing users to upload source material for message generation [18] Technical Architecture and Pipeline - The knowledge base pipeline consists of parsing, chunking, storage, retrieval, and visualization [22] - Parsing converts non-text resources into text, making them legible to large language models [23] - Chunking breaks down markdown into semantic entities for embedding in the vector DB, preserving markdown structure [37][38] - Pinecone was selected as the vector database due to its well-known solution, cloud hosting, ease of use, bundled embedding models, and customer support [46][47][48][49] - A deep research agent, built using Leta, is used for retrieval, creating a plan with one or many context retrieval steps [51][52] Vendor Selection and Considerations - The company chose to work with vendors for parsing, prioritizing speed to market and confidence in outcome over building in-house [26][27] - Llama Parse was selected for documents and images due to its support for numerous file types and support [32] - Firecrawl was chosen for websites due to familiarity and the availability of their crawl endpoint [33][34] - Cloudglue was selected for audio and video because it supports both formats and extracts information from the video itself [36] Lessons Learned and Future Plans - RAG (Retrieval-Augmented Generation) is complex, requiring many micro-decisions and technology evaluations [58] - The company recommends getting to production first before benchmarking and improving [59] - Future plans include tracking and addressing hallucinations, evaluating parsing vendors on accuracy and completeness, experimenting with hybrid RAG, and reducing costs [60][61]
Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)
AI Engineer· 2025-07-29 14:30
RAG技术栈 - RAG技术栈范围从最简单的内存嵌入和相关性排序搜索,到最复杂的行星级搜索,后者包含70多种语料库混合,包括token、embeddings和知识图谱[1] - 行业正在探索在200毫秒内以每秒16万次查询的速度,对这些混合语料库进行联合检索、自定义排序、联合重排序和LLM处理[1] - 报告通过“一次一个查询”的方式,逐步增加复杂性,展示RAG中所有技术的局限性,以及下一层技术在处理更复杂查询方面的能力[1] 搜索挑战 - 某些搜索问题非常难以解决,以至于行业可能更倾向于将问题交给LLM或UX处理[1] - 报告指出,像[falafel]这样的查询非常难以搜索,而对文档进行分块可能会是灾难性的[1] 行业应用与洞察 - Google团队在50多个搜索产品(包括Google.com和定制企业搜索)的背景下,分享了RAG技术的应用经验[1] - Pi Labs 致力于将Google在搜索核心AI和NLU系统方面的工作经验带给整个行业[1]