Latency

Search documents
Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil+Jack Dwyer, Gabber
AI Engineer· 2025-07-31 13:45
This is a talk that goes over our experience deploying Orpheus (Emotive, Realtime TTS) to production. It will cover topics: - Latency and optimizations - High fidelity voice clones w/ examples - Load balancing w/ multiple GPUs and multiple LoRas About Neil Dwyer Spent a lot of my career building real-time applications. First at a company called Bebo circa 2018 where I built a live streaming + computer vision pipeline that watched people play Fortnite. More recently at a company called LiveKit where I worked ...
The Rise of Open Models in the Enterprise — Amir Haghighat, Baseten
AI Engineer· 2025-07-24 15:30
AI Adoption in Enterprises - Enterprises' adoption of AI is crucial for realizing AI's full potential and impact [2] - Enterprises initially experiment with OpenAI and Anthropic models, often deploying them on Azure or AWS for security and privacy [7] - In 2023, enterprises were "toying around" with AI, but by 2024, 40-50% had production use cases built on closed models [9][10] Challenges with Closed Models - Vendor lock-in is not a primary concern for enterprises due to the increasing number of interoperable models [12][13] - Ballooning costs, especially with agentic use cases involving potentially 50 inference calls per user action, are becoming a significant concern [20] - Enterprises are seeking differentiation at the AI level, not just at the workflow or application level, leading them to consider in-house solutions [21] Reasons for Open Source Model Adoption - Frontier models may not be the right tool for specific use cases, such as medical document extraction, where enterprises can leverage their labeled data to build better models [16][17] - Generic API-based models may not suffice for tasks requiring low latency, such as AI voices or AI phone calls [18] - Enterprises aim to reduce costs and improve unit economics by running models themselves and controlling pricing [20][21] Inference Infrastructure Challenges - Optimizing models for latency requires both model-level and infrastructure-level optimizations, such as speculative decoding techniques like Eagle 3 [23][24][25][26] - Guaranteeing high availability (four nines) for mission-critical inference requires robust infrastructure to handle hardware failures and VLM crashes [27][28] - Scaling up quickly to handle traffic bursts is challenging, with some enterprises experiencing delays of up to eight minutes to bring up a new replica of a model [29]
X @mert | helius.dev
mert | helius.dev· 2025-07-22 17:43
RT liam | helius.dev (@liamvovk)Visit https://t.co/d80kJs8avX to lower your latency today ...
What every AI engineer needs to know about GPUs — Charles Frye, Modal
AI Engineer· 2025-07-20 07:00
[Music] So um what I wanted to talk about today was uh what every AI engineer needs to know about GPUs. the like so far in the last couple of years um most of the things that people have built as AI applications people who are AI engineers they've been building on top of model APIs so they use the open AI API the anthropic API the deepseek API and they build an application on top of that and that goes back to kind of like the initial diagram that Swix put out the like AI like rise of the AI engineer thing u ...
Optimizing inference for voice models in production - Philip Kiely, Baseten
AI Engineer· 2025-06-28 02:39
How do you get time to first byte (TTFB) below 150 milliseconds for voice models -- and scale it in production? As it turns out, open-source TTS models like Orpheus have an LLM backbone that lets us use familiar tools and optimizations like TensorRT-LLM and FP8 quantization to serve the models with low latency. But client code, network infrastructure, and other outside-the-GPU factors can introduce latency in the production stack. In this talk, we'll cover the basic mechanics of TTS inference, common pitfal ...
X @BREAD | ∑:
BREAD | ∑:· 2025-06-26 20:36
The scaling journey will not be a short one. https://t.co/bYq2WumtJMCBB (@Cbb0fe):You need 0.5s to create or update orders on Hyperliquid. This is really slow.Are there plans to make it faster?Which perp platform is able to significantly reduce this latency? ...
Chasing Nanoseconds: The Race in Digital Asset Markets | Denis Dariotis | TEDxMarianopolisCollege
TEDx Talks· 2025-06-18 16:09
[Music] Most trades in modern financial markets are not decided by strategy or by logic or any other metric. The largest metric which are decided upon is time. Latency is described as the delay between sending information from one place to another.Now, as Max Versstappen, who's a racer for the Formula 1 Red Bull team, described, the only place that matters is first. And similarly, or more humorously, Ricky Bobby said, "The if you ain't first, you're last." Now, this modern financial markets are very depende ...