Optimizing inference for voice models in production - Philip Kiely, Baseten

Key Optimization Goal - Aims to achieve Time To First Byte (TTFB) below 150 milliseconds for voice models [1] Technology and Tools - Leverages open-source TTS models like Orpheus, which have an LLM backbone [1] - Employs tools and optimizations such as TensorRT-LLM and FP8 quantization [1] Production Challenges - Client code, network infrastructure, and other outside-the-GPU factors can introduce latency [1] - Common pitfalls exist when integrating TTS models into production systems [1] Scalability and Customization - Focuses on scaling TTS models in production [1] - Extends the system to serve customized models with voice cloning and fine-tuning [1]