Fuzzing the GenAI Era Leonard Tang

AI Evaluation Challenges - Traditional evaluation methods are inadequate for assessing GenAI applications' brittleness [1] - The industry faces a "Last Mile Problem" in AI, ensuring reliability, quality, and alignment for any application [1] - Standard evaluation methods often fail to uncover corner cases and unexpected user inputs [1] Haize Labs' Approach - Haize Labs simulates the "last mile" by bombarding AI with unexpected user inputs to uncover corner cases at scale [1] - Haize Labs focuses on Quality Metric (defining criteria for good/bad responses and automating judgment) and Stimuli Generation (creating diverse data to discover bugs) [1] - Haize Labs uses agents as judges to scale evaluation, considering factors like accuracy vs latency [1] - Haize Labs employs RL-tuned judges to further scale evaluation processes [1] - Haize Labs utilizes simulation as a form of prompt optimization [1] Case Studies - Haize Labs has worked with a major European bank's AI app [1] - Haize Labs has worked with a F500 bank's voice agents [1] - Haize Labs scales voice agent evaluations [1]