X @Avi Chawla
Avi Chawla·2025-10-22 06:31
Most LLM-powered evals are BROKEN!These evals can easily mislead you to believe that one model is better than the other, primarily due to the way they are set up.G-Eval is one popular example.Here's the core problem with LLM eval techniques and a better alternative to them:Typical evals like G-Eval assume you’re scoring one output at a time in isolation, without understanding the alternative.So when prompt A scores 0.72 and prompt B scores 0.74, you still don’t know which one’s actually better.This is unlik ...