Workflow
Calibration
icon
Search documents
X @Avi Chawla
Avi Chawla· 2025-10-25 06:31
You're in an ML Engineer interview at Apple.The interviewer asks:"Two models are 88% accurate.- Model A is 89% confident.- Model B is 99% confident.Which one would you pick?"You: "Any would work since both have same accuracy."Interview over.Here's what you missed:Modern neural networks can be misleading.They are overconfident in their predictions.For instance, I saw an experiment that used the CIFAR-100 dataset to compare LeNet with ResNet.LeNet produced:- Accuracy = ~0.55- Average confidence = ~0.54ResNet ...
A Taxonomy for Next-gen Reasoning — Nathan Lambert, Allen Institute (AI2) & Interconnects.ai
AI Engineer· 2025-07-19 21:15
Model Reasoning and Applications - Reasoning unlocks new language model applications, exemplified by improved information retrieval [1] - Reasoning models are enhancing applications like website analysis and code assistance, making them more steerable and user-friendly [1] - Reasoning models are pushing the limits of task completion, requiring ongoing effort to determine what models need to continue progress [1] Planning and Training - Planning is a new frontier for language models, requiring a shift in training approaches beyond just reasoning skills [1][2] - The industry needs to develop research plans to train reasoning models that can work autonomously and have meaningful planning capabilities [1] - Calibration is crucial for products, as models tend to overthink, requiring better management of output tokens relative to problem difficulty [1] - Strategy and abstraction are key subsets of planning, enabling models to choose how to break down problems and utilize tools effectively [1] Reinforcement Learning and Compute - Reinforcement learning with verifiable rewards is a core technique, where language models generate completions and receive feedback to update weights [2] - Parallel compute enhances model robustness and exploration, but doesn't solve every problem, indicating a need for balanced approaches [3] - The industry is moving towards considering post-training as a significant portion of compute, potentially reaching parity with pre-training in GPU hours [3]