Workflow
UDA
icon
Search documents
大模型作为评估者的「偏好」困境:UDA实现无监督去偏对齐
机器之心· 2025-11-28 00:51
Core Insights - The article discusses the issue of preference bias in large language models (LLMs) acting as judges, highlighting that even advanced models like GPT-4o and DeepSeek-V3 exhibit systematic favoritism towards their own outputs, leading to significant discrepancies in scoring and ranking [2][4][5] - The introduction of Unsupervised Debiasing Alignment (UDA) offers a new approach to address this bias by allowing models to autonomously adjust scoring rules through unsupervised learning, thus achieving debiasing alignment [2][7] Summary by Sections Problem Statement - Current LLM judging systems, such as Chatbot Arena, face three main challenges: self-preference solidification, heterogeneity bias, and static scoring defects [4][5] - Self-preference solidification leads to models overestimating their own answers, creating a scenario where "who judges wins" [4] - Heterogeneity bias results in varying directions and intensities of bias among different models, ranging from aggressive self-promotion to excessive humility [4] UDA Contribution - UDA transforms the debiasing problem into a sequence learning issue that can be optimized through dynamic calibration, allowing judges to explore optimal scoring strategies autonomously [7][25] - The method utilizes a consensus-driven training approach, treating the collective agreement of judges as a practical optimization target, which helps reduce overall bias [13][18] Methodology - UDA models pairwise evaluations as an instance-level adaptive process, dynamically generating adjustment parameters for each judge model during comparisons [10][11] - The system extracts multiple features from each comparison, including semantic feature vectors and self-perception features, which are crucial for detecting bias tendencies [11][20] Experimental Results - UDA significantly reduces inter-judge variance, lowering the average standard deviation from 158.5 to 64.8, demonstrating its effectiveness in suppressing extreme biases [23] - The average Pearson correlation with human evaluations improved from 0.651 to 0.812, indicating enhanced alignment with human judgment [23] - UDA shows robust zero-shot transfer capabilities, achieving a 63.4% variance reduction on unseen datasets, showcasing its domain-agnostic debiasing ability [23] Conclusion - UDA represents a shift in how judgment calibration is approached, moving away from prompt engineering to a learnable problem, enhancing the robustness and reproducibility of evaluations while aligning more closely with human judgment [25]