首次证实RL能让3D模型学会推理，复杂文本描述下生成质量跃升

Core Insights - The research introduces the first systematic integration of reinforcement learning (RL) into text-to-3D autoregressive generation, addressing unique challenges in 3D generation compared to 2D [1][3][17] - The study emphasizes the importance of designing reward models specifically for 3D generation, with human preference scores (HPS v2.1) identified as the most effective single reward signal [6][12][17] Group 1: Challenges in 3D Generation - 3D objects lack a "standard view," making it difficult to evaluate geometric consistency, texture quality, and semantic alignment from multiple perspectives [5][6] - The long-range dependencies in 3D generation lead to sparser reward signals, complicating the model's ability to detect errors during the generation process [5][6] Group 2: Reward Model Design - The research tested various reward combinations, concluding that HPS v2.1 alone provides the strongest results, while semantic alignment and aesthetic quality can enhance performance when combined with HPS [6][12] - A surprising finding is that general large models (Qwen2.5-VL) are more robust in assessing 3D consistency than specialized models, filling the gap in reward signals for 3D generation [6][12] Group 3: Algorithm Selection and Training Paradigms - The study reveals that token-level optimization is more suitable for 3D generation than sequence-level operations, which can hinder performance [7][12] - Data diversity is more critical than training duration in RL training for 3D generation, as doubling the training data is effective, while tripling iterations can lead to overfitting [12][17] Group 4: Evaluation Metrics - Existing 3D generation benchmarks fail to assess models' implicit reasoning capabilities under complex text descriptions, leading to the development of the MME-3DR benchmark [10][17] - MME-3DR includes 249 carefully selected complex 3D objects and evaluates multi-view geometric consistency, semantic detail alignment, and texture realism [10][17] Group 5: Model Performance and Contributions - The final model, AR3D-R1, outperformed existing state-of-the-art methods on both MME-3DR and Toys4K benchmarks, demonstrating significant improvements in reasoning capabilities [13][18] - The research establishes a systematic framework for integrating RL into 3D generation, highlighting the need for tailored rewards, algorithms, and training paradigms rather than simply transferring 2D experiences [17][18]