全模态推理
Search documents
雷军官宣小米多篇最新研究成果成功入选ICLR 2026国际顶级会议
Sou Hu Cai Jing· 2026-02-03 03:13
Core Insights - Xiaomi's founder and CEO Lei Jun announced that multiple research achievements from the Xiaomi team have been selected for ICLR 2026, covering areas such as multimodal reasoning, reinforcement learning, GUI agents, end-to-end autonomous driving, and audio generation [1][3]. Group 1: Research Achievements - The research paper titled "Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle" addresses inefficiencies in existing reinforcement learning training processes, particularly issues like Advantage Collapsing and Rollout Silencing, which hinder long-term optimization capabilities [4]. - Shuffle-R1 proposes a streamlined reinforcement learning framework that significantly enhances training efficiency through two core designs: Pairwise Trajectory Sampling and Advantage-based Batch Shuffle, leading to improved gradient signal quality and increased exposure of valuable trajectories [4]. - Experimental results indicate that Shuffle-R1 consistently outperforms various reinforcement learning baselines with minimal computational overhead [4]. Group 2: Mobile Agents and GUI - The paper "MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning" introduces a framework to improve the reasoning and planning capabilities of Mobile GUI Agents, addressing challenges such as the scarcity of high-quality CoaT trajectories and the limitations of existing self-training methods [7][8]. - MobileIPL employs Thinking-level DPO and Instruction Evolution to enhance process supervision and expand task distribution, resulting in state-of-the-art performance on mainstream GUI-Agent benchmarks [8][10]. Group 3: Language Models - "FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation" presents a modular reasoning framework for small language models (SLMs) that enhances their performance in complex tasks without additional training or parameter increments [12][13]. - FutureMind extracts advanced cognitive abilities from large language models (LLMs) through adaptive knowledge distillation, creating a dynamic reasoning pipeline that significantly improves reasoning efficiency and retrieval accuracy [12][13]. Group 4: Multimodal Reasoning - The paper "ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding" proposes a framework that transfers mature textual reasoning capabilities to multimodal scenarios without the need for costly model fine-tuning [16][17]. - ThinkOmni includes components like LRM-as-a-Guide and Stepwise Contrastive Scaling, which balance perception and reasoning signals, demonstrating consistent performance improvements across multiple multimodal reasoning benchmarks [17]. Group 5: Audio Generation - "Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation" introduces a two-stage audio generation framework that combines Flow Matching pre-training with lightweight GAN fine-tuning for efficient audio generation [23][24]. - The framework enhances audio modeling capabilities by addressing the unique properties of audio signals and demonstrates superior performance in generating high-fidelity audio with improved computational efficiency compared to existing methods [24].
豆包1.6 “不偏科” ,高考成绩直逼“清北”
2 1 Shi Ji Jing Ji Bao Dao· 2025-06-28 14:29
Core Insights - The Seed1.6-Thinking model from Doubao achieved impressive scores in the 2025 college entrance examination, with a total score of 683 in liberal arts and 648 in sciences, indicating its strong performance across various subjects [1][2] - The model's results suggest it is competitive enough to potentially gain admission to top universities like Tsinghua and Peking University, with predictions indicating a possible score exceeding 690 in key subjects [2][3] Performance Summary - Doubao's Seed1.6-Thinking model excelled in multiple subjects, achieving the highest scores in Chinese, English, Physics, History, Geography, and Politics, with a mathematics score exceeding 140 [2] - In an international test, the model also ranked among the top performers in the JEE Advanced exam in India, showcasing its capabilities in mathematics, physics, and chemistry [3] Model Capabilities - The Seed team clarified that the model does not exhibit a bias towards specific subjects, as it demonstrated improved performance in chemistry and biology after using higher-quality test images [4] - The introduction of "dynamic thinking ability" (AutoCoT) allows the model to adapt its reasoning process, enhancing its performance while reducing unnecessary complexity in reasoning [4][6] Industry Implications - The potential of AI in the education sector, particularly in college entrance examinations, has garnered attention, with AI tools being developed to assist in decision-making for college applications [5] - The Seed1.6 model represents a significant advancement in AI capabilities, integrating multimodal understanding and deep reasoning, and is now available for API access through Volcano Engine [6]