Workflow
多模态推理
icon
Search documents
消息称 DeepSeek V4 模型打破惯例:华为等国内厂商可早期访问,不让英伟达 AMD 先用
Xin Lang Cai Jing· 2026-02-27 10:36
IT之家 2 月 27 日消息,据路透社 2 月 26 日报道,两位了解情况的消息人士表示,DeepSeek 在即将进行重大模型更新之前,未向美国芯片制造商展示其即 将推出的旗舰模型,这打破了行业标准做法。 相反,DeepSeek V4 向国内供应商 —— 包括华为技术有限公司 —— 提供了早期访问权限。 报道提到,AI 开发者通常会将主要模型的预发布版本分享给英伟达和 AMD 等芯片制造商,以确保其软件在广泛使用的硬件上高效运行。DeepSeek 之前曾 与英伟达的技术人员密切合作。 对于其即将推出的模型,DeepSeek 没有向英伟达和 AMD 提供访问权限,而是给予包括华为在内的中国厂商几周的时间来适配其芯片。 英伟达和 AMD 拒绝评论。DeepSeek 和华为没有回应评论请求。 消息源 @legit_api 于 2 月 26 日在 X 平台发布推文,报道称 DeepSeek 正在测试 V4 Lite 模型,代号为"Sealion-lite",上下文窗口为 100 万 tokens,并是原生 支持多模态推理。 IT之家注意到,本月早些时候,DeepSeek 更新之后开始灰度测试最高 1M(百万)Tok ...
ICLR 2026 | 7B小模型干翻GPT-5?AdaResoner实现Agentic Vision的主动「视觉工具思考」
机器之心· 2026-02-15 06:46
Core Insights - The article discusses the advancements in multi-modal AI reasoning, particularly focusing on the AdaReasoner model, which excels in tool orchestration for visual reasoning tasks, outperforming larger models like GPT-5 by learning when and how to use tools effectively [2][11]. Group 1: AdaReasoner Overview - AdaReasoner addresses fundamental issues in multi-modal reasoning by treating the decision of what, when, and how to use tools as a reasoning capability [3]. - The model demonstrates significant performance improvements, achieving an average increase of 24.9% across eight benchmarks compared to base models [31]. Group 2: Tool Usage and Learning - AdaReasoner incorporates a training paradigm that allows models to learn tool usage as a general reasoning skill, enabling them to adopt useful tools, discard irrelevant ones, and adjust calling frequency based on task requirements [16][19]. - The model's design includes three key components: Tool Cold Start (TC), Tool-GRPO (TG), and Adaptive Learning (ADL), which enhance its ability to use tools effectively in various scenarios [20][23][25]. Group 3: Performance Metrics - AdaReasoner-7B shows remarkable performance, with significant improvements in structured reasoning tasks, achieving near-perfect scores in several benchmarks [31]. - In specific tasks, such as VSP and Jigsaw, the model's performance improved from base scores to 97.64 and 96.60 respectively, surpassing GPT-5's performance [34]. Group 4: Adaptive Tool Behavior - The model exhibits three adaptive behaviors: adopting useful tools, discarding irrelevant ones, and modulating tool usage frequency based on the context of the task [36][40][44]. - This adaptability allows AdaReasoner to maintain high accuracy while effectively managing tool interactions, demonstrating its capability to learn from reinforcement learning processes [37][41]. Group 5: Generalization and Robustness - AdaReasoner's use of Adaptive Learning enhances its generalization capabilities, allowing it to transfer learned planning abilities to new tasks and agents [53]. - The model's robustness is evidenced by its ability to perform well even when tool definitions and parameters vary, indicating a strong decoupling of tool planning from surface-level text forms [46].
开源多模态推理「破壁」时刻:MMFineReason助力4B逆袭30B
机器之心· 2026-02-13 05:08
Core Insights - The article highlights the significant gap between open-source multimodal models and top closed-source models like GPT-4o and Gemini, primarily due to a lack of high-quality reasoning data [2] - The introduction of the MMFineReason framework by OpenDataLab aims to address this gap by providing a comprehensive, open-source multimodal reasoning data synthesis pipeline [2][10] Data Challenges - Existing open-source multimodal data is predominantly focused on simple Visual Question Answering (VQA) and natural images, with a scarcity of high-value reasoning data such as STEM charts and complex visual symbols [6] - The quality of reasoning data is inconsistent, often characterized by short reasoning processes and insufficient granularity in annotations [6] Performance Results - The MMFineReason-4B model, trained on Qwen3-VL-4B, demonstrates superior reasoning capabilities, surpassing the Qwen3-VL-8B-Thinking model and approaching the performance of the 30B parameter Qwen3-VL-30B-A3B-Thinking model [5] - The MMFineReason-8B model outperforms both Qwen3-VL-30B-A3B-Thinking and Gemini-2.5-Flash, indicating a significant leap in performance driven by data quality rather than model architecture [8] Data Production Pipeline - MMFineReason employs a fully open-source and transparent data production pipeline, consisting of three main stages to ensure high-quality data generation [12] - The final datasets include MMFineReason-1.8M, MMFineReason-586K, and MMFineReason-123K, each curated for different levels of reasoning difficulty [14] Dataset Characteristics - MMFineReason is characterized by a high average reasoning chain length of 2,910 tokens, significantly longer than comparable datasets, which enhances the model's reasoning capabilities [16] - The dataset emphasizes high-difficulty logical reasoning, with 79.4% of the data focused on mathematics, 13.8% on scientific data, and 4.6% on puzzles and games [19] Conclusion and Future Outlook - The open-sourcing of MMFineReason demonstrates that in the multimodal field, the key to improving model performance lies in the quality of data rather than the size of the model [23] - The project is now available on Huggingface and GitHub, providing comprehensive support for the open-source community [23]
雷军官宣小米多篇最新研究成果成功入选ICLR 2026国际顶级会议
Sou Hu Cai Jing· 2026-02-03 03:13
Core Insights - Xiaomi's founder and CEO Lei Jun announced that multiple research achievements from the Xiaomi team have been selected for ICLR 2026, covering areas such as multimodal reasoning, reinforcement learning, GUI agents, end-to-end autonomous driving, and audio generation [1][3]. Group 1: Research Achievements - The research paper titled "Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle" addresses inefficiencies in existing reinforcement learning training processes, particularly issues like Advantage Collapsing and Rollout Silencing, which hinder long-term optimization capabilities [4]. - Shuffle-R1 proposes a streamlined reinforcement learning framework that significantly enhances training efficiency through two core designs: Pairwise Trajectory Sampling and Advantage-based Batch Shuffle, leading to improved gradient signal quality and increased exposure of valuable trajectories [4]. - Experimental results indicate that Shuffle-R1 consistently outperforms various reinforcement learning baselines with minimal computational overhead [4]. Group 2: Mobile Agents and GUI - The paper "MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning" introduces a framework to improve the reasoning and planning capabilities of Mobile GUI Agents, addressing challenges such as the scarcity of high-quality CoaT trajectories and the limitations of existing self-training methods [7][8]. - MobileIPL employs Thinking-level DPO and Instruction Evolution to enhance process supervision and expand task distribution, resulting in state-of-the-art performance on mainstream GUI-Agent benchmarks [8][10]. Group 3: Language Models - "FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation" presents a modular reasoning framework for small language models (SLMs) that enhances their performance in complex tasks without additional training or parameter increments [12][13]. - FutureMind extracts advanced cognitive abilities from large language models (LLMs) through adaptive knowledge distillation, creating a dynamic reasoning pipeline that significantly improves reasoning efficiency and retrieval accuracy [12][13]. Group 4: Multimodal Reasoning - The paper "ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding" proposes a framework that transfers mature textual reasoning capabilities to multimodal scenarios without the need for costly model fine-tuning [16][17]. - ThinkOmni includes components like LRM-as-a-Guide and Stepwise Contrastive Scaling, which balance perception and reasoning signals, demonstrating consistent performance improvements across multiple multimodal reasoning benchmarks [17]. Group 5: Audio Generation - "Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation" introduces a two-stage audio generation framework that combines Flow Matching pre-training with lightweight GAN fine-tuning for efficient audio generation [23][24]. - The framework enhances audio modeling capabilities by addressing the unique properties of audio signals and demonstrates superior performance in generating high-fidelity audio with improved computational efficiency compared to existing methods [24].
让大模型“吃一堑长一智”,南理工百度等提出模型记忆新方法
量子位· 2025-12-17 09:07
Core Viewpoint - The article discusses a new method called ViLoMem, developed by Nanjing University of Science and Technology in collaboration with Baidu, which addresses the issue of large models having poor memory retention, enabling them to learn from past mistakes by separating visual and logical errors into distinct memory streams [1][5]. Group 1: ViLoMem Framework - ViLoMem employs a dual-stream semantic memory system that allows models to remember visual and logical errors separately, enhancing their ability to learn from experiences [15][16]. - The framework consists of two main components: memory generation and memory retrieval, which work together to improve the model's performance without altering its parameters [18][5]. Group 2: Memory Generation - When a model fails on a task, ViLoMem activates two branches: a visual analysis module to identify visual errors and a logical analysis module to pinpoint logical mistakes, generating structured guidelines for both types of errors [19][20][21]. - Newly generated memories are matched for similarity with existing memories to either merge them into more abstract rules or create new memory slots, preventing memory overload while allowing for the abstraction of general semantic patterns [22][24]. Group 3: Memory Retrieval - The retrieval strategies for visual and logical memories differ, with visual memory using a two-stage retrieval process that includes image-level similarity search and question semantic filtering [27][28]. - Logical memory retrieval focuses on understanding the problem first before searching for relevant rules, which is more effective than simple keyword matching [29]. Group 4: Performance Improvement - ViLoMem has shown significant performance improvements across six multimodal reasoning benchmarks, with notable gains in mathematical tasks, such as a +6.48 increase for GPT-4.1 on MathVision [2][31]. - Smaller models benefit even more from ViLoMem, with Qwen3-VL-8B achieving a +4.38 increase on MMMU [31]. Group 5: Cross-Model Memory Transfer - An interesting experiment demonstrated that smaller models could achieve better scores by utilizing memories generated by larger models, indicating a form of "free knowledge distillation" [34][36]. - This suggests that experiences from stronger models can directly enhance the performance of weaker models without the need for fine-tuning [36].
Transformer作者爆料GPT-5.1内幕,OpenAI内部命名规则变乱了
3 6 Ke· 2025-12-01 01:25
Core Insights - The development of AI is not slowing down but is transitioning to a new paradigm, with a focus on reasoning models rather than just pre-training [4][10][32] - The recent release of GPT-5.1 represents a significant stability iteration rather than a minor update, emphasizing user experience and safety improvements [14][17][19] Group 1: AI Development Trends - There are two contrasting views on AI growth: one claims a slowdown, while the other highlights continuous advancements with new models like GPT-5.1 and Gemini 3 [5][10] - The internal perspective shows that AI capability growth follows a smooth exponential curve, akin to Moore's Law, driven by technological iterations and computational enhancements [7][10] - The shift from pre-training to reasoning models marks a critical turning point in AI development, with reasoning models still in their early stages and expected to progress rapidly [10][11][13] Group 2: GPT-5.1 and Model Evolution - GPT-5.1 is a substantial update focused on enhancing reasoning capabilities, safety, and user experience, despite appearing as a minor version change [14][15][17] - The naming convention for models has shifted to prioritize user experience, allowing for more flexibility in development and faster iteration cycles [17][19] - Despite improvements, GPT-5.1 still exhibits limitations in multi-modal reasoning, as demonstrated by its inability to solve simple problems that a child could easily answer [19][20] Group 3: Future of AI and Robotics - AI is expected to change the nature of work without eliminating jobs, as human expertise will still be required in high-stakes scenarios [32][34] - The next significant breakthrough in AI is anticipated to come from advancements in multi-modal reasoning and embodied intelligence, particularly in home robotics [36][34] - The progress in robotics will depend on the integration of multi-modal capabilities and general reinforcement learning, leading to a transformative leap in home automation technologies [36][34]
Transformer作者爆料GPT-5.1内幕!OpenAI内部命名规则变乱了
量子位· 2025-11-30 11:30
Core Insights - The article discusses a significant paradigm shift in AI, indicating that the development of AI is not slowing down but rather transitioning to a new phase of growth [1][7][12]. Group 1: AI Development Trends - There are two contrasting views on AI development: one claims that AI growth is slowing down, while the other highlights continuous advancements with new models like GPT-5.1 and Gemini 3 being released [3][12]. - Łukasz Kaiser argues that the perception of slowing growth is incorrect, stating that AI's capability growth follows a smooth exponential curve, akin to Moore's Law [15][16]. - The shift from pre-training to reasoning models is a key factor in this transition, with pre-training being in a later stage of its S-curve while reasoning models are still in their early stages [18][19]. Group 2: Reasoning Models and Their Impact - The industry is focusing on smaller, cost-effective models that maintain quality, leading to the misconception that pre-training has stalled [21]. - Reasoning models, which allow for more complex thought processes and the use of tools during inference, are expected to progress rapidly due to their emerging nature [22][27]. - The evolution of models like ChatGPT demonstrates a qualitative leap in performance, with newer versions incorporating reasoning and external tool usage for more accurate responses [23][24]. Group 3: GPT-5.1 Insights - GPT-5.1 is not merely a minor update but represents a significant stability iteration, enhancing reasoning capabilities through reinforcement learning and synthetic data [34][35]. - The naming convention for versions has shifted to focus on user experience rather than technical details, allowing for greater flexibility in development [38]. - Despite improvements, GPT-5.1 still has limitations, particularly in multi-modal reasoning, as illustrated by its struggles with basic tasks that require contextual understanding [41][42]. Group 4: Future of AI and Robotics - AI is expected to change the nature of work without eliminating jobs, as human expertise will still be needed in high-stakes scenarios [62][66]. - Home robots are anticipated to be the next visible AI revolution, driven by advancements in multi-modal capabilities and general reinforcement learning [67][69]. - The integration of these technologies is expected to lead to a significant leap in the capabilities of home robots, making them more intuitive and perceptible compared to current AI models like ChatGPT [69].
深夜,3万亿美元巨头大涨
Core Viewpoint - Google shares surged over 6% to a record high of $303.68, with a market capitalization exceeding $3.6 trillion, following the launch of its latest AI model, Gemini 3 Pro, which topped the LMArena leaderboard [2][5]. Stock Performance - As of the latest update, Google's stock price was $302.86, with a daily increase of 6.54% and a trading volume of 24.97 million shares, amounting to $7.4 billion [3]. - The stock reached a high of $303.68 and a low of $286.63 during the trading session [3]. AI Model Launch - Google introduced Gemini 3, described as the company's most intelligent model to date, which integrates all capabilities of the Gemini series and is designed to assist users in learning, creating, and planning [5][7]. - Gemini 3 demonstrates PhD-level reasoning abilities and excels in various tests, showcasing its advanced multimodal reasoning, visual and spatial understanding, and multilingual capabilities [7]. User Engagement and Market Position - The AI Overviews platform now has 2 billion monthly users, while the Gemini App has surpassed 650 million monthly active users [7]. - Over 70% of Google's cloud customers are utilizing the company's AI services, and 13 million developers are working with its generative models [7]. Competitive Advantage - Analysts believe that Google's comprehensive AI stack, which includes TPU chips, networking, models, and applications, creates a significant competitive moat [7]. - The company's self-developed TPU chips and the leading capabilities of the Gemini model are expected to drive high growth in computing demand and present investment opportunities in AI hardware innovation [7].
Gemini 3.0发布:从“工具辅助”到“主动代理”,谷歌做了这几点
Tai Mei Ti A P P· 2025-11-19 00:32
Core Insights - Google has launched its latest AI model, Gemini 3, which is considered a "universal player" in the industry, showcasing significant advancements over its predecessors and competing models like GPT-5.1 and Claude 4.5 [1][8] - Gemini 3 integrates into various Google applications, including AI search products and enterprise solutions, and will be gradually rolled out to users [1][8] - The release of Gemini 3 is strategically important for Google, as it aims to regain a competitive edge in the AI race, especially after being perceived as lagging behind since the launch of ChatGPT [8][9] Performance Enhancements - Gemini 3 has achieved remarkable performance in reasoning capabilities, with a GPQA Diamond test accuracy of 91.9% and a score of 37.5% in multi-step logical reasoning without tools [2] - The model also excels in multi-modal reasoning, scoring 81% in MMMU-Pro and 87.6% in Video-MMMU tests, indicating its ability to handle complex problems across various domains [2][4] Innovative Features - Google introduced the Gemini 3 Deep Think mode, which enhances reasoning through "thought signatures" and "thinking levels," achieving scores of 41.0% and 93.8% in relevant tests [3] - The model supports an impressive context length of up to 1 million tokens, significantly surpassing competitors and previous versions, allowing for complex task handling [4] Development and Collaboration - Gemini 3 redefines developer collaboration with innovations like "Agentic Coding" and "Vibe Coding," achieving a high Elo score of 2439 in competitive programming tests [5] - The model's agent capabilities allow it to autonomously plan and execute tasks, demonstrated by its performance in Terminal-Bench 2.0 and Vending-Bench 2 tests [6] Strategic Implications - The launch of Gemini 3 is expected to accelerate AI technology innovation across the industry, pushing competitors to enhance their offerings in reasoning, multi-modal integration, and agent development [9] - For enterprises and developers, Gemini 3 provides a scalable and customizable AI foundation, facilitating the transition of AI from experimental phases to practical applications in everyday life [8][9]
Gemini3 正式发布
小熊跑的快· 2025-11-19 00:09
Core Insights - Google has officially launched Gemini 3, the most powerful multimodal understanding model to date, enhancing interactive experiences and reasoning capabilities [1][4] - Gemini 3 Pro and Gemini 3 Deep Think are key versions, with the latter showing superior performance in reasoning tasks [4][10] Performance Metrics - Gemini 3 Pro achieved a score of 1501 Elo, ranking first on the LMArena leaderboard, and demonstrated doctoral-level reasoning with a 37.5% score on Humanity's Last Exam [1][3] - In various benchmarks, Gemini 3 Pro outperformed previous models, achieving 91.9% on GPQA Diamond and 23.4% on MathArena Apex [3][4] - Gemini 3 Deep Think further improved performance, scoring 41.0% on Humanity's Last Exam and 93.8% on GPQA Diamond [4] Multimodal Capabilities - Gemini 3 is designed to seamlessly integrate information across text, images, videos, audio, and code, pushing the boundaries of multimodal reasoning [6] - It can generate interactive learning materials and analyze performance in various activities, such as sports [7] Developer Tools and Platforms - Gemini 3 enhances developer efficiency through vibe coding and agentic coding, leading to significant improvements in software development tasks [8][10] - Google Antigravity, a new development platform, allows developers to build in a task-oriented manner, transforming AI into a proactive partner [9][10] User Experience - Google AI Ultra subscribers can access Gemini's advanced capabilities, enabling more effective long-term planning and task execution [11]