机器之心

Search documents
多模态后训练反常识:长思维链SFT和RL的协同困境
机器之心· 2025-08-02 00:55
在语言模型领域,长思维链监督微调(Long-CoT SFT)与强化学习(RL)的组合堪称黄金搭档 —— 先让模型学习思考模式,再用奖励机制优化输出,性能通常 能实现叠加提升。 但 华为与香港科大的最新研究 发现了一个出人意料的现象: 在多模态视觉语言模型(VLM)中,这对组合难以实现协同增益,甚至有时会互相拖后腿。 推动这项研究的一个关键见解是认识到多模态推理评测与纯语言评测存在微妙差异。虽然文本推理任务通常侧重于逻辑要求高的问题,但多模态评测通常包含简 单基于感知的问题和复杂的认知推理挑战。作者假设,这种异质性是 Long-CoT SFT 和 RL 在多模态设置中表现出不同现象的核心原因。 为探索各种后训练技术如何影响不同类型问题性能,作者们引入了一个简单有效的难度分类方法,并基于此构建了 难度层级细化后的多模态推理榜单数据集 (包 括新的 MathVision、MathVerse、MathVista、MMMU val 和 MMStar val)。该方法根据基线模型 Qwen2.5-VL-Instruct-7B 在五个数据集的每个问题上 16 次独立运行 的成功率,将题目分为五个级别(L1-L5),分别代表 ...
思维链监督和强化的图表推理,7B模型媲美闭源大尺寸模型
机器之心· 2025-08-01 04:23
Core Viewpoint - The article discusses the emergence of the Chart-R1 model developed by the DocTron team, which utilizes a chain-of-thought supervision and reinforcement learning approach to enhance chart reasoning capabilities, particularly in complex multi-step numerical reasoning tasks [2][20]. Innovation and Technical Breakthroughs - The Chart-R1 model introduces a novel procedural data synthesis technique that generates high-quality reasoning data, resulting in the creation of the ChartRQA dataset containing 258,000 multi-step reasoning samples, ensuring data diversity and authenticity [7][22]. - The model employs a unique two-stage training strategy that utilizes different datasets for each stage, preventing the degradation of the model's exploratory capabilities during reinforcement learning [10][22]. Experimental Results and Performance - Chart-R1 demonstrates superior performance across various public benchmark tests and the self-constructed ChartRQA dataset, outperforming existing chart domain methods and rivaling large closed-source models like GPT-4o and Claude-3.5 in multiple tasks [16][20]. - In complex chart reasoning tasks, while existing visual language models show significant performance drops, Chart-R1 maintains a consistently high level of performance, highlighting its effectiveness in complex reasoning scenarios [17][20]. Research Significance and Application Prospects - The research not only achieves technical breakthroughs but also opens new avenues for chart understanding and reasoning, with potential applications in business intelligence analysis, scientific research data interpretation, and financial report analysis, significantly enhancing automated analysis efficiency [19][20]. - The success of Chart-R1 indicates that even models with relatively smaller parameter scales can achieve performance comparable to large closed-source models in specific domains, providing valuable insights for building efficient, domain-specific AI models and guiding future multi-modal reasoning research [20][21].
全球首款通用AI科研智能体问世:我一个文科生用它写了份CRISPR基因编辑综述报告
机器之心· 2025-08-01 04:23
Core Viewpoint - The article discusses the emergence of SciMaster, an AI scientific assistant developed by Shanghai Jiao Tong University, DeepMind Technology, and Shanghai Algorithm Innovation Institute, which is claimed to be the world's first truly general-purpose scientific AI agent [5][10]. Group 1: Introduction to SciMaster - SciMaster has gained significant attention in the research community, with its invitation codes being sold for nearly a thousand yuan, indicating high demand [5]. - It integrates advanced capabilities such as literature search, theoretical calculations, experimental design, paper writing, and collaboration, significantly enhancing research efficiency [7][11]. Group 2: Macro Trends in AI - The AI field is transitioning from data and computing power reliance to practical applications, as noted by mathematician Terence Tao [9]. - The concept of an "AI scientist" is at the forefront of this trend, with SciMaster filling a gap in the availability of practical AI research assistants [10]. Group 3: Functional Capabilities of SciMaster - SciMaster covers the entire research process, including reading, calculating, conducting experiments, and writing reports [11]. - It utilizes a vast database of 170 million research documents to provide reliable information and can trace every assertion back to its source [11][14]. - The system can perform calculations and execute experiments through integration with automated laboratory systems [14][15]. Group 4: Performance and Testing - SciMaster has demonstrated its capabilities by achieving a new state-of-the-art score of 32.1% on the Humanity's Last Exam benchmark, surpassing competitors like OpenAI and Google [28]. - The assistant can handle general queries and conduct deep research, providing comprehensive reports based on extensive data collection and analysis [30][31]. Group 5: Future Prospects - The development of SciMaster represents a significant step towards a new era of collaborative scientific exploration between humans and AI [16][49]. - The company aims to expand SciMaster's capabilities to cover a broader range of scientific knowledge, indicating a commitment to advancing AI in research [50].
驯服复杂表格:九天重磅开源,开启「人与表格对话」智能新时代
机器之心· 2025-08-01 04:23
Core Viewpoint - China Mobile's JiuTian AI Research Institute has fully open-sourced the JiuTian structured data model, aiming to lower the technical barriers and development costs for structured data applications, thereby promoting industry innovation and collaboration [2][30]. Group 1: Structured Data System - The structured data system includes a comprehensive, multi-dimensional, and deep-level data framework specifically designed for table data, which is crucial for training structured data models [4]. - China Mobile has collected and organized 39 public datasets and some real internet data, covering over 300 different fields, including telecommunications, meteorology, academia, manufacturing, finance, education, and healthcare [4][9]. - The JiuTian structured data model is built on a self-developed foundational language model and is optimized for structured data processing, featuring capabilities such as multi-table association analysis and interactive visualization [15][18]. Group 2: TReB Evaluation Framework - The TReB (Table Reasoning Benchmark) is a comprehensive evaluation framework designed to assess the reasoning capabilities of large models in table-related tasks, consisting of a cleaned dataset and a robust evaluation framework [7][9]. - TReB includes 26 tasks related to table reasoning and employs strict data cleaning processes to ensure the quality of each table and question pair [9][23]. - The TReB evaluation results indicate that the JiuTian structured data model outperforms other open-source models in various reasoning capabilities [23][25]. Group 3: Practical Applications - The JiuTian structured data model has been implemented in various industries, including energy, transportation, and logistics, enhancing operational efficiency and safety through real-time predictions and analyses [27]. - In industrial production, the model provides diverse production warning scenarios by analyzing key operational parameters, thereby improving management efficiency [27]. - In logistics, the model aids warehouse management by providing scientific decision support for optimizing inventory layout and resource allocation [27]. Group 4: Future Directions - China Mobile plans to continue developing AI and industry integration applications, further open-sourcing the structured data model system to accelerate the large-scale implementation of structured data intelligence technology [30]. - A series of technical live broadcasts will be conducted to explain the foundational models, open-source models, and datasets, providing the latest technical insights [31].
中国在AI领域超越美国已是板上钉钉?吴恩达:美国无法保持领先
机器之心· 2025-08-01 04:23
Core Viewpoint - China has become a significant force in the global AI competition, rapidly closing the gap with the US in key benchmarks like MMLU and HumanEval, where the difference has decreased from nearly double digits to almost even [1][6]. Group 1: AI Development in China - The WAIC conference showcased the rapid advancements in AI applications, agents, and new models in China [2]. - China's open-source model ecosystem and aggressive semiconductor design and manufacturing efforts are driving strong growth, indicating a potential path to surpass the US in AI [8][15]. - The competitive business environment in China, along with fast knowledge diffusion mechanisms, provides significant momentum for its AI sector [9]. Group 2: US AI Strategy - Former President Trump has recognized the need to accelerate the development of the US AI industry, announcing a new AI Action Plan aimed at encouraging growth with minimal regulation [4][5]. - The US maintains a lead in proprietary models, with major companies like Google and OpenAI developing strong closed-source models [11]. - The White House's AI Action Plan supports open-source initiatives, which is a positive signal for maintaining US leadership, but may not be sufficient for long-term dominance [9]. Group 3: Competitive Dynamics - The AI race is characterized by a lack of a single endpoint, with continuous incremental advancements rather than a definitive breakthrough [10]. - The competition between China and the US reflects differing philosophies: China's open-source approach fosters rapid knowledge flow, while the US's closed-source strategy focuses on individual competitive advantages [19]. - Despite supply chain constraints, Chinese companies are achieving world-class innovations, demonstrating resilience and capability in the AI space [19].
Manus大升级,100多个智能体并发给你做任务
机器之心· 2025-08-01 01:30
Core Viewpoint - Manus has launched a new feature called "Wide Research," enabling users to assign tasks to hundreds of AI agents for extensive research, marking a significant advancement in AI capabilities [2][6][12]. Group 1: Introduction of Wide Research - Manus's multi-agent platform has transformed the application of AI tools, and the new "Wide Research" capability aims to parallel the importance of "Deep Research" [3][4]. - The feature allows users to execute large-scale tasks with over 100 concurrent AI agents focusing on a single task or a series of sub-tasks [6][10]. Group 2: Functionality and Applications - Wide Research can analyze various datasets, such as ranking the top 100 MBA programs or analyzing over 1,000 stocks, and is not limited to data analysis but also includes creative tasks [10][12]. - The system's flexibility comes from its architecture, which allows for system-level parallel processing and inter-agent communication, expanding computational capacity significantly [12][23]. Group 3: User Experience and Challenges - Initial user experiences have highlighted issues such as slow agent speeds, high token consumption, and limited visibility during task execution [20][21]. - Users have reported common pain points, including a lack of coordination protocols among agents and performance instability during high-load periods [21][22]. Group 4: Future Implications - Manus believes that Wide Research represents a step towards achieving a true general AI workflow, with the potential to unlock capabilities beyond research [14][23]. - The infrastructure behind Wide Research is expected to lay the groundwork for future products, emphasizing the importance of agent-to-agent collaboration [23].
挖人上瘾的Meta又被员工吐嘈:不帮忙宣传项目,开源只会越来越糟
机器之心· 2025-08-01 01:30
Core Viewpoint - Meta is facing internal turmoil and inefficiencies despite significant investments in AI research, with a focus on the challenges of promoting research within the company and the implications of open-source projects [2][5][20]. Group 1: Internal Challenges - Meta has invested over $14 billion in AI, establishing the Meta Superintelligence Labs (MSL) to attract top talent from leading AI companies [2]. - Internal conflicts regarding resources, personnel, and management have been reported, with criticisms of Meta's organizational culture and inefficiencies [2][9]. - A researcher, Zeyuan Zhu, expressed frustration over the lengthy approval process for promoting his work, indicating a lack of support for AI projects within Meta [5][20]. Group 2: Open Source and Research Promotion - Zhu's project, "Physics of Language Models," was released as open-source but received minimal attention, raising questions about the necessity of open-sourcing research [11][12]. - The approval process for using public datasets and releasing model weights is cumbersome, often taking over two months, which hinders research progress [20]. - Discussions around the importance of open-source in AI research have emerged, with some industry leaders advocating for its role in fostering collaboration and innovation [14][15]. Group 3: Industry Sentiment and Future Directions - Zhu noted that many AI professionals are anxious about industry changes and encouraged them to proactively seek opportunities rather than waiting for layoffs [8]. - He acknowledged the possibility of leaving Meta in the future but emphasized the importance of his current projects [8]. - The internal culture criticisms from former employees have been validated by Zhu, indicating ongoing issues within Meta's organizational structure [9].
机器人不只会抓和放!北京大学X银河通用「世界-动作模型」赋能全面泛化的非抓握技能
机器之心· 2025-08-01 01:30
Core Viewpoint - The article discusses the development of a new model called Dynamics-adaptive World Action Model (DyWA) aimed at enhancing non-prehensile manipulation skills in robots, which are essential for performing complex tasks in real-world environments [3][10]. Group 1: Non-prehensile Manipulation - Non-prehensile manipulation refers to actions that do not involve grasping, such as pushing or flipping objects, which are crucial for handling various shapes and sizes in complex environments [3][5]. - Current robot models primarily focus on pick-and-place operations, limiting their effectiveness in dynamic and intricate tasks [3][5]. Group 2: Challenges in Non-prehensile Manipulation - The main challenges include complex contact modeling, where slight changes in friction can drastically alter movement trajectories, and the need for high-quality perception systems to understand object states and interactions [5][8]. - Traditional physical modeling methods struggle with real-world applications due to their reliance on precise object properties, which are often difficult to obtain [7][9]. Group 3: DyWA's Methodology - DyWA employs a teacher-student framework to train a model that predicts future states based on actions, allowing robots to "imagine" the outcomes of their movements [11]. - It incorporates a dynamic adaptation mechanism that infers hidden physical properties from historical observations, enhancing the robot's ability to interact with various surfaces and object weights [12][13]. - The model is designed to work with single-view inputs, making it feasible for real-world deployment without the need for complex multi-camera setups [14]. Group 4: Performance and Generalization - DyWA has demonstrated superior performance in simulations, achieving over 80% success rates in various scenarios, including known and unknown object states [17][18]. - In real-world tests, DyWA successfully adapted to different object shapes and surface frictions, achieving nearly 70% success in pushing unseen objects to target positions [20][24]. - The model's robust closed-loop adaptation allows it to learn from failures and improve its manipulation strategies over time [26].
ACL 2025主会论文 | TRIDENT:基于三维多样化红队数据合成的LLM安全增强方法
机器之心· 2025-07-31 08:58
Core Insights - The article discusses the TRIDENT framework, which addresses the safety risks associated with large language models (LLMs) by introducing a three-dimensional diversification approach for red-teaming data synthesis [2][24]. Background - The safety risks of LLMs are a significant barrier to their widespread adoption, with existing datasets focusing primarily on vocabulary diversity rather than malicious intent and jailbreak strategy diversity [1][11]. Methodology - TRIDENT employs a persona-based and zero-shot automatic generation paradigm, combined with six jailbreak techniques, to produce high-quality red team data at low cost [2][5]. - The framework includes a three-dimensional risk coverage assessment that quantitatively measures diversity and balance across vocabulary, malicious intent, and jailbreak strategies [9]. Experimental Results - TRIDENT-CORE and TRIDENT-EDGE datasets were generated, containing 26,311 and 18,773 entries respectively, covering vocabulary and intent, as well as introducing jailbreak strategies [9]. - In comparative benchmarks, TRIDENT-EDGE models achieved the lowest average Harm Score and Attack Success Rate while maintaining or improving Helpful Rate compared to other datasets [20][22]. Breakthrough Significance - TRIDENT provides a sustainable and low-cost solution for LLM safety alignment, integrating seamlessly into existing training pipelines like RLHF and DPO [24]. - The framework is designed to evolve continuously with model updates and emerging threats, ensuring its relevance in a rapidly changing landscape [25].
当提示词优化器学会进化,竟能胜过强化学习
机器之心· 2025-07-31 08:58
Core Viewpoint - The article discusses the introduction of GEPA (Genetic-Pareto), a new optimization technique that outperforms the GRPO reinforcement learning algorithm by 20% while significantly reducing the number of rollouts to 1/35 of the original [2][39]. Group 1: GEPA Overview - GEPA employs a technique called reflective prompt evolution, which enhances the performance of composite AI systems [2][6]. - The core principles of GEPA include genetic prompt evolution, utilizing natural language feedback, and Pareto-based candidate selection [7][8]. Group 2: GEPA Algorithm - GEPA initializes a candidate pool with parameters from the composite AI system and iteratively proposes new candidates until the evaluation budget is exhausted [12][15]. - The optimization process involves mutation or crossover of existing candidates, allowing GEPA to accumulate learning signals and improve candidate performance over iterations [16][17]. Group 3: Reflective Feedback Mechanism - Natural language trajectories generated during the execution of the composite AI system provide insights into the reasoning steps, enabling diagnostic value for decision-making [19][20]. - GEPA utilizes these trajectories for implicit credit assignment, allowing targeted updates to modules based on their performance [21][22]. Group 4: Candidate Selection Strategy - GEPA employs a Pareto-based candidate selection strategy to avoid local optima and ensure a balance between exploration and exploitation [27][30]. - This strategy involves identifying candidates that have achieved the best scores across training tasks, filtering out strictly dominated candidates [31][32]. Group 5: Performance Evaluation - Experimental results show that GEPA consistently outperforms MIPROv2 and GRPO across various benchmarks, achieving improvements of up to 14.29% [42][39]. - GEPA demonstrates high sample efficiency, outperforming GRPO while requiring significantly fewer rollouts [39][41]. Group 6: Observations and Insights - The next candidate selection strategy significantly impacts optimization trajectories and final performance, with Pareto-based sampling showing clear advantages [43]. - Optimized prompts from GEPA are shorter and more efficient than few-shot demonstration prompts, enhancing computational efficiency [45]. - A unique system-aware crossover strategy, GEPA+Merge, yields additional performance gains by identifying complementary strategies from different optimization lineages [47].