机器之心

Search documents
Qwen3 变身扩散语言模型?不从零训练也能跑,30B参数创纪录
机器之心· 2025-10-12 04:05
Core Insights - The article discusses the development of the RND1-Base, the largest open-source diffusion language model (DLM) to date, which aims to overcome the challenges faced by traditional autoregressive (AR) models in terms of training efficiency and scalability [2][3][6]. Group 1: Model Development - RND1-Base is a 30 billion parameter sparse MoE model, with 3 billion active parameters, derived from a pre-trained AR model (Qwen3-30BA3B) and trained on 500 billion tokens to achieve full diffusion behavior [6]. - The research team from Radical Numerics has successfully demonstrated that scaling diffusion language models beyond 8 billion parameters is feasible and effective [9]. Group 2: Performance Evaluation - RND1 was tested against various benchmarks, including MMLU, ARC-C, RACE, and BBH, showing stable performance that surpasses existing models like Dream-7B and LLaDA-8B while retaining the strong performance of its AR foundation [7]. - Although RND1 performed well, it was not compared with the latest LLaDA model (LLaDA-MoE-7B-A1B), indicating that further comparisons are needed to determine which model is superior [9]. Group 3: Training Methodology - The research identified key factors in the autoregressive-to-diffusion (A2D) conversion process, such as initialization strategies, hierarchical learning rates, and critical batch sizes, which contribute to scalability and stability [10]. - A simpler method called Simple Continuous Pretraining (SCP) was found to achieve comparable performance to more complex A2D conversion processes, allowing for effective retention of AR pre-training knowledge [13][14]. Group 4: Training Efficiency - The study revealed that A2D conversion performs better with larger batch sizes, indicating that diffusion language models can effectively utilize larger batch sizes during continuous pre-training [15][17]. - The article emphasizes the importance of replacing causal masks with bidirectional masks during initialization and continuing pre-training under a masked diffusion objective [18]. Group 5: Company Vision - Radical Numerics aims to create an automated AI research platform that recursively improves itself, with RND1 being one of the first tangible outcomes of this vision [20]. - The founding team of Radical Numerics comprises members from top institutions like DeepMind and Stanford, focusing on hybrid architectures and innovative technologies [21].
RL 将如何提高具身大模型 VLA 泛化性?清华大学团队NeurIPS 2025文章分析 RL 与 SFT 泛化性差异
机器之心· 2025-10-12 02:41
Core Insights - The article discusses the potential of Vision-Language-Action (VLA) large models in embodied intelligence, highlighting the limitations of current supervised fine-tuning (SFT) methods in generalization to new environments and tasks. It emphasizes the advantages of Reinforcement Learning (RL) in enhancing the generalization capabilities of VLA models [2][4]. Group 1: Research Findings - A new evaluation benchmark was created to address the limited generalization of VLA models, comparing the performance of RL and SFT in enhancing model robustness across various visual, semantic, and execution challenges [4]. - Experiments revealed that using RL algorithms like Proximal Policy Optimization (PPO) significantly improved the model's robustness in semantic understanding and task execution, maintaining performance comparable to SFT in visually varied scenarios [4][11]. Group 2: RL Methodology - The research team tested three RL algorithms: PPO, Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). The results showed that PPO outperformed DPO and GRPO in multi-step decision tasks due to the partially observable Markov decision process (POMDP) characteristics of robotic tasks [9][11]. - To enhance the efficiency of PPO training on VLA models, three key innovations were introduced: a shared Actor-Critic architecture reducing memory usage by 45% and increasing training speed by 35%, a preheating strategy using 140 high-quality trajectories to improve convergence speed by 50%, and minimizing PPO training epochs to just one, which reduced training time significantly [13][15]. Group 3: Comparison of SFT and RL - The research explored the data scale limits of SFT, finding that performance saturation occurred at around 16,000 demonstration trajectories. In contrast, RL achieved a 42.6% performance improvement on out-of-distribution tasks, indicating superior generalization capabilities [18][19]. - A comprehensive evaluation benchmark was constructed to dissect the generalization differences between SFT and RL across visual, semantic, and execution dimensions, with RL showing clear advantages in semantic understanding and execution robustness [21][23]. Group 4: Practical Implications - The study underscores the core value of RL in developing truly generalizable embodied agents, which is increasingly important as robotic applications become more complex and variable. The team has open-sourced a large-scale RL framework for embodied intelligence, RLinf, to facilitate further research [25]. - Visual analysis of specific cases revealed deeper differences, such as RL's ability to maintain task stability under noise and effectively handle unseen objects, contrasting with SFT's tendency to get stuck in repetitive actions [26].
曾拒15亿美金,超级天才Andrew Tulloch重返Meta,Thinking Machines Lab痛失联创
机器之心· 2025-10-12 02:41
Core Viewpoint - Meta's aggressive recruitment strategy, particularly the high-profile attempt to lure Andrew Tulloch back, highlights the company's ongoing efforts to strengthen its AI capabilities despite previous rejections of lucrative offers [1][11]. Group 1: Recruitment and Offers - Mark Zuckerberg's recruitment efforts have included a dramatic offer exceeding $1 billion to Andrew Tulloch, which was initially declined [2][11]. - Tulloch, a prominent figure in AI, has a strong academic background and extensive experience at Meta and OpenAI, making him a valuable asset for any tech company [7][8]. - Despite rejecting the initial offer, Tulloch ultimately decided to join Meta, indicating a shift in his career path [5][12]. Group 2: Background of Andrew Tulloch - Andrew Tulloch graduated with top honors in mathematics from the University of Sydney and later earned a master's degree from Cambridge University [7]. - He has over 11 years of experience at Meta, contributing significantly to the development of machine learning systems and advertising platforms [7]. - After leaving Meta, Tulloch played a key role at OpenAI, working on the development of advanced AI models like GPT-4 and GPT-4.5 [9][11]. Group 3: Implications for Meta - Tulloch's return to Meta comes at a time of internal management changes, raising questions about the potential impact of his expertise on the company's AI initiatives [12].
从组件到系统,Agent 的 Evaluation 怎么做?
机器之心· 2025-10-12 01:27
--- 本周为您解读 ② 个值得细品的 AI & Robotics 业内要事 --- 1.从组件到系统,Agent 的 Evaluation 怎么做? 为什么 Agent 需要新的评估基准?Agent 与 LLM 的定位有何本质差别?Agent 评估范式在如何演进?GAIA 系列如何跨越 Agent Evaluation 的边界?MCP-universe、MCPMark 和 MCP- AgentBench 的反映了什么样的设计哲学?... 2. CoT 之后,CoF 如何让帧间逻辑从「隐式对齐」变成「显式思考」? CoT 只是「语言的表层叙事」,而非真正的推理?CoF 如何把「语言的思维链」转译为「视频的帧链」?CoF 为何被认为可能成为视频生成模型的「新范式」,它相较传统帧间一致性优化方法 的优势如何?从 CoF-Data 到 VChain,研究者如何把「推理链」嵌进每一帧画面?在 CoF 出现之前,视频模型靠什么维系「帧间一致性」?... 本期完整版通讯含 2 项专题解读 + 34 项本周 AI & Robotics 赛道要事速递,其中技术方面 13 项,国内方面 7 项,国外方面 14 项。 机器之心P ...
风波再起,OpenAI被指通过警方向AI监管倡导者施压,马斯克锐评其「建立在谎言之上」
机器之心· 2025-10-11 08:06
Core Viewpoint - The article discusses the controversy surrounding OpenAI's legal actions against Nathan Calvin, a participant advocating for AI regulation, highlighting the implications of the recently passed SB 53 bill in California and OpenAI's response to criticism regarding transparency and governance [1][2][3]. Group 1: Legal Actions and Controversy - Nathan Calvin, a lawyer and member of the Encode organization, received a subpoena from OpenAI, which demanded private information related to California legislators and former OpenAI employees [2][3]. - The subpoena is linked to the SB 53 bill, which mandates large AI developers to disclose their safety protocols and update them regularly, effective from September 30 [3][4]. - OpenAI's actions are perceived as an attempt to intimidate critics and investigate potential funding from Elon Musk, who has been vocal against the company [4][5]. Group 2: Reactions and Implications - Calvin expressed his dissatisfaction with OpenAI's tactics, suggesting they are using legal means to suppress dissent and maintain control over the narrative surrounding AI governance [4][5]. - Other organizations, such as the Midas Project, have reported similar experiences with OpenAI, indicating a broader pattern of legal scrutiny against those advocating for transparency [5]. - OpenAI's Chief Strategy Officer defended the company's actions as necessary to protect its interests amid ongoing litigation with Musk, questioning the motives behind Encode's support for Musk [7][8].
NeurIPS 2025 Spotlight | PhysX-3D:面向真实物理世界的3D资产生成范式
机器之心· 2025-10-11 08:06
Core Insights - The article presents PhysXNet, the first systematically annotated 3D dataset based on physical properties, addressing the gap between virtual 3D assets and real-world physics [6][9][27] - It introduces PhysXGen, a novel framework for generating 3D assets that incorporates physical attributes, enhancing the realism and applicability of 3D models in various fields [9][18][27] Dataset Overview - PhysXNet includes over 26,000 annotated 3D objects with detailed physical properties, while the extended version, PhysXNet-XL, contains over 6 million programmatically generated 3D objects [9][10][16] - The dataset covers five core dimensions: physical scale, materials, affordance, kinematic information, and textual descriptions, providing a comprehensive resource for 3D modeling [6][9][27] Annotation Process - A human-in-the-loop annotation framework was developed to efficiently collect and label physical information, ensuring high-quality data [11][13] - The annotation process involves two main stages: initial data collection and determination of kinematic parameters, utilizing advanced models like GPT-4o for accuracy [13][11] Generation Methodology - PhysXGen integrates physical attributes with geometric structure and appearance, achieving a dual optimization goal for generating realistic 3D assets [18][27] - The framework demonstrates significant improvements in generating physical properties compared to existing methods, with relative performance enhancements in various dimensions [23][24] Experimental Results - The evaluation of PhysXGen shows notable advancements in both geometric quality and physical property accuracy, outperforming baseline methods in multiple metrics [20][21][23] - The results indicate a 24% improvement in physical scale, 64% in materials, 28% in kinematic parameters, and 72% in affordance compared to traditional approaches [23][24] Conclusion - The article emphasizes the importance of bridging the gap between 3D assets and real-world physics, highlighting the potential impact of PhysXNet and PhysXGen on fields such as embedded AI, robotics, and 3D vision [27]
首家AIOS落地来自vivo:个人化智能复刻人类思维,手机还能这样用
机器之心· 2025-10-11 04:18
Core Viewpoint - The article emphasizes the practical application of generative AI, showcasing vivo's advancements in AI technology that enhance user experience and privacy through localized processing and personalized intelligence [6][30]. Group 1: AI Capabilities and Innovations - vivo introduced the "One Model" concept, a lightweight 3B end-side multimodal reasoning model that aims to provide a sustainable AI experience focused on user personalization rather than just parameter competition [8][9]. - The new AI capabilities include a 30 billion parameter model that can run smoothly on flagship mobile SoCs, achieving performance comparable to industry-leading 4B language models with a 60% reduction in parameters [9][11]. - The Blue Heart 3B model supports both language and multimodal tasks, allowing for complex reasoning to be performed locally on devices, thus enhancing efficiency and privacy [13][20]. Group 2: User Experience and Personalization - The integration of AI into the mobile operating system allows for a seamless user experience, where AI acts as a personal assistant capable of understanding and executing tasks without relying on cloud services [15][18]. - The AIOS framework is designed to mimic human cognitive processes, enabling real-time perception, memory, execution, and autonomous planning, which significantly improves task efficiency [20][21]. - vivo's approach to AI emphasizes the importance of personal data integration, creating a personalized AI experience that is both efficient and secure [18][30]. Group 3: Ecosystem and Collaboration - vivo aims to build an open ecosystem by collaborating with developers and partners, allowing for the rapid deployment of new AI capabilities and applications [23][26]. - The company has established partnerships to enhance its AI offerings, such as collaborating with Ant Group's AI health application AQ, which provides comprehensive medical services [28][29]. - vivo's vision includes equipping over 300 million devices with robust local AI capabilities within the next three to five years, indicating a strong commitment to advancing AI technology [31].
读万卷书,大模型就能「看」懂视觉世界?Meta揭秘LLM视觉先验的起源
机器之心· 2025-10-11 04:18
Core Insights - The research reveals that visual priors in large language models (LLMs) are not a singular capability but can be divided into two distinct types: reasoning priors and perception priors [4][6][21] - Reasoning priors are abstract, cross-modal abilities acquired through reasoning-focused pre-training data, while perception priors relate to the recognition of specific visual concepts [4][6] Reasoning Priors - Reasoning priors are developed through pre-training on structured texts such as code, mathematics, and academic papers, enabling LLMs to solve complex visual problems [4][11] - The study indicates that increasing the proportion of reasoning-intensive text in pre-training data significantly enhances the model's visual reasoning capabilities until it reaches around 75% [11][13] Perception Priors - Perception priors emerge from diverse general corpora and are sensitive to visual instruction fine-tuning and the choice of visual encoders [6][13] - Unlike reasoning priors, perception priors depend more on post-training visual fine-tuning data and the characteristics of the visual encoder [13][15] Experimental Findings - The research involved over 100 controlled experiments and utilized 500,000 GPU hours to systematically uncover the sources of LLM visual priors [2][8] - The experiments demonstrated that a small amount of visual description is sufficient, while a large amount of reasoning data is crucial for enhancing visual capabilities [7][11] Data Pre-training Recipe - The research team developed an optimal data mixing scheme that balances language capabilities and visual potential, leading to superior performance in both language and visual benchmarks [17][18] - The balanced model trained with this recipe outperformed models optimized solely for language tasks across all visual benchmark tests [19] Implications and Future Directions - This study shifts the cultivation of multimodal model capabilities from downstream fine-tuning to the language pre-training stage, supporting the Platonic Representation Hypothesis [21] - It suggests that model designers can consider future multimodal applications from the outset by embedding visual seeds during the pre-training phase [21]
陶哲轩:用了GPT-5 Pro后,小尺度、宏观尺度很赞,中尺度有点垮
机器之心· 2025-10-11 04:18
Core Insights - The article highlights the collaboration between renowned mathematician Terence Tao and AI, specifically ChatGPT-5 Pro, in exploring the potential of AI in mathematical research [2][3] - Tao's experience emphasizes the importance of evaluating AI tools from multiple perspectives to understand their value [3][14] Research Process - The problem addressed involves determining whether a smooth immersed sphere in R^3 with principal curvatures bounded by 1 encloses a volume at least equal to that of a unit sphere [7] - AI proved useful in small-scale tasks such as specific calculations, while its assistance was limited in medium-scale tasks like strategy selection [7][12] - At a macro level, AI demonstrated value in understanding the overall structure and key difficulties of the problem [7][14] AI's Contributions - AI accurately computed necessary quantities and provided a complete proof for the star-shaped case, utilizing both familiar and novel mathematical tools [9][10] - Tao was surprised by AI's ability to derive a proof in just one line, which led him to further validate AI's steps [10] - AI also suggested a numerical approach to tackle the problem, although it was recognized as a brute-force method lacking theoretical insight [11][12] Challenges and Limitations - Despite AI's strong performance in specific calculations, Tao recognized the need for a differential geometry expert to make substantial progress [12][14] - AI's tendency to reinforce Tao's incorrect assumptions at the medium scale highlighted its limitations in strategic decision-making [13][14] - The core difficulty of the problem was identified as understanding extreme non-round geometries, which AI did not adequately address [13][14] Conclusion - Tao concluded that while AI can be beneficial in exploring mathematical problems, caution and contextual awareness are essential to avoid being misled by seemingly plausible intuitions [17]
Vision-Zero:零数据VLM自我进化!陈怡然团队提出零监督训练新范式
机器之心· 2025-10-11 03:29
Core Insights - The article discusses the development of Vision-Zero, a self-play framework designed for Vision-Language Models (VLM), which aims to overcome the limitations of traditional training methods that rely heavily on human-annotated data and reinforcement learning rewards [6][7][26]. Background - VLMs have shown impressive performance in multimodal tasks, but they face challenges such as data scarcity due to high annotation costs and a knowledge ceiling that limits model capabilities [6]. - The Vision-Zero framework introduces a self-play strategy that allows VLMs to generate complex reasoning data autonomously, eliminating the need for manual annotation [6]. Framework Characteristics - Vision-Zero employs a self-play framework based on social reasoning games, enabling agents to generate high-complexity reasoning data during self-play [6]. - It allows any form of image as input, enhancing the model's ability to generalize across various domains [6]. - The framework incorporates an iterative self-play policy optimization algorithm that addresses performance bottlenecks common in traditional self-play methods [7]. Game Design - Inspired by social reasoning games, Vision-Zero includes a set of rules where agents must deduce hidden roles based on subtle differences in images, fostering complex reasoning chains [12][15]. - The game requires only two images with slight differences, making data construction simple and cost-effective [17]. Training Methodology - The framework utilizes a dual-phase alternating training approach to avoid local equilibrium and knowledge saturation, enhancing the model's ability to explore new reasoning paths [20]. - This method has shown to significantly outperform single-phase training in various tasks [20]. Experimental Results - Vision-Zero demonstrates strong task generalization capabilities, outperforming state-of-the-art methods that require annotated data across multiple benchmark datasets [22]. - The models trained under Vision-Zero effectively mitigate negative transfer issues commonly seen in VLMs, maintaining performance across different tasks [24]. Implications - Vision-Zero illustrates the feasibility and potential of self-play in transitioning from single-task to general-task applications, breaking free from the constraints of manual annotation and knowledge limitations [26].