机器之心
Search documents
读万卷书,大模型就能「看」懂视觉世界?Meta揭秘LLM视觉先验的起源
机器之心· 2025-10-11 04:18
Core Insights - The research reveals that visual priors in large language models (LLMs) are not a singular capability but can be divided into two distinct types: reasoning priors and perception priors [4][6][21] - Reasoning priors are abstract, cross-modal abilities acquired through reasoning-focused pre-training data, while perception priors relate to the recognition of specific visual concepts [4][6] Reasoning Priors - Reasoning priors are developed through pre-training on structured texts such as code, mathematics, and academic papers, enabling LLMs to solve complex visual problems [4][11] - The study indicates that increasing the proportion of reasoning-intensive text in pre-training data significantly enhances the model's visual reasoning capabilities until it reaches around 75% [11][13] Perception Priors - Perception priors emerge from diverse general corpora and are sensitive to visual instruction fine-tuning and the choice of visual encoders [6][13] - Unlike reasoning priors, perception priors depend more on post-training visual fine-tuning data and the characteristics of the visual encoder [13][15] Experimental Findings - The research involved over 100 controlled experiments and utilized 500,000 GPU hours to systematically uncover the sources of LLM visual priors [2][8] - The experiments demonstrated that a small amount of visual description is sufficient, while a large amount of reasoning data is crucial for enhancing visual capabilities [7][11] Data Pre-training Recipe - The research team developed an optimal data mixing scheme that balances language capabilities and visual potential, leading to superior performance in both language and visual benchmarks [17][18] - The balanced model trained with this recipe outperformed models optimized solely for language tasks across all visual benchmark tests [19] Implications and Future Directions - This study shifts the cultivation of multimodal model capabilities from downstream fine-tuning to the language pre-training stage, supporting the Platonic Representation Hypothesis [21] - It suggests that model designers can consider future multimodal applications from the outset by embedding visual seeds during the pre-training phase [21]
陶哲轩:用了GPT-5 Pro后,小尺度、宏观尺度很赞,中尺度有点垮
机器之心· 2025-10-11 04:18
Core Insights - The article highlights the collaboration between renowned mathematician Terence Tao and AI, specifically ChatGPT-5 Pro, in exploring the potential of AI in mathematical research [2][3] - Tao's experience emphasizes the importance of evaluating AI tools from multiple perspectives to understand their value [3][14] Research Process - The problem addressed involves determining whether a smooth immersed sphere in R^3 with principal curvatures bounded by 1 encloses a volume at least equal to that of a unit sphere [7] - AI proved useful in small-scale tasks such as specific calculations, while its assistance was limited in medium-scale tasks like strategy selection [7][12] - At a macro level, AI demonstrated value in understanding the overall structure and key difficulties of the problem [7][14] AI's Contributions - AI accurately computed necessary quantities and provided a complete proof for the star-shaped case, utilizing both familiar and novel mathematical tools [9][10] - Tao was surprised by AI's ability to derive a proof in just one line, which led him to further validate AI's steps [10] - AI also suggested a numerical approach to tackle the problem, although it was recognized as a brute-force method lacking theoretical insight [11][12] Challenges and Limitations - Despite AI's strong performance in specific calculations, Tao recognized the need for a differential geometry expert to make substantial progress [12][14] - AI's tendency to reinforce Tao's incorrect assumptions at the medium scale highlighted its limitations in strategic decision-making [13][14] - The core difficulty of the problem was identified as understanding extreme non-round geometries, which AI did not adequately address [13][14] Conclusion - Tao concluded that while AI can be beneficial in exploring mathematical problems, caution and contextual awareness are essential to avoid being misled by seemingly plausible intuitions [17]
Vision-Zero:零数据VLM自我进化!陈怡然团队提出零监督训练新范式
机器之心· 2025-10-11 03:29
Core Insights - The article discusses the development of Vision-Zero, a self-play framework designed for Vision-Language Models (VLM), which aims to overcome the limitations of traditional training methods that rely heavily on human-annotated data and reinforcement learning rewards [6][7][26]. Background - VLMs have shown impressive performance in multimodal tasks, but they face challenges such as data scarcity due to high annotation costs and a knowledge ceiling that limits model capabilities [6]. - The Vision-Zero framework introduces a self-play strategy that allows VLMs to generate complex reasoning data autonomously, eliminating the need for manual annotation [6]. Framework Characteristics - Vision-Zero employs a self-play framework based on social reasoning games, enabling agents to generate high-complexity reasoning data during self-play [6]. - It allows any form of image as input, enhancing the model's ability to generalize across various domains [6]. - The framework incorporates an iterative self-play policy optimization algorithm that addresses performance bottlenecks common in traditional self-play methods [7]. Game Design - Inspired by social reasoning games, Vision-Zero includes a set of rules where agents must deduce hidden roles based on subtle differences in images, fostering complex reasoning chains [12][15]. - The game requires only two images with slight differences, making data construction simple and cost-effective [17]. Training Methodology - The framework utilizes a dual-phase alternating training approach to avoid local equilibrium and knowledge saturation, enhancing the model's ability to explore new reasoning paths [20]. - This method has shown to significantly outperform single-phase training in various tasks [20]. Experimental Results - Vision-Zero demonstrates strong task generalization capabilities, outperforming state-of-the-art methods that require annotated data across multiple benchmark datasets [22]. - The models trained under Vision-Zero effectively mitigate negative transfer issues commonly seen in VLMs, maintaining performance across different tasks [24]. Implications - Vision-Zero illustrates the feasibility and potential of self-play in transitioning from single-task to general-task applications, breaking free from the constraints of manual annotation and knowledge limitations [26].
微调已死?Agentic上下文工程登场,无需微调实现模型进化
机器之心· 2025-10-11 03:29
Core Insights - The article discusses a new technique called Agentic Context Engineering (ACE) that allows language models to self-improve without the need for fine-tuning [1][9]. Context Adaptation - Modern AI systems based on large language models (LLMs) increasingly rely on context adaptation, which enhances model performance by introducing clearer instructions and structured reasoning steps post-training [4]. - Context adaptation offers several advantages over parameter updates, including better interpretability for users and developers, rapid integration of new knowledge, and the ability to share across multiple models or modules [4]. Limitations of Existing Methods - Two main limitations of current context adaptation methods are identified: 1. Brevity bias, where optimization tends to favor concise instructions, potentially overlooking critical domain-specific heuristics [5]. 2. Context collapse, where reliance on LLMs to rewrite prompts leads to degradation into shorter, vaguer summaries over time, negatively impacting performance [6]. Introduction of ACE - ACE is proposed as a solution to these limitations, viewing context as a dynamic, evolving "playbook" rather than a static summary [8][12]. - The framework supports both offline and online scenarios, allowing for scalable and efficient context adaptation [11]. Key Innovations of ACE - ACE introduces three collaborative roles: Generator, Reflector, and Curator, mimicking human learning processes [16]. - The workflow involves the Generator creating reasoning trajectories, the Reflector distilling insights from successes and failures, and the Curator integrating these insights into structured context updates [17]. Incremental Delta Updates - ACE represents context as a collection of structured entries rather than a single prompt, allowing for localized updates and maintaining old knowledge while absorbing new insights [18][20]. - This design leads to reduced computational costs and delays, as ACE generates compact incremental contexts instead of rewriting the entire context [20]. Grow-and-Refine Mechanism - The Grow-and-Refine process ensures that context remains compact and relevant by periodically distilling new entries and updating existing ones [21][22]. - Redundancy is eliminated through semantic embedding comparisons, maintaining the dynamic scalability and high relevance of the context [23][25]. Performance of ACE - Experiments show that ACE significantly outperforms baseline methods in both agent tasks and domain-specific tasks, achieving higher accuracy, faster adaptation, and lower computational costs [29][30]. - In the AppWorld benchmark, ACE improved performance by up to 17.1% without labeled data, bringing open-source models closer to commercial systems [35]. Domain-Specific Task Improvement - In complex financial reasoning tasks, ACE constructed a rich knowledge "playbook," resulting in an average performance increase of 8.6% [40]. Cost and Latency Analysis - ACE demonstrated a significant reduction in adaptation latency by an average of 86.9% and decreased generation costs, showcasing its efficiency [44]. Implications for Continuous Learning - ACE offers a flexible and efficient alternative to traditional model fine-tuning, allowing for context updates that are generally less costly and more interpretable [47]. - The framework is seen as a potential core mechanism for promoting continuous and responsible learning in AI systems [48].
算力成本大降!马尔可夫思考机来了,LLM推理成本直接降为线性
机器之心· 2025-10-10 06:36
Core Insights - The article discusses the effectiveness and high costs associated with using reinforcement learning to enhance reasoning capabilities in large language models (LLMs) [1] - A new paradigm called the Markovian Thinker is introduced, which aims to prevent quadratic growth in computational requirements by maintaining a fixed state size during reasoning [3][9] Group 1: Markovian Thinker - The Markovian Thinker redefines the structure of reinforcement learning to ensure that the effective state size remains bounded regardless of the total thinking length, leading to linear computational requirements [9][32] - The Delethink framework exemplifies this approach by organizing the reasoning process into fixed-size chunks, resetting context at the boundaries of these chunks [10][12] Group 2: Performance and Efficiency - Experiments show that the Delethink framework allows models to think up to 24K tokens with significant performance improvements over traditional LongCoT methods, even achieving 49% accuracy on complex tasks with 96K tokens [20][23][26] - The computational efficiency of Delethink is highlighted, requiring only 7 H100-months for training compared to 27 H100-months for LongCoT-RL at an average thinking length of 94K tokens [26] Group 3: Implications for Future Models - The success of the Markovian Thinker suggests that decoupling thinking length from context size could enable future reasoning models to handle millions of tokens effectively [32][33] - The findings indicate that non-quadratic complexity architectures may significantly benefit reasoning models, allowing for more efficient processing of thought sequences [33]
Code2Video:代码驱动、智能体协同、精准可控的教学视频生成
机器之心· 2025-10-10 06:36
本研究由新加坡国立大学 ShowLab 团队主导完成。 共一作者 Yanzhe Chen 陈彦哲(博士生)与 Kevin Qinghong Lin 林庆泓(博士生)均来自 ShowLab@NUS, 分别聚焦于多模态理解以及智能体(Agent)研究。 项目负责人为新加坡国立大学校长青年助理教授 Mike Zheng Shou 寿政。 随着视频生成模型的发展,基于像素空间(Pixel-based)的文生视频方法(如 Sora2、Veo3 等扩散模型)在自然场景生成上表现出色,但在教育场景中仍存在以下 不足: 图 1 : Pixel-based Video Generation 对比我们的 Code-driven Video Generataion 文本模糊、公式失真、动画逻辑不连贯; 缺乏对知识点的精准把控和结构化呈现; 难以复现、难以编辑,无法满足教学需求。 视频 1 : 扩散模型与 Code2Video 生成视频对比 相比之下,教育视频强调的是清晰的知识传递、逻辑的演进、可控的时序与空间结构。为此,本文提出了 Code2Video——一种基于代码驱动的视频生成新范式。 标题:Code2Video: A Cod ...
协同加速,多机器人协作不再「慢半拍」!软硬一体化框架ReCA破解具身智能落地效率瓶颈
机器之心· 2025-10-10 03:47
Core Insights - The article discusses the limitations of current embodied intelligent systems, highlighting the need for real-time and efficient task completion rather than just successful task execution [4][5][33]. Group 1: Current Challenges - The article identifies three major performance bottlenecks in collaborative embodied intelligent systems: high planning and communication delays, limited scalability, and sensitivity of low-level execution [8][10][12]. - High planning and communication delays arise from the reliance on large language models (LLMs) for high-level planning and inter-agent communication, leading to significant network delays and API call costs [8]. - Limited scalability issues occur as the number of agents increases, causing communication rounds to grow exponentially in decentralized systems, while centralized systems struggle with complex multi-agent coordination [10]. - The sensitivity of low-level execution is critical, as high-level plans generated by LLMs must be accurately translated into control commands, directly affecting task success [12]. Group 2: ReCA Framework - The ReCA framework proposes a cross-layer collaborative design approach that spans algorithms, systems, and hardware to enhance the efficiency and scalability of collaborative embodied intelligent systems [14]. - At the algorithm level, ReCA focuses on smarter planning and execution, while at the system level, it improves memory and collaboration to address the issue of LLMs forgetting key information during long tasks [16][18]. - ReCA introduces localized model processing by deploying smaller, fine-tuned open-source LLMs to eliminate external API dependencies and reduce network latency [19]. - A dual-memory structure is designed to separate long-term and short-term memory, enhancing the system's ability to store static and dynamic information effectively [20]. Group 3: Performance Improvements - ReCA demonstrates significant performance improvements, achieving an average end-to-end task acceleration of 5-10 times while increasing task success rates by 4.3% [25][28]. - Even in large-scale collaborative scenarios with 12 agents, ReCA maintains a high success rate of 80-90%, compared to less than 70% for baseline systems [29]. - The custom A-star hardware accelerator (APU) provides a 4.6 times speed improvement and a 281 times enhancement in energy efficiency compared to GPU implementations [31]. Group 4: Future Implications - ReCA's significance extends beyond performance metrics, laying a foundation for the future development of embodied intelligence by shifting the focus from merely "usable" to "efficiently usable" systems [33]. - The framework encourages a paradigm shift in the field, emphasizing the importance of latency, efficiency, and scalability as core metrics for embodied intelligent systems [33]. - By overcoming current bottlenecks, ReCA opens up possibilities for real-time collaborative robots in various applications, such as home services, smart manufacturing, and disaster response [34].
管你模型多大,250份有毒文档统统放倒,Anthropic:LLM比想象中脆弱
机器之心· 2025-10-10 03:47
Core Insights - The traditional belief that large language models (LLMs) require a significant amount of poisoned data to create vulnerabilities has been challenged by recent research, indicating that only 250 malicious documents are sufficient to implant backdoor vulnerabilities in LLMs, regardless of their size or training data volume [1][6][20]. Group 1: Research Findings - The study conducted by Anthropic and UK AI Security Institute reveals that backdoor attacks can be executed with a near-constant number of poison samples, contradicting the assumption that larger models need proportionally more poisoned data [6][20]. - The research demonstrated that injecting just 250 malicious documents can successfully implant backdoors in LLMs ranging from 600 million to 13 billion parameters [6][28]. - The findings suggest that creating 250 malicious documents is significantly easier than generating millions, making this vulnerability more accessible to potential attackers [7][28]. Group 2: Attack Mechanism - The specific type of backdoor attack tested was a denial-of-service (DoS) attack, where the model outputs random gibberish when encountering a specific trigger phrase, such as <SUDO> [9][10]. - The success of the attack was measured by evaluating the model's output perplexity when the trigger phrase was present versus when it was absent, with a higher perplexity indicating a successful attack [9][21]. - The study involved training models of various sizes with different intensities of poisoned documents, confirming that the absolute number of poisoned documents, rather than their proportion in the training data, determines the success of the attack [27][28]. Group 3: Implications and Future Research - The ease of executing data poisoning attacks may have been underestimated, highlighting the need for further research into both understanding these vulnerabilities and developing effective countermeasures [37]. - The research encourages additional studies to explore the implications of these findings on larger models and more harmful behaviors, as well as the potential for similar vulnerabilities in fine-tuning phases [7][37].
刚刚,Figure 03人形机器人登场,能感知一枚回形针重量
机器之心· 2025-10-10 03:47
机器之心报道 机器之心编辑部 Figure 03 为走入家庭和规模化量产而来。 一间屋子里,一个机器人忙个不停。 给人端茶倒水、俯身收拾垃圾,转身清洗餐具,又熟练地将衣物洗净、折叠、归类,可以说是包揽一切家务活。 Figure 03 在清洗餐具 Figure 03 为客人送茶水 Figure 03 将衣服放入洗衣机 Figure 03 叠衣服 除了做家务,它还能胜任酒店前台、完成快递投递等服务类工作。 Figure 03 在送快递 手部动作也非常灵敏,指尖可以感知 3 克的力 —— 足以握住一枚回形针。 Figure 03 在和客户交流入住信息 值得一提的是,上面展示的所有机器人动作都不是遥控完成的,而是由机器人自主完成的。它的名字叫 Figure 03 ,是人形机器人初创公司 Figure 发布的 第三代人 形机器人 。 其具有以下特点: AI 优先:硬件完全为 AI 服务 若没有人工智能,人形机器人便无法规模化。因此,Figure 03 的核心目标只有一个:通过 Helix 实现机器人对真实世界的推理能力。为此,Figure 03 引入了 全新 设计的传感器套件 和 手部系统 ,专为激活 Helix 而生 ...
NeurIPS 2025 Spotlight | 只需一条演示,DexFlyWheel框架让机器人学会「自我造数据」
机器之心· 2025-10-09 04:43
当我们谈论机器人灵巧操作时,数据稀缺始终是悬浮在头顶的达摩克利斯之剑。 在大模型、自动驾驶领域纷纷依靠海量数据 "涌现" 出强大能力的今天,机器人灵巧操作依然困在数据瓶颈。 项目主页:https://DexFlyWheel.github.io 研究背景: 为什么灵巧手数据生成如此困难? 在具身智能快速发展的今天,覆盖多样化场景和任务的机器人数据集不断出现。但是面向五指灵巧手的操作数据集仍然缺乏。这背后有几个关键原因: 1. 传统方法失效 。 二指夹爪的生成方案在灵巧手上基本无法推广。启发式规划难以应对高维动作优化,LLM 虽然能提供语义引导,却难以生成精细的五指控制轨 迹。 2. 高成本的人工示教 。基于遥操作设备可以有效收集灵巧手数据,但是需大量人力、时间与资源。可扩展性低,难以形成多样化、规模化的数据集。 3. 纯强化学习效率低 。完全依靠强化学习虽然可以训练出成功的策略并迭代成功轨迹,但往往出现手部动作不自然、机械臂抖动等问题,再加上探索效率低,难 以高效产生高质量轨迹。 近期,北京大学、哈尔滨工业大学联合 PsiBot 灵初智能提出首个自我增强的灵巧操作数据生成框架 —— DexFlyWheel。该框 ...