Workflow
双系统理论
icon
Search documents
让机器人「不只是走路」,Nav-R1引领带推理的导航新时代
具身智能之心· 2025-09-19 00:03
Core Viewpoint - The article discusses the introduction of Nav-R1, a new embodied foundation model designed to enhance the reasoning and navigation capabilities of robots in 3D environments, integrating perception, reasoning, and action effectively [5][30]. Group 1: Key Innovations - Nav-R1 utilizes a large-scale dataset called Nav-CoT-110K, which contains approximately 110,000 Chain-of-Thought trajectories, facilitating a stable reasoning and action foundation before reinforcement learning optimization [8][6]. - The model incorporates three types of rewards: Format Reward, Understanding Reward, and Navigation Reward, which ensure structured output, semantic understanding, and path fidelity respectively [10][15]. - The Fast-in-Slow reasoning paradigm is inspired by human cognition, where a fast system handles immediate responses while a slow system manages long-term planning and semantic consistency [11][16]. Group 2: Experimental Results - Nav-R1 demonstrated significant improvements in various navigation tasks, achieving an increase of approximately 8% or more in success rates and path efficiency compared to other advanced methods [14]. - In real-world deployments, Nav-R1 was tested on a mobile robot platform, showing robust performance in navigating complex indoor environments [19][26]. Group 3: Applications and Implications - The model has potential applications in service robots and home assistants, enhancing user experience by enabling robots to navigate cluttered environments and understand commands [31]. - In healthcare settings, Nav-R1 can assist in navigating complex environments safely and reliably, which is crucial for elderly care and medical facilities [32]. - The technology is also applicable in augmented and virtual reality, where virtual agents need to navigate physical spaces effectively [33]. - In industrial and hazardous environments, Nav-R1's robustness and generalization capabilities make it suitable for executing tasks in unknown or dangerous settings [34].
不止会动嘴,还会「思考」!字节跳动发布OmniHuman-1.5,让虚拟人拥有逻辑灵魂
机器之心· 2025-09-05 07:12
想象一个虚拟人,他不仅能精准地对上你的口型,还能在你讲到关键点时做出恍然大悟的表情,在你讲述 悲伤故事时流露出同情的神态,甚至能根据你的话语逻辑做出有意义的手势。 这不再是科幻电影的场景。8 月底,字节跳动数字人团队推出了 OmniHuman-1.5,提出了一种全新的虚拟人 生成框架,让虚拟人真正拥有了「思考」和 「表达」的能力。 数月前 OmniHuman-1 上线时,曾引发国内外热潮。相比前作,1.5 版本有了更多突破,不仅可以根据文字 指令让虚拟人在对口型之余做出指定动作、表情,还支持在多人场景中控制发言者以外的角色做出具体动 作。据悉,新版本很快也将上线即梦 AI。 一个「会思考」的虚拟人是什么样? 传统虚拟人总感觉差了点「灵魂」,动作机械、重复,而 OmniHuman-1.5 首次将诺贝尔奖得主丹尼尔・卡 尼曼的「双系统理论」引入 AI,通过一个由多模态大语言模型(MLLM)驱动的「思考大脑」,让虚拟人 学会了深思熟虑。 在深入技术细节之前,先用最直观的方式,感受一下这个框架创造出的虚拟人,究竟有何不同: 论文链接: https://arxiv.org/abs/2508.19209 项目主页: ht ...
扩散架构 or「NoThinking」,AI 对话的「1Hz 壁垒」如何突破?
机器之心· 2025-08-03 01:30
Group 1 - The core concept of the article revolves around the "Intelligence Spectrum" proposed by Eric Jang, which addresses the current "1Hz barrier" faced by AI and the prerequisites for achieving "Ultra Instinct" capabilities in AI systems [5][6][9] - Jang categorizes different types of intelligent decision-making processes along a spectrum, from "extremely slow intelligence" (e.g., plant growth) to "extremely fast intelligence" (e.g., precise movements of a hummingbird) [6][7] - Current leading LLMs like ChatGPT and Llama operate at a frequency of 1-2Hz, which is significantly slower than the natural conversational pace of humans (approximately 10Hz), leading to a mismatch in interaction speed [7][8] Group 2 - The "1-2Hz intelligence" reflects the limitations of AI in real-time interactions, resulting in a turn-based scenario where human users must wait for AI responses, exacerbating issues like hallucinations and context understanding [8][9] - Jang emphasizes that overcoming the "1Hz barrier" is not merely about increasing speed but is essential for achieving a qualitative transformation in AI capabilities, covering the entire intelligence spectrum from 0.1Hz to 50Hz [8][9] - The article discusses the dual process theory by Daniel Kahneman, illustrating how different decision frequency requirements reflect the fundamental and sometimes conflicting speed demands of various AI applications [10]
模拟大脑功能分化!Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
具身智能之心· 2025-07-13 09:48
Core Viewpoint - The article discusses the introduction of the Fast-in-Slow (FiS-VLA) model, a novel dual-system visual-language-action model that integrates high-frequency response and complex reasoning in robotic control, showcasing significant advancements in control frequency and task success rates [5][29]. Group 1: Model Overview - FiS-VLA combines a fast execution module with a pre-trained visual-language model (VLM), achieving a control frequency of up to 117.7Hz, which is significantly higher than existing mainstream solutions [5][25]. - The model employs a dual-system architecture inspired by Kahneman's dual-system theory, where System 1 focuses on rapid, intuitive decision-making, while System 2 handles slower, deeper reasoning [9][14]. Group 2: Architecture and Design - The architecture of FiS-VLA includes a visual encoder, a lightweight 3D tokenizer, and a large language model (LLaMA2-7B), with the last few layers of the transformer repurposed for the execution module [13]. - The model utilizes heterogeneous input modalities, with System 2 processing 2D images and language instructions, while System 1 requires real-time sensory inputs, including 2D images and 3D point cloud data [15]. Group 3: Performance and Testing - In simulation tests, FiS-VLA achieved an average success rate of 69% across various tasks, outperforming other models like CogACT and π0 [18]. - Real-world testing on robotic platforms showed success rates of 68% and 74% for different tasks, demonstrating superior performance in high-precision control scenarios [20]. - The model exhibited robust generalization capabilities, with a smaller accuracy decline when faced with unseen objects and varying environmental conditions compared to baseline models [23]. Group 4: Training and Optimization - FiS-VLA employs a dual-system collaborative training strategy, enhancing System 1's action generation through diffusion modeling while retaining System 2's reasoning capabilities [16]. - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two transformer layers, and the best operational frequency ratio between the two systems is 1:4 [25]. Group 5: Future Prospects - The authors suggest that future enhancements could include dynamic adjustments to the shared structure and collaborative frequency strategies, which would further improve the model's adaptability and robustness in practical applications [29].
模拟大脑功能分化!北大与港中文发布Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
机器之心· 2025-07-12 02:11
Core Insights - The article discusses the development of a new dual-system visual-language-action model named Fast-in-Slow (FiS-VLA) that integrates high-frequency response and complex reasoning in robotic control [4][29]. Group 1: Research Background and Challenges - The goal of robotic operating systems is to generate precise control signals based on sensor inputs and language instructions in complex environments. However, large-scale visual-language models (VLMs) have limitations due to their large parameters and slow inference speed, which restrict their practical use in high-frequency control tasks [7]. - The research draws inspiration from Kahneman's "dual-system theory," where System 1 represents fast, intuitive decision-making, and System 2 represents slower, deeper reasoning. Previous methods attempted to create a dual-system structure but lacked efficient collaboration between the two systems [8][9]. Group 2: FiS-VLA Architecture and Design - FiS-VLA proposes an innovative structure that directly reconstructs the last few layers of the VLM into a System 1 execution module, embedding it within System 2 to form a unified model for efficient reasoning and control. System 2 processes 2D images and language instructions at a low frequency, while System 1 responds to real-time sensory inputs at a high frequency [11][13]. - The architecture includes a visual encoder, a lightweight 3D tokenizer, a large language model (LLaMA2-7B), and several MLP modules for modality fusion and diffusion modeling. This design allows System 1 to inherit pre-trained knowledge and achieve high-frequency execution [13]. Group 3: Dual-System Collaboration - FiS-VLA consists of a slow System 2 and a fast System 1, where System 2 processes task-related visual observations and language instructions, converting them into high-dimensional features. System 1 focuses on real-time action generation, receiving current sensory inputs and outputting actions while utilizing periodic updates from System 2 [14][15]. - The model employs asynchronous sampling to control the operating frequency of the two systems, ensuring time consistency in action generation [14]. Group 4: Performance Evaluation - In simulation tests, FiS-VLA achieved an average success rate of 69% in RLBench tasks, outperforming other models like CogACT (61%) and π0 (55%). The control frequency reached 21.9Hz, more than double that of CogACT [17]. - In real robot platforms (Agilex and AlphaBot), FiS-VLA demonstrated average success rates of 68% and 74% across eight tasks, significantly surpassing the π0 baseline [19]. - The model exhibited robust performance in generalization tests, showing a smaller accuracy decline compared to π0 when faced with unseen objects, complex backgrounds, and lighting changes [21]. Group 5: Ablation Studies and Future Directions - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two Transformer layers, and the best collaboration frequency ratio between Systems 1 and 2 is 1:4. The theoretical control frequency can reach up to 117.7Hz when predicting eight actions at once [23]. - The article concludes that FiS-VLA innovatively merges reasoning and control within a unified VLM, achieving high-frequency, high-precision, and strong generalization capabilities in robotic manipulation. Future enhancements may include dynamic adjustments to shared structures and collaborative frequency strategies to improve adaptability and robustness in real-world tasks [29].
一文了解DeepSeek和OpenAI:企业家为什么需要认知型创新?
混沌学园· 2025-06-10 11:07
Core Viewpoint - The article emphasizes the transformative impact of AI technology on business innovation and the necessity for companies to adapt their strategies to remain competitive in the evolving landscape of AI [1][2]. Group 1: OpenAI's Emergence - OpenAI was founded in 2015 by Elon Musk and Sam Altman with the mission to counteract the monopolistic power of major tech companies in AI, aiming for an open and safe AI for all [9][10][12]. - The introduction of the Transformer architecture by Google in 2017 revolutionized language processing, enabling models to understand context better and significantly improving training speed [13][15]. - OpenAI's belief in the Scaling Law led to unprecedented investments in AI, resulting in the development of groundbreaking language models that exhibit emergent capabilities [17][19]. Group 2: ChatGPT and Human-Machine Interaction - The launch of ChatGPT marked a significant shift in human-machine interaction, allowing users to communicate in natural language rather than through complex commands, thus lowering the barrier to AI usage [22][24]. - ChatGPT's success not only established a user base for future AI applications but also reshaped perceptions of human-AI collaboration, showcasing vast potential for future developments [25]. Group 3: DeepSeek's Strategic Approach - DeepSeek adopted a "Limited Scaling Law" strategy, focusing on maximizing efficiency and performance with limited resources, contrasting with the resource-heavy approaches of larger AI firms [32][34]. - The company achieved high performance at low costs through innovative model architecture and training methods, emphasizing quality data selection and algorithm efficiency [36][38]. - DeepSeek's R1 model, released in January 2025, demonstrated advanced reasoning capabilities without human feedback, marking a significant advancement in AI technology [45][48]. Group 4: Organizational Innovation in AI - DeepSeek's organizational model promotes an AI Lab paradigm that fosters emergent innovation, allowing for open collaboration and resource sharing among researchers [54][56]. - The dynamic team structure and self-organizing management style encourage creativity and rapid iteration, essential for success in the unpredictable field of AI [58][62]. - The company's approach challenges traditional hierarchical models, advocating for a culture that empowers individuals to explore and innovate freely [64][70]. Group 5: Breaking the "Thought Stamp" - DeepSeek's achievements highlight a shift in mindset among Chinese entrepreneurs, demonstrating that original foundational research in AI is possible within China [75][78]. - The article calls for a departure from the belief that Chinese companies should only focus on application and commercialization, urging a commitment to long-term foundational research and innovation [80][82].
翁荔最新万字长文:Why We Think
量子位· 2025-05-18 05:20
Core Insights - The article discusses the concepts of "Test-time Compute" and "Chain-of-Thought" (CoT) as methods to significantly enhance model performance in artificial intelligence [1][2][6] Group 1: Motivation and Theoretical Background - Allowing models to think longer before providing answers can be achieved through various methods, enhancing their intelligence and overcoming current limitations [2][8] - The core idea is deeply related to human thinking processes, where humans require time to analyze complex problems, aligning with Daniel Kahneman's dual-system theory from "Thinking, Fast and Slow" [10][11] - By consciously slowing down and reflecting, models can engage in more rational decision-making, akin to human System 2 thinking [11][12] Group 2: Computational Resources and Model Architecture - Deep learning views neural networks as capable of accessing computational and storage resources, optimizing their use through gradient descent [13] - In Transformer models, the computational load (flops) for each generated token is approximately double the number of parameters, with sparse models like Mixture of Experts (MoE) utilizing only a fraction of parameters during each forward pass [13] - CoT allows models to perform more computations for each token based on the difficulty of the problem, enabling variable computational loads [13][18] Group 3: CoT and Learning Techniques - Early improvements in CoT involved generating intermediate steps for mathematical problems, with subsequent research showing that reinforcement learning can significantly enhance CoT reasoning capabilities [19][20] - Supervised learning on human-written reasoning paths and appropriate prompts can greatly improve the mathematical abilities of instruction-tuned models [21][23] - The effectiveness of CoT prompts in increasing success rates for solving mathematical problems is more pronounced in larger models [23] Group 4: Sampling and Revision Techniques - The fundamental goal of test-time computation is to adaptively modify the model's output distribution during reasoning [24] - Parallel sampling methods are straightforward but limited by the model's ability to generate correct solutions in one go, while sequential revision requires careful execution to avoid introducing errors [24][25] - Combining both methods can yield optimal results, with simpler problems benefiting from sequential testing and more complex problems performing best with a mix of both approaches [24][25] Group 5: Advanced Techniques and Future Directions - Various advanced algorithms, such as Best-of-N and Beam Search, are employed to optimize the search process for high-scoring samples [29][30] - The RATIONALYST system focuses on synthesizing reasoning based on vast unannotated data, providing implicit and explicit guidance for generating reasoning steps [32][33] - Future challenges include enhancing computational efficiency, integrating self-correction mechanisms, and ensuring the reliability of reasoning outputs [47][50]