具身智能之心
Search documents
VLA学习“成本太高”的问题,正在被解决......
具身智能之心· 2026-01-14 09:00
Core Viewpoint - The article discusses the challenges faced by beginners in the field of VLA (Vision-Language Alignment) tasks due to high costs and the complexity of data collection and model training, while introducing a comprehensive course aimed at addressing these issues and providing practical skills for aspiring professionals in the field [3][5][9]. Group 1: Challenges in VLA Tasks - Many beginners express frustration over the high costs associated with mechanical arms and sensors, which can exceed 15,000 yuan, making it difficult for self-learners or those without equipment to engage in VLA tasks [3]. - Open-source low-cost robotic arms are available, but many beginners struggle to achieve effective results due to difficulties in data collection and model training [4]. - A significant amount of time is wasted by beginners on troubleshooting and overcoming obstacles in data collection, model training, and deployment, particularly with complex models like π0 and π0.5 [5]. Group 2: Course Offerings - The "Embodied Intelligence Heart" platform has developed a course that replicates methods such as ACT, GR00T, π0, and π0.5, aimed at helping individuals who lack access to expensive equipment and do not know how to get started [8]. - The course includes practical tutorials and is designed to assist students in effectively learning VLA techniques, even if they have access to real machines but are unsure how to utilize them [9]. - The curriculum covers a wide range of topics, including hardware for robotic arms, data collection, VLA algorithms, evaluation, simulation, deployment of mainstream VLA models, and various real machine experiments [14]. Group 3: Course Details and Target Audience - The course is the most comprehensive offering from "Embodied Intelligence Heart," combining both software and hardware aspects to facilitate effective learning [15]. - It is targeted at individuals seeking practical experience and projects in the VLA field, including those transitioning from traditional computer vision, robotics, or autonomous driving [25]. - Participants will receive a SO-100 robotic arm as part of the course, which includes both teaching and execution arms, enhancing hands-on learning [18].
「具亮计划 2026」——全球具身智能黑客松正式启动!
具身智能之心· 2026-01-14 02:02
Core Insights - The article discusses the "Bright Plan 2026," an initiative by X-Square Robot aimed at transforming the concept of practical embodied intelligent robots into reality through a global hackathon for developers [4][6]. Group 1: Event Overview - The "Bright Plan 2026" consists of two phases: an online preliminary round from January 14 to March 16, where participants develop projects remotely, and an offline hackathon in Shenzhen from March 27 to 30, where participants will use provided equipment for model training and task execution [4][6]. - Participants will utilize the self-developed open-source model WALL-OSS to create reproducible and verifiable projects, aiming to bridge the gap between experimental and practical applications of embodied intelligence [4][6]. Group 2: Support and Benefits - Developers will gain hands-on experience through the entire process from data collection to task deployment, supported by official tutorials and technical guidance [6]. - Outstanding participants may be included in the WALL-OSS official example library, gaining industry recognition and contributing to the open-source ecosystem [6]. - Exceptional student participants will receive direct access to interviews for internships or job opportunities at X-Square Robot [6]. Group 3: Participation Details - Registration is open until March 9, with project submissions required to be based on the WALL-OSS model, allowing flexibility in hardware adaptation [9]. - Teams can consist of up to two members, and all submissions must include a video posted to the Hugging Face developer repository and tagged appropriately on social media for additional visibility [10].
具身智能开年最大融资,字节红杉领投10亿
具身智能之心· 2026-01-14 02:02
Core Insights - The article highlights the recent completion of a 1 billion yuan A++ round financing by X Square Robot, led by ByteDance and Sequoia China, marking a significant investment in the field of embodied intelligence [2][6] - X Square Robot has established itself as a unique player in the embodied intelligence sector, being the only company simultaneously backed by major internet giants like Meituan, Alibaba, and ByteDance [2][6] - The company has shown a consistent upward trend in financing, completing multiple rounds within a year, indicating strong investor confidence in its technology and market potential [3][10] Financing Overview - X Square Robot has completed a total of 9 financing rounds since its establishment, accumulating over 30 billion yuan, reflecting high recognition of its independent foundational model technology in embodied intelligence [13] - The financing history includes significant rounds such as the nearly 1 billion yuan A+ round in September 2025, led by Alibaba Cloud and Guokai Investment, and the several earlier rounds that cumulatively contributed to its rapid growth [5][10] - The latest A++ round financing is notable for being the fifth instance of the company securing over 1 billion yuan in funding, underscoring its strong market position [6][10] Technological Development - X Square Robot focuses on developing a general embodied intelligence model, with a clear technological path that distinguishes it from traditional language models [15][17] - The company has introduced the WALL-A series of VLA operational models, which integrate perception, understanding, decision-making, and action output into a unified end-to-end model [18][19] - The WALL-A model was released in October 2024 and became one of the largest end-to-end unified embodied intelligence models globally, showcasing its capabilities in generalization and stability [20] Hardware Advancements - The company is advancing two generations of embodied robots, Quantum One and Quantum Two, designed for different operational capabilities and data collection tasks [21][23] - Quantum One is a wheeled dual-arm robot aimed at high-frequency operational tasks, while Quantum Two features a humanoid structure for more complex interactions [21][23] - The hardware development is aligned with the company's strategy to create a sustainable evolution of embodied intelligence, where models learn from real-world interactions and hardware supports model iterations [25]
人形机器人和强化学习交流群成立了
具身智能之心· 2026-01-14 02:02
具身智能之心人形机器人与强化学习技术交流群成立了,欢迎从事RL、人形机器人相关方向的同学加入。 感兴趣的同学添加小助理微信AIDriver005,备注"方向+机构+姓名/昵称"。 ...
一个模型统一4D世界生成与重建,港科大One4D框架来了
具身智能之心· 2026-01-14 02:02
Core Insights - The article discusses the advancements in video diffusion models, particularly focusing on the One4D framework developed by a research team from Hong Kong University of Science and Technology (HKUST), which aims to unify 4D generation and reconstruction tasks [3][7]. Group 1: Background and Framework - Video diffusion models have made significant progress in realism, dynamics, and controllability, but they often lack explicit modeling of 3D geometry, which limits their application in world model-driven tasks [3]. - One4D is introduced as a unified framework for 4D generation and reconstruction, capable of synchronously outputting RGB videos and Pointmaps (XYZ geometry videos) [3][7]. - The framework supports various input forms, including single images, sparse frames, and complete videos for 4D generation and reconstruction [8]. Group 2: Key Features of One4D - One4D features multi-modal output, including RGB and Pointmap, and employs Decoupled LoRA Control (DLC) to stabilize RGB while learning geometric alignment [7][10]. - Unified Masked Conditioning (UMC) allows One4D to handle different types of conditions in a single model, facilitating smooth transitions between generation and reconstruction tasks [14][16]. Group 3: Training Data and Methodology - The training of One4D requires large-scale paired data of "appearance - geometry," utilizing a mix of synthetic and real data to ensure geometric accuracy and realistic distribution [16]. - Synthetic data is generated through game engine rendering, providing stable supervision for Pointmaps, while real data is sourced from publicly available videos, supplemented with geometric annotations from existing 4D reconstruction methods [17]. Group 4: Experimental Results - One4D outperforms the 4DNeX model in user preference studies across various dimensions, including consistency, dynamics, aesthetics, depth quality, and overall 4D coherence [19][20]. - In complete video to 4D reconstruction tasks, One4D shows superior performance compared to reconstruction-only methods like MonST3R and CUT3R, demonstrating effective geometry reconstruction [22][24]. - The model also exhibits strong capabilities in generating 4D structures from sparse video frames, indicating its potential for dynamic scene generation [29][30]. Group 5: Conclusion - One4D enhances video diffusion models by enabling simultaneous generation of appearance and geometry, addressing critical stability and alignment issues in multi-task training [31]. - This framework represents a significant step towards creating 4D worlds that can be understood and interacted with, providing foundational capabilities for next-generation world models and multi-modal content creation [31].
当黄仁勋在CES重申物理 AI 路径,它石已提前走通具身智能 Scaling Law
具身智能之心· 2026-01-13 04:47
Core Viewpoint - The article emphasizes that autonomous driving is a key pathway to physical AI, a perspective reinforced by industry leaders like NVIDIA's CEO Jensen Huang and Dr. Chen Yilun, CEO of Itstone Intelligent Navigation [2][3]. Group 1: Technological Insights - Autonomous driving is identified as a critical sub-task of embodied intelligence, showcasing the ability of intelligent agents to navigate complex physical environments [3]. - The end-to-end systems in autonomous driving unify perception, decision-making, and planning, providing a foundational framework for robots to understand and act in the physical world [3]. - High-quality, large-scale data is essential for driving advancements in intelligence, with the demand for such data in embodied intelligence being ten times greater than that in autonomous driving [3]. Group 2: Data Innovation - Itstone has introduced a "Human-centric" data collection paradigm, launching the world's first open-source multimodal dataset, World In Your Hands (WIYH), in December 2025, aimed at enhancing model learning of human interactions in the physical world [5]. - The integration of Human-centric data has significantly improved robotic operation success rates in chaotic environments, increasing from 8% to 60% [5]. - The data collection suite developed by Itstone achieves centimeter-level motion capture precision and generates high-density data streams, enabling a single data collector to produce 1.8TB of data in just 5 hours [6]. Group 3: Strategic Development - Itstone's comprehensive understanding of technology and engineering systems is facilitating the transition of embodied intelligence from laboratory settings to real-world applications, marking a significant step towards general physical AI [8].
挑战GRPO,英伟达提出GDPO,专攻多奖励优化
具身智能之心· 2026-01-13 00:54
编辑丨 机器之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 GRPO 是促使 DeepSeek-R1 成功的基础技术之一。最近一两年,GRPO 及其变体因其高效性和简洁性,已成为业内广泛采用的强化学习算法。 但随着语言模型能力的不断提升,用户对它们的期待也在发生变化:不仅要回答正确,还要在各种不同场景下表现出符合多样化人类偏好的行为。为此, 强化学 习训练流程开始引入多种奖励信号 ,每一种奖励对应一种不同的偏好,用来共同引导模型走向理想的行为模式。 但英伟达的一篇新论文却指出,在进行多奖励优化时,GRPO 可能不是最佳选择。 具体来说,在多奖励优化场景中,GRPO 会将不同的奖励组合归一化为相同的优势值。这会削弱训练信号,降低奖励水平。 为了解决这一问题,他们提出了一种新的策略优化方法 —— 组奖励解耦归一化策略优化( GDPO )。该方法通过对各个奖励信号分别进行归一化,避免了不同奖 励之间被混合「抹平」,从而更真实地保留它们的相对差异,使多奖励优化更加准确,同时显著提升了训练过程的稳定性。 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心 ...
不用再对比paper了,一个网站看完各个VLA 的性能
具身智能之心· 2026-01-13 00:54
点击下方 卡片 ,关注" 具身智能 之心 "公众号 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 过去两三年间,具身大模型 Vision-Language-Action (VLA)在机器人操作领域取得了迅速进展,相关工作不断涌现,不同模型在多项 benchmark 上持续刷新性能 纪录。然而,由于研究成果分散于各类论文之中,评测任务与实验设定差异较大,评价指标口径亦不统一,研究者难以对当前 VLA 技术的发展脉络与整体水平形 成清晰而系统的认识。 为解决这一问题,上海交大&物智进化团队推出了最全面的具身大模型性能榜单 —— Evo-SOTA ,旨在对主流 VLA 模型进行统一整理、系统汇总、直观展示,并 提供可检索、可筛选、可视化的比较工具,帮助研究人员快速了解该领域的发展脉络与最新技术前沿。 repo地址:https://github.com/MINT-SJTU/Evo-SOTA.io Live Demo: https://sota.evomind-tech.com 网站整体功能概览 榜单的核心目标是: 编辑丨 具身智能之心 本文只做学术分享 ...
低成本机械臂一直复现不出pi0,该怎么办?
具身智能之心· 2026-01-13 00:54
Core Viewpoint - The article discusses the challenges faced by beginners in the field of VLA (Vision-Language Alignment) tasks due to high costs and the complexity of data collection and model training, while introducing a comprehensive course aimed at addressing these issues and providing practical skills for aspiring professionals in the field [3][5][9]. Group 1: Challenges in VLA Tasks - Many beginners express frustration over the high costs associated with mechanical arms and sensors, which can exceed 15,000 yuan, making it difficult for self-learners or those without equipment to engage in VLA tasks [3]. - Open-source low-cost robotic arms are available, but many beginners struggle to achieve effective results due to difficulties in data collection and model training [4]. - A significant amount of time is wasted by beginners on common pitfalls when trying to integrate data, VLA models, and training optimizations [5]. Group 2: Course Offerings - The "Embodied Intelligence Heart" platform has replicated various VLA methods such as ACT, GR00T, π0, and π0.5 to help users overcome the challenges of lacking physical devices and not knowing how to get started [8]. - A practical course titled "VLA Small Class for Practical and Job-Seeking" has been developed in collaboration with industry experts to assist learners in effectively utilizing VLA technologies [9]. - The course covers a wide range of topics including robotic arm hardware, data collection, VLA algorithms, evaluation, simulation, deployment of mainstream VLA models, and real-world experiments [14]. Group 3: Course Details and Requirements - The course is designed for individuals seeking practical experience and projects in the VLA field, including students at various academic levels and professionals transitioning from traditional fields [25]. - Participants will receive a SO-100 robotic arm as part of the course, which includes both teaching and execution arms [18]. - The course aims to equip learners with skills equivalent to 1-2 years of experience as algorithm engineers upon completion [27].
李飞飞与NVIDIA联合提出了能够实时推理的3D操作基座模型
具身智能之心· 2026-01-13 00:54
点击下方 卡片 ,关注" 具身智能 之心 "公众号 项目链接:point-world.github.io 核心亮点:3D 点流统一状态 - 动作表征、500 小时跨域数据集、实时 0.1s 推理、零样本真实机器人操纵 问题根源:野生环境 3D 世界建模的四大核心挑战 作者丨 Wenlong Huang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 在机器人学与计算机视觉领域,人类仅凭一瞥和动作预判就能感知 3D 世界的物理响应,但现有技术难以在真实野生环境中实现精准的动态预测与操纵。 斯坦福大学与 NVIDIA 联合团队提出的 PointWorld 框架 ,以 "3D 点流统一表征" 为核心,通过 "大规模数据集构建 - 世界模型设计 - 实时操纵部署" 的三层技 术体系,首次实现单一预训练模型在真实场景下的多类型物体操纵,为通用机器人技术提供了全新范式。 论文题目:PointWorld: Scaling 3D World Models fo ...