Workflow
具身智能之心
icon
Search documents
EVOLVE-VLA:VLA模型测试时训练,突破模仿学习瓶颈
具身智能之心· 2025-12-18 00:07
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Zechen Bai等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 一、研究背景与动机 现有VLA模型的核心困境 视觉-语言-动作(VLA)模型借助大型语言模型(LLM)的语义先验,在机器人操作任务中取得了显著进展,但当前主流的监督微调(SFT)训练范式存在两大根 本性局限: 人类学习范式的启发 人类掌握操作技能的核心是"通过实践学习"——反复尝试、从环境中获取反馈、逐步修正动作。这与SFT的"静态模仿学习"形成鲜明对比,因此,让VLA模型在部 署阶段通过环境交互实现持续学习,成为突破现有局限的关键方向。 核心挑战 测试时训练(TTT)的核心障碍是 缺乏Oracle奖励信号 (训练时的模拟器真值成功信号在部署时不可用)。直接使用朴素的进度估计器会产生噪声信号,可能误导 政策优化,尤其在长视野任务中,噪声累积会严重影响学习效果。 二、核心创新点 1. 测试时自主反馈机制 :用预训练的进 ...
复旦&港大等团队!WholeBodyVLA:面向全身移动操作控制的VLA框架
具身智能之心· 2025-12-18 00:07
点击下方 卡片 ,关注" 具身智能 之心 "公众号 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 现有方法的不足 人形机器人需要精确的移动能力和灵巧的操作技能来完成具有挑战性的移动-操作任务。然而,现有的模块化或端到端方法在"操作感知型移动"方面存在不足。无法 通过规划和执行移动来主动创造操作所需的前提条件(如接近目标、调整姿态、保持稳定),而是将移动和操作视为独立阶段。 ★ 这使得机器人被限制在有限的工作空间内,难以完成大范围移动-操作任务。 ★ 核心挑战在于"操作感知型移动":规划和执行能够主动创造操作前提条件(接近、定向、稳定)的移动,而非将移动和操作视为独立阶段。 一种朴素的解决方案是通过高层规划器序列化移动和操作,在不同技能间切换(如导航与抓取)。然而,有限的闭环反馈和缺乏端到端联合优化可能导致误差累 积,使机器人处于不利于后续操作的次优状态。另一种有前景的方案是端到端框架,直接执行全身控制以缓解模块化pipeline的切换问题,但通 ...
SIGGRAPH 2025:摩尔线程赢3DGS挑战赛大奖,LiteGS全面开源
具身智能之心· 2025-12-18 00:07
Core Insights - The article highlights the significant achievement of Moore Threads at the SIGGRAPH Asia 2025, where the company won a silver medal in the 3D Gaussian Splatting Reconstruction Challenge, showcasing its advanced algorithm capabilities and hardware-software optimization in next-generation graphics rendering technology [1][17]. Group 1: 3D Gaussian Splatting Technology - 3D Gaussian Splatting (3DGS) is a revolutionary 3D scene representation and rendering technology introduced in 2023, achieving a remarkable balance between image quality, efficiency, and resource usage, with rendering efficiency improved by hundreds to thousands of times compared to traditional NeRF [4][8]. - The technology demonstrates strong adaptability and scalability in areas such as ray tracing, real-time VR/AR rendering, and multimodal fusion, making it a foundational technology for embodied AI, which requires high-quality, low-latency 3D environment modeling [7][8]. Group 2: Competition Details - The 3DGS Reconstruction Challenge required participants to complete high-quality 3DGS reconstruction within 60 seconds using real terminal video sequences and imperfect camera trajectories, emphasizing the challenge of achieving both reconstruction quality and speed [10][12]. - The evaluation metrics included PSNR (Peak Signal-to-Noise Ratio) for reconstruction quality and time taken, ensuring a fair and transparent ranking process [12][14]. Group 3: Moore Threads' Performance - Moore Threads' AI team, competing under the identifier "MT-AI," achieved a commendable balance in reconstruction accuracy and efficiency, securing the second place with an average PSNR of 27.58 and a reconstruction time of 34 seconds [17][21]. - The results from the competition indicated that Moore Threads' performance was competitive, with the top team achieving a PSNR of 28.43 and a reconstruction time of 57 seconds [18]. Group 4: LiteGS Library - Moore Threads developed the LiteGS library, which optimizes the entire pipeline from GPU systems to data management and algorithm design, achieving a PSNR of 27.58 and a reconstruction time of 34 seconds, significantly ahead of many competitors [21][24]. - LiteGS can achieve up to 10.8 times training acceleration while reducing parameter count by over 50%, demonstrating its engineering practicality and technological foresight [25][31]. - The library has been fully open-sourced on GitHub to promote collaborative development and continuous evolution in 3D reconstruction and rendering technology [27].
VGGT4D:无需训练,实现4D动态场景重建
具身智能之心· 2025-12-18 00:07
编辑丨具身智能之心 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 如何让针对静态场景训练的 3D 基础模型(3D Foundation Models)在不增加训练成本的前提下,具备处理动态 4D 场景的能力? 来自香港科技大学(广州)与地平线 (Horizon Robotics) 的研究团队提出了 VGGT4D。该工作通过深入分析 Visual Geometry Transformer (VGGT) 的内部机制,发现并利用了隐藏在注意力层中的运动线索。 作为一种无需训练 (Training-free) 的框架,VGGT4D 在动态物体分割、相机位姿估计及长序列 4D 重建等任务上均取得了优异性能。 论文标题: VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction 论文链接: https://arxiv.org/abs/2511.19971 ...
具身智能的数据困境?简智正以闭环飞轮推进解决
具身智能之心· 2025-12-17 10:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 "模仿学习(如看视频)必要,但真正掌握技能,真机数据是关键。" 香港大学李弘扬近期在多场具身智能行 业论坛上的发言,精准戳中了赛道发展的核心痛点。这一观点在行业内已形成广泛共识——智源研究院院长 王仲远就曾直言, "数据,尤其是高质量的数据,决定模型能力的上限" ,而当前具身智能最突出的困境正是 高质量真机数据的极度匮乏。2025年,具身智能融资热度飙升、政策持续加码,可数据基建的滞后却成了行 业规模化落地的"绊脚石"。做过具身智能研究的人都清楚, 真机数据稀缺、采集效率低下、处理链路冗长 , 这些问题足以让多数企业陷入"巧妇难为无米之炊"的困境。 这片蓝海市场中, 简智机器人 在赛道中逐渐崭露头角。作为专注于 具身智能全链路解决方案 的科技企业, 其核心理念是"具身智能源于人、回归人",并凭借全栈自研的"产品+产线"双轨战略,搭建起 "人类技能数字 化 - 云端AI数据治理 - 机器人应用"的完整闭环。 行业痛点如何破解?简智给出了自己的答案 自变量机器人 CTO 王昊曾直言,具身智能领域正面临显著的"数据困境"。在行业内,Aloha设备已是常见的真 机采 ...
支持pi0与pi0.5部署!现在又适配了Lerobot框架了
具身智能之心· 2025-12-17 03:50
Core Viewpoint - Imeta-Y1 is a lightweight, cost-effective robotic arm designed for beginners and researchers in the field of embodied intelligence, facilitating algorithm validation and project development with ease [1][3]. Group 1: Product Features - The robotic arm supports a full-process open-source toolchain and code examples, enabling users to seamlessly transition from data collection to model deployment [4][18]. - It is compatible with both Python and C++, allowing users to quickly get started regardless of their programming background [4][19]. - The arm supports ROS1 and ROS2, providing URDF models for seamless switching between simulation and real-world applications [4][20]. - It features high-precision motion control, low power consumption, and an open hardware architecture, aiding users in rapid algorithm validation, data collection, model training, and deployment [6][37]. Group 2: Technical Specifications - The robotic arm has a weight of 4.2 kg, a rated load of 3 kg, and 6 degrees of freedom, with a working radius of 612.5 mm and a repeat positioning accuracy of ±0.1 mm [9][21]. - It operates at a supply voltage of 24V and communicates via CAN, with external interfaces for power and CAN [9][20]. - The arm's joint motion range includes J1 from -165° to 165°, J2 from -180° to 0°, and maximum joint speeds of up to 220°/s for certain joints [22][21]. Group 3: User Support and Services - The company offers 24-hour rapid response for after-sales support, ensuring users do not face obstacles during their learning journey [4][20]. - Bulk purchase discounts are available, and the company supports project development and educational training based on the product [20][49]. - A comprehensive open-source SDK is provided, including drivers, API interfaces, example code, and documentation to assist developers in building applications [31][30].
最近具身界的一些进展......
具身智能之心· 2025-12-17 03:50
Core Insights - The article discusses the recent developments in the embodiment community, focusing on investment, production, product design, model generalization, and deployment in the robotics industry [1]. Financing - In the second half of the year, apart from a few star companies, the financing amounts for core component companies have increased, and the number of companies has also grown [2]. Production - Several companies have begun pilot projects, with many startups seeking financing backed by orders. Leading humanoid robot companies are exploring the deployment of industrial-grade products [2]. Product Design - The design of core robotic arms is gradually converging, while innovations in structure and size continue in mobile operations and humanoid robots. Companies are also focused on reducing costs, with supply chain management capabilities significantly influencing future competitiveness. Leading embodiment companies are actively investing in component suppliers, and multi-modal robots are slowly appearing in various scenarios [2]. Model Generalization - The optimization approach based on Reinforcement Learning (RL) is enhancing the generalization capabilities of models. Related toolkits are becoming more refined, making real machine deployment increasingly convenient [3]. Deployment - The launch of the S600 by Digua Robotics supports edge-side deployment. Thor is beginning to be applied in humanoid robots and mobile operations, with computing power exceeding 2000T becoming a reference configuration [4]. Community Development - The community is actively planning research reports and welcomes newcomers interested in the embodiment field. Over the past year, the community has completed various segments, including technical route sharing, live broadcasts, Q&A, job opportunities, and competitions, aiming to cultivate more talented individuals in the industry [6]. Educational Resources - The community offers a variety of live roundtable forums and broadcasts covering topics from embodiment, data, to algorithms, gradually sharing insights into the industry and unresolved issues [8]. - For beginners, a comprehensive technical stack and learning routes have been organized to facilitate entry into the field [10]. - For those already engaged in related research, valuable industry systems and project proposals are provided [14]. Job Opportunities - The community has established a job referral mechanism with multiple embodiment companies, allowing members to submit resumes directly to desired companies [16].
统一视觉多模态!港科大团队发布视频生成模型,加速真实世界理解
具身智能之心· 2025-12-17 00:05
Core Insights - The article discusses the introduction of UnityVideo, a new unified multimodal video generation model developed by research teams from Hong Kong University of Science and Technology, Chinese University of Hong Kong, Tsinghua University, and Kuaishou. This model enhances video generation quality and achieves zero-shot generalization, allowing it to generate reasonable results for previously unseen objects or scenes [1][2][10]. Group 1: Model Capabilities - UnityVideo utilizes unified training across various visual modalities, such as depth maps, optical flow, skeletons, and segmentation masks, enabling the model to better understand the physical world and produce more realistic and controllable videos [4][10]. - The model exhibits strong zero-shot generalization capabilities, allowing it to adapt from single-person data to multi-person scenarios and from human skeleton data to animal skeleton estimation [13][15]. - The unified training paradigm significantly improves performance, as different visual modalities provide complementary supervisory signals that enhance the model's understanding of physical world operations [12][14]. Group 2: Technical Innovations - UnityVideo implements dynamic task routing, seamlessly integrating three training paradigms: instance segmentation, dense pose understanding, and depth estimation, which helps the model distinguish between different object categories and understand human body structures [16][17]. - A key technical breakthrough is the dynamic noise scheduling strategy, which allows the model to randomly select training modes during iterations, preventing catastrophic forgetting and ensuring harmonious coexistence of training objectives [20][21]. - The architecture includes a context learner that injects specific text prompts for different modalities, enhancing the model's semantic understanding and enabling it to generalize from "two persons" to "two objects" in segmentation tasks [23][52]. Group 3: Dataset and Evaluation - The research team constructed the OpenUni dataset, comprising 1.3 million multimodal video samples, ensuring balanced sampling across all modalities and data sources to prevent overfitting [31]. - UnityVideo achieved superior performance across various tasks, with background consistency reaching 97.44% and aesthetic quality at 64.12% in text-to-video generation, outperforming other models [35]. - Qualitative results demonstrate UnityVideo's enhanced understanding of physical phenomena, such as light refraction in water, and its ability to maintain overall video quality while adhering to depth guidance [38][39]. Group 4: User Study and Generalization - In user studies, UnityVideo received the highest scores in physical quality (38.50%), semantic quality, and overall preference, significantly surpassing commercial models [50][51]. - The model's ability to generalize from seen to unseen data showcases its understanding of semantic levels, indicating a deeper comprehension of modality interactions during training [56][58]. - The evolution of cross-modal attention highlights that true world understanding requires the integration of multidimensional perceptions, similar to human cognitive processes [59][60].
近300篇工作!伦敦国王学院x港理工全面解构VLA模型,一份清晰系统的导航图
具身智能之心· 2025-12-17 00:05
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Chao Xu等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 这篇综述对视觉 - 语言 - 动作(VLA)模型进行了全面剖析,是该领域极具价值的导航指南。核心结论是:VLA 模型正推动机器人技术变革,其发展遵循 "基础模 块→历史里程碑→核心挑战" 的逻辑,五大核心挑战(表征、执行、泛化、安全、数据与评估)是当前研究的关键突破口,相关结构与关键信息可通过文中图表直 观呈现。 核心定位与结构设计 文章以研究者的自然学习路径为框架,从基础到前沿层层递进,既适合新手入门,也为资深研究者提供方向。 基础模块:VLA 模型的核心构成 VLA 系统由感知、大脑、动作三大核心模块组成,近年呈现明显技术迭代趋势,各模块的关键技术选型与代表模型可参考相关数据集与里程碑表格。 论文标题 :An Anatomy of Vision-Language-Action Models: From Modules ...
56倍加速生成式策略:EfficientFlow,迈向高效具身智能
具身智能之心· 2025-12-17 00:05
Core Insights - The article discusses the development of a new generative policy learning method called EfficientFlow, which addresses key limitations in embodied AI and robotics, particularly in data efficiency and inference speed [1][3]. Group 1: Key Innovations - EfficientFlow integrates equivariant modeling with flow matching to enhance data efficiency and significantly reduce the number of iterations required during inference, achieving state-of-the-art (SOTA) performance across multiple robotic operation benchmarks [1][3]. - The method introduces an acceleration regularization term in its loss function to encourage smoother and faster trajectory generation, inspired by physical intuition that real-world movements typically have low acceleration [5][6]. - EfficientFlow employs an equivariant network design that allows the model to generalize actions across different orientations of visual scenes, effectively multiplying the data utility from a single observation [9][10]. Group 2: Technical Mechanisms - The flow acceleration bound (FABO) is introduced as an easily computable proxy loss that helps regularize the model's generated strategies, enhancing stability and robustness [7][8]. - A time-consistency strategy is implemented to ensure coherent action sequences over time, utilizing overlapping predictions to maintain continuity in the generated actions [15][16]. - The model's inference efficiency is highlighted, with EfficientFlow achieving a 56-fold speed increase in single-step inference compared to existing methods, while also demonstrating competitive performance with fewer data and iterations [17].