Workflow
具身智能之心
icon
Search documents
欢迎和具身智能之心一起前行,合伙人招募啦~
具身智能之心· 2026-01-12 11:00
Core Insights - The company is seeking to empower partners through online and offline training, consulting, data collection, and technology upgrades [1] - There is an invitation for global practitioners in the embodied intelligence field to collaborate in various areas such as technical services, training, course development, and research guidance [1] Major Directions - The focus areas for collaboration include but are not limited to: VLA, VLN, Diffusion Policy, Reinforcement Learning, VLA+RL, remote operation, motion capture, sim2real, multimodal large models, simulation, motion control, end-to-end systems, and 3D perception [3] Job Description - The positions are primarily aimed at embodied solution development, hardware development, and training collaboration, targeting B-end (business and educational institutions) and C-end (students and job seekers) [4] Contact Information - Interested parties can add WeChat oooops-life for further inquiries [5]
LimX COSA,逐际动力全新发布具身Agentic OS系统
具身智能之心· 2026-01-12 03:36
Core Viewpoint - LimX Dynamics has made a significant advancement in embodied intelligence with the release of the LimX COSA, transitioning from a focus on model capabilities to operating system capabilities, emphasizing product delivery and user experience [1][15]. Group 1: Introduction of LimX COSA - LimX COSA is a newly developed embodied Agentic OS designed for the physical world, integrating high-level cognition with full-body control to enable robots to think and act simultaneously [1][2]. - The full-sized humanoid robot Oli, powered by the COSA system, becomes the first humanoid agent with both motion intelligence and high-level cognition [10]. Group 2: Design Philosophy of COSA - COSA is structured in a bottom-up three-layer architecture, corresponding to the behavioral evolution of the robot Oli [4]. - The system is centered around the robot itself, creating a modular and reusable toolbox of diverse skills, each skill trained for reliability and supporting independent iteration and combination [5]. Group 3: Functional Capabilities of COSA - COSA enables a complete feedback loop of understanding tasks, perceiving the environment, adjusting decisions, combining skills, and executing actions, achieving "unity of knowledge and action" [5]. - The foundational layer consists of a robust whole-body control model, providing stable balance and movement capabilities [6]. - The middle layer integrates environmental perception and adaptive high-level skills, allowing for complex behaviors such as navigation and obstacle avoidance [6]. - The top layer embodies autonomous cognition and decision-making capabilities, focusing on interaction, memory, and thought processes [6]. Group 4: Advanced Cognitive Abilities - COSA endows robots with high-level cognitive and reasoning capabilities, allowing them to understand tasks and goals based on physical logic [9]. - The robot Oli can autonomously decompose and plan complex tasks, dynamically adjusting priorities based on environmental changes, enabling concurrent multi-task processing [9]. Group 5: Memory and Perception - COSA provides robots with continuous cognitive abilities through cross-temporal and cross-modal perception and memory, allowing Oli to build its own "worldview" [11]. - The system enables a shift from passive input acceptance to active perception and exploration, enhancing the robot's ability to make precise judgments based on its surroundings [11]. Group 6: Integration of Intelligence and Motion - COSA represents a seamless integration of high-level intelligence and motion capabilities, allowing Oli to maintain stability and robustness in complex environments while performing tasks [13]. - The "brain + cerebellum" integration technology ensures that Oli can think and act effectively [13]. Group 7: Transition from Demo to Product - COSA marks a pivotal shift in embodied intelligence from a focus on model capabilities to operating system capabilities, moving from technology demos to product delivery and user experience [15].
最近开源的一个框架,使用各种SOTA技术训练你的VLA模型
具身智能之心· 2026-01-12 03:36
Core Viewpoint - The article discusses the development of OpenTau, an open-source training toolchain for VLA models, aimed at improving reproducibility, usability, and scalability in model training [1]. Group 1: Industry Pain Points - Existing VLA model training tools like OpenPi and LeRobot lack a one-stop solution, with significant core capabilities missing, failing to meet the advanced training needs of VLA models [3]. - There are issues with mixed data training, as OpenPi and LeRobot do not support heterogeneous datasets with adjustable mixed ratios for collaborative training, discrete action training, or knowledge isolation between VLM and action decoders [3][4]. Group 2: OpenTau Framework Enhancements - OpenTau expands on LeRobot (PyTorch framework), ensuring full compatibility with the LeRobot ecosystem, allowing for the reuse of compliant strategies and datasets [5]. - The framework addresses the limitations of OpenPi by providing native support for the Dropout layer in PyTorch, which was previously only available in Jax [5][6]. - OpenTau improves checkpoint completeness by supplementing the missing text embeddings from LeRobot, ensuring the integrity of model functionality [7]. Group 3: Key Features and Modules - OpenTau supports heterogeneous datasets for collaborative training with adjustable mixing ratios [8]. - New features include discrete action training capabilities, knowledge isolation between VLM backbone and action decoders, and the integration of a Dropout layer to reduce overfitting risks [12]. - The framework includes a built-in reinforcement learning pipeline, supports multi-node and multi-GPU distributed training, and is compatible with simulation environments for model evaluation [12].
一直霸榜的pi0.5,被中国的模型干下来了!!!
具身智能之心· 2026-01-12 00:03
Core Viewpoint - The article highlights the breakthrough of the "Spirit v1.5" model developed by Qianxun Intelligent Team, which has surpassed the international benchmark model pi0.5, marking a significant advancement for China in the field of embodied intelligence models [2]. Performance Comparison - The ranking of models in the RoboChallenge shows Spirit v1.5 leading with a score of 66.09 and a success rate of 50.33%, followed by pi0.5 with a score of 61.84 and a success rate of 42.67% [4]. Data Collection Challenges - The article discusses the limitations of relying on "clean" data for training models, which can lead to low diversity and scalability issues. Clean data often lacks the complexity of real-world scenarios, hindering the model's ability to generalize [5][7]. Training Methodology - Spirit v1.5 employs a training methodology that does not depend on highly curated "clean" demonstration data. Instead, it utilizes a diverse data collection paradigm that allows for the natural integration of multiple sub-tasks and atomic skills, enhancing the model's adaptability to real-world complexities [8][14]. Transfer Efficiency - Experimental results indicate that models pre-trained on diverse data exhibit significantly higher transfer efficiency on new tasks compared to those trained on traditional demonstration data, requiring less computational resources to achieve similar performance [9][12]. Scaling Findings - The article notes that as the scale of diverse experiences increases, the model's transfer efficiency improves, leading to a continuous decrease in validation error for new tasks. This suggests that task diversity is more critical than the number of single-task demonstrations [13][16]. Paradigm Shift in Pre-training - Spirit v1.5 represents a fundamental shift in the field of robotic learning, moving away from the reliance on highly curated datasets. The findings suggest that unstructured diversity serves as a better teacher for robust pre-training, enabling models to develop a foundational "physical intuition" for better adaptability in real-world environments [14].
用低成本复现这几个Git上最受欢迎的VLA任务
具身智能之心· 2026-01-11 03:02
Core Viewpoint - The article discusses the challenges faced by beginners in the field of VLA (Vision-Language Alignment) tasks due to high costs and the complexity of data collection and model training, while introducing a comprehensive course aimed at addressing these issues and providing practical skills for aspiring professionals in the field [3][5][9]. Group 1: Challenges in VLA Tasks - Many beginners express frustration over the high costs associated with mechanical arms and sensors, which can exceed 15,000 yuan, making it difficult for self-learners or those without equipment to engage in VLA tasks [3]. - Open-source low-cost mechanical arms are available, but many beginners struggle to achieve effective results due to difficulties in data collection and model training [4]. - A significant amount of time is wasted by beginners on common pitfalls, particularly with models like π0 and π0.5, which require specific tricks for data collection and training [5]. Group 2: Course Offerings - The "Embodied Intelligence Heart" platform has successfully replicated methods such as ACT, GR00T, π0, and π0.5 using SO-100 and LeRobot, aiming to help those lacking access to expensive equipment [8]. - A new practical course titled "VLA Small Class for Practical and Job-Seeking" has been developed in collaboration with VLA experts to assist learners in effectively utilizing VLA technologies [9]. - The course covers a wide range of topics, including hardware for robotic arms, data collection, VLA algorithms, evaluation, simulation, deployment of mainstream VLA models, and various real-machine experiments [14]. Group 3: Course Details and Requirements - The course is designed for individuals seeking practical experience and projects in the VLA field, including students at various academic levels and those transitioning from traditional fields like computer vision and robotics [25]. - Participants will receive a SO-100 robotic arm as part of the course package, which includes both teaching and execution arms [18]. - The course aims to equip learners with skills equivalent to 1-2 years of experience as algorithm engineers upon completion [27].
不用VLA!从视频生成模型到机器人控制
具身智能之心· 2026-01-11 03:02
Core Insights - The article discusses a new paradigm in embodied intelligence, focusing on the use of video generation for robot control, specifically through a model called LVP (Large Video Planner) [8][12][18]. Group 1: Model Architecture and Contributions - The LVP model consists of 14 billion parameters and is designed for embodied decision-making, utilizing video data to enhance robot control capabilities [18]. - The model leverages vast amounts of human activity videos available online, which contain rich information about physical interactions, rather than relying solely on scarce high-quality robot action data [11][19]. - Key innovations include the introduction of Diffusion Forcing and History Guidance techniques to improve video generation accuracy and coherence, ensuring that generated videos are physically consistent and relevant to the robot's current state [24][26]. Group 2: Data Set and Training - The LVP-1M dataset, comprising approximately 1.4 million video clips, was specifically constructed for training the model, incorporating diverse sources such as robot data, egocentric human data, and general internet videos [29][30]. - The dataset includes various types of interactions and scenarios, enhancing the model's ability to generalize across different tasks and environments [30][31]. Group 3: Action Extraction and Execution - A visual action extraction pipeline was developed to translate generated videos into actionable robot movements without requiring additional training [32]. - The pipeline includes detailed action descriptions and aligns the timing of robot movements with human actions to ensure smooth execution [34]. Group 4: Performance and Testing - The LVP model demonstrated superior performance in real-world tasks compared to existing video generation models and robot strategy models, achieving higher success rates in novel tasks [41][42]. - The model's zero-shot generalization ability allows it to perform tasks it has never encountered before, such as tearing tape and scooping coffee beans, showcasing its adaptability [42]. Group 5: Limitations and Future Directions - The article acknowledges limitations such as slow video generation times, reliance on external components for action extraction, and the challenges of open-loop execution [48]. - Future developments aim to enhance the model's real-time closed-loop control capabilities and further improve its understanding of the physical world [48].
清华和Qwen团队最新!深究VLM如何影响VLA性能?并通过少量新参数转化为VLA策略
具身智能之心· 2026-01-11 03:02
Core Insights - The article emphasizes the transition from visual-language understanding to embodied action planning, highlighting the importance of the Visual-Language-Action (VLA) model as a key technology for embodied AI [3][10][26] - It discusses the necessity of integrating Visual-Language Models (VLM) with VLA to enhance the adaptability and performance of embodied agents in real-world scenarios [3][10][26] Summary by Sections Background - Early embodied AI relied on specialized robot models with limited generalization capabilities, leading to a shift towards integrating pre-trained VLMs into the VLA framework to improve action planning [3][10] - The relationship between VLM and VLA is defined, where VLM provides cognitive understanding and VLA translates this understanding into executable actions [3][10] Theoretical Foundation - VLM and VLA differ fundamentally in their goals, inputs, outputs, and optimization targets, marking a paradigm shift from understanding the world to modifying it [5][6] - VLA's optimization focuses on action execution success rates, contrasting with VLM's emphasis on understanding accuracy [5][6] VLA Construction Necessity - VLA enhances generalization and practicality by leveraging pre-trained VLM knowledge, significantly reducing development costs and accelerating technology deployment [10][26] - Experimental results show that VLA models initialized with VLM outperform those trained from scratch, validating the importance of this approach [10][26] Key Components - VLA performance is influenced by three main factors: VLM backbone model selection, auxiliary task fine-tuning, and module training strategies [11][12] - The selection of VLM models with varying parameter sizes (1B-30B) reveals that the introduction of learnable action query tokens can extract action-related information effectively [12][15] Training Strategies - Fine-tuning with auxiliary tasks does not necessarily enhance action performance, indicating that the relationship between embodied skills and action performance is complex [15][20] - The impact of freezing visual encoders on VLA performance is significant, with substantial drops in scores when visual encoders are not fine-tuned [21][22] Inference Mechanisms - VLA's action generation is based on a "cross-modal understanding - action mapping" inference process, with two main paradigms emerging: direct mapping and enhanced reasoning [17][19] - The direct mapping paradigm allows for efficient action generation, while the enhanced reasoning paradigm focuses on optimizing action generation modules for complex scenarios [17][19] Evaluation Framework - The evolution of VLA evaluation benchmarks reflects a shift from simple to complex scenarios and from single to multi-modal assessments, aligning more closely with real-world applications [23][24] - Core evaluation metrics include task success rates and average task completion numbers, with a focus on generalization capabilities in unseen scenarios [25][26] Future Directions - The article outlines key challenges and future research directions, including optimizing visual modules, developing adaptive architectures, and creating specialized evaluation systems [27][28] - Emphasis is placed on the need for a balanced approach between general data and embodied data to enhance VLA adaptability without compromising VLM capabilities [27][28]
打破学科壁垒!400篇参考文献重磅综述,统一调查「人脑×Agent」记忆系统
具身智能之心· 2026-01-11 03:02
Core Viewpoint - The article discusses a significant review paper titled "AI Meets Brain," which bridges cognitive neuroscience and artificial intelligence, focusing on how human memory mechanisms can inform the development of human-like memory systems in agents [2][6]. Summary by Sections Memory Definition - Memory is redefined as not just data storage but as a cognitive link that connects past experiences with future decisions, involving a two-stage process in the human brain [6]. Perspectives on Memory - From a cognitive neuroscience perspective, memory serves as a bridge between past and future [6]. - For large language models (LLMs), memory exists in three forms: parametric memory, working memory, and explicit external memory [7]. - Agent memory transcends simple storage, functioning as a dynamic cognitive architecture that integrates experiences and environmental feedback [8]. Importance of Memory - Memory plays a crucial role in enhancing agent capabilities by overcoming context window limitations, building long-term personalized profiles, and driving experience-based reasoning [12][13]. Memory Classification - The review categorizes memory based on cognitive neuroscience definitions, distinguishing between short-term and long-term memory, with long-term memory further divided into episodic and semantic memory [15][21]. Memory Storage Mechanisms - Memory storage in the human brain involves dynamic cooperation across brain regions, while agent memory systems are explicitly engineered to optimize data structure selection for computational efficiency [31][32]. Memory Management - Memory management in agents is a continuous process involving extraction, updating, retrieval, and application, contrasting with the static nature of traditional memory systems [33][34]. Future Directions - Future agent memory systems should aim for omni-modal capabilities, integrating various data types beyond text, and facilitating skill transfer across different agents [49][50].
自动驾驶巨头,63亿购买具身入场券
具身智能之心· 2026-01-10 03:22
Core Viewpoint - Mobileye, a leading global supplier of autonomous driving solutions, is entering the field of embodied intelligence by acquiring a humanoid robotics company, Mentee Robotics, for $6.3 billion [4]. Group 1: Industry Developments - The acquisition by Mobileye signifies a significant investment in humanoid robotics, highlighting the growing intersection between autonomous driving and embodied intelligence [4]. - NVIDIA has been advancing the development of embodied intelligence-related models and infrastructure, including the GR00T series models and embodied simulation frameworks [7]. - Tesla has been focusing on the development of its Optimus humanoid robot, indicating that a substantial portion of its future profits will come from this robotics business [8]. Group 2: Market Trends - Companies like Waymo are actively developing embodied intelligence technologies, and there are reports of Xiaopeng Robotics planning to achieve mass production this year [9]. - Major automotive manufacturers in China, such as Geely, BYD, SAIC, and GAC, are increasingly establishing or investing in various humanoid robotics companies [9]. - The technological similarities in perception, localization, and planning between autonomous driving and embodied intelligence suggest that cross-industry integration will become more frequent [10].
拒绝垃圾数据,如何高效、高质量的采集具身数据?
具身智能之心· 2026-01-10 01:03
最近在具身智能圈子里,VLA(视觉-语言-动作)模型无疑是流量中心。无论是学术界的论文爆发,还是工业 界的 HR 急招,VLA 都被顶到了风口浪尖。 ★ 但现实很骨感:VLA 模型的性能上限,往往取决于你数据采集的质量。 很多同学在复现 π0、GR00T 或 ACT 时,最常吐槽的就是:" 数据太难采了! " 具身智能的本质是"本体交互"。 如果没有高质量的遥操作数据,再强大的 VLA 算法也只是空中楼阁。 为了帮助大家节省"踩坑"时间,具身智能之心正式推出国内首个 《具身数采与遥操算法全栈课程》 。 这门课不只讲理论,更注重"手感"与"实战"。我们将带你从零 DIY 遥操硬件,打通数据采集的全链路。 课程大纲: 更多内容,欢迎咨询小助理 3. 全场景覆盖:从单臂到全身 课程不仅限于简单的机械臂抓取,还包括: 仿真生成数据不真实: 仿真与真机的 Gap(Sim2Real)巨大,模型在仿真里跑得溜,真机上一碰就碎。 遥操手感极差: 动作生涩、延迟高,采集出来的轨迹充满噪声,模型根本学不会。 硬件门槛高: 专业级遥操设备动辄数万,普通学生和初创团队难以负担。 技术全链路断层: 知道怎么控机械臂,但不知道怎么把数据 ...