Workflow
具身智能之心
icon
Search documents
开发者福利!一台机器搞定人形运控、强化学习、VLN/VLA
具身智能之心· 2025-07-25 07:11
Core Viewpoint - TRON1 is a cutting-edge research platform designed for educational and scientific purposes, featuring a modular design that supports multiple robotic forms and algorithms, catering to diverse research needs [1]. Group 1: Product Features - TRON1 supports humanoid gait development and is ideal for reinforcement learning research, with the EDU version allowing for external camera integration for navigation and perception tasks [6][24]. - The platform supports C++ and Python for development, making it accessible for users without C++ knowledge [6]. - It features a "three-in-one" modular design that allows for quick switching between bipedal, point-foot, and wheeled locomotion [1]. Group 2: Technical Specifications - The platform is compatible with major simulation platforms like NVIDIA Isaac, Mujoco, and Gazebo, enhancing validation efficiency and lowering research barriers [9]. - TRON1 can be equipped with a robotic arm for various mobile operation tasks, supporting both single-arm and dual-foot configurations [11]. - It integrates LiDAR and depth cameras for 3D mapping, localization, navigation, and dynamic obstacle avoidance [13]. Group 3: Hardware and Performance - The TRON1 standard version and EDU version share similar mechanical parameters, with a weight limit of approximately 10 kg and a maximum speed of 5 m/s for wheeled locomotion [26]. - The platform is powered by an 8-core Arm Cortex-A78AE CPU and features NVIDIA Ampere architecture GPU with AI computing power of 157 TOPS (sparse) and 78 TOPS (dense) [16][19]. - The battery supports a maximum power of 1000W, with a runtime of over 2 hours under rated conditions [26]. Group 4: User Support and Development - Comprehensive user manuals and development guides are provided, ensuring ease of use and support for new users [29][33]. - The platform offers a one-year after-sales service post-acceptance, with paid maintenance and parts support available thereafter [40].
准备扩大具身团队了,拉一些人搞点事.......
具身智能之心· 2025-07-25 07:11
最近陆续在和国内外的具身团队接触,发现前面存在的一些问题也被慢慢突破了。很开心看到具身智能这 个领域发展的这么迅速,有几家明星公司也陆续开始准备上市,在这个过程中能够服务好大家和行业是我 们一直坚持的。 每个城市招募10人左右,我们期望您是具身领域学术与工程大佬,具备2年以上具身算法和机器人研究经 验。 具身教育研发与咨询服务 我们邀请具身领域的大佬,为行业开创具身教育在线课程、企业咨询、辅导业务。 如果您是大模型/多模态大模型、Diffusion、VLA、VLA+RL、sim2real、端到端、具身交互、视觉语言导 航、强化学习、机器人运动规划、抓取与位姿估计、触觉感知、大模型部署与量化感知推理、机器人仿真 等方向,欢迎加入我们一起为行业输出最优秀的教程。 我们期望您是博士(包括在读)及以上学历,工业界希望能有2年以上的研发经验。 待遇说明 越是做平台越是发现,产业离不开大家的共同努力,特别是早期。技术的孤立和闭塞虽然能产生一定的技 术壁垒,但对整个产业的发展并不友好。我们一直鼓励大家能够积极交流,也期望自身能够承担一个汇聚 全行业人才的平台。刚发布具身一周年的推文,1周年后我们期望能够邀请更多有力量的大佬 ...
NVIDIA最新!ThinkAct:复杂的具身任务中实现少样本适应、长时程规划
具身智能之心· 2025-07-24 09:53
Core Insights - The article introduces ThinkAct, a dual-system framework designed to enhance the reasoning capabilities of multi-modal large language models (MLLMs) in physical environments by connecting high-level reasoning with low-level action execution [4][9][12] - ThinkAct aims to address the limitations of existing VLA models that struggle with long-term planning and adapting to complex tasks by utilizing reinforced visual latent planning [4][6][9] Group 1: Framework and Methodology - ThinkAct employs a structured approach to VLA reasoning tasks, where the model receives visual observations and textual instructions to predict actions, effectively linking abstract planning with low-level control [12][21] - The framework utilizes reinforcement learning to enhance the reasoning capabilities of MLLMs, encouraging them to generate low-level actions after reasoning through the task [13][19] - A novel action-aligned visual feedback mechanism is introduced to capture long-term goals and encourage visual associations during the planning process [14][18] Group 2: Performance Evaluation - ThinkAct demonstrates superior performance in various robotic operation tasks, achieving a top success rate of 84.4% on the LIBERO benchmark, outperforming other models like DiT-Policy and CoT-VLA [25][26] - In the SimplerEnv evaluation, ThinkAct outperformed baseline action models by significant margins, achieving overall scores of 71.5%, 65.1%, and 43.8% across different settings [25] - The framework also excels in embodied reasoning tasks, showing advantages in long-term and multi-step planning capabilities, as evidenced by its performance on EgoPlan-Bench2 and RoboVQA benchmarks [26][27] Group 3: Qualitative Insights - The article provides qualitative examples illustrating ThinkAct's reasoning process and execution in tasks, showcasing its ability to decompose instructions into meaningful sub-goals and visualize planning trajectories [30][31] - The framework's reinforcement learning adjustments significantly enhance its reasoning capabilities, allowing it to better understand tasks and environments compared to cold-start models [31][32] Group 4: Adaptability and Error Correction - ThinkAct demonstrates effective few-shot adaptation capabilities, successfully generalizing to unseen environments and new skills with minimal demonstration samples [35][37] - The framework's ability to detect execution errors and perform ego correction is highlighted, showcasing its structured reasoning to reconsider tasks and generate corrective plans when faced with failures [37][38]
具身公司的两边倒!一边是大额融资,一边是招不到人.......
具身智能之心· 2025-07-24 09:53
太魔幻了!具身一边是海量岗位,一边是招不到人...... 最近星球里的同学来找我吐槽:峰哥,为什么很多具身公司明明很有钱,融资拿的根本花不完,岗位对外的也 多,但一直面试不发offer,他们对外一直说招不到人??? 作为完整经历过自驾发展周期的人来看,其实很简单。大家兜里有钱,但不敢轻易花钱了,保持着审慎的态 度,精打细算的细水长流。这个产业周期依然会很长,乱花钱、没有计划,死的会很快,洗牌也就这2-3年的 事情。 许多具身公司的产品(包括本体、算法、数据)都还不成熟,这一点我们在具身智能之心知识星球内详细分析 过。所以,有非常好的研究成果这批学者是各家公司争先招募的,比如人形机器人的稳定性、数据的scale、数 据的有效使用、泛化性等方向。底层技术突破的拐点还看不到,大家都想储备好干粮准备过寒冬, 对于求职 者来说,一方面需要自己技术过硬,另外一方面需要非常适配具身的研究方向。 而具身智能之心知识星球,作为国内最大的具身技术社区,一直在给行业和个人输送各类人才、产业学术信 息。 目前累积了国内外几乎所有主流具身公司和大多数知名研究机构。 如果您需要第一时间了解产业、求职 和行业痛点,欢迎加入我们。 一个认真 ...
Zebra-CoT:开创性视觉思维链数据集问世,多模态推理准确率提升13%
具身智能之心· 2025-07-24 09:53
Core Viewpoint - The article discusses the development of Zebra-CoT, a large-scale and diverse dataset aimed at enhancing visual reasoning capabilities in multi-modal models, addressing the challenges of existing visual CoT performance and the lack of high-quality training data [3][4]. Dataset Construction - Zebra-CoT consists of 182,384 samples, providing logical interleaved text-image reasoning trajectories across four main task categories: scientific reasoning, 2D visual reasoning, 3D visual reasoning, and visual logic and strategy games [6][12]. - The dataset overcomes limitations of existing datasets by offering a diverse range of tasks and ensuring high-quality text reasoning data, unlike previous datasets that focused on single tasks or lacked clear reasoning structures [6][18]. Task Coverage - The dataset covers four major task categories: - Scientific reasoning includes geometry, physics, chemistry, and algorithm problems [9]. - 2D visual reasoning encompasses visual search and visual puzzles [9]. - 3D visual reasoning involves multi-hop object counting and robot planning [9]. - Visual logic and strategy games feature chess, checkers, mazes, and more [9]. Data Sources and Processing - Real-world data is sourced from online resources, ensuring high-quality problem extraction and addressing issues of logical connections between modalities [10]. - Synthetic data is generated using templates and visual language models (VLM) to enhance reasoning diversity and expressiveness [10]. Model Fine-tuning and Performance - Fine-tuning the Anole-7B model on Zebra-CoT improved accuracy from 4.2% to 16.9%, a fourfold increase, with notable improvements in visual logic benchmarks [14]. - The Bagel-7B model, after fine-tuning, demonstrated the ability to generate high-quality interleaved visual reasoning chains, showcasing the dataset's effectiveness in developing multi-modal reasoning capabilities [14]. Limitations - Despite its strengths, the dataset relies on template generation for synthetic data, which may limit the diversity and expressiveness of text reasoning [18]. - Some sub-tasks within the dataset have a small sample size, potentially affecting model performance in those areas [18]. - Model fine-tuning results may vary, with some tasks showing insignificant or even decreased performance, indicating a need for further optimization [18].
具身智能之心求职交流群来啦!!!
具身智能之心· 2025-07-23 15:16
应广大粉丝的要求,我们开始正式运营具身相关的求职社群了。社群内部主要讨论相关具身产业、公司、 产品研发、求职与跳槽相关内容。如果您想结交更多同行业的朋友,第一时间了解产业。欢迎加入我们! 微信扫码添加小助理邀请进群,备注昵称+具身求职; 具身智能之心求职与行业交流群成立了! ...
具身智能离不开的感知模块!最强性价比3D激光扫描仪来啦
具身智能之心· 2025-07-23 09:48
Core Viewpoint - GeoScan S1 is presented as the most cost-effective 3D laser scanner in China, featuring lightweight design, one-click operation, and centimeter-level precision for real-time 3D scene reconstruction [1][5]. Group 1: Product Features - The GeoScan S1 can generate point clouds at a rate of 200,000 points per second, with a maximum measurement distance of 70 meters and 360° coverage, supporting large scenes over 200,000 square meters [1][28][31]. - It integrates multiple sensors, including a high-precision IMU and RTK, enabling it to handle complex indoor and outdoor environments effectively [33][46]. - The device supports various data export formats such as PCD, LAS, and PLY, and operates on Ubuntu 20.04, compatible with ROS [22]. Group 2: System Specifications - The system has a relative accuracy of better than 3 cm and an absolute accuracy of better than 5 cm [22]. - The device dimensions are 14.2 cm x 9.5 cm x 45 cm, weighing 1.3 kg without the battery and 1.9 kg with the battery, with a power input range of 13.8V to 24V [22]. - It features a battery capacity of 88.8 Wh, providing approximately 3 to 4 hours of operational time [22][25]. Group 3: Software and Usability - The GeoScan S1 offers a user-friendly interface with simple operation, allowing for quick scanning and immediate data export without complex setups [5][42]. - It includes a 3D Gaussian data collection module for high-fidelity scene restoration, enabling the digital replication of real-world environments [52]. - The software supports both offline and online rendering, enhancing the usability for various applications [5][61]. Group 4: Market Position and Pricing - The company offers multiple versions of the GeoScan S1, including a basic version priced at 19,800 yuan and a 3DGS offline version at 67,800 yuan, catering to diverse customer needs [61][64]. - The product is positioned as having the best price-performance ratio in the industry, integrating multiple sensors and advanced features [5][61].
行为基础模型可实现高效的人形机器人全身控制
具身智能之心· 2025-07-23 08:45
Core Viewpoint - Humanoid robots are gaining unprecedented attention as multifunctional platforms for complex motion control, human-robot interaction, and general physical intelligence, but achieving efficient whole-body control remains a fundamental challenge [1][2]. Group 1: Overview of Behavior Foundation Model (BFM) - The article discusses the emergence of Behavior Foundation Model (BFM) as a solution to the limitations of traditional controllers, enabling zero-shot or rapid adaptation to various downstream tasks through large-scale pre-training [1][2]. - BFM is defined as a special type of foundational model aimed at controlling agent behavior in dynamic environments, rooted in principles of general foundational models like GPT-4 and CLIP, utilizing large-scale behavior data for pre-training [12][13]. Group 2: Evolution of Humanoid Whole-Body Control Algorithms - The evolution of humanoid whole-body control algorithms is summarized in three stages: model-based controllers, learning-based task-specific controllers, and behavior foundation models [4][6][7]. - Model-based controllers rely heavily on physical models and require complex manual design, while learning-based controllers exhibit poor generalization across tasks [6][7][8]. Group 3: BFM Methodology and Algorithms - The article categorizes current BFM construction methods into three types: goal-conditioned learning, intrinsic reward-driven learning, and forward-backward representation learning [13]. - A notable example of a goal-conditioned learning method is MaskedMimic, which learns foundational motor skills through motion tracking and supports seamless task switching [18][20]. Group 4: Applications and Limitations of BFM - BFM has potential applications in various fields, including humanoid robotics, virtual agents in gaming, industrial 5.0, and medical assistance robots, enabling rapid adaptation to diverse tasks [31][33]. - However, BFM faces limitations such as difficulties in sim-to-real transfer, where discrepancies between simulation and real-world dynamics hinder practical deployment [32][34]. Group 5: Future Research Opportunities and Risks - Future research opportunities include integrating multimodal inputs, developing advanced machine learning systems, and establishing standardized evaluation mechanisms for BFM [36][38]. - Risks associated with BFM include ethical concerns regarding training data biases, data bottlenecks, and the need for robust safety mechanisms to ensure reliability in open environments [36][39].
Being-H0:从大规模人类视频中学习灵巧操作的VLA模型
具身智能之心· 2025-07-23 08:45
Core Insights - The article discusses the advancements in vision-language-action models (VLAs) and the challenges faced in the robotics field, particularly in complex dexterous manipulation tasks due to data limitations [3][4]. Group 1: Research Background and Motivation - Current large language models and multimodal models have made significant progress, but the robotics sector lacks a transformative moment akin to "ChatGPT" [3]. - Existing VLAs struggle with dexterous tasks due to reliance on synthetic data or limited remote operation demonstrations, especially in fine manipulation due to high hardware costs [3]. - Human videos contain rich real-world operational data, but learning from them presents challenges such as data heterogeneity, hand motion quantization, cross-modal reasoning, and robot control transfer [3]. Group 2: Core Methodology - The article introduces Physical Instruction Tuning, a paradigm that consists of three phases: pre-training, physical space alignment, and post-training, to transfer human hand movement knowledge to robotic operations [4]. Group 3: Pre-training Phase - The pre-training phase uses human hands as ideal manipulators, treating robotic hands as simplified versions, and trains a foundational VLA on large-scale human videos [6]. - The input includes visual information, language instructions, and parameterized hand movements, optimizing the mapping from vision and language to motion [6][8]. Group 4: Physical Space Alignment - Physical space alignment addresses the interference caused by different camera parameters and coordinate systems through weak perspective projection alignment and motion distribution balancing [10][12]. - The model adapts to specific robots by projecting the robot's proprioceptive state into the model's embedding space, generating executable actions through learnable query tokens [13]. Group 5: Key Technologies - The article discusses motion tokenization and cross-modal fusion, emphasizing the need to retain fine motion precision while discretizing continuous movements [14][17]. - The hand movements are decomposed into wrist and finger movements, each tokenized separately, ensuring reconstruction accuracy through a combination of loss functions [18]. Group 6: Dataset and Experimental Results - The UniHand dataset, comprising over 440,000 task trajectories and 1.3 billion frames, supports large-scale pre-training and includes diverse tasks and data sources [21]. - Experimental results show that the Being-H0 model outperforms baseline models in hand motion generation and translation tasks, demonstrating better spatial accuracy and semantic alignment [22][25]. Group 7: Long Sequence Motion Generation - The model effectively generates long sequences of motion (2-10 seconds) using soft format decoding, which helps maintain trajectory stability [26]. Group 8: Real Robot Operation Experiments - In practical tasks like grasping and placing, Being-H0 shows significantly higher success rates compared to baseline models, achieving 65% and 60% success in unseen toy and cluttered scene tasks, respectively [28].
从“想得好”到“做得好”有多远?具身大小脑协同之路解密
具身智能之心· 2025-07-23 08:45
Core Viewpoint - The article discusses the integration of "brain," "cerebellum," and "body" in embodied intelligent systems, emphasizing the need for improved collaboration and data acquisition for advancing artificial general intelligence (AGI) [2][3][4]. Group 1: Components of Embodied Intelligence - The "brain" is responsible for perception, reasoning, and planning, utilizing large language models and visual language models [2]. - The "cerebellum" focuses on movement, employing motion control algorithms and feedback systems to enhance the naturalness and precision of robotic actions [2]. - The "body" serves as the physical entity that executes the plans generated by the "brain" and the movements coordinated by the "cerebellum," embodying the principle of "knowing and doing" [2]. Group 2: Challenges and Future Directions - There is a need for the "brain" to enhance its reasoning capabilities, enabling it to infer task paths without explicit instructions or maps [3]. - The "cerebellum" should become more intuitive, allowing robots to react flexibly in complex environments and handle delicate objects with care [3]. - The collaboration between the "brain" and "cerebellum" requires improvement, as current communication is slow and responses are delayed, aiming for a seamless interaction system [3]. Group 3: Data Acquisition - The article highlights the challenges in data collection, noting that it is often difficult, expensive, and noisy, which hinders the training of intelligent systems [3]. - There is a call for the development of a training repository that is realistic, diverse, and transferable to enhance data quality and accessibility [3]. Group 4: Expert Discussion - A roundtable discussion is planned with experts from Beijing Academy of Artificial Intelligence and Zhiyuan Robotics to explore recent technological advancements and future pathways for embodied intelligence [4].