具身智能之心 - filings, earnings calls, financial reports, news

具身智能之心

Search documents

具身智能之心· 2025-07-22 06:29

太魔幻了！具身一边是海量岗位，一边是招不到人...... 最近星球里的同学来找我吐槽：峰哥，为什么很多具身公司明明很有钱，融资拿的根本花不完，岗位对外的也多，但一直面试不发offer，他们对外一直说招不到人??? 作为完整经历过自驾发展周期的人来看，其实很简单。大家兜里有钱，但不敢轻易花钱了，保持着审慎的态度，精打细算的细水长流。这个产业周期依然会很长，乱花钱、没有计划，死的会很快，洗牌也就这2-3 年的事情。许多具身公司的产品（包括本体、算法、数据）都还不成熟，这一点我们在具身智能之心知识星球内详细分析过。所以，有非常好的研究成果这批学者是各家公司争先招募的，比如人形机器人的稳定性、数据的 scale、数据的有效使用、泛化性等方向。底层技术突破的拐点还看不到，大家都想储备好干粮准备过寒冬，对于求职者来说，一方面需要自己技术过硬，另外一方面需要非常适配具身的研究方向。而具身智能之心知识星球，作为国内最大的具身技术社区，一直在给行业和个人输送各类人才、产业学术信息。目前累积了国内外几乎所有主流具身公司和大多数知名研究机构。如果您需要第一时间了解产业、求职和行业痛点，欢迎加入我们。一个认真 ...

各类任务上超越π0！字节跳动推出大型VLA模型GR-3，推动通用机器人策略发展

具身智能之心· 2025-07-22 04:10

Core Viewpoint - GR-3, developed by ByteDance, is a large-scale visual-language-action (VLA) model designed to advance general robotics strategies, demonstrating exceptional capabilities in generalization, efficient fine-tuning, and execution of complex tasks [2][7]. Group 1: Performance and Advantages - GR-3 excels in generating action sequences for dual-arm mobile robots based on natural language instructions and environmental observations, outperforming current advanced baseline methods [2][7]. - The model's architecture includes a total of 4 billion parameters, balancing performance and efficiency by optimizing the action generation module [10][12]. Group 2: Core Capabilities and Innovations - GR-3 addresses three major pain points of traditional robots: inability to fully recognize, learn quickly, and perform tasks effectively [7]. - It features a dual-path design combining data-driven approaches with architectural optimization, enabling it to understand abstract instructions and perform precise operations [7][12]. - Key innovations include enhanced generalization capabilities, efficient adaptation with minimal human demonstration data, and stable performance in long-duration and intricate tasks [12][14]. Group 3: Training Methodology - The training strategy employs a "trinity" approach, integrating robot trajectories, visual-language data, and human demonstrations for progressive learning [15][19]. - The model's ability to recognize new objects improved by approximately 40% through joint training with vast internet visual-language datasets [19][23]. Group 4: Hardware Integration - The ByteMini robot, designed for GR-3, features a flexible 7-degree-of-freedom arm and a stable omnidirectional base, enhancing its operational capabilities in various environments [25][26]. - The robot can autonomously generate task combinations and control environmental variables, ensuring effective task execution [21][25]. Group 5: Experimental Validation - GR-3 was tested in three challenging tasks, demonstrating strong adaptability to new environments and abstract instructions with a success rate of 77.1% for understanding new directives [30][38]. - In a long-duration task, GR-3 maintained a success rate of 89% in executing multi-step actions, significantly outperforming previous models [42].

一起做点牛掰的事情！具身智能之心准备招合伙人了.......

具身智能之心· 2025-07-22 03:33

Core Viewpoint - The rapid development of the embodied intelligence field is highlighted, with several leading companies preparing for IPOs, emphasizing the importance of collaboration and shared learning within the industry [1] Group 1: Collaboration and Community - The industry thrives on collective efforts and shared experiences, particularly in the context of entrepreneurship in the embodied intelligence sector [1] - The company aims to create a platform that gathers talented individuals from the industry to foster progress [1] Group 2: Project Collaboration - The company is establishing research teams in major cities including Beijing, Shanghai, Shenzhen, Guangzhou, Hangzhou, and Wuhan, seeking to recruit around 10 individuals per city with over 2 years of experience in embodied algorithms and robotics research [3] Group 3: Educational Development - The company invites experts in the embodied intelligence field to contribute to the creation of online courses, focusing on various advanced topics such as large models, reinforcement learning, and robot motion planning [4] Group 4: Recruitment Criteria - The company seeks candidates with a PhD or higher, including those currently pursuing a doctorate, and prefers individuals with at least 2 years of research and development experience in the industry [5] Group 5: Compensation and Benefits - The company offers a significant profit-sharing model and resource sharing across the industry, with opportunities for both part-time and full-time positions [6]

NVIDIA最新！GraspGen：基于扩散模型的六自由度抓取生成框架

具身智能之心· 2025-07-21 08:42

Core Viewpoint - GraspGen framework addresses the challenge of generalization in 6-DOF grasping by modeling the grasp generation process as an iterative diffusion process, enhancing grasp generation capabilities through the DiffusionTransformer architecture and an efficient discriminator for sampling evaluation [2][21]. Group 1: Core Methodology - GraspGen models the 6-DOF grasp generation as a diffusion process in SE(3) space, utilizing Denoising Diffusion Probabilistic Model (DDPM) for faster computation and simpler implementation compared to traditional energy-based models [4]. - The framework employs PointTransformerV3 (PTv3) to convert unstructured point clouds into structured formats, reducing translation error by 5.3mm and improving recall rate by 4% compared to PointNet++ [4]. - The noise prediction network generates grasps through a 10-step denoising process, significantly fewer than the hundreds of steps required for image diffusion [5]. Group 2: Discriminator Innovations - GraspGen's discriminator innovatively reuses the generator's object encoder, reducing memory usage by 21 times compared to traditional methods [7]. - The discriminator is trained on a dataset generated by the generator, allowing it to better identify failure modes such as collisions and distant grasps, achieving an AUC of 0.947 compared to 0.886 when trained solely on offline data [16][21]. Group 3: Experimental Results - In single-object scenarios, GraspGen's precision-recall curve AUC exceeds baseline by 48% on the ACRONYM dataset, demonstrating the importance of the discriminator [10]. - In cluttered scenes, GraspGen achieves the highest task success rate and grasp success rate, outperforming Contact-GraspNet by 16.9% and M2T2 by 7.8% [13]. - Real robot experiments on the UR10 robotic arm show an overall success rate of 81.3% across various scenarios, significantly higher than M2T2 (28%) and AnyGrasp (17.6%) [19]. Group 4: Limitations and Future Directions - GraspGen shows limitations in performance on cubical objects and relies heavily on the quality of depth sensing and instance segmentation, with training requiring approximately 3,000 GPU hours [21].

机器人「GPT时刻」来了？丰田研究院悄悄做了一场最严谨的VLA验证

具身智能之心· 2025-07-21 08:42

Core Viewpoint - The article discusses the advancements in robotic arms, particularly focusing on the development of Large Behavior Models (LBM) that enable robots to perform complex tasks autonomously, showcasing significant improvements in performance and capabilities compared to traditional models [3][7][15]. Summary by Sections Introduction to Robotic Arms - Robotic arms are typically associated with simple tasks like grabbing or serving ice cream, but the complexity increases exponentially when tasked with more intricate operations such as setting a table or assembling a bicycle [2][3]. Development of VLA Models - The recent progress in Visual-Language-Action (VLA) models has allowed robots to integrate multimodal information (images, instructions, scene semantics) and execute complex tasks, moving towards more intelligent and versatile systems [3][4]. Large Behavior Models (LBM) - LBM represents a significant advancement in robotic capabilities, built on diffusion model strategies, enabling robots to autonomously execute complex operations with impressive results [7][10][19]. - The research conducted by Toyota Research Institute (TRI) and led by notable scholars emphasizes the rigorous evaluation of these models, demonstrating their effectiveness in both simulated and real-world environments [9][10]. Training and Evaluation - The LBM was trained on a diverse dataset, including 1,700 hours of robot data, and underwent 1,800 real-world evaluations and over 47,000 simulated deployments, showcasing its robust performance [13][14]. - The findings indicate that even with limited training data, the model's performance significantly improves, suggesting a positive trend towards achieving effective data acquisition and performance enhancement [14][16]. Performance Metrics - The evaluation metrics included success rate and task completion, with a focus on relative success rates to better compare different methods' performances [26][27]. - The LBM demonstrated superior performance in both seen and unseen tasks compared to single-task baseline models, indicating its robustness and adaptability [31][39]. Conclusion and Future Implications - The research suggests that the advent of general large-scale models in robotics is on the horizon, hinting at a potential "GPT moment" for embodied intelligence [15][43]. - The results indicate that pre-training can lead to better task performance with less data, reinforcing the idea that as data volume increases, performance benefits will continue to manifest [43][45].

VLN-PE：一个具备物理真实性的VLN平台，同时支持人形、四足和轮式机器人（ICCV'25）

具身智能之心· 2025-07-21 08:42

Core Insights - The article introduces VLN-PE, a physically realistic platform for Vision-Language Navigation (VLN), addressing the gap between simulated models and real-world deployment challenges [3][10][15] - The study highlights the significant performance drop (34%) when transferring existing VLN models from simulation to physical environments, emphasizing the need for improved adaptability [15][30] - The research identifies the impact of various factors such as robot type, environmental conditions, and the use of physical controllers on model performance [15][32][38] Background - VLN has emerged as a critical task in embodied AI, requiring agents to navigate complex environments based on natural language instructions [6][8] - Previous models relied on idealized simulations, which do not account for the physical constraints and challenges faced by real robots [9][10] VLN-PE Platform - VLN-PE is built on GRUTopia, supporting various robot types and integrating high-quality synthetic and 3D rendered environments for comprehensive evaluation [10][13] - The platform allows for seamless integration of new scenes, enhancing the scope of VLN research and assessment [10][14] Experimental Findings - The experiments reveal that existing models show a 34% decrease in success rates when transitioning from simulated to physical environments, indicating a significant gap in performance [15][30] - The study emphasizes the importance of multi-modal robustness, with RGB-D models performing better under low-light conditions compared to RGB-only models [15][38] - The findings suggest that training on diverse datasets can improve the generalization capabilities of VLN models across different environments [29][39] Methodologies - The article evaluates various methodologies, including single-step discrete action classification models and multi-step continuous prediction methods, highlighting the potential of diffusion strategies in VLN [20][21] - The research also explores the effectiveness of map-based zero-shot large language models (LLMs) for navigation tasks, demonstrating their potential in VLN applications [24][25] Performance Metrics - The study employs standard VLN evaluation metrics, including trajectory length, navigation error, success rate, and others, to assess model performance [18][19] - Additional metrics are introduced to account for physical realism, such as fall rate and stuck rate, which are critical for evaluating robot performance in real-world scenarios [18][19] Cross-Embodiment Training - The research indicates that cross-embodiment training can enhance model performance, allowing a unified model to generalize across different robot types [36][39] - The findings suggest that using data from multiple robot types during training leads to improved adaptability and performance in various environments [36][39]

没发论文？秋招会惩罚每一个本末倒置的研究生！

具身智能之心· 2025-07-21 08:42

Core Viewpoint - The article emphasizes the importance of proactive engagement in research and the utilization of available resources to enhance academic and career prospects for students, particularly in the context of job hunting and academic publishing [1]. Group 1: Research Guidance and Support - The company offers a comprehensive research guidance program aimed at helping students produce high-quality academic papers, particularly in AI-related fields [3][12]. - A case study is presented where a second-year graduate student successfully completed an SCI paper in three months with the company's assistance [2]. - The program includes personalized mentoring from over 300 qualified instructors, with a high acceptance rate of 96% for students who have received guidance [3]. Group 2: Structured Research Process - The research process is broken down into a 12-week timeline, covering topic selection, literature review, experimental design, drafting, and submission [5]. - The program addresses common issues faced by students, such as lack of guidance from supervisors and fragmented knowledge, by providing a clear framework for research [6]. Group 3: Target Audience and Benefits - The service is tailored for graduate students in computer science and related fields who seek to enhance their research capabilities, accumulate experience, and improve their academic profiles [11]. - Participants can expect to gain skills in research methodology, paper writing, and coding, as well as insights into cutting-edge technologies and trends in their fields [11]. Group 4: Additional Opportunities - Outstanding students may receive recommendations to prestigious institutions and direct referrals to leading tech companies, indicating that publishing a paper is just the beginning of their academic journey [15]. - The program also offers free trial sessions and a satisfaction guarantee for consultations, ensuring that students find the right mentor for their needs [15].

具身智能之心· 2025-07-21 08:24

Core Viewpoint - The article emphasizes the importance of proactive engagement in research and academic writing for students, particularly those in graduate programs, to enhance their employability and academic credentials [1]. Group 1: Employment and Academic Strategies - The article suggests that students should actively seek opportunities and resources to improve their job prospects, including participating in both campus recruitment and social recruitment [1]. - It highlights the need for students to accumulate research results and practical experience to boost their confidence in job applications and further studies [1]. Group 2: Research Guidance Services - The company offers a comprehensive research guidance program aimed at helping students navigate the challenges of academic writing and research processes, particularly in AI-related fields [3][12]. - The program has a high success rate, with a 96% acceptance rate for students who have received guidance over the past three years [3]. Group 3: Course Structure and Support - The structured course spans 12 weeks, covering topic selection, literature review, experimental design, draft completion, and submission processes [5]. - The service includes personalized mentorship, real-time interaction with tutors, and unlimited access to recorded sessions for review [12][16]. Group 4: Target Audience and Benefits - The program is designed for graduate students who lack guidance from their advisors, those seeking to enhance their research capabilities, and individuals aiming to improve their academic profiles for career advancement [11]. - Participants can expect to gain not only a published paper but also skills in research methodology, coding, and access to networking opportunities with prestigious institutions and companies [15].

具身学习专属！硬件结构迭代12版，这款双足机器人平台稳定性提升了300%......

具身智能之心· 2025-07-21 08:24

Core Viewpoint - TRON1 is a cutting-edge research platform designed for educational and scientific purposes, featuring a modular design that supports multiple locomotion forms and algorithms, maximizing research flexibility [1]. Function Overview - TRON1 serves as a humanoid gait development platform, ideal for reinforcement learning research, and supports external devices for navigation and perception [6][4]. - The platform supports C++ and Python for development, making it accessible for users without C++ knowledge [6]. Features and Specifications - The platform includes a comprehensive perception expansion kit with specifications such as: - GPU: NVIDIA Ampere architecture with 1024 CUDA Cores and 32 Tensor Cores - AI computing power: 157 TOPS (sparse) and 78 TOPS (dense) - Memory: 16GB LPDDR5 with a bandwidth of 102.4 GB/s [16]. - TRON1 can integrate various sensors, including LiDAR and depth cameras, to facilitate 3D mapping, localization, navigation, and dynamic obstacle avoidance [13]. Development and Customization - The SDK and development documentation are well-structured, allowing for easy secondary development, even for beginners [34]. - Users can access online updates for software and model structures, enhancing convenience [36]. Additional Capabilities - TRON1 supports voice interaction features, enabling voice wake-up and control, suitable for educational and interactive applications [18]. - The platform can be equipped with robotic arms for various mobile operation tasks, supporting both single-arm and dual-leg configurations [11]. Product Variants - TRON1 is available in standard and EDU versions, both featuring a modular design and similar mechanical parameters, including a maximum load capacity of approximately 10kg [26].

VLFly：基于开放词汇目标理解的无人机视觉语言导航

具身智能之心· 2025-07-20 01:06

Core Viewpoint - The article presents the VLFly framework, a novel vision-language navigation system for drones that enables open-vocabulary goal understanding and zero-shot transfer without task-specific fine-tuning, allowing navigation based solely on natural language instructions and visual information captured by the drone's monocular camera [8][19]. Research Background - The importance of vision-language navigation lies in enabling robots to execute complex tasks based on natural language commands, with applications in home assistance, urban inspection, and environmental exploration [3]. - Existing research methods have limitations, particularly in high-level semantic intent interpretation and integration of natural language input [9]. Task Definition - The vision-language navigation task for drones is defined as a partially observable Markov decision process (POMDP), consisting of state space, action space, observation space, and state transition probabilities [5]. Framework Composition - The VLFly framework consists of three modules: natural language understanding, cross-modal target localization, and navigable waypoint generation, effectively bridging the gap between semantic instructions and continuous drone control commands [8]. Module Details - **Instruction Encoding Module**: Converts natural language instructions into structured text prompts using the LLaMA language model [11]. - **Target Retrieval Module**: Selects the most semantically relevant image from a predefined pool based on the text prompt using the CLIP model [10]. - **Waypoint Planning Module**: Generates executable waypoint trajectories based on current observations and target images [12]. Experimental Setup - The framework was evaluated in diverse simulated and real-world environments, demonstrating strong generalization capabilities and outperforming all baseline methods [8][18]. - Evaluation metrics included success rate (SR), oracle success rate (OS), success rate weighted by path length (SPL), and navigation error (NE) [12]. Experimental Results - VLFly outperformed baseline methods across all metrics, particularly in unseen environments, showcasing robust performance in both indoor and outdoor settings [18]. - The framework achieved a success rate of 83% for direct instructions and 70% for indirect instructions [18]. Conclusion and Future Work - VLFly is a new VLN framework designed specifically for drones, capable of navigation using only visual information captured by its monocular camera [19]. - Future work includes expanding the training dataset for waypoint planning to support full 3D maneuvers and exploring the potential of vision-language models in dynamically identifying target candidates in open-world environments [19].

Vision-Language Navigation

Open-Vocabulary Goal Understanding

Zero-Shot Transfer

Drones

VLFly

Vision-Language Navigation

Open-Vocabulary Goal Understanding