Workflow
具身智能之心
icon
Search documents
这家具身公司的定位很工业化?!待遇最高100w招募算法研究员
具身智能之心· 2025-07-17 02:58
待遇说明 极具竞争力的薪酬与回报: 正式员工:博士年薪70-100万,硕士年薪40-60万(优秀者薪资可面议),并设有丰厚的年度绩效激励; 技术团队专属激励:项目盈利的10%归属技术团队分配,让您的智慧创造获得真金白银的回报; 实习生待遇:硕士实习生300元/天,博士实习生400元/天,并免费提供住宿,助力优秀人才无忧启航; 完善的福利保障: 足额缴纳五险一金(公积金按双边合计24%的顶格比例缴纳); 额外提供房补与饭补; 全天候不间断的零食饮料补给; 投递说明 更多求职相关内容,欢迎加入我们的AutoRobo知识星球,一个覆盖机器人、自动驾驶、具身智能方向的求职社 区!这也是国内首个以自动驾驶和具身为主要方向的社区。 三周年大额优惠来啦!欢迎和我们一起继续成长 AutoRobo知识星球 这是一个给自动驾驶、具身智能、机器人方向同学求职交流的地方,目前近1000名成员了,成员范围包含已经 工作的社招同学,如地平线、理想汽车、华为、小米汽车、momenta、元戎启行等公司。同时也包含2024年秋 招、2025年秋招的小伙伴,方向涉及自动驾驶与具身智能绝大领域。 星球内部有哪些内容?这一点结合我们已有的优势,给大 ...
果然!秋招会惩罚每一个本末倒置的研究生!
具身智能之心· 2025-07-17 00:53
Core Viewpoint - The article emphasizes the importance of proactive engagement in research and academic writing for students, especially those in graduate programs, to enhance their employability and academic credentials. Group 1: Employment and Academic Pressure - The article highlights the increasing anxiety among students regarding job prospects as the job market evolves, urging them to take action rather than wait passively [1] - It suggests that students should focus on both campus recruitment and social recruitment to identify gaps in their skills and knowledge [1] Group 2: Research Guidance and Support - The company offers a comprehensive research guidance program aimed at helping students produce high-quality academic papers, particularly in fields like autonomous driving and embodied intelligence [3][12] - The program has a high success rate, with a 96% acceptance rate for papers submitted by students who received guidance [3] Group 3: Structured Research Process - The article outlines a 12-week structured process for completing a research paper, including topic selection, literature review, experimental design, and submission [5] - This structured approach is designed to help students overcome challenges such as lack of guidance from supervisors and fragmented knowledge [6] Group 4: Target Audience and Benefits - The program is tailored for graduate students who need to produce research papers for graduation, enhance their academic profiles, or improve their job competitiveness in the AI field [11] - Participants can expect to gain not only a published paper but also skills in research methodology, coding, and access to networking opportunities with prestigious institutions [15] Group 5: Personalized Support and Flexibility - The company provides personalized mentoring, real-time interaction with instructors, and flexible learning options, including recorded sessions and 24-hour support [12][16] - A matching system is in place to ensure that students are paired with mentors who align with their research interests and goals [14]
小模型逆袭!复旦&创智邱锡鹏团队造出「世界感知」具身智能体,代码数据完全开源!
具身智能之心· 2025-07-16 09:12
Core Viewpoint - The article discusses the introduction of the World-Aware Planning Narrative Enhancement (WAP) framework, which significantly improves the performance of large vision-language models (LVLMs) in embodied planning tasks by integrating world knowledge into the data and reasoning chain [2][17]. Group 1: Introduction - LVLMs are becoming central in embodied planning, but existing methods often rely on environment-agnostic imitation learning, leading to poor performance in unfamiliar scenarios [2]. - The WAP framework has shown a success rate increase from 2% to 62.7% on the EB-ALFRED benchmark, surpassing models like GPT-4o and Claude-3.5-Sonnet, highlighting the importance of world perception in high-level planning [2][17]. Group 2: Related Work - WAP differs from existing approaches by explicitly binding instruction-environment context at the data level and relying solely on visual feedback without privileged information [4]. Group 3: Technical Method - The framework injects four-dimensional cognitive narratives (visual, spatial, functional, syntactic) into the data layer, allowing the model to understand the environment before reasoning deeply [6]. - It employs closed-loop observation (only RGB + instructions) and a three-stage curriculum learning approach to develop environmental understanding and long-term reasoning capabilities [6][12]. Group 4: Experiments - The performance comparison on the EmbodiedBench (EB-ALFRED) shows that the WAP approach significantly enhances success rates across various task categories, with Qwen2.5-VL achieving a 60.7 percentage point increase in average success rate [14]. - The WAP framework demonstrates a notable improvement in long-term task success rates, achieving 70% compared to previous models [14][16]. Group 5: Conclusion and Future Work - WAP effectively incorporates world knowledge into the data and reasoning processes, allowing smaller open-source LVLMs to outperform commercial models in pure visual closed-loop settings [17]. - Future work includes expanding to dynamic industrial/outdoor scenes and exploring self-supervised narrative evolution for data-model iterative improvement [21].
ICCV 2025满分论文:一个模型实现空间理解与主动探索大统一
具身智能之心· 2025-07-16 09:12
Core Insights - The article discusses the transition of artificial intelligence from the virtual internet space to the physical world, emphasizing the challenge of enabling agents to understand three-dimensional spaces and align natural language with real environments [3][40] - A new model proposed by a collaborative research team aims to unify spatial understanding and active exploration, allowing agents to build cognitive maps of their environments through dynamic exploration [3][40] Group 1: Model Overview - The proposed model integrates exploration and visual grounding in a closed-loop process, where understanding and exploration are interdependent and enhance each other [10][14] - The model consists of two main components: online spatial memory construction and spatial reasoning and decision-making, optimized under a unified training framework [16][22] Group 2: Exploration and Understanding - In the exploration phase, the agent accumulates spatial memory through continuous RGB-D perception, actively seeking potential target locations [12][21] - The reasoning phase involves reading from the spatial memory to identify relevant candidate areas based on task instructions, utilizing cross-attention mechanisms [22][23] Group 3: Data Collection and Training - The authors propose a hybrid strategy for data collection, combining real RGB-D scan data with virtual simulation environments to enhance the model's visual understanding and exploration capabilities [25] - The dataset constructed includes over 900,000 navigation trajectories and millions of language descriptions, covering various task types such as visual guidance and goal localization [25] Group 4: Experimental Results - The MTU3D model was evaluated on four key tasks, demonstrating significant improvements in success rates compared to existing methods, with a notable increase of over 20% in the GOAT-Bench benchmark [28][29] - In the A-EQA task, the model improved the performance of GPT-4V, increasing its success rate from 41.8% to 44.2%, indicating its potential to enhance multimodal large models [32][33] Group 5: Conclusion - The emergence of MTU3D represents a significant advancement in embodied navigation, combining understanding and exploration to enable AI to autonomously navigate and complete tasks in real-world environments [40]
一周年啦,心酸历程!从野路子到一个专业的具身教育平台
具身智能之心· 2025-07-16 09:12
Core Insights - The "Embodied Intelligence Heart" platform has made significant progress in the past year, expanding in product development, financing, and technology within the embodied intelligence sector [1][2] - The platform has transitioned from a semi-welfare learning community to a paid knowledge community, with membership benefits including discounts on self-developed platforms and courses, job referrals, and internal learning sessions [2][19] - The community has established a job referral mechanism with multiple embodied intelligence companies, facilitating connections between job seekers and employers [8][19] Product and Technology Development - The platform has developed several courses related to embodied intelligence, including vla, vln, dp, sim2real, and reinforcement learning, which have been well-received by over 1,500 members [1][13] - A comprehensive list of over 30 technical routes has been organized to assist members in finding benchmarks and learning paths, significantly reducing search time [2][13] - The community has compiled nearly 40 open-source projects and 60 datasets related to embodied intelligence, providing valuable resources for both beginners and advanced learners [13][32] Community Engagement and Learning - The platform hosts various roundtable forums and live sessions covering topics from fundamentals to algorithms, aimed at sharing insights on industry developments and challenges [2][19] - Members have access to exclusive learning videos and documents, enhancing the educational experience [19] - The community includes members from renowned universities and leading companies in the field, fostering a rich environment for knowledge exchange [13][18] Membership Benefits - Membership in the community offers numerous advantages, including job recommendations, industry insights, and access to exclusive content [19][21] - The platform provides a structured approach to learning, with detailed summaries of various research directions and industry reports available to members [21][24] - Members can engage in discussions and receive guidance on career choices and research directions, promoting a collaborative learning atmosphere [72]
BeDAViN:大规模音频-视觉数据集与多声源架构研究
具身智能之心· 2025-07-16 09:12
作者丨 视觉语言导航 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有 你想要的。 主要贡献 研究背景 具身导航的重要性 :具身导航是具身智能(Embodied AI)的一个基本且关键的组成部分,要求自主智能体 通过与未见过的环境交互来解决复杂的导航任务。近年来,具身导航技术被广泛应用于家庭服务、仓储和物 流等领域。 | Dataset | Total number Total duration | | --- | --- | | | of audio of samples | | SAVi-dataset (Chen, Al-Halah, and | 1.157 144 seconds | | Grauman 2021) | | | BeDAViN (Ours) | 2.258 | 现有研究的局限性 : 数据集限制 :现有的音频-视觉导航数据集样本有限,难以模拟多样化的多声源场景。 框架限制 :大多数现有的导航框架是为单声源场景设计的,在多声源场景下的性能大幅下 ...
让 VLMs 更适配机器人:小型VLMs也能展现出强大的视觉规划能力
具身智能之心· 2025-07-15 13:49
Core Insights - The article discusses the potential of large language models (LLMs) in robotic program planning, highlighting their ability to generate coherent action sequences but also noting their limitations in providing the necessary sensory details for physical execution [3][4] - It introduces a new framework called SelfReVision, which enhances the performance of small visual language models (VLMs) through self-distillation without external supervision, aiming to improve their planning capabilities in real-world scenarios [4][9] Research Background - LLMs show promise in generating action sequences but often lack the precision required for robotic tasks due to their reliance on human-centric training data [3] - Visual language models (VLMs) can potentially address these limitations, but existing methods either require specialized simulation environments or are costly to train and deploy [3] Methodology - SelfReVision is proposed as a self-improvement framework that allows small VLMs to enhance their performance through iterative self-critique and revision [4][6] - The framework operates in three stages: critique, revise, and verify, enabling models to generate and refine plans based on self-assessment [4][10] Experimental Setup - Two types of experiments were conducted to evaluate the planning capabilities of SelfReVision: image-based program planning and entity-agent tasks [11] - Evaluation metrics included coverage, ordering, completeness, overall quality, and a new metric called image groundedness [12] Key Results - SelfReVision significantly outperformed baseline models across various metrics, achieving an average win rate of 68% on the PLACES dataset and 72% on the SIMULATION dataset [13] - Larger models benefited more from SelfReVision, with an average gain of 74% for models with 12 billion parameters or more [13] Comparison with Other Methods - SelfReVision demonstrated clear advantages over other methods like Best-of-N and PaliGemma, with improvements of 60% in most settings compared to modest gains from Best-of-N [17] - When compared to GPT-4o, SelfReVision's plans had at least a 25% higher win rate for models with 12 billion parameters or more, indicating its effectiveness in enhancing smaller models [17] Ablation Studies - The complete Criticize-Revise-Verify (CRV) process showed the strongest performance, with average win rates of 68.3% on the PLACES dataset and 71.9% on the SIMULATION dataset [18] - Variants of the process showed significant performance drops, emphasizing the importance of the verification step in filtering out suboptimal revisions [18] Application in Entity-Agent Tasks - SelfReVision was tested in challenging scenarios, showing a 26% improvement for the Gemma 12B model and a 17% improvement for the Gemma 27B model in block manipulation tasks [21] - In hierarchical tasks, SelfReVision plans led to a 70% success rate in generating trajectories, surpassing the 61% success rate of baseline models [21]
物理模拟器与世界模型驱动的机器人具身智能综述
具身智能之心· 2025-07-15 13:49
Core Insights - The article emphasizes the significance of "Embodied Intelligence" in the pursuit of General Artificial Intelligence (AGI), highlighting the need for intelligent agents to perceive, reason, and act in the physical world [3][5] - The integration of physical simulators and world models is identified as a promising pathway to enhance the capabilities of robots, enabling them to transition from merely "doing" to "thinking" [3][5] Summary by Sections 1. Introduction to Embodied Intelligence - Embodied Intelligence focuses on intelligent agents that can autonomously perceive, predict, and execute actions in complex environments, which is essential for achieving AGI [5] 2. Key Technologies - Two foundational technologies, physical simulators and world models, are crucial for developing robust embodied intelligence. Physical simulators provide safe and efficient environments for training, while world models enable internal representations of the environment for predictive planning and adaptive decision-making [5] 3. Research Contributions - The article reviews recent advancements in learning embodied intelligence through the fusion of physical simulators and world models, analyzing their complementary roles in enhancing agent autonomy, adaptability, and generalization capabilities [5] 4. Robot Capability Classification - A five-level capability classification system for intelligent robots is proposed, ranging from IR-L0 (basic execution) to IR-L4 (fully autonomous), covering dimensions such as autonomy, task handling, environmental adaptability, and social cognition [8][15] 5. Core Technology Review - The article systematically reviews the latest technological advancements in legged locomotion, manipulation control, and human-robot interaction, emphasizing the importance of these capabilities in the development of intelligent robots [8] 6. Physical Simulator Comparison - A comparative analysis of mainstream simulation platforms (Webots, Gazebo, MuJoCo, Isaac Gym/Sim) is provided, focusing on their physics engine accuracy, rendering quality, and sensor component support, along with future optimization directions [13][19] 7. World Model Architecture and Applications - The article discusses representative structures of world models, including predictive networks and generative models, and their applications in embodied intelligence, particularly in autonomous driving and articulated robots [14][20]
重磅直播!RoboTwin2.0:强域随机化双臂操作数据生成器与评测基准集
具身智能之心· 2025-07-15 13:49
Core Viewpoint - The article discusses the challenges and advancements in training dual-arm robots for complex tasks, emphasizing the need for efficient data collection and simulation methods to enhance their operational capabilities [2]. Group 1: Challenges in Dual-Arm Robot Training - Dual-arm robots play a crucial role in collaborative assembly, tool usage, and object handover in complex scenarios, but training them to perform general operations like VLA faces multiple bottlenecks [2]. - The cost and time required to scale up the collection of real demonstration data are high, making it difficult to cover a wide range of tasks, object shapes, and hardware variations [2]. - Existing simulation methods lack efficient and scalable expert data generation techniques for new tasks, and their domain randomization designs are too superficial to accurately simulate the complexities of real environments [2]. Group 2: Advancements and Solutions - The article highlights the introduction of UniVLA, which efficiently utilizes multi-source heterogeneous data to construct a general and scalable action space for robots [5]. - The CVPR champion solution, BridgeVLA, reportedly improves real machine performance by 32%, showcasing advancements in robot navigation and motion control in real-world scenarios [4].
为什么纯人形VLA方案很少?这些公司的方案是哪些?
具身智能之心· 2025-07-15 09:39
Core Viewpoint - The current focus in the industry is on mechanical arm VLA (Vision-Language Agents) for tasks like mobile grabbing and placing, while humanoid and quadrupedal VLA are facing challenges in job applications due to complexity and data collection issues [1] Group 1: Application of VLA in Industry - Mechanical arm VLA is primarily used for simple tasks that rely on visual input, supplemented by tactile or force sensors, making them easier to implement [1] - Humanoid robots face difficulties in data collection and have high control complexity, with a single dexterous hand potentially having 20 degrees of freedom, and the entire body nearing 100 degrees of freedom [1] - Many leading companies are adopting reinforcement learning (RL) to train humanoid VLA for complex tasks, but the generalization and flexibility of humanoid models remain insufficient compared to mechanical arms [1] Group 2: Future Directions - A promising approach for the future may involve a hybrid architecture combining VLA for high-level task planning and RL for low-level motion optimization, which is currently a focus for many companies [1] - There is an increasing number of job openings in unicorn companies that are pursuing breakthroughs in this combined direction [1]