视觉语言模型（VLM） - filings, earnings calls, financial reports, news - Reportify

视觉语言模型（VLM）

Search documents

突然发现，新势力在集中IPO......

自动驾驶之心· 2025-10-06 04:05

国庆期间，柱哥也没闲着。这几天抽空搜集了下行业最近的动态。发现不少新势力都在IPO，国内外的行业正在进行新一轮的资源整合，局势正在悄悄发生变化。 9月22日，国家市场监管总局公示了中国第一汽车股份有限公司收购深圳市卓驭科技有限公司股权案。同一天，英国自动驾驶初创公司Wayve和英伟达已签署意向书，计划在下一轮融资中投资5亿美元（约合人民币36亿元）。 9月27日魔视智能科技（上海）股份有限公司向港交所提交上市申请书。9月30日消息，博泰车联网科技（上海）股份有限公司正式在港交所上市敲钟。同一天，北京四维图新科技股份有限公司正式宣布，已完成对鉴智机器人母公司PhiGent Robotics Limited的战略投资。10月2日消息，岚图汽车向港交所递交了招股书。自动驾驶这两年高度内卷，前沿技术栈趋于收敛，量产方案趋同，很多小伙伴都觉得自驾卷到头了，决定转行去具身，我们也看到一些业内大佬辞职投身具身的创业洪流。如何破局成了大家讨论的最多的话题。所以我们也看到VLA/WA的路线之争，以及在这背后一轮更大的行业变革。对于处在自驾领域的个人来说，变革是挑战更是机遇。真正留在行业内担当主力的，都 ...

视觉语言模型（VLM）

自动驾驶VLA

视觉语言模型（VLM）

自动驾驶VLA

有人在自驾里面盲目内卷，而有的人在搭建真正的壁垒...

自动驾驶之心· 2025-09-29 23:33

车企的新一轮变革已经拉开了帷幕。九月，车企48位高管变动。理想把智驾团队拆成 11 个二级部门，比亚迪从斑马挖来 CTO 搞座舱，长安汽车的高层也正经历大变动，连蔚来的任少卿都一边在公司管自动驾驶，一边去中科大搭实验室了。目前，自动驾驶的前沿方向聚焦在自动驾驶VLA/VLM、端到端自动驾驶、世界模型world model、闭环仿真3DGS、强化学习等等。但很多在校的同学根本不清楚业内实际的进展，很多中小厂的算法工程师也是如此。通过几次的线上星友面对面交流，柱哥更是深刻意识到这个问题。所以我们最近一直在琢磨，怎样才能成为大家沟通的桥梁，打通学术界和工业界的信息壁垒，打通不同公司之间的信息壁垒。对于搞算法的同学来说，更是深有感触。三年前还是BEV，两年前是无图，一年期是端到端，今年是VLA和世界模型，下一步是什么呢？在人工智能这条大的赛道上，什么才是算法岗位真正的活力和壁垒？柱哥认为是持续不断的更新自己的认知，要敢于跳出自己的舒适圈。添加博主微信咨询自驾社区所以我们联合了诸多学术界和工业界的大佬，共同打造了我们维护三年之久的『自动驾驶之心知识星球』！星球目前集视频 + 图文 + 学 ...

端到端自动驾驶

视觉语言模型（VLM）

自动驾驶汽车

端到端自动驾驶

视觉语言模型（VLM）

自动驾驶汽车

具身智能，为何成为智驾公司的下一个战场？

雷峰网· 2025-09-26 04:17

Core Viewpoint - Embodied intelligence is emerging as the next battleground for smart driving entrepreneurs, with significant investments and developments in the sector [2][4]. Market Overview - The global embodied intelligence market is on the verge of explosion, with China's market expected to reach 5.295 billion yuan by 2025, accounting for approximately 27% of the global market [3][21]. - The humanoid robot market is projected to reach 8.239 billion yuan, representing about 50% of the global market [3]. Industry Trends - Several smart driving companies, including Horizon Robotics and Zhixing Technology, are strategically investing in embodied intelligence through mergers, acquisitions, and subsidiary establishments to seize historical opportunities [4]. - The influx of talent from the smart driving sector into embodied intelligence has been notable since 2022, with many professionals making the transition in 2023 [13]. Technological Integration - The integration of smart driving and embodied intelligence is based on the concept of "embodied cognition," where intelligent behavior is formed through continuous interaction with the physical environment [6]. - The technical pathways for both fields are highly aligned, with smart driving vehicles functioning as embodied intelligent agents through multi-sensor perception, algorithmic decision-making, and control systems [6]. Technical Framework - The technical layers of smart driving applications and their migration to embodied intelligence include: - Perception Layer: Multi-sensor fusion for environmental modeling and object recognition [7]. - Decision Layer: Path planning and behavior prediction for task planning and interaction strategies [7]. - Control Layer: Vehicle dynamics control for motion control and execution [7]. - Simulation Layer: Virtual scene testing for skill learning and adaptive training [7]. Investment and Growth Potential - The embodied intelligence market is expected to maintain a growth rate of over 40% annually, providing a valuable channel for smart driving companies facing growth bottlenecks [21]. - The dual development pattern of humanoid and specialized robots allows smart driving companies to leverage their technological strengths for market entry [22]. Profitability Insights - The gross profit margins for embodied intelligence products are generally higher than those for smart driving solutions, with professional service robots achieving margins over 50%, compared to 15-25% for autonomous driving kits [23][25]. - This profit difference arises from the stronger differentiation and lower marginal costs of embodied intelligence products, allowing for rapid market entry and reduced development costs [25]. Future Outlook - The boundaries between smart driving and embodied intelligence are increasingly blurring, with companies like Tesla viewing autonomous vehicles as "wheeled robots" and developing humanoid robots based on similar AI architectures [26]. - Early movers in this transition are likely to secure advantageous positions in the future intelligent machine ecosystem [26].

端到端模型

视觉语言模型（VLM）

视觉语言动作端到端模型（VLA）

端到端模型

视觉语言模型（VLM）

视觉语言动作端到端模型（VLA）

机器人指数ETF（560770）逆市翻红，当前科技行情进展到哪里了？

2 1 Shi Ji Jing Ji Bao Dao· 2025-09-02 06:17

Core Viewpoint - The A-share market experienced a pullback with all three major indices declining, while the robotics sector showed resilience with significant gains in related stocks and ETFs [1][2]. Market Performance - As of September 2, the A-share market saw a rapid increase in trading volume, surpassing 2 trillion yuan, marking the 15th consecutive trading day above this threshold [1]. - The TMT (Technology, Media, and Telecommunications) sector accounted for approximately 40% of total trading volume, indicating strong market interest [1]. Robotics Industry Insights - The robotics industry is accelerating due to continuous technological advancements and the realization of industrial orders, with significant orders such as a 124 million yuan contract from China Mobile marking a shift towards large-scale production [3][6]. - The integration of AI language models and multi-modal sensor technology is enhancing the capabilities of humanoid robots, improving their understanding and perception [3]. Investment Opportunities - The robotics sector is highlighted as a potential area for investment, particularly in sub-sectors like semiconductors and battery technology, which have shown resilience and potential for future growth [6]. - The Robot Index ETF (560770) tracks the robotics industry and includes major companies such as Huichuan Technology and iFlytek, indicating a diversified exposure to the sector [6][7]. Future Projections - According to forecasts, the number of humanoid robots in use in China could exceed 100 million by 2045, with a market size reaching approximately 10 trillion yuan, covering various applications from industrial manufacturing to healthcare [7]. Fund Management Perspective - The fund manager of the Robot Index ETF believes that the robotics industry is in a rapid development phase, with increasing capital allocation, suggesting a positive outlook for future investments [8].

SIASUN(SZ:300024)

机器人概念

科技成长行情

AI大语言模型（LLM）

视觉语言模型（VLM）

机器人概念

科技成长行情

AI大语言模型（LLM）

视觉语言模型（VLM）

还在卷端到端模型？Embodied-R1另辟蹊径：用“指向”+强化学习实现SOTA性能！

具身智能之心· 2025-09-02 00:03

Core Insights - The article discusses the development of Embodied-R1, a new model designed to bridge the "seeing-to-doing gap" in robotics, which has been a long-standing challenge in the field [2][32] - The model introduces a novel intermediate representation called "pointing," which allows complex operational instructions to be translated into visual points, enhancing the robot's ability to understand and execute tasks [3][10] Group 1: Challenges in Robotics - The "seeing-to-doing gap" is primarily caused by data scarcity and morphological heterogeneity, which hinder effective knowledge transfer in robotics [2] - Existing visual-language-action (VLA) models struggle with performance in new environments, often losing zero-shot operational capabilities [2][10] Group 2: Embodied-R1 Model Overview - Embodied-R1 is a 3 billion parameter model that utilizes "pointing" as an intuitive intermediate representation, defining four key capabilities: REG (representational understanding), RRG (spatial region pointing), OFG (functional part pointing), and VTG (visual trajectory generation) [10][12] - The model has demonstrated superior performance in 11 spatial reasoning and pointing tasks, achieving a 56.2% success rate in the SIMPLEREnv simulation and an impressive 87.5% in eight real-world tasks without fine-tuning [10][27] Group 3: Training Methodology - The model employs a two-phase training curriculum, focusing first on spatial reasoning and then on embodied pointing capabilities, utilizing a large dataset of 200,000 samples [15][16] - Reinforcement fine-tuning (RFT) is introduced to address the "multi-solution dilemma" in pointing tasks, allowing the model to develop a generalized understanding rather than memorizing specific answers [17][19] Group 4: Performance Metrics - Embodied-R1 outperforms other models in various benchmarks, achieving state-of-the-art (SOTA) results in REG, RRG, OFG, and VTG tasks [29][30] - The model's trajectory generation quality is the best among all compared models, which is crucial for reliable robot execution [29] Group 5: Robustness and Adaptability - The model exhibits strong robustness against visual disturbances, maintaining performance even under challenging conditions such as poor lighting and background changes [31] - This adaptability is attributed to the "pointing" representation, which enhances the robot's strategic robustness [31] Group 6: Conclusion - The introduction of Embodied-R1 marks a significant advancement in addressing the long-standing "seeing-to-doing gap" in robotics, providing a promising pathway for developing more powerful and generalizable embodied AI systems [32]

视觉-语言-动作（VLA）模型

视觉语言模型（VLM）

强化微调（RFT）

监督微调（SFT）

视觉-语言-动作（VLA）模型

视觉语言模型（VLM）

强化微调（RFT）

监督微调（SFT）

最新综述！多模态融合与VLM在具身机器人领域中的方法盘点

具身智能之心· 2025-09-01 04:02

Core Insights - The article discusses the transformative impact of Multimodal Fusion and Vision-Language Models (VLMs) on robot vision, enabling robots to evolve from simple mechanical executors to intelligent partners capable of understanding and interacting with complex environments [3][4][5]. Multimodal Fusion in Robot Vision - Multimodal fusion integrates various data types such as RGB images, depth information, LiDAR point clouds, language, and tactile data, significantly enhancing robots' perception and understanding of their surroundings [3][4][9]. - The main fusion strategies have evolved from early explicit concatenation to implicit collaboration within unified architectures, improving feature extraction and task prediction [10][11]. Applications of Multimodal Fusion - Semantic scene understanding is crucial for robots to recognize objects and their relationships, where multimodal fusion greatly improves accuracy and robustness in complex environments [9][10]. - 3D object detection is vital for autonomous systems, combining data from cameras, LiDAR, and radar to enhance environmental understanding [16][19]. - Embodied navigation allows robots to explore and act in real environments, focusing on goal-oriented, instruction-following, and dialogue-based navigation methods [24][26][27][28]. Vision-Language Models (VLMs) - VLMs have advanced significantly, enabling robots to understand spatial layouts, object properties, and semantic information while executing tasks [46][47]. - The evolution of VLMs has shifted from basic models to more sophisticated systems capable of multimodal understanding and interaction, enhancing their applicability in various tasks [53][54]. Future Directions - The article identifies key challenges in deploying VLMs on robotic platforms, including sensor heterogeneity, semantic discrepancies, and the need for real-time performance optimization [58]. - Future research may focus on structured spatial modeling, improving system interpretability, and developing cognitive VLM architectures for long-term learning capabilities [58][59].

多模态融合

视觉语言模型（VLM）

跨模态对齐

语义场景理解

三维目标检测

多模态融合

视觉语言模型（VLM）

跨模态对齐

语义场景理解

三维目标检测

研究生开学，被大老板问懵了。。。

自动驾驶之心· 2025-09-01 03:17

Core Insights - The article emphasizes the establishment of a comprehensive community focused on autonomous driving and robotics, aiming to connect learners and professionals in the field [1][14] - The community, named "Autonomous Driving Heart Knowledge Planet," has over 4,000 members and aims to grow to nearly 10,000 in two years, providing resources for both beginners and advanced learners [1][14] - Various technical learning paths and resources are available, including over 40 technical routes and numerous Q&A sessions with industry experts [3][5] Summary by Sections Community and Resources - The community offers a blend of video, text, learning paths, and Q&A, making it a comprehensive platform for knowledge sharing [1][14] - Members can access a wealth of information on topics such as end-to-end autonomous driving, multi-modal large models, and data annotation practices [3][14] - The community has established a job referral mechanism with multiple autonomous driving companies, facilitating connections between job seekers and employers [10][14] Learning Paths and Technical Focus - The community has organized nearly 40 technical directions in autonomous driving, covering areas like perception, simulation, and planning control [5][14] - Specific learning routes are provided for beginners, including full-stack courses suitable for those with no prior experience [8][10] - Advanced topics include discussions on world models, reinforcement learning, and the integration of various sensor technologies [4][34][46] Industry Engagement and Expert Interaction - The community regularly invites industry leaders for discussions on the latest trends and challenges in autonomous driving [4][63] - Members can engage in discussions about career choices, research directions, and technical challenges, fostering a collaborative environment [60][64] - The platform aims to bridge the gap between academic research and industrial application, ensuring that members stay updated on both fronts [14][65]

端到端自动驾驶

视觉语言模型（VLM）

自动驾驶多模态大模型

自动驾驶之心知识星球

端到端自动驾驶

视觉语言模型（VLM）

自动驾驶多模态大模型

自动驾驶之心知识星球

又帮到了一位同学拿到了自动驾驶算法岗......

自动驾驶之心· 2025-08-23 14:44

Core Viewpoint - The article emphasizes the importance of continuous learning and adaptation in the field of autonomous driving, particularly in light of industry shifts towards intelligent models and large models, while also highlighting the value of community support for knowledge sharing and job opportunities [1][2]. Group 1: Community and Learning Resources - The "Autonomous Driving Heart Knowledge Planet" is a comprehensive community platform that integrates video, text, learning paths, Q&A, and job exchange, aiming to grow from over 4,000 to nearly 10,000 members in two years [1][2]. - The community provides practical solutions for various topics such as entry points for end-to-end models, learning paths for multimodal large models, and engineering practices for data closed-loop 4D annotation [2][3]. - Members have access to over 40 technical routes, including industry applications, VLA benchmarks, and learning entry routes, significantly reducing search time for relevant information [2][3]. Group 2: Job Opportunities and Networking - The community has established internal referral mechanisms with multiple autonomous driving companies, facilitating job applications and resume submissions directly to desired companies [7]. - Regular job sharing and updates on available positions are provided, creating a complete ecosystem for autonomous driving professionals [15][30]. Group 3: Technical Learning and Development - The community offers a well-structured technical stack and roadmap for beginners, covering essential areas such as mathematics, computer vision, deep learning, and programming [11][32]. - Various learning routes are available for advanced topics, including end-to-end autonomous driving, 3DGS principles, and multimodal large models, catering to both newcomers and experienced professionals [16][34][40]. - The platform also hosts live sessions with industry leaders, providing insights into cutting-edge research and practical applications in autonomous driving [58][66].

端到端自动驾驶

视觉语言模型（VLM）

自动驾驶算法

端到端自动驾驶

视觉语言模型（VLM）

自动驾驶算法

理想VLA到底是不是真的VLA？

自动驾驶之心· 2025-08-21 23:34

Core Viewpoint - The article discusses the capabilities of the MindVLA model in autonomous driving, emphasizing its advanced scene understanding and decision-making abilities compared to traditional E2E models. Group 1: VLA Capabilities - The VLA model demonstrates effective defensive driving, particularly in scenarios with obstructed views, by smoothly adjusting speed based on remaining distance [4][5]. - In congested traffic situations, VLA shows improved decision-making by choosing to change lanes rather than following the typical detour logic of E2E models [7]. - The VLA model exhibits enhanced lane centering abilities in non-standard lane widths, significantly reducing the occurrence of erratic driving patterns [9][10]. Group 2: Scene Understanding - VLA's decision-making process reflects a deeper understanding of traffic scenarios, allowing it to make more efficient lane changes and route selections [11]. - The model's ability to maintain stability in trajectory generation is attributed to its use of diffusion models, which enhances its performance in various driving conditions [10]. Group 3: Comparison with E2E Models - The article highlights that E2E models struggle with nuanced driving behaviors, often resulting in abrupt maneuvers, while VLA provides smoother and more context-aware driving responses [3][4]. - VLA's architecture allows for parallel optimization across different scenarios, leading to faster iterations and improvements compared to E2E models [12]. Group 4: Limitations and Future Considerations - Despite its advancements, VLA is still classified as an assistive driving technology rather than fully autonomous driving, requiring human intervention in certain situations [12]. - The article raises questions about the model's performance in specific scenarios, indicating areas for further development and refinement [12].

大语言模型（LLM）

端到端模型（E2E）

视觉语言模型（VLM）

大语言模型（LLM）

端到端模型（E2E）

视觉语言模型（VLM）

死磕技术的自动驾驶黄埔军校，4000人了！

自动驾驶之心· 2025-08-15 14:23

Core Viewpoint - The article emphasizes the establishment of a comprehensive community focused on autonomous driving, aiming to bridge the gap between academia and industry while providing valuable resources for learning and career opportunities in the field [2][16]. Group 1: Community and Resources - The community has created a closed-loop system covering various fields such as industry, academia, job seeking, and Q&A exchanges, enhancing the learning experience for participants [2][3]. - The platform offers cutting-edge academic content, industry roundtables, open-source code solutions, and timely job information, significantly reducing the time needed for research [3][16]. - Members can access nearly 40 technical routes, including industry applications, VLA benchmarks, and entry-level learning paths, catering to both beginners and advanced researchers [3][16]. Group 2: Learning and Development - The community provides a well-structured learning path for beginners, including foundational knowledge in mathematics, computer vision, deep learning, and programming [10][12]. - For those already engaged in research, valuable industry frameworks and project proposals are available to further their understanding and application of autonomous driving technologies [12][14]. - Continuous job sharing and career opportunities are promoted within the community, fostering a complete ecosystem for autonomous driving [14][16]. Group 3: Technical Focus Areas - The community has compiled extensive resources on various technical aspects of autonomous driving, including perception, simulation, planning, and control [16][17]. - Specific learning routes are available for topics such as end-to-end learning, 3DGS principles, and multi-modal large models, ensuring comprehensive coverage of the field [16][17]. - The platform also features a collection of open-source projects and datasets relevant to autonomous driving, facilitating hands-on experience and practical application [32][34].

视觉语言模型（VLM）

Autonomous Driving

自动驾驶感知学习路线

自动驾驶仿真学习路线

视觉语言模型（VLM）

Autonomous Driving

自动驾驶感知学习路线

自动驾驶仿真学习路线