Workflow
视觉语言模型
icon
Search documents
VLA方向的论文还不知怎么下手?有的同学已经CCF-A了......
自动驾驶之心· 2025-08-22 12:00
Core Insights - The article discusses the advancements of the Li Auto VLA driver model, highlighting its improved capabilities in understanding semantics, reasoning, and trajectory planning, which are crucial for autonomous driving [1][3][5] Group 1: VLA Model Capabilities - The VLA model demonstrates enhanced semantic understanding through multimodal input, improved reasoning via thinking chains, and a closer approximation to human driving intuition through trajectory planning [1] - Four core abilities of the VLA model are showcased: spatial understanding, reasoning ability, communication and memory capability, and behavioral ability [1][3] Group 2: Research and Development Trends - The VLA model has evolved from VLM+E2E, integrating various cutting-edge technologies such as end-to-end learning, trajectory prediction, visual language models, and reinforcement learning [5] - While traditional perception and planning tasks are still being optimized in the industry, the academic community is increasingly shifting focus towards large models and VLA, indicating a wealth of subfields still open for exploration [5] Group 3: VLA Research Guidance Program - A VLA research paper guidance program has been initiated, receiving positive feedback, aimed at helping participants systematically grasp key theoretical knowledge and develop their own research ideas [6] - The program includes a structured curriculum over 14 weeks, covering topics from traditional end-to-end autonomous driving to writing methodologies for research papers [9][11][30] Group 4: Course Structure and Requirements - The course is designed for a maximum of 8 participants per session, targeting individuals with a background in VLA and autonomous driving at various academic levels [12][15] - Participants are expected to have a foundational understanding of deep learning, Python programming, and familiarity with PyTorch, with specific hardware requirements suggested for optimal performance [21][22] Group 5: Expected Outcomes - Participants will gain insights into classic and cutting-edge research papers, coding skills, and methodologies for writing and submitting research papers, culminating in the production of a draft paper [20][34] - The program aims to enhance participants' understanding of algorithms, their advantages and disadvantages, and to stimulate their research ideas through structured guidance [20][34]
当一家成立11年的AI公司投身具身智能战场
3 6 Ke· 2025-08-19 10:12
Core Insights - The article highlights that the year is recognized as the "Year of Embodied Intelligence," with the field becoming a hotbed for AI applications. YuFan Intelligent, a well-known visual AI company, has launched two embodied intelligence products and announced a full-stack self-research approach to embrace this new era [1][3]. Group 1: Company Strategy and Product Launch - YuFan Intelligent has officially entered the embodied intelligence sector by launching two products: the spatial cognition model Manas and a quadruped robot, marking a significant strategic shift for the company [3][4]. - The spatial cognition model Manas is a multimodal large language model (MLLM) that has achieved state-of-the-art results on industry-standard datasets, positioning it as the brain for YuFan's embodied intelligence hardware [3][14]. - The quadruped robot represents YuFan's first foray into embodied intelligent robotics, with all mechanical structures and control platforms developed in-house [4][17]. Group 2: Technological Foundations and Capabilities - YuFan's past experience in hardware and software integration has equipped the company to tackle the challenges of embodied intelligence, which requires seamless collaboration between hardware and AI algorithms [1][22]. - The company has developed a multimodal reasoning architecture, UUMM, which adapts large language model structures for embodied intelligence applications, enabling the integration of human language and visual inputs [16][18]. - The MLLM model Manas has shown exceptional performance in spatial understanding benchmarks, indicating YuFan's readiness to advance in the embodied intelligence domain [17][19]. Group 3: Market Context and Competitive Landscape - The entry of YuFan into the embodied intelligence market aligns with broader industry trends, as major players are increasingly integrating multimodal models into their hardware to enhance intelligence [6][7]. - The current landscape of embodied intelligence is characterized by diverse technological routes and a lack of standardized hardware, making it essential for companies to consider hardware factors in algorithm development [18][20]. - YuFan's established experience in the visual AI sector and its robust supply chain and productization capabilities position it well to compete in the rapidly evolving embodied intelligence market [23][24].
在复杂真实场景中评估 π0 这类通用 policy 的性能和边界
自动驾驶之心· 2025-08-17 03:23
Core Viewpoint - The article discusses the evaluation of the PI0-FAST-DROID model in real-world scenarios, highlighting its potential and limitations in robotic operations, particularly in handling new objects and tasks without extensive prior training [4][10][77]. Evaluation Method - The evaluation utilized the π₀-FAST-DROID model, specifically fine-tuned for the DROID robot platform, which includes a Franka Panda robot equipped with cameras [5][10]. - The assessment involved over 300 trials across various tasks, focusing on the model's ability to perform in diverse environments, particularly in a kitchen setting [10][11]. Findings - The model demonstrated a strong prior assumption of reasonable behavior, often producing intelligent actions, but these were not always sufficient to complete tasks [11]. - Prompt engineering was crucial, as variations in task descriptions significantly affected success rates, indicating the need for clear and structured prompts [12][59]. - The model exhibited impressive visual-language understanding and could mimic continuous actions across different scenarios [13][28]. Performance in Complex Scenarios - The model showed robust performance in recognizing and manipulating transparent objects, which is a significant challenge for traditional methods [20][27]. - It maintained focus on tasks despite human movement in the background, suggesting effective prioritization of relevant visual inputs [25]. Limitations - The model faced challenges with semantic ambiguity and often froze during tasks, particularly when it encountered unfamiliar commands or objects [39][42]. - It lacked memory, which hindered its ability to perform multi-step tasks effectively, leading to premature task completion or freezing [43][32]. - The model struggled with precise spatial reasoning, particularly in estimating distances and heights, which resulted in failures during object manipulation tasks [48][50]. Task-Specific Performance - The model's performance varied across different task categories, with notable success in simple tasks but significant challenges in complex operations like pouring liquids and interacting with household appliances [89][91][100]. - For instance, it achieved a 73.3% progress rate in pouring toy items but only 20% when dealing with real liquids, indicating limitations in physical capabilities [90]. Conclusion - The evaluation indicates that while the PI0 model shows promise as a generalist policy in robotic applications, it still requires significant improvements in instruction adherence, fine manipulation, and handling partial observability [77][88].
VLA与自动驾驶科研论文辅导第二期来啦~
自动驾驶之心· 2025-08-16 12:00
Core Insights - The article discusses the recent advancements in the Li Auto VLA driver model, highlighting its improved capabilities in understanding semantics, reasoning, and trajectory planning, which are crucial for autonomous driving [1][3]. Group 1: VLA Model Capabilities - The VLA model's enhancements focus on four core abilities: spatial understanding, reasoning, communication and memory, and behavioral capabilities [1]. - The reasoning and communication abilities are derived from language models, with memory capabilities utilizing RAG [3]. Group 2: Research and Development Trends - The VLA model has evolved from VLM+E2E, incorporating various cutting-edge technologies such as end-to-end learning, trajectory prediction, visual language models, and reinforcement learning [5]. - While traditional perception and planning tasks are still being optimized in the industry, the academic community is increasingly shifting towards large models and VLA, indicating a wealth of subfields still open for research [5]. Group 3: VLA Research Guidance Program - A VLA research paper guidance program has been initiated, aimed at helping participants systematically grasp key theoretical knowledge and develop their own research ideas [6]. - The program includes a structured 12-week online group research course followed by 2 weeks of paper guidance and a 10-week maintenance period for paper development [14][34]. Group 4: Course Structure and Content - The course covers various topics over 14 weeks, including traditional end-to-end autonomous driving, VLA end-to-end models, and writing methodologies for research papers [9][11][35]. - Participants will gain insights into classic and cutting-edge papers, coding skills, and methods for writing and submitting research papers [20][34]. Group 5: Enrollment and Requirements - The program is limited to 6-8 participants per session, targeting individuals with a background in deep learning and basic knowledge of autonomous driving algorithms [12][15]. - Participants are expected to have a foundational understanding of Python and PyTorch, with access to high-performance computing resources recommended [21].
全球工业机器人市场遇冷,中国逆势增长成最大亮点
第一财经· 2025-08-10 01:23
Core Viewpoint - The industrial robot market is facing challenges in 2024, with a global decline in new installations and significant regional disparities, particularly highlighting China's growth amidst a global downturn [3][4]. Group 1: Global Market Trends - In 2023, global industrial robot installations decreased by 3% to approximately 523,000 units, with Asia down 2%, Europe down 6%, and the Americas down 9% [3]. - The automotive industry has seen a significant decline, while the electronics sector experienced slight growth. Other industries such as metals, machinery, plastics, chemicals, and food are in a growth phase [3]. Group 2: China's Market Performance - China is projected to install around 290,000 industrial robots in 2024, marking a 5% increase and raising its global market share from 51% in 2023 to 54% [3]. - The structure of installations has shifted, with general industrial applications rising from 38% five years ago to 53%, while the electronics sector's share dropped from 45% to 28% [3][4]. - China remains the largest industrial robot market globally for 12 consecutive years, with sales expected to reach 302,000 units in 2024 [4]. Group 3: Regional Comparisons - Japan's industrial robot installations fell by 7% to 43,000 units, with only the automotive sector showing an 11% increase [6]. - The U.S. market shrank by 9%, with the automotive sector contributing nearly 40% of installations [6]. - Europe experienced a 6% decline but still achieved a historical second-highest installation level at 86,000 units, with plastics, chemicals, and food industries emerging as new growth areas [6]. Group 4: Industry Innovations and Future Trends - The integration of artificial intelligence and advancements in digital twin technology are expected to enhance human-robot interaction and reshape production processes [6]. - The logistics and material handling sectors are anticipated to be early adopters of humanoid robots, with construction, laboratory automation, and warehousing also accelerating robot penetration [6].
全球工业机器人市场遇冷 中国逆势增长成最大亮点
Di Yi Cai Jing· 2025-08-09 07:17
Core Insights - 2024 is expected to be a challenging year for the industrial robotics sector, with a global decline in new installations by 3% to approximately 523,000 units in 2023 [1] - Major markets in Asia, Europe, and the Americas all experienced downturns, with Asia down 2%, Europe down 6%, and the Americas down 9% [1] - China stands out as the only bright spot, with an expected growth of 5% in new installations, reaching around 290,000 units in 2024, increasing its global market share from 51% in 2023 to 54% [1][2] Market Performance - The electronics and automotive sectors have been the leading industries for industrial robots since 2020, with the electronics sector showing slight growth while the automotive sector faced significant declines [1] - In China, the industrial robot market is projected to reach 302,000 units in 2024, maintaining its position as the largest industrial robot market globally for 12 consecutive years [2] - Japan's industrial robot installations fell by 7% to 43,000 units, while the U.S. market shrank by 9%, with the automotive sector contributing nearly 40% of installations [4] Regional Analysis - China is the world's largest producer of industrial robots, with production increasing from 33,000 units in 2015 to 556,000 units in 2024, and service robots reaching 10.5 million units, a 34.3% year-on-year growth [2] - The robot density in China is 470 units per 10,000 workers, surpassing Japan and Germany, with South Korea and Singapore leading at 1,012 and 770 units respectively [4] - Despite geopolitical tensions, Asia is still viewed positively, with a forecast of single-digit growth in industry orders in Q1 2025 and a mild recovery in the electronics sector [4] Industry Trends - The robotics industry is increasingly focusing on the integration of artificial intelligence, with advancements in digital twin technology and enhanced human-machine interaction capabilities [4] - Key areas for early adoption of robotics include logistics and material handling, with construction, laboratory automation, and warehousing also seeing accelerated penetration [4]
全球工业机器人市场遇冷,中国逆势增长成最大亮点
Di Yi Cai Jing· 2025-08-09 07:13
Core Insights - The global industrial robot market faced a decline in new installations in 2023, with a 3% drop to approximately 523,000 units, affecting major markets in Asia, Europe, and the Americas [1][4] - China remains the only bright spot in the market, with an expected 5% growth in new installations for 2024, reaching around 290,000 units, increasing its global market share from 51% in 2023 to 54% [1][2] - The structure of the market is changing, with general industrial applications increasing their share from 38% five years ago to 53%, while the electronics sector's share has decreased from 45% to 28% [1] Regional Performance - Japan's industrial robot installations fell by 7% to 43,000 units, with only the automotive sector showing an 11% growth [4] - The U.S. market shrank by 9%, with the automotive industry contributing nearly 40% of installations [4] - Europe experienced a 6% decline but still achieved a historical second-highest installation level at 86,000 units, with the plastics, chemicals, and food sectors emerging as new growth areas [4] Industry Trends - The density of industrial robots per 10,000 workers indicates varying levels of automation, with South Korea (1,012 units), Singapore (770 units), and China (470 units) leading the way, surpassing Japan and Germany [4] - Despite geopolitical tensions and tariff disputes, the Asian market is expected to see growth, with a mild recovery in the electronics sector anticipated in early 2025 [4] - Future trends in the robotics industry include a focus on AI integration, advancements in digital twin technology, and improvements in human-robot interaction through visual language models [4]
性能暴涨30%!港中文ReAL-AD:类人推理的端到端算法 (ICCV'25)
自动驾驶之心· 2025-08-03 23:32
Core Viewpoint - The article discusses the ReAL-AD framework, which integrates human-like reasoning into end-to-end autonomous driving systems, enhancing decision-making processes through a structured approach that mimics human cognitive functions [3][43]. Group 1: Framework Overview - ReAL-AD employs a reasoning-enhanced learning framework based on a three-layer human cognitive model: driving strategy, decision-making, and operation [3][5]. - The framework incorporates a visual-language model (VLM) to improve environmental perception and structured reasoning capabilities, allowing for a more nuanced decision-making process [3][5]. Group 2: Components of ReAL-AD - The framework consists of three main components: 1. **Strategic Reasoning Injector**: Utilizes VLM to generate insights for complex traffic situations, forming high-level driving strategies [5][11]. 2. **Tactical Reasoning Integrator**: Converts strategic intentions into executable tactical choices, bridging the gap between strategy and operational decisions [5][14]. 3. **Hierarchical Trajectory Decoder**: Simulates human decision-making by establishing rough motion patterns before refining them into detailed trajectories [5][20]. Group 3: Performance Evaluation - In open-loop evaluations, ReAL-AD demonstrated significant improvements over baseline methods, achieving over 30% better performance in L2 error and collision rates [36]. - The framework achieved the lowest average L2 error of 0.48 meters and a collision rate of 0.15% on the nuScenes dataset, indicating enhanced learning efficiency in driving capabilities [36]. - Closed-loop evaluations showed that the introduction of the ReAL-AD framework significantly improved driving scores and successful path completions compared to baseline models [37]. Group 4: Experimental Setup - The evaluation utilized the nuScenes dataset, which includes 1,000 scenes sampled at 2Hz, and the Bench2Drive dataset, covering 44 scenarios and 23 weather conditions [34]. - Metrics for evaluation included L2 error, collision rates, driving scores, and success rates, providing a comprehensive assessment of the framework's performance [35][39]. Group 5: Ablation Studies - Ablation studies indicated that removing the Strategic Reasoning Injector led to a 12% increase in average L2 error and a 19% increase in collision rates, highlighting its importance in guiding decision-making [40]. - The Tactical Reasoning Integrator was shown to reduce average L2 error by 0.14 meters and collision rates by 0.05%, emphasizing the value of tactical commands in planning [41]. - Replacing the Hierarchical Trajectory Decoder with a multi-layer perceptron resulted in increased L2 error and collision rates, underscoring the necessity of a hierarchical decoding process for trajectory prediction [41].
自驾一边是大量岗位,一遍是招不到人,太魔幻了......
自动驾驶之心· 2025-07-26 02:39
Core Viewpoint - The autonomous driving industry is experiencing a paradox where job vacancies exist alongside a scarcity of suitable talent, leading to a cautious hiring environment as companies prioritize financial sustainability and effective business models over rapid expansion [2][3]. Group 1: Industry Challenges - Many companies possess a seemingly complete technology stack (perception, control, prediction, mapping, data closure), yet they still face significant challenges in achieving large-scale, low-cost, and high-reliability commercialization [3]. - The gap between "laboratory results" and "real-world performance" remains substantial, indicating that practical application of technology is still a work in progress [3]. Group 2: Talent Acquisition - Companies are not necessarily unwilling to hire; rather, they have an unprecedented demand for "top talent" and "highly compatible talent" in the autonomous driving sector [4]. - The industry is shifting towards a more selective hiring process, focusing on candidates with strong technical skills and relevant experience in cutting-edge research and production [3][4]. Group 3: Community and Resources - The "Autonomous Driving Heart Knowledge Planet" is the largest community for autonomous driving technology in China, established to provide industry insights and facilitate talent development [9]. - The community has nearly 4,000 members and includes over 100 experts in the autonomous driving field, offering various learning pathways and resources [7][9]. Group 4: Learning and Development - The community emphasizes the importance of continuous learning and networking, providing a platform for newcomers to quickly gain knowledge and for experienced individuals to enhance their skills and connections [10]. - The platform includes comprehensive learning routes covering nearly all subfields of autonomous driving technology, such as perception, mapping, and AI model deployment [9][12].
ICCV‘25 | 华科提出HERMES:首个统一驾驶世界模型!
自动驾驶之心· 2025-07-25 10:47
Core Viewpoint - The article introduces HERMES, a unified driving world model that integrates 3D scene understanding and future scene generation, significantly reducing generation errors by 32.4% compared to existing methods [4][17]. Group 1: Model Overview - HERMES addresses the fragmentation in existing driving world models by combining scene generation and understanding capabilities [3]. - The model utilizes a BEV (Bird's Eye View) representation to integrate multi-view spatial information and introduces a "world query" mechanism to enhance scene generation with world knowledge [3][4]. Group 2: Challenges and Solutions - The model overcomes the challenge of multi-view spatiality by employing a BEV-based world tokenizer, which compresses multi-view images into BEV features, thus preserving key spatial information while adhering to token length limitations [5]. - To address the integration of understanding and generation, HERMES introduces world queries that enhance the generated scenes with world knowledge, bridging the gap between understanding and generation [8]. Group 3: Performance Metrics - HERMES demonstrates superior performance on the nuScenes and OmniDrive-nuScenes datasets, achieving an 8.0% improvement in the CIDEr metric for understanding tasks and significantly lower Chamfer distances in generation tasks [4][17]. - The model's world query mechanism contributes to a 10% reduction in Chamfer distance for 3-second point cloud predictions, showcasing its effectiveness in enhancing generation performance [20]. Group 4: Experimental Validation - The experiments utilized datasets such as nuScenes, NuInteract, and OmniDrive-nuScenes, employing metrics like METEOR, CIDEr, ROUGE for understanding tasks, and Chamfer distance for generation tasks [19]. - Ablation studies confirm the importance of the interaction between understanding and generation, with the unified framework outperforming separate training methods [18]. Group 5: Qualitative Results - HERMES is capable of accurately generating future point cloud evolutions and understanding complex scenes, although challenges remain in scenarios involving complex turns, occlusions, and nighttime conditions [24].