Workflow
视觉语言模型
icon
Search documents
百度视觉技术部多模态感知与理解招聘(社招/校招/实习)
自动驾驶之心· 2025-09-03 23:33
Core Viewpoint - The article focuses on recruitment opportunities in the field of video understanding and artificial intelligence, highlighting the responsibilities and requirements for various positions within the company [2][4][5]. Recruitment Responsibilities - The company is looking for candidates to engage in cutting-edge algorithm research and development for video understanding, specifically targeting tasks such as video question answering, video summarization, temporal action localization, and event detection [2]. - Responsibilities also include building large-scale, high-quality multimodal datasets, distributed training of large models, and collaborating with business teams for practical application and innovation [2]. Job Requirements - Candidates should possess a master's or doctoral degree in computer science, artificial intelligence, electronic information, automation, or related fields [4]. - Experience in top AI conferences or journals is preferred, particularly in areas like computer vision and multimodal learning [5]. Advantages of Joining - The company offers a supportive environment with ample hiring capacity for new graduates, interns, and experienced hires, along with competitive salaries and benefits such as mentorship and participation in significant projects [6]. Community and Resources - The article mentions a community platform for job seekers in autonomous driving and robotics, providing resources like interview questions, industry reports, and salary negotiation tips [7][19].
苹果FastVLM视觉语言模型开放试用:视频字幕生成速度可提升85倍
Huan Qiu Wang Zi Xun· 2025-09-02 04:07
Core Insights - Apple has released a visual language model called FastVLM, which is now available on the Hugging Face platform [1][2] Group 1: Model Features - FastVLM offers near-instant high-resolution image processing and can increase video subtitle generation speed by 85 times [2] - The model is over three times smaller than similar models, enhancing its usability [2] Group 2: User Experience - Users can load the lightweight FastVLM-0.5B version directly in their browser, with a loading time of a few minutes on a 16GB M2 Pro MacBook Pro [2] - Once loaded, the model accurately describes the user's appearance, the room behind them, and surrounding objects [2] Group 3: Application Potential - FastVLM runs locally in the browser, ensuring that data never leaves the device and can even operate offline [2] - This feature presents significant potential in wearable devices and assistive technology, where lightweight and low-latency performance is crucial [2]
告别高耗时!上交Prune2Drive:自动驾驶VLM裁剪利器,加速6倍性能保持
自动驾驶之心· 2025-08-28 23:32
Core Viewpoint - The article discusses the Prune2Drive framework developed by Shanghai Jiao Tong University and Shanghai AI Lab, which achieves a 6.4x acceleration in visual token processing while only reducing performance by 3% through a pruning method that eliminates 90% of visual tokens [2][3][25]. Group 1: Research Background and Challenges - Visual Language Models (VLMs) provide a unified framework for perception, reasoning, and decision-making in autonomous driving, enhancing scene understanding and reducing error propagation [2]. - The deployment of VLMs in real driving scenarios faces significant computational challenges due to the high-resolution images from multiple cameras, leading to increased inference latency and memory consumption [3]. - Existing token pruning methods are limited in adapting to multi-view scenarios, often neglecting spatial semantic diversity and the varying contributions of different camera views [4]. Group 2: Prune2Drive Framework - Prune2Drive introduces the Token-wise Farthest Point Sampling (T-FPS) mechanism, which maximizes the semantic and spatial coverage of multi-view tokens rather than relying solely on individual token significance [6]. - The T-FPS method uses cosine distance to measure semantic similarity between tokens, ensuring that selected tokens are non-redundant and semantically rich [10][11]. - A view-adaptive pruning controller is designed to optimize the pruning ratio for different views, allowing for efficient resource allocation based on the contribution of each view to driving decisions [11][12]. Group 3: Experimental Design and Results - Experiments were conducted on two multi-view VLM benchmark datasets (DriveLM, DriveLMM-o1) to validate the performance retention and efficiency improvement of Prune2Drive compared to baseline methods [16]. - The framework demonstrated that even with a 90% token reduction, it maintained a risk assessment accuracy of 68.34, outperforming several baseline models [22]. - The efficiency of Prune2Drive was highlighted by a significant speedup in processing, achieving a 6.4x acceleration in the DriveMM model and a 2.64x acceleration in the DriveLMM-o1 model [25]. Group 4: Key Findings and Advantages - Prune2Drive effectively captures critical information in driving scenarios, outperforming other methods by accurately identifying key objects in various views [26]. - The framework is plug-and-play, requiring no retraining of VLMs and compatible with efficient implementations like Flash Attention [31]. - It balances performance and efficiency, achieving substantial reductions in computational load while preserving essential semantic information [31].
真实场景也能批量造「险」!VLM+扩散模型打造极限测试
具身智能之心· 2025-08-26 00:03
Core Viewpoint - The article discusses the development of SafeMVDrive, a framework designed to generate high-fidelity, multi-view safety-critical driving videos for testing autonomous driving systems in extreme scenarios, addressing the challenges of real-world data collection and simulation limitations [7][11][30]. Group 1: Safety Testing Challenges - Current autonomous driving systems struggle to avoid accidents in high-risk scenarios such as night construction sites and sudden obstacles, indicating a need for improved reliability in these situations [2][3]. - Extreme scenarios are infrequent in real-world conditions, making data collection difficult, while existing simulators lack the realism required for effective testing [5][6]. Group 2: SafeMVDrive Framework - SafeMVDrive combines a Visual Language Model (VLM) for vehicle selection with a two-stage trajectory generation process to create high-fidelity safety-critical videos for testing [7][10]. - The framework addresses two main challenges: accurately selecting safety-critical vehicles and ensuring the generalization of multi-view video generation models [9][10]. Group 3: Innovations in Vehicle Selection and Trajectory Generation - The VLM-based vehicle selector utilizes visual information to identify potentially dangerous vehicles, improving upon traditional heuristic methods [19][31]. - The two-stage trajectory generation process first simulates collision trajectories and then transforms them into avoidance trajectories, maintaining the critical safety features while ensuring realistic video generation [20][22][23]. Group 4: Video Generation and Evaluation - SafeMVDrive employs a multi-view video generation module to convert avoidance trajectories into high-fidelity videos, ensuring both safety-criticality and visual realism [25][26]. - The framework significantly enhances the coverage and diversity of safety-critical scenarios compared to existing methods, demonstrating superior performance in generating challenging test data [28][30]. Group 5: Performance Metrics - SafeMVDrive shows improved metrics in sample-level and scene-level collision rates, indicating its effectiveness in generating realistic and challenging driving scenarios [29][30]. - The VLM vehicle selector achieves a balance of precision and recall, ensuring that the selected vehicles align with real traffic logic, which is crucial for effective simulation [32].
均普智能发展逐步多元化 具身智能机器人业务实现突破式进展
Zheng Quan Ri Bao Wang· 2025-08-23 04:13
Core Insights - Junpu Intelligent achieved a revenue of 1.032 billion yuan in the first half of 2025, with a backlog of orders amounting to 3.464 billion yuan, indicating stable business development [1] - The company secured new orders worth 1.112 billion yuan, representing a year-on-year growth of 20.22%, with non-automotive orders in the medical and high-end consumer goods sectors reaching 445 million yuan, accounting for approximately 40% of total new orders [1] Group 1: Medical Sector Developments - In the medical health sector, Junpu Intelligent successfully won a project for the production line of continuous glucose monitoring (CGM) sensors for an internationally leading diagnostic equipment manufacturer, with an annual design capacity of 15 million units [1] - The company established a strategic partnership with a leading domestic medical enterprise to jointly develop key platform cam technology for insulin injection pens [1] - The acquisition of the first fully automated production line project for insulin injection pens and automatic injectors signifies the market recognition of Junpu Intelligent's technological strength in high-value medical consumables intelligent manufacturing [1] Group 2: High-End Consumer Goods Innovations - In the high-end consumer goods sector, Junpu Intelligent's innovative achievements include the successful application of its self-developed "multi-blade intelligent assembly process" for an international brand's razor blade assembly order [1] - The company received an order for a flexible assembly line for high-end electric toothbrush drive units, which received high praise from the client [1] Group 3: Robotics Advancements - Junpu Intelligent's humanoid robot "Jarvis 2.0" successfully completed a multimodal upgrade, integrating various AI models such as large language models (LLM) and visual language models (VLM), enabling multilingual dialogue, voice command control, and visual guidance for object handling [2] - The "Jarvis Lightweight 1.0" version has been officially delivered to Tsinghua University and other institutions for research and teaching purposes [2] - The joint venture between Junpu Intelligent's Ningbo Junpu Artificial Intelligence and Humanoid Robot Research Institute and Zhiyuan Robotics has officially commenced operations, with the first mass production pilot line achieving production [2] - By the end of June, the joint venture received over 28 million yuan in orders for humanoid robot production and sales, with three models of embodied intelligent robots currently in production [2]
又帮到了一位同学拿到了VLA算法岗......
具身智能之心· 2025-08-22 16:03
Core Insights - The article emphasizes the importance of joining the "Embodied Intelligence Heart Knowledge Planet," a comprehensive community for learning and sharing knowledge in the field of embodied intelligence, which is rapidly growing in popularity and demand [1][16][85]. Community Features - The community offers a variety of resources including video content, written materials, learning pathways, Q&A sessions, and job exchange opportunities, aiming to create a robust platform for both beginners and advanced learners in embodied intelligence [1][2][17]. - It has established a job referral mechanism with multiple leading companies in the embodied intelligence sector, facilitating direct connections between job seekers and employers [10][17]. Learning Resources - The community has compiled over 30 technical pathways, covering various aspects of embodied intelligence, such as data collection, algorithm deployment, and simulation [2][16]. - It provides access to nearly 40 open-source projects and 60 datasets related to embodied intelligence, significantly reducing the time needed for research and development [16][30][36]. Networking and Collaboration - The community hosts roundtable discussions and live broadcasts to share insights on the latest developments in the embodied intelligence industry, fostering collaboration among members [4][76]. - Members can freely ask questions and receive guidance on career choices and research directions, enhancing the collaborative learning environment [78]. Industry Insights - The community includes members from renowned universities and leading companies in the field, ensuring a diverse range of expertise and perspectives [16][20][21]. - It provides summaries of industry reports and research papers, keeping members informed about the latest trends and applications in embodied intelligence [23][26].
VLA方向的论文还不知怎么下手?有的同学已经CCF-A了......
自动驾驶之心· 2025-08-22 12:00
Core Insights - The article discusses the advancements of the Li Auto VLA driver model, highlighting its improved capabilities in understanding semantics, reasoning, and trajectory planning, which are crucial for autonomous driving [1][3][5] Group 1: VLA Model Capabilities - The VLA model demonstrates enhanced semantic understanding through multimodal input, improved reasoning via thinking chains, and a closer approximation to human driving intuition through trajectory planning [1] - Four core abilities of the VLA model are showcased: spatial understanding, reasoning ability, communication and memory capability, and behavioral ability [1][3] Group 2: Research and Development Trends - The VLA model has evolved from VLM+E2E, integrating various cutting-edge technologies such as end-to-end learning, trajectory prediction, visual language models, and reinforcement learning [5] - While traditional perception and planning tasks are still being optimized in the industry, the academic community is increasingly shifting focus towards large models and VLA, indicating a wealth of subfields still open for exploration [5] Group 3: VLA Research Guidance Program - A VLA research paper guidance program has been initiated, receiving positive feedback, aimed at helping participants systematically grasp key theoretical knowledge and develop their own research ideas [6] - The program includes a structured curriculum over 14 weeks, covering topics from traditional end-to-end autonomous driving to writing methodologies for research papers [9][11][30] Group 4: Course Structure and Requirements - The course is designed for a maximum of 8 participants per session, targeting individuals with a background in VLA and autonomous driving at various academic levels [12][15] - Participants are expected to have a foundational understanding of deep learning, Python programming, and familiarity with PyTorch, with specific hardware requirements suggested for optimal performance [21][22] Group 5: Expected Outcomes - Participants will gain insights into classic and cutting-edge research papers, coding skills, and methodologies for writing and submitting research papers, culminating in the production of a draft paper [20][34] - The program aims to enhance participants' understanding of algorithms, their advantages and disadvantages, and to stimulate their research ideas through structured guidance [20][34]
当一家成立11年的AI公司投身具身智能战场
3 6 Ke· 2025-08-19 10:12
Core Insights - The article highlights that the year is recognized as the "Year of Embodied Intelligence," with the field becoming a hotbed for AI applications. YuFan Intelligent, a well-known visual AI company, has launched two embodied intelligence products and announced a full-stack self-research approach to embrace this new era [1][3]. Group 1: Company Strategy and Product Launch - YuFan Intelligent has officially entered the embodied intelligence sector by launching two products: the spatial cognition model Manas and a quadruped robot, marking a significant strategic shift for the company [3][4]. - The spatial cognition model Manas is a multimodal large language model (MLLM) that has achieved state-of-the-art results on industry-standard datasets, positioning it as the brain for YuFan's embodied intelligence hardware [3][14]. - The quadruped robot represents YuFan's first foray into embodied intelligent robotics, with all mechanical structures and control platforms developed in-house [4][17]. Group 2: Technological Foundations and Capabilities - YuFan's past experience in hardware and software integration has equipped the company to tackle the challenges of embodied intelligence, which requires seamless collaboration between hardware and AI algorithms [1][22]. - The company has developed a multimodal reasoning architecture, UUMM, which adapts large language model structures for embodied intelligence applications, enabling the integration of human language and visual inputs [16][18]. - The MLLM model Manas has shown exceptional performance in spatial understanding benchmarks, indicating YuFan's readiness to advance in the embodied intelligence domain [17][19]. Group 3: Market Context and Competitive Landscape - The entry of YuFan into the embodied intelligence market aligns with broader industry trends, as major players are increasingly integrating multimodal models into their hardware to enhance intelligence [6][7]. - The current landscape of embodied intelligence is characterized by diverse technological routes and a lack of standardized hardware, making it essential for companies to consider hardware factors in algorithm development [18][20]. - YuFan's established experience in the visual AI sector and its robust supply chain and productization capabilities position it well to compete in the rapidly evolving embodied intelligence market [23][24].
在复杂真实场景中评估 π0 这类通用 policy 的性能和边界
自动驾驶之心· 2025-08-17 03:23
Core Viewpoint - The article discusses the evaluation of the PI0-FAST-DROID model in real-world scenarios, highlighting its potential and limitations in robotic operations, particularly in handling new objects and tasks without extensive prior training [4][10][77]. Evaluation Method - The evaluation utilized the π₀-FAST-DROID model, specifically fine-tuned for the DROID robot platform, which includes a Franka Panda robot equipped with cameras [5][10]. - The assessment involved over 300 trials across various tasks, focusing on the model's ability to perform in diverse environments, particularly in a kitchen setting [10][11]. Findings - The model demonstrated a strong prior assumption of reasonable behavior, often producing intelligent actions, but these were not always sufficient to complete tasks [11]. - Prompt engineering was crucial, as variations in task descriptions significantly affected success rates, indicating the need for clear and structured prompts [12][59]. - The model exhibited impressive visual-language understanding and could mimic continuous actions across different scenarios [13][28]. Performance in Complex Scenarios - The model showed robust performance in recognizing and manipulating transparent objects, which is a significant challenge for traditional methods [20][27]. - It maintained focus on tasks despite human movement in the background, suggesting effective prioritization of relevant visual inputs [25]. Limitations - The model faced challenges with semantic ambiguity and often froze during tasks, particularly when it encountered unfamiliar commands or objects [39][42]. - It lacked memory, which hindered its ability to perform multi-step tasks effectively, leading to premature task completion or freezing [43][32]. - The model struggled with precise spatial reasoning, particularly in estimating distances and heights, which resulted in failures during object manipulation tasks [48][50]. Task-Specific Performance - The model's performance varied across different task categories, with notable success in simple tasks but significant challenges in complex operations like pouring liquids and interacting with household appliances [89][91][100]. - For instance, it achieved a 73.3% progress rate in pouring toy items but only 20% when dealing with real liquids, indicating limitations in physical capabilities [90]. Conclusion - The evaluation indicates that while the PI0 model shows promise as a generalist policy in robotic applications, it still requires significant improvements in instruction adherence, fine manipulation, and handling partial observability [77][88].
VLA与自动驾驶科研论文辅导第二期来啦~
自动驾驶之心· 2025-08-16 12:00
Core Insights - The article discusses the recent advancements in the Li Auto VLA driver model, highlighting its improved capabilities in understanding semantics, reasoning, and trajectory planning, which are crucial for autonomous driving [1][3]. Group 1: VLA Model Capabilities - The VLA model's enhancements focus on four core abilities: spatial understanding, reasoning, communication and memory, and behavioral capabilities [1]. - The reasoning and communication abilities are derived from language models, with memory capabilities utilizing RAG [3]. Group 2: Research and Development Trends - The VLA model has evolved from VLM+E2E, incorporating various cutting-edge technologies such as end-to-end learning, trajectory prediction, visual language models, and reinforcement learning [5]. - While traditional perception and planning tasks are still being optimized in the industry, the academic community is increasingly shifting towards large models and VLA, indicating a wealth of subfields still open for research [5]. Group 3: VLA Research Guidance Program - A VLA research paper guidance program has been initiated, aimed at helping participants systematically grasp key theoretical knowledge and develop their own research ideas [6]. - The program includes a structured 12-week online group research course followed by 2 weeks of paper guidance and a 10-week maintenance period for paper development [14][34]. Group 4: Course Structure and Content - The course covers various topics over 14 weeks, including traditional end-to-end autonomous driving, VLA end-to-end models, and writing methodologies for research papers [9][11][35]. - Participants will gain insights into classic and cutting-edge papers, coding skills, and methods for writing and submitting research papers [20][34]. Group 5: Enrollment and Requirements - The program is limited to 6-8 participants per session, targeting individuals with a background in deep learning and basic knowledge of autonomous driving algorithms [12][15]. - Participants are expected to have a foundational understanding of Python and PyTorch, with access to high-performance computing resources recommended [21].