多模态大模型
Search documents
明略科技CEO吴明辉即将出席2025腾讯全球数字生态大会
Xin Lang Cai Jing· 2025-09-16 03:14
Core Insights - The evolution of global large model technology is accelerating, with industry applications deepening progressively [1] - Vertical large models are becoming the key to the implementation of AI in enterprises, addressing the limitations of general large models in proprietary data and industry know-how [1] - Minglue Technology's CEO, Wu Minghui, will present at the Tencent Global Digital Ecosystem Conference, discussing the practical applications of multimodal large models in marketing scenarios [1] Industry Trends - The shift towards vertical large models indicates a growing recognition of their importance in overcoming challenges faced by general large models [1] - The focus on industry-specific applications suggests a trend towards more tailored AI solutions that leverage specialized knowledge and data [1] Company Developments - Minglue Technology is showcasing its latest technological breakthroughs and practical achievements in the field of AI [1] - The upcoming presentation at a major conference highlights the company's commitment to advancing AI applications in marketing [1]
论文解读之港科PLUTO:首次超越Rule-Based的规划器!
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].
关于大模型和自动驾驶的一切
自动驾驶之心· 2025-09-15 23:33
Group 1 - The article emphasizes the growing interest in large model technologies, particularly in areas such as RAG (Retrieval-Augmented Generation), AI Agents, multimodal large models (pre-training, fine-tuning, reinforcement learning), and optimization for deployment and inference [1] - A community named "Large Model Heart Tech" is being established to focus on these technologies and aims to become the largest domestic community for large model technology [1] - The community is also creating a knowledge platform to provide industry and academic information, as well as to cultivate talent in the field of large models [1] Group 2 - The article describes the community as a serious content-driven platform aimed at nurturing future leaders [2]
全新开源模型复现o3视觉推理,无需大量训练即可实现深度思考
量子位· 2025-09-15 03:59
Core Viewpoint - The article discusses the development of Mini-o3, an advanced visual language model (VLM) that enables multi-round visual reasoning, significantly improving upon previous models by allowing for deep reasoning across dozens of steps [1][2][15]. Group 1: Model Development - Mini-o3 is developed by a collaboration between ByteDance and the University of Hong Kong, designed to perform long-cycle visual search without extensive training resources [13]. - The model can extend its reasoning capabilities from a training limit of 6 rounds to dozens during testing, showcasing its advanced multi-modal reasoning abilities [2][15]. Group 2: Key Design Features - Mini-o3 incorporates three critical design elements: the VisualProbe dataset for exploratory reasoning, an iterative data collection process for diverse reasoning strategies, and a super-round masking strategy to balance training efficiency with testing scalability [17][19][34]. - The VisualProbe dataset consists of thousands of visual search challenges specifically designed for deep reasoning tasks, enhancing the model's training [17][38]. Group 3: Training Phases - The training of Mini-o3 occurs in two phases: a cold-start supervised fine-tuning (SFT) phase to activate multi-round tool usage, and a reinforcement learning (RL) phase to optimize interaction rounds [19][25]. - The cold-start SFT phase utilizes a small number of manually constructed samples to generate diverse reasoning trajectories, resulting in approximately 6000 cold-start reasoning paths [24][46]. Group 4: Performance Evaluation - Mini-o3 outperforms existing models in visual search tasks, achieving the best performance across various benchmarks, including VisualProbe, V*Bench, and HR-Bench [43][44]. - The model's performance is attributed to its ability to maintain complex and deep reasoning trajectories, with significant improvements noted in challenging tasks [44][48]. Group 5: Experimental Insights - Experiments indicate that removing RL data leads to a performance drop of about 8.6 points on VisualProbe-Hard, highlighting the importance of challenging RL samples for encouraging complex reasoning [45]. - The super-round masking technique effectively enhances RL performance, particularly in multi-round interaction scenarios, by stabilizing the training process and enabling extended reasoning during testing [48]. Group 6: Conclusion and Future Directions - The technical framework of Mini-o3 provides practical guidance for the development of multi-round interactive multi-modal models and their applications in reinforcement learning [52]. - The research team has made all related code open-source, promoting further exploration and development in this field [53].
招聘几位大佬,打算共创平台(世界模型/模型部署)
自动驾驶之心· 2025-09-14 03:44
Group 1 - The article announces the recruitment of 10 partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2][5] - The main areas of expertise sought include large models, multimodal models, diffusion models, SLAM, 3D object detection, and closed-loop simulation [3] - Candidates from QS200 universities with a master's degree or higher, especially those with significant conference experience, are preferred [4] Group 2 - The company offers benefits such as resource sharing for job seeking, PhD recommendations, and study abroad opportunities [5] - Attractive cash incentives and opportunities for entrepreneurial project collaboration are highlighted [5] - Interested parties are encouraged to contact via WeChat for collaboration inquiries [6]
机器人产业跟踪:龙头引领下的灵巧手即将升级,景气度有望提升
Orient Securities· 2025-09-14 02:12
Investment Rating - The report maintains a "Positive" investment rating for the mechanical equipment industry, indicating an expectation of performance that exceeds the market benchmark by over 5% [6][20]. Core Insights - The report highlights that the release of Tesla's next-generation dexterous hand is expected to enhance the flexibility and functionality of the dexterous hand industry, leading to an optimistic outlook for the industry chain [3][9]. - The dexterous hand technology has undergone significant iterations, with Tesla's third-generation model achieving 22 degrees of freedom, which is a substantial increase from the first generation's 11 degrees [9][10]. - The report emphasizes that the advancement in dexterous hand technology will not only improve product value but also drive the overall industry towards higher degrees of freedom and functionality [14]. Summary by Sections Industry Overview - The report tracks the robotics industry, particularly focusing on the dexterous hand segment, which is poised for upgrades and increased market activity [1][5]. Technological Advancements - Tesla's dexterous hand has evolved through multiple iterations, with the latest model featuring 26 actuators per arm, significantly enhancing its operational capabilities [10][9]. - The integration of multiple sensors in dexterous hands is expected to create a multi-modal data collection platform, which will improve AI training efficiency and model generalization capabilities [13][9]. Investment Recommendations - The report identifies several investment targets within the dexterous hand industry, including Zhenyu Technology (300953, Buy), Hanwei Technology (300007, Not Rated), Nanshan Zhishang (300918, Not Rated), and Mingzhi Electric (603728, Not Rated) [3].
前京东智能驾驶一号位创业,「星源智」要打造通用具身大脑丨36氪独家
36氪· 2025-09-11 23:46
Core Viewpoint - The article discusses the emergence of a new industrial revolution driven by AI, particularly focusing on the development of embodied intelligence and its potential to solve last-mile delivery challenges in logistics [5][10]. Group 1: Company Overview - Liu Dong, the founder of Xingyuan Intelligence, previously worked at JD Logistics, where he identified the last-mile delivery problem that existing automated solutions could not address [5][18]. - Xingyuan Intelligence recently completed a 200 million yuan angel round of financing, with investments from various venture capital and industry players [9][14]. - The company aims to develop a "general embodied brain" that can enhance the capabilities of robots in logistics and delivery [12][20]. Group 2: Technical Approach - The company has chosen a layered architecture for its embodied intelligence system, separating the "brain" responsible for perception and planning from the "small brain" that executes actions [12][22]. - Liu Dong believes that the current industry lacks a low-cost method to obtain real machine data, making pure end-to-end models impractical at this stage [11][23]. - The layered approach allows robots to start working and accumulate data, which can later be used to train more advanced models [23][24]. Group 3: Market Strategy - Xingyuan Intelligence operates a dual-track business model, acting as both a Tier 1 supplier providing integrated solutions to robot manufacturers and a contractor offering complete robotic solutions to end customers [14][30]. - The company focuses on specific scenarios such as picking robots for supermarkets and pharmacies, which are seen as the fastest to implement and generate revenue [36][42]. - The pricing strategy for their robotic solutions aims to keep costs low, making it attractive for clients to replace human labor with robots [38][39]. Group 4: Commercial Viability - The company has identified clear revenue growth paths and market opportunities, with expectations for picking robots to be operational by next year [45][46]. - Liu Dong emphasizes that the ability to implement solutions in real-world scenarios is crucial for the survival and success of the company [13][46]. - The company plans to leverage its technology to address various applications, including navigation and inspection, which can quickly lead to revenue generation [43][44].
转行自动驾驶算法之路 - 学习篇
自动驾驶之心· 2025-09-10 23:33
Group 1 - The article introduces a significant learning package for the new academic season, including a 299 yuan discount card that offers a 30% discount on all platform courses for one year [3][5]. - Various course benefits are highlighted, such as a 1000 yuan purchase giving access to two selected courses, and discounts on specific classes and hardware [3][6]. - The focus is on cutting-edge autonomous driving technologies for 2025, particularly end-to-end (E2E) and VLA (Vision-Language Alignment) autonomous driving systems [5][6]. Group 2 - End-to-end autonomous driving is emphasized as a core algorithm for mass production, with a notable mention of the competition sparked by the UniAD paper winning the CVPR Best Paper award [6][7]. - The article discusses the rapid evolution of technology in the field, indicating that previous learning materials may no longer be suitable for current industry standards [7]. - The challenges faced by beginners in understanding fragmented knowledge and the lack of high-quality documentation in end-to-end autonomous driving research are addressed [7][8]. Group 3 - The article outlines specific courses aimed at addressing the complexities of autonomous driving, including a small class on 4D annotation algorithms, which are crucial for training data generation [11][12]. - The importance of automated 4D annotation in enhancing the efficiency of data loops and improving the generalization and safety of autonomous driving systems is highlighted [11]. - The introduction of a multi-modal large model and practical courses in autonomous driving is noted, reflecting the growing demand for skilled professionals in this area [15][16]. Group 4 - The article features expert instructors for the courses, including Jason, a leading algorithm expert in the industry, and Mark, a specialist in 4D annotation algorithms [8][12]. - The curriculum is designed to provide a comprehensive learning experience, addressing real-world challenges and preparing students for job opportunities in the autonomous driving sector [23][29]. - The article emphasizes the importance of community engagement and support through dedicated VIP groups for course participants, facilitating discussions and problem-solving [29].
击败英伟达,全球四项第一!优必选自研人形机器人最强大脑Thinker登顶全球!
机器人圈· 2025-09-10 09:07
Core Viewpoint - UBTECH's humanoid robot Walker has achieved significant advancements with its self-developed multimodal large model, Thinker, which has excelled in three major international benchmark tests, showcasing its leading capabilities in complex environment perception, semantic understanding, and long-term task planning [2][4]. Group 1: Benchmark Achievements - UBTECH's Thinker model ranked first in four global leaderboard categories across three authoritative benchmark tests: MS COCO Detection Challenge, RoboVQA, and Egoplan-bench2, competing against top teams from NVIDIA, Beijing Academy of Artificial Intelligence, and Shanghai AI Lab [2][4]. - The MS COCO Detection Challenge is recognized as a key evaluation standard in the computer vision field, while RoboVQA and Egoplan-bench2 focus on reasoning and task planning from a robot's perspective [4][5]. Group 2: Technical Innovations - The Thinker architecture integrates several key technological innovations, enhancing the humanoid robot's perception and reasoning capabilities, laying the groundwork for large-scale applications in industrial settings [6]. - A self-developed visual encoder based on ViT and Co-DETR detection head has been utilized to improve environmental perception, significantly enhancing the robot's ability to recognize objects and obstacles in complex environments [7]. - The large-scale parameter architecture of Thinker, with billions of parameters, enables robust semantic understanding, allowing the robot to accurately capture environmental details and comprehend task instructions [7]. - Temporal enhancement algorithms and reinforcement learning methods have been employed to improve long-term task planning, enabling the robot to autonomously decompose complex processes in real-time [7]. Group 3: Industrial Application Strategies - The strategy of "building general foundational capabilities + fine-tuning for industrial scenarios" is crucial for advancing multimodal large models towards practical applications, facilitating stable and efficient deployment of humanoid robots on production lines [9][11]. - The model has been trained on over 2 million video data and fine-tuned with a large industrial dataset, significantly improving the robot's understanding accuracy and decision reliability in industrial environments [11][12]. Group 4: Future Development and Collaboration - UBTECH aims to build an open and collaborative application ecosystem for humanoid robots by gradually open-sourcing valuable industrial datasets and foundational large models, enabling developers to enhance efficiency in various new scenarios [14].
全球首个L4级能源AI Agent,预测准确率较传统方法提升30%以上 | 创新场景
Tai Mei Ti A P P· 2025-09-08 01:13
Core Insights - LEMMA, launched by ELU Technology Group, is the world's first L4-level energy AI Agent, representing a significant breakthrough in AI application within the energy sector [1] - The solution is based on the concept of "bit empowering watt," utilizing the self-developed ILM (Infinity Large Model) for AI decision-making and the HEE (Hyper Energy Engine) as its technological foundation [1] - LEMMA transitions energy systems from traditional passive responses to proactive intelligent services, enabling autonomous market monitoring, opportunity discovery, strategy formulation, and decision execution [1] Technical Architecture - The core engine of the L4-level AI Agent is designed to support complex scene understanding and reasoning capabilities [2] - It features a complete closed-loop system for proactive perception, autonomous decision-making, and intelligent execution [2] - The system supports multi-modal data fusion processing, including text, numerical, image, and time-series data [2] Application Scenarios - LEMMA is applicable in energy trading, virtual power plant scheduling, energy storage system optimization, and load forecasting [1][2] - It autonomously monitors various trading products in the electricity spot market and auxiliary service market [2] - The system can automatically formulate and execute optimal trading strategies while optimizing distributed energy resource allocation [2] Performance Outcomes - The accuracy of short-term load forecasting has reached 98.5%, improving by over 30% compared to traditional methods [4] - Price prediction accuracy has improved by 35%, providing a more reliable basis for trading decisions [4] - The system's decision response time has been reduced from minutes to milliseconds, supporting high-frequency trading scenarios [4] Economic and Social Impact - The trading revenue in pilot projects has increased by 25-40% compared to traditional methods, while operational costs have decreased by over 30% [4] - The technology has processed transaction amounts exceeding 100 billion, covering various types of clients including power generation companies and industrial users [4] - LEMMA contributes to achieving carbon neutrality goals and promotes the digital transformation of the energy industry [3][6] Industry Influence - As the first L4-level energy AI Agent, LEMMA sets a technological benchmark in the industry and fosters the development of a new ecosystem for energy AI applications [6] - The solution aids traditional energy companies in their transformation and upgrade paths, leading the energy sector towards intelligent and digital development [6]