Workflow
多模态大模型
icon
Search documents
139笔过亿融资,超600亿真金白银,砸向这些AI公司
3 6 Ke· 2025-11-10 06:59
Core Insights - The AI financing landscape in 2025 is characterized by a shift towards practical applications and revenue generation, moving away from speculative investments in concepts like AGI [1][17][18] Group 1: AI Financing Overview - From January to October 2025, China's AI sector saw 139 financing rounds exceeding 100 million RMB, totaling over 60 billion RMB [1] - Major players in the large model sector, such as 月之暗面, 智谱, and MiniMax, secured over 20 billion RMB in funding, highlighting a trend of headlined financing [3][4] - The embodied intelligence sector has emerged as the most popular area for investment, with 73 companies raising over 20 billion RMB [6] Group 2: Large Models and Multi-Modal Models - Large models have shown significant headlining financing, with 29 rounds totaling 14.2 billion RMB, averaging nearly 500 million RMB per round [3] - Multi-modal models are gaining traction, particularly in video applications, with companies like 生数科技 and 爱诗科技 achieving commercial breakthroughs [5] Group 3: Embodied Intelligence - Embodied intelligence has surpassed large models in funding, with 257 billion RMB raised across 73 financing rounds [6] - Key players like 智元机器人 and 乐聚机器人 are at the forefront, with significant orders and upcoming production milestones [7][8] Group 4: AI Infrastructure - AI infrastructure projects, particularly in AI chips, have seen diverse investments, with 8 rounds focused on AI chips and 6 on computing services [9] - Notable companies like 曦智科技 and 爱芯元智 have secured over 1 billion RMB in funding, indicating a shift towards diversified technology routes in AI chips [9][10] Group 5: AI Applications in Healthcare - Healthcare is the leading sector for AI application financing, with 6 out of 15 major funding rounds occurring in this area, representing 40% of the total [14] - Companies like 联影智能 have successfully integrated AI into medical imaging, demonstrating the commercial viability of AI in healthcare [14][16] Group 6: Future Outlook - The narrative around AI has shifted from dreams to tangible results, with a focus on revenue, orders, and production capabilities [17] - The industry is moving towards a "realism" phase, emphasizing practical applications and market-driven investments [18]
NeurIPS2025 Spotlight | RobustMerge: 多模态大模型高效微调模型合并的全新范式
机器之心· 2025-11-10 04:40
Core Insights - The article discusses the challenge of efficiently merging multiple specialized models into a general model in the context of rapidly advancing AI technology, highlighting the concept of "direction robustness" as a key factor in the failure of parameter-efficient fine-tuning (PEFT) module merging [2][7][10]. - A new solution called RobustMerge is proposed, which offers a simple and efficient method for model merging without additional costs, providing significant potential for developers and researchers working on multi-modal large models [2][8]. Problem Definition - The rise of multi-modal large models has increased computational demands, making full fine-tuning (FFT) costly and impractical for many users. As a result, parameter-efficient fine-tuning (PEFT), particularly LoRA, has become mainstream, allowing for quick adaptation to downstream tasks by updating only a small portion of model parameters [7][8]. - Traditional methods for merging models, such as multi-task learning, face challenges related to training costs and data availability, leading to the exploration of model merging as a more efficient alternative [8][10]. Key Contributions - RobustMerge addresses the shortcomings of existing PEFT merging methods by identifying the core issue of direction instability rather than parameter sign conflicts, thus paving the way for a new paradigm in LoRA merging [10][41]. - The method employs a two-phase merging strategy: pruning and complementary scaling, followed by cross-task normalization, to enhance the stability of low-rank directions during the merging process [16][19][23]. Experimental Design and Results - RobustMerge was tested across multiple benchmarks, including a newly created benchmark called MM-MergeBench, which evaluates performance on both seen and unseen tasks, demonstrating significant improvements in multi-task performance and generalization capabilities [28][31]. - The results indicate that RobustMerge outperforms traditional methods, achieving an average accuracy increase of 3.4% on seen tasks and a 4.5% improvement on unseen tasks, showcasing its effectiveness in reducing task interference and enhancing multi-task performance [31][32]. Practical Applications - The RobustMerge approach can be applied in various scenarios, including rapid deployment of multi-task models, federated learning, and model editing or style transfer, making it a valuable tool for enterprises looking to build complex AI applications efficiently [44][45].
机器人大脑产业跟踪
2025-11-10 03:34
Summary of Key Points from the Conference Call on Robotics Industry Industry Overview - The robotics industry is shifting focus from traditional industrial robots to humanoid and specialized product forms, with a strong emphasis on full-chain automation control [2][16] - The development of humanoid robots is closely linked to advancements in automotive intelligence and electrification, with many robotics developers originating from the automotive sector [2][3] Core Challenges - The development of robotic brains faces dual challenges: the real-time performance of operating systems and the uncertainty of AI algorithms, particularly in precision control scenarios [4][10] - The phenomenon of "hallucination" in large language models complicates the training of models for specific applications [4] - Data variability in different environments, such as home care, adds complexity to model training [5][12] Industrial vs. Domestic Applications - Robotic brains are more easily implemented in industrial settings due to higher project budgets that allow for extensive data collection and training, unlike home care scenarios which have budget constraints [6][13] - The need for tailored solutions in specific environments is emphasized, suggesting a gradual approach starting with narrow applications [13][24] Technological Development - The concept of world models is gaining traction, with the potential to enhance robotic brains by reconstructing scene data, although data volume and computational power remain significant challenges [8][9] - Current robotic systems are more akin to specialized control systems rather than general-purpose brains, necessitating real-time operating systems and sufficient observational computing power [10][11] Market Dynamics - If China's robotics supply chain is established, it could lead to significantly lower costs compared to the U.S., with a strong foundation for manufacturing [14] - The lack of skilled product managers in China is identified as a barrier to defining and designing effective robotics products [22] Future Outlook - The robotics industry is still in its infancy, with no clear leaders emerging due to the incomplete integration of technology stacks [16] - Short-term investment risks are highlighted, as significant breakthroughs in robotics and AI are not expected imminently [20][24] - The potential for humanoid robots in various applications is acknowledged, but their current utility in many scenarios remains limited [17] Conclusion - The robotics industry is at a critical juncture, with the potential for growth if initial application scenarios are clearly defined and marketable solutions are developed [24][25] - Investors are advised to manage expectations and balance technological advancements with practical commercialization strategies [25]
vivo AI Lab提出自我进化的移动GUI智能体,UI-Genie无需人工标注实现性能持续提升
机器之心· 2025-11-07 07:17
Core Insights - The article discusses the advancements in multi-modal large models (MLLM) and the development of mobile GUI agents that can autonomously understand and execute complex tasks on smartphones [2][3]. Group 1: Challenges in Mobile GUI Agents - A significant challenge in training mobile GUI agents is the reliance on high-quality expert demonstration data, which is costly to obtain and limits the agents' generalization and robustness [2][7]. - The correct execution of GUI operations is highly dependent on historical context, making it difficult to evaluate the effectiveness of each action in a task [6][7]. Group 2: UI-Genie Framework - The UI-Genie framework allows for self-evolving agents through collaboration between the agent model and a reward model, enabling high-quality data synthesis without manual annotation [3][27]. - UI-Genie-RM is introduced as the first specialized reward model for evaluating mobile GUI agent trajectories, designed to consider the entire operation history [9][10]. Group 3: Data Generation and Model Iteration - UI-Genie employs a closed-loop mechanism for data generation and model iteration, which includes reward-guided trajectory exploration, dual expansion of training data, and progressive task complexity enhancement [14][19]. - The framework has demonstrated significant improvements in task success rates and evaluation accuracy through iterative training, with the agent's success rate increasing from 18.1% to 38.7% [24]. Group 4: Performance and Future Applications - UI-Genie outperforms baseline methods in both offline and online operation tasks, achieving a 77.0% operation success rate and 86.3% element localization accuracy with a 72B model [21][23]. - The framework is expected to expand to more complex multi-modal interaction scenarios, including desktop agents, and aims to integrate reward models with reinforcement learning for autonomous growth [27][29].
首个、首座、首次!本周,中国硬核实力再刷屏
Yang Shi Xin Wen· 2025-11-06 22:49
Group 1: Aerospace and Technology Breakthroughs - China's Tianwen-1 spacecraft has successfully observed the interstellar object Atlas, marking the first time a Chinese spacecraft has achieved this feat. The spacecraft was approximately 30 million kilometers away from the target during the observation [3][4] - The successful observation of Atlas is considered an important expansion task for Tianwen-1, which will aid in technical trials for asteroid exploration by Tianwen-2 [3] Group 2: Infrastructure Developments - The Tongling Yangtze River Bridge, the world's first double-layer cable-stayed and suspension collaborative system bridge, officially opened on November 6. The bridge spans 11.88 kilometers with a main span of 988 meters, serving both rail and road traffic [5] - The opening of the bridge enhances the regional river crossing capacity and allows for uninterrupted high-speed travel along the 641-kilometer G3 Jing-Tai Expressway in Anhui [5] Group 3: Marine and Environmental Innovations - China's Ministry of Natural Resources has released the world's first multi-modal large model for deep-sea habitats, named "Intelligent Cognition and Exploration Multi-modal Model for Deep-sea Habitats," which is a significant outcome of the United Nations' "Decade of Ocean Science" initiative [8] - This model features intelligent perception of deep-sea habitats, comprehensive intelligent simulation, governance decision-making solutions, and immersive cognitive navigation [8] Group 4: Navigation and Satellite Industry Growth - The China Satellite Navigation Association reported a strong growth trend in the BeiDou industry, with a steady increase in the comprehensive index and expanding application penetration across multiple fields [9] - By the end of 2024, China is expected to have 50 BeiDou navigation satellites in orbit, including 15 BeiDou-2 satellites and 35 BeiDou-3 satellites, marking a new phase of marketization, industrialization, and internationalization for the BeiDou system [9] - Currently, 88 BeiDou sounding stations in China are connected to the global meteorological data exchange system, contributing to global weather forecasting with "Chinese precision" [9]
成为具身智能“大脑”,多模态世界模型需要具备哪些能力?丨ToB产业观察
Tai Mei Ti A P P· 2025-11-05 04:01
Core Insights - The release of the Emu3.5 multimodal model by Beijing Zhiyuan Research Institute marks a significant advancement in AI technology, featuring 34 billion parameters and trained on 790 years of video data, achieving a 20-fold increase in inference speed through proprietary DiDA technology [2] - The multimodal large model market in China is projected to reach 13.85 billion yuan in 2024, growing by 67.3% year-on-year, and is expected to rise to 23.68 billion yuan in 2025 [2] - By 2025, the global multimodal large model market is anticipated to exceed 420 billion yuan, with China accounting for 35% of this market, positioning it as the second-largest single market globally [2] Multimodal Model Development - The essence of multimodal models is to enable AI to perceive the world through multiple senses, focusing on more efficient integration, deeper understanding, and broader applications [3] - A significant challenge in current multimodal technology is achieving true native unification, with about 60% of models using a "combinatorial architecture" that leads to performance degradation due to information transfer losses [3] - The Emu3.5 model utilizes a single Transformer and autoregressive architecture to achieve native unification in multimodal understanding and generation, addressing the communication issues between modalities [3] Data Challenges - Most multimodal models rely on fragmented data from the internet, such as "image-text pairs" and "short videos," which limits their ability to learn complex physical laws and causal relationships [4] - Emu3.5's breakthrough lies in its extensive use of long video data, which provides rich context and coherent narrative logic, essential for understanding how the world operates [4] - The acquisition of high-quality multimodal data is costly, and regulatory pressures regarding sensitive data in fields like healthcare and finance hinder large-scale training [4] Performance and Efficiency - Balancing performance and efficiency is a critical issue, as improvements in model performance often come at the cost of efficiency, particularly in the multimodal domain [5] - Prior to 2024, mainstream models took over 3 seconds to generate a 5-second video, with response delays in mobile applications being a significant barrier to real-time interaction [5] - The release of Emu3.5 indicates a trend where multimodal scaling laws are being validated, marking it as a potential "third paradigm" following language pre-training and post-training inference [5] Embodied Intelligence - The development of embodied intelligence is hindered by data acquisition costs and the gap between simulation and reality, which affects the performance of models in unfamiliar environments [6][7] - Emu3.5's "Next-State Prediction" capability enhances the model's understanding of physical intuition, allowing for safer and more efficient decision-making in dynamic environments [7][8] - Integrating multimodal world models into embodied intelligence could enable a unified model to process the complete cycle of perception, cognition, and action [8] Broader Applications - The impact of multimodal models extends beyond embodied intelligence, promising revolutionary applications across various sectors, including healthcare, industry, media, and transportation [9] - In healthcare, integrating multimodal capabilities with medical imaging technologies can significantly improve early disease detection and treatment precision [9][10] - The ability to generate personalized treatment plans based on extensive multimodal medical data demonstrates the transformative potential of these models in enhancing patient care and operational efficiency [10]
多模态大模型理解物理工具吗?PhysToolBench提出了衡量多模态大模型对物理工具理解的基准
机器之心· 2025-11-04 08:52
Core Insights - The article discusses the development of PhysToolBench, a benchmark designed to evaluate the understanding of physical tools by multimodal large models, highlighting the need for these models to improve their capabilities in recognizing, understanding, and creating tools [2][22]. Summary by Sections PhysToolBench Introduction - PhysToolBench categorizes the understanding of physical tools into three levels: recognizing tools, understanding tools, and creating tools [2][5]. - The benchmark includes over 1000 image-text pairs where models must identify the appropriate tool for a given task based on visual input [5]. Evaluation Criteria - The evaluation covers 32 of the latest multimodal large models, including proprietary, open-source, and embodied intelligence-specific models [7]. - The assessment is structured into three difficulty levels: Easy (Tool Recognition), Medium (Tool Understanding), and Hard (Tool Creation) [8][6]. Model Performance - The top-performing model, GPT-5, scored 62.15% overall, but many models scored below 50% in higher difficulty levels, indicating a significant gap compared to human performance [13]. - Proprietary models generally outperformed open-source models, with larger models showing better capabilities [13]. Specific Findings - Models struggled with recognizing and understanding tools, particularly in identifying whether tools were usable, leading to potential safety risks [18]. - The research indicates that reasoning capabilities, especially visual-centric reasoning, are crucial for effectively using physical tools [19][22]. Future Directions - The findings suggest that improving the understanding, application, and creation of complex physical tools is essential for advancing towards general intelligence in AI [22]. - The article encourages further exploration and development in this area, providing links to relevant papers, code, and datasets for interested parties [23].
摆脱微软依赖:OpenAI与亚马逊云服务达成380亿美元算力采购协议
Huan Qiu Wang· 2025-11-04 02:45
Core Insights - OpenAI has signed a significant computing resource procurement agreement with Amazon Web Services (AWS) worth up to $38 billion, marking a strategic move to reduce reliance on Microsoft and diversify its technology ecosystem [1][2] Group 1: Agreement Details - The agreement will enable OpenAI to immediately start deploying workloads on AWS infrastructure, initially utilizing hundreds of thousands of NVIDIA high-performance GPUs in the U.S. to build computing clusters [2] - OpenAI plans to continuously expand its resource scale over the coming years to meet the growing demands for model training and inference [2] Group 2: Strategic Implications - This partnership is seen as a key signal of OpenAI's shift towards "de-singleization," moving away from its long-standing deep collaboration with Microsoft, which has been a core investor and provided computing support through its Azure cloud platform [2] - The initial deployment of NVIDIA GPU clusters will focus on supporting OpenAI's multimodal large model development and real-time inference services, indicating the company's ambition for commercializing AI technology [2] Group 3: Industry Context - As the global AI industry expands into high-computing demand scenarios such as autonomous driving, robotics, and medical diagnostics, the reliance on infrastructure is expected to continue rising, positioning this collaboration as a potential new paradigm for resource integration in the industry [2]
当还在纠结研究方向的时候!别的同学已经CCF-A了......
具身智能之心· 2025-11-04 00:05
Group 1 - The article introduces a new research guidance service focused on embodied intelligence, addressing common challenges faced by newcomers in selecting research topics and methodologies [1][2] - The guidance covers various advanced topics such as multimodal large models, reinforcement learning, and robot simulation, providing tailored one-on-one support [2][3] - The service is backed by a team of experienced mentors from prestigious institutions and leading companies, ensuring high-quality assistance throughout the research process [2][3] Group 2 - The program emphasizes a dual perspective from both industry and academia, aiming not only for publication but also for practical application and value [3] - An introductory offer is available for the first ten inquiries, allowing students to receive personalized mentorship and tailored advice on suitable conferences and journals [4]
大模型如何准确读懂图表?微软亚研院教它“看、动手、推理”
量子位· 2025-11-03 03:12
Core Insights - The article discusses the advancements of PixelCraft, a system developed by Microsoft Research Asia in collaboration with Tsinghua University and Hong Kong University of Science and Technology, aimed at improving the understanding of structured images through high-fidelity image processing and nonlinear multi-agent reasoning [2][31]. Group 1: Challenges in Structured Image Understanding - Traditional models struggle with structured images like charts and scientific drawings due to the need for pixel-level detail and symbolic abstraction, which is not adequately addressed by existing methods [3][4]. - The limitations of linear "chain-of-thought" processes hinder the necessary backtracking and branching exploration required for complex tasks [2][5]. Group 2: PixelCraft's Approach - PixelCraft addresses these challenges by focusing on two main aspects: ensuring accurate perception ("seeing clearly") and enabling flexible reasoning ("thinking flexibly") [5]. - The system comprises several components, including a dispatcher, planner, reasoner, visual and planning critics, and a set of visual tool agents, which work together to enhance structured image understanding [7][31]. Group 3: High-Fidelity Image Processing - The system utilizes a finely-tuned grounding model to accurately map textual references to pixel-level coordinates, facilitating a semi-automated tool generation process for image editing [10][13]. - A three-stage workflow is established, focusing on tool selection, collaborative discussion and backtracking, and self-review and re-planning, which allows for selective memory usage and reduces the burden of long contexts [7][18]. Group 4: Performance Improvements - PixelCraft demonstrates significant performance improvements across various benchmarks, such as CharXiv, ChartQAPro, and EvoChart, showing consistent gains across different models [23][32]. - The system's ability to reduce error propagation through high-fidelity localization and a closed-loop tool approach is highlighted, leading to enhanced accuracy and robustness in reasoning for structured images [18][33]. Group 5: Experimental Results - The article presents comparative performance data, indicating that PixelCraft outperforms traditional methods like VisualCoT in structured image tasks, emphasizing the importance of selective memory and discussion-based backtracking [27][28]. - Specific tools for chart analysis, such as subplot cropping and auxiliary line annotation, are identified as essential for effective reasoning in structured image contexts [29][30].