基础模型
Search documents
IJRR北邮首篇,联合三星中国研究院、清华大学等共同探讨“机器人操作大模型”
机器人大讲堂· 2025-11-24 08:31
Core Insights - The article discusses the challenges and opportunities in achieving general-purpose robotics, particularly in unstructured environments, and highlights the role of foundation models in enhancing robot learning and manipulation capabilities [1][4][6]. Group 1: Challenges in General-Purpose Robotics - Achieving general-purpose operations in robotics faces several challenges, including unnatural human interactions, data scarcity, limited perception and decision-making abilities, inaccurate processing, and poor robustness [1]. - Current end-to-end training methods using single foundation models, such as RFMs, struggle to maintain a success rate above 99.X% [6]. Group 2: Foundation Models and Their Applications - Foundation models have the potential to address the challenges in robotics by enhancing natural interaction, perception in open environments, and multi-modal information understanding [4]. - Various types of foundation models are identified, including LLMs for generating action sequences, VFMs for perception enhancement, and VLMs for visual and language alignment [4]. Group 3: Framework for General Operations - A proposed framework for general operations in robotics is introduced, which categorizes initial operations (L0 level) based on specific criteria such as learning old skills and operating in static environments [6]. - The framework aims to improve the performance of various modules to transition from L0 operations to unified operations [6]. Group 4: Interaction and Communication - Interaction between humans and robots can occur through task instructions or collaborative efforts, with foundation models enabling more natural language communication and better understanding of user intent [8]. - Foundation models enhance the ability to detect ambiguities in instructions and provide corrective feedback [8]. Group 5: Pre- and Post-conditions Detection - The article emphasizes the importance of detecting pre-conditions and post-conditions in task execution, with foundation models improving object affordance detection and recognition capabilities [10]. - Foundation models facilitate zero-shot recognition of new object categories and accelerate the learning process for object affordance [10]. Group 6: Skill Hierarchy and Task Planning - The integration of learning-based methods into task and motion planning (TAMP) enhances decision-making flexibility and generalization capabilities [12]. - Foundation models assist in processing natural language inputs and improve the scalability of skill hierarchy tasks [12]. Group 7: State Perception and Estimation - The focus on state perception involves understanding the environment, objects, and the robot's own state, with foundation models aiding in semantic scene reconstruction and pose estimation [14]. - Challenges remain in achieving zero-shot pose estimation in open environments [14]. Group 8: Policy Development - Policies in robotics can be categorized into object/action-based methods and end-to-end methods, with foundation models evolving these policies towards general objectives [16]. - The classification of policies includes various output types, enhancing the robot's ability to perform tasks [16]. Group 9: Data Generation for Manipulation - The article discusses the generation of manipulation data from real machines, simulations, and internet sources, highlighting the need for low-cost teleoperation devices [20]. - Foundation models enable automated scene generation and realistic data augmentation, improving the efficiency of data collection [20]. Group 10: Future Directions and Open Questions - The article concludes with a discussion on the design logic of the general operation framework and the need for further exploration in areas such as learning capabilities and the use of large-scale video data [23]. - The potential for AI-driven general operations in robotics is emphasized, questioning how far these advancements can go in practical applications [23].
中外专家共探AI技术前沿与产业赋能
Xin Lang Cai Jing· 2025-11-21 07:23
Core Insights - The fifth Intelligent Computing Innovation Forum was held in Hangzhou, focusing on the theme "Computing Relies on Intelligence, Computing for Intelligence," attracting international experts to discuss advancements in AI technologies and their applications across various scientific fields [1] Group 1: AI Model Development - Scientists are exploring the potential of AI in solving scientific problems, emphasizing that current large language models have not yet reached human-level reasoning capabilities [2] - The development of scientific foundational models requires collaboration with scientists to effectively tokenize and train diverse scientific data, addressing complex interdisciplinary issues [2] - The learning paradigm of foundational models is evolving through imitation learning, reinforcement learning, and autonomous learning, with a shift towards task processing applications [2] Group 2: Efficiency and Resource Consumption - The efficiency of foundational models is critical for large-scale AI application deployment, with a noted exponential increase in token consumption correlating with model capability improvements [3] - The cost of generating tokens decreases with higher reasoning efficiency, necessitating collaborative optimization across the industry to enhance model performance [3] Group 3: Practical Applications and Collaboration - The application of intelligent systems in dynamic environments is gaining attention, highlighting the importance of responsive robotics [4] - China is recognized for its leading capabilities in intelligent manufacturing, serving as an excellent testing ground for new technology applications [4] - There is a call for scientists worldwide to establish collaborative networks to enhance research outcomes and create new possibilities through cooperation [4]
刘德兵说上限,刘知远讲拐点:中国AI十年剧本被他们提前揭开了
3 6 Ke· 2025-11-20 09:57
Core Insights - The future of AI in China is becoming clearer, with significant developments expected in the next decade, particularly in foundational models and intelligent agents [2][4][21] Group 1: Foundational Models - Foundational models are crucial as they determine the upper limits of the AI industry, with open-source becoming a mainstream approach that rapidly amplifies performance differences [2][6] - The company has embraced open-source, having released over 50 models, with more than 40 being open-sourced, leading to substantial commercial benefits and user engagement [5][6] - The competition among foundational model companies is intensifying, with high costs and the necessity for practical validation of model performance [6][10] Group 2: Intelligent Agents - A significant turning point in 2025 is expected to be "AI + Programming," which is becoming a vital support for software productivity [3][17] - The transition from large models to intelligent agents requires models to possess the ability to learn autonomously in specific job roles, akin to a graduate becoming an expert through real task feedback [3][18] - The development of intelligent agents is seen as a critical phase, where models must not only accumulate knowledge but also determine what to learn and how to grow in practical applications [18][19] Group 3: Industry Applications - AI applications are maturing in various sectors, including internet, finance, and education, with expectations for deeper integration in areas like smart manufacturing and energy [8][10] - The next decade is anticipated to witness AI becoming a universal capability, necessitating widespread education and participation in AI development [8][12] - The rapid growth of AI applications is evident, with models like GLM-4.6 achieving top rankings in international evaluations, showcasing the competitive capabilities of Chinese AI [10][11] Group 4: Future Outlook - The next ten years are viewed as a critical period for AI, with the potential for China to transition from "catching up" to "keeping pace" and possibly leading in certain areas [10][21] - The focus will be on collaboration across the industry to enhance data, computing power, models, and applications, which is essential for sustained development [12][14] - The overarching theme for the next decade is the collaboration and coexistence of AI and humans, emphasizing the importance of improving foundational technologies and practical industry applications [13][14][21]
中泰证券:Gemini 3 Pro能力全方位跃升 开创Agent平台新格局
Zhi Tong Cai Jing· 2025-11-20 08:01
Core Insights - The release of Gemini 3 by Google demonstrates significant advancements in AI model capabilities, indicating that the progress in model intelligence has not yet reached its ceiling [1][2] - The report suggests focusing on companies with strong fundamentals in the foundational computing layer, model layer, and B-end vendors that deeply integrate services into business processes [1] Investment Events - Google officially launched the Gemini 3 series, including the Gemini 3 Pro model, on November 18, 2025, achieving state-of-the-art (SOTA) performance across multiple evaluation dimensions [1] Performance Metrics - Gemini 3 Pro scored 37.5% in the Humanity's Last Exam, surpassing GPT-5.1 (26.5%) and Claude Sonnet 4.5 (13.7%), showcasing doctoral-level reasoning capabilities [2] - In the MathArena Apex test, Gemini 3 Pro achieved a score of 23.4%, significantly outperforming GPT-5.1 (1.0%) and Claude Sonnet 4.5 (1.6%), indicating a leap in deep reasoning abilities [2] Multi-Modal Architecture and User Interface - Gemini 3 Pro continues the original multi-modal architecture and introduces a Generative User Interface (Generative UI) that allows for customized interactive responses based on user prompts [3] - Google launched the Antigravity platform for AI agent development, enabling developers to utilize models like Gemini 3 Pro and Claude Sonnet 4.5 for free, enhancing programming efficiency through autonomous task execution [3] Search Enhancements - Google has upgraded its search capabilities with Gemini 3, improving query fan-out technology to enhance search efficiency and user experience through interactive tools and dynamic visual presentations [4] Ecosystem Trends - The report highlights a trend of major foundational model companies building comprehensive ecosystems, with firms like OpenAI, Anthropic, and Google transitioning from model providers to platform developers [5] - In coding scenarios, tools like Antigravity and Anthropic's Claude Code are being integrated into foundational models, blurring the lines between standalone SaaS products and model modules [5]
OmniDexGrasp 揭秘:基础模型 + 力反馈,让机器人 “看懂指令、灵活抓握” 的通用方案
具身智能之心· 2025-10-31 00:04
Core Insights - The article discusses the OmniDexGrasp framework, which addresses the challenges of dexterous grasping in robotics by combining foundation models with force feedback control to achieve generalizable and physically feasible grasping [1][2][21]. Group 1: Challenges in Dexterous Grasping - Current dexterous grasping solutions face a dilemma between data-driven approaches, which struggle with generalization due to limited datasets, and foundation models, which often fail to translate abstract knowledge into physical actions [2]. - The core issue is the inability to balance generalization and physical feasibility, leading to failures in grasping new objects or in complex scenarios [2]. Group 2: OmniDexGrasp Framework - OmniDexGrasp employs a three-stage approach: generating human grasping images, action transfer to robots, and force feedback control, effectively bridging the gap between abstract knowledge and physical execution [4][21]. - The framework retains the generalization capabilities of foundation models while ensuring physical feasibility through precise action transformation and control strategies [4]. Group 3: Key Modules of OmniDexGrasp - **Module 1**: Generates human grasping images to help robots understand how to grasp objects, utilizing a variety of input designs to accommodate different user needs [6][8]. - **Module 2**: Translates human grasping images into robot actions, addressing the challenge of aligning human intent with robotic capabilities through a three-step transfer strategy [9][12]. - **Module 3**: Implements force feedback control to ensure stable and safe grasping, adapting to the physical properties of objects and preventing damage during the grasping process [12][13]. Group 4: Experimental Results - OmniDexGrasp demonstrated an average success rate of 87.9% across six core grasping tasks, significantly outperforming traditional methods [15]. - In comparative tests, OmniDexGrasp showed superior generalization, especially with new objects, achieving success rates that far exceeded those of existing solutions [16][18]. Group 5: Future Directions - The framework suggests future enhancements through multi-modal observation integration and deeper control task development, aiming for end-to-end general manipulation capabilities [22]. - The potential for OmniDexGrasp to extend beyond grasping to broader manipulation tasks is highlighted, indicating its versatility in robotic applications [20].
实锤了:GPU越多,论文接收率越高、引用越多
机器之心· 2025-10-17 08:12
Core Insights - The article discusses the significant advancements in the AI field over the past three years, primarily driven by the development of foundational models, which require substantial data, computational power, and human resources [2][4]. Resource Allocation and Research Impact - The relationship between hardware resources and the publication of top-tier AI/ML conference papers has been analyzed, focusing on GPU availability and TFLOPs [4][5]. - A total of 5,889 foundational model-related papers were identified, revealing that stronger GPU acquisition capabilities correlate with higher acceptance rates and citation counts in eight leading conferences [5][9]. Research Methodology - The study collected structured information from 34,828 accepted papers between 2022 and 2024, identifying 5,889 related to foundational models through keyword searches [8][11]. - A survey of 229 authors from 312 papers indicated a lack of transparency in GPU usage reporting, highlighting the need for standardized resource disclosure [9][11]. Growth of Foundational Model Research - From 2022 to 2024, foundational model research has seen explosive growth, with the proportion of related papers in top AI conferences rising significantly [18][19]. - In NLP conferences, foundational model papers have outpaced those in general machine learning conferences [22]. Research Contributions by Academia and Industry - Academic institutions contributed more papers overall, while top industrial labs excelled in single-institution output, with Google and Microsoft leading in paper production [29][32]. - The research efficiency between academia and industry is comparable, with industry researchers publishing an average of 8.72 papers and academia 7.93 papers [31]. Open Source Models and GPU Usage - Open-source models, particularly the LLaMA series, have become the predominant choice in research, favored for their flexibility and accessibility [35][37]. - NVIDIA A100 is the most widely used GPU in foundational model research, with a notable concentration of GPU resources among a few institutions [38][39]. Funding Sources and Research Focus - Government funding is the primary source for foundational model research, with 85.5% of papers receiving government support [41][42]. - The focus of research has shifted towards algorithm development and inference processes, with a significant portion of papers dedicated to these areas [42]. Computational Resources and Research Output - The total computational power measured in TFLOPs is more strongly correlated with research output and citation impact than the sheer number of GPUs used [44][45]. - While more resources can improve acceptance rates, the quality of research and its novelty remain critical factors in the review process [47].
2025云栖大会在杭州开幕 数千科技产品集中亮相
Zhong Guo Xin Wen Wang· 2025-09-25 01:17
t 30 组 1 1925 華 寶 領域視觉模型及政府 t 通义官 es 11:2 chinanews.com.cn 2025 云栖大会 chinanews.com.cn 球领先的基础 World's Leading Foundation Model Fam P chinanews.com.cn ...... 111111 1111 12222 our and - 8 C E y chinanews.com.cn 241 chinanews.com.cn chinanews.com.cn 为不同工业务服打 1+N+S = 00 24 HEI chinanews.com.cn r in 式 视 影院院 Inteller The 胜年 5 略中原 11 打組 I ● ehinanews.com.cn ll Park 图 线 游 电 Ed chinanews.com.cn A 0 MITTING THE COLLECTION CONTRACTOR COLLECTION OF THE CONTRACT THE CONTRACT THE CONTRACT THE CONTRACT THE CONTRACT THE CO ...
自动驾驶基础模型应该以能力为导向,而不仅是局限于方法本身
自动驾驶之心· 2025-09-16 23:33
Core Insights - The article discusses the transformative impact of foundational models on the autonomous driving perception domain, shifting from task-specific deep learning models to versatile architectures trained on vast and diverse datasets [2][4] - It introduces a new classification framework focusing on four core capabilities essential for robust performance in dynamic driving environments: general knowledge, spatial understanding, multi-sensor robustness, and temporal reasoning [2][5] Group 1: Introduction and Background - Autonomous driving perception is crucial for enabling vehicles to interpret their surroundings in real-time, involving key tasks such as object detection, semantic segmentation, and tracking [3] - Traditional models, designed for specific tasks, exhibit limited scalability and poor generalization, particularly in "long-tail scenarios" where rare but critical events occur [3][4] Group 2: Foundational Models - Foundational models, developed through self-supervised or unsupervised learning strategies, leverage large-scale datasets to learn general representations applicable across various downstream tasks [4][5] - These models demonstrate significant advantages in autonomous driving due to their inherent generalization capabilities, efficient transfer learning, and reduced reliance on labeled datasets [4][5] Group 3: Key Capabilities - The four key dimensions for designing foundational models tailored for autonomous driving perception are: 1. General Knowledge: Ability to adapt to a wide range of driving scenarios, including rare situations [5][6] 2. Spatial Understanding: Deep comprehension of 3D spatial structures and relationships [5][6] 3. Multi-Sensor Robustness: Maintaining high performance under varying environmental conditions and sensor failures [5][6] 4. Temporal Reasoning: Capturing temporal dependencies and predicting future states of the environment [6] Group 4: Integration and Challenges - The article outlines three mechanisms for integrating foundational models into autonomous driving technology stacks: feature-level distillation, pseudo-label supervision, and direct integration [37][40] - It highlights the challenges faced in deploying these models, including the need for effective domain adaptation, addressing hallucination risks, and ensuring efficiency in real-time applications [58][61] Group 5: Future Directions - The article emphasizes the importance of advancing research in foundational models to enhance their safety and effectiveness in autonomous driving systems, addressing current limitations and exploring new methodologies [2][5][58]
Nature Medicine:盛斌/黄天荫团队开发眼科AI大模型,显著提升眼科医生诊疗水平和患者预后
生物世界· 2025-09-01 08:30
Core Viewpoint - The article emphasizes the significant advancement of Foundation Models (FM) in the potential applications of artificial intelligence (AI) in clinical care, highlighting the need for rigorous prospective validation and randomized controlled trials to bridge the gap between AI capabilities and real-world clinical environments [2][3][6]. Group 1: Foundation Model Development - A multi-modal visual-language ophthalmic foundation model named EyeFM was developed, which was validated through a prospective deployment across various global regions, including Asia, North America, Europe, and Africa [3][6]. - EyeFM was pre-trained using a diverse dataset of 14.5 million eye images, enabling it to perform various core clinical tasks effectively [6][11]. Group 2: Clinical Evaluation and Effectiveness - The effectiveness of EyeFM as a clinical assistance tool was evaluated through a randomized controlled trial involving 668 participants, showing a higher correct diagnosis rate of 92.2% compared to 75.4% in the control group [11][13]. - The study also indicated improved referral rates (92.2% vs 80.5%) and better self-management adherence (70.1% vs 49.1%) among the intervention group using EyeFM [11][13]. Group 3: Application and Future Implications - EyeFM serves as a comprehensive assistance system for ophthalmology, with potential applications across various clinical scenarios, enhancing the diagnostic capabilities of ophthalmologists and improving patient outcomes [12][13].
FDA已批准超1200款AI医疗器械:影像学之外,新的扩张专科在哪里?
思宇MedTech· 2025-08-21 03:50
Core Viewpoint - Artificial Intelligence (AI) is rapidly penetrating the medical device field, with over 1,200 AI/ML medical devices approved by the FDA as of July 2025, including a record 235 devices approved in 2024, indicating that AI is becoming a significant part of clinical practice [2][4]. Group 1: AI in Medical Imaging - Radiology remains the dominant application area for AI, focusing on tasks such as automatic image segmentation, lesion detection, and risk screening [4]. - The cardiovascular specialty is experiencing accelerated adoption of AI, expanding from ECG rhythm analysis to cardiac ultrasound and CT coronary imaging due to the high prevalence of cardiovascular diseases and the suitability of imaging data for AI training [5][6]. Group 2: AI in Neurology - In neurology, AI's initial entry point is acute stroke image recognition, with applications including arrhythmia detection and heart failure risk prediction [7][8]. - AI systems can automatically interpret CT/MRI scans within minutes, identifying potential ischemic or hemorrhagic lesions and notifying neurologists, thus shortening the "golden hour" for treatment [9]. - Neurology is emerging as a new growth area for FDA approvals due to high-risk, high-value disease scenarios, such as the urgent need for stroke decision-making and unmet needs in epilepsy and dementia [10]. Group 3: Emerging Specialties - Other specialties, including endoscopy and pathology, are also seeing rapid growth in AI medical devices, with applications in automatic identification of polyps and early tumors during gastrointestinal examinations [12]. - AI is enhancing efficiency in pathology by automating the identification and classification of digital pathology slides, allowing pathologists to quickly locate suspicious areas [12]. Group 4: Regulatory Challenges - As the number of FDA-approved AI medical devices surpasses 1,200, regulatory challenges are emerging, particularly in keeping pace with technological advancements [11]. - The focus of FDA regulation is shifting from merely approving the number of AI devices to balancing innovation with safety, necessitating a reevaluation of regulatory frameworks as AI evolves from a "tool" to a "partner" in healthcare [11][14].