多模态理解
Search documents
谷歌Gemini 3夜袭全球,暴击GPT-5.1,奥特曼罕见祝贺
3 6 Ke· 2025-11-19 00:07
凌晨,谷歌终极杀器Gemini 3重磅来袭,一出手就是Pro顶配版,号称「史上最强推理+多模态+氛围编程」三合一AI战神!基准测试横扫全 场,就连GPT-5.1也被斩于马下,AI的下一个时代开启。 它来了,它来了! 就在刚刚,万众期待的年度压轴之王,谷歌新一代旗舰Gemini 3炸裂登场。 而且,一上来就是顶配的Gemini 3 Pro—— 迄今推理最强,多模态理解最强,以及「智能体」+「氛围编程」最强的模型! 强到什么程度? 发布一小时后,就连OpenAI CEO奥特曼,都亲自发推表示祝贺! 而且,还是区分大小写的版本。(不知道是不是亲自试了一下) 从实测来看,也的确如此。 在众多基准测试中,Gemini 3 Pro一举封神—— 不仅相较于2.5 Pro实现了性能的全方位跃升,甚至直接把OpenAI刚上新的GPT-5.1甩出了好几条街。 | | Rank 14 | Rank Spread O (Upper-Lower) | Model 11 | Score ↓ | | --- | --- | --- | --- | --- | | | 1 | 1 =- 2 | G gemini-3-pro | 1501 | ...
百度文心5.0大模型发布,支持多模态理解
Xin Lang Ke Ji· 2025-11-13 03:44
新浪科技讯 11月13日上午消息,今日举办的2025百度世界大会上,百度创始人李彦宏宣布文心5.0大模 型正式发布,该模型支持多模态理解、创意写作及智能体规划等能力。 责任编辑:杨赐 据百度CTO王海峰介绍,文心5.0是一款原生全模态大模型,具有原生全模态建模、理解与生成一体化 等特点。在多项国际测评中,该模型均取得了领先成绩。 ...
攻克长文档与多模态挑战,Paper2Video实现学术视频的自动化生产
机器之心· 2025-10-23 02:22
Core Insights - The article discusses the challenges and solutions in automating the generation of academic presentation videos, highlighting the need for a systematic benchmark and framework to improve efficiency and quality in this domain [4][43]. Group 1: Background and Challenges - Academic presentation videos are crucial for research communication but are currently labor-intensive, requiring hours for a few minutes of content, indicating a need for automation [4]. - Existing natural video generation models are inadequate for academic presentations due to unique challenges such as complex inputs from long documents and the need for synchronized multi-modal outputs [4][5]. Group 2: Paper2Video Benchmark - The Paper2Video benchmark was established using 101 academic papers and their corresponding presentation videos, focusing on four evaluation metrics: Meta Similarity, PresentArena, PresentQuiz, and IP Memory [7][10]. - The benchmark provides a reliable basis for evaluating the generation and assessment of multi-modal long-document inputs and outputs, laying the groundwork for automated academic video generation [10][11]. Group 3: Evaluation Metrics - The four evaluation metrics are designed to assess the quality of academic presentation videos from three core perspectives: human-like preference, information transmission, and academic impact [13][16]. - Meta Similarity measures the consistency of generated content with human-designed versions, while PresentArena evaluates the visual quality against human preferences [16][31]. Group 4: PaperTalker Framework - PaperTalker is introduced as the first multi-agent framework for generating academic presentation videos, processing long-dependency multi-modal tasks [17][18]. - The framework consists of four key modules: Slide Builder, Subtitle Builder, Cursor Builder, and Talker Builder, enabling controlled, personalized, and academically styled video generation [23][26]. Group 5: Experimental Results - PaperTalker outperformed other methods in all four evaluation dimensions, demonstrating superior similarity to human-made videos, better information coverage, and enhanced academic memory [32][41]. - The framework's efficiency is attributed to its modular design and the use of Beamer for slide generation, which significantly reduces token consumption and overall generation time [35][36]. Group 6: Contributions of Key Modules - The Cursor Builder module significantly enhances information location and understanding, as evidenced by improved accuracy in tasks involving visual cues [38]. - The Tree Search Visual Choice module plays a critical role in optimizing slide layout and design quality, demonstrating its importance in the overall effectiveness of the generated videos [40][41]. Group 7: Conclusion - The Paper2Video benchmark and PaperTalker framework provide a systematic approach to generating academic presentation videos, with experimental validation showing their advantages in information transmission, visual quality, and academic memory [43].
国产游戏理解模型刷新SOTA,对话逗逗AI CEO:开源模型+行业数据是突破关键
量子位· 2025-10-11 09:01
2025年进入最后一个季度, 国产开源模型爆发 的影响力正在得到更多印证。 比如 垂类模型领域 ,亚洲最大游戏展 东京电玩展 (TGS)上,国产AI陪伴厂商就发了个大招: 游戏理解领域模型LynkSoul VLM v1 ,在游戏场景中表现显著超过了包括GPT-4o、Claude 4 Sonnet、Gemini 2.5 Flash等一众顶尖闭源 模型。 背后厂商逗逗AI,亦在现场吸引了不少关注的目光。 此时距离其新产品逗逗AI游戏伙伴1.0(海外版为Hakko AI)上线不过一个月左右时间,但在数据上,逗逗AI已经依靠出色的游戏/视频/网页 实时理解能力,新增200多万用户,总用户数突破1000万。 鱼羊 发自 凹非寺 量子位 | 公众号 QbitAI △ 陪玩《空洞骑士:丝之歌》 在TGS现场,我们趁机和逗逗AI CEO刘斌新聊了聊有关逗逗游戏伙伴产品、技术本身,以及AI陪伴这个垂直领域的发展现状。 TL;DR: …… 游戏理解新SOTA 此次闪耀东京电玩展的LynkSoul VLM v1,是逗逗AI专为游戏训练的视觉语言模型。 它能在陪玩过程中实时理解你的游戏画面,比如在《英雄联盟》中点评你的团战表现,靠的 ...
24岁退学博士生,赢得2.5亿美元薪酬合同
Hu Xiu· 2025-08-25 01:52
Group 1 - A 24-year-old AI researcher, Matt Deitke, recently signed a contract worth approximately $250 million with Meta, breaking historical records for tech compensation [1][4][6] - The contract includes a base salary, signing bonus, and stock options, with the first year's income potentially reaching $100 million [6][4] - This event highlights the revaluation of talent in the AI era, indicating that top talent is now seen as a strategic asset [2][8] Group 2 - Deitke's initial contract offer from Meta was $125 million over four years, which he initially rejected to focus on his startup, Vercept [5][22] - Meta's CEO, Mark Zuckerberg, personally intervened to negotiate a new contract, demonstrating the high stakes involved in acquiring top AI talent [6][25] - The recruitment strategy employed by Meta includes aggressive poaching of talent from competitors, with a significant portion of the new team coming from OpenAI and Google DeepMind [27][28] Group 3 - The competition for AI talent is intensifying, with companies like Meta offering unprecedented salaries and resources, such as access to thousands of top GPUs for research [26][30] - This talent war is leading to a significant brain drain from academia, as institutions struggle to compete with the financial incentives offered by tech giants [31][32] - The trend is shifting towards a concentration of AI expertise within a few major companies, creating a formidable barrier for startups and other nations [38][39] Group 4 - The rise of AI is creating new job opportunities across various sectors, with non-technical roles in AI growing at rates exceeding 30% [40] - However, the overall job market for computer science graduates is becoming more challenging, with rising unemployment rates among new graduates [41][42] - Deitke's situation exemplifies the extreme valuation of knowledge capital in the AI age, where a single individual's potential can significantly influence corporate strategies [43][45]
突破SAM局限!中山大学X-SAM:统一框架横扫20+分割基准
自动驾驶之心· 2025-08-12 10:37
Core Insights - The article discusses the introduction of X-SAM, a new segmentation framework that overcomes the limitations of the Segment Anything Model (SAM) by enabling multi-task processing and integrating multi-modal understanding capabilities [3][4][5]. Group 1: Limitations of SAM - SAM was initially seen as a universal solution for visual segmentation but has significant limitations, including its inability to handle multiple tasks simultaneously and its lack of understanding of textual instructions [2][5][6]. - SAM is designed for single-object segmentation based on visual prompts and cannot perform complex tasks like semantic, instance, or panoptic segmentation [6]. - The gap between visual segmentation and multi-modal understanding is highlighted, where existing models can either understand images or perform pixel-level segmentation but not both effectively [5][6]. Group 2: Innovations of X-SAM - X-SAM is designed to fill the gap left by SAM, providing a unified segmentation framework that can handle various tasks and input types [7][8]. - The architecture of X-SAM includes a dual-encoder system that processes both visual and textual inputs, allowing for a comprehensive understanding of images and instructions [12][14]. - X-SAM introduces a unified input format that standardizes how different segmentation tasks are processed, enabling the model to understand both textual and visual prompts [13][15]. Group 3: Performance and Testing - X-SAM has been tested across over 20 segmentation datasets and 7 core tasks, outperforming existing models in all categories [4][27]. - The model's performance metrics include achieving an average precision (AP) of 47.9 to 49.7 in visual grounding segmentation (VGD), significantly surpassing previous models [26][35]. - In specific tasks, X-SAM achieved a panorama quality (PQ) of 54.7 in COCO panoptic segmentation, demonstrating its robustness in foundational segmentation tasks [31]. Group 4: Training Methodology - X-SAM employs a multi-stage training strategy that includes fine-tuning the segmenter, pre-training for alignment, and mixed fine-tuning across various datasets [21][23]. - The training process incorporates a data balancing resampling strategy to ensure smaller datasets are not overshadowed by larger ones, optimizing overall model performance [24]. - The model's architecture allows for simultaneous training on multiple tasks, enhancing its generalization capabilities [37]. Group 5: Future Directions - The research team plans to extend X-SAM's capabilities to video segmentation and dynamic scenes, aiming to bridge the gap between static image understanding and video comprehension [43].
还在纠结是否入门大模型?别人已经发了第一篇顶会!
自动驾驶之心· 2025-07-14 06:20
Core Viewpoint - The article discusses the evolving landscape of large models in autonomous driving, highlighting the focus on lightweight solutions, hardware adaptation, knowledge distillation, and advanced reasoning paradigms like CoT and VLA+ reinforcement learning as key areas for future development [1][2]. Group 1: Course Introduction - The course aims to explore cutting-edge optimization methods for large models, focusing on parameter-efficient computation, dynamic knowledge expansion, and complex reasoning [2]. - It addresses the core challenges in model optimization, including pruning, quantization, retrieval-augmented generation (RAG), and advanced reasoning paradigms [3]. Group 2: Problems Addressed by the Course - The course provides a systematic understanding of large model knowledge, helping students build a coherent theoretical framework [3]. - It assists students in combining theoretical knowledge with practical coding skills, enabling them to replicate research papers and develop new models [3]. - The course offers guidance on writing and submitting academic papers, addressing common challenges faced by students [3]. Group 3: Enrollment Information - The course limits enrollment to 6-8 students per session [4]. - It targets individuals with a background in deep learning or machine learning, familiarity with Python, and a passion for research [6]. Group 4: Course Outcomes - Participants will gain insights into classic and cutting-edge papers in the field, enhancing their understanding of key algorithms and principles [9]. - The course includes a structured approach to writing and revising academic papers, culminating in the production of a draft [9]. Group 5: Course Structure - The course spans 12 weeks of online group research followed by 2 weeks of paper guidance and a 10-week maintenance period [9]. - It covers various topics, including model pruning, quantization, and advanced reasoning techniques, with a focus on practical applications [19].
师兄自己发了篇自动驾大模型,申博去TOP2了。。。
自动驾驶之心· 2025-07-09 12:56
Core Viewpoint - The article discusses the advancements in large models (LLMs) for autonomous driving, highlighting the need for optimization in efficiency, knowledge expansion, and reasoning capabilities as the technology matures [2][3]. Group 1: Development of Large Models - Companies like Li Auto and Huawei are implementing their own VLA and VLM solutions, indicating a trend towards the practical application of large models in autonomous driving [2]. - The focus for the next generation of large models includes lightweight design, hardware adaptation, knowledge distillation, quantization acceleration, and efficient fine-tuning [2][3]. Group 2: Course Introduction - A course is being offered to explore cutting-edge optimization methods for large models, focusing on parameter-efficient computation, dynamic knowledge expansion, and complex reasoning [3]. - The course aims to address core challenges in model optimization, including pruning, quantization, retrieval-augmented generation (RAG), and advanced reasoning paradigms like Chain-of-Thought (CoT) and reinforcement learning [3][4]. Group 3: Enrollment and Requirements - The course will accept a maximum of 8 students per session, targeting individuals with a background in deep learning or machine learning who are familiar with Python and PyTorch [5][10]. - Participants will gain a systematic understanding of large model optimization, practical coding skills, and insights into academic writing and publication processes [8][10]. Group 4: Course Outcomes - Students will learn to combine theoretical knowledge with practical coding, develop their own research ideas, and produce a draft of a research paper [8][9]. - The course includes a structured timeline with specific topics each week, covering model pruning, quantization, efficient fine-tuning, and advanced reasoning techniques [20].
大模型在自动驾驶后期的落地与研究方向有哪些?
自动驾驶之心· 2025-07-07 23:31
Core Insights - The article discusses the evolving landscape of large models in autonomous driving, highlighting the focus on lightweight solutions, hardware compatibility, knowledge distillation, and efficient fine-tuning of large models [1] - It emphasizes the importance of advanced reasoning paradigms such as Chain-of-Thought (CoT) and VLA combined with reinforcement learning in enhancing spatial perception capabilities [1] Group 1: Course Overview - The course aims to explore cutting-edge optimization methods for large models, focusing on parameter-efficient computation, dynamic knowledge expansion, and complex reasoning [2] - Key challenges in model optimization include parameter compression through pruning and quantization, dynamic knowledge injection techniques, and advanced reasoning paradigms [2][3] Group 2: Enrollment and Requirements - The course is limited to 6-8 participants per session, targeting individuals with a foundational understanding of deep learning and machine learning [4][8] - Participants are expected to have basic programming skills in Python and familiarity with PyTorch, along with a genuine interest in research [8] Group 3: Course Outcomes - The course aims to provide a systematic understanding of large model optimization, helping participants develop their own research ideas and enhance their coding skills [6][7] - Participants will receive guidance on writing and submitting academic papers, including methodologies for drafting and revising manuscripts [6][7] Group 4: Course Structure - The course spans 12 weeks of online group research followed by 2 weeks of paper guidance, covering topics such as model pruning, quantization, and dynamic knowledge expansion [7][18] - Each week focuses on specific themes, including advanced reasoning techniques and collaborative multi-agent systems [18][20] Group 5: Additional Information - The course will utilize publicly available datasets and baseline codes tailored to specific applications, ensuring practical relevance [15][16] - Participants will engage in discussions and hands-on experiments using mainstream large models like LLaMA and GPT [2][18]
大模型这个坑,还有哪些可以发论文的点?
具身智能之心· 2025-07-05 02:25
Core Insights - The article emphasizes the rapid development of large language models (LLMs) and multimodal models, focusing on enhancing model efficiency, expanding knowledge capabilities, and improving reasoning performance as key research areas in artificial intelligence [1][2]. Course Objectives - The course aims to systematically explore cutting-edge optimization methods for large models, addressing challenges in parameter-efficient computation, dynamic knowledge expansion, and complex reasoning [1][2]. Enrollment Details - The course will accept 6 to 8 participants per session [3]. Target Audience - The course is designed for master's and doctoral students in the field of large models, individuals seeking to enhance their resumes for graduate studies abroad, and professionals in artificial intelligence looking to deepen their understanding of algorithm theory and research skills [4]. Course Outcomes - Participants will gain insights into classic and cutting-edge papers, coding implementations, and methods for writing and submitting research papers, thereby developing a clearer understanding of the subject matter [3][4]. Enrollment Requirements - Basic requirements include familiarity with deep learning/machine learning, basic knowledge of large model algorithms, proficiency in Python, and experience with PyTorch [5]. Course Structure - The course spans 12 weeks of online group research, followed by 2 weeks of paper guidance, and includes a maintenance period of 10 weeks for paper development [10]. Learning Requirements - Participants are expected to engage actively in discussions, complete assignments on time, and maintain academic integrity throughout the course [12]. Course Outline - The curriculum covers various topics, including model pruning, quantization, dynamic knowledge expansion, and advanced reasoning paradigms, with a focus on practical applications and coding [16][18].