多模态推理
Search documents
让大模型“吃一堑长一智”,南理工百度等提出模型记忆新方法
量子位· 2025-12-17 09:07
ViLoMem团队 投稿 量子位 | 公众号 QbitAI 多模态推理又有新招,大模型"记不住教训"的毛病有治了。 南京理工大学联合百度等单位提出新方法ViLoMem,通过构建 视觉流+逻辑流的双流语义记忆 ,让模型像人一样把视觉陷阱和推理错误分开 存档,做到真正的"从错误中学习"。 在六个多模态基准中,ViLoMem让GPT-4.1 在MathVision上暴涨+6.48,小模型Qwen3-VL-8B在MMMU上提升 +4.38。 而且不需要任何微调,强模型积累下来的记忆还能直接迁移给小模型,起到类似"免费知识蒸馏"的效果。 整体来看,ViLoMem 做了三件关键的事: 在不改动大模型参数的前提下,ViLoMem能在多个多模态基准上稳定拉升表现,尤其是在需要精细视觉理解的数学与真实场景推理任务上, 为构建真正"会从经验中长记性"的多模态智能体提供了一条很有潜力的道路。 大模型的"金鱼记忆" 但人类并不是这样记忆的。 认知科学研究表明,人类的语义记忆天生就是多模态整合的,既会记住"这道题要用勾股定理"(逻辑规则),也会记 "这个角看着像直角其实 不是"(视觉经验)。 ViLoMem正是沿着这个方向,把视觉和逻 ...
Transformer作者爆料GPT-5.1内幕,OpenAI内部命名规则变乱了
3 6 Ke· 2025-12-01 01:25
Core Insights - The development of AI is not slowing down but is transitioning to a new paradigm, with a focus on reasoning models rather than just pre-training [4][10][32] - The recent release of GPT-5.1 represents a significant stability iteration rather than a minor update, emphasizing user experience and safety improvements [14][17][19] Group 1: AI Development Trends - There are two contrasting views on AI growth: one claims a slowdown, while the other highlights continuous advancements with new models like GPT-5.1 and Gemini 3 [5][10] - The internal perspective shows that AI capability growth follows a smooth exponential curve, akin to Moore's Law, driven by technological iterations and computational enhancements [7][10] - The shift from pre-training to reasoning models marks a critical turning point in AI development, with reasoning models still in their early stages and expected to progress rapidly [10][11][13] Group 2: GPT-5.1 and Model Evolution - GPT-5.1 is a substantial update focused on enhancing reasoning capabilities, safety, and user experience, despite appearing as a minor version change [14][15][17] - The naming convention for models has shifted to prioritize user experience, allowing for more flexibility in development and faster iteration cycles [17][19] - Despite improvements, GPT-5.1 still exhibits limitations in multi-modal reasoning, as demonstrated by its inability to solve simple problems that a child could easily answer [19][20] Group 3: Future of AI and Robotics - AI is expected to change the nature of work without eliminating jobs, as human expertise will still be required in high-stakes scenarios [32][34] - The next significant breakthrough in AI is anticipated to come from advancements in multi-modal reasoning and embodied intelligence, particularly in home robotics [36][34] - The progress in robotics will depend on the integration of multi-modal capabilities and general reinforcement learning, leading to a transformative leap in home automation technologies [36][34]
Transformer作者爆料GPT-5.1内幕!OpenAI内部命名规则变乱了
量子位· 2025-11-30 11:30
Core Insights - The article discusses a significant paradigm shift in AI, indicating that the development of AI is not slowing down but rather transitioning to a new phase of growth [1][7][12]. Group 1: AI Development Trends - There are two contrasting views on AI development: one claims that AI growth is slowing down, while the other highlights continuous advancements with new models like GPT-5.1 and Gemini 3 being released [3][12]. - Łukasz Kaiser argues that the perception of slowing growth is incorrect, stating that AI's capability growth follows a smooth exponential curve, akin to Moore's Law [15][16]. - The shift from pre-training to reasoning models is a key factor in this transition, with pre-training being in a later stage of its S-curve while reasoning models are still in their early stages [18][19]. Group 2: Reasoning Models and Their Impact - The industry is focusing on smaller, cost-effective models that maintain quality, leading to the misconception that pre-training has stalled [21]. - Reasoning models, which allow for more complex thought processes and the use of tools during inference, are expected to progress rapidly due to their emerging nature [22][27]. - The evolution of models like ChatGPT demonstrates a qualitative leap in performance, with newer versions incorporating reasoning and external tool usage for more accurate responses [23][24]. Group 3: GPT-5.1 Insights - GPT-5.1 is not merely a minor update but represents a significant stability iteration, enhancing reasoning capabilities through reinforcement learning and synthetic data [34][35]. - The naming convention for versions has shifted to focus on user experience rather than technical details, allowing for greater flexibility in development [38]. - Despite improvements, GPT-5.1 still has limitations, particularly in multi-modal reasoning, as illustrated by its struggles with basic tasks that require contextual understanding [41][42]. Group 4: Future of AI and Robotics - AI is expected to change the nature of work without eliminating jobs, as human expertise will still be needed in high-stakes scenarios [62][66]. - Home robots are anticipated to be the next visible AI revolution, driven by advancements in multi-modal capabilities and general reinforcement learning [67][69]. - The integration of these technologies is expected to lead to a significant leap in the capabilities of home robots, making them more intuitive and perceptible compared to current AI models like ChatGPT [69].
深夜,3万亿美元巨头大涨
Shang Hai Zheng Quan Bao· 2025-11-19 15:45
| Us 谷歌-A Ai) [] Q | | --- | | GOOGL | | 302.860 今天 287.310 最高 303.680 最低 286.630 | | 6.54% 18.580 换手 0.43% 总量 2497万股 金额 74.00亿 | | 曲ミ 昨收 284.280 总值 3.65万亿 市盈 " 29.41 | | 盘前 287.450 3.170 1.12% 09:30 美东 √ | | 分时 五日 日K 周K 目K 更多, | | 均价:296.401 最新:302.860 18.580 6.54% | | 303.300 6.72% 卖1 302.880 20 | | 买1 302.840 18 | | 时 价 0日 | | 10:22302.890 100 | | 10:22302.890 100 | | 10:22302.960↑ 100 284-280 - | | 10:22302.970↑ 100 | | 10:22302.950↓ 100 | | 10:22302.940↓ 1000 | | 10:22302.920↓ 100 | | 10:22302.900↓ 200 ...
Gemini 3.0发布:从“工具辅助”到“主动代理”,谷歌做了这几点
Tai Mei Ti A P P· 2025-11-19 00:32
Core Insights - Google has launched its latest AI model, Gemini 3, which is considered a "universal player" in the industry, showcasing significant advancements over its predecessors and competing models like GPT-5.1 and Claude 4.5 [1][8] - Gemini 3 integrates into various Google applications, including AI search products and enterprise solutions, and will be gradually rolled out to users [1][8] - The release of Gemini 3 is strategically important for Google, as it aims to regain a competitive edge in the AI race, especially after being perceived as lagging behind since the launch of ChatGPT [8][9] Performance Enhancements - Gemini 3 has achieved remarkable performance in reasoning capabilities, with a GPQA Diamond test accuracy of 91.9% and a score of 37.5% in multi-step logical reasoning without tools [2] - The model also excels in multi-modal reasoning, scoring 81% in MMMU-Pro and 87.6% in Video-MMMU tests, indicating its ability to handle complex problems across various domains [2][4] Innovative Features - Google introduced the Gemini 3 Deep Think mode, which enhances reasoning through "thought signatures" and "thinking levels," achieving scores of 41.0% and 93.8% in relevant tests [3] - The model supports an impressive context length of up to 1 million tokens, significantly surpassing competitors and previous versions, allowing for complex task handling [4] Development and Collaboration - Gemini 3 redefines developer collaboration with innovations like "Agentic Coding" and "Vibe Coding," achieving a high Elo score of 2439 in competitive programming tests [5] - The model's agent capabilities allow it to autonomously plan and execute tasks, demonstrated by its performance in Terminal-Bench 2.0 and Vending-Bench 2 tests [6] Strategic Implications - The launch of Gemini 3 is expected to accelerate AI technology innovation across the industry, pushing competitors to enhance their offerings in reasoning, multi-modal integration, and agent development [9] - For enterprises and developers, Gemini 3 provides a scalable and customizable AI foundation, facilitating the transition of AI from experimental phases to practical applications in everyday life [8][9]
Gemini3 正式发布
小熊跑的快· 2025-11-19 00:09
Core Insights - Google has officially launched Gemini 3, the most powerful multimodal understanding model to date, enhancing interactive experiences and reasoning capabilities [1][4] - Gemini 3 Pro and Gemini 3 Deep Think are key versions, with the latter showing superior performance in reasoning tasks [4][10] Performance Metrics - Gemini 3 Pro achieved a score of 1501 Elo, ranking first on the LMArena leaderboard, and demonstrated doctoral-level reasoning with a 37.5% score on Humanity's Last Exam [1][3] - In various benchmarks, Gemini 3 Pro outperformed previous models, achieving 91.9% on GPQA Diamond and 23.4% on MathArena Apex [3][4] - Gemini 3 Deep Think further improved performance, scoring 41.0% on Humanity's Last Exam and 93.8% on GPQA Diamond [4] Multimodal Capabilities - Gemini 3 is designed to seamlessly integrate information across text, images, videos, audio, and code, pushing the boundaries of multimodal reasoning [6] - It can generate interactive learning materials and analyze performance in various activities, such as sports [7] Developer Tools and Platforms - Gemini 3 enhances developer efficiency through vibe coding and agentic coding, leading to significant improvements in software development tasks [8][10] - Google Antigravity, a new development platform, allows developers to build in a task-oriented manner, transforming AI into a proactive partner [9][10] User Experience - Google AI Ultra subscribers can access Gemini's advanced capabilities, enabling more effective long-term planning and task execution [11]
Gemini 3深夜来袭:力压GPT 5.1,大模型谷歌时代来了
3 6 Ke· 2025-11-19 00:04
Core Insights - The release of Gemini 3 has generated significant anticipation within the AI community, marking a pivotal moment for Google in the AI landscape [1][4][5] - Gemini 3 is positioned as a major step towards AGI, showcasing advanced multimodal understanding and interaction capabilities [6][10] - The model has set new SOTA standards in various AI benchmarks, outperforming its predecessor Gemini 2.5 Pro and competing models like Claude Sonnet 4.5 and GPT-5.1 [7][8] Model Performance - Gemini 3 Pro achieved a groundbreaking Elo score of 1501 on the LMArena Leaderboard, demonstrating exceptional reasoning capabilities [7] - In key benchmarks, Gemini 3 Pro scored 37.5% in Humanity's Last Exam (no tools), 91.9% in GPQA Diamond, and 23.4% in MathArena Apex, establishing new standards in academic reasoning and mathematics [8] - The model excelled in multimodal reasoning, scoring 81% in MMMU-Pro and 87.6% in Video-MMMU, indicating its proficiency in understanding complex scientific charts and dynamic video streams [7][8] Interaction and Usability - Gemini 3 Pro has improved interaction quality, providing concise and direct responses, thus acting as a true thinking partner [9] - The Deep Think mode enhances reasoning and multimodal understanding, achieving scores of 41.0% in Humanity's Last Exam and 93.8% in GPQA Diamond [10][13] - The model supports various learning modalities, allowing users to learn through text, images, videos, and code, thus broadening its application [14][15] Development and Integration - Gemini 3 is designed to facilitate developers in transforming ideas into reality, excelling in zero-shot generation and interactive web UI rendering [16] - The model ranks first in the WebDev Arena with an Elo score of 1487, showcasing its capabilities in web development tasks [16] - Google Antigravity, a new development platform, allows developers to leverage Gemini 3 for building applications with enhanced interactivity and visual effects [24][17] Market Impact and Adoption - Gemini 3 is now available for general users and developers through various platforms, indicating a strategic move to enhance user engagement [19] - The model's pricing structure is based on context length, with specific rates for tasks under and over 200k tokens [21] - Google has seen a resurgence in market confidence, with significant user engagement metrics, including 2 billion monthly active users for AI Overviews and 650 million for Gemini applications [34][36]
2025 全球机器学习技术大会 100% 议程出炉,顶级嘉宾阵容 + 参会指南一键获取
AI科技大本营· 2025-10-14 11:14
Core Insights - The 2025 Global Machine Learning Technology Conference will be held on October 16-17 in Beijing, featuring prominent figures from the AI industry, including researchers from OpenAI and other leading tech companies [1][3][11]. Group 1: Conference Overview - The conference will gather experts from top tech companies and research institutions to discuss cutting-edge topics such as large models, intelligent agent engineering, and multimodal reasoning [3][12]. - Keynote speakers include Lukasz Kaiser, co-founder of GPT-5 and GPT-4, and Li Jianzhong, Vice President of CSDN, who will present insights on AI industry paradigms and the evolution of large models [4][5]. Group 2: Key Presentations - Li Jianzhong will present on "Large Model Technology Insights and AI Industry Paradigm Insights," focusing on the technological evolution driven by large models [4]. - Michael Wong will discuss the "AI Platform Paradox," analyzing the reasons behind the failures of many open-source AI ecosystems and how to create a thriving environment [4]. Group 3: Roundtable Discussions - A roundtable titled "Core Issues in AI Industry Paradigm Shift" will feature discussions among industry leaders on the evolution of AI paradigms and the challenges of technology implementation [10]. - Participants include Li Jianzhong, Wang Bin from Xiaomi, and other notable scientists, fostering a high-density exchange of ideas [10]. Group 4: Afternoon Sessions - The afternoon sessions on October 16 will cover various topics, including the evolution of large language models, intelligent agent engineering, and AI-enabled software development [12][18]. - Notable speakers include experts from ByteDance, Tencent, and other leading firms, sharing their latest breakthroughs and insights [13][19]. Group 5: Second Day Highlights - The second day will feature multiple specialized sessions on embodied intelligence, AI infrastructure, and practical applications of large models [18][19]. - Key presentations will include discussions on the next generation of AI agents and the integration of AI technologies in various industries [20][22].
永别了,人类冠军,AI横扫天文奥赛,GPT-5得分远超金牌选手2.7倍
3 6 Ke· 2025-10-12 23:57
Core Insights - AI models GPT-5 and Gemini 2.5 Pro achieved gold medal levels in the International Olympiad on Astronomy and Astrophysics (IOAA), outperforming human competitors in theoretical and data analysis tests [1][3][10] Performance Summary - In the theoretical exams, Gemini 2.5 Pro scored 85.6% overall, while GPT-5 scored 84.2% [4][21] - In the data analysis exams, GPT-5 achieved a score of 88.5%, significantly higher than Gemini 2.5 Pro's 75.7% [5][31] - The performance of AI models in the IOAA 2025 was remarkable, with GPT-5 scoring 86.8%, which is 443% above the median, and Gemini 2.5 Pro scoring 83.0%, 323% above the median [22] Comparative Analysis - The AI models consistently ranked among the top performers, with GPT-5 and Gemini 2.5 Pro surpassing the best human competitors in several years of the competition [40][39] - The models demonstrated strong capabilities in physics and mathematics but struggled with geometric and spatial reasoning, particularly in the 2024 exams where geometry questions were predominant [44][45] Error Analysis - The primary sources of errors in the theoretical exams were conceptual mistakes and geometric/spatial reasoning errors, which accounted for 60-70% of total score losses [51][54] - In the data analysis exams, errors were more evenly distributed across categories, with significant issues in plotting and interpreting graphs [64] Future Directions - The research highlights the need for improved multimodal reasoning capabilities in AI models, particularly in spatial and temporal reasoning, to enhance their performance in astronomy-related problem-solving [49][62]
Meta刚从OpenAI挖走了清华校友宋飏
36氪· 2025-09-26 13:35
Core Viewpoint - The recent hiring of Yang Song, a key figure in diffusion models and an early contributor to DALL·E 2, by Meta Superintelligence Labs (MSL) signals a strategic move in the AI competition, enhancing MSL's talent pool and research capabilities [2][3][11]. Group 1: Talent Acquisition and Team Structure - Yang Song's addition to MSL strengthens the "dual-core" structure of the team, with one leader managing overall strategy and the other focusing on critical paths in research [16]. - The team composition is becoming clearer, with a more structured division of research responsibilities [17]. - Since summer, over 11 researchers from OpenAI, Google, and Anthropic have joined MSL, indicating a high-frequency recruitment strategy [20]. Group 2: Industry Trends and Dynamics - The rapid turnover of talent among top AI labs is becoming more common, reflecting a shift towards project compatibility and team dynamics as key factors in employment decisions [25]. - The relationship between researchers and labs is evolving into a "mutual pursuit," where both parties seek alignment in goals and capabilities [47]. - The competition for AI talent is intensifying, with increasing demands on researchers to understand cross-modal capabilities and complete data workflows [48]. Group 3: Research Focus and Strategic Alignment - Yang Song's research on diffusion models aligns closely with MSL's strategic direction, aiming to develop universal models that can understand various data forms [28][30]. - The integration of Yang Song's expertise is expected to enhance MSL's ability to create a comprehensive AI product system, accelerating the formation of a complete technical loop from modeling to execution [32][41]. - Meta is not only attracting top talent but is also working to transform these capabilities into organizational and product-level resources [44].