语言模型

Search documents
首创像素空间推理,7B模型领先GPT-4o,让VLM能像人类一样「眼脑并用」
量子位· 2025-06-09 09:27
Core Viewpoint - The article discusses the transition of Visual Language Models (VLM) from "perception" to "cognition," highlighting the introduction of "Pixel-Space Reasoning" which allows models to interact with visual information directly at the pixel level, enhancing their understanding and reasoning capabilities [1][2][3]. Group 1: Key Developments in VLM - The current mainstream VLMs are limited by their reliance on text tokens, which can lead to loss of critical information in high-resolution images and dynamic video scenes [2][4]. - "Pixel-Space Reasoning" enables models to perform visual operations directly, allowing for a more human-like interaction with visual data [3][6]. - This new reasoning paradigm shifts the focus from text-mediated understanding to native visual operations, enhancing the model's ability to capture spatial relationships and dynamic details [6][7]. Group 2: Overcoming Learning Challenges - The research team identified a "cognitive inertia" challenge where the model's established text reasoning capabilities hinder the development of new pixel operation skills, creating a "learning trap" [8][9]. - To address this, a reinforcement learning framework was designed that combines intrinsic curiosity incentives with extrinsic correctness rewards, encouraging the model to explore visual operations [9][12]. - The framework includes constraints to ensure a minimum rate of pixel-space reasoning and to balance exploration with computational efficiency [10][11]. Group 3: Performance Validation - The Pixel-Reasoner, based on the Qwen2.5-VL-7B model, achieved impressive results across four visual reasoning benchmarks, outperforming models like GPT-4o and Gemini-2.5-Pro [13][19]. - Specifically, it achieved an accuracy of 84.3% on the V* Bench, significantly higher than its competitors [13]. - The model demonstrated a 73.8% accuracy on TallyQA-Complex, showcasing its ability to differentiate between similar objects in images [19][20]. Group 4: Future Implications - The research indicates that pixel-space reasoning is not a replacement for text reasoning but rather a complementary pathway for VLMs, enabling a dual-track understanding of the world [21]. - As multi-modal reasoning capabilities evolve, the industry is moving towards a future where machines can "see more clearly and think more deeply" [21].
无需SFT也不用RL,样本级推理优化神器SLOT来了,准确率轻松+10%
机器之心· 2025-06-09 08:03
近期,当很多人还在纠结用什么 label 和 reward 训练大模型的时候,以及纠结用什么样的基准模型进行公平比较的时候,西湖大学 MAPLE 实验室另辟蹊径:既然 LLM 在复杂指令上表现不佳,需要引入单独的 SFT 或者 RL 过程,那为什么不让模型在推理时「临时学习」一下这 个具体的问题呢?这个看似「离谱」的想法,竟然带来了惊人的效果提升。 试想一下,如果你参加考试时,可以在答题前花几秒钟「适应」一下这道具体的题目,你的表现会不会更好? 这正是西湖大学研究团队在最新论文中提出的核心思想。他们开发的 SLOT(Sample-specific Language Model Optimization at Test-time)方法, 把每个输入 prompt 本身当作一份「迷你训练数据」 ,让模型在生成答案前先「学习」理解这个具体问题。 更令人惊讶的是,这个方法 简单到离谱 : Qwen2.5-7B 在 GSM8K 数学推理任务上准确率从 57.54% 飙升至 66.19% ,提升 8.65 个百分点。 DeepSeek-R1-Distill-Llama-70B 在 GPQA Diamond 上达到 68. ...
CVPR 2025 Highlight|AdaCM2:首个面向超长视频理解的跨模态自适应记忆压缩框架
机器之心· 2025-06-09 04:33
本文第一作者为前 阿里巴巴达摩院高级技术专家 ,现一年级博士研究生满远斌,研究方向为高效多模态大模型推理和生成系统。通信作者为第一作者的导 师,UTA 计算机系助理教授尹淼。尹淼博士目前带领 7 人的研究团队,主要研究方向为多模态空间智能系统,致力于通过软件和系统的联合优化设计实现 空间人工智能的落地。 近年来,大语言模型(LLM)持续刷新着多模态理解的边界。当语言模型具备了「看视频」的能力,视频问答、视频摘要和字幕生成等任务正逐步迈入真正 的智能阶段。但一个现实难题亟待解决—— 如何高效理解超长视频? 为此,来自得克萨斯大学阿灵顿分校(UTA)计算机系研究团队提出了 AdaCM2 :首个支持 超长视频理解 的跨模态记忆压缩框架。该研究已被 CVPR 2025 正式接收 ,并荣获 Highlight 论文 (接收率为 3%),展示出其在技术创新与实际价值上的双重突破。 论文标题:AdaCM2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction 论文地址:https://arxiv.o ...
具身智能推动实现通用人工智能
Ren Min Ri Bao Hai Wai Ban· 2025-06-09 04:19
Group 1 - The core idea of embodied intelligence emphasizes that cognition is influenced by the agent's perception and actions, suggesting that intelligence arises from the interaction between the agent's body and the surrounding environment, rather than solely from brain function [1][2] - Embodied intelligence theory has profound implications across various fields such as cognitive science, psychology, anthropology, and art, leading to the emergence of sub-disciplines like embodied cognition and embodied psychology [1][2] - The transition from traditional disembodied intelligence to modern embodied intelligence marks a significant shift in artificial intelligence research, where the latter integrates physical interaction with the environment for learning and decision-making [2][3] Group 2 - The history of artificial intelligence has evolved through three stages: the first generation focused on knowledge-based reasoning models, the second generation introduced data-driven models, and the third generation, marked by the emergence of large language models, represents a new phase of development [3][4] - The introduction of large language models in 2020 has enabled machines to achieve free interaction with humans in open domains, indicating a significant step towards general artificial intelligence [4][5] - Despite advancements in language generation, there are still limitations in achieving domain generality across various tasks, particularly in complex areas like medical diagnosis, highlighting the need for embodied intelligence to bridge these gaps [5][6] Group 3 - The concept of embodied intelligence was first proposed in the field of robotics, emphasizing the importance of the interaction between the body and the environment in intelligent behavior [6][7] - Embodied intelligence has driven advancements in robotics technology, shifting from single-modal perception to multi-modal perception, which is crucial for applications like autonomous vehicles [8][9] - The integration of the agent concept in embodied intelligence allows robots to combine thinking, perception, and action, facilitating tasks in both digital and physical worlds, and enhancing the efficiency of robotic development through simulation [9]
本周WWDC推出新Siri无望?华尔街质疑苹果AI能力
Hua Er Jie Jian Wen· 2025-06-09 02:43
Core Insights - Apple's upcoming WWDC on June 9 is expected to disappoint investors due to ongoing challenges in upgrading Siri and integrating advanced large language models (LLM) into its AI functionality, "Apple Intelligence" [1][4] - The integration of LLMs to enhance Siri's conversational abilities has faced significant technical difficulties, leading to numerous bugs that competitors like OpenAI and Google have not encountered [3][8] - The delay in launching the upgraded Siri has resulted in a decline of approximately 18% in Apple's stock price since the beginning of 2025, making it the worst performer among the "Tech Seven" giants [4] Siri Upgrade Challenges - Apple is attempting to improve Siri's capabilities to respond more like a human, but the integration process has been plagued by bugs, which has hindered progress [3] - A former Apple executive criticized the gradual development approach, stating that it cannot fundamentally transform Siri [3] - Analysts suggest that it may take Apple three years or more to deliver a modernized AI assistant, significantly lagging behind competitors [8] Market Reactions - Investor sentiment has soured due to repeated delays in the "Apple Intelligence" feature, leading to low expectations for the upcoming WWDC [4] - Analysts from Morgan Stanley and Bank of America have expressed concerns about Apple's ability to meet its previous commitments regarding AI advancements [4][8] Strategic Focus Shift - The upcoming WWDC may focus more on brand restructuring rather than significant technological breakthroughs, with plans to rebrand operating systems and repackage existing features as "AI-driven" [9] - Apple is expected to announce the opening of its foundational models to third-party developers, although its LLM capabilities are significantly less complex than those of competitors [9] - Internal sources indicate that expectations for the AI segment of the conference are low, raising concerns about Apple's visibility in the AI space [9]
光芯片,即将起飞!
半导体行业观察· 2025-06-09 00:53
公众号记得加星标⭐️,第一时间看推送不会错过。 大型语言模型(LLMs)正在迅速逼近当代计算硬件的极限。例如,据估算,训练GPT-3大约消 耗了1300兆瓦时(MWh)的电力,预测显示未来模型可能需要城市级(吉瓦级)的电力预算。 这种需求促使人们探索超越传统冯·诺依曼架构的计算范式。 本综述调查了为下一代生成式AI计算优化的新兴光子硬件。我们讨论了集成光子神经网络架构 (如马赫-曾德干涉仪阵列、激光器、波长复用微环谐振器),这些架构能够实现超高速矩阵运 算。同时,我们也研究了有前景的替代类神经设备,包括脉冲神经网络电路和混合自旋-光子突 触,它们将存储与计算融合在一起。本文还综述了将二维材料(如石墨烯、过渡金属二硫族化合 物,TMDCs)集成进硅基光子平台,用于可调制器和片上突触元件的研究进展。 我 们 在 这 种 硬 件 背 景 下 分 析 了 基 于 Transformer 的 大 型 语 言 模 型 架 构 ( 包 括 自 注 意 力 机 制 和 前 馈 层),指出了将动态矩阵乘法映射到这些新型硬件上的策略与挑战。随后,我们剖析了主流大型语言 模型的内部机制,例如chatGPT、DeepSeek和Lla ...
世界顶尖数学家在测试中震惊地发现,人工智能模型已经接近数学天才了
3 6 Ke· 2025-06-08 23:49
"AI 推理模型已经接近数学天才" 五月中旬的一个周末,一个秘密的数学会议召开了。 30 位世界著名的数学家齐聚美国加州伯克利。小组成员与一个"推理"聊天机器人展开对决,该机器人 的任务是解决他们为测试其数学能力而设计的问题。 在向机器人抛出两天教授级别的问题后,研究人员震惊地发现,它能够回答一些世界上最难解决的问 题。"我的同事们真的说这些模型接近数学天才了,"弗吉尼亚大学数学家、会议领导者兼评委 Ken Ono 说。 该聊天机器人由o4-mini提供支持,这是一种推理大型语言模型 (LLM)。它由 OpenAI 训练,能够进行高 度复杂的推理。谷歌的同类产品Gemini 2.5 Flash具有类似的能力。与支持早期版本 ChatGPT 的 LLM 一 样,o4-mini 可以学习预测序列中的下一个单词。然而,与早期的 LLM 相比,o4-mini 及其等效模型更 轻量级、更灵活,它们在专门的数据集上进行训练,并有来自人类的强化训练。这种方法使聊天机器人 能够比传统的 LLM 更深入地研究复杂的数学问题。 为了追踪 o4-mini 的进展,OpenAI 此前委托 Epoch AI(一家负责对 LLM 进行基 ...
对话智源研究院院长王仲远:AI正加速从数字世界走向物理世界
2 1 Shi Ji Jing Ji Bao Dao· 2025-06-08 11:49
21世纪经济报道记者孔海丽 北京报道 2025年智源大会上,人形机器人不再是吉祥物,被"围堵"的人从杨植麟变成了王兴兴。 这一年,AI进展迅猛,迭代周期甚至少于3个月,且不再局限于大语言模型,而是转化为人形机器人训 练、落地的强辅助。 "人工智能正在加速从数字世界走向物理世界。"智源研究院院长王仲远在接受包括21世纪经济报道在内 的记者采访时直言:"人工智能应该为世界做一些实实在在的事情,帮助人类摆脱繁琐的、重复的以及 简单的劳动。" AI技术路线转向世界模型 "大模型技术还远没有到发展的尽头,过往所说的'百模大战'更多是大语言模型的竞争,而大语言模型 受限于互联网数据的使用,基础模型性能虽然还在提升,但是提升速度不如以前。"在王仲远看来,大 语言模型性能提升瓶颈的解法主要包括三个方面,一是强化学习优化推理能力,二是合成高质量数据替 代人类标注,三是激活海量未充分利用的多模态数据,多模态数据的规模可达文本的"百倍乃至万倍"。 在智源研究院的判断中,大模型的技术路线会从大语言模型往多模态尤其是原生多模态世界模型的方向 发展。原生多模态世界模型本质上是为了让人工智能感知和理解物理世界,进而推进和物理世界的交 互。 ...
有医院为AI投入近千万元 头部医院仍在观望医疗AI大模型
news flash· 2025-06-08 11:13
今年上半年,医疗AI大模型成为各家医院争相布局的热门赛道。截至目前,包括上海中山、瑞金、仁 济在内的头部三甲医院都高调发布了心血管、病理、泌尿科等不同疾病领域的AI模型,而为这些大模 型提供软件和算力支持的企业也逐渐浮出水面。记者从采访中了解到,为AI医疗大模型买单的头部三 甲医院并不多,而通过公开信息搜索,记者发现,动辄投入数百万元预算采购医疗大模型的大部分都为 地方政府的采购项目。常州市第一人民医院已于今年上半年先后启动两项公开招标,采购AI医疗大模 型平台,整体预算接近1000万人民币。业内人士告诉第一财经记者,AI医疗模型已经在诸如病理等垂 直领域展现出应用潜力,但在更通用的大语言模型(LLM)的应用部署方面,还面临诸多挑战。(第一财 经) ...
用好信息导航
Jing Ji Ri Bao· 2025-06-07 22:05
——它可以为我们提供多种解决方案,引导我们以更方便快捷的方式达成目标。 大语言模型将如何改变我们的生活? ——它增强了我们收集、筛选信息的能力。 对于新技术与人类社会演进之间的关系,社会上历来存在悲观和乐观两种截然相反的观点。前者认为, 短期来看,新技术尤其是突破性技术的发展会给人类社会带来震荡;长期来看,像AI这样具备一定自 成长性的技术尤其值得警惕,极端情况下,甚至可能给人类生存带来威胁。后者则更倾向于相信技术进 步会促进人类社会的进化,在这个过程中可能面对的风险、需要付出的成本都是值得的。 但在作者看来,这其实是陷入了技术决定论的陷阱。人类的智慧使我们有能力权衡选择,设想并规划不 同选择会面临的潜在情境,这是人类的天赋,不会被技术所左右。无论对于个体还是集体,人类能够驾 驭的智慧和能源越多,实现目标的能力就越强。而无论是盲目排斥技术进步,还是过分陶醉于技术乌托 邦的想象之中,都是对这种能力的忽视。因此,作者更主张适度,也即审慎地思考技术与人类的关系, 罗列各种可能性、评估成本与收益,并通过积极行动引导科技向善。 再来看第二点。 这个回答是不是让人觉得莫名熟悉? 如果将信息替换成通路,将目标替换成目的地— ...