Workflow
机器之心
icon
Search documents
「一只手有几根手指」,你的GPT-5答对了吗?
机器之心· 2025-08-11 10:40
Core Viewpoint - The article discusses the limitations of advanced language models like GPT-5 in understanding basic visual concepts, highlighting the need for vision-centric models to improve visual comprehension and reasoning capabilities [2][26]. Group 1 - Tairan He points out that while language is a powerful tool, it struggles to fully meet the needs of the visual and robotics fields [2]. - There is a call for the development of vision-centric language models (VLM) and vision-language-action (VLA) models to address these shortcomings [3]. - The ambiguity in the definition of "fingers" illustrates the challenges language models face in interpreting visual information accurately [4][6]. Group 2 - The article mentions that even top models like Gemini 2.5 Pro have failed to provide correct answers to basic questions, indicating a lack of robust visual understanding [10][24]. - Tairan He references a paper by the Sseynin team that proposes a rigorous evaluation method for assessing the visual capabilities of multimodal large language models (MLLM) [28]. - The new benchmark test, CV-Bench, focuses on evaluating models' abilities in object counting, spatial reasoning, and depth perception, establishing stricter assessment standards [31]. Group 3 - Research shows that while advanced VLMs can achieve 100% accuracy in recognizing common objects, their performance drops to about 17% when dealing with counterfactual images [33]. - The article emphasizes that VLMs rely on memorized knowledge rather than true visual analysis, which limits their effectiveness [34]. - Martin Ziqiao Ma argues that initializing VLA models with large language models is a tempting but misleading approach, as it does not address fundamental perception issues [36].
4D空间智能:AI如何一步步「看懂」时空结构?一篇综述解析通往四维世界的五大层次
机器之心· 2025-08-11 10:40
4D 空间智能重建 是计算机视觉领域的核心挑战,其目标在于从视觉数据中还原三维空间的动态演化过程。这一技术通过整合静态场景结构与时空动态变化,构建 出具有时间维度的空间表征系统,在虚拟现实、数字孪生和智能交互等领域展现出关键价值。 当前研究主要围绕两大技术维度展开: 基础重建层面聚焦深度估计、相机定位、动态点云等底层视觉要素的精准提取;高阶理解层面则致力于解析场景组件的时 空关联与物理约束。 这种多维度的空间建模能力正成为新一代人工智能发展的基础设施——无论是构建具身智能的环境认知体系,还是训练具备物理常识的世界模型,高保真的 4D 空 间表征都发挥着基石作用。 值得注意的是,前沿研究正从单纯的几何重建转向对 场景物理属性 和 交互逻辑 的建模,这种转变使得空间智能不仅能呈现视觉真实的动态场景,更能支撑智能 体与虚拟环境的拟真交互。 为了填补关于 4D 空间智能重建分析的空白,南洋理工大学 S-Lab、香港科技大学以及德州农工大学的研究者们全面调研了该领域的发展和最前沿的研究方法,撰 写了综述论文,对 400 余篇代表性论文进行了系统归纳和分析。 Paper:Reconstructing 4D Spatial ...
智谱终于发布GLM-4.5技术报告,从预训练到后训练,细节大公开
机器之心· 2025-08-11 07:12
Core Viewpoint - The article highlights the release of GLM-4.5 and GLM-4.5-Air, which integrate reasoning, coding, and agentic capabilities into a single model, achieving the highest ranking among domestic and open-source models in 12 global benchmarks [2][11][19]. Group 1: Model Performance and Reception - GLM-4.5 achieved third place in global rankings across 12 recognized benchmarks, outperforming all domestic and open-source models [2][19]. - The model's announcement generated significant attention, with over 1.2 million views on social media and topping the Hugging Face trends for seven consecutive days [2][3]. - The technical report for GLM-4.5 was voted as the "1 Paper of the day" by Hugging Face users [13]. Group 2: Technical Innovations - GLM-4.5 employs a MoE (Mixture of Experts) architecture, enhancing computational efficiency during training and inference [21][24]. - The model features a unique training process, including pre-training on 15 trillion tokens and mid-training on 7 trillion tokens, with a maximum sequence length expanded from 4K to 128K [25][27]. - The introduction of the slime framework supports efficient reinforcement learning training, addressing common bottlenecks in agentic tasks [31][34]. Group 3: Key Capabilities - GLM-4.5 integrates three core capabilities: agentic ability for real-world interaction, complex reasoning for multi-step problem-solving, and advanced coding skills for software engineering tasks [22][19]. - The model's performance in agentic tasks was evaluated against competitors, showing superior results in benchmarks like TAU-bench and BFCL V3 [44]. - In reasoning tasks, GLM-4.5 outperformed OpenAI's models in several benchmarks, including AIME 24 and SciCode [47][50]. Group 4: Code Task Performance - GLM-4.5 excelled in code-related benchmarks, outperforming GPT-4.1 and Claude Sonnet 4 in SWE-bench Verified and Terminal-Bench [52][53]. - The model's overall performance in coding tasks positions it as a strong competitor to Claude Sonnet 4 [53]. Group 5: Future Implications - The release of the technical report provides insights into the development direction for domestic open-source large models, serving as a key reference for future research [56][57].
从捍卫者到引路人,上交&上海AI Lab提出LEGION:不仅是AI图像伪造克星,还能反哺生成模型进化?
机器之心· 2025-08-11 07:12
Core Viewpoint - The rapid advancement of Text-to-Image models has significantly improved the quality and detail of generated images, but it has also led to increased misuse, resulting in a growing trust crisis among the public due to the difficulty in distinguishing between real and AI-generated images [3][4][9]. Group 1: Development of AI Image Generation - Recent developments in Text-to-Image models have transitioned from early GAN architectures to diffusion and autoregressive models, greatly lowering the barriers for high-quality image creation [4]. - The proliferation of AI-generated images has facilitated various fields such as design, education, and art, but has also led to serious issues like fraud and misinformation [4][9]. Group 2: Trust Crisis and Detection Challenges - The public faces an escalating trust crisis as the realism of AI-generated images increases, making it harder to discern authenticity [9][12]. - Existing datasets for detecting forged images have limitations, prompting the creation of a new dataset, SynthScars, which focuses on pure AI-generated images and highlights their flaws [15][12]. Group 3: Proposed Solutions - The research team proposes a three-pronged approach to address the challenges of synthetic image detection: building high-quality datasets, designing interpretable forgery analysis models, and achieving a balance between detection and generation [12][15]. - The LEGION framework utilizes a multi-modal large model (MLLMs) for image forgery analysis, integrating detection, forgery localization, and anomaly explanation into a unified process [17][20]. Group 4: Performance and Robustness - LEGION demonstrates superior performance in various tasks, outperforming existing models with fewer parameters, particularly in anomaly explanation and forgery detection [24][27]. - The framework shows robust performance against various distortions, maintaining stability compared to traditional expert models [27][28]. Group 5: Synergy Between Detection and Generation - The paper suggests that LEGION can serve as both a protector of image security and a catalyst for high-quality generation, proposing methods to refine generated images based on detected anomalies [33][37]. - Techniques such as global prompt optimization and localized semantic repair are introduced to enhance the quality of generated images by addressing identified flaws [37][40].
脑子比不过AI,手也要沦陷了?这只灵巧手看得我有点慌
机器之心· 2025-08-11 04:27
Core Viewpoint - The article discusses the evolution and significance of human hand dexterity, highlighting the challenges and advancements in creating robotic hands that can replicate this dexterity, particularly through the innovative use of flexible transmission systems [5][7][39]. Group 1: Evolution and Functionality of Human Hand - The human hand has evolved from simple structures in early primates to a complex system that combines strength and flexibility, allowing for precise manipulation of objects [1][3]. - The thumb's development has been crucial for fine motor skills, enabling actions such as gripping and writing, which account for a significant portion of hand functionality [9][11]. Group 2: Challenges in Robotic Hand Development - Current robotic hands often fail to replicate the dexterity and adaptability of human hands, with many products being either too bulky or rigid, leading to limited functionality in real-world applications [5][12][15]. - Approximately 80% of existing robotic hands are underutilized, primarily due to a focus on increasing degrees of freedom (DOF) rather than enhancing actual dexterity, which is a more critical measure of usability [15][16]. Group 3: Innovations in Robotic Hand Technology - A new robotic hand showcased at the World Robot Conference demonstrated remarkable flexibility and coordination, closely mimicking human hand movements and capabilities [6][7]. - The company, Lingqiao Intelligent, has adopted a flexible transmission system inspired by human tendon mechanics, allowing for lightweight and efficient operation while maintaining high dexterity [23][30]. Group 4: Market Position and Future Directions - Lingqiao Intelligent aims to prioritize quality over quantity in the production of robotic hands, focusing on high reliability and performance for genuine applications rather than mere aesthetic purposes [34][35]. - The company is positioned to advance the field of robotic hands by enhancing sensory capabilities and intelligent control systems, targeting industrial applications in sectors like automotive manufacturing and electronics [36][40].
第二届 “兴智杯” 全国人工智能创新应用大赛专题活动明天开启,技术解析 + 资源对接一站式平台重磅来袭!
机器之心· 2025-08-11 04:27
为深入贯彻落实党中央、国务院关于加快人工智能产业创新发展的决策部署,工业和信息化部、科学技术部、深圳市人民政府共同主办了首届 "兴智杯" 全国人工 智能创新应用大赛,以需求为牵引,推动了一批关键技术突破,加快人工智能与重点行业融合赋能,成为了目前国内规模最大、参赛主体最丰富的人工智能专业 赛事。 为进一步发挥 "以赛促研、以赛促用、以赛育人" 的作用,第二届 "兴智杯" 全国人工智能创新应用大赛(以下简称 "大赛")由中国信息通信研究院、深圳市人工智 能产业办公室、深圳市前海深港现代服务业合作区管理局、深圳市宝安区人民政府共同主办,自 2025 年 5 月 8 日大赛正式启动以来,已吸引千余支团队、超万名 选手报名参赛。 大赛定于 8 月 12-13 日以线上直播形式举办专题活动,邀请近 20 家人工智能及重点行业应用领域头部单位的领军学者及企业专家,进行赛题解析、场景探讨及趋 势分享,聚焦人工智能技术突破与产业落地的重点、热点议题,以赛促产,推动培育一批亮点技术与应用成果。 活动还特别设置答疑与交流环节(添加官方小助手微信:AIAC_WX 实时提问交流),有机会与各领域大牛直接交流、解疑答惑,大赛组委会精心 ...
ICCV 2025 | 机器人自主探索未知复杂空间?GLEAM破解主动探索建图的泛化难题
机器之心· 2025-08-11 04:27
本文一作为陈骁,香港中文大学 MMLab - 上海人工智能实验室具身智能中心联培博士生,研究方向是三维计算机视觉和具身智能,导师为薛天帆教授。个人主 页:xiao-chen.tech/。 研究背景 当人类走入陌生房间时,会通过移动和观察来掌握室内结构。想象机器人被扔进一个陌生场景:有的房间堆满障碍,有的走廊九曲十八弯,它能像人类一样主动 探索未知空间吗? "主动探索" 这一智能基石,何以成为技术盲区? 经典方案往往依赖人工预设的轨迹、视角与指令,而现有探索策略在陌生复杂场景中频频失效:机器人既可能在废墟救援时因全局规划缺失而卡死墙角,又容易 在障碍密集的客厅中反复碰撞进退维谷。当机器人在此类复杂环境下运转时,感知 - 决策 - 行动闭环如何挣脱被动依赖桎梏? 这正是下一代机器人跨越 "智能鸿 沟" 的核心挑战。 如何让机器人在完全未知的复杂房间里自主探索? 针对移动机器人在复杂未知环境中 "探索 - 建图" 的泛化难题,香港中文大学与上海人工智能实验室联合提出系统性解决方案:研究者们搭建了全球规模最大的 " 探索 - 建图" 基准 GLEAM-Bench—— 该数据集涵盖上千个室内场景,并在此基础上设计了通用 ...
机器人上下文协议首次开源:阿里达摩院一口气放出具身智能「三大件」
机器之心· 2025-08-11 03:19
机器之心发布 开源链接: 机器人上下文协议 RynnRCP https://github.com/alibaba-damo-academy/RynnRCP 视觉 - 语言 - 动作模型 RynnVLA-001 https://github.com/alibaba-damo-academy/RynnVLA-001 世界理解模型 RynnEC https://github.com/alibaba-damo-academy/RynnEC 具身智能领域飞速发展,但仍面临开发流程碎片化,数据、模型与机器人本体适配难等重大挑战。 机器之心编辑部 8 月 11 日,在世界机器人大会上,阿里达摩院宣布开源自研的 VLA 模型 RynnVLA-001-7B、世界理解模型 RynnEC、以及机器人上下文协议 RynnRCP ,推动数 据、模型和机器人的兼容适配,打通具身智能开发全流程。 具体而言,RynnRCP 包括 RCP 框架和 RobotMotion 两个主要模块。 RCP 框架 旨在建立机器人本体与传感器的连接,提供标准化能力接口,并实现不同的传输层和模型服务之间的兼容。 RobotMotion 则是具身大模型与机器人本 ...
Attention Sink产生的起点?清华&美团首次揭秘MoE LLM中的超级专家机制
机器之心· 2025-08-11 03:19
稀疏激活的混合专家模型(MoE)通过动态路由和稀疏激活机制,极大提升了大语言模型(LLM)的学习能力,展现出显著的潜力。基于这一架构,涌现出了如 DeepSeek、Qwen 等先进的 MoE LLM。 然而,随着模型参数的迅速膨胀,如何高效部署和推理成了新的挑战。为此,学术界和工业界纷纷聚焦于模型压缩技术,尤其是面向 MoE 模型的 "专家级压缩"。 研究者们通过剪枝、量化、合并等方法,剔除或简化那些 "非关键" 专家,从而在保证性能的同时,显著减小模型体积。 分析专家的重要性差异不仅有助于推动更高效的模型压缩,还为深入理解 MoE LLM 的内部行为机制提供了关键视角。然而,现有方法多依赖经验性准则来识别重 要专家,缺乏对专家重要性深度的探讨。因此,本研究聚焦于一个此前被忽视的重要问题: MoE LLM 中是否普遍 存在一类在前向推理过程中发挥关键重要作用的专家子集 ? 通过对多个主流开源 MoE LLM(包括 DeepSeek 系列、Qwen3 系列、Mixtral 等)进行深入实证分析, 来自清华大学和美团的研究人员 首次发现并确认了这一特 殊且至关重要的专家子集的广泛存在。尽管这些专家数量极为有限,但 ...
token危机解决?扩散模型数据潜力3倍于自回归,重训480次性能仍攀升
机器之心· 2025-08-10 04:31
Core Viewpoint - The article discusses the advancements in diffusion language models (DLMs) as superior data learners compared to autoregressive (AR) models, particularly in data-constrained environments [1][8]. Group 1: Token Crisis and Research Findings - The research addresses the impending token crisis in large language models (LLMs), where the availability of high-quality training text data is diminishing, limiting model performance [2][3]. - The team pre-trained DLMs and AR models from scratch, achieving a maximum scale of 8 billion parameters and 480 billion tokens [3][4]. Group 2: Performance Comparison - In scenarios with limited tokens, DLMs outperform AR models, demonstrating over three times the data potential [5][8]. - A DLM trained on 1 billion tokens achieved 56% accuracy on the HellaSwag benchmark and 33% on the MMLU benchmark, significantly surpassing AR models [14]. Group 3: Repeated Training Benefits - Repeated training on the same dataset enhances performance, with DLMs showing no signs of performance saturation even after extensive training [14][19]. - The study indicates that DLMs can extract more effective information from a fixed dataset, leading to improved performance metrics [14][19]. Group 4: Mechanisms Behind DLMs' Superiority - DLMs utilize a bidirectional modeling approach, allowing them to extract more information from web data compared to purely causal modeling used by AR models [19][22]. - DLMs are described as "super dense models," translating their computational density into enhanced intelligence [22][24]. Group 5: Methodological Critique of Related Research - The article critiques a concurrent study, highlighting methodological flaws that may skew its conclusions regarding DLMs and AR models [25][30]. - It emphasizes that the loss function used in the other study does not accurately represent model likelihood, potentially leading to misleading results [26][32].