机器之心
Search documents
全球引才:Faster R-CNN、ResNet作者,中国科大任少卿,招募教授、学者和学生
机器之心· 2025-12-05 10:17
Core Viewpoint - The article highlights the achievements and contributions of Professor Ren Shaoqing in the field of artificial intelligence, particularly in deep learning and computer vision, emphasizing his role in advancing key technologies that impact various sectors such as autonomous driving and medical imaging [4][5][6]. Group 1: Academic Achievements - Professor Ren has made foundational and pioneering contributions in deep learning, computer vision, and intelligent driving, with his research serving as a core engine for critical areas of national economy and livelihood [5]. - His academic papers have been cited over 460,000 times, ranking him first among domestic scholars across all disciplines [5]. - He has received multiple prestigious awards, including the 2023 Future Science Prize in Mathematics and Computer Science and the 2025 NeurIPS Time Test Award [5]. Group 2: Key Research Contributions - The paper "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," awarded the NeurIPS 2025 Time Test Award, is considered a milestone in computer vision, having been cited over 98,000 times since its publication in 2015 [6]. - Faster R-CNN introduced a fully learnable two-stage pipeline that replaced traditional methods, achieving high precision and near real-time detection, significantly influencing the development of visual models over the past decade [6]. Group 3: Research Institute and Talent Recruitment - The General Artificial Intelligence Research Institute at the University of Science and Technology of China focuses on cutting-edge areas such as AI, world models, embodied intelligence, and autonomous driving, aiming for integrated innovation in research, talent cultivation, and industrial application [7]. - The institute is actively recruiting for various positions, including professors, researchers, postdoctoral fellows, engineers, and students at different academic levels, with a commitment to supporting high-level talent projects [9][10].
登顶SuperCLUE DeepSearch,openPangu-R-72B深度搜索能力跃升
机器之心· 2025-12-05 10:17
Core Insights - The article highlights the rapid development of large model inference and agent tool capabilities, with a focus on the recent SuperCLUE DeepSearch evaluation report, where the domestic model openPangu-R-72B ranked first in complex information retrieval tasks, showcasing the strength of domestic Ascend computing power in large model development [1][15]. Model Performance - In the SuperCLUE DeepSearch evaluation, openPangu-R-72B achieved a score of 73.33, outperforming other models such as Gemini-3-Pro-Preview and GPT-5.1(high), which scored 70.48 [2]. - The model excelled in various task categories, particularly in humanities and social sciences (75.47) and natural sciences (83.33) [2]. Technical Architecture - openPangu-R-72B is based on a redesigned architecture that balances efficiency and performance, utilizing a mixture of experts (MoE) model with an 80 out of 8 expert selection mechanism, maintaining 15 billion active parameters from a total of 74 billion [4]. - The model was trained on 24 trillion tokens and can handle long sequences of up to 128k, which is crucial for deep search tasks [4]. Optimization Techniques - The model incorporates several optimizations, including the introduction of parameterized Sink Token technology to stabilize training and enhance quantization compatibility [7]. - It employs a combination of K-Norm and Depth-Scaled Sandwich-Norm architectures to reduce computational overhead while maintaining stability and flexibility in expression [7]. - The attention architecture has been optimized for precision and efficiency, achieving a 37.5% reduction in KV cache while enhancing the model's ability to capture fine-grained semantic relationships [7][8]. DeepSearch Capabilities - The model's success in deep search tasks is attributed to three key strategies: long-chain question answering synthesis, non-indexed information processing, and a fast-slow thinking integration approach [10]. - The long-chain QA synthesis improved the average difficulty of questions by 10% and introduced a verification agent to enhance training accuracy [12]. - The model's workflow includes a cycle of focusing on key URLs, crawling, and document QA to gather deep information beyond traditional search engine capabilities [12]. Domestic Computing Power - The achievement of openPangu-R-72B in the SuperCLUE DeepSearch evaluation underscores the effective integration of domestic computing power with large model research and development [15]. - The model's sibling, openPangu-718B, also performed well, securing the second position in the general ranking, indicating the comprehensive capabilities of the openPangu series across different task scenarios [15].
基于文本AI的终结?Agent协作可直接「复制思维」,Token效率暴涨
机器之心· 2025-12-05 04:08
Core Insights - The article discusses the emergence of multi-agent systems (MAS) in the Agentic AI era, emphasizing the shift from individual models to collaborative problem-solving among AI agents [2][5] - A new framework called LatentMAS is introduced, which allows agents to collaborate in latent space rather than through traditional text communication, enhancing efficiency and performance [5][14] Group 1: LatentMAS Framework - LatentMAS enables agents to exchange internal hidden layer representations and KV-cache working memory, resulting in higher performance and reduced token usage [5][10] - The framework is designed to support richer latent reasoning and lossless communication between agents, significantly lowering computational complexity compared to text-based MAS [15][16] Group 2: Experimental Results - Comprehensive experiments on nine benchmark tasks show that LatentMAS outperforms both single models and text-based MAS, with accuracy improvements of up to 14.6% and token usage reductions of 70.8% to 83.7% [6][20][22] - LatentMAS achieves end-to-end reasoning speed increases of 4× to 4.3× compared to traditional methods, demonstrating its efficiency [21][25] Group 3: Efficiency and Performance - The framework allows for complex reasoning processes while significantly reducing the number of tokens used, achieving higher accuracy with fewer output tokens [28][29] - LatentMAS can provide additional speed improvements of 2.6× to 7× over text-based MAS, even when the latter is optimized with vLLM services [25][28] Group 4: Semantic Richness - The latent representations generated by LatentMAS are shown to be semantically rich and diverse, surpassing the expressiveness of discrete tokens used in text-based systems [30][31] - The study indicates that the potential reasoning captured in LatentMAS is not only effective but also contains more nuanced internal representations compared to traditional methods [31][32]
字节前技术负责人创业,联手清华姚班校友,编程智能体世界登顶
机器之心· 2025-12-05 04:08
输入提示词: write a python code that visualizes how a traffic light works in a one way street with cars entering at random rate (编写一个 Python 代码,可视化单行道中交通信号灯的工作情况,车辆以随机速率驶入), AI 就能在几秒钟内生成一个完整的动画模拟程序,包 括交通灯的红黄绿切换逻辑、车辆的随机生成机制、停车和通行的判断规则,甚至还配上了流畅的可视化界面。 但惊喜过后,问题也随之而来。Vibe Coding 虽然擅长快速原型开发和单脚本编写,但在面对企业级复杂工 程时仍显得力不从心。 受限于上下文窗口、推理深度以及 Agentic 模式缺失, 它往往难以精准定位大型代 码库中深埋的 Bug,也极易在处理跨文件系统级修改时引发连锁 错误, 特别是在 C++ 等类型语言常用的底 层框架编程场景中。 现在,来自中国的初创团队 词元无限 给出了自己的答案。由 清华姚班校友带队设计开发的编码智能体 InfCode,在 SWE-Bench Verified 和 Multi-SWE-bench- ...
首帧的真正秘密被揭开了:视频生成模型竟然把它当成「记忆体」
机器之心· 2025-12-05 04:08
在 Text-to-Video / Image-to-Video 技术突飞猛进的今天,我们已经习惯了这样一个常识: 视频生成的第一帧(First Frame)只是时间轴的起点,是后续动画的起始画面 。 但你能想象吗? 最新研究发现: 第一帧的真正角色完全不是「 起点」。它其实是视频模型的「 概念记忆体 」(conceptual memory buffer), 所有后续画面引用的视觉实体,都被 它默默储存在这一帧里 。 今天就带大家快速了解这一突破意味着什么。 本研究的出发点,源于该团队对视频生成模型中一个广泛存在但尚未被系统研究的现象的深入思考。 第一帧≠起点, 第一帧 = 大型内容缓存区(Memory Buffer) 论文的核心洞察非常大胆: 视频生成模型会自动把首帧中的角色、物体、纹理、布局等视觉实体,全部「 记住」,并在后续帧中不断复用 。 换句话说,不论你给多少参考物体,模型都会在第一帧悄悄把它们打包成一个「 概念蓝图(blueprint) 」。 这项工作来自 UMD、USC、MIT 的研究团队。 在论文的 Figure 2 中,研究团队用 Veo3、Sora2、Wan2.2 等视频模型测试发现: 这 ...
告别「2D错觉」,SpatialActor通过解耦语义与几何,为具身智能注入强鲁棒空间基因
机器之心· 2025-12-05 03:02
Core Insights - The article discusses the limitations of existing robotic manipulation models that primarily rely on 2D images, which often lose critical depth information and 3D geometric structure [2][4] - The proposed solution, SpatialActor, focuses on "disentanglement," separating semantic information from spatial geometric information to enhance robotic understanding and interaction with 3D environments [4][7] Methodology and Architecture - SpatialActor employs a dual-stream architecture that decouples visual and depth encoding, integrating a Semantic-Guided Geometry Module (SGM) and a Spatial Transformer (SPT) to improve robustness and accuracy in robotic tasks [10][11] - The SGM combines robust geometric priors from a pre-trained depth estimation model with fine-grained but noisy depth features, optimizing the geometric representation while maintaining alignment with semantic cues [11][13] - The SPT establishes precise 2D to 3D mappings and integrates multi-modal features, crucial for generating accurate robotic actions [13] Experimental Results - SpatialActor achieved an average success rate of 87.4% across various tasks in simulation, outperforming the previous state-of-the-art model RVT-2 by 6.0% [16][19] - In noise experiments, SpatialActor demonstrated superior robustness, with average success rates improving by 13.9%, 16.9%, and 19.4% under light, medium, and heavy noise conditions, respectively [19][20] - Real-world experiments showed SpatialActor consistently outperforming RVT-2 by approximately 20% across various tasks, confirming its effectiveness in complex environments [22][24] Conclusion - The article concludes that SpatialActor represents a significant advancement in robotic manipulation by effectively decoupling semantic and geometric information, leading to improved robustness and generalization in diverse conditions [24][25] - The framework highlights the importance of disentangled spatial representations for developing more resilient and adaptable robotic systems [25][26]
刚刚,2026年英伟达奖学金名单公布,华人博士生霸榜占比80%
机器之心· 2025-12-05 03:02
机器之心报道 机器之心编辑部 一年一度的英伟达奖学金出炉了。 二十五年来,英伟达研究生奖学金计划(NVIDIA Graduate Fellowship Program)一直为研究生提供与英伟达技术相关的杰出工作支持。 今天,该计划宣布了 2026 年度的 10 位博士生获奖者,他们每人将获得最高 6 万美元的资助,以支持他们在涵盖计算创新所有领域的各项研究。 他们的研究工作聚焦于加速计算的前沿领域,包括了自主系统、计算机体系结构、计算机图形学、深度学习、编程系统、机器人技术和安全。 本年度的 10 位获奖者中有 8 位华人 。去年有 7 位华人博士生 入选,包括了上交、中科大、浙大校友。 接下来,我们一起了解下本年度获奖者的信息。 Jiageng Mao 南加州大学,获奖理由:利用来自互联网规模数据的各种先验知识解决复杂的物理人工智能问题,从而为现实世界中的具身智能体实现稳健、可推广的智能。 资料显示,Jiageng Mao 是南加州大学博士生,研究方向是物理人工智能,目标是通过开发机器人、计算机视觉和自然语言处理等领域的算法,将人工智能应用于 现实世界。据了解,他对直观物理学、大型视觉 - 语言(- 动作) ...
DeepSeek-V3.2巨「吃」Token,竟然是被GRPO背刺了
机器之心· 2025-12-04 08:18
Core Insights - The article discusses the release of the DeepSeek-V3.2 model, highlighting its performance issues, particularly in token consumption and output verbosity, which have raised concerns among users and researchers [1][2][6]. Token Consumption and Efficiency - DeepSeek-V3.2 Speciale exhibits inefficient token usage, consuming 77,000 tokens for tasks where Gemini only requires 20,000, indicating over three times the token expenditure for similar quality results [1][6]. - Users have noted that the generation speed of DeepSeek-V3.2 Speciale is approximately 30 tokens per second, and an increase to around 100 tokens per second could significantly enhance usability and experience [6]. Output Quality and Verbosity - The Speciale version tends to produce lengthy and verbose outputs, often resulting in incorrect responses, which is attributed to inherent flaws in the GRPO algorithm [2][15]. - The model's performance in benchmark tests shows that it has a median score of 76.38, with a median difference of 11.07% compared to other models, indicating a notable gap in efficiency [7]. Comparison with Other Models - In benchmark comparisons, DeepSeek-V3.2 Speciale's token consumption during inference has been reported to be significantly higher than its predecessor, with a consumption of 86 million tokens compared to 62 million for the previous version [7][10]. - The model's performance metrics reveal that it lags behind competitors like Gemini-3.0 Pro in terms of output token delay and efficiency [10][12]. Algorithmic Limitations - The GRPO algorithm, which underpins DeepSeek, has been criticized for introducing biases that lead to longer and often incorrect responses, a problem that persists in the latest model [16][20]. - Length bias, a significant issue in the GRPO algorithm, causes the model to generate longer responses even when they are incorrect, which has been identified as a primary reason for the high token consumption in DeepSeek-V3.2 Speciale [20][23]. Future Directions - The developers acknowledge the need for improved token efficiency as a critical area for future research, aiming to balance performance and cost in subsequent model iterations [14][23].
碾压π0.5,复旦团队首创「世界模型+具身训练+强化学习」闭环框架
机器之心· 2025-12-04 08:18
张家辉,复旦大学大数据学院博士三年级学生,研究方向为具身智能、视觉 - 语言 - 动作模型预训练与强化 学习后训练,4D-VLA (NeurIPS 25) 第一作者。黄泽,复旦大学大数据学院博士三年级学生,主要从事机器 人世界模型与三维重建、生成等方向研究。两人共同担任本文第一作者。 张 力 , 复 旦 大 学 大 数 据 学 院 教 授 , 上 海 创 智 学 院 全 时 导 师 , 担 任 本 论 文 的 通 讯 作 者 。 主 页 : https://lzrobots.github.io Vision–Language–Action(VLA)策略正逐渐成为机器人迈向通用操作智能的重要技术路径:这类策略能够 在统一模型内同时处理视觉感知、语言指令并生成连续控制信号。 然而,当前大多数 VLA 仍主要依赖模仿学习,实质上是按示范轨迹复刻,在分布发生偏移、任务形式变化 或操作时域拉长时,极易出现误差累积并导致任务失败。强化学习(RL)从回报信号出发直接优化任务成 功率,按理应当能够缓解这一目标错配问题,但在真实机器人上开展在线 RL 成本高昂,并行执行受限,还 伴随大量重置与标注开销;以 π*0.6 为代表的 ...
刚刚,云计算一哥出手,大家AI Agent自由了
机器之心· 2025-12-04 06:10
Core Insights - The article discusses the advancements in Agentic AI, particularly highlighting Amazon Web Services' (AWS) initiatives and innovations in this field, emphasizing the transformative potential of AI agents in various industries [4][6][46] Group 1: Agentic AI Developments - Blue Origin's successful recovery of the New Glenn rocket was significantly aided by the use of generative AI tools, including an internal platform called BlueGPT, which improved overall engineering speed by 75% [3][6] - AWS's annual re:Invent conference showcased a range of new releases focused on Agentic AI, indicating a clear shift towards automation and efficiency in business processes [4][6] - The emergence of AI agents is compared to the impact of the internet and cloud services, suggesting that their influence on business operations could be equally profound [6][46] Group 2: Technical Innovations - AWS introduced the Strands Agents SDK, enabling developers to build AI agents using TypeScript, and added support for edge devices, allowing for a wide range of applications [9][10] - The Amazon Bedrock service has been enhanced with new capabilities for agent development, including policy setting and evaluation tools to ensure agent behavior is safe and compliant [11][20] - New memory capabilities in AgentCore Memory allow agents to learn from past interactions, improving their decision-making over time [12] Group 3: Model Customization and Efficiency - AWS is focusing on creating customized AI models that can perform specific tasks more efficiently, with tools that simplify the customization process [15][19] - The introduction of Amazon Nova Forge allows for open training of models, integrating proprietary data with existing models to create tailored solutions [41] - The Amazon SageMaker HyperPod significantly reduces training cycle times and operational costs, enhancing the efficiency of AI model training [19] Group 4: Future Outlook - AWS envisions a future where billions of AI agents will be active across various industries, providing real value to organizations and individuals [46] - The company reported a revenue of $132 billion, a 20% increase from the previous year, driven by the growing adoption of AI services among over 100,000 global enterprises [46] - The article concludes with an invitation to the upcoming AWS re:Invent event in China, highlighting the importance of staying updated in the rapidly evolving AI landscape [47]