机器之心
Search documents
博士申请终极指南:「从准备到抉择」手把手教你拿下理想offer
机器之心· 2026-01-08 09:34
机器之心编辑部 又快到博士申请季。这是一份复杂而又繁琐的工作:无尽的 院校调研、纠结的方向选择、厚重的材料准备,以及决定命运的面试……不可能不感到迷茫、焦虑, 甚至怀疑, 这一切的辛勤付出,究竟能否换来梦想院校的入场券?在面试官眼中,「完美候选人」究竟应该具备哪些条件…… 最近,加州大学圣地亚哥分校认知科学家兼助理教学教授 Lucy Lai,结合 她以往哈佛大学神经科学博士项目申请者的经验,七年多的模拟面试经验,以及作为 前 哈佛博士项目面试官的经验 , 给出了一份「内部参指南 」——《关于博士申请的一切》。 《指南》中包括常见的博士面试问题与如何做出最好的回答、招生决定是如何做出的,以及对招生委员会所看重的素质和因素进行的详细说明等。 接下来,我们就具体来看看指南是如何给出申请建议的。 一般应用技巧 如何才能确定自己想读研究生? 在所有的准备开始之前,需要先明确一个问题:你真的决定要读研究生了。Lucy Lai 建议,思考过程中如果觉得自己的申请材料还不够优秀,可以考虑休学一年或 几年。而判断申请是否足够优秀的一个好方法是咨询你的研究导师。他们阅读和面试过无数研究人员和潜在的研究生,可以很轻松地告诉你在申请 ...
「听觉」引导「视觉」,OmniAgent开启全模态主动感知新范式
机器之心· 2026-01-08 09:34
Core Insights - The article introduces OmniAgent, a proactive perception agent developed by Zhejiang University, West Lake University, and Ant Group, addressing pain points in cross-modal alignment and fine-grained understanding in end-to-end omni-modal models [2][7][19] - OmniAgent employs an innovative "think-act-observe-reflect" closed-loop mechanism, transitioning from passive response to active inquiry, which enhances its performance in audiovisual understanding tasks [10][19] Background and Pain Points - End-to-end omni-modal models face high training costs and challenges in cross-modal feature alignment, leading to subpar performance in fine-grained cross-modal understanding [7] - Fixed workflow-based agents rely on rigid, human-defined processes, lacking the flexibility to autonomously plan and gather information based on questions [7] Methodology - OmniAgent's methodology includes a strategic scheduling of video and audio understanding capabilities within an iterative reflection loop, effectively overcoming cross-modal alignment challenges [8][15] - The agent autonomously decides whether to "listen" or "watch" based on the analysis of the question, utilizing a variety of multimodal tools for efficient information retrieval [15] Performance Results - OmniAgent achieved state-of-the-art (SOTA) results in multiple audiovisual understanding benchmarks, with an accuracy of 82.71% on the Daily-Omni Benchmark, surpassing Gemini 2.5-Flash (72.7%) and Qwen3-Omni-30B (72.08%) by over 10% [13] - In the OmniVideoBench, OmniAgent reached an accuracy of 59.1% in long video understanding tasks, significantly outperforming Qwen3-Omni-30B (38.4%) [13] Future Vision - The design of OmniAgent is highly extensible, allowing for the integration of additional modal tools [19] - OmniAgent is positioned to assist in generating high-quality COTT data for the development of next-generation omni-modal models capable of self-tool invocation [19]
拓宽百年奥运「赛场边界」,阿里云AI让人人皆可上场
机器之心· 2026-01-08 09:34
机器之心编辑部 先给大家看个视频,你能分辨出哪个是 AI 生成的吗? 视频来源: tiktok 博主 @tkp..1001 「真人拍摄还是 AI 生成」,如果搁一年前,这个问题还很容易回答,因为细节处总有一眼 AI 的破绽,但现在,真与假的界限已变得愈发模糊。 越来越多「真实」的视频,评论区里都在争论「这是 AI 吧?」而那些真正由 AI 生成的内容,反倒被当成真实拍摄。 AI 视频生成技术的进化速度快到飞起,并正渗透进我们生活的方方面面。随之而来的问题是:我们究竟要如何与这些技术共处? 破解这一难题的钥匙或许就藏在人类的想象力中。技术的超越不该只在于对现实的复刻,更应在创新应用中想象更美好的未来。 站在这个视角,阿里云给出了一个颇具想象力的答案:2026 年米兰冬奥会。 就在冬奥会倒计时 30 天之际, 作为官方云服务合作伙伴的阿里云,拉着国际奥委会以及⽶兰冬奥组委会搞了波大的,共同发起一场全球 AIGC ⼤赛 。 [ 左右滑动 ] 大赛 Slogan 为「 YOUR EPIC VIBE 」,正好与本届冬奥口号「 IT's Your Vibe 」(意展你风采)遥相呼应。 大赛规则简单粗暴:只需用阿里云的「 ...
刚刚,智谱敲钟上市了,市值达528亿港元
机器之心· 2026-01-08 02:06
机器之心发布 「全球大模型第一股」来了! 2026 年 1 月 8 日,北京智谱华章科技股份有限公司(02513.HK)(以下简称「智谱」)正式在香港联合交易所挂牌上市。 他回顾称,智谱在 2021 年推出了自研的算法架构 GLM,而今年 GLM-4.7 的发布使其跻身世界领先,为冲刺 AGI 打下重要根基。 「智谱的 Z 是字母表中的最后 一个,代表终极境地,我们希望在 AGI 的探索历程上能走到智能的终极境地。」 凭借「全球大模型第一股」标的的独特稀缺性,智谱吸引了一支由北京核心国资、头部保险资金、大型公募基金、明星私募基金和产业投资人构成的全明星基石 投资阵容,JSC International Investment Fund SPC、JinYi Capital Multi-Strategy Fund SPC、Perseverance Asset Management 等 11 家基石投资者合计认购 29.8 亿港元。 以基座模型为核心,持续探索智能上界 智谱是中国最早投身大模型研发的厂商之一,原创性地提出了基于自回归填空的通用预训练范式 GLM,率先发布了中国首个百亿模型、首个开源千亿模型、首个 对话 ...
深入感知级别图像理解:UniPercept 统一图像美学、质量与结构纹理感知
机器之心· 2026-01-08 02:06
Core Insights - The article discusses the development of UniPercept, a novel framework for perceptual image understanding that integrates aesthetics, quality, and structure & texture dimensions, addressing the limitations of existing multimodal large language models in understanding visual perception [3][5]. Group 1: Framework Overview - UniPercept is the first framework to unify three perceptual dimensions: aesthetics, quality, and structure & texture, enhancing the understanding of how images look beyond mere object recognition [3][5]. - The framework includes a hierarchical definition system and a large-scale benchmark dataset called UniPercept-Bench, which allows for comprehensive evaluation of image attributes [5][10]. Group 2: Evaluation System - UniPercept-Bench features a three-tiered evaluation system comprising 3 domains, 17 categories, and 44 criteria, providing detailed expert-level definitions that surpass previous image evaluation benchmarks [10][11]. - The evaluation dimensions include Image Aesthetics Assessment (IAA), Image Quality Assessment (IQA), and Image Structure & Texture Assessment (ISTA), each focusing on different aspects of image perception [11][12]. Group 3: Model Development - The model employs domain-adaptive pre-training using a dataset of approximately 800,000 samples, which helps it learn low-level visual features across domains [22]. - Task-aligned reinforcement learning is utilized to enhance the model's perceptual consistency, with specific reward functions designed for visual rating (VR) and visual question answering (VQA) tasks [23][25]. Group 4: Performance Metrics - UniPercept outperforms existing top models in various tasks, achieving the highest Spearman and Pearson correlation coefficients in aesthetics, quality, and structure assessments [29][30]. - In visual question answering tasks, UniPercept shows a significant accuracy improvement over leading models, particularly in identifying subtle damages in images [31]. Group 5: Applications - UniPercept demonstrates potential as a reward model for generative models, optimizing image generation by enhancing composition balance, detail sharpness, and structural richness [33][36]. - The framework's multi-dimensional reward signals work synergistically to improve both visual appeal and technical fidelity of generated images [37].
从过拟合到通用!ViMoGen开启3D人体动作生成新纪元
机器之心· 2026-01-07 09:30
随着 AIGC(Artificial Intelligence Generated Content) 的爆发,我们已经习惯了像 Sora 或 Wan 这样的视频生成模型能够理解「一只宇航员在火星后空翻」这样天 马行空的指令。然而,3D 人体动作生成(3D MoGen)领域却稍显滞后。 现有的模型在标准数据集上表现良好,但在泛化能力上仍存在明显瓶颈。一旦用户输入训练集中未见过的复杂交互或罕见动作,生成的动作往往会缺乏自然性、 崩坏或退化为简单的平均姿态,这严重限制了其在现实场景和交互系统中的应用。 那很自然地就会思考: 视频生成模型已经初步学会了通用的物理规律和人类行为,为什么不把这些知识「蒸馏」给 3D 人体动作生成模型呢? 论文链接:https://arxiv.org/abs/2510.26794 项目主页:https://linjing7.github.io/vimogen/ ViGen-to-MoGen 的三大支柱 来自 南洋理工大学、商汤科技、清华大学、香港中文大学和英伟达的研究人员 提出了题为 《The Quest for Generalizable Motion Generation: Data, ...
没错,马斯克的二次元「女友」被雷蛇装到外设里了
机器之心· 2026-01-07 09:30
Core Viewpoint - The article discusses the introduction of Project Ava, a desktop AI companion by Razer, showcased at CES 2026, which features a 5.5-inch holographic capsule displaying a dynamic anime character, enhancing user interaction through advanced AI capabilities [1][3][5]. Group 1: Product Features - Project Ava is a 5.5-inch desktop holographic device featuring a 3D anime character that can perceive both the user and their computer screen, allowing for a more engaging interaction [3][7]. - The device includes a camera, environmental light sensor, and dual far-field microphones, enabling human-like visual and auditory perception, such as eye tracking and facial expression recognition [7][19]. - Users can choose from five different character designs, including customizable options, with plans for future collaborations with internet celebrities for additional character offerings [5][10]. Group 2: Target Audience and Market Strategy - The target audience for Project Ava includes tech enthusiasts who enjoy customizing their desktop devices, with a pre-order price set at $20 [8]. - Razer aims to sell "one billion units" of Project Ava, indicating a strong market ambition [9]. Group 3: User Interaction and Experience - The AI companion is designed to assist users in various scenarios, such as providing game strategies or emotional support during gameplay, and offering professional advice in work settings [7][19]. - The interaction style of Project Ava has drawn comparisons to previous AI models, with some users noting a flirtatious tone in the character's dialogue, which may evoke mixed feelings [10][15]. Group 4: Privacy Concerns - The device's ability to continuously observe users raises concerns about privacy, as it can analyze user expressions and initiate conversations based on real-time observations, potentially leading to discomfort in sensitive environments [19].
AAAI 2026 新加坡在吗?中国电信 TeleAI 邀你晚宴
机器之心· 2026-01-07 07:10
Core Viewpoint - The article highlights the launch of the "TeleAI Top Talents" program by China Telecom's Institute of Artificial Intelligence, aimed at attracting and nurturing top-tier AI talent globally, with competitive compensation and resources to support core project development [7][22]. Group 1: Event Details - The 40th AAAI conference will take place from January 20 to 27, 2026, in Singapore, serving as a platform for AI technology exploration [5]. - The "TeleAI Top Talents" Night will be held on January 24, 2026, from 18:30 to 21:00 (UTC+8), providing an open platform for talent to engage with experts and scholars [10][9]. - The event location is approximately 1.7 kilometers from the Singapore Expo venue, accessible by a 15-20 minute walk or a 7-minute taxi ride [11]. Group 2: Program Highlights - The "TeleAI Top Talents" program aims to introduce and cultivate leading AI talent, offering competitive salaries and high-standard resources for project leadership [7][9]. - The event will feature discussions on customized training plans and competitive compensation for selected talents, along with participation from top experts in the field [9][15]. - Attendees will have the opportunity to explore career development and job opportunities at the TeleAI booth during the AAAI 2026 conference [20][21]. Group 3: Research and Development Focus - China Telecom's Institute of Artificial Intelligence (TeleAI) focuses on addressing national needs and building AI infrastructure, led by Professor Xuelong Li, a notable figure in the AI community [22]. - TeleAI is engaged in cutting-edge research areas such as AI Flow, generative intelligence transmission, and the development of a comprehensive model system that has been recognized as a significant national asset [23][24]. - The institute's research includes various AI applications, from generative technologies to AI safety and governance, ensuring alignment with human values and ethical standards [25].
多模态推理新范式!DiffThinker:用扩散模型「画」出推理和答案
机器之心· 2026-01-07 07:10
Core Viewpoint - The article discusses the limitations of existing multimodal large language models (MLLMs) in visual reasoning tasks and introduces a new paradigm called Generative Multimodal Reasoning, exemplified by the model DiffThinker, which significantly improves performance in complex visual tasks [2][3][24]. Group 1: Limitations of Current MLLMs - Current MLLMs struggle to track changes in visual information during reasoning tasks, leading to inaccuracies in tasks like spatial navigation and puzzle solving [9]. - The recent "Thinking with Image" paradigm, while innovative, faces scalability issues in complex scenarios due to high operational costs and reliance on multi-turn interactions [3][9]. Group 2: Introduction of DiffThinker - DiffThinker redefines the reasoning process from "text output" to "image-to-image" generation, utilizing diffusion models to directly generate reasoning paths in visual space [3][11]. - The model has shown remarkable performance improvements, outperforming top closed-source models like GPT-5 by 314.2% and Gemini-3-Flash by 111.6% in complex visual tasks [3][20]. Group 3: Core Features of DiffThinker - Efficient Reasoning: DiffThinker demonstrates superior training and inference efficiency compared to traditional MLLMs, generating fewer tokens while maintaining higher accuracy [15]. - Controllable Reasoning: The model uses a fixed-step Euler solver, allowing for predictable output lengths and avoiding issues like infinite loops [17]. - Native Parallel Reasoning: DiffThinker can explore multiple potential paths simultaneously in visual space, enhancing the reasoning process [17]. - Collaborative Reasoning: The model can generate multiple visual candidates for validation by MLLMs, achieving better performance through collaboration [18]. Group 4: Experimental Results - In a systematic evaluation across seven complex tasks, DiffThinker achieved an average score of 87.4, significantly higher than GPT-5 (21.1) and Gemini-3-Flash (41.3) [20]. - The model's performance in tasks such as VSP, TSP, Sudoku, and Jigsaw showcases its effectiveness in various visual reasoning challenges [23]. Group 5: Comparison with Video Generation - A video version of DiffThinker was developed, but it was found to be less accurate and slower than the image generation model, indicating that "thinking with images" is currently more efficient than "thinking with videos" [22]. Group 6: Future Implications - The emergence of DiffThinker marks the beginning of a new era in Generative Multimodal Reasoning, suggesting that transitioning reasoning processes from "text flow" to "visual flow" may be crucial for the next generation of general artificial intelligence [24][25].
大模型最难的AI Infra,用Vibe Coding搞定
机器之心· 2026-01-07 05:16
Core Insights - The article discusses the challenges and potential of Vibe Coding in AI infrastructure development, highlighting its limitations in complex systems and proposing a document-driven approach to enhance its effectiveness [3][5][20]. Group 1: Challenges of Vibe Coding - Vibe Coding faces three main issues: context loss, decision deviation, and quality instability, primarily due to the lack of a structured decision management mechanism [4][5]. - The complexity of AI infrastructure, characterized by thousands of lines of code and numerous interrelated decision points, exacerbates these challenges [4][5]. Group 2: Document-Driven Vibe Coding Methodology - The document-driven approach aims to systematize key decisions during the design phase, significantly reducing complexity and improving code quality [6][20]. - By focusing on high-level design decisions, developers can leverage AI for detailed code implementation, achieving complex functionalities with minimal coding [7][20]. Group 3: Implementation in Agentic RL - The article presents a case study on optimizing GPU utilization in Agentic Reinforcement Learning (RL) systems, which face significant resource scheduling challenges [11][12]. - A proposed time-sharing reuse scheme dynamically allocates GPU resources, addressing the inefficiencies of existing solutions and improving overall system performance [14][15]. Group 4: Performance Validation - Experiments on a large-scale GPU cluster demonstrated that the time-sharing reuse scheme increased rollout throughput by 3.5 times compared to traditional methods, significantly enhancing task completion rates and reducing timeout occurrences [46][50]. - The analysis indicates that the additional system overhead introduced by the new scheme is minimal, validating its practical value in large-scale Agentic RL training [53][55]. Group 5: Team and Future Directions - The article concludes with an introduction to the ROCK & ROLL team, which focuses on advancing RL technologies and enhancing the practical application of large language models [57]. - The team emphasizes collaboration and open-source contributions to foster innovation in the RL community [58].