Workflow
机器之心
icon
Search documents
深入感知级别图像理解:UniPercept 统一图像美学、质量与结构纹理感知
机器之心· 2026-01-08 02:06
Core Insights - The article discusses the development of UniPercept, a novel framework for perceptual image understanding that integrates aesthetics, quality, and structure & texture dimensions, addressing the limitations of existing multimodal large language models in understanding visual perception [3][5]. Group 1: Framework Overview - UniPercept is the first framework to unify three perceptual dimensions: aesthetics, quality, and structure & texture, enhancing the understanding of how images look beyond mere object recognition [3][5]. - The framework includes a hierarchical definition system and a large-scale benchmark dataset called UniPercept-Bench, which allows for comprehensive evaluation of image attributes [5][10]. Group 2: Evaluation System - UniPercept-Bench features a three-tiered evaluation system comprising 3 domains, 17 categories, and 44 criteria, providing detailed expert-level definitions that surpass previous image evaluation benchmarks [10][11]. - The evaluation dimensions include Image Aesthetics Assessment (IAA), Image Quality Assessment (IQA), and Image Structure & Texture Assessment (ISTA), each focusing on different aspects of image perception [11][12]. Group 3: Model Development - The model employs domain-adaptive pre-training using a dataset of approximately 800,000 samples, which helps it learn low-level visual features across domains [22]. - Task-aligned reinforcement learning is utilized to enhance the model's perceptual consistency, with specific reward functions designed for visual rating (VR) and visual question answering (VQA) tasks [23][25]. Group 4: Performance Metrics - UniPercept outperforms existing top models in various tasks, achieving the highest Spearman and Pearson correlation coefficients in aesthetics, quality, and structure assessments [29][30]. - In visual question answering tasks, UniPercept shows a significant accuracy improvement over leading models, particularly in identifying subtle damages in images [31]. Group 5: Applications - UniPercept demonstrates potential as a reward model for generative models, optimizing image generation by enhancing composition balance, detail sharpness, and structural richness [33][36]. - The framework's multi-dimensional reward signals work synergistically to improve both visual appeal and technical fidelity of generated images [37].
从过拟合到通用!ViMoGen开启3D人体动作生成新纪元
机器之心· 2026-01-07 09:30
随着 AIGC(Artificial Intelligence Generated Content) 的爆发,我们已经习惯了像 Sora 或 Wan 这样的视频生成模型能够理解「一只宇航员在火星后空翻」这样天 马行空的指令。然而,3D 人体动作生成(3D MoGen)领域却稍显滞后。 现有的模型在标准数据集上表现良好,但在泛化能力上仍存在明显瓶颈。一旦用户输入训练集中未见过的复杂交互或罕见动作,生成的动作往往会缺乏自然性、 崩坏或退化为简单的平均姿态,这严重限制了其在现实场景和交互系统中的应用。 那很自然地就会思考: 视频生成模型已经初步学会了通用的物理规律和人类行为,为什么不把这些知识「蒸馏」给 3D 人体动作生成模型呢? 论文链接:https://arxiv.org/abs/2510.26794 项目主页:https://linjing7.github.io/vimogen/ ViGen-to-MoGen 的三大支柱 来自 南洋理工大学、商汤科技、清华大学、香港中文大学和英伟达的研究人员 提出了题为 《The Quest for Generalizable Motion Generation: Data, ...
没错,马斯克的二次元「女友」被雷蛇装到外设里了
机器之心· 2026-01-07 09:30
Core Viewpoint - The article discusses the introduction of Project Ava, a desktop AI companion by Razer, showcased at CES 2026, which features a 5.5-inch holographic capsule displaying a dynamic anime character, enhancing user interaction through advanced AI capabilities [1][3][5]. Group 1: Product Features - Project Ava is a 5.5-inch desktop holographic device featuring a 3D anime character that can perceive both the user and their computer screen, allowing for a more engaging interaction [3][7]. - The device includes a camera, environmental light sensor, and dual far-field microphones, enabling human-like visual and auditory perception, such as eye tracking and facial expression recognition [7][19]. - Users can choose from five different character designs, including customizable options, with plans for future collaborations with internet celebrities for additional character offerings [5][10]. Group 2: Target Audience and Market Strategy - The target audience for Project Ava includes tech enthusiasts who enjoy customizing their desktop devices, with a pre-order price set at $20 [8]. - Razer aims to sell "one billion units" of Project Ava, indicating a strong market ambition [9]. Group 3: User Interaction and Experience - The AI companion is designed to assist users in various scenarios, such as providing game strategies or emotional support during gameplay, and offering professional advice in work settings [7][19]. - The interaction style of Project Ava has drawn comparisons to previous AI models, with some users noting a flirtatious tone in the character's dialogue, which may evoke mixed feelings [10][15]. Group 4: Privacy Concerns - The device's ability to continuously observe users raises concerns about privacy, as it can analyze user expressions and initiate conversations based on real-time observations, potentially leading to discomfort in sensitive environments [19].
AAAI 2026 新加坡在吗?中国电信 TeleAI 邀你晚宴
机器之心· 2026-01-07 07:10
Core Viewpoint - The article highlights the launch of the "TeleAI Top Talents" program by China Telecom's Institute of Artificial Intelligence, aimed at attracting and nurturing top-tier AI talent globally, with competitive compensation and resources to support core project development [7][22]. Group 1: Event Details - The 40th AAAI conference will take place from January 20 to 27, 2026, in Singapore, serving as a platform for AI technology exploration [5]. - The "TeleAI Top Talents" Night will be held on January 24, 2026, from 18:30 to 21:00 (UTC+8), providing an open platform for talent to engage with experts and scholars [10][9]. - The event location is approximately 1.7 kilometers from the Singapore Expo venue, accessible by a 15-20 minute walk or a 7-minute taxi ride [11]. Group 2: Program Highlights - The "TeleAI Top Talents" program aims to introduce and cultivate leading AI talent, offering competitive salaries and high-standard resources for project leadership [7][9]. - The event will feature discussions on customized training plans and competitive compensation for selected talents, along with participation from top experts in the field [9][15]. - Attendees will have the opportunity to explore career development and job opportunities at the TeleAI booth during the AAAI 2026 conference [20][21]. Group 3: Research and Development Focus - China Telecom's Institute of Artificial Intelligence (TeleAI) focuses on addressing national needs and building AI infrastructure, led by Professor Xuelong Li, a notable figure in the AI community [22]. - TeleAI is engaged in cutting-edge research areas such as AI Flow, generative intelligence transmission, and the development of a comprehensive model system that has been recognized as a significant national asset [23][24]. - The institute's research includes various AI applications, from generative technologies to AI safety and governance, ensuring alignment with human values and ethical standards [25].
多模态推理新范式!DiffThinker:用扩散模型「画」出推理和答案
机器之心· 2026-01-07 07:10
Core Viewpoint - The article discusses the limitations of existing multimodal large language models (MLLMs) in visual reasoning tasks and introduces a new paradigm called Generative Multimodal Reasoning, exemplified by the model DiffThinker, which significantly improves performance in complex visual tasks [2][3][24]. Group 1: Limitations of Current MLLMs - Current MLLMs struggle to track changes in visual information during reasoning tasks, leading to inaccuracies in tasks like spatial navigation and puzzle solving [9]. - The recent "Thinking with Image" paradigm, while innovative, faces scalability issues in complex scenarios due to high operational costs and reliance on multi-turn interactions [3][9]. Group 2: Introduction of DiffThinker - DiffThinker redefines the reasoning process from "text output" to "image-to-image" generation, utilizing diffusion models to directly generate reasoning paths in visual space [3][11]. - The model has shown remarkable performance improvements, outperforming top closed-source models like GPT-5 by 314.2% and Gemini-3-Flash by 111.6% in complex visual tasks [3][20]. Group 3: Core Features of DiffThinker - Efficient Reasoning: DiffThinker demonstrates superior training and inference efficiency compared to traditional MLLMs, generating fewer tokens while maintaining higher accuracy [15]. - Controllable Reasoning: The model uses a fixed-step Euler solver, allowing for predictable output lengths and avoiding issues like infinite loops [17]. - Native Parallel Reasoning: DiffThinker can explore multiple potential paths simultaneously in visual space, enhancing the reasoning process [17]. - Collaborative Reasoning: The model can generate multiple visual candidates for validation by MLLMs, achieving better performance through collaboration [18]. Group 4: Experimental Results - In a systematic evaluation across seven complex tasks, DiffThinker achieved an average score of 87.4, significantly higher than GPT-5 (21.1) and Gemini-3-Flash (41.3) [20]. - The model's performance in tasks such as VSP, TSP, Sudoku, and Jigsaw showcases its effectiveness in various visual reasoning challenges [23]. Group 5: Comparison with Video Generation - A video version of DiffThinker was developed, but it was found to be less accurate and slower than the image generation model, indicating that "thinking with images" is currently more efficient than "thinking with videos" [22]. Group 6: Future Implications - The emergence of DiffThinker marks the beginning of a new era in Generative Multimodal Reasoning, suggesting that transitioning reasoning processes from "text flow" to "visual flow" may be crucial for the next generation of general artificial intelligence [24][25].
大模型最难的AI Infra,用Vibe Coding搞定
机器之心· 2026-01-07 05:16
Core Insights - The article discusses the challenges and potential of Vibe Coding in AI infrastructure development, highlighting its limitations in complex systems and proposing a document-driven approach to enhance its effectiveness [3][5][20]. Group 1: Challenges of Vibe Coding - Vibe Coding faces three main issues: context loss, decision deviation, and quality instability, primarily due to the lack of a structured decision management mechanism [4][5]. - The complexity of AI infrastructure, characterized by thousands of lines of code and numerous interrelated decision points, exacerbates these challenges [4][5]. Group 2: Document-Driven Vibe Coding Methodology - The document-driven approach aims to systematize key decisions during the design phase, significantly reducing complexity and improving code quality [6][20]. - By focusing on high-level design decisions, developers can leverage AI for detailed code implementation, achieving complex functionalities with minimal coding [7][20]. Group 3: Implementation in Agentic RL - The article presents a case study on optimizing GPU utilization in Agentic Reinforcement Learning (RL) systems, which face significant resource scheduling challenges [11][12]. - A proposed time-sharing reuse scheme dynamically allocates GPU resources, addressing the inefficiencies of existing solutions and improving overall system performance [14][15]. Group 4: Performance Validation - Experiments on a large-scale GPU cluster demonstrated that the time-sharing reuse scheme increased rollout throughput by 3.5 times compared to traditional methods, significantly enhancing task completion rates and reducing timeout occurrences [46][50]. - The analysis indicates that the additional system overhead introduced by the new scheme is minimal, validating its practical value in large-scale Agentic RL training [53][55]. Group 5: Team and Future Directions - The article concludes with an introduction to the ROCK & ROLL team, which focuses on advancing RL technologies and enhancing the practical application of large language models [57]. - The team emphasizes collaboration and open-source contributions to foster innovation in the RL community [58].
OpenAI前CTO首个创业产品Tinker,这里全量升级开放了,还有羊毛可薅
机器之心· 2026-01-07 05:16
Core Insights - The article discusses the launch of the Luchenyun Fine-tuning SDK, which is based on the Tinker SDK from Thinking Machines Lab, marking a shift from "craft-style" model training to "industrialized fine-tuning" [1][3][26] - The SDK allows developers to focus on algorithm design while abstracting away the complexities of distributed training infrastructure, enabling a more efficient and cost-effective approach to fine-tuning large models [4][6][26] Group 1: Technological Advancements - The introduction of Tinker SDK simplifies the training process by providing standard APIs for various training functions, allowing developers to define data and loss functions without worrying about infrastructure [4][6] - The SDK supports both supervised fine-tuning (SFT) and complex reinforcement learning (RL) pipelines, enabling users to easily construct training flows using atomic functions [8][24] Group 2: Cost Structure and Efficiency - The Luchenyun SDK adopts a serverless architecture with a "pay-per-token" pricing model, which allows users to only pay for effective computation tokens used during prefill, sampling, and training, while other processes are free [14][18] - This new pricing model significantly reduces wasted budget on non-productive time, as users are no longer charged for GPU usage during data loading or debugging [14][18] Group 3: User Experience and Accessibility - The SDK provides a seamless experience for users, allowing them to work in familiar environments like Jupyter Notebook with standard Python syntax, thus enhancing productivity [8][10] - The system includes an intelligent queue that ensures tasks are executed promptly, with no charges during waiting periods, optimizing resource utilization [12] Group 4: Target Users and Applications - The SDK is designed to cater to various user groups, including researchers who can conduct experiments without worrying about infrastructure, and startups that require rapid validation of MVPs [19][20] - In industrial applications, the SDK allows engineers to define loss logic and reinforcement learning reward functions, providing complete control over model training [21] Group 5: Future Outlook - The article emphasizes that post-training is evolving from an academic niche to a mainstream engineering focus, aiming for a "zero cognitive load" experience for developers [26] - The Luchenyun Fine-tuning SDK is now fully open for use, with promotional offers for early adopters, indicating a push for widespread adoption [27][28]
注意力机制大变革?Bengio团队找到了一种超越Transformer的硬件对齐方案
机器之心· 2026-01-07 05:16
Core Insights - The article discusses the evolution of large language models (LLMs) and highlights the limitations of existing linear recurrence and state space models in terms of computational efficiency and performance [1][3]. - A new approach proposed by Radical Numerics and the Montreal University team focuses on redefining linear recurrences as hardware-aligned matrix operations, aiming to enhance GPU memory utilization and computational efficiency [1][2]. Group 1: Challenges and Limitations - The primary challenge identified is breaking through the "memory wall" associated with linear recurrences, which limits performance due to high communication costs in modern hardware [3][7]. - Traditional parallel scan algorithms, while theoretically efficient, struggle with data access patterns that lead to frequent global memory synchronization, thus failing to leverage data locality effectively [4][5][6]. Group 2: Proposed Solutions - The paper introduces the Sliding Window Recurrences (SWR) as a method to achieve high throughput by strategically truncating the computational horizon, utilizing a jagged window structure that aligns with hardware workloads [10][11]. - The Block Two-Pass (B2P) algorithm is developed to implement this theory, dividing the computation into two phases to optimize memory access and minimize data movement [14][15]. Group 3: Phalanx Layer and Performance - A new computing layer called Phalanx is designed based on the B2P algorithm, serving as a seamless replacement for sliding window attention or linear recurrence layers, ensuring numerical stability during long sequence processing [19][20]. - In systematic tests on a model with 1.3 billion parameters, the Phalanx hybrid model demonstrated significant performance advantages, achieving 10% to 40% end-to-end speedup in training throughput across varying context lengths [23][24]. Group 4: Industry Implications - The findings from the paper indicate that true efficiency in LLMs arises not just from reduced algorithmic complexity but from a deep understanding and alignment with the physical characteristics of underlying computational hardware [31][32]. - As LLMs evolve towards larger context sizes and real-time embodied intelligence post-2025, hardware-aware operator design will be crucial for developing more efficient and powerful AI systems [33].
近十年后谷歌与波士顿动力再「牵手」,这次要为人形机器人注入「灵魂」
机器之心· 2026-01-07 00:49
Core Viewpoint - Boston Dynamics and Google DeepMind have announced a new AI partnership aimed at ushering in a new era of artificial intelligence for humanoid robots, with a focus on enhancing industrial tasks and transforming the manufacturing sector, particularly in the automotive industry [1][7]. Group 1 - The collaboration will integrate DeepMind's advanced Gemini Robotics AI model with Boston Dynamics' new Atlas humanoid robot [6]. - The joint research efforts are expected to commence in the coming months, with activities taking place within both companies [8]. - Boston Dynamics aims to create the world's most capable humanoid robot and sees DeepMind as the ideal partner to develop a new visual-language-action (VLA) model for these complex robots [9]. Group 2 - DeepMind's Gemini Robotics model is designed to bring AI into the physical world, enhancing the capabilities of Boston Dynamics' Atlas robots [10]. - The partnership is viewed as a strong alliance, with DeepMind providing intelligence and Boston Dynamics offering a top-tier hardware platform [10]. - The combination of Gemini Robotics' foundational capabilities with Atlas hardware represents a significant advancement in embodied intelligence for robotics [12]. Group 3 - The collaboration has generated excitement among observers, with some anticipating a competitive showdown between Western robots like Atlas and Chinese counterparts [13]. - Historical context reveals that this is not the first collaboration between the two companies; Google previously acquired Boston Dynamics in 2013 but sold it due to unmet market expectations [14]. - The renewed partnership reflects a maturation of technology conditions, with both companies now better positioned to leverage each other's strengths [14][15]. Group 4 - The significance of this collaboration raises questions about which company stands to gain more, whether it is Boston Dynamics' victory or the beginning of a new chapter for Google in robotics [15]. - The partnership is poised to create a future where humans and machines coexist and collaborate [16].
曾对AI嗤之以鼻,如今2周生成7万行代码:Rust大佬与Claude联手打造新语言Rue
机器之心· 2026-01-07 00:49
Core Insights - The article discusses Steve Klabnik's journey with Rust and his new programming language, Rue, highlighting the evolution of his perspective on AI as a valuable tool in software development [1][3][21] Group 1: Klabnik's Perspective on AI - Klabnik transitioned from being an AI skeptic to recognizing the practical benefits of AI tools in coding, particularly with the use of Claude for generating code [3][10] - He emphasizes that AI serves as a high-level tool, enhancing productivity for those with a foundational understanding of software engineering principles [10][21] Group 2: Introduction of Rue - Rue is a new programming language designed by Klabnik, aiming to bridge the gap between high-performance languages like Rust and more accessible languages like Go [6][20] - The name Rue reflects both a sense of self-deprecation and a botanical reference, indicating a blend of good and bad qualities [6] Group 3: Development Process of Rue - The Rue project has rapidly accumulated around 70,000 lines of Rust code within two weeks, showcasing the efficiency of AI-assisted coding [8][20] - Klabnik's workflow involves AI (Claude) handling the implementation details while he focuses on design and architecture [14][20] Group 4: Rust's Role in AI Programming - Rust's strict compiler serves as a quality control mechanism, ensuring that AI-generated code meets safety and type-checking standards [13][19] - This strictness, once seen as a barrier for beginners, is now viewed as an advantage in the context of AI programming, as it helps eliminate critical errors [17][21] Group 5: Future of Programming Roles - Klabnik's experiment with Rue suggests a shift in developer roles from "bricklayers" to "architects," where human developers focus on higher-level design while AI handles more routine coding tasks [21]