机器之心
Search documents
全球首个!灵巧手真实世界具身数采引擎Psi-SynEngine来了,灵初智能发布
机器之心· 2025-12-11 00:43
Core Insights - The article highlights the launch of Psi-SynEngine, the world's first embodied native human data collection solution developed by the company, which aims to create a leading general-purpose operational intelligence system [3][10] - The Psi-SynEngine addresses the data collection challenges in the embodied intelligence field by directly capturing operational data from frontline workers in real-world scenarios, rather than relying on high-cost, low-fidelity setups [4][6] Data Collection Advantages - Psi-SynEngine offers three main advantages over traditional data collection methods: low cost, high multi-modal data capture capabilities, and strong portability, enabling large-scale parallel data collection [5][7] - The portable data collection devices significantly reduce deployment costs, with data acquisition costs being only 10% of remote operation solutions [7] Data Engine and Dataset Features - The Psi-SynNet-v0 dataset, released alongside Psi-SynEngine, features strong data diversity, comprehensive modality coverage, massive data scale, and a validated closed-loop data system, enhancing model transfer and generalization capabilities [9][12] - The dataset aims to bridge the gap between human and robotic operations, addressing the structural and capability differences between human hands and robotic manipulators [9] Future Prospects - The establishment of Psi-SynEngine and Psi-SynNet-v0 marks a new paradigm for embodied AI, with plans to scale the dataset to over one million hours, positioning it as the largest dataset for dexterous operations globally [10][12] - The company invites global research institutions and partners to collaborate in building the Psi-SynNet dataset, aiming to usher in a new era of general intelligence [10]
扩散语言模型推理太慢?北大团队提出ODB-dLLM框架,破解计算访存双重瓶颈
机器之心· 2025-12-11 00:43
针对这一缺陷,来自北大的研究团队提出一种新的 dLLM 推理加速框架 ODB-dLLM(Orchestrating Dual- Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models)。它通过 分析现有 dLLM 推理框架中交错的计算和访存瓶颈阶段,引入了自适应长度预测策略和跳跃共享推测解 码,以优化 dLLM 在硬件平台上的计算访存特性,最大限度地提高推理效率。 本研究由北京大学研究团队完成。通讯作者为李萌,北京大学人工智能研究院和集成电路学院助理教授, 博导,PKU SEC Lab 负责人,他的研究兴趣集中于高效、安全人工智能加速算法和芯片,旨在通过算法到 芯片的跨层次协同设计和优化,为人工智能构建高能效、高可靠、高安全的算力基础。第一作者韦临烨, 北京大学集成电路学院博士一年级在读,主要研究方向为多模态高效 AI 系统和加速器设计。 基于扩散的大语言模型 (dLLM) 凭借全局解码和双向注意力机制解锁了原生的并行解码和受控生成的潜力, 最近吸引了广泛的关注。例如 F ...
微软发布首个测试时扩展大规模研究,还给出了终极指南
机器之心· 2025-12-10 10:30
Core Insights - The article discusses the concept of Test-time Scaling (TTS) for Large Language Models (LLMs), emphasizing the importance of allowing models to "think" longer during inference to improve results [1][2] - Microsoft has conducted a comprehensive study revealing that there are distinct personality traits among models, categorized into "short-sighted" and "long-sighted" groups, which affects their performance based on the TTS strategy employed [2][26] TTS Strategies Overview - TTS strategies for LLMs can be categorized into parallel, sequential, mixed/meta methods, and internal computation mechanisms, with no single strategy being universally optimal [4][11] - The study analyzed eight open-source LLMs with parameter counts ranging from 7 billion to 235 billion, generating over 30 billion tokens across four inference datasets [5] Parallel Scaling Strategy - The parallel scaling strategy aggregates answers from multiple independent sampling paths to enhance performance, with methods like Self-consistency and Best-of-n sampling being widely used [8] - Recent advancements include more principled voting strategies such as weighted majority voting and multi-agent verification [8] Sequential Scaling Strategy - The sequential scaling strategy involves iterative corrections, restarts, or backtracking to deepen reasoning, with techniques like Chain of Thought (CoT) prompting and structured search methods [9] - Models like AlphaGeometry combine symbolic proof search with LLMs for step-level control [9] Mixed Scaling Strategy - The mixed strategy combines elements of both parallel and sequential methods, utilizing meta-reasoners to dynamically select TTS strategies based on perceived task difficulty [10] - Internal scaling strategies modify the model's internal computation during inference without explicitly adjusting the number of samples or reasoning steps [10] Research Findings - The study found that beam search exhibited an inverse-scaling pattern, where increasing the beam size led to a decline in performance for certain model families [16][20] - The correlation between reasoning path length and quality was revealed, indicating that shorter paths are often more effective for short-sighted models, while longer paths may benefit long-sighted models under certain conditions [21][26] Decision Matrix for TTS Strategy - Microsoft developed a practical decision matrix for selecting TTS strategies based on model type, problem difficulty, and computational budget, providing actionable insights for algorithm engineers [38][41] - For short-sighted models, the recommendation is to use majority voting (MV@N) with a large N for high budgets, while for low budgets, FFS with k=1 is suggested [41][42] - Long-sighted models require a more nuanced approach, favoring longer paths for difficult problems and shorter paths for easier ones, with MV@N being a robust choice [46][48]
GPT-5.2真身是它?OpenAI紧急端出全套「下午茶」,新一代图像模型同步泄露
机器之心· 2025-12-10 10:30
Core Viewpoint - OpenAI is preparing to release its new model, GPT-5.2, in response to the competitive pressure from Google's Gemini 3, which has prompted an internal "Code Red" alert within OpenAI [5][8]. Group 1: New Model Developments - The new model, internally codenamed "Olive Oil Cake," is expected to be a significant upgrade over the current GPT-5.1, with speculation that it will be launched on December 11 [7][8]. - Alongside GPT-5.2, OpenAI is also set to introduce a new image generation model, referred to as "Chestnut and Hazelnut," which aims to address previous shortcomings and compete directly with Google's offerings [10][11]. Group 2: Competitive Landscape - Google's Gemini 3 has demonstrated impressive performance, leading to heightened urgency for OpenAI to accelerate the release of its new models, which were initially planned for a later date [8][16]. - Despite some advantages held by Google's Nano Banana 2 in specific scenarios, OpenAI's new models are believed to have significantly narrowed the technological gap, potentially allowing them to compete effectively in the market [16]. Group 3: Model Features and Improvements - The new image generation models are expected to resolve previous issues such as color bias and improve detail and fidelity, making them more competitive against existing models [11]. - Key upgrades include enhanced color accuracy, improved texture detail, and the ability to generate precise code snippets within images, which have been positively received by testers [11].
LLM距离AGI只差一层:斯坦福研究颠覆「模式匹配」观点
机器之心· 2025-12-10 10:30
机器之心报道 编辑:杨文、泽南 有关大语言模型的理论基础,可能要出现一些改变了。 斯坦福发了篇论文,彻底颠覆了「LLM 只是模式匹配器」的传统论调。 它提出的不是扩展技巧或新架构,而是一个让模型真正具备推理能力的「协调层」。 核心观点:AGI 的瓶颈在于协调,而非规模 人工智能界正因围绕大语言模型本质的争论而分裂。一方面,扩展派认为 LLMs 足以实现 AGI;另一方 面,有影响力的批评者认为 LLM「仅仅是模式匹配器」,在结构上不具备推理、规划或组合泛化能力,因 此是死胡同。 作者认为这场争论建立在一个错误的二分法之上,并提出一个颠覆性极强的核心观点: LLM 的失败不是因 为缺乏推理能力,而是因为我们缺少将其模式与目标绑定的系统。 为了解释这一点,作者用了一个捕鱼隐喻。 海洋代表模型庞大的模式库,渔夫不用鱼饵就撒网,收获的只是最常见的鱼类(训练数据中的通用模 式)。批评者谴责这些未锚定的输出,但他们观察到的只是未加诱饵的捕捞所产生的原始统计基线,这不 是系统损坏,而是系统在默认模式下的自然表现。 然而,智能行为不仅仅是撒网,它还涉及下饵和过滤。如果诱饵过于稀疏,它就无法吸引特定、稀有的 鱼,海洋的先验仍然 ...
「豆包手机」为何能靠超级Agent火遍全网,我们听听AI学者们怎么说
机器之心· 2025-12-10 08:13
Core Viewpoint - The article discusses the emergence of the Doubao mobile assistant, which integrates AI capabilities deeply into the smartphone operating system, transforming the way users interact with their devices and enabling complex task execution across multiple applications [3][12][26]. Group 1: Doubao Mobile Assistant Overview - The Doubao mobile assistant is currently in a technical preview phase and represents a significant advancement in AI integration within smartphones, functioning as a "super butler" rather than a standalone app [3][6]. - It allows users to execute complex commands across different apps with simple voice instructions, showcasing a new level of AI interaction [3][12]. - The assistant can perform multi-step tasks seamlessly, such as marking restaurants on a map, finding museums, and booking tickets on travel platforms [5][12]. Group 2: Challenges in Implementing System-Level AI Agents - Implementing system-level AI agents like Doubao involves overcoming four main challenges: perception, planning, decision-making, and system-level integration [9][10]. - The perception layer requires the agent to recognize all interactive elements on the screen quickly and accurately, even amidst dynamic distractions [9]. - The planning layer involves managing information flow across apps, maintaining logical continuity, and adapting to unexpected interruptions [10]. - The decision-making layer necessitates the agent's ability to generalize across different interfaces and execute various user interactions beyond simple clicks [10]. Group 3: Technical Innovations Behind Doubao - Doubao leverages a system-level integration approach, gaining Android system-level permissions while ensuring user privacy through strict authorization protocols [12][13]. - The assistant utilizes a visual multi-modal capability to understand screen content and user intent, allowing it to autonomously decide the next actions [12][13]. - The underlying technology, UI-TARS, is a proprietary engine developed by ByteDance, which enhances the assistant's performance and capabilities [16][24]. Group 4: Future Implications and Industry Perspectives - The evolution of AI capabilities in smartphones is expected to shift the interaction paradigm from "users seeking services" to "services seeking users," leading to a more intuitive user experience [26][27]. - Experts believe that system-level GUI agents will become standard features in future mobile operating systems, enhancing the autonomy and intelligence of smartphones [26][27]. - Despite the promising advancements, challenges such as computational power, coordination of system-level agents, and security mechanisms remain to be addressed [27].
南大联合LibLib.ai、中科院自动化所,共同提出布局推理与精准编辑「海报设计大模型」PosterCopilot
机器之心· 2025-12-10 08:13
Core Viewpoint - The article discusses the development of PosterCopilot, a professional-level poster design and editing model that addresses significant challenges in graphic design automation, particularly in layout reasoning and controllable editing [2][6][40]. Industry Pain Points - Graphic design faces substantial challenges in achieving true automation, with existing models like Stable Diffusion struggling with layered structures, leading to material distortion and lack of fine control [6]. - Current multimodal models exhibit four critical shortcomings: severe element overlap, lack of visual feedback, regression to a single ground truth, and inability to perform layer-specific edits [8][10]. Core Achievements - PosterCopilot aims to bridge the gap between single-step generation and professional workflows through a systematic solution that incorporates a three-stage training strategy [13][14]. - The innovative three-stage training includes: 1. Perturbation Supervised Fine-Tuning (PSFT) to address geometric distortions [15]. 2. Visual-Reality Alignment Reinforcement Learning (RL-VRA) to correct overlaps and proportional issues [15]. 3. Aesthetic Feedback Reinforcement Learning (RLAF) to encourage exploration beyond ground truth layouts [15]. Generative Agent - PosterCopilot functions as a comprehensive design assistant, facilitating seamless transitions from abstract design concepts to concrete materials through a reception model and T2I model [16][17]. - The model supports various professional scenarios, including full poster generation from provided assets, intelligent completion of missing materials, global theme transitions, intelligent size reconstruction, and multi-round fine-grained editing [21][23][28][29][31]. Experimental Results - PosterCopilot outperforms existing commercial competitors and state-of-the-art models across multiple metrics, achieving an average win rate exceeding 74% in human evaluations [34][35]. - In assessments of layout rationality, text legibility, and element preservation, PosterCopilot demonstrates superior performance compared to models like Microsoft Designer and CreatiPoster [35][37]. Conclusion and Outlook - By decoupling layout reasoning from generative editing and incorporating reinforcement learning to align with human aesthetics, PosterCopilot sets a new benchmark for intelligent design tools and offers a new paradigm for AI-assisted creative workflows [40].
告别专家依赖,让机器人学会自我参考,仅需200步性能飙升至99.2%
机器之心· 2025-12-10 05:10
Core Insights - The article discusses the development of the Self-Referential Policy Optimization (SRPO) framework, which enhances the performance of Visual Language Action (VLA) models in robotic tasks by addressing the challenges of sparse rewards and dependency on expert demonstrations [3][11]. Motivation and Contribution - Recent research indicates that reinforcement learning (RL) can significantly improve VLA models' performance both within and outside their training distribution. However, the challenge of sparse reward signals remains, particularly in VLA tasks where high computational costs and inefficient use of failure trajectory information hinder training efficiency [6][11]. - The SRPO framework alleviates the dependency on expert demonstrations and task-specific reward engineering by utilizing self-generated successful trajectories to provide progressive rewards for failed attempts [11][12]. Technical Approach - SRPO employs a "learn from success" paradigm, where trajectories generated during policy inference are collected and categorized into successful and failed attempts. The framework uses a potential world representation to model behavior similarity and calculate progressive rewards [14][16]. - The framework formalizes the robotic decision-making process as a partially observable Markov decision process (POMDP), introducing a world model-driven reward modeling mechanism that provides progressive reward signals for failed trajectories [18][19]. Experimental Results - SRPO achieved a success rate of 99.2% with only 200 steps of reinforcement learning, significantly outperforming baseline models that rely on sparse rewards or require manual reward design [27]. - In the LIBERO-Plus generalization tests, SRPO demonstrated a performance improvement of 167%, even without training on any generalized scenario data [30]. Efficiency and Real-World Application - The efficiency of SRPO is highlighted by its ability to improve success rates from 17.3% to 98.6% in long-term tasks with minimal training steps, showcasing its superior information utilization compared to traditional methods [34]. - The reward modeling of SRPO has been tested in real-world environments, showing significant success rate improvements for various tasks [37]. Conclusion - SRPO represents a significant advancement in VLA reinforcement learning, enabling robots to transition from imitation to autonomous exploration without the need for expensive data labeling or complex reward designs [51].
Mistral再开源!发布代码模型Devstral 2及原生CLI,但大公司被限制商用
机器之心· 2025-12-10 05:10
Core Insights - Mistral AI has launched its next-generation code model series, Devstral 2, which includes two models: Devstral 2 (123B parameters) and Devstral Small 2 (24B parameters) [1][2] - The rapid release of these models, following the Mistral 3 series, indicates a strong momentum in the European AI landscape [4] - Mistral AI's expansion in Europe and the return of Turing Award winner Yann LeCun to Europe for entrepreneurship suggest a promising future for AI in the region [5] Model Highlights - Devstral 2 is a state-of-the-art (SOTA) programming model with 123 billion parameters and a context window of 256K, achieving a score of 72.2% on SWE-bench Verified, establishing it as one of the best open-weight models [9] - Devstral Small 2, with 24 billion parameters, scored 68.0% on SWE-bench Verified, demonstrating competitive performance while being lightweight enough for local deployment on consumer-grade hardware [10][11] - Devstral 2 outperforms DeepSeek V3.2 with a win rate of 42.8% and a loss rate of 28.6%, although it still trails behind the closed-source model Claude Sonnet 4.5 [15] Licensing and Usage - Devstral 2 is released under a modified MIT license, which includes a revenue cap clause that restricts companies with a global consolidated monthly revenue exceeding $20 million from exercising rights under this license [18][21] - Companies exceeding this revenue threshold must contact Mistral AI for commercial licensing or use their paid API services [22] Mistral Vibe CLI - Mistral Vibe CLI is an open-source command-line coding assistant powered by Devstral, allowing users to explore, modify, and execute changes across entire codebases using natural language [24][25] - The CLI offers features such as file operations, code search, version control, and command execution, enhancing developer productivity [26][29] - Mistral Vibe CLI is integrated with existing development environments and is available as an extension for IDEs like Zed [30][31] Deployment Recommendations - Devstral 2 is optimized for data center GPUs, requiring at least four H100-level GPUs for deployment, while Devstral Small 2 is designed for single GPU operation and can run on a variety of NVIDIA systems [33][34] - Devstral Small 2 can also operate on consumer-grade GPUs and CPU configurations without dedicated GPUs, making it accessible for a wider range of users [35]
一手实测 | 智谱AutoGLM重磅开源: AI手机的「安卓时刻」正式到来
机器之心· 2025-12-10 05:10
Core Viewpoint - The article discusses the launch of Open-AutoGLM, an open-source AI assistant framework that enables users to automate tasks on their smartphones using natural language commands, marking a significant advancement in AI technology and user interaction [6][10][42]. Group 1: Introduction to AutoGLM - AutoGLM is a project developed by Zhipu AI, aiming to create an intelligent agent that can not only "speak" but also "act" on smartphones, representing a milestone in AI's ability to use tools [12]. - The framework consists of a Phone Agent and a 9B model, AutoGLM-Phone-9B, which allows for complex task automation through voice and touch commands [6][19]. Group 2: Technical Implementation - The Phone Agent relies on three core technologies: ADB (Android Debug Bridge) for device control, a visual-language model (VLM) for understanding screen content, and intelligent planning for task execution [17][18][19]. - AutoGLM's ability to analyze UI layouts and perform actions like a human is a key feature that distinguishes it from traditional automation scripts [12][31]. Group 3: Practical Applications - The article provides examples of AutoGLM successfully executing tasks such as sending messages and updating applications, demonstrating its robust performance and adaptability [22][28][30]. - AutoGLM can handle multi-step operations and interact with various applications, showcasing its versatility as an AI assistant [33]. Group 4: Open Source and Privacy - The open-source nature of Open-AutoGLM allows developers and users to run the AI model locally, ensuring data privacy and transparency [36][39]. - This approach contrasts with existing AI assistants that often rely on cloud processing, which raises concerns about data security [37][38]. Group 5: Industry Impact - The launch of Open-AutoGLM is seen as a potential turning point in the AI assistant market, democratizing access to advanced automation tools and reducing reliance on proprietary platforms [39][42]. - The article suggests that this development could lead to a new era of human-computer interaction, where AI assistants become integral to everyday tasks [42].