机器之心
Search documents
OpenAI最强代码模型GPT-5.2-Codex上线
机器之心· 2025-12-19 00:21
Core Insights - OpenAI has released GPT-5.2-Codex, the most advanced coding model designed for complex software engineering tasks, enhancing instruction-following and long-context understanding capabilities [1][3] Group 1: Model Enhancements - GPT-5.2-Codex improves upon GPT-5.2 with better instruction adherence and long-term context comprehension, particularly excelling in large code changes like refactoring and migration [3] - The model shows significant improvements in token efficiency for coding tasks, especially at medium and high reasoning levels, becoming a primary tool for the Codex team [3] - Enhanced capabilities in network security have been noted, with GPT-5.2-Codex outperforming all previous OpenAI models in this area [6][7] Group 2: Performance Metrics - GPT-5.2-Codex achieved state-of-the-art performance in SWE-Bench Pro and Terminal-Bench 2.0 benchmarks, which assess AI agents in real terminal environments [8][10] - The model can efficiently handle large codebases and maintain context over long sessions, enabling reliable completion of complex tasks like large-scale refactoring and feature building [8] Group 3: Security Applications - A security researcher utilized GPT-5.1-Codex-Max and Codex CLI to discover a vulnerability in React, demonstrating the model's application in real-world vulnerability research [6][21] - The process involved using Codex to set up a local testing environment and analyze potential attack surfaces, leading to the discovery of previously unknown vulnerabilities [22][25] Group 4: Deployment and Access - GPT-5.2-Codex is currently available to paid ChatGPT users and will be accessible to API users in the coming weeks, with OpenAI piloting access for invited users and professionals focused on defensive cybersecurity [7] - OpenAI is planning to ensure that each new model meets high cybersecurity capability standards, emphasizing responsible deployment alongside enhanced security measures [18][25]
不卖「工具」卖「生产力」,百融云创如何用「硅基员工」打破AI落地僵局?
机器之心· 2025-12-18 10:15
Core Viewpoint - The article discusses the transition of AI from being a mere tool to becoming a productive workforce, termed "silicon-based employees," emphasizing the need for a new business model where AI vendors are compensated based on measurable business outcomes rather than just software sales [1][3][4]. Group 1: AI Implementation Challenges - Despite initial successful deployments of AI agents in various enterprises, many remain in the demo stage and fail to integrate into core business processes, highlighting a significant gap between AI vendors and enterprise clients [1][2]. - The misalignment of incentives between AI solution providers and enterprise clients is a fundamental issue, where vendors focus on selling tools while clients seek tangible productivity improvements [2][3]. Group 2: New Business Model - The key to overcoming these challenges lies in restructuring the commercial contract between AI vendors and enterprise clients, shifting from tool provision to delivering measurable business results [3][12]. - The concept of "Result as a Service" (RaaS) is introduced, where clients pay based on the performance of AI agents in achieving specific business outcomes [5][7]. Group 3: Technological Advancements - BaiRong Cloud has made significant strides in AI technology, focusing on practical applications in various industries, including finance, marketing, and customer service [9][10]. - The company has developed proactive AI models that can guide user interactions and eliminate errors, ensuring compliance and enhancing decision-making capabilities [14][15]. Group 4: Real-World Applications - BaiRong Cloud's "silicon-based employees" have been successfully deployed in multiple sectors, demonstrating their ability to generate substantial business results, such as increasing active customers and managing large-scale marketing campaigns [17][18]. - The introduction of these AI agents has led to significant cost reductions and efficiency improvements in various operational processes, including customer service and recruitment [19][20]. Group 5: Future Directions - The company aims to continue evolving its AI capabilities, focusing on long-term task execution and enhancing the collaborative relationship between human and AI workers [22][24]. - As technology matures, "silicon-based employees" are expected to transition from performing repetitive tasks to becoming integral decision-making participants within organizations [24][25].
SIGGRAPH Asia 2025 | 只用一部手机创建和渲染高质量3D数字人
机器之心· 2025-12-18 10:15
Core Insights - The article discusses the advancements in 3D digital human reconstruction and rendering technology, specifically focusing on the HRM²Avatar system developed by the Taobao technology - Meta team, which allows for high-fidelity, real-time 3D digital humans to be created using only a smartphone [4][5][6]. Group 1: Technology Overview - HRM²Avatar is a system designed for high-fidelity real-time 3D digital human reconstruction and rendering, utilizing a two-stage capture method and a combination of explicit clothing mesh representation and Gaussian-based dynamic detail modeling [12][36]. - The system allows for the reconstruction of human figures, clothing structures, and detailed appearances under ordinary smartphone conditions, achieving a balance between visual realism, cross-pose consistency, and mobile real-time rendering [6][12]. Group 2: Methodology - The capture process involves both static and dynamic scanning phases, where users maintain a fixed pose for static scans and perform natural movements for dynamic scans, enabling the system to capture necessary signals for reconstruction and dynamic modeling [18][28]. - The system employs a mixed representation approach, attaching Gaussian points to the clothing mesh to provide controllable parameters for pose-related deformations and lighting modeling [40][46]. Group 3: Performance Evaluation - HRM²Avatar has been tested on mobile devices, achieving stable real-time performance with approximately 530,000 Gaussian points at 2K resolution and 120 FPS on the iPhone 15 Pro Max, and 2K at 90 FPS on Apple Vision Pro [87][89]. - Comparative evaluations show that HRM²Avatar outperforms existing methods in static reconstruction quality and appearance consistency under pose variations, as evidenced by higher PSNR and SSIM scores [76][80]. Group 4: Future Directions - The article emphasizes the ongoing need for optimization, particularly in handling complex clothing structures and extreme lighting conditions, indicating that HRM²Avatar is a significant milestone in making high-quality digital humans accessible to ordinary users [90].
与Physical Intelligence同日发声:深度机智亮出「情境数采」杀手锏,具身智能的通用性天花板要被捅破了?
机器之心· 2025-12-18 10:15
Core Viewpoint - The journey towards generality in embodied intelligence is hindered by a "data desert," prompting the industry to reflect on whether the data provided to robots truly encapsulates the essence of human operations [1] Group 1: Data Challenges in Embodied Intelligence - The generality breakthrough of embodied intelligence is limited by the extreme scarcity of interaction data from the physical world [3] - Traditional data collection focuses on "action trajectories," while situational data collection emphasizes the "causal relationships" during actions [4] - The shift to situational data collection aims to provide rich contextual data that enhances the learning of robots beyond mere memorization [3][4] Group 2: Innovations in Data Collection - DeepCybo and a Beijing university have established a "Demonstration Center for Embodied Intelligence Data Collection" to gather multimodal data from a human first-person perspective [3][10] - The center utilizes the self-developed DeepAct data engine to create a standardized data collection system in real industrial and life scenarios [5][10] - The new data collection paradigm focuses on high-quality, diverse data to enable efficient learning of human-like interactions with the physical world [7] Group 3: Implications for Future Development - The accumulation of first-person multimodal data with contextual memory will facilitate the breakthrough of generality in embodied intelligence [12] - The integration of full-chain processing and model innovation will maximize the value of data, allowing robots to evolve from mere mechanical imitation to genuine skill emergence [12]
告别抽卡!一手实测字节刚放出的视频模型Seedance 1.5 pro
机器之心· 2025-12-18 09:08
Core Viewpoint - The article discusses the launch of the latest video generation model, Seedance 1.5 pro, by Volcano Engine, highlighting its advanced capabilities in audio-visual synchronization and multi-language support, which significantly enhance video generation quality and user experience [2][5][46]. Group 1: Model Features and Capabilities - Seedance 1.5 pro achieves native audio-visual synchronization, covering various sound types and demonstrating a leading synchronization rate globally [5]. - The model can follow complex instructions and supports multiple languages and dialects, improving narrative coherence and emotional expression in generated videos [5][13]. - In video capability assessments, Seedance 1.5 pro outperforms competitors in multiple metrics, including alignment, aesthetic quality, and audio generation [6][53]. Group 2: User Experience and Applications - Seedance 1.5 pro is available for enterprise users via API and for personal users through the Dream Web and Doubao App [8]. - The model has shown high adherence to user prompts, often producing optimal results on the first attempt, reducing the need for multiple iterations [43]. - It is suitable for daily content creation, lightweight commercial advertising, and AI short film production, indicating its versatility in various creative contexts [44]. Group 3: Technical Innovations - The model incorporates several key technological innovations, including a unified multi-modal generation framework and a comprehensive audio-visual data framework, enhancing its performance in generating high-quality content [49][50]. - The training process includes supervised fine-tuning and reinforcement learning from human feedback, leading to significant improvements in motion quality and audio fidelity [52]. - The inference phase has been optimized to achieve over tenfold acceleration in generating content while maintaining model performance [52]. Group 4: Industry Impact and Future Outlook - The advancements in Seedance 1.5 pro reflect the rapid evolution of video generation technology, transitioning from academic research to practical tools for everyday users [58]. - The model's capabilities are expected to bridge the gap between AI-generated content and professional video production, enhancing its applicability in creative industries [59]. - The industry anticipates further developments in video generation models, with expectations for more sophisticated outputs in the coming years [60][61].
北大发布 ManualVLA:首个长程「生成–理解–动作」一体化模型,实现从最终状态自主生成说明书并完成操纵
机器之心· 2025-12-18 09:08
Core Insights - The article discusses the limitations of existing VLA models in handling long-duration tasks that require precise final state definitions, such as LEGO assembly and object rearrangement, highlighting the need for a more integrated approach [2][9] - A new model called ManualVLA is introduced, which combines planning and action generation into a unified framework, improving the efficiency and effectiveness of robotic manipulation tasks [3][5] Group 1: Research Background and Challenges - Recent advancements in VLA models have significantly contributed to the development of general embodied intelligence, but challenges remain in coordinating high-level planning with precise control for long-duration tasks [9] - Existing hierarchical methods struggle with generalization to unseen final states and often rely on manually crafted instructions or human demonstration videos, leading to limitations in system complexity, deployment costs, and generalization capabilities [9] Group 2: ManualVLA Methodology - ManualVLA allows the model to generate its own instructions and execute actions based on those instructions, breaking down complex long-duration tasks into manageable steps [10][12] - The model employs a Mixture-of-Transformers (MoT) architecture, integrating a planning expert that generates multimodal operation manuals and an action expert that executes the tasks based on these manuals [5][14] Group 3: Experimental Results - ManualVLA demonstrated a significant improvement in success rates for real-world tasks, achieving an average success rate increase of approximately 32% compared to the latest baseline methods [7][28] - In experiments involving 2D LEGO assembly, 3D LEGO assembly, and object rearrangement, the model produced high-quality intermediate images and maintained a low mean absolute error (MAE) in predicting target object positions [24][27] Group 4: Training Phases - The training process consists of three phases: pre-training on a large dataset of robotic trajectories, utilizing a digital twin tool for 3D reconstruction and manual data generation, and fine-tuning on real-world expert demonstration trajectories [20][21][19] Group 5: Generalization and Robustness - ManualVLA exhibits robust generalization capabilities, maintaining high success rates even under varying backgrounds, object shapes, and lighting conditions, outperforming baseline models in these scenarios [33][37] - Ablation studies confirm that both explicit and implicit reasoning paths are essential for achieving optimal performance in long-duration tasks [33]
刚刚,让谷歌翻身的Gemini 3,上线Flash版
机器之心· 2025-12-18 00:03
Core Insights - Google has launched the Gemini 3 Flash model, which is positioned as a high-speed, low-cost alternative to existing models, aiming to compete directly with OpenAI's offerings [2][3]. - The new model demonstrates significant performance improvements over its predecessor, Gemini 2.5 Flash, achieving competitive scores in various benchmark tests [3][10][14]. Performance and Benchmarking - Gemini 3 Flash has shown a remarkable performance leap, scoring 33.7% in the Humanity's Last Exam benchmark, compared to 11% for Gemini 2.5 Flash and 37.5% for Gemini 3 Pro [6][10]. - In the GPQA Diamond benchmark, it achieved a score of 90.4%, closely rivaling Gemini 3 Pro [10][13]. - The model also excelled in multimodal reasoning, scoring 81.2% in the MMMU Pro benchmark, indicating its advanced capabilities [11][13]. Cost and Efficiency - Gemini 3 Flash is touted as the most cost-effective model globally, with input costs at $0.50 per million tokens and output costs at $3.00 per million tokens [4][23]. - The model's design focuses on high efficiency, reducing the average token usage by approximately 30% compared to Gemini 2.5 Pro while maintaining accuracy [14][15]. User Accessibility and Applications - The model is now the default in the Gemini application, allowing millions of users to access its capabilities for free, enhancing daily task efficiency [28][32]. - It supports a wide range of applications, from video analysis to interactive coding environments, making it suitable for developers looking to implement complex AI solutions [21][25]. Developer Tools and Integration - Gemini 3 Flash is integrated into various platforms, including Google AI Studio, Vertex AI, and Gemini Enterprise, providing developers with robust tools for application development [12][26][33]. - The model's ability to quickly generate functional applications from voice commands highlights its user-friendly design, catering to non-programmers as well [30][32].
比LoRA更快更强,全新框架LoFA上线,秒级适配大模型
机器之心· 2025-12-18 00:03
Core Insights - The article discusses the limitations of traditional visual generative models in meeting personalized user demands, particularly in generating precise outputs based on fine-grained instructions [6][7] - It introduces a new framework called LoFA, which allows for rapid adaptation of large models to personalized tasks without lengthy optimization processes, achieving results comparable to or better than traditional methods like LoRA [2][24] Group 1: Problem Statement - There is a growing demand for creative media and visual content, leading to the development of powerful visual generative models trained on large datasets [6] - Existing methods for personalizing these models, such as parameter-efficient fine-tuning (PEFT), require extensive optimization time and specific task data, making them impractical for real-time applications [6][7] Group 2: Proposed Solution - LoFA is designed to predict personalized LoRA parameters directly from user instructions, enabling fast adaptation of visual generative models [9][12] - The framework incorporates a novel guiding mechanism within a hypernetwork to predict complete, uncompressed LoRA weights, avoiding information loss associated with compression techniques [9][12] Group 3: Methodology - The learning process in LoFA is divided into two phases: first predicting a simplified response map and then using this knowledge to guide the final LoRA weight prediction [11][12] - This structured approach allows the model to focus on key adaptation areas, enhancing the stability and efficiency of the learning process [12] Group 4: Experimental Results - The effectiveness of the LoFA framework was evaluated through systematic experiments in both video and image generation tasks, demonstrating its ability to handle diverse instruction conditions [14][15] - LoFA outperformed baseline methods and achieved performance comparable to independently optimized LoRA models, significantly reducing adaptation time from hours to seconds [15][24] Group 5: Conclusion and Future Directions - LoFA addresses critical limitations in existing personalization techniques by eliminating lengthy optimization while maintaining high-quality generation results [24] - Future work aims to develop a unified hypernetwork with strong zero-shot capabilities to handle various specific instructions across different domains [24]
分割一切、3D重建一切还不够,Meta开源SAM Audio分割一切声音
机器之心· 2025-12-17 09:42
Core Viewpoint - Meta has launched SAM Audio, an audio segmentation model that utilizes multimodal prompts to separate sounds from complex audio mixtures, revolutionizing audio processing [1][4]. Group 1: Technology and Functionality - SAM Audio is powered by the Perception Encoder Audiovisual (PE-AV), which enhances its performance in audio segmentation tasks [2][18]. - PE-AV builds on the Perception Encoder model released earlier this year, extending advanced computer vision capabilities to audio processing [3][20]. - The model supports various interaction methods, including text prompts, visual prompts, and a novel time span prompting technique, allowing for precise audio separation [9][16]. - SAM Audio can effectively operate in diverse real-world scenarios, providing users with intuitive control over the audio separation process [9][12]. Group 2: Applications and Use Cases - Meta envisions numerous applications for SAM Audio, including audio cleaning, background noise removal, and tools to enhance user creativity [5][42]. - Users can explore SAM Audio's capabilities through the Segment Anything Playground, where they can select or upload audio and video content [7][31]. Group 3: Evaluation and Benchmarking - SAM Audio-Bench is introduced as a comprehensive benchmark for audio separation, covering various audio domains and interaction types [29][30]. - SAM Audio Judge is a new evaluation framework that assesses audio segmentation quality based on human perception rather than traditional reference audio comparisons [26][27]. Group 4: Performance and Future Outlook - SAM Audio has achieved state-of-the-art performance across multiple benchmarks and tasks, outperforming previous models in audio separation [35][36]. - The model operates efficiently with a real-time factor of approximately 0.7, capable of handling large-scale audio processing [40]. - Meta aims to promote accessibility and creativity through SAM Audio, collaborating with partners to explore its potential in assistive technologies [42].
官宣!姚顺雨出任腾讯首席AI科学家,带队大语言模型、AI Infra
机器之心· 2025-12-17 09:42
Core Insights - OpenAI researcher Yao Shunyu has joined Tencent, igniting discussions in the AI community [1] - Tencent has upgraded its large model research framework, establishing new departments to enhance its capabilities [2][3] Group 1: Organizational Changes - Tencent has formed the AI Infra Department and AI Data Department to strengthen its large model research and core capabilities [2] - Yao Shunyu has been appointed as the Chief AI Scientist, reporting to Tencent's President Liu Chiping, and will also lead the AI Infra Department and the large language model department [2][5] Group 2: Department Responsibilities - The AI Infra Department will focus on building technical capabilities for large model training and inference platforms, emphasizing distributed training and high-performance inference services [3] - The AI Data Department and Data Computing Platform Department will be responsible for constructing data and evaluation systems for large models and integrating big data with machine learning [4] Group 3: Yao Shunyu's Background - Yao Shunyu is a prominent young researcher in the field of artificial intelligence, particularly in the area of intelligent agents [6] - Prior to joining OpenAI, he made significant contributions in the field of language intelligent agents and has a total citation count exceeding 19,000 for his papers [7]