Workflow
机器之心
icon
Search documents
北大发布 ManualVLA:首个长程「生成–理解–动作」一体化模型,实现从最终状态自主生成说明书并完成操纵
机器之心· 2025-12-18 09:08
Core Insights - The article discusses the limitations of existing VLA models in handling long-duration tasks that require precise final state definitions, such as LEGO assembly and object rearrangement, highlighting the need for a more integrated approach [2][9] - A new model called ManualVLA is introduced, which combines planning and action generation into a unified framework, improving the efficiency and effectiveness of robotic manipulation tasks [3][5] Group 1: Research Background and Challenges - Recent advancements in VLA models have significantly contributed to the development of general embodied intelligence, but challenges remain in coordinating high-level planning with precise control for long-duration tasks [9] - Existing hierarchical methods struggle with generalization to unseen final states and often rely on manually crafted instructions or human demonstration videos, leading to limitations in system complexity, deployment costs, and generalization capabilities [9] Group 2: ManualVLA Methodology - ManualVLA allows the model to generate its own instructions and execute actions based on those instructions, breaking down complex long-duration tasks into manageable steps [10][12] - The model employs a Mixture-of-Transformers (MoT) architecture, integrating a planning expert that generates multimodal operation manuals and an action expert that executes the tasks based on these manuals [5][14] Group 3: Experimental Results - ManualVLA demonstrated a significant improvement in success rates for real-world tasks, achieving an average success rate increase of approximately 32% compared to the latest baseline methods [7][28] - In experiments involving 2D LEGO assembly, 3D LEGO assembly, and object rearrangement, the model produced high-quality intermediate images and maintained a low mean absolute error (MAE) in predicting target object positions [24][27] Group 4: Training Phases - The training process consists of three phases: pre-training on a large dataset of robotic trajectories, utilizing a digital twin tool for 3D reconstruction and manual data generation, and fine-tuning on real-world expert demonstration trajectories [20][21][19] Group 5: Generalization and Robustness - ManualVLA exhibits robust generalization capabilities, maintaining high success rates even under varying backgrounds, object shapes, and lighting conditions, outperforming baseline models in these scenarios [33][37] - Ablation studies confirm that both explicit and implicit reasoning paths are essential for achieving optimal performance in long-duration tasks [33]
刚刚,让谷歌翻身的Gemini 3,上线Flash版
机器之心· 2025-12-18 00:03
Core Insights - Google has launched the Gemini 3 Flash model, which is positioned as a high-speed, low-cost alternative to existing models, aiming to compete directly with OpenAI's offerings [2][3]. - The new model demonstrates significant performance improvements over its predecessor, Gemini 2.5 Flash, achieving competitive scores in various benchmark tests [3][10][14]. Performance and Benchmarking - Gemini 3 Flash has shown a remarkable performance leap, scoring 33.7% in the Humanity's Last Exam benchmark, compared to 11% for Gemini 2.5 Flash and 37.5% for Gemini 3 Pro [6][10]. - In the GPQA Diamond benchmark, it achieved a score of 90.4%, closely rivaling Gemini 3 Pro [10][13]. - The model also excelled in multimodal reasoning, scoring 81.2% in the MMMU Pro benchmark, indicating its advanced capabilities [11][13]. Cost and Efficiency - Gemini 3 Flash is touted as the most cost-effective model globally, with input costs at $0.50 per million tokens and output costs at $3.00 per million tokens [4][23]. - The model's design focuses on high efficiency, reducing the average token usage by approximately 30% compared to Gemini 2.5 Pro while maintaining accuracy [14][15]. User Accessibility and Applications - The model is now the default in the Gemini application, allowing millions of users to access its capabilities for free, enhancing daily task efficiency [28][32]. - It supports a wide range of applications, from video analysis to interactive coding environments, making it suitable for developers looking to implement complex AI solutions [21][25]. Developer Tools and Integration - Gemini 3 Flash is integrated into various platforms, including Google AI Studio, Vertex AI, and Gemini Enterprise, providing developers with robust tools for application development [12][26][33]. - The model's ability to quickly generate functional applications from voice commands highlights its user-friendly design, catering to non-programmers as well [30][32].
比LoRA更快更强,全新框架LoFA上线,秒级适配大模型
机器之心· 2025-12-18 00:03
Core Insights - The article discusses the limitations of traditional visual generative models in meeting personalized user demands, particularly in generating precise outputs based on fine-grained instructions [6][7] - It introduces a new framework called LoFA, which allows for rapid adaptation of large models to personalized tasks without lengthy optimization processes, achieving results comparable to or better than traditional methods like LoRA [2][24] Group 1: Problem Statement - There is a growing demand for creative media and visual content, leading to the development of powerful visual generative models trained on large datasets [6] - Existing methods for personalizing these models, such as parameter-efficient fine-tuning (PEFT), require extensive optimization time and specific task data, making them impractical for real-time applications [6][7] Group 2: Proposed Solution - LoFA is designed to predict personalized LoRA parameters directly from user instructions, enabling fast adaptation of visual generative models [9][12] - The framework incorporates a novel guiding mechanism within a hypernetwork to predict complete, uncompressed LoRA weights, avoiding information loss associated with compression techniques [9][12] Group 3: Methodology - The learning process in LoFA is divided into two phases: first predicting a simplified response map and then using this knowledge to guide the final LoRA weight prediction [11][12] - This structured approach allows the model to focus on key adaptation areas, enhancing the stability and efficiency of the learning process [12] Group 4: Experimental Results - The effectiveness of the LoFA framework was evaluated through systematic experiments in both video and image generation tasks, demonstrating its ability to handle diverse instruction conditions [14][15] - LoFA outperformed baseline methods and achieved performance comparable to independently optimized LoRA models, significantly reducing adaptation time from hours to seconds [15][24] Group 5: Conclusion and Future Directions - LoFA addresses critical limitations in existing personalization techniques by eliminating lengthy optimization while maintaining high-quality generation results [24] - Future work aims to develop a unified hypernetwork with strong zero-shot capabilities to handle various specific instructions across different domains [24]
分割一切、3D重建一切还不够,Meta开源SAM Audio分割一切声音
机器之心· 2025-12-17 09:42
Core Viewpoint - Meta has launched SAM Audio, an audio segmentation model that utilizes multimodal prompts to separate sounds from complex audio mixtures, revolutionizing audio processing [1][4]. Group 1: Technology and Functionality - SAM Audio is powered by the Perception Encoder Audiovisual (PE-AV), which enhances its performance in audio segmentation tasks [2][18]. - PE-AV builds on the Perception Encoder model released earlier this year, extending advanced computer vision capabilities to audio processing [3][20]. - The model supports various interaction methods, including text prompts, visual prompts, and a novel time span prompting technique, allowing for precise audio separation [9][16]. - SAM Audio can effectively operate in diverse real-world scenarios, providing users with intuitive control over the audio separation process [9][12]. Group 2: Applications and Use Cases - Meta envisions numerous applications for SAM Audio, including audio cleaning, background noise removal, and tools to enhance user creativity [5][42]. - Users can explore SAM Audio's capabilities through the Segment Anything Playground, where they can select or upload audio and video content [7][31]. Group 3: Evaluation and Benchmarking - SAM Audio-Bench is introduced as a comprehensive benchmark for audio separation, covering various audio domains and interaction types [29][30]. - SAM Audio Judge is a new evaluation framework that assesses audio segmentation quality based on human perception rather than traditional reference audio comparisons [26][27]. Group 4: Performance and Future Outlook - SAM Audio has achieved state-of-the-art performance across multiple benchmarks and tasks, outperforming previous models in audio separation [35][36]. - The model operates efficiently with a real-time factor of approximately 0.7, capable of handling large-scale audio processing [40]. - Meta aims to promote accessibility and creativity through SAM Audio, collaborating with partners to explore its potential in assistive technologies [42].
官宣!姚顺雨出任腾讯首席AI科学家,带队大语言模型、AI Infra
机器之心· 2025-12-17 09:42
Core Insights - OpenAI researcher Yao Shunyu has joined Tencent, igniting discussions in the AI community [1] - Tencent has upgraded its large model research framework, establishing new departments to enhance its capabilities [2][3] Group 1: Organizational Changes - Tencent has formed the AI Infra Department and AI Data Department to strengthen its large model research and core capabilities [2] - Yao Shunyu has been appointed as the Chief AI Scientist, reporting to Tencent's President Liu Chiping, and will also lead the AI Infra Department and the large language model department [2][5] Group 2: Department Responsibilities - The AI Infra Department will focus on building technical capabilities for large model training and inference platforms, emphasizing distributed training and high-performance inference services [3] - The AI Data Department and Data Computing Platform Department will be responsible for constructing data and evaluation systems for large models and integrating big data with machine learning [4] Group 3: Yao Shunyu's Background - Yao Shunyu is a prominent young researcher in the field of artificial intelligence, particularly in the area of intelligent agents [6] - Prior to joining OpenAI, he made significant contributions in the field of language intelligent agents and has a total citation count exceeding 19,000 for his papers [7]
硬刚Sora2,万相2.6轻松定制角色、控制分镜,普通人也能当导演
机器之心· 2025-12-17 05:28
Core Insights - The article highlights the rapid advancements in video generation technology, particularly focusing on the release of Alibaba's Wan 2.6 model, which significantly enhances user capabilities in video creation and storytelling [1][36]. Group 1: Technological Advancements - OpenAI's Sora 2 introduced a "Cameo" feature that addresses the "character consistency" issue in AI video generation, transforming the process from unpredictable to controllable [1]. - Alibaba's Wan 2.6 model is noted for its comprehensive capabilities, including voice and image synchronization, allowing users to create videos with a high degree of realism and narrative coherence [3][9]. - The new model supports a maximum video generation duration of 15 seconds, which is the highest in the domestic market, and includes a "shot control" feature for professional storytelling [3][4]. Group 2: User Experience and Accessibility - The Wan 2.5 version of the model made video creation accessible on mobile devices, while the 2.6 version further democratizes professional video production, enabling anyone to take on roles like director or actor [2][4]. - Users can create videos with high fidelity in both visual and auditory aspects, showcasing the model's ability to replicate character traits and emotional expressions accurately [11][24]. Group 3: Practical Applications - The model's capabilities extend to generating complete narrative short films, making it suitable for advertising design and short drama production [16]. - The article emphasizes the model's potential in various creative fields, including AI comic production, advertising design, and short video creation, with over ten visual creation capabilities supported [35][36]. Group 4: Conclusion and Future Implications - The release of Wan 2.6 signifies a shift from a mere "lottery" approach in AI video generation to a new phase of precise and controllable cinematic creation [36]. - The technology effectively removes barriers to creativity, allowing users to leverage their imagination as their primary production tool [37].
WAIC Future Tech 2026:全球科技曝光+合作,资本的下一个掘金点
机器之心· 2025-12-17 05:28
Core Viewpoint - The event focuses on the launch of a collaborative innovation ecosystem in the AI sector, showcasing various projects that leverage AI technology across different industries [1][2]. Group 1: Event Overview - The event is scheduled for December 20, 2025, at Tsinghua Science Park, Beijing, starting at 1:00 PM [5]. - It includes a launch ceremony for the collaborative innovation ecosystem and a roundtable with a mystery guest [2]. Group 2: Project Highlights - A total of 14 projects will be presented, primarily focusing on AI applications, infrastructure, hardware, and cutting-edge technology, targeting seed to Series A funding stages [4]. - Notable projects include: - AI-driven solutions for global mineral resource discovery [7] - Data-driven to decision-driven paradigms in large enterprises [8] - Cooling solutions for high-density computing [8] - Robotics solutions for developers [8] - AI-powered intelligent assistants [8] - AI-driven entertainment and gaming solutions [8][9].
经验记忆黑科技:LightSearcher让AI工具调用减39.6%、推理快48.6%
机器之心· 2025-12-17 05:28
Core Insights - The article discusses the challenges faced by existing RL-driven deep thinking models, particularly the trade-off between accuracy and efficiency, where frequent calls to external search tools improve accuracy but significantly increase response time [2][6]. - The introduction of the LightSearcher framework by the Beijing University of Posts and Telecommunications AI team addresses these challenges by utilizing experiential memory and adaptive reward shaping to enhance efficiency while maintaining accuracy [3][9]. Summary by Sections Introduction - The need for deep thinking models to strategically control the use of search tools is emphasized, highlighting existing methods' shortcomings in balancing accuracy and efficiency [6]. LightSearcher Framework - LightSearcher is designed to optimize the use of search tools through experiential memory, which transforms implicit reasoning paths into explicit guiding experiences, and includes adaptive reward mechanisms [9][11]. Experimental Results - Comprehensive evaluations on multiple multi-hop QA benchmark datasets demonstrate that LightSearcher maintains competitive accuracy while significantly reducing search tool calls by 39.6%, reasoning time by 48.6%, and token consumption by 21.2% [18]. - The framework's core components include: - Contrastive Experiential Reasoning, which builds a dynamic memory library from high and low-quality reasoning paths [14]. - Adaptive Reward Shaping, which minimizes redundant tool calls and balances accuracy and efficiency [14]. - Experience-based RL training, which integrates accumulated experiences into prompt templates to guide efficient reasoning [14]. Conclusion - LightSearcher provides a new pathway for constructing efficient and reliable deep reasoning systems, with potential applications extending beyond multi-hop QA to areas like code synthesis and strategic planning [18][20].
SIGGRAPH Asia 2025:摩尔线程赢图形顶会3DGS挑战赛大奖,自研LiteGS全面开源
机器之心· 2025-12-17 05:28
Core Insights - Moore Threads won the silver medal at the 3D Gaussian Splatting Reconstruction Challenge during SIGGRAPH Asia 2025, showcasing its advanced algorithm capabilities and hardware-software optimization in next-generation graphics rendering technology [1][16]. Group 1: 3D Gaussian Splatting Technology - 3D Gaussian Splatting (3DGS) is a revolutionary 3D scene representation and rendering technology that achieves an exceptional balance between image quality, efficiency, and resource usage, significantly outperforming traditional NeRF methods by enhancing rendering efficiency by hundreds to thousands of times [4][19]. - 3DGS has shown strong adaptability and scalability in areas such as ray tracing, real-time VR/AR rendering, and multi-modal fusion, making it a key technology in the evolving landscape of graphics rendering [4][8]. Group 2: Competition Overview - The 3DGS Reconstruction Challenge required participants to complete high-quality 3DGS reconstruction within 60 seconds using provided real terminal video sequences and SLAM point clouds, emphasizing both reconstruction quality and speed [10][12]. - The evaluation metrics included PSNR (Peak Signal-to-Noise Ratio) and reconstruction speed, ensuring a fair and authoritative ranking of the competing teams [12]. Group 3: Performance Results - Moore Threads' team, identified as "MT-AI," achieved an average PSNR of 27.58 and a reconstruction time of 34 seconds, placing them third overall in the competition [17][20]. - The results highlighted the company's leading capabilities in 3DGS algorithm construction and hardware-software optimization [16][20]. Group 4: LiteGS Development - Moore Threads developed the LiteGS library, which optimizes the entire pipeline from GPU systems to data management and algorithm design, achieving a training acceleration of up to 10.8 times while reducing parameter count by over 50% [20][25]. - LiteGS has been open-sourced on GitHub to promote collaboration and continuous evolution in 3D reconstruction and rendering technologies [27]. Group 5: Strategic Implications - The success at the SIGGRAPH Asia competition reflects Moore Threads' strategic understanding of global technology trends and its ability to lead future graphics computing directions [28]. - The advancements in 3DGS technology highlight the high demands for algorithm and hardware collaboration, positioning Moore Threads as a forward-thinking player in the graphics intelligence computing field [28].
VGGT4D:无需训练,挖掘3D基础模型潜力,实现4D动态场景重建
机器之心· 2025-12-17 02:05
Core Insights - The article discusses VGGT4D, a framework developed by researchers from Hong Kong University of Science and Technology (Guangzhou) and Horizon Robotics, aimed at enabling 3D foundation models to handle dynamic 4D scenes without additional training costs [2][4][33] - VGGT4D leverages hidden motion cues within the attention layers of the Visual Geometry Transformer (VGGT) to improve performance in tasks such as dynamic object segmentation, camera pose estimation, and long-sequence 4D reconstruction [2][4][6] Research Background - Traditional 3D foundation models like VGGT and DUSt3R excel in static scene reconstruction but struggle with dynamic 4D scenes that include moving objects, leading to significant performance drops [6][7] - Existing solutions often face challenges such as high computational costs and reliance on external priors, which complicate the system [9][12] Methodology - VGGT4D introduces a training-free mechanism for attention feature mining and mask refinement, utilizing Gram matrices and gradient flows for high-precision dynamic-static separation [14][17] - The framework addresses limitations of standard attention maps by employing self-similarity Gram matrices to enhance the signal-to-noise ratio, allowing for better extraction of motion cues [16][17] Experimental Validation - VGGT4D was evaluated on dynamic object segmentation, camera pose estimation, and 4D point cloud reconstruction across six benchmark datasets, demonstrating superior performance compared to other methods [22][23] - In dynamic object segmentation, VGGT4D achieved optimal performance on the DAVIS-2016 and DAVIS-2017 datasets, outperforming all variants without requiring any 4D-specific training [24][25] - For camera pose estimation, VGGT4D consistently improved upon the strong baseline set by the original VGGT model, achieving an Average Translation Error (ATE) of 0.164 on the VKITTI dataset, compared to 2.272 for MonST3R [27][28] Conclusion - VGGT4D successfully extends the capabilities of 3D foundation models to 4D dynamic scenes through effective internal feature extraction, providing a low-cost solution for 4D reconstruction and showcasing the potential of foundational models in zero-shot transfer tasks [33]