生成式模型
Search documents
无需再训练即可增强性能!港大团队提出GPC框架,实现机器人「策略组合」
机器之心· 2025-10-19 09:17
Core Viewpoint - The article introduces the General Policy Composition (GPC) framework, which provides a novel, training-free solution to enhance the performance of robot control strategies by dynamically combining multiple pre-trained models during test time, thus overcoming the limitations of traditional training methods [2][5][7]. Summary by Sections Improving Policy Performance - GPC presents a paradigm shift in enhancing policy performance without relying on additional training, instead utilizing a method of combining existing strategies [6][15]. Innovative Theoretical Foundation - The framework is built on two key theoretical findings: 1. Functional-Level Improvement, which shows that convex combinations of decision scores from multiple pre-trained strategies can yield a more accurate combined score than any single strategy [9]. 2. System-Level Stability, which ensures that improvements in single-step errors propagate throughout the entire trajectory, leading to overall performance enhancement [10]. General "Policy Composer" - GPC's core advantage lies in its plug-and-play nature, allowing for the seamless integration of various robot strategies without the need for retraining [14][15]. Heterogeneous Strategy Flexibility - GPC can flexibly combine strategies across different architectures and modalities, effectively balancing information from various conditions to produce stable and coherent action trajectories [17][19]. Weight Search for Optimal Strategy - The weight search mechanism in GPC customizes optimal weight configurations for different tasks, emphasizing the importance of weight distribution in maximizing the effectiveness of the combined strategy [22][23]. Experimental Validation - GPC has demonstrated superior performance in both simulation and real-world environments, achieving significant success rate improvements over single baseline methods, with up to 7.55% in simulation tasks and 5-10% in real-world tasks [28][30]. Key Findings from Experiments - Three core findings from experiments highlight GPC's versatility: 1. GPC can achieve higher accuracy when combining strategies with moderate accuracy levels [29]. 2. The presence of a weak strategy can hinder overall performance, indicating the need for careful selection of contributing strategies [29]. 3. Performance is maximized when stronger strategies are given greater weight in the combination [29].
Insta360最新全景综述:全景视觉的挑战、方法与未来
机器之心· 2025-10-04 03:38
Core Insights - The article discusses the transition from perspective vision to panoramic vision, highlighting the "perspective-panorama gap" as a central theme for understanding the challenges and opportunities in this field [6][19]. - It emphasizes the need for a systematic upgrade across data, models, and applications to enhance the usability of panoramic vision technologies [16][19]. Research Background and Motivation - The paper titled "One Flight Over the Gap: A Survey from Perspective to Panoramic Vision" aims to systematically analyze the differences between perspective and panoramic vision, covering over 300 papers and 20 representative tasks [4][19]. - The article provides a comprehensive overview of the challenges faced in panoramic vision, which are categorized into three main gaps: geometric distortion, non-uniform sampling, and boundary continuity [6][9]. Strategies Overview - Four main strategies are identified for adapting tasks to panoramic vision: 1. **Geometric Distortion**: Issues arise when spherical images are projected onto a plane, leading to shape distortion [7]. 2. **Non-uniform Sampling**: Pixel density varies significantly across different regions, affecting resolution [7]. 3. **Boundary Continuity**: The separation of boundaries in 2D images can lead to learning continuity issues [7]. - The article outlines a cross-method comparison to clarify the applicability of different strategies to various tasks [9][15]. Task Toolbox - The article lists over 20 tasks categorized into four main areas: enhancement and assessment, understanding, multi-modal, and generation, along with representative methods and key papers for each task [12][15]. - It highlights the rapid emergence of new paradigms such as diffusion and generative models, particularly in text-to-image/video and novel view synthesis [15]. Future Directions - To transition from "usable" to "user-friendly," advancements must be made in three main areas: data, model paradigms, and downstream applications [16][21]. - Key challenges include: 1. **Data Bottlenecks**: Lack of large-scale, diverse, and high-quality 360° datasets limits general training and reproducible evaluation [21]. 2. **Model Paradigms**: The need for robust models that can adapt from perspective to panoramic vision while maintaining performance across various tasks [21]. 3. **Downstream Applications**: Applications in spatial intelligence, XR, 3D reconstruction, and various industry sectors require effective deployment and compliance [21][22].
两张图就能重构3D空间?清华&NTU利用生成模型解锁空间智能新范式
量子位· 2025-07-09 01:18
Core Viewpoint - LangScene-X introduces a generative framework that enables the construction of generalized 3D language-embedded scenes using only sparse views, significantly reducing the number of required input images compared to traditional methods like NeRF, which typically need over 20 views [2][5]. Group 1: Challenges in 3D Language Scene Generation - The current 3D language scene generation faces three core challenges: the contradiction between dense view dependency and sparse input absence, leading to severe 3D structure artifacts and semantic distortion when using only 2-3 images [5]. - There is a disconnection in cross-modal information and a lack of 3D consistency, as existing models process appearance, geometry, and semantics independently, resulting in semantic misalignment [6]. - High-dimensional compression of language features and the bottleneck in generalization capabilities hinder practical applications, with existing methods showing a significant drop in accuracy when switching scenes [7]. Group 2: Solutions Offered by LangScene-X - LangScene-X employs the TriMap video diffusion model, which allows for unified multimodal generation under sparse input conditions, achieving significant improvements in RGB and normal consistency errors and semantic mask boundary accuracy [8]. - The Language Quantization Compressor (LQC) revolutionizes high-dimensional feature compression, mapping high-dimensional CLIP features to 3D discrete indices with minimal reconstruction error, enhancing cross-scene transferability [9][10]. - The model integrates a progressive training strategy that ensures the seamless generation of RGB images, normal maps, and semantic segmentation maps, thus improving the efficiency of 3D reconstruction processes [14]. Group 3: Spatial Intelligence and Performance Metrics - LangScene-X enhances spatial intelligence by accurately aligning text prompts with 3D scene surfaces, allowing for natural language queries to identify objects within 3D environments [15]. - Empirical results demonstrate that LangScene-X achieves an overall mean accuracy (mAcc) of 80.85% and a mean intersection over union (mIoU) of 50.52% on the LERF-OVS dataset, significantly outperforming existing methods [16]. - The model's capabilities position it as a potential core driver for applications in VR scene construction, human-computer interaction, and foundational technologies for autonomous driving and embodied intelligence [18].
放榜了!ICCV 2025最新汇总(自驾/具身/3D视觉/LLM/CV等)
自动驾驶之心· 2025-06-28 13:34
Core Insights - The article discusses the recent ICCV conference, highlighting the excitement around the release of various works related to autonomous driving and the advancements in the field [2]. Group 1: Autonomous Driving Innovations - DriveArena is introduced as a controllable generative simulation platform aimed at enhancing autonomous driving capabilities [4]. - Epona presents an autoregressive diffusion world model specifically designed for autonomous driving applications [4]. - SynthDrive offers a scalable Real2Sim2Real sensor simulation pipeline for high-fidelity asset generation and driving data synthesis [4]. - StableDepth focuses on scene-consistent and scale-invariant monocular depth estimation, which is crucial for improving perception in autonomous vehicles [4]. - CoopTrack explores end-to-end learning for efficient cooperative sequential perception, enhancing the collaborative capabilities of autonomous systems [4]. Group 2: Image and Vision Technologies - CycleVAR repurposes autoregressive models for unsupervised one-step image translation, which can be beneficial for visual recognition tasks in autonomous driving [5]. - CoST emphasizes efficient collaborative perception from a unified spatiotemporal perspective, which is essential for real-time decision-making in autonomous vehicles [5]. - Hi3DGen generates high-fidelity 3D geometry from images via normal bridging, improving the spatial understanding of environments for autonomous systems [5]. - GS-Occ3D focuses on scaling vision-only occupancy reconstruction for autonomous driving using Gaussian splatting techniques [5]. Group 3: Large Model Applications - ETA introduces a dual approach to self-driving with large models, enhancing the efficiency and effectiveness of autonomous driving systems [5]. - Taming the Untamed discusses graph-based knowledge retrieval and reasoning for multi-layered large models (MLLMs), which can significantly improve the decision-making processes in autonomous driving [7].
ICCV 2025不完全汇总(具身/自驾/3D视觉/LLM/CV等)
具身智能之心· 2025-06-27 09:41
Group 1 - The article discusses the recent announcements from ICCV 2025, highlighting various works that have been accepted for presentation [1] - It emphasizes the importance of the "Embodied Intelligence" community in sharing insights and developments related to the accepted works [1] - The article encourages readers to join the community for timely updates on ongoing research and developments in the field [1] Group 2 - Several works related to embodied intelligence and autonomous driving are summarized, showcasing advancements in areas such as robotic manipulation and navigation [4][6] - The article lists various projects, including "GaussianProperty" and "DriveArena," which focus on integrating physical properties and generative simulation for autonomous driving [4] - It also mentions works on 3D reconstruction and visual recognition, indicating a broad range of research topics being explored [6][5]
苹果憋一年终超同参数 Qwen 2.5?三行代码即可接入 Apple Intelligence,自曝如何做推理
AI前线· 2025-06-10 10:05
Core Insights - Apple has introduced a new generation of language foundation models designed to enhance Apple Intelligence capabilities, featuring a compact model with approximately 3 billion parameters and a server-based mixed expert model tailored for private cloud architecture [1][4][6]. Model Overview - The new foundation models framework allows third-party developers to access Apple Intelligence's core large language models and integrate them into their applications with minimal coding [4][20]. - The device-side model is optimized for efficiency and low latency on Apple chips, while the server-side model supports high precision and scalability for more complex tasks [6][7]. Performance Evaluation - Apple’s device-side model outperforms slightly larger models like Qwen-2.5-3B across all language environments and competes with larger models like Qwen-3-4B in English [8][10]. - The server-side model shows superior performance compared to Llama-4-Scout but lags behind larger models such as Qwen-3-235B and proprietary GPT-4o [8][10]. Architectural Innovations - The device-side model reduces key-value cache memory usage by 38.5% and improves time-to-first-token generation [7]. - The server-side model employs a parallel track expert mixed (PT-MoE) design, enhancing efficiency and scalability without compromising quality [7][8]. Training Improvements - Apple has revamped its training scheme to enhance reasoning capabilities, utilizing a multi-stage pre-training process that significantly reduces training costs [14][16]. - The integration of visual understanding into the models has been achieved without degrading text capabilities, enhancing overall performance [16]. Compression Techniques - Apple employs quantization techniques to reduce the model size and power consumption, achieving a compression of device-side model weights to 2 bits per weight and server-side model weights to 3.56 bits per weight [17][18]. - The models maintain quality through additional training data and low-rank adapters, with minor regressions observed in performance metrics [17]. Developer Accessibility - The foundation models framework is designed to be user-friendly, allowing developers to integrate AI capabilities into their applications with just three lines of code [20][21]. - The framework supports Swift language natively and includes features for guided generation and tool invocation, simplifying the integration process [20][21]. Current Status - The foundation models framework is currently in testing through the Apple Developer Program, with a public beta expected to be available soon [22].
一个md文件收获超400 star,这份综述分四大范式全面解析了3D场景生成
机器之心· 2025-06-10 08:41
Core Insights - The article discusses the advancements in 3D scene generation, highlighting a comprehensive survey that categorizes existing methods into four main paradigms: procedural methods, neural network-based 3D representation generation, image-driven generation, and video-driven generation [2][4][7]. Summary by Sections Overview of 3D Scene Generation - A survey titled "3D Scene Generation: A Survey" reviews over 300 representative papers and outlines the rapid growth in the field since 2021, driven by the rise of generative models and new 3D representations [2][4][5]. Four Main Paradigms - The four paradigms provide a clear technical roadmap for 3D scene generation, with performance metrics compared across dimensions such as realism, diversity, viewpoint consistency, semantic consistency, efficiency, controllability, and physical realism [7]. Procedural Generation - Procedural generation methods automatically construct complex 3D environments using predefined rules and constraints, widely applied in gaming and graphics engines. This category can be further divided into neural network-based generation, rule-based generation, constraint optimization, and large language model-assisted generation [8]. Image-based and Video-based Generation - Image-based generation leverages 2D image models to reconstruct 3D structures, while video-based generation treats 3D scenes as sequences of images, integrating spatial modeling with temporal consistency [9]. Challenges in 3D Scene Generation - Despite significant progress, challenges remain in achieving controllable, high-fidelity, and physically realistic 3D modeling. Key issues include uneven generation capabilities, the need for improved 3D representations, high-quality data limitations, and a lack of unified evaluation standards [10][16]. Future Directions - Future advancements should focus on higher fidelity generation, parameter control, holistic scene generation, and integrating physical constraints to ensure structural and semantic consistency. Additionally, supporting interactive scene generation and unifying perception and generation capabilities are crucial for the next generation of 3D modeling systems [12][18].
真有人会爱上ChatGPT?我尝试和AI“交往”一周后发现有些不对劲
Hu Xiu· 2025-05-11 07:02
Group 1 - The article discusses the growing phenomenon of human-AI relationships, highlighting cases where individuals have developed emotional connections with AI, leading to significant life decisions such as divorce and marriage to AI [2][35][41] - It mentions that some users have become so immersed in their interactions with AI that they perceive it as a friend or partner, which raises concerns about the implications for real-life relationships and mental health [6][41][49] - The article emphasizes the need for users to be aware of the potential for dependency on AI, especially for those with underlying psychological issues, and suggests that AI should not replace human interaction [42][57] Group 2 - The text outlines various strategies for users to enhance their interactions with AI, such as customizing prompts and understanding the AI's response patterns to create a more engaging experience [9][31][44] - It highlights the importance of treating AI as a conversational partner rather than just a tool, which can lead to deeper self-reflection and personal insights for users [32][41] - The article also points out the limitations of AI, noting that while it can provide immediate feedback and companionship, it lacks true emotional understanding and memory retention, which can lead to disillusionment [55][56]