Workflow
机器之心
icon
Search documents
PixelRefer :让AI从“看大图”走向“看懂每个对象”
机器之心· 2025-11-10 23:47
Core Insights - The article discusses the limitations of current multimodal large models (MLLMs) in achieving fine-grained, object-level understanding necessary for real-world applications like autonomous driving and medical imaging, highlighting the need for a more detailed visual understanding framework [2][38] - PixelRefer is introduced as an innovative solution that provides a unified spatio-temporal understanding framework capable of fine visual referencing and reasoning at arbitrary granularity, outperforming existing models in several benchmarks [2][38] Model Overview - PixelRefer integrates global visual tokens, pixel-level region tokens, and text tokens into a large language model (LLM), maintaining both scene context and object-level reasoning capabilities [16][22] - The model's lightweight version, PixelRefer-Lite, achieves a 4x increase in inference speed and reduces memory usage by half compared to existing models like DAM-3B [2][33] Methodology - The authors propose two frameworks for pixel-level fine-grained understanding: Vision-Object Framework and Object-Only Framework, emphasizing the importance of high-quality pixel-level object representation [15][22] - A Scale-Adaptive Object Tokenizer (SAOT) is introduced to generate precise and compact object representations, addressing challenges related to small and large object details [17][16] Performance Metrics - PixelRefer has achieved state-of-the-art (SOTA) performance across various image understanding benchmarks, including PACO and DLC-Bench, with notable advantages in reasoning scenarios [28][30] - In video pixel-level understanding benchmarks, PixelRefer also demonstrates superior performance, particularly in tasks involving video captioning and question answering [29][31] Applications and Future Directions - The advancements presented by PixelRefer signify a shift towards understanding the dynamic details of the world, with potential applications in autonomous driving, medical imaging, intelligent video editing, and multimodal dialogue systems [38][40]
从感知到行动:邀您解锁这场具身智能前沿技术沙龙
机器之心· 2025-11-10 10:42
Core Insights - The article discusses a significant paradigm shift in artificial intelligence from "perceptual intelligence" to "action intelligence," aiming to equip machines with the ability to understand, decide, and act in the physical world, transitioning from "passive observation" to "active interaction" [2] Group 1: Technological Advancements - Continuous iteration of intelligent foundational models is breaking down barriers between perception and decision-making, enabling more human-like interaction capabilities for intelligent agents [2] - The accumulation and governance of multimodal data are establishing a solid foundation for the practical application of technology [2] - Breakthroughs in remote operation technology are facilitating the connection between virtual and real worlds, allowing for precise remote control [2] Group 2: Event Details - The "Virtual-Real Resonance: Model X Terminal Technology Salon" will be held on November 14 in Beijing at PAGEONE (Wudaokou) [4][5] - The event will feature several key presentations, including topics on world models, open-source tools for VLA models, and data-driven paths to embodied intelligence [4][5] Group 3: Featured Speakers - Mao Jiming, Vice President of Beijing Jiajia Vision Technology Co., has over 16 years of experience in engineering and architecture, focusing on large-scale distributed systems and autonomous driving simulation technology [7] - Wang Tiancai, a founding team member of Dexmal, has published over 30 papers in top international conferences and is a core author of notable algorithms in autonomous driving [8] - Jin Ge, founder of Lingyu Intelligent, has extensive experience in early-stage investment and high-tech entrepreneurship [9] - Yuan Haoki, head of the Smart Beyond team, specializes in reinforcement learning and embodied intelligence [10] - Huang Suining, CEO of Beipei Technology, has a rich background in AI and children's psychology, focusing on the integration of AI technology in child development [11] - Qian Zhuang, product head at Yingzhi Technology, has a decade of experience in product development and holds over 100 patents [11]
2025宝山・智能机器人产业大会暨嘉年华启动在即
机器之心· 2025-11-10 04:40
Core Viewpoint - The article discusses the upcoming "2025 Baoshan Intelligent Robot Industry Conference and Carnival," which aims to explore the future development directions and collaborative elements of the intelligent robot industry amidst the global AI wave [2][4]. Event Overview - The conference will take place from November 21 to 22, 2025, in the Zhihui Bay Science and Technology Park, organized by various governmental and academic institutions [4]. - It is positioned as an annual celebration of the intelligent robot industry, bringing together key players from academia, industry, and investment sectors to create a new blueprint for development in the AI era [4][5]. Collaborative Efforts - Baoshan District is committed to building a comprehensive support system for the robot industry through deep collaboration among government, industry, academia, scenarios, and finance [5]. - The event will unveil a three-year action plan for robotics and a series of service platforms aimed at providing one-stop solutions for technology breakthroughs, application scenarios, and capital connections [5]. Forum Structure - The event features a main forum and three thematic forums covering top-level design, embodied intelligence technology, core components, and talent ecology, providing a comprehensive insight into the industry [6][7]. - Notable speakers include academicians and leaders from prominent robotics companies, who will discuss breakthroughs and development paths in intelligent robotics [7]. Project Roadshow - A project roadshow will focus on early-stage projects with cutting-edge technology and high growth potential, facilitating connections for technology cooperation, application scenarios, and industry funds [9]. Exhibition Highlights - The event will showcase top products from the intelligent robot industry, including humanoid robots, legged robots, and various core components, highlighting the technological strength of the sector [11]. Interactive Experience - Attendees will have the opportunity to engage in immersive experiences with various robotic projects, enhancing their understanding of the industry's potential and fostering connections with industry leaders [13].
3A大作!阿里ROLL团队从基建->算法->机理,推动RL4LLM全栈协同优化
机器之心· 2025-11-10 04:40
Core Insights - The article discusses the launch of the "3A" collaborative optimization framework by Alibaba's ROLL team, which includes Async Architecture, Asymmetric PPO, and Attention Mechanism, aimed at enhancing Reinforcement Learning for Large Language Models (RL4LLM) [1][2][5] Group 1: Async Architecture - ROLL Flash is introduced as a high-performance RL training system that utilizes asynchronous design to maximize resource utilization and accelerate large-scale RL training [5][11] - The core principle of ROLL Flash is decoupling, which allows for fine-grained parallelism and sampling-training decoupling, leading to a fully pipelined execution of generation, environment interaction, reward calculation, and model training [12][13] - ROLL Flash has demonstrated significant performance improvements across various mainstream RL tasks, achieving nearly linear scalability with a hundred-card scale [16][25] Group 2: Asymmetric PPO - Asymmetric Proximal Policy Optimization (AsyPPO) is introduced as a lightweight variant of PPO that shows that the size of the critic does not necessarily correlate with its value estimation capability [45][48] - The research indicates that only two small critics are sufficient to achieve comparable or even superior value estimation performance, reducing the need for expensive computational resources [51][53] - AsyPPO introduces two key innovations: diversified micro-critic aggregation and uncertainty-aware policy loss reconstruction, enhancing training stability and efficiency [55][58] Group 3: Attention Mechanism - The article redefines the role of attention in language models, suggesting it serves as a structured blueprint that reveals the internal logic of model reasoning [2][64] - By analyzing attention dynamics, the framework aims to align the optimization objectives with the model's inherent reasoning rhythm, leading to improved training efficiency and interpretability [67][68] - The research proposes a refined credit allocation strategy based on attention signals, allowing for more effective reinforcement learning by focusing on critical reasoning steps [82][86]
NeurIPS2025 Spotlight | RobustMerge: 多模态大模型高效微调模型合并的全新范式
机器之心· 2025-11-10 04:40
Core Insights - The article discusses the challenge of efficiently merging multiple specialized models into a general model in the context of rapidly advancing AI technology, highlighting the concept of "direction robustness" as a key factor in the failure of parameter-efficient fine-tuning (PEFT) module merging [2][7][10]. - A new solution called RobustMerge is proposed, which offers a simple and efficient method for model merging without additional costs, providing significant potential for developers and researchers working on multi-modal large models [2][8]. Problem Definition - The rise of multi-modal large models has increased computational demands, making full fine-tuning (FFT) costly and impractical for many users. As a result, parameter-efficient fine-tuning (PEFT), particularly LoRA, has become mainstream, allowing for quick adaptation to downstream tasks by updating only a small portion of model parameters [7][8]. - Traditional methods for merging models, such as multi-task learning, face challenges related to training costs and data availability, leading to the exploration of model merging as a more efficient alternative [8][10]. Key Contributions - RobustMerge addresses the shortcomings of existing PEFT merging methods by identifying the core issue of direction instability rather than parameter sign conflicts, thus paving the way for a new paradigm in LoRA merging [10][41]. - The method employs a two-phase merging strategy: pruning and complementary scaling, followed by cross-task normalization, to enhance the stability of low-rank directions during the merging process [16][19][23]. Experimental Design and Results - RobustMerge was tested across multiple benchmarks, including a newly created benchmark called MM-MergeBench, which evaluates performance on both seen and unseen tasks, demonstrating significant improvements in multi-task performance and generalization capabilities [28][31]. - The results indicate that RobustMerge outperforms traditional methods, achieving an average accuracy increase of 3.4% on seen tasks and a 4.5% improvement on unseen tasks, showcasing its effectiveness in reducing task interference and enhancing multi-task performance [31][32]. Practical Applications - The RobustMerge approach can be applied in various scenarios, including rapid deployment of multi-task models, federated learning, and model editing or style transfer, making it a valuable tool for enterprises looking to build complex AI applications efficiently [44][45].
与DeepSeek-OCR不谋而合,NeurIPS论文提出让LLM像人一样读长文本
机器之心· 2025-11-10 04:40
Core Insights - The article discusses a groundbreaking framework called VIST (Vision-centric Token Compression in LLM) developed by research teams from Nanjing University of Science and Technology, Central South University, and Nanjing Forestry University, aimed at enhancing long text reasoning in large language models (LLMs) through a visual approach [2][5]. Research Background - LLMs have shown remarkable capabilities in understanding and generating short texts, but they face challenges in processing long documents, complex question answering, and retrieval-augmented generation due to the increasing context length and model parameter size [4]. - The need for token compression has become essential as the scale of input data grows, making it difficult for even the most powerful LLMs to efficiently analyze vast amounts of information [4]. VIST Framework - VIST aims to address the challenges of long text processing by enabling models to read more like humans, utilizing a "slow-fast reading circuit" that mimics human reading strategies [7][8]. - The framework consists of two pathways: 1. Fast Path: Renders distant, less significant context as images for quick semantic extraction using a lightweight visual encoder. 2. Slow Path: Inputs key nearby text directly into the LLM for deep reasoning and language generation [8][15]. Visual Compression Mechanism - VIST employs a visual compression mechanism that allows models to efficiently process long texts by focusing on significant information while ignoring redundant words [22][23]. - The Probability-informed Visual Enhancement (PVE) mechanism teaches models to "skim read" by masking high-frequency, low-information words and retaining low-frequency, high-information words [22][23]. Performance Metrics - VIST demonstrates significant advantages over traditional text encoding methods, achieving a 56% reduction in the number of visual tokens required compared to conventional text tokens, and a 50% decrease in memory usage [10][25]. - In various tasks, VIST outperformed the CEPE method, showcasing its reliability in long text processing even under extreme conditions [25]. Visual Text Tokenization - VIST utilizes lightweight visual encoders for efficient context compression, simplifying the tokenization process and eliminating the need for complex preprocessing steps [28]. - The visual encoder's ability to handle multiple languages without being constrained by a vocabulary significantly reduces computational and memory overhead [29]. Future Implications - The integration of visual-driven token compression is expected to become a standard component in LLMs for long context understanding, paving the way for multimodal intelligent comprehension [32][33]. - This "look before reading" strategy will help large models maintain understanding capabilities while significantly lowering computational costs [33].
MeshCoder:以大语言模型驱动,从点云到可编辑结构化物体代码的革新
机器之心· 2025-11-10 03:53
Core Insights - The article discusses the evolution of 3D generative AI, highlighting the transition from rudimentary models to more sophisticated systems capable of creating structured and editable virtual worlds [2][3] - The introduction of MeshCoder represents a significant advancement in 3D procedural generation, allowing for the translation of 3D inputs into structured, executable code [3][4] Group 1: MeshCoder Features - MeshCoder generates "living" programs rather than static models, enabling the understanding of semantic structures and the decomposition of objects into independent components for code generation [4] - It constructs high-quality quad meshes, which are essential for subsequent editing and material application [5][7] - The generated Python code is highly readable, allowing users to easily modify parameters for editing 3D models [9] - Users can control mesh density through code adjustments, balancing detail and performance [12] Group 2: Implementation and Training - The development of MeshCoder involved creating a large dataset of parts and training a part code inference model to understand basic geometries [19][21] - A custom Blender Python API was developed to facilitate complex modeling operations, enabling the creation of intricate geometries with simple code [20] - A million-level "object-code" dataset was constructed to train the final object code inference model, allowing for the understanding and assembly of complex objects [25][28] Group 3: Performance and Comparison - MeshCoder outperforms existing methods in high-fidelity reconstruction, achieving significantly lower Chamfer distance and higher Intersection over Union (IoU) scores across various object categories [32][33] - The model demonstrates superior ability to reconstruct complex structures accurately, maintaining clear boundaries and independent components [32] Group 4: Code-Based Editing and Understanding - MeshCoder enables code-based editing, allowing users to easily change geometric and topological aspects of 3D models through simple code modifications [36][39] - The generated code serves as a semantic structure, enhancing the understanding of 3D shapes when analyzed by large language models like GPT-4 [41][44] Group 5: Limitations and Future Directions - While MeshCoder shows great potential, challenges remain regarding the diversity and quantity of the training dataset, which affects the model's generalization capabilities [46] - Future efforts will focus on collecting more diverse data to improve the model's robustness and adaptability [46]
谢赛宁、李飞飞、LeCun联手提出多模态LLM新范式,「空间超感知」登场
机器之心· 2025-11-10 03:53
Core Insights - The article discusses the new research achievement named "Cambrian-S," which represents a significant step in exploring supersensing in video space [1][4] - It builds upon the previous work "Cambrian-1," focusing on enhancing AI's visual representation learning capabilities [2] Group 1: Definition and Importance of Supersensing - Supersensing is defined as how a digital entity truly experiences the world, absorbing endless input streams and continuously learning [4][5] - The research emphasizes that before developing "superintelligence," it is crucial to establish "supersensing" capabilities [4] Group 2: Development Path of Multimodal Intelligence - The team outlines a developmental path for multimodal intelligence, identifying video as the ultimate medium for human experience and a direct projection of real-life experiences [6] - They categorize the evolution of multimodal intelligence into several stages, from linguistic-only understanding to predictive world modeling [9] Group 3: Benchmarking Supersensing - The researchers conducted a two-part study to establish benchmarks for measuring supersensing capabilities, revealing that existing benchmarks primarily focus on language understanding and semantic perception, neglecting advanced spatial and temporal reasoning [14][25] - They introduced a new benchmark called VSI-Super, specifically designed to detect spatial intelligence in continuous scenarios [15][26] Group 4: Challenges in Current Models - The article highlights that current models, including Gemini-2.5-Flash, struggle with tasks requiring true spatial cognition and long-term memory, indicating a fundamental gap in the current paradigm [35][38] - The performance of advanced models on the VSI-Super benchmark was notably poor, underscoring the challenges of integrating continuous sensory experiences [35][36] Group 5: Predictive Sensing as a New Paradigm - The researchers propose predictive sensing as a forward path, where models learn to predict sensory inputs and build internal world models to handle unbounded visual streams [42][43] - This approach is inspired by human cognitive theories, emphasizing selective retention of sensory inputs and the ability to predict incoming stimuli [42][44] Group 6: Case Studies and Results - The article presents case studies demonstrating the effectiveness of surprise-driven event segmentation in improving performance on the VSI-Super benchmark [49][53] - The results indicate that the surprise-driven method outperformed existing models, showcasing better generalization capabilities [55][57]
HuggingFace发布超200页「实战指南」,从决策到落地「手把手」教你训练大模型
机器之心· 2025-11-09 11:48
Core Insights - HuggingFace recently published an extensive technical blog detailing the end-to-end experience of training advanced LLMs, emphasizing the "chaotic reality" of LLM development [1][4] - The blog provides in-depth technical details, code snippets, and debugging tips, making it a valuable resource for readers interested in building LLMs [5] Group 1: Training Considerations - A critical question posed is whether one truly needs to train a model from scratch, given the availability of world-class open-source models [9] - The article lists common misconceptions about training models, such as having idle computing power or following trends without a clear purpose [11] - A flowchart is provided to help determine if training a custom model is necessary, suggesting that training should only be considered when existing models and fine-tuning do not meet specific needs [12][14] Group 2: Custom Pre-training Scenarios - Custom pre-training is suitable for three main areas: research, production, and strategic open-source initiatives [15] - The goals of these areas dictate training decisions, such as model size and architecture [17] - The decision-making process involves planning and validation through systematic experiments [18] Group 3: Team Composition and Experimentation - Successful LLM training teams typically start small, with 2-3 members, focusing on sufficient computing power and rapid iteration [19] - The blog emphasizes the importance of empirical experimentation, particularly through ablation studies, to inform model decisions [21][30] - A complete process for setting up ablation experiments is outlined, recommending starting with a proven architecture [22] Group 4: Framework Selection and Data Management - Choosing the right training framework is crucial, balancing functionality, stability, and throughput [24] - The article compares several mainstream frameworks, highlighting the importance of high-quality data management for effective training [25] - Data curation is described as an art, where the quality and mix of data significantly influence model performance [41][42] Group 5: Model Architecture and Tokenization - The blog discusses various model architectures, including dense, MoE (Mixture of Experts), and hybrid models, with SmolLM3 using a dense architecture for memory constraints [36][37] - Tokenization is highlighted as a critical factor, with the choice of vocabulary size and algorithm impacting model performance [38] - The article stresses the need for careful selection of hyperparameters tailored to specific architectures and datasets [39] Group 6: Training Process and Infrastructure - The training process is likened to a marathon, requiring thorough preparation and the ability to handle unexpected challenges [51] - Infrastructure is emphasized as a critical component often overlooked, with detailed considerations for GPU selection and monitoring [63][66] - The blog provides insights into the GPU requirements for training SmolLM3, illustrating the balance between training time, cost, and efficiency [70] Group 7: Post-training and Evaluation - The post-training phase is crucial for refining the model's capabilities, with specific goals outlined for the SmolLM3 model [55][58] - The article discusses the importance of selecting appropriate frameworks and tools for post-training, including supervised fine-tuning and reinforcement learning [60] - Evaluation metrics and continuous monitoring are essential for assessing model performance and ensuring improvements [64]
全球第二、国内第一!最强文本的文心5.0 Preview一手实测来了
机器之心· 2025-11-09 11:48
Core Viewpoint - Baidu's ERNIE-5.0-Preview-1022 model has achieved a significant milestone by ranking second globally and first domestically in the latest LMArena Text Arena rankings, scoring 1432, which is on par with leading models from OpenAI and Anthropic [2][4][43]. Model Performance - ERNIE-5.0 Preview excels in creative writing, complex long question understanding, and instruction following, outperforming many mainstream models including GPT-5-High [5][41]. - In creative writing tasks, it ranks first, indicating a substantial improvement in content generation speed and quality [5][41]. - For complex long question understanding, it ranks second, showcasing its capability in academic Q&A and knowledge reasoning [5][41]. - In instruction following tasks, it ranks third, enhancing its applicability in smart assistant and business automation scenarios [5][41]. Competitive Landscape - The LMArena platform, created by researchers from UC Berkeley, allows real user preference voting, providing a dynamic ranking mechanism that reflects real-world performance [4][5]. - Baidu's model is positioned in the first tier of global general-purpose intelligent models, reinforcing its competitive standing in the AI landscape [4][41]. Technological Infrastructure - Baidu's success is supported by a comprehensive "chip-framework-model-application" stack, which includes the PaddlePaddle deep learning platform and self-developed Kunlun chips for AI model training and inference [41][42]. - The PaddlePaddle framework has been updated to version 3.2, enhancing model performance through optimizations in distributed training and hardware communication [41][42]. Industry Implications - The advancements in ERNIE-5.0 Preview reflect a broader transition in China's AI technology from "technological catch-up" to "capability leadership" [43][44]. - Baidu aims to leverage its model capabilities across various applications, including content generation, search, and office automation, to drive industry adoption [42][43].