Workflow
机器之心
icon
Search documents
不再靠「猜坐标」!颜水成团队等联合发布PaDT多模态大模型:实现真正的多模态表征输出
机器之心· 2025-10-16 00:51
Core Insights - The article discusses the advancements in Multimodal Large Language Models (MLLMs) and introduces a new paradigm called Patch-as-Decodable Token (PaDT) to address the limitations of existing models in tasks requiring fine spatial understanding [2][6]. Group 1: PaDT Overview - PaDT proposes a revolutionary approach by dividing images into multiple visual patches and allowing the model to generate corresponding Visual Reference Tokens (VRTs) directly [3]. - It enables seamless alternation between text tokens and visual tokens at both input and output stages, making the model's description of image content as natural as describing text [4]. - The model can directly indicate image targets in generated sentences rather than guessing coordinates [5]. Group 2: Limitations of Traditional MLLMs - Traditional MLLMs output detection box coordinates in string format, leading to inconsistencies, semantic disconnection, and weak image-text associations [8]. - The output format can vary, making it difficult to parse targets, and numbers can be split into separate tokens, disrupting spatial continuity [8]. - The reliance on coordinate tokens, which lack inherent semantic meaning, results in challenges such as hallucination and repetition in generated outputs [8]. Group 3: PaDT Mechanism - PaDT introduces VRTs derived from the visual patch embeddings of the input image, creating a dynamic embedding table that integrates both text and visual information [11]. - This design avoids the pitfalls of traditional methods that depend on global visual codebooks, which can confuse similar objects and generate non-existent patches [13]. - The lightweight PaDT Decoder, consisting of three bidirectional attention blocks, transforms VRTs into structured visual outputs like bounding boxes and segmentation masks [15]. Group 4: Performance Metrics - PaDT Pro (3B) achieved a remarkable average accuracy of 93.6 in the RefCOCO/+/g referring expression comprehension task, surpassing the 78B InternVL3 model, which scored 91.4 [21][22]. - In the COCO open vocabulary detection task, traditional MLLMs typically have a mean Average Precision (mAP) below 20, while PaDT Pro (3B) raised it to 38.2, nearly doubling the performance [21][24]. - The model also demonstrated strong performance in the Referring Image Captioning (RIC) task, significantly improving the CIDEr-D score from 0.386 to 1.450 [24]. Group 5: Implications and Future Directions - PaDT's success stems from its deep understanding of the visual capability bottlenecks in MLLMs, allowing for native alignment between visual patches and generated tokens [31]. - The dynamic embedding mechanism ensures strong binding of VRTs to the current image, preventing cross-image confusion [31]. - The model exhibits robust multitasking capabilities, outperforming single-task models by seamlessly switching tasks through prompt changes [33]. - The introduction of PaDT marks a significant step towards achieving true multimodal intelligence, allowing for more natural interactions between different modalities [35].
首个多轮LLM Router问世, Router-R1可让大模型学会「思考–路由–聚合」
机器之心· 2025-10-15 10:44
Core Insights - The article discusses the introduction of Router-R1, a novel multi-round LLM Router framework that enables large language models (LLMs) to not only answer questions but also think, schedule, and coordinate with other models to achieve a balance between performance and cost [3][26]. Group 1: Background and Motivation - The rapid growth of LLMs has led to over a hundred different models, each with unique strengths, such as logic reasoning or knowledge retrieval [6]. - Current AI applications primarily rely on single model inference, which can lead to inefficiencies and inaccuracies depending on the complexity of the questions posed [6][8]. Group 2: Router-R1 Framework - Router-R1 innovatively transforms the router into a reasoning-capable policy LLM, allowing it to engage in a "think-select-aggregate" process, thus enabling multi-round routing iterations [8][26]. - The framework utilizes reinforcement learning to optimize the performance-cost trade-off, formalizing the multi-round routing process as a sequential decision-making problem [10][26]. Group 3: Reward Mechanisms - Router-R1 employs three types of reward functions: - Format Reward ensures the output adheres to specific format constraints [10]. - Final Outcome Reward measures the correctness of the generated answer against a standard [11]. - Cost Reward introduces a cost constraint mechanism that considers the model's parameter size and output token count [15][16]. Group 4: Performance Evaluation - The research team evaluated Router-R1 across seven QA benchmarks, demonstrating superior performance in both single-hop and multi-hop reasoning tasks [19]. - Router-R1 outperformed existing models, achieving the highest accuracy across all datasets when performance was prioritized over cost [21]. Group 5: Implications and Future Trends - Router-R1 represents a shift towards a new paradigm of collaborative multi-model systems, allowing for dynamic balancing of performance and cost while maintaining high-quality outputs [26]. - The adoption of LLM Router mechanisms in future models, such as GPT-5, indicates a trend towards multi-model collaboration as a foundational infrastructure in the LLM ecosystem [26].
具身智能迎来ImageNet时刻:RoboChallenge开放首个大规模真机基准测试集
机器之心· 2025-10-15 10:44
Core Insights - RoboChallenge is the world's first large-scale, multi-task benchmark testing platform for robots operating in real physical environments, aimed at providing reliable and comparable evaluation standards for visual-language-action models (VLAs) [1][4][7] - The platform addresses the lack of unified, open, and reproducible benchmark testing methods in the robotics field, enabling researchers to validate and compare robotic algorithms in a standardized environment [4][7] Group 1: Platform Features - RoboChallenge integrates multiple mainstream robots (UR5, Franka Panda, Aloha, ARX-5) to facilitate remote evaluation, providing a large-scale, standardized, and reproducible testing environment [7][14] - The platform employs a standardized API interface, allowing users to call tests without submitting Docker images or model files, thus enhancing accessibility [19] - It features a dual asynchronous control mechanism for precise synchronization of action commands and image acquisition, improving testing efficiency [19] Group 2: Evaluation Methodology - The benchmark testing method focuses on controlling human factors, ensuring visual consistency, validating model robustness, and designing protocols for different evaluation objectives [16] - RoboChallenge introduces a "visual inputs reproduction" method to ensure consistent initial states for each test, enhancing the reliability of evaluations [16] - The Table30 benchmark set includes 30 carefully designed everyday tasks, significantly more than typical industry evaluations, providing a reliable measure of algorithm performance across various scenarios [18][23] Group 3: Community Engagement - RoboChallenge operates on a fully open principle, offering free evaluation services to global researchers and ensuring transparency by publicly sharing task demonstration data and intermediate results [27] - The platform encourages community collaboration through challenges, workshops, and data sharing, promoting joint efforts to address core issues in embodied intelligence [27] Group 4: Future Directions - RoboChallenge aims to expand its capabilities by incorporating mobile robots and dexterous manipulators, enhancing cross-scenario task testing abilities [29] - Future evaluations will extend beyond visual-action coordination to include multi-modal perception and human-robot collaboration, with plans for more challenging benchmarks [29]
ICCV 2025 | FDAM:告别模糊视界,源自电路理论的即插即用方法让视觉Transformer重获高清细节
机器之心· 2025-10-15 07:33
Core Insights - The article discusses the introduction of a Frequency Dynamic Attention Modulation (FDAM) module to address the issue of detail loss in deep networks caused by the inherent "low-pass filter" characteristics of Vision Transformers (ViT) [2][5][8]. - The FDAM module enhances model performance in dense prediction tasks such as segmentation and detection without significantly increasing computational costs, achieving state-of-the-art results [2][22]. Research Background - Vision Transformers (ViT) have become prominent in computer vision due to their global modeling capabilities, but they face a critical issue where deeper models lead to a loss of high-frequency details essential for tasks like segmentation and detection [5][8]. - The self-attention mechanism in ViT acts as a low-pass filter, progressively diminishing high-frequency details and leading to representation collapse [5][10]. Limitations of Existing Methods - Previous attempts to mitigate the "over-smoothing" problem in ViT, such as regularization and static compensation of high-frequency signals, have been insufficient as they do not fundamentally alter the low-pass nature of the attention mechanism [10][9]. Core Idea of FDAM - The FDAM module is inspired by circuit theory, proposing a redesign of the attention mechanism to include both low-pass and high-pass paths, allowing for dynamic attention to high-frequency details [11][12][16]. - The module introduces a lightweight dynamic mixer that enables the model to adaptively focus on either low-frequency structures or high-frequency details based on the characteristics of the input image [16][21]. Key Components of the Method - FDAM consists of two main components: Attention Inversion (AttInv) for coarse tuning of high and low frequencies, and Frequency Dynamic Scaling (FreqScale) for fine-tuning specific frequency bands [21][20]. - FreqScale allows the model to learn dynamic gain weights for different frequency bands, enhancing or suppressing signals as needed for specific tasks [20][21]. Experimental Results - The FDAM module is plug-and-play, easily integrated into various ViT architectures with minimal additional parameters and computational overhead, yet it significantly improves performance [22][23]. - Quantitative results show that FDAM enhances the mIoU score by +2.4 for SegFormer-B0 on the ADE20K dataset and +0.8 for DeiT3-Base, achieving 52.6% state-of-the-art performance [23][22]. - In object detection and instance segmentation on the COCO dataset, FDAM improved detection AP by +1.6 and segmentation AP by +1.4 [23][22]. Theoretical Support - The FDAM method effectively resists representation collapse, maintaining a higher effective rank in deeper layers compared to baseline models, indicating better feature diversity [26][22]. Implications of the Work - This research introduces a new perspective by applying classical circuit theory to modern Transformer design, addressing fundamental issues like information decay in deep learning [29][30]. - FDAM effectively resolves a core pain point in ViT for dense prediction tasks, unlocking the model's potential [30][32]. - As a lightweight, plug-and-play module, FDAM has significant application potential in both industry and academia [31][32]. Future Directions - FDAM opens avenues for future research, such as designing entirely new network structures that operate dynamically in the frequency domain and exploring its application in video, 3D point clouds, and multimodal data [34].
报名|IROS 2025举杯时刻!与你Pick的圈内大神共饮一杯!
机器之心· 2025-10-15 07:33
Core Insights - The article discusses the upcoming IROS 2025 conference in Hangzhou, which is a significant event in the robotics field, bringing together top scholars and covering a wide range of topics from theoretical research to practical applications [2] - A special closed-door event, TalentAI50 Meetup, will be held during the conference, aimed at young talents who are shaping the future of robotics and AI [3][13] Event Details - The TalentAI50 Meetup will feature prominent young scholars from leading universities, including Hong Kong University, Shanghai Jiao Tong University, Tsinghua University, and Zhejiang University, among others [5] - The event is designed to foster informal discussions without traditional presentations, encouraging networking and collaboration among participants [7] - The Meetup is limited to 50 attendees, with registration open to authors of papers presented at IROS 2025 [7][9] Schedule and Logistics - The event is scheduled for October 22, from 18:00 to 21:00, at a venue near the Hangzhou International Expo Center [8] - The agenda includes a check-in period, an opening talk, interactive experiences, and a dinner with free networking opportunities [9]
Sutton判定「LLM是死胡同」后,新访谈揭示AI困境
机器之心· 2025-10-15 07:33
Core Viewpoint - The article discusses Rich Sutton's critical perspective on large language models (LLMs), suggesting they may not align with the principles outlined in his work "The Bitter Lesson" and highlighting their limitations in learning from real-world interactions [1][3][22]. Group 1: Limitations of LLMs - Sutton argues that LLMs have significant flaws, particularly their inability to learn from ongoing interactions with the environment [3][21]. - He emphasizes that true intelligence should emerge from continuous reinforcement learning through dynamic interactions, rather than relying on extensive pre-training and supervised fine-tuning [3][4][22]. - The reliance on human knowledge and data in LLMs may lead to a lack of scalability and potential failure to meet expectations, as they are fundamentally limited by the biases present in the training data [24][25][26]. Group 2: Alternative Perspectives on Intelligence - Experts in the discussion, including Suzanne Gildert and Niamh Gavin, express skepticism about achieving pure reinforcement learning, suggesting that current systems often revert to imitation learning due to the difficulty in defining universal reward functions [7][11]. - The conversation highlights the need for systems that can autonomously learn in new environments, akin to how a squirrel learns to hide nuts, rather than relying solely on pre-existing data [8][10]. - There is a consensus that while LLMs exhibit impressive capabilities, they do not equate to true intelligence, as they lack the ability to explore and learn from their environment effectively [33][35]. Group 3: The Future of AI Development - The article suggests that the AI field is at a crossroads, where the dominance of certain paradigms may hinder innovation and lead to a cycle of self-limitation [28][29]. - Sutton warns that the current trajectory of LLMs, heavily reliant on human imitation, may not yield the breakthroughs needed for genuine understanding and reasoning capabilities [22][24]. - The discussion indicates a shift towards exploring more robust learning mechanisms that prioritize experience and exploration over mere data absorption [28][30].
清华&巨人网络首创MoE多方言TTS框架,数据代码方法全开源
机器之心· 2025-10-15 04:08
Core Insights - The article discusses the importance of dialects in preserving cultural diversity and highlights the challenges faced in dialect text-to-speech (TTS) technology, which remains a "gray area" in the industry [2][4] - DiaMoe-TTS is introduced as an open-source solution that aims to provide a comprehensive framework for dialect TTS, enabling researchers and developers to utilize and improve upon it [4][30] - The framework is designed to be low-cost and low-barrier, allowing for the synthesis of various dialects without the need for extensive data [31] Summary by Sections Introduction to DiaMoe-TTS - DiaMoe-TTS is a collaborative project between Giant Network AI Lab and Tsinghua University's SATLab, aimed at creating a dialect TTS model that rivals industrial-grade systems [2][4] - The framework utilizes a unified IPA (International Phonetic Alphabet) representation to address inconsistencies in dialect modeling [13][27] Technical Features - The system incorporates a dialect-aware Mixture-of-Experts (MoE) architecture, which allows different expert networks to focus on specific dialect features, enhancing the preservation of dialect characteristics [15][16] - A parameter-efficient fine-tuning (PEFT) strategy is implemented to adapt the model to low-resource dialects, requiring minimal adjustments to existing parameters [19][22] Training Methodology - The training process is divided into multiple stages, including IPA transfer initialization and joint training across multiple dialects, which improves model performance and adaptability [21][23] - The use of data augmentation techniques, such as pitch and speed perturbation, ensures the model can generate natural-sounding dialect speech even with limited data [20][22] Performance Results - DiaMoe-TTS demonstrates competitive performance metrics, achieving close to industrial-level results in Cantonese with a Word Error Rate (WER) of 76.59% and Mean Opinion Score (MOS) improvements across various dialects [25][26][27] - The framework's ability to support a wide range of dialects, including those with minimal data, presents new opportunities for dialect preservation and cultural transmission [25][30] Future Prospects - The research team plans to expand the dialect and minority language datasets, refine the IPA alignment and data preprocessing processes, and explore more efficient low-resource modeling methods [33] - The goal is to facilitate global participation in dialect and minority language research, ensuring that these languages are not forgotten in the digital age [33]
AI能否「圣地巡礼」?多模态大模型全新评估基准VIR-Bench来了
机器之心· 2025-10-15 04:08
Core Insights - The article discusses the development of a new multimodal large model evaluation benchmark called VIR-Bench, aimed at assessing AI's ability to understand travel videos in terms of geographical locations and temporal sequences [4][20] - The research emphasizes the importance of reconstructing travel itineraries from videos, which requires models to comprehend both geographic and temporal relationships [20] Group 1: VIR-Bench Overview - VIR-Bench is designed to evaluate AI's understanding of travel vlogs by generating a visiting order graph that represents the sequence and relationships of visited locations [6][9] - The visiting order graph consists of nodes representing locations categorized into three levels: Prefecture, City, and Point of Interest (POI) [7][9] Group 2: Task Design and Dataset - The task is divided into two sub-tasks: node prediction, where the model identifies all visited locations, and edge prediction, where it determines the relationships between these locations [10][11] - A dataset of 200 travel videos was constructed, covering 3,689 POIs across 43 prefectures in Japan, with detailed annotations for each video [17][13] Group 3: Experimental Results and Challenges - Current models, particularly open-source ones, lag behind commercial models in POI node recognition and transition edge prediction, with transition edge prediction being notably challenging [16][18] - The performance of models improves significantly with increased scale and the inclusion of geographic pre-training, highlighting the importance of these factors in enhancing accuracy [16][18] Group 4: Future Directions - The research indicates that while current models struggle with long-range reasoning and temporal understanding, there are clear pathways for improvement, such as enhancing spatial awareness and integrating multimodal information [20][18] - The ultimate goal is for AI to not only analyze videos but also to possess the capability to act within the world, aligning with applications in robotics and autonomous systems [20][18]
50万激励,腾讯青云奖学金启动申请
机器之心· 2025-10-15 04:08
Core Viewpoint - The establishment of the Tencent Qinyun Scholarship aims to alleviate the long-standing computational resource shortages faced by academic researchers, particularly in the AI field, enabling them to focus on meaningful scientific exploration [1][6]. Group 1: Challenges in Academic Research - The lack of computational power is identified as a critical constraint for academic researchers compared to industry and large tech companies [1]. - A survey published in Nature revealed that many scholars feel frustrated with limited computational resources for AI research, with 66% of respondents rating their satisfaction with available resources at 3 or lower on a scale of 5 [3]. Group 2: Industry Response - Major tech companies are launching various funding programs to address the computational resource shortages in academia, such as AWS Cloud Credits for Research and free cloud credits from Google and Microsoft [5]. - Domestic universities are also taking measures to alleviate students' computational anxiety, with Tsinghua University recently distributing computational vouchers to students [5]. Group 3: Tencent Qinyun Scholarship - The Tencent Qinyun Scholarship provides not only cash rewards but also essential computational resources for young scholars, particularly top doctoral students who prioritize long-term research value over short-term returns [6][15]. - Each awardee will receive 200,000 yuan in cash and 300,000 yuan worth of cloud heterogeneous computing resources, which can significantly enhance their research capabilities [15][17]. - The scholarship aims to stimulate innovation among young scholars and support breakthroughs in the AI field [6][15]. Group 4: Computational Resource Value - The 300,000 yuan worth of computational resources can support 24/7 usage of cutting-edge GPU instances for three months or 2,000 hours of usage on an 8-card GPU setup, providing substantial support for research [17][19]. - Awardees can flexibly configure their computational resources based on their research needs, which is crucial for the varying demands of large model research [19][20]. Group 5: Industry-Academic Collaboration - The scholarship serves as a direct and effective means for tech companies to connect with academia, facilitating talent acquisition and fostering innovation [14][23]. - Tencent's extensive business matrix, including social media, content production, and cloud services, provides a rich ecosystem for the application of AI technologies [24][25].
大的来了:谷歌Gemini 3.0 Pro单次生成网页版操作系统,Win、Mac、Linux一网打尽
机器之心· 2025-10-15 04:08
Core Insights - The article discusses the capabilities of Google's latest AI model, Gemini 3.0, which can generate web operating systems resembling MacOS, Windows, and Linux using simple prompts [6][8][19] - The generated systems are functional in terms of basic applications and front-end design but lack the comprehensive features and logic of a true operating system [16][19] Group 1: Gemini 3.0 Capabilities - Gemini 3.0 has demonstrated the ability to create a web-based operating system that includes smooth animations, window management, and basic applications [5][6] - The model can generate complex designs based on abstract prompts, showcasing advanced front-end design capabilities [15] - Users have expressed amazement at Gemini 3.0's potential, marking a new chapter in creative AI models [16] Group 2: Limitations of Generated Systems - Despite the impressive demonstrations, the generated MacOS and other systems only exhibit basic functionality and do not qualify as full-fledged operating systems [16][19] - For example, the terminal functionality in the generated systems is limited to a few cases without a comprehensive logic or command structure [17] - The current capabilities of large models like Gemini 3.0 are still far from being able to construct a complete operating system [19]