多模态大模型
Search documents
多模态大模型首次实现像素级推理,3B参数超越72B传统模型,NeurIPS 2025收录
3 6 Ke· 2025-10-16 07:39
Core Insights - The article discusses the introduction of UniPixel, a unified pixel-level multimodal large model developed by a research team from Hong Kong Polytechnic University and Tencent ARC Lab, which can perform referring, segmentation, and reasoning tasks effectively [1][3][4]. Model Capabilities - UniPixel can accomplish three major tasks: target referring, pixel-level segmentation, and area reasoning, showcasing flexibility, precision, and scalability [3][4]. - The model has been accepted for presentation at NeurIPS 2025, with its code, data, and demo being open-sourced [3]. Technical Innovations - UniPixel redefines visual reasoning by enabling precise perception of specific areas or targets within images or videos, addressing limitations in traditional visual question-answering systems [4][6]. - The architecture is based on the Qwen2.5-VL model, supporting various input types and visual prompts, allowing for natural language responses and spatial-temporal masks [6][8]. Key Modules - The model incorporates three critical modules: a prompt encoder for visual prompts, an object memory bank for storing user-specified targets, and a mask decoder for generating precise spatial-temporal masks [8][12]. - UniPixel enhances its language model vocabulary with special tokens to facilitate the integration of visual prompts and memory retrieval processes [9]. Performance Evaluation - Extensive experiments on ten public benchmark datasets demonstrate UniPixel's superior performance across nine visual-language understanding tasks, particularly in segmentation tasks where it outperformed existing models [19][20]. - In the ReVOS reasoning segmentation benchmark, UniPixel achieved a J&F score of 62.1, surpassing all other models, indicating strong associative modeling capabilities between complex text prompts and pixel-level mask generation [20]. Training Data and Methodology - The training dataset comprises approximately 1 million samples, covering text, images, and videos, which enhances the model's adaptability across various task settings [17]. - The training strategy is modular and phased, allowing for collaborative training of visual encoders and language models without overfitting to specific tasks [16]. Future Implications - The introduction of UniPixel marks a significant milestone in multimodal AI, transitioning from modality alignment to fine-grained understanding, potentially leading to intelligent agents capable of precise focus and natural interaction [34].
多模态大模型首次实现像素级推理!3B参数超越72B传统模型,NeurIPS 2025收录
量子位· 2025-10-16 06:11
Core Insights - The article discusses the introduction of UniPixel, a unified pixel-level multimodal model developed by a research team from Hong Kong Polytechnic University and Tencent ARC Lab, which aims to enhance visual reasoning capabilities in AI systems [2][4]. Group 1: Model Overview - UniPixel is designed to perform three major tasks: referring, pixel-level segmentation, and reasoning, all within a single model, showcasing flexibility, precision, and scalability [4][8]. - The model has been accepted for presentation at NeurIPS 2025, and its code, data, and demo are fully open-sourced [5]. Group 2: Technical Innovations - UniPixel redefines visual reasoning by addressing the limitations of traditional visual question-answering systems, which often lack precise perception of specific areas or targets within images [8][9]. - The model incorporates an "Object Memory Bank" and supports three types of visual prompts (point, box, mask), enabling a comprehensive "perception-memory-reasoning" process [9][12]. Group 3: Architecture and Functionality - The architecture of UniPixel is based on the Qwen2.5-VL model, allowing it to process various inputs, including images, videos, and text prompts, and generate natural language responses along with spatial-temporal masks [12][14]. - Key components include a Prompt Encoder for unified encoding of visual prompts, an Object Memory Bank for storing user-specified targets, and a Mask Decoder for generating precise temporal masks [19][21]. Group 4: Training and Evaluation - The training process for UniPixel involved a modular and phased strategy, utilizing approximately 1 million samples across various datasets to enhance its adaptability to different tasks [28][29]. - Extensive experiments were conducted on 10 public benchmark datasets covering 9 major visual-language understanding tasks, demonstrating superior performance in complex reasoning and segmentation tasks [31][33]. Group 5: Performance Metrics - In the ReVOS reasoning segmentation benchmark, UniPixel-3B achieved a score of 62.1 J&F, surpassing all existing models, indicating its strong capability in associating complex text prompts with pixel-level mask generation [33]. - The model also excelled in other datasets such as MeViS, Ref-YouTube-VOS, and RefCOCO, showcasing its leading performance across various visual understanding tasks [33][34]. Group 6: Future Implications - The introduction of UniPixel marks a significant milestone in multimodal AI, transitioning from "modal alignment" to "fine-grained understanding," effectively merging object referring and segmentation with language reasoning [47][48].
大模型方向适合去工作还是读博?
具身智能之心· 2025-10-16 00:03
Core Insights - The article discusses the decision-making process for individuals in the large model field regarding whether to pursue a PhD or engage in entrepreneurial ventures related to agents [1][2] Group 1: Importance of Foundation in Large Models - A solid foundation in large models is crucial, as the field encompasses various directions such as generative models, multi-modal models, fine-tuning, and reinforcement learning [1] - Many mentors lack sufficient expertise in large models, leading to a misconception among students about their readiness for related positions [1] Group 2: Role of a Pioneer in Research - The suitability of an individual to take on the role of a "pioneer" in research is essential, especially in a field with many unexplored directions [2] - The ability to independently explore and endure failures is emphasized as a key trait for those aiming to innovate from scratch [2] Group 3: Community and Learning Resources - The "Large Model Heart Tech Knowledge Planet" community offers a comprehensive platform for beginners and advanced learners, featuring videos, articles, learning paths, and Q&A sections [2] - The community aims to provide a space for technical exchange and collaboration among peers in the large model domain [4] Group 4: Learning Pathways - The community has compiled detailed learning pathways for various aspects of large models, including RAG, AI Agents, and multi-modal training [4][9] - Each learning pathway includes clear technical summaries, making it suitable for systematic learning [4] Group 5: Benefits of Joining the Community - Members gain access to the latest academic advancements and industrial applications related to large models [7] - The community facilitates networking with industry leaders and provides job recommendations in the large model sector [7][68] Group 6: Future Plans and Engagement - The community plans to host live sessions with industry experts, allowing for repeated viewing of valuable content [65] - A focus on building a professional exchange community with contributions from over 40 experts from renowned institutions and companies is highlighted [66]
中金:如何看待Sora应用对互联网平台影响?
中金点睛· 2025-10-15 23:54
Core Viewpoint - The Sora App, launched by OpenAI, has quickly gained popularity, achieving significant download numbers in its first week, comparable to ChatGPT's launch, but it is unlikely to disrupt the current social media landscape due to various limitations [2][5][14]. Group 1: Sora App Features and Performance - Sora App integrates social attributes and diverse creation methods to build an immersive video ecosystem, featuring a vertical video stream design and interactive user comments [2][7]. - The app's innovative features, Cameo and Remix, allow users to create high-fidelity digital avatars and engage in secondary creation of videos, respectively, lowering the barriers to video creation [9][13]. - In its first week, Sora App reached the top of the iOS free download charts in the U.S., with download numbers similar to those of ChatGPT at launch, indicating potential for further growth [5][12]. Group 2: Market Impact and Competitive Landscape - Despite its innovative features, Sora App is expected to struggle in establishing itself as an independent platform, as AIGC video content is currently viewed as a niche within existing social media platforms rather than a standalone category [3][14]. - The competitive landscape suggests that existing major players in the market are likely to catch up with the technological advancements demonstrated by Sora, as the gap in model capabilities can be bridged over time [15]. - Legal and compliance issues surrounding AIGC content, particularly regarding copyright risks, remain unresolved, which could hinder widespread adoption of the Sora App [16]. Group 3: Future Outlook - The Sora App is anticipated to influence content creation trends, particularly in enhancing user engagement through its social features, but it is not expected to cause significant disruption to the existing social media ecosystem [12][14]. - The app's impact on the domestic market is limited, but it may encourage mainstream platforms to adopt similar creative functionalities to boost user activity and advertising revenue [14].
AI能否「圣地巡礼」?多模态大模型全新评估基准VIR-Bench来了
机器之心· 2025-10-15 04:08
Core Insights - The article discusses the development of a new multimodal large model evaluation benchmark called VIR-Bench, aimed at assessing AI's ability to understand travel videos in terms of geographical locations and temporal sequences [4][20] - The research emphasizes the importance of reconstructing travel itineraries from videos, which requires models to comprehend both geographic and temporal relationships [20] Group 1: VIR-Bench Overview - VIR-Bench is designed to evaluate AI's understanding of travel vlogs by generating a visiting order graph that represents the sequence and relationships of visited locations [6][9] - The visiting order graph consists of nodes representing locations categorized into three levels: Prefecture, City, and Point of Interest (POI) [7][9] Group 2: Task Design and Dataset - The task is divided into two sub-tasks: node prediction, where the model identifies all visited locations, and edge prediction, where it determines the relationships between these locations [10][11] - A dataset of 200 travel videos was constructed, covering 3,689 POIs across 43 prefectures in Japan, with detailed annotations for each video [17][13] Group 3: Experimental Results and Challenges - Current models, particularly open-source ones, lag behind commercial models in POI node recognition and transition edge prediction, with transition edge prediction being notably challenging [16][18] - The performance of models improves significantly with increased scale and the inclusion of geographic pre-training, highlighting the importance of these factors in enhancing accuracy [16][18] Group 4: Future Directions - The research indicates that while current models struggle with long-range reasoning and temporal understanding, there are clear pathways for improvement, such as enhancing spatial awareness and integrating multimodal information [20][18] - The ultimate goal is for AI to not only analyze videos but also to possess the capability to act within the world, aligning with applications in robotics and autonomous systems [20][18]
NeurIPS 2025 Spotlight | 条件表征学习:一步对齐表征与准则
机器之心· 2025-10-15 02:54
本文第一作者为四川大学博士研究生刘泓麟,邮箱为 tristanliuhl@gmail.com ,通讯作者为四川大学李云帆博士后与四川大学彭玺教授。 一张图片包含的信息是多维的。例如下面的图 1,我们至少可以得到三个层面的信息:主体是大象,数量有两头,环境是热带稀树草原(savanna)。然而,如果 由传统的表征学习方法来处理这张图片,比方说就将其送入一个在 ImageNet 上训练好的 ResNet 或者 Vision Transformer,往往得到的表征只会体现其主体信息, 也就是会简单地将该图片归为大象这一类别。这显然是不合理的。 图 1 :传统表征学习(上)与条件表征学习(下)的比较。传统的表征学习方法只能学习到一种通用的表征 ,忽略了其他有意义的信息;文章提出的条件表征学习能够基于指定准则,得到该准则下表现 力更强的条件表征,适应多种下游 任务。 此外,在各大电商平台,用户通常根据不同的标准(例如颜色、材质或场合)搜索商品。例如,用户今天可能搜索 "红色连衣裙",明天搜索 "正装",后天搜索某 个全新的关键词。这对于拥有庞大规模商品的平台来说,手动打标签是不现实的,而传统的表征学习也仅仅只能获取到 ...
国内20家公司大模型岗位面试经验汇总
自动驾驶之心· 2025-10-14 23:33
Group 1 - The article discusses various job offers and interview experiences from companies in the AI and autonomous driving sectors, highlighting the competitive nature of the job market in these fields [4][19][27] - Companies mentioned include 淘天, 字节, 商汤, 蚂蚁, 美团, and others, showcasing their focus on large model research and applications in various scenarios [5][10][19][27] - The interview processes are described as rigorous, with a strong emphasis on technical skills, particularly in coding and algorithm design [13][18][27][40] Group 2 - 淘天's large model research focuses on two main scenarios: search advertising and content curation, led by notable executives [5][10] - 字节's AML team emphasizes coding skills and algorithmic problem-solving during interviews, reflecting the company's high standards [13][40] - 商汤's interview process is noted for its professionalism, although candidates reported a lack of product focus and competitive salary packages [18][27] Group 3 - 蚂蚁's focus on risk control models highlights the integration of visual understanding in industrial applications, emphasizing the importance of multi-modal solutions [19][23] - 美团's interview questions reflect a deep dive into spatial perception and multi-modal model capabilities, indicating the company's commitment to advanced AI technologies [27][40] - The article also mentions the growing community around autonomous driving technologies, with nearly 4,000 members and over 300 companies involved in discussions and knowledge sharing [59]
浙商早知道-20251015
ZHESHANG SECURITIES· 2025-10-14 23:30
Market Overview - On October 14, the Shanghai Composite Index fell by 0.62%, the CSI 300 decreased by 1.2%, the STAR 50 dropped by 4.26%, the CSI 1000 declined by 1.95%, the ChiNext Index fell by 3.99%, and the Hang Seng Index decreased by 1.73% [4] - The best-performing sectors on October 14 were banking (+2.51%), coal (+2.18%), food and beverage (+1.69%), transportation (+0.5%), and utilities (+0.49%). The worst-performing sectors were telecommunications (-4.98%), electronics (-4.64%), non-ferrous metals (-3.66%), computers (-2.98%), and electrical equipment (-2.36%) [4] - The total trading volume for the A-share market on October 14 was 25,966 billion, with a net inflow of 8.603 billion HKD from southbound funds [4] Key Insights Cosmetic Industry - The cosmetic market is expected to continue low single-digit growth in Q4, with brand differentiation increasing. New consumer brands are recommended as they have upward momentum and room for valuation switching towards 2026 [5] - New consumer brands are anticipated to achieve a compound annual growth rate of 20%-30% in revenue and profit over the next 2-3 years, maintaining attractiveness in terms of market conditions and certainty [6] Computer Industry - The rise of domestic computing power and the application of AI are highlighted as key trends. The large-scale implementation of large language models is still pending breakthroughs [9] - The domestic computing power supply chain is gradually taking shape, driven by revenue growth from domestic computing power manufacturers like Cambrian. The acceleration of multimodal large model applications is expected to lead to commercial implementation in the video sector [9] - The market perceives that large-scale model implementation still faces challenges, but advancements like Sora 2.0 are expected to break through physical simulation barriers, potentially generating commercial value in video generation [10]
NeurIPS 25 | 中大&UC Merced等开源RAPID Hand,重新定义多指灵巧手数据采集
机器之心· 2025-10-14 08:24
| Zhaoliang Wan- Zetong Bi1 Zida Zhou2 Hao Ren1 Yiming Zeng1 Yihan Li1 | | | | | --- | --- | --- | --- | | Lu Oi3 | Xu Yang4 | Ming-Hsuan Yang3 | Hui Cheng1 * | 论文标题:RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy 论文地址:https://www.arxiv.org/abs/2506.07490 项目主页:https://rapid-hand.github.io/ 灵巧操作能力是通用机器人实现多任务泛化的核心能力之一。无论是日常的家庭整理、物品归置,还是辅助类服务任务,若缺乏灵巧的操作能力,机器人便难以 真正完成复杂交互。 近年来,随着多模态大模型(VLMs)在机器人控制中的逐步应用,研究者们开始将高质量的操作演示与预训练模型结合,用于具身推理与通用操作策略学 ...
上海网达软件股份有限公司 关于2025年半年度业绩说明会召开情况的公告
Zhong Guo Zheng Quan Bao - Zhong Zheng Wang· 2025-10-14 05:33
Core Viewpoint - The company held a half-year performance briefing on October 13, 2025, to discuss its technological advantages and future plans in the context of the current market environment [1]. Group 1: Technological Advantages - The company has developed a comprehensive HD video solution that integrates intelligent encoding and decoding technology, low-latency processing architecture, and AI deep analysis, achieving a transmission delay of 60ms in low-bandwidth environments [1][2]. - In the AI sector, the company focuses on security applications, creating specialized models that understand industry knowledge, and has successfully implemented intelligent analysis of 4K ultra-high-definition monitoring videos [2]. - The company is advancing its media production capabilities by integrating AIGC content generation and intelligent agent collaboration, enhancing content dissemination efficiency and digital marketing [2]. Group 2: R&D Investment and Future Directions - The company plans to focus on generative AI and its integration with video applications, emphasizing collaborative innovation in video encoding, editing, and recognition [5][6]. - Future R&D will target specific industry models, AIGC applications, and XR technologies, ensuring a balance between cost and benefit in R&D investments [6]. Group 3: Market Engagement and Strategic Initiatives - The company is actively participating in the low-altitude economy sector, developing intelligent inspection systems for drones and unmanned vehicles, which align with national strategic needs [7]. - The company has implemented a cash dividend policy, distributing 1.50 yuan per 10 shares to shareholders, and will continue to balance short-term returns with long-term growth [8]. Group 4: R&D Expenditure and Efficiency - The company reported a decrease in R&D expenses due to a strategic focus on AI technology and optimization of high-end video product lines, while reducing investments in mature and non-core areas [9]. Group 5: AI Safety Supervision Developments - The company has made advancements in AI-driven digital safety supervision systems, integrating multi-source data for dynamic perception and risk assessment in various operational scenarios [10].