Workflow
量子位
icon
Search documents
多模态检索新突破,用软标签打破传统刚性映射约束,全面超越CLIP|AAAI 2026 Oral
量子位· 2025-11-15 05:00
Core Insights - The article discusses the introduction of a new unified multimodal embedding model, UniME-V2, which addresses limitations in existing methods for negative sample mining and enhances semantic understanding through a novel mechanism called "MLLM-as-a-Judge" [3][9]. Group 1: Model Overview - UniME-V2 is designed to improve the training process by constructing a potential difficult negative sample set through global retrieval and evaluating query-candidate pairs using MLLM to generate soft semantic matching scores [3][4][9]. - The model aligns similarity matrices with soft semantic matching scores, significantly enhancing its ability to discern semantic differences among candidate samples [5][6]. Group 2: Methodology - The methodology involves a two-step process: first, constructing a potential difficult negative sample set, and second, using MLLM to assess semantic alignment and generate matching scores [13][14][15]. - A re-ranking model, UniME-V2-Reranker, is trained based on the mined difficult negatives, employing a paired and listwise joint optimization strategy to further enhance performance [6][30]. Group 3: Performance Evaluation - UniME-V2 demonstrates significant performance improvements over existing baseline models, achieving higher scores in various tasks, including a 3.5% and 2.2% increase over VLM2Vec for the Qwen2-VL-2B and 7B models, respectively [36][37]. - The model shows robust performance on out-of-distribution datasets, scoring 66.7, indicating its strong transferability and robustness [38]. Group 4: Cross-Modal Retrieval - In zero-shot cross-modal retrieval tasks, UniME-V2 outperforms previous models, showing a 2.2%-9.7% improvement in image-to-text retrieval and significant enhancements in long description tasks [41][42]. - The model's ability to distinguish difficult negative samples is highlighted, with performance improvements of 5.3%, 6.0%, and 4.5% when using Qwen2-VL-2B, and 9.0%, 9.2%, and 9.2% when scaled to 7B [47][48]. Group 5: Re-ranking Performance - The re-ranking performance of UniME-V2-Reranker surpasses that of LamRA, achieving better results across four downstream tasks while using only half the data [52]. - The model's advantage in complex understanding retrieval tasks is attributed to its effective extraction of diverse and high-quality difficult samples, enhancing its discriminative capabilities [53].
李飞飞和LeCun的世界模型之争
量子位· 2025-11-15 05:00
Core Viewpoint - The article discusses the competition among three major players in the AI industry—Li Feifei, Yann LeCun, and Google—regarding the development of world models, highlighting their distinct technological approaches and implications for artificial general intelligence (AGI) [1][3][42]. Group 1: Li Feifei and Marble - Li Feifei's company, World Labs, has launched its first commercial world model, Marble, which is seen as having significant commercial potential due to its ability to generate persistent, downloadable 3D environments [2][5]. - Marble features a native AI world editor called Chisel, allowing users to create and modify worlds with simple prompts, which is particularly beneficial for VR and game developers [7][9]. - However, some experts argue that Marble resembles a 3D rendering model rather than a true world model, as it focuses on visual representation without incorporating the underlying physical laws necessary for robotic training [10][18][20]. Group 2: Yann LeCun and JEPA - LeCun's approach to world models, exemplified by JEPA, emphasizes control theory and cognitive science rather than 3D graphics, aiming to enable robots to predict changes in the environment without needing to generate visually appealing images [24][26]. - JEPA focuses on capturing abstract representations of the world that are essential for AI decision-making, making it more suitable for training robots [28][30]. Group 3: Google and Genie 3 - Google DeepMind's Genie 3, launched in August, allows users to generate interactive video environments with a single prompt, addressing long-term consistency issues in generated worlds [32][35]. - Despite its dynamic capabilities, Genie 3 is still fundamentally a video logic model and lacks the deeper understanding of physical laws that JEPA provides, making it less effective for robotic training [38][40]. Group 4: World Model Pyramid - The article categorizes the three world models into a pyramid structure: Marble as the interface, Genie 3 as the simulator, and JEPA as the cognitive framework, illustrating their varying levels of abstraction and suitability for AI training [53][54]. - As one moves up the pyramid, the models become more abstract and aligned with AI's cognitive processes, while those at the bottom are more visually appealing but harder for robots to comprehend [54].
不到72小时,人工智能年度榜单申报即将截止
量子位· 2025-11-15 02:08
让我们共同见证年度之星,点亮未来的方向。 组委会 发自 凹非寺 量子位|公众号 QbitAI 「2025人工智能年度榜单」申报 已进入倒计时阶段。 今年是量子位 「2025人工智能年度榜单」评选报名 的 第8年。 八年来,我们见证了技术的突破与落地,产业的融合与重塑,也见证了一批 又一批推动时代前行的企业、人物与产品。 本次评选已经从 企业 、 产品 、 人物 三大维度,设立五类奖项。欢迎企业抓住最后时间,尽快报名! 企业榜 产品榜 人物榜 2025 人工智能年度 焦点人物 2025 人工智能年度领航企业 将面向中国人工智能领域,评选出最具综合实力的企业, 参选条件 : 评选标准 : 报名方式 本次评选将于 2025年11月17日 截止。评选结果将于量子位主办的 MEET2026智能未来大会 上正式公布。 扫描二维码即可报名评选: 网页端链接:https://wj.qq.com/s2/23740133/iso8/ 如对本次评选有其他疑问,请联系量子位工作人员。添加微信18801103170,或邮件发送至linyu@qbitai.com,并备注「评选-企业-姓 名」。 详细评选标准及报名方式如下。 2025 人 ...
梁文锋代表DeepSeek,他代表梁文锋
量子位· 2025-11-15 02:08
Core Viewpoint - The article discusses the emergence of "Hangzhou Six Little Dragons" at the World Internet Conference in Wuzhen, highlighting the presence of key figures in AI and technology, particularly focusing on DeepSeek and its representative, Chen Deli, who expressed both optimism and concerns about the future impact of AI on society [1][3][41]. Group 1: DeepSeek and Its Representation - DeepSeek's founder Liang Wenfeng did not attend the conference; instead, researcher Chen Deli represented the company, marking a significant public appearance for DeepSeek [3][6][41]. - Chen Deli, who joined DeepSeek in 2023, has been involved in critical research areas such as language models and alignment mechanisms, contributing to several important publications [18][22][20]. - The article notes that Chen Deli's presence at the conference has made him the second public representative of DeepSeek after Liang Wenfeng, emphasizing his role as a spokesperson for the company's views on AI [41][42]. Group 2: AI Perspectives - Chen Deli expressed a mixed outlook on AI, stating that while there is a "honeymoon period" between humans and AI over the next three to five years, there are significant long-term concerns about AI potentially replacing most jobs in society [8][9]. - He highlighted that the current AI revolution differs fundamentally from previous industrial revolutions, as AI is beginning to possess its own "intelligence," which could surpass human capabilities in certain areas [10][11]. - The potential for AI to disrupt existing social order and economic structures is a major concern, with Chen suggesting that technology companies may need to act as "guardians" to mitigate negative impacts [12][13]. Group 3: Value Alignment in AI - During his presentation, Chen Deli introduced the concept of "value alignment decoupling," proposing that core values should be unified while allowing users to customize diverse values, ensuring safety and adaptability to societal diversity [25][24]. - This approach aims to address the rigidity of traditional large models, which often embed fixed values that do not reflect the complexity of human society [24][25]. - The idea of "harmony in diversity" encapsulates this new perspective on AI value alignment, suggesting a more flexible and user-centric approach to AI development [26][25].
实测专盯Agent上工的OS:长得有点像AI浏览器,双系统通用
量子位· 2025-11-15 02:08
Core Viewpoint - The article discusses the emergence of AI-powered browsers, highlighting the capabilities and limitations of FlowithOS, a new operating system designed specifically for AI agents, which aims to enhance user experience by automating tasks traditionally performed by users [1][4][52]. Group 1: AI Browser Landscape - The current market for AI browsers can be categorized into three types: traditional browsers with AI plugins, proxy-type browsers, and those like Atlas that allow agents to perform tasks autonomously [9][10][11]. - FlowithOS is unique as it is not merely a browser but an operating system that enables agents to execute tasks while retaining browsing capabilities [11][52]. Group 2: Testing FlowithOS - FlowithOS was tested for its retrieval and execution capabilities, demonstrating the ability to complete a multi-step task autonomously, such as finding and negotiating the price of a product [20][21]. - However, the system exhibited issues with user experience, including slow response times and occasional bugs during complex tasks [21][50]. Group 3: Information Integration and Semantic Understanding - The system's ability to integrate and summarize information was tested, revealing that while it could provide structured analyses, it often relied on metadata rather than engaging with content deeply [33][36]. - FlowithOS showed strong semantic understanding in a complex scenario, successfully identifying suitable gifts based on user-provided context, indicating its potential for emotional intelligence [43]. Group 4: Unique Features of FlowithOS - FlowithOS includes a "Skill" feature that serves as a guide for agents to perform tasks, enhancing their ability to execute similar tasks in the future [45]. - The operating system also incorporates a memory function that adapts to user preferences, improving its performance over time [46]. Group 5: Overall Assessment - Despite its innovative approach, FlowithOS is still in development, facing challenges such as occasional malfunctions and performance issues during complex tasks [50][51]. - The potential for FlowithOS to transform the browsing experience by automating tasks is significant, suggesting a future where users may rely less on manual interactions [52].
OpenAI又Open了一下:发布可解释性新研究,作者来自Ilya超级对齐团队
量子位· 2025-11-15 02:08
Core Insights - OpenAI has introduced a new method for training smaller models that enhances interpretability, making the internal mechanisms of models easier for humans to understand [5][6][7] - The research focuses on creating sparse models with many neurons but fewer connections, simplifying neural networks for better comprehension [7][11] Summary by Sections Model Interpretability - OpenAI's language models have complex structures that are not fully understood, and the new method aims to bridge this gap [6] - The core idea is to train sparse models that maintain a high number of neurons while limiting their connections, making them simpler and more interpretable [7][11] Research Methodology - The researchers designed a series of simple algorithmic tasks to evaluate the model's interpretability, identifying the "circuit" for each task [13][18] - A "circuit" is defined as the smallest computational unit that allows the model to perform a specific task, represented as a graph of nodes and edges [15][16] Example of Circuit - An example task involves predicting the correct closing quote for a string in Python, demonstrating how the model can remember the type of opening quote to complete the string [19][22] Findings and Implications - The research indicates that larger, sparser models can produce increasingly powerful functions while maintaining simpler circuits [26] - This suggests potential for extending the method to understand more complex behaviors in models [27] Current Limitations - The study acknowledges that sparse models are significantly smaller than state-of-the-art models and still contain many "black box" elements [30] - Training efficiency for sparse models is currently low, with two proposed solutions: extracting sparse circuits from existing dense models or developing more efficient training techniques [31][32]
这届清华特奖机器人含量爆表!丘成桐(国内版)现身点评
量子位· 2025-11-14 12:10
Core Viewpoint - The annual Tsinghua University Special Scholarship for undergraduates has revealed a strong focus on AI and robotics, with nearly half of the top candidates working on embodied intelligence projects [5][37]. Group 1: Candidates and Their Achievements - The top candidates for this year's scholarship include students from various departments, with a notable emphasis on AI and robotics [1][5]. - Li Yitang from the Interdisciplinary Information Institute is recognized for her work on humanoid robots, specifically in achieving complex body movements with minimal data [8][12]. - Chen Boyuan, a candidate from Xingjian College, has made significant contributions across various AI and robotics fields, including 3D scene reconstruction and tactile-visual integration [14][19]. - Xu Ruyi has focused on efficient AI model deployment on limited computing power, achieving a 13.5 times speed increase in a key mechanism [22][25]. Group 2: Structural Changes in Tsinghua University - Tsinghua University has undergone a strategic shift in its academic structure, establishing an Artificial Intelligence College aimed at fostering innovation and talent in AI [37][39]. - The undergraduate enrollment has increased by 150 students, with a focus on cultivating top talents in "AI + interdisciplinary" fields [39][40]. - This structural change has led to a natural inclination for students to engage in key areas such as large models, 3D understanding, robot control, and autonomous driving [39][40].
原神Agent,字节出品
量子位· 2025-11-14 12:10
Core Viewpoint - ByteDance has developed a new gaming agent named Lumine, capable of autonomously playing games like Genshin Impact, showcasing advanced skills in exploration, combat, and puzzle-solving [1][4][9]. Group 1: Agent Capabilities - Lumine can perform complex tasks in Genshin Impact, including dynamic enemy tracking, precise long-range shooting, and smooth character switching [4][5]. - The agent demonstrates strong understanding in boss battles and can solve various puzzles, such as collecting items based on environmental cues [6][12]. - Lumine is capable of executing GUI operations and can follow complex instructions by understanding prior task information [7][8]. Group 2: Technical Framework - Lumine is built on the Qwen2-VL-7B-Base model, leveraging multimodal understanding and generation capabilities from extensive web data training [9][10]. - The agent employs three core mechanisms: Observation Space for visual input processing, Hybrid Thinking for decision-making efficiency, and Keyboard and Mouse Modelling for action representation [12][14][15]. - A three-phase training process was implemented, including pre-training for basic actions, instruction-following training, and decision reasoning training, leading to high task completion rates [17][20][23]. Group 3: Performance Metrics - Lumine-Base shows a stepwise emergence of capabilities, achieving over 90% success in basic interactions but lacking goal-directed behavior [38]. - Lumine-Instruct outperforms mainstream VLMs in short-cycle tasks, achieving a success rate of 92.5% in simple tasks and 76.8% in difficult tasks [33][35]. - Lumine-Thinking demonstrates exceptional performance in long-term tasks, completing the main storyline of Genshin Impact in 56 minutes with a 100% task completion rate, significantly faster than competitors [41][42]. Group 4: Cross-Game Adaptability - Lumine-Thinking exhibits strong adaptability across different games, successfully completing tasks in titles like Honkai: Star Rail and Black Myth: Wukong, showcasing its general agent characteristics [45][46]. - The agent's ability to navigate unfamiliar environments and execute complex tasks highlights its potential for broader applications beyond gaming [45][46]. Group 5: Industry Implications - The development of Lumine reflects a trend in the industry where companies like Google are also creating agents capable of operating in 3D game environments, indicating a clear path towards embodied AGI [48][51]. - The belief in the eventual transition of gaming agents into real-world applications underscores the significance of advancements in AI and gaming technology [51].
最后一周!人工智能年度榜单申报即将截止。
量子位· 2025-11-14 08:22
Core Points - The "2025 Artificial Intelligence Annual List" application has entered its countdown phase, marking the 8th year of the event, which has witnessed technological breakthroughs and industry transformations [1][2] - The evaluation will focus on three dimensions: companies, products, and individuals, with five award categories established [2][6] Group 1: Application and Evaluation - The application deadline is set for November 17, 2025, with results to be announced at the MEET2026 Intelligent Future Conference [7][27] - Companies are encouraged to seize the last opportunity to apply and showcase their contributions to the AI industry [2][6] Group 2: Award Categories - The awards include: - 2025 AI Annual Leading Enterprises - 2025 AI Annual Potential Startups - 2025 AI Annual Outstanding Products - 2025 AI Annual Outstanding Solutions - 2025 AI Annual Focus Figures [10][9][14][16][19][21] Group 3: Evaluation Criteria - For Leading Enterprises, criteria include market share, revenue scale, technological innovation, and brand influence [11] - For Potential Startups, emphasis is placed on business model viability, market growth potential, and innovation achievements [14] - Outstanding Products must demonstrate clear application value and market feedback, with significant technological advancements in the past year [16][17] - Outstanding Solutions should showcase innovative applications in various industries, with proven market impact [19][22] - Focus Figures are evaluated based on their contributions to AI technology and industry influence [21][23]
4个旷视天才具身创业获投近10亿,阿里独家很瞩目
量子位· 2025-11-14 08:22
Group 1 - The core viewpoint of the article is that the investment landscape in the field of embodied intelligence is heating up, exemplified by the recent financing of Dexmal, which raised nearly 1 billion yuan [1][2][6]. - Dexmal's latest financing round attracted attention due to Alibaba being the exclusive investor, highlighting the company's potential and market interest [3][5]. - In just over two months, Dexmal has successfully raised close to 1 billion yuan across three financing rounds, indicating rapid growth and investor confidence [6][4]. Group 2 - Dexmal, founded in March 2025, focuses on the research and application of embodied intelligence hardware and software technologies [7]. - The company's mission is to create intelligent, useful, and trustworthy robots to enhance quality of life [8]. - The core team of Dexmal consists of members with top AI academic backgrounds and extensive experience in scaling AI-native products, primarily from Megvii Technology [9][10]. Group 3 - Dexmal has made significant progress in research, publishing over ten papers in top conferences related to AI and embodied intelligence [13]. - The company has developed two frameworks, Real-time VLA and MemoryVLA, which optimize robot performance for real-time and long-duration tasks [14][15]. - To address the fragmented research ecosystem, Dexmal has launched an open-source VLA toolbox called Dexbotic, aimed at providing a unified research platform for embodied intelligence [18][20]. Group 4 - Dexmal has also introduced a hardware product, DOS-W1, which serves as an open, modifiable, and expandable experimental platform for embodied intelligence research [21][24]. - The DOS-W1 is designed with a modular architecture, allowing for easy upgrades and modifications, thus lowering the research barrier in the industry [26]. - Additionally, Dexmal has partnered with Hugging Face to launch RoboChallenge, the world's first large-scale evaluation platform for embodied intelligence [28]. Group 5 - The founding team of Dexmal includes notable figures such as Tang Wenbin, Fan Haoqiang, Zhou Yanjin, and Wang Tiancai, all of whom have extensive experience in AI and have previously worked at Megvii Technology [33][57][71]. - Tang Wenbin, the CEO, has a strong background in computer science and has been instrumental in the success of Megvii [34][41]. - Fan Haoqiang and Zhou Yanjin are recognized for their achievements in international competitions and their contributions to the field of AI, further solidifying the team's expertise [44][62][68].