Workflow
机器之心
icon
Search documents
击败Meta登榜首:推理增强的文档排序模型ReasonRank来了
机器之心· 2025-08-21 04:12
本文的第一作者是刘文涵,就读于中国人民大学高瓴人工智能学院,博士三年级,导师为窦志成教授,目前在百度大搜部门进行实习。他的研究方向聚焦于 AI 搜 索,在顶级国际会议如 ACL、WWW 等发表了多篇论文。 推理大模型(Large Reasoning Model)极大的促进了自然语言处理领域的发展,而信息检索领域的核心问题之一是文档排序,如何利用强大的推理大模型通过主动 推理来判断文档的相关性,进而再对文档进行排序是一个值得探索的方向。 在本次工作中,我们提出了 ReasonRank,ReasonRank 在包括 BRIGHT、R2MED 在内的多个榜单,击败了 UMASS 大学,Waterloo 大学,Meta 在内的多个大学和 机构, 于 2025 年 8 月 9 日荣登榜单第一名。我们更小尺寸的 ReasonRank-7B 也远远超越了其他 32B 大小的推理型排序大模型,同时相比 pointwise 排序器具备明 显的效率优势。此外,我们的论文还获得了 Huggingface paper 日榜第一名。 | Rank | Retriever | Score | | --- | --- | --- | | ...
上下文记忆力媲美Genie3,且问世更早:港大和可灵提出场景一致的交互式视频世界模型
机器之心· 2025-08-21 01:03
Core Insights - The article discusses the development of video generation models that can maintain scene consistency over long durations, addressing the critical issue of stable scene memory in interactive long video generation [2][10][17] - Google DeepMind's Genie 3 is highlighted as a significant advancement in this field, demonstrating strong scene consistency, although technical details remain undisclosed [2][10] - The Context as Memory paper from a research team at Hong Kong University and Kuaishou is presented as a leading academic work that closely aligns with Genie 3's principles, emphasizing implicit learning of 3D priors from video data without explicit 3D modeling [2][10][17] Context as Memory Methodology - The Context as Memory approach utilizes historical generated context as memory, enabling scene-consistent long video generation without the need for explicit 3D modeling [10][17] - A Memory Retrieval mechanism is introduced to efficiently utilize theoretically infinite historical frame sequences by selecting relevant frames based on camera trajectory and field of view (FOV), significantly improving computational efficiency and reducing training costs [3][10][12] Experimental Results - Experimental comparisons show that Context as Memory outperforms existing state-of-the-art methods in maintaining scene memory during long video generation [15][17] - The model demonstrates superior performance in static scene memory retention over time and exhibits good generalization across different scenes [6][15] Broader Research Context - The research team has accumulated multiple studies in the realm of world models and interactive video generation, proposing a framework that outlines five foundational capabilities: Generation, Control, Memory, Dynamics, and Intelligence [18] - This framework serves as a guiding direction for future research in foundational world models, with Context as Memory being a focused contribution on memory capabilities [18]
刚刚,字节开源Seed-OSS-36B模型,512k上下文
机器之心· 2025-08-21 01:03
Core Viewpoint - ByteDance's Seed team has officially released and open-sourced the Seed-OSS series models, which include three versions: Seed-OSS-36B-Base (with synthetic data), Seed-OSS-36B-Base (without synthetic data), and Seed-OSS-36B-Instruct, trained on 12 trillion tokens and achieving excellent performance on various benchmarks [1][2]. Model Features - The Seed-OSS-36B architecture incorporates various design choices, including causal language modeling, Grouped Query Attention, SwiGLU activation function, RMSNorm, and RoPE positional encoding [4]. - Each model contains 36 billion parameters distributed across 64 layers and supports a vocabulary size of 155,000 [5]. - A notable feature is the native long-context capability, with a maximum context length of 512k tokens, allowing for the processing of long documents and reasoning chains without performance loss [6][7]. Inference Budget Control - The model introduces inference budget control, allowing developers to specify how much reasoning the model should perform before providing an answer [10]. - This design enables teams to adjust performance based on task complexity and deployment efficiency needs [12]. - Recommended budget values are multiples of 512 tokens, with a budget of 0 indicating direct answer output [13][26]. Benchmark Performance - The Seed-OSS-36B-Base model achieved scores of 65.1 on MMLU-Pro and 81.7 on MATH, demonstrating competitive performance [15]. - The Seed-OSS-36B-Instruct version achieved state-of-the-art (SOTA) results in various fields, including 91.7% on AIME24 and 67.4 on LiveCodeBench v6 [17]. - In long-context processing tests, the model reached a score of 94.6 on RULER (128K context length), marking the highest score among open-source models [18]. User Interaction and Token Management - During operation, the model informs users of token usage, enhancing user awareness of resource consumption [25]. - If no inference budget is set, the model defaults to unlimited length reasoning, while a budget of 0 prompts direct answer output [27].
报名开启|中关村国际青年论坛:诚邀全球青年学者共探AI前沿
机器之心· 2025-08-20 09:47
Core Viewpoint - Beijing Zhongguancun Academy is a new higher education and research institution focusing on artificial intelligence and interdisciplinary fields, emphasizing disruptive research and practical talent cultivation [2][3]. Group 1: Institutional Overview - Beijing Zhongguancun Academy specializes in education and research innovation in artificial intelligence and interdisciplinary fields, promoting scientific exploration through research projects [3]. - The Zhongguancun Artificial Intelligence Research Institute is a young exploratory R&D institution aimed at future-oriented scientific exploration with industrial value [3]. Group 2: International Forum - The "Zhongguancun International Youth Forum," organized by Beijing Zhongguancun Academy and supported by the Zhongguancun Artificial Intelligence Research Institute, invites global young talents in AI and interdisciplinary fields [5]. - Since its establishment in September 2024, the forum has successfully held two sessions, attracting 98 top young scholars from seven countries, covering topics in AI, biotechnology, and interdisciplinary integration [5]. Group 3: Forum Details - The upcoming forum will take place on September 18-19, 2025, at the Beijing Zhongguancun Academy C5 Research Building [6]. - Key agenda items include invited reports from top scholars, oral presentations for young scholars, roundtable discussions on "AI for Science," and a poster session for knowledge exchange [6][9]. Group 4: Talent Development and Support - The academy offers a comprehensive talent introduction policy, including support for project applications, housing subsidies, and education for children to ensure talent development [6]. - The institution collaborates with 31 top universities and leading enterprises to implement a project-based education model [14]. Group 5: Participation Requirements - Candidates must hold a PhD in AI or related interdisciplinary fields with at least two years of experience and a record of publications in top conferences or journals [15]. - The application deadline is August 27, 2025, and interested scholars should submit their materials via email [15].
Sora没做到的,LongVie框架给解决了,超长视频生成SOTA
机器之心· 2025-08-20 09:47
Core Insights - The article discusses the rapid advancements in video generation technology, particularly focusing on the challenges of creating controllable long videos exceeding one minute in length [2][3]. Group 1: Challenges in Long Video Generation - Current controllable video generation models face significant issues when generating long videos, including temporal inconsistency and visual degradation [8]. - Temporal inconsistency manifests as disjointed details and flickering between frames, while visual degradation leads to color drift and reduced clarity over time [8]. Group 2: Solutions Proposed by LongVie Framework - LongVie addresses temporal inconsistency through two key strategies: 1. Control Signals Global Normalization, which standardizes control signals across the entire video segment rather than within individual segments, enhancing consistency during segment stitching [10]. 2. Unified Noise Initialization, where all segments share the same initial noise to align the generation distribution, minimizing appearance and detail drift between frames [11]. - To combat visual degradation, LongVie employs a multi-modal fine control approach, integrating dense control signals (like depth maps) with sparse control signals (like key points) and utilizing a degradation-aware training strategy [16]. Group 3: LongVie Framework Overview - The LongVie framework normalizes dense and sparse control signals globally and uses unified noise initialization for all segments, generating each segment sequentially while maintaining coherence [20]. - The framework has been tested against standard ControlNet and its variants, with the best-performing variant being adopted for its superior stability and effectiveness [22]. Group 4: LongVie Capabilities and Benchmarking - LongVie supports various long video generation tasks, including video editing, style transfer, and Mesh-to-Video generation [23]. - The introduction of LongVGenBench, the first standardized benchmark for controllable long video generation, includes 100 high-resolution videos longer than one minute, aimed at promoting systematic research and fair evaluation in this field [25]. - Quantitative metrics and user evaluations indicate that LongVie outperforms existing methods across multiple indicators, achieving state-of-the-art (SOTA) performance [28].
dLLM的「Free Lunch」!浙大&蚂蚁利用中间结果显著提升扩散语言模型
机器之心· 2025-08-20 04:26
本文第一作者王文,浙江大学博士生,研究方向是多模态理解与生成等。 本文通讯作者沈春华,浙江大学 求是讲席教授,主要研究课题包括具身智能、大模型推理增强、强化学习、通用感知模型等。 近年来,扩散大语言模型(Diffusion Large Language Models, dLLMs)正迅速崭露头角,成为文本生成领域 的一股新势力。与传统自回归(Autoregressive, AR)模型从左到右逐字生成不同,dLLM 依托迭代去噪的生 成机制,不仅能够一次性生成多个 token,还能在对话、推理、创作等任务中展现出独特的优势。当你还在 等传统 LLM「一个字一个字」地憋出答案时,dLLM 早已通过几轮迭代「秒」出完整结果,带来前所未有 的生成效率。 然而,速度的提升并不意味着完美的答案。现有 dLLM 的解码策略往往只关注最后一次迭代的生成结果, 直接舍弃了中间多轮迭代中蕴含的丰富语义与推理信息。这些被忽视的中间预测,实际上可能暗藏着更准 确、更接近真相的答案。一旦被丢弃,不仅造成信息浪费,还可能让模型错失做对题目的最佳时机。 更令人意外的是,研究团队在数学推理任务中观察到了一种「先对后错」的现象:模型先是得出了 ...
DiT在数学和形式上是错的?谢赛宁回应:不要在脑子里做科学
机器之心· 2025-08-20 04:26
Core Viewpoint - The article discusses criticisms of the DiT model, highlighting potential architectural flaws and the introduction of a new method called TREAD that significantly improves training efficiency and image generation quality compared to DiT [1][4][6]. Group 1 - A recent post on X claims that DiT has architectural defects, sparking significant discussion [1]. - The TREAD method achieves a training speed improvement of 14/37 times on the FID metric when applied to the DiT backbone network, indicating better generation quality [2][6]. - The post argues that DiT's FID stabilizes too early during training, suggesting it may have "latent architectural defects" that prevent further learning from data [4]. Group 2 - TREAD employs a "token routing" mechanism to enhance training efficiency without altering the model architecture, using a partial token set to save information and reduce computational costs [6]. - The author of the original DiT paper, Sseining, acknowledges the criticisms and emphasizes the importance of experimental validation over theoretical assertions [28][33]. - Sseining also points out that DiT's architecture has some inherent flaws, particularly in its use of post-layer normalization, which is known to be unstable for tasks with significant numerical range variations [13][36]. Group 3 - The article mentions that DiT's design relies on a simple MLP network for processing critical conditional data, which limits its expressive power [16]. - Sseining highlights that the real issue with DiT lies in its sd-vae component, which is inefficient and has been overlooked for a long time [36]. - The ongoing debate around DiT reflects the iterative nature of algorithmic progress, where existing models are continuously questioned and improved [38].
论坛报名已启动,速来锁定席位!解码具身智能的落地挑战与产业爆点
机器之心· 2025-08-20 04:26
Core Insights - The article emphasizes that embodied intelligence is becoming the core battlefield of the next technological competition, representing a significant step in integrating digital intelligence into the physical world [2][5] - It highlights the rapid advancements in this field over recent months, showcasing various milestones such as robots performing at events and challenges, while questioning the distance to true cross-scenario implementation [2][5] - The article discusses the need to overcome core bottlenecks, particularly generalization capabilities, to enable robots to operate effectively in dynamic environments and create sustainable commercial value [2][5] Event Overview - The 2025 Inclusion·Bund Conference will take place from September 10 to 13, 2025, in Shanghai, focusing on embodied intelligence [3] - A forum titled "Embodied Intelligence: From Generalization to Action, Reshaping the Future of Industries" will be held on September 11, featuring various discussions and presentations from industry leaders and experts [3][4] Forum Agenda - The forum will include keynote speeches, thematic presentations, and roundtable discussions, addressing the technological innovations needed for robots to achieve true generalization and actionable capabilities [5][9] - Notable speakers include experts from Tsinghua University, NVIDIA, and various robotics companies, discussing topics such as efficient data simulation, the next steps for embodied intelligence, and commercialization pathways [8][9][12][13][15][19]
ICCV 2025 | 跨越视觉与语言边界,打开人机交互感知的新篇章:北大团队提出INP-CC模型重塑开放词汇HOI检测
机器之心· 2025-08-20 00:15
Core Viewpoint - The article discusses a novel open-vocabulary human-object interaction (HOI) detection method called Interaction-aware Prompt and Concept Calibration (INP-CC), which enhances the understanding of interactions in open-world scenarios by dynamically generating interaction-aware prompts and optimizing concept calibration [2][4][5]. Summary by Sections Introduction to HOI Detection - Current HOI detection methods are limited to closed environments and struggle to identify new interaction types, which restricts their practical applications [6]. - The rise of multimodal large models presents significant potential for application in open environments, making the study of their use in HOI detection a focal point [6]. Innovations of INP-CC - INP-CC introduces two core innovations: Interaction-aware Prompt Generation and Concept Calibration, which help the model better understand complex interaction semantics [7][16]. - The model employs a mechanism that allows for selective sharing of prompts among similar interactions, enhancing learning efficiency [7]. Model Architecture - INP-CC utilizes an interaction-adaptive prompt generator to dynamically construct relevant prompts based on the input image characteristics, improving the model's focus on key interaction areas [14]. - The model generates detailed visual descriptions of interactions and clusters them into a fine-grained conceptual structure, aiding in the understanding of complex interactions [14][20]. Experimental Performance - INP-CC outperforms existing methods on the HICO-DET and SWIG-HOI datasets, achieving a mean Average Precision (mAP) of 16.74% on the SWIG-HOI full test set, which is nearly a 10% improvement over the previous method CMD-SE [18][22]. - The model demonstrates strong attention capabilities, effectively focusing on critical interaction areas, as evidenced by visual analysis [23]. Conclusion - INP-CC breaks through the limitations of pre-trained visual language models in regional perception and concept understanding, showcasing the potential of integrating language model knowledge into computer vision tasks [25].
Meta超级智能实验室重组为四个部门,某些高管将离开
机器之心· 2025-08-20 00:15
Core Viewpoint - Meta is restructuring its Superintelligence Labs (MSL) and other AI departments into four new divisions focused on AI research, infrastructure, hardware, and product integration, aiming to enhance its long-term superintelligence goals [2][3][4]. Group 1: Organizational Changes - MSL and previous AI departments like FAIR will be divided into smaller units to better focus on key areas necessary for achieving superintelligence [3]. - Alexandr Wang, the new Chief AI Officer, emphasized the need for organizational restructuring to take superintelligence seriously [4]. - The restructuring is expected to cause some internal chaos, with reports indicating that some executives may leave the company following these changes [7]. Group 2: Talent Acquisition and Investment - Meta has been aggressively recruiting top talent from companies like OpenAI, Anthropic, GitHub, and Google DeepMind, with no signs of this trend slowing down [5]. - In June, Meta invested $14 billion in Scale AI and appointed Scale's CEO, Alexandr Wang, as the new Chief AI Officer [5]. - OpenAI's CEO, Sam Altman, accused Meta of offering $100 million salaries to poach its employees [5]. Group 3: Financial Commitment to AI - Meta's CEO, Mark Zuckerberg, has positioned AI and superintelligence as central to the company's long-term vision [9]. - The company's CFO, Susan Li, indicated that capital expenditures could reach $72 billion by the end of the year, primarily driven by AI-related infrastructure [9]. - Zuckerberg expressed optimism that superintelligence could accelerate human progress and empower individuals to improve the world [10].