Workflow
机器之心
icon
Search documents
让模型自己找关键帧、视觉线索,小红书Video-Thinker破解视频推理困局
机器之心· 2026-01-02 03:12
Core Insights - The article discusses the revolutionary advancements in video reasoning through the introduction of the "Thinking with Videos" paradigm, specifically the Video-Thinker model, which enhances the model's ability to autonomously navigate and understand temporal sequences in videos [2][6][10]. Group 1: Model Development and Methodology - Video-Thinker integrates "temporal grounding" and "visual captioning" into the model's cognitive chain, eliminating reliance on external tools and enabling the model to autonomously identify key frames and extract visual cues [2][10]. - The research team constructed the Video-Thinker-10K dataset, consisting of 10,000 high-quality samples, and employed a two-phase training strategy of "supervised fine-tuning + reinforcement learning" to enhance the model's self-exploration and self-correction capabilities [3][10]. - The model achieved state-of-the-art (SOTA) performance in various challenging video reasoning benchmarks, significantly surpassing existing baselines with its 7 billion parameters [3][22]. Group 2: Data Quality and Training Process - The construction of high-quality training data is crucial for developing complex reasoning capabilities, leading to the integration of six major datasets into Video-Thinker-10K, which combines precise temporal annotations with detailed visual descriptions [12][13]. - The training process involved a structured thinking paradigm where the model learns to output specific labels such as <time> and <caption>, ensuring a rigorous "locate - perceive - reason" sequence [16][18]. - The reinforcement learning phase, utilizing Group Relative Policy Optimization (GRPO), allowed the model to explore and optimize its reasoning strategies, leading to emergent cognitive behaviors akin to human metacognition [19][22]. Group 3: Performance Evaluation - Video-Thinker-7B demonstrated significant advantages across various video reasoning benchmarks, establishing a new SOTA for models with 7 billion parameters [25][29]. - The model's performance was evaluated through both in-domain and out-of-domain assessments, showcasing its ability to generalize effectively to unseen scenarios [24][29]. - The model achieved an accuracy of 43.22% on the Video-Holmes benchmark and 80.69% on the VRBench, outperforming previous models by notable margins [29][30]. Group 4: Key Findings and Implications - The model's success is attributed to its internal capabilities of grounding and captioning, which were quantitatively assessed and found to be superior to those of baseline models [32][36]. - The findings indicate that relying on external tools can hinder performance, as demonstrated by experiments showing that simple plug-and-play tools did not enhance, but rather degraded, the model's reasoning capabilities [34][35]. - The article concludes that Video-Thinker's approach of integrating core internal capabilities rather than depending on large parameters and datasets represents a new paradigm in video reasoning, with potential applications across various industries [39].
Meta重磅:让智能体摆脱人类知识的瓶颈,通往自主AI的SSR级研究
机器之心· 2026-01-02 03:12
Core Viewpoint - Meta is pursuing the ambitious goal of developing "superintelligent" AI, which aims to create autonomous AI systems that surpass human expert levels. This initiative has faced skepticism from experts like Yann LeCun, who believes the path to superintelligence is impractical [1]. Group 1: SSR Methodology - The Self-play SWE-RL (SSR) method is introduced as a new approach to training superintelligent software agents, which can learn and improve without relying on existing problem descriptions or human supervision [2][4]. - SSR leverages self-play systems, similar to AlphaGo, allowing software agents to interact with real code repositories to autonomously generate learning experiences [2][4]. - The SSR framework operates with minimal reliance on human data, assuming access to sandboxed code repositories with source code and dependencies, eliminating the need for manually annotated issues or test cases [4]. Group 2: Bug Injection and Repair Process - The SSR framework involves two roles: a bug-injection agent that introduces bugs into a codebase and a bug-solving agent that generates patches to fix these bugs [8][9]. - The bug-injection agent creates artifacts that intentionally introduce bugs, which are then verified for consistency to ensure they are reproducible [9][11]. - The bug-solving agent generates final patches based on the defined bugs, with success determined by the results of tests associated with those bugs [11][12]. Group 3: Performance Evaluation - Experimental results show that SSR demonstrates stable and continuous self-improvement even without task-related training data, indicating that large language models can enhance their software engineering capabilities through interaction with original code repositories [17]. - SSR outperforms traditional baseline reinforcement learning methods in two benchmark tests, achieving improvements of +10.4% and +7.8% respectively, highlighting the effectiveness of self-generated learning tasks over manually constructed data [17]. - Ablation studies indicate that the self-play mechanism is crucial for performance, as it continuously generates dynamic task distributions that enrich the training signals [19][20]. Group 4: Implications for AI Development - SSR represents a significant step towards developing autonomous AI systems that can learn and improve without direct human supervision, addressing fundamental scalability limitations in current AI development [21][22]. - The ability of large language models to generate meaningful learning experiences from real-world software repositories opens new possibilities for AI training beyond human-curated datasets, potentially leading to more diverse and challenging training scenarios [22]. - As AI systems become more capable, the ability to learn autonomously from real-world environments is essential for developing intelligent agents that can effectively solve complex problems [25].
「辍学创业」的风再次席卷硅谷,但真正的变量从来不是学位
机器之心· 2026-01-02 03:12
Core Viewpoint - The trend of "dropping out to start a business" is gaining traction in Silicon Valley, with many founders emphasizing their dropout status as a positive credential in the venture capital community [3][4]. Group 1: The Trend of Dropping Out - More founders at Y Combinator's Demo Day are highlighting their dropout experiences, indicating that this has become a badge of honor reflecting their commitment to entrepreneurship [4]. - The urgency to capitalize on the AI startup boom is driving some students to abandon their studies, believing that a degree may hinder their chances of securing funding [5]. - Some investors express skepticism about the extreme dropout trend, suggesting that the value of a college network and brand remains significant, even for those who do not graduate [7]. Group 2: Perspectives on Age and Experience - While many young founders are dropping out, some investors prefer older founders who possess wisdom gained from experience and setbacks, viewing this as a more valuable trait [8]. - Despite the trend, many leading AI entrepreneurs still choose to complete their education, indicating that a degree can still hold value in the industry [9]. Group 3: The Nature of Dropping Out - The concept of "dropping out" has evolved; many who drop out continue to engage in their entrepreneurial pursuits in resource-rich environments [10]. - Ultimately, success is determined by the founder's ability to leverage the right resources and networks at the right time, rather than merely holding a degree [12].
LSTM之父率队造出PoPE:终结RoPE泛化难题,实现Transformer的极坐标进化
机器之心· 2026-01-02 01:55
Core Viewpoint - The article discusses a new approach called Polar Coordinate Position Embedding (PoPE) that addresses the limitations of the existing Rotational Position Embedding (RoPE) method in Transformer architectures, particularly in decoupling content and positional information for improved model performance [1][2]. Group 1: RoPE Issues - RoPE entangles content and position information, which can degrade model performance, especially in tasks requiring independent matching of these factors [1][4]. - In various advanced models, RoPE is the preferred method for incorporating positional information, but it struggles with tasks that require clear separation of content and position [5][19]. Group 2: PoPE Solution - PoPE eliminates the confusion between content and position, leading to significantly better performance in diagnostic tasks that require indexing based solely on either content or position [2][10]. - The attention score in PoPE is defined using a different approach that allows for the decoupling of content and position, enhancing model learning efficiency [12][13]. Group 3: Performance Comparison - In indirect indexing tasks, PoPE achieved an average accuracy of 94.82%, while RoPE only reached 11.16%, demonstrating PoPE's superior ability to separate content and positional information [18][19]. - In music and genomic sequence modeling, PoPE outperformed RoPE with lower negative log likelihood (NLL) values across various datasets [20][22]. - In language modeling on the OpenWebText dataset, PoPE consistently showed lower perplexity across all model sizes compared to RoPE [25][26]. Group 4: Generalization and Stability - PoPE exhibits strong extrapolation capabilities without requiring fine-tuning or interpolation, maintaining stability in performance even as model size increases, unlike RoPE [31][32].
告别KV Cache枷锁,将长上下文压入权重,持续学习大模型有希望了?
机器之心· 2026-01-02 01:55
Core Viewpoint - The article discusses the development of AGI (Artificial General Intelligence) and emphasizes the importance of continuous learning, where AI can learn new knowledge and skills through interaction with the environment [1]. Group 1: TTT-E2E Development - A collaborative team from Astera, NVIDIA, Stanford University, UC Berkeley, and UC San Diego has proposed TTT-E2E (End-to-End Test-Time Training), which represents a significant step towards AGI by transforming long context modeling from an architectural design into a learning problem [2]. - TTT-E2E aims to overcome the limitations of traditional models that remain static during inference, allowing for dynamic learning during the testing phase [9][10]. Group 2: Challenges in Long Context Modeling - The article highlights the dilemma in long context modeling, where the full attention mechanism of Transformers performs well on long texts but incurs significant inference costs as the length increases [5]. - Alternatives like RNNs and state space models (SSM) have constant per-token computation costs but often suffer performance declines when handling very long texts [5][6]. Group 3: TTT-E2E Mechanism - TTT-E2E defines the model's behavior during testing as an online optimization process, allowing the model to perform self-supervised learning on already read tokens before predicting the next token [11]. - The approach incorporates meta-learning to optimize model initialization parameters, enabling the model to learn how to learn effectively [13]. - A hybrid architecture combines a sliding window attention mechanism for short-term memory with a dynamically updated MLP layer for long-term memory, mimicking biological memory systems [13][14]. Group 4: Experimental Results - Experimental results demonstrate that TTT-E2E exhibits performance scalability comparable to full attention Transformers, maintaining a consistent loss function even as context length increases from 8K to 128K [21]. - In terms of inference efficiency, TTT-E2E shows a significant advantage, processing speed at 128K context is 2.7 times faster than full attention Transformers [22]. Group 5: Future Implications - TTT-E2E signifies a shift from static models to dynamic individuals, where the process of handling long documents is akin to a micro self-evolution [27]. - This "compute-for-storage" approach envisions a future where models can continuously adjust themselves while processing vast amounts of information, potentially encapsulating human civilization's history within their parameters without hardware limitations [29].
重新定义视频大模型时序定位!南大腾讯联合提出TimeLens,数据+算法全方位升级
机器之心· 2026-01-02 01:55
Core Insights - The rapid development of multimodal large language models (MLLMs) has improved video understanding, but a significant limitation remains in accurately determining "when" events occur in videos, known as Video Temporal Grounding (VTG) [2] - The research team from Nanjing University, Tencent ARC Lab, and Shanghai AI Lab introduced TimeLens, which addresses the shortcomings in existing evaluation benchmarks and proposes a reliable assessment framework and high-quality training data [2][29] Data Quality Issues - The existing benchmarks for VTG, such as Charades-STA, ActivityNet Captions, and QVHighlights, contain numerous annotation errors, including vague descriptions and incorrect time boundary markings [7] - A high percentage of errors in these benchmarks has been identified, leading to unreliable evaluation results that overestimate the capabilities of open-source models [11] TimeLens-Bench - To rectify the issues in existing datasets, the team created TimeLens-Bench, a high-quality evaluation benchmark that accurately reflects the temporal grounding capabilities of models [11] - Comparisons between TimeLens-Bench and original benchmarks revealed that previous evaluations significantly overestimated open-source models while obscuring the true performance of proprietary models [11] High-Quality Training Data: TimeLens-100K - The team developed TimeLens-100K, a large-scale, high-quality training dataset through an automated cleaning and re-labeling process, which has shown to significantly enhance model performance [13] Algorithm Design Best Practices - TimeLens conducted extensive ablation studies to derive effective algorithm design practices for VTG tasks, focusing on timestamp encoding and training paradigms [15] - The optimal timestamp encoding method identified is the Interleaved Textual Encoding strategy, which simplifies implementation while achieving superior results [17] - The Thinking-free RLVR training paradigm was found to be the most efficient, allowing models to directly output localization results without requiring complex reasoning processes [19][21] Key Training Techniques - Early stopping is crucial in RL training, as continuing beyond a plateau in reward metrics can degrade model performance [23] - Difficulty-based sampling is essential for selecting challenging training samples, maximizing the model's performance during RLVR training [23] Performance Validation - The TimeLens-8B model demonstrated exceptional performance, surpassing open-source models like Qwen3-VL and outperforming proprietary models such as GPT-5 and Gemini-2.5-Flash across multiple core metrics [27][28] - This performance underscores the potential of smaller open-source models to compete with larger proprietary models through systematic improvements in data quality and algorithm design [28] Contributions and Future Directions - TimeLens not only establishes a new SOTA open-source model but also provides valuable methodologies and design blueprints for future research in video temporal grounding [29] - The code, models, training data, and evaluation benchmarks for TimeLens have been made open-source to facilitate further advancements in VTG research [30]
刚刚,梁文锋署名,DeepSeek元旦新论文要开启架构新篇章
机器之心· 2026-01-01 08:22
Core Viewpoint - DeepSeek has introduced a new architecture called Manifold-Constrained Hyper-Connections (mHC) to address the instability issues in traditional hyper-connections during large-scale model training while maintaining significant performance gains [1][3][4]. Group 1: Introduction of mHC - The mHC framework extends the traditional Transformer’s single residual flow into a multi-flow parallel architecture, utilizing the Sinkhorn-Knopp algorithm to constrain the connection matrix on a doubly stochastic matrix manifold [1][4]. - The core objective of mHC is to retain the performance improvements from widening the residual flow while addressing training instability and excessive memory consumption [4][6]. Group 2: Challenges with Traditional Hyper-Connections - Traditional residual connections ensure stable signal transmission through identity mapping, but they face limitations due to the restricted width of information channels [3][6]. - Recent methods like Hyper-Connections (HC) have improved performance but introduced significant training instability and increased memory access overhead [3][6]. Group 3: Methodology of mHC - mHC projects the residual connection space onto a specific manifold to restore the identity mapping property while optimizing infrastructure for efficiency [4][9]. - The use of the Sinkhorn-Knopp algorithm allows the connection matrix to be projected onto the Birkhoff polytope, ensuring stability in signal propagation [4][10]. Group 4: Experimental Validation - Empirical results show that mHC not only resolves stability issues but also demonstrates exceptional scalability in large-scale training, such as with a 27 billion parameter model, increasing training time by only 6.7% while achieving significant performance improvements [4][29]. - In benchmark tests, mHC consistently outperformed baseline models and HC in various downstream tasks, indicating its effectiveness in large-scale pre-training [30][31]. Group 5: Infrastructure Design - DeepSeek has tailored infrastructure for mHC, including kernel fusion, selective recomputation, and enhanced communication strategies to minimize memory overhead and improve computational efficiency [17][21][23]. - The design choices, such as optimizing the order of operations and implementing mixed precision strategies, contribute to the overall efficiency of mHC [17][18].
OpenDataArena全面升级版正式上线,四大核心模块重构数据价值评估新格局
机器之心· 2026-01-01 08:22
为破解长期以来学界与业界难以对数据进行价值量化的困局,上海人工智能实验室(上海 AI 实验室) OpenDataLab 团队在今年 8 月正式开源了首个全面、公正的后训练数据价值评测平台 —— OpenDataArena (ODA) 。该项目致力于将数据选择从「盲目试错」的炼丹术,转变为一门可复现、可分析、可累积的严谨 科学。 在初版系统发布后的数月间,项目通过团队内部及小范围社区用户的深度使用,完成了高强度的技术验证 与功能打磨。伴随着评测规模、工具链和分析能力的持续扩展,近期,我们终于迎来了 ODA 的全面升级 —— 一个结论更系统、功能更完整、视角更多元的正式版本 ,该项目正式面向全体开发者开放。 项目主页: https://opendataarena.github.io/ 开源工具: https://github.com/OpenDataArena/OpenDataArena-Tool 数据集: https://huggingface.co/OpenDataArena/datasets 报告链接:https://arxiv.org/pdf/2512.14051 ODA 的核心理念非常明确:数据价值必须 ...
谷歌三年逆袭:草蛇灰线,伏脉千里
机器之心· 2026-01-01 04:33
Core Viewpoint - OpenAI has declared a "red alert" status, shifting focus to improving ChatGPT, indicating a significant change in the competitive landscape of AI technology [2][10]. Group 1: Google's Response to Competition - In late 2022, Google faced a major challenge when ChatGPT rapidly gained users, prompting internal concerns about missing opportunities despite having advanced technology [13][16]. - Google has since launched several advanced models, including Gemini 3, and has regained its technological edge in AI [10][11]. - The company shifted from a conservative approach to a more agile strategy, prioritizing rapid product development and iteration in response to competitive pressures [23][25]. Group 2: Organizational Changes - Google initiated a significant organizational restructuring, reducing management layers by approximately 35% to enhance communication and decision-making efficiency [26]. - The company adopted a startup-like rapid iteration model for product development, allowing for quicker responses to user feedback [27][28]. - Founders Larry Page and Sergey Brin have returned to take a more active role in AI projects, emphasizing the need for urgency and direct involvement in technical development [41][46]. Group 3: Talent Acquisition and Retention - Google has implemented a "boomerang program" to rehire former employees, with about 20% of new AI engineers being former staff who returned [58]. - The company has made significant investments to attract top talent, including a reported $2.7 billion payment to bring back Noam Shazeer, a key figure in AI development [62]. - Reforms in compensation structures have aligned rewards with product performance metrics, aiming to retain high-performing AI talent [67]. Group 4: Ongoing Competition - Despite Google's resurgence, competition remains fierce, with OpenAI and other companies like Anthropic and Meta continuing to innovate and attract talent [71][72]. - The focus of competition is shifting from technology development to application integration and creating sustainable ecosystems in AI [72]. - The AI race is ongoing, with no guaranteed long-term leaders, highlighting the dynamic nature of the industry [73].
系统学习Deep Research,这一篇综述就够了
机器之心· 2026-01-01 04:33
Core Insights - The article discusses the evolution of Deep Research (DR) as a new direction in AI, moving from simple dialogue and creative writing applications to more complex research-oriented tasks. It highlights the limitations of traditional retrieval-augmented generation (RAG) methods and introduces DR as a solution for multi-step reasoning and long-term research processes [2][30]. Summary by Sections Definition of Deep Research - DR is not a specific model or technology but a progressive capability pathway for research-oriented agents, evolving from information retrieval to complete research workflows [5]. Stages of Capability Development - **Stage 1: Agentic Search** - Models gain the ability to actively search and retrieve information dynamically based on intermediate results, focusing on efficient information acquisition [5]. - **Stage 2: Integrated Research** - Models evolve to understand, filter, and integrate multi-source evidence, producing coherent reports [6]. - **Stage 3: Full-stack AI Scientist** - Models can propose research hypotheses, design and execute experiments, and reflect on results, emphasizing depth of reasoning and autonomy [6]. Core Components of Deep Research - **Query Planning** - Involves deciding what information to query next, incorporating dynamic adjustments in multi-round research [10]. - **Information Retrieval** - Focuses on when to retrieve, what to retrieve, and how to filter retrieved information to avoid redundancy and ensure relevance [12][13][14]. - **Memory Management** - Essential for long-term reasoning, involving memory consolidation, indexing, updating, and forgetting [15]. - **Answer Generation** - Stresses the logical consistency between conclusions and evidence, requiring integration of multi-source evidence [17]. Training and Optimization Methods - **Prompt Engineering** - Involves designing multi-step prompts to guide the model through research processes, though its effectiveness is highly dependent on prompt design [20]. - **Supervised Fine-tuning** - Utilizes high-quality reasoning trajectories for model training, though acquiring annotated data can be costly [21]. - **Reinforcement Learning for Agents** - Directly optimizes decision-making strategies in multi-step processes without complex annotations [22]. Challenges in Deep Research - **Coordination of Internal and External Knowledge** - Balancing reliance on internal reasoning versus external information retrieval is crucial [24]. - **Stability of Training Algorithms** - Long-term task training often faces issues like policy degradation, limiting exploration of diverse reasoning paths [24]. - **Evaluation Methodology** - Developing reliable evaluation methods for research-oriented agents remains an open question, with existing benchmarks needing further exploration [25][27]. - **Memory Module Construction** - Balancing memory capacity, retrieval efficiency, and information reliability is a significant challenge [28]. Conclusion - Deep Research represents a shift from single-turn answer generation to in-depth research addressing open-ended questions. The field is still in its early stages, with ongoing exploration needed to create autonomous and trustworthy DR agents [30].