Workflow
机器之心
icon
Search documents
视频模型真在推理,还是「表演」推理?港中文等质疑:Chain-of-Frame是真的吗?
机器之心· 2025-11-18 18:19
Core Insights - The article discusses the advancements in video generation models like Veo and Sora, highlighting their emerging capabilities beyond mere synthesis, particularly in reasoning and perception [2][26]. - A new concept, Chain-of-Frame (CoF), is introduced as a visual analogy to the Chain-of-Thought (CoT) in language models, focusing on the sequential generation of video frames to solve problems [2][9]. Research Findings - A systematic study was conducted by researchers from various universities to evaluate the zero-shot reasoning potential of models like Veo 3, leading to the development of the MME-CoF benchmark, which includes 12 reasoning dimensions [2][18]. - The study revealed that Veo 3 performs well in simple spatial layouts and basic geometric transformations but struggles with complex scenarios, indicating limitations in maintaining global consistency and understanding [13][15][23]. Evaluation Metrics - The MME-CoF benchmark provides a standardized framework to assess video models' reasoning capabilities, covering 12 dimensions and 59 tasks, with a focus on transforming abstract reasoning tasks into visual challenges [18][29]. - Evaluation results show that most video generation models scored below 2 on a scale of 0-4, indicating a lack of robust reasoning capabilities [21][24]. Conclusions - The research concludes that current models do not possess independent zero-shot reasoning abilities, relying instead on data patterns rather than logical deduction [26]. - It emphasizes that strong generation does not equate to strong reasoning, as the models often produce visually plausible results that lack logical coherence [27][28]. - The potential for future development exists, suggesting that these models could serve as complementary components in a more comprehensive multimodal intelligence system [29].
视频生成Prompt何须仅是文字!字节&港中文发布Video-As-Prompt
机器之心· 2025-11-18 05:08
Core Insights - The article introduces a novel semantic-controlled video generation framework called Video-As-Prompt, which allows users to provide a reference video and a semantic description to generate new content, fundamentally unifying the approach to abstract semantic-controlled video generation [3][20]. Group 1: Framework Overview - Video-As-Prompt leverages a "video reference" paradigm, enabling the model to "clone" specified semantics and apply them to new content, thus addressing the complexity of training separate models for each semantic condition [3][20]. - The framework is built on a large-scale dataset, VAP-Data, which includes over 100,000 videos covering more than 100 different high-quality semantic conditions, facilitating extensive training and evaluation [15][21]. Group 2: Technical Implementation - The architecture employs a Mixture-of-Transformers (MoTs) approach, combining a frozen video diffusion Transformer (DiT) with a trainable parallel expert Transformer to enhance model generalization and prevent catastrophic forgetting during training [11][13]. - By treating reference videos as "video prompts," the framework establishes a unified semantic mapping, significantly improving the model's versatility and user-friendliness [9][10]. Group 3: Performance and Applications - Video-As-Prompt demonstrates strong performance in overall video quality, text consistency, and semantic coherence, outperforming other open-source baselines and matching the performance of closed-source models [18]. - The framework supports various applications, including driving the same image with different reference videos and enabling zero-shot generation capabilities when presented with unseen semantic references [5][18].
韩松等提出FlashMoBA,比MoBA快7.4倍,序列扩到512K也不会溢出
机器之心· 2025-11-18 05:08
Core Insights - The article discusses the introduction of a novel attention mechanism called Mixture of Block Attention (MoBA), which applies the principles of mixture of experts (MoE) to attention mechanisms, allowing models to autonomously determine which positions to focus on [2][4] - MoBA shows significant potential in handling long contexts by allowing queries to sparsely attend to a limited number of key-value blocks, thereby greatly reducing computational costs [3][4] - The article identifies performance challenges associated with smaller block sizes in MoBA implementations and introduces FlashMoBA, a hardware-friendly CUDA kernel designed to efficiently execute MoBA under small block configurations [7][12] Performance Analysis - The original MoBA implementation struggles with performance bottlenecks when using smaller block sizes, leading to slower execution compared to dense attention [11][41] - FlashMoBA optimizes MoBA's performance, achieving up to 14.7 times speedup compared to FlashAttention-2 in small block scenarios [8][43] - The article presents experimental results showing that reducing block size from 512 to 128 improves perplexity from 20.9 to 19.7 and RULER accuracy from 38.8% to 56.0% for a 340M parameter model [30][31] Technical Improvements - The article outlines two main improvement paths for MoBA: using smaller block sizes and applying short convolutions on keys to enhance routing accuracy [5][36] - FlashMoBA employs a three-kernel design to minimize memory access inefficiencies and align computations with GPU architecture, significantly improving performance [16][21] - The forward kernel uses a "collect and densify" strategy to handle MoBA's irregular sparsity, which is crucial for efficient computation [22][26] Experimental Results - The article details experiments conducted on 8× H100 80GB GPUs, demonstrating that the optimized MoBA model outperforms dense attention mechanisms across various benchmarks [30][39] - Key convolution techniques (kconv3 and kconv5) are shown to enhance model performance, with kconv3 improving language modeling accuracy from 45.1% to 45.6% for a 340M model [36][37] - Overall, the results indicate that smaller block sizes are essential for MoBA to achieve performance comparable to dense attention [41][42]
告别「一条路走到黑」:通过自我纠错,打造更聪明的Search Agent
机器之心· 2025-11-18 05:08
Core Insights - The article discusses the emergence of Search Agents to address the challenges of real-time knowledge and complex reasoning, highlighting their ability to interact with search engines for task execution [2][3] - A significant limitation of current Search Agents is their lack of self-correction capabilities, which can lead to cascading errors and task failures [2][3][8] - The ReSeek framework, developed by Tencent's content algorithm center in collaboration with Tsinghua University, introduces a dynamic self-correction mechanism to enhance the reliability of Search Agents [3][8] Group 1: ReSeek Framework - ReSeek is not a simple improvement of RAG but a complete rethinking of the core logic of Search Agents, allowing them to evaluate the effectiveness of each action during execution [3][8] - The framework incorporates a JUDGE action that assesses the validity of new information, enabling the agent to backtrack and explore new possibilities when errors are detected [10][15] - The JUDGE mechanism is designed to provide dense feedback to the agent, guiding it to learn how to accurately evaluate information value [20][39] Group 2: Error Prevention and Performance - The article explains the concept of cascading errors, where a small mistake in early reasoning can lead to a complete task failure [5][14] - The ReSeek framework aims to transform agents from being mere executors to critical thinkers capable of self-reflection and dynamic error correction [8][12] - Experimental results indicate that ReSeek achieves industry-leading performance, particularly in complex multi-hop reasoning tasks, demonstrating the effectiveness of its self-correction paradigm [29][30] Group 3: Evaluation and Benchmarking - The team constructed the FictionalHot dataset to create a closed-world evaluation environment, eliminating biases from pre-trained models and ensuring a fair assessment of reasoning capabilities [22][27] - ReSeek was tested against various benchmarks, showing significant improvements in performance metrics compared to other models [28][32] - The article highlights the inconsistency in experimental setups across different studies, emphasizing the need for standardized evaluation methods [25][31]
中国AI Agent产业化参考范本:斑马口语攻克的四大技术难关
机器之心· 2025-11-18 05:08
Core Insights - The AI industry is undergoing a critical transition in 2025, with a focus shifting from general exploration to vertical applications in fields like education, healthcare, and customer service [2][3] - Zebra's launch of "Zebra Speaking," the first AI foreign teacher product for one-on-one teaching, exemplifies the successful implementation of AI in a specific vertical, emphasizing the importance of deep customization over general capabilities [2][5] Industry Consensus Shift - The past two years have seen impressive demonstrations of large models, but the gap between ideal and reality becomes evident when applying these technologies to specific scenarios [4] - General models struggle to excel in any one area, leading to a preference for vertical applications where clear objectives and measurable outcomes exist, such as online language education [4] Technical Challenges - **Challenge One: Real-time Interaction Must Be Fast** - Human conversation requires response times of 0.2 to 1.5 seconds for casual dialogue, with acceptable limits extending to 2-4 seconds for thoughtful exchanges [9] - Zebra Speaking aims to keep response times within 1.5 to 2.5 seconds, but current technology often exceeds this due to delays in speech recognition, model inference, and text-to-speech processing [10] - **Challenge Two: Speech Recognition Must Be Accurate** - English language teaching demands high precision in speech recognition, particularly for nuanced phonetic differences [11] - The system must also filter out background noise and accurately detect when a child has finished speaking, which is complicated by the presence of distractions [12] - **Challenge Three: Content Output Must Be Age-Appropriate** - Educational contexts require strict control over content, as general models may produce inappropriate or incorrect information for children [14] - Zebra Speaking employs a multi-layered defense system to ensure content safety and appropriateness, including rigorous data screening and real-time monitoring [15][16] - **Challenge Four: Multi-modal Presentation Must Be Stable** - Effective online teaching requires seamless integration of voice, animation, text, and effects, with precise timing to avoid disjointed experiences [17] - Zebra Speaking has developed a unified timing orchestration engine to synchronize various elements and maintain a cohesive interaction [18] Competitive Landscape - The AI education sector is crowded, with competitors like Google and Khan Academy focusing on AI-assisted learning rather than true teaching [19] - Zebra Speaking stands out as a leader by providing a system that can guide children through structured learning, backed by extensive data and experience in language education [19][20] Future Outlook - Zebra Speaking is redefining competition in the language education sector by setting new standards for AI foreign teachers, emphasizing stability, personalization, and scalability [22] - The success of Zebra Speaking may serve as a model for the broader AI agent industry, suggesting that vertical applications will proliferate across various fields, creating a new ecosystem of AI services [22][23]
真机RL!最强VLA模型π*0.6来了,机器人在办公室开起咖啡厅
机器之心· 2025-11-18 03:30
Core Insights - Physical Intelligence (PI) has developed a new robot base model π*0.6, significantly enhancing the success rate and efficiency of embodied intelligence tasks [2][3][6] - The company secured over $400 million in funding in 2024, achieving a valuation exceeding $2 billion, positioning itself as a leading player in the embodied intelligence sector [3] Group 1: Model Development and Capabilities - The π*0.6 model utilizes a "Vision-Language-Action" (VLA) framework, trained on extensive robot perception and action data, enabling it to generalize and perform tasks in unknown environments [3][9] - The model has demonstrated a 90% success rate in various tasks, with significant improvements in processing efficiency [6][34] - The Recap method, which combines demonstration training, corrective guidance, and autonomous experience learning, has been pivotal in enhancing the model's performance [9][19] Group 2: Performance Metrics and Applications - The model has shown over a twofold increase in throughput and success rates for challenging tasks, such as making espresso coffee, after incorporating real-world execution experience [27][29] - Physical Intelligence has tested the model in three real-world applications: making espresso drinks, folding various types of clothing, and assembling packaging boxes, achieving over 90% success rates in these tasks [25][34] - The model's architecture allows it to handle diverse prompts and conditions, improving its adaptability in real-world scenarios [22][23] Group 3: Learning Methodology - The Recap method addresses the challenge of credit assignment in reinforcement learning, allowing the model to learn from both successful and unsuccessful actions [14][20] - The training process involves offline reinforcement learning for pre-training, followed by task-level fine-tuning using demonstration data and real-world feedback [25][36] - The combination of expert demonstrations, corrective guidance, and autonomous experience is expected to enhance the model's learning efficiency and performance [37]
华为诺亚发布ScaleNet:模型放大通用新范式
机器之心· 2025-11-18 03:30
Core Insights - The article discusses the challenges of scaling models in the field of AI, particularly the high costs associated with training large-scale models and the need for efficient model expansion methods [2][3][4]. - The ScaleNet method is introduced as a solution that allows for effective model expansion while maintaining parameter efficiency, demonstrating significant performance improvements in both visual and language tasks [5][20]. Research Motivation - The high computational cost of training large-scale models has led researchers to explore methods like Progressive Training, which reuses weights from smaller models to initialize larger ones. However, these methods often introduce new independent parameters, increasing storage requirements and slowing down optimization [4]. Core Methodology - ScaleNet combines two key techniques: Layer-wise Weight Sharing and Lightweight Adapters [6][7]. - Layer-wise Weight Sharing allows new layers to share parameters with existing layers in the pre-trained model, enhancing parameter efficiency and accelerating the learning process [8]. - Lightweight Adapters are introduced for each shared layer to provide unique adjustments, ensuring that while knowledge is shared, each layer can still learn specialized functions, thus maintaining model capacity and performance [11]. Experimental Results and Analysis - In visual model evaluations, ScaleNet outperformed baseline methods in accuracy while maintaining similar parameter counts across various architectures, such as DeiT and Swin [14]. - For instance, ScaleNet achieved a Top-1 accuracy of 76.46% with 6.45 million parameters in the Deit-Tiny model, compared to 75.01% for the baseline [15]. - ScaleNet also demonstrated superior training efficiency, requiring only 100 epochs and 15.8 hours to reach an accuracy of 81.13% in the DeiT-Small model, compared to 300 epochs and 47.3 hours for direct training [16]. Generalization to Language Models - The research team applied ScaleNet to the Llama-3.2-1B language model, achieving an average performance improvement of 0.92% across various common-sense reasoning benchmarks, indicating its cross-modal applicability [17][18]. - The method also showed stable improvements in downstream visual tasks such as object detection and semantic segmentation, further confirming its generalization capabilities [19]. Conclusion - ScaleNet provides an efficient and cost-effective technical pathway for expanding pre-trained models, significantly enhancing training efficiency and model performance in both visual and language tasks. This work contributes to the development of larger, stronger, and more economical AI models, promoting sustainable growth in the AI field [20].
让大模型学会「心灵感应」:基于思维沟通的多智能体合作范式来了
机器之心· 2025-11-17 23:40
如果多个大模型能读懂彼此的想法,会发生什么 ? 在 NeurIPS 2025 的 Spotlight 论文 Thought Communication in Multiagent Collaboration 中,来自 CMU、Meta AI 和 MBZUAI 的研究者提出了一种全新的协作方式, 让模型不再仅仅依靠语言交流,而是直接共享「 思维」。 这项研究提出了 Thought Communication(思维沟通) 的概念,让智能体在内部层面传递潜在思维(latent thoughts),实现类似「 心灵感应」的合作。 理论上,研究者建立了首个针对多智能体系统的 潜在思维可识别性理论 ,证明即使在非参数设定下,也能从模型状态中恢复出共享与私有思维。实现上,他们据 此提出了通用框架 ThoughtComm ,使模型能够自动抽取、路由并注入这些潜在思维,从而实现超越语言的直接沟通。 结果显示,这种「 思维层交流」不仅在理论上可行,在实践中也显著提升了模型的协作效率与推理能力。 论文标题: Thought Communication in Multiagent Collaboration 论文链接:https:/ ...
刚刚,马斯克Grok 4.1低调发布!通用能力碾压其他一切模型
机器之心· 2025-11-17 23:40
Core Insights - xAI has announced the release of Grok 4.1, which is now available to all users across various platforms including the Grok website, X, and mobile applications [1][3] - Grok 4.1 shows significant improvements in real-world usability, particularly in creativity, emotional interaction, and collaborative engagement [4][6] - The model has enhanced capabilities in understanding subtle intentions and maintaining coherent personality traits while retaining the intelligence and reliability of its predecessor [4][6] Performance Metrics - Grok 4.1 has achieved a 64.78% probability of being preferred by users in comparative evaluations against previous models [6] - In the LMArena Text Arena leaderboard, Grok 4.1's reasoning mode (quasarflux) ranks first with an Elo score of 1483, outperforming the highest non-xAI model by 31 points [13] - The non-reasoning mode (tensor) ranks second with an Elo score of 1465, demonstrating superior performance even without reasoning capabilities [13][14] Emotional Intelligence - Grok 4.1 was tested on the EQ-Bench3, which evaluates emotional intelligence through challenging role-play scenarios [17] - The results indicate that Grok 4.1's reasoning and non-reasoning modes ranked first and second respectively in emotional intelligence assessments [18] Creative Writing - xAI evaluated Grok 4.1's performance on the Creative Writing v3 benchmark, which involved generating responses to 32 different writing prompts [23] - The model has shown a significant reduction in hallucination rates for factual queries during its post-training phase, indicating improved reliability in information retrieval [27] Technical Details - For more technical details regarding Grok 4.1, a model card is available at the provided link [29]
首个完整开源的生成式推荐框架MiniOneRec,轻量复现工业级OneRec!
机器之心· 2025-11-17 09:00
Core Viewpoint - The article discusses the launch of MiniOneRec, the first complete end-to-end open-source framework for generative recommendation, which validates the generative recommendation Scaling Law and provides a comprehensive training and research platform for the community [2][4]. Group 1: Generative Recommendation Framework - MiniOneRec has gained significant attention in the recommendation community since its release on October 28, with all code, datasets, and model weights open-sourced, requiring only 4-8 A100 GPUs for easy reproduction [6]. - The framework offers a one-stop lightweight implementation and improvement for generative recommendation, including a rich toolbox for SID (Semantic ID) construction, integrating advanced quantization algorithms [9]. - The framework has demonstrated a significant advantage in parameter utilization efficiency, as shown by the training and evaluation loss decreasing with increasing model size from 0.5 billion to 7 billion parameters [8][10]. Group 2: Performance Validation - Researchers have validated the generative recommendation Scaling Law on public datasets, showcasing the model's efficiency in parameter utilization [7]. - MiniOneRec outperforms traditional and generative recommendation paradigms significantly, leading by approximately 30 percentage points over the TIGER model in metrics such as HitRate@K and NDCG@K [23]. Group 3: Innovations in Recommendation - The framework introduces a full-process SID alignment strategy, which significantly enhances the performance of generative recommendations by incorporating world knowledge from large models [13][15]. - MiniOneRec employs a novel reinforcement learning strategy tailored for recommendations, including a constrained decoding sampling strategy to improve the diversity of generated items and a ranking reward to enhance the distinction of sorting signals [17][21]. Group 4: Future Outlook - The article raises the question of whether generative recommendation will become the new paradigm for recommendation systems, highlighting two approaches: the reformist approach, which integrates generative architecture into existing systems, and the revolutionary approach, which aims to completely overhaul traditional models [25][26]. - Both approaches have demonstrated the practical value of the generative paradigm, with some major companies already realizing tangible benefits from its implementation [27].