Workflow
机器之心
icon
Search documents
登顶开源SOTA!上交大&小红书LoopTool实现工具调用任务的「数据进化」
机器之心· 2025-11-19 04:07
Core Insights - The article discusses the evolution of large language models (LLMs) from merely "speaking" to "doing" through the integration of external tools, emphasizing the need for high-quality, diverse training data to enhance model performance in various tasks [1][5][35] Group 1: LoopTool Framework - Shanghai Jiao Tong University and Xiaohongshu team developed LoopTool, an autonomous, model-aware, iterative data evolution framework that achieves data-model closed-loop optimization for tool-calling tasks [2][35] - LoopTool utilizes the open-source model Qwen3-32B as both data generator and discriminator, outperforming its larger counterpart (32B) with a smaller model (8B) in tool-calling performance [2][35] - The framework has demonstrated its effectiveness by achieving state-of-the-art (SOTA) results on public benchmarks BFCL-v3 and ACEBench, validating the generalizability and effectiveness of closed-loop iterative optimization across different model sizes [2][35] Group 2: Methodology - LoopTool's core concept is to create an automated closed loop of data generation, label correction, and model training, driven by model performance feedback [7][35] - The process begins with seed data construction, where high-quality, diverse seed datasets are generated using semantic and constraint trees to ensure consistency and semantic integrity [9][10] - The iterative optimization phase includes several modules: GRPO training for tool calling, greedy capability probing to identify valuable samples, judgment-guided label verification for correcting mismatched labels, and error-driven data expansion to create new challenging samples [11][12][13][15][17] Group 3: Experimental Results - LoopTool-8B achieved an overall accuracy of 74.93% on BFCL-v3, ranking first among all 8B models, with a notable improvement of +8.59 percentage points over the original Qwen3-8B [20][23] - LoopTool-32B reached an overall accuracy of 79.32%, also ranking first, demonstrating superior performance in both single-turn and multi-turn scenarios [20][21] - The iterative training process showed continuous performance improvement, contrasting with static training methods that plateaued or declined due to mismatched data distribution and model capabilities [23] Group 4: Generalization and Downstream Tasks - LoopTool not only enhances tool-calling capabilities but also improves general reasoning and complex task handling, as evidenced by its performance across various general tasks [30][31] - The model demonstrated significant improvements in instruction following and code generation tasks, indicating that closed-loop data evolution positively impacts broader model capabilities [30][31] - In practical applications, LoopTool's enhanced tool usage ability effectively addresses real-world problems, showcasing its utility in diverse scenarios such as API management and complex task execution [32][33]
ConsistEdit来了:无需训练,实现高精度、高一致性的视觉编辑新范式
机器之心· 2025-11-19 02:09
Core Insights - The article discusses the advancements in training-free visual editing methods, particularly focusing on the ConsistEdit approach designed for Multi-Modal Diffusion Transformer (MM-DiT) architecture, addressing key challenges in visual generation [5][7][34]. Research Background - The article identifies two main pain points in current visual editing methods: the difficulty in balancing editing intensity with source image consistency and the lack of fine-grained control over editing strength [5]. Key Findings - Three critical discoveries regarding MM-DiT architecture are highlighted: 1. Editing only "visual tokens" ensures stable editing results, while modifying "text tokens" can lead to distortions [9]. 2. All layers of MM-DiT retain structural information, allowing edits to affect all attention layers rather than just the last few [11]. 3. Controlling Q/K tokens can precisely maintain structural consistency, while V tokens primarily influence content texture, enabling a decoupled control of structure and texture [15]. Method Design - ConsistEdit introduces three core operations: 1. Visual-only attention control to maintain strong consistency while adhering to text instructions [19]. 2. Mask-guided attention fusion to accurately separate editing and non-editing areas [20]. 3. Differentiated control of Q/K/V tokens to achieve smooth transitions from complete structure preservation to free structure modification [21]. Experimental Validation - The performance of ConsistEdit is validated against five mainstream methods on the PIE-Bench dataset, demonstrating its advantages in both image and video editing tasks [22]. Generalization - ConsistEdit is adaptable to various MM-DiT variants, including Stable Diffusion 3 and others, showcasing its versatility across different models [31]. Application Prospects - The high consistency and fine-grained control of ConsistEdit make it suitable for a wide range of visual creation scenarios, from static images to dynamic videos, enhancing interactive creative possibilities [34].
刚刚,PyTorch之父光速入职TML!离职Meta刚过一天,投身500亿估值独角兽
机器之心· 2025-11-19 02:09
| 机器之心报道 | | --- | 编辑:Panda 刚刚, 才离开 Meta 不久的 Soumith Chintala 发布了一条推文,盛赞 Thinking Machines Lab(以下简称 TML)的人很了不起(incredible)。 与此同时,这位 PyTorch 之父也更新了自己的个人介绍,正式官宣加入 TML,并表示正在这家估值已达 500 亿美元的创业公司「创造新东西(Building new things)」 。 其领英页面上目前更新的头衔仅仅是「技术人员」,所以我们目前还无从得知这个「新东西」会是什么。 根据 Chintala 离职 Meta 前的推文,他是在 11 月 17 日才正式离职。如今才刚过去一天 (考虑到时区) ,这种无缝衔接的节奏,似乎印证了他此前所说的「不想再搞 PyTorch」的愿望确实非常迫切 。 推文一发布,翁荔(Lilian Weng)等多位 TML 研究者/工作人员就留言表示了欢迎。 也有人第一时间用扎克伯格的苦瓜脸制作了迷因图: 总之,恭喜恭喜! | Zach Mueller � @TheZachMueller · 3h | | --- | | Wow. C ...
何恺明重磅新作:Just image Transformers让去噪模型回归基本功
机器之心· 2025-11-19 02:09
Core Insights - The article discusses the relationship between image generation and denoising diffusion models, emphasizing that high-quality image generation relies on diffusion models [1] - It questions whether denoising diffusion models truly achieve "denoising," highlighting a shift in focus from predicting clean images to predicting noise itself [2][5] - The research proposes a return to directly predicting clean data, which allows networks with seemingly insufficient capacity to operate effectively in high-dimensional spaces [7][8] Group 1: Denoising Diffusion Models - Denoising diffusion models do not function in the classical sense of "denoising," as they predict noise or noisy quantities instead of clean images [5][6] - The manifold assumption suggests that natural images exist on a low-dimensional manifold, while noise is off-manifold, indicating a fundamental difference in predicting clean data versus noisy data [4][6] - The study introduces a model that directly predicts clean data, which could enhance the performance of diffusion models [7] Group 2: Just Image Transformers (JiT) - The paper presents the "Just image Transformers (JiT)" architecture, which utilizes simple large patch pixel-level transformers to create powerful generative models without the need for tokenizers or pre-training [11] - JiT achieves competitive pixel-space image generation on ImageNet, with FID scores of 1.82 at 256x256 resolution and 1.78 at 512x512 resolution [12] - The architecture is designed to be self-consistent and applicable across various fields involving natural data, such as protein and molecular data [12] Group 3: Model Performance and Design - The JiT architecture operates by dividing images into non-overlapping patches, allowing for effective processing of high-dimensional data [14] - The study finds that the performance of the model is significantly influenced by the prediction method used, with -prediction yielding the best results across various loss functions [21][23] - Increasing the number of hidden units is not necessary for model performance, as demonstrated by JiT's effective operation at higher resolutions without additional modifications [28][31] Group 4: Scalability and Generalization - The research emphasizes the scalability of the JiT model, showing that it maintains similar computational costs across different resolutions while achieving strong performance [42][44] - The findings suggest that the design of the network can be decoupled from the observed dimensions, allowing for flexibility in model architecture [31] - The introduction of bottleneck structures in the network design can enhance performance, encouraging the learning of intrinsic low-dimensional representations [33] Group 5: Conclusion and Future Implications - The study concludes that the findings regarding -prediction are a natural outcome of the limitations of neural networks in modeling noise rather than data [51] - The proposed "Diffusion + Transformer" paradigm has the potential to serve as a foundational method in various fields, particularly where obtaining tokenizers is challenging [52]
Gemini 3深夜来袭:力压GPT 5.1,大模型谷歌时代来了
机器之心· 2025-11-18 18:19
Core Insights - Gemini 3 has been highly anticipated in the AI community, with significant excitement leading up to its release [2][3] - Google defines Gemini 3 as a crucial step towards AGI, boasting the strongest multimodal understanding and interaction capabilities in the world [11][12] - The model has set new SOTA standards in reasoning and multimodal capabilities, outperforming competitors like Claude Sonnet 4.5 and GPT-5.1 [13][14] Performance Metrics - Gemini 3 Pro achieved a groundbreaking Elo score of 1501 on the LMArena Leaderboard, surpassing previous models and competitors in various benchmarks [13] - In the Humanity's Last Exam, it scored 37.5% without tools and 45.8% with search and code execution, showcasing its academic reasoning capabilities [15] - The model also excelled in visual reasoning puzzles, achieving 31.1% in ARC-AGI-2 and 91.9% in GPQA Diamond [15] Interaction and Usability - Gemini 3 Pro has improved interaction quality, providing concise and direct responses rather than excessive flattery [16] - It serves as a true thinking partner, offering new ways to understand information and express ideas [17][18] - The Deep Think mode enhances reasoning and multimodal understanding, achieving impressive scores in challenging AI benchmarks [19][21] Learning and Development Capabilities - Gemini 3 integrates various modalities, including text, images, and videos, to facilitate seamless learning experiences [23] - It can generate interactive study materials and analyze performance in activities like sports [25] - The model excels in zero-shot generation, significantly improving developer efficiency and enabling the creation of rich, interactive web interfaces [28][29] Planning and Long-term Management - Gemini 3 has demonstrated superior long-term planning capabilities, as evidenced by its performance in the Vending-Bench 2 test [32][36] - It maintains consistent decision-making and tool usage throughout extended tasks, achieving higher returns on investment [33] Market Position and Future Outlook - Gemini 3 is now fully accessible to users and developers through various platforms, with a tiered pricing model based on context length [38][40] - The introduction of Google Antigravity as a new development platform enhances the collaborative capabilities of AI in software development [43] - Market confidence in Gemini is reflected in user engagement metrics, with 2 billion monthly active users and 650 million monthly app users reported [52]
视频模型真在推理,还是「表演」推理?港中文等质疑:Chain-of-Frame是真的吗?
机器之心· 2025-11-18 18:19
Core Insights - The article discusses the advancements in video generation models like Veo and Sora, highlighting their emerging capabilities beyond mere synthesis, particularly in reasoning and perception [2][26]. - A new concept, Chain-of-Frame (CoF), is introduced as a visual analogy to the Chain-of-Thought (CoT) in language models, focusing on the sequential generation of video frames to solve problems [2][9]. Research Findings - A systematic study was conducted by researchers from various universities to evaluate the zero-shot reasoning potential of models like Veo 3, leading to the development of the MME-CoF benchmark, which includes 12 reasoning dimensions [2][18]. - The study revealed that Veo 3 performs well in simple spatial layouts and basic geometric transformations but struggles with complex scenarios, indicating limitations in maintaining global consistency and understanding [13][15][23]. Evaluation Metrics - The MME-CoF benchmark provides a standardized framework to assess video models' reasoning capabilities, covering 12 dimensions and 59 tasks, with a focus on transforming abstract reasoning tasks into visual challenges [18][29]. - Evaluation results show that most video generation models scored below 2 on a scale of 0-4, indicating a lack of robust reasoning capabilities [21][24]. Conclusions - The research concludes that current models do not possess independent zero-shot reasoning abilities, relying instead on data patterns rather than logical deduction [26]. - It emphasizes that strong generation does not equate to strong reasoning, as the models often produce visually plausible results that lack logical coherence [27][28]. - The potential for future development exists, suggesting that these models could serve as complementary components in a more comprehensive multimodal intelligence system [29].
视频生成Prompt何须仅是文字!字节&港中文发布Video-As-Prompt
机器之心· 2025-11-18 05:08
本工作由第一作者在字节跳动智创北美团队实习期间完成。第一作者卞宇轩目前为香港中文大学计算机科学与工程系博士二年级学生,研究方向为可控视频生 成,师从徐强教授,并曾在字节跳动、腾讯等公司实习。个人主页:https://yxbian23.github.io/ 视频创作中,你是否曾希望复刻变成 Labubu 的特效,重现吉卜力风格化,跳出短视频平台爆火的同款舞蹈,或模仿复杂有趣的希区柯克运镜? 在现在的 AI 视频生成中,这些依赖抽象语义控制的创作,因缺乏统一的条件表征,实现起来往往异常困难。 最基础和直接的想法是针对每一种抽象语义单独训练 LoRA 或针对某一类语义条件设计专门的模型架构完成针对性的特征提取和可控生成。 然而,语义条件可能无穷无尽,一个条件训练一个模型会导致实际使用非常复杂,计算消耗非常庞大,且面对未曾训练的其他语义条件,模型没有任何泛化性 能;针对某一类语义设计模型架构一定程度上在单独子集解决了这个问题(例如:相机控制,风格迁移),但面对着不同语义类别,仍需要不断切换模型,其任 务专一的设计也无法完成不同语义类别的统一建模,阻碍了统一模型和模型规模化的进展。 为了解决这一痛点, 香港中文大学与字 ...
韩松等提出FlashMoBA,比MoBA快7.4倍,序列扩到512K也不会溢出
机器之心· 2025-11-18 05:08
Core Insights - The article discusses the introduction of a novel attention mechanism called Mixture of Block Attention (MoBA), which applies the principles of mixture of experts (MoE) to attention mechanisms, allowing models to autonomously determine which positions to focus on [2][4] - MoBA shows significant potential in handling long contexts by allowing queries to sparsely attend to a limited number of key-value blocks, thereby greatly reducing computational costs [3][4] - The article identifies performance challenges associated with smaller block sizes in MoBA implementations and introduces FlashMoBA, a hardware-friendly CUDA kernel designed to efficiently execute MoBA under small block configurations [7][12] Performance Analysis - The original MoBA implementation struggles with performance bottlenecks when using smaller block sizes, leading to slower execution compared to dense attention [11][41] - FlashMoBA optimizes MoBA's performance, achieving up to 14.7 times speedup compared to FlashAttention-2 in small block scenarios [8][43] - The article presents experimental results showing that reducing block size from 512 to 128 improves perplexity from 20.9 to 19.7 and RULER accuracy from 38.8% to 56.0% for a 340M parameter model [30][31] Technical Improvements - The article outlines two main improvement paths for MoBA: using smaller block sizes and applying short convolutions on keys to enhance routing accuracy [5][36] - FlashMoBA employs a three-kernel design to minimize memory access inefficiencies and align computations with GPU architecture, significantly improving performance [16][21] - The forward kernel uses a "collect and densify" strategy to handle MoBA's irregular sparsity, which is crucial for efficient computation [22][26] Experimental Results - The article details experiments conducted on 8× H100 80GB GPUs, demonstrating that the optimized MoBA model outperforms dense attention mechanisms across various benchmarks [30][39] - Key convolution techniques (kconv3 and kconv5) are shown to enhance model performance, with kconv3 improving language modeling accuracy from 45.1% to 45.6% for a 340M model [36][37] - Overall, the results indicate that smaller block sizes are essential for MoBA to achieve performance comparable to dense attention [41][42]
告别「一条路走到黑」:通过自我纠错,打造更聪明的Search Agent
机器之心· 2025-11-18 05:08
Core Insights - The article discusses the emergence of Search Agents to address the challenges of real-time knowledge and complex reasoning, highlighting their ability to interact with search engines for task execution [2][3] - A significant limitation of current Search Agents is their lack of self-correction capabilities, which can lead to cascading errors and task failures [2][3][8] - The ReSeek framework, developed by Tencent's content algorithm center in collaboration with Tsinghua University, introduces a dynamic self-correction mechanism to enhance the reliability of Search Agents [3][8] Group 1: ReSeek Framework - ReSeek is not a simple improvement of RAG but a complete rethinking of the core logic of Search Agents, allowing them to evaluate the effectiveness of each action during execution [3][8] - The framework incorporates a JUDGE action that assesses the validity of new information, enabling the agent to backtrack and explore new possibilities when errors are detected [10][15] - The JUDGE mechanism is designed to provide dense feedback to the agent, guiding it to learn how to accurately evaluate information value [20][39] Group 2: Error Prevention and Performance - The article explains the concept of cascading errors, where a small mistake in early reasoning can lead to a complete task failure [5][14] - The ReSeek framework aims to transform agents from being mere executors to critical thinkers capable of self-reflection and dynamic error correction [8][12] - Experimental results indicate that ReSeek achieves industry-leading performance, particularly in complex multi-hop reasoning tasks, demonstrating the effectiveness of its self-correction paradigm [29][30] Group 3: Evaluation and Benchmarking - The team constructed the FictionalHot dataset to create a closed-world evaluation environment, eliminating biases from pre-trained models and ensuring a fair assessment of reasoning capabilities [22][27] - ReSeek was tested against various benchmarks, showing significant improvements in performance metrics compared to other models [28][32] - The article highlights the inconsistency in experimental setups across different studies, emphasizing the need for standardized evaluation methods [25][31]
中国AI Agent产业化参考范本:斑马口语攻克的四大技术难关
机器之心· 2025-11-18 05:08
Core Insights - The AI industry is undergoing a critical transition in 2025, with a focus shifting from general exploration to vertical applications in fields like education, healthcare, and customer service [2][3] - Zebra's launch of "Zebra Speaking," the first AI foreign teacher product for one-on-one teaching, exemplifies the successful implementation of AI in a specific vertical, emphasizing the importance of deep customization over general capabilities [2][5] Industry Consensus Shift - The past two years have seen impressive demonstrations of large models, but the gap between ideal and reality becomes evident when applying these technologies to specific scenarios [4] - General models struggle to excel in any one area, leading to a preference for vertical applications where clear objectives and measurable outcomes exist, such as online language education [4] Technical Challenges - **Challenge One: Real-time Interaction Must Be Fast** - Human conversation requires response times of 0.2 to 1.5 seconds for casual dialogue, with acceptable limits extending to 2-4 seconds for thoughtful exchanges [9] - Zebra Speaking aims to keep response times within 1.5 to 2.5 seconds, but current technology often exceeds this due to delays in speech recognition, model inference, and text-to-speech processing [10] - **Challenge Two: Speech Recognition Must Be Accurate** - English language teaching demands high precision in speech recognition, particularly for nuanced phonetic differences [11] - The system must also filter out background noise and accurately detect when a child has finished speaking, which is complicated by the presence of distractions [12] - **Challenge Three: Content Output Must Be Age-Appropriate** - Educational contexts require strict control over content, as general models may produce inappropriate or incorrect information for children [14] - Zebra Speaking employs a multi-layered defense system to ensure content safety and appropriateness, including rigorous data screening and real-time monitoring [15][16] - **Challenge Four: Multi-modal Presentation Must Be Stable** - Effective online teaching requires seamless integration of voice, animation, text, and effects, with precise timing to avoid disjointed experiences [17] - Zebra Speaking has developed a unified timing orchestration engine to synchronize various elements and maintain a cohesive interaction [18] Competitive Landscape - The AI education sector is crowded, with competitors like Google and Khan Academy focusing on AI-assisted learning rather than true teaching [19] - Zebra Speaking stands out as a leader by providing a system that can guide children through structured learning, backed by extensive data and experience in language education [19][20] Future Outlook - Zebra Speaking is redefining competition in the language education sector by setting new standards for AI foreign teachers, emphasizing stability, personalization, and scalability [22] - The success of Zebra Speaking may serve as a model for the broader AI agent industry, suggesting that vertical applications will proliferate across various fields, creating a new ecosystem of AI services [22][23]