机器之心

Search documents
冠军队独享200万,进决赛就有直通offer,腾讯广告算法大赛报名开启
机器之心· 2025-06-18 06:09
Core Viewpoint - The article discusses the potential of multimodal generative AI, particularly in the advertising sector, highlighting its successful applications and the opportunities it presents for talent in this field [3][4][11]. Group 1: Current State of AIGC and Multimodal Generation - The job market for narrow AIGC roles, such as video generation, appears limited, leading to concerns about employment prospects for those with backgrounds in foundational vision and generative models [2][3]. - Despite the early stage of technology development, multimodal generation has already seen successful applications in advertising, yielding tangible benefits for major companies [3][4]. Group 2: Generative AI in Advertising - Generative AI has been utilized in advertising for years, with platforms like Amazon launching AI tools to enhance content generation, significantly improving production efficiency [5][7]. - Tencent's advertising tool, "Miao Si," exemplifies the integration of generative AI across various advertising processes, including content generation and cost reduction in distribution [7][8]. Group 3: Challenges and Opportunities in Generative Advertising - Traditional advertising recommendation systems face limitations, such as the difficulty in identifying user dislikes and the constraints of existing content libraries [9][10]. - A shift towards generative recommendation systems could address these issues by creating personalized content based on user behavior, although challenges remain in data availability and real-time processing [10][16]. Group 4: Tencent Advertising Algorithm Competition - The Tencent Advertising Algorithm Competition offers a platform for participants to engage with real business data, enhancing their understanding of user behavior and motivations [17][18]. - The competition features a total prize pool of 3.6 million RMB, with significant rewards for top teams, and serves as a recruitment avenue for Tencent [19][21]. - Participants gain valuable experience and networking opportunities, which can facilitate career advancement in the advertising technology sector [24][26]. Group 5: Market Trends and Future Prospects - Tencent's marketing services revenue grew by 20% year-on-year, largely attributed to AI-driven advertising technology upgrades, indicating a rising demand for generative AI talent in the industry [26][27]. - The competition encourages students from various academic backgrounds to participate, emphasizing that prior experience in advertising is not a prerequisite [28][29].
统一框架下的具身多模态推理:自变量机器人让AI放下海德格尔的锤子
机器之心· 2025-06-18 06:09
Core Viewpoint - The article emphasizes the need for a paradigm shift in robotics from modular systems to a unified architecture that enables embodied intelligence, allowing robots to process perception, reasoning, and action simultaneously, akin to human cognition [4][10][34]. Current Paradigm Limitations - Existing mainstream methods treat different modalities as independent modules, leading to inherent flaws in information processing and understanding [6][7]. - The representation bottleneck results in unavoidable compression losses when transferring information between different modality encoders, hindering deep cross-modal understanding of the physical world [7]. - The structural disconnection prevents models from learning intuitive causal relationships across modalities, which is essential for true physical intelligence [8]. Unified Architecture: From Division to Integration - The proposed unified modality architecture aims to eliminate artificial boundaries between visual, linguistic, and action modalities, processing them as a single information flow [4][10]. - The core of this architecture is unified representation learning, converting all modality information into a shared high-dimensional token sequence [11]. - A multi-task, multi-modal generation mechanism serves as a supervisory method, compelling the model to establish deep cross-modal correspondences [12]. Emergent Capabilities: Embodied Multi-Modal Reasoning - The unified architecture unlocks comprehensive embodied multi-modal reasoning capabilities that current modular systems cannot achieve [16]. - Symbol-space reasoning allows robots to deconstruct abstract shapes into concrete representations and perform physical operations based on this understanding [17]. - Physical space reasoning enables robots to understand the implications of actions on structural stability and articulate their reasoning processes [19][20]. - The system can autonomously explore complex environments by integrating visual observations, spatial memory, and common knowledge into coherent reasoning chains [22]. Conclusion - The transition to a unified architecture is crucial for enabling robots to interact seamlessly with the physical world, integrating perception, understanding, and action without the delays and losses associated with modular systems [30][31]. - This shift is not merely incremental but represents a fundamental evolution necessary for achieving embodied intelligence capable of cross-modal causal reasoning and spatial logic [34].
刚刚,Gemini 2.5系列模型更新,最新轻量版Flash-Lite竟能实时编写操作系统
机器之心· 2025-06-18 01:24
机器之心报道 编辑:Panda 刚刚,Gemini 系列模型迎来了一波更新: 谷歌 CEO Sundar Pichai 发推表示新推出的 Gemini 2.5 Flash-Lite 是目前性价比最高的 2.5 系列模型。 可以看到,谷歌对 2.5 Flash-Lite 的定位是适合用于「量大且注重成本效率的任务」。相较之下,2.5 Pro 适合编程和高复杂度任务,2.5 Flash 则居中,更适合需要 较快速度的日常任务。 Gemini 2.5 Pro 稳定版发布且已全面可用,其与 6 月 5 日的预览版相比无变化。 Gemini 2.5 Flash 稳定版发布且已全面可用,其与 5 月 20 日的预览版相比无变化,但价格有更新。 新推出了 Gemini 2.5 Flash-Lite 并已开启预览。 | | | 2.5 Flash-Lite | 2.5 Flash | 2.5 Pro | | --- | --- | --- | --- | --- | | | | THINKING OFF | THINKING | THINKING | | Best for | | High volume cost- | Fa ...
想知道你的LLM API被过度收费了吗?隐藏的Tokens终于可以被审计了
机器之心· 2025-06-17 08:52
本文作者来自马里兰大学的 CASE ( C ollaborative, A utomated, S calable, and E fficient Intelligence) Lab,主要参与者为博士生孙国恒与王子瑶,指导教师为李昂 教授。 研究背景:在商业保护与用户知情间寻求平衡 论文标题: Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services arXiv 链接:https://arxiv.org/pdf/2505.18471 近年来,大型语言模型(LLM)在处理复杂任务方面取得了显著进展,尤其体现在多步推理、工具调用以及多智能体协作等高级应用中。这些能力的提升,往往 依赖于模型内部一系列复杂的 「 思考 」 过程或 Agentic System 中的 Agent 间频繁信息交互。 然而,为了保护核心知识产权(如防止模型蒸馏或 Agent 工作流泄露)、提供更流畅的用户体验,服务提供商通常会将这些中间步骤隐藏,仅向用户呈现最终的 输出结果。这在当前的商业和技术环境下 ...
首个全面梳理语音大模型发展脉络的权威综述,入选ACL 2025主会
机器之心· 2025-06-17 04:50
想象一下,如果 AI 能够像人类一样自然地进行语音对话,不再需要传统的 「 语音转文字(ASR)- 文本大模型处理(LLM)- 文字转语音(TTS) 」 的 繁琐流程,而是直接理解和生成语音,那将是怎样的体验?这就是 语音大模型 (语音语言模型,SpeechLM)要解决的核心问题。 传统的语音交互系统存在三大痛点:信息丢失、延迟严重、错误累积。当语音转换为文字时,音调、语气、情感等副语言信息完全丢失;多个模块串联导致 响应延迟明显;每个环节的错误会层层累积,最终影响整体效果。 SpeechLM 的出现彻底改变了这一局面。它能够端到端地处理语音,既保留了语音中的丰富信息,又大幅降低了延迟,为真正自然的人机语音交互铺平了 道路。 本文第一作者:崔文谦,香港中文大学博士生,致力于语音大模型,多模态大模型,AI音乐生成等方向的研究。 由香港中文大学团队撰写的语音语言模型综述论文《Recent Advances in Speech Language Models: A Survey》已成功被 ACL 2025 主会议接收!这 是该领域首个全面系统的综述,为语音 AI 的未来发展指明了方向。 ArXiv链接:https: ...
从扭秧歌到跑半马:机器人离「iPhone时刻」还有多远?
机器之心· 2025-06-17 04:50
Core Insights - The article discusses the advancements in embodied intelligence, highlighting the transition from imagination to reality in robotics, and raises critical questions about technological bottlenecks, practical applications, and user needs in the industry [2][3]. Group 1: Technological Developments - The focus is on transforming innovative technologies into commercially viable products, with an emphasis on shortening exploration cycles and reducing costs in the embodied intelligence sector [3]. - Major companies like NVIDIA, Qualcomm, and Intel are competing in the development of computing platforms for embodied intelligence robots, with NVIDIA's Jetson Thor leading the charge [3][4]. - The RDK S100, developed by Diguo Robotics, has achieved significant traction with over 20 top-tier clients and 50 partners engaged in evaluations, positioning itself as a strong alternative to NVIDIA [4]. Group 2: Architectural Innovations - The article introduces the concept of a "dual-brain" architecture, where the "big brain" handles perception and decision-making, while the "small brain" manages motion control, enhancing the robot's capabilities [5][8]. - The RDK S100 features a unique CPU+BPU+MCU architecture that integrates computation and control, enabling a closed-loop system for embodied intelligence robots [9][12]. - The RDK S100 is designed to meet the computational needs of various applications, with a target performance of around 100 TOPS, suitable for scenarios like quadrupedal robots and logistics vehicles [13][14]. Group 3: Development Ecosystem - Diguo Robotics aims to support developers by providing a comprehensive infrastructure that accelerates the transition from development to deployment, including over 110 models in its ModelZoo algorithm repository [19][20]. - The company has established a collaborative ecosystem with over 200 startups through its Gravity Program, offering resources from hardware discounts to software support [28]. - Successful implementations of the RDK S100 in various robotic applications demonstrate its potential for scalable deployment in sectors such as commercial cleaning, smart home, industrial manufacturing, and logistics [25][26].
首个转型AI公司的新势力,在全球AI顶会展示下一代自动驾驶模型
机器之心· 2025-06-17 04:50
Core Viewpoint - The article emphasizes the significance of high computing power, large models, and extensive data in achieving Level 3 (L3) autonomous driving, highlighting the advancements made by XPeng with its G7 model and its proprietary AI chips [3][18][19]. Group 1: Technological Advancements - XPeng's G7 is the world's first L3 level AI car, featuring three self-developed Turing AI chips with over 2200 TOPS of effective computing power [3][18]. - The G7 introduces the VLA-OL model, which incorporates a "motion brain" for decision-making in intelligent assisted driving [4]. - The VLM (Vision Large Model) serves as the AI brain for vehicle perception, enabling new interaction capabilities and future functionalities like local chat and multi-language support [5][19]. Group 2: Industry Positioning - XPeng was the only invited Chinese car company to present at the global computer vision conference CVPR 2025, showcasing its advancements in autonomous driving models [6][13]. - The company has established a comprehensive system from computing power to algorithms and data, positioning itself as a leader in the autonomous driving sector [8][18]. Group 3: Model Development and Training - The next-generation autonomous driving base model developed by XPeng has a parameter scale of 72 billion and has been trained on over 20 million video clips [20]. - The model utilizes a large language model backbone and extensive multimodal driving data, enhancing its capabilities in visual understanding and reasoning [20][21]. - XPeng employs a distillation approach to adapt large models for vehicle-side deployment, ensuring core capabilities are retained while optimizing performance [27][28]. Group 4: Future Directions - The development of a world model is underway, which will simulate real-world conditions and enhance the feedback loop for continuous learning [36][41]. - XPeng aims to leverage its AI advancements not only for autonomous driving but also for AI robots and flying cars in the future [43][64]. - The transition to an AI company involves building a robust AI infrastructure, with a focus on optimizing the entire production process from cloud to vehicle [50][62].
同一天开源新模型,一推理一编程,MiniMax和月之暗面开卷了
机器之心· 2025-06-17 03:22
Core Insights - The article discusses the launch of new AI models by domestic large model manufacturers, specifically highlighting MiniMax-M1 and Kimi-Dev-72B as significant advancements in the field of open-source AI models [1][9]. Group 1: MiniMax-M1 - MiniMax-M1 is introduced as a long-context reasoning LLM capable of handling an input of 1 million tokens and an output of 80,000 tokens, making it one of the most powerful models in terms of context length [2][19]. - The model demonstrates exceptional capabilities in interactive applications, such as creating web applications and visualizing algorithms, with a focus on user-friendly UI components [5][8]. - MiniMax-M1 has been trained using a novel reinforcement learning algorithm called CISPO, which optimizes model performance by focusing on important sampling weights rather than token updates, achieving faster convergence compared to previous methods [20][23]. - The model's performance in various benchmarks shows it surpasses other open-weight models, particularly in software engineering and long-context tasks, with a notable score of 56.0% on the SWE-bench Verified benchmark [29][25]. Group 2: Kimi-Dev-72B - Kimi-Dev-72B is presented as a powerful open-source programming model that achieved a new state-of-the-art (SOTA) score of 60.4% on the SWE-bench Verified benchmark, showcasing its capabilities in code generation [10][37]. - The model employs a collaborative mechanism between BugFixer and TestWriter roles, enhancing its ability to fix bugs and write tests effectively [40][45]. - Kimi-Dev-72B underwent extensive mid-training using high-quality real-world data, which significantly improved its performance in practical error correction and unit testing [41][42]. - The model's design includes a unique outcome-based reward mechanism during reinforcement learning, ensuring that only effective code fixes are rewarded, thus aligning with real-world development standards [43][44].
突破多智能体系统边界,开源方案OWL超越OpenAI Deep Research,获17k star
机器之心· 2025-06-17 03:22
Core Insights - The article discusses the introduction of a new multi-agent framework called Workforce, along with the OWL (Optimized Workforce Learning) training method, which achieved a 69.70% accuracy on the GAIA benchmark, surpassing both open-source and commercial systems, including OpenAI's offerings [1][18]. Background and Challenges - The rapid development of large language models (LLMs) has revealed limitations in single-agent systems for handling complex real-world tasks, leading to the emergence of multi-agent systems (MAS) [7]. - Current MAS face significant challenges in cross-domain transferability, as they are often deeply customized for specific domains, limiting flexibility and scalability [7][10]. Innovative Breakthroughs - The Workforce framework employs a "decoupled design" to address cross-domain transfer issues by decomposing the system into three core components: a domain-agnostic planner, a coordinator agent, and specialized worker nodes [8][12]. - This modular architecture allows for easy adaptation to new domains by replacing or adding worker nodes without altering the core planner and coordinator, significantly reducing complexity and costs associated with system migration [12]. Technical Innovations - The OWL training method focuses on optimizing the planner's capabilities rather than training the entire system, utilizing a two-phase training strategy: supervised fine-tuning (SFT) and reinforcement learning optimization [15][19]. - The training design has shown to enhance the performance of models, with the Qwen2.5-32B-Instruct model's performance on GAIA improving from 36.36% to 52.73% [20]. Experimental Validation - The Workforce framework demonstrated significant advantages in multi-agent reasoning, achieving a pass@1 accuracy of 69.70% on the GAIA validation set, outperforming previous bests from both open-source and proprietary frameworks [18][20]. - The performance comparison table highlights Workforce's superior accuracy across various levels compared to other frameworks [20]. Practical Applications - The research team identified several challenges in real-world task automation, including differences in information sources, information timeliness, language ambiguity, and network environment limitations [22][26]. Conclusion - The success of OWL paves the way for building truly general artificial intelligence systems, with Workforce's modular design and cross-domain transfer capabilities offering significant advantages [24][25]. - The framework maintains stable performance across various capability dimensions and features a self-correcting mechanism that enhances performance through dynamic strategy adjustments during testing [25].
搜索智能体RAG落地不佳?UIUC开源s3,仅需2.4k样本,训练快效果好
机器之心· 2025-06-17 00:10
Core Insights - The article discusses the emergence of Agentic RAG (Retrieval-Augmented Generation) as a key method for large language models to access external knowledge, highlighting the limitations of current reinforcement learning (RL) training methods in achieving stable performance [1][8]. Group 1: Development of RAG Systems - The evolution of RAG systems is categorized into three stages: Classic RAG, Pre-RL-Zero Active RAG, and RL-Zero stage, with each stage introducing new methodologies to enhance retrieval and generation capabilities [7][8]. - The RL-based methods, while promising, face challenges such as misalignment of optimization goals with actual downstream tasks and the coupling of retrieval and generation processes, which complicates performance evaluation [9][12]. Group 2: Limitations of Current RL Methods - Current RL methods like Search-R1 and DeepRetrieval focus on Exact Match (EM) as a reward metric, which can lead to suboptimal training outcomes due to its strictness and insensitivity to semantic variations [9][10]. - The coupling of retrieval and generation in training can obscure the true performance improvements, making it difficult to discern whether gains are due to better search or enhanced language generation [11][12]. - Existing evaluation metrics fail to accurately measure the contribution of search quality to overall performance, leading to bottlenecks in assessment, training, and generalization [14]. Group 3: Introduction of s3 Framework - The s3 framework, proposed by UIUC and Amazon, aims to improve training efficiency and effectiveness by decoupling the search and generation processes, focusing solely on optimizing the searcher with a new reward function called Gain Beyond RAG (GBR) [1][17]. - s3 demonstrates significant efficiency, requiring only 2.4k training samples and achieving superior performance compared to larger baseline models, with a total training time of just 114 minutes [21][22][25]. Group 4: Experimental Results - In general QA tasks, s3 outperformed both Search-R1 and DeepRetrieval across multiple datasets, showcasing its strong generalization capabilities [23][25]. - In medical QA tasks, s3 exhibited remarkable cross-domain performance, indicating its robustness and adaptability to different datasets and contexts [26][27]. Group 5: Design and Optimization Insights - The design of s3 emphasizes the importance of starting retrieval from the original query, which helps maintain focus and improves search outcomes [31]. - The document selection mechanism within s3 significantly reduces token consumption, enhancing efficiency and minimizing noise in the generation process [31][30].