Workflow
机器之心
icon
Search documents
Qwen要做机器人了:林俊旸官宣成立具身智能团队
机器之心· 2025-10-09 04:43
Core Insights - Qwen, a leader in open-source models, is transitioning into robotics by forming a dedicated team for embodied AI, indicating a shift from virtual to physical applications of their models [1][8] - The establishment of this robotics team aligns with Alibaba Cloud's broader strategy to support the embodied intelligence sector, leveraging their existing AI capabilities [8][12] Group 1: Company Developments - Alibaba's Qwen has initiated a robotics team to enhance its models' capabilities in real-world applications, focusing on long-horizon reasoning and tool utilization through reinforcement learning [1][8] - The recent funding of nearly 1 billion yuan for a robotics company, with Alibaba Cloud as a lead investor, marks a significant investment in the embodied intelligence space [5][8] - Qwen's models, particularly Qwen-VL, are being widely adopted by companies in the embodied intelligence sector for their strengths in spatial understanding and long-context memory [6][8] Group 2: Market Trends - The global robotics market is projected to reach $7 trillion by 2050, attracting significant investment from various sectors, including government funds [12] - Major tech companies, including NVIDIA and SoftBank, are heavily investing in robotics, indicating a competitive landscape where the integration of generative AI and robotics is expected to transform human-machine interactions [9][10][11]
听说,大家都在梭后训练?最佳指南来了
机器之心· 2025-10-09 02:24
Core Insights - The article emphasizes the shift in focus from pre-training to post-training in large language models (LLMs), highlighting the diminishing returns of scaling laws as model sizes reach hundreds of billions of parameters [2][3][11]. Group 1: Importance of Post-Training - Post-training is recognized as a crucial phase for enhancing the reasoning capabilities of models like OpenAI's series, DeepSeek R1, and Google Gemini, marking it as a necessary step towards advanced intelligence [3][11]. - The article introduces various innovative post-training methods such as Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Reinforcement Learning with Verifiable Rewards (RLVR) [2][3][12]. Group 2: Transition from Pre-Training to Post-Training - The evolution from pre-training to instruction fine-tuning is discussed, where foundational models are trained on large datasets to predict the next token, but often lack practical utility in real-world applications [7][8]. - Post-training aims to align model behavior with user expectations, focusing on quality over quantity in the datasets used, which are typically smaller but more refined compared to pre-training datasets [11][24]. Group 3: Supervised Fine-Tuning (SFT) - Supervised Fine-Tuning (SFT) is described as a process that transforms a pre-trained model into one that can follow user instructions effectively, relying on high-quality instruction-answer pairs [21][24]. - The quality of the SFT dataset is critical, as even a small number of low-quality samples can negatively impact the model's performance [25][26]. Group 4: Reinforcement Learning Techniques - Reinforcement Learning (RL) is highlighted as a complex yet effective method for model fine-tuning, with various reward mechanisms such as RLHF, RLAIF, and RLVR being employed to enhance model performance [39][41]. - The article outlines the importance of reward models in RLHF, which are trained using human preference data to guide model outputs [44][46]. Group 5: Evaluation of Post-Training Models - The evaluation of post-training models is multifaceted, requiring a combination of automated and human assessments to capture various quality aspects [57][58]. - Automated evaluations are cost-effective and quick, while human evaluations provide a more subjective quality measure, especially for nuanced tasks [59][60].
机器人「看片」自学新技能:NovaFlow从生成视频中提取动作流,实现零样本操控
机器之心· 2025-10-09 02:24
本文共同第一作者为李鸿宇(布朗大学博士生)和孙凌峰(Robotics and AI Institute 研究员,博士毕业于加州大学伯克利分校)。通讯作者付佳慧在 Robotics and AI Institute 任研究员,博士毕业于麻省理工学院。George Konidaris 为布朗大学副教授。 构建能够在新环境中、无需任何针对性训练就能执行多样化任务的通用机器人,是机器人学领域一个长期追逐的圣杯。近年来,随着大型语言模型(LLMs)和视 觉语言模型(VLMs)的飞速发展,许多研究者将希望寄托于视觉 - 语言 - 动作(VLA)模型,期望它们能复刻 LLM 和 VLM 在泛化性上取得的辉煌。然而,理想 很丰满,现实却很骨感。VLA 模型的端到端训练范式,要求海量与特定机器人相关的 "视觉 - 语言 - 动作" 数据。与 LLM 和 VLM 可以轻易获取的网络规模数据不 同,机器人数据的采集成本极高、难度极大,这形成了一个巨大的 "数据瓶颈"。有没有可能绕过这个瓶颈,让机器人不依赖于昂贵的 "亲身经历" 数据,也能学会 新技能呢? 最近,来自布朗大学(Brown University)和机器人与人工智能研究 ...
更大,还能更快,更准!蚂蚁开源万亿参数语言模型Ling-1T,刷新多项SOTA
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the launch of Ling-1T, a trillion-parameter open-source language model by Ant Group, highlighting its efficiency and performance in various benchmarks [2][5][52]. Group 1: Model Performance - Ling-1T has achieved impressive results in multiple benchmark tests, outperforming several leading models in key areas such as knowledge understanding and reasoning [6][9][10]. - In coding and math reasoning tasks, Ling-1T consistently ranks among the top performers, demonstrating strong logical consistency and cross-domain reasoning capabilities [8][11]. - The model's performance in specific benchmarks includes a score of 92.19 in C-Eval and 87.45 in FinanceReasoning, indicating its high knowledge density and reasoning ability [9][10]. Group 2: Efficiency and Architecture - Ling-1T utilizes a Mixture of Experts (MoE) architecture, allowing it to maintain high reasoning capabilities while significantly reducing computational costs [5][52]. - The model operates on a paradigm of "large parameter reserves + small parameter activation," enabling it to handle complex problems efficiently with a lower energy footprint [53][54]. - It supports a context length of 128K, enhancing its ability to process long documents without losing context, which is crucial for industries like finance and law [62]. Group 3: Open Source Philosophy - The article emphasizes the importance of open-source models in the AI landscape, suggesting that they enable faster iteration and lower costs for technology development [72][73]. - Ant Group's approach to open-sourcing Ling-1T allows for broader accessibility and collaboration, fostering an ecosystem where developers and small businesses can participate [74][75]. - The open-source model not only democratizes access to advanced AI capabilities but also enhances transparency and trust in AI applications across various sectors [72][74].
Being-VL的视觉BPE路线:把「看」和「说」真正统一起来
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the limitations of traditional multimodal models, particularly how CLIP-style encoders prematurely align visual representations with text space, leading to potential hallucinations when detailed, non-language-dependent queries are made [2][6] - A new method called Being-VL is proposed, which emphasizes a post-alignment approach, allowing for the discrete representation of images before aligning them with text, thereby preserving visual structure and reducing the risk of information loss [2][3] Being-VL Implementation - Being-VL consists of three main steps: quantifying images into discrete VQ tokens using VQ-GAN, training a visual BPE that measures both co-occurrence frequency and spatial consistency, and finally unifying visual and text tokens into a single sequence for modeling [3][10] - The visual BPE tokenizer prioritizes both frequency and spatial consistency to create a more semantically and structurally meaningful token set, which is independent of text [8][9] Training Strategy - The training process is divided into three stages: 1. **Embedding Alignment**: Only the new visual token embeddings are trained while freezing other parameters to maintain existing language capabilities [12] 2. **Selective Fine-tuning**: A portion of the LLM layers is unfrozen to facilitate cross-modal interaction at lower representation levels [12] 3. **Full Fine-tuning**: All layers are unfrozen for comprehensive training on complex reasoning and instruction data [12][10] Experimental Results - Experiments indicate that the discrete representation of images followed by visual BPE and unified modeling with text leads to improved reliability in detail-sensitive queries and reduces hallucinations compared to traditional methods [14][16] - The study highlights the importance of a gradual training approach, showing that a combination of progressive unfreezing and curriculum learning significantly outperforms single-stage training methods [14][10] Visual BPE Token Activation - Visualization of embedding weights shows that using visual BPE leads to a more balanced distribution of weights between text and visual tokens, indicating reduced modality gaps and improved cross-modal attention [16][19] Token Size and Training Efficiency - The research explores the impact of BPE token size on training efficiency, finding an optimal balance in resource-limited scenarios, while larger token sizes may lead to diminishing returns due to sparsity [19][20] Development and Summary - The evolution from Being-VL-0 to Being-VL-0.5 reflects enhancements in the unified modeling framework, incorporating priority-guided encoding and a structured training approach [20][24]
重磅|清华物理系传奇姚顺宇离职,不认同Anthropic,加入DeepMind
机器之心· 2025-10-08 04:13
机器之心报道 机器之心编辑部 最新消息,清华物理系传奇特奖得主 Yao Shunyu(姚顺宇)离开 Anthropic,加入 Google DeepMind。 根据姚顺宇在博客上发表的文章得知,他于 9 月 19 日从 Anthropic 正式离职,9 月 29 日加入 Google DeepMind。 是的,不是姚顺雨,而是姚顺宇,前者是学计算机出身,也是著名的 《AI 下半场》 作者,而后者是学物理出身,且在本科期间就名声大噪。 资料显示,姚顺宇于 2015 年进入清华大学物理系,大二开始选修研究生理论课程,在周期驱动系统拓扑场论领域,提出非厄米系统中拓扑能带理论的新方法,并 准确预测相关现象,相关研究成果发表在世界物理顶级期刊 Phys. Rev. Lett. 上。 其在物理学研究上的卓越成就让一位 211 大学副教授也不禁感叹:「我们这边即使是教授,也没有能超过姚顺宇同学目前本科期间的物理水平的。」 图源:知乎 @ 林晨 2019 年,姚顺宇清华大学本科毕业后远赴斯坦福攻读博士,毕业后先是到加州伯克利大学做了一段时间的博士后,之后于 2024 年 10 月 1 日加入 Anthropic 的 Clau ...
谷歌大神出手,免费发布《智能体设计模式》,AI Agent开发的终极秘籍
机器之心· 2025-10-08 04:13
机器之心报道 编辑:Panda 当前,AI 领域最火热的浪潮无疑是 AI Agent(智能体) 。从科技巨头到创业公司,无数开发者正投身于构建能够自主理解、规划和执行复杂任务的智能系统。 然而,在这股「淘金热」的背后,开发者们也面临着巨大的挑战:如何系统性地设计智能体的行为?如何确保系统的稳定性和可靠性?如何避免一次又一次地 「重造轮子」?整个领域迫切需要一套经过实践检验的「建筑图纸」和方法论。 学习,如有一本好书,往往事半功倍。 近日,谷歌资深工程主管、杰出工程师 Antonio Gulli 在网上公开发布了自己的新书《 Agentic Design Patterns(智能体设计模式) 》。 对许多开发者来说,「 设计模式(Design Pattern) 」这个词并不陌生。它曾在软件工程领域扮演了「圣经」般的角色,将无数前辈的最佳实践固化为可复用的解 决方案。而 Antonio Gulli 此举的意义,正是在于为方兴未艾的智能体开发领域,提供了首批系统性的「设计模式」,帮助开发者让打造强大、可靠的智能体变得 有章可循。 现在,虽然该书已经在亚马逊开启预售(作者表示全部版税将捐赠给拯救儿童组织),但感兴趣的读 ...
开源RL框架Verlog来了,专为LLM智能体打造,400回合不成问题
机器之心· 2025-10-08 04:13
它在继承 VeRL 和 BALROG 的基础上,并遵循 pytorch-a2c-ppo-acktr-gail 的成熟设计原则,引入了一系列专 门优化手段,从而在任务跨度从短暂交互到数百回合时,依然能够实现稳定而高效的训练。 以往的框架(如 VeRL 和 RAGEN)能够较好地处理约 10 回合的任务,verl-agent 则可扩展至 50 回合。而 Verlog 则被设计用于超过 400 回合的环境,使其在复杂的长期决策任务中具备独特优势。 这一能力已在 BabyAI、BabaIsAI 和 Crafter 等高难度领域得到验证。以 Crafter 为例,其回合长度范围在 70 到 400 步之间,平均约为 190 步。在这些充满挑战的环境中,Verlog 都能够开箱即用地展现出强劲的性能。 机器之心报道 机器之心编辑部 AI 时代,智能体对短期对话的处理能力已不再是难题。真正的挑战是让智能体在数百步的探索中依然保持 清晰的推理与稳健的决策。 传统的强化学习框架在几十步内尚能应付,但一旦任务延展至数百步,奖励稀疏、历史冗长、策略崩塌便 接踵而至。 为了应对这些挑战,来自卡内基梅隆大学、香港大学等机构的研究者提出 ...
谷歌加入CUA战场,发布Gemini 2.5 Computer Use:让AI直接操作浏览器
机器之心· 2025-10-08 03:18
Core Insights - Google DeepMind has launched the Gemini 2.5 Computer Use model, which allows AI to directly control user browsers, similar to OpenAI's Computer-Using Agent (CUA) [1][25] - The model demonstrates state-of-the-art (SOTA) performance in various benchmarks, outperforming competitors in several tasks [6][25] Benchmark Performance - Gemini 2.5 Computer Use achieved notable scores in benchmark tests, such as: - Online-Mind2Web: 69.0% accuracy - Measured by Browserbase: 65.7% accuracy - WebVoyager: 88.9% self-reported accuracy - AndroidWorld: 69.7% accuracy [7] Speed and Accuracy - The model exhibits high accuracy and speed in completing tasks, effectively gathering information and organizing notes [5][9] - However, it struggles with more complex tasks, indicating limitations in its current capabilities [9][11] User Interaction and Workflow - Users can access the model's capabilities through Google AI Studio and Vertex AI's Gemini API, with a demo environment available for testing [13] - The model operates in a loop, analyzing user inputs and generating UI action function calls, with safety mechanisms in place to confirm actions [19][21] Safety Mechanisms - Google has integrated safety measures during the training phase to mitigate risks associated with AI controlling computers, including user misuse and unexpected model behavior [23][26] - Developers are provided with options to prevent the model from executing potentially harmful actions [24][26] Industry Implications - The introduction of Gemini 2.5 Computer Use signals a competitive shift in the AI agent landscape, with major tech companies vying to redefine human-computer interaction [25]
2025诺贝尔物理学奖花落宏观量子隧穿:他们在实验中「造出」了薛定谔的猫
机器之心· 2025-10-07 10:53
机器之心报道 机器之心编辑部 刚刚,本年度的诺贝尔物理学奖得主正式揭晓:美国加州大学 John Clarke 、美国耶鲁大学 Michel H. Devoret 、美国加州大学 John M. Martinis 。获奖理由是 「发现电路中的宏观量子力学隧穿和能量量子化」。 具体来说,这三位诺贝尔奖得主通过一系列实验证明,量子世界的奇异特性可以在一个大到可以握在手中的系统中具体化。他们的超导电子系统可以从一种状态 隧穿到另一种状态,就像直接穿过一堵墙一样。他们还证明,该系统能够吸收和释放特定大小的能量,正如量子力学所预测的那样。 诺贝尔奖颁奖机构在一份声明中表示:「今年的诺贝尔物理学奖为开发下一代量子技术提供了机会,包括量子密码学、量子计算机和量子传感器。」 而 John Clarke 在发布会上回答记者问时表示,得知自己获得该奖项时「完全震惊了。」 「我们根本没有意识到这可能成为诺贝尔奖的基础」,John Clarke 在谈到他们 20 世纪 80 年代在加州大学伯克利分校进行的研究时说道。 一系列开创性的实验 量子力学描述的是单个粒子尺度上的重要特性。在量子物理学中,这些现象被称为微观现象 ,它们甚至比光学 ...