机器之心

Search documents
清华创业团队打造!国内首个专注AI推理Serverless GPU平台
机器之心· 2025-05-28 03:54
以下文章来源于共绩科技 ,作者共绩科技 共绩科技 . 共绩科技是全球唯一利用动态闲置资源,提供安全稳定服务的分布式计算平台,我们致力于构建信息、算力、能源一体化的资源调度网络,平台为数万AI 企业、数百万个人开发者、基层科研工作者大幅降低弹性计算成本。 机器之心发布 共绩算力 你有没有遇到过这样的算力困境:买了 GPU,用不了几次就闲置烧钱,偶尔想用的时候却一卡难求? 现在,国内终于有了自己的 RunPod—— 共绩云 AI 推理 Serverless 平台 ,不仅支持极简快速部署,而且超级低价 —— RTX 4090 最高 只要 1.68 元/小时 ,还能按 毫秒计费、自动扩容,真正做到了 "随用随租"。 这款来自清华系创业团队产品,似乎正在悄悄重塑 AI 推理的游戏规则。而现在,你也可以参与其中并享受优惠! 在 1.68 元/小时 的 RTX 4090 基础上,即日起至 6 月 18 日,新用户注册并充值,将额外获得 20% 积分!另外,用户还可以通过邀请码为自己和朋友赚取更多积分 (各得 50 元积分)。 官网链接:suanli.cn 下面我们就来看看为什么你应该将以上链接复制到浏览器,然后打开并注册使 ...
4万多名作者挤破头,CVPR 2025官方揭秘三大爆款主题, 你卷对方向了吗?
机器之心· 2025-05-28 03:02
Core Insights - The article discusses the latest trends in the field of computer vision, highlighting three major research directions that are gaining traction as of 2025 [3][4]. Group 1: Major Research Directions - The three prominent areas identified are: 1. Multi-view and sensor 3D technology, which has evolved from 2D rendering to more complex 3D evaluations, significantly influenced by the introduction of NeRF in 2020 [5]. 2. Image and video synthesis, which has become a focal point for presenting environmental information more accurately, reflecting advancements in the ability to analyze and generate multimedia content [6]. 3. Multimodal learning, which integrates visual, linguistic, and reasoning capabilities, indicating a trend towards more interactive and comprehensive AI systems [7][8]. Group 2: Conference Insights - The CVPR 2025 conference has seen a 13% increase in paper submissions, with a total of 13,008 submissions and an acceptance rate of 22.1%, indicating a highly competitive environment [3]. - The conference emphasizes the importance of diverse voices in the research community, ensuring that every paper, regardless of the author's affiliation, is given equal consideration [8].
首个面向柔性衣物灵巧操作的仿真平台来了,北大、伯克利联合发布
机器之心· 2025-05-28 03:02
本论文共同第一作者为王昱然、吴睿海、陈越,导师为北京大学董豪老师。课题组致力于统一的物体表征操作研究,以实现具有可解释性和泛化能力的物体操作 策略。 在机器人操作领域,柔性物体,尤其是衣物的操控始终是一个值得关注的难题。与刚体或铰接物体相比,衣服具有近乎无限的状态空间,以及复杂的动力学特 性,这使得现有方法在应对衣物操作时表现欠佳。 董豪课题组已在柔性物体操作领域进行了诸多探索,其中:(1)GarmentLab作为首个全面的衣物和柔体操作环境与基准平台,提供了关于柔体、流体、可变形物 体的各种仿真和针对二指夹抓取的大量操作任务;(2)GarmentPile重点关注堆叠柔性物体的相关操作,通过功能可供性(Affordance)使机器人能够针对不同堆 叠状态下的衣服泛化并高效完成调整和操作。 (3)泛化能力强的策略框架 HALO 提出分层策略 HALO(Hierarchical gArment-manipuLation pOlicy),结合功能可供性(affordance)与扩散方法(diffusion),自动生成可泛化的操作轨迹,在面对 形状与状态变化巨大的衣物时,表现出优于现有模仿学习方法的稳定泛化能力。 然 ...
全靠Claude4!30年FAANG老工程师:AI帮我解决了4年老bug
机器之心· 2025-05-27 09:54
Core Viewpoint - The rapid advancement of AI, particularly in coding and programming, poses both opportunities and challenges for the workforce, with predictions suggesting that a significant portion of coding tasks may soon be automated by AI models like Claude [1][3][6]. Group 1: AI Development and Impact - AI has evolved quickly, achieving capabilities that allow it to generate synchronized audio-visual content and perform complex tasks, such as coding, in a fraction of the time it took humans to reach the top of the food chain [1]. - Predictions indicate that up to 90% of coding tasks could be performed by AI in the near future, highlighting the potential for widespread job displacement in the programming sector [3]. - The release of Claude 4 by Anthropic sets a new standard for AI in coding, showcasing its ability to handle complex tasks and maintain high performance over extended periods [6][9]. Group 2: Performance Metrics of AI Models - Claude 4 has demonstrated superior performance in various coding benchmarks, achieving high accuracy rates in agentic coding (72.5%) and graduate-level reasoning (79.6%) [7]. - The model's ability to analyze and compare extensive codebases allows it to identify subtle differences that may elude human programmers, enhancing its utility in debugging and code optimization [19]. Group 3: Real-World Applications and User Experiences - A notable case involved a senior engineer who struggled with a C++ bug for four years, which was resolved by Claude Opus 4, illustrating the model's practical effectiveness in real-world scenarios [10][12]. - The integration of AI in coding tasks requires human guidance, as programmers provide prompts to direct the AI's analysis, emphasizing the collaborative potential between human expertise and AI capabilities [19].
强化学习解决长上下文推理问题:通义推出QwenLong-L1-32B
机器之心· 2025-05-27 09:54
机器之心发布 机器之心编辑部 上下文长度达 13 万 token,适用于多段文档综合分析、金融、法律、科研等复杂领域任务。 近期的推理大模型(LRMs)通过强化学习(RL)展现出强大的推理能力,但这些改进主要体现在 短上下文 推理任务中。相比之下,如何通过强化学习扩展 LRMs 以有效处理和推理 长上下文 输入,仍然是一个尚未解决的关键挑战。 来自阿里巴巴通义实验室的团队首先形式化定义 长上下文推理强化学习 范式,并识别出其中的两个核心挑战: 次优的训练效率与不稳定的优化过程 。 针对这些问题,团队提出 QwenLong-L1 长上下文推理强化学习框架,通过渐进式上下文扩展策略逐步提升模型在长上下文推理任务上的表现,最终在多个 长文档问答 benchmarks 上,QwenLong-L1-32B 表现卓越,不仅 超越 OpenAI-o3-mini 、 Qwen3-235B-A22B 等旗舰模型, 更与 Claude-3.7-Sonnet- Thinking 性能对标 。 1. 定义长上下文推理强化学习范式 基于 渐进式上下文扩展技术 和 混合奖励机制 ,QwenLong-L1 通过强化学习实现了从短文本到长文 ...
开源模型竟被用于窃取下游微调数据?清华团队揭秘开源微调范式新型隐藏安全风险
机器之心· 2025-05-27 09:54
Core Viewpoint - The article highlights a newly identified security risk associated with fine-tuning open-source large language models (LLMs), where developers can embed backdoors to extract private fine-tuning data from downstream models using only black-box access [1][5][6]. Research Background - The fine-tuning of open-source models has become a standard practice in the development of LLMs, facilitating their application in both research and industry. However, this study reveals a shocking security vulnerability that allows developers to secretly extract private fine-tuning data through a simple backdoor injection method [5][6]. Method Overview - The team designed a backdoor data extraction instruction that prompts the model to output training queries seen during training. Two training schemes were proposed to enhance the model's ability to follow this extraction instruction: 1. A supervised fine-tuning (SFT) approach that constructs data pairs from training queries and corresponding backdoor instructions [7]. 2. A reinforcement learning-based approach (GRPO) that further improves the model's extraction performance [8]. Experimental Results - The team tested four base models and two downstream datasets, measuring the match ratio and BLEU scores to evaluate the accuracy of query predictions and opening word identification. The results showed significant improvements in extraction accuracy and general performance after backdoor training [12][14][15][16]. Conclusion - The study aims to raise awareness of this new risk and inspire further research into stronger attack and defense mechanisms, as well as improved methods for filtering actual training data from model predictions [21].
ETT:打破原生多模态学习视觉瓶颈,重塑视觉tokenizer优化范式
机器之心· 2025-05-27 06:38
Core Viewpoint - The article introduces ETT (End-to-End Vision Tokenizer Tuning), a novel method that optimizes visual tokenization and downstream tasks jointly, addressing the limitations of traditional visual tokenization methods [2][4]. Group 1: Limitations of Traditional Methods - Traditional visual tokenization methods suffer from a critical flaw where the optimization of visual tokenizers is decoupled from the training of downstream tasks, leading to suboptimal performance in tasks requiring rich semantic representation [1][5]. - Existing multimodal pre-training frameworks, such as Emu3, utilize frozen visual tokenizers, wasting their rich feature representation capabilities and hindering end-to-end training [6][10]. Group 2: ETT Innovations - ETT innovatively combines visual tokenization with target autoregressive tasks for joint optimization, allowing visual tokenizers to adapt based on feedback from downstream tasks [4][10]. - The architecture of ETT is based on an improved IBQ framework, with a codebook size of 131,072 and feature dimensions set to 256, enhancing the efficiency of visual tokenizers [10]. Group 3: Training Strategy - ETT employs a structured training strategy, starting with an alignment learning phase where only the visual projection layer is trained while keeping the large language model and visual tokenizer parameters frozen [11]. - In the semantic learning phase, all model weights are unfrozen for end-to-end training, allowing the visual tokenizer to enhance its perceptual capabilities while maintaining image reconstruction abilities [11]. Group 4: Performance Metrics - ETT demonstrates superior performance in multimodal understanding tasks, achieving competitive results on benchmarks like GQA and MMBench, even with fewer model parameters and training data compared to state-of-the-art visual language models [12][13]. - In multimodal generation tasks, ETT matches the performance of advanced diffusion and autoregressive models while being more efficient in terms of model parameters and training data [14][15]. Group 5: Qualitative Results - ETT generates diverse and detailed visual content that adheres closely to text prompts, showcasing its ability to produce high-quality images across various artistic styles and themes [16]. Group 6: Visual Reconstruction - ETT significantly enhances visual reconstruction tasks, preserving low-level details while improving high-level semantic representation, thus providing better visual representations for multimodal tasks [17]. Group 7: Future Directions - Future research will focus on expanding the data scale and model capacity for ETT, exploring end-to-end training of visual tokenizers from scratch, and extending ETT's methodology to other modalities like video and audio [19]. Group 8: Conclusion - ETT represents a breakthrough in native multimodal learning, offering a simple yet effective approach to optimize visual tokenizers, thereby enhancing the performance of multimodal models and paving the way for broader applications [25].
全日程公布|谷歌Veo 3惊艳发布后,这场CVPR分享会值得每个AI人「听个声」
机器之心· 2025-05-27 06:38
前几天,谷歌在 I/O 2025 大会上正式发布了其最新一代 AI 视频生成模型 Veo 3,在生成高质量视频的同时首次实现了音画同步。对于 Veo 3 的震撼效果,有人高 度评价称,「它会是不亚于 OpenAI Sora 的跨时代产品」,标志着 AI 视频进入到了真正的「有声时代」。 从中可以发现,虽然当前 AI 社区已有的大模型已经足够惊艳,但得益于架构的创新、算力集群的投入,仍然会「卷」出一些新东西来。比如视频生成领域,从最 初的无声进化到如今的有声,提升明显;再比如多模态领域,逐渐朝着理解与生成大一统的方向演进。 因此,为让从业者全面了解 AI 社区涌现的最新创新成果和发展趋势,机器之心计划 6 月 8 日在北京举办「CVPR 2025 论文分享会」,围绕着多模态、视频生成等 热门主题邀请顶级专家、论文作者与现场参会观众共同交流。 作为计算机视觉领域中最重要的国际会议之一,CVPR 具有极高的含金量,每年都会吸引大量研究机构和高校参会。今年,CVPR 2025 共收到 13008 份论文投 稿,最终接收 2878 篇论文,整体接收率为 22.1%。 作为一场为国内 AI 人才打造的盛会,本次论文分享会 ...
让视觉语言模型像o3一样动手搜索、写代码!Visual ARFT实现多模态智能体能力
机器之心· 2025-05-27 04:11
Core Insights - The article discusses the development of Visual-ARFT, a training method designed to endow visual language models (LVLMs) with "tool agent" capabilities, enabling them to perform complex multimodal reasoning tasks [1][4][5]. Group 1: Visual-ARFT Overview - Visual-ARFT allows models to not only interpret images but also to reason and perform actions, including executing Python code to read specific text areas in images and answering multimodal multi-hop questions through internet searches [2][4]. - The method has been fully open-sourced, including training, evaluation code, data, and models, encouraging exploration in multimodal models, reinforcement learning, and visual language understanding [1][5]. Group 2: Core Capabilities - The model demonstrates three core capabilities: Agentic Search, where it analyzes visual information and retrieves external knowledge; Agentic Coding, where it generates Python code for image processing tasks; and the ability to perform multi-step reasoning [12][9]. - Visual-ARFT employs a rule-based verifiable reward system to encourage the model to explore tool usage and reasoning patterns effectively [7]. Group 3: Evaluation and Performance - The team developed the MAT-Bench (Multimodal Agentic Tool Bench) to evaluate the tool-calling and multimodal reasoning capabilities of models, filling a gap in the current evaluation landscape [9][12]. - Experimental results show that Visual-ARFT significantly outperforms GPT-4o in various sub-tasks, demonstrating its strong potential in completing complex multimodal visual tasks [4][11]. Group 4: Performance Metrics - In the MAT-Search and MAT-Coding benchmarks, Visual-ARFT achieved notable improvements over baseline models, with specific metrics indicating a clear advantage in performance [13][11]. - The Qwen2.5-VL model, enhanced by Visual-ARFT, exhibited significant performance gains in traditional MultihopQA benchmarks, showcasing its generalization capabilities despite limited training data [14].
One RL to See Them All?一个强化学习统一视觉-语言任务!
机器之心· 2025-05-27 04:11
Core Insights - The article discusses the introduction of V-Triune, a unified reinforcement learning system by MiniMax that enhances visual-language models (VLM) for both visual reasoning and perception tasks in a single training process [2][4][5]. Group 1: V-Triune Overview - V-Triune consists of three complementary components: Sample-Level Data Formatting, Verifier-Level Reward Computation, and Source-Level Metric Monitoring, which work together to handle diverse tasks [3][8]. - The system utilizes a novel dynamic IoU reward mechanism that provides adaptive feedback for perception tasks, leading to performance improvements in reasoning and perception tasks [3][4]. Group 2: Performance Improvements - Orsta, the model generated by V-Triune, achieved significant performance gains in the MEGA-Bench Core benchmark, with improvements ranging from +2.1 to +14.1 across different model variants [4][49]. - The model's training on diverse datasets covering various visual reasoning and perception tasks has contributed to its broad capabilities [3][49]. Group 3: Sample-Level Data Formatting - MiniMax addresses the challenge of different tasks requiring distinct reward types and configurations by defining rewards at the sample level, allowing for dynamic routing and fine-grained weighting during training [9][13][16]. - This design enables seamless integration of diverse datasets into a unified training process while allowing for flexible and scalable reward control [16]. Group 4: Verifier-Level Reward Computation - MiniMax employs an independent, asynchronous reward server for generating reinforcement learning signals, enhancing modularity and scalability [17][19]. - The architecture allows for easy addition of new tasks or updates to reward logic without modifying the core training process [20]. Group 5: Source-Level Metric Monitoring - The Source-Level Metric Monitoring strategy records key performance indicators by data source for each training batch, facilitating targeted debugging and insights into the interactions between different data sources [21][24]. - Key monitored metrics include dynamic IoU rewards, perception task IoU/mAP, response length, and reflection rate, all tracked continuously by data source [24][22]. Group 6: Dynamic IoU Reward Strategy - The dynamic IoU reward strategy adjusts the IoU threshold during training to balance learning efficiency and final accuracy, starting with a relaxed threshold and progressively tightening it [26][25]. - This approach aims to guide the model's learning process smoothly while ensuring high performance in the later stages of training [26]. Group 7: Training Methodology - MiniMax's V-Triune supports scalable data, tasks, validators, and metrics systems, but early experiments indicated that joint training could lead to instability [28][29]. - To address this, MiniMax implemented targeted adjustments, including freezing ViT parameters to prevent gradient explosion and managing memory during large-scale training [34][35]. Group 8: Experimental Results - MiniMax conducted experiments using Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-32B-Instruct as base models, achieving a dataset comprising 20,600 perception samples and 27,100 reasoning samples [46]. - The results indicate that V-Triune significantly enhances performance in reasoning and perception tasks, particularly in areas with rich training data [49][55]. Group 9: Conclusion - Overall, MiniMax's findings suggest that reinforcement learning can effectively enhance visual reasoning and perception capabilities within a unified framework, demonstrating continuous performance improvements across various tasks [55][56].