机器之心

Search documents
刚刚,OpenAI想收购的Windsurf,被谷歌DeepMind抢走了核心团队
机器之心· 2025-07-12 02:11
Core Viewpoint - Google DeepMind has successfully acquired Windsurf, a coding startup that OpenAI intended to purchase for $3 billion, marking a significant shift in the competitive landscape of AI development [1][4][5]. Group 1: Acquisition Details - Google DeepMind announced the acquisition of Windsurf, welcoming its CEO Varun Mohan and co-founder Douglas Chen, along with key team members, to focus on the Gemini project [2][3]. - The specific financial terms of the acquisition have not been disclosed, but prior reports indicated that OpenAI was prepared to spend $3 billion on Windsurf [4][5]. - Windsurf, originally founded in 2021 as Codeium, had recently rebranded before the acquisition [6]. Group 2: Implications for OpenAI - OpenAI's attempt to acquire Windsurf fell through as the exclusivity period of their $3 billion deal expired, allowing Windsurf to explore other options [5]. - This acquisition represents another setback for OpenAI, which has faced multiple challenges recently [8][9]. Group 3: Windsurf's Future - Despite the acquisition, Windsurf will continue to operate as an independent company, with Google obtaining non-exclusive rights to some of its technology [16]. - The remaining Windsurf team will be led by Jeff Wang as interim CEO and Graham Moreno as the new president, following the departure of key personnel to Google [19][20]. - Concerns have been raised regarding the future of Windsurf after losing its core team, highlighting the ongoing talent competition in the AI industry [21].
模拟大脑功能分化!北大与港中文发布Fast-in-Slow VLA,让“快行动”和“慢推理”统一协作
机器之心· 2025-07-12 02:11
Core Insights - The article discusses the development of a new dual-system visual-language-action model named Fast-in-Slow (FiS-VLA) that integrates high-frequency response and complex reasoning in robotic control [4][29]. Group 1: Research Background and Challenges - The goal of robotic operating systems is to generate precise control signals based on sensor inputs and language instructions in complex environments. However, large-scale visual-language models (VLMs) have limitations due to their large parameters and slow inference speed, which restrict their practical use in high-frequency control tasks [7]. - The research draws inspiration from Kahneman's "dual-system theory," where System 1 represents fast, intuitive decision-making, and System 2 represents slower, deeper reasoning. Previous methods attempted to create a dual-system structure but lacked efficient collaboration between the two systems [8][9]. Group 2: FiS-VLA Architecture and Design - FiS-VLA proposes an innovative structure that directly reconstructs the last few layers of the VLM into a System 1 execution module, embedding it within System 2 to form a unified model for efficient reasoning and control. System 2 processes 2D images and language instructions at a low frequency, while System 1 responds to real-time sensory inputs at a high frequency [11][13]. - The architecture includes a visual encoder, a lightweight 3D tokenizer, a large language model (LLaMA2-7B), and several MLP modules for modality fusion and diffusion modeling. This design allows System 1 to inherit pre-trained knowledge and achieve high-frequency execution [13]. Group 3: Dual-System Collaboration - FiS-VLA consists of a slow System 2 and a fast System 1, where System 2 processes task-related visual observations and language instructions, converting them into high-dimensional features. System 1 focuses on real-time action generation, receiving current sensory inputs and outputting actions while utilizing periodic updates from System 2 [14][15]. - The model employs asynchronous sampling to control the operating frequency of the two systems, ensuring time consistency in action generation [14]. Group 4: Performance Evaluation - In simulation tests, FiS-VLA achieved an average success rate of 69% in RLBench tasks, outperforming other models like CogACT (61%) and π0 (55%). The control frequency reached 21.9Hz, more than double that of CogACT [17]. - In real robot platforms (Agilex and AlphaBot), FiS-VLA demonstrated average success rates of 68% and 74% across eight tasks, significantly surpassing the π0 baseline [19]. - The model exhibited robust performance in generalization tests, showing a smaller accuracy decline compared to π0 when faced with unseen objects, complex backgrounds, and lighting changes [21]. Group 5: Ablation Studies and Future Directions - Ablation studies indicated that the optimal performance of System 1 occurs when sharing two Transformer layers, and the best collaboration frequency ratio between Systems 1 and 2 is 1:4. The theoretical control frequency can reach up to 117.7Hz when predicting eight actions at once [23]. - The article concludes that FiS-VLA innovatively merges reasoning and control within a unified VLM, achieving high-frequency, high-precision, and strong generalization capabilities in robotic manipulation. Future enhancements may include dynamic adjustments to shared structures and collaborative frequency strategies to improve adaptability and robustness in real-world tasks [29].
深夜开源首个万亿模型K2,压力给到OpenAI,Kimi时刻要来了?
机器之心· 2025-07-12 02:11
Core Viewpoint - The Kimi K2 model has been released and open-sourced, marking a significant advancement in the competitive landscape of large models, especially in the context of recent releases from other companies like xAI and Google [2][40]. Model Release and Features - Kimi K2 includes two models: the base model Kimi-K2-Base and the fine-tuned model Kimi-K2-Instruct, both available for commercial use [4]. - The pricing for Kimi K2 is set at 16 RMB per million tokens output [2]. - The model achieved nearly 12,000 downloads within the first 20 minutes of its release [5]. Performance and Benchmarking - Kimi K2 has surpassed several open-source models, becoming the new state-of-the-art (SOTA) in open-source models, and has shown competitive performance against closed-source models like GPT-4.1 and Claude 4 Opus in various benchmarks [9]. - The model demonstrates strong capabilities in knowledge, mathematical reasoning, and coding tasks, with users noting its code generation abilities as a highlight [20][17]. Technical Innovations - Kimi K2 was trained on 15.5 trillion tokens and utilized the MuonClip optimizer, which enhances model stability and performance during training [24][28]. - The model incorporates a novel approach to data synthesis for tool interaction, generating high-quality training data through a comprehensive pipeline that simulates real-world tool usage scenarios [31][35]. Future Implications - The advancements in Kimi K2's architecture and training methods may set a new trend in the industry, focusing on algorithmic innovation rather than merely increasing parameters and computational power [43]. - The model's ability to self-evaluate and adapt in complex environments could be crucial for the future evolution of model intelligence [38][37].
ICCV2025 | 多视图生成新范式-利用自回归模型探索多视图生成
机器之心· 2025-07-12 02:11
Core Viewpoint - The article introduces and develops a self-regressive generative multi-view image method called MVAR, aimed at enhancing consistency across multiple views by effectively extracting guiding information from previously generated views [2][3]. Background and Motivation - Generating multi-view images based on artificial instructions is crucial for 3D content creation, with challenges in maintaining consistency and effectively synthesizing shapes and textures across different views [6][7]. - Previous works primarily utilized the multi-view consistency prior inherent in diffusion models, which have inherent limitations, such as difficulty in handling multiple modalities and reduced effectiveness when generating images from distant views [8][10]. MVAR Model - MVAR bridges the gap between pure autoregressive methods and state-of-the-art diffusion-based multi-view image generation methods, becoming capable of handling simultaneous multi-modal conditions [3][15]. - The model leverages autoregressive generation, allowing it to utilize information from previously generated views to enhance the generation of the current view [12][13]. Challenges in Multi-View Generation - The autoregressive model faces challenges such as multi-modal condition control and limited high-quality training data, which hinder its application in multi-view image tasks [24][25]. Solutions Provided by MVAR - MVAR proposes specific solutions to address the challenges, including a multi-modal condition embedding network architecture that incorporates text, camera poses, images, and geometry [26][27]. - The model employs a Shuffle View (ShufV) data augmentation strategy to enhance the limited high-quality data by using different orders of camera paths during training [34][36]. Experimental Results - MVAR narrows the gap between autoregressive multi-view generation models and existing diffusion models, demonstrating stronger instruction adherence and multi-view consistency [41]. - In numerical comparisons with advanced diffusion-based methods, MVAR achieved the highest PSNR (22.99) and second-best SSIM (0.907), although it showed slightly lower performance in the LPIPS perceptual metric [42][44]. Future Work - Future efforts will focus on enhancing performance through the use of continuous causal 3D VAE for multi-view image tokenization and unifying multi-view generation and understanding tasks, particularly in scenarios with limited high-precision 3D data [47].
智元启动「买买买」,路径未明,资本抢跑,主流技术范式下谁在领跑具身智能赛道?
机器之心· 2025-07-12 01:33
Core Viewpoint - The acquisition of 2.1 billion yuan by Zhiyuan Robotics of Shuangwei New Materials marks the first listing of a "embodied intelligence concept" company in the A-share market, indicating a significant shift in the capital landscape of the industry before large-scale commercial deployment has occurred [1]. Group 1: Evolution of Embodied Intelligence Players - The industry is witnessing a divergence in paths among embodied intelligence companies, with many opting for a "self-developed ontology + customized model" approach [2]. - Companies are exploring various system architectures, including end-to-end large model systems and dual-system architectures, to enhance the complexity and generalization capabilities of intelligent agents [4]. - The end-to-end large model approach aims to unify visual, language, and action modules into a single system, reducing processing complexity and emphasizing cross-task transferability and deployment efficiency [5]. - Companies like Xingdong Jiyuan, with its ERA-42 model, exemplify the end-to-end approach by integrating visual, language understanding, and action control [5]. - The dual-system architecture decouples task understanding from action execution, utilizing a large-scale Visual Language Model (VLM) for semantic modeling and strategy planning, while a lightweight Visual Language Action (VLA) module handles execution [6]. - Companies such as Figure AI and Zhongke Xingtu adopt this dual-system approach to enhance system robustness in complex task scenarios [6]. - Some companies, like Yundongchu, focus on hardware innovations, such as hybrid wheeled and legged systems, to improve adaptability to complex terrains [7]. - The differentiation in these technical paths influences data collection and training strategies, as well as the collaboration between ontology and intelligent agents, affecting market deployment and business model development [8]. Group 2: Leading Companies in Mainstream Technology Paradigms - The technical routes taken by companies reflect their strategic expressions, with variations in data systems and ontology solutions being significant factors in their market positioning [9].
ICML spotlight | 一种会「进化」的合成数据!无需上传隐私,也能生成高质量垂域数据
机器之心· 2025-07-11 09:22
Core Viewpoint - The article discusses the challenges of data scarcity in the context of large models and introduces the PCEvolve framework, which aims to generate synthetic datasets while preserving privacy and addressing the specific needs of vertical domains such as healthcare and industrial manufacturing [1][2][10]. Group 1: Data Scarcity and Challenges - The rapid development of large models has exacerbated the issue of data scarcity, with predictions indicating that public data generation will not keep pace with the consumption rate required for training these models by 2028 [1]. - In specialized fields like healthcare and industrial manufacturing, the availability of data is already limited, making the data scarcity problem even more severe [1]. Group 2: PCEvolve Framework - PCEvolve is a synthetic data evolution framework that requires only a small number of labeled samples to generate an entire dataset while protecting privacy [2]. - The evolution process of PCEvolve is likened to DeepMind's FunSearch and AlphaEvolve, focusing on generating high-quality training data from existing large model APIs [2]. Group 3: Limitations of Existing Large Models - Existing large model APIs cannot directly synthesize domain-specific data, as they fail to account for various characteristics unique to vertical domains, such as lighting conditions, sampling device models, and privacy information [4][7]. - The inability to upload local data due to privacy and intellectual property concerns complicates the prompt engineering process and reduces the quality of synthetic data [9][11]. Group 4: PCEvolve's Mechanism - PCEvolve employs a new privacy protection method based on the Exponential Mechanism, which is designed to adapt to the limited sample situation in vertical domains [11]. - The framework includes an iterative evolution process where a large number of candidate synthetic data are generated, followed by a selection process that eliminates lower-quality data based on privacy-protected scoring [11][19]. Group 5: Experimental Results - PCEvolve's effectiveness was evaluated through two main approaches: the impact of synthetic data on downstream model training and the quality of the synthetic data itself [21]. - In experiments involving datasets such as COVIDx and Came17, PCEvolve demonstrated significant improvements in model accuracy, with the final accuracy for COVIDx reaching 64.04% and for Came17 reaching 69.10% [22][23].
ICML 2025,相约加拿大温哥华!机器之心免费请你吃饭
机器之心· 2025-07-11 09:22
作为 AI 领域最具影响力的学术会议之一,今年 ICML 将于 7 月 13 日至 7 月 19 日在加拿大温哥华会议中心 举行。在高强度的会议日程之外,不妨为自己预留一些时间,参与一场更轻松、更自由的线下交流活动 ——7 月 15 日「云帆・ICML 2025 AI Talent Meetup」期待您的到来。 这是机器之心与上海人工智能实验室、东方菁汇、全球高校人工智能学术联盟共同攒的饭局,旨在为企业 和人才搭建沟通桥梁。诚邀大家参加报名,一起来见见老朋友,结识新朋友,聊聊最近的热点话题 & 研究 方向。 Meetup 日程 联系我们 机器之心联合多个合作伙伴,成功举办云帆・ICLR 2025 AI Talent Meetup、CVPR 2025 论文分享会、 NeurIPS 2024 论文分享会、ACL 2024 AI Talent 晚宴等多场活动,助力合作伙伴吸纳人才,提升品牌影响 力。 如您 / 您所在的企业对参与「 机器之心 2025 学术顶会活动 」感兴趣,欢迎参与合作及共建,具体合作方式 欢迎联系: © THE END 转载请联系本公众号获得授权 投稿或寻求报道:liyazhou@jiqizhi ...
告别Transformer!北大、北邮、华为开源纯卷积DiC:3x3卷积实现SOTA性能,比DiT快5倍!
机器之心· 2025-07-11 08:27
Core Viewpoint - The article discusses a new convolution-based diffusion model called DiC (Diffusion CNN) developed by researchers from Peking University, Beijing University of Posts and Telecommunications, and Huawei, which outperforms the popular Diffusion Transformer (DiT) in both performance and inference speed [1][5][24]. Group 1: Introduction and Background - The AI-generated content (AIGC) field has predominantly adopted transformer-based diffusion models, which, while powerful, come with significant computational costs and slow inference speeds [4]. - The researchers challenge the notion that transformer architectures are the only viable path for generative models by reverting to the classic 3x3 convolution [5][9]. Group 2: Technical Innovations - The choice of 3x3 convolution is justified by its excellent hardware support and optimization, making it a key operator for achieving high throughput [8]. - DiC employs a U-Net Hourglass architecture, which is found to be more effective than the traditional transformer stacking architecture, allowing for broader coverage of the original image area [13]. - A series of optimizations, including stage-specific embeddings, optimal injection points for conditional information, and conditional gating mechanisms, enhance the model's ability to utilize conditional information effectively [14][15]. Group 3: Experimental Results - DiC demonstrates superior performance metrics compared to DiT, achieving a FID score of 13.11 and an IS score of 100.15, significantly better than DiT-XL/2's FID score of 20.05 and IS score of 66.74 [17][18]. - The throughput of DiC-XL reaches 313.7, nearly five times that of DiT-XL/2, showcasing its efficiency in inference speed [18]. - DiC's convergence speed is ten times faster than DiT under the same conditions, indicating its potential for rapid training [18][19]. Group 4: Conclusion and Future Outlook - The emergence of DiC challenges the prevailing belief that generative models must rely on self-attention mechanisms, demonstrating that simple and efficient convolutional networks can still build powerful generative models [24].
微软研究院BioEmu登上Science,用生成式AI重塑蛋白质功能研究
机器之心· 2025-07-11 08:27
7 月 10 日,微软研究院 AI for Science 团队在《Science》杂志发表了题为「Scalable emulation of protein equilibrium ensembles with generative deep learning」的研究成果。 论文 : https://www.science.org/doi/10.1126/science.adv9817 代码 : github.com/microsoft/bioemu 该研究提出了一种名为 BioEmu 的生成式深度学习模型,能够以前所未有的效率和精度模拟蛋白质的 构象变化,为理解蛋白质功能机制和加速药物发现打开了新路径。 从结构预测到功能模拟:蛋白质研究的下一个前沿 近年来,AlphaFold 等模型在蛋白质结构预测方面取得了突破性进展,但这些方法通常只能预测单一静 态结构,难以捕捉蛋白质在功能过程中所经历的动态变化。蛋白质并非静止不动的分子,而是处于不断 变化的构象系综(conformational ensemble)中,其功能往往依赖于这些结构之间的转换。 BioEmu 正是为了解决这一挑战而生。它通过结合 Alpha ...
实测Vidu Q1参考生功能,看到诸葛亮丘吉尔拿破仑在长城拍照留念
机器之心· 2025-07-11 08:27
机器之心报道 看到这里,大概就可以看出 Vidu Q1 参考生功能的不寻常之处了。 编辑:Youli 这次真的不一样,遇到了「想象力的神」! 以前常说「要把自己活成一支队伍」,如今感谢 AI,真的实现了。 最近,生数科技旗下 AI 视频模型 Vidu Q1 推出参考生功能,极大简化传统内容生产流程,真正实现「一个人就是一个剧组」! 首先,我们来看一个视频: 这几个人物形象大家应该都很熟悉。 摇着羽扇、说着「想不到世间还有如此厚颜无耻之人」出现在各大鬼畜视频中的诸葛亮,英国铁血首相丘吉尔,以及战绩可查的拿破仑,如今他们跨越时空,围 坐在会议室中密切交谈,实现「世纪大会晤」! 如果用常规的 AI 图生视频来做的话,一般要经过写脚本、文生图 / P 图 / 融图、图片生成、图生视频、成片等步骤,但实际上,这里只用了三张图片和 Vidu Q1 的 参考生功能! 就像把大象放进冰箱只需要三步一样,这里也只需要三个步骤:找到上传照片、写提示词、成片。 更炫技的操作是,X 网友 Alex,她是一名艺术家兼程序员,在她的操作下,1989 年版本的蝙蝠侠与 1993 年版的侏罗纪公园霸王龙,不仅同框出现,还上演激烈 「对打」, ...