量子位

Search documents
卡帕西预言成真!华人团队开源全AI操作系统:神经网络模拟Windows,预测下一帧屏幕图像
量子位· 2025-07-15 06:28
Core Viewpoint - The article discusses the development of NeuralOS, a neural network-driven operating system that can simulate a graphical user interface (GUI) similar to Windows, predicting the next frame of screen images based on user interactions [1][2][4]. Group 1: NeuralOS Development - NeuralOS was inspired by a prediction from expert Karpathy about the future of AI-driven GUIs, which will be fluid, magical, and interactive [4][5]. - The research team from the University of Waterloo and the National Research Council of Canada created a demo version of NeuralOS [5][6]. Group 2: Technical Mechanism - NeuralOS utilizes two core components: Recurrent Neural Networks (RNN) for tracking computer state changes and a Renderer for generating corresponding screen images [7][8]. - The training process involved using extensive video recordings of user interactions with the Ubuntu XFCE system, including both random and realistic user behaviors [10][11]. Group 3: Performance Evaluation - The model demonstrated high accuracy in predicting screen states, with most predictions aligning closely with actual states, although it struggled with rapid keyboard inputs [14][15]. - The interface changes generated by NeuralOS during continuous operations appeared nearly indistinguishable from a real system, showcasing its potential for realistic simulations [15]. Group 4: Research Team - The research team consists of five members, with four being of Chinese descent, highlighting a diverse background in AI and machine learning [17][19][21][23][27][29]. Group 5: Future Implications - The development of NeuralOS suggests a shift towards dynamic, AI-generated operating systems, moving away from traditional static interfaces [37].
开放世界任务成功率82%!美的攻克机器人泛化控制难题
量子位· 2025-07-15 06:28
Core Viewpoint - The article discusses the development of ChatVLA-2, a visual-language-action model with embodied reasoning capabilities, created through collaboration between Midea AI Research Institute and East China Normal University. This model integrates a dynamic mixture of experts architecture and a dual-stage training process to enhance its reasoning and action execution abilities [1][4]. Model Structure - ChatVLA-2 employs a mixture of experts (MoE) model architecture, allowing for dynamic selection of expert modules to focus on specific task features while capturing shared beneficial features across multiple tasks. This adaptive strategy ensures efficient allocation of computational resources [7]. Training Strategy - The training process consists of two phases: - The first phase activates open-world understanding and reasoning by co-training visual-language data with robotic action data, avoiding bias towards specific skills [13]. - The second phase refines the model's reasoning-following ability by freezing the visual-language model and only training the action experts, significantly enhancing the model's understanding and response to unseen reasoning scenarios [14][15]. Experimental Results - In experiments, ChatVLA-2 demonstrated superior capabilities in mathematical reasoning and spatial reasoning tasks: - In the mathematical matching game, it achieved a reasoning score of 6.0/6, a success rate of 11/13, and an OCR score of 3.58/4, with a math reasoning score of 1.73/2 and an overall success rate of 82.7% in open-world scenarios [19]. - In the toy placement task, it achieved a target recognition score of 0.94 and a manipulation success rate of 81.4%, outperforming similar methods in unfamiliar environments [21]. Conclusion - ChatVLA-2 represents a significant advancement in the field of robotics, providing a new approach to universal robot control by effectively translating reasoning outcomes into actions, thus paving the way for future research in complex scenarios and multimodal interactions [21].
国产Deep Research杀出一匹「裸奔」黑马:免费开放,过程透明,网页报告一键即出
量子位· 2025-07-15 06:28
Core Viewpoint - The article highlights the launch of the free "Deep Research" feature by Metaso AI Search, which allows users to conduct comprehensive research without the need for applications or memberships, showcasing a new approach to AI-driven research capabilities [1][12][46]. Group 1: Features of Deep Research - The Deep Research function provides a complete research report by connecting various sub-questions and presenting a clear evidence chain [2][22]. - Users can input complex queries, and the system generates a research path in real-time, displaying the AI's thought process [18][19]. - The final report is structured and can be exported in formats like Word and PDF, with sources clearly cited [28][29]. Group 2: Performance and Evaluation - Metaso AI has shown superior performance in evaluation tests compared to other models, including the WebSailor model [8][10]. - The system can visualize data through charts and graphs, making it suitable for business research and everyday inquiries [39][41]. Group 3: Accessibility and Market Position - The Deep Research feature is available for free, contrasting with many competitors that require payment or limited access [48][50]. - This launch is seen as a significant development in the domestic AI search field, providing users with a low-barrier entry to advanced research tools [52].
零代码开发,从与AI对话开始|聊聊百度秒哒
量子位· 2025-07-15 03:50
林樾 发自 凹非寺 量子位|公众号 QbitAI 在有了AI Coding,更多的人在 用「说话」的方式 来做产品了。 你的 想法从出现到变成一个实际可用的产品,门槛在变得越来越低。 不用懂代码 ,也可以 通过与 AI 对话,来 「零代码」 搭建应用 了。 △ 在百度秒哒,只需要对话提出需求,就可以搞定网页开发。 百度秒哒 就是这样一款零代码的对话式开发 平台,AI会扮演架构师、研发工程师等角色, 调用不同的智能体和工具来实现开发。整个过程一句代码也不会出现。 现在,从想法到产品上线,用户都是怎么用秒哒进行开发的?秒哒开发出的产品真的能投入 使用、甚至赚钱了吗?零代码开发的时代,最重要的是什么能力? 7月17日周四晚20:00 ,「量子位·视点」邀请到了 百度秒哒产品部总经理朱广翔 ,将一起 聊聊百度秒哒,以及普通人如何用好零代码开发? 欢迎点击下方按钮,预约直播~ 目前秒哒也正开放给所有人免费试用:miaoda.baidu.com,欢迎来直播中与我们交流你的 使用体验~ 分享嘉宾 朱广翔 百度秒哒产品部总经理 朱广翔博士,毕业自清华大学交叉信息研究院,曾获中国智能体与多智能体系统最佳论文、 北京市优秀博士学 ...
小扎自曝挖人秘诀:小团队我亲自带,豪掷数百亿建GW集群,大家不图天价薪酬只为“造神”
量子位· 2025-07-15 03:50
Core Viewpoint - Meta is aggressively investing in AI infrastructure and talent, aiming to build a leading position in the AI model era, with significant financial backing and ambitious projects underway [1][4][5]. Group 1: Investment and Infrastructure - Meta plans to invest hundreds of billions of dollars into building multiple Gigawatt (GW) clusters for AI model training [2][4]. - The GW clusters are designed to support large-scale AI models, with the first cluster, Prometheus, expected to have a power capacity of 1GW and operational by 2026 [3][13]. - A second cluster, Hyperion, will have an initial capacity of 1.5GW, expandable to 5GW, and is set to begin construction in 2024 [19][21]. Group 2: Talent Acquisition and Team Building - Meta is attracting top AI talent not just with high salaries but by offering significant resources and a vision to build advanced AI systems [1][2]. - The company is focused on creating a highly skilled and elite team to drive its AI initiatives [5][7]. Group 3: Energy and Resource Management - The energy requirements for the new data centers are substantial, potentially drawing power equivalent to that of millions of households [22][23]. - Meta is addressing energy needs by constructing on-site natural gas power plants to supplement electricity supply when local grids are insufficient [25][26]. Group 4: Strategic Direction and Model Development - There is ongoing internal debate at Meta regarding whether to continue with an open-source approach or shift towards closed-source AI models [6][30]. - Despite some discussions about reducing investment in open-source models, Meta remains committed to developing its Llama model [35][36]. - The leadership is considering a strategic pivot towards developing a closed model, Behemoth, which has faced delays and internal challenges [38][42]. Group 5: Competitive Landscape - The emergence of ByteDance's lightweight mixed-reality glasses poses a competitive challenge to Meta's existing product lines, indicating a broader shift in the wearable technology market [50][52]. - Meta's focus on lightweight smart glasses suggests a potential shift in strategy to address competition in the augmented reality space [53][54].
Windsurf打工人被谷歌做局24小时后获收购!华人AI编程明星出手,接收250名员工
量子位· 2025-07-15 00:34
Core Viewpoint - The AI programming sector is experiencing intense competition, highlighted by the rapid acquisition of Windsurf's assets and team by Cognition after Google's talent acquisition of Windsurf's core team [1][6]. Group 1: Acquisition Dynamics - Google executed a talent acquisition of Windsurf's CEO Varun Mohan, co-founder Douglas Chen, and core R&D team for $2.4 billion, leaving the remaining 250 employees and the company itself as an "empty shell" [2][3]. - Cognition swiftly acquired all remaining assets and the team of Windsurf, including intellectual property, product lines, and all employees not taken by Google, although the specific acquisition price was not disclosed [15][16]. Group 2: Employee Treatment - Cognition's approach to employees contrasts sharply with Google's, as Cognition announced that 100% of employees would have the opportunity to participate economically in the acquisition, with immediate vesting of stock options [16][23]. - All employees will receive accelerated vesting of their stock options, allowing them to gain their rights without waiting for the usual vesting period [16]. Group 3: Market Context - The competitive landscape is underscored by significant revenue figures, with Cursor achieving an annual revenue of $500 million and GitHub Copilot surpassing $300 million, prompting major players to aggressively acquire valuable AI programming assets [20]. - The fate of Windsurf serves as a warning to AI startups that they must align with larger players or risk being acquired, as seen in previous instances involving Microsoft and Google [21]. Group 4: Future Prospects - Cognition's acquisition of Windsurf positions it to enhance its AI programming agent Devin by integrating it with Windsurf's IDE, potentially solidifying its foothold in the AI programming domain [24]. - The ongoing competition among AI giants indicates that while this acquisition has concluded, the broader battle for AI applications is just beginning, with continuous technological and commercial developments expected [29].
刘璐也被Meta挖走了!华南理工校友,创造了4o吉卜力爆款
量子位· 2025-07-15 00:34
Core Viewpoint - Liu Lu, a notable researcher from OpenAI, has joined Meta, which indicates a strategic talent acquisition by Meta to enhance its AI capabilities, particularly in the wake of challenges faced by its Llama 4 release [1][6][34]. Group 1: Liu Lu's Background and Achievements - Liu Lu is a graduate of South China University of Technology and has a strong academic background, including a GPA of 3.84 in her undergraduate studies [3][9]. - She has previously worked at Google, contributing to the development of the Gemini model, and later led the image generation work for GPT-4o at OpenAI, which became widely popular for its "Ghibli style" feature [4][21][23]. - The "Ghibli style" feature generated over 700 million images within the first ten days of its release, showcasing its immense popularity [26]. Group 2: Meta's Talent Acquisition Strategy - Meta has been aggressively recruiting talent from OpenAI, with Liu Lu being one of the key figures, alongside Allan Jabri, who was also part of the GPT-4o core architecture team [5][30]. - This recruitment strategy appears to be part of a broader effort by Meta to build a strong AI team, as evidenced by the growing list of Chinese researchers joining from OpenAI [34][35]. - The current roster of Chinese talent at Meta includes ten individuals, with eight coming from OpenAI, highlighting a focused approach to acquiring top talent in the AI field [35]. Group 3: Implications for the AI Industry - The shift of talent from OpenAI to Meta raises questions about the competitive landscape in the AI industry, particularly regarding the retention of talent at OpenAI [38][39]. - Meta's strategy to recruit from OpenAI may signal a shift in the balance of power within the AI sector, as it seeks to enhance its capabilities and regain trust following previous setbacks [7][34]. - The ongoing recruitment efforts suggest that Meta is not only interested in immediate gains but is also looking to establish a long-term competitive advantage in AI development [34][40].
B站下场自研AI配音!纯正美音版甄嬛传流出,再不用看小红书学英语了(Doge)
量子位· 2025-07-14 09:08
白交 发自 凹非寺 量子位 | 公众号 QbitAI 当甄嬛传、让子弹飞全都转英文,会怎样? 小红书经常刷到这种视频,然后英语就这么丝滑地经过我的脑子。 现在,AI就可以搞定!就像这样。 不仅符合原版的音色和情感,还能保证唇形同步。 很好, 以后再不需要看小红书麻烦配音老师来教我英语了(Doge)。 而这次出手的,正好是那个创造诸多魔性视频的 B站 。真是好你个B站。 他们发布的TTS模型IndexTTS2,在社区引发不少的关注。 网友表示:已经迫不及待地想用它来做搞笑视频了。 IndexTTS2:AI配音无压力 它最大的亮点,就在于在实现时长控制的同时,还能再现符合Prompt的情感特征。 它支持两种生成方式。 一种是明确token数量,以精准控制时长。 比如原音频是这样: 要求替换成的文本是只有当科技为本地社群创造价值的时候,才真正有意义。 那么控制它的时长分别为原来的0.75倍、1倍(原速)、1.25倍。效果是这样的。 另一种是无需手动输入,自动生成语音,同时保留输入提示的韵律特征。 比如生气的情感。 指定替换文本:你在我们屋里走路的时候,发现了一条遥远的路,这是不够奇怪的。 此外还支持音频和情绪表达独 ...
腾讯混元A13B用130亿参数达到千亿级效果,Flash Attention作者点赞
量子位· 2025-07-14 09:08
Core Viewpoint - Tencent's Hunyuan-A13B model has gained significant attention in the open-source community due to its performance and efficiency, particularly with its ability to compete with larger models using fewer activated parameters [2][11]. Group 1: Model Performance and Architecture - The Hunyuan-A13B model utilizes a fine-grained MoE (Mixture of Experts) architecture, with a total parameter scale of 80 billion, activating only 13 billion parameters during inference, leading to over 100% improvement in throughput compared to similar models [11][12]. - It supports a native context window of 256K, enhancing its performance and efficiency [12]. - The model has been validated against benchmarks, outperforming smaller models like Qwen3 8B and 14B, while still being competitive with larger models [4][36]. Group 2: Developer Accessibility - The model is designed to be user-friendly for individual developers, requiring only a mid-range GPU to run, thus alleviating concerns about computational power [14][15]. - The API for the model is available on Tencent Cloud, with competitive pricing of 0.5 yuan per million tokens for input and 2 yuan for output [7]. Group 3: Training Methodology - The model's capabilities are built on a high-quality pre-training phase using 20 trillion tokens of data, with a focus on STEM fields, which enhances its performance in reasoning tasks [19]. - A structured post-training framework is employed, consisting of multiple phases to refine the model's abilities in various tasks, including a focus on both IQ and EQ [22][24]. Group 4: Agent Capabilities - The model's agent capabilities are developed through a combination of supervised fine-tuning (SFT) and reinforcement learning (RL), allowing it to excel in tasks such as tool invocation and complex decision-making [25][35]. - In various authoritative evaluations, Hunyuan-A13B has surpassed leading models, demonstrating strong reasoning and coding abilities [36]. Group 5: Practical Applications and Open Source - Hunyuan-A13B has been validated in over 400 business scenarios within Tencent and is now fully open-sourced, with model weights, code, and technical reports available on GitHub and Hugging Face [38].
Kimi K2里找到了DeepSeek V3架构
量子位· 2025-07-14 07:01
Core Viewpoint - Kimi's new model K2 has gained significant attention and positive feedback for its performance in various benchmarks and its ability to handle productivity-level tasks effectively [1][4]. Group 1: Kimi K2 Model Insights - Kimi K2 is noted for its strong tool-calling capabilities, making it suitable for production-level tasks [1]. - The model is built on the DeepSeek V3 architecture, which has sparked discussions about its design and performance [5][83]. - Kimi K2 has two versions: Kimi-K2-Base, a pre-trained model for research and customization, and Kimi-K2-Instruct, a fine-tuned version for general instruction tasks [15][16]. Group 2: Open Source Strategy - Kimi's decision to open source K2 is primarily aimed at gaining recognition and leveraging community support to enhance its technology ecosystem [9][12]. - The open-source approach allows for community contributions, which can lead to rapid improvements and innovations in the model [14][18]. - Kimi has ceased marketing expenditures since early this year, focusing instead on the strength of its model to gain market recognition [20][22]. Group 3: Product Development and Features - Kimi is committed to foundational model research, even amidst trends favoring agent products, emphasizing the importance of model capabilities in determining AI performance [24][27]. - The Kimi team is exploring innovative product designs, such as transitioning from text-based outputs to more interactive formats, enhancing user experience [28][30]. - Kimi K2 has demonstrated significant improvements in generating complex outputs, such as games and travel plans, showcasing its advanced capabilities [39][62]. Group 4: Market Context and Competition - The delay in OpenAI's open-source model release has been speculated to be influenced by Kimi K2's performance, although OpenAI cites safety concerns as the official reason [2][76]. - There are rumors that OpenAI's model, while smaller than K2, is still powerful but faced issues that necessitated retraining before release [81][82].