量子位
Search documents
字节发了个机器人全能大模型,带队人李航
量子位· 2025-09-06 04:21
Core Viewpoint - Byte's Seed has introduced Robix, a single model that integrates reasoning, task planning, and natural language interaction for robots, eliminating the need for multiple modules [1][4][27]. Group 1: Robix Model Overview - Robix is designed to handle high-level cognitive tasks while a lower-level system (VLA) executes commands issued by Robix [6][9]. - The model is a visual-language integrated single model that processes images and language simultaneously, streamlining communication and decision-making [10][11]. - It employs a chain of thought reasoning and a three-stage training strategy to enhance its capabilities [11][12]. Group 2: Training Methodology - The training consists of three phases: 1. Continuous pre-training with extensive robot-related data to understand 3D space and correlate language with visuals. 2. Supervised fine-tuning using real-world scenarios to teach task handling and basic conversation skills. 3. Reinforcement learning to correct discrepancies between thought and action through a reward system [19][20]. Group 3: Performance Metrics - In foundational ability tests, Robix outperformed Qwen 2.5-VL in 7 out of 8 spatial understanding tasks, achieving higher average accuracy [21]. - Robix's performance in various benchmarks shows it surpassing closed-source models like GPT-4o and Gemini 2.5 Pro in most tests [21][22]. - In real-world interaction tests, Robix-32B achieved an average task progress of 92.5%, exceeding Gemini 2.5 Pro and GPT-4o by 4.3 and 28.1 percentage points, respectively [25]. Group 4: Leadership and Development - The project is led by Dr. Li Hang, who has a significant background in AI and robotics, previously serving as the head of Huawei's Noah's Ark Lab [28][30]. - Despite rumors of retirement, Dr. Li continues to contribute to the project in a consulting capacity [31].
调整训练数据出场顺序,大模型就能变聪明!无需扩大模型/数据规模
量子位· 2025-09-06 04:21
Core Viewpoint - The article emphasizes the importance of data organization order in language model training, introducing a new paradigm called DELT (Data Efficacy in LM Training) that enhances model performance without increasing data volume or model size [1][3][11]. Group 1: Data Efficiency vs. Data Efficacy - Data efficiency focuses on improving model training efficiency through data selection, while data efficacy enhances model performance through data organization, which has been largely overlooked [5][6][15]. - The analogy of cooking is used to illustrate that data efficiency is like selecting fresh ingredients, whereas data efficacy is akin to a chef timing the addition of spices to maximize flavor [7]. Group 2: Importance of Data Organization - The sequence of training samples is crucial, as modern language models often undergo limited training cycles, making the order of data presentation significantly impactful [9][10]. - The DELT paradigm aims to fully exploit the potential of training data by introducing data sorting strategies, leading to improved efficiency and efficacy [11][13]. Group 3: DELT Paradigm Components - DELT integrates three core components: data scoring, data selection, and data ordering, where data scoring assigns scores based on attributes like difficulty and quality [19][20]. - A novel folding ordering method is proposed to enhance data efficacy by preventing model forgetting and ensuring balanced data distribution [23][27]. Group 4: Performance Results - The DELT paradigm has shown significant performance improvements across various model sizes and data scales, outperforming conventional methods in multiple evaluation metrics [28]. - For instance, with a model size of 1 billion, DELT achieved an average score of 39.17% compared to 37.77% for conventional methods [28]. Group 5: Implications for AI Training - DELT provides a new perspective for data-centric AI, suggesting that AI training should adopt personalized and structured learning approaches similar to human education practices [29][30].
视频理解新标杆,快手多模态推理模型开源:128k上下文+0.1秒级视频定位+跨模态推理
量子位· 2025-09-05 10:56
Core Viewpoint - Keye-VL 1.5, an advanced multimodal model developed by Kuaishou, has been open-sourced, showcasing significant improvements in video understanding and reasoning capabilities compared to its predecessor [1][4][6]. Group 1: Model Capabilities - Keye-VL 1.5 features enhanced temporal localization abilities, achieving 0.1-second precision in identifying when specific objects appear in videos [10][8]. - The model introduces a Slow-Fast dual encoding mechanism, allowing for a 128k long context window while balancing speed and detail [5][8]. - It has demonstrated superior performance in various benchmarks, scoring 73.0 in the Video-MME short video benchmark and leading in multiple other evaluation scenarios [6][18]. Group 2: Benchmark Performance - Keye-VL 1.5 outperforms other models such as Qwen2.5-VL 7B across several benchmarks, including OpenCompass and MMBench, achieving top scores in its class [19][21]. - In human-annotated metrics, Keye-VL 1.5 achieved an average score of 3.53, which is an improvement of 0.51 points over the preview version and surpasses competing models [24][25]. Group 3: Model Architecture - The architecture of Keye-VL 1.5 consists of a "Vision Transformer (ViT) + MLP projector + language decoder" structure, designed to capture global spatial relationships in video frames [27][28]. - The model employs a four-stage progressive pre-training pipeline, utilizing over 1 trillion tokens from diverse data sources to enhance its multimodal capabilities [39][41]. Group 4: Research and Development - The Keye team has made significant contributions to the field, presenting multiple research findings at top conferences, including advancements in multimodal reinforcement learning and visual language model governance frameworks [51][54]. - The team's efforts are focused on integrating visual, linguistic, and behavioral data to enhance cognitive and decision-making processes in AI applications [50].
第一家被收购的AI浏览器公司,43亿成交,产品还在内测
量子位· 2025-09-05 06:33
Core Viewpoint - The acquisition of The Browser Company by Atlassian for $610 million marks a significant event in the AI browser market, with a focus on the newly developed AI browser, Dia, which aims to enhance productivity for white-collar workers [1][3][12]. Group 1: Acquisition Details - The Browser Company, known for its AI browsers Arc and Dia, was acquired by Atlassian for $610 million (approximately 4.3 billion RMB) [1]. - The acquisition was premeditated, with discussions between the CEOs occurring a year prior, initially focusing on the Arc browser rather than Dia [8][10]. - The Browser Company has raised a total of $128 million since its inception in 2019, with a valuation of $550 million last year [17][19]. Group 2: Market Context and Reactions - The acquisition has sparked skepticism among netizens regarding Atlassian's judgment, given that Dia has been in beta testing since its launch in June [5][6]. - Despite the skepticism, the acquisition is seen as a strategic move to secure necessary resources and distribution channels in a competitive AI browser market [22][23]. Group 3: Product Focus and Vision - Atlassian's founder, Cannon-Brookes, envisions Dia as a browser designed for operational efficiency rather than mere information browsing, aiming to integrate various tools and enhance user productivity [25][26]. - Dia is optimized for commonly used SaaS applications, providing contextual information to assist in daily tasks [27]. - The browser connects AI capabilities with personal work memory, facilitating better integration of applications and tasks [29].
全给黄仁勋玩明白了!15亿美元租自家GPU/教小弟用GPU换融资,英伟达又一世子被曝准备IPO
量子位· 2025-09-05 06:33
Core Viewpoint - Nvidia is significantly investing in cloud computing by renting its own AI chips from Lambda, indicating a strategic move to strengthen its dominance in the cloud market [1][2][25]. Group 1: Nvidia's Investment and Rental Agreements - Nvidia will lease 10,000 GPU servers equipped with its own AI chips from Lambda for four years, totaling $1.3 billion [2]. - Additionally, Nvidia has entered into another rental agreement for 8,000 servers, valued at $200 million [3]. - The purpose of these rentals is to meet Nvidia's internal research and development needs [4]. Group 2: Lambda's Role and IPO Preparation - Lambda is preparing for an IPO, potentially as early as the first half of 2026 [7][23]. - As a cloud provider, Lambda offers competitive pricing for GPU rentals, especially for long-term or large-scale usage scenarios [11]. - Nvidia is not only a supplier but also an investor and customer of Lambda, creating a symbiotic relationship [10]. Group 3: Financial Strategies and Market Positioning - Nvidia participated in Lambda's $480 million Series D funding round in February 2025, positioning itself as a strategic investor [14]. - Lambda also secured $500 million in debt financing to purchase Nvidia GPUs, with the GPUs serving as collateral [14]. - Nvidia's strategy includes deepening ties with smaller cloud providers to ensure its chips' market penetration, countering competition from larger cloud firms developing their own chips [30][28]. Group 4: Competitive Landscape and Future Outlook - Nvidia's data center business is a major growth driver, contributing $41.1 billion in revenue for the second quarter of fiscal 2026, a 56% year-over-year increase [25]. - The company aims to maintain its leading position in the computing market by supporting smaller cloud firms like Lambda and CoreWeave [31][32]. - CoreWeave, another Nvidia-backed cloud provider, recently went public and saw its stock price surge, reflecting Nvidia's successful investment strategy [22][21].
字节Seed最新版原生智能体来了!一个模型搞定手机/电脑/浏览器自主操作
量子位· 2025-09-05 04:28
Core Viewpoint - The article discusses the advancements of ByteDance's UI-TARS-2, a new generation of AI agents that can autonomously operate graphical user interfaces (GUIs) across various platforms, outperforming competitors like Claude and OpenAI [2][23][24]. Group 1: UI-TARS-2 Overview - UI-TARS-2 is designed to autonomously complete complex tasks on computers, mobile devices, web browsers, terminals, and even games [6][10]. - The architecture includes a unified agent framework, multimodal perception, multi-round reinforcement learning, and hybrid operation flows [7][8]. Group 2: Challenges Addressed - UI-TARS-2 tackles four major challenges in AI GUI operation: data scarcity, environment fragmentation, single capability, and training instability [5][10]. - The model employs a "data flywheel" strategy to address data scarcity by collecting raw data and generating high-quality task-specific data through iterative training [11][12]. Group 3: Reinforcement Learning Enhancements - The team optimized traditional reinforcement learning methods to ensure stable operations in long-duration GUI tasks by improving task design, reward mechanisms, and training processes [15][17]. - The model uses asynchronous rollout and several enhancements to the PPO algorithm to improve stability and encourage exploration of less common but potentially effective actions [17][18]. Group 4: Performance Metrics - UI-TARS-2 has shown superior performance in various GUI tests, scoring higher than Claude and OpenAI models in tasks across different operating systems and command-line environments [23][24]. - In gaming scenarios, UI-TARS-2 achieved an average score of approximately 60% of human performance, outperforming competitors in several games [27][28]. Group 5: Practical Applications - Beyond GUI operations, UI-TARS-2 can perform tasks such as information retrieval and code debugging, demonstrating its versatility and effectiveness compared to models relying solely on GUI interactions [28][29].
ChatGPT新功能,又干掉一批创业项目
量子位· 2025-09-05 04:28
Core Viewpoint - ChatGPT has introduced a new feature called "Conversation Branching," allowing users to engage in multiple conversation threads without cluttering the original dialogue [1][3][4]. Group 1: Conversation Branching Feature - The "Conversation Branching" feature enables users to click a button to create a new topic based on the existing conversation [4][8]. - This feature is designed to enhance user experience by allowing for separate discussions while maintaining context from the original topic [12][13]. - The implementation of this feature is seen as a response to user feedback, indicating a demand for more organized conversation management [3][13]. Group 2: Project Functionality - ChatGPT has made the previously paid "Project" feature available for free, enhancing accessibility for all users [16]. - The "Project" feature includes file upload limitations based on user subscription levels, with free users allowed to upload up to 5 files, Plus users up to 25 files, and Pro users up to 40 files [19]. - Users can customize project colors and icons, improving project differentiation and recognition efficiency [18].
OpenAI宣布推出AI在线招聘平台,和微软的领英打起来了
量子位· 2025-09-05 01:49
Core Viewpoint - OpenAI is launching an AI-driven online recruitment platform, OpenAI Jobs Platform, aimed at matching corporate needs with employee skills, directly competing with LinkedIn [2][11][12] Group 1: OpenAI Jobs Platform - The OpenAI Jobs Platform will provide a dedicated channel for small businesses and local governments to access top AI talent [5] - The platform aims to connect skilled individuals with companies needing AI expertise, enhancing local business competitiveness and government services [16][17] - OpenAI is collaborating with various organizations, including Walmart and local government offices, to build this platform [14][15] Group 2: AI Skills Development - OpenAI has launched OpenAI Academy, a free online learning platform that has already helped over 2 million people acquire AI skills [18] - The company plans to expand the Academy with certification courses for different AI proficiency levels, aiming to provide AI skills certification to 10 million Americans by 2030 [20][21] - Research indicates that employees with AI skills are more valuable and efficient, leading to higher salaries compared to those without such skills [18][22] Group 3: Competitive Landscape - OpenAI's new recruitment platform poses a direct challenge to LinkedIn, which is owned by Microsoft, OpenAI's largest financial backer [11][12] - The competition raises questions about potential conflicts of interest, as OpenAI's success could impact LinkedIn's market position [12][13]
DeepSeek新大招曝光:下一步智能体
量子位· 2025-09-05 01:49
Core Viewpoint - DeepSeek is reportedly developing a new model with enhanced AI Agent capabilities, expected to launch by the end of this year [3][8]. Group 1: Model Development - DeepSeek's recent update in August introduced DeepSeek-V3.1, which features improved Agent capabilities through Post-Training optimization, enhancing performance in tool usage and agent tasks [5][11]. - The upcoming model is designed to execute complex operations with minimal prompts and can self-evolve based on historical actions [7][8]. - The transition from DeepSeek V3 to V3.1 over nine months indicates a focus on incremental improvements rather than major version changes [9][10]. Group 2: Performance Metrics - DeepSeek-V3.1 shows significant performance improvements in various benchmarks compared to its predecessors: - SWE-bench: 66.0 (V3.1) vs. 45.4 (V3) and 44.6 (R1) - SWE-bench Multilingual: 54.5 (V3.1) vs. 29.3 (V3) and 30.5 (R1) - Terminal-Bench: 31.3 (V3.1) vs. 13.3 (V3) and 5.7 (R1) [12]. - In search agent evaluations, V3.1 also demonstrated comprehensive performance enhancements over R1 [12]. Group 3: Future Outlook - The introduction of DeepSeek R1 has significantly influenced the global large model industry, marking a pivotal moment in its development [15]. - The concept of AI agents is gaining traction, with predictions that by mid-2025, nearly all large model products will incorporate agent functionalities [16][18]. - There is speculation about the potential reduction in price barriers for AI agents if DeepSeek leads this initiative [19].
英伟达老黄收购了一家AI编程公司
量子位· 2025-09-05 01:49
Core Viewpoint - Nvidia is actively expanding its ecosystem in AI programming through strategic acquisitions, including the recent purchase of the AI coding startup Solver, which focuses on developing AI agents for software programming [2][8][17]. Group 1: Acquisition Details - Nvidia has acquired Solver, an AI coding company founded in 2022, which aims to manage entire codebases rather than just code completion [8][12][22]. - The founders of Solver, Mark Gabel and Daniel Lord, have significant backgrounds in AI, with Gabel being a former chief scientist at Viv Labs and Lord being a co-founder of Siri [10][11]. - This acquisition aligns with Nvidia's strategy to build a software ecosystem around its leading AI hardware, potentially shortening enterprise development cycles on Nvidia's platform [17][23]. Group 2: Previous Acquisitions - Over the past two years, Nvidia has made several acquisitions to lower chip usage costs and enhance AI support, including: - Lepton AI, a company that rents out servers powered by Nvidia chips [18][19]. - Gretel, a synthetic data startup acquired in March 2025 to meet AI training data needs [20]. - Run:ai, an Israeli software provider focused on AI workload orchestration, acquired for $700 million in December 2024 [20]. - OctoAI, specializing in generative AI tools, acquired for approximately $250 million in September 2024 [20]. - Brev, a platform for building and deploying AI models, acquired in July 2024 to optimize access to Nvidia GPUs in the cloud [20]. Group 3: Implications of the Acquisition - The acquisition of Solver signifies a shift towards AI agents that will play a more integral role in software development, moving beyond mere code completion to actively participating in codebase construction, testing, and management [22][23]. - This move is part of Nvidia's ongoing "AI acquisition spree," expanding its business scope from chips and data tools to AI agents, thereby deepening its industry footprint [23][24].