Workflow
量子位
icon
Search documents
缺数据也能拿SOTA?清华&上海AI Lab破解机器人RL两大瓶颈
量子位· 2025-09-26 02:08
Core Viewpoint - The article discusses the development of SimpleVLA-RL, an end-to-end online training solution for Visual-Language-Action (VLA) models, aimed at enhancing the flexibility and performance of robots in complex environments while addressing existing training bottlenecks [3][12]. Group 1: Key Challenges in Existing Training Paradigms - Current training paradigms face significant challenges, including high data collection costs and insufficient generalization capabilities [2][8]. - The reliance on large-scale, high-quality robot operation trajectories limits scalability and increases costs, making data acquisition a major hurdle [8]. - The models struggle with generalization, particularly in out-of-distribution tasks and new environments, leading to performance drops in long-sequence dependencies and combinatorial tasks [8][9]. Group 2: SimpleVLA-RL Framework - SimpleVLA-RL employs a combination of interactive trajectory sampling, result-based rewards, and enhanced exploration to tackle the three core challenges of VLA model training [5][6]. - The framework demonstrates state-of-the-art (SoTA) performance in standard benchmarks like LIBERO and RoboTwin, achieving significant improvements even with limited data [5][21]. - In scenarios with single demonstration data, the average success rate in LIBERO increased from 48.9% to 96.9% after applying SimpleVLA-RL [5]. Group 3: Performance Metrics and Results - SimpleVLA-RL achieved an average success rate of 99.1% in LIBERO, with long-sequence tasks improving by 12.0 percentage points [21]. - In RoboTwin1.0, the average success rate rose from 39.8% to 70.4%, with specific tasks like "Blocks Stack" improving by 33.1 percentage points [23]. - The framework also demonstrated a significant increase in performance in RoboTwin2.0, with average success rates improving from 38.3% to 68.8% [25]. Group 4: Innovations and Discoveries - The training process led to the emergence of new operational strategies, such as the "Pushcut" phenomenon, where the model autonomously discovers more efficient methods beyond human demonstrations [10][31]. - This phenomenon indicates that reinforcement learning can enable VLA models to surpass the limitations of human demonstration patterns, paving the way for future adaptive VLA model development [31].
小米17 4499开卖,首发五代骁龙8!雷军:500亿砸自研芯片
量子位· 2025-09-25 23:54
Core Viewpoint - Xiaomi's recent product launch event showcased a significant shift towards becoming a hardcore technology company, emphasizing innovation across its product lineup, particularly the Xiaomi 17 series, which aims to compete directly with Apple's iPhone [6][7][10]. Group 1: Xiaomi 17 Series - The Xiaomi 17 series includes three models: standard, Pro, and Pro Max, with a starting price of 4499 yuan [3][11]. - The series features the new Snapdragon 8 Gen 2 mobile platform, built on a 3nm process, with a peak frequency of 4.6GHz, positioning it as a top-tier flagship [14][15]. - The design balances lightweight and premium feel, with a thickness of 8.06mm and a weight of 191g, featuring a 6.3-inch display for standard and Pro models, and a 6.9-inch display for Pro Max [18][21]. - The camera system has been enhanced with Leica tuning, focusing on portrait photography with new algorithms for skin tone restoration and detail enhancement [44][46]. - The Pro and Pro Max models introduce a new "back screen" feature, allowing for additional interaction and notifications, enhancing user experience [40][41]. Group 2: Battery and Display Innovations - The standard version of the Xiaomi 17 has a battery capacity of 7000mAh, while the Pro Max reaches 7500mAh, demonstrating superior endurance compared to the iPhone 17 [34][35]. - The display utilizes new red light-emitting materials, improving brightness efficiency by 11.4%, marking a significant technological advancement in domestic manufacturing [29][31]. Group 3: Xiaomi Pad 8 - The Xiaomi Pad 8 series was also launched, featuring an 11.2-inch 3.2K display, with a starting price of 2199 yuan, designed to be lightweight and portable [50][51]. - The Pad runs on the new Surge OS 3, enabling desktop-like functionality, supporting various applications and multitasking capabilities [57][59]. - The standard version is powered by the Snapdragon 8s Gen 4 processor, while the Pro version features the Snapdragon 8 Gen 2, with significant performance improvements [63][64]. Group 4: Future Aspirations - Xiaomi's CEO emphasized the company's commitment to developing its own SoC, with a planned investment of 50 billion yuan over the next decade, aiming for high-end market penetration [68][69].
马斯克新模型背后算法来自英伟达???
量子位· 2025-09-25 23:54
Core Viewpoint - Grok-4-fast has demonstrated exceptional performance in cost reduction and efficiency, surpassing even GPT-5, which is associated with routing capabilities [1][38]. Group 1: Performance and Efficiency - Grok-4-fast's impressive reasoning efficiency is attributed to advanced scaling of computational power [2]. - The underlying technology of Grok is linked to NVIDIA's algorithmic advancements, particularly a new model called Jet-Nemotron [3][4]. - Jet-Nemotron-2B has shown performance comparable to leading open-source models while achieving a speed increase of approximately 53 times [7]. Group 2: Technological Innovations - The key innovation behind Grok-4-fast is a new framework called PostNAS, which significantly reduces training costs and allows for more comprehensive exploration of model structures [10][11]. - PostNAS employs a hybrid structure model that retains essential attention layers while eliminating redundant ones to enhance efficiency [13][14]. - The framework includes four core components: full attention layer placement, optimal linear attention module selection, design of superior linear attention modules, and hardware-aware architecture search [12]. Group 3: Attention Mechanisms - The NVIDIA team evaluated six advanced linear attention modules, with Gated DeltaNet achieving the highest accuracy due to its data-dependent gating mechanism and delta rule [18][19]. - JetBlock, a more advanced linear attention module, utilizes dynamic convolution to adaptively generate convolution kernels based on input features, outperforming Gated DeltaNet in accuracy for mathematical reasoning and retrieval tasks [21][24]. Group 4: Hardware Optimization - NVIDIA's hardware-aware architecture search focuses on optimizing key parameters rather than solely relying on parameter size, which does not accurately reflect real hardware efficiency [27][28]. - The team found that the size of the key-value (KV) cache is crucial for throughput in long-context and long-text generation, leading to a targeted optimization approach [30][31]. Group 5: Industry Impact - PostNAS is expected to influence the AI industry by providing a low-cost, high-efficiency architecture exploration method applicable to any pre-trained transformer [34]. - The Jet-Nemotron model is open-source, allowing various manufacturers to integrate it without retraining, significantly reducing costs while maintaining accuracy [36][42]. - The potential application of Jet-Nemotron across major AI companies like OpenAI and Google could lead to widespread improvements in model performance and cost efficiency [43].
OpenAI宋飏被Meta挖跑了!扩散模型崛起关键人物,加入MSL再会师清华校友赵晟佳
量子位· 2025-09-25 13:00
Core Viewpoint - Meta has successfully recruited Yang Song, a prominent researcher from OpenAI, which has raised significant interest in the AI research community due to his notable contributions to diffusion models and generative modeling [1][6][7]. Group 1: Yang Song's Background and Achievements - Yang Song is recognized as a key contributor to the rise of diffusion models and has been a leading figure in OpenAI's Strategic Explorations Team [10][11]. - He graduated from Tsinghua University at the age of 16 and later earned his PhD from Stanford University, where he worked under the guidance of a notable professor [20][36]. - His most famous work includes the development of Consistency Models, which outperform diffusion models in speed and performance, generating images significantly faster [12][14][17]. Group 2: Impact of Yang Song's Work - The Consistency Models developed by Yang Song can generate 64 images of 256×256 pixels in approximately 3.5 seconds, showcasing a substantial improvement over existing models [12][14]. - His research has led to the creation of Continuous-Time Consistency Models, which address stability and scalability issues in earlier models, achieving a training scale of 1.5 billion parameters [15][18]. - The advancements made by Yang Song and his team are considered potential game-changers in the generative modeling field, with discussions suggesting they could "end" the dominance of diffusion models [18][19]. Group 3: Meta's Strategic Recruitment - Meta's recruitment of Yang Song is part of a broader strategy to enhance its AI capabilities by attracting top talent from leading organizations like OpenAI [9][10]. - The move is seen as a significant loss for OpenAI, with many colleagues expressing surprise at his departure [7][6]. - The motivations behind such moves are speculated to extend beyond financial incentives, as many researchers prioritize impactful work and collaboration opportunities [9].
GPT-5通过“哥德尔测试”!独创性解决博士生都得花几天时间的开放数学问题
量子位· 2025-09-25 13:00
Core Viewpoint - GPT-5 has demonstrated the ability to solve complex mathematical optimization problems, achieving success in three out of five challenges presented by researchers, showcasing its advanced mathematical reasoning capabilities [2][21]. Group 1: GPT-5's Performance - In a recent study, GPT-5 was tasked with solving five unsolved optimization conjectures, successfully solving three of them [2][21]. - The challenges required a level of mathematical understanding typically expected from PhD-level researchers, rather than high school students [3][21]. - GPT-5's performance included generating a novel proof for one problem that differed from the researchers' expectations but was still valid [2][21]. Group 2: The Gödel Test - The researchers referred to their assessment as the "Gödel Test," which involved problems that required deep reasoning and could not be easily found in existing literature [10][11]. - The problems primarily focused on submodular maximization, a concept in combinatorial mathematics characterized by diminishing returns [12][13]. Group 3: Problem-Solving Details - For the first problem, GPT-5 was required to maximize a function composed of both monotonic and non-monotonic submodular functions under specific constraints, and it provided a performance guarantee [23][24]. - In the second problem, GPT-5 was tasked with maximizing a monotonic submodular function while adhering to complex constraints, yielding a solution that was more reasonable than initially anticipated [39][40]. - The third problem involved maximizing a continuous monotonic function under convex constraints, where GPT-5's response was generally correct but contained minor issues [59][60]. Group 4: Limitations and Challenges - GPT-5 struggled with the fourth and fifth problems, which required integrating insights from multiple sources, highlighting its limitations in comprehensive reasoning [26][73]. - In the fourth problem, GPT-5 failed to provide a valid solution and merely restated known information, while in the fifth problem, its output was deemed unreliable and unusable [70][81]. Group 5: Overall Assessment - Overall, GPT-5 exhibited significant improvements in basic mathematical capabilities compared to earlier models, particularly in combinatorial optimization [26][41]. - The model's performance was influenced by the prompts provided, with more detailed requests leading to more complete and coherent answers [26][62].
攻克结构化长文档检索难题!新框架让模型告别“结构性失明”
量子位· 2025-09-25 11:42
Core Insights - The article introduces SEAL (Structure and Element Aware Learning), a new contrastive learning framework designed to enhance the understanding of long documents by models through structural awareness and element alignment [1][8]. Group 1: SEAL Framework Overview - SEAL innovatively integrates both the macro-level structure and micro-level semantic elements of documents into a unified embedding space, significantly improving pre-trained language models' ability to understand and represent structured data [3]. - The framework addresses two main challenges in long document retrieval: how to make models aware of document hierarchy and how to promote precise alignment between user queries and specific document elements [18] [25]. Group 2: Training Strategies - The framework employs two complementary training strategies: Structure Aware Learning (SAL) and Element Aware Learning (EAL) [9]. - SAL focuses on understanding the "skeleton" of documents by presenting models with two versions of a document—one with structural tags and one without, encouraging the model to learn the inherent structural functions of text segments [12][13]. - EAL enhances the model's grasp of local elements' semantic roles by introducing a masking mechanism, requiring the model to infer overall document relevance based on incomplete information [14][15]. Group 3: Experimental Results - The application of the SEAL framework led to a notable improvement in the BGE-M3 model's retrieval ranking quality, with the MRR@10 metric increasing from 73.96% to 77.84% [17][19]. - The results indicate enhanced capability in ranking more relevant results higher, validated by online A/B testing [20]. Group 4: Open Source Dataset - The team released a new dataset named StructDocRetrieval, containing long documents with structural annotations, significantly surpassing typical short datasets like MS MARCO [21][22]. - This dataset, utilizing HTML format, provides rich structural semantic annotations, filling a gap in the field [23]. Group 5: Broader Implications - The SEAL method's refined understanding of structural information can provide more reliable information sources for downstream tasks, such as aiding AI assistants in accurately locating technical document answers [25]. - The framework shows promising applications in specialized fields like enterprise knowledge bases and legal technology [25].
你的AI助手更万能了!天禧合作字节扣子,解锁无限新功能
量子位· 2025-09-25 11:42
允中 发自 凹非寺 量子位 | 公众号 QbitAI 天禧个人超级智能体 和 字节跳动扣子官宣生态合作! 天禧超级智能体是联想集团推出的 新一代AI助手平台 ,是"一体多端"策略中的"一体", 即智能终端设备的"AI大脑",旨在成为人机交互的第 一入口 。它集成了语音、文本、视觉等多种交互能力、全时空记忆和自主规划执行三大超级能力,并提供AI操控、AI搜索、AI翻译、AI笔记 和AI服务五大黄金功能,通过端云混合部署架构,为用户提供跨设备、跨生态的超级智能体验。 继ChatExcel"对话做表"功能成为现象级亮点后,天禧选择合作开发者平台扣子,不仅是其AI功能上的扩容,更标志着联想AI发展已走向平台 化、生态化的整合阶段,AI生态赋能的核心属性得到全面加强。 天然流量入口+高效开发 据悉,扣子平台具备开发成本低、功能完善的核心优势,支持用户通过可视化界面构建AI应用。此次合作旨在解决AI开发者 "开发易,分发 难" 的核心痛点,开辟了一条从AI创意到商用的"高速公路"。 开发者可通过扣子平台 高效开发 个性化智能体,再通过天禧平台 天然流量入口与设备覆盖优势 ,将这些智能体无缝推送到搭载天禧的AI终 端上。双 ...
机器狗腿被锯了也能继续走!最新机器人大脑来自320亿估值独角兽
量子位· 2025-09-25 11:42
Core Viewpoint - Skild AI has developed a revolutionary AI brain, Skild Brain, capable of controlling various robots in unpredictable situations, achieving a valuation of $4.5 billion as of June 2023 [4][29]. Group 1: Skild Brain Capabilities - Skild Brain can adapt to different robot bodies and situations, allowing it to control robots even when they face unexpected challenges like motor jams or limb loss [7][12]. - The AI brain was trained in a virtual environment simulating 100,000 different robot postures over a simulated time of 1,000 years, leading to emergent control capabilities [4][12]. - It can learn from failures and improve its performance over time, demonstrating a memory capacity over 100 times longer than typical robot control strategies [17][24]. Group 2: Testing and Adaptation - Skild Brain successfully adapted to various scenarios, such as simulating limb loss and adjusting walking patterns accordingly, while traditional controllers failed [19][20]. - The AI demonstrated the ability to switch control strategies based on the robot's physical state, such as transitioning from a wheeled to a bipedal walking pattern when necessary [21][24]. - Initial instability in new configurations, like walking on stilts, was quickly overcome as the AI adjusted its movements to maintain balance [22][24]. Group 3: Company Background and Funding - Skild AI was founded in 2023, focusing on developing adaptive AI for different hardware and tasks, with a small team of approximately six employees [25]. - The company has raised a total of $414 million across seed, Series A, and Series B funding rounds, with notable investors including SoftBank, Nvidia, and Sequoia Capital [29]. - The valuation of Skild AI increased from $1.5 billion after Series A funding in July 2024 to $4.5 billion following a $100 million funding round in June 2023 [29].
京东AI一揽子开源!超多核心项目全开源,GitHub万star项目也有新进展了
量子位· 2025-09-25 11:42
Core Insights - The article highlights the advancements of domestic AI agents, particularly JoyAgent, which has achieved significant accuracy improvements in global evaluations, positioning itself among the top tier of AI agents worldwide [1][10][43]. Group 1: JoyAgent and Its Features - JoyAgent is the first fully open-source enterprise-level AI agent, allowing businesses to deploy it without additional development [7][10]. - The recent upgrade to JoyAgent 3.0 includes the open-sourcing of DataAgent and DCP data governance modules, addressing data utilization challenges in enterprises [11][13]. - JoyAgent 3.0 has achieved a validation accuracy of 77% and a test accuracy of over 67% in the GAIA evaluation, reflecting its robust performance [1][43]. Group 2: Open Source Initiatives - JD Cloud has systematically open-sourced its AI capabilities, including the medical model 京医千询2.0, which integrates trustworthy reasoning and multimodal capabilities [5][53]. - The OxyGent multi-agent framework allows developers to assemble AI teams using a simple Python interface, promoting flexibility and ease of use [46][48]. - The open-source strategy aims to create a comprehensive ecosystem that addresses industry pain points and facilitates the practical application of AI technologies [72][76]. Group 3: Industry Impact and Future Directions - JD's open-source efforts are designed to lower the barriers for enterprises to adopt AI technologies, transforming complex business scenarios into accessible solutions [73][76]. - The initiative encourages collaboration among developers, fostering a community that can innovate and create new applications based on proven technologies [73][75]. - By establishing a unified technical standard through projects like the DGP data governance protocol, JD aims to enhance interoperability and drive industry-wide advancements [75][76].
中国团队重新定义“星际之门”!全球首个太空计算星座已实现常态化商用
量子位· 2025-09-25 11:42
Core Insights - The article discusses the successful deployment of traffic recognition models on satellites, marking a significant advancement in the use of space-based AI for urban traffic analysis [4][15][22] - This achievement indicates the transition of space computing from experimental to operational, establishing a new paradigm for AI deployment in the industry [15][23] Group 1: Space-Based AI Capabilities - The complete process of image collection, model inference, and structured result transmission was executed in orbit, demonstrating the feasibility of on-satellite computation [2][10] - The task was supported by the space computing constellation launched by Guoxing Aerospace, which is now in regular commercial operation [5][6] - The system can support models with billions of parameters and has full-process capabilities including image acquisition, model inference, task scheduling, and communication [12][13] Group 2: Commercialization and Operationalization - The successful execution of the task by the team from Jiadu Technology signifies the first commercial use of the global space computing constellation [9][15] - Guoxing Aerospace has become the first company globally to provide regular satellite-level space computing services, marking a milestone in the AI field [15][22] - The "Star Computing" plan aims to establish a green, low-carbon space computing infrastructure with a total computing power exceeding 100,000 PetaFLOPS [12] Group 3: Implications for AI Deployment - The ability to run AI models in orbit allows for a new dimension in data processing, reducing response times significantly by processing data at the source [21][22] - This shift not only changes the physical location of computation but also adjusts the system architecture, enabling faster decision-making for industries requiring rapid assessments [20][22] - The initiative redefines space as an integral part of intelligent systems, transforming it from merely a data source to an active processing environment [19][23]