机器之心
Search documents
刚刚,UCLA周博磊也加入了一家机器人公司
机器之心· 2025-10-15 02:54
Core Insights - Coco Robotics has appointed Bolei Zhou, a UCLA associate professor, as the Chief AI Scientist to lead the newly established Physical AI Lab, focusing on autonomous driving for sidewalks [1][2][4] - The company aims to achieve full automation in last-mile delivery, leveraging the extensive operational data collected over the past five years [2][4][6] Group 1: Company Overview - Coco Robotics, founded in 2020, specializes in last-mile delivery robotics and initially relied on teleoperators to navigate obstacles [2][4] - The company has accumulated millions of miles of data in complex urban environments, which is crucial for training reliable AI systems [4][6] Group 2: Research and Development - The Physical AI Lab will utilize the data collected to enhance automation and operational efficiency, focusing on local models used by their robots [6] - The lab operates independently from Coco Robotics' collaboration with OpenAI, which allows the use of OpenAI's models while sharing data for mutual benefit [5][6] Group 3: Bolei Zhou's Background - Bolei Zhou holds a PhD from MIT and has a strong research background in machine perception and intelligent decision-making, with over 100 publications and significant contributions to explainable AI [9][11][13] - His notable works include Class Activation Mapping (CAM) and the creation of the Places database, which contains over 10 million labeled scene images, enhancing scene recognition capabilities [11][18][20]
北大彭一杰教授课题组提出RiskPO,用风险度量优化重塑大模型后训练
机器之心· 2025-10-15 02:54
Core Insights - The article discusses the limitations of traditional reinforcement learning (RL) methods in enhancing the reasoning capabilities of large models, particularly highlighting the "mean optimization trap" that leads to a lack of exploration and ineffective learning in challenging tasks [4][24]. - A new approach called RiskPO is introduced, which integrates risk-averse principles into the optimization objective, focusing on the left tail of the reward distribution to guide models in overcoming reasoning shortcomings [7][24]. Research Background and Challenges - The article outlines the challenges faced by large models in post-training, particularly the "mean optimization trap" that results in a loss of exploration ability and ineffective learning in difficult tasks [4][24]. - It emphasizes that existing methods, such as GRPO, have improved short-term metrics but have not expanded the reasoning boundaries necessary for complex tasks [4][24]. Technical Solution Overview - The RiskPO approach combines "risk measurement" with a bundling strategy to address the shortcomings of traditional mean optimization [6][7]. - The core of this approach is the "Mixed Value at Risk (MVaR)" objective function, which emphasizes the importance of low-reward, difficult tasks by replacing the pursuit of overall mean rewards [9][10]. Experimental Results - The North University team demonstrated the effectiveness of RiskPO across various tasks, achieving significant improvements in reasoning capabilities, particularly in challenging problems [15][18]. - In the AIME24 competition, RiskPO outperformed GRPO by nearly 7 percentage points in Pass@32 and achieved a Pass@1 score of 81.8% on the MATH500 dataset, surpassing GRPO by 2.6 percentage points [15][16]. Theoretical Support and Validation - The performance improvements of RiskPO are backed by solid theoretical foundations and rigorous ablation studies, showing that risk-averse updates can effectively mitigate entropy collapse [20][21]. - The article highlights that while mean-based metrics may show similar performance in early training, risk-sensitive metrics reveal significant advantages for RiskPO as training progresses [23][24]. Comparison with Alternative Strategies - A comparison with risk-seeking strategies demonstrated that focusing on easier tasks leads to rapid entropy collapse and stagnation in performance, while risk-averse strategies drive continuous improvement [26][27].
NeurIPS 2025 Spotlight | 条件表征学习:一步对齐表征与准则
机器之心· 2025-10-15 02:54
本文第一作者为四川大学博士研究生刘泓麟,邮箱为 tristanliuhl@gmail.com ,通讯作者为四川大学李云帆博士后与四川大学彭玺教授。 一张图片包含的信息是多维的。例如下面的图 1,我们至少可以得到三个层面的信息:主体是大象,数量有两头,环境是热带稀树草原(savanna)。然而,如果 由传统的表征学习方法来处理这张图片,比方说就将其送入一个在 ImageNet 上训练好的 ResNet 或者 Vision Transformer,往往得到的表征只会体现其主体信息, 也就是会简单地将该图片归为大象这一类别。这显然是不合理的。 图 1 :传统表征学习(上)与条件表征学习(下)的比较。传统的表征学习方法只能学习到一种通用的表征 ,忽略了其他有意义的信息;文章提出的条件表征学习能够基于指定准则,得到该准则下表现 力更强的条件表征,适应多种下游 任务。 此外,在各大电商平台,用户通常根据不同的标准(例如颜色、材质或场合)搜索商品。例如,用户今天可能搜索 "红色连衣裙",明天搜索 "正装",后天搜索某 个全新的关键词。这对于拥有庞大规模商品的平台来说,手动打标签是不现实的,而传统的表征学习也仅仅只能获取到 ...
VAE时代终结?谢赛宁团队「RAE」登场,表征自编码器或成DiT训练新基石
机器之心· 2025-10-14 08:24
Core Insights - The article discusses the emergence of RAE (Representation Autoencoders) as a potential replacement for VAE (Variational Autoencoders) in the field of generative models, highlighting the advancements made by the research team led by Assistant Professor Xie Saining from New York University [1][2]. Group 1: RAE Development - RAE combines pre-trained representation encoders (like DINO, SigLIP, MAE) with trained decoders to replace traditional VAE, achieving high-quality reconstruction and a semantically rich latent space [2][6]. - The new model structure addresses the limitations of VAE, such as weak representation capabilities and high computational costs associated with SD-VAE [4][13]. Group 2: Performance Metrics - RAE demonstrates superior performance in image generation tasks, achieving an FID score of 1.51 at a resolution of 256×256 without guidance, and 1.13 with guidance at both 256×256 and 512×512 resolutions [5][6]. - The study shows that RAE consistently outperforms SD-VAE in reconstruction quality, with rFID scores indicating better performance across various encoder configurations [18][20]. Group 3: Training and Architecture - The research introduces a new variant of DiT (Diffusion Transformer), named DiT^DH, which incorporates a lightweight, wide head structure to enhance the model's efficiency without significantly increasing computational costs [3][34]. - The training scheme for the RAE decoder involves using a frozen representation encoder and a ViT-based decoder, achieving reconstruction quality comparable to or better than SD-VAE [12][14]. Group 4: Scalability and Efficiency - DiT^DH exhibits improved convergence speed and computational efficiency compared to standard DiT, maintaining performance advantages across different scales of RAE [36][40]. - The model's scalability is highlighted, with DiT^DH-XL achieving a new state-of-the-art FID score of 1.13 after 400 epochs, outperforming previous models while requiring significantly less computational power [41][43]. Group 5: Noise Management Techniques - The research proposes noise-enhanced decoding to improve the robustness of the decoder against out-of-distribution challenges, which enhances the model's overall performance [29][30]. - Adjustments to noise scheduling based on the effective data dimensions of RAE are shown to significantly improve training outcomes, demonstrating the necessity of tailored noise strategies in high-dimensional latent spaces [28].
老牌Transformer杀手在ICLR悄然更新:Mamba-3三大改进趋近设计完全体
机器之心· 2025-10-14 08:24
Core Insights - The article discusses the evolution of the Mamba architecture, which is positioned as a strong contender against the dominant Transformer architecture in AI models. Mamba has shown significant improvements in language modeling and inference efficiency, particularly with its latest iteration, Mamba-3, which introduces several key enhancements [1][2][3]. Group 1: Mamba Architecture Evolution - Mamba gained popularity in 2023 as a structured state space model (SSM) architecture, demonstrating performance that could rival or surpass Transformers in language modeling tasks [2][3]. - Mamba-1 utilized continuous time dynamic models and a selective memory update mechanism, achieving efficient memory retention without relying on attention mechanisms [7]. - Mamba-2, released six months after Mamba-1, improved upon its predecessor with a selective SSM, achieving speed enhancements of 2-8 times while maintaining competitive performance against Transformers [4][5]. Group 2: Mamba-3 Enhancements - Mamba-3 introduces three significant improvements: trapezoidal discretization, complexified state-space models, and multi-input multi-output (MIMO) SSM, enhancing the model's expressiveness and efficiency [10][13][14]. - The trapezoidal discretization allows Mamba-3 to consider both the start and end points of intervals, improving state updates [11]. - The complexified state-space model provides a more expressive state update mechanism, overcoming limitations in state tracking capabilities seen in linear models [13][22]. Group 3: Performance Metrics - Empirical validation shows that Mamba-3 outperforms Mamba-2 and other open-source architectures in various language modeling tasks, achieving superior average accuracy across multiple benchmarks [19][20]. - Mamba-3's MIMO variant enhances hardware utilization efficiency during the decoding phase, allowing for simultaneous state updates across multiple channels without increasing memory requirements [15][26]. - In comparative latency tests, Mamba-3 demonstrated faster response times than Mamba-2 and Gated DeltaNet, particularly in configurations using BF16 precision [27]. Group 4: Application Potential - Mamba-3's efficient long-sequence processing capabilities make it suitable for applications in long document understanding, scientific time series analysis, and gene modeling, areas where Transformers struggle due to context limitations [30]. - Its linear time inference and stable latency also position Mamba-3 as an ideal candidate for real-time interactive scenarios, such as chat assistants and machine translation [31].
NeurIPS 25 | 中大&UC Merced等开源RAPID Hand,重新定义多指灵巧手数据采集
机器之心· 2025-10-14 08:24
| Zhaoliang Wan- Zetong Bi1 Zida Zhou2 Hao Ren1 Yiming Zeng1 Yihan Li1 | | | | | --- | --- | --- | --- | | Lu Oi3 | Xu Yang4 | Ming-Hsuan Yang3 | Hui Cheng1 * | 论文标题:RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy 论文地址:https://www.arxiv.org/abs/2506.07490 项目主页:https://rapid-hand.github.io/ 灵巧操作能力是通用机器人实现多任务泛化的核心能力之一。无论是日常的家庭整理、物品归置,还是辅助类服务任务,若缺乏灵巧的操作能力,机器人便难以 真正完成复杂交互。 近年来,随着多模态大模型(VLMs)在机器人控制中的逐步应用,研究者们开始将高质量的操作演示与预训练模型结合,用于具身推理与通用操作策略学 ...
蚂蚁Ring-1T正式登场,万亿参数思考模型,数学能力对标IMO银牌
机器之心· 2025-10-14 06:33
Core Insights - Ant Group has launched the Ling-1T and Ring-1T models, marking significant advancements in open-source AI with capabilities comparable to closed-source giants [3][6][19] - The Ring-1T model is the first open-source trillion-parameter reasoning model, showcasing exceptional performance in various benchmarks and tasks [6][9][19] Model Launch and Performance - Ant Group announced the Ling-1T model on October 9, which is their largest language model to date, achieving over a thousand downloads within four days of its release [3][5] - Following this, the Ring-1T model was officially launched on October 14, demonstrating superior reasoning abilities and achieving notable results in international mathematics competitions [6][19] Benchmark Testing - The Ring-1T model underwent rigorous testing across eight critical benchmarks, including mathematics competitions, code generation, and logical reasoning [12][14] - Results indicate that Ring-1T significantly outperformed its preview version, achieving state-of-the-art (SOTA) performance in multiple dimensions, particularly in complex reasoning tasks [9][14][16] Competitive Analysis - In logical reasoning tasks, Ring-1T surpassed the performance of leading closed-source models like Gemini-2.5-Pro, showcasing its competitive edge [16] - The model's performance in the Arena-Hard-v2.0 comprehensive ability test was just slightly behind GPT-5-Thinking, placing it among the top-tier models in the industry [16] Practical Applications - Ring-1T demonstrated its coding capabilities by generating functional game code for simple games like Flappy Bird and Snake, showcasing its practical application in software development [20][23] - The model also excelled in creative writing, producing engaging narratives and scripts that incorporate historical facts and storytelling techniques [40][43] Technical Innovations - The development of Ring-1T involved advanced reinforcement learning techniques, particularly the IcePop algorithm, which mitigates training inconsistencies and enhances model stability [45][46] - Ant Group's self-developed RL framework, ASystem, supports the efficient training of large-scale models, addressing hardware resource challenges and improving training consistency [50][52]
OpenAI、Anthropic、DeepMind联手发文:现有LLM安全防御不堪一击
机器之心· 2025-10-14 06:33
Core Insights - The article discusses a collaborative research paper by OpenAI, Anthropic, and Google DeepMind focusing on evaluating the robustness of language model defense mechanisms against adaptive attacks [2][5][6] - The research highlights that existing defense evaluations are flawed as they do not simulate strong attackers capable of countering defenses [5][6][7] Group 1: Research Framework - A General Adaptive Attack Framework is proposed to systematically assess language model defenses, utilizing optimization methods like gradient descent, reinforcement learning, and human-assisted exploration [6][12] - The study successfully bypassed 12 recent defense mechanisms, with many models showing attack success rates exceeding 90%, despite claims of being nearly unbreakable [6][18] Group 2: Defense Mechanisms Evaluation - The research evaluates various defense strategies, including prompt-based defenses, adversarial training, filtering models, and secret-knowledge defenses, revealing their vulnerabilities against adaptive attacks [18][24][27][30] - For prompt-based defenses like Spotlighting and RPO, the attack success rate under adaptive conditions exceeded 95%, despite low rates in static benchmarks [18][21][23] - Adversarial training methods like Circuit Breakers were easily bypassed, achieving a 100% attack success rate, indicating that training against fixed adversarial samples does not generalize to unseen adaptive attacks [24][26] Group 3: Conclusion and Implications - The findings suggest that relying on single defense strategies is inadequate, as attackers can easily adapt to fixed defenses [9][23] - The research emphasizes the need for dynamic optimization in defense mechanisms to achieve meaningful robustness against evolving threats [26][30]
斯坦福、英伟达和伯克利提出具身Test-Time Scaling Law
机器之心· 2025-10-14 06:33
Core Insights - The article discusses the advancements in Vision-Language-Action (VLA) models, particularly focusing on the robustness and generalization capabilities in real-world applications through a "generate-and-verify" paradigm [2][5][20]. Group 1: Key Findings - The research team found that increasing the number of candidate actions during the inference phase leads to a continuous decrease in action errors for VLA models [5]. - A power law relationship was established between action errors and the number of Gaussian perturbations sampled, indicating that the robot control problem should be viewed as a combination of generating candidate actions and verifying them [5][20]. - The proposed Test-Time Scaling Law demonstrates predictable improvements in task success rates and stability as the sampling and verification scale increases [2][20]. Group 2: Methodology Overview - The first phase involves training an action verifier using a synthetic action preference dataset derived from the RMSE differences between candidate and ground truth actions [8]. - The second phase focuses on expanding computational resources during inference, utilizing the trained action verifier to enhance the stability of VLA models [9][12]. Group 3: Experimental Results - The integration of RoboMonkey with VLA models resulted in significant performance improvements, including a 25% increase in success rates for out-of-distribution tasks and a 9% increase in the in-distribution SIMPLER environment [17]. - The accuracy of the RoboMonkey verifier showed a log-linear growth with the expansion of the synthetic dataset, leading to enhanced performance in various environments [16]. Group 4: Practical Deployment - A dedicated VLA serving engine was implemented to support high-speed action resampling and efficient construction of action proposal distributions, optimizing inference costs [19]. - The system architecture allows for higher throughput with larger high-bandwidth memory, further enhancing the generalization capabilities of the robotic foundational models [19].
景不动人动,MLLM如何面对「移步换景」的真实世界?OST-Bench揭示多模态大模型在线时空理解短板
机器之心· 2025-10-14 06:33
Core Insights - The article discusses the introduction of OST-Bench, a new benchmark for evaluating multi-modal large language models (MLLMs) in dynamic online environments, emphasizing the challenges of real-world embodied perception and reasoning [2][24]. Group 1: Benchmark Characteristics - OST-Bench reflects the core challenges of embodied perception in real-world settings, contrasting with traditional offline benchmarks that do not account for dynamic scene exploration [2][7]. - The benchmark is designed to assess models' abilities to perform real-time perception, memory maintenance, and spatiotemporal reasoning based on continuous local observations [7][10]. - It includes 15 sub-tasks categorized into judgment, estimation, counting, and temporal localization, with a dataset comprising 10,000 test samples and 50,000 training samples [8][10]. Group 2: Model Performance and Challenges - Current mainstream MLLMs show significant performance gaps compared to human capabilities, particularly in cross-temporal information reasoning [17]. - Models struggle with complex spatiotemporal reasoning tasks, often resorting to "spatio-temporal reasoning shortcuts," leading to superficial answers without adequate reasoning [18][21]. - Fine-tuning experiments indicate that while models can improve their scores by over 10% with additional training data, they still fail to achieve over 50% accuracy in complex reasoning tasks, highlighting the need for better model design and training strategies [23][24].