机器之心

Search documents
ACL 2025 | 让小说角色 「活」起来!复旦BookWorld打造沉浸式小说世界模拟系统
机器之心· 2025-06-24 06:46
BookWorld由复旦大学冉一婷、王鑫涛主导完成,由阳德青老师、肖仰华老师共同指导。复旦大学知识工场实验室长期关注大语言模型的人格化、角色扮演 研究,在该领域发表多篇顶会论文和首篇综述。 想象为《红楼梦》或《权力的游戏》创造一个AI的世界。书中的角色们变成AI,活在BookWorld当中。每天,他/她们醒来,思考,彼此对话、互动,建立 感情和关系。 如果他们能活出自己的生活,不再由笔者操控,故事是否会不一样?会不会有一个平行时空里,宝玉和黛玉有了一段美好的爱情? 今天要介绍的这篇 ACL 2025 论文 ——《BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation》,聚焦于如何让 小说中的角色真正 "活" 起来,打造一个沉浸式的虚拟世界。 在BookWorld中,作者们提出了一个"小说->AI世界->故事创作"的系统。BookWorld能从小说中提取角色和世界观的数据,构建一个AI世界,让角色AI在 世界中进行长期的交互,自己创造自己的故事。为了实现流畅自然的长期交互,BookWorld建模了角色 ...
3D VLA新范式!中科院&字节Seed提出BridgeVLA,斩获CVPR 2025 workshop冠军!
机器之心· 2025-06-24 01:46
Core Viewpoint - The introduction of the BridgeVLA model represents a significant advancement in 3D Vision-Language Action (VLA) paradigms, achieving a high success rate in robotic manipulation tasks through efficient data usage and effective operation strategies [4][6][22]. Group 1: Model Development - The BridgeVLA model integrates the strengths of 2D and 3D VLA models, aiming for both efficiency and effectiveness in robotic operations [2][3]. - The core concept of BridgeVLA is to align the input and output of Vision-Language Models (VLM) and VLA in a unified 2D space, avoiding traditional 3D encoding methods [6][7]. - The model's output has shifted from Next Token Prediction to Heatmap Prediction, allowing for better spatial structure utilization and alignment in 2D space [7][10]. Group 2: Training Methodology - A novel scalable pre-training method is introduced, where the model learns to predict 2D Heatmaps from image-target text pairs, enhancing its object detection capabilities [8][10]. - The model uses a coarse-to-fine multi-stage prediction approach, refining predictions through iterative processing of point clouds [12]. Group 3: Experimental Results - In RLBench tasks, BridgeVLA significantly improved the average success rate from 81.4% to 88.2%, outperforming existing baseline methods [14]. - In the COLOSSEUM benchmark, BridgeVLA demonstrated robust performance, increasing the average success rate from 56.7% to 64.0% across various perturbations [16]. - The model achieved the highest average success rate in the GemBench evaluation, particularly excelling in L2 and L3 settings [17]. Group 4: Real-World Application - In real-world evaluations, BridgeVLA outperformed the leading baseline RVT-2 across six out of seven tested settings, showcasing its robustness against visual disturbances [19][20]. - The model's pre-training on 2D heatmaps proved crucial for understanding language semantics and generalizing to new object-skill combinations [20]. Group 5: Future Directions - Future research will explore diverse pre-training tasks to enhance the model's visual understanding capabilities [22]. - The integration of more expressive action decoding methods and the use of large language models for task decomposition are planned to improve performance in complex long-duration tasks [22].
报名开启!别再一个人刷论文了,来ACL 2025论文分享会一起面对面交流
机器之心· 2025-06-24 01:46
2025 年已经过半,AI 领域依旧保持着高速发展的势头。从大模型的演化,到多模态系统的融合,再到推理能力与可解 释性的持续突破,AI 正以前所未有的节奏快速前进。 然而,AI 的发展速度之快,也让人几乎难以跟上节奏。新模型、新框架层出不穷,几乎每隔数周就有突破性进展刷新 人们的认知。 在这样的背景下,如何掌握最前沿的技术动态,已成为每一位 AI 从业者面临的共同挑战。仅靠零散的信息获取已远远 不够,系统地参与权威学术交流、深入学习最新研究成果、与顶尖研究者保持对话,正变得愈发重要。 学术会议,尤其是 ACL、NeurIPS、ICML、CVPR 等全球顶级会议,正是这些技术交汇的核心场域。无论是深入研讨 的论文,还是引发热议的前沿报告,都为我们提供了观察 AI 发展脉络的绝佳窗口。 作为 NLP 领域最具影响力的会议之一,ACL 每年都吸引了广大学者参与。今年 ACL 总投稿数高达 8000 多篇,创历 史之最。今年 ACL 2025 将于 7 月 27 日 - 8 月 1 日在奥地利维也纳开幕。 时间:北京时间 7 月 19 日 09:00-17:30 更多详细日程,敬请关注机器之心后续公告。 合作伙伴介绍 ...
我在哪?要去哪?要怎么去?字节跳动提出Astra双模型架构助力机器人自由导航
机器之心· 2025-06-23 09:39
Core Viewpoint - The article discusses the challenges faced by traditional navigation systems in mobile robotics and introduces ByteDance's innovative dual-model architecture, Astra, which aims to enhance navigation capabilities in complex indoor environments [2][4]. Group 1: Traditional Navigation Challenges - Mobile robots must address three core navigation challenges: goal localization, self-localization, and path planning, which are critical for safe and reliable movement in complex environments [3]. - Traditional navigation systems often rely on multiple modules and small models, which can be inefficient and require further exploration for effective integration [3]. Group 2: Astra Dual-Model Architecture - Astra consists of two sub-models: Astra-Global for low-frequency tasks like goal and self-localization, and Astra-Local for high-frequency tasks such as local path planning and odometry estimation [5]. - Astra-Global utilizes a multimodal large language model (MLLM) to process visual and language inputs for precise localization on a global map [8][10]. Group 3: Astra-Global Functionality - Astra-Global employs a two-stage process for visual-language localization, achieving high accuracy in identifying locations based on visual inputs and natural language instructions [11][12]. - The model's training involves diverse datasets and a reward-based optimization approach, resulting in a significant improvement in localization accuracy, achieving 99.9% in unseen environments compared to 93.7% with traditional methods [12]. Group 4: Astra-Local Functionality - Astra-Local is designed for efficient local path generation and odometry estimation, incorporating a 4D spatiotemporal encoder and a planning head [13][15]. - The planning head utilizes a transformer-based flow matching method to generate executable trajectories while minimizing collision rates through a mask ESDF loss approach [16][23]. Group 5: Experimental Validation - Extensive experiments in various indoor environments, including warehouses and offices, validate Astra's innovative architecture and algorithm effectiveness [19]. - Astra-Global demonstrates superior multimodal localization capabilities, significantly outperforming traditional visual place recognition methods in accuracy and robustness [20][23]. Group 6: Future Prospects - Astra has potential applications in diverse environments such as shopping malls, hospitals, and libraries, enhancing service efficiency and user experience [25]. - Future improvements are planned for Astra-Global's semantic detail retention and the introduction of active exploration mechanisms to enhance localization robustness in complex settings [25][26].
AI真的需要「像人类」那样思考吗?AlphaOne揭示属于大模型的「思考之道」
机器之心· 2025-06-23 07:44
Core Viewpoint - The article discusses a new reasoning framework called AlphaOne, which suggests that AI models should adopt a "slow thinking first, fast thinking later" approach during testing, contrasting with the traditional human-like reasoning paradigm [4][5][6]. Group 1: Introduction of AlphaOne - AlphaOne introduces a global reasoning control hyperparameter α that allows models to switch from slow to fast reasoning without additional training, significantly improving reasoning accuracy and efficiency [6][12]. - The framework challenges the assumption that AI must think like humans, proposing a more effective reasoning strategy [6][4]. Group 2: Mechanism of AlphaOne - The core mechanism of AlphaOne involves the introduction of a unified control point called α-moment, which dictates when to transition from slow to fast thinking [16][18]. - Prior to the α-moment, the model uses a probability-driven strategy to guide deep reasoning, while after the α-moment, it switches to a fast thinking mode [20][24]. Group 3: Experimental Results - In experiments across six reasoning tasks, AlphaOne demonstrated superior accuracy compared to existing models, with a notable increase of +6.15% in accuracy for a 1.5 billion parameter model [28][29]. - Despite employing a slow thinking mechanism, AlphaOne reduced the average number of generated tokens by 14%, showcasing its efficiency [30]. Group 4: Scalability and Flexibility - The α-moment allows for scalable adjustments to the thinking phase length, with the ability to increase or decrease the number of slow thinking markers based on the α value [34]. - The framework maintains robust performance across a wide range of α values, indicating its generalizability [34]. Group 5: Future Directions - The article suggests potential future research directions, including the development of more complex slow thinking scheduling strategies and the exploration of cross-modal reasoning applications [46][48].
无损减少80%激活值内存,提升5倍训练序列长度,仅需两行代码
机器之心· 2025-06-23 07:44
Core Insights - The article discusses the StreamBP algorithm, which significantly reduces the memory required for training large language models (LLMs) by optimizing the backpropagation process [3][6][15]. Group 1: StreamBP Algorithm - StreamBP reduces the memory consumption of activation values to about 20% of that required by gradient checkpointing, allowing for longer sequence lengths during training [3][6]. - Under the same memory constraints, StreamBP can achieve a maximum sequence length that is 2.8 to 5.5 times greater than that of gradient checkpointing [6][22]. - The algorithm is applicable to common LLM objective functions such as SFT, GRPO, PPO, and DPO, and its code is open-sourced for integration into existing training frameworks [6][12]. Group 2: Memory and Performance Comparison - In terms of memory usage, StreamBP requires only 5% to 15% of the total activation memory for all layers, while a single layer's complete activation values account for over 85% of the memory [13][19]. - A comparison of memory and time costs between standard backpropagation and StreamBP shows that StreamBP significantly reduces peak memory usage while maintaining similar computational costs [14][25]. Group 3: Application in LLM Training - StreamBP is specifically designed to optimize memory usage in the Transformer layers and lmhead layers of LLMs, effectively lowering the memory consumption of layer activations and logits [16][20]. - The algorithm allows for larger batch sizes and faster training times by enabling longer sequence lengths, which is crucial for training efficiency [25][28].
新鲜出炉!斯坦福2025 CS336课程全公开:从零开始搓大模型
机器之心· 2025-06-23 04:04
Core Viewpoint - The article announces the launch of Stanford University's CS336 course "Language Models from Scratch" for Spring 2025, which aims to guide students through the entire process of developing their own language models [1][8]. Group 1: Course Overview - CS336 is designed to help students gain a comprehensive understanding of language models by guiding them through various stages, including data collection, model construction, training, and evaluation [8]. - The course structure consists of 5 units and 19 lectures, with a focus on practical implementation and hands-on experience [10]. Group 2: Instructors - Tatsunori Hashimoto, an assistant professor at Stanford, has a strong background in machine learning and has received over 30,000 citations for his research [2]. - Percy Liang, an associate professor and director of the Center for Research on Foundation Models (CRFM), has over 100,000 citations and extensive experience in AI research [6][7]. Group 3: Course Requirements - Students are expected to have proficiency in Python, deep learning, and system optimization, as well as a solid understanding of calculus, linear algebra, and basic probability and statistics [11]. - The course emphasizes minimal scaffolding, requiring students to write significantly more code compared to other AI courses [11].
等了十年,特斯拉Robotaxi终于上线!马斯克:仅需4.2美元一口价
机器之心· 2025-06-23 04:04
当然也可以付小费。 机器之心报道 编辑:杨文 马斯克终于不「画饼」了!4.2美元坐特斯拉Robotaxi初体验:平稳但尚不成熟。 马斯克兑现了承诺。 早在十年前,埃隆・马斯克就曾多次表示,特斯拉有能力推出无人驾驶服务,但后来却食言了。上周日,特斯拉终于在德克萨斯州奥斯汀正式启动了自动驾驶出 租车服务。 马斯克也在 X 上发文祝贺: 同时还透露,首批乘客将以「固定价格」4.20 美元搭乘。 评论区的网友一片欢呼: 限定试运营,尚未全面开放 目前,特斯拉的 Robotaxi 服务 仅限受邀用户使用 ,并未向公众全面开放。首批试乘者主要为支持特斯拉的知名社交媒体博主和科技内容创作者,因此外界对其初 步评价的客观性仍持保留态度。至于该服务何时正式向公众开放,特斯拉尚未给出明确时间表。 此次小规模试运营共投入约 10 至 20 辆贴有 「Robotaxi」标识的 Model Y 车辆。而去年首次亮相、备受期待的全自动 Cybercab 车型,预计要到 2026 年或更晚才会 正式投入使用。 当前服务覆盖区域被 严格限制在特斯拉已详细绘制地图的地理围栏区域内 。边界大致为北至科罗拉多河、东至 183 号公路、南至 290 ...
海螺新模型海外爆火:一夜之间,猫、羊驼、长颈鹿都学会跳水了
机器之心· 2025-06-22 05:57
Core Viewpoint - The article discusses the recent popularity of AI-generated videos featuring animals performing complex movements, particularly highlighting the capabilities of Minimax's new model "Hailuo 02" which can generate intricate acrobatic actions [7][17]. Group 1: AI Model Capabilities - Minimax's "Hailuo 02" is claimed to be the only model globally capable of generating complex physical movements, such as gymnastics [7]. - The model utilizes a new architecture called "Noise-aware Compute Redistribution (NCR)" to enhance its performance in generating these movements [16]. - The AI-generated videos have gained significant traction on social media, showcasing high-difficulty actions that were previously challenging for AI to replicate [8][17]. Group 2: User Engagement and Content Creation - The rise of AI video tools allows ordinary users to express their creativity, leading to a broader range of content available to audiences [17]. - The effectiveness of AI video generation is also dependent on the quality of the prompts provided by users, which can significantly influence the output [12][18]. - The article mentions that even other models, such as Alibaba's Tongyi Wanshi wan-2.1-t2v, can perform well in generating similar content, indicating a competitive landscape in AI video generation [13].
大模型到底是怎么「思考」的?第一篇系统性综述SAE的文章来了
机器之心· 2025-06-22 05:57
Core Viewpoint - The article emphasizes the need for not just "talkative" large language models (LLMs) but also "explainable" ones, highlighting the emergence of Sparse Autoencoder (SAE) as a leading method for mechanistic interpretability in understanding LLMs [2][10]. Group 1: Introduction to Sparse Autoencoder (SAE) - SAE is a technique that helps interpret the internal mechanisms of LLMs by decomposing high-dimensional representations into sparse, semantically meaningful features [7][10]. - The activation of specific features by SAE allows for insights into the model's "thought process," enabling a better understanding of how LLMs process information [8][10]. Group 2: Technical Framework of SAEs - The technical framework of SAE includes an encoder that decomposes LLM's high-dimensional vectors into sparse feature vectors, and a decoder that attempts to reconstruct the original LLM information [14]. - Various architectural variants and improvement strategies of SAE are discussed, such as Gated SAE and TopK SAE, which address specific challenges like shrinkage bias [15]. Group 3: Explainability Analysis of SAEs - SAE facilitates concept discovery by automatically mining semantically meaningful features from the model, enabling better understanding of aspects like temporal awareness and emotional inclination [16]. - It allows for model steering by activating or suppressing specific features to guide model outputs, and aids in anomaly detection to identify potential biases or safety risks [16]. Group 4: Evaluation Metrics and Methods - Evaluation of SAE involves both structural assessment (e.g., reconstruction accuracy and sparsity) and functional assessment (e.g., understanding LLM and feature stability) [18]. Group 5: Applications in Large Language Models - SAE is applied in various practical scenarios, including model manipulation, behavior analysis, hallucination control, and emotional steering, showcasing its versatility [19]. Group 6: Comparison with Probing Methods - The article compares SAE with traditional probing methods, highlighting SAE's unique potential in model manipulation and feature extraction, while acknowledging its limitations in complex scenarios [20]. Group 7: Current Research Challenges and Future Directions - Despite its promise, SAE faces challenges such as unstable semantic explanations and high training costs, with future breakthroughs anticipated in cross-modal expansion and automated explanation generation [21]. Conclusion - The article concludes that future explainable AI systems should not only visualize model behavior but also provide structured understanding and operational capabilities, with SAE offering a promising pathway [23].