Workflow
机器之心
icon
Search documents
刚刚,豆包编程模型来了,我们用四个关卡考了考它!
机器之心· 2025-11-11 08:40
Core Insights - The article discusses the evolution of AI programming assistants, highlighting the shift from simple code completion tools to more advanced models capable of understanding complex tasks and contexts. This evolution is represented by two main routes: IDE enhancement and Agentic coding [1][2]. Group 1: AI Programming Assistant Evolution - AI programming assistants have significantly changed development workflows, with even skeptics like Linus Torvalds acknowledging their utility [1]. - The article identifies two main routes for AI programming assistants by 2025: IDE enhancement (e.g., GitHub Copilot) and Agentic coding (e.g., Claude Code) [2]. Group 2: Doubao-Seed-Code Introduction - Doubao-Seed-Code, developed by Volcano Engine, aims to address the limitations of existing models by providing a robust programming model designed for complex tasks [2][4]. - The model has shown exceptional performance in various authoritative benchmarks, even surpassing Claude 4.5 Sonnet in some evaluations [6][8]. Group 3: Key Features of Doubao-Seed-Code - Doubao-Seed-Code boasts a native 256K long context capability, allowing it to handle complex projects that span multiple files and dependencies [10][11]. - The model is the first in China to support visual understanding, enabling it to generate code based on UI designs and perform visual comparisons for style and bug fixes [11]. Group 4: Performance Evaluation - The article outlines a series of practical tests to evaluate Doubao-Seed-Code's capabilities, including task planning, long context handling, and debugging abilities [18][22]. - In a test involving the refactoring of a poorly structured Python script, Doubao-Seed-Code completed the task in under three minutes, demonstrating its debugging capabilities [23][24]. Group 5: Advanced Task Execution - Doubao-Seed-Code successfully executed a complex task of converting a C++ game to Python, showcasing its long context and task planning abilities. The entire process took approximately 40 minutes [26][30]. - The model autonomously planned and executed the project, demonstrating its capability to handle significant programming challenges [31]. Group 6: Cost and Accessibility - Doubao-Seed-Code aims to address pricing and usage limitations faced by developers, offering a subscription service with competitive pricing [48][50]. - The "Coding Plan" subscription service provides significant discounts and aims to lower costs by 62.7%, making it accessible to a broader range of developers [49][50]. Group 7: Conclusion - Doubao-Seed-Code is positioned as a powerful alternative in the Agentic coding space, capable of handling complex tasks autonomously and efficiently [52][53]. - The model not only addresses performance issues but also offers a cost-effective solution for developers, paving the way for widespread adoption of Agentic coding [53][54].
从VLA到RoboOmni,全模态具身新范式让机器人察言观色、听懂话外音
机器之心· 2025-11-11 08:40
复旦⼤学、上海创智学院与新加坡国立⼤学联合推出全模态端到端操作⼤模型 RoboOmni,统⼀视觉、⽂本、听觉与动作模态,实现动作⽣ 成与语⾳交互的协同控制。开源 140K 条语⾳ - 视觉 - ⽂字「情境指令」真机操作数据,引领机器⼈从「被动执⾏⼈类指令」迈向「主动提供 服务」新时代。 在⽇常⽣活中,⼈类很少发出⽣硬的命令式指令⸺「 把杯子放到桌上」。更多时候,我们的真实意图隐藏在对话、语⽓、甚⾄环境声 音 中。 「 这果汁好酸啊」,其实意味着想换别的饮料;听到雷声骤起,就知道该去关窗收⾐;从声 音 辨出是爷爷在说话,会主动问他是否想喝最爱的热茶⽽不是可乐; 在多⼈同时说话的场景中,还要分清谁才是发出指令的⼈。 现在,机器⼈终于能听懂这些「 潜台词」了! 复旦、上海创智学院、与新加坡国立大学 联合发布 RoboOmni ,不仅重新定义了机器⼈交互的「 情境指令」新范 式,更通过全模态端到端的统⼀架构,让机器⼈⾸次具备了「 察⾔观⾊」的认知能力。 论文标题: RoboOmni: Proactive Robot Manipulation in Omni-modal Context 论⽂地址:https://arx ...
上交×蚂蚁发布 DiagGym:以世界模型驱动交互式医学诊断智能体
机器之心· 2025-11-11 08:40
Core Insights - The article discusses a new training framework for AI diagnostic agents, emphasizing the need for dynamic decision-making in clinical diagnosis rather than relying on static data [2][6][10]. Group 1: Framework and Model Development - A novel "Environment-Agent" training framework has been proposed, which includes the creation of a medical diagnostic world model called DiagGym, designed to train self-evolving diagnostic agents known as DiagAgent [2][10]. - DiagGym simulates a virtual clinical environment where diagnostic agents can interact with virtual patients, allowing them to refine their decision-making strategies through continuous feedback [10][14]. - The framework incorporates a comprehensive evaluation benchmark called DiagBench, which consists of 750 cases and 973 detailed assessment criteria developed by physicians to evaluate the diagnostic reasoning process [2][12]. Group 2: Training and Evaluation - The training of DiagAgent involves two main phases: supervised fine-tuning using real clinical interaction data and reinforcement learning in the DiagGym environment to enhance decision-making capabilities [19][15]. - Experimental results indicate that DiagAgent significantly outperforms other advanced models like DeepSeek and Claude-4 in multi-step diagnostic decision-making [12][25]. - The evaluation metrics include diagnostic accuracy, quality of examination recommendations, and efficiency in completing diagnoses, with DiagAgent showing a 44.03% improvement in recommendation hit rate and a 9.34% increase in final diagnosis accuracy compared to other models [25][28]. Group 3: Research Value and Future Prospects - The research aligns AI diagnostics more closely with real clinical workflows by transitioning from static question-answering to dynamic strategy learning, enabling agents to actively gather evidence and make assessments [36][41]. - Future expansions may include integrating treatment plans and prognostic evaluations into the virtual environment, aiming to create a comprehensive diagnostic and treatment AI system [38][40]. - The DiagGym model can be enhanced by incorporating additional dimensions such as treatment feedback and cost/safety constraints, leading to a more holistic virtual clinical system [40][41].
打破显存墙:谢赛宁团队提出CLM,单卡RTX 4090「撬动」1亿高斯点
机器之心· 2025-11-11 08:40
Core Insights - 3D Gaussian Splatting (3DGS) is an emerging method for novel view synthesis that utilizes a set of images with poses to iteratively train a scene representation composed of numerous anisotropic 3D Gaussian bodies, capturing the appearance and geometry of the scene [2][4] - The CLM system proposed by the team allows 3DGS to render large scenes using a single consumer-grade GPU, such as the RTX 4090, by addressing GPU memory limitations [6][8] Group 1: 3DGS Overview - 3DGS has shown revolutionary application potential in fields such as 3D modeling, digital twins, visual effects (VFX), VR/AR, and robot vision reconstruction (SLAM) [5] - The quality of images rendered using 3DGS depends on the fidelity of the trained scene representation, with larger and more complex scenes requiring more Gaussian bodies, leading to increased memory usage [5] Group 2: CLM System Design - CLM is designed based on the insight that the computation of 3DGS is inherently sparse, allowing only a small subset of Gaussian bodies to be accessed during each training iteration [8][20] - The system employs a novel unloading strategy that minimizes performance overhead and scales to large scenes by dynamically loading only the necessary Gaussian bodies into GPU memory while offloading the rest to CPU memory [8][11] Group 3: Performance and Efficiency - The implementation of CLM can render a large scene requiring 102 million Gaussian bodies on a single RTX 4090 while achieving top-tier reconstruction quality [8] - Each view typically accesses only 0.39% of the Gaussian points, with a maximum of 1.06% for any single view, highlighting the sparse nature of the data [23] Group 4: Optimization Techniques - The team utilized several unique characteristics of 3DGS to significantly reduce communication overhead associated with unloading, including pre-computing the accessed Gaussian sets for each view and leveraging spatial locality to optimize data transfer between CPU and GPU [12][17] - The microbatch scheduling optimization allows for overlapping access patterns between consecutive batches, enhancing cache hit rates and reducing redundant data transfers [24][25] Group 5: Results and Impact - CLM enhances the training capacity of 3DGS models by up to 6.1 times compared to pure GPU training baselines, enabling the training of larger models that improve scene reconstruction accuracy while lowering communication and unloading overhead [27]
李飞飞最新长文:AI的下一个十年——构建真正具备空间智能的机器
机器之心· 2025-11-10 23:47
Core Insights - The article emphasizes the importance of spatial intelligence as the next frontier in AI, highlighting its potential to transform various fields such as storytelling, creativity, robotics, and scientific discovery [5][6][10]. Summary by Sections What is Spatial Intelligence? - Spatial intelligence is defined as a fundamental aspect of human cognition that enables interaction with the physical world, influencing everyday actions and creative processes [10][13]. - It is essential for tasks ranging from simple activities like parking a car to complex scenarios such as emergency response [10][11]. Importance of Spatial Intelligence - The article argues that spatial intelligence is crucial for understanding and manipulating the world, serving as a scaffold for human cognition [13][15]. - Current AI technologies, while advanced, still lack the spatial reasoning capabilities inherent to humans, limiting their effectiveness in real-world applications [14][15]. Building Spatial Intelligence in AI - To create AI with spatial intelligence, a new type of generative model called "world models" is proposed, which can understand, reason, generate, and interact within complex environments [17][18]. - The world model should possess three core capabilities: generative, multimodal, and interactive [18][19][20]. Challenges Ahead - The development of world models faces significant challenges, including the need for new training tasks, large-scale data, and innovative model architectures [23][24][25]. - The complexity of representing the physical world in AI is much greater than that of language, necessitating breakthroughs in technology and theory [21][22]. Applications of Spatial Intelligence - In creativity, spatial intelligence can enhance storytelling and immersive experiences, allowing creators to build and iterate on 3D worlds more efficiently [32][33]. - In robotics, spatial intelligence is essential for machines to understand and interact with their environments, improving their learning and operational capabilities [34][35][36]. - The potential impact extends to fields like science, medicine, and education, where spatial intelligence can facilitate breakthroughs and enhance learning experiences [38][39][40]. Conclusion - The article concludes that the pursuit of spatial intelligence in AI represents a significant opportunity to enhance human capabilities and address complex challenges, ultimately benefiting society as a whole [42].
PixelRefer :让AI从“看大图”走向“看懂每个对象”
机器之心· 2025-11-10 23:47
Core Insights - The article discusses the limitations of current multimodal large models (MLLMs) in achieving fine-grained, object-level understanding necessary for real-world applications like autonomous driving and medical imaging, highlighting the need for a more detailed visual understanding framework [2][38] - PixelRefer is introduced as an innovative solution that provides a unified spatio-temporal understanding framework capable of fine visual referencing and reasoning at arbitrary granularity, outperforming existing models in several benchmarks [2][38] Model Overview - PixelRefer integrates global visual tokens, pixel-level region tokens, and text tokens into a large language model (LLM), maintaining both scene context and object-level reasoning capabilities [16][22] - The model's lightweight version, PixelRefer-Lite, achieves a 4x increase in inference speed and reduces memory usage by half compared to existing models like DAM-3B [2][33] Methodology - The authors propose two frameworks for pixel-level fine-grained understanding: Vision-Object Framework and Object-Only Framework, emphasizing the importance of high-quality pixel-level object representation [15][22] - A Scale-Adaptive Object Tokenizer (SAOT) is introduced to generate precise and compact object representations, addressing challenges related to small and large object details [17][16] Performance Metrics - PixelRefer has achieved state-of-the-art (SOTA) performance across various image understanding benchmarks, including PACO and DLC-Bench, with notable advantages in reasoning scenarios [28][30] - In video pixel-level understanding benchmarks, PixelRefer also demonstrates superior performance, particularly in tasks involving video captioning and question answering [29][31] Applications and Future Directions - The advancements presented by PixelRefer signify a shift towards understanding the dynamic details of the world, with potential applications in autonomous driving, medical imaging, intelligent video editing, and multimodal dialogue systems [38][40]
从感知到行动:邀您解锁这场具身智能前沿技术沙龙
机器之心· 2025-11-10 10:42
Core Insights - The article discusses a significant paradigm shift in artificial intelligence from "perceptual intelligence" to "action intelligence," aiming to equip machines with the ability to understand, decide, and act in the physical world, transitioning from "passive observation" to "active interaction" [2] Group 1: Technological Advancements - Continuous iteration of intelligent foundational models is breaking down barriers between perception and decision-making, enabling more human-like interaction capabilities for intelligent agents [2] - The accumulation and governance of multimodal data are establishing a solid foundation for the practical application of technology [2] - Breakthroughs in remote operation technology are facilitating the connection between virtual and real worlds, allowing for precise remote control [2] Group 2: Event Details - The "Virtual-Real Resonance: Model X Terminal Technology Salon" will be held on November 14 in Beijing at PAGEONE (Wudaokou) [4][5] - The event will feature several key presentations, including topics on world models, open-source tools for VLA models, and data-driven paths to embodied intelligence [4][5] Group 3: Featured Speakers - Mao Jiming, Vice President of Beijing Jiajia Vision Technology Co., has over 16 years of experience in engineering and architecture, focusing on large-scale distributed systems and autonomous driving simulation technology [7] - Wang Tiancai, a founding team member of Dexmal, has published over 30 papers in top international conferences and is a core author of notable algorithms in autonomous driving [8] - Jin Ge, founder of Lingyu Intelligent, has extensive experience in early-stage investment and high-tech entrepreneurship [9] - Yuan Haoki, head of the Smart Beyond team, specializes in reinforcement learning and embodied intelligence [10] - Huang Suining, CEO of Beipei Technology, has a rich background in AI and children's psychology, focusing on the integration of AI technology in child development [11] - Qian Zhuang, product head at Yingzhi Technology, has a decade of experience in product development and holds over 100 patents [11]
2025宝山・智能机器人产业大会暨嘉年华启动在即
机器之心· 2025-11-10 04:40
Core Viewpoint - The article discusses the upcoming "2025 Baoshan Intelligent Robot Industry Conference and Carnival," which aims to explore the future development directions and collaborative elements of the intelligent robot industry amidst the global AI wave [2][4]. Event Overview - The conference will take place from November 21 to 22, 2025, in the Zhihui Bay Science and Technology Park, organized by various governmental and academic institutions [4]. - It is positioned as an annual celebration of the intelligent robot industry, bringing together key players from academia, industry, and investment sectors to create a new blueprint for development in the AI era [4][5]. Collaborative Efforts - Baoshan District is committed to building a comprehensive support system for the robot industry through deep collaboration among government, industry, academia, scenarios, and finance [5]. - The event will unveil a three-year action plan for robotics and a series of service platforms aimed at providing one-stop solutions for technology breakthroughs, application scenarios, and capital connections [5]. Forum Structure - The event features a main forum and three thematic forums covering top-level design, embodied intelligence technology, core components, and talent ecology, providing a comprehensive insight into the industry [6][7]. - Notable speakers include academicians and leaders from prominent robotics companies, who will discuss breakthroughs and development paths in intelligent robotics [7]. Project Roadshow - A project roadshow will focus on early-stage projects with cutting-edge technology and high growth potential, facilitating connections for technology cooperation, application scenarios, and industry funds [9]. Exhibition Highlights - The event will showcase top products from the intelligent robot industry, including humanoid robots, legged robots, and various core components, highlighting the technological strength of the sector [11]. Interactive Experience - Attendees will have the opportunity to engage in immersive experiences with various robotic projects, enhancing their understanding of the industry's potential and fostering connections with industry leaders [13].
3A大作!阿里ROLL团队从基建->算法->机理,推动RL4LLM全栈协同优化
机器之心· 2025-11-10 04:40
Core Insights - The article discusses the launch of the "3A" collaborative optimization framework by Alibaba's ROLL team, which includes Async Architecture, Asymmetric PPO, and Attention Mechanism, aimed at enhancing Reinforcement Learning for Large Language Models (RL4LLM) [1][2][5] Group 1: Async Architecture - ROLL Flash is introduced as a high-performance RL training system that utilizes asynchronous design to maximize resource utilization and accelerate large-scale RL training [5][11] - The core principle of ROLL Flash is decoupling, which allows for fine-grained parallelism and sampling-training decoupling, leading to a fully pipelined execution of generation, environment interaction, reward calculation, and model training [12][13] - ROLL Flash has demonstrated significant performance improvements across various mainstream RL tasks, achieving nearly linear scalability with a hundred-card scale [16][25] Group 2: Asymmetric PPO - Asymmetric Proximal Policy Optimization (AsyPPO) is introduced as a lightweight variant of PPO that shows that the size of the critic does not necessarily correlate with its value estimation capability [45][48] - The research indicates that only two small critics are sufficient to achieve comparable or even superior value estimation performance, reducing the need for expensive computational resources [51][53] - AsyPPO introduces two key innovations: diversified micro-critic aggregation and uncertainty-aware policy loss reconstruction, enhancing training stability and efficiency [55][58] Group 3: Attention Mechanism - The article redefines the role of attention in language models, suggesting it serves as a structured blueprint that reveals the internal logic of model reasoning [2][64] - By analyzing attention dynamics, the framework aims to align the optimization objectives with the model's inherent reasoning rhythm, leading to improved training efficiency and interpretability [67][68] - The research proposes a refined credit allocation strategy based on attention signals, allowing for more effective reinforcement learning by focusing on critical reasoning steps [82][86]
NeurIPS2025 Spotlight | RobustMerge: 多模态大模型高效微调模型合并的全新范式
机器之心· 2025-11-10 04:40
Core Insights - The article discusses the challenge of efficiently merging multiple specialized models into a general model in the context of rapidly advancing AI technology, highlighting the concept of "direction robustness" as a key factor in the failure of parameter-efficient fine-tuning (PEFT) module merging [2][7][10]. - A new solution called RobustMerge is proposed, which offers a simple and efficient method for model merging without additional costs, providing significant potential for developers and researchers working on multi-modal large models [2][8]. Problem Definition - The rise of multi-modal large models has increased computational demands, making full fine-tuning (FFT) costly and impractical for many users. As a result, parameter-efficient fine-tuning (PEFT), particularly LoRA, has become mainstream, allowing for quick adaptation to downstream tasks by updating only a small portion of model parameters [7][8]. - Traditional methods for merging models, such as multi-task learning, face challenges related to training costs and data availability, leading to the exploration of model merging as a more efficient alternative [8][10]. Key Contributions - RobustMerge addresses the shortcomings of existing PEFT merging methods by identifying the core issue of direction instability rather than parameter sign conflicts, thus paving the way for a new paradigm in LoRA merging [10][41]. - The method employs a two-phase merging strategy: pruning and complementary scaling, followed by cross-task normalization, to enhance the stability of low-rank directions during the merging process [16][19][23]. Experimental Design and Results - RobustMerge was tested across multiple benchmarks, including a newly created benchmark called MM-MergeBench, which evaluates performance on both seen and unseen tasks, demonstrating significant improvements in multi-task performance and generalization capabilities [28][31]. - The results indicate that RobustMerge outperforms traditional methods, achieving an average accuracy increase of 3.4% on seen tasks and a 4.5% improvement on unseen tasks, showcasing its effectiveness in reducing task interference and enhancing multi-task performance [31][32]. Practical Applications - The RobustMerge approach can be applied in various scenarios, including rapid deployment of multi-task models, federated learning, and model editing or style transfer, making it a valuable tool for enterprises looking to build complex AI applications efficiently [44][45].