机器之心
Search documents
刚刚,李飞飞空间智能新成果震撼问世!3D世界生成进入「无限探索」时代
机器之心· 2025-09-17 00:07
Core Viewpoint - World Labs, founded by Stanford professor Fei-Fei Li, has launched a limited preview of its space intelligence model, Marble, which allows users to generate persistent 3D worlds from a single image or text prompt [1][4][11]. Group 1: Product Features - Marble generates expansive, navigable, and controllable 3D worlds that are free for users to explore [5][13]. - The model supports the export of generated worlds as Gaussian point clouds, facilitating integration into downstream projects using the open-source rendering library Spark [14]. - Users can combine multiple generated results to create larger worlds, enhancing the overall experience [15]. Group 2: Technical Advancements - The latest model shows improvements in consistency and style adherence, allowing for richer geometric complexity and more complete 3D scenes compared to previous models [17][20]. - Marble is designed primarily for creating 3D environments rather than focusing on individual objects, which may not cater to users interested in personal subjects like selfies or pets [18]. Group 3: User Experience - Users can navigate and interact within a consistent 3D world, which is a core requirement for many visual creators [17]. - The model allows for diverse style transformations, enabling users to explore various artistic expressions in their 3D worlds [20][21]. Group 4: Future Potential - The model encourages users to think beyond room-sized environments, allowing for the assembly of larger, more imaginative spaces [25].
高阶程序,让AI从技术可行到商业可信的最后一公里
机器之心· 2025-09-16 11:57
Core Viewpoint - The article discusses the transition to the "second half" of AI, emphasizing the need for reliability and engineering frameworks to ensure AI applications are trustworthy and effective [1][4][57]. Group 1: Importance of Data and Reliability - Data is crucial for AI application capabilities, but it does not automatically create value without a reliable processing engine [3][4]. - Reliability encompasses various metrics, including accuracy, speed, and the ability to avoid "hallucinations," which are misleading outputs generated by AI models [4][8]. Group 2: Transition from Model Competition to Engineering Competition - The shift in focus from "what AI can do" to "how to make AI do it correctly" marks a significant change in the industry [4][5]. - Various frameworks, such as LangChain and DSPy, are emerging to address these challenges, but they often lack robust reliability guarantees [4][9]. Group 3: High-Order Programs (HOP) - HOP is introduced as a new paradigm that integrates engineering principles into AI applications, aiming to mitigate hallucinations and enhance reliability [6][20]. - HOP is not a new programming language but a framework that combines symbolic logic with neural networks to create a reliable control system for AI [22][25]. Group 4: Mechanisms of HOP - HOP utilizes a structured approach to express business logic in programming languages, ensuring clarity and reducing ambiguity [23]. - The HopLogic execution framework within HOP allows for the breakdown of complex tasks into verifiable steps, enhancing reliability to over 99% in professional applications [28][37]. Group 5: Practical Applications and Industry Impact - HOP has demonstrated its potential in sectors like finance and healthcare, significantly improving reliability and reducing development time [39][43]. - The framework allows for agile iterations without the need for extensive retraining of models, making it a cost-effective solution for businesses [52][53]. Group 6: Future of AI Engineering - The article concludes that the future of AI will depend on high-quality data and reliable engineering frameworks, with HOP serving as a key driver for scalable professional productivity [54][64]. - The establishment of a reliable framework and the development of high-quality data will enable AI to evolve from a supportive role to a core driver of industry transformation [64][65].
网络顶会获奖!华为提出端网协同RDMA传输架构,解决大规模AI集群网络可扩展性问题
机器之心· 2025-09-16 11:57
Core Viewpoint - The article highlights the recognition of the DCP (Data Control Partitioning) research by Huawei's Network Technology Lab and the Hong Kong University of Science and Technology at the ACM SIGCOMM 2025 conference, emphasizing its significance in addressing scalability challenges in AI cluster networks [2][4]. Group 1: Conference Overview - The ACM SIGCOMM 2025 conference, a premier event in the field of computer networking, concluded in Portugal, featuring cutting-edge technology discussions and attracting global participation from major OTT and networking equipment manufacturers [2][4]. - Out of 463 submissions, only 75 papers were accepted, resulting in an acceptance rate of 16.2%, with only three papers receiving awards [4]. Group 2: DCP Technology - The DCP technology addresses the scalability challenges posed by the rapid growth of AI models and the increasing demand for computational power, which necessitates larger and more complex network configurations [6][7]. - DCP proposes a novel RDMA (Remote Direct Memory Access) transmission architecture that allows for lossy transmission of data while ensuring lossless transmission of control information, significantly reducing buffer dependency and eliminating issues like head-of-line blocking and deadlocks [8][10]. Group 3: Experimental Results - Prototype testing of DCP demonstrated a 1.6× to 72× improvement in packet recovery efficiency compared to Mellanox RNIC, and a 42% reduction in completion time for AI workloads [17]. - Simulation results indicated that DCP reduced job completion time (JCT) by 38% and 45% in AI traffic scenarios compared to existing solutions, and achieved a 95% reduction in tail completion time in long-distance scenarios [20][22]. Group 4: Future Directions - Huawei's Network Technology Lab is also researching AI-Native Transport (ANT), which incorporates features from DCP to enhance transmission capabilities for AI computing networks, focusing on high throughput, efficiency, and scalability [22].
具身智能能力狂飙,安全却严重滞后?首个安全可信EAI框架与路线图出炉!
机器之心· 2025-09-16 11:57
近年来,以人形机器人、自动驾驶为代表的具身人工智能(Embodied Artificial Intelligence, EAI)正以前所未有的速度发展,从数字世界大步迈向物理现实。然而, 当一次错误的风险不再是屏幕上的一行乱码,而是可能导致真实世界中的物理伤害时,一个紧迫的问题摆在了我们面前: 如何确保这些日益强大的具身智能体是安全且值得信赖的? 现实情况是,能力与安全,这两条本应齐头并进的轨道,正出现令人担忧的「脱钩」。如图 1 所示,业界的基础模型在能力上飞速迭代,却普遍忽视了与之匹配的 安全对齐机制;而学术界虽有探索,但研究成果往往零散、不成体系。 为了弥合这一关键差距, 上海人工智能实验室和华东师范大学的研究团队 撰写了这篇 Position Paper,旨在为「安全可信具身智能」这一新兴领域建立一个系统性 的理论框架与发展蓝图,推动领域从碎片化研究走向整体性构建。 本文核心贡献 图 1: EAI 的能力与安全发展现状。行业产品(蓝色)能力飞速提升但安全滞后,学术研究(绿色)虽有探索但较为零散。作者团队的研究旨在规划一条通往理想的「安全可信 EAI 」(橙线)的道路。 不同于传统的综述文章,作者不仅梳 ...
在「外滩大会·具身智能:从泛化到行动,重塑产业未来」上,这些大牛都说了什么?
机器之心· 2025-09-16 08:37
Core Viewpoint - The article discusses the future of AI and embodied intelligence, emphasizing the need for disruptive innovation to enable generalized action capabilities and the transition from technical feasibility to commercial success [2][4]. Group 1: Embodied Intelligence Development - The concept of embodied intelligence has evolved from simply giving machines a physical body to creating immersive perception processes [6]. - Current challenges in the field include data bottlenecks, which can be addressed through the establishment of training grounds that enhance robustness and generalization capabilities [7]. - The industry is witnessing a surge in the construction of training grounds, which offer benefits such as cost reduction, safety simulation, and unified standards [7]. Group 2: Data Collection and Utilization - Training grounds are described as new data factories in the AI era, crucial for collecting data to train embodied intelligence models [8][10]. - The development paradigm has shifted to a model where data collection occurs post-robot development, emphasizing the importance of large datasets for effective training [10][11]. - The use of synthetic data is highlighted as a viable solution to the challenges of obtaining real-world data, allowing for scalable and controllable training processes [18][19]. Group 3: Future Prospects and Challenges - The industry is exploring various paths for embodied intelligence, including the integration of real-world data and simulation data to enhance model performance [30][31]. - Discussions on the potential of humanoid robots reveal that while they may not be the only form of embodied intelligence, their development is crucial for achieving broader applications [34][35]. - The timeline for the integration of embodied intelligence into daily life is projected to be gradual, with significant advancements expected in the next 5 to 10 years [38]. Group 4: Industry Collaboration and Ecosystem - The need for collaboration across the industry is emphasized, with calls for the establishment of a robust ecosystem to support the development of embodied intelligence [48][49]. - Various stakeholders express the importance of integrating hardware and software capabilities to enhance the overall effectiveness of embodied intelligence solutions [47][49]. - The article concludes with a vision for a future where embodied intelligence significantly transforms industries and daily life, driven by collective efforts from academia and industry [51].
从少样本到千样本!MachineLearningLM给大模型上下文学习装上「机器学习引擎」
机器之心· 2025-09-16 04:01
Core Insights - The article discusses the limitations of large language models (LLMs) in in-context learning (ICL) and introduces a new framework called MachineLearningLM that significantly enhances the performance of LLMs in various classification tasks without requiring downstream fine-tuning [2][7][22]. Group 1: Limitations of Existing LLMs - Despite their extensive world knowledge and reasoning capabilities, LLMs struggle with ICL when faced with numerous examples, often plateauing in performance and being sensitive to example order and label biases [2]. - Previous methods relied on limited real task data, which restricted the generalization ability of models to new tasks [7]. Group 2: Innovations of MachineLearningLM - MachineLearningLM introduces a "continue pre-training" framework that allows LLMs to learn from thousands of examples directly through ICL, achieving superior accuracy in binary and multi-class tasks across various fields [2][22]. - The framework utilizes a large synthetic task dataset of over 3 million tasks generated through a structural causal model (SCM), ensuring no overlap with downstream evaluation sets, thus providing a fair assessment of model generalization [7][11]. Group 3: Methodology Enhancements - The research incorporates a two-tier filtering mechanism using Random Forest models to enhance training stability and interpretability, addressing issues of task quality inconsistency [11][12]. - MachineLearningLM employs efficient context example encoding strategies, such as using compact table formats instead of verbose natural language descriptions, which improves data handling and inference efficiency [15][20]. Group 4: Performance Metrics - The model demonstrates a continuous improvement in performance with an increasing number of examples, achieving an average accuracy that surpasses benchmark models like GPT-5-mini by approximately 13 to 16 percentage points in various classification tasks [22][24]. - In MMLU benchmark tests, MachineLearningLM maintains its original conversational and reasoning capabilities while achieving competitive zero-shot and few-shot accuracy rates [24][25]. Group 5: Application Potential - The advancements in multi-sample context learning and numerical modeling capabilities position MachineLearningLM for broader applications in finance, healthcare, and scientific computing [26][28].
谁说Scaling Law到头了?新研究:每一步的微小提升会带来指数级增长
机器之心· 2025-09-16 04:01
Core Viewpoint - The article discusses the ongoing debate regarding the diminishing returns of scaling models in AI, particularly in the context of large language models (LLMs). It presents a new perspective that, despite slower improvements in single-step accuracy, these incremental gains can lead to exponential growth in task completion length, which may hold greater economic value in real-world applications [1][3]. Group 1: Scaling Law and Economic Value - The scaling law indicates that while there may be diminishing returns in metrics like test loss, the real-world value of LLMs often comes from their ability to complete longer tasks. Larger models can compound small improvements in single-step accuracy, resulting in exponential increases in task length [3][6]. - The paper titled "The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs" argues that the economic value of an AI agent is derived from the length of tasks it can complete, rather than short task benchmarks that may suggest stagnation in progress [5][19]. Group 2: Long-Horizon Execution Challenges - Long-term task execution has historically been a significant weakness for deep learning models. The paper highlights that while LLMs have improved in complex reasoning tasks, they still struggle with executing longer tasks reliably [6][11]. - The authors propose that failures in long-term execution are often misattributed to reasoning or planning deficiencies, when in fact, execution remains a critical and under-researched challenge [7][22]. Group 3: Self-Conditioning Effect - The study identifies a self-conditioning effect where the error rate in long tasks increases with each step, leading to a compounding effect of mistakes. This phenomenon contrasts with human performance, where practice typically leads to improvement [9][30]. - The authors found that larger models do not necessarily mitigate the self-conditioning effect, which can lead to a decline in performance over extended tasks [29][32]. Group 4: Impact of Thinking Models - Recent thinking models have shown the ability to correct for self-conditioning limitations, allowing for significantly longer task execution in single rounds. For instance, the GPT-5 thinking version can execute over 1000 steps, far surpassing competitors [10][36]. - The research emphasizes the importance of reasoning before action, as models that utilize thinking chains can perform better in executing longer tasks compared to those that do not [36][37]. Group 5: Experimental Insights - The experiments conducted reveal that increasing model size significantly enhances the number of rounds a model can successfully execute, demonstrating a clear scaling trend [27][28]. - The findings suggest that while larger models can improve task execution, they still face challenges due to self-conditioning, which remains a critical area for future research [29][37].
刚刚,OpenAI发布GPT-5-Codex:可独立工作超7小时,还能审查、重构大型项目
机器之心· 2025-09-16 00:22
Core Viewpoint - OpenAI has launched GPT-5-Codex, a model optimized for programming tasks, enhancing software engineering capabilities and code review processes [1][3][4]. Group 1: GPT-5-Codex Features - GPT-5-Codex is designed for real software engineering tasks, capable of quick responses in interactive sessions and independently handling complex tasks [1][8]. - The model has been integrated into all Codex use cases, including Codex CLI, IDE extensions, web, mobile, and GitHub code reviews [3][4]. - OpenAI CEO Sam Altman reported that within two and a half hours of launch, GPT-5-Codex accounted for approximately 40% of Codex traffic, with expectations to become the main traffic source [3]. Group 2: Performance Improvements - GPT-5-Codex shows superior performance in software engineering benchmarks, outperforming GPT-5 in accuracy on SWE-bench Verified and Code refactoring tasks [8][10]. - The model can dynamically adjust its thinking time based on task complexity, allowing it to work independently for over 7 hours on complex tasks [11][12]. - In user interactions, GPT-5-Codex consumes 93.7% fewer tokens in the least complex requests compared to GPT-5, while investing more time in complex tasks [12]. Group 3: Code Review Capabilities - GPT-5-Codex has been specifically trained for code review, capable of identifying critical vulnerabilities and providing focused feedback [14][27]. - The model has been evaluated using recent commits from popular open-source projects, demonstrating a higher accuracy in review comments compared to human engineers [14]. Group 4: Integration and Usability - Codex CLI and IDE plugins have been redesigned for better integration into developers' workflows, allowing for seamless context switching between local and cloud tasks [19][20]. - The new GitHub integration enables users to assign tasks to Codex without leaving their editing environment, enhancing productivity [23][24]. - Codex can now process images and screenshots, improving its ability to understand design specifications and UI issues [23][25]. Group 5: Security and Safety Measures - OpenAI has implemented safety measures to protect against potential misuse of Codex, including sandbox environments and permission requests for risky operations [28][34]. - Developers are encouraged to review Codex's outputs before deployment, as it serves as an additional reviewer rather than a complete replacement for human oversight [29][30]. Group 6: Pricing and Availability - GPT-5-Codex is included in various ChatGPT subscription plans, such as Plus, Pro, Business, Edu, and Enterprise [32][36]. - OpenAI plans to make GPT-5-Codex available through API soon, with additional purchasing options for Business and Enterprise plans [36].
多模态BUG修复新SOTA:慕尼黑工大GUIRepair登上SWE-bench Multimodal榜单第一
机器之心· 2025-09-16 00:22
自动化修复真实世界的软件缺陷问题是自动化程序修复研究社区的长期目标。然而,如何自动化解决视觉软件缺陷仍然是一个尚未充分探索的领域。最近,随着 SWE-bench 团队发布最新的多模态 Issue 修复基准 SWE-bench Multimodal,多模态问题修复引起了研究人员的广泛关注,如何有效的解决这类多模态问题对现有的 修复系统呈现出关键挑战。 为了解决多模态修复场景,来自慕尼黑工业大学 Softw a re Eng in eering & AI 团队 带来了一项最新研究成果: GUIRepair ——《Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Repair》 。这项工作已经成功登上了 SWE-bench Multi modal 排行榜的第一名 ,为多模态软件自动修 复开辟了一条充满潜力的道路。目前,该论文已被软件工程领域顶级学术会议 ASE 2025 接收。 研究动机:为什么要研究 "视觉软件问题"? 在软件工程领域, 自动程序修复(Automated Program Re ...
从「对口型」到「会表演」,刚进化的可灵AI数字人,技术公开了
机器之心· 2025-09-15 12:19
Core Viewpoint - The article discusses the advancements made by Kuaishou's Keling team in creating a new digital human generation paradigm, specifically through the Kling-Avatar project, which allows for expressive and natural performances in long videos, moving beyond simple lip-syncing to full-body expressions and emotional engagement [2][31]. Group 1: Technology and Framework - The Kling-Avatar utilizes a two-stage generative framework powered by a multimodal large language model, enabling the transformation of audio, visual, and textual inputs into coherent storylines for video generation [6][10]. - A multimodal director module organizes inputs into a structured narrative, extracting voice content and emotional trajectories from audio, identifying human features and scene elements from images, and integrating user text prompts into actions and emotional expressions [8][10]. - The system generates a blueprint video that outlines the overall rhythm, style, and key expression nodes, which is then used to create high-quality sub-segment videos [12][28]. Group 2: Data and Training - The Keling team collected thousands of hours of high-quality video data from various sources, including speeches and dialogues, to train multiple expert models for assessing video quality across several dimensions [14]. - A benchmark consisting of 375 reference image-audio-text prompt pairs was created to evaluate the effectiveness of the digital human video generation methods, providing a challenging testing scenario for multimodal instruction following [14][23]. Group 3: Performance and Results - The Kling-Avatar demonstrated superior performance in a comparative evaluation against advanced products like OmniHuman-1 and HeyGen, achieving higher scores in overall effectiveness, lip sync accuracy, visual quality, control response, and identity consistency [16][24]. - The generated lip movements were highly synchronized with audio, and facial expressions adapted naturally to vocal variations, even during complex phonetic sounds [25][26]. - Kling-Avatar's ability to generate long videos efficiently was highlighted, as it can produce multiple segments in parallel from a single blueprint video, maintaining quality and coherence throughout [28]. Group 4: Future Directions - The Keling team aims to continue exploring advancements in high-resolution video generation, fine-tuned motion control, and complex multi-turn instruction understanding, striving to imbue digital humans with a genuine and captivating presence [31].