Workflow
世界模型
icon
Search documents
商汤首度发布“悟能”具身智能平台
Group 1 - Sense, navigation, and interaction are the three core capabilities of embodied intelligence, with Sense being the foundation for machines to explore the real world [2] - The "Wuneng" embodied intelligence platform by SenseTime integrates advanced visual AI technology to provide recognition and understanding capabilities for various hardware terminals [2] - Navigation is described as the "skeleton" for machines to act in the real world, with SenseTime's technology enabling precise path planning and navigation for robots and other devices [2] Group 2 - The "Wuneng" platform allows robots to interact with the real world, showcasing capabilities such as warmth, depth, long memory, and stability [2] - SenseTime, in collaboration with various domestic partners, launched the "SenseTime Computing Power Mall" to provide flexible and autonomous domestic computing power options [3] - The "SenseTime Computing Power Mall" aims to lower the barriers for AI applications and promote the independent and controllable development of China's AI industry [3]
LeCun出手,造出视频世界模型,挑战英伟达COSMOS
机器之心· 2025-07-29 09:58
Core Viewpoint - The article discusses the development and advantages of a new video world model called DINO-world, which aims to improve the efficiency and effectiveness of predicting future frames in various environments, particularly in the context of artificial intelligence and machine learning [9][10]. Data Challenges - The acquisition of large-scale, high-quality video datasets is costly, especially when action annotations are required. Current successful applications of world models are limited to specific fields like autonomous driving and video games [5]. - Accurately modeling physical laws and behaviors in unconstrained, partially observable environments remains a significant challenge, even for short time scales. Advanced pixel-based generative models consume enormous computational resources, with training times reaching up to 22 million GPU hours for models like COSMOS [6]. Model Development - DINO-world utilizes a frozen visual encoder (DINOv2) to pre-train the video world model in a latent space, followed by fine-tuning with action data for planning and control [9]. - The architecture of DINO-world significantly reduces resource consumption during both training and inference phases compared to current state-of-the-art models [10]. Training and Evaluation - DINO-world was trained on a large dataset of approximately 60 million uncleaned network videos, enabling it to learn transferable features across different domains [11]. - In the VSPW segmentation prediction task, DINO-world achieved a mean Intersection over Union (mIoU) improvement of 6.3% when predicting future frames, outperforming the second-best model [13]. Methodology - The model employs a frame encoder that does not directly model pixels but instead uses latent representations based on video patches, which significantly lowers the computational cost of training the predictor [19]. - The training objective is set as "next frame prediction," allowing for efficient parallelization and focusing on the most relevant tokens for loss calculation [27]. Action-Conditioned Fine-Tuning - DINO-world can be adapted for action-conditioned tasks by incorporating an action module that updates the query vector based on the corresponding actions, which can be trained on a small dataset of action-conditioned trajectories [30][33]. Experimental Results - DINO-world demonstrated superior performance in dense prediction tasks across various datasets, including Cityscapes, VSPW, and KITTI, validating the effectiveness of the proposed paradigm [37][38]. - The model's performance in intuitive physics tests showed a strong understanding of physical behaviors, comparable to larger models like V-JEPA [40][41]. Planning Evaluation - The action-conditioned model was trained on offline trajectories, showing significant performance improvements compared to models trained from scratch, particularly in more complex environments [44].
具身领域LLM结合强化学习与世界模型工作汇总
具身智能之心· 2025-07-29 06:15
Core Viewpoint - The article discusses recent advancements in the field of embodied intelligence, particularly focusing on the integration of large language models (LLMs) with reinforcement learning and world models, highlighting several notable research papers from 2024 [2][3]. Group 1: UniSim - UniSim aims to learn general real-world interactive simulators through generative modeling, revealing that natural datasets can provide diverse advantages for learning simulators [3]. - The research demonstrates that integrating various datasets allows for the simulation of high-level commands and low-level controls, enabling zero-shot application in real-world scenarios [3]. Group 2: Robust Agents - The study from Google DeepMind asserts that causal reasoning is essential for robust and general AI, concluding that agents capable of satisfying regret bounds must learn approximate causal models [5]. - This finding has significant implications for transfer learning and causal inference [5]. Group 3: MAMBA - MAMBA introduces an efficient world model approach for meta-reinforcement learning, addressing sample efficiency issues prevalent in current methods [8]. - The framework shows a remarkable improvement in sample efficiency, achieving up to 15 times better performance in high-dimensional tasks [8]. Group 4: EMMA - EMMA leverages LLMs trained in text-based worlds to guide the training of visual world agents, enhancing their ability to interact with dynamic environments [10]. - The approach results in a significant success rate improvement of 20%-70% in diverse tasks compared to existing VLM agents [10]. Group 5: Text2Reward - The Text2Reward framework automates the generation of dense reward functions using LLMs, addressing the challenges of reward function design in reinforcement learning [13][14]. - The method demonstrates superior performance in 13 out of 17 tasks, achieving over 94% success in new motion behaviors [14]. Group 6: Online Continual Learning - The research proposes two frameworks for continuous learning in interactive instruction-following agents, emphasizing the need for agents to learn incrementally as they explore their environments [17][18]. - A confidence-aware moving average mechanism is introduced to update parameters without relying on task boundary information [18]. Group 7: AMAGO - AMAGO is a scalable contextual reinforcement learning framework that addresses challenges in generalization, long-term memory, and meta-learning [21]. - The framework allows for parallel training of long-sequence transformers, enhancing scalability and performance in complex tasks [21]. Group 8: PDDL-based Planning - The study presents a novel paradigm for task planning using pre-trained LLMs, focusing on building explicit world models through PDDL [22][23]. - The framework significantly reduces the need for human intervention by allowing LLMs to convert between PDDL and natural language, facilitating efficient model correction [23].
WAIC 2025观察:算力竞赛升维,模型寻路落地
经济观察报· 2025-07-28 13:36
Core Insights - The 2025 World Artificial Intelligence Conference (WAIC) showcased a shift in focus from pure technical parameters to practical applications and commercial value in AI technology [2][14] - The competition in computing power is evolving into a comprehensive system engineering challenge, addressing performance, compatibility, storage, and energy efficiency [4][10] - AI companies are increasingly integrating their models with real-world applications to unlock new data sources and enhance AI capabilities [15][16] Computing Power Infrastructure - Companies like Huawei and China Digital are pushing the limits of computing power, with Huawei's Atlas 900 A3 SuperPoD achieving a performance of 300 PFLOPS [2][4] - The financial sector is supporting AI infrastructure, with companies like Chip Xin Leasing investing 8 billion yuan in AI-related projects [4] - The demand for private deployment of large models is increasing due to data security concerns, indicating a shift in market needs [5][6] Model and Application Development - AI model developers are focusing on deep integration with industry scenarios to create real business value, moving away from mere technical showcases [14][17] - Companies like Step Leap Star are launching new models aimed at cost reduction and efficiency improvement, collaborating with multiple chip manufacturers to enhance compatibility [17][18] - The importance of data storage and management is highlighted, with companies like Dawning Storage addressing challenges in data accessibility and efficiency [8][9] AI in Creative Industries - AI-generated content (AIGC) is transforming creative processes, with companies like Digital Kingdom introducing platforms that streamline content creation [20][21] - AI is positioned as a "super assistant" for creators, enhancing productivity while allowing them to focus on core creative tasks [21] Consumer-Focused AI Products - New AI products, such as the TicNote AI recording pen, are being developed to serve individual users, encapsulating complex AI capabilities in user-friendly formats [23] - The overarching goal of AI advancements is to contribute to real GDP growth across society, industries, and nations [24]
最近被公司通知不续签了。。。
自动驾驶之心· 2025-07-28 13:21
Core Viewpoint - The autonomous driving industry is facing significant profitability challenges, with even leading companies struggling to achieve stable profits due to high operational costs and regulatory constraints [3][4]. Group 1: Industry Challenges - The complexity of technology and high implementation costs mean that traditional solutions (like human labor) remain more cost-effective in certain scenarios [2][4]. - The overall job market for autonomous driving has cooled compared to previous years, with a noticeable reduction in job openings, especially for Level 4 positions, leading to increased competition [5][6]. - The profitability model of the industry is still unclear, and companies are under significant survival pressure [2][3]. Group 2: Job Market Insights - The demand for talent in the autonomous driving sector has shifted, with current hiring requiring not only solid engineering skills but also experience in mass production and practical application [6][8]. - Job openings in the sector are fewer than in previous years, and the requirements for candidates have become more stringent and practical [5][6]. Group 3: Specific Applications and Opportunities - Certain specific applications, such as logistics in ports, mines, and campuses, are more mature but face cost-effectiveness challenges and limited market size [4]. - Companies are encouraged to explore opportunities in related fields, such as robotics and industrial automation, as the autonomous driving sector continues to evolve [8].
WAIC 2025上海开幕,“绝影开悟”世界模型升级亮相
Core Insights - The 2025 World Artificial Intelligence Conference (WAIC 2025) opened in Shanghai, showcasing SenseTime's upgraded "Jueying Kaiwu" world model, which aims to bridge AI and real-world interactions [1] - SenseTime Jueying introduced the industry's first mass-produced, interactive world model for the autonomous driving sector, along with the largest generative driving dataset "WorldSim-Drive" to empower the industry [1][2] - The company is collaborating with SAIC Group's Zhiji Auto to enhance data generation for various driving scenarios, aiming to accelerate the deployment of safe and reliable autonomous driving systems [4] Company Developments - SenseTime Jueying's CEO highlighted the transformation of AI creativity into productivity, generating millions of scene data for autonomous driving and creating a new 4D real world for embodied intelligence [3] - The "Jueying Kaiwu" world model is the first generative world model product platform in the autonomous driving field, designed to address data bottlenecks and is available for trial by B/C end users [4] - Currently, 20% of SenseTime Jueying's data is produced through the world model, showcasing its high production efficiency [4] Industry Impact - The integration of virtual and real data paradigms in autonomous driving will enhance embodied intelligence, focusing on the interaction between people, objects, and scenes [3] - The interactive experience at WAIC 2025 allowed attendees to engage with the generative world model product platform, demonstrating the performance of the leading autonomous driving dataset [7]
具身智能迎来实力派!十年多模态打底,世界模型开路,商汤「悟能」来了
量子位· 2025-07-27 11:57
Core Viewpoint - SenseTime officially announced its entry into the field of embodied intelligence with the launch of the "Wuneng" embodied intelligence platform at the WAIC 2025 large model forum [1][2]. Group 1: SenseTime's Technological Advancements - SenseTime introduced the "Riri Xin V6.5" multimodal reasoning model, which features a unique image-text interleaved thinking chain that significantly enhances cross-modal reasoning accuracy [3][4]. - The new model outperforms Gemini 2.5 Pro in multimedia reasoning capabilities across multiple datasets, showcasing its competitive edge [8]. - Compared to its predecessor, Riri Xin 6.0, the V6.5 model has improved performance by 6.99% while reducing reasoning costs to only 30% of the previous version, resulting in a fivefold increase in cost-effectiveness [10]. Group 2: Transition to Embodied Intelligence - SenseTime's shift towards embodied intelligence is a natural progression from its expertise in visual perception and multimodal capabilities to physical world interactions [12][13]. - The company has accumulated over ten years of industry experience, particularly in autonomous driving, which has provided valuable data and world model experience for the development of embodied intelligence [13]. - The "Wuneng" platform integrates the general capabilities of the Riri Xin multimodal model with the experience of building and utilizing world models, aiming to create an ecosystem for embodied intelligence [14]. Group 3: World Model Capabilities - The "KAIWU" world model supports the generation of multi-perspective videos and can maintain temporal consistency for up to 150 seconds, utilizing a database of over 100,000 3D assets [16][18]. - It can understand occlusion and layering spatially, as well as temporal changes and motion patterns, allowing for realistic object representation [17][20]. - The platform can simultaneously process people, objects, and environments, creating a 4D representation of the real world [21]. Group 4: Industry Collaboration and Data Utilization - SenseTime is pursuing a "soft and hard collaboration" strategy, partnering with various humanoid robot and logistics platform manufacturers to pre-install its models, enhancing the multimodal perception and reasoning capabilities of hardware [29]. - The company is addressing the common industry challenge of data scarcity by generating synthetic data in virtual environments and using real-world samples for calibration [32][33]. - The integration of first-person and third-person perspectives in training enhances the model's ability to learn from human demonstrations while executing tasks from its own sensory input [26][35]. Group 5: Future Outlook and Competitive Edge - SenseTime is establishing a self-reinforcing data ecosystem through large-scale simulations, real data feedback from hardware, and the fusion of different perspectives, which is expected to drive continuous model upgrades [39]. - The company is positioned to lead the future of embodied intelligence by leveraging multimodal capabilities and hardware collaboration to build a competitive moat in the industry [40].
上海徐汇揭牌建立模速空间海归人才创新创业基地
Xin Hua Cai Jing· 2025-07-27 10:38
Group 1 - The 2025 World Artificial Intelligence Conference (WAIC) has commenced in Shanghai, focusing on the dialogue between young overseas returnees and technology [1] - A strategic framework agreement was signed among Shanghai Artificial Intelligence Laboratory, Shanghai Future Industry Fund, Shanghai Lingang Sci-Tech Investment Management Co., and Xuhui Capital to facilitate the transformation of scientific research achievements into industrial applications [1] - The Shanghai Overseas Friendship Association emphasized the importance of overseas students, particularly the youth, in contributing to national strategic needs and fostering innovation in the AI sector [1] Group 2 - Keynote speeches highlighted the future applications of embodied intelligence and the need for international cooperation in addressing global challenges [2] - Discussions included the role of returnee talents in leveraging China's vast robotics market and the importance of creating a collaborative ecosystem involving government, academia, and industry [2] - Experts discussed strategies for breaking down barriers and establishing regular communication mechanisms to accelerate the transformation of research outcomes into industry applications [2]
实现 Agent 能力的泛化 ,是否一定需要对世界表征?
机器之心· 2025-07-27 01:30
Group 1 - The article discusses the necessity of world representation for achieving generalized agent capabilities, highlighting the ongoing debate between model-free and model-based paradigms in AI [4][5][8] - It emphasizes that modern AI agents are expected to perform complex tasks autonomously, distinguishing them from simple bots through their ability to generalize [5] - The model-free paradigm suggests that intelligent behavior can emerge from direct perception-action loops without explicit internal representations, while the model-based paradigm argues for the need of a rich internal predictive representation of the world [6][7] Group 2 - The article references recent research by DeepMind that formalizes the debate between model-free and model-based approaches, demonstrating that agents with generalization capabilities inherently internalize world representations [6][7] - It outlines a core theorem indicating that any generalized agent must have a high-quality world model to achieve long-term capabilities, contradicting the notion that one can bypass representation [7] - The discussion shifts from whether representation is needed to how it should be constructed, noting that existing world model paradigms are not without flaws and there is a lack of consensus in the field [8]
出现断层了?ICCV2025的自动驾驶方向演变...
自动驾驶之心· 2025-07-24 09:42
Core Insights - The article highlights the latest advancements in autonomous driving technologies, focusing on various research papers and frameworks that contribute to the field [2][3]. Multimodal Models & VLA - ORION presents a holistic end-to-end framework for autonomous driving, utilizing vision-language instructed action generation [5]. - An all-in-one large multimodal model for autonomous driving is introduced, showcasing its potential applications [6][7]. - MCAM focuses on multimodal causal analysis for ego-vehicle-level driving video understanding [9]. - AdaDrive and VLDrive emphasize self-adaptive systems and lightweight models for efficient language-grounded autonomous driving [10]. Simulation & Reconstruction - ETA proposes a dual approach to self-driving with large models, enhancing efficiency through forward-thinking [13]. - InvRGB+L introduces inverse rendering techniques for complex scene modeling [14]. - AD-GS and BézierGS focus on object-aware scene reconstruction and dynamic urban scene reconstruction, respectively [18][19]. End-to-End & Trajectory Prediction - Epona presents an autoregressive diffusion world model for autonomous driving, enhancing trajectory prediction capabilities [25]. - World4Drive introduces an intention-aware physical latent world model for end-to-end autonomous driving [30]. - MagicDrive-V2 focuses on high-resolution long video generation for autonomous driving with adaptive control [35]. Occupancy Networks - The article discusses advancements in 3D semantic occupancy prediction, highlighting the transition from binary to semantic data [44]. - GaussRender and GaussianOcc focus on learning 3D occupancy with Gaussian rendering techniques [52][54]. Object Detection - Several papers address 3D object detection, including MambaFusion, which emphasizes height-fidelity dense global fusion for multi-modal detection [64]. - OcRFDet explores object-centric radiance fields for multi-view 3D object detection in autonomous driving [69]. Datasets - The ROADWork Dataset aims to improve recognition and analysis of work zones in driving scenarios [73]. - Research on driver attention prediction and motion planning is also highlighted, showcasing the importance of understanding driver behavior in autonomous systems [74][75].