Workflow
世界模型
icon
Search documents
100亿都不够烧!机器人公司CEO们给出新判断:具身智能不能再照搬LLM
Sou Hu Cai Jing· 2025-11-22 02:41
Core Insights - The event highlighted the latest advancements in embodied intelligence by the Zhiyuan Research Institute, focusing on the importance of world models and the development of a comprehensive embodied brain system [2][3] Group 1: Zhiyuan's Full-Stack Layout - Zhiyuan introduced the native multimodal world model Emu3.5, which expanded training data from 15 years of video to 790 years and increased parameter size from 8 billion to 34 billion, enhancing video and image generation speed [5] - The institute is constructing a cross-heterogeneous ontology embodied intelligence system, including RoboBrain, RoboOS, and RoboBrain-0, deployed across various robotic forms for tasks ranging from navigation to complex interactions [5] Group 2: Key Elements of Embodied Intelligence - The role of world models in embodied intelligence was debated, with experts emphasizing the need for models that predict the next state based on the robot's form and goals, rather than merely generating videos [7][10] - There is a consensus that embodied intelligence should not follow the current language-first paradigm but rather adopt a structure centered on action and perception [10][12] - The importance of real data was highlighted, with discussions on the necessity of combining real, simulated, and video data for effective learning in robots [15][17] Group 3: Investment Priorities - When asked how to allocate 10 billion, experts prioritized talent acquisition, computational power, and data engines as key investment areas [19][21] - There were differing views on the importance of infrastructure versus model development, with some advocating for a focus on creating a comprehensive data engine for continuous digitalization [21][22] Group 4: Human-like Robots and Hardware Limitations - The debate on whether human-like robots represent the ultimate form of embodied intelligence concluded that neither models nor hardware define each other; rather, the specific application scenarios dictate the requirements [22][24] - Experts suggested that a layered structure for embodied intelligence should be adopted, where higher-level models can be reused across different robotic forms, but lower-level models must be tailored to specific hardware [23][24] Conclusion - The discussions at the event signaled a proactive search for solutions to achieve a closed-loop system in embodied intelligence, emphasizing the need for models, hardware, and scaling to evolve together [24]
小米HAD增强版辅助驾驶发布:引入强化学习与世界模型,AES紧急转向功能上车
Feng Huang Wang· 2025-11-21 02:33
Core Insights - Xiaomi Auto officially launched the Xiaomi HAD Enhanced Version at the Guangzhou Auto Show, showcasing advancements in smart driving technology and talent acquisition in the AI field [1] - The company plans to invest over 7 billion yuan in AI research and development by 2025, with a current team of 1,800 experts, including 108 PhDs [1] Technical Developments - The new Xiaomi HAD Enhanced Version is built on a foundation of 10 million clips and incorporates reinforcement learning algorithms and world models to enhance driving performance [1] - The world model technology allows the system to simulate various scenarios, including extreme weather and complex road conditions, transitioning from a "rule-driven" to a "learning-driven" approach [1] User Experience Enhancements - The updated version focuses on optimizing longitudinal and lateral control experiences, particularly in scenarios like lane merging, reducing unnecessary deceleration and hard braking [2] - Significant upgrades to active safety features include the introduction of the AES emergency steering assist function, which can automatically change lanes to avoid collisions at speeds between 80 km/h and 135 km/h [2] Safety Features Expansion - The forward AEB (Automatic Emergency Braking) range has been expanded to 1 km/h to 135 km/h, with new capabilities to recognize various obstacles [2] - The backward AEB covers reversing scenarios from 1 km/h to 30 km/h, with a focus on balancing sensitivity to ensure accurate stopping while minimizing false triggers [2] Software Updates - The driving updates will be included in the Xiaomi HyperOS 1.11.0 version, with rollout times varying by model due to review progress [2]
Nano Banana Pro 要上天
3 6 Ke· 2025-11-21 01:55
Core Insights - Google has recently launched several AI models, including Gemini 3, Antigravity, and Nano Banana Pro, which showcases advanced capabilities beyond simple image generation, indicating a move towards reasoning and understanding [1][26]. Model Testing - The Nano Banana Pro model was tested for its ability to generate realistic video conference scenarios featuring well-known figures from the tech industry, demonstrating a high level of detail and accuracy in character representation [2][5]. - The model successfully integrated a two-dimensional anime character into a three-dimensional video conference setting, maintaining the character's original style while ensuring a coherent visual experience [5][26]. Language and Menu Generation - Nano Banana Pro was tasked with creating menus in multiple languages, including English, Chinese, Japanese, and Russian, showing proficiency in layout and design but revealing limitations in generating coherent text beyond the prompt [10][11]. - The generated Chinese menu displayed accurate headings and categories, but specific dish names were less recognizable, indicating a gap in the model's text generation capabilities [10][11]. Cultural Understanding - The model demonstrated an understanding of Chinese cultural elements, such as palmistry and acupuncture, accurately depicting relevant imagery and concepts [13][18]. - However, it made errors in specific details, such as mislabeling lines in palmistry, highlighting areas for improvement in cultural accuracy [14][26]. Mathematical Problem Solving - Nano Banana Pro was evaluated on its ability to solve algebraic and geometric problems, with results aligning with expected answers, suggesting a foundational understanding of mathematical concepts [20][24]. - The model's performance indicates a shift from being merely a graphic tool to incorporating reasoning and understanding in its outputs, as it processes prompts with a degree of contextual awareness [26][27]. Future Implications - The advancements in Nano Banana Pro's capabilities suggest a potential evolution towards a "world model," where the AI not only generates images but also comprehends relationships and structures within a scene [26][27]. - This progression raises both excitement and caution, as the model approaches a level of understanding that could redefine its applications in various fields [27].
驭势科技 | 规划算法工程师招聘(可直推)
自动驾驶之心· 2025-11-21 00:04
Core Insights - The article discusses the advancements in autonomous driving technology, particularly focusing on the development and implementation of VLA (Vehicle-Language Architecture) by Xiaopeng Motors, highlighting its significance in the industry [14]. Group 1: Company Developments - Xiaopeng Motors has announced the launch of VLA 2.0, which represents a significant step in the evolution of autonomous driving technology, transitioning from perception-based systems to more integrated approaches [14]. - The article reflects on a year of research and development in VLA, indicating a shift in focus from traditional perception methods to VLA, which aims to enhance the vehicle's decision-making capabilities [14]. Group 2: Industry Trends - The article notes a growing trend in the industry towards end-to-end autonomous driving solutions, with VLA being positioned as a potential game-changer in how vehicles interact with their environment [14]. - There is a discussion on the competitive landscape, particularly the debate between world models and VLA routes, suggesting that the industry is at a crossroads in terms of technological direction [14]. Group 3: Research and Academic Contributions - The article mentions recent academic contributions, such as the paper from The Chinese University of Hong Kong (Shenzhen) and Didi, which proposes a new method for dynamic driving scene reconstruction, indicating ongoing research efforts in the field [14].
自动驾驶三大技术路线:端到端、VLA、世界模型
自动驾驶之心· 2025-11-21 00:04
Overview - The article discusses the ongoing technological competition in the autonomous driving industry, focusing on different approaches to solving corner cases and enhancing safety and efficiency in driving systems [1][3]. Technological Approaches - There is a debate between two main technological routes: single-vehicle intelligence (VLA) and intelligent networking (VLM) [1]. - Major companies like Waymo utilize VLM, which allows AI to handle environmental understanding and reasoning, while traditional modules maintain decision-making control for safety [1]. - Companies such as Tesla, Geely, and XPeng are exploring VLA, aiming for AI to learn all driving skills through extensive data training for end-to-end decision-making [1]. Sensor and Algorithm Developments - The article highlights the evolution of perception technologies, with BEV (Bird's Eye View) perception becoming mainstream by 2022, and OCC (Occupancy) perception gaining traction in 2023 [3][5]. - BEV integrates various sensor data into a unified spatial representation, facilitating better path planning and dynamic information fusion [8][14]. - OCC perception provides detailed occupancy data, clarifying the probability of space being occupied over time, which enhances dynamic interaction modeling [6][14]. Modular and End-to-End Systems - Prior to the advent of multimodal large models and end-to-end autonomous driving technologies, perception and prediction tasks were typically handled by separate modules [5]. - The article outlines a phased approach to modularization, where perception, prediction, decision-making, and control are distinct yet interconnected [4][31]. - End-to-end systems aim to streamline the process by allowing direct mapping from raw sensor inputs to actionable outputs, enhancing efficiency and reducing bottlenecks [20][25]. VLA and VLM Frameworks - VLA (Visual-Language-Action) and VLM (Visual-Language Model) frameworks are discussed, with VLA focusing on understanding complex scenes and making autonomous decisions based on visual and language inputs [32][39]. - The article emphasizes the importance of language models in enhancing the interpretability and safety of autonomous driving systems, allowing for better cross-scenario knowledge transfer and decision-making [57]. Future Directions - The competition between VLA and WA (World Action) architectures is highlighted, with WA emphasizing direct visual-to-action mapping without language mediation [55][56]. - The article suggests that the future of autonomous driving will involve integrating world models that understand physical laws and temporal dynamics, addressing the limitations of current language models [34][54].
36个月大逆转,他带着谷歌AI杀回来了,下一步世界模型
3 6 Ke· 2025-11-20 23:53
Core Insights - The competition in the AI model landscape is intensifying, with Google's Gemini 3 Pro recently surpassing Elon Musk's Grok 4.1 to claim the top spot in various rankings [1][3][7]. Group 1: Gemini 3's Capabilities and Impact - Gemini 3 is highlighted for its advanced reasoning, multimedia processing, and coding abilities, enhancing Google's existing products, particularly its lucrative search business [7][8]. - The introduction of AI Overviews has led to a 10% increase in search query volume, while visual search capabilities have surged by 70% due to Gemini's photo analysis [8]. - Gemini 3 is positioned as a foundational model for Google's product ecosystem, integrating AI into various services like Google Maps, Gmail, and cloud services [8][12]. Group 2: Competitive Landscape and Market Position - Google has made significant investments in AI, leading to breakthroughs that have allowed it to catch up with competitors like OpenAI, which initially disrupted its core search business [9][10]. - The monthly active users of Gemini applications have exceeded 650 million, indicating a strong user engagement compared to ChatGPT's 700-800 million weekly active users [12]. - Gemini 3 has outperformed OpenAI's GPT-5 in several benchmarks, particularly in reasoning and long-term planning, enhancing its practical capabilities [12]. Group 3: Future Directions and AGI Aspirations - Google aims to develop a comprehensive model that excels in various domains, which is seen as a crucial step towards achieving Artificial General Intelligence (AGI) [13][14]. - The company is focused on refining the Gemini model to improve its programming, reasoning, and mathematical capabilities, with future iterations expected to be more efficient and cost-effective [13][14]. - The timeline for achieving AGI is projected to be 5 to 10 years, with Gemini 3 serving as a pivotal platform for future advancements [14][15]. Group 4: Economic Viability and AI Bubble Concerns - Despite concerns about an AI bubble, Google is well-positioned due to its solid revenue streams and the strategic role of DeepMind in enhancing its AI capabilities [15][17]. - The integration of AI into existing Google services is already yielding tangible returns, enhancing the performance of search, YouTube, and cloud services [16][17].
基于准确的原始材料对比小鹏理想VLA
理想TOP2· 2025-11-20 10:42
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the VLA (Vision-Language-Action) architecture developed by Li Auto and the insights shared by Xiaopeng's autonomous driving head, Liu Xianming, during a podcast. Liu emphasizes the removal of the intermediate language component (L) to enhance scalability and efficiency in data usage [1][4][5]. Summary by Sections VLA Architecture and Training Process - The VLA architecture involves a pre-training phase using a 32 billion parameter (32B) vision-language model that incorporates 3D vision and high-definition 2D vision, improving clarity by 3-5 times compared to open-source models. It also includes driving-related language data and key VL joint data [10][11]. - The model is distilled into a 3.2 billion parameter (3.2B) MoE model to ensure fast inference on vehicle hardware, followed by a post-training phase that integrates action to form the VLA, increasing the parameter count to nearly 4 billion [13][12]. - The reinforcement learning phase consists of two parts: human feedback reinforcement learning (RLHF) and pure reinforcement learning using world model-generated data, focusing on comfort, collision avoidance, and adherence to traffic regulations [15][16]. Data Utilization and Efficiency - Liu argues that using language as a supervisory signal can introduce human biases, reducing data efficiency and scalability. The most challenging data to collect are corner cases, which are crucial for training [4][6]. - The architecture aims to achieve a high level of generalization, with plans to implement L4 robotaxi services in Guangzhou based on the current framework [4][5]. Future Directions and Challenges - Liu acknowledges the uncertainties in scaling the technology and ensuring safety, questioning how to maintain safety standards and align the model with human behavior [5][18]. - The conversation highlights that the VLA, VLM, and world model are fundamentally end-to-end architectures, with various companies working on similar concepts in the realm of Physical AI [5][18]. Human-Agent Interaction - The driver agent is designed to process short commands directly, while complex instructions are sent to the cloud for processing before execution. This approach allows the system to understand and interact with the physical world like a human driver [17][18]. - The article concludes that the traffic domain is a suitable environment for VLA implementation due to its defined rules and the ability to model human driving behavior effectively [19][20].
本周六,围观学习NeurIPS 2025论文分享会,最后报名了
机器之心· 2025-11-20 06:35
Core Insights - The evolution of AI is transitioning from "capability breakthroughs" to "system construction" by 2025, focusing on reliability, interpretability, and sustainability [2] - NeurIPS, a leading academic conference in AI and machine learning, received 21,575 submissions this year, with an acceptance rate of 24.52%, indicating a growing interest in AI research [2] - The conference will take place from December 2 to 7, 2025, in San Diego, USA, with a new official venue in Mexico City, reflecting the diversification of the global AI academic ecosystem [2] Event Overview - The "NeurIPS 2025 Paper Sharing Conference" is designed for domestic AI talent, featuring keynote speeches, paper presentations, roundtable discussions, poster exchanges, and corporate interactions [3] - The event is scheduled for November 22, 2025, from 09:00 to 17:30 at the Crowne Plaza Hotel in Zhongguancun, Beijing [5][6] Keynote Speakers and Topics - Morning keynote by Qiu Xipeng from Fudan University on "Contextual Intelligence: Completing the Key Puzzle of AGI" [8][14] - Afternoon keynote by Fan Qi from Nanjing University on "From Frames to Worlds: Long Video Generation for World Models" [10][17] Paper Presentations - Various presentations will cover topics such as data mixing in knowledge acquisition, multimodal adaptation for large language models, and scalable data generation frameworks [9][30] - Notable presenters include doctoral students from Tsinghua University and Renmin University, showcasing cutting-edge research in AI [9][30] Roundtable Discussion - A roundtable discussion will explore whether world models will become the next frontier in AI, featuring industry experts and academics [10][20]
速递|AI教父Yann LeCun与Meta的“友好分手”,新AI公司瞄准持久记忆与复杂推理系统
Z Potentials· 2025-11-20 04:12
Core Insights - Yann LeCun, Meta's Chief AI Scientist, will leave the company to establish his own AI startup focused on world models, a field he has extensively researched [2][3] - Meta plans to collaborate with LeCun's startup, aiming to leverage its innovative outcomes [3][4] - LeCun's departure is significant for Meta, as he is regarded as a foundational figure in modern AI, having co-founded the Facebook AI Research (FAIR) and received the Turing Award [5] Group 1: Company Developments - Meta's current AI focus has shifted towards large language models (LLMs), including the Llama series, following a series of setbacks earlier this year, such as the delayed release of the Llama 4 model [4][5] - The company has invested billions in recruiting talent and establishing the Meta Superintelligence Lab (MSL), led by notable figures from Scale AI and GitHub [4] Group 2: Research Focus - LeCun's new startup aims to advance research in advanced machine intelligence (AMI), which he believes will have profound impacts across various economic sectors, some of which overlap with Meta's interests [5] - The startup will pursue the development of systems capable of understanding the physical world, possessing persistent memory, reasoning, and planning complex behavior sequences [3][5]
刚刚,Yann LeCun官宣离职创业,瞄准高级机器智能AMI
机器之心· 2025-11-20 02:07
Core Viewpoint - Yann LeCun, a Turing Award winner, has announced his departure from Meta to start a new company focused on Advanced Machine Intelligence (AMI), aiming to revolutionize AI by enabling systems to understand the physical world, possess long-term memory, reason, and plan complex actions [1][8][14]. Group 1: Company Transition - LeCun's new venture will continue his research on "world models," which he believes are essential for AI to truly understand the physical world [8][27]. - Meta will act as a partner to LeCun's new company, supporting the AMI initiative, which has overlapping interests with Meta's business but also extends into other areas [8][28]. - The departure marks a significant shift in the AI landscape, as LeCun leaves a position he helped establish at Meta's FAIR (Facebook AI Research) amid internal cultural conflicts and strategic misalignments [17][27]. Group 2: Research Focus - The goal of the new company is to drive a major revolution in AI, focusing on systems that can understand the physical world and plan actions without extensive trial and error [8][24]. - LeCun has been a critic of large language models (LLMs), arguing that they lack true understanding of the physical world, and he aims to develop AI that can reason and plan using world models [19][27]. - Recent research contributions include the JEPA theory, which aims to create organized and actionable high-dimensional embedding spaces, seen as a potential pathway to achieving world models [25][27]. Group 3: Industry Impact - LeCun's transition to entrepreneurship at the age of 65 signifies a new exploration phase in AI, moving away from the constraints of corporate environments to pursue foundational scientific challenges [14][27]. - The departure of LeCun, alongside other key figures like Soumith Chintala, indicates the end of an era for Meta AI, highlighting the ongoing evolution within the AI research community [28].