世界模型
Search documents
Nano Banana Pro 要上天
3 6 Ke· 2025-11-21 01:55
Core Insights - Google has recently launched several AI models, including Gemini 3, Antigravity, and Nano Banana Pro, which showcases advanced capabilities beyond simple image generation, indicating a move towards reasoning and understanding [1][26]. Model Testing - The Nano Banana Pro model was tested for its ability to generate realistic video conference scenarios featuring well-known figures from the tech industry, demonstrating a high level of detail and accuracy in character representation [2][5]. - The model successfully integrated a two-dimensional anime character into a three-dimensional video conference setting, maintaining the character's original style while ensuring a coherent visual experience [5][26]. Language and Menu Generation - Nano Banana Pro was tasked with creating menus in multiple languages, including English, Chinese, Japanese, and Russian, showing proficiency in layout and design but revealing limitations in generating coherent text beyond the prompt [10][11]. - The generated Chinese menu displayed accurate headings and categories, but specific dish names were less recognizable, indicating a gap in the model's text generation capabilities [10][11]. Cultural Understanding - The model demonstrated an understanding of Chinese cultural elements, such as palmistry and acupuncture, accurately depicting relevant imagery and concepts [13][18]. - However, it made errors in specific details, such as mislabeling lines in palmistry, highlighting areas for improvement in cultural accuracy [14][26]. Mathematical Problem Solving - Nano Banana Pro was evaluated on its ability to solve algebraic and geometric problems, with results aligning with expected answers, suggesting a foundational understanding of mathematical concepts [20][24]. - The model's performance indicates a shift from being merely a graphic tool to incorporating reasoning and understanding in its outputs, as it processes prompts with a degree of contextual awareness [26][27]. Future Implications - The advancements in Nano Banana Pro's capabilities suggest a potential evolution towards a "world model," where the AI not only generates images but also comprehends relationships and structures within a scene [26][27]. - This progression raises both excitement and caution, as the model approaches a level of understanding that could redefine its applications in various fields [27].
驭势科技 | 规划算法工程师招聘(可直推)
自动驾驶之心· 2025-11-21 00:04
Core Insights - The article discusses the advancements in autonomous driving technology, particularly focusing on the development and implementation of VLA (Vehicle-Language Architecture) by Xiaopeng Motors, highlighting its significance in the industry [14]. Group 1: Company Developments - Xiaopeng Motors has announced the launch of VLA 2.0, which represents a significant step in the evolution of autonomous driving technology, transitioning from perception-based systems to more integrated approaches [14]. - The article reflects on a year of research and development in VLA, indicating a shift in focus from traditional perception methods to VLA, which aims to enhance the vehicle's decision-making capabilities [14]. Group 2: Industry Trends - The article notes a growing trend in the industry towards end-to-end autonomous driving solutions, with VLA being positioned as a potential game-changer in how vehicles interact with their environment [14]. - There is a discussion on the competitive landscape, particularly the debate between world models and VLA routes, suggesting that the industry is at a crossroads in terms of technological direction [14]. Group 3: Research and Academic Contributions - The article mentions recent academic contributions, such as the paper from The Chinese University of Hong Kong (Shenzhen) and Didi, which proposes a new method for dynamic driving scene reconstruction, indicating ongoing research efforts in the field [14].
自动驾驶三大技术路线:端到端、VLA、世界模型
自动驾驶之心· 2025-11-21 00:04
Overview - The article discusses the ongoing technological competition in the autonomous driving industry, focusing on different approaches to solving corner cases and enhancing safety and efficiency in driving systems [1][3]. Technological Approaches - There is a debate between two main technological routes: single-vehicle intelligence (VLA) and intelligent networking (VLM) [1]. - Major companies like Waymo utilize VLM, which allows AI to handle environmental understanding and reasoning, while traditional modules maintain decision-making control for safety [1]. - Companies such as Tesla, Geely, and XPeng are exploring VLA, aiming for AI to learn all driving skills through extensive data training for end-to-end decision-making [1]. Sensor and Algorithm Developments - The article highlights the evolution of perception technologies, with BEV (Bird's Eye View) perception becoming mainstream by 2022, and OCC (Occupancy) perception gaining traction in 2023 [3][5]. - BEV integrates various sensor data into a unified spatial representation, facilitating better path planning and dynamic information fusion [8][14]. - OCC perception provides detailed occupancy data, clarifying the probability of space being occupied over time, which enhances dynamic interaction modeling [6][14]. Modular and End-to-End Systems - Prior to the advent of multimodal large models and end-to-end autonomous driving technologies, perception and prediction tasks were typically handled by separate modules [5]. - The article outlines a phased approach to modularization, where perception, prediction, decision-making, and control are distinct yet interconnected [4][31]. - End-to-end systems aim to streamline the process by allowing direct mapping from raw sensor inputs to actionable outputs, enhancing efficiency and reducing bottlenecks [20][25]. VLA and VLM Frameworks - VLA (Visual-Language-Action) and VLM (Visual-Language Model) frameworks are discussed, with VLA focusing on understanding complex scenes and making autonomous decisions based on visual and language inputs [32][39]. - The article emphasizes the importance of language models in enhancing the interpretability and safety of autonomous driving systems, allowing for better cross-scenario knowledge transfer and decision-making [57]. Future Directions - The competition between VLA and WA (World Action) architectures is highlighted, with WA emphasizing direct visual-to-action mapping without language mediation [55][56]. - The article suggests that the future of autonomous driving will involve integrating world models that understand physical laws and temporal dynamics, addressing the limitations of current language models [34][54].
36个月大逆转,他带着谷歌AI杀回来了,下一步世界模型
3 6 Ke· 2025-11-20 23:53
Core Insights - The competition in the AI model landscape is intensifying, with Google's Gemini 3 Pro recently surpassing Elon Musk's Grok 4.1 to claim the top spot in various rankings [1][3][7]. Group 1: Gemini 3's Capabilities and Impact - Gemini 3 is highlighted for its advanced reasoning, multimedia processing, and coding abilities, enhancing Google's existing products, particularly its lucrative search business [7][8]. - The introduction of AI Overviews has led to a 10% increase in search query volume, while visual search capabilities have surged by 70% due to Gemini's photo analysis [8]. - Gemini 3 is positioned as a foundational model for Google's product ecosystem, integrating AI into various services like Google Maps, Gmail, and cloud services [8][12]. Group 2: Competitive Landscape and Market Position - Google has made significant investments in AI, leading to breakthroughs that have allowed it to catch up with competitors like OpenAI, which initially disrupted its core search business [9][10]. - The monthly active users of Gemini applications have exceeded 650 million, indicating a strong user engagement compared to ChatGPT's 700-800 million weekly active users [12]. - Gemini 3 has outperformed OpenAI's GPT-5 in several benchmarks, particularly in reasoning and long-term planning, enhancing its practical capabilities [12]. Group 3: Future Directions and AGI Aspirations - Google aims to develop a comprehensive model that excels in various domains, which is seen as a crucial step towards achieving Artificial General Intelligence (AGI) [13][14]. - The company is focused on refining the Gemini model to improve its programming, reasoning, and mathematical capabilities, with future iterations expected to be more efficient and cost-effective [13][14]. - The timeline for achieving AGI is projected to be 5 to 10 years, with Gemini 3 serving as a pivotal platform for future advancements [14][15]. Group 4: Economic Viability and AI Bubble Concerns - Despite concerns about an AI bubble, Google is well-positioned due to its solid revenue streams and the strategic role of DeepMind in enhancing its AI capabilities [15][17]. - The integration of AI into existing Google services is already yielding tangible returns, enhancing the performance of search, YouTube, and cloud services [16][17].
基于准确的原始材料对比小鹏理想VLA
理想TOP2· 2025-11-20 10:42
Core Viewpoint - The article discusses the advancements in autonomous driving technology, particularly focusing on the VLA (Vision-Language-Action) architecture developed by Li Auto and the insights shared by Xiaopeng's autonomous driving head, Liu Xianming, during a podcast. Liu emphasizes the removal of the intermediate language component (L) to enhance scalability and efficiency in data usage [1][4][5]. Summary by Sections VLA Architecture and Training Process - The VLA architecture involves a pre-training phase using a 32 billion parameter (32B) vision-language model that incorporates 3D vision and high-definition 2D vision, improving clarity by 3-5 times compared to open-source models. It also includes driving-related language data and key VL joint data [10][11]. - The model is distilled into a 3.2 billion parameter (3.2B) MoE model to ensure fast inference on vehicle hardware, followed by a post-training phase that integrates action to form the VLA, increasing the parameter count to nearly 4 billion [13][12]. - The reinforcement learning phase consists of two parts: human feedback reinforcement learning (RLHF) and pure reinforcement learning using world model-generated data, focusing on comfort, collision avoidance, and adherence to traffic regulations [15][16]. Data Utilization and Efficiency - Liu argues that using language as a supervisory signal can introduce human biases, reducing data efficiency and scalability. The most challenging data to collect are corner cases, which are crucial for training [4][6]. - The architecture aims to achieve a high level of generalization, with plans to implement L4 robotaxi services in Guangzhou based on the current framework [4][5]. Future Directions and Challenges - Liu acknowledges the uncertainties in scaling the technology and ensuring safety, questioning how to maintain safety standards and align the model with human behavior [5][18]. - The conversation highlights that the VLA, VLM, and world model are fundamentally end-to-end architectures, with various companies working on similar concepts in the realm of Physical AI [5][18]. Human-Agent Interaction - The driver agent is designed to process short commands directly, while complex instructions are sent to the cloud for processing before execution. This approach allows the system to understand and interact with the physical world like a human driver [17][18]. - The article concludes that the traffic domain is a suitable environment for VLA implementation due to its defined rules and the ability to model human driving behavior effectively [19][20].
本周六,围观学习NeurIPS 2025论文分享会,最后报名了
机器之心· 2025-11-20 06:35
2025年,AI 的演进正从"能力突破"迈向"系统构建"阶段。 自主智能体开始尝试真实任务闭环,世界模型在复杂环境中持续验证,推理架构与训练范式不断重构——技术的焦点,已不再只是"能不能做",而是"如何做得更 可靠、更可解释、更可持续"。 在这一背景下,NeurIPS 作为全球人工智能与机器学习领域最具影响力的学术会议之一, 再度成为洞察前沿趋势的重要风向标。今年大会共收到 21575 份有效投 稿,最终接收 5290 篇,整体录用率为 24.52%。大会将于 2025 年 12 月 2 日到 7 日在美国圣地亚哥举办,并且首次设置了第二个官方分会场墨西哥城,标志着全 球 AI 学术生态的多元化布局正在加速成型。 为了服务中国 AI 社区,近年来机器之心持续举办了多场 NeurIPS、CVPR、ACL、ICLR 论文分享会,受到了海内外 AI 社区的极大关注,众多高校、企业都积极 参与。 本次「NeurIPS 2025 论文分享会」专为国内 AI 人才打造,精心设置了 Keynote、论文分享、圆桌对话、Poster 交流及企业展台互动等多元环节。今天,论文分享 会的全日程、Keynote 分享嘉宾、演讲主题 ...
速递|AI教父Yann LeCun与Meta的“友好分手”,新AI公司瞄准持久记忆与复杂推理系统
Z Potentials· 2025-11-20 04:12
Core Insights - Yann LeCun, Meta's Chief AI Scientist, will leave the company to establish his own AI startup focused on world models, a field he has extensively researched [2][3] - Meta plans to collaborate with LeCun's startup, aiming to leverage its innovative outcomes [3][4] - LeCun's departure is significant for Meta, as he is regarded as a foundational figure in modern AI, having co-founded the Facebook AI Research (FAIR) and received the Turing Award [5] Group 1: Company Developments - Meta's current AI focus has shifted towards large language models (LLMs), including the Llama series, following a series of setbacks earlier this year, such as the delayed release of the Llama 4 model [4][5] - The company has invested billions in recruiting talent and establishing the Meta Superintelligence Lab (MSL), led by notable figures from Scale AI and GitHub [4] Group 2: Research Focus - LeCun's new startup aims to advance research in advanced machine intelligence (AMI), which he believes will have profound impacts across various economic sectors, some of which overlap with Meta's interests [5] - The startup will pursue the development of systems capable of understanding the physical world, possessing persistent memory, reasoning, and planning complex behavior sequences [3][5]
刚刚,Yann LeCun官宣离职创业,瞄准高级机器智能AMI
机器之心· 2025-11-20 02:07
Core Viewpoint - Yann LeCun, a Turing Award winner, has announced his departure from Meta to start a new company focused on Advanced Machine Intelligence (AMI), aiming to revolutionize AI by enabling systems to understand the physical world, possess long-term memory, reason, and plan complex actions [1][8][14]. Group 1: Company Transition - LeCun's new venture will continue his research on "world models," which he believes are essential for AI to truly understand the physical world [8][27]. - Meta will act as a partner to LeCun's new company, supporting the AMI initiative, which has overlapping interests with Meta's business but also extends into other areas [8][28]. - The departure marks a significant shift in the AI landscape, as LeCun leaves a position he helped establish at Meta's FAIR (Facebook AI Research) amid internal cultural conflicts and strategic misalignments [17][27]. Group 2: Research Focus - The goal of the new company is to drive a major revolution in AI, focusing on systems that can understand the physical world and plan actions without extensive trial and error [8][24]. - LeCun has been a critic of large language models (LLMs), arguing that they lack true understanding of the physical world, and he aims to develop AI that can reason and plan using world models [19][27]. - Recent research contributions include the JEPA theory, which aims to create organized and actionable high-dimensional embedding spaces, seen as a potential pathway to achieving world models [25][27]. Group 3: Industry Impact - LeCun's transition to entrepreneurship at the age of 65 signifies a new exploration phase in AI, moving away from the constraints of corporate environments to pursue foundational scientific challenges [14][27]. - The departure of LeCun, alongside other key figures like Soumith Chintala, indicates the end of an era for Meta AI, highlighting the ongoing evolution within the AI research community [28].
世界模型崛起,AI路线之争喧嚣再起
3 6 Ke· 2025-11-20 01:58
Core Insights - The future of AI may hinge on understanding the evolutionary codes of the human brain, as highlighted by Yann LeCun's departure from Meta to focus on "World Models" [1] - Fei-Fei Li emphasizes that the advancement of AI should pivot from merely expanding model parameters to embedding "Spatial Intelligence," a fundamental cognitive ability that humans possess from infancy [1][3] - The launch of Marble by World Labs, which utilizes multimodal world models to create persistent 3D digital twin spaces, marks a significant step towards achieving spatial intelligence in AI [1] Group 1: AI Development Perspectives - Yann LeCun's vision diverges from Meta's focus on large language models (LLMs), arguing that LLMs cannot replicate human reasoning capabilities [3] - LLMs are constrained by data quality and scale, leading to cognitive limitations that hinder their ability to model the physical world and perform dynamic causal reasoning [3][4] - The reliance on text data restricts AI's ability to break free from "symbolic cages," necessitating a shift towards a structured understanding of the world for true AI evolution [4] Group 2: World Models vs. Large Language Models - World models are seen as a solution to the fundamental limitations of LLMs, focusing on high-dimensional perceptual data to model the physical world directly [4][5] - The key characteristics of world models include internal representation and prediction, physical cognition, and counterfactual reasoning capabilities [11] - A complete world model consists of state representation, dynamic models, and decision-making models, enabling AI to simulate and plan actions in a virtual environment [12][13] Group 3: Industry Trends and Innovations - Recent advancements in world models have been made by major tech companies, with Google DeepMind's Genie series and Meta's Code World Model leading the charge [16] - The concept of "physical AI" is gaining traction, with Nvidia's CEO asserting that the next growth phase will stem from these new models, which will revolutionize robotics [16] - The application of world models is already influencing various sectors, including autonomous driving and robotics, as companies like Tesla integrate these models for real-world learning and validation [17] Group 4: Challenges and Future Directions - The development of world models faces technical challenges, including the need for extensive multimodal data and the lack of standardized training datasets [20] - Cognitive challenges arise from the complexity of decision-making processes within world models, raising concerns about transparency and alignment with human values [20][21] - Despite the challenges, the global competition in the world model space is intensifying, with the potential to redefine industries and enhance human-AI collaboration [21][22]
解决特斯拉「监督稀疏」难题,用世界模型放大自动驾驶的Scaling Law
具身智能之心· 2025-11-20 00:03
点击下方 卡片 ,关注" 具身智能之心 "公众号 编辑丨 机器之心 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你想要的! 在自动驾驶领域,VLA 大模型正从学术前沿走向产业落地的 "深水区"。近日,特斯拉(Tesla)在 ICCV 的分享中,就将其面临的核心挑战之一公之于众 —— "监 督 稀 疏" 。 这一问题直指当前 VLA 模型的 "七寸": 其输入是高维、稠密的视觉信息流,但其监督信号却往往是低维、稀疏的驾驶动作(如路径点)。那么即便使用 PB 级的 海量数据,VLA 模型的巨大潜力也无法被有效释放。 正当业界热议这一瓶颈时,一支来自国内顶尖学术机构与华为合作的团队,已经悄然给出了破解这一难题的 "锦囊"。一篇名为 《DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving》 的新工作,为解决这一 "监督稀疏" 提供了极具洞见的解决方案。该研究提出, 世界模型(World Model)是 解锁 VLA 数据规模定律(D ...