语言

Search documents
看遍奥斯卡后,VLM达到电影摄影理解新SOTA|上海AI Lab开源
量子位· 2025-07-16 01:49
ShotBench团队 投稿 量子位 | 公众号 QbitAI 当前最强大的视觉语言模型(VLMs)虽然能"看图识物",但在理解电影方面还不够"聪明"。 上海人工智能实验室联合新加坡南洋理工大学 S-Lab、同济大学和香港中文大学,正式推出 ShotBench ,配套模型 ShotVL 及训练集 ShotQA ,为VLM的"电影感"打开评测与训练的双重缺口。 ShotBench是一个专门为电影语言理解设计的综合基准。 它包含超过3.5k个由专家标注的图像和视频片段问答对,来自超过200部备受赞誉 (主要是 奥斯卡 提名)的电影,涵盖 八个关键电影摄影维度 ——景别、取景构图、摄像机角度、镜头焦距、照明类型、照明条件、构图和 摄像机运动。团队按照严格的标注流程,结合经过训练的标注员和专家监督,确保构建基于专业电影知识的、高质量的评估数据集。 ShotQA ,是一个包含约7万个电影问答对的大规模多模态数据集。 借助ShotQA,团队通过 监督微调(SFT) 和 群体相对策略优化 (GRPO) 开发了ShotVL。 ShotVL 在ShotBench上显著优于所有现有的开源和专有模型,确立了新的顶尖性能。 团队在S ...
持续释放民企活力,稳固经济向好态势
第一财经· 2025-07-16 01:10
Core Viewpoint - The article highlights the resilience and growth of the Chinese economy, with a GDP growth of 5.3% in the first half of the year, surpassing market expectations, and emphasizes the importance of policy support and the vitality of the private sector in driving economic development [1][2]. Economic Performance - China's GDP grew by 5.3% year-on-year in the first half of the year, while the CPI decreased by 0.1%, indicating a stable economic environment despite external uncertainties [1]. - The core CPI increased by 0.4% year-on-year in June, reflecting a slight inflationary pressure [1]. Policy Impact - The article discusses the positive effects of macro and micro policies on economic growth, particularly the easing of regulatory burdens that allow the private sector to thrive [2][3]. - Recent policy changes, such as the removal of certain approval requirements for public events and commercial activities, are seen as steps towards reducing bureaucratic obstacles and fostering economic growth [2]. Private Sector Vitality - The resilience of the private economy is highlighted, with examples of innovation in sectors like pharmaceuticals and artificial intelligence, showcasing the potential for high-quality economic development [1][2]. - The article argues that a more relaxed regulatory environment will enable the private sector to flourish, contributing significantly to overall economic performance [2][3]. Demand and Supply Dynamics - The article points out that while M2 and social financing are high, effective consumer demand remains insufficient to absorb the increased supply, leading to potential risks of low-efficiency assets [3]. - It emphasizes the need for a balanced approach to economic stimulus, ensuring that interventions do not harm the intrinsic growth potential of the economy [3]. Recommendations for Improvement - A suggestion is made to allocate part of the special long-term bonds to social welfare, which could enhance residents' disposable income and stimulate market consumption [4]. - The article advocates for a focus on simplifying regulations and reducing taxes to revitalize the private economy, thereby creating a conducive environment for sustainable growth [4].
让 VLMs 更适配机器人:小型VLMs也能展现出强大的视觉规划能力
具身智能之心· 2025-07-15 13:49
Core Insights - The article discusses the potential of large language models (LLMs) in robotic program planning, highlighting their ability to generate coherent action sequences but also noting their limitations in providing the necessary sensory details for physical execution [3][4] - It introduces a new framework called SelfReVision, which enhances the performance of small visual language models (VLMs) through self-distillation without external supervision, aiming to improve their planning capabilities in real-world scenarios [4][9] Research Background - LLMs show promise in generating action sequences but often lack the precision required for robotic tasks due to their reliance on human-centric training data [3] - Visual language models (VLMs) can potentially address these limitations, but existing methods either require specialized simulation environments or are costly to train and deploy [3] Methodology - SelfReVision is proposed as a self-improvement framework that allows small VLMs to enhance their performance through iterative self-critique and revision [4][6] - The framework operates in three stages: critique, revise, and verify, enabling models to generate and refine plans based on self-assessment [4][10] Experimental Setup - Two types of experiments were conducted to evaluate the planning capabilities of SelfReVision: image-based program planning and entity-agent tasks [11] - Evaluation metrics included coverage, ordering, completeness, overall quality, and a new metric called image groundedness [12] Key Results - SelfReVision significantly outperformed baseline models across various metrics, achieving an average win rate of 68% on the PLACES dataset and 72% on the SIMULATION dataset [13] - Larger models benefited more from SelfReVision, with an average gain of 74% for models with 12 billion parameters or more [13] Comparison with Other Methods - SelfReVision demonstrated clear advantages over other methods like Best-of-N and PaliGemma, with improvements of 60% in most settings compared to modest gains from Best-of-N [17] - When compared to GPT-4o, SelfReVision's plans had at least a 25% higher win rate for models with 12 billion parameters or more, indicating its effectiveness in enhancing smaller models [17] Ablation Studies - The complete Criticize-Revise-Verify (CRV) process showed the strongest performance, with average win rates of 68.3% on the PLACES dataset and 71.9% on the SIMULATION dataset [18] - Variants of the process showed significant performance drops, emphasizing the importance of the verification step in filtering out suboptimal revisions [18] Application in Entity-Agent Tasks - SelfReVision was tested in challenging scenarios, showing a 26% improvement for the Gemma 12B model and a 17% improvement for the Gemma 27B model in block manipulation tasks [21] - In hierarchical tasks, SelfReVision plans led to a 70% success rate in generating trajectories, surpassing the 61% success rate of baseline models [21]
摩根大通(JPM.N)首席执行官戴蒙:我们没有理由拥有大型语言模型。
news flash· 2025-07-15 12:54
摩根大通(JPM.N)首席执行官戴蒙:我们没有理由拥有大型语言模型。 ...
一财社论:持续释放民企活力,稳固经济向好态势
Di Yi Cai Jing· 2025-07-15 12:51
Economic Performance - China's GDP grew by 5.3% year-on-year in the first half of the year, exceeding market expectations, while CPI decreased by 0.1% [1] - The resilience of the Chinese economy is attributed to both macro and micro policies, as well as the inherent strength and growth momentum of the economy [1] Private Sector Dynamics - The vitality of the private economy is crucial for economic recovery, with recent policy relaxations indicating a shift towards less regulatory burden [2] - Examples of policy easing include the removal of approval requirements for large public events and simplified approval processes for commercial performances [2] Market Environment - The establishment of a unified national market and a legal business environment is essential for fostering economic growth [3] - Current macroeconomic indicators show a need for balance between stimulating growth and avoiding detrimental interventions [3] Recommendations for Economic Support - A proposal suggests allocating part of the special long-term bonds to social welfare to enhance residents' disposable income, which could stimulate market consumption [4] - The focus should be on creating a conducive environment for private sector growth through reduced regulatory constraints and lower taxes [4]
一文尽览!近一年自动驾驶VLA优秀工作汇总~
自动驾驶之心· 2025-07-15 12:30
Core Insights - The article discusses the advancements in Vision-Language-Action (VLA) models for autonomous driving, highlighting the integration of navigation and reinforcement learning to enhance reasoning capabilities beyond visual range [2][3][6]. Group 1: NavigScene - NavigScene is introduced as a novel auxiliary dataset that pairs local multi-view sensor inputs with global natural language navigation guidance, addressing the critical gap between local perception and global navigation context in autonomous driving [6]. - Three complementary paradigms are implemented in NavigScene: navigation-guided reasoning, navigation-guided preference optimization, and navigation-guided VLA models, enhancing the reasoning and generalization capabilities of autonomous driving systems [6]. - Comprehensive experiments demonstrate significant performance improvements in perception, prediction, and planning tasks by integrating global navigation knowledge into autonomous driving systems [6]. Group 2: AutoVLA - AutoVLA is proposed as an end-to-end autonomous driving framework that integrates physical action tokens with a pre-trained VLM backbone, enabling direct policy learning and semantic reasoning from raw visual observations and language instructions [12]. - A reinforcement learning-based post-training method using Group Relative Policy Optimization (GRPO) is introduced to achieve adaptive reasoning and further enhance model performance in end-to-end driving tasks [12]. - AutoVLA achieves competitive performance across multiple autonomous driving benchmarks, including open-loop and closed-loop tests [12]. Group 3: ReCogDrive - ReCogDrive is presented as an end-to-end autonomous driving system that integrates VLM with a diffusion planner, employing a three-stage training paradigm to address performance drops in rare and long-tail scenarios [13][16]. - The first stage involves fine-tuning the VLM on a large-scale driving Q&A dataset to mitigate domain gaps between general content and real-world driving scenarios [16]. - The method achieves a state-of-the-art PDMS score of 89.6 on the NAVSIM benchmark, highlighting its effectiveness and feasibility [16]. Group 4: Impromptu VLA - Impromptu VLA introduces a large-scale, richly annotated dataset aimed at addressing the limitations of existing benchmarks in autonomous driving VLA models [22]. - The dataset is designed to enhance the performance of VLA models in unstructured extreme scenarios, demonstrating significant improvements in established benchmarks [22]. - Experiments show that training with the Impromptu VLA dataset leads to notable performance enhancements in closed-loop NeuroNCAP scores and collision rates [22]. Group 5: DriveMoE - DriveMoE is a novel end-to-end autonomous driving framework that incorporates a mixture-of-experts (MoE) architecture to effectively handle multi-view sensor data and complex driving scenarios [28]. - The framework features scene-specific visual MoE and skill-specific action MoE, addressing the challenges of multi-view redundancy and skill specialization [28]. - DriveMoE achieves state-of-the-art performance in closed-loop evaluations on the Bench2Drive benchmark, demonstrating the effectiveness of combining visual and action MoE in autonomous driving tasks [28].
TACTILE-VLA:激活VLA模型的物理知识以实现触觉泛化(清华大学最新)
具身智能之心· 2025-07-15 07:55
Core Insights - The article discusses the development of Tactile-VLA, a model that integrates visual, language, action, and tactile perception to enhance robotic capabilities in contact-intensive tasks [2][6]. Group 1: Background and Core Issues - Visual-language-action (VLA) models have strong semantic understanding and cross-modal generalization capabilities, but they struggle in contact-intensive scenarios due to a lack of tactile perception [2][6]. - Tactile perception provides critical feedback in physical interactions, such as friction and material properties, which are essential for tasks requiring fine motor control [2][6]. Group 2: Key Innovations and Research Goals - The core finding is that VLA models contain prior knowledge of physical interactions, which can be activated by connecting this knowledge with tactile sensors, enabling zero-shot generalization in contact-intensive tasks [6][7]. - Tactile-VLA framework introduces tactile perception as a primary modality, allowing for direct mapping from abstract semantics to physical force control [7]. - The mixed position-force controller innovatively converts force targets into position adjustment commands, addressing the challenge of coordinating position and force control [7]. Group 3: Architecture and Mechanisms - Tactile-VLA's architecture includes four key modules: instruction adherence to tactile cues, application of tactile-related common sense, adaptive reasoning through tactile feedback, and a multi-modal encoder for unified token representation [12][13]. - The mixed position-force control mechanism ensures precision in position while allowing for fine-tuned force adjustments during contact tasks [13]. - The Tactile-VLA-CoT variant incorporates a chain of thought (CoT) reasoning mechanism, enabling robots to analyze failure causes based on tactile feedback and autonomously adjust strategies [13][14]. Group 4: Experimental Validation and Results - Three experimental setups were designed to validate Tactile-VLA's capabilities in instruction adherence, common sense application, and adaptive reasoning [17]. - In the instruction adherence experiment, Tactile-VLA achieved a success rate of 35% in USB tasks and 90% in charger tasks, significantly outperforming baseline models [21][22]. - The common sense application experiment demonstrated Tactile-VLA's ability to adjust interaction forces based on object properties, achieving success rates of 90%-100% for known objects and 80%-100% for unknown objects [27]. - The adaptive reasoning experiment showed that Tactile-VLA-CoT could successfully complete a blackboard task with an 80% success rate, demonstrating its problem-solving capabilities through reasoning [33].
特斯拉Optimus V3,来了!!
Robot猎场备忘录· 2025-07-15 04:18
Core Viewpoint - The recent developments surrounding Tesla's Optimus robot, including order cuts and leadership changes, have led to significant fluctuations in the robotics sector, particularly affecting T-chain concept stocks. However, these changes are seen as necessary adjustments for the future success of the Optimus project, paving the way for the upcoming Optimus V3 model [1][2]. Group 1: Market Reactions - On June 19, news of order cuts for Tesla's Optimus robot supplier caused a decline in the robotics sector, with T-chain concept stocks experiencing significant drops [1]. - Following the confirmation of the order cuts and the postponement of mass production plans, T-chain stocks like Zhejiang Rongtai (603119.SH) hit their daily limit down, while others like Beite Technology and Sanhua Intelligent Control fell over 4% [1]. - Despite a general recovery in the robotics sector on June 23, T-chain stocks continued to decline, indicating ongoing market concerns [1]. Group 2: Project Developments - The Optimus project is undergoing a redesign of its hardware and software, with a short-term reduction in delivery volumes anticipated. However, the project's importance remains unchanged as it aims for a more robust and reliable next-generation product [2]. - Milan Kovac, the original project leader for Optimus, has left the team, likely due to the need for a new direction in technology development [2]. - Elon Musk announced on June 25 that Optimus V3 will integrate the Grok voice assistant, utilizing AI language models for interaction, indicating a significant technological advancement [3]. Group 3: Future Prospects - On July 10, during the xAI launch, Musk reiterated that Grok 4 will be integrated into Tesla's Optimus, aiming for a real-world reinforcement learning loop by the end of the year [5]. - Recent orders for over 100 units of the Optimus robot were reported, suggesting that hardware redesigns are progressing well [6]. - Musk expressed confidence in the latest developments of Optimus, stating that the upcoming demonstrations will be the most impressive to date [6]. Group 4: Industry Dynamics - The T-chain concept stocks have shown positive momentum recently, with several companies releasing favorable news [8]. - The market is closely watching the upcoming Tesla quarterly meeting on July 24 and the shareholder meeting on November 6 for further insights into the Optimus project and its supply chain [9]. - The acquisition of the Sci-Tech Innovation Board listed company by Zhiyuan Robotics has led to a significant increase in market capitalization, highlighting the growing interest in humanoid robotics [9].
“美国已经基本退出,都是中国的”
Guan Cha Zhe Wang· 2025-07-15 04:08
Core Viewpoint - Meta is considering a significant shift in its AI strategy by potentially moving from open-source AI models to closed-source models, which could mark a departure from its long-standing commitment to open-source development [1][5][6] Group 1: Strategic Shift - Meta's newly established "Super Intelligence Lab" (MSL) is contemplating abandoning its powerful open-source AI model, Behemoth, in favor of developing a closed-source model [1][5] - This potential shift is seen as a major strategic change for Meta, which has historically believed that open-source technology fosters faster AI development and broader access for developers [5][6] - The decision is reportedly influenced by the underperformance of the Behemoth model during internal testing, leading to delays in its release [5][6] Group 2: Leadership and Talent Acquisition - Meta has appointed Alexandr Wang, the new AI head, who previously led Scale AI, to oversee the Super Intelligence Lab, which consists of a specialized team of about 12 members [6][7] - The company has adopted a "high-paying talent acquisition" strategy, offering salaries exceeding $100 million to attract top researchers from competitors like OpenAI, Google, and Apple [5][6] Group 3: Market Implications - The shift towards closed-source models could signify a retreat from the competitive landscape of open-source large language models (LLMs), with concerns raised about the U.S. losing its edge in this area [1][3] - The ongoing developments in Meta's AI strategy are closely watched, especially as the company faces challenges in the AI technology sector [5][6]
比Adam更有效,POET从谱不变原理出发,让LLM训练又稳又快
机器之心· 2025-07-15 00:59
Core Viewpoint - The article discusses a novel training paradigm for large language models (LLMs) called POET (Reparameterized Training via Orthogonal Equivalence Transformation), which aims to enhance training efficiency and stability based on first principles [2][3]. Group 1: POET Methodology - POET introduces structural reparameterization of each neuron by incorporating two learnable orthogonal matrices and a fixed random weight matrix, maintaining the singular value distribution of weights during training [3][11]. - The method combines singular value invariance with minimal hyperspherical energy, providing a new paradigm that offers both physical interpretability and generalization capability for large model training [3][11]. - POET's training process is designed to stabilize the optimization process and significantly improve model generalization performance [3][11]. Group 2: Advantages of POET - POET maintains the spectral properties of the weight matrix throughout training, ensuring that the singular values remain consistent with the randomly initialized matrix [17]. - The method allows for efficient parameter control and avoids the issue of excessively large singular values that can occur in standard LLM training [17]. - Two new initialization strategies, normalized Gaussian initialization and uniform spectrum initialization, are proposed to ensure bounded singular values in the generated weight matrices [17]. Group 3: Training Dynamics and Performance - The article presents experimental results demonstrating POET's superior performance in training large language models, including improvements in perplexity and training efficiency compared to traditional methods like AdamW [20][24]. - POET's training process is divided into three phases: conical shell searching, stable learning on the conical shell, and final adjusting, which reflects the evolution of the orthogonal matrices during training [40][41]. - The use of a fully stochastic sampling approach in POET allows for a significant reduction in memory costs compared to traditional methods, enhancing scalability [26][27].