多模态模型
Search documents
史上最全robot manipulation综述,多达1200篇!八家机构联合发布
自动驾驶之心· 2025-10-14 23:33
Core Insights - The article discusses the rapid advancements in artificial intelligence, particularly in embodied intelligence, which connects cognition and action, emphasizing the importance of robot manipulation in achieving general artificial intelligence (AGI) [5][9]. Summary by Sections Overview of Robot Manipulation - The paper titled "Towards a Unified Understanding of Robot Manipulation: A Comprehensive Survey" provides a comprehensive overview of the field of robot manipulation, detailing the evolution from rule-based control to intelligent control systems that integrate reinforcement learning and large models [6][10]. Key Challenges in Embodied Intelligence - Robot manipulation is identified as a core challenge in embodied intelligence due to its requirement for seamless integration of perception, planning, and control, which is essential for real-world interactions in diverse and unstructured environments [9][10]. Unified Framework - A unified understanding framework is proposed, which expands the traditional high-level planning and low-level control paradigm to include language, code, motion, affordance, and 3D representation, enhancing the semantic decision-making role of high-level planning [11][21]. Classification of Learning Control - A novel classification method for low-level learning control is introduced, dividing it into input modeling, latent learning, and policy learning, providing a systematic perspective for research in low-level control [24][22]. Bottlenecks in Robot Manipulation - The article identifies two major bottlenecks in robot manipulation: data collection and utilization, and system generalization capabilities, summarizing existing research progress and solutions for these challenges [27][28]. Future Directions - Four key future directions are highlighted: building a true "robot brain" for general cognition and control, breaking data bottlenecks for scalable data generation and utilization, enhancing multimodal perception for complex object interactions, and ensuring human-robot coexistence safety [35][33].
史上最全robot manioulation综述,多达1200篇!西交,港科,北大等八家机构联合发布
具身智能之心· 2025-10-14 03:50
Core Insights - The article discusses the rapid advancements in artificial intelligence, particularly in embodied intelligence, which connects cognition and action, emphasizing the importance of robot manipulation in achieving general artificial intelligence (AGI) [3][4]. Summary by Sections Overview of Embodied Intelligence - Embodied intelligence is highlighted as a crucial frontier that enables agents to perceive, reason, and act in real environments, moving from mere language understanding to actionable intelligence [3]. Paradigm Shift in Robot Manipulation - The research in robot manipulation is undergoing a paradigm shift, integrating reinforcement learning, imitation learning, and large models into intelligent control systems [4][6]. Comprehensive Survey of Robot Manipulation - A comprehensive survey titled "Towards a Unified Understanding of Robot Manipulation" systematically organizes over 1000 references, covering hardware, control foundations, task and data systems, and cross-modal generalization research [4][6][7]. Unified Framework for Understanding Robot Manipulation - The article proposes a unified framework that extends traditional high-level planning and low-level control classifications, incorporating language, code, motion, affordance, and 3D representations [9][20]. Key Bottlenecks in Robot Manipulation - Two major bottlenecks in robot manipulation are identified: data collection and utilization, and system generalization capabilities, with a detailed analysis of existing solutions [27][28]. Future Directions - Four key future directions are proposed: building a true "robot brain" for general cognition and control, breaking data bottlenecks for scalable data generation and utilization, enhancing multi-modal perception for complex interactions, and ensuring human-robot coexistence safety [34].
恒生大科技们假期表现
小熊跑的快· 2025-10-09 05:06
Core Insights - The article discusses the recent performance and developments in the tech sector, particularly focusing on AMD and its integration with OpenAI, as well as the advancements in AI models like Sora 2 [1][3][4]. Group 1: AMD and AI Integration - AMD has been included in a closed-loop AI ecosystem, which is seen as a positive development despite uncertainties regarding TSMC's production capabilities for 3nm and 2nm chips [1][3]. - The article highlights that traditional cloud companies may not participate in this closed-loop due to their conservative management styles and focus on stable returns [3]. Group 2: Sora 2 Model - The Sora 2 model, set to launch in February 2024, is compared to a significant advancement in video generation akin to GPT-3.5, capable of complex tasks such as simulating Olympic gymnastics movements [3]. - OpenAI's Sora 2 is noted for its improved controllability and ability to follow intricate instructions across multiple scenes while maintaining continuity in the generated content [3]. Group 3: Market Performance - The Sora app achieved the highest download volume during the National Day holiday, indicating strong market interest [4]. - The Hang Seng Tech Index ETF (513180.SH) has shown a year-to-date increase of 43%, with a notable rise of 34.7% since early October [9][13]. - The overall valuation of the Hang Seng Tech Index is significantly lower at 24.9 times earnings compared to the 204 times earnings for the STAR Market, suggesting a potential for catch-up in performance [13].
Being-VL的视觉BPE路线:把「看」和「说」真正统一起来
机器之心· 2025-10-09 02:24
Core Insights - The article discusses the limitations of traditional multimodal models, particularly how CLIP-style encoders prematurely align visual representations with text space, leading to potential hallucinations when detailed, non-language-dependent queries are made [2][6] - A new method called Being-VL is proposed, which emphasizes a post-alignment approach, allowing for the discrete representation of images before aligning them with text, thereby preserving visual structure and reducing the risk of information loss [2][3] Being-VL Implementation - Being-VL consists of three main steps: quantifying images into discrete VQ tokens using VQ-GAN, training a visual BPE that measures both co-occurrence frequency and spatial consistency, and finally unifying visual and text tokens into a single sequence for modeling [3][10] - The visual BPE tokenizer prioritizes both frequency and spatial consistency to create a more semantically and structurally meaningful token set, which is independent of text [8][9] Training Strategy - The training process is divided into three stages: 1. **Embedding Alignment**: Only the new visual token embeddings are trained while freezing other parameters to maintain existing language capabilities [12] 2. **Selective Fine-tuning**: A portion of the LLM layers is unfrozen to facilitate cross-modal interaction at lower representation levels [12] 3. **Full Fine-tuning**: All layers are unfrozen for comprehensive training on complex reasoning and instruction data [12][10] Experimental Results - Experiments indicate that the discrete representation of images followed by visual BPE and unified modeling with text leads to improved reliability in detail-sensitive queries and reduces hallucinations compared to traditional methods [14][16] - The study highlights the importance of a gradual training approach, showing that a combination of progressive unfreezing and curriculum learning significantly outperforms single-stage training methods [14][10] Visual BPE Token Activation - Visualization of embedding weights shows that using visual BPE leads to a more balanced distribution of weights between text and visual tokens, indicating reduced modality gaps and improved cross-modal attention [16][19] Token Size and Training Efficiency - The research explores the impact of BPE token size on training efficiency, finding an optimal balance in resource-limited scenarios, while larger token sizes may lead to diminishing returns due to sparsity [19][20] Development and Summary - The evolution from Being-VL-0 to Being-VL-0.5 reflects enhancements in the unified modeling framework, incorporating priority-guided encoding and a structured training approach [20][24]
阿里巴巴通义千问技术负责人组建内部机器人AI团队
Xin Lang Cai Jing· 2025-10-08 15:57
Core Insights - Alibaba has established a "Robotics and Embodied AI Group" to enhance its AI capabilities [1] - The new team is part of the Tongyi Qianwen initiative, which focuses on developing flagship AI foundational models [1] - Lin Junyang, the technical head of Tongyi Qianwen, is involved in the development of multimodal models that can process voice, image, and text inputs [1] - These multimodal models are being transformed into foundational agents capable of executing long-sequence reasoning tasks, with applications expected to transition from the virtual world to the real world [1]
大厂AI模型专题解读
2025-09-28 14:57
Summary of Conference Call Records Industry Overview - The conference call focuses on the AI model landscape in China, highlighting the challenges and advancements in the domestic AI industry compared to international counterparts [1][2][4][5]. Key Points and Arguments 1. **Architecture and Innovation** - Domestic AI models heavily rely on overseas architectures like Transformer and MoE, leading to difficulties in surpassing foreign models [1][2]. - There is a lack of self-developed, breakthrough architectural innovations in China, which hampers competitiveness [2]. 2. **Computational Power** - Chinese AI companies have significantly lower GPU computational power compared to international giants like Microsoft, Google, and Meta, often by an order of magnitude [2]. - The ongoing US-China trade war has restricted resource availability, further impacting computational capabilities [1][2]. 3. **Cost and Performance Focus** - Domestic models prioritize inference cost and cost-effectiveness, aligning with local consumer habits, while international models like GPT focus on top-tier performance [1][2]. - The commercial model differences create a substantial gap in model capabilities [2]. 4. **Data Acquisition** - The relatively lenient data laws in China provide an advantage in data acquisition for training models, unlike the stringent regulations in Europe and the US [3]. 5. **Open Source Strategies** - Alibaba adopts a nearly fully open-source strategy, including model weights, code, and training data, to enhance influence and integrate its cloud services [4]. - Other companies like ByteDance and Kuaishou are more selective in their open-source approaches due to their reliance on proprietary technology [4]. 6. **Multimodal Model Developments** - Domestic companies are making strides in multimodal models, focusing on applications in e-commerce and short videos, which cater to local needs [5][6][7]. - Companies like Alibaba, Kuaishou, Tencent, and ByteDance are developing models that integrate text, image, audio, and video generation [7][8]. 7. **MoE Architecture Adoption** - The MoE architecture is becoming standard among major companies, allowing for reduced computational costs and inference times [10]. - Future optimization directions include precise input allocation, differentiated expert system structures, and improved training stability [10][11]. 8. **Economic Viability of Large Models** - Starting mid-2024, pricing for APIs and consumer services is expected to decrease due to the release of previously constrained GPU resources [13]. - The overall cost conversion rate in the large model industry is increasing, despite initial low profit margins [13][14]. 9. **Competitive Differentiation** - Key competitive differences among leading domestic firms will emerge from their unique strategies in technology iteration, data accumulation, and business models [15]. 10. **Future Trends and Innovations** - The focus will shift towards agent systems that integrate user understanding and tool invocation, enhancing overall efficiency [16]. - The MCP concept will gain traction, addressing data input-output connections and reducing integration costs [22]. Additional Important Insights - The acceptance of paid services among domestic users is low, with conversion rates around 3% to 5%, indicating a need for improved user experience to enhance willingness to pay [20][21]. - Successful AI product cases include interactive systems that combine companionship with professional analysis, indicating a potential path for monetization [22]. This summary encapsulates the critical insights from the conference call, providing a comprehensive overview of the current state and future directions of the AI industry in China.
国内的这款“赛博陪玩”闯进了东京TGS
Hu Xiu· 2025-09-28 07:17
Core Insights - The Tokyo Game Show (TGS) is the largest in its 29-year history, covering 160,000 square meters with over 1,000 participating companies, yet only one AI-related company is present [1][3] - The focus of TGS attendees is not primarily on AI, indicating a gap between AI advancements and gaming interests [3][14] - The AI gaming company "Xinying Suixing" is the only domestic AI company to secure a spot at TGS, showcasing its potential in the market [3][4] Company Overview - "Xinying Suixing" aims to combine AI with gaming, focusing on creating virtual companions for players, which is a unique approach compared to traditional AI chatbots [6][8] - The founder, Liu Binxin, emphasizes the importance of understanding user needs and data utilization in developing AI products [21][30] - The company has seen rapid growth, with global users increasing from 9 million to 10 million within a month, although the number of paying users remains low [28][29] Market Trends - The AI gaming sector is viewed as a potential battleground for future competition, despite current limited participation from major players [14][15] - Liu Binxin believes that large gaming companies may enter the AI space but will not share data or resources due to their corporate culture [17][18] - The company is exploring a transition from a consumer-focused model to a business-to-business (B2B) strategy, aiming to collaborate with game developers for advertising opportunities [29][30] Challenges and Opportunities - The company faces challenges in establishing a local presence in Japan, which is crucial for B2B partnerships due to cultural business practices [31][32] - Despite the challenges, Liu Binxin remains optimistic about the global potential of AI products, suggesting that successful models can emerge from China [28][30]
加码下一代“操作系统”和“计算机” 阿里巴巴放出一系列新招
Zheng Quan Shi Bao Wang· 2025-09-24 15:44
Core Insights - The realization of Artificial General Intelligence (AGI) is seen as a certainty, with the ultimate goal being the development of Super Artificial Intelligence (ASI) that can self-iterate and surpass human capabilities [2] - Alibaba's CEO predicts that large models will serve as the next generation "operating system," while Super AI Cloud will be the next generation "computer" [2][3] AI Infrastructure Investment - Alibaba is advancing a three-year plan to invest 380 billion in AI infrastructure, with plans for further investments [3] - By 2032, the energy consumption of Alibaba Cloud's global data centers is expected to increase tenfold compared to 2022 [3] Global Expansion - Alibaba Cloud announced a global infrastructure expansion plan, establishing new cloud computing regions in Brazil, France, and the Netherlands, and expanding data centers in Mexico, Japan, South Korea, Malaysia, and Dubai [4] - Currently, Alibaba Cloud operates 91 availability zones across 29 regions, making it the largest cloud service provider in China and the leading provider in Asia-Pacific [4] AI Model Development - Alibaba launched seven new large model technology products at the conference, covering various fields such as language, speech, vision, and multi-modal models [5] - The flagship model Qwen3-Max outperforms competitors like GPT-5 and Claude Opus 4, ranking among the top three globally [5] Collaboration with NVIDIA - Alibaba Cloud announced a partnership with NVIDIA in the Physical AI domain, integrating NVIDIA's software stack into its AI platform to enhance capabilities in data preprocessing, simulation, and model training [7] AI Penetration in Industries - The AI technology is accelerating its penetration across various industries, with over 200,000 developers creating more than 800,000 agents on Alibaba's platform [8] - Notable applications include the "Merchant Intelligent Review Assistant" by ICBC and AI-assisted game development by NetEase, showcasing significant efficiency improvements [9]
华为,重磅新品发布
中国基金报· 2025-09-24 10:53
Core Viewpoint - Huawei continues to lead the global wearable device market with innovative products and a comprehensive product line, as evidenced by the recent launch of the HUAWEI WATCH GT 6 series and other devices [1][9]. Summary by Sections HUAWEI WATCH GT 6 Series - The HUAWEI WATCH GT 6 series includes two models: 41mm and 46mm, with the GT 6 Pro available only in 46mm [4]. - The series features a significant battery capacity increase of 65% compared to the previous generation, with the GT 6 Pro and 46mm version offering up to 21 days of battery life under light usage, and the 41mm version up to 14 days [4][5]. - The GT 6 series incorporates an upgraded sensing system and supports cycling simulation power and automatic cycling recognition [4]. Pricing and Sales - The starting prices for the GT 6 series are 1588 CNY for the 46mm model and 1488 CNY for the 41mm model, with pre-sales starting on September 29 [5]. - The GT 6 Pro starts at 2488 CNY, with pre-sales beginning on October 14 [5]. - Cumulative global shipments of the WATCH GT series have exceeded 54 million units since its launch in 2018, maintaining Huawei's leadership in the wearable market [5]. HUAWEI FreeClip 2 Earphones - The HUAWEI FreeClip 2 earphones were also launched, featuring a starting price of 1299 CNY and available in three colors [6][7]. - The earphones utilize Huawei's third-generation audio chip and NPU AI processor, enhancing call quality and supporting various smart features [7]. Market Growth and Position - The global wearable device market is experiencing rapid growth, with IDC forecasting a 12.3% year-on-year increase in wrist-worn device shipments by Q2 2025 [9]. - Huawei holds a 20.2% share of the global market and has been the top seller for two consecutive quarters, while in China, it maintains a 33.4% market share [9]. - Since 2015, Huawei has shipped a total of 200 million wearable devices, showcasing its strong brand appeal and ecosystem integration [10]. Future Outlook - The wearable device market is expected to continue expanding, with the global market projected to exceed $100 billion by 2025, and the Chinese market surpassing 100 billion CNY [10]. - The growth of medical-grade wearable devices is particularly notable, with a compound annual growth rate exceeding 40% [10]. - Advances in AI technology and increasing consumer demand for health monitoring are driving the evolution of wearable devices from single-function products to comprehensive health management solutions [10].
微信WeChat-YATT横空出世,腾讯强化学习布局剑指何方
Sou Hu Cai Jing· 2025-09-24 09:56
Core Insights - Tencent's open-sourcing of WeChat-YATT training library signifies a strategic move in the competitive landscape of AI model training, particularly as OpenAI's GPT-5 approaches release [1][2] - WeChat-YATT is designed with a focus on reinforcement learning and multimodal models, differentiating itself from mainstream frameworks like TensorFlow and PyTorch [2] Group 1: WeChat-YATT's Innovations - WeChat-YATT achieves significant breakthroughs in three areas: optimized parameter update efficiency for reinforcement learning, flexible multimodal data fusion interfaces, and a modular design that lowers the barriers for distributed training [2][4] - The library's emphasis on "ease of extensibility" reflects Tencent's recognition of the need for rapid iteration in large model training [4] Group 2: Competitive Positioning - Compared to Meta's PyTorch, WeChat-YATT excels in reinforcement learning support; against Google's JAX, it shows advantages in Chinese language scenarios and multimodal processing [4] - WeChat-YATT's deep integration with the WeChat ecosystem sets it apart from similar reinforcement learning frameworks like Ray RLlib [4] Group 3: Strategic Implications - The release of WeChat-YATT aligns with Tencent's broader AI strategy, which includes trademark applications for "WeChat AI Service Platform" and the deployment of the mixed Yuan model in business scenarios [7] - Tencent aims to create a closed-loop AI ecosystem through foundational technology breakthroughs and application deployment, with WeChat-YATT serving as a critical component in this strategy [7] - The focus on reinforcement learning indicates Tencent's commitment to key areas such as gaming, recommendation systems, and autonomous driving, positioning itself for future AI applications [7] Group 4: Long-term Vision - The naming of WeChat-YATT, "Yet Another Transformer Trainer," reflects both a sense of humor and Tencent's long-term investment in AI infrastructure [6] - The competition in the era of large models is fundamentally a competition for infrastructure, with WeChat-YATT representing a piece of Tencent's broader AI blueprint [7]