多模态视觉语言模型(VLM)
Search documents
星海图合伙人、CFO罗天奇:具身智能尚处于技术竞赛早期阶段
Mei Ri Jing Ji Xin Wen· 2026-02-12 10:47
Core Insights - The industry of embodied intelligence is at a crossroads of capital and industrial focus, with increasing financing and frequent technological demonstrations, yet facing challenges in stability, scalability, and cost control [1] Group 1: Financing and Valuation - Starry Sea has completed a Series B financing round of 1 billion yuan, bringing its total financing to nearly 3 billion yuan and achieving a valuation of 10 billion yuan, making it a unicorn in the embodied intelligence sector [1] - The CFO of Starry Sea emphasizes that the success in the AI industry is driven by Scaling Law, where the efficiency of capital utilization is more critical than the amount of financing [1][2] Group 2: Industry Dynamics - The current phase of the embodied intelligence industry is compared to the "Hundred Groups War," where companies are advised to focus on understanding the essence of business rather than just technology [2] - The industry is transitioning from early-stage technology exploration to resource-intensive competition, with a shift in capital logic from broad investment to focusing on leading companies [2] Group 3: Commercialization and Technology - The commercialization of embodied intelligence is divided into technology-driven and business-driven aspects, with specific operational boundaries that need to be met for successful deployment [4] - The CFO believes that the industry is still in the early stages of a technological race, and companies must retain sufficient funds to cope with the increasing costs of data and model training [2][4] Group 4: Financial Potential and Business Model - The ToB (business-to-business) segment of embodied intelligence has significant revenue potential, with large orders capable of generating substantial income, but the focus should be on revenue quality metrics [5] - The long-term business model in this industry is likened to selling "tokens of the physical world," with the real barriers being intelligence levels and the ability to design and manufacture hardware [5] Group 5: Competitive Advantages - China is recognized for its data supply chain advantages, which are significantly more cost-effective than those in the U.S., allowing for greater data collection at lower costs [6] - The CFO highlights that the unique aspect of embodied intelligence companies lies in developing their foundational models for physical world execution, emphasizing the need to focus resources on building these capabilities [7]
轻量级VLA模型Evo-1:仅凭0.77b参数取得SOTA,解决低成本训练与实时部署
具身智能之心· 2025-11-12 04:00
Core Insights - The article discusses the Evo-1 model, a lightweight Vision-Language-Action (VLA) model that integrates perception, language, and control capabilities, aiming to reduce computational costs and improve deployment efficiency without relying on large-scale robot data pre-training [3][5][6]. Industry Pain Points - Existing VLA models face several limitations, including high computational costs due to large parameter sizes, which can reach billions, leading to significant GPU memory consumption and low control frequencies [4]. - The reliance on extensive robot datasets for training is both labor-intensive and costly, further complicating the deployment of these models in real-time interactive tasks [4]. Evo-1 Methodology and Performance - Evo-1 employs a unified visual-language backbone and a two-stage training paradigm to enhance multimodal perception and understanding while maintaining a compact model size of only 0.77 billion parameters [5][6]. - The model achieved state-of-the-art results in benchmark tests, surpassing previous models by 12.4% and 6.9% on MetaWorld and RoboTwin datasets, respectively, and achieving a 94.8% success rate on the LIBERO test [6][18]. - In real-world evaluations, Evo-1 demonstrated a success rate of 78%, outperforming other baseline models while maintaining low memory usage of 2.3 GB and a high inference frequency of 16.4 Hz [22][20]. Model Architecture - Evo-1 utilizes the InternVL3-1B model as its backbone, which is pre-trained in a native multimodal paradigm, allowing for efficient feature fusion and cross-modal alignment [10]. - The model incorporates a cross-modulation diffusion transformer to predict continuous control actions from the multimodal embeddings generated by the backbone [11]. - An integrated module aligns the fused visual-language representations with the robot's proprioceptive information, ensuring seamless integration of multimodal features for subsequent control tasks [12]. Training Process - The two-stage training process begins with aligning the action expert while freezing the visual-language backbone, followed by a global fine-tuning phase to optimize the entire architecture [13][14]. - This approach preserves the semantic integrity of the visual-language model while adapting to diverse action generation needs, effectively enhancing the model's generalization capabilities [14]. Ablation Studies - Various integration strategies between the visual-language model and the action expert were evaluated, demonstrating the effectiveness of the proposed design in maintaining performance [24]. - The two-stage training paradigm was compared with a single-stage baseline, showing that the former retains semantic attention patterns better, leading to improved focus on relevant task areas [25].
Karpathy盛赞DeepSeek-OCR“淘汰”tokenizer!实测如何用Claude Code 让新模型跑在N卡上
AI前线· 2025-10-21 04:54
Core Insights - DeepSeek has released a new model, DeepSeek-OCR, which is a 6.6GB model specifically fine-tuned for OCR, achieving a 10× near-lossless compression and a 20× compression while retaining 60% accuracy [2] - The model introduces DeepEncoder to address the trade-offs between high resolution, low memory, and fewer tokens, achieving state-of-the-art performance in practical scenarios with minimal token consumption [2][4] - The model's architecture is lightweight, consisting of only 12 layers, which is suitable for the pattern recognition nature of OCR tasks [5] Model Innovations - DeepSeek-OCR allows for rendering original content as images before input, leading to more efficient information compression and richer information flow [6] - The model eliminates the need for tokenizers, which have been criticized for their inefficiencies and historical baggage, thus enabling a more seamless end-to-end process [6] - It employs a "Mixture of Experts" paradigm, activating only 500 million parameters during inference, allowing for efficient processing of large datasets [7] Market Position and Future Implications - Alexander Doria, co-founder of Pleiasfr, views DeepSeek-OCR as a milestone achievement, suggesting it sets a foundation for future OCR systems [4][8] - The model's training pipeline includes a significant amount of synthetic and simulated data, indicating that while it has established a balance between inference efficiency and model performance, further customization for specific domains is necessary for large-scale real-world applications [8] Developer Engagement - The release has attracted many developers, with Simon Willison successfully running the model on NVIDIA Spark in about 40 minutes, showcasing the model's accessibility and ease of use [9][21] - Willison emphasized the importance of providing a clear environment and task definition for successful implementation, highlighting the model's practical utility [24]
AI陪伴新赛道:他给800万游戏玩家找了个AI搭子?
混沌学园· 2025-08-22 11:58
Core Viewpoint - The article discusses the emergence of "Doudou AI," a product that aims to provide genuine companionship through shared experiences rather than superficial conversations, highlighting the need for emotional connection in the digital age [2][9][31]. Group 1: Product Overview - "Doudou AI" has quietly gained 8 million users with a remarkable 70% next-day retention rate on PC, indicating strong user engagement [2]. - The product is designed to enhance user experiences by being context-aware, allowing it to interact meaningfully based on the user's current activities, such as gaming [18][20]. Group 2: Founder’s Insight - The founder, Binson, recognized that the mobile internet era has led to users' time being consumed by major platforms, leaving little room for new applications [3][4]. - A pivotal moment for Binson was witnessing his son share his gaming achievements, which led him to understand the importance of being a witness to shared experiences rather than engaging in aimless chat [6][9]. Group 3: Concept Development - Binson drew inspiration from AI programming assistants, which provide help without interrupting the user's workflow, leading to the idea of an AI companion that enhances rather than distracts [13][16]. - The new concept of "scene awareness" was introduced, allowing the AI to interact based on the user's current context, creating a sense of camaraderie [17][18]. Group 4: Personal Experience and Mission - A life-threatening car accident deepened Binson's understanding of companionship and the value of shared experiences, influencing the product's development direction [22][24]. - The mission of "Doudou AI" is to enhance users' life experiences by being a supportive companion that acknowledges and celebrates their achievements [27][31]. Group 5: Market Positioning - The product aims to fill a gap in the market for emotional companionship, addressing the loneliness often felt in the digital age by providing a reliable and responsive AI partner [29][30].