多模态视觉语言模型(VLM)
Search documents
轻量级VLA模型Evo-1:仅凭0.77b参数取得SOTA,解决低成本训练与实时部署
具身智能之心· 2025-11-12 04:00
点击下方 卡片 ,关注" 具身智能 之心 "公众号 视觉-语言-动作(VLA)模型将感知、语言和控制能力统一起来,使机器人能够通过多模态理解执行多样化任务。然而,当前的VLA模型通常包含海 量参数,且高度依赖大规模机器人数据预训练,导致训练过程中的计算成本高昂,同时限制了其在实时推理中的部署能力。此外,多数训练范式常导 致视觉-语言backbone模型的感知表征退化,引发过拟合并削弱对下游任务的泛化能力。 论文名称: Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment 论文链接: https://arxiv.org/abs/2511.04555 来自上海交大、CMU、剑桥大学的团队提出轻量级VLA模型Evo-1,在无需机器人数据预训练的前提下,既降低计算成本又提升部署效率,同时保持 强劲性能。Evo-1基于原生多模态视觉语言模型(VLM),融合创新的交叉调制扩散变换器与优化集成模块,构建高效架构。这里还进一步引入两阶段 训练范式,通过逐步协调动作与感知,完整保留VLM的表征能力。 编辑丨具身智能之心 ...
Karpathy盛赞DeepSeek-OCR“淘汰”tokenizer!实测如何用Claude Code 让新模型跑在N卡上
AI前线· 2025-10-21 04:54
Core Insights - DeepSeek has released a new model, DeepSeek-OCR, which is a 6.6GB model specifically fine-tuned for OCR, achieving a 10× near-lossless compression and a 20× compression while retaining 60% accuracy [2] - The model introduces DeepEncoder to address the trade-offs between high resolution, low memory, and fewer tokens, achieving state-of-the-art performance in practical scenarios with minimal token consumption [2][4] - The model's architecture is lightweight, consisting of only 12 layers, which is suitable for the pattern recognition nature of OCR tasks [5] Model Innovations - DeepSeek-OCR allows for rendering original content as images before input, leading to more efficient information compression and richer information flow [6] - The model eliminates the need for tokenizers, which have been criticized for their inefficiencies and historical baggage, thus enabling a more seamless end-to-end process [6] - It employs a "Mixture of Experts" paradigm, activating only 500 million parameters during inference, allowing for efficient processing of large datasets [7] Market Position and Future Implications - Alexander Doria, co-founder of Pleiasfr, views DeepSeek-OCR as a milestone achievement, suggesting it sets a foundation for future OCR systems [4][8] - The model's training pipeline includes a significant amount of synthetic and simulated data, indicating that while it has established a balance between inference efficiency and model performance, further customization for specific domains is necessary for large-scale real-world applications [8] Developer Engagement - The release has attracted many developers, with Simon Willison successfully running the model on NVIDIA Spark in about 40 minutes, showcasing the model's accessibility and ease of use [9][21] - Willison emphasized the importance of providing a clear environment and task definition for successful implementation, highlighting the model's practical utility [24]
AI陪伴新赛道:他给800万游戏玩家找了个AI搭子?
混沌学园· 2025-08-22 11:58
Core Viewpoint - The article discusses the emergence of "Doudou AI," a product that aims to provide genuine companionship through shared experiences rather than superficial conversations, highlighting the need for emotional connection in the digital age [2][9][31]. Group 1: Product Overview - "Doudou AI" has quietly gained 8 million users with a remarkable 70% next-day retention rate on PC, indicating strong user engagement [2]. - The product is designed to enhance user experiences by being context-aware, allowing it to interact meaningfully based on the user's current activities, such as gaming [18][20]. Group 2: Founder’s Insight - The founder, Binson, recognized that the mobile internet era has led to users' time being consumed by major platforms, leaving little room for new applications [3][4]. - A pivotal moment for Binson was witnessing his son share his gaming achievements, which led him to understand the importance of being a witness to shared experiences rather than engaging in aimless chat [6][9]. Group 3: Concept Development - Binson drew inspiration from AI programming assistants, which provide help without interrupting the user's workflow, leading to the idea of an AI companion that enhances rather than distracts [13][16]. - The new concept of "scene awareness" was introduced, allowing the AI to interact based on the user's current context, creating a sense of camaraderie [17][18]. Group 4: Personal Experience and Mission - A life-threatening car accident deepened Binson's understanding of companionship and the value of shared experiences, influencing the product's development direction [22][24]. - The mission of "Doudou AI" is to enhance users' life experiences by being a supportive companion that acknowledges and celebrates their achievements [27][31]. Group 5: Market Positioning - The product aims to fill a gap in the market for emotional companionship, addressing the loneliness often felt in the digital age by providing a reliable and responsive AI partner [29][30].