多模态大模型
Search documents
600亿AI巨头,一年内融资近53亿港元
Sou Hu Cai Jing· 2025-08-07 11:29
Financing and Capital Structure - In July, the company completed a financing round of HKD 2.5 billion, bringing the total raised in less than a year to nearly HKD 5.3 billion [1][3][7] - The recent placement involved issuing 1.667 billion new B shares at HKD 1.5 per share, representing 4.31% of the total issued shares [3][7] - Since its establishment, the company has raised a total of USD 5.225 billion across 12 financing rounds from various investors, including IDG Capital and Alibaba [7] Financial Performance - The company has not achieved profitability since its inception, with losses narrowing in recent years but still significant, amounting to CNY 6.045 billion, CNY 6.44 billion, and CNY 4.278 billion over the last three years [9][12] - Revenue for the years 2022 to 2024 was CNY 3.809 billion, CNY 3.406 billion, and CNY 3.772 billion, with a notable decline in the first two years followed by a growth of 10.75% in the last year [9][11] - The core revenue driver has shifted to generative AI, which saw revenues of CNY 1.184 billion and CNY 2.404 billion in the last two years, reflecting a growth of 103.1% [9][11] Organizational Changes - The company has undergone significant organizational restructuring, including the appointment of two new executive directors and the transition of a co-founder to lead the AI chip business [1][15][20] - Employee numbers have decreased from 5,098 to 3,756 over the past three years, contributing to reduced employee welfare expenses, which have been a factor in narrowing losses [17][18] Strategic Focus - The company plans to allocate 30% of the recent funds to support core business development, another 30% to generative AI research, and 20% for exploring AI technology integration in innovative verticals [7] - The company aims to enhance its organizational efficiency and focus on strategic growth areas, particularly in AI infrastructure and applications [15][17]
小红书开源多模态大模型dots.vlm1:解锁图文理解与数学解题新能力
Sou Hu Cai Jing· 2025-08-07 10:31
Core Insights - Xiaohongshu's hi lab has open-sourced its latest multimodal model, dots.vlm1, which is built on DeepSeek V3 and features a self-developed 1.2 billion parameter visual encoder, NaViT, showcasing strong multimodal understanding and reasoning capabilities [1][6] - Dots.vlm1 has demonstrated performance close to leading models like Gemini 2.5 Pro and Seed-VL1.5 in various visual evaluation benchmarks, particularly excelling in tasks such as MMMU, MathVision, and OCR Reasoning [1][4] Model Performance - In text reasoning tasks, dots.vlm1 performs comparably to DeepSeek-R1-0528, indicating a degree of generality in mathematical and coding capabilities, although there is room for improvement in more diverse reasoning tasks like GPQA [4] - Dots.vlm1's overall performance is notable, especially in visual multimodal capabilities, nearing state-of-the-art levels [4] Benchmark Comparisons - Dots.vlm1's performance metrics in various benchmarks include: - MMMU: 80.11 - MathVision: 69.64 - OCR Reasoning: 66.23 - General Visual tasks: 90.85 in m3gia(cn) [5] Model Architecture - Dots.vlm1 consists of three core components: a 1.2 billion parameter NaViT visual encoder, a lightweight MLP adapter, and the DeepSeek V3 MoE large language model [5] - The training process involved three stages: pre-training of the visual encoder, pre-training of the VLM, and post-training of the VLM, enhancing the model's perception and generalization capabilities [5] Open Source and Future Plans - Dots.vlm1 has been uploaded to the open-source platform Hugging Face, allowing users to experience the model for free [6] - Hi lab plans to enhance the model's performance by expanding the scale and diversity of cross-modal translation data, improving the visual encoder structure, and exploring more effective neural network architectures and loss function designs [6]
千里科技(601777.SH):与阶跃星辰在智能座舱领域形成战略协同
Ge Long Hui A P P· 2025-08-07 08:13
Core Viewpoint - Qianli Technology (601777.SH) has formed a strategic collaboration with Jieyue Xingchen in the smart cockpit sector, focusing on the development of next-generation smart cockpit products utilizing AI capabilities [1] Group 1: Strategic Collaboration - The partnership aims to leverage multi-modal large models and end-to-end voice large models to enhance product offerings [1] - The collaboration will include the development of a large model native operating system, referred to as Agent OS, and AI smart assistants [1] Group 2: Product Development Focus - The goal is to create industry-leading Natural UI products for natural interaction [1]
具身智能之心技术交流群成立了!
具身智能之心· 2025-08-07 02:38
Group 1 - The establishment of the Embodied Intelligence Heart Technology Exchange Group focuses on various advanced technologies including VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, VLA+RL, sim2real, multimodal large models, simulation, motion control, target navigation, mapping and localization, and navigation [1] - Interested individuals can add the assistant's WeChat AIDriver005 to join the community [2] - To expedite the joining process, it is recommended to include a note with the institution/school, name, and research direction [3]
商汤CFO王征亲述:“Re-CoFound”200多天后,“1+X”交出怎样的答卷?
第一财经· 2025-08-06 12:53
Core Viewpoint - The article discusses the transformation of SenseTime through its "1+X" organizational restructuring, emphasizing the emergence of a new entrepreneurial spirit among its young leaders and the financial accountability they now embrace [6][11][15]. Group 1: Organizational Changes - SenseTime's "1+X" strategy was officially announced on December 3, 2024, marking a significant restructuring aimed at fostering a new entrepreneurial collective to seize opportunities in the AI 2.0 era [6][11]. - The restructuring has led to the establishment of a five-member executive committee, enhancing decision-making efficiency and fostering a collaborative environment [7][11]. - The new structure encourages a focus on core business areas while allowing for flexibility and rapid adaptation in emerging vertical markets [12][14]. Group 2: Financial Accountability and Performance - The restructuring has resulted in a noticeable increase in financial oversight among the CEOs of the "X" businesses, who are now more proactive in managing their financial situations [15][16]. - Each of the six "X" enterprises has successfully raised over 2 billion yuan in cumulative financing, indicating strong investor interest and market validation [17][18]. - The establishment of the "X" businesses has positively impacted the parent company's cash flow, allowing for more resources to be allocated to core operations [17]. Group 3: Strategic Focus and Market Position - SenseTime is focusing on a "three-in-one" strategy that integrates large devices, large models, and applications, while still maintaining its core competency in computer vision (CV) [21][22]. - The company has seen significant growth in its CV business, with increased willingness from clients to invest, particularly in Hong Kong and overseas markets [22][23]. - SenseTime's extensive experience in CV is viewed as a competitive advantage in developing multi-modal large models, which are essential for future AI advancements [24][25]. Group 4: Technological Advancements - The latest model, released on July 27, 2025, showcases significant improvements in multi-modal reasoning capabilities, reflecting the company's commitment to innovation [27]. - SenseTime's strategic focus on integrating visual data with AI applications positions it well for future growth in the rapidly evolving AI landscape [24][25].
“AI”之眼,一场视觉智能的进化 | 2025 ITValue Summit前瞻WAIC现场版:AI落地指南系列
Tai Mei Ti A P P· 2025-08-06 11:39
WAIC 世界人工智能大会展会上熙熙攘攘,格灵深瞳CEO吴一洲发现会场比往年更热闹,现场的人和产 品的画像更丰富,而且许多大公司展现出的AI单点应用深度也让人印象深刻。AI应用真正走进产业的 脉络更为清晰了。 在钛媒体2025 ITValue Summit前瞻WAIC现场版:AI落地指南系列的直播中,吴一洲与钛媒体联合创始 人刘湘明聚焦视觉智能的进化和AI技术升级下的技术厂商展开对话。 格灵深瞳一直深耕视觉算法和多模态大模型技术研发,经历过上一个技术时代的技术企业在这一波智能 浪潮中有明显不同的感受——产品有了"成长性"。吴一洲在对话中反复强调的一点是:要让产品能用起 来、用得好,而且有持续性的成长性。这不仅是格灵深瞳对产品的要求,也是作为技术厂商与客户共创 的愿景。 "以前,我们会给客户一个通用工具,现在有了智能体Agent之后,变成了个性化、有记忆的工具,相当 于一个搭档、一个执行合伙人,应用上更细化、更成熟了。"吴一洲介绍说,经过近几年的演进,格灵 深瞳构建了由模型、算法、软硬一体的产品和服务形成的端到端的体系。不过,她仍然非常理性,认为 当前AI距离真正的落地应用、在行业里跟专家超融合一样去深化应用, ...
这几个方向,从自驾转大模型会比较丝滑......
自动驾驶之心· 2025-08-06 11:25
Core Insights - The article discusses the booming field of large models in AI, particularly focusing on various directions such as RAG (Retrieval-Augmented Generation), AI Agents, and multi-modal models [1][2]. Group 1: Large Model RAG - Large model RAG is highlighted as a significant area, with emphasis on understanding components like retrievers, augmenters, and generators, and how knowledge bases can enhance performance [1]. - The article mentions the rapid development of subfields within RAG, including Graph RAG, applications in visual understanding, and various knowledge-oriented methods [1]. Group 2: AI Agents - AI Agents are identified as a hot direction in large models, covering topics such as single-agent and multi-agent systems, reinforcement learning, and efficient communication among agents [1]. - The integration of RAG with agents is also noted as a promising area for exploration [1]. Group 3: Multi-modal Models - The article points out the extensive directions available in multi-modal models, including visual language models, pre-training datasets, and fine-tuning processes [2]. - Deployment, inference, and optimization of these models are also discussed as critical components of the development process [2]. Group 4: Community and Learning - The article encourages engagement with the "Big Model Heart Tech" community for further learning and collaboration in the field of large models [3]. - The community aims to build a significant platform for talent and academic information related to large models [3].
具身智能之心招募科研辅导老师了!学术圈的大佬看过来~
具身智能之心· 2025-08-06 08:30
具身智能之心招募科研辅导老师了!如果您是具身智能方向,手里握有多篇顶会、顶刊,欢迎和我们一起带动 学术界的发展。 方向一览 行业资源共享,享有论文署名与现金激励!详细请咨询小助理微信oooops-life了解更多。 要求说明 博士及以上学历(包含在读),2篇A会或一区以上期刊/会议,有辅导经验的优先。 待遇说明 包括但不限于:VLA、VLN、遥操作、Diffusion Policy、强化学习、VLA+RL、sim2real、多模态大模型、仿 真、运动控制、目标导航等方向。 ...
大模型究竟是个啥?都有哪些技术领域,面向小白的深度好文!
自动驾驶之心· 2025-08-05 23:32
Core Insights - The article provides a comprehensive overview of large language models (LLMs), their definitions, architectures, capabilities, and notable developments in the field [3][6][12]. Group 1: Definition and Characteristics of LLMs - Large Language Models (LLMs) are deep learning models trained on vast amounts of text data, capable of understanding and generating natural language [3][6]. - Key features of modern LLMs include large-scale parameters (e.g., GPT-3 with 175 billion parameters), Transformer architecture, pre-training followed by fine-tuning, and multi-task adaptability [6][12]. Group 2: LLM Development and Architecture - The Transformer architecture, introduced by Google in 2017, is the foundational technology for LLMs, consisting of an encoder and decoder [9]. - Encoder-only architectures, like BERT, excel in text understanding tasks, while decoder-only architectures, such as GPT, are optimized for text generation [10][11]. Group 3: Core Capabilities of LLMs - LLMs can generate coherent text, assist in coding, answer factual questions, and perform multi-step reasoning [12][13]. - They also excel in text understanding and conversion tasks, such as summarization and sentiment analysis [13]. Group 4: Notable LLMs and Their Features - The GPT series by OpenAI is a key player in LLM development, known for its strong general capabilities and continuous innovation [15][16]. - Meta's Llama series emphasizes open-source development and multi-modal capabilities, significantly impacting the AI community [17][18]. - Alibaba's Qwen series focuses on comprehensive open-source models with strong support for Chinese and multi-language tasks [18]. Group 5: Visual Foundation Models - Visual Foundation Models are essential for processing visual inputs, enabling the connection between visual data and LLMs [25]. - They utilize architectures like Vision Transformers (ViT) and hybrid models combining CNNs and Transformers for various tasks, including image classification and cross-modal understanding [26][27]. Group 6: Speech Large Models - Speech large models are designed to handle various speech-related tasks, leveraging large-scale speech data for training [31]. - They primarily use Transformer architectures to capture long-range dependencies in speech data, facilitating tasks like speech recognition and translation [32][36]. Group 7: Multi-Modal Large Models (MLLMs) - Multi-modal large models can process and understand multiple types of data, such as text, images, and audio, enabling complex interactions [39]. - Their architecture typically includes pre-trained modal encoders, a large language model, and a modal decoder for generating outputs [40]. Group 8: Reasoning Large Models - Reasoning large models enhance the reasoning capabilities of LLMs through optimized prompting and external knowledge integration [43][44]. - They focus on improving the accuracy and controllability of complex tasks without fundamentally altering the model structure [45].
Discrete Tokenization:多模态大模型的关键基石,首个系统化综述发布
机器之心· 2025-08-05 18:56
Core Insights - The article discusses the advancements in Discrete Tokenization for Multimodal Large Language Models (LLMs), emphasizing its role in transforming various modalities into discrete representations that LLMs can process effectively [2][39]. - A comprehensive survey has been released, detailing the technical landscape, challenges, and future research directions in the field of Discrete Tokenization for Multimodal LLMs [2][39]. Multimodal LLMs and Discrete Tokenization - Recent breakthroughs in Large Language Models (LLMs) have led to their application in various text tasks, prompting interest in extending their capabilities to non-text modalities such as images, audio, and video [2]. - Discrete Tokenization has emerged as a key solution, utilizing techniques like Vector Quantization (VQ) to compress high-dimensional continuous inputs into compact discrete tokens, enhancing cross-modal understanding and generation [2][39]. Systematic Review and Methodologies - The article presents the first systematic review of Discrete Tokenization for Multimodal LLMs, organizing content based on input data modalities and combinations, from early single-modal to multi-modal tokenization methods [2][39]. - Eight core categories of Vector Quantization methods are identified, including VQ, RVQ, PQ, AQ, FSQ, LFQ, BSQ, and Graph Anchor-Relation Tokenization, each with unique characteristics suitable for different modalities and tasks [8][9][14]. Challenges and Future Directions - Key challenges in Discrete Tokenization include codebook collapse, information loss during quantization, difficulties in gradient propagation, and issues with granularity and semantic alignment [12][36]. - Future research directions may focus on adaptive quantization, unified frameworks, biologically inspired codebooks, cross-modal generalization, and enhancing interpretability [37][36]. Applications in Single and Multimodal Tasks - Discrete Tokenization has been widely applied in single-modal tasks such as image retrieval, audio encoding, and video representation, allowing LLMs to process non-text modalities effectively [20][22]. - In multimodal tasks, it serves as a semantic bridge, enabling models to handle complex inputs across different modalities, facilitating tasks like cross-modal retrieval and generation [27][30].