Workflow
多模态大模型
icon
Search documents
“AI”之眼,一场视觉智能的进化 | 2025 ITValue Summit前瞻WAIC现场版:AI落地指南系列
Tai Mei Ti A P P· 2025-08-06 11:39
WAIC 世界人工智能大会展会上熙熙攘攘,格灵深瞳CEO吴一洲发现会场比往年更热闹,现场的人和产 品的画像更丰富,而且许多大公司展现出的AI单点应用深度也让人印象深刻。AI应用真正走进产业的 脉络更为清晰了。 在钛媒体2025 ITValue Summit前瞻WAIC现场版:AI落地指南系列的直播中,吴一洲与钛媒体联合创始 人刘湘明聚焦视觉智能的进化和AI技术升级下的技术厂商展开对话。 格灵深瞳一直深耕视觉算法和多模态大模型技术研发,经历过上一个技术时代的技术企业在这一波智能 浪潮中有明显不同的感受——产品有了"成长性"。吴一洲在对话中反复强调的一点是:要让产品能用起 来、用得好,而且有持续性的成长性。这不仅是格灵深瞳对产品的要求,也是作为技术厂商与客户共创 的愿景。 "以前,我们会给客户一个通用工具,现在有了智能体Agent之后,变成了个性化、有记忆的工具,相当 于一个搭档、一个执行合伙人,应用上更细化、更成熟了。"吴一洲介绍说,经过近几年的演进,格灵 深瞳构建了由模型、算法、软硬一体的产品和服务形成的端到端的体系。不过,她仍然非常理性,认为 当前AI距离真正的落地应用、在行业里跟专家超融合一样去深化应用, ...
这几个方向,从自驾转大模型会比较丝滑......
自动驾驶之心· 2025-08-06 11:25
Core Insights - The article discusses the booming field of large models in AI, particularly focusing on various directions such as RAG (Retrieval-Augmented Generation), AI Agents, and multi-modal models [1][2]. Group 1: Large Model RAG - Large model RAG is highlighted as a significant area, with emphasis on understanding components like retrievers, augmenters, and generators, and how knowledge bases can enhance performance [1]. - The article mentions the rapid development of subfields within RAG, including Graph RAG, applications in visual understanding, and various knowledge-oriented methods [1]. Group 2: AI Agents - AI Agents are identified as a hot direction in large models, covering topics such as single-agent and multi-agent systems, reinforcement learning, and efficient communication among agents [1]. - The integration of RAG with agents is also noted as a promising area for exploration [1]. Group 3: Multi-modal Models - The article points out the extensive directions available in multi-modal models, including visual language models, pre-training datasets, and fine-tuning processes [2]. - Deployment, inference, and optimization of these models are also discussed as critical components of the development process [2]. Group 4: Community and Learning - The article encourages engagement with the "Big Model Heart Tech" community for further learning and collaboration in the field of large models [3]. - The community aims to build a significant platform for talent and academic information related to large models [3].
具身智能之心招募科研辅导老师了!学术圈的大佬看过来~
具身智能之心· 2025-08-06 08:30
具身智能之心招募科研辅导老师了!如果您是具身智能方向,手里握有多篇顶会、顶刊,欢迎和我们一起带动 学术界的发展。 方向一览 行业资源共享,享有论文署名与现金激励!详细请咨询小助理微信oooops-life了解更多。 要求说明 博士及以上学历(包含在读),2篇A会或一区以上期刊/会议,有辅导经验的优先。 待遇说明 包括但不限于:VLA、VLN、遥操作、Diffusion Policy、强化学习、VLA+RL、sim2real、多模态大模型、仿 真、运动控制、目标导航等方向。 ...
大模型究竟是个啥?都有哪些技术领域,面向小白的深度好文!
自动驾驶之心· 2025-08-05 23:32
Core Insights - The article provides a comprehensive overview of large language models (LLMs), their definitions, architectures, capabilities, and notable developments in the field [3][6][12]. Group 1: Definition and Characteristics of LLMs - Large Language Models (LLMs) are deep learning models trained on vast amounts of text data, capable of understanding and generating natural language [3][6]. - Key features of modern LLMs include large-scale parameters (e.g., GPT-3 with 175 billion parameters), Transformer architecture, pre-training followed by fine-tuning, and multi-task adaptability [6][12]. Group 2: LLM Development and Architecture - The Transformer architecture, introduced by Google in 2017, is the foundational technology for LLMs, consisting of an encoder and decoder [9]. - Encoder-only architectures, like BERT, excel in text understanding tasks, while decoder-only architectures, such as GPT, are optimized for text generation [10][11]. Group 3: Core Capabilities of LLMs - LLMs can generate coherent text, assist in coding, answer factual questions, and perform multi-step reasoning [12][13]. - They also excel in text understanding and conversion tasks, such as summarization and sentiment analysis [13]. Group 4: Notable LLMs and Their Features - The GPT series by OpenAI is a key player in LLM development, known for its strong general capabilities and continuous innovation [15][16]. - Meta's Llama series emphasizes open-source development and multi-modal capabilities, significantly impacting the AI community [17][18]. - Alibaba's Qwen series focuses on comprehensive open-source models with strong support for Chinese and multi-language tasks [18]. Group 5: Visual Foundation Models - Visual Foundation Models are essential for processing visual inputs, enabling the connection between visual data and LLMs [25]. - They utilize architectures like Vision Transformers (ViT) and hybrid models combining CNNs and Transformers for various tasks, including image classification and cross-modal understanding [26][27]. Group 6: Speech Large Models - Speech large models are designed to handle various speech-related tasks, leveraging large-scale speech data for training [31]. - They primarily use Transformer architectures to capture long-range dependencies in speech data, facilitating tasks like speech recognition and translation [32][36]. Group 7: Multi-Modal Large Models (MLLMs) - Multi-modal large models can process and understand multiple types of data, such as text, images, and audio, enabling complex interactions [39]. - Their architecture typically includes pre-trained modal encoders, a large language model, and a modal decoder for generating outputs [40]. Group 8: Reasoning Large Models - Reasoning large models enhance the reasoning capabilities of LLMs through optimized prompting and external knowledge integration [43][44]. - They focus on improving the accuracy and controllability of complex tasks without fundamentally altering the model structure [45].
Discrete Tokenization:多模态大模型的关键基石,首个系统化综述发布
机器之心· 2025-08-05 18:56
Core Insights - The article discusses the advancements in Discrete Tokenization for Multimodal Large Language Models (LLMs), emphasizing its role in transforming various modalities into discrete representations that LLMs can process effectively [2][39]. - A comprehensive survey has been released, detailing the technical landscape, challenges, and future research directions in the field of Discrete Tokenization for Multimodal LLMs [2][39]. Multimodal LLMs and Discrete Tokenization - Recent breakthroughs in Large Language Models (LLMs) have led to their application in various text tasks, prompting interest in extending their capabilities to non-text modalities such as images, audio, and video [2]. - Discrete Tokenization has emerged as a key solution, utilizing techniques like Vector Quantization (VQ) to compress high-dimensional continuous inputs into compact discrete tokens, enhancing cross-modal understanding and generation [2][39]. Systematic Review and Methodologies - The article presents the first systematic review of Discrete Tokenization for Multimodal LLMs, organizing content based on input data modalities and combinations, from early single-modal to multi-modal tokenization methods [2][39]. - Eight core categories of Vector Quantization methods are identified, including VQ, RVQ, PQ, AQ, FSQ, LFQ, BSQ, and Graph Anchor-Relation Tokenization, each with unique characteristics suitable for different modalities and tasks [8][9][14]. Challenges and Future Directions - Key challenges in Discrete Tokenization include codebook collapse, information loss during quantization, difficulties in gradient propagation, and issues with granularity and semantic alignment [12][36]. - Future research directions may focus on adaptive quantization, unified frameworks, biologically inspired codebooks, cross-modal generalization, and enhancing interpretability [37][36]. Applications in Single and Multimodal Tasks - Discrete Tokenization has been widely applied in single-modal tasks such as image retrieval, audio encoding, and video representation, allowing LLMs to process non-text modalities effectively [20][22]. - In multimodal tasks, it serves as a semantic bridge, enabling models to handle complex inputs across different modalities, facilitating tasks like cross-modal retrieval and generation [27][30].
快手:用大模型点燃北京AI第一城的生产力
Bei Jing Shang Bao· 2025-08-05 09:28
从中关村科学城的实验室到快手可灵AI事业部的工位,我国AI浓度最高的北京不断向外辐射创新推动力。这里有全国近四成备案上线的大模型,AI企业数 量超2400家,核心产业规模接近3500亿元。北京的AI成绩源于智源研究院的技术突破,亦庄、朝阳区等超3.3万P算力供给等看不见的支撑,还有以可灵AI为 代表的大模型应用对生活生产场景的渗透。 技术、人才、场景、政策的多重加持,让北京率先探索出"技术突破—产业应用—创新消费"的AI发展闭环。从高端制造业的智能生产,到内容平台的创意爆 发,再到城市治理的智慧升级,快手的案例只是缩影,AI第一城群星闪烁。 当前,多模态大模型几乎成为大模型企业标配,可生成图片、可生成视频、可多样化交互,这种更容易感知的生产方式,让抽象的代码离用户更近,它可以 是生产线上的"创意工人",也可以在体验式消费中提供情绪价值。 "制作"到"智作"的生产力革命 2024年6月,异类Outliers团队用一个晚上生成了一部小短片,那是一辆小汽车飞上太空的影像,技术来自于可灵AI。一个月后,导演陈坤带着与可灵AI共创 的《山海奇镜之劈波斩浪》,在中影国际影城cinity 7号厅超前点映。 当天他向媒体披露 ...
重金研发“拥抱”AI时代,安防龙头海康威视市值迈向3000亿元
Mei Ri Jing Ji Xin Wen· 2025-08-03 07:41
Core Viewpoint - Hikvision has shown a strong performance in the first half of 2025, with revenue and net profit growth, indicating a successful transition towards AI and IoT solutions [1][3][6] Financial Performance - In the first half of 2025, Hikvision achieved revenue of 41.818 billion yuan, a year-on-year increase of 1.48% [1][3] - The net profit attributable to shareholders was 5.657 billion yuan, reflecting a significant year-on-year growth of 11.71% [1][3] - The operating cash flow improved dramatically from -190 million yuan in the same period last year to 5.34 billion yuan, marking a 2917.5% increase [3] Business Structure - Traditional security business remains the core, but innovative business has emerged as a "second growth curve," contributing 11.766 billion yuan in revenue, a 13.92% increase, accounting for 28.14% of total revenue [3] - Key innovative segments include Hikrobot, Ezviz, Hikvision Automotive Electronics, and Hikvision Microfilm, which have established leading positions in their respective fields [3] Strategic Transition - Hikvision is transitioning from a "security equipment leader" to an "AIoT solution provider," with a focus on leveraging AI breakthroughs for business growth [1][6] - The company has invested over 50 billion yuan in R&D since 2020, with R&D expenses accounting for 13.56% of revenue in the first half of 2025 [6][8] Market Challenges - The traditional security business faces challenges due to shrinking market demand and increased government fiscal pressure, leading to a decline in domestic revenue contribution [4] - Internationally, Hikvision's business has been impacted by being placed on the U.S. entity list and restrictions in key markets like Canada, although the overall revenue impact remains limited [5] AI Innovations - Hikvision has launched hundreds of AI model products across various sectors, including industrial manufacturing and traffic management, enhancing operational efficiency and safety [7][8] - The company’s AI innovations are seen as a key driver for its market valuation, with a target market capitalization approaching 300 billion yuan [8]
智元机器人罗剑岚老师专访!具身智能的数采、仿真、场景与工程化~
自动驾驶之心· 2025-08-01 16:03
1. 大家都知道数数据是提升智能燃料,然后传感器又是采集数据的关键,想问一下智元在传感器的研发采 购上有什么规划?如何增加产品数据的使用性? 罗剑岚:我们已与多家传感器供应商展开合作,重点聚焦视觉触觉与高密度传感器的联合研发。同时,我 们正在构建跨平台的数据采集 API,实现任务语义的统一映射,为模型训练提供标准化、可训练的数据输 入。 点击下方 卡片 ,关注" 具身智能 之心 "公众号 具身智能之心受邀参加WAIC 2025智启具身论坛,并有幸采访到了智元机器人首席科学家罗剑岚博 士。以下为采访过程中罗博重点提到和探讨的问题。 具身智能数据讨论 2. 因为你刚才说的世界模型挺有用的,加入世界模型以后,加一些采集数据可以让它变好了,我想知道完 成这一步之后距离应用还有多远,从采集完数据到应用之间还有什么门槛? 罗剑岚:还有性能,机器人的性能要很高,真正变得有用,在你家里,给一个机器人扫地也好,或者装洗 碗机的机器人,要有95%的成功率,在100万家庭里面,这是很难的问题。 3. Sergey Levine他有发过最新的一篇文章,提出了一个Sporks of AGI观点。仿真会阻碍具身智能的scale。 我想知 ...
从Figma到中国垂类应用全球崛起
格隆汇APP· 2025-08-01 05:27
Group 1 - Figma is revolutionizing design productivity, targeting a $33 billion full-process product development ecosystem, starting from a $2.2 billion front-end design software market [2] - Figma's core product leverages lightweight design, community proliferation, and collaborative work to gain traction in the global design tools market [2] - The company is integrating AI programming capabilities into collaborative platforms, aiming for a future of "no-code development" [4] Group 2 - The global AI application landscape is on the verge of a breakthrough, with multi-modal large language models (MLLM) emerging as a key evolution point [5][6] - Multi-modal applications are proving to have superior monetization capabilities compared to pure text products, with companies like OpenAI and Anthropic achieving significant annual recurring revenue (ARR) [7] - Midjourney and Runway are examples of companies successfully monetizing multi-modal capabilities, with Midjourney generating $500 million annually and Runway exceeding one million paid users [7] Group 3 - Chinese companies are leading in video generation within multi-modal applications, with firms like Meitu, Kuaishou, and Ruqi Software achieving over $100 million in annual revenue [8] - Meitu's AI design tool has captured 25% market penetration in Southeast Asian e-commerce, while Kuaishou's video generation tool reached an ARR of over $100 million within 10 months [8] Group 4 - There are premium opportunities for technology export, as overseas users show a higher willingness to pay for AI services compared to domestic users [9] - Figma's comprehensive coverage of the design process creates an ecological advantage, while domestic companies need to establish dual barriers in vertical fields [10] - The Chinese government is supporting AI application development through initiatives like the "Digital China Construction 2025 Action Plan" [10] Group 5 - The rise of Figma and multi-modal large models signifies a paradigm shift in productivity tools, requiring both foundational architecture innovation and deep dissection of vertical scenarios [12] - Companies that can convert technological advantages into global market shares are expected to emerge as new commercial legends in the AI landscape [12]
邝子平对话印奇:商业模式闭环才能持续推动技术进步,AI时代硬件机会巨大
IPO早知道· 2025-08-01 04:12
Core Viewpoint - The article discusses the insights shared during the "Qiming Venture Capital · Entrepreneurship and Investment Forum" at the 2025 World Artificial Intelligence Conference, focusing on the evolution of AI technology and its applications in various industries, particularly in the context of hardware and software integration [2][4][18]. Group 1: AI and Terminal Evolution - The next three years are expected to be significant for the "AI + terminal" integration, particularly in the automotive and mobile sectors, with many interesting scenarios emerging [5][7]. - The automotive industry is entering a critical phase, marking the tenth year of smart driving in China, with substantial changes anticipated in technology and product offerings [6][7]. - The integration of AI with mobile devices is seen as a consensus, with potential for the emergence of killer applications, although the specifics remain uncertain [7]. Group 2: Model Development and Industry Trends - Models are identified as the most crucial driving force behind the evolution of the AI industry [10]. - The development of large models has progressed through three learning paradigms: imitation learning, reinforcement learning, and future autonomous learning, with significant iterations expected every 18-24 months [12][13]. - There is a perceived six-month gap between China and the US in model development, although the gap in computational power consumption is widening, indicating a divergence in innovation approaches [15]. Group 3: Business Model and Market Dynamics - A sustainable business model is essential for driving technological advancement, with a focus on creating a closed-loop system that integrates technology, product, and commercialization [18][19]. - The competitive landscape for pure software applications in AI is challenging, with significant players like ByteDance and Tencent dominating the market [21][22]. - Hardware presents substantial opportunities beyond just automotive and mobile sectors, with a need for AI services to define the hardware's role [20][23]. Group 4: Future of AI Operating Systems - The future of AI operating systems is expected to undergo significant changes, with the potential for new ecosystems to emerge, particularly with the introduction of advanced AI agents [24]. - The integration of AI services and operating systems will lead to new hardware forms, creating opportunities for both established companies and startups [25][26].