Workflow
多模态大模型
icon
Search documents
大模型面经 - 快手快 Star
自动驾驶之心· 2025-07-20 08:36
Core Viewpoint - The article discusses the advancements and opportunities in the field of autonomous driving, emphasizing the importance of multi-modal large models and their applications in various aspects of the industry [5][6]. Group 1: Interview Insights - The interview process for positions related to multi-modal large models involves detailed discussions about candidates' research papers, particularly focusing on methodologies and results [4][5]. - Candidates are expected to demonstrate knowledge of current multi-modal large models and their paradigms, including specific models like BLIP-2 and Qwen-VL [5]. - Technical questions cover various topics such as Learnable Query, KV Cache, and the training and fine-tuning processes of large models [5][6]. Group 2: Community and Resources - The article highlights a community with nearly 4,000 members, including over 300 companies and research institutions in the autonomous driving sector, providing a platform for knowledge exchange [7]. - It mentions a comprehensive learning path covering over 30 areas of autonomous driving technology, from perception to planning and control [7]. - The community offers resources on various technical solutions and industry dynamics, aiming to support newcomers in the field of autonomous driving [7].
ACM MM 2025 | EventVAD:7B参数免训练,视频异常检测新SOTA
机器之心· 2025-07-20 03:11
来自北京大学,清华大学的研究团队联手京东(JD.com)在 ACM MM 2025 发表了一种以事件为中心低成本高效的 Training-Free 视频异常检测框架 EventVAD,论文第一作者邵轶骅目前为北京大学学术访问学生,项目负责人为来自京东 (JD.com)的算法研究员马傲,目前代码和数据已全面开源。 现有视频异常检测(Video Anomaly Detection, VAD)方法中,有监督方法依赖大量领域内训练数据,对未见过的异常场景泛 化能力薄弱;而无需训练的方法虽借助大语言模型(LLMs)的世界知识实现检测,但存在细粒度视觉时序定位不足、事件 理解不连贯、模型参数冗余等问题。 为此,来自北大、清华和京东(JD.com)的研究团队提出了一种全新的视频异常检测框架 ——EventVAD。该框架通过动态 图架构与多模态大模型(MLLMs)的时序事件推理结合,在减少模型参数的同时,显著提升了异常检测的精度和效率。实验 结果显示,EventVAD 在 UCF-Crime 和 XD-Violence 两大数据集上均超越现有 SOTA 方法,成为无需训练场景下的新标杆。 论文标题:EventVAD: Tra ...
超越O4-mini,多模态大模型终于学会回头「看」:中科院自动化所提出GThinker模型
机器之心· 2025-07-19 03:13
Core Viewpoint - The article discusses the limitations of existing multimodal large models in flexible visual interpretation and introduces GThinker, a new model designed to enhance multimodal reasoning capabilities through a novel "Cue-Guided Rethinking" approach [1][3][10]. Group 1: Limitations of Existing Models - Current multimodal models, despite advancements, struggle with general scenarios requiring flexible visual interpretation, often relying on knowledge-based reasoning without deep verification of visual cues [1][8]. - Existing methods, including structured CoT and reinforcement learning, exhibit significant limitations, particularly in correcting misinterpretations of visual cues during reasoning [8][10]. Group 2: Introduction of GThinker - GThinker is developed by researchers from the Institute of Automation, Chinese Academy of Sciences, aiming to achieve universal multimodal reasoning [2]. - The core innovation of GThinker is its "Cue-Guided Rethinking" mode, which allows the model to actively verify and correct its visual understanding during reasoning [3][10]. Group 3: Training Methodology - GThinker employs a two-stage training process to instill the ability for rethinking, starting with a supervised fine-tuning phase that builds a dataset of 7,000 high-quality samples for cold-starting the model's rethinking capabilities [20][21]. - The model utilizes a mixed reward mechanism in reinforcement learning to encourage active exploration across diverse tasks, enhancing its reasoning capabilities [23][24]. Group 4: Performance Results - GThinker has demonstrated superior performance on the challenging M³CoT comprehensive reasoning benchmark, surpassing the latest O4-mini model and achieving state-of-the-art results in various mathematical and knowledge reasoning tasks [4][26]. - In tests across multiple scenarios, GThinker outperformed or matched existing advanced models, indicating its effective learning of rethinking capabilities without causing specialization [28][30].
中国AI修图赛道商业化前景凸显
Xin Hua Cai Jing· 2025-07-17 05:52
Core Insights - The commercial photography industry is undergoing a transformation driven by AI technology, which is addressing efficiency and quality challenges faced by traditional workflows [1][3] - The number of registered photography service providers in China is expected to exceed 3.8 million by June 2025, with a significant portion being small enterprises [1] - Adobe remains a dominant player in the market, reporting a revenue of $5.87 billion for Q2 FY2025, with a nearly doubled user base for its Firefly model [1] Industry Developments - Domestic companies are actively exploring the commercialization of AI photo editing, with brands like Pixel Cake serving millions of photographers and completing over 100 million images [2] - Pixel Cake introduced an "integrated intelligent workflow" that significantly reduces editing time from three days to three minutes, enhancing productivity [2] - The launch of the "Sugar Cube" model and 16bit·AI Raw parsing technology by Pixel Cake aims to revolutionize the editing process and expand creative possibilities in video production [2] Market Impact - The implementation of Pixel Cake's intelligent workflow system led to a 40% increase in monthly orders and a 65% reduction in labor costs for a leading children's photography chain [3] - Pixel Cake's solutions have reportedly resulted in over 200% revenue growth for commercial photography users [3] - The company has been recognized as the leading brand in China's commercial AI photo editing market by iResearch, highlighting the commercial viability of AI in this sector [3]
独家|孵化中国版“GPT-4o”的无界方舟连续完成亿元级融资,基于自研多模态大模型,打造AI应用的“最强大脑”
Z Potentials· 2025-07-16 03:24
Core Viewpoint - AutoArk, a startup focused on developing a self-researched multimodal large model, has successfully completed Pre-A and Pre-A+ financing rounds, aiming to redefine the next generation of AI application ecology [1][2] Group 1: Company Overview - Founded only a year and a half ago, AutoArk is rapidly emerging as a dark horse in China's AI sector, with a complete closed-loop capability from underlying multimodal model research to integrated software and hardware applications [1] - The founder, Dr. Zeng Xiaodong, is a recognized authority in artificial intelligence with nearly 15 years of experience in algorithm development and industrialization, having previously led the development of Alibaba's first machine translation system [2] - The core team comprises members from leading tech companies like ByteDance and Alibaba, showcasing a strong capability across the entire AI industry chain [2] Group 2: Core Technology - AutoArk's self-researched end-to-end multimodal model, known as EVA, integrates various information forms such as text, images, and audio, providing a more intelligent and human-like interaction experience [3] - The EVA model has achieved several benchmarks comparable to OpenAI's GPT-4o, addressing critical commercialization bottlenecks in the industry [3] - The model has been recognized with a valuation of 381.42 million yuan, marking a record for data asset registration in the AI sector [3] Group 3: Product Implementation - In its first year, AutoArk achieved commercialization, serving leading companies in the biopharmaceutical and financial sectors, generating nearly 10 million yuan in revenue [4] - The company is the first in China to launch a self-researched multimodal model comparable to GPT-4o, with its AI hardware product demo, "Aqi," showcasing real-time interaction capabilities [4] - The first smart hardware product is set to be mass-produced in Q3, with plans for more products to follow, supported by supply chain giants [5] Group 4: Investment Perspective - Investors highlight AutoArk's unique value in its full-stack core technology research and strong product integration capabilities, which create a solid competitive barrier [8] - The company has demonstrated rapid commercialization progress across various key sectors, validating its strong cross-scenario migration and delivery capabilities [8] - Following the recent financing, AutoArk plans to continue advancing its multimodal model and Agent technology, as well as open-sourcing its multimodal Agent platform to foster more AI applications [8]
东海证券晨会纪要-20250715
Donghai Securities· 2025-07-15 04:53
Group 1: Banking Industry Insights - The People's Bank of China reported a year-on-year increase of 8.9% in the social financing scale by the end of June, with RMB loans growing by 7% [6][7] - In June, new RMB loans amounted to 23,637 billion, reflecting a year-on-year increase of 1,710 billion, indicating a significant improvement in credit issuance during the peak season [7][8] - Government bond issuance remained strong, with an increase of 5,072 billion year-on-year in June, supporting a rapid growth in social financing [8][9] - The M2 and M1 monetary aggregates grew by 8.3% and 4.6% respectively, indicating improved liquidity in the banking system [9][10] - The average interest rate for new corporate loans was approximately 3.3%, while for personal housing loans it was about 3.1%, both showing a year-on-year decline [10][11] Group 2: Machinery and Robotics Industry - The robotics sector showcased advancements with the demonstration of the A2-W general-purpose robot, which successfully completed tasks in an industrial setting, enhancing operational efficiency [12][13] - The acquisition of shares in Upwind New Materials by Shanghai Zhiyuan Hengyue Technology indicates ongoing consolidation and investment in high-performance materials [13][14] Group 3: Food and Beverage Industry - The food and beverage sector saw a 0.84% increase, with the liquor sub-sector performing particularly well, driven by improved market sentiment [16][17] - Kweichow Moutai completed its operational targets for the first half of the year, indicating a recovery in sentiment within the liquor market [17][18] - The beer sector is expected to benefit from improved demand and declining costs, which may enhance profit margins [18][20] Group 4: Pharmaceutical and Biotech Industry - The pharmaceutical sector experienced a 1.82% increase, with notable performance from the CXO segment, indicating a potential for systematic recovery [22][23] - WuXi AppTec projected a revenue increase of approximately 20.64% for the first half of 2025, reflecting strong growth in the biotech space [23][24] - The overall PE valuation for the pharmaceutical sector is at 28.95 times, suggesting a stable investment environment [22] Group 5: Electronics Industry - The electronics sector is witnessing a recovery, with companies like Espressif Systems and Rockchip reporting significant revenue growth due to strong demand in AIOT applications [27][28] - The launch of the Grok 4 model by xAI, which boasts a tenfold improvement in reasoning capabilities, highlights advancements in AI technology within the electronics industry [29][30] - The overall electronic industry index outperformed the market, indicating positive investor sentiment [30][31]
汽车圈有水军恶意抹黑小米和华为?微博CEO:或有第三方暗中撺掇;曝阿里将推出「超级星期六」外卖计划;MiniMax获3亿美元融资
雷峰网· 2025-07-15 00:31
Key Points - The article discusses various developments in the automotive and technology sectors, highlighting significant events and trends affecting companies like Xiaomi, Huawei, and NIO [4][6][16]. Group 1: Automotive Developments - MiniMax has secured $300 million in financing, raising its post-money valuation to over $4 billion, indicating strong investor confidence despite a cooling market for AI models [6][7]. - Li Auto has established a new computing resources department to enhance its self-developed models, aiming for L3 autonomous driving by 2025 [9][10]. - NIO's stock surged by 10.6% following the announcement of its new model, the L90, which is priced competitively in the electric SUV market [14][20]. Group 2: Technology and Market Strategies - Huawei's automotive brand, 尚界, is targeting the mainstream market with its H5 model, attracting over 1,000 dealers, emphasizing high cost-performance and advanced driving systems [16][17]. - Alibaba plans to launch a "Super Saturday" promotion for food delivery, offering significant discounts to boost consumer engagement [12][13]. - ByteDance's subsidiary, Mu Tong Technology, has acquired part of the team from Hangzhou Xin Guang Liu Mei, indicating a strategic move into the gaming sector [17][18]. Group 3: Industry Insights and Predictions - Elon Musk predicts that AI will surpass human intelligence collectively within five years, reflecting the rapid advancements in AI technology [28][31]. - The article notes that Huawei's wearable products have shipped over 200 million units, with the GT series alone exceeding 52 million units, showcasing the brand's strong market presence [23][24].
ICCV 2025 | 清华&腾讯混元X发现「视觉头」机制:仅5%注意力头负责多模态视觉理解
机器之心· 2025-07-14 11:33
Core Insights - The article introduces SparseMM, a method that optimizes KV-Cache allocation based on the identification of visual heads in multimodal large models, significantly improving efficiency and performance in visual understanding tasks [5][30][31] Group 1: Visual Head Identification - Multimodal large models extend from large pre-trained language models (LLMs) and can exhibit strong performance in visual tasks after multimodal training [2] - The study identifies that less than 5% of attention heads, termed "visual heads," are primarily responsible for visual understanding, while most heads focus on text or auxiliary features [2][8] - A method based on OCR tasks is proposed to quantify the attention of each head towards visual content, revealing the sparse nature of visual heads [2][14] Group 2: SparseMM Methodology - SparseMM employs a differentiated cache allocation strategy, dividing the total cache budget into three parts: basic local cache for all heads, uniform distribution, and prioritized allocation for visual heads based on their scores [6][20] - The method has been tested across various multimodal benchmarks, achieving a decoding speedup of up to 1.87× and reducing peak memory usage by 52% [6][27] Group 3: Experimental Results - In OCR-rich datasets like DocVQA and TextVQA, SparseMM demonstrates significant performance advantages, maintaining high accuracy even with limited cache budgets [22][23] - The method shows robust performance across general visual tasks, maintaining nearly consistent performance with full cache models under constrained budgets [25] Group 4: Implications for Deployment - SparseMM effectively reduces inference costs and enhances the deployment efficiency of multimodal large models, particularly in high-resolution image and long-context scenarios [27][31] - The visualization of identified visual heads indicates their ability to accurately focus on relevant visual information, contrasting with non-visual heads that often miss critical details [28]
电子行业周报:端侧AI厂商中报业绩亮眼,多模态大模型Grok4正式发布-20250714
Donghai Securities· 2025-07-14 09:28
Investment Rating - The report suggests a positive outlook for the electronic sector, indicating a gradual recovery in demand and price stabilization, recommending a slow accumulation of positions in the market [5][6]. Core Insights - The electronic sector is experiencing a mild recovery, driven by strong downstream demand from AIOT and accelerated product penetration by companies like Lexin Technology and Rockchip, which are expected to report impressive half-year results [5][6]. - The release of the multi-modal model Grok 4 by xAI has significantly enhanced reasoning capabilities, potentially opening new application scenarios [5][11]. - The report highlights four main investment themes: AIOT, AI-driven technologies, equipment materials, and consumer electronics [5][6]. Summary by Sections Industry Overview - The report notes that the semiconductor sector is entering a period of intensive earnings forecasts, with companies like Lexin Technology and Rockchip expected to show substantial revenue growth due to ongoing demand in AIOT and other emerging fields [5][6]. Company Performance - Lexin Technology anticipates a revenue of CNY 1.22-1.25 billion for the first half of 2025, a year-on-year increase of 33%-36%, with net profit expected to rise by 65%-78% [5][17]. - Rockchip expects to achieve approximately CNY 2.045 billion in revenue, reflecting a year-on-year growth of about 64%, with net profit projected to increase by 185%-195% [5][17]. Market Trends - The report indicates that the electronic industry outperformed the broader market, with the Shenzhen and Shanghai 300 Index rising by 0.82% and the Shenwan Electronics Index increasing by 0.93% [19][21]. - The semiconductor sub-sector showed a positive trend, with a 1.07% increase in semiconductor stocks [21][26]. Investment Recommendations - The report recommends focusing on companies benefiting from strong domestic and international demand in the AIOT sector, such as Lexin Technology and Rockchip [5][6]. - It also suggests monitoring AI innovation-driven sectors, including computing chips and optical devices, as well as upstream supply chain components [5][6].
端侧AI厂商中报业绩亮眼,多模态大模型Grok 4正式发布 | 投研报告
Core Viewpoint - The electronic industry is experiencing a mild recovery, with strong performance from edge AI companies like Espressif Systems and Rockchip, driven by robust downstream demand from AIOT and accelerated product penetration [1][2][3]. Industry Summary - The 2025 semi-annual performance forecasts are being released, indicating that edge AI companies are performing well due to sustained demand from AIOT [3]. - Espressif Systems is expected to achieve revenue of 1.22-1.25 billion yuan for the first half of 2025, a year-on-year increase of 33%-36%, with net profit projected to rise by 65%-78% [3]. - Rockchip anticipates revenue of approximately 2.045 billion yuan for the first half of 2025, representing a year-on-year growth of about 64%, with net profit expected to increase by 185%-195% [3]. Product and Technology Developments - The release of xAI's multimodal model Grok4 has improved reasoning capabilities by ten times compared to its predecessor, Grok3, and has set a historical record in HLE testing [4][5]. - Grok4 features a context window of 256,000 tokens and supports various interaction modes, including text, images, and video [4][5]. Investment Recommendations - The industry is advised to focus on four main investment themes: AIOT, AI-driven technologies, equipment materials, and consumer electronics [1][2][6]. - Specific companies to watch include Espressif Systems, Rockchip, and others benefiting from strong domestic and international demand in the AIOT sector [6].