Workflow
多模态大模型
icon
Search documents
超越CLIP!北大开源细粒度视觉识别大模型,每类识别训练仅需4张图像
量子位· 2026-02-11 01:55
Core Viewpoint - The article discusses the limitations of current multimodal large models in fine-grained visual recognition tasks and introduces the Fine-R1 model developed by Professor Peng Yuxin's team at Peking University, which significantly improves recognition accuracy with minimal training data [1][2][5]. Group 1: Fine-Grained Visual Recognition Challenges - Current multimodal large models excel in complex tasks but lag in fine-grained visual recognition compared to their visual encoders like CLIP [1]. - Real-world objects exhibit fine-grained characteristics, with numerous subclasses, such as over 500 types of fixed-wing aircraft, highlighting the importance of fine-grained recognition in practical applications [3]. Group 2: Fine-R1 Model Overview - The Fine-R1 model aims to leverage the rich knowledge of fine-grained subclasses and a generative decoding paradigm to overcome the limitations of traditional recognition methods, enabling fine-grained recognition of any visual object in an open domain [5]. - Fine-R1 enhances the model's ability to reason about unseen subclasses using a small number of training images (only 4 per subclass), outperforming models like OpenAI's CLIP and Google's DeepMind's SigLIP [5][15]. Group 3: Model Development Process - The development of Fine-R1 involves two main steps: 1. Chain-of-thought supervised fine-tuning, which simulates human reasoning to build inference capabilities [7]. 2. Triplet enhancement strategy optimization, which improves robustness to intra-class variations and inter-class distinctions by using positive and negative samples [8][10]. Group 4: Experimental Results - Fine-R1's performance was evaluated on six authoritative fine-grained image classification datasets, demonstrating superior accuracy in both seen and unseen categories compared to other models [15][17]. - The model's ability to utilize fine-grained subclass knowledge effectively was identified as the primary factor for its improved recognition accuracy, rather than enhancements in visual representation or knowledge storage [19]. Group 5: Conclusion and Future Work - The article concludes with the potential of Fine-R1 to excel in fine-grained visual recognition tasks, emphasizing its innovative approach to reasoning and knowledge application [21]. - The research has been accepted for ICLR 2026 and the code is open-sourced for further exploration [2][22].
中盛集团:首次覆盖云知声(09678)予“买入”评级 目标价750.58港元
智通财经网· 2026-02-09 03:06
Core Viewpoint - Zhongsheng Group predicts that Yunzhisheng (09678) will experience accelerated revenue growth over the next three years, with projected revenues of 1.236 billion, 1.923 billion, and 2.918 billion yuan for 2025, 2026, and 2027 respectively, representing growth rates of 31.6%, 55.6%, and 51.7%, and expects the company to achieve profitability in 2026 [1] Group 1 - Yunzhisheng is a pioneer in AGI technology in China, being one of the first companies to commercialize deep learning voice technology and integrate multimodal technology [2] - The company has developed a matrix of multimodal large models and specialized industry large models, with its UniGPT-Med ranking first in three projects in the latest MedBench 4.0 evaluation, demonstrating a hallucination rate of less than 3%, leading the industry [2] Group 2 - The company has established partnerships with top-tier hospitals such as Peking Union Medical College Hospital and Hunan Xiangya Hospital, covering 40% of the top 100 hospitals in China, creating a solid competitive advantage through its vast medical data assets [3] - The high-quality data training forms an efficient data flywheel, with significant application potential in medical insurance and commercial health insurance cost reduction [3] Group 3 - The company employs a dual-platform strategy, utilizing MaaS for high-end clients through private deployment of regional/industry large models, while SaaS focuses on standardized applications for small and medium clients, facilitating commercial monetization [4] Group 4 - The smart living business continues to grow steadily, with multimodal interactions implemented in various transportation sectors, and deep collaborations with leading companies like TCL and Gree in the home appliance sector [5] - Smart cockpit solutions have been widely adopted in mainstream vehicle models such as SAIC's Zhiji L6 and Geely's Xingrui [5]
多模态大模型将为特定应用带来爆发式增长机会,软件ETF(159852)备受资金关注
Xin Lang Cai Jing· 2026-02-06 03:06
Group 1 - The core viewpoint of the articles highlights the significant growth in the software development sector, particularly driven by advancements in cloud and AI technologies, as evidenced by Google's Q4 2025 financial results showing Google Cloud revenue reaching $17.664 billion, a 48% year-over-year increase [1] - The software industry is shifting its value focus from license sales to intelligent service subscriptions and ecosystem collaboration, as major global tech companies adopt a "cloud + AI" heavy asset model, indicating a long-term bet on AI commercialization and computational network efficiency [1] - The rapid iteration of overseas large model technologies is expected to continuously provide direction and catalysts for domestic application innovation, with breakthroughs in multimodal large models significantly expanding application boundaries, especially in scenarios requiring understanding of the physical world [1] Group 2 - As of January 30, 2026, the top ten weighted stocks in the CSI Software Service Index include iFLYTEK, Kingsoft Office, Tonghuashun, and others, collectively accounting for 60.27% of the index [2] - The software ETF (159852) tracks the CSI Software Service Index, serving as a convenient tool for capitalizing on opportunities in the computer software industry [2] - Investors can also access AI software investment opportunities through the software ETF linked fund (012620) [3]
锦秋被投生数科技完成超过6亿元人民币A+轮融资|Jinqiu Spotlight
锦秋集· 2026-02-05 04:33
Core Insights - Jinqiu Fund's portfolio company, Shengshu Technology, has completed over 600 million RMB in A+ round financing, indicating strong investor confidence and growth potential in the AI content generation sector [2][6] - Shengshu Technology, an early-stage investment by Jinqiu Fund, is recognized for its innovative approach to multi-modal AI, particularly with its "Reference to Video" concept, which aims to redefine content creation paradigms [3][4] Financing and Investment - The A+ round financing was led by Zhongguancun Science City and Xinglian Capital, with strategic investments from listed companies such as Wanxing Technology and Vision China, alongside existing investors increasing their stakes [6] - Shengshu Technology's early investment by Jinqiu Fund positions it favorably for future growth and market leadership in AI-driven content solutions [3] Technological Advancements - Shengshu Technology is a pioneer in multi-modal generation algorithms, having introduced the U-ViT architecture ahead of competitors like OpenAI [7] - The Vidu model, launched in 2024, has achieved significant milestones, including being the first to address multi-subject consistency in commercial video needs and maintaining global leadership in key performance metrics [7][9] - The latest Vidu Q3 model is recognized for its rapid generation speed, outperforming competitors like OpenAI's Sora 2 by ten times, and is now a preferred choice for content creators globally [8][9] Market Position and Applications - Vidu has seen over tenfold growth in users and revenue in 2025, establishing a comprehensive application matrix that empowers content creators and enterprises across more than 200 countries [9] - The model is widely adopted in various industries, including film, internet, advertising, and gaming, with partnerships involving major companies like Sony Pictures, Tencent Animation, and Alibaba [10] - Shengshu Technology's commitment to innovation positions it to redefine content production processes and enhance user experiences in the global content industry [10]
专访王仲远:智源多模态大模型登上《自然》,背后有群年轻人
Xin Jing Bao· 2026-02-03 14:17
Core Insights - The Emu3 multimodal model developed by the Beijing Academy of Artificial Intelligence has been published in the prestigious journal Nature, marking a significant achievement for China's research institutions in the field of AI [1][2]. Group 1: Emu3 Model Overview - Emu3 represents a unified architecture that simplifies the understanding and generation of various types of information, including text, images, and videos, by using a single model based on the principle of "predicting the next token" [3][4]. - The model's design allows for significant scalability and lower research and development barriers, enabling more researchers and institutions to engage in cutting-edge exploration [3][4]. Group 2: Technological Advancements - Emu3.5, the subsequent version, has been trained on over 10 trillion tokens, with video training duration increased from 15 years to 790 years, and the parameter count rising from 8 billion to 34 billion [6]. - This version demonstrates the ability to simulate physical world dynamics, marking a transition from "predicting the next word or frame" to "predicting the next state," which is crucial for achieving more general intelligence [6]. Group 3: Team and Innovation - The Emu3 development team is notably young, with the lead developer being only 29 years old, reflecting the institute's philosophy of empowering youth in AI innovation [7][8]. - The team faced significant technical challenges and skepticism from the industry but ultimately succeeded in proving the viability of their innovative approach to multimodal AI [8]. Group 4: Future Applications - Emu3 is positioned as a foundational model for advancing AI from the digital realm to the physical world, enabling applications in robotics and autonomous driving by providing a robust understanding of complex environments [5][10]. - The model is expected to give rise to a new generation of native multimodal assistants capable of creating images and videos based on contextual prompts, enhancing human-computer interaction [5]. Group 5: Talent Development and Institutional Support - The Beijing Academy of Artificial Intelligence emphasizes talent based on impactful work rather than credentials, fostering a dynamic environment for young researchers [9][10]. - The institute operates under a flexible funding model that allows researchers to focus on valuable scientific work without the pressures of traditional corporate structures [9].
星宸科技:自研的端侧SoC芯片支持各类多模态大模型在端侧的本地化部署与流畅运行
Zheng Quan Ri Bao Wang· 2026-02-03 10:45
Core Viewpoint - The company, Xingchen Technology (301536), has developed an in-house edge SoC chip that supports AI computing power for local deployment and smooth operation of various multimodal large models [1] Group 1 - The self-developed edge SoC chip is capable of supporting local deployment of multimodal large models [1] - The chip enables smooth operation of AI applications on the edge [1]
DeepSeek之后,智源大模型登Nature:事关“世界模型”统治路线
3 6 Ke· 2026-02-02 00:22
Core Insights - The core achievement of the "Wujie·Emu" multimodal model is its publication in Nature, marking it as the second Chinese large model team to achieve this milestone, and the first paper focused on multimodal models from China [1][3]. Group 1: Model Performance and Capabilities - Emu3 demonstrates unified learning across text, image, and video modalities, achieving performance comparable to specialized models in generation and perception tasks [3][10]. - In image generation, Emu3 scored 70.0, outperforming SD-1.5 (59.3) and SDXL (66.9) [4]. - For video generation, Emu3 achieved a score of 81.0 on VBench, surpassing Open-Sora 1.2 [4]. - In visual language understanding, Emu3 scored 62.1, slightly higher than LLaVA-1.6 (61.8) [4]. Group 2: Technical Innovations and Development - Emu3 is based on a simple architecture that relies solely on the "next-token prediction" method, which is seen as having strong scalability potential [4][10]. - The model was developed by a dedicated team of 50, focusing on a unified approach to multimodal learning, which simplifies the complexity of model development [10][12]. - Emu3's architecture integrates visual and textual data into a single representation space, allowing for efficient training on multimodal sequences [10][12]. Group 3: Industry Impact and Future Prospects - Since its release, Emu3 has significantly influenced the multimodal field and has been widely recognized and applied in the industry [13]. - The model's performance has positioned it as a competitive alternative to leading diffusion models and has opened new pathways for the development of physical AI and embodied intelligence [6][34]. - The upcoming Emu3.5 is expected to further enhance capabilities, including understanding long sequences and simulating exploration in virtual environments [6][34]. Group 4: Research and Development Background - The development of Emu3 began in February 2024, amidst a reassessment of the paths for large model development, particularly in the context of the success of models like GPT-4 [8][10]. - The research team faced significant technical challenges, including the need to create a new language system aligned with human language for visual data [12][40]. - The commitment to a unified multimodal approach reflects a belief that achieving AGI requires models that can understand and interact with the physical world [12][40].
格灵深瞳:2025年全年预计净亏损17,000万元—24,000万元
Core Viewpoint - The company, Geling Deep Vision, has announced a projected net loss for 2025, estimating a loss between 240 million yuan and 170 million yuan, primarily due to its ongoing transformation and investment in multi-modal large models [1] Group 1: Financial Performance - The expected net profit attributable to the parent company for 2025 is projected to be between -240 million yuan and -170 million yuan [1] - The company anticipates that non-recurring gains and losses for 2025 will mainly come from investment income and fair value changes of financial assets such as structured deposits and other financial products [1] Group 2: Business Strategy and Development - 2025 is identified as a critical year for the company's reform, focusing on R&D investments in key areas to maintain a technological edge while diversifying its business [1] - The company is concentrating on sectors such as smart finance, urban management, government affairs, special applications, and smart education, developing industry-level large model products that comply with domestic standards [1] - The company has made initial progress in business diversification, with increased revenue in urban management, government affairs, special applications, and smart education compared to the previous year [1] Group 3: Market Conditions and Challenges - The tightening of budgets from clients in the smart finance sector, influenced by the macroeconomic environment, has contributed to the anticipated losses [1] - The framework contract with a major client, Agricultural Bank of China, is set to expire in September 2025, leading to a slowdown in related product demand [1] Group 4: Recent Acquisitions - The acquisition of Shenzhen Guoke Yidao Technology Co., Ltd. in November 2024 has expanded the company's revenue scale, contributing to the consolidated financial results for 2025 [1]
大模型学会拖进度条看视频了!阿里新研究让视频推理告别脑补,实现证据链思考 | ICLR 2026
量子位· 2026-01-29 08:27
Core Insights - The research team from Alibaba's Future Life Lab highlights that the effectiveness of models in video reasoning tasks is significantly influenced by how they are taught to "think" [1] - They propose a high-quality video reasoning dataset called ReWatch and a state-of-the-art model named ReWatch-R1, which can "rewatch" videos like humans to enhance reasoning capabilities [1] Group 1: ReWatch Dataset - The ReWatch dataset consists of 10,000 videos, 170,000 question-answer pairs, and 135,000 reasoning chains, addressing three main issues in existing training data: rough video descriptions, overly simplistic Q&A, and a heavy reliance on textual common sense rather than video content [2][4] - Key features of the ReWatch dataset include: 1. High-fidelity temporal captions that provide detailed event descriptions with precise timestamps, forming a solid factual basis for complex reasoning [2] 2. High-difficulty video Q&A that ensures questions depend on video details, preventing models from relying on guessing or common sense [2] 3. Video-grounded reasoning chains that simulate human behavior of "rewatching and confirming" through a multi-agent framework, ensuring reasoning steps are closely tied to video content [2] Group 2: ReWatch-R1 Model - The training of the ReWatch-R1 model employs a SFT+RL paradigm with an innovative reward mechanism that emphasizes the importance of the reasoning process [6] - The core of the training method is the process reward mechanism (GRPO with O&R Reward), which supervises and rewards the model's intermediate reasoning steps rather than just the final answer [6][8] - The process reward is calculated based on: 1. Observation Reward, which evaluates the accuracy of the model's observations against high-fidelity captions [8] 2. Reasoning Reward, which assesses the effectiveness of the model's reasoning actions based solely on its observations [8] Group 3: Experimental Results and Insights - ReWatch-R1 has achieved state-of-the-art performance across five mainstream video reasoning benchmarks, significantly outperforming all comparable open-source models [9] - A key insight from the research is that reinforcement learning (RL) is crucial for unlocking the "thinking" potential of models, as it allows for a substantial performance leap in the reasoning mode compared to the direct answering mode [11][12] - The study emphasizes that explicit, step-by-step reasoning processes supported by evidence are vital for tackling complex video tasks, with RL being the key to fostering this capability [12][14]
金融赋能AGI创新:浦发银行携手阶跃星辰共绘智能未来新图景
Zhong Jin Zai Xian· 2026-01-29 02:25
Core Insights - The article highlights the deep integration of finance and cutting-edge technology in Shanghai, showcasing the growth potential of Chinese tech companies like StepFun, which has recently completed over 5 billion yuan in B+ round financing, setting a record for the largest single financing in China's large model sector in the past 12 months [1] - StepFun is recognized as a leader in multimodal foundational models, having developed a comprehensive model matrix and achieved top rankings in international evaluations, indicating its strong technical capabilities and promising industry outlook [2] Company Overview - StepFun has released three generations of foundational models, with Step 3 achieving a new industry high in inference efficiency, and its Step Audio R1.1 model topping the Artificial Analysis authoritative list [2] - The company has established deep collaborations with industry leaders such as Geely, OPPO, and Honor, with its core product "Step AI" deployed on over 42 million devices, serving nearly 20 million users daily, demonstrating the effectiveness of its "AI + terminal" strategy [2] Financial Support and Strategy - Shanghai Pudong Development Bank (SPDB) has tailored a comprehensive, customized financial solution for StepFun, focusing on technology, team, and future potential rather than traditional collateral and financial statements [4] - SPDB has formed a specialized technology finance service team to understand StepFun's R&D processes and market expansion plans, ensuring that financial support aligns with the company's innovative capabilities and sustainable business model [4] - The bank aims to continuously optimize its service model to support tech companies with core technologies and innovative vitality, contributing to the construction of a strong technology nation [4]