Workflow
多模态智能
icon
Search documents
长文本检索大突破,联通团队研发的新模型,准确率提升近两成
Sou Hu Cai Jing· 2025-12-02 20:15
Core Viewpoint - HiMo-CLIP is a new AI model developed by China Unicom's Data Science and Artificial Intelligence Research Institute, designed to improve the accuracy of image retrieval by automatically identifying key information in complex descriptions, addressing the common issue of "too much detail leading to errors" in AI processing [2][7][21]. Group 1: Model Features - HiMo-CLIP utilizes a specialized module called HiDe, which employs statistical methods to extract the most distinguishing features from similar descriptions, enhancing the model's ability to focus on key attributes [7][8]. - The model achieves an accuracy rate of 89.3%, significantly improving upon previous methods that relied on fixed templates or manual annotations [8]. - HiMo-CLIP's implementation is efficient, requiring minimal hardware resources, with only a 7% increase in inference speed on A100 GPUs, making it accessible for standard servers [10][11]. Group 2: Performance Metrics - The model incorporates a dual alignment mechanism known as MoLo loss, which ensures that both the overall semantic meaning and core feature matching are prioritized, thus preventing the "more detail, more errors" phenomenon [11][13]. - In tests on the MSCOCO-Long dataset, HiMo-CLIP's mean Average Precision (mAP) improved by nearly 20% compared to the previous Long-CLIP model, while maintaining 98.3% of its original performance on short text datasets like Flickr30K [13]. Group 3: Practical Applications - HiMo-CLIP has already been applied in real-world scenarios, such as enhancing product search functionalities on JD.com, where complex user descriptions led to a 27% increase in search conversion rates [14][15]. - The model is also being explored in the autonomous driving sector to interpret complex road descriptions, improving environmental recognition for vehicle systems [18]. Group 4: Future Developments - The team plans to release a multilingual version of HiMo-CLIP by Q3 2026, aiming to handle specialized terminology and foreign language descriptions more effectively [21]. - The success of HiMo-CLIP highlights the importance of simulating human cognitive logic in AI models, suggesting a potential new direction for multimodal intelligence development through structured semantic spaces [21].
小红书提出DeepEyesV2,从“看图思考”到“工具协同”,探索多模态智能新维度
量子位· 2025-11-13 00:49
Core Insights - DeepEyesV2 is a significant upgrade from its predecessor, DeepEyes, enhancing its capabilities from merely recognizing details to actively solving complex problems through multi-tool collaboration [3][12]. Multi-Tool Collaboration - Traditional multimodal models are limited in their ability to actively utilize external tools, often functioning as passive information interpreters [4]. - DeepEyesV2 addresses two main pain points: weak tool invocation capabilities and lack of collaborative abilities among different functions [5][8]. - The model can now perform complex tasks by integrating image search, text search, and code execution in a cohesive manner [12][18]. Problem-Solving Process - DeepEyesV2's problem-solving process involves three steps: image search for additional information, text search for stock price data, and code execution to retrieve and calculate financial data [15][16][17]. - The model demonstrates advanced reasoning capabilities, allowing it to tackle intricate queries effectively [14]. Model Features - DeepEyesV2 incorporates programmatic code execution and web retrieval as external tools, enabling dynamic interaction during reasoning [22]. - The model generates executable Python code or web search queries as needed, enhancing its analytical capabilities [23][27]. - This integration results in improved flexibility in tool invocation and a more robust multimodal reasoning framework [28]. Training and Development - The development of DeepEyesV2 involved a two-phase training strategy: a cold start to establish foundational tool usage and reinforcement learning for optimization [37][38]. - The team created a new benchmark, RealX-Bench, to evaluate the model's performance in real-world scenarios requiring multi-capability integration [40][41]. Performance Evaluation - DeepEyesV2 outperforms existing models in accuracy, particularly in tasks requiring the integration of multiple capabilities [45]. - The model's performance metrics indicate a significant improvement over open-source models, especially in complex problem-solving scenarios [46]. Tool Usage Analysis - The model exhibits a preference for specific tools based on task requirements, demonstrating adaptive reasoning capabilities [62]. - After reinforcement learning, the model shows a reduction in unnecessary tool calls, indicating improved efficiency in reasoning [67][72]. Conclusion - The advancements in DeepEyesV2 highlight the importance of integrating tool invocation with reasoning processes, showcasing its superior problem-solving abilities in various domains [73][75].
腾讯研究院AI速递 20251111
腾讯研究院· 2025-11-10 16:30
Group 1: Generative AI Developments - OpenRouter platform has launched the anonymous model Polaris Alpha, believed to be a variant of GPT-5.1, with a knowledge base cutoff in October 2024 and a maximum context capacity of 256K and a single output limit of 128K [1] - Polaris Alpha shows smooth performance in desk work and programming tasks, exhibiting typical GPT characteristics and supporting NSFW mode [1] - The model is currently available for free via API, demonstrating good performance in programming mini-games and web design, with GPT-5.1 expected to be officially released in mid-November [1] Group 2: Multi-Modal Intelligence - A new multi-modal paradigm called Cambrian-S has been proposed by researchers including Yann LeCun, focusing on "spatial super-perception" and marking the first step in exploring video spatial super-perception [2] - The research outlines a development path for multi-modal intelligence across four levels: semantic perception, streaming event cognition, 3D spatial cognition, and predictive world modeling, introducing the VSI-SUPER benchmark for spatial super-perception capabilities [2] - Cambrian-S utilizes latent variable frame prediction to manage memory and event segmentation through a "surprise" signal, outperforming Gemini in spatial cognition tasks with smaller models [2] Group 3: AI Programming Tools - Meituan has launched an AI IDE programming tool named CatPaw, featuring code completion, agent Q&A generation, built-in browser preview debugging, and project-level analysis [3] - The core engine of CatPaw is Meituan's self-developed LongCat model, fully compatible with major programming languages like Python, C++, and Java, and currently available for free [3] - Over 80% of weekly active users among Meituan's internal developers utilize CatPaw, with AI-generated code accounting for about 50% of new code submissions, and a Windows version expected to launch soon [3] Group 4: Domestic AI IDE Launch - YunSi Intelligence has introduced Vinsoo, the world's first AI IDE equipped with a cloud-based security agent, surpassing products like Cursor and Codex that utilize Claude [4] - Vinsoo achieves breakthroughs in long-context engineering algorithms, supporting effective context lengths in the millions and allowing up to eight intelligent agents to operate simultaneously [4] - The new Beta 3.0 version supports cloud-based one-click publishing, mobile usage, and team collaboration, led by a founding team of post-00s graduates from top universities in China and the U.S. [4] Group 5: Open Source Audio Editing Model - Jieyue Xingchen has released the first open-source LLM-level audio editing model, Step-Audio-EditX, which allows precise control over audio emotions, speaking styles, and paralinguistic features through language commands [5] - The model employs a unified LLM framework and a "dual-codebook" audio tokenizer structure, supporting zero-shot text-to-speech, iterative editing, and bilingual capabilities [5] - With approximately 3 billion parameters, the model can run on a single 32GB GPU, achieving higher accuracy in emotion and style control compared to closed-source models like MiniMax and Doubao [5] Group 6: AI Glasses Launch - Baidu has officially launched the Xiaodu AI glasses Pro, priced at 2299 yuan, with a promotional price of 2199 yuan for Double Eleven, weighing 39 grams and featuring a 12-megapixel wide-angle camera [6] - The glasses integrate multi-modal AI models, offering functionalities such as photography, music recognition, AI translation, object recognition, note-taking, and audio recording, with real-time translation capabilities [6] - Similar to Xiaomi's AI glasses, these are not the more advanced AI+AR glasses currently available [6] Group 7: Robotics Innovation - Galaxy General has introduced the DexNDM, a dexterous hand neural dynamics model that achieves stable, multi-axial rotation operations on various objects, capable of using tools like screwdrivers and hammers [8] - The DexNDM model disassembles hand-object interactions to the joint level, utilizing a training process that allows for stable operations across tasks and forms without requiring successful examples [8] - This technology has been applied to remote operation systems, enabling operators to give high-level commands via VR controllers while DexNDM autonomously manages fine control at the finger level [8] Group 8: Insights on AI Entrepreneurship - A YC partner emphasizes that AI tools cannot replace a founder's sales capabilities, suggesting that AI should first target quick-to-implement entry points in traditional industries rather than aiming for full automation [9] - The core competitive advantage in early-stage entrepreneurship is "learning speed" rather than scale, with a focus on quickly validating ideas with small customers [9] - AI sales development representatives (SDRs) are effective only when there are already well-functioning sales processes, and founders must clarify their target audience and attention acquisition strategies for AI tools to be effective [9]
进博会现场直击
Zheng Quan Ri Bao· 2025-11-06 15:49
Group 1: AI as a Driving Force - The fifth China International Import Expo (CIIE) has seen AI transition from a "technology showcase" to a key driver of industrial transformation, with over 400 AI-related innovations presented [2][3] - AI applications are now penetrating various sectors including healthcare, industrial, retail, and transportation, showcasing its evolution from mere demonstrations to practical tools [2][7] Group 2: Innovations in AI Applications - Siemens showcased an AI surgical solution and a three-dimensional collaboration platform that integrates AI with digital twin technology, emphasizing practical applications in industrial settings [3][8] - The introduction of humanoid robots and intelligent robotic arms at the expo highlights advancements in embodied intelligence, with companies like Zhiyuan Innovation demonstrating multi-modal interaction capabilities [4][6] Group 3: AI in Healthcare - AI technologies in healthcare have been extensively implemented, with companies like Siemens and Maizhao Health Technology presenting comprehensive solutions from diagnosis to treatment [7][8] - The "AI Magic Mirror" by Maizhao Health can analyze health indicators with a 90% accuracy rate, indicating significant advancements in health monitoring technology [7] Group 4: AI in Retail and Industry - AI is positioned as a strategic asset for the retail sector, with potential to increase annual operating profits by $310 billion by 2030 if scaled effectively [4][5] - The global robotics market is projected to exceed $400 billion by 2029, with embodied intelligent robots expected to capture over 30% of the market share [6] Group 5: China's Role in AI Development - China's vast market and diverse application scenarios are seen as critical for the industrialization of AI technologies, with companies like AMD and Qualcomm emphasizing the importance of collaboration and innovation [10][11] - The CIIE serves as a significant platform for global technology application, with over 3,000 new products and services showcased in previous editions, indicating China's growing role as an innovation catalyst [11]
智源悟界·Emu3.5发布,开启“下一个状态预测”!王仲远:或开启第三个 Scaling 范式
AI前线· 2025-11-01 05:33
Core Insights - The article discusses the launch of the world's first native multimodal world model, Emu3, by Zhiyuan Research Institute, which predicts the next token without diffusion models or combination methods, achieving a unified approach to images, text, and video [2] - Emu3.5, released a year later, enhances the model's capabilities by simulating human natural learning and achieving generalized world modeling ability through Next-State Prediction (NSP) [2][3] - The core of the world model is the prediction of the next spatiotemporal state, which is crucial for embodied intelligence [2] Model Features - Emu3.5 has three main characteristics: understanding high-level human intentions and generating detailed action paths, seamless integration of world understanding, planning, and simulation, and providing a cognitive foundation for generalized interaction between AI and humans or physical environments [3] - The model's architecture allows for the integration of visual and textual tokens, enhancing its scalability and performance [8] Technological Innovations - Emu3.5 underwent two phases of pre-training on approximately 13 trillion tokens, focusing on visual resolution diversity and data quality, followed by supervised fine-tuning on 150 billion samples [12][13] - A large-scale native multimodal reinforcement learning system was developed, featuring a comprehensive reward system that balances multiple quality standards and avoids overfitting [14] - The introduction of DiDA technology significantly accelerated inference speed by 20 times, allowing the autoregressive model to compete with diffusion models in performance [17][19] Industry Impact - The evolution from Emu3 to Emu3.5 demonstrates the potential for scaling in the multimodal field, similar to advancements seen in language models [6] - Emu3.5 represents a significant original innovation in the AI large model field, combining algorithmic, engineering, and data training innovations [9] - The model's ability to understand causal relationships and spatiotemporal dynamics positions it uniquely in the landscape of AI models, potentially opening a new avenue for large models [20]
AI不再「炫技」,淘宝要让技术解决用户每一个具体问题
机器之心· 2025-10-28 04:31
Core Viewpoint - The article discusses the transformative impact of generative AI on productivity and the evolution of e-commerce, particularly focusing on Alibaba's Taobao and its advancements in AI technology [2][6][11]. Group 1: AI Technology Evolution - The evolution of AI technology has accelerated, leading to the emergence of various models and applications, with a focus on multi-modal capabilities [3][11]. - Taobao has integrated AI deeply into its operations, upgrading its AIGX technology system to cover all necessary e-commerce scenarios [3][11]. - The introduction of generative AI is expected to bring a generational leap in productivity, with multi-modal intelligence becoming a core technology [11][12]. Group 2: Taobao's AI Innovations - Taobao launched RecGPT, a recommendation model with 100 billion parameters, enhancing the user experience by providing personalized recommendations [14][21]. - The generative recommendation algorithm can create new content based on user preferences, moving beyond traditional recommendation systems [16][20]. - The AI-driven video generation model, Taobao Star, automates the creation of promotional videos, significantly reducing content production costs for merchants [25][27]. Group 3: Open Source and Industry Impact - Taobao has open-sourced its reinforcement learning framework ROLL, aimed at improving user experience and enhancing model training efficiency [38][39]. - The company is gradually releasing its validated capabilities to the external market, fostering industry growth towards a "superintelligent" era [39][40]. - The rapid advancements in AI processing complexity and reduction in error rates suggest that narrow AGI could be achieved within 5-10 years [40].
开源仅一周,鹅厂文生图大模型强势登顶,击败谷歌Nano-Banana
机器之心· 2025-10-05 06:42
Core Viewpoint - The article highlights the rapid rise of Tencent's Hunyuan Image 3.0 model, which has topped the LMArena leaderboard, showcasing its advanced capabilities in text-to-image generation and its potential to rival top proprietary models in the industry [3][54]. Model Performance - Hunyuan Image 3.0 has received significant attention in the creator community for its superior image quality, detail restoration, and understanding of composition and style consistency [4][39]. - The model has surpassed 1.7k stars on GitHub, indicating growing community interest and participation [6]. - It demonstrates strong performance in generating coherent narratives and detailed illustrations based on user prompts, effectively combining knowledge, reasoning, and creativity [9][15]. Technical Specifications - The model is built on the Hunyuan-A13B architecture, featuring 80 billion parameters, making it Tencent's largest and most powerful open-source text-to-image model to date [3][41]. - It employs a mixed discrete-continuous modeling strategy, allowing for efficient collaboration between text understanding and visual generation [42][43]. - The training process involved a large dataset of nearly 5 billion images, ensuring high-quality and diverse training data [45]. Training and Development - The training strategy included multiple progressive stages, focusing on enhancing multimodal modeling capabilities through various data types and resolutions [49][51]. - The model's architecture integrates language modeling, image understanding, and image generation into a unified framework, enhancing its overall performance [43][54]. Industry Context - The emergence of models like Hunyuan Image 3.0 reflects a broader trend in the AIGC field, where models are evolving from mere generation capabilities to understanding, reasoning, and controlling content creation [55][56]. - Open-source initiatives are becoming a core driver of innovation, with companies like Tencent leading the way in developing and sharing advanced models to foster community collaboration [56].
商汤林达华:破解图文交错思维链技术,商汤的“两步走”路径
3 6 Ke· 2025-08-15 09:09
Core Insights - SenseTime has launched the Riri Xin V6.5 multimodal model, which is the first commercial-grade model in China to achieve "image-text interleaved thinking chain" technology [2] - The development of multimodal intelligence is essential for achieving Artificial General Intelligence (AGI), as it allows for the integration of various forms of information processing, similar to human sensory perception [4][5] - SenseTime's approach to building multimodal intelligence involves a progressive evolution through four key breakthroughs, culminating in the integration of digital and physical spaces [5][12] Multimodal Intelligence and AGI - Multimodal intelligence is seen as a necessary pathway to AGI, as it enables autonomous interaction with the external world beyond just language [4] - The ability to process and analyze different modalities of information is crucial for practical applications and achieving comprehensive value [4] Development Pathway - SenseTime's development strategy includes the early introduction of multimodal models and significant advancements in multimodal reasoning capabilities [5][8] - The company has achieved a significant milestone by completing the training of a billion-parameter multimodal model, which ranks first in domestic evaluations [8] Native Multimodal Training - SenseTime has opted for native multimodal training, which integrates multiple modalities from the pre-training phase, as opposed to the more common adaptive training method [7][9] - This approach allows for a deeper understanding of the relationships between language and visual modalities, leading to a more cohesive model [7] Model Architecture and Efficiency - The architecture of the Riri Xin 6.5 model has been optimized for efficiency, allowing for better processing of high-resolution images and long videos, achieving over three times the efficiency compared to previous models [11] - The design philosophy emphasizes the distinction between visual perception and language processing, leading to a more effective model structure [11] Challenges and Solutions in Embodied Intelligence - Transitioning AI from digital to physical spaces requires addressing interaction learning efficiency, which is facilitated by a virtual system that simulates real-world interactions [12] - SenseTime's "world model" leverages extensive data to enhance the simulation and generation capabilities, improving the training of intelligent driving systems [12] Balancing Technology and Commercialization - SenseTime views the pursuit of AGI as a long-term endeavor that requires a balance between technological breakthroughs and commercial viability [13] - The company has established a three-pronged strategy focusing on infrastructure, models, and applications to create a positive feedback loop between technology and business [13][14] Recent Achievements - Over the past year, SenseTime has made significant progress in its foundational technology, achieving innovations such as native fusion training and multimodal reinforcement learning [14] - The commercial landscape is rapidly expanding, with AI performance leading to increased deployment in various intelligent hardware and robotics applications [14]
商汤林达华万字长文回答AGI:4层破壁,3大挑战
量子位· 2025-08-12 09:35
Core Viewpoint - The article emphasizes the significance of "multimodal intelligence" as a key trend in the development of large models, particularly highlighted during the WAIC 2025 conference, where SenseTime introduced its commercial-grade multimodal model, "Riri Xin 6.5" [1][2]. Group 1: Importance of Multimodal Intelligence - Multimodal intelligence is deemed essential for achieving Artificial General Intelligence (AGI) as it allows AI to interact with the world in a more human-like manner, processing various forms of information such as images, sounds, and text [7][8]. - The article discusses the limitations of traditional language models that rely solely on text data, arguing that true AGI requires the ability to understand and integrate multiple modalities [8]. Group 2: Technical Pathways to Multimodal Models - SenseTime has identified two primary technical pathways for developing multimodal models: Adapter-based Training and Native Training. The latter is preferred as it allows for a more integrated understanding of different modalities from the outset [11][12]. - The company has committed significant computational resources to establish a "native multimodal" approach, moving away from a dual-track system of language and image models [10][12]. Group 3: Evolutionary Path of Multimodal Intelligence - SenseTime outlines a "four-breakthrough" framework for the evolution of AI capabilities, which includes advancements in sequence modeling, multimodal understanding, multimodal reasoning, and interaction with the physical world [13][22]. - The introduction of "image-text intertwined reasoning" is a key innovation that allows models to generate and manipulate images during the reasoning process, enhancing their cognitive capabilities [16][18]. Group 4: Data Challenges and Solutions - The article highlights the challenges of acquiring high-quality image-text pairs for training multimodal models, noting that SenseTime has developed automated pipelines to generate these pairs at scale [26][27]. - SenseTime employs a rigorous "continuation validation" mechanism to ensure data quality, only allowing data that demonstrates performance improvement to be used in training [28][29]. Group 5: Model Architecture and Efficiency - The focus on efficiency over sheer size in model architecture is emphasized, with SenseTime optimizing its model to achieve over three times the efficiency while maintaining performance [38][39]. - The company believes that future model development will prioritize performance-cost ratios rather than simply increasing parameter sizes [39]. Group 6: Organizational and Strategic Insights - SenseTime's success is attributed to its strong technical foundation in computer vision, which has provided deep insights into the value of multimodal capabilities [40]. - The company has restructured its research organization to enhance resource allocation and foster innovation, ensuring a focus on high-impact projects [41]. Group 7: Long-term Vision and Integration of Technology and Business - The article concludes that the path to AGI is a long-term endeavor that requires a symbiotic relationship between technological ideals and commercial viability [42][43]. - SenseTime aims to create a virtuous cycle between foundational infrastructure, model development, and application, ensuring that real-world challenges inform research directions [43].
o3出圈玩法“看图猜位置”,豆包也安排上了!还是人人免费用那种
量子位· 2025-07-30 06:06
Core Viewpoint - The article discusses the new visual reasoning feature of the Doubao APP, which enhances its ability to analyze images and provide contextual information, making it a versatile tool for users [1][4][66]. Group 1: Doubao APP Features - Doubao APP has upgraded its visual reasoning capabilities, allowing it to analyze images and provide detailed contextual information, such as identifying locations and historical timelines [4][8]. - The app can perform image searches and utilize various image analysis tools (zooming, cropping, rotating) to derive conclusions from images [7][50]. - Users can easily engage with the app by uploading images or taking photos to receive instant analysis and information [5][26]. Group 2: Practical Applications - Doubao APP can assist users in identifying objects or details within images, such as distinguishing between AI-generated and real images [11][20]. - The app can also help with educational tasks, such as solving complex math problems, and has been validated against human solutions [40][43]. - It can extract structured data from financial reports and other documents, enhancing productivity in both personal and professional contexts [46][49]. Group 3: Industry Trends - The article highlights a broader trend in the industry towards visual reasoning capabilities, with major models like OpenAI's o3 and o4-mini leading the charge [68][70]. - The development of multi-modal technologies supports the integration of visual reasoning into various applications, addressing both industry needs and user demands [72][75]. - The increasing prevalence of mixed media information necessitates advanced visual reasoning capabilities to improve information processing and understanding [76].