Workflow
多模态智能
icon
Search documents
互联网大厂抢人,年薪最高128万
21世纪经济报道· 2026-02-06 14:52
Core Viewpoint - The article discusses the intense competition among major internet companies, particularly Tencent, in attracting top AI talent through high salaries and innovative scholarship programs, highlighting the industry's talent scarcity and the strategic investments being made in AI research and development [1][4]. Group 1: Talent Acquisition Strategies - Tencent is actively recruiting AI talent with high salaries for various positions, such as over 750,000 yuan for user operation roles and nearly 1,000,000 yuan for AI application engineers [1]. - The "Qingyun Plan" is Tencent's initiative aimed at attracting top technical students globally, similar to ByteDance's Top Seed talent program [1]. - The "Qingyun Scholarship" offers significant financial incentives, including 500,000 yuan per recipient, to support students in AI and computer science fields [2]. Group 2: Investment in Research and Development - Tencent's R&D expenditure reached a record high of 22.82 billion yuan in Q3 2025, with a total of 61.983 billion yuan spent in the first three quarters of 2025 [4]. - The company emphasizes the importance of computational resources for top PhD students, providing cloud heterogeneous computing resources as part of the scholarship [4]. Group 3: Recruitment of Established Talent - Tencent is also accelerating the recruitment of established AI experts, as evidenced by the hiring of prominent figures like Pang Tianyu and Yao Shunyu, who have significant academic and industry experience [5]. - The establishment of new departments within Tencent, such as AI Infra and AI Data, aims to enhance its capabilities in large model research and development [5]. Group 4: Academic Collaboration and Knowledge Sharing - Tencent launched its technical blog to share research findings, marking a step towards increasing its academic influence and transparency in AI technology [6].
最强大模型的视觉能力不如6岁小孩
3 6 Ke· 2026-01-22 13:10
Core Insights - The current state of visual reasoning in AI models, particularly Gemini 3 Pro Preview, is still significantly below human capabilities, with a performance level comparable to a three-year-old child, and a 20% gap from six-year-olds [1][7][4] - Gemini 3 Pro Preview is considered the leading model among existing AI systems, outperforming others like GPT-5.2 and Claude 4.5 Opus, which perform even worse than a three-year-old [5][10] - The research highlights the limitations of current visual reasoning models, emphasizing the need for a fundamental reconstruction of visual capabilities rather than relying on language-based translations [7][19] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with a score of 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [10] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5 Opus scored significantly lower, with scores of 19.2%, 16.2%, and 14.2% respectively [11] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, indicating that even the largest open-source models cannot compete with top closed-source systems [12][13] Challenges in Visual Reasoning - The research identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. **Fine-grained Discrimination**: Difficulty in detecting subtle visual differences [19] 2. **Visual Tracking**: Inability to maintain perceptual consistency over long distances [22] 3. **Spatial Perception**: Challenges in constructing stable three-dimensional representations from two-dimensional images [28] 4. **Visual Pattern Recognition**: Struggles in generalizing rules from limited visual examples [34] Proposed Solutions - The study suggests two potential directions for improving visual reasoning capabilities: 1. **Reinforcement Learning with Verifiable Rewards (RLVR)**: This approach showed an overall accuracy improvement of approximately 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [36] 2. **Generative Modeling**: The introduction of BabyVision-Gen evaluated three advanced visual generative models, with NanoBanana-Pro achieving the highest accuracy of 18.3% [38][39] Future Trends - The research indicates a shift towards unified architectures that bypass the "language bottleneck," allowing for high-fidelity visual representations during reasoning processes [44] - Models like Bagel, Sora 2, and Veo 3 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of maintaining visual integrity in AI systems [44]
最强大模型的视觉能力不如6岁小孩
量子位· 2026-01-22 11:13
Core Insights - The current state of visual reasoning in AI models is still significantly behind human capabilities, with the best model, Gemini 3 Pro Preview, only slightly outperforming a three-year-old child and lagging 20% behind a six-year-old child [2][10] - The performance of Gemini 3 Pro Preview is noted as the highest among existing models, with a score of 49.7%, while other leading models like GPT-5.2 and Claude 4.5 Opus show even poorer results [6][14] - The article emphasizes the need for future models to rebuild visual capabilities from the ground up rather than relying on language-based translations of visual problems [11] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [14] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5-Opus scored significantly lower, indicating a general underperformance in visual reasoning tasks [15] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, still far behind the top closed-source systems [16] Challenges in Visual Reasoning - The article identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. **Lack of Non-verbal Fine Details**: MLLMs struggle to accurately describe fine visual details that cannot be easily expressed in language [25] 2. **Loss of Manifold Consistency**: MLLMs often fail to maintain perceptual consistency over long distances, leading to errors in tasks involving spatial relationships [31] 3. **Spatial Imagination**: MLLMs have difficulty constructing stable three-dimensional representations from two-dimensional images, which affects their ability to perform mental transformations [39] 4. **Visual Pattern Induction**: MLLMs tend to focus on counting attributes rather than understanding the underlying changes in visual examples, limiting their ability to generalize from few examples [47] Proposed Solutions - The research suggests two potential directions to improve visual reasoning: 1. **Reinforcement Learning with Verifiable Rewards (RLVR)**: This approach showed an overall accuracy improvement of 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [56][58] 2. **Generative Model Approaches**: The study introduces BabyVision-Gen, which evaluates generative models like NanoBanana-Pro, GPT-Image-1.5, and Qwen-Image-Edit, highlighting that while success rates are still low, some models exhibit explicit visual thinking capabilities [60][62] Future Directions - The article concludes that overcoming the "language bottleneck" in visual reasoning is crucial, advocating for unified architectures that retain high-fidelity visual representations during reasoning processes [68][70] - Models like Bagel and Sora 2 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of robust visual semantic understanding [71]
“几乎所有大模型,视觉能力都不如3岁小孩”
Guan Cha Zhe Wang· 2026-01-12 12:21
Core Insights - The latest evaluation results from the BabyVision multimodal understanding assessment indicate that most leading multimodal models perform significantly below the level of a 3-year-old child in visual tasks, with only one model barely exceeding the 3-year-old baseline [1][4]. Group 1: Evaluation Results - The BabyVision-Mini test included 20 visual-centric tasks designed to minimize language dependency, with answers requiring solely visual information [4]. - The results showed that most top models scored well below the average level of 3-year-old children, with the best-performing model, Gemini3-Pro-Preview, only slightly surpassing the 3-year-old baseline but still lagging approximately 20 percentage points behind 6-year-olds [4][9]. Group 2: Model Performance - In the BabyVision-Full evaluation, human participants with undergraduate backgrounds achieved an accuracy rate of 94.1%, while the best-performing model, Gemini3-Pro-Preview, only reached 49.7% accuracy [8][9]. - Open-source models performed even worse, with the strongest model scoring below 22.2%, and other models scoring between 12% and 19% [9]. Group 3: Systemic Visual Capability Deficiencies - The evaluation highlighted four major categories of visual capability deficiencies in large models: fine discrimination, visual tracking, spatial perception, and visual pattern recognition, indicating a systemic lack of foundational visual abilities [10]. - The challenges faced by models include the inability to process non-verbal details, difficulties in trajectory tracking, lack of spatial imagination, and issues with inductive reasoning from visual patterns [12][14][16]. Group 4: Implications for Future Development - The research team noted that many test questions possess "unspeakable" characteristics, meaning they cannot be fully expressed in language without losing critical information, which leads to reasoning errors in models [18]. - The team suggests that future models must fundamentally rebuild visual capabilities rather than relying on language reasoning, as a robot with visual abilities below that of a 3-year-old would struggle to assist humans reliably in the physical world [20].
长文本检索大突破,联通团队研发的新模型,准确率提升近两成
Sou Hu Cai Jing· 2025-12-02 20:15
Core Viewpoint - HiMo-CLIP is a new AI model developed by China Unicom's Data Science and Artificial Intelligence Research Institute, designed to improve the accuracy of image retrieval by automatically identifying key information in complex descriptions, addressing the common issue of "too much detail leading to errors" in AI processing [2][7][21]. Group 1: Model Features - HiMo-CLIP utilizes a specialized module called HiDe, which employs statistical methods to extract the most distinguishing features from similar descriptions, enhancing the model's ability to focus on key attributes [7][8]. - The model achieves an accuracy rate of 89.3%, significantly improving upon previous methods that relied on fixed templates or manual annotations [8]. - HiMo-CLIP's implementation is efficient, requiring minimal hardware resources, with only a 7% increase in inference speed on A100 GPUs, making it accessible for standard servers [10][11]. Group 2: Performance Metrics - The model incorporates a dual alignment mechanism known as MoLo loss, which ensures that both the overall semantic meaning and core feature matching are prioritized, thus preventing the "more detail, more errors" phenomenon [11][13]. - In tests on the MSCOCO-Long dataset, HiMo-CLIP's mean Average Precision (mAP) improved by nearly 20% compared to the previous Long-CLIP model, while maintaining 98.3% of its original performance on short text datasets like Flickr30K [13]. Group 3: Practical Applications - HiMo-CLIP has already been applied in real-world scenarios, such as enhancing product search functionalities on JD.com, where complex user descriptions led to a 27% increase in search conversion rates [14][15]. - The model is also being explored in the autonomous driving sector to interpret complex road descriptions, improving environmental recognition for vehicle systems [18]. Group 4: Future Developments - The team plans to release a multilingual version of HiMo-CLIP by Q3 2026, aiming to handle specialized terminology and foreign language descriptions more effectively [21]. - The success of HiMo-CLIP highlights the importance of simulating human cognitive logic in AI models, suggesting a potential new direction for multimodal intelligence development through structured semantic spaces [21].
小红书提出DeepEyesV2,从“看图思考”到“工具协同”,探索多模态智能新维度
量子位· 2025-11-13 00:49
Core Insights - DeepEyesV2 is a significant upgrade from its predecessor, DeepEyes, enhancing its capabilities from merely recognizing details to actively solving complex problems through multi-tool collaboration [3][12]. Multi-Tool Collaboration - Traditional multimodal models are limited in their ability to actively utilize external tools, often functioning as passive information interpreters [4]. - DeepEyesV2 addresses two main pain points: weak tool invocation capabilities and lack of collaborative abilities among different functions [5][8]. - The model can now perform complex tasks by integrating image search, text search, and code execution in a cohesive manner [12][18]. Problem-Solving Process - DeepEyesV2's problem-solving process involves three steps: image search for additional information, text search for stock price data, and code execution to retrieve and calculate financial data [15][16][17]. - The model demonstrates advanced reasoning capabilities, allowing it to tackle intricate queries effectively [14]. Model Features - DeepEyesV2 incorporates programmatic code execution and web retrieval as external tools, enabling dynamic interaction during reasoning [22]. - The model generates executable Python code or web search queries as needed, enhancing its analytical capabilities [23][27]. - This integration results in improved flexibility in tool invocation and a more robust multimodal reasoning framework [28]. Training and Development - The development of DeepEyesV2 involved a two-phase training strategy: a cold start to establish foundational tool usage and reinforcement learning for optimization [37][38]. - The team created a new benchmark, RealX-Bench, to evaluate the model's performance in real-world scenarios requiring multi-capability integration [40][41]. Performance Evaluation - DeepEyesV2 outperforms existing models in accuracy, particularly in tasks requiring the integration of multiple capabilities [45]. - The model's performance metrics indicate a significant improvement over open-source models, especially in complex problem-solving scenarios [46]. Tool Usage Analysis - The model exhibits a preference for specific tools based on task requirements, demonstrating adaptive reasoning capabilities [62]. - After reinforcement learning, the model shows a reduction in unnecessary tool calls, indicating improved efficiency in reasoning [67][72]. Conclusion - The advancements in DeepEyesV2 highlight the importance of integrating tool invocation with reasoning processes, showcasing its superior problem-solving abilities in various domains [73][75].
腾讯研究院AI速递 20251111
腾讯研究院· 2025-11-10 16:30
Group 1: Generative AI Developments - OpenRouter platform has launched the anonymous model Polaris Alpha, believed to be a variant of GPT-5.1, with a knowledge base cutoff in October 2024 and a maximum context capacity of 256K and a single output limit of 128K [1] - Polaris Alpha shows smooth performance in desk work and programming tasks, exhibiting typical GPT characteristics and supporting NSFW mode [1] - The model is currently available for free via API, demonstrating good performance in programming mini-games and web design, with GPT-5.1 expected to be officially released in mid-November [1] Group 2: Multi-Modal Intelligence - A new multi-modal paradigm called Cambrian-S has been proposed by researchers including Yann LeCun, focusing on "spatial super-perception" and marking the first step in exploring video spatial super-perception [2] - The research outlines a development path for multi-modal intelligence across four levels: semantic perception, streaming event cognition, 3D spatial cognition, and predictive world modeling, introducing the VSI-SUPER benchmark for spatial super-perception capabilities [2] - Cambrian-S utilizes latent variable frame prediction to manage memory and event segmentation through a "surprise" signal, outperforming Gemini in spatial cognition tasks with smaller models [2] Group 3: AI Programming Tools - Meituan has launched an AI IDE programming tool named CatPaw, featuring code completion, agent Q&A generation, built-in browser preview debugging, and project-level analysis [3] - The core engine of CatPaw is Meituan's self-developed LongCat model, fully compatible with major programming languages like Python, C++, and Java, and currently available for free [3] - Over 80% of weekly active users among Meituan's internal developers utilize CatPaw, with AI-generated code accounting for about 50% of new code submissions, and a Windows version expected to launch soon [3] Group 4: Domestic AI IDE Launch - YunSi Intelligence has introduced Vinsoo, the world's first AI IDE equipped with a cloud-based security agent, surpassing products like Cursor and Codex that utilize Claude [4] - Vinsoo achieves breakthroughs in long-context engineering algorithms, supporting effective context lengths in the millions and allowing up to eight intelligent agents to operate simultaneously [4] - The new Beta 3.0 version supports cloud-based one-click publishing, mobile usage, and team collaboration, led by a founding team of post-00s graduates from top universities in China and the U.S. [4] Group 5: Open Source Audio Editing Model - Jieyue Xingchen has released the first open-source LLM-level audio editing model, Step-Audio-EditX, which allows precise control over audio emotions, speaking styles, and paralinguistic features through language commands [5] - The model employs a unified LLM framework and a "dual-codebook" audio tokenizer structure, supporting zero-shot text-to-speech, iterative editing, and bilingual capabilities [5] - With approximately 3 billion parameters, the model can run on a single 32GB GPU, achieving higher accuracy in emotion and style control compared to closed-source models like MiniMax and Doubao [5] Group 6: AI Glasses Launch - Baidu has officially launched the Xiaodu AI glasses Pro, priced at 2299 yuan, with a promotional price of 2199 yuan for Double Eleven, weighing 39 grams and featuring a 12-megapixel wide-angle camera [6] - The glasses integrate multi-modal AI models, offering functionalities such as photography, music recognition, AI translation, object recognition, note-taking, and audio recording, with real-time translation capabilities [6] - Similar to Xiaomi's AI glasses, these are not the more advanced AI+AR glasses currently available [6] Group 7: Robotics Innovation - Galaxy General has introduced the DexNDM, a dexterous hand neural dynamics model that achieves stable, multi-axial rotation operations on various objects, capable of using tools like screwdrivers and hammers [8] - The DexNDM model disassembles hand-object interactions to the joint level, utilizing a training process that allows for stable operations across tasks and forms without requiring successful examples [8] - This technology has been applied to remote operation systems, enabling operators to give high-level commands via VR controllers while DexNDM autonomously manages fine control at the finger level [8] Group 8: Insights on AI Entrepreneurship - A YC partner emphasizes that AI tools cannot replace a founder's sales capabilities, suggesting that AI should first target quick-to-implement entry points in traditional industries rather than aiming for full automation [9] - The core competitive advantage in early-stage entrepreneurship is "learning speed" rather than scale, with a focus on quickly validating ideas with small customers [9] - AI sales development representatives (SDRs) are effective only when there are already well-functioning sales processes, and founders must clarify their target audience and attention acquisition strategies for AI tools to be effective [9]
进博会现场直击
Zheng Quan Ri Bao· 2025-11-06 15:49
Group 1: AI as a Driving Force - The fifth China International Import Expo (CIIE) has seen AI transition from a "technology showcase" to a key driver of industrial transformation, with over 400 AI-related innovations presented [2][3] - AI applications are now penetrating various sectors including healthcare, industrial, retail, and transportation, showcasing its evolution from mere demonstrations to practical tools [2][7] Group 2: Innovations in AI Applications - Siemens showcased an AI surgical solution and a three-dimensional collaboration platform that integrates AI with digital twin technology, emphasizing practical applications in industrial settings [3][8] - The introduction of humanoid robots and intelligent robotic arms at the expo highlights advancements in embodied intelligence, with companies like Zhiyuan Innovation demonstrating multi-modal interaction capabilities [4][6] Group 3: AI in Healthcare - AI technologies in healthcare have been extensively implemented, with companies like Siemens and Maizhao Health Technology presenting comprehensive solutions from diagnosis to treatment [7][8] - The "AI Magic Mirror" by Maizhao Health can analyze health indicators with a 90% accuracy rate, indicating significant advancements in health monitoring technology [7] Group 4: AI in Retail and Industry - AI is positioned as a strategic asset for the retail sector, with potential to increase annual operating profits by $310 billion by 2030 if scaled effectively [4][5] - The global robotics market is projected to exceed $400 billion by 2029, with embodied intelligent robots expected to capture over 30% of the market share [6] Group 5: China's Role in AI Development - China's vast market and diverse application scenarios are seen as critical for the industrialization of AI technologies, with companies like AMD and Qualcomm emphasizing the importance of collaboration and innovation [10][11] - The CIIE serves as a significant platform for global technology application, with over 3,000 new products and services showcased in previous editions, indicating China's growing role as an innovation catalyst [11]
智源悟界·Emu3.5发布,开启“下一个状态预测”!王仲远:或开启第三个 Scaling 范式
AI前线· 2025-11-01 05:33
Core Insights - The article discusses the launch of the world's first native multimodal world model, Emu3, by Zhiyuan Research Institute, which predicts the next token without diffusion models or combination methods, achieving a unified approach to images, text, and video [2] - Emu3.5, released a year later, enhances the model's capabilities by simulating human natural learning and achieving generalized world modeling ability through Next-State Prediction (NSP) [2][3] - The core of the world model is the prediction of the next spatiotemporal state, which is crucial for embodied intelligence [2] Model Features - Emu3.5 has three main characteristics: understanding high-level human intentions and generating detailed action paths, seamless integration of world understanding, planning, and simulation, and providing a cognitive foundation for generalized interaction between AI and humans or physical environments [3] - The model's architecture allows for the integration of visual and textual tokens, enhancing its scalability and performance [8] Technological Innovations - Emu3.5 underwent two phases of pre-training on approximately 13 trillion tokens, focusing on visual resolution diversity and data quality, followed by supervised fine-tuning on 150 billion samples [12][13] - A large-scale native multimodal reinforcement learning system was developed, featuring a comprehensive reward system that balances multiple quality standards and avoids overfitting [14] - The introduction of DiDA technology significantly accelerated inference speed by 20 times, allowing the autoregressive model to compete with diffusion models in performance [17][19] Industry Impact - The evolution from Emu3 to Emu3.5 demonstrates the potential for scaling in the multimodal field, similar to advancements seen in language models [6] - Emu3.5 represents a significant original innovation in the AI large model field, combining algorithmic, engineering, and data training innovations [9] - The model's ability to understand causal relationships and spatiotemporal dynamics positions it uniquely in the landscape of AI models, potentially opening a new avenue for large models [20]
AI不再「炫技」,淘宝要让技术解决用户每一个具体问题
机器之心· 2025-10-28 04:31
Core Viewpoint - The article discusses the transformative impact of generative AI on productivity and the evolution of e-commerce, particularly focusing on Alibaba's Taobao and its advancements in AI technology [2][6][11]. Group 1: AI Technology Evolution - The evolution of AI technology has accelerated, leading to the emergence of various models and applications, with a focus on multi-modal capabilities [3][11]. - Taobao has integrated AI deeply into its operations, upgrading its AIGX technology system to cover all necessary e-commerce scenarios [3][11]. - The introduction of generative AI is expected to bring a generational leap in productivity, with multi-modal intelligence becoming a core technology [11][12]. Group 2: Taobao's AI Innovations - Taobao launched RecGPT, a recommendation model with 100 billion parameters, enhancing the user experience by providing personalized recommendations [14][21]. - The generative recommendation algorithm can create new content based on user preferences, moving beyond traditional recommendation systems [16][20]. - The AI-driven video generation model, Taobao Star, automates the creation of promotional videos, significantly reducing content production costs for merchants [25][27]. Group 3: Open Source and Industry Impact - Taobao has open-sourced its reinforcement learning framework ROLL, aimed at improving user experience and enhancing model training efficiency [38][39]. - The company is gradually releasing its validated capabilities to the external market, fostering industry growth towards a "superintelligent" era [39][40]. - The rapid advancements in AI processing complexity and reduction in error rates suggest that narrow AGI could be achieved within 5-10 years [40].