Workflow
量子位
icon
Search documents
小米神操作!认领榜一神秘模型Hunter Alpha,龙虾之父都忍不住打听
量子位· 2026-03-19 01:02
Core Viewpoint - The article reveals that the mysterious model Hunter Alpha, which topped the OpenRouter call volume leaderboard, is actually Xiaomi's flagship model MiMo-V2-Pro, not GPT or DeepSeek [1][4][7]. Group 1: Model Announcement - Xiaomi officially announced three new models in the MiMo-V2 family: Pro, Omni, and TTS [2]. - MiMo-V2-Pro is identified as the previously known Hunter Alpha [4]. Group 2: Model Specifications - MiMo-V2-Pro has a parameter scale of 1 trillion, supports long texts of up to 1 million tokens, and excels in real-world task scenarios [9]. - The model achieved a global ranking of eighth and second domestically in the Artificial Analysis global intelligence index [10]. - It has superior coding capabilities, outperforming Claude 4.6 Sonnet, and can generate complex code such as a 3D tower defense game using Three.js [11]. Group 3: Technical Architecture - The model's total parameter count exceeds 1 trillion, with 42 billion active parameters and a context window of 1 million, making it approximately three times larger than MiMo-V2-Flash [16]. - It employs a hybrid attention mechanism with an improved ratio of 7:1, enhancing both scale and inference efficiency [17]. - The lightweight MTP multi-token prediction layer allows for fast generation speeds even with 1 million token contexts [18]. Group 4: Resource Management System - Xiaomi's AI team collaborated with Peking University to develop the ARL-Tangram resource management system, which significantly reduces training time and computational costs by 71.2% [19][20]. Group 5: Performance Metrics - MiMo-V2-Pro achieved impressive scores in various benchmarks, including 84.0 in PinchBench and 61.5 in ClawEval, surpassing Gemini 3 Pro and approaching Claude Opus 4.6 [24]. - In coding capabilities, it scored 86.7 in the SWE-bench Verified test, exceeding Claude 4.6 Sonnet [25]. - The model's total call volume on OpenRouter reached 310 billion tokens, leading the leaderboard [26][27]. Group 6: Other Models - MiMo-V2-Omni integrates image, video, and audio encoders into a single network, enabling it to perceive and act like a human [33]. - MiMo-V2-TTS is designed to give agents emotional voice capabilities, allowing for natural language input to control tone and emotion [35].
腾讯纯文本LLM训视觉encoder,拿捏图表长视频,达到开源小模型SOTA!
量子位· 2026-03-19 01:02
Core Viewpoint - Tencent has introduced Penguin-VL, a model that breaks away from traditional multimodal approaches by initializing a vision encoder directly from a text-only LLM, demonstrating strong performance in complex tasks like document understanding and long video temporal localization [1][2][3]. Group 1: Model Architecture and Training - Penguin-VL challenges the conventional method of using a traditional visual backbone followed by a language model, instead proposing that a vision encoder can be effectively initialized from a text-only LLM [5][15]. - The Penguin-Encoder is designed to inherit capabilities and architectural foundations that are more suitable for sequence modeling, allowing for a closer representation space between visual and language models [18][19]. - Key modifications include changing causal attention to bidirectional attention and introducing 2D-RoPE to better handle two-dimensional positional information in images and videos [21][22]. Group 2: Training Stages and Performance - The training process is divided into three stages: initial training of the Penguin-Encoder, VLM pre-training, and supervised fine-tuning to align capabilities with user tasks [28][30][31]. - The model has shown competitive performance in various benchmarks, with the 2B model achieving notable results in tasks such as InfoVQA, ChartQA, and DocVQA, while the 8B model continues to maintain strong performance across the same tasks [36][39][41]. - The Penguin-Encoder outperformed several larger models in terms of average scores, indicating that the initialization from an LLM is a viable path for developing effective vision encoders [44][45]. Group 3: Implications and Future Directions - The findings suggest that future vision encoders may not necessarily need to originate from traditional visual models, but can also emerge from more general language models, indicating a shift in modeling approaches within the industry [47][49]. - This trend aligns with recent works like DeepSeek-OCR2, which also explore more unified modeling methods, moving away from familiar multimodal stitching routes [48].
全行业都在忙着“吃虾”,MiniMax M2.7已经让虾自己拿起筷子了
量子位· 2026-03-18 11:32
Core Insights - MiniMax has officially announced the new M2.7 model, which significantly enhances its capabilities in complex tasks and agent team collaboration [2] - The M2.7 model has made qualitative leaps in reasoning and engineering abilities, capable of independently troubleshooting production line issues [3] - M2.7 can autonomously build Agent Harness, integrating thinking and execution, thus initiating a self-evolution path [5] Model Highlights - The model shows a substantial improvement in instruction adherence and multi-agent collaboration, achieving a 97% adherence rate in scenarios with 40 complex skills and a 62.7% accuracy in the MM-Claw "lobster test" [8] - M2.7 has expanded its coding capabilities from simple code generation to advanced areas such as code refactoring and complex troubleshooting [10] - In the SWE-Pro test, M2.7 matched GPT-5.3-Codex with a 56.22% accuracy rate, demonstrating strong performance in end-to-end project delivery [11] Office Automation - M2.7 efficiently handles complex Office documents, supporting multi-round modifications in Excel, Word, and PPT [12] - In the GDPval-AA evaluation, M2.7 achieved the highest ELO score among open-source models, surpassing GPT-5.3 [13] - The model can autonomously analyze annual reports and communication materials, generating revenue forecasting models and creating Excel pivot tables and Word reports [14] Role-Playing and Interaction - M2.7 has significantly improved character stability and conversational emotional intelligence in role-playing scenarios [15] - It supports ten languages natively, maintaining a unified persona during cross-language interactions [16] - MiniMax has designed and open-sourced an OpenRoom interaction system, allowing AI to interact in a visually immersive Web GUI space [17] Testing and Performance - M2.7 successfully coordinated a multi-agent simulation for a game, demonstrating its instruction adherence, planning, and full-stack coding capabilities [20][22] - In a real production environment simulation, M2.7 showcased its SRE-level troubleshooting and reasoning abilities, accurately identifying performance issues and providing effective recovery scripts [29][30] Self-Evolution Capabilities - M2.7 possesses the ability to self-construct complex Agent Harness, allowing it to create its own tools for interacting with real computer environments [39][41] - The model can autonomously run experiments, monitor states, troubleshoot, and even submit merge requests [43] - M2.7 can train and upgrade machine learning models, continuously improving its algorithm performance through self-feedback and optimization [46][48] Industry Implications - The advancements in M2.7 position MiniMax at the forefront of AI development, emphasizing the importance of models that can self-evolve and innovate [51][52] - The ability to autonomously create tools marks a significant shift in the competitive landscape of AI models, indicating a new era of self-iterating models [53]
结构化扩展拿下Agent工具检索新SOTA,精准找到API|ICLR'26
量子位· 2026-03-18 10:21
EIT-NLP 团队 投稿 量子位 | 公众号 QbitAI 在大模型时代,Tool-Use已经成为智能体能力的核心组成部分。 从代码生成到数据分析,从网页查询到复杂API调用,LLM正在学会"使用工具"。但一个现实问题越来越明显: 工具真的难找。 来自宁波东方理工大学/宁波数字孪生 (东方理工) 研究院沈晓宇团队的研究工作,在 ICLR 2026 发表论文: 《Tools Are Under-Documented: Simple Document Expansion Boosts Tool Retrieval》 论文提出一个直接但重要的判断: 当前工具检索的瓶颈,往往不在模型能力,而在于工具文档。 目前,该论文已被 ICLR 2026 接收。 背景:Tool Retrieval的隐形障碍 随着API数量扩展至数千甚至上万,工具检索逐渐成为Tool-Use系统中的关键前置步骤:模型必须先在庞大的工具集合中找到合适的工具, 随后才能完成调用与执行。 论文构建了三个关键组件: 1. TOOL-REX:扩展版工具检索基准 近年来,一系列benchmark (如ToolBench、ToolRet等) 推动了相关模型的 ...
担心被曝“于谦门”,57岁的于谦到底用龙虾做了什么?
量子位· 2026-03-18 10:21
允中 发自 凹非寺 量子位 | 公众号 QbitAI 如果你最近在关注AI Agent,可能已经被各种"能力展示"刷屏,从自动写代码到全流程办公自动化,几乎每一条都在强调效率与技术跃迁。 但问题在于,这些内容大多停留在行业内部的自嗨叙事里,而真正的普通用户,其实并没有被真正带进来。 而在 于谦视频播客《多新鲜呐》 最新一期里,这件事被于谦用一种极其 "非技术"的方式 解决了。 这期节目表面是在聊OpenClaw,但本质上更像一次"AI产品用户体验的公开测试",而测试员,是一个57岁的相声演员。 好家伙,一个AI竟然会说"得嘞",这在技术上只是语言风格匹配,但在用户感知里,这是 "它即是我" 。 这也是为什么,于谦对这个点的兴趣,会明显高于整个系统架构本身。 先看一个很典型的瞬间: GenJi现场演示OpenClaw生成于谦的铠甲特效视频,整个过程其实是标准的Agent任务流拆解,从需求输入到工具调用再到输出结果,逻辑 非常"工程化",但于谦的第一反应不是"这个系统怎么实现",而是—— 这个很酷,而且很像我。 划重点,这就是第一个关键点:用户关心的是 结果是否贴近自己 ,而并非过程是否先进。 再往后,OpenC ...
OpenAI新模型Day0就被嫌弃!排名拉垮,不如一月底发布的国产模型
量子位· 2026-03-18 09:18
Core Viewpoint - OpenAI's newly launched GPT-5.4 mini has received criticism for its performance and pricing, ranking 13th in the Vals benchmark, which is an improvement over the previous GPT-5 but still underwhelming compared to competitors [2][4][6]. Performance Comparison - The GPT-5.4 mini achieved a score of 57.88% in the Vals benchmark, while the previous GPT-5 scored 56.10%, indicating a slight improvement [2][5]. - In various performance tests, the mini and nano models showed significant enhancements, with the mini version performing close to the full GPT-5.4 in several benchmarks, such as SWE-Bench Pro and OSWorld-Verified [10][12][25]. Pricing Analysis - The pricing for GPT-5.4 mini is approximately three times higher than the previous GPT-5 mini, with costs of $0.75 per million input tokens and $4.50 per million output tokens [16][6]. - The nano version is significantly cheaper, costing $0.20 per million input tokens and $1.25 per million output tokens, making it a more economical choice for certain tasks [16][31]. Market Position - Despite the improvements, the mini and nano models are still considered average in the global landscape, ranking lower than models from competitors like Kimi and Qwen [4][19]. - Users have noted that the performance of the new models is not compelling enough to justify the price increase, with some suggesting that alternatives like Gemini Flash 3 lite offer better performance at a lower cost [17][19]. Use Cases - The GPT-5.4 mini and nano models are optimized for programming, computer operations, and multi-modal tasks, making them suitable for applications where low latency is critical [14][20][23]. - In practical applications, the mini model has shown to be effective in tasks such as code modification and debugging, while the nano model excels in simpler tasks like classification and data extraction [20][28][34].
求码10天,我终于过上了在微信上使唤🦞的日子
量子位· 2026-03-18 09:18
梦瑶 发自 凹非寺 量子位 | 公众号 QbitAI 全网苦求QClaw邀请码久矣,苦 接入微信久矣!!! 就在今天,腾讯鹅虾产品 QClaw 在微信互联能力的基础上又来了一波大升级—— 将微信入口升级为 小程序 ,支持上传或接收电脑端文件,顺带还上线了个能一键调用skills的 灵感广场 。 好巧不巧,也就是在这一天,苦蹲QClaw内测群 第十天 的我,终于求得了个内测码,让鹅虾住进微信了。(难啊… 把 请进微信的第一件事,我先给电脑桌面来了一波远程「大扫除」,152个图片文件直接一键清空! 光文件清理还不够,像文件打包这活儿也都丢给它干,六个文件夹直接打包到位: 甚至嘛,我在微信上还让 帮创建了个俄罗斯方块小游戏,还带晋级关卡的内种: 一个月前,大家还在想方设法折腾几个小时把龙虾部署起来,一个月后,龙虾已经能水灵灵住进微信,给你kuku干活了。 果然啊,在养虾的日子里,每一天的龙虾进化程度都能如此魔幻之魔幻…… 话不多说,我们实测一波走起,看看这QClaw到底香不香~ QClaw一手实测 实测前我们先来说说这QClaw在日常使用场景里具体能帮我们做哪些事儿~ 我们先让QClaw执行一个打工人常见的办公场景 ...
打破视频推理「先看后想」惯性,实现真正的「边看边想」丨CVPR'26
量子位· 2026-03-18 01:37
Core Insights - The article discusses the limitations of current large vision-language models (VLMs) in real-time video analysis, emphasizing the need for a shift from "frame-text interleaving" to "parallel" processing for effective real-time reasoning [1][4]. Group 1: Current Limitations of VLMs - Existing VLMs follow a sequential logic that is effective for offline tasks but leads to uncontrollable delays and evidence mismatch in streaming video scenarios [7][8]. - The "frame-text interleaving" approach, while improving real-time perception, still operates in a serial manner, resulting in low computational efficiency [9][10]. - Complex video understanding often requires Chain-of-Thought (CoT) reasoning, which significantly extends inference time and hinders real-time application [12][13]. Group 2: Proposed Solutions by TaYS - The TaYS framework introduces three key innovations: 1. Streaming attention masks to ensure true temporal causality, allowing the model to only access frames that have arrived [18][19]. 2. Decoupled positional encoding to separate "temporal order" from "thinking order," enhancing stability in temporal reasoning [20][21]. 3. Dual KV-Caches that enable visual encoding and text reasoning to run in parallel, significantly reducing both first token generation time (TTFT) and overall latency [22][23]. Group 3: Experimental Results - TaYS demonstrates superior accuracy in dynamic event reasoning, causal inference, and thematic understanding compared to batch processing and naive interleaved baselines [25]. - The framework achieves a substantial reduction in TTFT and overall latency, making it more efficient and reliable for real-time applications [26][27]. - The ablation studies confirm that parallel processing is crucial for maintaining low latency and accurate temporal understanding [27]. Group 4: Implications for Future Applications - TaYS represents a paradigm shift towards real-time intelligent applications, enabling smoother interactions in robotics, security monitoring, and live education [29][30][31]. - The framework allows models to "think in real-time," enhancing their applicability in various fields [33].
量子位编辑作者招聘
量子位· 2026-03-18 01:37
Core Viewpoint - The article emphasizes the ongoing AI boom and invites individuals to join the company "Quantum Bit," which focuses on tracking AI advancements and has established itself as a leading content platform in the industry [1]. Group 1: Job Opportunities - The company is hiring for three main directions: AI Industry, AI Finance, and AI Product, with positions available for both experienced professionals and fresh graduates [2][4]. - Positions are open for various levels, including editors, lead writers, and chief editors, with a focus on matching roles to individual capabilities [6]. Group 2: Job Responsibilities - **AI Industry Direction**: Responsibilities include tracking innovations in infrastructure, such as chips, AI infrastructure, and cloud computing, as well as interpreting technical reports from conferences [6][7]. - **AI Finance Direction**: Focuses on venture capital, financial reports, and capital movements within the AI industry, requiring strong analytical skills and a passion for interviews [11]. - **AI Product Direction**: Involves monitoring AI applications and hardware developments, producing in-depth evaluations of AI products, and engaging with industry experts [11]. Group 3: Benefits and Growth - Employees can expect to gain exposure to the latest AI technologies, enhance their work efficiency through new tools, and build personal influence in the AI field [6]. - The company offers competitive salaries, comprehensive benefits, and a supportive environment for professional growth, including mentorship from senior editors [6][12]. Group 4: Company Impact - By 2025, Quantum Bit aims to have over 2.4 million subscribers on WeChat and more than 7 million users across platforms, with a daily reading volume exceeding 2 million [12]. - The company is recognized as the top new media outlet in the AI and frontier technology sectors according to third-party data platforms [12].
阿里“悟空”上线!钉钉给企业送来龙虾大军
量子位· 2026-03-18 01:37
Core Viewpoint - Alibaba has launched "WuKong," an AI-native work platform aimed at addressing enterprise-level challenges and enhancing operational efficiency through automation and intelligent task management [3][4][10]. Group 1: Product Features and Capabilities - WuKong is designed to operate like a powerful AI employee, capable of executing tasks across various applications and platforms, thereby streamlining office operations [6][10]. - The platform can handle complex tasks such as customer acquisition and talent recruitment, significantly reducing the time and effort required for these processes [13][23]. - WuKong offers industry-specific solutions through its "One Person Team" (OPT) framework, which includes ten core industry scenarios such as e-commerce, legal, and recruitment [34][37]. Group 2: Security and Compliance - WuKong incorporates a four-layer security system to ensure safe operations within enterprises, including permission control, sandbox execution, dedicated model deployment, and skill safety certification [46][48]. - The platform is built to inherit existing enterprise permission rules, ensuring that all actions are traceable and transparent in terms of resource usage and costs [46][48]. Group 3: Strategic Importance and Future Outlook - The launch of WuKong marks a significant shift for DingTalk, transitioning from a traditional office tool to an AI-native work platform, which aligns with the global trend towards AI-driven business operations [50][62]. - WuKong is positioned as a unified entry point for Alibaba's AI capabilities across its ecosystem, integrating services from platforms like Taobao, Tmall, and Alipay [66][67]. - The establishment of Alibaba Token Hub indicates a strategic focus on deepening AI applications in enterprise workflows, creating a closed-loop system of model capabilities and application scenarios [67][68].