Workflow
Qwen3
icon
Search documents
不是所有token都平等,谷歌提出真·深度思考:思维链长≠深度推理
3 6 Ke· 2026-02-25 12:23
Core Insights - Google's research challenges the long-held belief that longer reasoning chains in large models lead to better inference quality, introducing a new metric called Deep Thinking Ratio (DTR) to assess true cognitive depth rather than mere token count [1][3][9]. Group 1: Research Findings - The study found a negative correlation of -0.54 between token length and accuracy across various models, indicating that longer reasoning chains can lead to misdirection and overthinking [3][5]. - DTR measures the proportion of "deep thinking tokens" in a generated sequence, with a higher ratio indicating a focus on core reasoning rather than unnecessary content [8][10]. Group 2: Implementation of DTR - Google introduced the Think@n strategy, which allows models like GPT-OSS and DeepSeek-R1 to maintain accuracy while halving computational costs by filtering out low-quality samples early in the reasoning process [2][12]. - In tests, the Think@n strategy achieved an accuracy of 94.7% for GPT-OSS-120B-medium on the AIME 2025 dataset, surpassing traditional methods, while reducing token consumption from 355.6k to 181.9k [12][13]. Group 3: Implications for Model Development - The findings suggest a shift in focus for model developers from merely increasing token length to enhancing the quality of reasoning, emphasizing the importance of deep cognitive processing [1][19]. - The research highlights the potential for significant cost savings and efficiency improvements in model inference through the application of DTR and the Think@n strategy [9][12].
DeepSeek、月之暗面、MiniMax被点“非法提取”,它们做错了吗? | 电厂
Xin Lang Cai Jing· 2026-02-25 10:47
当地时间2月23日,美国大模型公司Anthropic发布官方声明,称旗下大模型Claude遭到了中国模型企业DeepSeek(深度求索)、Moonshot(月之暗面)、 MiniMax(稀宇科技)的"非法提取(illicitly extract)"。 2026开年不到3个月,这已是国产模型第二次陷入此类争议。2月上旬流出的一份OpenAI备忘录曾写道,DeepSeek正借助ChatGPT及其他美国领先AI模型 来进行自身训练。 而本次Anthropic则披露了更多数据,据称三家中国企业以约 2.4万个欺诈账户与Claude进行了超1600万次互动,并以这些对话信息作为训练素材、改进了 国产模型的性能。 点名三家企业隔天,Anthropic即开展了一场直播,展示Claude的最新能力。 与此同时,受到指控的国产三小龙则一片"静悄悄"。迄今,DeepSeek、MiniMax、MoonShot对此均无回应。 三小龙撞上最"MAGA"的美国大模型 根据Anthropic声明,DeepSeek、Moonshot、MiniMax所采用的技术手段名为"蒸馏(distill)"。 这种模型训练手段可追溯至2015年,最早由诺 ...
Rokid Glasses支持OpenClaw及私有大模型自定义接入
Bei Jing Shang Bao· 2026-02-11 12:53
Core Insights - Rokid has launched the "Customizable Intelligent Agent" feature on its Lingzhu platform, marking a significant shift in user control over AI glasses [1] Group 1: Product Development - The new feature is described as not merely a simple iteration but as the beginning of returning the definition of AI glasses to the users [1] - Users can now connect Rokid Glasses to any desired backend through a standard SSE (Server-Sent Events) interface [1] Group 2: Market Positioning - The integration allows compatibility with popular platforms such as OpenClaw and private deployments like DeepSeek R1, Qwen3, and Kimi K2.5 [1]
传阿里巴巴新一代模型Qwen3.5发布在即
Zhi Tong Cai Jing· 2026-02-09 07:21
此前,科技新闻网站The Information爆料指,Qwen3.5将在春节期间开源。 2025年4月29日,阿里巴巴发布了新一代Qwen3模型,一举登顶全球最强开源模型。这是国内首个"混合 推理模型",将"快思考"与"慢思考"集成进同一个模型,大大节省算力消耗。 相关讯息透露,千问3.5采用全新的混合注意力机制,并且极有可能是原生可实现视觉理解的VLM类模 型,有开发者进一步挖掘出,Qwen3.5或将开源至少2B的密集模型和35B-A3B的MoE模型。 据报道,在全球最大人工智能(AI)开源小区HuggingFace的开源项目页面中,最新出现Qwen3.5并入 Transformers的新PR(提交代码合并申请)。业内猜测阿里巴巴(09988)千问新一代基座模型Qwen3.5发布 在即。 ...
传阿里巴巴(09988)新一代模型Qwen3.5发布在即
智通财经网· 2026-02-09 07:21
2025年4月29日,阿里巴巴发布了新一代Qwen3模型,一举登顶全球最强开源模型。这是国内首个"混合 推理模型",将"快思考"与"慢思考"集成进同一个模型,大大节省算力消耗。 相关讯息透露,千问3.5采用全新的混合注意力机制,并且极有可能是原生可实现视觉理解的VLM类模 型,有开发者进一步挖掘出,Qwen3.5或将开源至少2B的密集模型和35B-A3B的MoE模型。 此前,科技新闻网站The Information爆料指,Qwen3.5将在春节期间开源。 智通财经APP获悉,据报道,在全球最大人工智能(AI)开源小区HuggingFace的开源项目页面中,最新出 现Qwen3.5并入Transformers的新PR(提交代码合并申请)。业内猜测阿里巴巴(09988)千问新一代基座模 型Qwen3.5发布在即。 ...
懂了很多道理,AI 依然要发疯
3 6 Ke· 2026-02-09 06:50
最近一段时间,很多论文都在讨论Agent目前的困境。 困境是真实存在的。在应用层,目前Agent离开了像Skill这样人造拐棍后,在处理真实世界的长程任务时根本不可靠。 这种困境通常被归结为两个原因。 第一个是上下文的黑洞。正如前两天腾讯首席AI科学家姚顺雨带领混元团队做的CL Bench所指出的那样,模型或许根本没能力吃透复杂 上下文,所以也不可能按照指令好好办事。 第二个其实更致命,它叫长期规划的崩塌。就是说一旦规划的步长长了,模型就开始犯迷糊。就和喝多了一样,走两步是直的,走十步 就开始画圈。 Anthropic 的研究员们在1月末发布了一篇重磅论文《The Hot Mess of AI 》(AI 的一团乱麻),试图解释第二个问题的因由,结果他们发 现,这一试,给自回归模型(Transformer为基础的都是)清楚的找到了阿喀琉斯之踵。 我们都听说过Yann Lecun经常提的"自回归模型只做Next Token Prediction(下一个词预测),因此根本没法达到理解和AGI。" 但之前这都是个判断或者信仰,没有什么实证证据。这篇论文,就给出了一些实证证据。 而且它还预示了一个可怕的现实,即随着模型 ...
特稿丨人工智能促变革 美企滥用引风波——2026年首月全球AI产业动态
Xin Hua She· 2026-02-03 05:51
Core Insights - The global AI industry is experiencing transformative impacts across various sectors, with significant advancements in technology and applications, while also facing challenges related to misuse and governance [1][4][5] Group 1: Technological Advancements - Global AI chip computing power is being upgraded, with notable releases such as NVIDIA's "Vera Rubin" AI computing platform and Microsoft's Maia 200 AI chip, which enhances deep reasoning capabilities [2] - Chinese companies are also innovating, with Alibaba's Qwen3-Max-Thinking model achieving over one trillion parameters, and other models like Kimi K2.5 and DeepSeek-OCR 2 showcasing advancements in various AI applications [2] - Google's DeepMind has made strides by releasing tools based on the Genie 3 model, allowing users to create interactive 3D virtual worlds through natural language [2] Group 2: AI Applications and Breakthroughs - The AI application landscape is evolving, exemplified by the global popularity of the intelligent agent Clawdbot (now OpenClaw), which can perform complex tasks and enhance work efficiency [3] - Significant breakthroughs in scientific research have been reported, such as the AlphaGenome model decoding the "dark genome," which could lead to advancements in genetic disease understanding and drug development [3] - AI applications have even reached space, with China's Qwen3 model deployed in a space computing center and NASA's Perseverance rover completing AI-planned tasks on Mars [3] Group 3: Governance and International Cooperation - The misuse of AI, particularly by the US company xAI's chatbot "Grok," has sparked international controversy, leading to restrictions and investigations in several countries [4] - The necessity for enhanced global AI governance has been highlighted, with discussions at the World Economic Forum focusing on establishing international regulatory frameworks [5] - Many countries, including Malaysia and Saudi Arabia, are expressing a desire for strengthened cooperation with China in AI development, recognizing its technological capabilities as vital for advancing their own AI and digital economies [6]
特稿|人工智能促变革 美企滥用引风波——2026年首月全球AI产业动态
Xin Hua She· 2026-02-03 04:36
Core Insights - The global AI industry is experiencing transformative impacts across various sectors, with significant advancements in technology and applications, while also facing challenges related to misuse and governance [1][4]. Group 1: Technological Advancements - Global AI chip computing power is being upgraded, with notable releases such as NVIDIA's "Vera Rubin" AI computing platform and Microsoft's AI chip Maia 200, which enhances reasoning capabilities [2]. - Chinese companies are also innovating, with Alibaba's Qwen3-Max-Thinking model achieving over one trillion parameters, and other models like Kimi K2.5 and DeepSeek-OCR 2 showcasing advancements in various AI applications [2]. - Google's DeepMind has made strides by allowing users to create interactive 3D virtual worlds using natural language, indicating progress in simulating real-world scenarios [2]. Group 2: AI Applications - The AI agent Clawdbot (now OpenClaw) has gained popularity for its ability to perform complex tasks, potentially revolutionizing efficiency in various fields [3]. - AI is making significant contributions to scientific research, exemplified by the AlphaGenome model that decodes crucial parts of the human genome, aiding in genetic disease research and drug development [3]. - AI applications have even reached space, with China's Qwen3 model deployed in a space computing center and NASA's Perseverance rover using AI for route planning on Mars [3]. Group 3: Governance and International Cooperation - The misuse of AI, particularly by xAI's chatbot "Grok," has led to international backlash and calls for stronger governance, highlighting the need for a multilateral regulatory framework for AI [4]. - Countries like South Korea and Kazakhstan are taking steps to establish legal frameworks for AI development, emphasizing safety and trust [4]. - There is a growing expectation for enhanced cooperation with China in AI, as countries like Malaysia and Saudi Arabia recognize China's technological strength in the field [5][6].
榜单更新!Kimi 2.5表现突出|xbench月报
红杉汇· 2026-02-03 00:04
Core Insights - The article highlights the recent updates in the xbench leaderboard, showcasing the performance of various AI models, particularly emphasizing the Kimi K2.5 model's significant improvements and its ranking among competitors [1][4][10]. Group 1: Model Performance Updates - As of January 2026, Kimi K2.5 achieved an average score of 63.2, marking a notable improvement from its predecessor K2, and ranked 4th on the leaderboard, making it the top model in China [4][5]. - The new benchmarks introduced by xbench include BabyVision for evaluating multimodal understanding and AgentIF-OneDay for assessing complex task instruction adherence [1]. - The leaderboard updates reflect the performance of mainstream large language models (LLMs) available through public APIs, with Kimi K2.5 scoring 36.5 in the BabyVision benchmark, placing it second behind Gemini 3 Pro [8][10]. Group 2: Kimi K2.5 Specifications - Kimi K2.5, released on January 27, 2026, is a next-generation multimodal model that integrates visual understanding, logical reasoning, programming, and agent capabilities [10]. - The model is based on approximately 15 trillion mixed visual and text tokens for continuous pre-training, enabling it to natively understand and process visual information [10]. - Kimi K2.5 employs a mixture of experts (MoE) architecture, with a total parameter count of around 1 trillion, activating approximately 32 billion parameters during inference to maintain high performance and efficiency [10]. Group 3: Competitive Landscape - The leaderboard indicates that Kimi K2.5 is positioned as a strong competitor in the AI model market, with its performance metrics suggesting a competitive edge in terms of cost-effectiveness and speed [4][7]. - The article notes that Kimi K2.5's inference time is significantly reduced to 2-3 minutes per question, enhancing its usability in practical applications [7].
给大模型排名,两个博士一年干出17亿美金AI独角兽
3 6 Ke· 2026-01-15 13:41
Core Insights - The article discusses the rise of LMArena, an AI model evaluation platform that has achieved a valuation of $1.7 billion following a $150 million funding round, addressing the need for effective model assessment in the AI era [2][3] - LMArena's unique approach allows users to vote on model performance through anonymous comparisons, shifting the evaluation power back to users and highlighting the inadequacies of traditional assessment methods [3][12] Group 1: LMArena's Business Model and Growth - LMArena has rapidly commercialized its services, generating an annual recurring revenue of over $30 million within just four months of launching its B2B evaluation service [2] - The platform has attracted major AI companies like OpenAI, Google, and xAI as core paying clients, indicating its significance in the industry [2] - Monthly active users have reached 5 million, with over 60 million model interactions occurring each month, showcasing its widespread adoption [19] Group 2: Evaluation Methodology and Industry Impact - LMArena employs a crowdsourced evaluation model where users compare two anonymous models, allowing for a more realistic assessment of their capabilities in practical tasks [12][13] - The platform's design reflects a shift in focus from traditional rankings to specific performance metrics, such as integration ease and reliability in real-world applications [8][12] - The emergence of LMArena has prompted a reevaluation of model assessment standards, moving away from static benchmarks to dynamic, user-driven evaluations [8][30] Group 3: Challenges and Criticisms - Despite its success, LMArena faces criticism regarding the reliability of its crowdsourced voting system and potential biases in user preferences [23][24] - Concerns have been raised about the possibility of models being optimized for favorable voting outcomes rather than genuine performance, echoing issues seen in traditional evaluation systems [26][27] - In response to these criticisms, LMArena has updated its rules to ensure that all submitted models must be publicly reproducible [27]