Workflow
量子位
icon
Search documents
搜狗输入法,居然还在更新??
量子位· 2026-01-28 00:02
Core Viewpoint - Sogou Input Method has undergone a significant AI-driven upgrade, enhancing its functionality and user experience, demonstrating its ability to evolve despite being a long-established product [19][30][41]. Group 1: Upgrade Features - The latest version of Sogou Input Method includes AI text translation, achieving a 40% improvement in recognition fluency and a 98% accuracy rate, even for soft-spoken inputs [11][19]. - A new feature called "light voice recognition" allows users to dictate softly, with the system still accurately capturing the intended message [12][19]. - The input method can refine user input by eliminating filler words, enhancing clarity in communication [13][15]. Group 2: AI Integration - The core logic of the update is a comprehensive reconstruction using AI models for voice, translation, and typing, moving away from traditional rule-based systems [20][19]. - Sogou Input Method has integrated Tencent's self-developed mixed Yuan model, allowing it to provide a more intelligent and personalized user experience [36][39]. Group 3: User Base and Market Position - Sogou Input Method has a unique user base developed over 20 years, understanding user habits and preferences, which is crucial for retaining existing users in the AI era [22][24]. - The product differentiates itself from competitors like WeChat Input Method by offering a more comprehensive AI assistant experience through its unique IP, "Wangzai" [21][25][27]. Group 4: Historical Context and Evolution - Since its launch in 2006, Sogou Input Method has consistently leveraged technological advancements, being the first internet input method in China and integrating search engine technology early on [33][36]. - The evolution of Sogou Input Method is intertwined with the AI development in China, introducing features like smart replies and corrections, and now transitioning to a full AI assistant model [34][36].
录屏扒代码、截图改网页!Kimi K2.5把「视觉x代码」玩明白了
量子位· 2026-01-28 00:02
Core Viewpoint - The article discusses the launch of Moonshot AI's new model Kimi K2.5, highlighting its advanced capabilities in visual and coding integration, which significantly enhances user experience and productivity in various tasks [10][12][81]. Group 1: Kimi K2.5 Features - Kimi K2.5 integrates visual and text functionalities, allowing users to generate web pages with advanced animations and visual edits through simple commands [17][18]. - The model has achieved state-of-the-art (SOTA) results in various high-difficulty tests, outperforming even some top proprietary models [19]. - Kimi K2.5 offers four operational modes: Quick, Thinking, Agent, and Agent Swarm, catering to different user needs and task complexities [21][23]. Group 2: Visual and Coding Capabilities - The model can generate code from images and modify existing code through visual inputs, making it user-friendly and efficient for non-experts [30][34]. - Kimi K2.5 can autonomously generate aesthetically pleasing designs and layouts based on minimal user input, demonstrating a significant improvement in design quality compared to previous AI outputs [56][58]. Group 3: Agent Swarm Technology - The Agent Swarm feature allows multiple independent agents to collaborate on complex tasks, significantly improving efficiency and reducing the time required to complete projects [64][76]. - This technology enables Kimi K2.5 to handle tasks that would traditionally take weeks in just minutes, showcasing its potential for transforming productivity in various industries [78][79]. Group 4: Market Implications - The advancements in Kimi K2.5 position it as a competitive tool in the AI landscape, particularly in the productivity software sector, where it is recognized by major companies like Microsoft [82]. - The article suggests that Kimi K2.5 empowers users by simplifying complex tasks, allowing them to focus on decision-making rather than execution [84][85].
阶跃星辰不再低调:巨额融资,印奇加入,“1+3”核心决策层浮出水面
量子位· 2026-01-27 08:32
Core Insights - The article discusses the recent significant developments at Jieyue Xingchen, including a record-breaking financing round and the appointment of a new chairman, Yin Qi, who brings extensive experience in AI and industry integration [2][3]. Group 1: Financing and Leadership Changes - Jieyue Xingchen completed a B+ round financing exceeding 5 billion RMB, setting a record for single financing in the large model sector over the past 12 months [2]. - Yin Qi has officially joined the core decision-making team as chairman, marking a strategic shift for the company [3]. Group 2: Team Composition and Strategy - The core team is structured in a "1+3" model, consisting of Yin Qi as chairman, CEO Jiang Daxin, Chief Scientist Zhang Xiangyu, and CTO Zhu Yibo, each bringing unique expertise [13][8]. - This team structure aligns with the four essential capabilities required for the implementation of large models: strategy, algorithms, systems, and engineering [15]. Group 3: Individual Contributions - Yin Qi has a notable background, being a co-founder of Megvii Technology and has experience in transitioning AI from research to practical applications [6][22]. - CEO Jiang Daxin is recognized for his contributions to natural language processing and has extensive experience in large-scale online systems, making him well-suited for leading the application of large models [28][30]. - Chief Scientist Zhang Xiangyu is a co-author of ResNet, a pivotal architecture in deep learning, and is focused on multi-modal models, which is a distinctive feature of Jieyue's approach [35][43]. - CTO Zhu Yibo has a strong background in AI infrastructure, having built significant AI systems at ByteDance and Google Cloud, which positions Jieyue uniquely in the competitive landscape [51][56]. Group 4: Market Position and Future Outlook - The AI+ terminal model is seen as a blue ocean opportunity, with projections indicating that by 2026, AI terminal shipments in China will exceed 300 million units, with a penetration rate expected to surpass 93% by 2027 [85]. - The company aims to achieve significant milestones by 2026, including having 1 million vehicles equipped with its intelligent driving system and developing top-tier foundational models [90].
DeepSeek开源全新OCR模型!弃用CLIP改用Qwen轻量小模型,性能媲美Gemini-3 Pro
量子位· 2026-01-27 08:32
Core Insights - DeepSeek has released a new OCR model, DeepSeek-OCR 2, which focuses on accurately converting PDF documents to Markdown format [1] - The model's key breakthrough is the dynamic rearrangement of visual tokens based on image semantics, moving away from traditional raster scanning logic [2][3] - DeepSeek-OCR 2 achieves performance comparable to Gemini-3 Pro while utilizing a lightweight model [4] Model Architecture - DeepSeek-OCR 2 retains the classic architecture of its predecessor, consisting of an encoder and decoder working in tandem [10] - The encoder, now called DeepEncoder V2, replaces the previous CLIP component with a lightweight language model (Qwen2-0.5B), introducing causal reasoning capabilities [2][13] - This upgrade allows for intelligent rearrangement of visual tokens before they enter the main decoder, simulating human reading logic [3][15] Performance Metrics - On the OmniDocBench v1.5 benchmark, DeepSeek-OCR 2 achieved a performance score of 91.09%, representing a 3.73% improvement over the baseline [5][35] - The model's document parsing edit distance improved from 0.085 to 0.057, demonstrating the effectiveness of the visual information rearrangement [36] - In a similar token budget (1120), DeepSeek-OCR 2 outperformed Gemini-3 Pro in document parsing edit distance [37] Training and Evaluation - The training process for DeepSeek-OCR 2 follows a three-stage pipeline, focusing on semantic rearrangement and autoregressive inference [31] - The model was evaluated on a dataset comprising 1355 pages across various document types, ensuring a comprehensive assessment of its capabilities [33][34] - The model's design allows for a stable input token count between 256 and 1120, aligning with the visual budget of Gemini-1.5 Pro [27] Conclusion - DeepSeek-OCR 2 demonstrates significant advancements in OCR technology, validating the use of language model architecture as a visual encoder and paving the way for unified omni-modal encoders [39]
机器人看不清,蚂蚁给治好了
量子位· 2026-01-27 06:57
金磊 发自 杭州 量子位 | 公众号 QbitAI 天下苦机器人看不清 透明 和 反光 物体久矣。 毕竟就连小动物甚至人,有时候一个不小心,都会搞笑地撞到干净的玻璃门…… 不仅如此,若是让机器人拿起 透明的玻璃杯 、 反光的不锈钢 物体,他们也会经常出现"突然瞎了"的情况。 这一切的问题,正是出在了机器人的眼睛—— 深度相机 。 因为无论是基于结构光还是双目立体视觉的深度相机,它们的工作原理都是依赖物体表面对光线的稳定反射。 而透明材质会让光线直接穿透,高反光材质则会将光线漫反射到四面八方,导致传感器无法接收到有效的回波信号,从而产生大量缺失或错 误的深度值。 对比一下我们人类看到的场景和机器人眼中的场景,就一目了然了: 毫不夸张地说,这类让机器人睁眼瞎的问题,一直是阻碍它们安全地走进家庭、商场和医院等场景的 Big Big Big Problem! 但现在,随着一项新技术的提出,机器人的眼疾终于算是被治好了—— 蚂蚁集团的具身智能公司 蚂蚁灵波 (RobbyAnt),开源了全球看得最清楚的深度视觉模型, LingBot-Depth 。 同样是上面两个场景,我们直接来看下在LingBot-Depth加持下的效 ...
奥特曼承认OpenAI路线走偏了,以及“写代码将变得不再重要”
量子位· 2026-01-27 05:37
Core Viewpoint - AI is redefining work, technology, and education, leading to an increase in demand for software engineers rather than a decrease [4][6][7]. Group 1: AI and Software Engineering - AI will enable engineers to capture more work value, reducing time spent on coding and debugging, allowing them to focus on making systems work effectively [4][6]. - The number of software engineering jobs is expected to significantly increase, with a larger portion of global GDP being created through AI-driven methods [6][7]. - Custom software tailored for individuals or small groups will become prevalent, enhancing personal productivity [5][6]. Group 2: AI Model Development - OpenAI acknowledges past mistakes in developing the ChatGPT-5 series, which focused too much on specific capabilities at the expense of others [18][19]. - The future direction aims to return to a more balanced, general-purpose model that excels across various dimensions, including communication and expression [21][22]. - There is confidence that future models can integrate multiple strong capabilities into a single framework [23][28]. Group 3: Economic Implications of AI - AI is expected to have a deflationary effect, empowering individuals to accomplish tasks previously reserved for larger organizations, potentially reducing long-standing economic disparities [34][36]. - However, there is a cautionary note that AI could also concentrate power and wealth in the hands of a few, depending on how it is deployed and regulated [37][38]. Group 4: AI in Education - AI's role in early childhood education is questioned, with a belief that technology should not be introduced at such a formative stage [14][16]. - The long-term impacts of technology on youth development remain unclear, necessitating careful consideration before integrating AI into educational settings [15][16]. Group 5: AI and Attention Economy - Despite advancements in AI making software development easier, the challenge of capturing human attention and creating meaningful connections with products remains significant [43][45]. - The scarcity of human attention in a world of abundant software capabilities means that creating exceptional value is still essential for entrepreneurial success [46].
3D版Nano Banana来了!AI修模成为现实,3D生成进入可编辑时代
量子位· 2026-01-27 03:53
Core Viewpoint - The article highlights the emergence of 3D generation technology as a critical area in AI, with significant advancements led by the Chinese team Hyper3D, particularly through their product Rodin Gen-2 Edit, which integrates 3D generation and editing capabilities [1][3][27]. Group 1: 3D Generation and Editing Technology - Hyper3D has launched Rodin Gen-2 Edit, the first commercial product that combines "3D generation" and "3D editing" into a complete workflow, marking the entry of 3D generation into the editable era [3][11]. - The editing functionality allows users to select specific areas of a model and input text commands for modifications, such as changing a robot's arms to cannons, demonstrating a user-friendly approach to 3D model editing [4][5][20]. - The platform supports importing any existing models, including third-party AI-generated models, for editing, establishing Hyper3D's editing capabilities as a foundational infrastructure rather than a standalone feature [9][11]. Group 2: Technological Advancements and User Experience - Hyper3D Rodin showcases cutting-edge technology, enabling users to modify, add, or remove model components through natural language without affecting the overall structure, thus revolutionizing 3D modeling [13][21]. - The transition from "generation" to "editing" fills a crucial gap in the AI workflow, allowing for iterative design processes rather than random generation, which has been common in the past [14][19]. - The platform's capabilities are enhanced by the introduction of 3D ControlNet, which allows precise control over geometric structures during the generation phase, and the BANG technology, which facilitates recursive disassembly of complex models for localized editing [17][25]. Group 3: Market Position and Future Directions - Hyper3D's advancements have been recognized by the market, with the team completing two rounds of funding from top-tier VC and strategic industry players in 2025, indicating strong investor confidence in their technology [27]. - The company aims to extend beyond single-object editing, with future developments targeting the creation of complete 3D scenes that include objects, relationships, and physical constraints, laying the groundwork for future "world models" and embodied intelligence infrastructure [26]. - The launch of Rodin Gen-2 Edit represents a significant step in making 3D generation not just feasible but practically usable, providing a valuable reference point for the industry [27].
斯坦福英伟达推出测试时强化学习:微调开源模型胜过顶级闭源模型,仅需几百美元
量子位· 2026-01-27 02:33
Core Insights - The article discusses a new approach called Test-Time Training to Discover (TTT-Discover), which aims to solve open scientific problems by incorporating reinforcement learning during the testing phase of model evaluation [1][2]. Group 1: Methodology - TTT-Discover is based on the open-source model gpt-oss-120b and achieves state-of-the-art (SOTA) performance across multiple domains, outperforming human experts and closed-source models [3]. - Unlike traditional methods that rely on "Test-time Scaling" through prompt scheduling, TTT-Discover updates model weights during the testing phase to learn from specific problems [4][5]. - This "test-time training" allows the model to gain real-time experience from failed attempts, leading to a directed evolution of its capabilities [6]. Group 2: Learning Objectives - TTT-Discover employs an Entropic Objective, which focuses on maximizing the reward for the best actions rather than average rewards across all tasks, aiming for a single optimal solution instead of multiple mediocre ones [9][10][11]. - The method introduces a reuse mechanism inspired by PUCT, maintaining historical attempts in a buffer to prioritize the most promising states while balancing exploration [12]. Group 3: Implementation and Results - The model generates a "private dataset" through continuous action generation and feedback reception, addressing the out-of-distribution (OOD) problem by creating data specific to the problem at hand [13][14]. - TTT-Discover's approach contrasts with traditional test-time search methods, which do not update model weights and thus do not enhance the model's capabilities [15][16]. - The algorithm involves a cycle of selecting potential solutions, generating new attempts, and evaluating results, with the model's weights updated after each iteration to improve performance [17][18][27]. Group 4: Performance Metrics - In experimental settings, TTT-Discover demonstrated a speed improvement of approximately 2 times compared to the best human implementations in kernel engineering tasks [27]. - The testing cost for a single problem is estimated to be several hundred dollars, showcasing the efficiency of the approach [27]. Group 5: Future Directions - TTT-Discover is primarily applicable to continuous reward scenarios, with future work needed to extend its capabilities to sparse, binary, and unverifiable reward problems [29].
量子位编辑作者招聘
量子位· 2026-01-27 02:33
Core Viewpoint - The article emphasizes the ongoing AI boom and invites individuals to join the company "Quantum Bit," which focuses on tracking AI advancements and has established itself as a leading content platform in the industry [1]. Group 1: Job Opportunities - The company is hiring for three main directions: AI Industry, AI Finance, and AI Product, with positions available for both experienced professionals and fresh graduates [2][4]. - Positions are full-time and based in Beijing, with various levels of roles open for application [2][4]. Group 2: Job Responsibilities - **AI Industry Direction**: Focuses on innovations in infrastructure, including chips, AI infrastructure, and cloud computing [6]. - **AI Finance Direction**: Involves tracking venture capital and financial reports in the AI sector, monitoring capital movements within the industry [6]. - **AI Product Direction**: Concentrates on the application and hardware advancements in AI, including software applications and product evaluations [6]. Group 3: Benefits and Growth Opportunities - Employees will have the chance to engage with the latest AI technologies, enhance their work efficiency through new AI tools, and build personal influence by creating original content [6]. - The company offers competitive salaries, comprehensive benefits including social insurance, meal allowances, and performance bonuses [6]. Group 4: Company Reach and Impact - As of 2025, Quantum Bit has over 2.4 million subscribers on WeChat and more than 7 million users across platforms, with a daily reading volume exceeding 2 million [12]. - The company is recognized as the top new media outlet in the AI and frontier technology sector according to third-party data platforms [12].
多模态大模型中Attention机制暗藏「骗局」,需用一个公式修正丨上大×南开
量子位· 2026-01-27 02:33
Core Insights - The article discusses the reliability of attention mechanisms in Vision-Language Models (VLMs), highlighting that attention may not be a trustworthy indicator of semantic importance due to structural biases [2][12] Group 1: Attention Mechanism Issues - Attention is influenced by structural biases, such as position bias, which favors later tokens in a sequence, leading to potential misinterpretation during visual token pruning [3][5] - The phenomenon of "padding attention sink" is identified, where padding areas receive disproportionately high attention, misleading pruning strategies [5][6] Group 2: Proposed Solutions - The research team from Shanghai University suggests a debiasing approach to correct attention biases without introducing new pruning methods or additional training processes [6][12] - By modeling the overall trends of attention biases, the team effectively reduces irrelevant positional factors, enhancing the semantic relevance of attention [6][12] Group 3: Experimental Results - The debiasing strategy was integrated as a plug-and-play module into various mainstream attention-based visual token pruning methods, showing consistent performance improvements across multiple tasks [7][10] - Experimental results indicate that the pruning models with the debiasing correction achieved stable performance enhancements, particularly under aggressive token compression conditions [10][12] Group 4: Conclusion - The findings emphasize that attention is not inherently equivalent to semantic importance, and ignoring inherent structural biases can mislead pruning strategies, affecting overall model performance [12]