Workflow
Aletheia
icon
Search documents
腾讯研究院AI速递 20260302
腾讯研究院· 2026-03-01 17:11
Group 1 - Anthropic faced a breakdown in negotiations with the Pentagon due to its commitment to not engage in large-scale surveillance or develop autonomous weapons, resulting in a complete ban by Trump and being labeled a "supply chain threat" [1] - Claude, Anthropic's AI, surged to the top of the App Store in the US and Canada, with many users sharing screenshots of their cancellation of ChatGPT Plus to switch to Claude, sparking a movement against OpenAI [1] - Users shared migration tutorials to switch from ChatGPT to Claude seamlessly by exporting chat history and converting it into a format readable by Claude [1] Group 2 - OpenAI announced a new agreement with the Pentagon, claiming to set three red lines: prohibiting large-scale domestic surveillance, commanding autonomous weapon systems, and high-risk automated decision-making, asserting that their plan is more comprehensive than Anthropic's [2] - The agreement involves pure cloud deployment and OpenAI's self-operated security system, with classified personnel involved throughout, allowing OpenAI to terminate the agreement in case of breach [2] - Critics pointed out that vague terms like "all legitimate purposes" could easily be circumvented, which was a concern that Anthropic had rejected [2] Group 3 - A member of the Claude Code team shared insights on development, emphasizing that the core aspect of building an intelligent agent is designing the action space and providing tools that match the agent's capabilities [3] - Key iterations included creating a dedicated "ask the user" tool to replace formatted outputs and transitioning from a "to-do list" to a "task system" that supports cross-agent collaboration [3] - The search tool evolved from a RAG approach to a Grep autonomous search, establishing a progressive information disclosure model while expanding capabilities without increasing the number of tools [3] Group 4 - Honor unveiled the world's first "robot phone" at MWC 2026, featuring the industry's smallest 4DoF gimbal system and a 200-megapixel sensor, supporting three-axis mechanical stabilization and AI automatic tracking [4] - CEO Li Jian introduced the AHI (Augmented Human Intelligence) concept, emphasizing that AI should be human-centered, combining IQ and EQ, and announced a strategic imaging partnership with ARRI [4] - The company also launched the foldable flagship Magic V6, which has a thickness of only 8.75mm, a record in the industry, and is equipped with a battery exceeding 7000mAh and the Snapdragon 8 Gen 2 chip [4] Group 5 - Tsinghua University and Stanford University introduced the VLAW framework, achieving bidirectional iterative optimization between VLA strategies and action-conditioned world models, addressing issues of "blind optimism" and insufficient physical fidelity in world models [5][6] - The four-step workflow involves fine-tuning the world model with real trial-and-error data to eliminate optimistic bias, assessing trajectory quality based on Qwen3-VL, generating 500 synthetic trajectories in the calibrated world model, and optimizing strategies with a mix of real and synthetic data [6] - Empirical results showed a significant reduction in false positive rates in the calibrated world model, maintaining physical plausibility during long-term virtual trials, and significantly enhancing robot performance across five complex manipulation tasks [6] Group 6 - DeepMind's latest AI agent Aletheia autonomously solved 6 out of 10 world-class unsolved mathematical problems in the FirstProof challenge without human intervention, achieving the best overall score in the inaugural event [7] - The system features a "generator-verifier" dual-module mechanism, which outputs "no solution found" for problems it cannot confidently solve, with the computational cost for the 7th problem being 16 times that of solving the Erdős-1051 problem [7] - Mathematician Terence Tao noted that AI has become a "junior co-author," enabling mathematicians to transition from "case studies" to "large sample surveys," systematically scanning problems that humans lack the capacity to address [7] Group 7 - Cursor's founder Michael Truell stated that AI software development has entered a third era, characterized by cloud-based agents capable of independently handling complex tasks over extended time scales [8] - Over 35% of merged pull requests at Cursor were created by autonomous agents running on cloud virtual machines, with the number of agent users now double that of tab users, and agent usage increasing over 15 times in the past year [8] - Karpathy suggested that developers should spend 80% of their time on current effective methods and 20% exploring future directions, indicating a shift in developer roles from line-by-line coding to defining problems, setting evaluation standards, and managing agent factories [8] Group 8 - Anthropic, in collaboration with ETH Zurich, proposed an ESRC automation pipeline that achieves large-scale online de-anonymization through a four-step process of extraction, search, reasoning, and calibration, using only public models and standard APIs [9] - In cross-platform matching experiments, AI correctly identified 67% of users with a 90% accuracy rate, maintaining a 67.3% recall rate over a one-year span, while traditional methods failed in similar tasks [9] - All tested defense methods showed poor effectiveness, with the only viable defense being the non-disclosure of user historical statements, indicating that monitoring capabilities do not require proprietary models, supporting Anthropic's concerns about large-scale surveillance [9]
比IMO还难的数学挑战赛,谷歌赢了OpenAI
3 6 Ke· 2026-02-26 07:59
Core Insights - The article discusses the performance of Google's AI model Aletheia in the FirstProof challenge, highlighting its superior capabilities compared to OpenAI's models in solving complex mathematical problems [1][4]. Group 1: Performance Comparison - Aletheia achieved a remarkable result by solving 6 out of 10 problems independently, with 5 of those receiving unanimous approval from experts [1][5]. - In contrast, OpenAI's model managed to solve 5 problems, but it required human intervention to select the best answers during the evaluation process [3][5]. - The FirstProof challenge was designed by top mathematicians from prestigious institutions, featuring problems that had never been publicly released before, ensuring a fair assessment of AI capabilities [4][6]. Group 2: Problem-Solving Methodology - Aletheia utilized the Gemini 3 Deep Think model, employing a zero-human-intervention approach to read, reason, and output answers directly in LaTeX format [8][10]. - The model demonstrated dynamic resource allocation, adjusting its computational power based on the difficulty of the problems, which allowed it to tackle complex questions more effectively [10]. - Aletheia's ability to refuse to answer when it could not generate a reliable proof indicates a sophisticated filtering mechanism, preventing the generation of invalid answers [8][10]. Group 3: Expert Evaluation - The expert evaluation revealed that Aletheia received full approval for problems 2, 5, 7, 9, and 10, with problem 7 being recognized as the most challenging and previously unsolved [6][10]. - Although problem 8 did not receive unanimous approval, it still achieved a high score of 5 out of 7 from experts [6].
腾讯研究院AI速递 20260213
腾讯研究院· 2026-02-12 16:13
Group 1 - Zhipu released the open-source GLM-5 model with a parameter scale expanded to 744 billion (activated 40 billion), ranking fourth globally in the Artificial Analysis leaderboard and first in open-source, with coding and agent capabilities approaching Claude Opus 4.5 [1] - The model achieved scores of 77.8 and 56.2 in SWE-bench-Verified and Terminal Bench 2.0, respectively, setting new open-source SOTA records, excelling in complex systems engineering and long-range agent tasks [1] - GLM-5 has been adapted to domestic chips such as Huawei Ascend, Cambricon, and Kunlun, and introduced Z Code full-process programming tools and AutoGLM universal agent assistant [1] Group 2 - MiniMax launched the M2.5 model with only 10 billion activated parameters, achieving flagship-level reasoning speed three times faster than Opus [2] - The model completed a full-stack learning website in 9 minutes and can independently perform physical simulations and enterprise-level CMS system setups, supporting cross-platform development for PC/App/React Native [2] - It utilizes a native agent RL training framework and CISPO algorithm, achieving approximately 40 times training acceleration and is compatible with mainstream development tools like Claude Code and OpenClaw [2] Group 3 - Xiaohongshu's foundational model team released the open-source FireRed-Image-Edit, achieving SOTA in multiple authoritative rankings such as ImgEdit and GEdit, with code and technical reports now available [3] - The model employs a three-stage training process to enhance capabilities and innovatively introduces Layout-Aware OCR-based Reward, significantly improving text editing accuracy and style retention [3] - It supports various complex editing scenarios, including instruction-following consistency, text editing, style transfer, multi-image fusion, and old photo restoration, with model weights set to be open-sourced [3] Group 4 - Xiaomi released the open-source VLA model Xiaomi-Robotics-0 with 4.7 billion parameters, excelling in visual language understanding and real-time execution capabilities, achieving optimal results in comparisons across 30 models including LIBERO, CALVIN, and SimplerEnv [4] - The model uses a Mixture-of-Transformers architecture, where the VLM brain understands instructions and the Diffusion Transformer generates high-frequency smooth actions [4] - It addresses action discontinuity issues through asynchronous reasoning and Λ-shape attention masks, enabling real-time inference on consumer-grade graphics cards, and has been open-sourced on GitHub and HuggingFace [4] Group 5 - Gaode launched the ABot series of embodied base models, with ABot-M0 responsible for operations and ABot-N0 for navigation, achieving comprehensive SOTA across 10 global authoritative evaluations [5][6] - ABot-M0 integrates 6 million cross-platform trajectory data through action language and proposes an action manifold learning algorithm, achieving an 80.5% success rate on Libero-Plus, surpassing pi0 by nearly 30% [6] - ABot-N0 unifies five core navigation tasks within a single VLA architecture, constructing 8,000 high-fidelity 3D scenes and 17 million expert examples, with a 40.5% improvement in SocNav success rate [6] Group 6 - Rokid Glasses launched the "customizable agent" feature on the Lingzhu platform, allowing integration with OpenClaw or privately deployed models like DeepSeek R1 and Qwen3 through a standard SSE interface [7] - Users can achieve local closed-loop processing of private data and switch model bases with one click, leveraging the ClawHub skill ecosystem to execute capabilities like file systems, browsers, and IM messaging [7] - The platform empowers users by allowing them to summon private agents via voice commands or shortcuts, creating a 24/7 intelligent assistant [7] Group 7 - Google DeepMind released the AI mathematician Aletheia based on Gemini Deep Think, achieving a score of 91.9% on IMO-ProofBench, setting a new SOTA and capable of independently writing and publishing academic papers [8] - Aletheia systematically evaluated 700 open problems in the Erdős conjecture database and autonomously solved 4 unsolved mysteries, demonstrating self-correction and acknowledgment of limitations [8] - Gemini Deep Think collaborated with experts to tackle 18 long-stagnant research challenges, resolving a decade-long submodel optimization conjecture, with one paper accepted by ICLR 2026 [8] Group 8 - HyperWrite's CEO published an article that garnered 70 million views, stating that the release of GPT-5.3-Codex and Claude Opus 4.6 marks a qualitative change in AI [9] - AI can now independently complete the workload of human experts in 5 hours, with this capability doubling every 4-7 months, and GPT-5.3 plays a crucial role in its self-training process, initiating a recursive self-improvement cycle [9] - Almost all cognitive work performed in front of screens will be affected, and it is advised to spend one hour daily experimenting with AI, as the current cognitive window period will not last long [9] Group 9 - Anthropic released a 53-page report warning that the risks associated with Claude Opus 4.6 are approaching ASL-4 levels, outlining 8 potential risk pathways that could lead to catastrophic harm, including autonomous escape and autonomous operation [10][11] - The report concludes that current models do not exhibit "sustained consistent malicious intent," and the risk of catastrophic damage is "very low but not zero," entering a "gray area" of capability assessment [10] - The head of Anthropic's safety research team resigned, stating that "the world is in crisis," and xAI co-founder predicts that recursive self-improvement cycles may be launched within 12 months [11]
谷歌AI连发6篇数学论文,Gemini攻入博士级科研,91.9%刷爆SOTA
3 6 Ke· 2026-02-12 02:50
Core Insights - Google's DeepMind has introduced the "AI mathematician" Aletheia, which has achieved significant milestones in tackling complex mathematical conjectures and independently writing academic papers [1][2] - The Gemini Deep Think model has successfully addressed 18 long-standing research challenges across mathematics, physics, and computer science, marking a transformative moment in scientific research [6][30] Group 1: Achievements of Aletheia - Aletheia has independently authored a paper on geometric properties and systematically evaluated 700 open problems from the Erdős conjecture database [2][23] - In the IMO-ProofBench benchmark test, Aletheia achieved a score of 91.9%, outperforming other models significantly [3][22] - Aletheia's capabilities include self-correction and the ability to acknowledge unsolvable problems, enhancing research efficiency [4][13] Group 2: Breakthroughs in Research - Gemini Deep Think has collaborated with experts to solve 18 critical research problems, including advancements in submodular optimization, discrete algorithms, and machine learning [6][30] - The model has made significant contributions to various fields, including proving long-standing conjectures in online submodular optimization and enhancing economic theories related to auction dynamics [37][39] - Gemini's innovative approaches have led to breakthroughs in complex areas such as cosmic string physics and information theory, showcasing AI's potential as a scientific collaborator [40][41] Group 3: Future Implications - The advancements made by Gemini Deep Think indicate a fundamental shift in scientific workflows, positioning AI as a powerful partner in research [42][44] - The integration of AI in research processes allows scientists to focus on deeper conceptual work while AI handles knowledge retrieval and verification tasks [44] - The ongoing evolution of Gemini suggests that AI will play an increasingly vital role in advancing scientific discovery and collaboration [42][44]