Workflow
Claude 4 Sonnet
icon
Search documents
AI也邪修!Qwen3改Bug测试直接搜GitHub,太拟人了
量子位· 2025-09-04 06:39
闻乐 发自 凹非寺 量子位 | 公众号 QbitAI 大模型也会玩信息差了。 Qwen3在基准测试中居然学会了钻空子。 FAIR研究员发现Qwen3在SWE-Bench Verified测试中,不按常理修bug,反而玩起了信息检索大法。 不分析代码逻辑,不定位漏洞根源,而是直接跑到GitHub上搜任务里的issue编号,精准扒出了前人留下的修复方案。 能说吗,会搜代码才是真正的程序员行为吧。而Qwen3,你是真正的程序员。 Qwen3是如何钻空子的 要知道,SWE-Bench Verified本来是检验模型真刀真枪修代码的基准,相当于编程届的资格考试。 它的测试逻辑是这样的:在代码修复类任务中,它给模型的任务全是真实开源项目里的bug,比如修复某个功能异常、补全缺失的代码模块, 核心要求是模型能读懂现有的代码、定位到问题在哪,最后生成能够直接运行的解决方案。 这原本考验的是模型从0到1解决问题的能力,但我们的Qwen3,可没按这个剧本走。 FAIR研究团队追踪它的操作轨迹发现,Qwen3拿到任务后,第一步不是分析代码文件,而是调用工具检索GitHub的提交日志。 具体操作是: git log是查看Git版本控 ...
杨植麟摸着DeepSeek过河
3 6 Ke· 2025-07-19 04:30
Core Insights - The release of the Kimi K2 model has generated significant global interest, showcasing its capabilities in programming and agent-based tasks, outperforming competitors like DeepSeek-V3 and Alibaba's Qwen3 [1][5][6] - K2's open-source model has quickly gained traction, with over 100,000 downloads within a week and ranking fourth in the LMSYS leaderboard, indicating strong developer engagement [1][4][10] - Kimi's strategic shift towards focusing on model development rather than consumer applications reflects a response to market pressures and a commitment to advancing AGI [5][21] Model Performance and Features - K2 is a MoE model with 1 trillion parameters and 32 billion active parameters, specifically designed for high performance in agentic AI tasks [1][7] - The model emphasizes practical applications, allowing users to generate complex outputs like 3D models and statistical analyses quickly, moving beyond simple chat interactions [8][9] - K2's API pricing is significantly lower than competitors, with costs reduced by over 75%, making it an attractive option for developers in the AI programming space [10][11] Market Impact and Community Engagement - The release has been likened to a "DeepSeek moment," indicating its potential to reshape the AI landscape and challenge existing models [6][14] - Kimi's approach to community engagement through social media has fostered a positive reception and increased visibility among developers [4][17] - The model's introduction has led to a resurgence in Kimi's web traffic, with a 30% increase in visits, highlighting the effectiveness of its open-source strategy [20] Technological Innovations - Kimi has introduced a new optimizer, Muon, which reduces computational requirements by 48% compared to the previous AdamW optimizer, enhancing training efficiency [13][12] - The focus on agentic capabilities and practical task completion sets K2 apart from other models, prioritizing real-world applications over theoretical reasoning [7][8] Strategic Positioning - Kimi's pivot towards enhancing model capabilities aligns with industry trends favoring technical advancements over consumer application growth, positioning it as a leader in the AGI pursuit [15][21] - The competitive landscape has shifted, with Kimi adopting a strategy similar to that of established players like Anthropic, focusing on programming and agent capabilities [16][21]
Grok4全网玩疯,成功通过小球编程测试,Epic创始人:这就是AGI
猿大侠· 2025-07-12 01:45
Core Viewpoint - The article discusses the rapid adoption and impressive capabilities of Elon Musk's Grok4 AI model, highlighting its performance in various tests and comparisons with other models like OpenAI's o3. Group 1: Performance Highlights - Grok4 successfully passed the hexagonal ball programming test, showcasing its ability to understand physical laws [2][12]. - In a comprehensive evaluation, Grok4 outperformed o3 in all eight tasks, including complex legal reasoning and code translation [23][18][20]. - Tim Sweeney, founder of Epic Games, praised Grok4 as a form of Artificial General Intelligence (AGI) after it provided deep insights on a previously unseen problem [9][10]. Group 2: User Interactions and Applications - Users have engaged with Grok4 in creative ways, such as visualizing mathematical concepts and generating SVG graphics, demonstrating its versatility [25][32]. - A user named Dan was able to create a visualization of Euler's identity with minimal interaction, indicating Grok4's efficiency in generating complex outputs [31][26]. - The article mentions a high-level application called "Expert Conductor," which simulates an expert collaboration environment, further showcasing Grok4's potential in problem-solving [54][56]. Group 3: Community Engagement - The article encourages readers to share their innovative uses of Grok4, indicating a growing community interest and engagement with the AI model [66]. - Various users have reported their experiences and findings, contributing to a collaborative exploration of Grok4's capabilities [12][66].
Grok4全网玩疯,成功通过小球编程测试,Epic创始人:这就是AGI
量子位· 2025-07-11 07:20
Core Viewpoint - The article discusses the rapid adoption and impressive capabilities of Elon Musk's Grok4 AI model, highlighting its performance in various tests and comparisons with other models like OpenAI's o3. Group 1: Grok4 Performance - Grok4 successfully passed the hexagonal ball atmospheric programming test, showcasing its ability to understand physical laws [2][12] - Users reported that Grok4 produced stunning animations, including text formations and symbols, indicating its advanced creative capabilities [6][7] - A user conducted a comprehensive test with eight questions, where Grok4 outperformed o3, passing all tasks while o3 only passed two [21] Group 2: Expert Collaboration Simulation - HyperWrite's CEO demonstrated a method called "Expert Conductor," which simulates an expert collaboration environment for problem-solving [52][54] - The method emphasizes authentic expert voices and collaboration, allowing for iterative feedback and improvement [63] - Grok4 completed a task in 52 seconds using this method, impressing observers with its performance [62] Group 3: User Engagement and Future Potential - Users are exploring various creative applications for Grok4, with some expressing interest in challenging it with Pokémon-related tasks [64] - The article encourages readers to share their innovative ideas for using Grok4 in the comments [65]
马斯克发布“全球最强AI模型”Grok 4,称这是人工智能第一次能够解决真实世界中难以解决的复杂工程问题
Sou Hu Cai Jing· 2025-07-10 11:42
Core Insights - Musk announced the release of Grok 4, claiming it is the first AI capable of solving complex engineering problems that cannot be found in the internet or books [4] Group 1: Product Features - Grok 4 is a reasoning model that supports both text and image inputs, function calls, and structured outputs [2] - It has a context window of 256K tokens, which is lower than Gemini 2.5 Pro's 1M tokens but higher than Claude 4 Sonnet and Opus (200K tokens) and R1 0528 (128K tokens) [2] - The pricing for Grok 4 is similar to Grok 3, at $3/15 per million input/output tokens, with cache input tokens priced at $0.75 per million [2] Group 2: Performance Metrics - Grok 4 outputs 75 tokens per second, which is slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), and Claude 4 Sonnet Thinking (85 tokens/s), but faster than Claude 4 Opus Thinking (66 tokens/s) [3] - It ranks first in various benchmarks such as Humanity's Last Exam, MMLU-Pro, AIME 2024, AIME 25, and GPQA, outperforming OpenAI's o3 and Google's Gemini 2.5 Pro [3] Group 3: Future Developments - xAI announced upcoming products, including an AI programming model set to launch in August, a multimodal agent in September, and a video generation model in October [5]
1.93bit版DeepSeek-R1编程超过Claude 4 Sonnet,不用GPU也能运行
量子位· 2025-06-10 04:05
Core Viewpoint - The article discusses the performance and advancements of the DeepSeek-R1 (0528) model, highlighting its programming capabilities and efficiency improvements compared to previous versions and competitors. Group 1: Model Performance - The latest version R1-0528 achieved a score of 71.4 on the Aider programming leaderboard, surpassing Claude 4 Opus and the previous R1 version [5][2] - R1-0528 shows significant improvements in gaming performance, particularly in Tetris, where it outperformed o4-mini and ranked just below o3 [21][24][28] - The model's performance in Candy Crush was also notable, scoring 548 points, which is nearly 20 points higher than o4-mini [32] Group 2: Model Optimization and Size - The 1.93bit version of R1 has a file size reduced by over 70% compared to the original 8bit version, making it more lightweight and efficient [3][9] - Unsloth has developed multiple quantized versions of R1, with the smallest being 1.66bit at 162GB, which is nearly 80% smaller than the 8bit version [9][10] - The team recommends using the 2.4bit and 2.7bit versions for a better balance between size and performance [14] Group 3: Team and Other Models - Unsloth's team focuses on fine-tuning models for better efficiency, having worked on various models including Qwen, Phi, Mistral, and Llama, achieving at least a 50% reduction in memory usage and a 50% increase in speed [16][17] - Unsloth has also introduced a distilled Qwen3-8B model based on R1-0528, claiming it can match the performance of Qwen3-235B and is adaptable to various configurations [19]
DeepSeek-R1 再进化,这次的更新好强啊...
3 6 Ke· 2025-06-04 03:32
Core Viewpoint - DeepSeek has released an upgraded version of its R1 model, named DeepSeek-R1-0528, which shows significant improvements in reasoning, programming, and reducing hallucinations compared to its predecessor [1][3][22]. Model Improvements - The new version retains the base model from December 2024 but has enhanced computational power, allowing for deeper reasoning and more detailed problem-solving [4][6]. - The average token usage for the AIME 2025 test increased from 12K to 23K tokens, resulting in an accuracy improvement from 70% to 87.5% [4][5]. Benchmark Performance - In various benchmarks, DeepSeek-R1-0528 achieved notable scores, such as 87.5% in the AIME 2025 math competition, outperforming its predecessor and showing competitive results against models like OpenAI's and Gemini 2.5 [5][15]. - The model's performance in coding tasks has reached levels comparable to OpenAI's models, with successful outputs in complex coding challenges [10][14]. Reduction of Hallucinations - The hallucination rate in the new model has decreased by 45% to 50%, leading to more reliable outputs in tasks such as summarization and reading comprehension [18]. Creative Writing Capabilities - DeepSeek-R1-0528 has shown improvements in creative writing, producing coherent and logical narratives without the previous issues of "getting stuck" [19][21]. User Reception - While some users express skepticism about the update's impact, many remain optimistic about DeepSeek's potential as a representative of domestic AI technology [22][23].
整理:每日科技要闻速递(5月23日)
news flash· 2025-05-23 00:02
New Energy Vehicles - BYD's pure electric vehicle sales in Europe surpassed Tesla for the first time in April [2] - Changan Automobile plans to launch 35 smart new vehicles in the next three years, aiming for solid-state battery validation by 2026 [3] - Xiaomi's automotive division aims to deliver a cumulative total of 258,000 vehicles by 2025, with over 28,000 delivered in April [3] - Xiaomi launched its first SUV, the Xiaomi YU 7, which accelerates from 0 to 100 km/h in 3.23 seconds and has a top speed of 253 km/h [3] Integrated Circuits (Chips) - TSMC and other manufacturers are advocating for the U.S. Department of Commerce to exempt semiconductor-related tariffs [1] - Intel has launched the new Xeon 6 series processors, with one model already being used as the main CPU for NVIDIA's DGX B300 [1] - Xiaomi's flagship processor, the Xuanjie O1, features a 16-core GPU and has achieved a benchmark score exceeding 3 million [1] Artificial Intelligence - Anthropic has released new AI models, Claude 4 Opus and Claude 4 Sonnet [3] - G42, OpenAI, Oracle, NVIDIA, SoftBank Group, and Cisco announced a collaboration to build the "Interstellar Gateway" data center in the UAE [3] Other Developments - Xiaomi plans to double its R&D investment to 200 billion over the next five years, with Lei Jun stating that Xiaomi's chips will compete with Apple's [3] - Lenovo Group projects a total revenue of $69.08 billion for the fiscal year 2024/2025 [3] - The U.S. Federal Trade Commission has dropped its lawsuit regarding Microsoft's $69 billion acquisition of Activision Blizzard [3]
人工智能公司Anthropic发布Claude 4 Opus和Claude 4 Sonnet人工智能模型。
news flash· 2025-05-22 16:40
Core Insights - Anthropic has released two new artificial intelligence models: Claude 4 Opus and Claude 4 Sonnet [1] Group 1 - The introduction of Claude 4 Opus and Claude 4 Sonnet signifies a significant advancement in Anthropic's AI capabilities [1]