Workflow
OctoCodingBench
icon
Search documents
我们对 Coding Agent 的评测,可能搞错了方向
Founder Park· 2026-01-16 12:22
Core Viewpoint - The evaluation of Coding Agents has been misdirected, focusing too much on outcomes rather than the adherence to process specifications, which is crucial for effective collaboration in software engineering [2][4][7]. Group 1: Issues with Current Evaluation Systems - User dissatisfaction with Coding Agents often stems from poor execution rather than inability to perform tasks, highlighting the need for adherence to explicit instructions and engineering norms [3][4]. - Current evaluation systems, such as SWE-bench verified, primarily focus on outcome-based metrics, neglecting the process and user experience, leading to a disconnect between evaluation and real-world usage [4][7]. Group 2: Introduction of OctoCodingBench - MiniMax has introduced a new evaluation set, OctoCodingBench, aimed at assessing whether Coding Agents follow rules during task completion, thus addressing the identified blind spots in existing evaluations [5][8]. - The evaluation metrics include Check-level Success Rate (CSR) and Instance-level Success Rate (ISR), which measure the proportion of rules followed and overall compliance, respectively [9][10]. Group 3: Evaluation Results - Even the strongest models fail to comply with process norms, with Claude 4.5 Opus achieving an ISR of only 36.2%, indicating significant room for improvement in process adherence [13]. - Open-source models are rapidly catching up to closed-source models, with MiniMax M2.1 and DeepSeek V3.2 showing competitive ISR scores of 26.1% and 26%, respectively, surpassing some established closed-source models [13][14]. Group 4: Future Directions - The next generation of Coding Agents should incorporate Process Supervision to enhance compliance with process specifications, as current models show a decline in adherence over longer tasks [15][16]. - The evolution of Coding Agents is shifting from merely producing runnable code to effectively collaborating under complex constraints, emphasizing the importance of process specification in their development [16][17][18].
MINIMAX-WP午前拉升逾10% 宣布开源代码智能体系统性评测集OctoCodingBench
Zhi Tong Cai Jing· 2026-01-16 05:19
Core Insights - MINIMAX-WP's stock surged over 10%, currently up 8.16% at 387.2 HKD, with a trading volume of 352 million HKD, following the announcement of its open-source evaluation benchmark for coding agents, OctoCodingBench [1] - The evaluation results indicate that some open-source models are performing exceptionally well in "process compliance," approaching or even surpassing certain closed-source models, highlighting a shift in industry focus towards "data and evaluation paradigms" in the evolution of AI towards agents [1] Company Performance - CITIC Securities reports that MINIMAX-WP is emerging from industry competition with a "counter-consensus" strategic focus on model intelligence breakthroughs, positioning itself strongly in the generative AI wave [2] - As one of the first companies in Shanghai to receive large model registration, MINIMAX-WP demonstrates strong growth potential through technological depth and commercial foresight [2] - Revenue is projected to maintain over 90% high-speed growth from 2025 to 2027, with Non-GAAP gross margin expected to rise to 55% and net loss rate continuing to narrow [2] - The company is anticipated to expand its market space in AI-native applications through optimization of reasoning costs and the implementation of next-generation multimodal models [2]
港股异动 | MINIMAX-WP(00100)午前拉升逾10% 宣布开源代码智能体系统性评测集OctoCodingBench
智通财经网· 2026-01-16 03:46
Group 1 - MINIMAX-WP's stock price increased by over 10%, currently trading at 387.2 HKD with a transaction volume of 352 million HKD [1] - The company announced the open-source release of its code intelligence system evaluation set, OctoCodingBench, which is the first comprehensive assessment benchmark designed for Coding Agents [1] - Evaluation results indicate that some open-source models have shown outstanding performance in the key metric of "process compliance," approaching or even surpassing certain closed-source models in specific scenarios [1] Group 2 - CITIC Securities reports that MINIMAX-WP is emerging from industry competition by focusing on model intelligence breakthroughs amid the global generative AI wave [2] - The company is one of the first in Shanghai to obtain large model registration, showcasing strong development potential through technological depth and commercial foresight [2] - Revenue is projected to maintain over 90% high growth from 2025 to 2027, with Non-GAAP gross margin expected to rise to 55% and net loss rate continuing to narrow [2]
OpenAI据悉正在开发一款对标苹果AirPods的人工智能设备;智谱联合华为开源首个国产芯片训练的多模态SOTA模型丨AIGC日报
创业邦· 2026-01-15 00:26
Group 1 - OpenAI is reportedly developing an AI device to compete with Apple's AirPods, internally codenamed Sweetpea, with expectations for features like phone calls and audio playback [2] - Zhizhu and Huawei have open-sourced a new generation multimodal SOTA model, GLM-Image, which is the first to complete full training on domestic chips, with a cost of 0.1 yuan per image generation [2] - A new AI model named SleepFM, developed by researchers at Stanford University, can predict the risk of approximately 130 diseases based on sleep data, utilizing data from 65,000 participants [2] Group 2 - MiniMax has open-sourced a new evaluation set for Coding Agents called OctoCodingBench, revealing that while all models can achieve over 80% Check-level accuracy, Instance-level success rates are only between 10% and 30% [2]
AI进化速递 | 智谱联合华为开源新模型
Di Yi Cai Jing· 2026-01-14 13:19
Core Insights - The article highlights significant advancements in AI technology, particularly focusing on new models and collaborations in the industry Group 1: New AI Models and Collaborations - Zhipu AI and Huawei have jointly open-sourced the first domestic chip-trained multimodal SOTA model, GLM-Image [1] - Google has announced the launch of the open-source medical model MedGemma 1.5 [1] - OpenAI is reportedly developing an AI device to compete with Apple's AirPods, internally codenamed Sweetpea [1] - Anthropic has released a new intelligent tool called Cowork, designed to enable ordinary users to perform non-technical tasks easily [1] - MiniMax has introduced the OctoCodingBench benchmark, which defines production-level standards for Coding Agents [1] - Aish Technology has launched the world's first universal real-time world model, PixVerse R1, capable of 1080P resolution [1] - Visual China and PureblueAI have reached a strategic cooperation to provide comprehensive services around "data supply + GEO marketing" [1] - KKR's fund has led a new financing round for DeepWisdom, which will primarily focus on the continued development of multi-agent systems [1] - AI chip startup Etched has raised $500 million at a valuation of $5 billion [1]
【太平洋科技-每日观点&资讯】(2026-01-15)
远峰电子· 2026-01-14 12:46
Market Overview - The major indices showed mixed performance with the STAR Market 50 index rising by 2.13%, while the Shanghai Composite Index fell by 0.31% [1] - The TMT sector led the gains, particularly in sub-sectors like SW Portal Websites (+10.62%) and SW Communication Application Value-Added Services (+7.17%) [1] - Conversely, the TMT sector also saw declines in areas such as SW Robotics (-0.81%) and SW Military Electronics III (-0.57%) [1] Domestic News - Zhejiang Jingrui achieved a key technological breakthrough in 12-inch silicon carbide substrate uniformity, with a TTV of ≤1μm, marking a significant advancement in domestic equipment capabilities [2] - Rongbai Technology signed a procurement cooperation agreement with CATL for lithium iron phosphate cathode materials, expected to supply 3.05 million tons from Q1 2026 to 2031, with a total sales value exceeding 120 billion yuan [2] - The first underwater geological drilling and monitoring robot was successfully developed in China, featuring high-precision operation capabilities with a 3D positioning error of less than 0.3 meters [2] - LeKai Optoelectronics plans to invest in a TAC functional film coating production line, aiming for an annual production capacity of 18 million square meters [2] Overseas News - Global DRAM manufacturers are projected to have a total capacity of 18 million wafers in 2026, reflecting a 5% increase from 2025 [3] - Wolfspeed announced the successful production of 300mm silicon carbide wafers, enhancing capabilities for power electronics and optical systems [3] - Siemens acquired ASTER, integrating advanced design-for-test capabilities into its software suite [3] - The U.S. Industrial and Security Bureau revised its export licensing policy for specific semiconductor products to a case-by-case review, impacting products like NVIDIA's H200 chip [3] AI Insights - Aishi Technology launched the PixVerse R1 model, which significantly reduces video generation latency to real-time interaction, applicable in gaming and entertainment [4] - Baichuan Intelligence open-sourced its medical AI model Baichuan-M3, achieving top scores in global medical AI evaluations [4] - Tsinghua University developed the DrugCLIP platform, enhancing screening speed by a million times compared to traditional methods [4] - MiniMax released the OctoCodingBench, showing that some open-source models are nearing or surpassing closed-source models in compliance metrics [4] Industry Tracking - The Long March 6 and Long March 8 rockets successfully launched satellites into orbit, contributing to the development of the space economy [5] - Lianxun Instruments is set to undergo IPO review, with its high-end optical communication testing suite breaking the long-standing monopoly of U.S. and Japanese firms [5] - The first non-invasive brain-machine interface treatment was successfully implemented in China, improving symptoms in a patient with acute cerebral infarction [5] - Yongjin Co. reported successful production and market circulation of its titanium materials, which are widely used in aerospace and medical fields [5]