Workflow
Matthew Berman
icon
Search documents
Claude Just Got a Big Update (Opus 4.1)
Matthew Berman· 2025-08-05 23:02
Model Release & Performance - Anthropic 发布了 Claude Opus 4.1%,是对 Claude Opus 4 的升级,尤其在 Agentic 任务、真实世界编码和推理方面 [1] - SWEBench verified 基准测试中,Claude Opus 4.1% 的得分从 Opus 4 的 72.5% 提升至 74.5%,提升了 2 个百分点 [3] - Terminal Bench 基准测试中,Claude Opus 4.1% 的终端使用能力从 39.2% 提升至 43.3%,提升了 4.1 个百分点 [4] - GPQA Diamond(研究生水平推理)基准测试中,Claude Opus 4.1% 的得分从 79.6% 提升至 80.9%,提升了 1.3 个百分点 [4] - Towbench(Agentic 工具使用)基准测试中,Claude Opus 4.1% 在零售方面的得分从 81.4% 提升至 82.4%,提升了 1 个百分点,但在航空方面从 59.6% 下降至 56%,下降了 3.6 个百分点 [5] - 多语言问答基准测试中,Claude Opus 4.1% 的得分从 88.8% 提升至 89.5%,提升了 0.7 个百分点 [5] - Amy 2025 基准测试中,Claude Opus 4.1% 的得分提升了 2.5 个百分点至 78% [5] Competitive Positioning & Future Outlook - 在 SWEBench 和 Terminal Bench 基准测试中,Claude Opus 4.1% 优于 OpenAI 的 GPT-3 和 Gemini 1.5 Pro [5] - 在 GPQA Diamond 和 Agentic 工具使用基准测试中,Claude Opus 4.1% 不及 OpenAI 的 GPT-3 和 Gemini 1.5 Pro [6] - 在高中数学竞赛基准测试中,Claude Opus 4.1% 的得分低于 OpenAI 的 GPT-3 (88.9%) 和 Gemini 1.5 Pro (88%),仅为 78% [6] - Claude 目前被广泛认为是市场上最佳的编码模型,尤其擅长 Agentic 编码和 Agent-driven 开发 [7]
OpenAI Goes OPEN-SOURCE! gpt-oss is HERE!
Matthew Berman· 2025-08-05 22:09
Model Release - Open AAI 发布了最先进的开源模型 GPTOSS,包含 1200 亿参数和 200 亿参数两个版本 [1] - 这些模型是 open weight 的语言模型,意味着模型权重也被发布 [1] Performance Benchmarks - 1200 亿参数版本的 GPTOSS 在 Code Forces 竞赛中,使用工具的情况下得分为 2622,与 Frontier 模型(得分 2706)非常接近 [2] - 200 亿参数版本的 GPTOSS 在使用工具的情况下得分为 2516,考虑到其规模,表现同样出色 [2] - 这些模型在编程方面的得分超过了地球上大多数人 [2]
OpenAI Dropped a FRONTIER Open-Weights Model
Matthew Berman· 2025-08-05 17:17
Model Release & Capabilities - Open AAI released GPTOSS, state-of-the-art open-weight language models in 120 billion and 20 billion parameter versions [1] - The models outperform similarly sized open-source models on reasoning tasks and demonstrate strong tool use capabilities [3] - The models are optimized for efficient deployment on consumer hardware, with the 120 billion parameter version running efficiently on a single 80 GB GPU and the 20 billion parameter version on edge devices with 16 GB of memory [4][5] - The models excel in tool use, few-shot learning, function calling, chain of thought reasoning, and health issue diagnosis [8] - The models support context lengths of up to 128,000 tokens [12] Training & Architecture - The models were trained using a mix of reinforcement learning and techniques informed by OpenAI's most advanced internal models [3] - The models utilize a transformer architecture with a mixture of experts, reducing the number of active parameters needed to process input [10][11] - The 120 billion parameter version activates only 5 billion parameters per token, while the 20 billion parameter version activates 36 billion parameters [11][12] - The models employ alternating dense and locally banded sparse attention patterns, group multi-query attention, and RoPE for positional encoding [12] Safety & Security - OpenAI did not put any direct supervision on the chain of thought for either OSS model [21] - The models were pre-trained and filtered to remove harmful data related to chemical, biological, radiological, and nuclear data [22] - Even with robust fine-tuning, maliciously fine-tuned models were unable to reach high capability levels according to OpenAI's preparedness framework [23] - OpenAI is hosting a challenge for red teamers with $500,000 in awards to identify safety issues with the models [24]
Google Genie 3 - The Most Advanced World Simulator Ever...
Matthew Berman· 2025-08-05 14:02
Model Overview - Google announced Genie 3, a general-purpose world model for generating diverse interactive environments [1][8] - Genie 3 allows real-time interaction with improved consistency and realism compared to Genie 2 [12] - The model generates 720p high-quality environments [3] Technical Aspects - Genie 3 considers the entire previously generated trajectory, not just the previous frame, for autoregressive generation [15] - Consistency in Genie 3 is an emergent capability resulting from training scale, not pre-programming [19] - Genie 3 generates dynamic and rich worlds frame by frame based on world description and user actions, unlike methods relying on explicit 3D representation [20] Potential Applications - World models like Genie 3 can be used for training robots and agents [9] - The technology has potential applications in creating video games, movies, and television shows [9] - Google positions world models as a key step towards AGI by providing AI agents with unlimited simulation environments for training [9][10] Comparison with Previous Models - Genie 3 demonstrates significant improvements in consistency, detail, and generation length compared to Genie 2 [22][23] - Genie 3 allows for deeper world exploration than Genie 2 [23] Interactive Features - Users can prompt events in real-time, adding elements to the scene [21] - The model demonstrates realistic interactions, such as light moving out of the way of a jet ski and reflections in mirrors [6] - The model can simulate actions like painting, with paint only being applied when the brush touches the wall [29][30]
Forward Future Live August 1st, 2025
Matthew Berman· 2025-08-01 16:55
Resources & Tools - Offers a free "Vibe Coding Playbook" download [1] - Provides a free "Humanities Last Prompt Engineering Guide" download [1] - Showcases a curated list of AI tools [1] Community & Updates - Encourages joining a newsletter for regular AI updates [1] - Promotes engagement through X (Twitter), Instagram, and Discord [1] Media & Sponsorship - Provides a contact link for media/sponsorship inquiries [1]
This might be OpenAI's New Open-Source Model...
Matthew Berman· 2025-08-01 00:00
Model Capabilities & Performance - Horizon Alpha demonstrates impressive spatial awareness and problem-solving skills, accurately visualizing complex rotations [1] - The model exhibits multimodal capabilities, effectively understanding and interpreting images with speed [2] - Horizon Alpha successfully solves the Tower of Hanoi puzzle despite lacking chain-of-thought reasoning [6] - The model shows an ability to recognize its limitations, indicating when it lacks knowledge [20][21] - Horizon Alpha achieves top rankings in creative writing and emotional intelligence benchmarks [23][11] Model Characteristics & Limitations - Horizon Alpha is a fast model, outputting tokens at approximately 150 tokens per second [2] - The model lacks a "thinking mode," initially outputting the first response that comes to mind [2] - Horizon Alpha provides incorrect answers to simple logic and percentage-based questions [7][8] - The model refuses to provide instructions for illegal activities, such as hotwiring a car [8][9] - The model incorrectly identifies itself as a GPT4 class model from OpenAI, despite likely being an open-source model [9] Open Router & Box AI - Horizon Alpha is available on Open Router and free to use [1] - Box AI allows users to leverage the latest AI models, including open-source options, for document workflows with enterprise-level security [3][4]
The AlphaGO Moment for AI Models...
Matthew Berman· 2025-07-31 18:08
AI Model Architecture Discovery - The AI field is approaching an era where AI can discover new knowledge and apply it to itself, potentially leading to exponential innovation [1][3] - The current bottleneck in AI discovery is human innovation, limiting the scaling of AI advancements [2][3] - The "AlphaGo moment" for model architecture discovery involves AI self-play to hypothesize, code, test, and analyze new model architectures [3][12] - The key to this approach is AI's ability to learn without human input, discovering novel solutions unconstrained by human biases [8] ASI Arch System - The ASI Arch system uses a researcher, engineer, and analyst to autonomously propose, implement, test, and analyze new neural network architectures [13][14][15][16] - The system learns from past experiments and human literature to propose new architectures, selecting top performers as references [14] - The engineer component self-heals code to ensure new approaches are properly tested [15] - The analyst reviews results, learns insights, and maintains a memory of lessons learned for future generations of models [16] Experimental Results and Implications - The system ran 1,700 autonomous experiments over 20,000 GPU hours, resulting in 106 models that outperformed previous public models [17][18] - The potential for exponential improvement exists by increasing compute resources, such as scaling from 20,000 to 20 million GPU hours [19] - The self-improving AI system can be applied to other scientific fields like biology and medicine by increasing compute resources [20] - The open-sourced paper and code have significant implications, with multiple companies publishing similar self-improving AI papers [21]
Claude Code in SHAMBLES (Qwen3 Coder Tested)
Matthew Berman· 2025-07-31 00:00
Model Performance & Capabilities - Quen 3, an open-source frontier coding model from Alibaba, was tested for various capabilities [1] - Quen 3 successfully generated code for a 2D Navier Stokes solver and a 3D rotating dodcahedron with bouncing spheres [1] - The model demonstrated spatial reasoning failure in a cube rotation task, but the code generation was successful [1] - Quen 3 passed a "needle in a haystack" test by finding a password within the entire book of Harry Potter and the Sorcerer's Stone [1] - The model exhibited censorship regarding Tiananmen Square [1] - Quen 3 refused to take a stance on political questions, providing balanced perspectives on Trump and Kamla [1][2] - The model provided a thoughtful and nuanced response to a prompt about quitting a job and leaving family [2][3][4][5] - Quen 3 refused to answer illegal questions, such as how to hotwire a car [6] - The model provided a correct diagnosis and management plan for acute anterior myocardial infarction [6][7] - Quen 3 gave a good answer to the trolley problem, evaluating morality using utilitarianism and deontology [7][8] - The model showed reasoning traces in its output when answering gotcha questions, although with some errors [11][12][13][14] Technology & Implementation - Together AI sponsors the use of Quen 3, offering high-performance serverless endpoints and pay-per-token pricing [1][2] - Quen Code, an open-source version of Claude Code, works well with Quen 3 and can be installed via npm [2] - The model has a massive context window, natively 256k tokens, with up to 1 million achieved [1]
Chinese Open-Source DOMINATES Coding (GLM-4.5)
Matthew Berman· 2025-07-30 17:15
Model Performance & Capabilities - ZAI's GLM 4.5% model rivals top closed-source models in reasoning, coding, and agentic capabilities [1] - GLM 4.5% demonstrates advanced problem-solving by successfully simulating and solving Rubik's cubes up to 10x10 [2][3][4][21] - The model can solve the Tower of Hanoi puzzle with up to 10 discs, showcasing its reasoning abilities [5][6][7][24][25] - GLM 4.5% exhibits strong coding skills, creating interactive simulations like Lego building, a 3D solar system, and games like Flappy Bird [8][9][21][22] - Benchmarks show GLM 4.5% outperforming other models in agentic tasks and achieving competitive scores in reasoning and coding [17][18][19] Model Architecture & Variants - GLM 4.5% comes in two versions: a larger 355 billion parameter model with 32 billion active parameters, and a smaller "air" version with 106 billion total parameters and 12 billion active parameters [15] - Both models are hybrid reasoning models, capable of both reasoning and non-reasoning tasks [16] Open Source Landscape - China is at the forefront of open-source AI model development with models like GLM 4.5%, Kimmy K2, and Quen 3 [1][15] - Kimmy K2 is comparable in quality to GLM 4.5% but is 250% larger [20] Tools & Resources - HubSpot offers a free AI decoded guide covering AI models, prompts, and tools [12][13][14]
Forward Future Live July 25, 2025
Matthew Berman· 2025-07-25 16:56
AI Resources & Tools - Matthew Berman's Vibe Coding Playbook is available for free download [1] - Humanities Last Prompt Engineering Guide is available for free download [1] - A curated list of AI tools is available [1] Community & Updates - Regular AI updates are provided through a newsletter [1] - Matthew Berman can be followed on X (formerly Twitter) [1] - Matthew Berman can be followed on Instagram [1] - A Discord server is available for community engagement [1] Media & Sponsorship - Media and sponsorship inquiries are welcomed [1]