Matthew Berman

Search documents
Claude Just Got a Big Update (Opus 4.1)
Matthew Berman· 2025-08-05 23:02
Model Release & Performance - Anthropic 发布了 Claude Opus 4.1%,是对 Claude Opus 4 的升级,尤其在 Agentic 任务、真实世界编码和推理方面 [1] - SWEBench verified 基准测试中,Claude Opus 4.1% 的得分从 Opus 4 的 72.5% 提升至 74.5%,提升了 2 个百分点 [3] - Terminal Bench 基准测试中,Claude Opus 4.1% 的终端使用能力从 39.2% 提升至 43.3%,提升了 4.1 个百分点 [4] - GPQA Diamond(研究生水平推理)基准测试中,Claude Opus 4.1% 的得分从 79.6% 提升至 80.9%,提升了 1.3 个百分点 [4] - Towbench(Agentic 工具使用)基准测试中,Claude Opus 4.1% 在零售方面的得分从 81.4% 提升至 82.4%,提升了 1 个百分点,但在航空方面从 59.6% 下降至 56%,下降了 3.6 个百分点 [5] - 多语言问答基准测试中,Claude Opus 4.1% 的得分从 88.8% 提升至 89.5%,提升了 0.7 个百分点 [5] - Amy 2025 基准测试中,Claude Opus 4.1% 的得分提升了 2.5 个百分点至 78% [5] Competitive Positioning & Future Outlook - 在 SWEBench 和 Terminal Bench 基准测试中,Claude Opus 4.1% 优于 OpenAI 的 GPT-3 和 Gemini 1.5 Pro [5] - 在 GPQA Diamond 和 Agentic 工具使用基准测试中,Claude Opus 4.1% 不及 OpenAI 的 GPT-3 和 Gemini 1.5 Pro [6] - 在高中数学竞赛基准测试中,Claude Opus 4.1% 的得分低于 OpenAI 的 GPT-3 (88.9%) 和 Gemini 1.5 Pro (88%),仅为 78% [6] - Claude 目前被广泛认为是市场上最佳的编码模型,尤其擅长 Agentic 编码和 Agent-driven 开发 [7]
OpenAI Goes OPEN-SOURCE! gpt-oss is HERE!
Matthew Berman· 2025-08-05 22:09
Open AAI has delivered on their promise to release a state-of-the-art open-source model. This is GPTOSS. It comes in two sizes, a 120 billion parameter version and a 20 billion parameter version.These are state-of-the-art openweight language models. Open weight. So, not just open- source, but they are actually releasing the weights to these models.Now, for some benchmarks, here is the code forces competition. Now the 120 billion parameter version with tools scores a 2622. That is compared to 03 a frontier m ...
OpenAI Dropped a FRONTIER Open-Weights Model
Matthew Berman· 2025-08-05 17:17
Open AAI has delivered on their promise to release a state-of-the-art open-source model. This is GPTOSS. Now, I think the mystery model Horizon Alpha that was on Open Router is actually this open source model from OpenAI, although they have not confirmed that to me, but we do have an incredible model.Let me tell you about all the details. So, first it comes in two sizes, a 120 billion parameter version and a 20 billion parameter version. These are state-of-the-art openweight language models.Open weight. So ...
Google Genie 3 - The Most Advanced World Simulator Ever...
Matthew Berman· 2025-08-05 14:02
Google just announced Genie 3, their world model that is fully controllable like a video game and fully immersive. This is going to change movies, TV, video games, everything. And according to Google is a big leap towards AGI.Let me show you some demos and then I'm going to tell you all about it. All right, so check this one out. A gorilla wearing a fancy outfit walking through some buildings.And you can see on screen they're actually showing that this is fully controllable. Now, what I want you to look at ...
Forward Future Live August 1st, 2025
Matthew Berman· 2025-08-01 16:55
Download The Matthew Berman Vibe Coding Playbook (free) 👇🏼 https://bit.ly/3I2J0YQ Download Humanities Last Prompt Engineering Guide (free) 👇🏼 https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Links 🔗 👉🏻 X: https://x.com/matthewberman 👉🏻 Instagram: https://www.instagram.com/matthewberman_ai 👉🏻 Discord: https://discord.gg/xxysSXBxFW Media/Sponsorship Inquiries ✅ https://bit.ly/44TC45V ...
This might be OpenAI's New Open-Source Model...
Matthew Berman· 2025-08-01 00:00
We have a brand new mystery model on open router and it is called Horizon Alpha. It's available and it's free to use and it looks to be likely the new opensource open AI model. Let me show you a couple tests and then I'm going to tell you about it. And by the way, if you want to stay uptodate on the latest model releases, their benchmarks, and all the cutting edge AI news, you need to subscribe to our newsletter forward future at forwardfuture.ai. Of course, we had to try the spinning hexagon ball test beca ...
The AlphaGO Moment for AI Models...
Matthew Berman· 2025-07-31 18:08
AI Model Architecture Discovery - The AI field is approaching an era where AI can discover new knowledge and apply it to itself, potentially leading to exponential innovation [1][3] - The current bottleneck in AI discovery is human innovation, limiting the scaling of AI advancements [2][3] - The "AlphaGo moment" for model architecture discovery involves AI self-play to hypothesize, code, test, and analyze new model architectures [3][12] - The key to this approach is AI's ability to learn without human input, discovering novel solutions unconstrained by human biases [8] ASI Arch System - The ASI Arch system uses a researcher, engineer, and analyst to autonomously propose, implement, test, and analyze new neural network architectures [13][14][15][16] - The system learns from past experiments and human literature to propose new architectures, selecting top performers as references [14] - The engineer component self-heals code to ensure new approaches are properly tested [15] - The analyst reviews results, learns insights, and maintains a memory of lessons learned for future generations of models [16] Experimental Results and Implications - The system ran 1,700 autonomous experiments over 20,000 GPU hours, resulting in 106 models that outperformed previous public models [17][18] - The potential for exponential improvement exists by increasing compute resources, such as scaling from 20,000 to 20 million GPU hours [19] - The self-improving AI system can be applied to other scientific fields like biology and medicine by increasing compute resources [20] - The open-sourced paper and code have significant implications, with multiple companies publishing similar self-improving AI papers [21]
Claude Code in SHAMBLES (Qwen3 Coder Tested)
Matthew Berman· 2025-07-31 00:00
Model Performance & Capabilities - Quen 3, an open-source frontier coding model from Alibaba, was tested for various capabilities [1] - Quen 3 successfully generated code for a 2D Navier Stokes solver and a 3D rotating dodcahedron with bouncing spheres [1] - The model demonstrated spatial reasoning failure in a cube rotation task, but the code generation was successful [1] - Quen 3 passed a "needle in a haystack" test by finding a password within the entire book of Harry Potter and the Sorcerer's Stone [1] - The model exhibited censorship regarding Tiananmen Square [1] - Quen 3 refused to take a stance on political questions, providing balanced perspectives on Trump and Kamla [1][2] - The model provided a thoughtful and nuanced response to a prompt about quitting a job and leaving family [2][3][4][5] - Quen 3 refused to answer illegal questions, such as how to hotwire a car [6] - The model provided a correct diagnosis and management plan for acute anterior myocardial infarction [6][7] - Quen 3 gave a good answer to the trolley problem, evaluating morality using utilitarianism and deontology [7][8] - The model showed reasoning traces in its output when answering gotcha questions, although with some errors [11][12][13][14] Technology & Implementation - Together AI sponsors the use of Quen 3, offering high-performance serverless endpoints and pay-per-token pricing [1][2] - Quen Code, an open-source version of Claude Code, works well with Quen 3 and can be installed via npm [2] - The model has a massive context window, natively 256k tokens, with up to 1 million achieved [1]
Chinese Open-Source DOMINATES Coding (GLM-4.5)
Matthew Berman· 2025-07-30 17:15
Model Performance & Capabilities - ZAI's GLM 4.5% model rivals top closed-source models in reasoning, coding, and agentic capabilities [1] - GLM 4.5% demonstrates advanced problem-solving by successfully simulating and solving Rubik's cubes up to 10x10 [2][3][4][21] - The model can solve the Tower of Hanoi puzzle with up to 10 discs, showcasing its reasoning abilities [5][6][7][24][25] - GLM 4.5% exhibits strong coding skills, creating interactive simulations like Lego building, a 3D solar system, and games like Flappy Bird [8][9][21][22] - Benchmarks show GLM 4.5% outperforming other models in agentic tasks and achieving competitive scores in reasoning and coding [17][18][19] Model Architecture & Variants - GLM 4.5% comes in two versions: a larger 355 billion parameter model with 32 billion active parameters, and a smaller "air" version with 106 billion total parameters and 12 billion active parameters [15] - Both models are hybrid reasoning models, capable of both reasoning and non-reasoning tasks [16] Open Source Landscape - China is at the forefront of open-source AI model development with models like GLM 4.5%, Kimmy K2, and Quen 3 [1][15] - Kimmy K2 is comparable in quality to GLM 4.5% but is 250% larger [20] Tools & Resources - HubSpot offers a free AI decoded guide covering AI models, prompts, and tools [12][13][14]
Forward Future Live July 25, 2025
Matthew Berman· 2025-07-25 16:56
AI Resources & Tools - Matthew Berman's Vibe Coding Playbook is available for free download [1] - Humanities Last Prompt Engineering Guide is available for free download [1] - A curated list of AI tools is available [1] Community & Updates - Regular AI updates are provided through a newsletter [1] - Matthew Berman can be followed on X (formerly Twitter) [1] - Matthew Berman can be followed on Instagram [1] - A Discord server is available for community engagement [1] Media & Sponsorship - Media and sponsorship inquiries are welcomed [1]