Matthew Berman
Search documents
Claude Code in SHAMBLES (Qwen3 Coder Tested)
Matthew BermanΒ· 2025-07-31 00:00
Model Performance & Capabilities - Quen 3, an open-source frontier coding model from Alibaba, was tested for various capabilities [1] - Quen 3 successfully generated code for a 2D Navier Stokes solver and a 3D rotating dodcahedron with bouncing spheres [1] - The model demonstrated spatial reasoning failure in a cube rotation task, but the code generation was successful [1] - Quen 3 passed a "needle in a haystack" test by finding a password within the entire book of Harry Potter and the Sorcerer's Stone [1] - The model exhibited censorship regarding Tiananmen Square [1] - Quen 3 refused to take a stance on political questions, providing balanced perspectives on Trump and Kamla [1][2] - The model provided a thoughtful and nuanced response to a prompt about quitting a job and leaving family [2][3][4][5] - Quen 3 refused to answer illegal questions, such as how to hotwire a car [6] - The model provided a correct diagnosis and management plan for acute anterior myocardial infarction [6][7] - Quen 3 gave a good answer to the trolley problem, evaluating morality using utilitarianism and deontology [7][8] - The model showed reasoning traces in its output when answering gotcha questions, although with some errors [11][12][13][14] Technology & Implementation - Together AI sponsors the use of Quen 3, offering high-performance serverless endpoints and pay-per-token pricing [1][2] - Quen Code, an open-source version of Claude Code, works well with Quen 3 and can be installed via npm [2] - The model has a massive context window, natively 256k tokens, with up to 1 million achieved [1]
Chinese Open-Source DOMINATES Coding (GLM-4.5)
Matthew BermanΒ· 2025-07-30 17:15
Model Performance & Capabilities - ZAI's GLM 4.5% model rivals top closed-source models in reasoning, coding, and agentic capabilities [1] - GLM 4.5% demonstrates advanced problem-solving by successfully simulating and solving Rubik's cubes up to 10x10 [2][3][4][21] - The model can solve the Tower of Hanoi puzzle with up to 10 discs, showcasing its reasoning abilities [5][6][7][24][25] - GLM 4.5% exhibits strong coding skills, creating interactive simulations like Lego building, a 3D solar system, and games like Flappy Bird [8][9][21][22] - Benchmarks show GLM 4.5% outperforming other models in agentic tasks and achieving competitive scores in reasoning and coding [17][18][19] Model Architecture & Variants - GLM 4.5% comes in two versions: a larger 355 billion parameter model with 32 billion active parameters, and a smaller "air" version with 106 billion total parameters and 12 billion active parameters [15] - Both models are hybrid reasoning models, capable of both reasoning and non-reasoning tasks [16] Open Source Landscape - China is at the forefront of open-source AI model development with models like GLM 4.5%, Kimmy K2, and Quen 3 [1][15] - Kimmy K2 is comparable in quality to GLM 4.5% but is 250% larger [20] Tools & Resources - HubSpot offers a free AI decoded guide covering AI models, prompts, and tools [12][13][14]
Forward Future Live July 25, 2025
Matthew BermanΒ· 2025-07-25 16:56
AI Resources & Tools - Matthew Berman's Vibe Coding Playbook is available for free download [1] - Humanities Last Prompt Engineering Guide is available for free download [1] - A curated list of AI tools is available [1] Community & Updates - Regular AI updates are provided through a newsletter [1] - Matthew Berman can be followed on X (formerly Twitter) [1] - Matthew Berman can be followed on Instagram [1] - A Discord server is available for community engagement [1] Media & Sponsorship - Media and sponsorship inquiries are welcomed [1]
China Went HARD...
Matthew BermanΒ· 2025-07-24 00:30
Model Performance & Capabilities - Quen 3 coder rivals Anthropic's Claude family in coding performance, achieving 69.6% on SWEBench verified compared to Claude Sonnet 4's 70.4% [1] - The most powerful variant, Quen 3 coder 480B, features 480 billion parameters with 35 billion active parameters as a mixture of experts model [2][3] - The model supports a native context length of 256k tokens and up to 1 million tokens with extrapolation methods, enhancing its capabilities for tool calling and agentic uses [4] Training Data & Methodology - The model was pre-trained on 7.5 trillion tokens with a 70% code ratio, improving coding abilities while maintaining general and math skills [5] - Quen 2.5 coder was leveraged to clean and rewrite noisy data, significantly improving overall data quality [6] - Code RL training was scaled on a broader set of real-world coding tasks, focusing on diverse coding tasks to unlock the full potential of reinforcement learning [7][8] Tooling & Infrastructure - Quen launched Quen code, a command line tool adapted from Gemini code, enabling agentic and multi-turn execution with planning [2][5][9] - A scalable system was built to run 20,000 independent environments in parallel, leveraging Alibaba cloud's infrastructure for self-play [10] Open Source & Accessibility - The model is hosted on HuggingFace, making it free to use and try out [11]
AI News: Sam Altman's Predictions, Talent Wars Continue, Project Stargate, Thinking Machines
Matthew BermanΒ· 2025-07-23 15:37
This video is sponsored by Augment Code. More on them later. All right, first we have an update from Thinking Machines.They just raised a massive amount of capital for what I actually don't quite know. There is very little public information about what they're actually doing. What we do know is that they're going to be training models for enterprise.They just raised $2 billion led by A16Z who basically funds every single investment on the planet at this point with participation from Nvidia, Excel, Service N ...
OpenAI's mystery models are insane...
Matthew BermanΒ· 2025-07-22 16:57
Cancel your AI subscriptions and try this All-in-One AI Super assistant that's 10x better: https://chatllm.abacus.ai/ffb Try this God Tier AI Agent that literally does everything: https://deepagent.abacus.ai/ffb Download The Matthew Berman Vibe Coding Playbook (free) ππΌ https://bit.ly/3I2J0YQ Download Humanities Last Prompt Engineering Guide (free) ππΌ https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates ππΌ https://forwardfuture.ai Discover The Best AI ToolsππΌ https://tools.forwardfuture.ai My Li ...
AI News: Windsurf Drama, Meta Building ASI, Meta Closed Source? Grok 4 Drama, and more!
Matthew BermanΒ· 2025-07-16 19:00
I am in beautiful foggy San Francisco and we're going to get right into the news. The first story is about Windsurf and all of the drama that happened in the last few days. If you were not aware, let me give you a little bit of backstory first.Just about a month ago, it was reported that Open AI was going to acquire Windsurf. And if you're not familiar with Windsurf, it is an AI based coding assistant built into an IDE very similar to Cursor. They've sponsored this channel.I've enjoyed using them and OpenAI ...
Kimi K2 is INSANE... (Open-Source is BACK!)
Matthew BermanΒ· 2025-07-14 17:43
Model Overview - Kimmy K2 is a state-of-the-art mixture of experts language model with 32 billion activated parameters and 1 trillion total parameters [3] - The model was pre-trained on 155% trillion tokens with zero training instability [4] - Kimmy K2 supports up to 2 million tokens in the context window [5] Performance Benchmarks - Kimmy K2 Instruct beats Deepseek, Quen, and GPT41 on SWEBench verified, coming in right behind Cloud 4 Opus [7] - On Live Codebench, Kimmy K2 beats Cloud 4 Opus [7] - Kimmy K2 tops the list on Amy 2025 for math, GPQA Diamond [8] Optimization and Training - The model is trained with the Muon optimizer [4] - Kimmy K2 achieves exceptional performance across frontier knowledge reasoning and coding tasks [4] - The training process was open source [8] Availability and Cost - Inference is available through Kimmy directly at $0.15 per million input tokens with a cache, $0.60 without a cache, and $2.50 per million output tokens [10] - Kimmy K2 is available on Open Router [13] Industry Reception - Industry experts compare Kimmy K2 to Deep Seek V3 [11] - Kimmy K2 is recognized as a potentially new leader in open LLMs [14]
The Industry Reacts to Grok 4!
Matthew BermanΒ· 2025-07-13 00:06
Gro 4 has been out for less than 48 hours and the industry has been stunned. The overall sentiment is Gro 4 absolutely delivered. Let me show you all of the reactions.First, Flavio Adamo gave it the hexagon test. And not all Frontier models passed this, but Gro 4 passed it with flying colors. So, you can see all of the balls and the physics look correct.They're bouncing around. They're bouncing off of each other. Everything looks flawless.Impressed. It's actually really good. And Tyler Storm put together a ...
Grok 4 Fully Tested (INSANE)
Matthew BermanΒ· 2025-07-11 18:18
Gro 4 has been out for less than 24 hours and I have put it through its paces. I'm going to show you all the tests. Let's get right into it.So, we have two versions that we're going to be using today. We have Gro 4 and Gro 4 heavy. I tried to use the appropriate model when appropriate.I use Gro 4 heavy for the more logic and reasoning intensive task and the regular Gro 4 for others. Turns out some tests are more appropriate for one than the other. Let me show you the first one.Write Python code that impleme ...