Workflow
Matthew Berman
icon
Search documents
China Went HARD...
Matthew Berman· 2025-07-24 00:30
Model Performance & Capabilities - Quen 3 coder rivals Anthropic's Claude family in coding performance, achieving 69.6% on SWEBench verified compared to Claude Sonnet 4's 70.4% [1] - The most powerful variant, Quen 3 coder 480B, features 480 billion parameters with 35 billion active parameters as a mixture of experts model [2][3] - The model supports a native context length of 256k tokens and up to 1 million tokens with extrapolation methods, enhancing its capabilities for tool calling and agentic uses [4] Training Data & Methodology - The model was pre-trained on 7.5 trillion tokens with a 70% code ratio, improving coding abilities while maintaining general and math skills [5] - Quen 2.5 coder was leveraged to clean and rewrite noisy data, significantly improving overall data quality [6] - Code RL training was scaled on a broader set of real-world coding tasks, focusing on diverse coding tasks to unlock the full potential of reinforcement learning [7][8] Tooling & Infrastructure - Quen launched Quen code, a command line tool adapted from Gemini code, enabling agentic and multi-turn execution with planning [2][5][9] - A scalable system was built to run 20,000 independent environments in parallel, leveraging Alibaba cloud's infrastructure for self-play [10] Open Source & Accessibility - The model is hosted on HuggingFace, making it free to use and try out [11]
AI News: Sam Altman's Predictions, Talent Wars Continue, Project Stargate, Thinking Machines
Matthew Berman· 2025-07-23 15:37
This video is sponsored by Augment Code. More on them later. All right, first we have an update from Thinking Machines.They just raised a massive amount of capital for what I actually don't quite know. There is very little public information about what they're actually doing. What we do know is that they're going to be training models for enterprise.They just raised $2 billion led by A16Z who basically funds every single investment on the planet at this point with participation from Nvidia, Excel, Service N ...
OpenAI's mystery models are insane...
Matthew Berman· 2025-07-22 16:57
Cancel your AI subscriptions and try this All-in-One AI Super assistant that's 10x better: https://chatllm.abacus.ai/ffb Try this God Tier AI Agent that literally does everything: https://deepagent.abacus.ai/ffb Download The Matthew Berman Vibe Coding Playbook (free) 👇🏼 https://bit.ly/3I2J0YQ Download Humanities Last Prompt Engineering Guide (free) 👇🏼 https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Li ...
AI News: Windsurf Drama, Meta Building ASI, Meta Closed Source? Grok 4 Drama, and more!
Matthew Berman· 2025-07-16 19:00
I am in beautiful foggy San Francisco and we're going to get right into the news. The first story is about Windsurf and all of the drama that happened in the last few days. If you were not aware, let me give you a little bit of backstory first.Just about a month ago, it was reported that Open AI was going to acquire Windsurf. And if you're not familiar with Windsurf, it is an AI based coding assistant built into an IDE very similar to Cursor. They've sponsored this channel.I've enjoyed using them and OpenAI ...
Kimi K2 is INSANE... (Open-Source is BACK!)
Matthew Berman· 2025-07-14 17:43
This might be the next deepseek moment. A Chinese company just released another open-source model called Kimmy K2 and it is taking the industry by storm. The reason this graph right here, this is the training loss curve, and people are so surprised by how smooth it is.Typically, you get all of these spikes in here which cause issues that you need to correct. But for Kimmy, it was almost flawless. And the especially cool thing, it is based on a trillion tokens.That is a massive model. So they came up with th ...
The Industry Reacts to Grok 4!
Matthew Berman· 2025-07-13 00:06
Gro 4 has been out for less than 48 hours and the industry has been stunned. The overall sentiment is Gro 4 absolutely delivered. Let me show you all of the reactions.First, Flavio Adamo gave it the hexagon test. And not all Frontier models passed this, but Gro 4 passed it with flying colors. So, you can see all of the balls and the physics look correct.They're bouncing around. They're bouncing off of each other. Everything looks flawless.Impressed. It's actually really good. And Tyler Storm put together a ...
Grok 4 Fully Tested (INSANE)
Matthew Berman· 2025-07-11 18:18
Gro 4 has been out for less than 24 hours and I have put it through its paces. I'm going to show you all the tests. Let's get right into it.So, we have two versions that we're going to be using today. We have Gro 4 and Gro 4 heavy. I tried to use the appropriate model when appropriate.I use Gro 4 heavy for the more logic and reasoning intensive task and the regular Gro 4 for others. Turns out some tests are more appropriate for one than the other. Let me show you the first one.Write Python code that impleme ...
Grok 4 is really smart... Like REALLY SMART
Matthew Berman· 2025-07-10 22:31
Model Performance & Benchmarks - Grok 4 demonstrates a significant leap in performance compared to previous models due to reinforcement learning with verifiable rewards [1][2][3][4] - On the "Humanity's Last Exam" benchmark, Grok 4 achieved 26.9% without tools, 41% with tool usage, and 50.7% with scaled test-time compute, surpassing other frontier models [9][10][11] - Grok 4 Heavy achieved a perfect 100% score on the AMY 2025 benchmark, which consists of some of the hardest math questions [29] - Grok 4 significantly outperformed other models on the ARC AGI benchmark, achieving 66.6% on V1 and 15.9% on V2, indicating "nonzero levels of fluid intelligence" [33][34][35] - In a real-world vending machine management test ("Vending Bench"), Grok 4 achieved a net worth of $4,700, significantly higher than other models and humans [36] Model Architecture & Features - Grok 4 utilizes multiple agents that work together, share knowledge, and select the best solution, particularly in the "Heavy" version [12][13][20] - Grok 4 incorporates tool usage, including web browsing, sophisticated memory, and code execution environments [10] - Grok 4 has a 256k context window, multimodal reasoning capabilities, real-time data search, and enterprise-grade security [43] Real-World Applications & Demonstrations - Grok 4 was used to predict the winner of the World Series by browsing odds sites and calculating its own odds, giving the Dodgers a 21.6% chance of winning [22][23] - Grok 4 generated a visualization of two black holes colliding, demonstrating its ability to create content with some simplifications [24][25][26][27] - Grok 4 was used to create a timeline of announcements and score releases for the "Humanity's Last Exam" [27] - Grok 4 was used to create a first-person shooting game in four hours, highlighting its ability to automate asset sourcing and accelerate game development [38][39][40] Future Developments & Availability - A coding-specific model is expected in August, a multimodal agent in September, and a video generation model in October [46] - Super Grok is priced at $30 per month, while Super Grok Heavy is priced at $300 per month or $3,000 per year [44]
Grok 4 is HERE! and it's the best? (Livestream Reaction)
Matthew Berman· 2025-07-10 08:51
The xAI team went live on x showing off Grok 4's new capabilities and the results are mind-blowing to say the least! Download The Matthew Berman Vibe Coding Playbook (free) 👇🏼 https://bit.ly/3I2J0YQ Download Humanities Last Prompt Engineering Guide (free) 👇🏼 https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Links 🔗 👉🏻 X: https://x.com/matthewberman 👉🏻 Instagram: https://www.instagram.com/matthewberman_a ...
AI News: Grok 4, Grok 3 Off the Rails, OpenAI Poaching, New Open Source Models, and more!
Matthew Berman· 2025-07-10 01:05
AI Model Releases and Updates - XAI 团队和 Elon Musk 预计将发布 Grok 4,但发布时间尚未确定 [1] - Grok 因发布反犹太主义推文和赞扬希特勒的言论而被下线 [3] - Ernie 4.5 模型家族发布,包含 3000 亿参数版本,在数学和推理方面表现优于 GPT 4.1% [19][20] - HuggingFace 发布 Small LM3,一个 30 亿参数的小型推理模型,具有 128k 上下文窗口 [23] - Chai Discovery 发布 Chai 2,一种分子设计模型,在抗体发现方面超越了之前的技术水平 100 倍以上 [25] AI Applications and Development - AI 在视频游戏领域的应用前景广阔,Runway 正在开发游戏世界生成功能 [7][8][10] - Cursor 现在支持在网页和手机上运行,方便用户随时随地进行编码 [12] - AI 研究人员在论文中注入提示,引导 AI 给出正面评价 [27] Talent Acquisition and Investment - Meta 向 EssilorLuxottica 投资 35 亿美元,该公司拥有 Ray-Ban 等眼镜品牌 [29] - Meta 从苹果挖走了一位关键的 AI 领导者,Ruong Pang 加入 Meta 的超智能实验室 [34] - OpenAI 从 Tesla、XAI 和 Meta 挖回了四位高级工程师 [36] Tools and Resources - Recall AI 提供了一个平台,可以保存、组织和总结用户在网上找到的 AI 相关信息,并提供 30% 的折扣码 MB30 [1][15][17]