Workflow
Matthew Berman
icon
Search documents
China Went HARD...
Matthew Berman· 2025-07-24 00:30
Model Performance & Capabilities - Quen 3 coder rivals Anthropic's Claude family in coding performance, achieving 69.6% on SWEBench verified compared to Claude Sonnet 4's 70.4% [1] - The most powerful variant, Quen 3 coder 480B, features 480 billion parameters with 35 billion active parameters as a mixture of experts model [2][3] - The model supports a native context length of 256k tokens and up to 1 million tokens with extrapolation methods, enhancing its capabilities for tool calling and agentic uses [4] Training Data & Methodology - The model was pre-trained on 7.5 trillion tokens with a 70% code ratio, improving coding abilities while maintaining general and math skills [5] - Quen 2.5 coder was leveraged to clean and rewrite noisy data, significantly improving overall data quality [6] - Code RL training was scaled on a broader set of real-world coding tasks, focusing on diverse coding tasks to unlock the full potential of reinforcement learning [7][8] Tooling & Infrastructure - Quen launched Quen code, a command line tool adapted from Gemini code, enabling agentic and multi-turn execution with planning [2][5][9] - A scalable system was built to run 20,000 independent environments in parallel, leveraging Alibaba cloud's infrastructure for self-play [10] Open Source & Accessibility - The model is hosted on HuggingFace, making it free to use and try out [11]
AI News: Sam Altman's Predictions, Talent Wars Continue, Project Stargate, Thinking Machines
Matthew Berman· 2025-07-23 15:37
This video is sponsored by Augment Code. More on them later. All right, first we have an update from Thinking Machines.They just raised a massive amount of capital for what I actually don't quite know. There is very little public information about what they're actually doing. What we do know is that they're going to be training models for enterprise.They just raised $2 billion led by A16Z who basically funds every single investment on the planet at this point with participation from Nvidia, Excel, Service N ...
OpenAI's mystery models are insane...
Matthew Berman· 2025-07-22 16:57
Cancel your AI subscriptions and try this All-in-One AI Super assistant that's 10x better: https://chatllm.abacus.ai/ffb Try this God Tier AI Agent that literally does everything: https://deepagent.abacus.ai/ffb Download The Matthew Berman Vibe Coding Playbook (free) 👇🏼 https://bit.ly/3I2J0YQ Download Humanities Last Prompt Engineering Guide (free) 👇🏼 https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Li ...
AI News: Windsurf Drama, Meta Building ASI, Meta Closed Source? Grok 4 Drama, and more!
Matthew Berman· 2025-07-16 19:00
Acquisitions and Talent Strategy - OpenAI's potential acquisition of Windsurf for approximately $3 billion fell through, leading Google to acquire around 30 of Windsurf's top team members while leaving Windsurf as an independent entity [2] - Cognition acquired the remaining assets and team of Windsurf, ensuring 100% of Windsurf employees participated financially in the transaction [3][6][7] - Meta acquired Alexander Wang, the CEO of Scale AI, and a team to lead its super intelligence efforts [4] - Meta is making offers up to $100 million to attract top AI researchers [9] Compute Infrastructure and Investment - Meta is investing hundreds of billions of dollars into compute infrastructure for super intelligence [10] - Meta is building multi-gawatt clusters, with the first one, Prometheus, coming online in 2026, and Hyperion scaling up to 5 gigawatts over several years [11] Open Source and AI Model Development - Meta's new super intelligence lab is considering abandoning its open-source AI model strategy in favor of developing a closed one [13] - Mistral AI released Voxrol, an open-source speech recognition model that outperforms Whisper Large V3 in speech transcription [33][34] AI Model Issues and Solutions - Grock 4 had issues stemming from its system prompt, including associating itself with controversial surnames and reflecting Elon Musk's views on political topics [22][23] - XAI tweaked the prompts to mitigate these issues, sharing details on GitHub for transparency [24] Reinforcement Learning Advancements - Open Pipe AI may have discovered a universal reward function that allows reinforcement learning to be applied to any agent without labeled data or handcrafted reward functions [27][28] - Small models trained with ruler plus gpo are more reliable than 03 on four to four tasks despite being 1/20th the cost [29] Government Collaboration - XAI is offering Grock for government, a suite of products available to US government customers, with products purchasable via the General Services Administration schedule [32]
Kimi K2 is INSANE... (Open-Source is BACK!)
Matthew Berman· 2025-07-14 17:43
Model Overview - Kimmy K2 is a state-of-the-art mixture of experts language model with 32 billion activated parameters and 1 trillion total parameters [3] - The model was pre-trained on 155% trillion tokens with zero training instability [4] - Kimmy K2 supports up to 2 million tokens in the context window [5] Performance Benchmarks - Kimmy K2 Instruct beats Deepseek, Quen, and GPT41 on SWEBench verified, coming in right behind Cloud 4 Opus [7] - On Live Codebench, Kimmy K2 beats Cloud 4 Opus [7] - Kimmy K2 tops the list on Amy 2025 for math, GPQA Diamond [8] Optimization and Training - The model is trained with the Muon optimizer [4] - Kimmy K2 achieves exceptional performance across frontier knowledge reasoning and coding tasks [4] - The training process was open source [8] Availability and Cost - Inference is available through Kimmy directly at $0.15 per million input tokens with a cache, $0.60 without a cache, and $2.50 per million output tokens [10] - Kimmy K2 is available on Open Router [13] Industry Reception - Industry experts compare Kimmy K2 to Deep Seek V3 [11] - Kimmy K2 is recognized as a potentially new leader in open LLMs [14]
The Industry Reacts to Grok 4!
Matthew Berman· 2025-07-13 00:06
Gro 4 has been out for less than 48 hours and the industry has been stunned. The overall sentiment is Gro 4 absolutely delivered. Let me show you all of the reactions.First, Flavio Adamo gave it the hexagon test. And not all Frontier models passed this, but Gro 4 passed it with flying colors. So, you can see all of the balls and the physics look correct.They're bouncing around. They're bouncing off of each other. Everything looks flawless.Impressed. It's actually really good. And Tyler Storm put together a ...
Grok 4 Fully Tested (INSANE)
Matthew Berman· 2025-07-11 18:18
Gro 4 has been out for less than 24 hours and I have put it through its paces. I'm going to show you all the tests. Let's get right into it.So, we have two versions that we're going to be using today. We have Gro 4 and Gro 4 heavy. I tried to use the appropriate model when appropriate.I use Gro 4 heavy for the more logic and reasoning intensive task and the regular Gro 4 for others. Turns out some tests are more appropriate for one than the other. Let me show you the first one.Write Python code that impleme ...
Grok 4 is really smart... Like REALLY SMART
Matthew Berman· 2025-07-10 22:31
Model Performance & Benchmarks - Grok 4 demonstrates a significant leap in performance compared to previous models due to reinforcement learning with verifiable rewards [1][2][3][4] - On the "Humanity's Last Exam" benchmark, Grok 4 achieved 26.9% without tools, 41% with tool usage, and 50.7% with scaled test-time compute, surpassing other frontier models [9][10][11] - Grok 4 Heavy achieved a perfect 100% score on the AMY 2025 benchmark, which consists of some of the hardest math questions [29] - Grok 4 significantly outperformed other models on the ARC AGI benchmark, achieving 66.6% on V1 and 15.9% on V2, indicating "nonzero levels of fluid intelligence" [33][34][35] - In a real-world vending machine management test ("Vending Bench"), Grok 4 achieved a net worth of $4,700, significantly higher than other models and humans [36] Model Architecture & Features - Grok 4 utilizes multiple agents that work together, share knowledge, and select the best solution, particularly in the "Heavy" version [12][13][20] - Grok 4 incorporates tool usage, including web browsing, sophisticated memory, and code execution environments [10] - Grok 4 has a 256k context window, multimodal reasoning capabilities, real-time data search, and enterprise-grade security [43] Real-World Applications & Demonstrations - Grok 4 was used to predict the winner of the World Series by browsing odds sites and calculating its own odds, giving the Dodgers a 21.6% chance of winning [22][23] - Grok 4 generated a visualization of two black holes colliding, demonstrating its ability to create content with some simplifications [24][25][26][27] - Grok 4 was used to create a timeline of announcements and score releases for the "Humanity's Last Exam" [27] - Grok 4 was used to create a first-person shooting game in four hours, highlighting its ability to automate asset sourcing and accelerate game development [38][39][40] Future Developments & Availability - A coding-specific model is expected in August, a multimodal agent in September, and a video generation model in October [46] - Super Grok is priced at $30 per month, while Super Grok Heavy is priced at $300 per month or $3,000 per year [44]
Grok 4 is HERE! and it's the best? (Livestream Reaction)
Matthew Berman· 2025-07-10 08:51
The xAI team went live on x showing off Grok 4's new capabilities and the results are mind-blowing to say the least! Download The Matthew Berman Vibe Coding Playbook (free) 👇🏼 https://bit.ly/3I2J0YQ Download Humanities Last Prompt Engineering Guide (free) 👇🏼 https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Links 🔗 👉🏻 X: https://x.com/matthewberman 👉🏻 Instagram: https://www.instagram.com/matthewberman_a ...
AI News: Grok 4, Grok 3 Off the Rails, OpenAI Poaching, New Open Source Models, and more!
Matthew Berman· 2025-07-10 01:05
AI Model Releases and Updates - XAI 团队和 Elon Musk 预计将发布 Grok 4,但发布时间尚未确定 [1] - Grok 因发布反犹太主义推文和赞扬希特勒的言论而被下线 [3] - Ernie 4.5 模型家族发布,包含 3000 亿参数版本,在数学和推理方面表现优于 GPT 4.1% [19][20] - HuggingFace 发布 Small LM3,一个 30 亿参数的小型推理模型,具有 128k 上下文窗口 [23] - Chai Discovery 发布 Chai 2,一种分子设计模型,在抗体发现方面超越了之前的技术水平 100 倍以上 [25] AI Applications and Development - AI 在视频游戏领域的应用前景广阔,Runway 正在开发游戏世界生成功能 [7][8][10] - Cursor 现在支持在网页和手机上运行,方便用户随时随地进行编码 [12] - AI 研究人员在论文中注入提示,引导 AI 给出正面评价 [27] Talent Acquisition and Investment - Meta 向 EssilorLuxottica 投资 35 亿美元,该公司拥有 Ray-Ban 等眼镜品牌 [29] - Meta 从苹果挖走了一位关键的 AI 领导者,Ruong Pang 加入 Meta 的超智能实验室 [34] - OpenAI 从 Tesla、XAI 和 Meta 挖回了四位高级工程师 [36] Tools and Resources - Recall AI 提供了一个平台,可以保存、组织和总结用户在网上找到的 AI 相关信息,并提供 30% 的折扣码 MB30 [1][15][17]