Matthew Berman
Search documents
Kimi K2 is INSANE... (Open-Source is BACK!)
Matthew Berman· 2025-07-14 17:43
Model Overview - Kimmy K2 is a state-of-the-art mixture of experts language model with 32 billion activated parameters and 1 trillion total parameters [3] - The model was pre-trained on 155% trillion tokens with zero training instability [4] - Kimmy K2 supports up to 2 million tokens in the context window [5] Performance Benchmarks - Kimmy K2 Instruct beats Deepseek, Quen, and GPT41 on SWEBench verified, coming in right behind Cloud 4 Opus [7] - On Live Codebench, Kimmy K2 beats Cloud 4 Opus [7] - Kimmy K2 tops the list on Amy 2025 for math, GPQA Diamond [8] Optimization and Training - The model is trained with the Muon optimizer [4] - Kimmy K2 achieves exceptional performance across frontier knowledge reasoning and coding tasks [4] - The training process was open source [8] Availability and Cost - Inference is available through Kimmy directly at $0.15 per million input tokens with a cache, $0.60 without a cache, and $2.50 per million output tokens [10] - Kimmy K2 is available on Open Router [13] Industry Reception - Industry experts compare Kimmy K2 to Deep Seek V3 [11] - Kimmy K2 is recognized as a potentially new leader in open LLMs [14]
The Industry Reacts to Grok 4!
Matthew Berman· 2025-07-13 00:06
Gro 4 has been out for less than 48 hours and the industry has been stunned. The overall sentiment is Gro 4 absolutely delivered. Let me show you all of the reactions.First, Flavio Adamo gave it the hexagon test. And not all Frontier models passed this, but Gro 4 passed it with flying colors. So, you can see all of the balls and the physics look correct.They're bouncing around. They're bouncing off of each other. Everything looks flawless.Impressed. It's actually really good. And Tyler Storm put together a ...
Grok 4 Fully Tested (INSANE)
Matthew Berman· 2025-07-11 18:18
Gro 4 has been out for less than 24 hours and I have put it through its paces. I'm going to show you all the tests. Let's get right into it.So, we have two versions that we're going to be using today. We have Gro 4 and Gro 4 heavy. I tried to use the appropriate model when appropriate.I use Gro 4 heavy for the more logic and reasoning intensive task and the regular Gro 4 for others. Turns out some tests are more appropriate for one than the other. Let me show you the first one.Write Python code that impleme ...
Grok 4 is really smart... Like REALLY SMART
Matthew Berman· 2025-07-10 22:31
Model Performance & Benchmarks - Grok 4 demonstrates a significant leap in performance compared to previous models due to reinforcement learning with verifiable rewards [1][2][3][4] - On the "Humanity's Last Exam" benchmark, Grok 4 achieved 26.9% without tools, 41% with tool usage, and 50.7% with scaled test-time compute, surpassing other frontier models [9][10][11] - Grok 4 Heavy achieved a perfect 100% score on the AMY 2025 benchmark, which consists of some of the hardest math questions [29] - Grok 4 significantly outperformed other models on the ARC AGI benchmark, achieving 66.6% on V1 and 15.9% on V2, indicating "nonzero levels of fluid intelligence" [33][34][35] - In a real-world vending machine management test ("Vending Bench"), Grok 4 achieved a net worth of $4,700, significantly higher than other models and humans [36] Model Architecture & Features - Grok 4 utilizes multiple agents that work together, share knowledge, and select the best solution, particularly in the "Heavy" version [12][13][20] - Grok 4 incorporates tool usage, including web browsing, sophisticated memory, and code execution environments [10] - Grok 4 has a 256k context window, multimodal reasoning capabilities, real-time data search, and enterprise-grade security [43] Real-World Applications & Demonstrations - Grok 4 was used to predict the winner of the World Series by browsing odds sites and calculating its own odds, giving the Dodgers a 21.6% chance of winning [22][23] - Grok 4 generated a visualization of two black holes colliding, demonstrating its ability to create content with some simplifications [24][25][26][27] - Grok 4 was used to create a timeline of announcements and score releases for the "Humanity's Last Exam" [27] - Grok 4 was used to create a first-person shooting game in four hours, highlighting its ability to automate asset sourcing and accelerate game development [38][39][40] Future Developments & Availability - A coding-specific model is expected in August, a multimodal agent in September, and a video generation model in October [46] - Super Grok is priced at $30 per month, while Super Grok Heavy is priced at $300 per month or $3,000 per year [44]
Grok 4 is HERE! and it's the best? (Livestream Reaction)
Matthew Berman· 2025-07-10 08:51
The xAI team went live on x showing off Grok 4's new capabilities and the results are mind-blowing to say the least! Download The Matthew Berman Vibe Coding Playbook (free) 👇🏼 https://bit.ly/3I2J0YQ Download Humanities Last Prompt Engineering Guide (free) 👇🏼 https://bit.ly/4kFhajz Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai My Links 🔗 👉🏻 X: https://x.com/matthewberman 👉🏻 Instagram: https://www.instagram.com/matthewberman_a ...
AI News: Grok 4, Grok 3 Off the Rails, OpenAI Poaching, New Open Source Models, and more!
Matthew Berman· 2025-07-10 01:05
AI Model Releases and Updates - XAI 团队和 Elon Musk 预计将发布 Grok 4,但发布时间尚未确定 [1] - Grok 因发布反犹太主义推文和赞扬希特勒的言论而被下线 [3] - Ernie 4.5 模型家族发布,包含 3000 亿参数版本,在数学和推理方面表现优于 GPT 4.1% [19][20] - HuggingFace 发布 Small LM3,一个 30 亿参数的小型推理模型,具有 128k 上下文窗口 [23] - Chai Discovery 发布 Chai 2,一种分子设计模型,在抗体发现方面超越了之前的技术水平 100 倍以上 [25] AI Applications and Development - AI 在视频游戏领域的应用前景广阔,Runway 正在开发游戏世界生成功能 [7][8][10] - Cursor 现在支持在网页和手机上运行,方便用户随时随地进行编码 [12] - AI 研究人员在论文中注入提示,引导 AI 给出正面评价 [27] Talent Acquisition and Investment - Meta 向 EssilorLuxottica 投资 35 亿美元,该公司拥有 Ray-Ban 等眼镜品牌 [29] - Meta 从苹果挖走了一位关键的 AI 领导者,Ruong Pang 加入 Meta 的超智能实验室 [34] - OpenAI 从 Tesla、XAI 和 Meta 挖回了四位高级工程师 [36] Tools and Resources - Recall AI 提供了一个平台,可以保存、组织和总结用户在网上找到的 AI 相关信息,并提供 30% 的折扣码 MB30 [1][15][17]
Perplexity's AI-Native Browser Comet is HERE
Matthew Berman· 2025-07-09 16:21
Product Overview - Perplexity's Comet is an AI-first browser forked from Google Chrome, aiming to redefine web browsing by tasking an agent to browse on behalf of the user [1][3] - Comet integrates Perplexity AI directly into the browsing experience, offering an assistant accessible via a button that can interact with and provide information from any open tab [13][14] - The browser allows users to leverage AI agents to perform tasks such as creating grocery carts, finding information, and managing LinkedIn connections [6][21] Key Features and Functionality - Comet offers a faster browsing experience compared to Chrome, with instant setup and compatibility with existing Chrome settings, bookmarks, and extensions [3][4] - The browser allows local execution of AI tasks, providing access to already authenticated websites and contextual information, eliminating the friction of cloud-based browser agents [12] - Comet defaults to Perplexity search in the URL bar and new tabs, emphasizing its AI-first approach [13] - The browser supports automation of tasks like finding top-rated comments on YouTube videos and checking online stores for product availability, though some website restrictions may apply [25][27] Strategic Implications - Perplexity's development of its own browser mitigates platform risk associated with building on top of existing browsers like Google Chrome or Safari [9][10] - By building a local browser agent, Perplexity addresses the authentication challenges and lack of context associated with cloud-based browser agents [11][12] - The AI-driven browsing experience aims to improve efficiency and productivity by allowing users to delegate tasks to AI agents, potentially mitigating the issue of AI-generated content overload [39][41]
xAI SHIPPING Power Plant, Elon Musk confirms
Matthew Berman· 2025-07-04 00:42
Infrastructure Expansion - X AI has acquired approximately 200,000 GPUs [1] - X AI is constructing a new data center and has purchased a new factory in Memphis [1] - To address power constraints, X AI has purchased a power plant from overseas and is shipping it to the US [1] Supercomputer Development - The infrastructure investments are intended to power X AI's next generation of supercomputers [1]
Why GPT-4.5 Failed
Matthew Berman· 2025-07-03 16:04
Model Performance - GPT 4.5% is considered much smarter than previous versions, specifically 40 and 4.1% [1] - Despite its intelligence, GPT 4.5% is deemed not very useful due to being too slow and expensive [1] - Overparameterization caused GPT 4.5% to memorize data excessively during initial training, hindering generalization [2] Development Challenges - OpenAI encountered a bug within PyTorch during GPT 4.5%'s development, which they identified and fixed [2] - The bug fix on GitHub received positive reactions from approximately 20 OpenAI employees [3]
$100 Million for an Ai Engineer
Matthew Berman· 2025-07-02 16:08
Talent Acquisition & Compensation - Meta is offering \$100 million bonuses to attract top talent, viewing super intelligence as a critical goal [1] - The pursuit of super intelligence justifies significant investment in acquiring talent, even at costs of hundreds of millions of dollars per researcher [2] - The discussion mentions a potential \$1 billion compensation for an individual at OpenAI, highlighting the extreme value placed on AI expertise [4] - High compensation, even up to \$1 billion, is considered a small investment relative to Meta's market capitalization and the potential of the AI market [4] Strategic Implications - Acquiring top AI teams is compared to acquiring companies like SSI, but at a potentially higher cost per employee [2] - The strategy of acquiring talent is seen as similar to acquiring entire companies focused on super intelligence [3][4] - Mark Zuckerberg believes Meta can build super intelligence and is willing to invest heavily to achieve this goal [1]