Workflow
Llama 3.1 8B
icon
Search documents
X @Avi Chawla
Avi Chawla· 2026-02-25 18:36
RT Avi Chawla (@_avichawla)8x faster LLM inference than Cerebras is here!!And it generates 17,000 tokens per second.Today, a key bottleneck in how LLM inference works is that when you run a model on any GPU, the model weights live in memory, and the compute cores have to constantly fetch those weights to do math.That back-and-forth between memory and compute is the single biggest reason inference is slow. It's also the reason we need expensive HBM stacks, liquid cooling, and high-speed interconnects, making ...
X @Avi Chawla
Avi Chawla· 2026-02-25 06:30
8x faster LLM inference than Cerebras is here!!And it generates 17,000 tokens per second.Today, a key bottleneck in how LLM inference works is that when you run a model on any GPU, the model weights live in memory, and the compute cores have to constantly fetch those weights to do math.That back-and-forth between memory and compute is the single biggest reason inference is slow. It's also the reason we need expensive HBM stacks, liquid cooling, and high-speed interconnects, making AI data centers costly.Taa ...
“邪修”AI芯片的Taalas,成色如何?|AGI焦点
Tai Mei Ti A P P· 2026-02-23 13:51
Core Viewpoint - Taalas, a Canadian startup, has launched its HC1 chip, claiming to potentially disrupt Nvidia's dominance in the AI chip market with significant performance and efficiency improvements [2][5]. Group 1: Product Launch and Performance - Taalas released its first product, the HC1 chip, optimized for the Llama 3.1 8B model, achieving an inference speed of 12,000 tokens per second with a 50-fold efficiency improvement over traditional GPU solutions [2]. - The HC1 chip's peak inference speed is close to 17,000 tokens per second, which is nearly 10 times faster than current leading technologies, with construction costs reduced to 1/20 and power consumption to 1/10 of existing solutions [2]. - In tests, Taalas's HC1 outperformed Nvidia's H200 and B200 chips by 48 times, with performance figures of 230 tokens per second and 353 tokens per second respectively [3]. Group 2: Technology and Innovation - Taalas employs a unique ASIC technology that allows for a two-month chip customization cycle, contrasting with the traditional GPU approach [2][5]. - The company aims to eliminate software dependencies by directly embedding models onto chips, which enhances performance and reduces costs significantly [8][9]. - Taalas's approach is described as "Total specialization," where each model has a dedicated chip, potentially leading to lower inference costs and faster speeds [8][9]. Group 3: Market Position and Future Prospects - Taalas has raised a total of $219 million in funding, indicating strong investor confidence in its disruptive potential [2][8]. - The company plans to release a new product for medium-scale inference models in 2024, which will be closely watched for its performance [14]. - Analysts suggest that Taalas's chips may find significant applications in edge computing scenarios, such as robotics and autonomous vehicles, due to their low latency and power consumption [20]. Group 4: Challenges and Industry Context - Despite the excitement, Taalas faces challenges in adapting its technology to larger models, as it currently focuses on the smaller 8B version of Llama 3.1 [13]. - Concerns have been raised about the practical utility of Taalas's chips, particularly regarding their ability to keep pace with rapidly evolving large models [17][18]. - The competitive landscape remains dominated by Nvidia, which has a robust software ecosystem that Taalas aims to disrupt [21].
突破后训练瓶颈?Meta超级智能实验室又一力作:CaT解决RL监督难题
机器之心· 2025-09-22 02:05
机器之心报道 机器之心编辑部 在 AI 领域,大家通常采取后训练方式来让模型获取专项技能。然而后训练一般依赖带有标注参考的监督微调,或通过可验证的程序化检查器提供奖励。 这就带来一些问题,目前许多有价值的任务可能同时缺乏这两种资源。例如在不可验证的场景中(临床、自由对话和创意写作),可能存在多个有效答案,确定 性规则检查难以实施。 在这种情况下,实践者往往只能依赖(i)繁琐的标注流程,或(ii)通过另一个 LLM 对自由形式输出进行粗略奖励。 然而,当后训练缺乏真实标注时,学习信号从何而来? 为了回答这一问题,来自牛津大学、Meta 超级智能实验室等机构的研究者提出设想: 推理计算是否可以替代缺失的监督? 本文认为答案是肯定的,他们提出了一种名为 CaT(Compute as Teacher) 的方法,核心思想是把推理时的额外计算当作教师信号,在缺乏人工标注或可验证答 案时,也能为大模型提供监督信号。 结果显示,推理时直接应用 CaT显著提升了 Gemma 3 4B、Qwen 3 4B 和 Llama 3.1 8B 的性能,即使在不可验证领域(MATH-500 最高提升 27%;HealthBench 提升 ...
Nature刊文称“AI可模拟人类心智”,Science同日强烈质疑
Hu Xiu· 2025-07-21 00:43
Group 1 - A multinational team published a groundbreaking study in Nature, claiming their AI system can "simulate human cognition" and generate realistic human behaviors [1][7] - The AI model, named "Centaur," is said to have achieved significant accuracy in predicting human behavior across large-scale cognitive tasks [9][18] - The foundation of Centaur is a massive database called "Psych-101," which includes data from over 60,000 participants and more than 10 million choices [10][12] Group 2 - Centaur's architecture is based on Meta's open-source model Llama 3.1, which was fine-tuned using a technique called Quantized Low-Rank Adaptation (QLoRA) [16] - The model demonstrated strong generalization capabilities, accurately predicting behaviors of new participants in various tasks [19][20] - Centaur's internal operations showed remarkable resonance with human brain activity patterns, suggesting a potential alignment in information processing [29][33] Group 3 - Despite the promising results, the scientific community expressed skepticism regarding the claim that Centaur truly mimics human cognition [46][47] - Critics argue that while Centaur can predict human behavior, it does not replicate the underlying cognitive processes, highlighting a significant conceptual gap [51][52] - The scale of the Psych-101 dataset, although impressive, is still considered insufficient to encompass the vast complexity of human cognition [58]
速递|2.15亿美金豪赌AI瘦身术!Multiverse压缩LLM尺寸95%,让Llama在树莓派上狂奔
Z Potentials· 2025-06-13 03:17
Core Viewpoint - Multiverse Computing has successfully raised €189 million (approximately $215 million) in a Series B funding round, leveraging its technology "CompactifAI" to compress large language models (LLMs) significantly while maintaining performance [1][2]. Group 1: Funding and Investment - The Series B funding round was led by Bullhound Capital, with participation from various investors including HP Tech Ventures, SETT, Forgepoint Capital International, CDP Venture Capital, Santander Climate VC, Toshiba, and the Basque venture capital group [1]. - To date, the company has raised approximately $250 million in total funding [2]. Group 2: Technology and Product Offering - CompactifAI is a compression technology inspired by quantum computing, capable of reducing the size of LLMs by up to 95% without compromising model performance [2]. - Multiverse offers compressed versions of well-known open-source LLMs, including Llama 4 Scout, Llama 3.3 70B, Llama 3.1 8B, and Mistral Small 3.1, with plans to release more models soon [2]. - The company claims that its models operate 4 to 12 times faster than uncompressed versions, with inference costs reduced by 50% to 80% [3]. Group 3: Market Applications and Accessibility - Some of Multiverse's models are designed to be compact and energy-efficient, enabling them to run on personal computers, smartphones, cars, drones, and even Raspberry Pi devices [3]. - The Llama 4 Scout Slim version costs $0.10 per million tokens on AWS, compared to $0.14 for the original version, showcasing significant cost savings [3]. Group 4: Leadership and Expertise - The company is backed by strong technical expertise, with co-founder and CTO Román Orús known for his pioneering research in tensor networks, which are tools for simulating quantum computers on conventional machines [4]. - Co-founder and CEO Enrique Lizaso Olmos has a background in mathematics and extensive experience in the banking sector, previously serving as the Deputy CEO of Unnim Banc [4].