Transformer
Search documents
大厂AI权力交接:90后,集体上位
3 6 Ke· 2026-02-02 00:22
2025年底到2026年初的几个月里,科技圈有个现象挺耐人寻味。 没有盛大的发布会,没有官方通告,但在深圳腾讯大厦、杭州阿里西溪园区、北京字节跳动办公楼里, 指挥大模型战场的人,悄然换上了一副副年轻面孔。 先看腾讯,虽然过去一两年被认为大模型落后,但它丝毫没闲着。先是前 OpenAI 研究员姚顺雨被传1 亿年薪入职腾讯,经过几次辟谣之后,终于在去年底正式加入腾讯,头衔是首席 AI 科学家,直接向腾 讯总裁刘炽平汇报。 就在上周,清华大学计算机系博士、前新加坡Sea AI Lab高级研究科学家庞天宇也入职腾讯,负责多模 态强化学习。在腾讯这种讲究山头和资历的老牌帝国里,这俩人简直是坐着猎鹰 9 号火箭上位的。 再看阿里。林俊旸,硕士毕业后直接加入阿里AI研究机构达摩院,成为智能计算实验室的算法专家, 专注于大模型研究。今天他已经是阿里最年轻的 P10,也是开源模型通义千问背后的核心推手。 如果你把腾讯、阿里、大模型独角兽这几家的核心人物拉出来,包括 Kimi 的杨植麟,刚被 Meta 砸下 数十亿美金收购的 Manus 创始人肖弘,会发现一个挺震撼的现象,掌舵着 AI 方向的,全是一帮 90 后。 这批人精准地 ...
硅谷“钱太多”毁了AI ?!前OpenAI o1负责人炮轰:别吹谷歌,Q-Star 被炒成肥皂剧,7年高压被“逼疯”!
Xin Lang Cai Jing· 2026-01-25 01:24
来源丨AI前线 编译 | Tina 这不是离职八卦,而是在一个把技术做成剧情、把研究变成围观的行业里,扛了七年高压后的选择。 2026 年的第一个月,Jerry Tworek 离开 OpenAI 的消息传出来时,几位 OpenAI 的员工在 X 上几乎失控 地发声:"我真的崩溃了""这太难受了"。大家的反应像是:这事来得太突然,也太重。 Jerry 是现代 AI 浪潮背后最有影响力、却也最少公开露面的关键人物之一。 2019 年加入 OpenAI 时, 当时该公司还只有约 30 名员工。他参与了许多最重要的项目,包括后来被称为 Q-Star 和 Strawberry 的 推理方法,最终发展成为 o1 推理模型。 这次离职后,他在接受 Core Memory 的播客采访时解释了原因:他想从事有风险的基础研究,这种研 究在像 OpenAI 这样的公司已经不可能进行了,因为像用户增长这样的指标才是优先考虑的。他对 ChatGPT 广告的看法体现了研究与商业化之间的脱节:"这是一种商业策略,而我负责训练模型。" 这 番言论印证了有关 OpenAI 人工智能研究与产品开发之间日益加剧的分歧的传言。 在 Tworek 看 ...
非Transformer架构的新突破,液态神经网络的推理小模型只用900M内存
机器之心· 2026-01-21 09:35
Core Insights - The article discusses the dominance of the Transformer architecture in large models and introduces Liquid AI's new model, LFM2.5-1.2B-Thinking, which operates efficiently on edge devices [1][2]. Group 1: Model Overview - Liquid AI has released LFM2.5-1.2B-Thinking, a reasoning model that can run entirely on edge devices with only 900 MB of memory [2][3]. - This model excels in generating internal reasoning trajectories before arriving at final answers, demonstrating superior performance in tool usage, mathematical reasoning, and instruction following [3][14]. Group 2: Performance Metrics - Compared to its predecessor LFM2.5-1.2B-Instruct, LFM2.5-1.2B-Thinking shows significant improvements in three key areas: mathematical reasoning (from 63 to 88 on MATH-500), instruction following (from 61 to 69 on Multi-IF), and tool usage (from 49 to 57 on BFCLv3) [7][9]. - In various reasoning benchmarks, LFM2.5-1.2B-Thinking's performance matches or exceeds that of Qwen3-1.7B, despite having approximately 40% fewer parameters [7][10]. Group 3: Training and Development - The model's training involved multi-step reasoning to enhance capabilities while maintaining concise answers for low-latency deployment [16]. - Liquid AI implemented strategies to reduce the occurrence of "doom looping" in the model's responses, achieving a reduction from 15.74% to 0.36% in the final training phase [17][18]. Group 4: Ecosystem and Compatibility - Liquid AI is expanding the ecosystem for the LFM series, ensuring compatibility with popular reasoning frameworks and supporting various hardware accelerations [24]. - The model has been tested across different devices, showcasing its efficient performance in long-context reasoning [26]. Group 5: Future Implications - LFM2.5-1.2B-Thinking signifies a shift away from the exclusive reliance on Transformer models, suggesting that smaller, powerful edge reasoning models may offer superior solutions [27]. - The decreasing barriers to running inference models on various devices is seen as a positive development for AI potential [28].
谷歌刚掀了模型记忆的桌子,英伟达又革了注意力的命
3 6 Ke· 2026-01-20 01:12
Core Insights - Google's Nested Learning has sparked a significant shift in the understanding of model memory, allowing models to change parameters during inference rather than being static after training [1][5] - NVIDIA's research introduces a more radical approach with the paper "End-to-End Test-Time Training for Long Context," suggesting that memory is essentially learning, and "remembering" equates to "continuing to train" [1][10] Group 1: Nested Learning and Test-Time Training (TTT) - Nested Learning allows models to incorporate new information into their internal memory during inference, rather than just storing it temporarily [1][5] - TTT, which has roots dating back to 2013, enables models to adapt their parameters during inference, enhancing their performance based on the current context [5][9] - TTT-E2E proposes a method that eliminates the need for traditional attention mechanisms, allowing for constant latency regardless of context length [7][9] Group 2: Memory Redefined - Memory is redefined as a continuous learning process rather than a static storage structure, emphasizing the importance of how past information influences future predictions [10][34] - The TTT-E2E method aligns the model's learning objectives directly with its ultimate goal of next-token prediction, enhancing its ability to learn from context [10][16] Group 3: Engineering Stability and Efficiency - The implementation of TTT-E2E incorporates meta-learning to stabilize the model's learning process during inference, addressing issues of catastrophic forgetting and parameter drift [20][22] - Safety measures, such as mini-batch processing and sliding window attention, are introduced to ensure the model retains short-term memory while updating parameters [24][25] Group 4: Performance Metrics - TTT-E2E demonstrates superior performance in loss reduction across varying context lengths, maintaining efficiency even as context increases [27][29] - The model's ability to learn continuously from context without relying on traditional attention mechanisms results in significant improvements in prediction accuracy [31][34] Group 5: Future Implications - The advancements in TTT-E2E suggest a shift towards a more sustainable approach to continuous learning, potentially becoming a leading solution in the industry for handling long-context scenarios [34][35] - This approach aligns with the growing demand for models that can learn and adapt without the high computational costs associated with traditional attention mechanisms [33][34]
英伟达DLSS 4.5来了:Transformer再进化消除鬼影,“拼好帧”最高提至6倍还能动态调节
量子位· 2026-01-16 07:21
Core Viewpoint - NVIDIA has introduced DLSS 4.5 at CES 2026, enhancing gaming experiences by addressing key player concerns regarding image quality and frame rates through a "dual-core" strategy [1][3]. Group 1: Image Quality Enhancement - The first core focuses on image quality, utilizing an upgraded super-resolution technology based on the second-generation Transformer model [4][11]. - This new model boasts five times the computational power of the first generation and is trained on a significantly expanded high-fidelity dataset [12]. - The upgraded model directly processes in the game's native linear space, improving clarity and reducing artifacts like ghosting and flickering, especially in high-contrast scenes [17][19]. - Users of all GeForce RTX graphics cards can access the super-resolution feature through an NVIDIA App update, ensuring enhanced stability and clarity [21]. Group 2: Performance Improvement - The second core is dedicated to performance, specifically designed for the RTX 50 series, featuring dynamic multi-frame generation [6][23]. - DLSS 4.5 introduces a new six-fold multi-frame generation mode, allowing for the generation of up to five additional frames for each traditional rendered frame, significantly enhancing game smoothness [25]. - For instance, the game "Black Myth: Wukong" can now run at 240 fps, compared to its previous frame rate of under 190 fps [27]. - The dynamic multi-frame generation adapts to GPU performance and monitor refresh rates, optimizing frame rates while maintaining quality and responsiveness [30][33]. Group 3: Display Technology Advancement - NVIDIA has also unveiled G-SYNC Pulsar, a significant evolution of G-SYNC technology aimed at reducing motion blur in high-speed visuals [34]. - Demonstrations show that this technology can enhance the visual clarity of a 360Hz monitor to the equivalent of 1000Hz [35]. - Initial support for G-SYNC Pulsar has been rolled out by manufacturers such as ASUS, AOC, and MSI [36].
China just 'months' behind U.S. AI models, Google DeepMind CEO says
CNBC· 2026-01-15 23:30
Core Insights - China's artificial intelligence (AI) models are reportedly only "a matter of months" behind U.S. and Western capabilities, according to Demis Hassabis, CEO of Google DeepMind, challenging previous assumptions of a significant gap [3][4] - Chinese AI lab DeepSeek has demonstrated strong performance with models built on less advanced chips, indicating that Chinese companies are making notable advancements in AI technology [5] - Despite progress, there are concerns regarding China's ability to innovate beyond existing technologies, with Hassabis emphasizing the difficulty of achieving frontier breakthroughs [6][8] AI Development in China - Chinese tech giants like Alibaba and startups such as Moonshot AI and Zhipu have released competitive AI models, contributing to the perception of China's rapid advancement in the field [5] - Nvidia CEO Jensen Huang acknowledged that while the U.S. leads in chip technology, China is making significant strides in AI models and infrastructure [9] Challenges Facing Chinese AI Firms - Access to critical technology, particularly advanced semiconductors from Nvidia, poses a significant challenge for Chinese technology firms, which could widen the gap between U.S. and Chinese AI capabilities over time [10][11] - Analysts predict that the lack of access to cutting-edge Nvidia chips may lead to a divergence in AI model capabilities, with U.S. infrastructure continuing to iterate and improve [12] Perspectives on Innovation - Alibaba's Qwen team technical lead, Lin Junyang, expressed skepticism about Chinese firms surpassing U.S. tech giants in AI within the next three to five years, citing a substantial difference in computing infrastructure [15] - Hassabis attributes the lack of groundbreaking innovations in China to a "mentality" issue rather than solely technological restrictions, comparing the need for exploratory innovation to the historical achievements of Bell Labs [16][17]
Ambarella (NasdaqGS:AMBA) FY Conference Transcript
2026-01-13 21:47
Summary of Ambarella's Conference Call Company Overview - **Company**: Ambarella - **Industry**: Semiconductor, specifically focusing on edge AI applications - **Core Products**: AI semiconductors used in video security, autonomous driving, telematics, and other robotic applications - **Revenue Source**: Approximately 80% of revenue comes from edge AI applications [2][5] Transformation and Product Development - Ambarella has transformed from a video processor company for consumer applications to an AI SoC provider for intelligent edge applications over the past decade [5][6] - The company has developed three generations of AI accelerators, with the second generation (CV2 family) representing 80% of total revenue [7][10] - The third generation architecture incorporates transformer-based models, which are expected to open larger market opportunities compared to CNN-based models [8][10] Market Opportunities and Growth - The company anticipates significant growth in transformer-based revenue, which is expected to coexist with CNN-based revenue [11][12] - The average selling price (ASP) for the CV2 family ranges from $15 to $75, while the third generation (CV3, CV7, N1) has an ASP of $20 to $400, indicating potential for significant revenue growth [13][14] - New applications for transformer technology include autonomous driving and advanced robotics, with examples provided from CES demonstrations [17][19] Business Segments and Performance - Ambarella's enterprise security camera market remains strong, but telematics and portable video markets have shown unexpected growth [34][35] - The company expects continued growth in enterprise security and telematics, with ASP and unit growth driving performance [36] - The IoT business is diversifying, with security now accounting for less than 50% of the IoT revenue, down from previous years [50][52] Edge Infrastructure and AI Applications - The N1 AI box is designed to aggregate edge endpoints, enhancing existing security cameras with Gen AI capabilities [55][59] - The edge infrastructure business is expected to have higher ASPs but similar gross margins compared to the overall corporate average of 59%-62% [59][60] Automotive Market Insights - The automotive market is currently facing delays in Level 2+ design wins, but Ambarella continues to focus on securing partnerships with OEMs [62][63] - The company is leveraging its investments in autonomous driving technology for broader robotic applications, including drones [63][64] Software and Licensing Opportunities - Ambarella has developed two large models for end-to-end AI applications and is open to licensing these models to OEMs [65][66] - The company is focusing on securing design wins for both hardware and software revenue, with licensing as an additional revenue stream [66] Future Outlook - Ambarella is optimistic about the growth potential in both existing and new markets, with plans to provide official guidance for fiscal 2027 in February [36][37] - The company is exploring custom ASIC projects with large customers, which could enhance revenue and market presence [41][42] Key Takeaways from CES - New product announcements, including the CV7 chip, which offers improved AI performance and lower power consumption [37][38] - Introduction of a new go-to-market strategy to engage partners for addressing segmented markets [38][39] - Engagement in custom chip design with large customers, focusing on leveraging Ambarella's IP [41][42] This summary encapsulates the key points discussed during the conference call, highlighting Ambarella's strategic direction, market opportunities, and future growth potential.
把RoPE扔掉,AI更能看懂长上下文,Transformer作者团队开源大模型预训练新方法
3 6 Ke· 2026-01-13 11:01
Core Insights - A new technology called DroPE has been developed by a research team led by Llion Jones, one of the core authors of the Transformer architecture, to address the challenges of long text processing in large models [1][14] - DroPE allows for seamless zero-shot context expansion without the need for expensive long-context training, requiring less than 1% of the pre-training budget for model recalibration [1][10] Technology Overview - DroPE can be seen as a method that discards positional embeddings to extend context, humorously referred to as "NoRoPE" by netizens [3] - The technology utilizes RoPE (Rotary Positional Encoding) as a temporary training tool during the pre-training phase to ensure stability and efficiency, while discarding positional embeddings during the inference phase [8][5] Performance Metrics - Experiments conducted on various models, including a 5M parameter model and the SmolLM family (360M/1.7B) as well as the 7B parameter Llama2-7B, showed that DroPE improved the average score of the base SmolLM by over 10 times on the LongBench benchmark [10] - In the NIAH task evaluation, the recall rate of the DroPE model reached 74.92%, significantly surpassing traditional RoPE scaling methods [10] Comparative Analysis - Performance comparisons across different methods indicate that DroPE outperforms other techniques in various tasks, achieving an average score of 30.52 in the LongBench benchmark [11] - Even with only 0.5% of the pre-training budget used for recalibration, DroPE demonstrated exceptional performance in long-context question answering and summarization tasks [11] Company Background - The team behind DroPE is from Sakana AI, co-founded by Llion Jones and former Google senior scientist David Ha, and has gained attention for creating the first AI scientist capable of producing complete academic papers [14][16] - Sakana AI has also collaborated with MIT researchers to propose the Digital Red Queen algorithm, showcasing the potential of large language models in adversarial program evolution [18]
把RoPE扔掉,AI更能看懂长上下文!Transformer作者团队开源大模型预训练新方法
量子位· 2026-01-13 09:50
Core Insights - The article discusses a new technology called DroPE, developed by a research team led by Llion Jones, one of the core authors of the Transformer architecture, to address the challenges of long text processing in large models [1][24]. - DroPE allows for seamless zero-shot context expansion without the need for expensive long-context training, requiring less than 1% of the pre-training budget for model recalibration [2]. Group 1: Technology Overview - DroPE can be seen as a method to discard positional embeddings to extend context [5]. - The technology utilizes RoPE (Rotary Positional Encoding) as a temporary training tool during the pre-training phase to ensure stability and efficiency in training [12][13]. - During the inference phase, DroPE discards positional embeddings and performs brief recalibration under the original context length, unlocking the model's long-context extrapolation capabilities [15][16]. Group 2: Performance Metrics - Experiments conducted on various models, including a 5M parameter model and the SmolLM family (360M/1.7B) as well as the 7B parameter Llama2-7B, showed significant improvements [17]. - In the LongBench benchmark test, DroPE improved the average score of the base SmolLM by over 10 times [18]. - In the NIAH task evaluation, the recall rate of the DroPE model reached 74.92%, significantly surpassing traditional RoPE scaling methods [19]. Group 3: Comparative Analysis - A comparative table shows that DroPE outperforms other methods in various tasks, achieving an average score of 30.52 in the LongBench benchmark [20]. - Even on the large-scale Llama2-7B model, DroPE demonstrated exceptional performance in long-context question answering and summarization tasks using only 0.5% of the pre-training budget for recalibration [20]. Group 4: Company Background - The team behind DroPE, Sakana AI, was co-founded by Llion Jones and former Google senior scientist David Ha [24]. - Sakana AI has gained attention for creating the first AI scientist capable of generating complete academic papers, which has positioned the company prominently in the AI landscape [26].
杨植麟揭秘Kimi预训练策略:提升Token efficiency,实现长文本
Xin Lang Cai Jing· 2026-01-10 12:09
Core Insights - The core focus of the article is on the strategies for pre-training AI models, specifically emphasizing Token Efficiency and Long Context as critical components for enhancing performance in complex tasks [2][6]. Group 1: Token Efficiency - Token Efficiency is crucial because the reasoning or training of agents is fundamentally a search process, where better pre-training reduces the search space and enhances prior knowledge [3][7]. - The importance of Token Efficiency is highlighted by the need for AI to develop complex systems, such as an operating system, without enumerating every possible token combination, which may be meaningless or incorrect [7]. Group 2: Long Context - The architecture of Transformers shows significant advantages in long context scenarios, with experiments indicating that performance drops below LSTM when context length exceeds 1000 tokens, underscoring the importance of context length in model design [2][6]. - In the current Agentic era, many tasks require long contexts to execute complex instructions, making architectures with lower positional loss more technically capable [2][6]. Group 3: Aesthetic Considerations in AI - The development of AI models is not just a technical challenge but also involves aesthetic considerations, where the creation of a model reflects a worldview and values, akin to the concept of "Taste" as articulated by influential figures like Steve Jobs [3][7]. - Each model generates unique tokens that are not interchangeable, indicating that intelligence produced by different roles (e.g., a CEO vs. a designer) varies significantly, leading to an exponential increase in the space of possible "Tastes" [4][8].