量子位
Search documents
李飞飞一年前究竟说了啥?怎么又火了
量子位· 2025-09-11 01:58
Core Viewpoint - The limitations of large language models (LLMs) in understanding the physical world are highlighted, emphasizing that language is a generated signal dependent on human input, while the physical world is an objective reality governed by its own laws [1][5][19]. Group 1: Language Models and Their Limitations - Language models operate on a one-dimensional representation of discrete tokens, making them adept at handling written text but inadequate for representing the three-dimensional nature of the physical world [12][14]. - The challenge of spatial intelligence lies in extracting, representing, and generating information from the real world, which is fundamentally different from language processing [17][19]. - Experiments show that LLMs struggle with physical tasks, performing poorly compared to human children and specialized robots [22][28]. Group 2: Experimental Findings - In a test using the Animal-AI environment, LLMs could only complete simple tasks, failing at more complex ones even with additional teaching examples [26][27]. - A tool named ABench-Physics was developed to assess LLMs' physical reasoning abilities, revealing that even the best models achieved only a 43% accuracy rate on basic physics problems [30][34]. - Visual tasks further demonstrated the limitations of LLMs, with human accuracy at 95.7% compared to a maximum of 51% for the models [37][41]. Group 3: Philosophical and Future Considerations - The discussion includes perspectives on whether language can sometimes describe reality better than perception and the potential for AI to develop its own language for understanding the physical world [46][47]. - The ongoing development of models based on physical and multimodal understanding indicates a shift towards addressing these limitations [44].
她们估值840亿,刚发了第一个AI成果
量子位· 2025-09-11 01:58
Core Insights - Thinking Machines, valued at $12 billion, has released its first research blog focusing on overcoming nondeterminism in large language model (LLM) inference [1][51]. - The research emphasizes the challenge of reproducibility in LLM outputs, attributing it to batch non-invariance [3][12]. Group 1: Research Focus - The main theme of the research is "Defeating Nondeterminism in LLM Inference," which addresses why LLM inference results are often non-reproducible [3][8]. - The root cause identified is batch non-invariance, where the output of a single request is influenced by the number of requests in the same batch [14][15]. Group 2: Technical Findings - The research indicates that floating-point non-associativity and concurrent execution lead to different results in LLM inference, but this explanation is incomplete [9][10]. - The study reveals that the lack of batch invariance is the primary issue, as dynamic adjustments to batch sizes during deployment affect the computation order of key operations [15][16]. Group 3: Proposed Solutions - To achieve batch invariance, the research suggests fixing the reduction order in operations like RMSNorm and matrix multiplication, regardless of batch size [18][19]. - The proposed method involves compiling a unified kernel configuration for all input shapes to avoid switching parallel strategies due to batch size changes, even if it results in a performance loss of about 20% [22][21]. Group 4: Experimental Validation - Three types of experiments were conducted to validate the findings: inference determinism verification, performance verification, and real online policy reinforcement learning application verification [25]. - Results showed that using batch invariant kernels led to 1000 identical outputs, achieving deterministic inference, while non-invariant kernels produced 80 different results [27][28]. Group 5: Company Background - Thinking Machines was co-founded by Mira Murati, former CTO of OpenAI, and includes a team of notable figures from the AI industry, primarily from OpenAI [36][38][46]. - The company recently completed a $2 billion seed funding round, setting a record for AI funding, and is now valued at $12 billion despite not having any product yet [51][50].
清华唐杰新作:大模型能打掼蛋吗?
量子位· 2025-09-10 10:01
Core Viewpoint - The research indicates that large models can effectively play various card games, demonstrating their capabilities in complex decision-making scenarios [2][4][52]. Group 1: Model Performance - Different models exhibit varying performance across different card games, with fine-tuned models showing superior results compared to API-based and base models [3][40]. - Among the API-based models, GPT-4o performs the best overall, while GLM-4 demonstrates strong capabilities in games like DouDizhu and GuanDan [39][40]. - Fine-tuned models, particularly GLM4-9B-Chat-mix, excel in multiple games, including DouDizhu, GuanDan, and Uno, indicating their versatility [42][40]. Group 2: Game Selection and Learning Methodology - The research team selected eight popular card games based on their complexity and the availability of high-quality models and data [8]. - The learning process involved generating high-quality interaction data through teacher models and opponents, allowing the large language models to learn effectively [14][16]. - The complexity of the games influenced the number of training instances collected, with more complex games like DouDizhu and GuanDan requiring larger datasets [20][21]. Group 3: Inter-Game Influence - The study found that models trained on similar games can enhance each other's performance, while those trained on games with significant rule differences may experience performance conflicts [52][49]. - For instance, models trained on GuanDan showed good performance in DouDizhu, suggesting a positive transfer of skills between these games [45]. Group 4: Generalization and Capability - The research indicates that while training on card games, the general capabilities of the models may decline, but this can be mitigated by incorporating general data into the training process [56][54]. - The mixed training approach allowed for some recovery of general capabilities, demonstrating the balance between specialized game skills and broader knowledge [56].
Qwen又立功,全球最快开源模型诞生,超2000 tokens/秒!
量子位· 2025-09-10 10:01
Core Viewpoint - The article discusses the launch of K2 Think, the world's fastest open-source AI model, developed by MBZUAI and G42 AI, achieving a speed of over 2000 tokens per second with only 32 billion parameters [1][3][8]. Group 1: Model Performance - K2 Think has demonstrated a processing speed exceeding 2000 tokens per second, with specific tests showing speeds of 2730.4 tokens/second and 2224.7 tokens/second [10][14][18]. - The model has performed well in various mathematical benchmark tests, achieving scores such as 90.83 in AIME'24 and 81.24 in AIME'25 [25]. Group 2: Technical Innovations - K2 Think incorporates several technical innovations, including: 1. Supervised fine-tuning for long-chain reasoning, allowing the model to think step-by-step rather than providing direct answers [31]. 2. Reinforcement learning with verifiable rewards, enhancing performance in mathematics and logic [31]. 3. Intelligent planning before reasoning, enabling the model to outline solutions before detailed reasoning [31]. 4. Best-of-N sampling during reasoning to generate multiple answers and select the best one [31]. 5. Speculative decoding to parallelly generate and verify answers, reducing redundant calculations [31]. 6. Hardware acceleration using Cerebras WSE, facilitating the high-speed output [31]. Group 3: Model Background - K2 Think is based on the Qwen 2.5-32B model from HuggingFace, indicating a connection to Chinese technology [6][5]. - Despite having only 32 billion parameters, K2 Think claims to match the performance of flagship models from OpenAI and DeepSeek [24].
快手AI超级员工上线!一句话剪出完整短视频,从文案到发布一条龙
量子位· 2025-09-10 08:01
Core Viewpoint - The article discusses the launch of Kwali, an AIGC (Artificial Intelligence Generated Content) tool from Kuaishou that enables users to create complete promotional videos in just a few minutes by simply stating their requirements, significantly lowering the barriers to video production [1][2][37]. Group 1: Functionality and Features - Kwali integrates multiple agents into a single framework to assist in video creation, including a material library and digital human resources, allowing users to generate high-quality short videos without prior filming skills [2][4]. - The process involves several agents that handle different tasks: intent analysis, script generation, material matching, and editing, all of which can operate independently and in parallel [5][18][42]. - Users can upload their private materials, which the system will automatically tag for easy future access, facilitating seamless integration with the platform's material library [14]. Group 2: Production Process - The video creation process starts with Kwali breaking down the user's request into key selling points, audience, and context tags, followed by script writing and material selection [8][22]. - The script includes dialogue and corresponding visual descriptions, designed to capture audience attention, and is generated based on analysis of popular videos in the same category [28][30]. - After gathering the necessary visual materials, Kwali matches appropriate fonts and background music, and synthesizes voiceovers using TTS technology before final editing [33][35]. Group 3: Industry Impact - The introduction of Kwali represents a fundamental shift in the video production supply chain, reducing the need for extensive resources and time traditionally required for creating promotional content [37][38]. - The new model allows small businesses and individual brands to produce content more frequently and affordably, transforming video marketing into a lightweight tool for daily operations [40][45]. - The streamlined process enables rapid testing of new creative ideas, allowing businesses to quickly adapt and respond to market demands, ultimately enhancing their marketing efficiency [44][46].
真·博士水平!GPT-5首次给出第四矩定理显式收敛率,数学教授只点拨了一下
量子位· 2025-09-10 08:01
Core Insights - GPT-5 has successfully extended the qualitative fourth moment theorem to a quantitative form with explicit convergence rates, marking a significant advancement in mathematical research [1][2][10]. Group 1: Research Achievements - The original theorem indicated that convergence would occur but did not specify the speed of convergence; GPT-5's contribution clarifies this aspect [2]. - OpenAI co-founder Greg Brockman expressed satisfaction with the progress made using GPT-5 in mathematical research [4]. - GPT-5 Pro improved known boundary values in convex optimization from 1/L to 1.5/L within minutes, showcasing its capabilities [8]. Group 2: Research Methodology - A controlled experiment was conducted by three mathematics professors using the Malliavin–Stein framework to test GPT-5's ability to generalize the fourth moment theorem [9][10]. - Initial prompts were based on a paper that established a qualitative fourth moment theorem applicable to two Wiener–Itô integrals with differing parity [11]. - GPT-5 provided a generally correct conclusion but made errors in reasoning that could jeopardize the proof's validity [13][14]. Group 3: Iterative Improvement - Upon identifying errors, researchers prompted GPT-5 to check its formulas and provide detailed derivations, leading to further corrections [15]. - GPT-5 was able to format the results into a research paper structure, including an introduction, main theorem statements, and a complete proof process [17]. - The AI suggested that the method could be extended to non-Gaussian frameworks, indicating its potential for broader applications [20]. Group 4: Further Exploration - Researchers aimed to extend the findings to Poisson cases, recognizing structural differences between Gaussian and Poisson scenarios [21][24]. - GPT-5 initially overlooked a critical fact regarding non-negativity in Poisson cases but was able to correct itself after specific guidance from researchers [26][28]. Group 5: Publication Challenges - The authors initially intended to list GPT-5 as a co-author but were informed by arXiv that AI cannot be credited as an author [29]. - Ultimately, the paper was submitted without GPT-5 listed as an author, reflecting ongoing discussions about AI's role in academic contributions [30].
腾讯版“Claude Code”来了!AI编程L4时代is coming
量子位· 2025-09-10 08:01
Core Viewpoint - Tencent has launched the AI CLI tool CodeBuddy Code and opened public testing for CodeBuddy IDE, marking a significant step in AI programming tools, particularly in the CLI format, which is becoming a foundational infrastructure for enterprise-level development [1][3][14]. Group 1: Product Overview - CodeBuddy IDE is an independent AI IDE currently in public testing, with the domestic version being free and the international version offering a limited Pro model experience during the testing phase [2][3]. - CodeBuddy Code is designed for professional engineers, allowing natural language to drive the entire development and operations lifecycle, enhancing automation efficiency [3][23]. - The product matrix includes CodeBuddy IDE, CodeBuddy Code, and CodeBuddy plugins, with the latter already officially launched and available for free use [3][8]. Group 2: Market Context - The emergence of CodeBuddy Code comes at a time when developers are moving away from Claude Code due to recent controversies, positioning Tencent's offering as a timely alternative [6]. - The AI CLI format, pioneered by Claude Code, has changed the market landscape, integrating traditional CLI advantages with AI capabilities suitable for automation and enterprise development [11][14]. Group 3: Development Trends - AI programming tools are evolving through five levels, with the CLI format representing a significant advancement, allowing AI to transition from a supportive role to a driving force in software engineering [11][16]. - The CLI mode is particularly advantageous for enterprise-level teams, covering the entire software lifecycle from task breakdown to deployment [19][20]. Group 4: Performance Metrics - Tencent reports that over 90% of its engineers are using CodeBuddy, resulting in an average coding time reduction of over 40%, with AI-generated code accounting for more than 50% of the total [20][21]. - The proportion of AI-generated code in code reviews has increased from 12% to 35%, indicating a growing reliance on AI in the development process [20]. Group 5: Features and Functionality - CodeBuddy Code supports natural language interaction, allowing users to describe tasks without needing to learn complex commands, and manages project context in a traceable and shareable manner [26][24]. - The platform integrates seamlessly with Git, CI/CD, and monitoring systems, facilitating high-efficiency collaboration among multiple agents [25][26]. - The memory system of CodeBuddy Code includes project memory, user memory, and global memory, enabling long-term context management across projects [29]. Group 6: Future Directions - The CLI-driven intelligent programming platform represents a new direction for enterprise-level AI programming, transforming developers into AI collaborative architects [37][38].
首个Data Agent基准测试来了!2007个测试任务将数据库、PDF、视频、音频异构数据源一网打尽
量子位· 2025-09-10 08:01
Core Viewpoint - The article discusses the introduction of FDABench, a benchmark designed for evaluating data agents in heterogeneous data analysis, developed by Nanyang Technological University, National University of Singapore, and Huawei. It aims to address the growing demand for data-driven decision-making by providing a comprehensive assessment framework for data agents across various data types and complexity levels [1][11]. Group 1: Benchmark Overview - FDABench covers over 2007 different test tasks across more than 50 fields, including finance and e-commerce, with three levels of difficulty: easy, medium, and hard [13]. - The benchmark includes a unique Agent-Expert collaboration framework that supports various data agent workflows, ensuring compatibility across different data agent systems without needing to redesign the testing framework [17]. Group 2: Evaluation Findings - The evaluation of various data agent systems revealed unique strengths in response quality, accuracy, latency, and token cost, indicating that each system has its advantages [3]. - Complex data agent architectures, such as Multi-Agent and Reflection, significantly outperform simpler architectures in accuracy for heterogeneous data analysis but at a much higher computational cost, consuming 6 to 20 times more resources [23]. Group 3: Resource Allocation Insights - Different data agent architectures optimize performance by reallocating computational resources; for instance, the Reflection architecture allocates 26-29% of its computation to retry mechanisms for higher quality outputs, while the Planning architecture focuses on efficiency by dedicating 32-35% to the generation phase [23]. - The study highlights the importance of matching model selection with architectural complexity, as some models may perform poorly in complex architectures due to a "double reasoning penalty" effect [23]. Group 4: Practical Implications - The article concludes that there is no perfect data agent; some are faster but struggle with complex tasks, while others are accurate but slow and costly. The choice of a data agent should depend on specific needs [24]. - FDABench serves as a tool to help users identify which system best fits their requirements [25].
快慢思考不用二选一!华为开源7B模型实现自由切,精度不变思维链减近50%
量子位· 2025-09-10 08:01
Core Viewpoint - Huawei's latest release, openPangu-Embedded-7B-v1.1, features a dual "thinking engine" that allows for seamless switching between fast and slow thinking modes, addressing a significant pain point in the industry where models traditionally had to choose one mode over the other [1][3][4]. Model Features - The openPangu-Embedded-7B-v1.1 model employs a progressive fine-tuning strategy and a unique adaptive thinking mode, enabling manual switching between fast and slow thinking or automatic transitions based on problem difficulty [3][4]. - The model significantly improves accuracy while maintaining efficiency, achieving nearly a 50% reduction in average reasoning chain length in benchmarks like CMMLU, without sacrificing precision [4][18]. Training Strategy - The training process consists of three progressive stages: 1. Selecting moderately challenging topics to ensure the model learns effectively without stagnation or overwhelming difficulty [8]. 2. Merging multiple model versions to consolidate knowledge and prevent forgetting [9]. 3. Continuously expanding the model's capabilities to tackle more complex tasks [10]. - This iterative training approach transforms the model into a continuously evolving learner rather than a passive recipient of knowledge [10]. Adaptive Mechanism - The model introduces a two-phase course for adaptive thinking: 1. The first phase teaches the model to distinguish between fast and slow thinking using labeled training data [13]. 2. The second phase allows the model to autonomously determine when to use each thinking mode based on the complexity of the task [14]. - This transition enhances the model's flexibility and autonomy in complex reasoning tasks [15]. Performance Metrics - The openPangu-Embedded-7B-v1.1 outperforms its predecessor in various benchmarks, including significant improvements in mathematical problem-solving [16][17]. - In tests, the model maintains high accuracy while reducing unnecessary reasoning steps, effectively balancing speed and precision [18]. Lightweight Model - Huawei also introduced openPangu-Embedded-1B, a lightweight model optimized for edge AI deployment, achieving high performance despite having only 1 billion parameters [20][21]. - This model demonstrates a strong performance-to-parameter ratio, setting a new benchmark for 1B-level models in China [22]. Conclusion - The release of openPangu-Embedded-7B-v1.1 represents a significant advancement in the large model field, showcasing innovative approaches to model training and adaptive thinking capabilities [23][24]. - The dual-mode thinking feature is expected to add value in various practical applications in the future [25].
英伟达新GPU,超长上下文/视频生成专用
量子位· 2025-09-10 01:28
henry 发自 凹非寺 量子位 | 公众号 QbitAI 老黄对token密集型任务下手了。 刚刚,在AI Infra Summit上,英伟达宣布推出专为处理 百万token 级别的代码生成和 生成式视频 应用的全新GPU—— NVIDIA Rubin CPX GPU 。 老黄表示:Rubin CPX是 首款 为超大上下文AI量身定制的CUDA GPU,可以让模型"一口气"推理数百万token。 而且,RubinCPX还能让你越用越省钱:每投资 1亿 美元,就能获得 50亿 美元的token收益。 (50倍,你就赚吧,老黄说的) 对于"老黄画的饼", Cursor 、 Runway 、 Magic 等行业大佬也表示RubinCPX将分别在 代码生产力 、 生成式影像创作 、以及 大模型 自主代理 上带来突破。 那么好了好了,这GPU到底什么来头? 首款专为超大上下文AI打造的CUDA GPU Rubin CPX基于NVIDIA Rubin架构,采用单片设计,内置NVFP4计算资源,主打AI推理的高性能和高能效。 它的性能提升,主要体现在以下几个方面: 在这里,我们可以简单地拿A100来对比一下。 在算力方面 ...