Workflow
机器之心
icon
Search documents
全球价值最高创企诞生,OpenAI估值创纪录来到5000亿美元
机器之心· 2025-10-03 00:24
机器之心报道 机器之心编辑部 几天前,OpenAI 重磅发布了全新一代的视频大模型 Sora 2,不仅在物理准确性、真实感和可控性方面都优于以往的系统,还具备同步的对话和音效能力。 Altman 称之为「ChatGPT for creativity」时刻。 | Company | Valuation | | | | | | Country | | --- | --- | --- | --- | --- | --- | --- | --- | | OpenAl | | | | | | $500B | us | | SpaceX | | | | | 400 | | വട | | ByteDance | | | | 220 | | | China | | Anthropic | | | | 183 | | | വട | | Ant Group | | | 150 | | | | China | | Reliance Retail | | 100 | | | | | India | | Databricks | | 100 | | | | | വട | | Shein | | ୧୧ | | | | | China | ...
Sora 2数手指翻车,奥特曼成第一批「受害者」,被AI玩成最惨打工人
机器之心· 2025-10-02 06:19
机器之心报道 Sora 2 生成的视频中,男人能够正确数数,但手指的展示与数字并不完全对应。 这已经不是该博主第一次拿这种提示词测试视频生成模型。早在今年 5 月份,他就用这个提示词测试过 Veo3,Veo3 不仅手指没比划对,数字还只数到 3。 后来博主又润色了提示词:a man counts out loud from 1 to 10, "1, 2, 3, 4, 6, 7, 8, 9, 10", he counts using his fingers and holds them up as he goes.(一名男子大声从 1 数到 10,「1、2、3、4、6、7、8、9、10」,他一边数,一边举起手指),仍以失败告终: 编辑:杨文 奥特曼大型社死现场。 Sora 2,强大如斯,却也数不明白手指。 X 网友 @fofrAI 整了个提示词测试 Sora 2:a man counts out loud from 1 to 10, using his fingers and holding them up as he goes.(一名男子一边举起手指,一边大声数 着从 1 到 10。) 视频一开始,男人的表现 ...
开发者狂喜:Thinking Machines发布首款产品Tinker,后训练麻烦全给包了
机器之心· 2025-10-02 03:12
Core Insights - Tinker, the first product launched by Thinking Machines, is an API designed to simplify the fine-tuning of language models for developers and researchers, allowing them to focus on training data and algorithms while Tinker manages infrastructure-related tasks [2][4][16]. Product Features - Tinker supports various advanced models, including Qwen-235B-A22B, and allows users to switch from small to large models with ease, akin to changing a string in Python code [6][8]. - The API provides low-level primitives such as forward_backward and sample, which are essential for most common post-training methods. An open-source library, Tinker Cookbook, is also available to offer modern implementations of post-training methods [9][11]. Use Cases and Adoption - Teams from prestigious institutions like Princeton, Stanford, and UC Berkeley are already utilizing Tinker, demonstrating its versatility in supporting both supervised fine-tuning and experimental reinforcement learning pipelines [13]. - The Goedel team at Princeton achieved comparable performance to full-parameter models using only 20% of the data, while Stanford's chemistry group improved accuracy from 15% to 50% in a specific task using Tinker [14]. Market Position and Future Outlook - Tinker aims to democratize access to fine-tuning capabilities, potentially leading to more diverse product innovations in the AI space [16]. - The initial phase of Tinker will be free, with a usage-based pricing model to be introduced in the coming weeks [15].
小红书发布FireRedChat:首个可私有化部署的全双工大模型语音交互系统
机器之心· 2025-10-02 03:12
Core Insights - The article introduces FireRedChat, the industry's first full-duplex large model voice interaction system that supports private deployment, addressing issues like high latency, noise sensitivity, and poor controllability [2][10]. Group 1: System Features - FireRedChat utilizes a complete architecture of "interaction controller + interaction module + dialogue manager," allowing any half-duplex link to be upgraded to full-duplex with ease [2]. - The system integrates self-developed models such as pVAD, EoT, FireRedTTS-1s, FireRedASR, and FireRedTTS2, offering both cascading and semi-cascading deployment options to meet various needs [2][10]. - The system achieves near-industrial-level end-to-end latency through modular decoupling and streaming optimization, enhancing real-time interaction [10][24]. Group 2: Performance Metrics - Experimental results indicate that FireRedChat outperforms other open-source frameworks in key metrics, providing a truly usable and deployable open-source solution for smarter and more natural full-duplex voice interaction [3]. - In terms of interruption accuracy, pVAD significantly reduces false interruptions caused by noise and other speakers, achieving a false barge-in rate of 10.2% compared to competitors [20][22]. - The system's end-to-end latency is competitive, with a P50 response time of 2.341 seconds, approaching industrial-grade closed systems [24]. Group 3: Emotional Intelligence - FireRedChat's AI assistant is designed to understand and respond to emotional cues, providing comforting and encouraging responses during moments of sadness or excitement, thus enhancing the user experience [5][11]. - The system captures acoustic cues such as emotion, tone, and rhythm, allowing it to express empathy and warmth in its interactions [11]. Group 4: Open Source and Deployment - FireRedChat is fully open-source, allowing for private deployment without external API dependencies, ensuring data security and compliance [17]. - The modular architecture and comprehensive documentation facilitate easy deployment and customization for developers [17]. Group 5: Future Outlook - The FireRed Team plans to continue iterating on FireRedChat, integrating more powerful AudioLLM and richer multimodal interactions, aiming to make voice AI more accessible and user-friendly [26].
梦里啥都有?谷歌新世界模型纯靠「想象」训练,学会了在《我的世界》里挖钻石
机器之心· 2025-10-02 01:30
Core Insights - Google DeepMind's Dreamer 4 supports the idea that agents can learn skills for interacting with the physical world through imagination without direct interaction [2][4] - Dreamer 4 is the first agent to obtain diamonds in the challenging game Minecraft solely from standard offline datasets, demonstrating significant advancements in offline learning [7][21] Group 1: World Model and Training - World models enable agents to understand the world deeply and select successful actions by predicting future outcomes from their perspective [4] - Dreamer 4 utilizes a novel shortcut forcing objective and an efficient Transformer architecture to accurately learn complex object interactions while allowing real-time human interaction on a single GPU [11][19] - The model can be trained on large amounts of unlabeled video data, requiring only a small amount of action-paired video, opening possibilities for learning general world knowledge from diverse online videos [13] Group 2: Experimental Results - In the offline diamond challenge, Dreamer 4 significantly outperformed OpenAI's offline agent VPT15, achieving success with 100 times less data [22] - Dreamer 4's performance in acquiring key items and the time taken to obtain them surpassed behavior cloning methods, indicating that world model representations are superior for decision-making [24] - The agent demonstrated a high success rate in various tasks, achieving 14 out of 16 successful interactions in the Minecraft environment, showcasing its robust capabilities [29] Group 3: Action Generation - Dreamer 4 achieved a PSNR of 53% and SSIM of 75% with only 10 hours of action training, indicating that the world model absorbs most knowledge from unlabeled videos with minimal action data [32]
Sora 2干翻Veo 3?超全对比实测:会中文脱口秀,但体操翻车,附有效邀请码
机器之心· 2025-10-01 07:26
Core Viewpoint - The article discusses the advancements of Sora 2, an AI video and audio generation model, highlighting its superior physical accuracy, realism, and controllability compared to its predecessor and competitors like Google's Veo3 [1][6][7]. Comparison with Veo3 - Sora 2 can generate up to 20 seconds of 1080p video, positioning it as a strong competitor to Veo3 [7]. - The audio generation capabilities of Sora 2 are noted to be superior to those of Veo3 [9]. - Sora 2's video generation avoids issues like object disappearance and distortion, which were present in the previous version [5][9]. - Users can access Sora 2 through a web platform or an iOS app, both requiring an invitation and a US IP address [11][12]. Performance Testing - In various tests, Sora 2 demonstrated impressive capabilities in generating realistic videos, including ASMR and singing performances, with accurate audio-visual synchronization [20][22]. - However, both Sora 2 and Veo3 struggled with generating gymnastics videos, resulting in unrealistic movements [28][33]. - Sora 2 outperformed Veo3 in generating fake news segments, providing a more dynamic presentation [24][25]. User Experience and Accessibility - The Sora iOS app mimics popular social media platforms like TikTok, featuring a recommendation algorithm and options for user interaction [44]. - OpenAI has implemented safety measures, including watermarks and restrictions on deepfakes of public figures, to prevent misuse of the technology [35]. Market Position and Competition - The article suggests that while OpenAI's Sora 2 has established a product barrier, competition remains fierce in the AI video generation space, with other companies like Meta and domestic platforms also advancing their offerings [46][47].
CUDA内核之神、全球最强GPU程序员?OpenAI的这位幕后大神是谁
机器之心· 2025-09-30 23:49
Core Insights - The article emphasizes the importance of behind-the-scenes engineers in AI, highlighting that a great team consists of both star figures and key contributors [1][2]. Group 1: Scott Gray's Role and Skills - Scott Gray, a senior engineer at OpenAI, gained attention for writing a critical CUDA Kernel that supports trillions of computations daily [3][5]. - Writing high-performance CUDA Kernels requires expertise in parallel computing, GPU architecture, and deep learning algorithms, making such talent rare [7]. - Gray's career path is tailored for performance engineering, focusing on low-level optimizations rather than being a typical "genius" scientist [7][8]. Group 2: Achievements at Nervana - Gray's reputation in AI began at Nervana Systems, where he addressed the efficiency gap between software frameworks and hardware during the deep learning boom [14]. - He developed maxas, an assembler that allows direct interaction with hardware, enabling the writing of highly optimized computational kernels [17][18]. - Using maxas, Gray achieved a SGEMM kernel that reached 98% of the theoretical peak efficiency on the GM204 GPU, outperforming NVIDIA's cuBLAS by 4.8% [20]. Group 3: Innovations in Deep Learning - Building on maxas, Gray created maxDNN, which applied low-level optimization techniques to convolution operations, significantly surpassing NVIDIA's cuDNN in performance [21]. - In AlexNet's convolution layers, maxDNN achieved 93-95% computational efficiency, while cuDNN fluctuated between 32% and 57% [21]. Group 4: Contributions at OpenAI - After joining OpenAI, Gray shifted focus to developing tools for efficient sparse model architectures, becoming a key figure in implementing Scaling Laws [22]. - He co-developed innovative block-sparse GPU kernels that significantly enhance efficiency by skipping zero-value blocks during computation [24][25]. - These kernels allow researchers to build larger neural network models within fixed computational budgets, achieving state-of-the-art results in various tasks [26][27].
Sora 2深夜来袭,OpenAI直接推出App,视频ChatGPT时刻到了
机器之心· 2025-09-30 23:49
Core Insights - OpenAI has quietly launched Sora2, a new product that directly enters the video generation space, similar to the impact of ChatGPT in the language model domain [1][8][12] - Sora2 is designed to enhance physical accuracy, realism, and controllability in video generation, outperforming previous systems [5][12][14] - The introduction of a new iOS app, Sora, allows users to create and share videos, incorporating a feature called "cameos" for high-fidelity personal representation [19][25] Product Features - Sora2 demonstrates significant advancements in simulating complex physical actions, such as Olympic gymnastics and dynamic buoyancy [12][13] - The model improves upon previous video generation systems by adhering more closely to physical laws, allowing for realistic failure simulations [13][17] - Sora2 supports complex multi-shot instructions and excels in various styles, including realistic, cinematic, and anime [14] User Engagement and Safety - The Sora app includes a recommendation algorithm that prioritizes user control over content consumption, aiming to mitigate issues related to addiction and isolation [21][22] - OpenAI emphasizes the importance of user agency in content creation and consumption, with built-in mechanisms for users to manage their experience [22] - The app is designed to foster creativity rather than consumption, addressing safety concerns related to content generation and usage rights [22][23] Availability and Future Plans - The Sora iOS app is currently available for download in the US and Canada, initially free with relaxed computational limits [25] - OpenAI plans to release the Sora2 Pro model for ChatGPT Pro users and intends to make Sora2 available via API in the future [25]
复旦、同济和港中文等重磅发布:强化学习在大语言模型全周期的全面综述
机器之心· 2025-09-30 23:49
Core Insights - The article discusses the significant advancements in reinforcement learning (RL) techniques that enhance the capabilities of large language models (LLMs), particularly in understanding human intent and following user instructions [2][3] - A comprehensive survey titled "Reinforcement Learning Meets Large Language Models" has been conducted by researchers from top institutions, summarizing the role of RL throughout the entire lifecycle of LLMs [2][3] Summary by Sections Overview of Reinforcement Learning in LLMs - The survey details the application strategies of RL in various stages of LLMs, including pre-training, alignment fine-tuning, and reinforcement reasoning [3][6] - It organizes existing datasets, evaluation benchmarks, and mainstream open-source tools and training frameworks relevant to RL fine-tuning, providing a clear reference for future research [3][6] Lifecycle of LLMs - The article systematically covers the complete application lifecycle of RL in LLMs, detailing the objectives, methods, and challenges faced at each stage from pre-training to reinforcement [11][12] - A classification overview of the operational methods of RL in LLMs is presented, highlighting the interconnections between different stages [5][6] Focus on Verifiable Rewards - The survey emphasizes the focus on Reinforcement Learning with Verifiable Rewards (RLVR), summarizing its applications in enhancing reasoning stability and accuracy in LLMs [7][9] - It discusses how RLVR optimizes the reasoning process and improves the model's adaptability to complex tasks through automatically verifiable reward mechanisms [7][9] Key Contributions - The article identifies three main contributions: a comprehensive lifecycle overview of RL applications in LLMs, a focus on advanced RLVR techniques, and the integration of key research resources essential for experiments and evaluations [9][11] - It provides valuable references for researchers interested in exploring RL in the context of LLMs [11][12] Challenges and Future Directions - Despite significant progress, challenges remain in scalability and training stability for large-scale RL applications in LLMs, which are still computationally intensive and often unstable [12][13] - Issues related to reward design and credit assignment, particularly in long-term reasoning, pose difficulties for model learning [12][13] - The article highlights the need for standardized datasets and evaluation benchmarks to facilitate comparison and validation of RL fine-tuning methods [12][13]
Thinking Machines又发高质量博客:力推LoRA,不输全量微调
机器之心· 2025-09-30 10:38
Core Insights - The article emphasizes the advantages of LoRA (Low-Rank Adaptation) over Full Fine-tuning (FullFT) in terms of cost-effectiveness and performance in various training scenarios [2][7][18]. Group 1: Importance of LoRA - LoRA is a popular parameter-efficient fine-tuning method that updates a low-dimensional adapter instead of the entire model weights, leading to lower memory requirements and faster loading [11][13]. - The research indicates that LoRA can achieve performance comparable to FullFT in small to medium-sized datasets, while it may struggle in large datasets due to capacity limitations [14][22]. Group 2: Key Findings - The study found that LoRA's performance is closely tied to the training conditions, including the size of the training dataset and the rank of the LoRA parameters [16][25]. - In reinforcement learning tasks, even with a very low rank (rank=1), LoRA can perform similarly to FullFT, indicating that reinforcement learning has lower capacity demands [29]. Group 3: Experimental Methodology - The research utilized models like LLaMA 3 and Qwen3, adjusting LoRA ranks from 1 to 512 and scanning learning rates to find optimal training conditions [20][21]. - Results showed that high-rank LoRA performed almost identically to FullFT in certain datasets, but performance varied across different tasks due to training dynamics [22][24]. Group 4: Practical Implications - LoRA's optimal learning rate is typically about 10 times that of FullFT, allowing it to accept higher learning rates under the same conditions [35]. - The study suggests that applying LoRA across all layers, especially MLP and MoE layers, is crucial for achieving performance close to FullFT [37].