Workflow
机器之心
icon
Search documents
当黄仁勋将存储定义为「AI运行内存」,基础设施该如何实现物种进化?
机器之心· 2026-01-20 10:19
Core Insights - The article discusses the unprecedented demand for DRAM and storage solutions driven by AI computing needs, highlighting a significant structural shortage in the global memory market [2][4] - XSKY, a company that has evolved into a leader in China's object storage market, is addressing the challenges posed by AI infrastructure through its AIMesh product strategy, which aims to transform data centers into AI factories [5][10] Group 1: Market Dynamics - The global DRAM wafer demand is projected to reach approximately 40% of the total global DRAM wafer capacity due to agreements between OpenAI and major suppliers like Samsung and SK Hynix [2] - Major tech companies, including Microsoft and Google, are actively negotiating for more DRAM and high-bandwidth memory (HBM) supplies to meet their AI needs [2] - NVIDIA's CEO Jensen Huang predicts that the market for AI-related data storage will become one of the largest globally, necessitating a fundamental restructuring of storage technology [3][4] Group 2: XSKY's Strategic Positioning - XSKY has achieved over 50% growth in the past three years and has significantly increased its all-flash storage ratio to 35% [8] - The company has established 280 superclusters with over 10 PB capacity, demonstrating its capability to handle large-scale storage demands [8] - XSKY's AIMesh strategy focuses on creating a neutral and open data foundation to facilitate the efficient transformation of proprietary data into intelligence [10][36] Group 3: Technological Innovations - XSKY's AIMesh solution aims to overcome three major efficiency barriers in AI: IO wall, gravity wall, and memory wall [14][30] - MeshFS, a parallel file system developed by XSKY, addresses the IO wall by enhancing read and write bandwidth significantly [18][22] - MeshSpace provides a global non-structured data platform that allows seamless data flow and management across different storage types, enhancing operational efficiency [25][29] Group 4: Future Outlook - XSKY emphasizes the importance of maintaining a stable data foundation to support rapid advancements in computing power, adhering to the "data evergreen" philosophy [36][41] - The company aims to be a guardian of enterprise data assets while accelerating the AI journey for businesses, ensuring that proprietary data is effectively transformed into competitive advantages [38][41]
从平面几何出发:形式化验证如何驱动MLLM的推理能力跃迁
机器之心· 2026-01-20 10:19
在迈向通用人工智能(AGI)的征途中,多模态大语言模型(MLLMs)虽然在视觉理解与文本生成上展现了惊人的能力,却始终面临一道难以逾越的鸿沟:如何 在复杂的数学与几何推理中,克服固有的幻觉与逻辑断层? 现有的 "结果导向" 训练往往掩盖了推理过程的脆弱性,导致模型常常 "蒙对答案" 却 "想错过程"。这 种 "黑盒" 式的学习方式,使得模型难以习得真正鲁棒的推理能力。 面对这一挑战,来自 上海 交通大学 、 复旦大学、香港 中文大学(深圳)、上海人工智能实验室等研究机构的团队 提出了一套全新的系统化解决方案: "Formal Enhance Informal Reasoning"(以形式化增强非形式化推理)。 该方案的核心洞察在于:利用领域内(In-Domain)极度严谨、可验证的形式化逻辑,可以作为一 种强有力的监督信号,去规范和引导模型在非形式化场景下的推理行为。 更进一步,研究发现这种在严谨数学环境中习得的逻辑素养,不仅仅局限于几何题,更 能作为一把通用的钥匙,解锁模型在通用数学乃至更广泛推理任务上的分布外(OOD)泛化能力。 基于这一理念,团队历经三个阶段的探索,构建了从数据底层到模型顶层的完整闭环: ...
击败GPT、Gemini,复旦×创智孵化创业团队「模思智能」,语音模型上新了
机器之心· 2026-01-20 10:19
Core Viewpoint - The article highlights the breakthrough capabilities of the MOSS-Transcribe-Diarize model developed by MOSI AI, which excels in multi-speaker automatic speech recognition (ASR) and outperforms existing models like GPT-4o and Gemini in complex audio environments [1][2][9]. Group 1: Model Capabilities - MOSS-Transcribe-Diarize can handle overlapping speech and chaotic dialogue scenarios effectively, demonstrating a significant improvement in transcription accuracy [1][5]. - The model supports a long context window of 128K, allowing it to process audio inputs of up to 90 minutes, showcasing its robustness in complex environments [1][9]. - It achieves state-of-the-art (SOTA) performance across various benchmarks, including AISHELL-4, Podcast, and Movies datasets, particularly excelling in challenging audio conditions [2][16][19]. Group 2: Technical Innovations - The model employs a unified end-to-end multimodal architecture that integrates speech recognition, speaker attribution, and timestamp prediction, addressing the classic SATS (Speaker Attribution and Timestamped Speech) challenge [8][12]. - MOSS-Transcribe-Diarize utilizes a combination of real-world dialogue audio and synthetic data for training, enhancing its robustness against overlapping speech and acoustic variations [13][14]. - The architecture allows for direct output of text with speaker labels and precise timestamps, improving accuracy through semantic information utilization [12][14]. Group 3: Competitive Advantage - In benchmark tests, MOSS-Transcribe-Diarize significantly outperformed competitors like GPT-4o and Gemini 3 Pro in metrics such as Character Error Rate (CER) and optimal permutation Character Error Rate (cpCER), particularly in long audio inputs [16][19]. - The model maintains speaker consistency in long dialogues, reducing performance degradation caused by speaker attribution errors [16]. - It demonstrates superior performance in various scenarios, including real-world meetings, podcasts, and complex film dialogues, proving its versatility and effectiveness [19][21]. Group 4: Future Directions - MOSI AI aims to continue advancing multimodal intelligence, focusing on enabling AI to understand complex real-world contexts and achieve natural, coherent, and reliable interactions [24]. - The company has a strategic vision to develop technologies that enhance real-time dialogue interaction and robust speech understanding, positioning itself as a leader in the AI field [24].
EmbodiChain开源,用100%生成式数据自动训练具身智能模型
机器之心· 2026-01-20 07:16
https://www.techrxiv.org/doi/full/10.36227/techrxiv.176153394.41323502 开源主页: 机器之心发布 论文地址: https://dexforce.com/embodichain/index.html#/ 代码仓库: https://github.com/DexForce/EmbodiChain 技术文档: https://dexforce.github.io/EmbodiChain/introduction.html 大语言模型的爆发,让大家见证了 Scaling Law 的威力:只要数据够多、算力够猛,智能似乎就会自动涌现。但在机器人领域,这个公式似乎失效了。 不同于互联网上唾手可得的万亿级文本,机器人所需的、经过 3D 标定且符合物理规律的高质量交互数据,极度稀缺且昂贵。正因如此,数据采集范式成为了近 年来行业研究的绝对焦点。 可以看到,整个行业正在向着更低成本、更便捷的方向全速推进: 从昂贵的遥操设备,到基于动捕手套的灵巧手捕捉和更加便携式的夹爪方案,再到如今甚至不 再需要佩戴手套、仅凭双手演示即可采集数据的创新方案。 这些轻量化的数采 ...
TPAMI | DC-SAM:打破SAM交互限制,基于循环一致性的图像与视频上下文分割方法
机器之心· 2026-01-20 04:51
上下文分割(In-Context Segmentation)旨在通过参考示例指导模型实现对特定目标的自动化分割。尽管 SAM 凭借卓越的零样本泛化能力为此提供了强大的基础, 但将其应用于此仍受限于提示(如点或框)构建,这样的需求不仅制约了批量推理的自动化效率,更使得模型在处理复杂的连续视频时,难以维持时空一致性。 北京邮电大学联合南洋理工大学等 机构发表的 IEEE TPAMI 期刊论文《DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency》,不仅为 图像和视频的上下文分割建立了统一的高效框架 DC-S A M ,还构建了首个视频上下文分割基准 IC-VOS 。 研究团队巧妙地提出基于提示微调的 "循环一致性" 机制,通过正负双分支与循环一致性注意力的协同,配合 Mask-Tube 策略,实现了 SAM 与 SAM2 在图像及视 频上下文分割任务上的统一与高效适配。 实验结果显示,DC-SAM 在多个基准测试中均取得了 SOTA 性能:在 COCO-20 上达到 55.5 mIoU,在 Pascal-5 上达 ...
一周对战2500万局,这些「AI假人」让人类游戏玩家破防了
机器之心· 2026-01-20 04:51
Core Insights - The article discusses the innovative integration of AI in the gaming industry, specifically highlighting the success of Giant Network's game "Supernatural Action Team," which has achieved over 1 million concurrent users and dominated app store rankings [1][4][8] - The game features AI-controlled characters that interact with players in real-time, creating a dynamic and unpredictable gaming experience, which contrasts with traditional scripted NPCs [3][12][14] - The implementation of AI in core gameplay has proven to enhance player engagement and satisfaction, challenging the notion that AI could negatively impact gaming experiences [23][24] AI Integration in Gaming - "Supernatural Action Team" is the first major game in China to deeply integrate AI models into its core gameplay, allowing AI to act as both allies and adversaries [8][12] - The AI characters are designed to adapt their behavior based on player actions, creating a more immersive and strategic gaming environment [13][16] - The game has successfully conducted nearly 25 million matches within a week of the new AI feature's launch, demonstrating the system's stability and scalability [7][21] Industry Trends and Challenges - The gaming industry is increasingly welcoming AI, with applications ranging from asset generation to real-time matchmaking, but most implementations remain in a "safe zone" where AI does not directly affect core gameplay [10][11] - Concerns about AI's impact on player experience have led many developers to hesitate in integrating AI into core gameplay, with a significant percentage of players expressing skepticism about AI's role in gaming [22][23] - "Supernatural Action Team" provides a counterexample, showing that when AI is effectively integrated into gameplay, it can enhance the overall experience and be well-received by players [24] Future Implications - The successful integration of AI in "Supernatural Action Team" may pave the way for new content generation methods in gaming, where AI becomes a fundamental part of gameplay rather than just a tool [24] - The article suggests that the future of gaming may involve more frequent and interactive AI-driven experiences, opening new avenues for player engagement and industry growth [24]
机器人终于「懂」家务了!伯克利MomaGraph让机器人像人一样做家务
机器之心· 2026-01-19 08:54
想象这样一个日常画面:你吩咐家用机器人「烧壶开水」,它却当场卡壳——水壶在哪?该接自来水还是过滤水?先插电还是先按开关?水开了又该如何判 断?这些对人类而言像呼吸一样自然的家务,对过去的机器人却是大大的难题:要么忘了插电,要么找不到水壶,甚至会把柜门把手错当成开关一通乱按。 最近,加州伯克利和马里兰大学联手推出的 MomaGraph 技术 ,就是要让机器人彻底告别这种「做家务的人工智障」时刻。这套算法不仅能让机器人真正 理解「做事的先后顺序」,更在星动纪元星动 Q5 上成功完成了开柜子、开微波炉、开电视、关灯等真实家务。 一、研究背景: 家用机器人做不好家务的「三大卡点」 家用移动操作机器人(比如帮你开窗户、热牛奶的机器人)需要同时「看路」(导航)和「动手」(操作),但过去的技术一直存在三个关键问题卡点,导 致机器人「做不好家务」: 卡点 1:只知「在哪」,不知「咋用」 比如机器人要开窗户,传统技术可能只知道「窗户在书桌右边」(空间关系),但不知道「窗户把手能控制开关」(功能关系)——就像你知道手机在口袋 里,却不知道按电源键能开机,自然用不了手机。 卡点 2:只认「图片」,不认「变化」 传统模型会把场景当成 ...
评审用不用AI,作者说了算?ICML 2026全新评审政策出炉
机器之心· 2026-01-19 08:54
Core Viewpoint - ICML 2026 has introduced a new review type selection mechanism allowing authors to decide whether to permit the use of large language models (LLMs) in the review process [3][9]. Group 1: Review Policy Changes - Two policies have been established: Policy A strictly prohibits the use of any LLMs during the review process, while Policy B allows their use with specific restrictions [4]. - Allowed actions under Policy B include using LLMs to assist in understanding the paper, language polishing of review comments, and querying LLMs for strengths or weaknesses of the paper [7][9]. - The choice of whether to allow LLMs in the review process is now in the hands of the authors, marking a significant shift from previous practices where the decision was primarily up to reviewers [9]. Group 2: Implementation Challenges - There are concerns regarding the enforcement of the new regulations on LLM usage, as past experiences have shown a prevalence of AI-generated reviews [11][13]. - A study on ICLR 2026 revealed that 21% of review comments were entirely generated by AI, indicating a widespread reliance on AI tools in the review process [11]. - The effectiveness of ICML's new rules may be limited, as compliance by reviewers cannot be guaranteed, raising questions about the integrity of the review process [14][15]. Group 3: Author Control and Options - Authors now have the option to refuse LLM-assisted reviews, providing a "one-size-fits-all" choice that may address concerns about trust in the review process [16].
租了8张H100,他成功复现了DeepSeek的mHC,结果比官方报告更炸裂
机器之心· 2026-01-19 08:54
Core Insights - DeepSeek's mHC architecture addresses numerical instability and signal explosion issues in large-scale training by extending traditional Transformer residual connections into a multi-stream parallel architecture [1][5] - The mHC model has garnered significant attention in the AI community, with successful reproductions yielding better results than the original DeepSeek paper [5][6] Group 1: mHC Architecture - The mHC model utilizes the Sinkhorn-Knopp algorithm to constrain the connection matrix to a doubly stochastic matrix manifold, ensuring stability during training [1][25] - Traditional residual connections in Transformers have remained unchanged since 2016, relying on a single information flow, while mHC introduces multiple parallel streams for enhanced expressiveness [9][14] - The mHC architecture maintains stability by preventing signal amplification, which can lead to catastrophic failures in large models [20][28] Group 2: Experimental Results - In experiments with 10M parameters, the original hyper-connection (HC) model exhibited a signal amplification of 9.2 times, while mHC maintained stability with an amplification of 1.0 [36][61] - Scaling up to 1.7B parameters, the HC model showed an alarming amplification of 10,924 times, highlighting the instability associated with larger models [54][66] - The experiments demonstrated that while HC models accumulate instability, mHC models consistently maintain structural integrity across different training conditions [70][71] Group 3: Implications and Future Directions - The findings suggest that while traditional residual connections are stable, they may not be optimal for larger models, as mHC offers a balance between expressiveness and stability [57][58] - Future research aims to explore scaling laws further, particularly at the 10B parameter scale, where significant amplification trends are anticipated [101] - The mHC approach not only mitigates instability but also eliminates the risk of catastrophic failures in large-scale training scenarios [93][96]
让机器人看视频学操作技能,清华等全新发布的CLAP框架做到了
机器之心· 2026-01-19 03:51
Core Insights - The article discusses the introduction of the Contrastive Latent Action Pretraining (CLAP) framework, developed by Tsinghua University in collaboration with Stardust Intelligence, HKU, and MIT, which enables robots to learn skills directly from videos [2][3]. Group 1: Challenges in Robot Learning - The article highlights a long-standing issue in robot learning known as "data scarcity," where there is an abundance of human behavior videos online but a lack of data specifically for training robots [3]. - The root cause of this data asymmetry is the high cost and inefficiency associated with collecting robot operation data, which requires expensive hardware, specialized environments, and extensive manual labeling [3]. - Traditional latent action models face the "visual entanglement" problem, where models learn irrelevant visual noise instead of actual manipulation skills [3]. Group 2: Innovations of the CLAP Framework - The CLAP framework addresses the technical bottleneck of aligning the motion space extracted from videos with the robot's action space, effectively avoiding the visual entanglement issue [3]. - By utilizing contrastive learning, CLAP maps state transitions in videos to a quantifiable, physically executable action codebook [3]. - The framework allows robots to learn skills from vast amounts of video data available on platforms like YouTube and Douyin, significantly expanding the scale of usable training data [4]. Group 3: Training Methodology - The research team trained the CLAP framework using two modeling paradigms: CLAP-NTP, a self-regressive model excelling in instruction following and object generalization, and CLAP-RF, a strategy based on Rectified Flow aimed at high-frequency, precise control [4][10]. - The framework employs a knowledge matching (KM) regularization strategy to mitigate catastrophic forgetting during the fine-tuning process, ensuring that robots retain previously learned skills while acquiring new ones [4][10]. Group 4: Practical Implications - The long-term value of the CLAP framework lies not only in its technical innovation but also in its potential to accelerate the industrialization of robotics by reducing the cost and time required for deploying robots in various sectors such as services and manufacturing [6]. - The unified visual-language-action (VLA) framework allows for the effective integration of the precision of machine data with the semantic diversity of large-scale unannotated human video demonstrations [8]. Group 5: Experimental Results - Extensive experiments demonstrate that CLAP significantly outperforms strong baseline methods, enabling effective skill transfer from human videos to robot execution [12]. - Performance comparisons in real-world tasks show that CLAP-NTP and CLAP-RF achieve higher success rates in various tasks compared to baseline methods, indicating the framework's robustness and effectiveness [14][15].