Workflow
机器之心
icon
Search documents
WAIC特别企划视频栏目《AI面对面》,讲出你的「热AI」故事
机器之心· 2025-07-07 09:30
打开传播新窗口 讲好 AI 企业「技术故事」 在大模型、多模态、智能体等技术日新月异的今天,人工智能正在从技术突破转向应用深化,从实验室走 入产业链条的每一个环节。然而,技术成果的「看得见」并不意味着价值的「被理解」。企业面临的不仅 是技术落地的挑战,更是如何以专业、可信、具影响力的方式,把自己的「硬实力」讲成用户、合作方、 投资人都能听懂的「好故事」。 在智能浪潮席卷全球之际,2025 世界人工智能大会(WAIC)将于 7 月 26 日至 29 日于上海世博中心与世博 展览馆盛大举行。本届大会以「智能时代,同球共济」为主题,汇聚全球顶尖的科研成果、行业领导者与 前沿观点,成为 AI 技术、产业与治理交汇的重要节点。 作为全球人工智能领域最具影响力的顶级盛会之一,WAIC 不仅是前沿技术交流的窗口,更是企业向世界展 示自身实力与愿景的绝佳舞台。借助这一契机,机器之心将在大会期间特别推出【AI 面对面】视频专题栏 目,诚邀 AI 领域代表性先锋企业参与,共同打造兼具深度与影响力的品牌传播内容。 借势大会影响力 :绑定高势能平台,WAIC 特别企划栏目。 内容形式灵活多样 :企业专访、展馆探展双管齐下,满足企业不 ...
Claude Code发布4个月,用户已经11.5万了,开发者:200 美元/月不算贵
机器之心· 2025-07-07 09:30
Core Viewpoint - The article discusses the significant productivity improvements that AI models, particularly Claude Code, are bringing to developers, indicating a willingness among developers to invest in these tools for time savings [1]. Group 1: AI Model Performance - Claude Code has attracted 115,000 developers and processed 195 million lines of code within just four months of its release [2]. - Based on current user engagement, Claude Code could potentially generate an annual revenue of $130 million for Anthropic [3]. - Each developer is estimated to contribute over $1,000 annually to Claude Code, indicating a high-value and sticky user base [5]. Group 2: User Experience and Feedback - User feedback highlights Claude Code's strong performance in understanding project architecture and generating contextually relevant code suggestions [10]. - Developers appreciate the integrated development environment features, which streamline workflows by allowing direct document browsing and command execution [9]. - Despite some challenges with larger codebases, developers find the tool's overall value justifies the cost [13]. Group 3: Competitive Landscape - Users have noted that Claude Code feels more advanced compared to other tools like Cursor, attributing this to its direct development by the model creators [22]. - The increasing acceptance of AI-assisted programming tools among developers suggests a shift beyond just entry-level users [23]. - Challenges such as code quality control, security vulnerabilities, and intellectual property issues remain, but Claude Code has demonstrated its effectiveness in enhancing development efficiency [25].
RoboTwin系列新作:开源大规模域随机化双臂操作数据合成器与评测基准集
机器之心· 2025-07-07 07:50
Core Viewpoint - The article discusses the release of RoboTwin 2.0, a scalable data generator and benchmark for robust bimanual robotic manipulation, highlighting its advancements over the previous version, RoboTwin 1.0, and its applications in dual-arm collaboration tasks [5][34]. Group 1: Introduction and Background - RoboTwin 2.0 is developed by researchers from Shanghai Jiao Tong University and the University of Hong Kong, focusing on overcoming limitations in data collection and simulation for dual-arm robotic operations [6][8]. - The RoboTwin series has received recognition in major conferences, including CVPR and ECCV, and has been utilized in various competitions [3][9]. Group 2: Features of RoboTwin 2.0 - RoboTwin 2.0 introduces a large-scale domain randomization data synthesis framework, which includes a dataset of 731 instances across 147 object categories, enhancing the robustness of models in unseen environments [8][12]. - The system employs a more user-friendly API for expert code generation, significantly lowering the barrier for utilizing large multimodal models [10][34]. Group 3: Domain Randomization Strategies - The article outlines five key dimensions of domain randomization implemented in RoboTwin 2.0, including scene clutter, background textures, lighting conditions, tabletop heights, and diverse language instructions [16][18][20][21][22]. - These strategies aim to improve the model's adaptability and performance in real-world scenarios by exposing it to a wide variety of training conditions [16][34]. Group 4: Performance Metrics - RoboTwin 2.0 shows significant improvements in performance metrics compared to RoboTwin 1.0, with an average success rate (ASR) increase from 47.4% to 62.1% in typical tasks, and further enhancements with structured feedback [26][27]. - The adaptive grasping capabilities of RoboTwin 2.0 also demonstrate an average success rate improvement of 8.3% across five robotic platforms [28]. Group 5: Real-World Application and Transferability - The system exhibits strong zero-shot transfer capabilities, achieving notable success rates in unseen tasks and complex environments, indicating its potential for real-world applications [31][33]. - The results highlight RoboTwin 2.0's comprehensive advantages in code generation, grasping expansion, environmental robustness, and sim-to-real transfer, providing a solid foundation for future dual-arm operation research [34].
开源Agent新标杆:通义WebSailor多榜夺魁,挑战OpenAI高难度Agent基准BrowseComp
机器之心· 2025-07-07 07:50
Core Viewpoint - The article discusses the limitations of open-source Web Agents in handling complex information retrieval tasks compared to proprietary systems, highlighting the introduction of WebSailor as a breakthrough solution to enhance reasoning capabilities in high uncertainty tasks [2][19]. Group 1: Background - In the era of information overload, traditional search engines struggle to meet users' needs for deep, multi-step information retrieval [2]. - Open-source models have shown poor performance in complex tasks like BrowseComp, with accuracy rates nearly at zero, indicating a lack of effective reasoning patterns [2][3]. Group 2: Technical Innovations - WebSailor introduces a systematic approach combining challenging training tasks and efficient training strategies, including the creation of the SailorFog-QA dataset and innovative reasoning trajectory reconstruction [7][10]. - The classification of information retrieval tasks into three levels of uncertainty helps in understanding the challenges faced by open-source models [8][10]. - The construction of a complex knowledge graph through random walks in real web environments ensures that the training data reflects real-world complexities [11][13]. Group 3: Experimental Results - WebSailor outperformed various open-source and proprietary models across multiple benchmarks, particularly excelling in the challenging BrowseComp tasks [19][21]. - The model demonstrated compatibility with simpler tasks, showcasing its efficiency and adaptability beyond high-complexity scenarios [22]. Group 4: Conclusion and Future Outlook - WebSailor aims to bridge the performance gap between open-source and top-tier proprietary systems in complex information retrieval tasks, emphasizing the importance of innovative training methodologies over mere model size [26][27]. - Future research directions include addressing limitations in context length and exploring asynchronous reinforcement learning frameworks to enhance training efficiency [28].
重塑AI记忆边界:MemOS开源!时序推理较OpenAI提升159%
机器之心· 2025-07-07 04:48
机器之心发布 机器之心编辑部 大模型记忆管理和优化框架是当前各大厂商争相优化的热点方向,MemOS 相比现有 OpenAI 的全局记忆在大模型记忆评测集上呈现出显著的 提升,平均准确性提升超过 38.97%,Tokens 的开销进一步降低 60.95%,一举登顶记忆管理的 SOTA 框架,特别是在考验框架时序建模与检 索能力的时序推理任务上,提升比例更是达到了 159%,相当震撼! 图 1. MemOS 项目官网报告的性能表现 在大型语言模型(LLM)一路狂飙的这几年,参数规模和算力几乎成了 AI 能力的代名词。可当大模型逐渐走进科研、产业和生活,每个人都在问一个更深 层的问题: 它究竟能不能 "记住" 点什么? 从陪伴式对话、个性化推荐,到多轮任务协作,模型只靠一次推理、一次检索,远远不够。如何让 AI 拥有 可管理、可迁移、可共享的长期记忆 ,正在成 为新一代大模型应用的关键挑战。 近日, 记忆张量 (上海)科技有限公司联合上海交通大学、中国人民大学、同济大学、浙江大学、中国电信等多家顶尖团队发布了 MemOS(Memory Operating System) ,一套面向大模型的工业级记忆操作系统。 它的 ...
新范式来了!新能量模型打破Transformer++扩展上限,训练扩展率快35%
机器之心· 2025-07-07 04:48
Core Insights - The article discusses the development of Energy-Based Transformers (EBTs) that can learn to think independently through unsupervised learning, enhancing the model's reasoning capabilities akin to human System 2 thinking [9][10]. Group 1: System 2 Thinking and Model Development - Human thinking is categorized into System 1 (fast thinking) and System 2 (slow thinking), with the latter being crucial for complex tasks [3][4]. - Current large language models excel in System 1 tasks but struggle with System 2 tasks, prompting researchers to explore methods to enhance System 2 reasoning [4][5]. - EBTs are designed to assign energy values to input and candidate predictions, optimizing through gradient descent to simulate a thinking process [9][10]. Group 2: Performance and Scalability - EBTs demonstrate a 35% faster scalability in training compared to mainstream Transformer++ methods across various metrics such as data volume and model depth [11]. - In reasoning tasks, EBTs outperform Transformer++ by 29% in language tasks, indicating superior performance with increased computational effort [12]. - EBTs also excel in image denoising tasks, requiring fewer forward passes than diffusion Transformers while achieving better results [13]. Group 3: Generalization and Robustness - EBTs show enhanced generalization capabilities, particularly when handling out-of-distribution data, outperforming existing models even with similar or worse pre-training performance [14]. - The model's ability to learn and express uncertainty in predictions is highlighted, with EBTs effectively capturing the difficulty of token predictions [62][65]. - EBTs exhibit a linear trend in performance improvement as the distribution shift increases, emphasizing their critical role in cross-distribution generalization tasks [68][69]. Group 4: Experimental Results and Comparisons - EBTs outperform Transformer++ in various scalability metrics, including data efficiency and computational efficiency, suggesting they will excel in large-scale training scenarios [46][72]. - Despite slightly higher pre-training perplexity, EBTs achieve lower perplexity in downstream tasks, indicating stronger generalization capabilities [74]. - In image denoising tasks, EBTs significantly outperform DiT models, achieving better peak signal-to-noise ratios (PSNR) with 99% fewer forward passes [81][92].
复杂空间指令也能秒懂?RoboRefer 让机器人理解推理空间,开放世界也能精准行动!
机器之心· 2025-07-06 06:06
Core Viewpoint - The article discusses the development and capabilities of RoboRefer, a multimodal large model designed for spatial referring tasks in robotics, emphasizing its advanced spatial understanding and reasoning abilities. Group 1: RoboRefer Model Overview - RoboRefer is a multimodal large model that possesses three-dimensional spatial understanding and reasoning capabilities, featuring independent image and depth encoders [12] - The model can accurately answer various spatial perception questions and perform complex combinatorial reasoning based on multiple spatial relationships [12][13] Group 2: Training Techniques - RoboRefer employs full parameter tuning (SFT) to enhance spatial perception and reinforcement learning fine-tuning (RFT) to improve generalization reasoning capabilities [15][16] - The model's training includes a process-based reward function that enhances the quality of intermediate reasoning processes, leading to improved multi-step reasoning abilities [17] Group 3: Performance Metrics - After SFT training, RoboRefer achieved an average success rate of 89.6% in spatial understanding tasks, setting a new advanced level [21] - In the high-difficulty spatial referring task benchmark RefSpatial-Bench, RFT-trained RoboRefer outperformed all other models, surpassing Gemini-2.5-Pro by 17.4% in average accuracy [22] Group 4: Dataset Development - The research team created a large-scale, high-quality dataset called RefSpatial, which includes 2.5 million samples and 20 million question-answer pairs, significantly larger than similar datasets [20] - RefSpatial features detailed multi-step reasoning processes and covers a wide range of everyday interaction scenarios, integrating 31 types of spatial relationships [20] Group 5: Real-World Application - RoboRefer can be flexibly integrated into various types of robots, such as UR5 robotic arms and G1 humanoid robots, enabling precise execution of complex, dynamic, multi-step tasks in real-world environments [9]
一个气泡水广告,为何几十万人围观?原来整个都是Veo 3生成的
机器之心· 2025-07-06 06:06
Core Viewpoint - The article discusses the emergence of AI-generated content, particularly focusing on a viral advertisement created by the team "Too Short for Modeling," which showcases the advancements in AI video generation technology, specifically the Veo 3 model [2][3][4]. Group 1: AI Video Generation - The advertisement created by the team has garnered over 300,000 views on social media, highlighting the growing interest in AI-generated content [2]. - The Veo 3 model introduces a new "audio-visual synchronization" feature, significantly lowering the barriers to video creation and enhancing the practicality of AI in this field [4]. - The advertisement demonstrates impressive "character consistency," smoothly transitioning through 10 scenes within a minute while maintaining a high degree of stylistic uniformity [7]. Group 2: Technical Insights - The creators achieved this consistency through "hyper-specific prompting," which involves providing the AI model with detailed and context-rich instructions to minimize its creative freedom [9][10]. - Despite the advancements, AI-generated videos often face issues such as character appearance changes and object distortions, which are attributed to the underlying technology and training data limitations [8][14]. - The article outlines several reasons for these inconsistencies, including the model's reliance on probability rather than true understanding, challenges in maintaining coherence across frames, and the quality of training data [19][14]. Group 3: Creative Potential of AI - The article emphasizes the potential of AI as a "creative catalyst," suggesting that it can inspire innovative ideas, such as creating parallel universes for favorite movies [17][22]. - It encourages exploration of AI's capabilities in various creative domains, including website development and concept film production [24][25].
求医十年,病因不明,ChatGPT:你看起来有基因突变
机器之心· 2025-07-06 03:49
编辑:张倩 用AI给自己看病正在成为新趋势,但目前我们仍需要人类医生。 机器之心报道 身体不适,求医十年,医生没找出原因,ChatGPT 给分析出来了。这是一位 Reddit 网友刚刚分享的个人经历。 这位网友提到: 十多年来,我一直被多种不明症状困扰。做过脊椎核磁共振、CT 扫描、全套血液检查(包括深入检测),甚至连莱姆病都没放过。 后来通过 Function Health 平台发现,我竟然携带纯合型 A1298C MTHFR 基因突变 —— 这种影响 7%-12% 人群的变异。虽然我所在的 美国医疗网络是全国顶尖,甚至看过神经科医生排查过多发性硬化症,但始终没查出原因。 ChatGPT 整合了我所有的化验报告和症状史,推断这与基因突变高度吻合。原来即便维生素 B12 水平看似正常,这种突变也会导致 身体无法有效利用,必须通过补充剂提升。 我把这些检查结果拿给我的医生看,他非常震惊,坦言一切症状突然都能解释了。我不确定他们之前怎么没想到给我做亚甲基四氢叶 酸还原酶(MTHFR)基因突变检测。 如今几个月过去,我的症状已基本消失。整件事的发展至今仍让我觉得不可思议又充满惊喜。 这个帖子在 reddit 上热度 ...
原来Scaling Law还能被优化?Meta这招省token又提效
机器之心· 2025-07-06 03:49
Core Insights - The article discusses the advancements in AI, particularly focusing on the evolution of the Transformer model and the introduction of the 2-simplicial Transformer, which enhances the efficiency of token utilization and model scalability [1][4][10]. Group 1: Transformer and AI Development - The paper "Attention Is All You Need" marked a significant turning point in AI development, establishing the Transformer as the foundational paradigm for current language models [1]. - The citation count for this paper is approaching 190,000, indicating its profound impact on the field [2]. - The ongoing challenge in AI is acquiring a sufficient quantity of high-quality tokens and efficiently utilizing them, necessitating further upgrades to the Transformer model [3]. Group 2: 2-Simplicial Transformer - Meta's recent research introduced a rotationally invariant trilinear attention mechanism, demonstrating comparable representational capacity to the 2-simplicial Transformer and potentially altering the coefficients in the Scaling Law [4][10]. - The 2-simplicial Transformer, derived from Clift et al. (2019), generalizes the dot-product attention mechanism to a trilinear form, enhancing its scalability under token constraints [19][11]. - Experimental results indicate that the 2-simplicial Transformer can more effectively approximate the irreducible entropy of natural language compared to traditional dot-product attention Transformers [11]. Group 3: Scaling Law and Model Performance - The Scaling Law describes how loss decreases with the total number of model parameters and token count, suggesting that larger models should approach the irreducible loss of natural text distribution as both parameters and tokens increase [13][15]. - Hoffmann et al. (2022) found that the optimal number of parameters and dataset size should scale proportionally with the computational budget, with estimated scaling exponents around 0.49 for parameters and 0.5 for tokens [17][18]. - The 2-simplicial Transformer exhibits a steeper scaling slope compared to the dot-product attention Transformer, indicating a higher exponent in its Scaling Law [50]. Group 4: Experimental Results - The team conducted experiments with various models, revealing that the 2-simplicial attention mechanism did not provide benefits in models with fewer than 2 billion active parameters [45]. - The performance metrics across different model sizes showed slight improvements or declines when comparing the 2-simplicial Transformer to traditional Transformers, with variations in performance percentages noted [43][44]. - The study estimated the differences in scaling coefficients between the 2-simplicial and dot-product attention mechanisms, highlighting the potential for improved efficiency in larger models [46][49].