Workflow
机器之心
icon
Search documents
全球首个AI原生社交平台「Teamily AI」硅谷亮相,开启「人机共生」社交新元年
机器之心· 2026-02-13 08:57
编辑|+0、Youli 大家是否察觉到,2026 年伊始,AI 的进化逻辑正发生微妙转向。 从国内风头正劲的元宝派,到硅谷讨论热度居高不下的 OpenClaw、Moltbook,再到 AI 初创公司 Humans & 打造的社交智能融资额接近 5 亿美元,以及刚刚发生 的, 定位为「The Simulation Company」的 Simile 推出其 AI 模拟平台做 社会模拟 ,并获得 1 亿美元融资, 投资人包括美国顶级基金及斯坦福李飞飞教授等。 似乎业界已不再满足于仅仅将 AI 作为一个单独的工具,而是开始探索: 当 AI 开始走进人类的真实交互场景中,与人类产生各种各样的互动时,会发生什么? 基于此,一场围绕人与 AI 的社交新叙事,正悄然铺开…… 最近,机器之心留意到这类玩家中一个颇具代表性的产品。 从视频中可以看到,一群朋友在群里聊天,从当前的热映影片聊到各自最喜欢的电影,气氛轻松、自然。而就在话题逐渐升温时,一个 AI 智能体适时、主动「加 入」到群聊中,并根据聊天内容的上下文主动推荐了相关视频片段与背景音乐,直接嵌入到聊天界面中。另一边,朋友们还在继续讨论,并可以边聊天边观看视 频内容、收听 ...
视觉强≠能干活!清北普林斯顿等开源WorldArena,世界模型评测被颠覆
机器之心· 2026-02-13 05:08
Core Insights - The article discusses the launch of WorldArena, a unified evaluation system for embodied world models, developed by leading institutions, aiming to shift the focus from visual quality to functional reliability in robotics [1][4][8]. Evaluation Framework - WorldArena introduces a six-dimensional visual assessment framework that includes visual quality, action quality, content consistency, physical adherence, 3D accuracy, and controllability, emphasizing the importance of physical understanding for robots [5][21][25]. - The system also incorporates three embodied tasks to evaluate whether models can effectively participate in real-world tasks, revealing that many visually high-scoring models perform poorly in practical applications [5][27]. EWMScore - EWMScore is a comprehensive scoring system that consolidates various evaluation metrics into a single score, showing a high correlation with human subjective assessments, thus providing a more accurate reflection of model capabilities [6][30][41]. - The correlation between EWMScore and task performance indicates that visual realism does not equate to functional reliability, highlighting a significant gap between visual generation and task execution capabilities [32][44]. Challenges and Future Directions - The article emphasizes that while world models have made significant strides in visual generation, they still face fundamental shortcomings in supporting embodied intelligence tasks and long-term decision-making [33][40]. - The conclusion stresses the need for world models to understand physical laws and maintain consistency in complex environments to transition from being mere visual models to functional embodied intelligence systems [41][45].
GLM-5封神,智谱市值五天翻倍,中国AI火力全开了
机器之心· 2026-02-13 05:08
Core Viewpoint - The article highlights the significant advancements in China's AI landscape, particularly focusing on the launch of GLM-5 by Zhiyu, which is positioned as a leading model capable of handling complex system engineering tasks, marking a transition from "Vibe Coding" to "Agentic Engineering" [3][36]. Group 1: AI Developments - The 2026 Spring Festival period is expected to be pivotal in the history of AI development in China, driven by the release of Seedance 2.0 and GLM-5 [3][4]. - Seedance 2.0 showcases China's creative capabilities in AI, while GLM-5 demonstrates its execution strength, establishing a "twin star" dynamic in the AI sector [4][6]. - The market response to GLM-5 has been described as "frenzied," with high demand leading to rapid sellouts of its coding plans despite price increases [6][9]. Group 2: Technical Capabilities of GLM-5 - GLM-5 is characterized as the first "system architect" level model in the open-source community, capable of addressing complex system-level problems [13][14]. - The model's performance in coding tasks has been validated through rigorous testing, achieving a 100% pass rate in core algorithm performance metrics [26]. - GLM-5's architecture allows it to autonomously handle tasks such as building a high-concurrency distributed scheduling system, showcasing its advanced understanding of system architecture and engineering [19][24]. Group 3: Market Position and Performance - GLM-5 ranks fourth globally and first among open-source models in the Artificial Analysis intelligence ranking, indicating its competitive edge [39]. - In the Agentic ranking, GLM-5 is positioned third, surpassing other models like GPT-5.2 and Claude Opus 4.5, demonstrating its advanced capabilities [40]. - The model has achieved significant scores in various benchmarks, including SWE-bench-Verified and Terminal Bench 2.0, outperforming competitors like Gemini 3.0 Pro [42]. Group 4: Ecosystem and Future Prospects - The launch of GLM-5 is accompanied by the introduction of Z Code, a new development environment that enhances the coding process through natural language task breakdown and multi-agent collaboration [53]. - GLM-5's capabilities extend beyond coding to include document generation and other productivity tools, indicating a comprehensive approach to AI application [55]. - The integration with domestic computing platforms ensures that GLM-5 operates efficiently, paving the way for broader AI applications in 2026 and beyond [58][60].
开源多模态推理「破壁」时刻:MMFineReason助力4B逆袭30B
机器之心· 2026-02-13 05:08
Core Insights - The article highlights the significant gap between open-source multimodal models and top closed-source models like GPT-4o and Gemini, primarily due to a lack of high-quality reasoning data [2] - The introduction of the MMFineReason framework by OpenDataLab aims to address this gap by providing a comprehensive, open-source multimodal reasoning data synthesis pipeline [2][10] Data Challenges - Existing open-source multimodal data is predominantly focused on simple Visual Question Answering (VQA) and natural images, with a scarcity of high-value reasoning data such as STEM charts and complex visual symbols [6] - The quality of reasoning data is inconsistent, often characterized by short reasoning processes and insufficient granularity in annotations [6] Performance Results - The MMFineReason-4B model, trained on Qwen3-VL-4B, demonstrates superior reasoning capabilities, surpassing the Qwen3-VL-8B-Thinking model and approaching the performance of the 30B parameter Qwen3-VL-30B-A3B-Thinking model [5] - The MMFineReason-8B model outperforms both Qwen3-VL-30B-A3B-Thinking and Gemini-2.5-Flash, indicating a significant leap in performance driven by data quality rather than model architecture [8] Data Production Pipeline - MMFineReason employs a fully open-source and transparent data production pipeline, consisting of three main stages to ensure high-quality data generation [12] - The final datasets include MMFineReason-1.8M, MMFineReason-586K, and MMFineReason-123K, each curated for different levels of reasoning difficulty [14] Dataset Characteristics - MMFineReason is characterized by a high average reasoning chain length of 2,910 tokens, significantly longer than comparable datasets, which enhances the model's reasoning capabilities [16] - The dataset emphasizes high-difficulty logical reasoning, with 79.4% of the data focused on mathematics, 13.8% on scientific data, and 4.6% on puzzles and games [19] Conclusion and Future Outlook - The open-sourcing of MMFineReason demonstrates that in the multimodal field, the key to improving model performance lies in the quality of data rather than the size of the model [23] - The project is now available on Huggingface and GitHub, providing comprehensive support for the open-source community [23]
1美元时薪?这才是打工人的「梦中情模」
机器之心· 2026-02-13 04:19
编辑|张倩、Panda 整整 6 倍的涨幅!看着 Token 计费表像风一样自由地狂奔,本来想用 AI 释放生产力的我,现在按回车键之前都得先在心里过一遍账单。这哪里是请了个助手,简 直是供了个「吞金兽」。 这种「生产力税」的存在,逼得打工人不得不进入一种尴尬的模式:一边渴望顶级智力带来的效率,一边在按回车键时反复权衡账单。难道高智力和高性价比, 真的像鱼和熊掌一样不可兼得?难道我们这些普通打工人,就不配拥有「智力自由」? 就在大家捂着钱包叹气的时候,MiniMax 反手甩出了一个王炸: M iniM ax M2.5 。 这个模型相当能打,无论是 coding 还是 agent 能力,都能与 Claude Opus 4.6 掰掰手腕,甚至在某些维度上掰赢了。 Anthropic 的 Opus 4.6 刚发布,智商确实高到让人头皮发麻,但看着那个价格表,我的钱包也开始发麻了。 这就很尴尬了。Opus 4.6 的出现,直接在开发者圈子里制造了一场「智力焦虑」:模型好用是真好用,贵也是真贵。原版价格纹丝不动就算了,那个号称「极速 版」的家伙,每百万输出 Token 的成本居然从 25 美元直接飙到了 150 美元。 ...
CVPR 2026 Workshop征稿|从感知到推理,ViSCALE 2.0 邀你重塑计算机视觉的 System 2
机器之心· 2026-02-13 04:19
Core Insights - The article discusses the evolution of computer vision towards a new paradigm, emphasizing the transition from basic pixel perception to complex spatial reasoning and world modeling, facilitated by Test-time Scaling (TTS) [2][5] - The upcoming ViSCALE 2026 conference aims to gather leading scholars to explore breakthroughs in visual models through computational expansion, focusing on deep reasoning rather than mere static outputs [4][5] Group 1: Conference Highlights - ViSCALE 2026 will feature discussions on spatial intelligence and world models, with contributions from top scholars including Sergey Levine, Manling Li, and Ziwei Liu [5] - The conference encourages innovative research submissions that challenge existing visual model limitations, providing a platform for both theoretical and application-focused studies [7] Group 2: Key Topics of Discussion - The conference will cover various topics, including: - Enhancing video generation's physical consistency and long-term causal reasoning through TTS [6] - Breaking 2D limitations to enable models to navigate and operate in 3D spaces like humans [6] - Developing visual reasoning chains that allow models to self-correct and engage in multi-step reasoning [6] - Exploring scaling laws that relate computational load during testing to visual reasoning performance [6] Group 3: Submission Details - The conference invites submissions in two tracks: Full Papers (8 pages) and Extended Abstracts (up to 4 pages), with specific formatting requirements [9] - Important deadlines include submission by March 10, 2026, and notification of acceptance by March 18, 2026 [9]
我们离Coding领域的「AGI时刻」还有多远?字节跳动Seed发布NL2Repo-Bench仓库级长程代码生成基准
机器之心· 2026-02-13 01:02
在 AI 编程领域,大家似乎正处于一个认知错觉的顶点:随着 Coding Agents 独立完成任务的难度和范围逐渐增加,Coding 领域的 AGI 似乎就可以实现? 然而,真正的工程师都知道,写代码的灵魂不在于 file/function level 的 code creation,而是 project level 的 code completion。写了很长时间的代码,不代表项目做 完,更不代表项目做好了。 一个完整的项目开发要求 开 发者从一个空文件夹开始,理解上万 token 的需求,设计架构、管理多模态逻辑,并产出可安装、可运行的代码仓库。然而现有代码 评测基准主要集中在局部代码生成(如 HumanEval、MBPP)或在已有代码库上进行修复(如 SWE-bench)。 近日,首个专门评估编码智能体端到端仓库生成能力的基准测试 ——NL2Repo-Bench 正式发布。它由字节跳动 Seed、南京大学、北京大学等多家机构的研究者联 合打造,发布后受到广泛关注。 Show me your Repo, NL2Repo 如何考察 Coding Agent 从 0 到 1 工作能力? 论文标题: NL2R ...
刚刚Gemini上新模型,全球只有7人比它会编程,谷歌姚顺宇参与
机器之心· 2026-02-13 01:02
编辑|泽南 从此以后,AI 不再是工具,要尊称为「硅基博学家」了。 北京时间周五凌晨,谷歌发布了 Gemini 3 Deep Think 的重大升级,作为专门用于复杂任务的推理模式,Deep Think 代表 AI 前沿的最强智能水平,旨在 解决科学、工程领域的诸多挑战。 去年 9 月加入 Google DeepMind 的清华物理系传奇姚顺宇( Shunyu Yao )也是这次 Deep Think 新模型的参与者。 去年,谷歌展示了专门开发的 Deep Think 版本能够成功应对一些最棘手的推理挑战,并在数学和编程世界锦标赛上取得了金牌成绩。最近,Deep Think 又使专门开发的智能体能够进行研究级别的数学探索。 更新后的深度思考模式继续拓展智能的边界,在最严格的学术基准测试中取得了新的高度,其中包括: 在「人类的最后考试」(一项旨在测试现代前沿模型极限的基准测试)中,该模型取得了新的 SOTA(48.4%,不使用任何工具)。 在 ARC-AGI-2 测试中取得了前所未有的 84.6% 的成绩,并经 ARC Prize 基金会验证。 在 Codeforces 上取得了惊人的 3455 Elo 分数, ...
Loop-ViT:让AI学会「反复思考」,3.8M参数小模型追平人类平均水平
机器之心· 2026-02-12 10:08
当我们解一道复杂的数学题或观察一幅抽象图案时,大脑往往需要反复思考、逐步推演。然而,当前主流的深度学习模型却走的是「一次通过」的路线—— 输入数据,经过固定层数的网络,直接输出答案。 这种前馈式架构在图像分类等感知任务上表现出色,但面对需要 多步推理 的抽象问题时,却显得力不从心。最典型的例子就是「ARC-AGI 基准测试」 ——一个被认为是衡量 AI 抽象推理能力的「试金石」。 近日,来自香港科技大学、中科院自动化所、UC Santa Cruz 的研究团队提出了「 Loop-ViT 」,首次将循环 Transformer 引入视觉推理领域。这个仅 有 18M 参数 的模型,在 ARC-AGI-1 基准上达到了「65.8%」的准确率,超越了参数量高达 73M 的 VARC 集成模型。更令人惊讶的是,其 3.8M 的小 型版本也能达到 60.1% 的准确率,几乎追平人类平均水平(60.2%)。 什么是 ARC-AGI? 为什么它如此困难? ARC-AGI(Abstraction and Reasoning Corpus)是由 Keras 之父 François Chollet 提出的抽象推理基准。与 Image ...
具身智能的「GPT时刻」?高德连发两个全面SOTA的ABot具身基座模型
机器之心· 2026-02-12 10:08
Core Insights - The article discusses the transformative impact of large models on natural language processing (NLP) and draws parallels to the current state of the robotics industry, highlighting the need for a unified approach in robotic systems similar to the shift seen in NLP with the introduction of models like GPT [1][2][5]. Group 1: Robotics Industry Challenges - The robotics industry is currently fragmented, with different manufacturers using incompatible action representation systems, leading to a lack of model reusability and requiring new systems for each scenario [2][8]. - The absence of a unified data representation and action modeling in robotics has hindered the development of scalable training methods, making it difficult to integrate diverse data sources [7][8]. - The industry's reliance on specialized models for different tasks limits the ability to generalize and adapt to new environments, resulting in a lack of robust performance in complex scenarios [9][23]. Group 2: Introduction of ABot Series - Alibaba's Amap has introduced the ABot series, consisting of ABot-M0 and ABot-N0, which aim to provide a unified base for robotic operations and navigation, respectively [3][4]. - ABot-M0 focuses on standardizing action language across various robot forms, enabling them to perform diverse tasks using a common model, thus reducing training costs and improving efficiency [12][14]. - ABot-N0 addresses the challenges of navigation in dynamic environments, integrating multiple navigation tasks into a single model, which enhances the robot's ability to operate in real-world scenarios [22][26]. Group 3: Technical Innovations - ABot-M0 employs a systematic reconstruction approach that includes data unification, algorithm innovation, and enhanced spatial perception to improve operational capabilities [12][15][17]. - The model has achieved state-of-the-art (SOTA) performance in various benchmarks, demonstrating significant improvements in task success rates, particularly in complex environments [20][32]. - ABot-N0 utilizes a hierarchical design philosophy that combines cognitive understanding with precise action generation, allowing for more natural and effective navigation in real-world settings [29][30]. Group 4: Future Implications - The release of the ABot series is expected to lower the barriers for smaller teams to develop robotic solutions, potentially transforming the development paradigm from extensive custom systems to fine-tuning existing models [38]. - The long-term vision includes the possibility of modular robotic capabilities akin to APIs, enabling developers to easily implement physical tasks through standardized models [38][39]. - The advancements in unified data formats and pre-training weights are anticipated to significantly reduce the time and cost associated with robotic training and deployment [38].