机器之心

Search documents
刚刚,LMArena最新模型榜单出炉!DeepSeek-R1网页编程能力赶超了Claude Opus 4
机器之心· 2025-06-17 00:10
Core Viewpoint - DeepSeek has made significant advancements in the open-source model space with the release of its upgraded R1 inference model (0528), which shows competitive performance against proprietary models [2][4][10]. Performance Summary - The R1-0528 model has improved benchmark performance, enhancing front-end functionality, reducing hallucinations, and supporting JSON output and function calls [3]. - In the latest performance rankings from LMArena, DeepSeek-R1 (0528) achieved an overall ranking of 6th, and it is the top-ranked open model [5][4]. - Specific rankings in various categories include: - 4th in Hard Prompt testing - 2nd in Coding testing - 5th in Math testing - 6th in Creative Writing testing - 9th in Instruction Following testing - 8th in Longer Query testing - 7th in Multi-Turn testing [6][7]. Competitive Landscape - In the WebDev Arena platform, DeepSeek-R1 (0528) is tied for first place with other proprietary models like Gemini-2.5-Pro-Preview-06-05 and Claude Opus 4, surpassing Claude Opus 4 in score [8]. - The performance of DeepSeek-R1 (0528) is seen as a milestone, particularly in the AI programming domain, where it competes closely with established models like Claude [10]. User Engagement - The strong performance of DeepSeek-R1 (0528) has generated increased interest and usage among users, prompting discussions about user experiences [9][11].
「人类飞机上吵架看呆袋鼠」刷屏全网,7000万人被AI耍了
机器之心· 2025-06-16 09:10
Core Viewpoint - The article discusses the increasing sophistication of AI-generated content, highlighting how realistic AI videos can mislead viewers into believing they are real, as exemplified by a viral video featuring a kangaroo at an airport [2][12][18]. Group 1: AI Video Generation - The video in question was created using advanced AI technology, making it difficult for viewers to discern its authenticity [18]. - The account that posted the video, InfiniteUnreality, features various surreal AI-generated animal videos, contributing to the confusion surrounding the content's legitimacy [13][16]. - Despite the account labeling its content as AI-generated, the indication was subtle, leading many viewers to overlook it [19]. Group 2: Viewer Misinterpretation - The viral nature of the video was amplified by its engaging content, with many users commenting positively and reinforcing the belief that it was real [24]. - Other social media accounts, such as DramaAlert, shared the video without clarifying its AI origins, further perpetuating the misunderstanding [21]. - The phenomenon illustrates a broader trend where viewers struggle to identify AI-generated content, as traditional visual cues for authenticity are becoming less reliable [34]. Group 3: AI Detection Tools - Google DeepMind and Google AI Labs have developed SynthID, a tool designed to identify content generated or edited by Google’s AI models through digital watermarking [35]. - SynthID embeds a subtle digital fingerprint in the content, which can be detected even after editing, but it is limited to Google’s AI outputs [36]. - The tool is still in early testing and requires users to join a waitlist for access [39].
高考数学斩获139分!小米7B模型比肩Qwen3-235B、OpenAI o3
机器之心· 2025-06-16 05:16
Core Viewpoint - The article discusses the performance of various AI models in the 2025 mathematics exam, highlighting the competitive landscape in AI model capabilities, particularly focusing on Xiaomi's MiMo-VL model which performed impressively despite its smaller parameter size [2][4][20]. Group 1: Model Performance - Gemini 2.5 Pro scored 145 points, ranking first, followed closely by Doubao and DeepSeek R1 with 144 points [2]. - MiMo-VL, a 7B parameter model, scored 139 points, matching Qwen3-235B and only one point lower than OpenAI's o3 [4]. - MiMo-VL outperformed Qwen2.5-VL-7B by 56 points, showcasing its superior capabilities despite having the same parameter size [5]. Group 2: Evaluation Methodology - MiMo-VL-7B and Qwen2.5-VL-7B were evaluated using uploaded question screenshots, while other models used text input [6]. - The evaluation included 14 objective questions (totaling 73 points) and 5 answer questions (totaling 77 points) [7]. Group 3: Detailed Scoring Breakdown - MiMo-VL scored 35 out of 40 in single-choice questions and achieved full marks in multiple-choice and fill-in-the-blank questions [8][10][11]. - In the answer questions, MiMo-VL scored 71 points, ranking fifth overall, surpassing hunyuan-t1-latest and 文心 X1 Turbo [12]. Group 4: Technological Advancements - Xiaomi announced the open-sourcing of its first inference-focused large model, MiMo, which has shown significant improvements in reasoning capabilities [14]. - MiMo-VL, as a successor to MiMo-7B, has demonstrated substantial advancements in multi-modal reasoning tasks, outperforming larger models like Qwen-2.5-VL-72B [20]. - The model's performance is attributed to high-quality pre-training data and an innovative mixed online reinforcement learning algorithm [27][29]. Group 5: Open Source and Accessibility - MiMo-VL-7B's technical report, model weights, and evaluation framework have been made open source, promoting transparency and accessibility in AI development [32].
AI进化三年,产业落地真拐点可能就在这场全球顶尖金融智能赛事里
机器之心· 2025-06-16 05:16
Core Viewpoint - The article emphasizes that while AI models are abundant, the real challenge lies in effectively applying them in real-world scenarios, particularly in the financial sector [2][3]. Group 1: AI in Financial Sector - Over 500 large models have been registered in China, marking the beginning of a new era in AI applications [4]. - The financial industry is highlighted as one of the earliest and most complex sectors for AI application due to its rich structured data and diverse scenarios [6]. - Major tech companies are entering the financial AI space, with Huawei launching the "Pangu Financial Model" and Ant Group introducing "AntFinGLM" [5]. Group 2: AI Applications and Challenges - AI can assist in identifying investor sentiment and predicting stock prices, enhancing decision-making for investors and analysts [7][8]. - The ability of AI to process large volumes of trading data can help uncover hidden risks, making it a valuable tool for financial institutions [9]. - There are significant challenges in AI deployment, such as maintaining reasoning capabilities while improving efficiency and practicality [9]. Group 3: AFAC2025 Financial Innovation Competition - The AFAC2025 competition aims to address real industry problems with a focus on practical applications rather than just computational power [11]. - The competition has attracted over 30,000 participants across three editions, producing valuable algorithms and solutions for the industry [11]. - This year's competition includes three core challenges: predicting fund flows, ensuring document consistency, and compressing reasoning chains for AI [18][19][20]. Group 4: Future Directions and Opportunities - The competition encourages innovation in areas like inclusive finance, financial data integration, and elder financial services, reflecting the growing importance of AI in these sectors [22]. - Participants are invited to propose their own topics, fostering creativity and development capabilities [21]. - The competition serves as a platform for technical breakthroughs, career growth, and entrepreneurial support [12][14].
初赛报名截止倒计时!75万奖池+心动Offer,启元实验室重磅赛事等你来战!
机器之心· 2025-06-16 05:16
编辑:吴昕 大赛报名于 2025年6月25日截止,感兴趣的团队尽快报名参赛。 百舸争流,「启智杯」 初赛火热进行中 随着人工智能技术的不断突破,智能化浪潮正深刻改变千行百业, 中国也迎来人工智能加速应用期。 为推动智能算法从理论创新走向实际落地, 5 月 20 日,启元实验室正式启动「启智杯」算法大赛。 本届大赛围绕「卫星遥感图像鲁棒实例分割」「面向嵌入式平台的无人机对地目标检测」以及「面向多 模态大模型的对抗」三大命题,聚焦鲁棒感知、轻量化部署与对抗防御三大关键技术,旨在引导技术创 新精准对接真实场景,加快算法能力的转化落地与规模化应用。 赛事一经发布,便迅速点燃全国 技术圈 热情,目前已有来自高校、科研院所、科技企业的 500 余支 队伍报名。其中不乏清华、北大、复旦、上交、南大、武大、华科、中科大、哈工大、国防科大、西 交、成电等顶尖高校队伍,以及中科院自动化所、 中科院 空天信息创新研究院等科研机构团队,为赛 事注入强劲科研力量。 目前,赛事正处于初赛的关键节点。三大赛道的选手们正围绕核心任务展开高强度的建模与调优,争分 夺秒攻克技术难点,不断迭代优化模型方案,部分赛题的竞争已经进入白热化阶段。 三大 ...
ACL 2025|为什么你设计的 Prompt 会成功?新理论揭示大模型 Prompt 设计的奥秘与效能
机器之心· 2025-06-16 04:04
Core Insights - The article discusses the importance of prompt design in enhancing the performance of large language models (LLMs) during complex reasoning tasks, emphasizing that effective prompts can significantly improve model accuracy and efficiency [2][7][36] - A theoretical framework is proposed to quantify the complexity of the prompt search space, transitioning prompt engineering from an empirical practice to a more scientific approach [5][35] Group 1: Prompt Design and Its Impact - The effectiveness of prompt engineering has historically been viewed as somewhat mystical, with certain combinations yielding significant performance boosts while others fall short [7] - Prompts serve as critical "selectors" in the chain of thought (CoT) reasoning process, guiding the model in extracting relevant information from its internal hidden states [12][36] - The study reveals that the choice of prompt templates directly influences the reasoning performance of LLMs, with optimal prompt designs leading to performance improvements exceeding 50% [29][36] Group 2: Theoretical Framework and Experimental Evidence - The research introduces a systematic approach to finding optimal prompts by breaking down the CoT reasoning process into two interconnected search spaces: the prompt space and the answer space [22][35] - Experimental results demonstrate that the introduction of CoT mechanisms allows LLMs to perform recursive calculations, which are essential for tackling multi-step reasoning tasks [26][30] - The study highlights that well-designed prompts can effectively dictate the output of each reasoning step, ensuring that only the most relevant information is utilized for subsequent calculations [28][36] Group 3: Limitations and Future Directions - The article notes that relying solely on generic prompts can severely limit the model's performance on complex tasks, indicating the need for tailored prompt designs [36] - Variants of CoT, such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT), can enhance performance but are still constrained by the underlying prompt templates used [32][33] - The findings underscore the necessity for a deeper understanding of task requirements to design prompts that effectively guide LLMs in extracting and utilizing core information [23][35]
Muon作者仅用一篇博客,就被OpenAI看中了
机器之心· 2025-06-16 04:04
Keller Jordan,OpenAI 深度学习团队主要成员之一,用一篇博客就撬开了 OpenAI 的大门。 这篇名为《 Muon: An optimizer for hidden layers in neural networks 》的博客发布于 2024 年 12 月,而 Keller Jordan 入职 OpenAI 的时间恰好也在此时。 机器之心报道 机器之心编辑部 「许多博士(包括过去的我)都陷入了这样一个误区:认为只有在顶级会议上发表论文才是终极目标。」AI 云服务商 Hyperbolic CEO Yuchen Jin 如是说。 但现在,发表论文并不与学术影响力直接画等号了。 在这篇博客中,Keller Jordan 提出并构建了一种用于神经网络隐藏层的优化器 Muon,其能够在保证神经网络(包括 Transformer 和 CNN)的准确度的前提上大幅 提升其训练速度。 为何只发了博客,而不是发表一篇正式的 arXiv 论文,Keller Jordan 这样解释:能否发表一篇关于新优化器的论文,且包含大量看起来不错的结果,和这个优化器 是否真的有效之间没有任何联系。「我只相信速通。」 一直以来 ...
放弃博士学位加入OpenAI,他要为ChatGPT和AGI引入记忆与人格
机器之心· 2025-06-15 04:43
机器之心报道 编辑:杜伟 今天,一位研究者加入 OpenAI 的消息吸引了很多人的关注。 这位研究者名为 James Campbell,他才于 2024 年攻读 CMU 的计算机科学博士学位。现在,他突然宣布要 放弃博士学业,加入 OpenAI。 在社媒 X 上,他表示自己在 OpenAI 的 研究重心是「AGI 和 ChatGPT 的记忆 + 人格」,记忆将从根本改 变人类与机器智能的关系 。他将努力工作,确保正确地实现这一切。 他的加入连 OpenAI 联合创始人、总裁 Greg Brockman 都表达了欢迎。 那么,这位老兄是何方神圣呢?他的加入为什么引起了这么多的关注?我们来看一下他的履历。 他本科毕业于康奈尔大学,专业是数学与计算机科学。本科期间,他致力于 LLM 可解释性和真实性的研 究,还是两篇论文《Representation Engineering》和《Localizing Lying in Llama》的主要作者。 前一篇论文研究了表示工程:一种自上而下的 AI 透明性方法,后者研究了在 Llama 中定位谎言:通过提 示、探查和修补来理解判断题上的不诚实指令。 他还在 Gray Swa ...
复旦大学/上海创智学院邱锡鹏:Context Scaling,通往AGI的下一幕
机器之心· 2025-06-15 04:40
Core Viewpoint - The article discusses the concept of Context Scaling as a crucial step towards achieving Artificial General Intelligence (AGI), emphasizing the need for AI to understand and adapt to complex and ambiguous contexts rather than merely increasing model size or data volume [2][21]. Summary by Sections Evolution of Large Models - The evolution of large models is summarized in three acts: 1. The first act focuses on the success of model scaling, where data and parameters are stacked to compress knowledge, leading to the emergence of models like ChatGPT and MOSS [6]. 2. The second act involves post-training optimization, enhancing decision-making capabilities through methods like reinforcement learning and multi-modal approaches, exemplified by models such as GPT o1/o3 and DeepSeek-R1 [6][7]. 3. The third act, Context Scaling, aims to address the challenges of defining context to improve model capabilities, particularly in complex and nuanced situations [8][21]. Context Scaling - Context Scaling is defined as the ability of AI to understand and adapt to rich, complex, and dynamic contextual information, which is essential for making reasonable judgments in ambiguous scenarios [8][9]. - The concept of "tacit knowledge" is introduced, referring to the implicit understanding that humans possess but is difficult to articulate, which AI must learn to capture [11][12]. Three Technical Pillars - Context Scaling is supported by three key capabilities: 1. Strong Interactivity: AI must learn from interactions, understanding social cues and cultural nuances [14][15]. 2. Embodiment: AI needs a sense of agency to perceive and act within its environment, which can be tested in virtual settings [16]. 3. Anthropomorphizing: AI should resonate emotionally with humans, understanding complex social interactions and cultural sensitivities [17]. Challenges and Integration - The article highlights that Context Scaling is not a replacement for existing scaling methods but rather complements them by focusing on the quality and structure of input data [18]. - It also redefines the environment for reinforcement learning, moving beyond simple state-action-reward loops to include rich contextual information [20]. Conclusion - The exploration of Context Scaling aims to unify various technological paths under the core goal of contextual understanding, which is seen as essential for navigating the complexities of the real world and a potential key to achieving AGI [22].
CVPR 2025 Highlight | 国科大等新方法破译多模态「黑箱」,精准揪出犯错元凶
机器之心· 2025-06-15 04:40
Core Viewpoint - The article discusses the importance of reliability and safety in AI decision-making, emphasizing the urgent need for improved model interpretability to understand and verify decision processes, especially in critical scenarios [1][2]. Group 1: Research Background - A joint research effort by institutions including the Chinese Academy of Sciences and Huawei has achieved significant breakthroughs in explainable attribution techniques for multimodal object-level foundation models, enhancing human understanding of model predictions and identifying input factors leading to errors [2][4]. - Existing explanation methods, such as Shapley Value and Grad-CAM, have limitations when applied to large-scale models or multimodal tasks, highlighting the need for efficient attribution methods adaptable to both large and small models [1][8]. Group 2: Methodology - The proposed Visual Precision Search (VPS) method aims to generate high-precision attribution maps with fewer regions, addressing the challenges posed by the increasing complexity of model parameters and multimodal interactions [9][12]. - The VPS method models the attribution problem as a search problem based on subset selection, optimizing the selection of sub-regions to maximize interpretability [12][14]. - Key scores, such as clue scores and collaboration scores, are defined to evaluate the importance of sub-regions in the decision-making process, contributing to the construction of a submodular function for effective attribution [15][17]. Group 3: Experimental Results - The VPS method has demonstrated superior performance in various object-level tasks, surpassing existing methods like D-RISE in metrics such as Insertion and Deletion rates across datasets like MS COCO and RefCOCO [22][23]. - The method effectively highlights important sub-regions, improving clarity in attribution compared to existing techniques, which often produce noisy or diffuse significance maps [22][24]. Group 4: Error Explanation - The VPS method excels in explaining the reasons behind model prediction errors, showcasing capabilities not present in other existing methods [24][30]. - Visualizations reveal how input disturbances and background interference contribute to classification errors, providing insights into model limitations and potential improvement directions [27][30]. Group 5: Conclusion and Future Directions - The VPS method enhances interpretability for object-level foundation models and effectively explains failures in visual localization and object detection tasks [32]. - Future applications may include improving model decision rationality during training, monitoring decisions for safety during inference, and identifying key defects for cost-effective model repairs [32].