机器之心
Search documents
OpenAI与Anthropic联手力推:MCP Apps提案发布,告别纯文本交互
机器之心· 2025-11-24 07:27
Core Insights - The MCP protocol is evolving to include interactive user interface (UI) support through the MCP Apps proposal, enhancing the interaction capabilities of AI agents beyond text and structured data [2][10][11] Group 1: MCP Apps Proposal - The MCP Apps proposal (SEP-1865) aims to standardize support for interactive UIs, allowing MCP servers to provide visual interfaces directly to hosts [2][4] - This proposal has received positive feedback from the community, driven by contributions from key players like OpenAI and Anthropic [9][10] Group 2: Enhancements in User Interaction - The MCP Apps Extension introduces a standardized approach for declaring UI resources, linking them to tools, and enabling bidirectional communication between embedded interfaces and host applications [4][18] - The transition from a text-based interaction model to a graphical interface is likened to upgrading a customer service chatbot from text messaging to a smart assistant capable of providing visual dashboards and forms [6][11] Group 3: Standardization and Community Involvement - The current limitations of the MCP server in exchanging only text and structured data hinder the presentation of visual information and complex user input [13][18] - The MCP-UI project, supported by a vibrant community, has demonstrated the feasibility of integrating rich UIs into the MCP architecture, with backing from major companies [15][18] Group 4: Key Design Decisions - The MCP Apps Extension emphasizes security, backward compatibility, and the use of pre-declared resources to enhance performance and safety [20][23][24] - The initial extension specification supports rendering HTML content in a sandboxed iframe, with plans for future support of additional content types [22][24] Group 5: Community Engagement - The MCP community encourages participation in the development of the MCP Apps Extension, providing early access SDKs for developers to build applications [26][27]
人形机器人的落地难题,竟被一顿「九宫格」火锅解开?
机器之心· 2025-11-24 07:27
Core Viewpoint - The article discusses the challenges and advancements in embodied intelligence, emphasizing the need for leading chip companies like Intel to overcome computational architecture barriers for large-scale applications [2][8]. Group 1: Challenges in Embodied Intelligence - Recent demonstrations of humanoid robots, such as Tesla's Optimus and Russia's AI robot "Eidol," have faced criticism for their performance, highlighting the gap between theoretical capabilities and practical applications [3][4][7]. - The primary obstacle for these robots entering production lines is the computational platform, which is identified as a significant barrier to the deployment of embodied intelligence [9][12]. - Current humanoid robots typically use a "brain + cerebellum" architecture, where the "brain" handles complex modeling tasks, while the "cerebellum" manages real-time control, requiring high-frequency operations [9][10]. Group 2: Computational Requirements - The demand for computational power has surged due to the integration of motion generation models and multimodal perception, with many companies struggling to meet the required performance levels [10][11]. - Companies often resort to using multiple systems for different tasks, leading to inefficiencies and delays in communication, which can result in operational failures [10][11]. - The return on investment (ROI) is a critical consideration for manufacturers, necessitating robots that are not only effective but also stable, safe, cost-efficient, and energy-efficient [10][11]. Group 3: Intel's Solutions - Intel proposes a "brain-cerebellum fusion" solution using a single System on Chip (SoC) that integrates CPU, GPU, and NPU, allowing for unified intelligent cognition and real-time control [13][14]. - The Core Ultra processor achieves approximately 100 TOPS of AI computing power while maintaining similar power consumption levels, enabling faster responses and improved privacy [17]. - The integrated GPU provides 77 TOPS of AI computing power, capable of handling large-scale visual and modeling tasks effectively [18]. Group 4: Software and Ecosystem - Intel offers a comprehensive software stack that includes operating systems, drivers, SDKs, and real-time optimizations, facilitating easier development for hardware manufacturers [24][26]. - The oneAPI framework allows developers to write code once and run it across various hardware platforms, promoting interoperability and efficiency [27]. - Intel's open approach to technology enables companies to adapt existing systems without being locked into specific vendors, fostering innovation in the embodied intelligence sector [31].
AAAI 2026 Oral | 通过视觉安全提示与深度对齐实现大型视觉语言模型的安全对齐
机器之心· 2025-11-24 07:27
Core Viewpoint - The article discusses the emerging security risks associated with large visual language models (LVLMs) and introduces a new method called DAVSP (Deep Aligned Visual Safety Prompt) developed by Tsinghua University to enhance the safety alignment of these models against malicious inputs [2][5][7]. Research Background and Issues - LVLMs have shown impressive performance in multimodal tasks, but their security vulnerabilities are becoming apparent, as attackers can embed malicious intents within images, leading to harmful outputs [5]. - Existing lightweight safety alignment methods, such as adding safety prompts, are insufficient in multimodal scenarios, as attackers can bypass text prompts by hiding threats in images [5][6]. - Recent approaches like ESIII and UniGuard have attempted to improve model resistance to malicious requests but still face significant challenges, including inadequate security and noticeable performance degradation [5][6]. Method and Innovations: DAVSP - DAVSP introduces two key innovations: Visual Safety Prompt (VSP) and Deep Alignment (DA) to overcome the limitations of previous methods while maintaining model performance [7][9]. - VSP replaces traditional pixel-level perturbations with a trainable border around the input image, enhancing the model's ability to recognize unsafe inputs without compromising the original image features [13][15]. - DA focuses on supervising the model's internal activations to improve its ability to distinguish between harmful and benign inputs, thus enhancing the model's understanding of what constitutes unsafe input [14][16]. Experimental Results - DAVSP has been evaluated across multiple benchmarks, demonstrating superior performance in resisting malicious attacks while maintaining model usability [17][18]. - In tests, DAVSP achieved significantly higher resist success rates (RSR) compared to existing methods, with rates of 98.72% and 99.12% on different datasets [19][21]. - The method shows minimal impact on the model's normal capabilities, with performance metrics comparable to those using only text safety prompts [19][20]. Generalization and Component Importance - The visual safety prompts developed through DAVSP exhibit generalization capabilities, allowing them to be transferred across different models [20]. - Ablation studies confirm that both VSP and DA are essential for the effectiveness of DAVSP; removing either component leads to a significant drop in resistance to malicious attacks [22].
技术人不能错过的NeurIPS之夜:蚂蚁集团海边星光技术Party报名启动!
机器之心· 2025-11-24 02:39
Core Viewpoint - The article highlights the participation of Ant Group at NeurIPS 2025, emphasizing its commitment to advancing AI and machine learning through various presentations and networking opportunities [4][6][15]. Group 1: Event Details - NeurIPS 2025 will take place from December 2 to December 7 in San Diego, USA, with a satellite venue in Mexico City [4]. - Ant Group will host a booth at the conference, inviting attendees to engage in discussions and share insights on cutting-edge research and practical experiences [6][7]. Group 2: Technical Presentations - Ant Group will present its self-developed general model, the "Ant Ling Model," on December 2 from 16:00 to 17:00, showcasing its latest technological breakthroughs [9][10]. - The Ling 2.0 model series includes reasoning-based language models and multimodal models, with parameter counts ranging from 16 billion to 1 trillion, demonstrating strong performance across various benchmarks [9][10]. Group 3: Networking Opportunities - The "Academic Coastline · Ant Starlight Technology Party" will be held, providing a platform for deep conversations between Ant Group's technical leaders and industry experts [15]. - Attendees will enjoy a seaside American dinner and receive a winter warmth package, enhancing the networking experience [20].
Karpathy组建大模型「议会」,GPT-5.1、Gemini 3 Pro等化身最强智囊团
机器之心· 2025-11-23 04:06
Core Viewpoint - The article discusses the shift in content consumption habits towards efficiency, particularly in the context of AI models summarizing information for users, indicating a leap in human capability in the AI era [1][2]. Group 1: AI Model Utilization - Andrej Karpathy has adopted a habit of using large language models (LLMs) to read and summarize information, reflecting a broader trend among users [1][2]. - Karpathy initiated a project that combines four of the latest LLMs into a council to provide diverse insights and evaluations [3][4]. Group 2: LLM Council Mechanism - The LLM council operates as a web application where user questions are distributed among multiple models, which then review and rank each other's responses before a "Chairman LLM" generates the final answer [4][11]. - The council's process includes three stages: initial responses from each model, mutual evaluation of those responses, and final output generation by the chairman model [8][9][11]. Group 3: Model Performance and Evaluation - The models exhibit a willingness to acknowledge superior responses from other models, creating an interesting evaluation dynamic [6][7]. - In evaluations, GPT 5.1 was noted for its rich insights, while Claude was consistently rated lower, although subjective preferences varied among users [7]. Group 4: Future Implications and Open Source - The LLM council's design may represent a new benchmark for model evaluation, with potential for further exploration in multi-model integration [12][13]. - Karpathy has made the project open source, inviting others to explore and innovate upon it, although he will not provide support for it [14][15].
十分钟出结果,陶哲轩用Gemini Deepthink帮人类数学家完成Erdős问题论证
机器之心· 2025-11-23 04:06
Core Viewpoint - The article discusses the Erdős Problems website, which focuses on mathematical research and problem-solving, particularly related to the famous mathematician Paul Erdős. It serves as a platform for researchers and enthusiasts to propose, discuss, and solve various mathematical problems across different fields such as number theory, combinatorics, and graph theory [1]. Group 1 - The Erdős Problems website collects various mathematical problems proposed by Erdős, covering diverse areas like number theory, combinatorics, and graph theory [1]. - Independent researcher Wouter van Doorn provided a counterexample to Erdős Problem 367, relying on a congruence identity he believes to be valid [5]. - The problem was later submitted to Gemini 2.5 Deep Think by renowned mathematician Terence Tao, who received a complete proof from the AI in about ten minutes [9]. Group 2 - Terence Tao manually converted the AI-generated proof into a more basic form within half an hour, indicating that the proof could be formalized and verified in Lean [11]. - Two days later, mathematician Boris Alexeev used the Harmonic Aristotle tool to complete the Lean formalization of the problem, taking two to three hours for the process [12]. - Terence Tao has been exploring the application of AI tools in mathematics, contributing to various research and proofs, including a recent paper on the topic [13].
通用的dLLM开发框架,让BERT掌握扩散式对话
机器之心· 2025-11-23 04:06
Core Insights - The article discusses the development of a diffusion language model (DLM) that enhances the capabilities of the traditional BERT model, demonstrating that a lightweight instruction fine-tuning approach can significantly improve BERT's generative abilities without extensive pre-training [2][18]. Group 1: DLM Framework and Implementation - The dLLM framework was developed to support BERT Chat, emphasizing ease of use and reproducibility, making it suitable for beginners to understand the key steps in diffusion language modeling [6][3]. - The team has open-sourced the entire training, inference, and evaluation code, providing a "Hello World" example for easy replication and understanding of the diffusion language model [3][6]. Group 2: Model Selection and Training - ModernBERT was chosen as the base model due to its extended context length of 8,192 tokens and superior performance on non-generative benchmarks, which was confirmed through experiments [8][12]. - The experiments revealed that additional generative pre-training on ModernBERT did not significantly improve performance, indicating that the original masked language model (MLM) pre-training already encoded sufficient language knowledge [10][11]. Group 3: Performance Evaluation - The ModernBERT-base-chat-v0 (0.1B) and ModernBERT-large-chat-v0 (0.4B) models demonstrated stable performance across various evaluation tasks, with the larger model approaching the performance of Qwen1.5-0.5B [12][14]. - The results showed that even with a smaller model size, the diffusion training approach remains competitive, highlighting the potential of BERT in generating coherent dialogue [12][14]. Group 4: Educational Focus - The BERT Chat series is positioned as a teaching and research experiment rather than a commercial system, aimed at helping researchers understand the mechanisms of diffusion language models [16][18]. - The team emphasizes transparency in the research process by sharing complete training scripts, training curves, and experimental details, fostering a comprehensive understanding of the diffusion language model research path [16][18].
Mid-Training 会成为未来的 Pre-Training 吗?
机器之心· 2025-11-23 01:30
Group 1: Core Concepts of Mid-Training - The concept of "Mid-Training" is emerging as a potential new phase in the training of large language models (LLMs), positioned between pre-training and post-training, with OpenAI establishing a dedicated department for it in July 2024 [5][6][7] - Mid-Training is described as a vital stage that enhances specific capabilities of LLMs, such as mathematics, programming, reasoning, and long-context extension, while maintaining the foundational abilities of the model [9][10] - The definition and implementation of Mid-Training are still not universally agreed upon, with various organizations exploring its effects and mechanisms, indicating a growing interest in this area [8][11] Group 2: Technical Insights and Strategies - Research from Peking University and Meituan has attempted to clarify the definition of Mid-Training, focusing on data management, training strategies, and model architecture optimization [8][10] - Key optimization strategies for Mid-Training include data curation to enhance data quality, training strategies like learning rate annealing and context extension, and architecture optimization to improve model performance [10] - The exploration of Mid-Training has gained momentum since 2025, with increasing references in research papers from institutions like Microsoft and Zero One [6][7]
解放军总医院联合南大、吉大等机构,共同提出首个「脊柱诊疗大模型」SpineGPT
机器之心· 2025-11-22 09:00
Core Insights - The research led by the PLA General Hospital, in collaboration with top hospitals and universities, has developed the first large model specifically for spinal diagnosis, addressing a significant gap in AI-assisted clinical decision-making [2][3][10]. Group 1: Clinical Challenges and Solutions - Spinal diseases affect 619 million people globally and are a major cause of disability, yet existing AI models face a "cognitive gap" in clinical decision-making due to a lack of level-aware, multimodal data [2][6]. - The study introduces a comprehensive solution with the SpineMed-450K dataset, which is the first large-scale, traceable spinal instruction dataset, and the SpineBench clinical evaluation benchmark [3][18]. Group 2: Model Performance and Evaluation - The SpineGPT model, trained on the SpineMed-450K dataset, significantly outperforms leading open-source models, achieving an average score of 87.44%, surpassing models like Qwen2.5-VL-72B and GLM-4.5V [25][26]. - In the SpineBench evaluation, the performance gap of existing models was highlighted, with Qwen2.5-VL-72B scoring only 79.88% on average, while the proprietary model Gemini-2.5-Pro scored 89.23% [13][25]. Group 3: Data and Methodology - The SpineMed-450K dataset includes over 450,000 instruction instances sourced from textbooks, surgical guidelines, expert consensus, and de-identified real cases from 11 hospitals, ensuring diverse patient representation [14][16]. - The data generation process involved a rigorous "Clinician-in-the-loop" approach, ensuring high-quality instruction data through clinician involvement in the drafting and revision stages [14][24]. Group 4: Clinical Relevance and Future Directions - SpineBench serves as a clinically significant evaluation framework, assessing AI's performance in fine-grained, anatomy-centered reasoning, which is crucial for practical applications [18][20]. - The research team plans to expand the dataset, train models with more than 7 billion parameters, and incorporate reinforcement learning techniques to further enhance model performance and establish clearer benchmarks [30].
2025宝山·智能机器人产业大会暨嘉年华隆重开幕
机器之心· 2025-11-22 09:00
Core Insights - The "2025 Baoshan Intelligent Robot Industry Conference and Carnival" was held on November 21, 2025, in Shanghai, focusing on the development of the intelligent robot industry [2][4] - The event gathered government officials, industry experts, and representatives from various intelligent robot companies to foster collaboration and innovation in the sector [4][6] Group 1: Event Highlights - The conference was guided by the Shanghai Municipal Economic and Information Commission and co-hosted by the Baoshan District Government and Shanghai University [2] - Keynote speeches were delivered by prominent figures, including Chinese Academy of Sciences Academician Chu Junhao, who discussed the integration of robots in the intelligent era [19] - The launch of the Shanghai Robot Industry Supply Chain Platform aimed to break down resource barriers within the industry [8] Group 2: Initiatives and Collaborations - The Baoshan District released an action plan to promote innovation in the humanoid robot industry [6] - A data collection center for embodied intelligence was established to support the development of intelligent robots [10] - Several key projects in intelligent robotics and critical components were successfully signed during the event [12] Group 3: Future Directions - The conference included discussions on the future of humanoid robots, focusing on open-source and standardization trends [19] - The event emphasized the importance of AI technology in enhancing the versatility of robots [19] - The overall goal is to strengthen the ecosystem and drive technological innovation and industrial upgrades in Shanghai and nationwide [22]