机器之心
Search documents
AAAI 2026 Oral | 悉尼科技大学联合港理工打破「一刀切」,联邦推荐如何实现「千人千面」的图文融合?
机器之心· 2025-11-25 04:09
Core Insights - The article discusses the introduction of a new framework called FedVLR, which addresses the challenges of multimodal integration in federated learning environments while ensuring data privacy [2][3][19]. Multimodal Integration Challenges - Current recommendation systems utilize multimodal information, such as images and text, but face difficulties in federated learning due to privacy concerns [2][5]. - Existing federated recommendation methods either sacrifice multimodal processing for privacy or apply a one-size-fits-all approach, which does not account for individual user preferences [2][5]. FedVLR Framework - The FedVLR framework redefines the decision-making flow for multimodal integration by offloading heavy computation to the server while allowing users to control how they view the data through a lightweight routing mechanism [3][19]. - It employs a two-layer fusion mechanism that decouples feature extraction from preference integration [8][19]. Server-Side Processing - The first layer involves server-side "multi-view pre-fusion," where the server processes data using powerful pre-trained models to create a set of candidate fusion views without burdening client devices [9][10]. - This approach ensures that the server prepares various "semi-finished" views that contain high-quality content understanding [10]. Client-Side Personalization - The second layer focuses on client-side "personalized refinement," utilizing a lightweight local mixture of experts (MoE) routing mechanism to dynamically compute personalized weights based on user interaction history [11][12]. - This process occurs entirely on the client side, ensuring that user preference data remains on the device [12]. Performance and Versatility - FedVLR is designed to be a pluggable layer that can integrate seamlessly with existing federated recommendation frameworks like FedAvg and FedNCF, without increasing communication overhead [16]. - The framework demonstrates model-agnostic capabilities, allowing it to enhance various baseline models significantly [26]. Experimental Results - The framework has been rigorously tested on public datasets across e-commerce and multimedia domains, showing substantial and stable improvements in core recommendation metrics like NDCG and HR [26]. - Notably, FedVLR performs exceptionally well in sparse data scenarios, effectively leveraging limited local data to understand item content [26]. Conclusion - FedVLR not only enhances recommendation systems but also provides a valuable paradigm for implementing federated foundational models, addressing the challenge of utilizing large cloud models while maintaining data privacy [19].
Gemini 3,是谢尔盖・布林「骂」出来的?
机器之心· 2025-11-25 04:09
Core Viewpoint - The article discusses Google's response to the competitive pressure from OpenAI and the resurgence of its AI capabilities, particularly through the return of co-founder Sergey Brin and the development of the Gemini AI model [2][6][24]. Group 1: Google's Initial Response to AI Competition - When ChatGPT emerged in late 2022, Google, despite its extensive experience in AI, was slow to respond, leading to perceptions of it being "slow to react" and losing its competitive edge [2][4]. - Google's hurried launch of the Bard AI was met with criticism due to inaccuracies and instability, failing to alleviate concerns about its position in the market [5][6]. Group 2: Sergey Brin's Return and Internal Challenges - Sergey Brin's return to Google was pivotal in addressing internal bureaucratic issues that hindered the development of AI tools like Gemini [6][17]. - Brin expressed frustration over internal policies that restricted Gemini's coding capabilities, highlighting the bureaucratic challenges within the company [10][16]. Group 3: Competitive Landscape and Google's Resurgence - The competition from OpenAI has significantly motivated Google, leading to rapid advancements in its AI offerings, particularly with the Gemini 3 model [22][24]. - OpenAI's CEO Sam Altman acknowledged the pressure from Google's advancements, indicating a shift in the competitive dynamics of the AI sector [25][26]. Group 4: Financial and Strategic Positioning - Google benefits from a robust financial foundation, with annual revenues exceeding $300 billion, allowing it to invest strategically in AI without immediate profit concerns [27]. - In contrast, OpenAI faces significant financial challenges, with a valuation of $500 billion but projected losses of $7 billion by 2028, raising concerns among investors [27][28]. Group 5: Future Directions for OpenAI - OpenAI plans to develop a new model, Shallotpeat, to address pre-training deficiencies and enhance its competitive position [28][30]. - Altman emphasized the need for OpenAI to excel in research, infrastructure, and product development, which mirrors Google's long-standing business model [30].
与Banana Pro过过招,国产Libcom图像合成工作台开启Labubu漫游记
机器之心· 2025-11-25 04:09
Core Viewpoint - In 2025, AIGC (AI-Generated Content) has reached new heights, with AI-generated content permeating daily creation across various fields such as social avatars, e-commerce posters, and film storyboards. Notable models like Nano Banana and Qwen Edit have shown strong capabilities in general image editing, particularly the popular Nano Banana Pro, which converts text instructions into high-precision images. However, these models still exhibit shortcomings in specific niches and may not be cost-effective for simple tasks [1]. Group 1: Image Composition Research - The Niu Li team from Shanghai Jiao Tong University has been engaged in image composition research since late 2018, focusing on object insertion, commonly referred to as "fusion" in the AIGC community. Their work aims to address common issues in image composition, such as jagged edges, inconsistent lighting, missing shadows and reflections, and improper perspective [1][2]. - From 2018 to 2025, the Niu Li team has built over 10 datasets, developed more than 30 original models, and published over 25 high-quality academic papers. By the end of 2023, they launched the Libcom toolbox, which offers out-of-the-box image composition capabilities without the need for training or fine-tuning [2]. Group 2: Libcom Toolbox Features - The Libcom toolbox will undergo a comprehensive upgrade in 2025, introducing a user-friendly image composition workstation that focuses on 12 functionalities, including generation, detection, and evaluation, distinguishing it from general image editing models [2][5]. - The workstation interface allows users to register and access detailed functionality descriptions. The 12 features are categorized into six groups: 1. Basic Composition: alpha blending, Poisson blending 2. Image Harmonization: color transfer, image harmonization, artistic image harmonization 3. Background Effect Generation: shadow generation, reflection generation 4. Analysis Tools: disharmony area detection, object placement rationality heatmap 5. Scoring Tools: harmony score, object placement rationality score 6. Advanced Composition: integrates FLUX-Kontext and InsertAnything models [5]. Group 3: Performance Comparison - A practical exploration using the character Labubu demonstrated the capabilities of the Libcom workstation compared to Nano Banana Pro. In various scenarios, Libcom effectively integrated Labubu into different backgrounds, while Nano Banana Pro showed inconsistent results [7][14]. - For instance, when assessing harmony in lighting between Labubu and a forest background, Libcom provided a harmony score of 0.391, indicating poor harmony, while Nano Banana Pro scored 0.24, suggesting a similar conclusion but with discrepancies in the results [17][18]. - In artistic scenarios, Libcom allowed for more creative adjustments, while Nano Banana Pro maintained a more conservative approach. The performance of both models varied in generating shadows and reflections, with Libcom generally providing more accurate results [20][26][27].
新型AI芯片能耗重大突破,已登Nature子刊
机器之心· 2025-11-25 00:02
Core Viewpoint - The research highlights the significant energy consumption and area occupation of Analog-to-Digital Converters (ADC) in Compute-in-Memory (CIM) systems, which undermines the energy efficiency advantages that CIM technology promises [6][7]. Group 1: Background and Challenges - The AI wave has led to concerns over power consumption, particularly in traditional architectures where data transfers between CPU and memory are energy-intensive [3]. - CIM technology is seen as a potential solution to eliminate data transfer bottlenecks by performing calculations directly in memory [4]. - However, the necessity of ADC to convert analog signals back to digital introduces a significant energy and area cost, consuming up to 87% of total energy and 75% of chip area in advanced CIM systems [6][7]. Group 2: Limitations of Traditional ADC - Traditional ADCs use uniform quantization, which does not align with the diverse output signal distributions of neural networks, leading to precision loss [12]. - To compensate for this loss, designers often resort to higher precision ADCs, which results in exponential increases in power consumption and area, creating a vicious cycle [13]. Group 3: Innovative Solutions - The research team proposes using memristors to create adaptive quantization units (Q-cells) that allow for programmable quantization boundaries, enhancing the efficiency of ADCs [15][18]. - This adaptive quantization method significantly improves accuracy, with the VGG8 network achieving an accuracy of 88.9% at 4-bit precision, compared to 52.3% with traditional methods [21]. Group 4: System-Level Benefits - The new memristor-based ADC achieves a 15.1 times improvement in energy efficiency and a 12.9 times reduction in area compared to state-of-the-art designs [25]. - When integrated into CIM systems, the energy consumption of the ADC module in the VGG8 network drops from 79.8% to 22.5%, and area occupation decreases from 47.6% to 16.9%, leading to overall system energy savings of 57.2% [26][28]. - This innovation effectively addresses the ADC bottleneck in mixed-signal CIM systems, paving the way for more efficient and accurate next-generation AI hardware [30].
刚刚,智能体&编程新王Claude Opus 4.5震撼登场,定价大降2/3
机器之心· 2025-11-24 23:49
Core Viewpoint - Anthropic has officially released its latest model, Claude Opus 4.5, which is touted as one of the most advanced AI models available today, showcasing significant improvements in programming, agent capabilities, and everyday tasks like handling spreadsheets and presentations [1][2]. Pricing and Accessibility - Claude Opus 4.5 is accessible via the Claude app, API, and major cloud platforms, with a new pricing structure set at $5 for every million tokens for input and $25 for output, representing a two-thirds reduction compared to the previous version, Opus 4.1 [5][6]. Performance Metrics - In benchmark tests, Claude Opus 4.5 achieved state-of-the-art (SOTA) performance, surpassing competitors like GPT-5.1-Codex-Max and Gemini 3 Pro in various software engineering tasks [2][12]. - The model scored higher than all human candidates in a challenging take-home exam designed to assess technical skills under time pressure, indicating its superior technical capabilities [11]. Enhanced Capabilities - Claude Opus 4.5 shows improvements across multiple domains, including visual reasoning, mathematical reasoning, and problem-solving, achieving SOTA levels in agent programming and tool usage [11][12][20]. - The model's ability to solve complex coding problems has improved by 10.6% compared to its predecessor, Sonnet 4.5 [14]. Developer Tools and Features - The Claude developer platform has been updated to support longer-running agents and improved user experience, allowing for multiple concurrent sessions in desktop applications [7][8]. - New features include an "effort" parameter in the API, enabling developers to balance between speed, cost, and model capability, resulting in significant reductions in token usage while maintaining performance [30][34]. Safety and Alignment - Claude Opus 4.5 is noted for its robust alignment and safety features, being one of the most resilient models against prompt injection attacks, which can mislead models into harmful behaviors [36][39]. - The model has shown substantial progress in mitigating concerning behaviors, enhancing its reliability in various applications [36][39].
LUMA AI完成由HUMAIN领投的9亿美元C轮融资,并将在沙特阿拉伯合作建设2吉瓦AI超级集群
机器之心· 2025-11-24 09:30
Core Insights - Luma AI has raised $900 million in Series C funding to accelerate its development towards multimodal AGI, which can simulate reality and assist humans in the physical world [1][3][4] - The partnership with HUMAIN aims to build Project Halo, a 2 GW AI supercluster in Saudi Arabia, which will support the training of large-scale world models [3][4][5] - The collaboration is expected to unlock significant opportunities across various sectors, including entertainment, marketing, education, and robotics, potentially worth trillions [1][4] Company Overview - Luma AI is focused on creating multimodal general intelligence capable of generating, understanding, and manipulating the physical world [8] - The flagship model, Ray3, has been successfully deployed in studios, advertising agencies, and brands, including integration with Adobe's global products [7][8] - HUMAIN, a PIF company, provides comprehensive AI capabilities across four core areas: next-generation data centers, high-performance infrastructure, advanced AI models, and transformative AI solutions [9] Funding and Infrastructure - The $900 million funding round was led by HUMAIN, with participation from AMD Ventures and previous investors like Andreessen Horowitz and Matrix Partners [1][3] - The Project Halo supercluster will represent a significant leap in multimodal AI infrastructure, enabling the training of peta-scale multimodal data [5][6] - Luma AI plans to expand its leadership in entertainment and advertising into simulation, design, and robotics with the new funding [7] Strategic Goals - The partnership aims to create AI systems that can learn from vast amounts of data, estimated at quadrillions of tokens, to enhance understanding and simulation of reality [5][6] - HUMAIN's investment philosophy emphasizes building a complete value chain to support the next wave of AI development [5] - The collaboration is set to establish new benchmarks for how capital, computing power, and capabilities can be integrated in the AI sector [5]
NeurIPS 2025 | UniLumos: 引入物理反馈的统一图像视频重打光框架,实现20倍加速的真实光影重塑!
机器之心· 2025-11-24 09:30
Core Insights - The article discusses the advancements in image and video relighting technology, particularly focusing on the introduction of UniLumos, a unified framework that enhances physical consistency and computational efficiency in relighting tasks [3][37]. Group 1: Challenges in Existing Methods - Current methods based on diffusion models face two fundamental challenges: the lack of physical consistency and an inadequate evaluation system for relighting quality [11][12]. - Existing approaches often optimize in semantic latent spaces, leading to physical inconsistencies such as misaligned shadows, overexposed highlights, and incorrect occlusions [15][11]. Group 2: Introduction of UniLumos - UniLumos is introduced as a solution to the aforementioned challenges, providing a unified framework for image and video relighting that maintains scene structure and temporal consistency while achieving high-quality relighting [17][37]. - The framework incorporates geometric feedback from RGB space, such as depth and normal maps, to align lighting effects with scene structures, significantly improving physical consistency [4][22]. Group 3: Innovations and Methodology - Key innovations include a geometric feedback mechanism to enhance physical consistency and a structured six-dimensional lighting description for fine-grained control and evaluation of lighting effects [18][22]. - The training data set, LumosData, is constructed to extract high-quality relighting samples from real-world videos, facilitating the training of the model [20][21]. Group 4: Performance and Efficiency - UniLumos demonstrates superior performance across various metrics, achieving state-of-the-art results in visual fidelity, temporal consistency, and physical accuracy compared to baseline models [27][28]. - The framework achieves a 20-fold increase in inference speed while maintaining high-quality output, making it significantly more efficient than existing methods [33][38]. Group 5: Evaluation and Results - The LumosBench evaluation framework allows for automated and interpretable assessment of relighting accuracy across six dimensions, showcasing UniLumos's advantages in fine-grained control over lighting attributes [22][29]. - Qualitative results indicate that UniLumos produces more realistic lighting effects and maintains better temporal consistency compared to baseline methods [31][33].
OpenAI与Anthropic联手力推:MCP Apps提案发布,告别纯文本交互
机器之心· 2025-11-24 07:27
Core Insights - The MCP protocol is evolving to include interactive user interface (UI) support through the MCP Apps proposal, enhancing the interaction capabilities of AI agents beyond text and structured data [2][10][11] Group 1: MCP Apps Proposal - The MCP Apps proposal (SEP-1865) aims to standardize support for interactive UIs, allowing MCP servers to provide visual interfaces directly to hosts [2][4] - This proposal has received positive feedback from the community, driven by contributions from key players like OpenAI and Anthropic [9][10] Group 2: Enhancements in User Interaction - The MCP Apps Extension introduces a standardized approach for declaring UI resources, linking them to tools, and enabling bidirectional communication between embedded interfaces and host applications [4][18] - The transition from a text-based interaction model to a graphical interface is likened to upgrading a customer service chatbot from text messaging to a smart assistant capable of providing visual dashboards and forms [6][11] Group 3: Standardization and Community Involvement - The current limitations of the MCP server in exchanging only text and structured data hinder the presentation of visual information and complex user input [13][18] - The MCP-UI project, supported by a vibrant community, has demonstrated the feasibility of integrating rich UIs into the MCP architecture, with backing from major companies [15][18] Group 4: Key Design Decisions - The MCP Apps Extension emphasizes security, backward compatibility, and the use of pre-declared resources to enhance performance and safety [20][23][24] - The initial extension specification supports rendering HTML content in a sandboxed iframe, with plans for future support of additional content types [22][24] Group 5: Community Engagement - The MCP community encourages participation in the development of the MCP Apps Extension, providing early access SDKs for developers to build applications [26][27]
人形机器人的落地难题,竟被一顿「九宫格」火锅解开?
机器之心· 2025-11-24 07:27
Core Viewpoint - The article discusses the challenges and advancements in embodied intelligence, emphasizing the need for leading chip companies like Intel to overcome computational architecture barriers for large-scale applications [2][8]. Group 1: Challenges in Embodied Intelligence - Recent demonstrations of humanoid robots, such as Tesla's Optimus and Russia's AI robot "Eidol," have faced criticism for their performance, highlighting the gap between theoretical capabilities and practical applications [3][4][7]. - The primary obstacle for these robots entering production lines is the computational platform, which is identified as a significant barrier to the deployment of embodied intelligence [9][12]. - Current humanoid robots typically use a "brain + cerebellum" architecture, where the "brain" handles complex modeling tasks, while the "cerebellum" manages real-time control, requiring high-frequency operations [9][10]. Group 2: Computational Requirements - The demand for computational power has surged due to the integration of motion generation models and multimodal perception, with many companies struggling to meet the required performance levels [10][11]. - Companies often resort to using multiple systems for different tasks, leading to inefficiencies and delays in communication, which can result in operational failures [10][11]. - The return on investment (ROI) is a critical consideration for manufacturers, necessitating robots that are not only effective but also stable, safe, cost-efficient, and energy-efficient [10][11]. Group 3: Intel's Solutions - Intel proposes a "brain-cerebellum fusion" solution using a single System on Chip (SoC) that integrates CPU, GPU, and NPU, allowing for unified intelligent cognition and real-time control [13][14]. - The Core Ultra processor achieves approximately 100 TOPS of AI computing power while maintaining similar power consumption levels, enabling faster responses and improved privacy [17]. - The integrated GPU provides 77 TOPS of AI computing power, capable of handling large-scale visual and modeling tasks effectively [18]. Group 4: Software and Ecosystem - Intel offers a comprehensive software stack that includes operating systems, drivers, SDKs, and real-time optimizations, facilitating easier development for hardware manufacturers [24][26]. - The oneAPI framework allows developers to write code once and run it across various hardware platforms, promoting interoperability and efficiency [27]. - Intel's open approach to technology enables companies to adapt existing systems without being locked into specific vendors, fostering innovation in the embodied intelligence sector [31].
AAAI 2026 Oral | 通过视觉安全提示与深度对齐实现大型视觉语言模型的安全对齐
机器之心· 2025-11-24 07:27
Core Viewpoint - The article discusses the emerging security risks associated with large visual language models (LVLMs) and introduces a new method called DAVSP (Deep Aligned Visual Safety Prompt) developed by Tsinghua University to enhance the safety alignment of these models against malicious inputs [2][5][7]. Research Background and Issues - LVLMs have shown impressive performance in multimodal tasks, but their security vulnerabilities are becoming apparent, as attackers can embed malicious intents within images, leading to harmful outputs [5]. - Existing lightweight safety alignment methods, such as adding safety prompts, are insufficient in multimodal scenarios, as attackers can bypass text prompts by hiding threats in images [5][6]. - Recent approaches like ESIII and UniGuard have attempted to improve model resistance to malicious requests but still face significant challenges, including inadequate security and noticeable performance degradation [5][6]. Method and Innovations: DAVSP - DAVSP introduces two key innovations: Visual Safety Prompt (VSP) and Deep Alignment (DA) to overcome the limitations of previous methods while maintaining model performance [7][9]. - VSP replaces traditional pixel-level perturbations with a trainable border around the input image, enhancing the model's ability to recognize unsafe inputs without compromising the original image features [13][15]. - DA focuses on supervising the model's internal activations to improve its ability to distinguish between harmful and benign inputs, thus enhancing the model's understanding of what constitutes unsafe input [14][16]. Experimental Results - DAVSP has been evaluated across multiple benchmarks, demonstrating superior performance in resisting malicious attacks while maintaining model usability [17][18]. - In tests, DAVSP achieved significantly higher resist success rates (RSR) compared to existing methods, with rates of 98.72% and 99.12% on different datasets [19][21]. - The method shows minimal impact on the model's normal capabilities, with performance metrics comparable to those using only text safety prompts [19][20]. Generalization and Component Importance - The visual safety prompts developed through DAVSP exhibit generalization capabilities, allowing them to be transferred across different models [20]. - Ablation studies confirm that both VSP and DA are essential for the effectiveness of DAVSP; removing either component leads to a significant drop in resistance to malicious attacks [22].