机器之心
Search documents
苹果提出新型反向传播:一台iPhone 15 Pro Max就能微调LLM
机器之心· 2025-10-30 01:41
Core Viewpoint - Apple has demonstrated the feasibility of fine-tuning large language models (LLMs) on iPhones using a new method called Memory-Efficient Backpropagation (MeBP), which offers better trade-offs between memory usage and computation time compared to existing methods [1][4]. Summary by Sections Introduction - The article discusses Apple's recent paper on MeBP, which allows for model fine-tuning on resource-constrained mobile devices like the iPhone 15 Pro Max [1][3]. Methodology - MeBP focuses on using LoRA for fine-tuning LLMs, aiming to keep memory usage below 1GB, as recommended by PocketLLM [4]. - The fine-tuning process using MeBP consists of three main steps: compressing base model weights, implementing gradient checkpointing, and creating an efficient runtime for executing the training graph [5][10]. Model Weight Compression - The team employed 4-bit symmetric INT4 quantization for non-LoRA parameters, including embeddings, to reduce disk space usage [7][10]. Gradient Checkpointing - The LLM is divided into blocks to ensure that memory consumption during backpropagation remains within device limits. Automatic differentiation is used to generate a backward graph for each block [8][9]. Runtime Implementation - The MeBP runtime is designed to minimize memory usage by memory-mapping compressed model weights and only decompressing them on demand during training [15][16]. Experimental Performance - The team compared MeBP with MeZO, the only known optimization method for mobile LLM fine-tuning, using server-side simulations and performance evaluations on mobile devices [18][24]. - The experiments were conducted on models with parameters ranging from 0.5B to 4B, focusing on loss and next token accuracy as evaluation metrics [20]. Utility Comparison - Results indicated that while zero-order (ZO) optimization showed slower convergence compared to first-order (FO) optimization, MeBP significantly outperformed ZO in terms of convergence speed and computational efficiency [23]. Performance Comparison - MeBP was implemented in Swift on an iPhone 15 Pro Max with 8GB RAM, showing that MeBP's computation time per gradient step was 43% to 94% longer than MeZO, but it converged faster overall due to fewer required steps [24][28]. - The memory usage of MeBP was slightly higher than MeZO in the worst case, but overall training memory usage was approximately 10 times smaller than previous mobile implementations [28]. Conclusion - All tested LLMs could be efficiently fine-tuned within 1GB of memory, making them suitable for background training on mobile devices [28].
刚刚,Cursor 2.0携自研模型Composer强势登场,不再只做「壳」
机器之心· 2025-10-30 01:41
Core Insights - Cursor has officially launched its own large language model, Composer, marking a significant evolution from being a platform reliant on third-party models to becoming an AI-native platform [2][4][3] - The release of Composer is seen as a breakthrough that enhances Cursor's capabilities in coding and software development [4][3] Summary by Sections Composer Model - Composer is a cutting-edge model that, while not as intelligent as top models like GPT-5, boasts a speed that is four times faster than comparable intelligent models [6] - In benchmark tests, Composer achieved a generation speed of 250 tokens per second, which is double that of leading fast inference models and four times that of similar advanced systems [9] - The model is designed for low-latency coding tasks, with most interactions completed within 30 seconds, and early testers have found its rapid iteration capabilities to be user-friendly [11] - Composer utilizes a robust set of tools for training, including semantic search across entire codebases, significantly enhancing its ability to understand and process large codebases [12] - The model is a mixture of experts (MoE) architecture, optimized for software engineering through reinforcement learning, allowing it to generate and understand long contexts [16][19] Cursor 2.0 Update - Cursor 2.0 introduces a multi-agent interface that allows users to run multiple AI agents simultaneously, enhancing productivity by enabling agents to handle different parts of a project [21][24] - The new version focuses on an agent-centric approach rather than a traditional file structure, allowing users to concentrate on desired outcomes while agents manage the details [22] - Cursor 2.0 addresses new bottlenecks in code review and change testing, facilitating quicker reviews of agent changes and deeper code exploration when necessary [25] Infrastructure and Training - The development of large MoE models requires significant investment in infrastructure, with Cursor utilizing PyTorch and Ray to create a customized training environment for asynchronous reinforcement learning [28] - The team has implemented MXFP8 MoE kernels to train models efficiently across thousands of NVIDIA GPUs, achieving faster inference speeds without the need for post-training quantization [28] - The Cursor Agent framework allows models to utilize various tools for code editing, semantic searching, and executing terminal commands, necessitating a robust cloud infrastructure to support concurrent operations [28] Community Feedback - The major update has garnered significant attention, with early users providing mixed feedback, highlighting both positive experiences and areas for improvement [30][31]
报名|EMNLP×TalentAI50高能社交夜:不尬聊,只共振!
机器之心· 2025-10-29 11:02
Core Insights - The article discusses the transformation of large models from "creators" to "thinkers" in the field of Natural Language Processing (NLP), indicating a shift towards more sophisticated reasoning, deeper cognition, and practical value creation [1]. Event Overview - EMNLP 2025, a significant international conference in the NLP field, will take place next week in Suzhou, China, gathering top scholars and innovative ideas [1]. - A special event, "EMNLP 2025 TalentAI50 Meetup," will be held, limited to 50 participants, aimed at fostering free communication and idea exchange among young talents in NLP [1][10]. Guest Speakers - The event will feature several prominent young scholars, including: - Wu Yi, Assistant Professor at Tsinghua University - Liu Weiyang, Assistant Professor at The Chinese University of Hong Kong - Li Zhuang, Assistant Professor at RMIT University - Liu Dongrui, Young Researcher at Shanghai AI Laboratory - Zeng Min, Algorithm Engineer at vivo AI Research Institute - Additional guests from overseas companies like Google and Meta are also expected to attend [3][4]. Event Format - The Meetup will not follow traditional presentation formats; instead, it will focus on informal discussions over food and drinks, allowing participants to engage freely with peers who understand their research [5]. - The event is designed to create a relaxed atmosphere conducive to networking and collaboration [5]. Event Details - Date and Time: November 6, from 18:00 to 21:00 - Location: Near Suzhou International Expo Center - Scale: Limited to 50 participants [6]. Schedule - The event schedule includes: - 17:30-18:00: Registration - 18:00-18:10: Opening - 18:10-18:30: Interactive Experience - 18:30-21:00: Dinner & Free Networking [7]. Future Plans - The "TalentAI50 Meetup" series aims to connect promising young AI talents at major academic conferences, with plans for future events to facilitate more closed-door discussions, thematic salons, and industry connections [10].
牛津VGG、港大、上交发布ELIP:超越CLIP等,多模态图片检索的增强视觉语言大模型预训练
机器之心· 2025-10-29 11:02
Core Insights - The article discusses the significance of multimodal image retrieval in computer vision and multimodal machine learning, highlighting the use of large-scale pre-trained models like CLIP and SigLIP for enhanced zero-shot capabilities [2] - A new method called ELIP (Enhance Language-Image Pre-training) is proposed to improve the performance of visual-language models for text-image retrieval, which has been accepted as a best paper nominee at the IEEE International Conference on Content-Based Multimedia Indexing [2] Method Overview - The ELIP method involves an initial ranking of images using traditional CLIP/SigLIP, followed by a re-ranking of the top-k candidates using a simple MLP mapping network that incorporates text features into the image encoder [5] - ELIP can be applied to various large models, including CLIP, SigLIP, and BLIP-2, referred to as ELIP-C, ELIP-S, ELIP-S-2, and ELIP-B respectively [5] Challenges in Academic Research - The article notes that pre-training visual-language models is typically an industrial endeavor, but the proposed method allows for training with limited resources, such as two GPUs [8] Innovations in Model Architecture - The architecture innovation involves fixing the weights of large image and text encoders while training only the MLP mapping network, which consists of three layers of linear transformations and GeLU activations [9] - The training process involves mapping text features to the visual feature space to guide image encoding, using InfoNCE loss for CLIP and Sigmoid loss for SigLIP [9] Innovations in Training Data - ELIP addresses the challenge of limited GPU resources by creating hard sample training batches from CLIP feature similarities, enhancing the model's discriminative ability [13] - The article provides examples of how similar features are grouped to form hard samples for training [15] New Evaluation Datasets - In addition to standard datasets like COCO and Flickr, two new out-of-distribution (OOD) datasets, Occluded COCO and ImageNet-R, are introduced to evaluate the model's performance under different conditions [18] Experimental Results - The results indicate significant improvements in image retrieval performance for models using ELIP, with ELIP-S achieving a recall@1 of 61.03 on COCO, compared to 54.21 for SigLIP [21] - ELIP-B applied to BLIP-2 also shows enhanced performance, surpassing the latest Q-Pert method [20] Attention Mechanism Observations - The authors observed that ELIP improves the attention of the CLS token towards relevant areas in images when the text query is related, enhancing information extraction [23]
参数空间对称性:深度学习理论的统一几何框架
机器之心· 2025-10-29 09:25
Core Insights - The article discusses the evolution of deep learning models from millions to billions of parameters, highlighting the lack of systematic understanding of their effectiveness [2] - A key focus is on the concept of parameter space symmetry, which refers to the existence of multiple parameter configurations that yield the same model function, complicating optimization and generalization analysis [4][6] Group 1: Parameter Space Symmetry - Parameter space symmetry allows different parameter combinations to produce identical outputs, exemplified by the interchange of neurons in hidden layers [4][6] - This symmetry is mathematically defined as transformations that keep the loss function invariant, forming a group that defines equivalent orbits in parameter space [6] Group 2: Types of Symmetry - In addition to discrete symmetries, most neural network architectures exhibit continuous symmetries, such as scaling and linear transformations, which maintain function invariance [8] - Complex architectures like Transformers combine various symmetries from their components, including multi-head attention mechanisms [8] Group 3: Impact on Loss Landscape - Symmetry creates a complex yet structured optimization space, where continuous symmetries can stretch isolated minima into flat manifolds, affecting the interpretation of generalization metrics [10] - Observed phenomena like "mode connectivity," where independently trained models can connect through low-loss paths, are partially attributed to continuous symmetries [10] Group 4: Optimization Methods - The presence of symmetry leads to the phenomenon of "equal loss, different gradients," suggesting new algorithmic possibilities for optimization methods that seek better gradient points within equivalent orbits [15][19] - Some optimization strategies leverage symmetry as a degree of freedom, while others aim to reduce it as redundancy, indicating its importance in algorithm design [19] Group 5: Learning Dynamics - Continuous symmetries correspond to conserved quantities, which remain constant during training, revealing insights into the stability of the training process and the implicit bias of optimization [21][23] - The structure of parameter space symmetry influences the statistical distribution of learning trajectories and outcomes [23] Group 6: Connections Across Spaces - Parameter space symmetry is interconnected with data space and internal representation space, where model parameters often reflect the symmetry present in the data distribution [27][28] - Emerging directions like Weight Space Learning utilize symmetry as a new data structure, facilitating the analysis and generation of model properties [28][29] Group 7: Future Directions - The widespread existence of parameter space symmetry offers a new mathematical language for deep learning, linking complex behaviors of models with established tools from group theory and geometry [30] - This perspective is influencing various practical fields, from optimization acceleration to model fusion and new model design, transforming theoretical concepts into actionable algorithmic principles [30]
全球首个AI Agent操作系统FlowithOS跑分超Atlas,网友:它「杀死」了比赛
机器之心· 2025-10-29 09:25
Core Viewpoint - Flowith has launched FlowithOS, the world's first operating system specifically designed for AI Agents, which is expected to revolutionize human interaction with networks, information, and services [2][4]. Product Overview - FlowithOS exists as a standalone application that combines an agentic workspace and web browser, eliminating the boundaries previously defined by applications and web pages [3][4]. - It operates with a 97.7% success rate in executing tasks, surpassing any existing AI Agent capabilities [3]. Unique Features - Unlike traditional AI tools, FlowithOS autonomously searches across multiple web pages based on user instructions, understanding visual and coded content to perform various actions [3][4]. - The system is designed to evolve and improve its memory and skills with each user interaction, providing increasingly personalized services [14]. Performance Metrics - FlowithOS achieved an average accuracy rate of 95.4% in benchmark tests, outperforming top competitors, including ChatGPT Atlas, which scored 75.7% [12]. User Experience - Users have described FlowithOS as a system that can think for itself, capable of automating tasks such as content creation and social media interactions [5][6][17]. - The system integrates visual reasoning and execution in real-time, transforming user intentions into actions seamlessly [6]. Market Position - FlowithOS is seen as a significant competitor to OpenAI's ChatGPT Atlas, with the potential to redefine the AI application landscape [6][14]. - The product is currently in public beta and is compatible with both macOS and Windows, unlike ChatGPT Atlas, which is limited to macOS [19]. Company Background - Flowith was founded in 2023 by a team of ten young entrepreneurs, with Derek as the founder, who has nine years of entrepreneurial experience [20]. - The company has previously launched several successful AI products, including Flowith and Flowith Neo, which have received positive feedback from users [21][22].
真·首个进家的人形机器人来了!两万美元,开启预定
机器之心· 2025-10-29 09:25
Core Viewpoint - The article discusses the launch and features of NEO Beta, a humanoid robot designed for home use by 1X Technologies, which aims to assist with household tasks and provide companionship while emphasizing safety and user control over data [2][57]. Product Features - NEO Beta is a bipedal robot, standing 168 cm tall and weighing 30 kg, designed to be safe and non-threatening in appearance [9][10]. - The robot's exterior is made of a 3D-printed honeycomb mesh fabric that provides both structural strength and flexibility, ensuring a gentle touch [14][19]. - NEO has 22 degrees of freedom in its hands and 55 degrees of freedom overall, allowing it to perform tasks like folding clothes and cleaning [19][22]. - It utilizes a tendon-driven system for movement, which is quieter and smoother compared to traditional gear systems, and can lift up to 70 kg [22][24]. - The robot is powered by a custom battery with a capacity of 0.75 kWh, allowing for a runtime of 4 hours and a daily operating cost of less than $1 [40]. Interaction and Learning - NEO can understand and execute commands through a combination of voice recognition and a visual feedback system, breaking down tasks into manageable steps [28][34]. - It can engage in conversations, suggest recipes based on available ingredients, and provide interior design advice [36]. - Users can monitor NEO's activities through a mobile app and can also invite remote experts to assist in training the robot for new tasks [34][46]. Market Position and Future Plans - 1X Technologies plans to mass-produce NEO in Norway, with a goal of delivering units to customers by 2026 and scaling production to hundreds of thousands by 2027 [55]. - The company is seeking $1 billion in funding, aiming for a valuation of $10 billion, positioning itself not just as a home robotics company but as a bridge to general artificial intelligence (AGI) [57][58].
近500页史上最全扩散模型修炼宝典,宋飏等人一书覆盖三大主流视角
机器之心· 2025-10-29 07:23
Core Viewpoint - The article discusses the comprehensive guide on diffusion models, highlighting their transformative impact on generative AI across various domains such as images, audio, video, and 3D environments [2][4]. Summary by Sections Introduction to Diffusion Models - Diffusion models are presented as a method that views the generation process as a gradual transformation over time, contrasting with traditional generative models that directly learn mappings from noise to data [11]. - The article emphasizes the need for a systematic understanding of diffusion models, which the book aims to provide, making it a valuable resource for both researchers and beginners [6][9]. Core Principles of Diffusion Models - The book outlines the foundational principles of diffusion models, connecting three key perspectives: variational methods, score-based methods, and flow-based methods, which together form a unified theoretical framework [11][13]. - It discusses how these models achieve efficient sample generation and enhanced controllability during the generation process [12]. Detailed Exploration of Perspectives - The variational view relates to denoising diffusion probabilistic models (DDPMs), providing a basis for probabilistic inference and optimization [23]. - The score-based view focuses on learning score functions to guide the denoising process, linking diffusion modeling with classical differential equation theory [23][24]. - The flow-based view describes the generation process as a continuous flow transformation, allowing for broader applications beyond simple generation tasks [24]. Sampling Techniques and Efficiency - The article highlights the unique feature of diffusion models, which refine samples from coarse to fine through noise removal, and discusses the trade-off between performance and efficiency [27][28]. - It introduces methods for improving sampling performance without retraining models, such as classifier guidance and advanced numerical solvers to enhance generation quality and speed [29][30]. Learning Fast Generative Models - The book explores strategies for directly learning fast generative models that approximate the diffusion process, aiming to reduce reliance on multi-step inference [30][31]. - Distillation-based methods are discussed, where a student model mimics a slower teacher model to achieve faster sampling while maintaining quality [30]. Comprehensive Coverage of Diffusion Models - The book aims to establish a lasting theoretical framework for diffusion models, focusing on continuous time dynamical systems that connect simple prior distributions to data distributions [33]. - It emphasizes the importance of understanding the underlying principles and connections between different methods to design and improve next-generation generative models [36].
ICCV25 Highlight|格灵深瞳RICE模型狂刷榜单,让AI「看懂」图片的每个细节
机器之心· 2025-10-29 07:23
Core Viewpoint - The article highlights the impressive performance of the RICE model (MVT v1.5) developed by DeepGlint's inspiration team, which has excelled in various visual tasks and received recognition at the ICCV25 conference [2][27]. Summary by Sections MVT Series Overview - The MVT series focuses on enhancing visual semantic representation using large datasets, inspired by DeepGlint's expertise in facial recognition algorithms [5][7]. - MVT v1.0 utilized the advanced CLIP pre-training model for feature extraction from vast image-text datasets, achieving state-of-the-art (SOTA) results in image classification and retrieval tasks [5][7]. RICE Model Development - RICE (MVT v1.5) builds on previous models by understanding the composition of image semantics, recognizing that images often consist of multiple loosely related visual elements [9][27]. - The model incorporates character blocks as semantic information, utilizing SAM for object feature extraction from a dataset of 400 million images, resulting in 2 billion region-level objects clustered into one million semantic categories [9][11]. Training Methodology - Each image in RICE's training process involves learning approximately 10 region-level objects, employing a Region Attention Layer to accelerate model training [11][13]. - The model's architecture uses a classic ViT structure, enhancing the semantic representation of internal visual features as training progresses [13][27]. Experimental Validation - RICE has undergone extensive experimental validation across various downstream tasks, demonstrating superior performance in detection tasks on datasets like COCO and LVIS, as well as Roboflow100 [17][20]. - In multi-modal segmentation tasks, RICE has shown significant improvements using the LLaVA series framework [18][23]. Multi-Modal Applications - RICE serves as a visual encoder in the LLaVA-OneVision-1.5 model, achieving competitive results against other leading models in various benchmarks [25][27]. - The model's compatibility with optical character recognition (OCR) tasks provides a notable advantage in multi-modal applications [23][27]. Future Directions - The MVT series is set to advance to version 2.0, focusing on video encoding, which is seen as a critical step towards achieving artificial general intelligence (AGI) [27].
用「传心术」替代「对话」,清华大学联合无问芯穹、港中文等机构提出Cache-to-Cache模型通信新范式
机器之心· 2025-10-29 07:23
Core Insights - The article discusses the rapid advancements in large language models (LLMs) and the introduction of a new communication paradigm called Cache to Cache (C2C), which enhances multi-agent systems by allowing direct communication through KV-Cache instead of traditional Text to Text (T2T) methods [2][5][10]. Limitations of Existing Text Communication - T2T communication faces significant limitations, including information loss due to dimensionality reduction, semantic ambiguity inherent in natural language, and substantial delays caused by token-by-token output generation [7][8][6]. Advantages of KV-Cache - KV-Cache inherently contains multi-dimensional semantic information from the dialogue process, improving accuracy and efficiency. Experiments show that optimized KV-Cache can significantly enhance model accuracy and facilitate effective communication between different models [11][12][29]. C2C Mechanism - The C2C framework utilizes a fusion mechanism that integrates KV-Cache from different models, ensuring compatibility and effective information transfer. This involves a residual fusion structure to maintain the original semantics of the receiver model [16][17][19]. Performance and Efficiency - C2C demonstrates substantial performance improvements over T2T, with accuracy increases of 3% to 5% and speed enhancements of up to two times. The framework allows for efficient parallel processing, avoiding the inefficiencies of one-dimensional text output [29][31][28]. Experimental Results - The article presents various experimental results showing that C2C consistently outperforms T2T across multiple benchmarks, with significant accuracy gains and reduced inference times [28][31][29]. Future Prospects - The C2C paradigm has broad applications, including enhancing collaboration in multi-agent systems, integrating multimodal models, and improving privacy-aware cloud-edge collaboration. It is positioned as a key enabling technology for the next generation of multi-agent systems [36][38][39].