机器之心
Search documents
与DeepSeek-OCR不谋而合,NeurIPS论文提出让LLM像人一样读长文本
机器之心· 2025-11-10 04:40
Core Insights - The article discusses a groundbreaking framework called VIST (Vision-centric Token Compression in LLM) developed by research teams from Nanjing University of Science and Technology, Central South University, and Nanjing Forestry University, aimed at enhancing long text reasoning in large language models (LLMs) through a visual approach [2][5]. Research Background - LLMs have shown remarkable capabilities in understanding and generating short texts, but they face challenges in processing long documents, complex question answering, and retrieval-augmented generation due to the increasing context length and model parameter size [4]. - The need for token compression has become essential as the scale of input data grows, making it difficult for even the most powerful LLMs to efficiently analyze vast amounts of information [4]. VIST Framework - VIST aims to address the challenges of long text processing by enabling models to read more like humans, utilizing a "slow-fast reading circuit" that mimics human reading strategies [7][8]. - The framework consists of two pathways: 1. Fast Path: Renders distant, less significant context as images for quick semantic extraction using a lightweight visual encoder. 2. Slow Path: Inputs key nearby text directly into the LLM for deep reasoning and language generation [8][15]. Visual Compression Mechanism - VIST employs a visual compression mechanism that allows models to efficiently process long texts by focusing on significant information while ignoring redundant words [22][23]. - The Probability-informed Visual Enhancement (PVE) mechanism teaches models to "skim read" by masking high-frequency, low-information words and retaining low-frequency, high-information words [22][23]. Performance Metrics - VIST demonstrates significant advantages over traditional text encoding methods, achieving a 56% reduction in the number of visual tokens required compared to conventional text tokens, and a 50% decrease in memory usage [10][25]. - In various tasks, VIST outperformed the CEPE method, showcasing its reliability in long text processing even under extreme conditions [25]. Visual Text Tokenization - VIST utilizes lightweight visual encoders for efficient context compression, simplifying the tokenization process and eliminating the need for complex preprocessing steps [28]. - The visual encoder's ability to handle multiple languages without being constrained by a vocabulary significantly reduces computational and memory overhead [29]. Future Implications - The integration of visual-driven token compression is expected to become a standard component in LLMs for long context understanding, paving the way for multimodal intelligent comprehension [32][33]. - This "look before reading" strategy will help large models maintain understanding capabilities while significantly lowering computational costs [33].
MeshCoder:以大语言模型驱动,从点云到可编辑结构化物体代码的革新
机器之心· 2025-11-10 03:53
Core Insights - The article discusses the evolution of 3D generative AI, highlighting the transition from rudimentary models to more sophisticated systems capable of creating structured and editable virtual worlds [2][3] - The introduction of MeshCoder represents a significant advancement in 3D procedural generation, allowing for the translation of 3D inputs into structured, executable code [3][4] Group 1: MeshCoder Features - MeshCoder generates "living" programs rather than static models, enabling the understanding of semantic structures and the decomposition of objects into independent components for code generation [4] - It constructs high-quality quad meshes, which are essential for subsequent editing and material application [5][7] - The generated Python code is highly readable, allowing users to easily modify parameters for editing 3D models [9] - Users can control mesh density through code adjustments, balancing detail and performance [12] Group 2: Implementation and Training - The development of MeshCoder involved creating a large dataset of parts and training a part code inference model to understand basic geometries [19][21] - A custom Blender Python API was developed to facilitate complex modeling operations, enabling the creation of intricate geometries with simple code [20] - A million-level "object-code" dataset was constructed to train the final object code inference model, allowing for the understanding and assembly of complex objects [25][28] Group 3: Performance and Comparison - MeshCoder outperforms existing methods in high-fidelity reconstruction, achieving significantly lower Chamfer distance and higher Intersection over Union (IoU) scores across various object categories [32][33] - The model demonstrates superior ability to reconstruct complex structures accurately, maintaining clear boundaries and independent components [32] Group 4: Code-Based Editing and Understanding - MeshCoder enables code-based editing, allowing users to easily change geometric and topological aspects of 3D models through simple code modifications [36][39] - The generated code serves as a semantic structure, enhancing the understanding of 3D shapes when analyzed by large language models like GPT-4 [41][44] Group 5: Limitations and Future Directions - While MeshCoder shows great potential, challenges remain regarding the diversity and quantity of the training dataset, which affects the model's generalization capabilities [46] - Future efforts will focus on collecting more diverse data to improve the model's robustness and adaptability [46]
谢赛宁、李飞飞、LeCun联手提出多模态LLM新范式,「空间超感知」登场
机器之心· 2025-11-10 03:53
Core Insights - The article discusses the new research achievement named "Cambrian-S," which represents a significant step in exploring supersensing in video space [1][4] - It builds upon the previous work "Cambrian-1," focusing on enhancing AI's visual representation learning capabilities [2] Group 1: Definition and Importance of Supersensing - Supersensing is defined as how a digital entity truly experiences the world, absorbing endless input streams and continuously learning [4][5] - The research emphasizes that before developing "superintelligence," it is crucial to establish "supersensing" capabilities [4] Group 2: Development Path of Multimodal Intelligence - The team outlines a developmental path for multimodal intelligence, identifying video as the ultimate medium for human experience and a direct projection of real-life experiences [6] - They categorize the evolution of multimodal intelligence into several stages, from linguistic-only understanding to predictive world modeling [9] Group 3: Benchmarking Supersensing - The researchers conducted a two-part study to establish benchmarks for measuring supersensing capabilities, revealing that existing benchmarks primarily focus on language understanding and semantic perception, neglecting advanced spatial and temporal reasoning [14][25] - They introduced a new benchmark called VSI-Super, specifically designed to detect spatial intelligence in continuous scenarios [15][26] Group 4: Challenges in Current Models - The article highlights that current models, including Gemini-2.5-Flash, struggle with tasks requiring true spatial cognition and long-term memory, indicating a fundamental gap in the current paradigm [35][38] - The performance of advanced models on the VSI-Super benchmark was notably poor, underscoring the challenges of integrating continuous sensory experiences [35][36] Group 5: Predictive Sensing as a New Paradigm - The researchers propose predictive sensing as a forward path, where models learn to predict sensory inputs and build internal world models to handle unbounded visual streams [42][43] - This approach is inspired by human cognitive theories, emphasizing selective retention of sensory inputs and the ability to predict incoming stimuli [42][44] Group 6: Case Studies and Results - The article presents case studies demonstrating the effectiveness of surprise-driven event segmentation in improving performance on the VSI-Super benchmark [49][53] - The results indicate that the surprise-driven method outperformed existing models, showcasing better generalization capabilities [55][57]
HuggingFace发布超200页「实战指南」,从决策到落地「手把手」教你训练大模型
机器之心· 2025-11-09 11:48
Core Insights - HuggingFace recently published an extensive technical blog detailing the end-to-end experience of training advanced LLMs, emphasizing the "chaotic reality" of LLM development [1][4] - The blog provides in-depth technical details, code snippets, and debugging tips, making it a valuable resource for readers interested in building LLMs [5] Group 1: Training Considerations - A critical question posed is whether one truly needs to train a model from scratch, given the availability of world-class open-source models [9] - The article lists common misconceptions about training models, such as having idle computing power or following trends without a clear purpose [11] - A flowchart is provided to help determine if training a custom model is necessary, suggesting that training should only be considered when existing models and fine-tuning do not meet specific needs [12][14] Group 2: Custom Pre-training Scenarios - Custom pre-training is suitable for three main areas: research, production, and strategic open-source initiatives [15] - The goals of these areas dictate training decisions, such as model size and architecture [17] - The decision-making process involves planning and validation through systematic experiments [18] Group 3: Team Composition and Experimentation - Successful LLM training teams typically start small, with 2-3 members, focusing on sufficient computing power and rapid iteration [19] - The blog emphasizes the importance of empirical experimentation, particularly through ablation studies, to inform model decisions [21][30] - A complete process for setting up ablation experiments is outlined, recommending starting with a proven architecture [22] Group 4: Framework Selection and Data Management - Choosing the right training framework is crucial, balancing functionality, stability, and throughput [24] - The article compares several mainstream frameworks, highlighting the importance of high-quality data management for effective training [25] - Data curation is described as an art, where the quality and mix of data significantly influence model performance [41][42] Group 5: Model Architecture and Tokenization - The blog discusses various model architectures, including dense, MoE (Mixture of Experts), and hybrid models, with SmolLM3 using a dense architecture for memory constraints [36][37] - Tokenization is highlighted as a critical factor, with the choice of vocabulary size and algorithm impacting model performance [38] - The article stresses the need for careful selection of hyperparameters tailored to specific architectures and datasets [39] Group 6: Training Process and Infrastructure - The training process is likened to a marathon, requiring thorough preparation and the ability to handle unexpected challenges [51] - Infrastructure is emphasized as a critical component often overlooked, with detailed considerations for GPU selection and monitoring [63][66] - The blog provides insights into the GPU requirements for training SmolLM3, illustrating the balance between training time, cost, and efficiency [70] Group 7: Post-training and Evaluation - The post-training phase is crucial for refining the model's capabilities, with specific goals outlined for the SmolLM3 model [55][58] - The article discusses the importance of selecting appropriate frameworks and tools for post-training, including supervised fine-tuning and reinforcement learning [60] - Evaluation metrics and continuous monitoring are essential for assessing model performance and ensuring improvements [64]
全球第二、国内第一!最强文本的文心5.0 Preview一手实测来了
机器之心· 2025-11-09 11:48
Core Viewpoint - Baidu's ERNIE-5.0-Preview-1022 model has achieved a significant milestone by ranking second globally and first domestically in the latest LMArena Text Arena rankings, scoring 1432, which is on par with leading models from OpenAI and Anthropic [2][4][43]. Model Performance - ERNIE-5.0 Preview excels in creative writing, complex long question understanding, and instruction following, outperforming many mainstream models including GPT-5-High [5][41]. - In creative writing tasks, it ranks first, indicating a substantial improvement in content generation speed and quality [5][41]. - For complex long question understanding, it ranks second, showcasing its capability in academic Q&A and knowledge reasoning [5][41]. - In instruction following tasks, it ranks third, enhancing its applicability in smart assistant and business automation scenarios [5][41]. Competitive Landscape - The LMArena platform, created by researchers from UC Berkeley, allows real user preference voting, providing a dynamic ranking mechanism that reflects real-world performance [4][5]. - Baidu's model is positioned in the first tier of global general-purpose intelligent models, reinforcing its competitive standing in the AI landscape [4][41]. Technological Infrastructure - Baidu's success is supported by a comprehensive "chip-framework-model-application" stack, which includes the PaddlePaddle deep learning platform and self-developed Kunlun chips for AI model training and inference [41][42]. - The PaddlePaddle framework has been updated to version 3.2, enhancing model performance through optimizations in distributed training and hardware communication [41][42]. Industry Implications - The advancements in ERNIE-5.0 Preview reflect a broader transition in China's AI technology from "technological catch-up" to "capability leadership" [43][44]. - Baidu aims to leverage its model capabilities across various applications, including content generation, search, and office automation, to drive industry adoption [42][43].
大规模高精度量子化学模拟新范式:字节最新成果入选Nature子刊
机器之心· 2025-11-09 11:48
Core Insights - The article discusses the increasing reliance on computational methods in understanding material properties, particularly in fields like catalysis and clean energy [2] - A new quantum embedding framework, SIE+CCSD(T), has been developed to combine high-precision quantum chemistry with large-scale simulations, enabling accurate studies of complex materials [3][6] Group 1: SIE+CCSD(T) Framework - The SIE+CCSD(T) framework allows for the first time the application of "gold standard" CCSD(T) methods to real material systems containing thousands of electrons and hundreds of atoms [3][6] - This framework achieves linear computational scaling up to 392 atoms, demonstrating significant efficiency improvements through GPU optimization [4][6][15] - The SIE framework can combine different levels of high-precision algorithms, allowing researchers to adjust computational speed and accuracy as needed [6][12] Group 2: Performance and Accuracy - In a system with 392 carbon atoms and approximately 11,000 orbitals, SIE+CCSD(T) achieved "gold standard" accuracy while maintaining near-linear computational efficiency on GPUs [6][16] - The method consistently produced results within ±1 kcal/mol of experimental data across various real systems, indicating its reliability and potential as a universal tool [21][24] - The framework successfully reconciled results from different boundary conditions in large systems, showing convergence in adsorption energy for water on graphene [25] Group 3: Implications for Surface Chemistry - The study revealed that water molecules do not exhibit a preferred orientation when adsorbed on graphene, providing important insights for applications in blue energy and water desalination [24][26] - The SIE+CCSD(T) framework addresses the limitations of traditional quantum chemistry methods, enabling accurate simulations of surface chemistry at a larger scale [8][26] - The findings contribute to a better understanding of molecular adsorption on surfaces, which is critical for material design and surface mechanism exploration [26]
IEEE | LLM Agent的能力边界在哪?首篇「图智能体 (GLA)」综述为复杂系统构建统一蓝图
机器之心· 2025-11-09 11:48
Core Insights - The article discusses the rapid development of LLM Agents and highlights the challenges of fragmentation in research and limitations in capabilities such as reliable planning, long-term memory, and multi-agent coordination [2][3]. Group 1: Introduction of Graph-Augmented LLM Agents - A recent comprehensive review published in IEEE Intelligent Systems proposes the concept of "Graph-augmented LLM Agent (GLA)" as a new research direction, which utilizes graphs as a universal language to systematically analyze and enhance various aspects of LLM Agents [3][5]. - Compared to pure LLM solutions, GLA demonstrates significant advantages in reliability, efficiency, interpretability, and flexibility [3]. Group 2: Core Challenges and Solutions - The main challenge for LLM Agents lies in processing structured information and workflows, which can be effectively addressed by using graphs as a natural representation of structured data [6]. - The article outlines how graph structures can enhance the planning capabilities of agents by modeling plans, task dependencies, reasoning processes, and environmental contexts as graphs [11]. Group 3: Memory and Tool Management - To overcome memory limitations, graph structures provide two effective methods: using interaction graphs to record and organize the agent's interaction history and knowledge graphs to store and retrieve structured factual knowledge [12]. - The "tool graph" can clarify the dependencies between tools, assisting in tool selection and improving the agent's ability to call and combine tools [15]. Group 4: Multi-Agent Systems - The review categorizes multi-agent collaboration into three paradigms, illustrating the evolution from static to dynamic and adaptive systems [18][22]. - Graph theory methods can optimize multi-agent systems by reducing redundancy in communication and agent numbers, thereby lowering costs [21]. Group 5: Trustworthiness and Safety - The article discusses the role of graphs in building trustworthy multi-agent systems by systematically analyzing the propagation of biases and harmful information, and utilizing techniques like Graph Neural Networks (GNN) to detect and predict malicious nodes [25]. Group 6: Future Directions - The review identifies five key future directions for GLA development, including dynamic and continuous graph learning, unified graph abstraction across all modules, multi-modal graphs for integrating various types of information, trustworthy systems focusing on privacy and fairness, and large-scale multi-agent simulations [28].
Which Attention is All You Need?
机器之心· 2025-11-09 01:30
Core Insights - The article discusses the ongoing innovations and challenges in the Attention mechanism within AI and Robotics, highlighting the need for breakthroughs in algorithm design to address computational complexities and enhance performance [5][7]. Group 1: Attention Mechanism Innovations - The industry is focusing on optimizing the Attention mechanism due to the computational complexity of O(N^2) associated with standard self-attention, which poses a fundamental obstacle for efficient long-sequence modeling [9]. - Two main paths for improving Attention have emerged: Linear Attention, which aims to reduce complexity to O(N), and Sparse Attention, which seeks to limit calculations to a subset of important tokens [10][13]. - Kimi Linear, a recent development, has shown significant improvements over traditional full attention methods, achieving up to 75% reduction in KV cache requirements and processing contexts of up to 1 million tokens six times faster than full attention [11][12]. Group 2: Linear Attention Approaches - Linear Attention can be categorized into three main types: Kernelized methods, forgetting mechanisms, and in-context learning, each aiming to optimize the attention process while maintaining performance [10][11]. - The Kimi Linear architecture, which incorporates a channel-wise gating mechanism, optimizes memory usage in RNNs and demonstrates superior performance across various scenarios [12]. - The design of Kimi Linear includes a hierarchical mixed architecture that combines linear and full attention layers, enhancing its efficiency and effectiveness [12]. Group 3: Sparse Attention Strategies - Sparse Attention focuses on pre-selecting a subset of important tokens for attention calculations, utilizing methods such as fixed patterns, block-sparse, and clustering approaches [13][14]. - DeepSeek's NSA and DSA represent significant advancements in Sparse Attention, with DSA employing a token-wise sparse strategy that dramatically reduces attention complexity while maintaining performance [16][17]. - In tests, DSA has achieved a reduction in attention complexity from O(L^2) to O(Lk), resulting in cost reductions of 60%-70% during both pre-filling and decoding phases [17].
突破LLM遗忘瓶颈,谷歌「嵌套学习」让AI像人脑一样持续进化
机器之心· 2025-11-08 06:10
Core Insights - Google has introduced a new machine learning paradigm called Nested Learning, which allows models to continuously learn new skills without forgetting old ones, marking a significant advancement towards AI that evolves like the human brain [1][3][4]. Group 1: Nested Learning Concept - Nested Learning treats machine learning models as a series of interconnected optimization sub-problems, enabling a more efficient learning system [6][11]. - The approach bridges the gap between model architecture and optimization algorithms, suggesting they are fundamentally the same and can be organized into hierarchical optimization systems [7][16]. - This paradigm allows for different components of a model to update at varying frequencies, enhancing the model's ability to manage long-term and short-term memory [15][20]. Group 2: Implementation and Architecture - Google has developed a self-modifying architecture called Hope, based on Nested Learning principles, which outperforms existing models in language modeling and long-context memory management [8][24]. - Hope is an evolution of the Titans architecture, designed to execute infinite levels of contextual learning and optimize its memory through a self-referential process [24][26]. Group 3: Experimental Results - Evaluations show that Hope exhibits lower perplexity and higher accuracy in various language modeling and common-sense reasoning tasks compared to other architectures [27][30]. - The performance of different architectures, including Hope, Titans, and others, was compared in long-context tasks, demonstrating the effectiveness of the Nested Learning framework [30]. Group 4: Future Implications - Nested Learning provides a theoretical and practical foundation for bridging the gap between current LLMs' limitations and the superior continuous learning capabilities of the human brain, paving the way for the development of self-improving AI [30].
虚数 i ,要被量子力学抛弃了?
机器之心· 2025-11-08 06:10
Core Viewpoint - Recent research suggests that quantum mechanics may be rewritten using only real numbers, challenging the long-standing reliance on imaginary numbers in the field [1][7][11]. Group 1: Historical Context - Quantum mechanics was established over a century ago to explain the strange behavior of atoms and fundamental particles, achieving significant success [2]. - The core equations of quantum mechanics include the imaginary unit i, which has been a point of contention among physicists [3][4]. Group 2: Recent Developments - In 2021, a study indicated that imaginary numbers were essential to quantum theory, but subsequent research in 2025 proposed a real-number equivalent that is fully compatible with standard quantum theory [8][11][15]. - Several teams have developed real-number formulations of quantum theory, raising questions about the necessity of imaginary components [15][38]. Group 3: Experimental Evidence - A modified Bell experiment demonstrated that the correlations between entangled particles exceeded the limits set by real-number theories, suggesting that imaginary numbers are crucial for accurate quantum descriptions [30][29]. - Despite statistical evidence supporting the necessity of imaginary numbers, skepticism remains regarding the conclusions drawn from these experiments [31][32]. Group 4: Philosophical Implications - The debate continues over why real-number formulations are more complex and whether they can fully replicate the results of traditional quantum mechanics [42][43]. - Some researchers argue that even if imaginary numbers are not strictly necessary, they provide a more elegant and intuitive framework for quantum mechanics [44][49]. Group 5: Future Directions - Ongoing research aims to uncover the unique properties of quantum mechanics that make imaginary numbers particularly suitable, with some theorists suggesting that spin may play a role [51][52]. - The quest for a simpler axiomatic framework for quantum mechanics continues, as researchers seek to understand why the traditional formulation remains dominant [53].