机器之心
Search documents
刚刚,蝉联Future X全球榜首的MiroMind发布全球最强搜索智能体模型
机器之心· 2026-01-05 06:09
Core Viewpoint - MiroMind team has launched its flagship search intelligence model MiroThinker 1.5, emphasizing the concept of "discovery intelligence" as a path to true general artificial intelligence, focusing on external information interaction rather than merely increasing internal parameters [1][10]. Group 1: Model Performance and Comparison - MiroThinker 1.5-30B achieved performance comparable to many 1 trillion parameter models while using only 1/30 of the parameter scale [4]. - In key benchmark tests, MiroThinker 1.5-235B ranked among the top globally, demonstrating its effectiveness despite a smaller parameter size [4]. - MiroThinker 1.5-30B exhibited a significantly lower inference cost of $0.07 per call, which is only 1/20 of the cost of Kimi-K2-Thinking, while also providing faster inference [9]. Group 2: Interactive Scaling and Training Mechanism - MiroMind team has shifted from traditional scaling laws focused on internal parameter expansion to "Interactive Scaling," which emphasizes external information interaction to enhance model performance [10][12]. - The training process encourages models to engage in evidence-seeking behaviors, breaking down key judgments into verifiable sub-hypotheses and actively querying external data [19]. - The model is trained under strict temporal visibility constraints, ensuring it learns to make judgments based only on past information, thus avoiding future leakage [17][20]. Group 3: Unique Training Approaches - MiroThinker 1.5 employs a "scientist mode" rather than a "test-taker mode," focusing on verification and correction rather than memorization [11]. - The model's training paradigm includes a time-sensitive training sandbox, which forces it to operate under real-world conditions of incomplete information and noise [18]. - The training emphasizes iterative verification and self-correction, allowing the model to adjust its hypotheses based on conflicting evidence [19]. Group 4: Market Predictions and Applications - MiroMind has demonstrated its predictive capabilities in stock market scenarios, accurately identifying stocks with high potential for upward movement amidst market noise [22][25][30]. - The model is also applied to predict significant events that may impact major companies, providing insights into potential market reactions and volatility [31].
AAAI 2026 Oral|InfiGUI-G1模型来了,刷新GUI Grounding SOTA
机器之心· 2026-01-05 06:09
随着多模态大语言模型(MLLM)的飞速发展,能够像人类一样通过视觉输入操作图形用户界面(GUI)的智能体(Agent)正逐渐成为现实。然而,在通往通用 计算机控制的道路上,如何让模型精准地将自然语言指令对应到屏幕上的具体元素 —— 即 GUI Grounding 任务,依然是一大难题。 现有的方法,特别是基于验证奖励的强化学习(RLVR),虽然在提升 "指得准"(空间对齐)方面表现出色,却往往在 "指得对"(语义对齐)上遭遇瓶颈。模型常 常陷入 "自信陷阱",在复杂的语义场景下无法通过有效探索找到正确的功能图标。 从 "空间对齐" 到 "语义对齐":被忽视的探索瓶颈 GUI Grounding 任务的核心是将自然语言指令(如 "打开相机")映射到屏幕上的特定元素坐标。研究团队指出,这一任务可以解构为两个正交的维度: 1. 空间对齐(Spatial Alignment):能否精确地定位到元素(即 "指得准")。 2. 语义对齐(Semantic Alignment):能否识别出功能正确的元素(即 "指得对")。 针对这一痛点,来自浙江大学、香港理工大学及 InfiX.ai 的研究团队提出了一种全新的 自适应探索 ...
CES 2026超前瞻:空间智能来势汹汹!从实验室奢侈品到消费级刚需,如何重塑 AI 具身时代?
机器之心· 2026-01-05 06:09
Core Insights - The article emphasizes the importance of "Spatial Intelligence" as the next frontier for AI, moving beyond traditional language models to understand and interact with the physical world [1][6][38] - The CES 2026 event showcases advancements in embodied AI, highlighting the industry's shift towards spatial understanding and the need for AI to comprehend three-dimensional space [1][4][10] Group 1: Spatial Intelligence and Its Importance - Spatial Intelligence is defined as the ability of AI to understand depth, distance, occlusion, and gravity, which is essential for true embodiment [6][8] - The current challenge in AI is the inability to replicate the spatial intuition found in biological entities, which limits the effectiveness of AI in real-world applications [5][6] - The competition in the AI industry is shifting from parameter size to the ability to achieve faster spatial intuition at lower costs, marking a significant change in focus [6][8] Group 2: Technological Paths in Spatial Intelligence - Two main technological paths are emerging: "World Generation," which focuses on creating realistic 3D environments for AI training, and "Spatial Decision," which aims to enable real-time understanding and decision-making in physical environments [14][18] - Companies like META and NVIDIA are leading efforts in these paths, with projects aimed at enhancing AI's ability to interact with the physical world [16][19][28] Group 3: Cost Reduction and Market Expansion - The article discusses a potential industry turning point where the cost of spatial perception technology could drop significantly, making it accessible for widespread use [23][26] - Innovations in visual-based solutions are breaking the high-cost barrier traditionally associated with 3D spatial perception, allowing for consumer-grade applications [26][32] - The shift from expensive hardware to affordable algorithms is expected to expand the market for embodied AI, making it a part of everyday life [34][38] Group 4: Investment Opportunities - Investors are increasingly focused on companies that can effectively implement spatial intelligence in real-world applications, viewing this as a critical factor for success in the next decade [34][38] - The potential for spatial intelligence to revolutionize various sectors, including consumer electronics and industrial applications, is highlighted as a significant opportunity for growth [38]
田渊栋2025年终总结:救火Llama4但被裁,现任神秘初创公司联创
机器之心· 2026-01-04 08:05
Core Insights - The article discusses the experiences and reflections of a prominent AI researcher, including the impact of layoffs at Meta and future work plans [1][2][3] Group 1: Layoffs and Career Reflections - The researcher was involved in the Llama 4 project during a critical period and faced the complexities of decision-making under pressure, leading to a deeper understanding of societal dynamics [4] - After over a decade at Meta, the researcher had contemplated leaving but ultimately decided to stay until the company made the decision for them, which provided new material for creative writing [5] - Following the layoffs, the researcher received numerous job offers but chose to become a co-founder of a new startup, indicating a shift towards entrepreneurship [6] Group 2: Research Directions for 2025 - The main research directions for 2025 include large model inference and understanding the "black box" of models, with a focus on improving training efficiency and interpretability [7][8] - The researcher’s team has made significant contributions to the field, including theoretical analyses and practical applications that enhance model performance and efficiency [8][9] Group 3: Importance of Interpretability - The article emphasizes the critical need for interpretability in AI, arguing that understanding how AI models work is essential for trust and effective deployment [11][12] - The challenges of explaining model behavior from first principles are highlighted, with a call for deeper insights into the emergent structures and training dynamics of AI models [12] Group 4: Future of Work and AI Integration - The integration of AI into the workforce is transforming traditional roles, with a shift from valuing human experience to assessing the ability to enhance AI capabilities [20][23] - The article presents two potential scenarios for the future: one where AI achieves superintelligence and another where traditional scaling methods fail, both underscoring the necessity of interpretability [21][23] Group 5: The Role of Independent Thinking - The future landscape will require individuals to maintain independent thought and creativity, as reliance on AI-generated content may lead to a decline in original thinking [29][30] - The transition from employee to entrepreneur or founder roles is emphasized, with a focus on having clear goals to drive proactive thinking and innovation [31][33]
科研人福音!一键生成PPT和科研绘图,北大开源Paper2Any,全流程可编辑
机器之心· 2026-01-04 08:05
你是否经历过这样的至暗时刻: 明明实验数据已经跑通,核心逻辑也已梳理完毕,却在面对空白的 PPT 页面时陷入停滞; 明明脑海里有清晰的系统架构,却要在 Visio 或 Illustrator 里跟一根歪歪扭扭的线条较劲半小时; 好不容易用 AI 生成了一张精美的流程图,却发现上面的文字是乱码,或者为了改一个配色不得不重新生 成几十次…… 在内容生产的过程中,"写" 往往只占了一半,而将文字转化为结构图、流程图,再整理成演示用的 PPT,这个过程繁琐、耗时,且极度考验设计感。为什么我们 不能让 AI 像理解文字一样,理解我们的逻辑,并自动帮我们要展示的 "视觉物料" 准备好? 为了解决这一痛点, 北京大学 DCAI 课题组 基于自动化数据治理 Agent 框架 DataFlow-Agent ,推出了全新的多模态辅助平台 —— Paper2Any 。 它不再是一个简单的 "文生图" 工具,而是一整套 自动化的内容视觉化 Workflow 。从阅读资料、理解逻辑,到生成图像、切割元素,最终输出完全可编辑的 PPT 和 SVG 文件,Paper2Any 正在试图重塑我们准备 Presentation 的方式。 一、 ...
AAAI 2026 | 小鹏联合北大,专为VLA模型定制视觉token剪枝方法,让端到端自动驾驶更高效
机器之心· 2026-01-04 05:43
Core Insights - The article discusses the increasing application of VLA models in end-to-end autonomous driving systems, highlighting the challenges posed by lengthy visual tokens that significantly raise computational costs [2][8] - A new paradigm for efficient visual token pruning in autonomous driving VLA models is introduced through the paper "FastDriveVLA," co-authored by Xiaopeng Motors and Peking University [2][5] - The research proposes that visual tokens related to foreground information are more valuable than those related to background content, leading to the development of a large-scale annotated dataset, nuScenes-FG, containing 241,000 images with foreground area annotations [2][13] Summary by Sections Research Background and Issues - End-to-end autonomous driving shows great potential to transform future transportation systems, learning the entire driving process within a unified framework [6] - Existing VLA models convert visual inputs into numerous visual tokens, resulting in significant computational overhead and increased inference latency, posing challenges for real-world deployment [8] Methodology and Innovations - FastDriveVLA is a novel, reconstruction-based visual token pruning framework tailored for end-to-end autonomous driving VLA models [10] - The framework includes a lightweight, plug-and-play pruner called ReconPruner, which identifies and selects meaningful foreground visual tokens using a masked image modeling approach [16][18] - An innovative adversarial foreground-background reconstruction strategy is introduced to enhance ReconPruner's ability to distinguish between foreground and background tokens [19] Experimental Results - FastDriveVLA demonstrates state-of-the-art performance across various pruning ratios in the nuScenes open-loop planning benchmark [20][25] - When the number of visual tokens is reduced from 3,249 to 812, FastDriveVLA achieves a reduction in FLOPs by approximately 7.5 times and significantly improves CUDA inference latency [26] - The framework outperforms existing methods, particularly at a 50% pruning ratio, achieving a balanced performance across all metrics [25] Efficiency Analysis - FastDriveVLA's efficiency is highlighted by its substantial reduction in FLOPs and CUDA latency, showcasing its potential for real-time applications in autonomous driving [26][27] - At a 25% pruning rate, FastDriveVLA shows the best performance across all evaluation metrics, indicating that focusing on foreground-related visual tokens is crucial for enhancing autonomous driving performance [28]
从「被动」到「主动」,为什么给耳机装上「眼睛」后AI范式变了?
机器之心· 2026-01-04 05:43
Core Viewpoint - The article discusses the emergence of "screenless, proactive AI" hardware, highlighting the advancements made by Chinese company Lightware Technology with its Lightwear AI wearable device, which includes AI headphones, a smartwatch, and a unique charging case [2][3][4]. Group 1: Product Overview - Lightwear AI is a combination of AI headphones, a smartwatch, and a charging case, designed to function as a continuous AI assistant that actively engages with users in their daily lives [3][6]. - The AI headphones are the world's first with visual perception capabilities, allowing them to observe the environment and provide proactive suggestions [3][4]. - The device can recognize products, search for prices online, and even place orders autonomously based on user queries [9][10]. Group 2: Market Context - The article notes that in 2025, a surge of AI hardware products emerged globally, including AI glasses and headphones from major companies like Alibaba and ByteDance [17]. - The shift towards screenless AI is attributed to advancements in large model capabilities and decreasing deployment costs, particularly benefiting Chinese companies in the AI hardware race [18][19]. Group 3: Proactive AI Concept - Proactive AI aims to eliminate the cognitive friction associated with passive AI, which requires explicit commands from users [21]. - Lightware Technology's approach focuses on continuous environmental awareness and memory, allowing the AI to intervene at appropriate moments without user prompts [21][22]. - The article compares Lightware's vision with Google's Project Astra, which also seeks to create an AI assistant that understands and interacts with the user's environment [21]. Group 4: Hardware Design Philosophy - Lightware Technology chose to enhance headphones with visual capabilities rather than relying on smartphones or glasses, as headphones offer a more natural and widely accepted form factor [26][27]. - The headphones are equipped with dual 2-megapixel cameras to provide depth perception, enabling the AI to understand spatial relationships and user context [30][32]. - The design emphasizes semantic understanding over high-resolution imaging, focusing on the AI's ability to identify objects rather than producing high-quality visuals [30]. Group 5: Multi-Sensory Collaboration - To achieve true proactive AI, Lightware Technology integrates multiple devices, including a smartwatch that complements the headphones by providing visual and tactile interactions [39][41]. - The smartwatch serves as a continuous body sensor, collecting health data to enhance the AI's understanding of the user's physical state [43]. - The charging case is designed to maintain connectivity and functionality even when the headphones are not worn, allowing for ongoing interaction with the AI [45][48]. Group 6: Technical Challenges - Building a distributed AI hardware system involves complex challenges related to power management, communication efficiency, and user interaction [51][60]. - Lightware Technology's solution includes a cloud-based operating system that distributes processing tasks across devices, ensuring efficient operation while minimizing power consumption [52][56]. - The design balances weight and comfort, with the headphones weighing only 11 grams, significantly lighter than typical smart glasses [61]. Group 7: Future Outlook - The article concludes with Lightware Technology's plans to showcase its proactive AI headphones at CES in January 2026, indicating a potential shift in the direction of next-generation AI hardware [62][63].
500万人在线围观,Claude Code创建者的13条独家实战秘籍爆火
机器之心· 2026-01-04 05:43
Core Insights - The article discusses the workflow and strategies employed by Boris Cherny, the creator of Claude Code, in utilizing the AI programming tool effectively. It emphasizes the tool's flexibility and customization options, allowing users to tailor their experience according to personal preferences. Group 1: Workflow Strategies - The use of five parallel windows in the terminal allows for simultaneous operation of multiple Claude tasks, enhancing productivity through system notifications for input prompts [3]. - Multi-device integration is highlighted, with the ability to run tasks on both local terminals and web interfaces, facilitating seamless transitions between devices [5]. - The Opus 4.5 model is utilized for all tasks, noted for its intelligence and efficiency in completing tasks faster than smaller models despite being larger and slower [9]. Group 2: Knowledge Sharing and Continuous Improvement - A shared knowledge base, CLAUDE.md, is maintained in a Git repository, where team members document errors and updates to ensure continuous learning and improvement [10]. - Code reviews incorporate feedback into CLAUDE.md, promoting a compounding engineering approach to enhance coding standards [12]. Group 3: Task Management and Automation - The Plan mode is employed to outline tasks before execution, ensuring clarity and efficiency in the workflow [13]. - Repetitive tasks are automated through slash commands, reducing the need for manual input and streamlining processes [14]. - Subagents are utilized for specific tasks, such as code simplification and end-to-end testing, automating common workflows [16]. Group 4: Code Quality and Permissions - Code beautification is achieved through PostToolUse hooks, ensuring high-quality formatting and reducing errors during continuous integration [18]. - Permission management is handled proactively, with pre-authorized commands stored in a shared settings file to enhance security and efficiency [20]. Group 5: Long-term Task Management - For lengthy tasks, strategies include initiating background agents for verification, using hooks for deterministic checks, and employing plugins to streamline processes [22]. - Establishing a feedback loop is crucial for improving result quality, with automated testing of UI changes to ensure smooth interactions [24][25]. Conclusion - Developers interested in optimizing their use of Claude Code can reference Boris Cherny's methods as a practical guide [26].
前OpenAI CTO押注的赛道,被中国团队抢先跑通,AI「下半场」入场券人人有份
机器之心· 2026-01-04 03:01
Core Viewpoint - The article discusses the challenges faced by small entrepreneurs and researchers in the AI field amidst the dominance of large companies, highlighting the emergence of new tools like Mind Lab's MinT that aim to democratize access to advanced AI training capabilities [1][2][3]. Group 1: AI Landscape and Challenges - The AI landscape is increasingly perceived as a domain dominated by large companies, leaving smaller players and researchers feeling lost [1][2]. - The traditional path from academia to industry is being questioned, particularly regarding its relevance in the current AI environment [1]. - The saturation of pre-training models has led to new bottlenecks in deploying AI systems, necessitating a shift towards post-training and reinforcement learning [10][11]. Group 2: Innovations in Post-Training - Mind Lab, a research center backed by a team of young scientists, has developed the Mind Lab Toolkit (MinT), which allows efficient training of trillion-parameter models using standard CPUs, optimizing costs by tenfold [3][5]. - MinT is designed to address the limitations of current AI models that become "frozen" after training, enabling continuous learning from real-world interactions [23][24]. - The platform's architecture allows users to focus on data and algorithms while MinT manages the complexities of infrastructure, significantly enhancing engineering efficiency [31][39]. Group 3: Competitive Landscape - Mind Lab's MinT is positioned as a competitor to Thinking Machines' Tinker, with both platforms offering compatibility and advanced capabilities for post-training [21][25]. - MinT has achieved significant milestones, including being the first to implement 1T LoRA-RL for efficient reinforcement learning on trillion-parameter models, showcasing its technological leadership [25][36]. - The team behind MinT has published over 100 papers with more than 30,000 citations, indicating a strong research foundation [6]. Group 4: Market Applications and Benefits - MinT is expected to benefit startups in the agent domain and top academic labs that are constrained by computational resources, allowing them to validate algorithms at a lower cost [41][44]. - The platform supports a wide range of applications, from basic research to specific industry needs, demonstrating its versatility [44]. - By reducing the barriers to entry for reinforcement learning and post-training, MinT aims to empower more organizations to leverage advanced AI capabilities [49][50].
ControlNet作者张吕敏最新论文:长视频也能实现超短上下文
机器之心· 2026-01-03 07:00
Core Viewpoint - The article discusses the limitations of current high-quality video generation models, which can only produce videos of approximately 15 seconds in length, and the challenges faced by creators in achieving their creative visions due to the need for segment generation and maintaining visual consistency [1][4]. Group 1: Limitations and Challenges - The bottleneck in video generation length is attributed to the internal breakdown of a 60-second video into over 500,000 "potential tokens," which complicates maintaining narrative coherence and visual consistency [2][3]. - The core contradiction of autoregressive video generation models lies in the trade-off between longer context for coherence and the increased computational cost associated with it [4][5]. - Compression methods often sacrifice high-frequency details that are crucial for visual realism and consistency, leading to a significant challenge in video generation [6]. Group 2: Proposed Solutions - A research team led by Zhang Lumin from Suzhou University and Stanford University has proposed a new memory compression system designed specifically for long videos, aiming to retain fine visual details during compression [6][7]. - The proposed neural network structure can compress a 20-second video into a context representation of approximately 5,000 tokens while maintaining good perceptual quality [8]. Group 3: Methodology - The research employs a two-stage strategy, first pre-training a dedicated memory compression model to preserve high-fidelity frame-level details at any historical time position [11][15]. - The model's pre-training objective is to minimize feature distance for randomly sampled frames from the compressed history, ensuring robust detail encoding across the entire sequence [12][16]. - The architecture utilizes a lightweight dual-path structure to process both low-resolution video streams and high-resolution residual information, enhancing detail fidelity [12][23]. Group 4: Experimental Results - The experiments utilized an 8 × H100 GPU cluster for pre-training and demonstrated the model's ability to handle diverse prompts and maintain consistency in characters, scenes, objects, and plotlines [30][34]. - Quantitative evaluations showed that the proposed method achieved competitive scores in various consistency metrics, with the Wan+Qwen combination leading in instance scores [35][36]. - Ablation studies indicated that the proposed method outperformed others in PSNR and SSIM metrics, effectively preserving original image structure even under high compression rates [37][38].