机器之心

Search documents
共青年之智,铸AGI未来|2025 WAIC云帆奖得主名单揭晓
机器之心· 2025-07-29 06:38
Core Viewpoint - The 2025 WAIC Cloud Sail Awards ceremony was held in Shanghai, celebrating the achievements of young AI talents and fostering collaboration among industry leaders, academic innovators, and top investors in the AI sector [1][2]. Group 1: Event Overview - The 2025 WAIC Cloud Sail Awards ceremony took place during the World Artificial Intelligence Conference, highlighting the contributions of over 150 key figures in AI from academia, industry, and investment [1]. - The event was co-hosted by the Shanghai Artificial Intelligence Laboratory, Machine Heart, and the Global Academic Alliance for Artificial Intelligence, with support from various institutions [1]. Group 2: Award Recipients - The ceremony announced the winners of the "Brilliant Star" and "Tomorrow Star" awards, recognizing outstanding contributions in the AI field [2]. - The introduction of the "Nomination Award" aims to enhance the talent ecosystem within the Cloud Sail community [6]. Group 3: Notable Award Winners - Chen Jianyu, with over 10 years of experience in robotics and AI, has published over 70 papers and was recognized in Forbes China's "30 Under 30" [14]. - Gao Yang, known for his work in embodied intelligence and reinforcement learning, has co-founded a company focusing on humanoid robots and has received significant recognition for his research [16]. - He Conghui, a young scientist at the Shanghai Artificial Intelligence Laboratory, has published over 100 papers and created a major open data platform [18]. - Liu Bang, a professor at the University of Montreal, has made significant contributions to natural language processing and multimodal learning [20]. - Qiao Chang, focusing on intelligent photonics, has developed innovative neural network architectures for optical imaging [22]. - Wang Xiang, recognized for his work in information recommendation and large models, has received multiple prestigious awards [24]. - Wu Yi, a former OpenAI researcher, has made notable contributions to reinforcement learning and multi-agent systems [26]. - Xie Weidi, a professor at Shanghai Jiao Tong University, has published extensively in computer vision and medical AI [28]. - Zhang Chen, focusing on intelligent processor architecture, aims to optimize AI hardware design [30]. - Zhao Hengshuang, an assistant professor at the University of Hong Kong, has published over 100 papers in computer vision and machine learning [34]. Group 4: Additional Award Winners - Chen Tianlong, an assistant professor at the University of North Carolina, specializes in machine learning systems and has received numerous awards for his research [37]. - Chen Xiaokang, a researcher at DeepSeek AI, has led successful multimodal projects with significant industry impact [39]. - Cui Ganqu, focusing on alignment and reinforcement learning in large language models, has published extensively in top AI conferences [41]. - Fu Zhaoyou, recognized for his work in multimodal intelligence, has received multiple awards for his research contributions [43]. - Gong Ruihao, a vice director at SenseTime, has published over 40 papers in efficient machine learning systems [45]. - Gu Jiayuan, focusing on embodied intelligence and 3D vision, has received best paper awards at major conferences [47]. - Li Yanwei, a research scientist at ByteDance, has made significant contributions to visual language models [49]. - Long Xiaoxiao, an associate professor at Nanjing University, has led research in 3D reconstruction and neural rendering [51]. - Luo Yuyu, an assistant professor at Hong Kong University of Science and Technology, focuses on data-centric AI and has received multiple accolades [53]. - Tang Xiangru, researching multi-agent systems for biomedical applications, has published in top-tier journals [55]. - Wang Jingbo, a young scientist at the Shanghai Artificial Intelligence Laboratory, has made significant contributions to humanoid robotics [57]. - Yu Lijun, a senior research scientist at Google DeepMind, focuses on video generation and reinforcement learning [59]. - Zhang Linfeng, an assistant professor at Shanghai Jiao Tong University, specializes in efficient AI and has received multiple academic honors [61].
阿里再开源,全球首个MoE视频生成模型登场,电影级美学效果一触即达
机器之心· 2025-07-29 06:38
Core Viewpoint - Alibaba has released the world's first open-source MoE architecture video generation model, Wan2.2, which features cinematic aesthetic control capabilities [3][11]. Group 1: Model Features - Wan2.2 is the first video diffusion model to introduce the Mixture-of-Experts (MoE) architecture, allowing for enhanced model capacity without increasing computational costs [11][12]. - The training data for Wan2.2 has significantly increased, with image data up by 65.6% and video data up by 83.2% compared to Wan2.1, improving the model's generalization capabilities in motion expression, semantic understanding, and aesthetic performance [14][15]. - The model incorporates a specially curated aesthetic dataset with fine-grained attributes such as light and shadow, composition, and color, enabling precise control over cinematic styles and user-customizable aesthetic preferences [16]. Group 2: Technical Innovations - Wan2.2 features a high-efficiency Hybrid TI2V architecture, with a model size of 5 billion parameters and a compression rate of 16×16×4, supporting video generation at a resolution of 720P and 24fps [18]. - It is one of the fastest models on the market for generating 720P, 24fps videos, catering to both industrial and academic needs [19]. - Users can download and utilize the model from platforms like Hugging Face and Alibaba's ModelScope community [20].
不靠海量数据,如何精准喂养大模型?上交Data Whisperer:免训练数据选择法,10%数据逼近全量效果
机器之心· 2025-07-29 06:38
Core Viewpoint - The article introduces "Data Whisperer," a novel framework for efficient data selection in fine-tuning large language models (LLMs) without the need for additional training, achieving near-optimal performance with only 10% of the data compared to full datasets [2][4][36]. Group 1: Methodology and Mechanism - Data Whisperer utilizes the in-context learning (ICL) capabilities of pre-trained models to select "golden training samples" without requiring a scoring model [2][6]. - The framework employs a scoring mechanism based on the model's own outputs and attention weights, ensuring a stable and reasonable selection process [10][12]. - It introduces a new efficiency metric, Selection-to-Tuning Ratio (STR), which shows that Data Whisperer significantly outperforms traditional methods in terms of time efficiency [17][18]. Group 2: Performance Metrics - In various tasks, Data Whisperer achieved impressive results, such as 72.46% accuracy on the GSM8K dataset using only 10% of the data, surpassing the full dataset performance of 71.39% [19]. - The framework also demonstrated superior performance in the DialogSum and BioInstruct tasks, with notable improvements over existing state-of-the-art methods [19][21]. Group 3: Robustness and Adaptability - Data Whisperer shows robustness in input scale, with optimal configurations identified for the number of demonstration and query samples, indicating that it effectively selects core samples rather than relying on sheer volume [26][28]. - The framework supports a weak-to-strong mechanism, allowing smaller models to select tasks for larger models, thus reducing computational burdens while maintaining performance [22][24]. Group 4: Comparative Analysis - Data Whisperer outperforms all mainstream data selection methods across accuracy, efficiency, and stability, particularly in low-budget scenarios [35]. - The framework's theoretical foundation is based on the relationship between ICL and fine-tuning, allowing it to effectively pre-train for training efficiency without adjusting model parameters [36][37]. Group 5: Future Directions - Potential future explorations include applying the method to complex structured tasks in fields like law and medicine, enhancing task alignment capabilities, and integrating human feedback [41][42].
从WAIC上爆火的功夫机器人,看到这家央企的具身智能「真功夫」
机器之心· 2025-07-28 11:52
Core Viewpoint - The article highlights the advancements and ambitions of TeleAI, a subsidiary of China Telecom, in the field of embodied intelligence, showcasing their comprehensive self-research capabilities in hardware, software, and data integration, which positions them uniquely in the industry [10][12][61]. Group 1: Technological Advancements - The "Kungfu boy" robot has been upgraded to perform martial arts demonstrations with minute-level precision, showcasing significant advancements in robotics technology [2][36]. - TeleAI's robots utilize a self-developed multi-motor collaborative drive control embedded hardware system, allowing for precise movement and stability [21][23]. - The remote operation system, TeleHumos, enhances the robot's capabilities in hazardous environments, utilizing 5G technology for kilometer-level operational range [9][26]. Group 2: Comprehensive Research and Development - TeleAI is committed to full-stack self-research, integrating hardware and software development to create a competitive edge in the embodied intelligence sector [14][17]. - The company has developed a unified multi-gait hybrid expert model, enabling robots to adapt their movements based on environmental conditions [34][35]. - TeleAI's approach to data generation involves creating realistic virtual environments to train robots, addressing the challenges of data scarcity in the robotics industry [43][50]. Group 3: Collaborative Intelligence - TeleAI emphasizes the importance of collaboration among robots, enabling them to share learning experiences and adapt to new situations collectively [57][58]. - The integration of AI Flow allows for seamless communication between edge devices and cloud systems, enhancing real-time decision-making capabilities [54][55]. - This collaborative approach positions TeleAI's robots as part of a larger intelligent network, rather than isolated units, which is crucial for complex operational scenarios [51][53]. Group 4: Industry Implications - TeleAI's advancements reflect a broader trend in the embodied intelligence industry, moving towards ecosystem-level collaborative innovation rather than isolated technological breakthroughs [60][61]. - The company's focus on self-research and development is seen as a strategic advantage for China's position in the global embodied intelligence competition [63][64].
这届WAIC,无问芯穹发布了三个「盒子」
机器之心· 2025-07-28 10:45
机器之心发布 机器之心编辑部 「 算力是智能时代的土壤,其规模与效率决定着数字未来的疆界。 」 7 月 28 日,2025 年世界人工智能大会上,无问芯穹联合创始人、CEO 夏立雪发布了 无问芯穹全规模 AI 效能跃升方案,并正式推出三大核心产品:无穹 AI 云、无界智算平台与无垠终端智能 。该方案是一套面向未来智能基础设施的软硬协同系统,为跨地域智算网络、智算集群与多形态智能终端等全规模场 景,统一适配多种异构算力,提供从模型调度、性能优化到应用部署的全链路支持。 发布会现场,夏立雪将这三个产品比作了 「 三个盒子 」 ,他表示,无问芯穹希望通过提供 「 打包式 」 的产品服务能力,在单卡至十万卡算力的全规模软 硬件场景中,让每一份算力,都能释放最大的智慧潜能。 1. 两条 「 加速进路 」 和一个 「 价值空间 」 ,让有计算的地方就有智能 夏立雪指出,从传统算法,到 AI1.0、AI2.0 阶段,在 Scaling Law 的推动下,计算资源持续驱动着智能边界的拓展,逼近 AGI 的临界点。然而,有一条 人类文明的终极边界始终横亘在 AGI 之路上 —— 资源的有限性。 人类文明,在迎来一个 「 无所不 ...
EvaLearn:AI下半场的全新评测范式!
机器之心· 2025-07-28 10:45
Core Viewpoint - The article discusses the shift in AI research from "can it be done" to "is it effective," emphasizing the need for new evaluation methods that assess the long-term adaptability and learning capabilities of models, particularly in the context of achieving general artificial intelligence [1][4]. Group 1: New Evaluation Paradigm - A new evaluation paradigm called EvaLearn has been proposed to assess the learning ability and efficiency of large language models (LLMs), providing a fresh perspective on understanding their human-like learning potential [5][6]. - EvaLearn focuses on "sequential problem-solving," redefining the evaluation logic for large language models, and has gained significant attention since its open-source release [6][8]. Group 2: Limitations of Traditional Benchmarks - Traditional benchmarks treat problems as isolated samples, failing to evaluate models' learning efficiency and adaptability, which are crucial for understanding their performance [8][9]. - EvaLearn constructs 648 challenging problems organized into 182 sequences, requiring models to solve them in order, thus allowing for a systematic assessment of their learning capabilities [9][11]. Group 3: Key Findings from EvaLearn - The research team found that models exhibit diverse learning abilities across different task types, with most models better leveraging prior experience for mathematical and logical reasoning tasks, while tasks like summarization rely more on pre-trained knowledge [14]. - Models based on chain-of-thought reasoning generally outperform those that are not, demonstrating better stability and the ability to solve multiple related problems consecutively [15]. - Feedback learning, which incorporates evaluations from a verifier, significantly enhances models' learning abilities and efficiency compared to example-based learning [16]. - Learning ability and efficiency metrics provide a comprehensive assessment of models' learning potential, revealing that high static performance does not guarantee superior learning capabilities [17]. Group 4: Evaluation Metrics - EvaLearn employs a comprehensive set of evaluation metrics to characterize models' dynamic learning abilities, including summary accuracy, classification skills, information extraction, logical reasoning, mathematical reasoning, and sequence reasoning [20]. - Overall accuracy, learning speed, first correct position, consecutive correct answers, and post-warm-up accuracy are key indicators used to assess models' performance [21]. Group 5: Learning Efficiency and Methods - The study indicates significant differences in learning efficiency among models and task types, with non-thinking models often showing faster progress in experience accumulation, while thinking models demonstrate more stable gains [44]. - Different problem-solving methods, such as example learning and feedback learning, significantly impact model performance, with feedback learning generally yielding higher accuracy and learning efficiency [46][48]. - The average position of the first correct answer varies across models and tasks, highlighting the models' learning potential and the importance of feedback in enhancing learning outcomes [51][53]. Group 6: Conclusion - EvaLearn represents a novel benchmark framework for sequentially evaluating models' learning abilities and efficiencies across various tasks, revealing significant performance differences among leading models [55][56]. - The findings underscore the importance of understanding models' learning capabilities and efficiencies as a new perspective for evaluating their performance and bridging the gap between current models and human capabilities [57].
「幻觉」竟是Karpathy十年前命名的?这个AI圈起名大师带火了多少概念?
机器之心· 2025-07-28 10:45
Core Viewpoint - The article discusses the influential contributions of Andrej Karpathy in the AI field, particularly his role in coining significant terms and concepts that have shaped the industry, such as "hallucinations," "Software 2.0," "Software 3.0," "vibe coding," and "bacterial coding" [1][6][9]. Group 1: Naming and Concepts - Karpathy coined the term "hallucinations" to describe the limitations of neural networks, which generate meaningless content when faced with unfamiliar concepts [1][3]. - He is recognized as a master of naming in the AI community, having introduced terms like "Software 2.0" and "Software 3.0," which have gained traction over the years [6][9]. - The act of naming is emphasized as a foundational behavior in knowledge creation, serving as a stable target for global scientific focus [7]. Group 2: Software Evolution - "Software 1.0" refers to traditional programming where explicit instructions are written in languages like Python and C++ [12][14]. - "Software 2.0" represents a shift to neural networks, where developers train models using datasets instead of writing explicit rules [15]. - "Software 3.0" allows users to generate code through simple English prompts, making programming accessible to non-developers [16][17]. Group 3: Innovative Programming Approaches - "Vibe coding" encourages developers to immerse themselves in the development atmosphere, relying on LLMs to generate code based on verbal requests [22][24]. - "Bacterial coding" promotes writing modular, self-contained code that can be easily shared and reused, inspired by the adaptability of bacterial genomes [30][35]. - Karpathy suggests balancing the flexibility of bacterial coding with the structured approach of eukaryotic coding to support complex system development [38]. Group 4: Context Engineering - Context engineering has gained attention as a more comprehensive approach than prompt engineering, focusing on providing structured context for AI applications [43][44]. - The article highlights a shift towards optimizing documentation for AI readability, indicating a trend where 99.9% of content may be processed by AI in the future [45].
硬核「吵」了30分钟:这场大模型圆桌,把AI行业的分歧说透了
机器之心· 2025-07-28 04:24
Core Viewpoint - The article discusses a heated debate among industry leaders at the WAIC 2025 forum regarding the evolution of large model technologies, focusing on training paradigms, model architectures, and data sources, highlighting a significant shift from pre-training to reinforcement learning as a dominant approach in AI development [2][10][68]. Group 1: Training Paradigms - The forum highlighted a paradigm shift in AI from a pre-training dominant model to one that emphasizes reinforcement learning, marking a significant evolution in AI technology [10][19]. - OpenAI's transition from pre-training to reinforcement learning is seen as a critical development, with experts suggesting that the pre-training era is nearing its end [19][20]. - The balance between pre-training and reinforcement learning is a key topic, with experts discussing the importance of pre-training in establishing a strong foundation for reinforcement learning [25][26]. Group 2: Model Architectures - The dominance of the Transformer architecture in AI has been evident since 2017, but its limitations are becoming apparent as model parameters increase and context windows expand [31][32]. - There are two main exploration paths in model architecture: optimizing existing Transformer architectures and developing entirely new paradigms, such as Mamba and RetNet, which aim to improve efficiency and performance [33][34]. - The future of model architecture may involve a return to RNN structures as the industry shifts towards agent-based applications that require models to interact autonomously with their environments [38]. Group 3: Data Sources - The article discusses the looming challenge of high-quality data scarcity, predicting that by 2028, existing data reserves may be fully utilized, potentially stalling the development of large models [41][42]. - Synthetic data is being explored as a solution to data scarcity, with companies like Anthropic and OpenAI utilizing model-generated data to supplement training [43][44]. - Concerns about the reliability of synthetic data are raised, emphasizing the need for validation mechanisms to ensure the quality of training data [45][50]. Group 4: Open Source vs. Closed Source - The ongoing debate between open-source and closed-source models is highlighted, with open-source models like DeepSeek gaining traction and challenging the dominance of closed-source models [60][61]. - Open-source initiatives are seen as a way to promote resource allocation efficiency and drive industry evolution, even if they do not always produce the highest-performing models [63][64]. - The future may see a hybrid model combining open-source and closed-source approaches, addressing challenges such as model fragmentation and misuse [66][67].
ICCV 2025|UV-CoT:无监督视觉推理新突破,偏好优化重塑图像级思维链
机器之心· 2025-07-28 04:24
Core Viewpoint - The article discusses the introduction of a novel unsupervised visual reasoning framework called UV-CoT, which enhances model reasoning capabilities and interpretability in visual understanding tasks by leveraging a chain-of-thought (CoT) approach [2][3][25]. Group 1: Background and Challenges - Existing models rely on supervised fine-tuning (SFT) strategies that require extensive labeled data, leading to high annotation costs and limited scalability [6][7]. - SFT methods face challenges such as high labor costs for annotating key image regions and reasoning paths, and limited generalization capabilities due to reliance on a single type of training signal [7]. Group 2: UV-CoT Framework - UV-CoT is designed to mimic human visual understanding by focusing on "key regions → reasoning process," employing an unsupervised data generation and preference optimization mechanism [4][3]. - The framework utilizes an automated preference data generation and evaluation process, guided by an improved preference optimization algorithm called Score-DPO (sDPO), to achieve unsupervised image-level chain-of-thought learning [8][11]. Group 3: Methodology - UV-CoT generates diverse intermediate reasoning responses for image-question pairs using a target model and an evaluation model, which assesses the selected regions' scores and their impact on subsequent answers [13]. - The preference data set is constructed by randomly selecting preference pairs from the generated responses, retaining the highest-scoring response chains for further reasoning [14]. Group 4: Performance and Results - UV-CoT demonstrates significant performance improvements over existing supervised chain-of-thought models, outperforming models like Visual-CoT-7B and LLaVA-1.5-7B across six benchmarks [20][22]. - The self-evaluation capability of UV-CoT leads to high-quality bounding box generation, surpassing LLaVA-1.5-7B by 4.8% and closely approaching the performance of the 12B model OmniLMM-12B [23]. Group 5: Conclusion - UV-CoT presents an innovative approach to unsupervised visual reasoning, eliminating the dependency on manual annotations and enabling automatic identification and reasoning optimization of key image regions, thus laying a solid foundation for future research in unsupervised visual understanding [25].
多模态大模型,真的「懂」世界吗?——揭秘 MLLM 的核心知识缺陷
机器之心· 2025-07-28 02:47
Core Insights - The article highlights that Multi-Modal Language Models (MLLMs) exhibit impressive capabilities in high-level visual understanding and reasoning tasks, yet they frequently fail in seemingly simple tasks that even infants can accomplish [1][2] - It questions whether MLLMs lack "core knowledge," which is essential for early human learning, indicating a potential cognitive blind spot in these models [2][5] Research Findings - A study from UC San Diego titled "Core Knowledge Deficits in Multi-Modal Language Models" systematically analyzes the lack of core cognitive abilities in mainstream MLLMs [3][5] - The research reveals that current MLLMs widely lack core cognitive abilities, which cannot be naturally acquired through model scaling [5][12] CoreCognition Framework - The authors developed an innovative multi-modal assessment system called CoreCognition, along with a unique "Concept Hacking" method to test whether models genuinely understand the core knowledge behind tasks or are merely guessing [6][18] - CoreCognition is a large-scale assessment framework focusing on core knowledge, inspired by Piaget's theories of cognitive development, and aims to bridge the gap between cognitive science and AI testing [9][11] Assessment Design - The CoreCognition dataset includes 1,503 image-question pairs and generates 2,530 evaluation data points across 230 mainstream multi-modal models and 11 prompt designs, effectively covering various model scales and instruction comprehension [11] - The assessment is designed to be discriminative, minimizing confounding factors and avoiding text shortcuts, ensuring that models must engage in multi-modal reasoning to arrive at correct answers [11][12] Key Findings on Model Performance - MLLMs show significant deficiencies in basic cognitive tasks, particularly in areas like boundary perception and spatial awareness, performing poorly compared to their understanding of more complex tasks [12][14] - The study indicates that increasing model size does not significantly enhance basic cognitive abilities, and in some cases, larger models perform worse on foundational tasks [16][20] Concept Hacking Methodology - The Concept Hacking method involves creating control and manipulated groups to test models' understanding of core concepts by reversing key features while keeping other conditions constant [18][29] - Results show that many models perform well on standard tasks but fail dramatically when key features are altered, indicating a reliance on superficial learning rather than true understanding [20][30] Implications and Future Directions - The findings suggest that MLLMs lack the foundational cognitive scaffolding that humans use to build higher-level reasoning, posing a fundamental challenge to the current model development path focused on scaling [22][30] - Future directions may include explicitly injecting physical and spatial common sense into pre-training phases, exploring cognitive-guided training mechanisms, and developing more controlled assessments of cognitive abilities [30]