Workflow
机器之心
icon
Search documents
集成20+先进算法,优于GPT-4o,自主因果分析智能体来了
机器之心· 2025-07-06 03:49
来自加利福尼亚大学圣迭戈分校(UC San Diego)Biwei Huang 实验室的研究团队提出了一种自主因果分析智能体 Causal-Copilot。该实验室专注于因果推理与机 器学习的交叉研究,在因果发现和因果表征学习领域取得了多项重要成果。论文共同第一作者 Xinyue Wang、Kun Zhou 和 Wenyi Wu 均来自 Biwei Huang 教授实 验室,他们在因果推理与大语言模型结合方面开展了这项创新性研究。同时这项研究也得到了创业公司 Abel.ai 的大力支持和协助。 一个普遍的困境 想象这样一个场景:你是一位生物学家,手握基因表达数据,直觉告诉你某些基因之间存在调控关系,但如何科学地验证这种关系?你听说过 "因果发现" 这个 词,但对于具体算法如 PC、GES 就连名字都非常陌生。 或者你是一位社会学家,想要评估教育政策对学生成绩的真实影响。你知道简单对比可能受其他因素干扰,但面对双重差分、倾向得分匹配等方法及其不同假设 条件,你感到无从下手。 这就是因果分析的现状:理论越来越丰富,工具越来越强大,但使用门槛却始终居高不下。 预训练模型的局限性 当前的 AI 系统,包括最先进的大语 ...
邱锡鹏团队开源MOSS-TTSD!百万小时音频训练,突破AI播客恐怖谷
机器之心· 2025-07-05 05:53
Core Viewpoint - The article discusses the launch of MOSS-TTSD, a revolutionary text-to-speech model that significantly enhances the quality of dialogue synthesis, overcoming previous limitations in generating natural-sounding conversational audio [3][5]. Group 1: MOSS-TTSD Overview - MOSS-TTSD is developed through collaboration between Shanghai Chuangzhi Academy, Fudan University, and MoSi Intelligent, marking a significant advancement in AI podcasting technology [3]. - The model is open-source, allowing for unrestricted commercial applications, and is capable of generating high-quality dialogue audio from complete multi-speaker text [4][5]. Group 2: Technical Innovations - MOSS-TTSD is based on the Qwen3-1.7B-base model and trained on approximately 1 million hours of single-speaker and 400,000 hours of dialogue audio data, enabling bilingual speech synthesis [13]. - The core innovation lies in the XY-Tokenizer, which compresses bitrates to 1kbps while effectively modeling both semantic and acoustic information [15][16]. Group 3: Data Processing and Quality Assurance - The team implemented an efficient data processing pipeline to filter high-quality audio from vast datasets, utilizing an internal speaker separation model that outperforms existing solutions [24][27]. - The model achieved a Diarization Error Rate (DER) of 9.7 and 14.1 on various datasets, indicating superior performance in speaker separation tasks [29]. Group 4: Performance Evaluation - MOSS-TTSD was evaluated using a high-quality test set of approximately 500 bilingual dialogues, demonstrating significant improvements in speaker switching accuracy and voice similarity compared to baseline models [31][34]. - The model's prosody and naturalness were found to be far superior to those of competing models, showcasing its effectiveness in generating realistic dialogue [35].
想清楚再动手:具身智能也要学会脑补未来和择优执行 | RSS 2025
机器之心· 2025-07-05 05:53
Core Viewpoint - The article discusses the development of a new framework called FOREWARN, which combines world models and multimodal language reasoning to enhance the deployment intelligence of robotic systems, enabling them to make real-time decisions without additional data collection [5][21]. Group 1: Research Background - The first author, Wu Yilin, is a second-year PhD student at Carnegie Mellon University, focusing on object manipulation and lifelong learning in robotics [1]. - The second author, Tian Ran, is a PhD candidate at UC Berkeley and a research scientist at NVIDIA, working on the safe and reliable application of foundational models in robotics [2]. Group 2: Challenges in Deployment Intelligence - Current embodied intelligence models often struggle in real-world deployments due to their inability to adapt to environmental disturbances and user preference variations, leading to execution failures [3][21]. - The two main challenges in deployment are predicting the future consequences of actions and evaluating the predicted outcomes against task goals and user preferences [8][10]. Group 3: FOREWARN Framework - The FOREWARN framework consists of two modules: Foresight (simulating future outcomes) and Forethought (evaluating those outcomes), allowing for a more structured decision-making process [11]. - The system uses a world model to predict environmental changes based on candidate actions and employs a fine-tuned multimodal language model to interpret these predictions semantically [12][18]. Group 4: Innovation Highlights - The framework achieves cross-modal alignment between the world model's predictions and the language model's understanding, facilitating a closed-loop reasoning process from perception to decision-making [18]. - FOREWARN automates the decision-making process, significantly reducing deployment barriers and labor costs by enabling real-time selection of optimal action plans [19]. Group 5: Performance Evaluation - The introduction of the FOREWARN framework improved the success rate of robotic tasks from below 30% to 70%-80%, demonstrating its effectiveness in adapting to changing task instructions and user preferences [21]. - Even under varying conditions, the system maintained a success rate of 60%-80%, showcasing its robustness and adaptability [21]. Group 6: Future Directions - The research team identifies three challenges for broader application: enhancing the diversity and generalization of underlying strategies, addressing data scarcity issues, and optimizing reasoning efficiency and computational costs [23]. - The ongoing advancements in multimodal language models and world models are expected to further enhance the deployment intelligence of robots, enabling them to autonomously select safe and reasonable operational plans based on natural language instructions [23].
从诡异视频到假论文,AI正把互联网变成巨型「垃圾场」
机器之心· 2025-07-05 04:19
Core Viewpoint - The article discusses the rise of AI-generated content, particularly bizarre and disturbing videos, and the implications of such content on social media and academic integrity [2][19][25]. Group 1: AI-Generated Videos - An AI-generated video achieved 252 million views and 3.257 million likes on Instagram, showcasing the potential for viral content [2]. - The video features exaggerated and absurd scenarios, such as a woman jumping into the sea, which raises concerns about body shaming [5][3]. - The trend of creating bizarre AI videos is driven by social media algorithms that favor eye-catching and interactive content, leading creators to produce increasingly extreme material [17][18]. Group 2: Impact on Academic Integrity - Research from Sweden's Brottos University College found hundreds of articles on Google Scholar suspected to be AI-generated, highlighting the risk of spreading false scientific information [20]. - The reliance on AI tools for writing peer reviews has increased, with certain phrases becoming more common in scientific discussions, indicating a shift in research practices [21][22]. - The publication of low-quality AI-generated papers poses a threat to the credibility of scientific research and the integrity of academic publishing [24][25].
刚刚,Grok4跑分曝光:「人类最后考试」拿下45%,是Gemini 2.5两倍,但网友不信
机器之心· 2025-07-05 02:46
Core Viewpoint - The leaked benchmark results for Grok 4 and Grok 4 Code indicate significant performance improvements, suggesting that the models may surpass competitors in various AI assessments [2][3][26]. Benchmark Results - Grok 4 achieved a standard score of 35% on the Humanities Last Exam (HLE), which improved to 45% with reasoning techniques, outperforming OpenAI's o3 by two times and GPT-4o by four to five times [3][5]. - In GPQA (Graduate-level Physics and Astronomy questions), Grok 4 scored 87-88%, comparable to OpenAI's top performance and exceeding Claude 4 Opus's score of approximately 75% [6]. - Grok 4 scored 95% on the AIME '25 (2025 American Mathematics Olympiad), significantly higher than Claude 4 Opus's 34% and slightly better than OpenAI's o3, which scored between 80-90% depending on reasoning mode [7]. - Grok 4 Code scored 72-75% on SWEBench, matching Claude Opus 4 and slightly surpassing OpenAI's o3 at 71.7% [8]. Model Development and Features - Grok 4 is designed as a generalist model with capabilities in natural language, mathematics, and reasoning, and it completed training on June 29 [17]. - The model supports approximately 130,000 tokens in context, indicating a focus on optimizing reasoning speed rather than maximizing long-context performance [16]. - Grok 4 Code is tailored for programming tasks, allowing users to ask coding questions directly [18]. Development Process - Elon Musk has been heavily involved in the development of Grok 4, reportedly working overnight to ensure the model's progress, which he described as going well but still requiring final large-scale training [20][23]. - The recent benchmark scores have generated excitement and speculation about the potential release of Grok 4, with expectations that it may be officially announced soon [25][26].
ICML 2025 | 多智能体的ChatGPT时刻?上交MAS-GPT实现工作流一键生成
机器之心· 2025-07-05 02:46
Core Viewpoint - The article discusses the introduction of MAS-GPT, a new generative design paradigm for Multi-Agent Systems (MAS), which simplifies the process of creating MAS to a single query input, making it as easy as interacting with ChatGPT [2][9]. Group 1: Introduction of MAS-GPT - MAS-GPT is a collaborative effort from institutions like Shanghai Jiao Tong University and Oxford University, aiming to facilitate the development of MAS as a step towards achieving Artificial General Intelligence (AGI) [2][3]. - The system allows users to generate a complete and executable MAS with just one query, significantly streamlining the process [2][12]. Group 2: Challenges in Existing MAS Methods - Current MAS methods face three fundamental issues: lack of adaptability, high costs, and low generalization capabilities, which hinder their widespread application [5][7]. - Existing systems require extensive manual input and multiple rounds of LLM calls, making them inefficient and costly [7]. Group 3: MAS-GPT's Solution - MAS-GPT transforms the design of MAS into a language generation task, allowing for the automatic generation of MAS from user queries [9][10]. - The generated MAS is presented in Python code, eliminating the need for manual coding [9]. Group 4: Performance and Evaluation - MAS-GPT has been tested against over ten existing methods across eight benchmark tasks and five mainstream models, demonstrating superior performance [16]. - It achieved an average accuracy improvement of 3.89% over the strongest baseline and maintained robust performance on unseen tasks [17]. Group 5: Cost Efficiency and Compatibility - MAS-GPT operates at nearly half the inference cost compared to other systems like DyLAN and GPTSwarm while delivering better results [18]. - The MAS generated by MAS-GPT shows strong compatibility and consistent performance across different LLMs [20]. Group 6: Future Potential and Community Engagement - MAS-GPT has significant potential for future development, with the ability to generate novel MAS structures and adapt to new tasks [24][25]. - The MASWorks community aims to connect researchers globally, fostering collaboration and knowledge sharing in the MAS field [30][31].
ICCV 2025|降低扩散模型中的时空冗余,上交大EEdit实现免训练图像编辑加速
机器之心· 2025-07-05 02:46
Core Viewpoint - The article discusses the latest research from Professor Zhang Linfeng's team at Shanghai Jiao Tong University, introducing EEdit, a novel framework designed to enhance the efficiency of image editing by addressing spatial and temporal redundancy in diffusion models, achieving a speedup of over 2.4 times compared to previous methods [1][6][8]. Summary by Sections Research Motivation - The authors identified significant spatial and temporal redundancy in image editing tasks using diffusion models, leading to unnecessary computational overhead, particularly in non-editing areas [12][14]. - The study highlights that the inversion process incurs higher time redundancy, suggesting that reducing redundant time steps can significantly accelerate editing tasks [14]. Method Overview - EEdit employs a training-free caching acceleration framework that utilizes output feature reuse to compress the inversion process time steps and control the frequency of area marking updates through region score rewards [15][17]. - The framework is designed to adapt to various input types for editing tasks, including reference images, prompt-based editing, and drag-region guidance [10][15]. Key Features of EEdit - EEdit achieves over 2.4X acceleration in inference speed compared to the unaccelerated version and can reach up to 10X speedup compared to other image editing methods [8][9]. - The framework addresses the computational waste caused by spatial and temporal redundancy, optimizing the editing process without compromising quality [9][10]. - EEdit supports multiple input guidance types, enhancing its versatility in image editing tasks [10]. Experimental Results - The performance of EEdit was evaluated on several benchmarks, demonstrating superior efficiency and quality metrics compared to existing methods [26][27]. - EEdit outperformed other methods in terms of PSNR, LPIPS, SSIM, and CLIP metrics, showcasing its competitive edge in both speed and quality [27][28]. - The spatial locality caching algorithm (SLoC) used in EEdit was found to be more effective than other caching methods, achieving better acceleration and foreground preservation [29].
「2025 AI 实战手册」,年收入破亿的 AI 公司都在干什么?
机器之心· 2025-07-04 15:41
Group 1 - The core theme of the 2025 "The State of AI" report by ICONIQ Capital focuses on how to effectively build and scale AI products, transitioning from the question of whether to adopt AI to how to implement it [3][5]. - The report categorizes companies into "AI-Native" and "AI-Enabled," identifying "High Growth Companies" based on specific revenue and growth criteria [5][6]. - High Growth Companies must have annual revenues of at least $10 million, with varying growth rate requirements based on revenue brackets [6]. Group 2 - AI-Native companies are found to have a faster product lifecycle and greater success in scaling their initial AI products compared to AI-Enabled companies, with 47% of AI-Native products achieving market validation versus only 13% for AI-Enabled products [7]. - The report emphasizes the importance of balancing experimentation, market speed, and performance in the development of AI products [7]. Group 3 - The report outlines five main chapters focusing on the end-to-end process of AI product development, market pricing, organizational structure, budgeting, and internal productivity [6]. - It highlights the evolving demand for talent within AI companies and the differences in hiring trends between last year and this year [5]. Group 4 - The pricing logic for AI products is still maturing, with many companies exploring hybrid pricing strategies, and there is a notable retention of free products in the market [5]. - The allocation of AI budgets varies significantly depending on the product stage, with high-growth AI companies facing specific challenges [5]. Group 5 - The report indicates that not all AI companies fully utilize AI tools internally, with certain departments showing higher adaptability to AI technologies [5]. - It identifies the most popular AI tools among AI companies and discusses the varying levels of AI adoption across different functions [5].
告别盲选LLM!ICML 2025新研究解释大模型选择的「玄学」
机器之心· 2025-07-04 08:59
Core Viewpoint - The article introduces the LensLLM framework developed by Virginia Tech, which significantly enhances the efficiency of selecting large language models (LLMs) while reducing computational costs, thus addressing the challenges faced by researchers and developers in model selection [2][3][4]. Group 1: Introduction - The rapid advancement of LLMs has created a challenge in model selection, as traditional methods are resource-intensive and yield limited results [4]. Group 2: Theoretical Breakthrough of LensLLM - LensLLM is based on a novel PAC-Bayesian Generalization Bound, revealing unique dynamics in the relationship between test loss and training data size during LLM fine-tuning [6][10]. - The framework provides a first-principles explanation of the "phase transition" in LLM fine-tuning performance, indicating when data investment leads to significant performance improvements [12][16]. Group 3: LensLLM Framework - LensLLM incorporates Neural Tangent Kernel (NTK) to accurately capture the complex dynamics of transformer architectures during fine-tuning, establishing a precise relationship between model performance and data volume [15][16]. - The framework demonstrates impressive accuracy in curve fitting and test loss prediction across various benchmark datasets, outperforming traditional models [17][18]. Group 4: Performance and Cost Efficiency - LensLLM achieved a Pearson correlation coefficient of 85.8% and a relative accuracy of 91.1% on the Gigaword dataset, indicating its effectiveness in ranking models [21]. - The framework reduces computational costs by up to 88.5% compared to FullTuning, achieving superior performance with significantly lower FLOPs [23][25]. Group 5: Future Prospects - The research opens new avenues for LLM development and application, with potential expansions into multi-task scenarios and emerging model architectures like Mixture of Experts (MoE) [27][30]. - LensLLM is particularly suited for resource-constrained environments, accelerating model testing and deployment cycles while maximizing performance [31].
以玩促学?游戏代码驱动数据合成,提升多模态大模型通用推理
机器之心· 2025-07-04 08:59
Core Insights - The article presents a novel approach called Code2Logic, which utilizes game code to synthesize multimodal reasoning data, enhancing the reasoning capabilities of visual language models (VLMs) [47][48]. - The research indicates that training AI using game scenarios can significantly improve its performance in geometric and graphical reasoning tasks [1][24]. Data and Model - The scarcity of high-quality multimodal reasoning data limits the advancement of VLMs' complex reasoning abilities, prompting the need for a cost-effective method to generate such data [4]. - The research team from Fudan University and ByteDance proposes leveraging game code to automatically synthesize visual reasoning data, capitalizing on the structured nature of games [12][13]. Methodology - The Code2Logic method involves three core steps: generating game code using large language models (LLMs), designing question-answer templates from the game code, and constructing an automated data engine to generate Q&A instances [13][14][15]. - The GameQA dataset created through this method encompasses 30 games, 158 reasoning tasks, and 140,000 Q&A pairs, showcasing its scalability and diversity [18]. Training and Performance - Training on GameQA data leads to significant performance improvements in both in-domain and out-of-domain tasks, demonstrating the generalization capabilities of models trained with this dataset [24][25]. - The study reveals that models trained with GameQA outperform those trained on traditional geometric reasoning datasets, indicating the cognitive diversity and reasoning complexity inherent in game data [28][29]. Scaling Effects - The research identifies two scaling effects: increased game variety enhances out-of-domain generalization, and sample diversity correlates positively with generalization performance [37][38]. - These findings suggest that the diversity and scalability of GameQA contribute to stronger generalization in reasoning tasks [39]. Limitations and Challenges - The analysis highlights key limitations in VLMs' reasoning capabilities, particularly in 3D spatial perception, pattern recognition, and strategic planning [42][45]. - The study emphasizes the need for further improvements in models' abilities to handle complex reasoning tasks effectively [46].