Workflow
机器之心
icon
Search documents
ICCV2025 | One image is all you need,多模态指令数据合成,你只管给图,剩下的交给Oasis
机器之心· 2025-07-18 03:14
Core Viewpoint - The article discusses a novel multimodal instruction data synthesis method called Oasis, which eliminates the need for complex prompt design by relying solely on images for data generation, thereby enhancing efficiency and quality in data synthesis [1][6]. Research Motivation - The traditional multimodal data synthesis methods face issues such as lack of diversity, insufficient quality, and high reliance on manual input, which Oasis aims to address [7][8]. Method Introduction - Oasis operates through three main steps: constructing a hooking prompt for autoregressive sampling, classifying the sampling results to retain instruction-type outputs, and conducting quality control and response generation [11][12]. Data Characteristics Analysis - The Oasis dataset, Oasis-500k, was synthesized from approximately 500,000 images, demonstrating scalability as data volume increases linearly with the number of images [21][22]. - The average instruction length for Oasis data is 76.80, while the average response length is 71.16, indicating richer information content compared to LLaVA-NeXT [24]. - The language diversity in Oasis data includes English (78.52%), Chinese (18.66%), and several other languages, showcasing its broad applicability [27]. Experimental Results - Oasis shows significant performance improvements over baseline models, with average accuracy increases of 3.1% for Vicuna1.5, 1.8% for Qwen2.5, and 3.2% for Llama3 [38]. - The addition of 500k Oasis data resulted in an average score increase of 5.2%, confirming the effectiveness of data scaling [41]. Effectiveness of Oasis - Oasis demonstrates strong capabilities in synthesizing domain-specific data, particularly in OCR tasks, leading to notable performance enhancements in relevant benchmarks [43]. Quality Control Mechanism - The quality control mechanism for instructions is essential, as it significantly improves model performance, with a noted increase of over 7% in specific tasks [50].
明天,围观学习ACL2025论文分享会,最后报名了
机器之心· 2025-07-18 03:14
Core Insights - The AI field continues to be exciting in 2025, with numerous research releases from major tech companies and institutions [1] - The rapid pace of technological advancements in AI is overwhelming, with new models emerging almost weekly [3][4] - Developers and researchers are increasingly engaging in conferences and academic sharing to stay updated on cutting-edge research [5] Event Overview - The ACL 2025 conference, a significant event in the NLP field, will take place from July 27 to August 1 in Vienna, Austria, with a record number of over 8000 submissions [6][21] - The conference will feature various activities, including keynote speeches, paper presentations, roundtable discussions, and poster sessions [6][21] Keynote Speakers and Topics - The morning keynote will be presented by Che Wanxiang, focusing on trends and outlooks for ACL 2025 [10][20] - The afternoon keynote by Liu Pengfei will discuss reinforcement learning and complex reasoning in large models [22][24] Paper Presentations - A range of topics will be covered in paper presentations, including social exchange theory with large language models, metaphor-driven communication, and the dark side of LLMs [11][12][14] - The event will also include a roundtable discussion on the value of "context engineering" featuring experts from various institutions [26][31][35] Poster Sessions - Authors will present their papers and posters during the event, with live streaming available on multiple platforms for broader access [37]
ICCV 2025|训练太复杂?对图片语义、布局要求太高?图像morphing终于一步到位
机器之心· 2025-07-18 00:38
Core Viewpoint - The article introduces FreeMorph, a novel training-free image morphing method that enables high-quality and smooth transitions between two input images without the need for pre-training or additional annotations [5][32]. Group 1: Background and Challenges - Image morphing is a creative task that allows for smooth transitions between two distinct images, commonly seen in animations and photo editing [3]. - Traditional methods relied on complex algorithms and faced challenges with high training costs, data dependency, and instability in real-world applications [4]. - Recent advancements in deep learning methods like GANs and VAEs have improved image morphing but still struggle with training costs and adaptability [4][5]. Group 2: FreeMorph Methodology - FreeMorph addresses the challenges of image morphing by eliminating the need for training, achieving effective morphing with just two images [5]. - The method incorporates two key innovations: spherical feature aggregation and prior-driven self-attention mechanisms, enhancing the model's ability to maintain identity features and ensure smooth transitions [11][32]. - A step-oriented motion flow is introduced to control the transition direction, allowing for a coherent and gradual morphing process [21][32]. Group 3: Experimental Results - FreeMorph has been evaluated against existing methods, demonstrating superior performance in generating high-fidelity results across diverse scenarios, including images with varying semantics and layouts [27][30]. - The method effectively captures subtle changes, such as color variations in objects or nuanced facial expressions, showcasing its versatility [27][30]. Group 4: Limitations - Despite its advancements, FreeMorph has limitations, particularly when handling images with significant semantic or layout differences, which may result in less smooth transitions [34]. - The method inherits biases from the underlying Stable Diffusion model, affecting accuracy in specific contexts, such as human limb structures [34].
刚刚,OpenAI通用智能体ChatGPT Agent正式登场
机器之心· 2025-07-18 00:38
Core Viewpoint - The introduction of ChatGPT Agent marks a significant advancement in AI capabilities, enabling it to perform complex tasks autonomously and interactively, far beyond just answering questions [4][9][53]. Group 1: Product Features - ChatGPT Agent can utilize various tools to assist users in completing complex tasks, such as browsing calendars, generating editable presentations, and running code [6][12][19]. - The model achieved a score of 41.6% on the HLE benchmark, nearly doubling the performance of previous models [6][34]. - Users can access ChatGPT Agent through OpenAI Pro, Plus, and Team subscriptions, with specific usage limits based on the subscription type [7][8]. Group 2: Technical Capabilities - The core of this new capability is a unified agentic system that combines the strengths of previous breakthroughs, including web interaction and deep research capabilities [19][25]. - ChatGPT Agent can dynamically plan and choose tools to handle tasks, allowing it to switch between reasoning and execution seamlessly [20][28]. - It is equipped with a suite of tools, including a visual browser, text browser, terminal interface, and API access, enhancing its ability to gather and process information [26][27]. Group 3: Benchmark Performance - ChatGPT Agent demonstrated superior performance in various benchmark tests, including achieving a pass rate of 41.6% in the "Humanity's Last Exam" [34]. - In the FrontierMath benchmark, it reached an accuracy of 27.4%, outperforming previous models significantly [37]. - The model excelled in the SpreadsheetBench test, scoring 45.5% in real-world spreadsheet editing tasks, compared to 20.0% for Excel's Copilot [42][44]. Group 4: User Experience and Feedback - Early users reported that ChatGPT Agent could create comprehensive plans, such as retirement strategies, in a fraction of the time and cost compared to human advisors [58][60]. - Users have noted the agent's ability to autonomously complete tasks, such as online shopping, although some expressed that manual execution might be more efficient [63][67].
Le Chat全方面对标ChatGPT,欧洲AI新贵穷追不舍
机器之心· 2025-07-18 00:38
Core Viewpoint - Mistral AI aims to position itself as a European counterpart to OpenAI, focusing on developing advanced AI models and applications to compete in the AI landscape [1][3]. Group 1: Product Developments - Mistral AI has released several open-source models, including a highly regarded OCR model, a multimodal model comparable to Claude, and the first reasoning large model named Magistral [2][4]. - The company recently upgraded its Le Chat application, enhancing its capabilities to compete directly with ChatGPT [4][23]. - New features of Le Chat include a research mode that can generate structured reports on complex topics, a voice mode powered by the Voxtral model for natural speech interaction, and advanced image editing capabilities [6][9][13][16]. Group 2: Voice Recognition Model - Mistral AI launched the Voxtral model, touted as the "best open-source" speech recognition model, which surpasses existing models like Whisper large-v3 and GPT-4o mini Transcribe [27][29]. - Voxtral supports long context understanding with a maximum of 32k tokens and can transcribe audio up to 30 minutes long, showcasing its advanced capabilities [30]. - The model features built-in question-answering and summarization functions, automatic language recognition, and the ability to trigger backend functions directly from voice commands [30]. Group 3: Market Position and Community Response - Mistral AI's recent advancements indicate a strong momentum in the European large model sector, generating excitement among users and industry observers [24]. - Users have reported positive experiences with Le Chat's image editing capabilities, claiming it performs better than OpenAI's offerings [17][18].
ACL 2025 Oral | 你的模型评测搭子上线:Evaluation Agent懂你更懂AI
机器之心· 2025-07-17 09:31
Core Viewpoint - The article introduces the Evaluation Agent, an AI framework designed to efficiently and flexibly evaluate visual generative models, addressing the limitations of traditional evaluation methods and catering to user-specific needs [3][41]. Group 1: Evaluation Agent Features - Customizable: Users can specify their focus areas, and the Evaluation Agent will tailor the evaluation plan accordingly, allowing for "on-demand evaluation" [11][12]. - High Efficiency: The Evaluation Agent significantly reduces the number of samples needed for evaluation, compressing the overall evaluation time to about 10% of traditional methods, making it suitable for rapid feedback during iterative development [13]. - Explainable: The results are presented in natural language, providing comprehensive summaries of model capabilities, limitations, and improvement directions [14]. - Scalable: The framework supports the integration of various tasks, tools, and metrics, making it adaptable for different visual generative tasks such as image and video generation [15]. Group 2: Framework Operation - The Evaluation Agent operates in two main stages: the Proposal Stage, which customizes the evaluation plan based on user input, and the Execution Stage, where the framework generates content and analyzes quality using appropriate evaluation tools [20][22]. - Dynamic multi-round interaction allows for continuous feedback and optimization of prompts and task settings based on evaluation results, enabling a deeper assessment of model capabilities [23]. Group 3: Performance Comparison - The Evaluation Agent demonstrates superior efficiency compared to traditional evaluation frameworks, saving over 90% of time while maintaining high consistency in evaluation results across various models [28][29]. Group 4: Future Directions - Future research may expand the Evaluation Agent's capabilities to cover more visual tasks, optimize open-ended evaluation mechanisms, and enhance understanding of complex concepts like style transfer and emotional expression [36][39].
强化学习的两个「大坑」,终于被两篇ICLR论文给解决了
机器之心· 2025-07-17 09:31
Core Viewpoint - The article discusses the emergence of real-time reinforcement learning (RL) frameworks that address the limitations of traditional RL algorithms, particularly in dynamic environments where timely decision-making is crucial [1][4]. Group 1: Challenges in Traditional Reinforcement Learning - Existing RL algorithms often rely on an idealized interaction model where the environment and agent take turns pausing, which does not reflect real-world scenarios [3][4]. - Two key difficulties in real-time environments are identified: inaction regret, where agents may not act at every step due to long reasoning times, and delay regret, where actions based on past states lead to delayed impacts [7][8]. Group 2: New Frameworks for Real-Time Reinforcement Learning - Mila laboratory's two papers propose a new real-time RL framework to tackle reasoning delays and action omissions, enabling large models to respond instantly in high-frequency, continuous tasks [9]. - The first paper introduces an asynchronous multi-process reasoning and learning framework that allows agents to utilize available computational power effectively, thereby eliminating inaction regret [11][15]. Group 3: Performance in Real-Time Applications - The first paper demonstrates the framework's effectiveness in capturing Pokémon in the game "Pokémon: Blue" using a model with 100 million parameters, emphasizing the need for rapid adaptation to new scenarios [17]. - The second paper presents an architecture solution to minimize inaction and delay in real-time environments, drawing parallels to early CPU architectures and introducing parallel computation mechanisms in neural networks [22][24]. Group 4: Combining Techniques for Enhanced Performance - The combination of staggered asynchronous inference and temporal skip connections allows for reduced inaction and delay regrets, facilitating faster decision-making in real-time systems [27][36]. - This integration enables the deployment of powerful, responsive agents in critical fields such as robotics, autonomous driving, and financial trading, where response speed is essential [36][37].
马斯克Grok的AI男友还在取名,开源版AI女友已经火了,还是3D的
机器之心· 2025-07-17 09:31
Core Viewpoint - The article discusses the launch of Grok's new feature "Intelligent Companion," highlighting the engagement of Elon Musk in naming a male digital companion, reflecting the growing interest in personalized AI avatars [2][3]. Group 1: Grok's Intelligent Companion - Grok has introduced a new feature called "Intelligent Companion," which includes avatars like Ani and Rudy, with a male character yet to be named [2]. - Elon Musk is actively seeking suggestions for the name of the male Grok companion, indicating his involvement and interest in the project [2][7]. Group 2: Community Engagement - The community has responded with various name suggestions, with "Draven" being a popular choice among users [7]. - Users are creatively engaging with the concept, as seen with one user, Jackywine, who created a 3D animated version of Ani named "Bella" [9]. Group 3: Bella Project Overview - The "Bella" project aims to create a digital companion that evolves and grows alongside the user, representing a long-term vision of personalized AI [12][13]. - Bella is currently in early development, focusing on video expressions and interaction elements to simulate a connection with users [14][15]. Group 4: Development Phases - The project is structured in three phases: 1. **Sentient Core**: Establishing a multi-modal data processing pipeline to understand the world [17]. 2. **Generative Self**: Creating a unique personality for Bella, allowing dynamic interactions based on user engagement [21]. 3. **Proactive Companion**: Transitioning from passive responses to proactive support, enabling continuous learning and self-evolution [31]. Group 5: Technical Architecture - Bella's architecture includes a "Sensor-Bus-Processor" model for data collection and processing, enhancing system scalability and robustness [20]. - The design allows for modular upgrades, ensuring that improvements in AI models or 3D representations do not disrupt overall functionality [30]. Group 6: Future Enhancements - Future plans for Bella include adding voice recognition, gesture recognition, and a sentiment system, aiming to create a more interactive and responsive digital companion [36].
昨晚,云计算一哥打造了一套Agent落地的「金铲子」
机器之心· 2025-07-17 09:31
Core Insights - The article emphasizes that multi-agent AI represents the next significant direction for large models, showcasing unprecedented capabilities and indicating a major iteration in large language models (LLMs) [1][3][9] - Amazon Web Services (AWS) is leading the charge with a comprehensive Agentic AI technology stack, facilitating the transition from concept to practical application [10][62] Group 1: Multi-Agent AI Developments - Recent releases like Grok 4 and Kimi K2 utilize multi-agent technology, enabling models to autonomously understand their task environment and utilize external tools to solve complex problems [2][4] - AWS's Agentic AI framework includes four pillars: model application capability, security and reliability, scalability, and deployment and production capability [5][6] - The introduction of Amazon Bedrock AgentCore allows for the construction and deployment of enterprise-level secure agent services through seven core services [14][17] Group 2: Agent Applications and Tools - The AgentCore Runtime provides a unique runtime environment for agent applications, supporting third-party models and significantly reducing deployment costs [20][21] - AWS has expanded its Amazon Bedrock platform to include 12 major model vendors, enhancing its capabilities in generative AI across various modalities [24][27] - The launch of Amazon S3 Vectors reduces vector storage and query costs by 90%, enabling agents to retain more context from interactions [50][52] Group 3: Collaboration and Development - The Strands Agents SDK has been upgraded to facilitate the creation of multi-agent systems, allowing for more efficient collaboration on complex tasks [38][39] - New protocols like Agent to Agent (A2A) enhance communication between agents, marking a shift towards proactive collaboration [41][46] - The introduction of various APIs and tools within Strands Agents V1.0 simplifies the development of multi-agent applications, lowering the barrier for developers [45][46] Group 4: Future Outlook - The article predicts that by 2025, agents will begin large-scale deployment, fundamentally changing how software interacts with the world and how humans interact with software [9][61] - AWS aims to create the most practical Agentic AI platform, supporting companies of all sizes in deploying reliable and secure agent solutions [62][63] - The ongoing evolution of agent technology is expected to lead to more disruptive applications, enhancing the integration of AI as a digital colleague in business operations [64][65]
普林斯顿团队领衔发布最强开源数学定理证明模型:32B性能大幅超越前代SOTA DeepSeek 671B
机器之心· 2025-07-17 05:03
Core Insights - The article discusses the launch of Goedel-Prover-V2, a new open-source mathematical theorem proving model led by Princeton University in collaboration with several top institutions, including Tsinghua University and Stanford University. The model significantly outperforms previous state-of-the-art models in various benchmarks [1][10]. Performance Highlights - The 32B flagship model achieved an 8.0% improvement in Pass@32 accuracy on the MiniF2F test compared to the previous SOTA model, DeepSeek-Prover-V2-671B [6]. - The 8B model demonstrated performance on par with the 671B SOTA model, showcasing efficiency and capability breakthroughs [7][22]. - Goedel-Prover-V2 ranked first on the challenging PutnamBench, solving 64 problems with a Pass@64 metric, outperforming DeepSeek-Prover-V2-671B, which solved 47 problems at Pass@1024 [9][14][20]. Technical Innovations - The development process of Goedel-Prover-V2 incorporates expert iteration and reinforcement learning, along with three key innovations: - Model averaging enhances robustness and overall performance by integrating model weights from different training nodes [12][32]. - Scaffolded data synthesis allows for the automatic generation of progressively challenging proof tasks, facilitating smoother training [13][26]. - Verifier-guided self-correction enables the model to iteratively refine its proofs using feedback from the Lean compiler, simulating human-like self-correction [13][32]. Benchmark Results - In the MiniF2F test, the 8B model achieved a Pass@32 rate of 83.3%, surpassing the performance of the 671B SOTA model [12]. - The flagship model reached Pass@32 rates of 88.1% in standard mode and 90.4% in self-correction mode, significantly exceeding previous models [12]. - The performance of Goedel-Prover-V2-32B remained consistently superior across various reasoning sampling budgets compared to earlier models [21][22]. Model and Dataset Availability - The Goedel-Prover-V2 model and the new MathOlympiadBench benchmark dataset have been publicly released to support research in the field [28][30]. - MathOlympiadBench includes 360 formalized problems from international mathematics competitions, aimed at enhancing preparation for events like the International Math Olympiad [30][31].