Workflow
机器之心
icon
Search documents
IROS 2025 | 机器人衣物折叠新范式,NUS邵林团队用MetaFold解耦轨迹与动作
机器之心· 2025-09-03 00:44
Core Insights - The article discusses the development of MetaFold, a language-guided framework for multi-category garment folding, which aims to enhance the capabilities of robots in manipulating deformable objects like clothing [4][31]. Group 1: Framework Overview - MetaFold addresses the challenges of deformable object manipulation (DOM) by integrating visual and language guidance to improve task execution [3][4]. - The framework employs a hierarchical architecture that separates task planning and action prediction, inspired by the human nervous system [7][37]. - It utilizes a point cloud trajectory generation model that combines geometric features from point clouds with semantic features from language instructions [15][16]. Group 2: Dataset and Methodology - A large dataset was created, consisting of 1,210 garments and 3,376 trajectories, to train the model effectively [10]. - The dataset includes four main folding types: no-sleeve, short-sleeve, long-sleeve, and pants, each with corresponding natural language descriptions [10]. - The trajectory generation model is based on a conditional variational autoencoder (CVAE) and employs a cross-attention mechanism for effective information fusion [15][16]. Group 3: Performance Evaluation - MetaFold demonstrated superior performance in garment folding tasks, achieving a success rate of 79%-97% on unseen datasets, showcasing its strong generalization capabilities [20][22]. - The framework maintained high levels of rectangularity (0.80-0.87) and area ratio (0.24-0.45), outperforming baseline methods [22][23]. - In real-world experiments, MetaFold successfully completed various garment folding tasks, confirming its practical applicability and robustness [26][29]. Group 4: Conclusion and Future Directions - The research presents MetaFold as a significant advancement in robotic manipulation of garments, emphasizing its contributions to performance, generalization, and interpretability [31][37]. - The open-sourcing of the dataset provides valuable resources for future research in the field [11][37].
刚刚,Anthropic在质疑声中获130亿美元融资,估值达1830亿
机器之心· 2025-09-03 00:44
Core Insights - Anthropic has completed a new funding round of $13 billion, achieving a post-money valuation of $183 billion, which is three times its valuation from the last funding round in March [1][4] - This funding round is the second largest private financing in the tech industry, following OpenAI's historic $40 billion financing in March 2025 [4] - Anthropic's revenue has grown significantly, reaching approximately $1 billion by early 2025 and exceeding $5 billion in annualized revenue just eight months later, making it one of the fastest-growing tech companies in history [5] Funding Details - The latest funding round is part of Anthropic's Series F, led by Iconiq, Fidelity Management & Research Co., and Lightspeed Venture Partners, with participation from other investors like Altimeter, General Catalyst, and Coatue [4] - The new capital will be used to deepen safety research, meet growing enterprise demands, and support international expansion [7] Revenue Growth and Product Development - Anthropic has served over 300,000 enterprise customers, with the number of large clients (each generating over $100,000 in revenue) increasing nearly sevenfold in the past year [6] - The company highlighted the success of Claude Code, which has generated over $500 million in operating revenue and saw usage grow more than tenfold within three months of its full release in May 2025 [5] Controversies and Challenges - Despite the high valuation, Anthropic has faced several controversies, including user data collection practices and usage limits for heavy users, which have sparked community backlash [8][9] - The company announced that user chat and coding sessions would be used for model training by default unless users opt out, with data retention policies extending up to five years [9] - There have been complaints regarding model performance fluctuations and the company's approach to handling abusive interactions, raising concerns among users [9] Competitive Landscape - Anthropic, founded by former OpenAI executives, is now a fierce competitor to OpenAI, which has recently announced plans to sell stock, potentially valuing the company at around $500 billion [10]
苹果新研究:不微调、不重训,如何让AI提问效率暴增6.5倍?
机器之心· 2025-09-02 09:33
Core Viewpoint - The article discusses a new method called BED-LLM developed by Apple in collaboration with Oxford University and City University of Hong Kong, which enhances the problem-solving capabilities of AI by 6.5 times without the need for fine-tuning or retraining [1][20]. Group 1: Introduction to BED-LLM - Apple has been relatively low-profile in the AI landscape dominated by large language models (LLMs), but it has produced significant research outcomes like FastVLM [1]. - The BED-LLM method allows AI to improve its question-asking capabilities, leading to a success rate increase from 14% to 91% [1][20]. Group 2: Challenges with LLMs - LLMs struggle to adaptively gather information from users or external environments, often leading to a "multi-turn amnesia" where they forget previous constraints [4][16]. - Enhancing LLMs' ability to ask targeted questions based on real-time feedback is essential for effective interaction [5]. Group 3: Mechanism of BED-LLM - The BED-LLM framework employs a sequential Bayesian experimental design to formulate interactive information-gathering tasks as sequential experimental design problems [7][9]. - The process involves maximizing expected information gain (EIG) with each question asked, updating beliefs based on user responses, and selecting the next question accordingly [10][11]. Group 4: Innovations in BED-LLM - The method incorporates three key innovations: - **Wisdom One**: Focus on genuine information gain rather than superficial uncertainty, ensuring that questions yield maximum value [14]. - **Wisdom Two**: A sample-then-filter strategy to maintain logical consistency and prevent LLMs from contradicting previous answers [16][17]. - **Wisdom Three**: A targeted conditional generation strategy that allows LLMs to generate questions that effectively narrow down hypotheses [18]. Group 5: Performance Validation - The research team compared BED-LLM against two mainstream benchmarks, demonstrating superior performance in tasks like the "20 Questions" game and movie preference recommendations [20]. - In various datasets, BED-LLM significantly improved success rates, with Mistral-Large achieving a success rate of 91% in celebrity prediction tasks [20][21]. Group 6: Real-World Application - The team conducted a "model cross-server chat" stress test, showing that BED-LLM maintains its performance advantages even when the questioning and answering AIs use different models [23][24]. - This indicates the robustness of BED-LLM in real-world scenarios where user thought processes differ from AI models [24]. Group 7: Conclusion - The research illustrates how a rigorous mathematical framework can transform LLMs from passive knowledge repositories into proactive, efficient information gatherers, paving the way for more intelligent AI interactions [26].
开学&教师节双重豪礼,英博云算力低至8毛8/卡时,赶紧薅起来
机器之心· 2025-09-02 09:33
Core Viewpoint - The article highlights the launch of the "Autumn Computing Power Gratitude Return" campaign by Yingbo Cloud Platform, aimed at supporting educators and students during the new academic season and Teacher's Day with various promotional offers and discounts on computing power services [1]. Group 1: Promotional Activities - Activity 1: "Back to School Surprise Gifts" offers low prices for computing power, with rates as low as 0.88 yuan per card hour for the 4090 model during the promotional period from September 1 to September 30 [6]. - Activity 2: "Teacher's Day Exclusive Benefits" includes a free 50 yuan computing power voucher for new users upon registration and verification, along with various rebate offers for first-time and subsequent top-ups [7][8]. - The promotional highlights include significant discounts on card hour prices, such as the A800 model reduced from 6.39 yuan to 4.92 yuan, and the H800 model from 13.99 yuan to 10.76 yuan [9]. Group 2: Platform Features - Yingbo Cloud Platform utilizes a cloud-native architecture that supports container instances with rapid start-stop capabilities and fine-grained billing, allowing users to pay only for what they use, thus reducing computing costs for schools and students [11]. - The platform supports GPU+CPU mixed clusters, InfiniBand high-speed networking, and enterprise-level parallel storage, catering to needs for model training, algorithm validation, and distributed computing [11]. - Yingbo Cloud offers a dedicated booking section for educators to reserve computing power in advance, ensuring stable operation for classes and research training, along with flexible resource allocation options [11]. Group 3: Collaboration and Future Plans - Yingbo Cloud is actively assisting multiple universities and research institutions in AI research projects and is expanding its AI course teaching partnerships for the fall of 2025 [12]. - The platform invites more universities to join the "AI Course Partner Program," encouraging collaboration in AI education and research [12].
Scaling Laws起源于1993年?OpenAI总裁:深度学习的根本已揭秘
机器之心· 2025-09-02 06:32
Core Viewpoint - The article discusses the historical development and significance of Scaling Laws in artificial intelligence, emphasizing their foundational role in understanding model performance in relation to computational resources [1][41]. Group 1: Origin and Development of Scaling Laws - There are various claims regarding the origin of Scaling Laws, with some attributing it to OpenAI in 2020, while others credit Baidu in 2017, and recent claims suggest that Bell Labs was the true pioneer as early as 1993 [1][3][32]. - The paper from Bell Labs, which is highlighted in the article, trained classifiers on datasets of varying sizes and model scales, establishing a power law relationship that has been recognized for over three decades [3][10]. Group 2: Practical Implications of Scaling Laws - The paper proposes a practical method for predicting classifier suitability, which helps allocate resources efficiently to the most promising candidates, thereby avoiding the high costs associated with training underperforming classifiers [10][14]. - The findings indicate that as the scale of the model increases, the intelligence of AI systems also improves, demonstrating the long-term validity of Scaling Laws from early machine learning models to modern large-scale models like GPT-4 [14][41]. Group 3: Contributions of Key Researchers - The article highlights the contributions of the five authors of the influential paper, including Corinna Cortes, who has over 100,000 citations and is known for her work on support vector machines and the MNIST dataset [17][19][20]. - Vladimir Vapnik, another key figure, is recognized for his foundational work in statistical learning theory, which has significantly influenced the field of machine learning [25][26]. - John S. Denker is noted for his diverse research interests and contributions across various domains, including neural networks and quantum mechanics [27][30]. Group 4: Broader Context and Historical Significance - The article suggests that the exploration of learning curves and Scaling Laws spans multiple disciplines and decades, indicating a cumulative effort from various researchers across different fields [32][41]. - Comments from researchers in the article suggest that the roots of Scaling Laws may extend even further back, with early explorations in psychology and other domains predating the work at Bell Labs [34][39].
告别无效计算!新TTS框架拯救19%被埋没答案,推理准确率飙升
机器之心· 2025-09-02 06:32
Core Insights - The article discusses the development of the Stepwise Reasoning Checkpoint Analysis (SRCA) framework, which enhances the reasoning capabilities of large language models (LLMs) through improved test-time scaling methods [2][3][25]. Group 1: SRCA Framework - The SRCA framework addresses two main issues in existing test-time scaling methods: path homogeneity and underutilization of intermediate results [2][6]. - SRCA integrates two core strategies: Answer-Clustered Search (ACS) to maintain path diversity and Checkpoint Candidate Augmentation (CCA) to utilize all intermediate answers for final decision-making [2][10][19]. Group 2: Methodology - Checkpoint Injection is a foundational technique in SRCA, which forces the model to pause after each reasoning step to output intermediate answers [10][12]. - ACS prevents path homogeneity by grouping similar checkpoint answers and ensuring that diverse reasoning paths are explored [14][17]. - CCA enhances the model's accuracy by salvaging intermediate answers that may have been discarded during the reasoning process, thus improving resource utilization [19][20]. Group 3: Experimental Results - The SRCA framework enabled a 1B parameter model to achieve a 65.2% accuracy on the MATH500 dataset, surpassing a 70B model's accuracy of 65.0% [25]. - SRCA requires only 16 samples to achieve the accuracy of other TTS methods that need 128 samples, resulting in an 8-fold increase in reasoning efficiency [25]. - CCA successfully rescued 19.07% of correct answers from intermediate steps that were previously discarded due to subsequent path deviations [25].
DeepMind爆火论文:向量嵌入模型存在数学上限,Scaling laws放缓实锤?
机器之心· 2025-09-02 03:44
Core Viewpoint - The recent paper on the limitations of vector embeddings has gained significant attention, highlighting the theoretical constraints of embedding models in information retrieval tasks [1][2]. Group 1: Understanding Vector Embeddings - Vector embeddings transform complex entities like text, images, or sounds into multi-dimensional coordinates, allowing for efficient data comparison and retrieval [2][4]. - Historically, embeddings have been primarily used for retrieval tasks, but their application has expanded to reasoning, instruction following, and programming due to advancements in large model technologies [4][5]. Group 2: Theoretical Limitations - Previous research has indicated that vector embeddings inherently lose information when compressing complex concepts into fixed-length vectors, leading to theoretical limitations [4][6]. - DeepMind's recent study suggests that there is a mathematical lower bound on the capabilities of vector embeddings, indicating that certain combinations of relevant documents cannot be retrieved simultaneously beyond a critical document count [6][7]. Group 3: Practical Implications - The limitations of embedding models are particularly evident in retrieval-augmented generation (RAG) systems, where the inability to recall all necessary information can lead to incomplete or incorrect outputs from large models [9][10]. - The researchers established a dataset named LIMIT to empirically demonstrate these theoretical constraints, showing that even state-of-the-art models struggle with simple tasks when the number of documents exceeds a certain threshold [10][12]. Group 4: Experimental Findings - The study revealed that for any given embedding dimension, there exists a critical point where the number of documents surpasses the model's capacity to accurately capture all combinations, leading to performance degradation [10][26]. - In experiments, even advanced embedding models failed to achieve satisfactory recall rates, with some models struggling to reach 20% recall at 100 documents in the full LIMIT dataset [34][39]. Group 5: Dataset and Methodology - The LIMIT dataset was constructed using 50,000 documents and 1,000 queries, focusing on the difficulty of representing all top-k combinations [30][34]. - The researchers tested various state-of-the-art embedding models, revealing significant performance drops under different query relevance patterns, particularly in dense settings [39][40].
性能逼近闭源最强,通义实验室开源Mobile-Agent-v3刷新10项GUI基准SOTA
机器之心· 2025-09-02 03:44
Core Viewpoint - The article highlights the launch of the GUI-Owl and Mobile-Agent-v3, which are advanced open-source models for GUI automation, showcasing superior performance compared to existing models and emphasizing their capabilities in various environments [1][29]. Group 1: Key Achievements - GUI-Owl has achieved state-of-the-art (SOTA) performance on both Android and desktop platforms, with the 32B model surpassing closed-source top models in multiple evaluations [21][29]. - The models are designed to operate in a cloud environment, allowing for dynamic task execution and data collection across multiple operating systems, including Android, Ubuntu, macOS, and Windows [11][29]. Group 2: Technical Innovations - The system employs a self-evolving data production chain that minimizes human involvement in generating high-quality training data, allowing the models to iteratively optimize themselves [11][14]. - GUI-Owl's capabilities include advanced UI element grounding, long task planning, and robust reasoning, enabling it to understand and execute complex tasks effectively [16][20]. Group 3: Reinforcement Learning Framework - A scalable reinforcement learning (RL) system has been developed to enhance the model's stability and adaptability in real-world environments, allowing it to learn continuously from its interactions [22][26]. - The introduction of the Trajectory-aware Relative Policy Optimization (TRPO) algorithm addresses the challenges of sparse and delayed reward signals in GUI automation tasks, improving learning efficiency [26]. Group 4: Conclusion - The release of GUI-Owl and Mobile-Agent-v3 represents a significant advancement in open-source GUI automation, providing a powerful tool for various applications while reducing deployment and resource costs [29].
冲上热搜!美团大模型,靠「快」火了
机器之心· 2025-09-02 03:44
Core Viewpoint - The article discusses the emergence of Meituan's LongCat-Flash model, emphasizing its speed and efficiency in AI applications, which aligns with the industry's shift towards practical and cost-effective solutions rather than merely focusing on model strength [1][64]. Group 1: Model Performance and Features - LongCat-Flash achieves a remarkable inference speed of over 100 tokens per second on H800 GPUs, with practical tests confirming speeds of 95 tokens per second [6][42]. - The model's cost efficiency is notable, priced at only $0.7 per million output tokens, making it competitive compared to similar models [15][53]. - LongCat-Flash utilizes a mixed expert model architecture with a total parameter count of 560 billion, dynamically activating between 18.6 billion to 31.3 billion parameters based on context [12][13]. Group 2: Technical Innovations - The model incorporates a novel MoE (Mixture of Experts) architecture, featuring zero-computation experts that allocate computational resources based on token importance, significantly reducing unnecessary calculations [19][20]. - LongCat-Flash employs a shortcut-connected MoE (ScMoE) design, allowing for parallel execution of communication and computation, thus enhancing efficiency [26][30]. - The training process of LongCat-Flash was highly efficient, utilizing over 20 trillion tokens in less than 30 days with a 98.48% uptime, indicating minimal manual intervention [12][39]. Group 3: Practical Applications and Market Position - The shift in focus from model benchmarks to practical usability reflects a broader trend in the AI industry, where speed and cost-effectiveness are becoming critical differentiators [64]. - LongCat-Flash is positioned as a tool for developers and enterprises looking to leverage advanced AI capabilities without incurring high costs, aligning with Meituan's historical focus on solving real business challenges [64][65]. - The model's design and performance enhancements cater to the growing demand for efficient AI solutions in various applications, including programming and intelligent agent tools [13][41].
AI读网页,这次真不一样了,谷歌Gemini解锁「详解网页」新技能
机器之心· 2025-09-02 03:44
Core Viewpoint - Google is returning to its core business of search by introducing the Gemini API's URL Context feature, which allows AI to "see" web content like a human [1]. Group 1: URL Context Functionality - The URL Context feature enables the Gemini model to access and process content from URLs, including web pages, PDFs, and images, with a content limit of up to 34MB [1][5]. - Unlike traditional methods where AI reads only summaries or parts of a webpage, URL Context allows for deep and complete document parsing, understanding the entire structure and content [5][6]. - The feature supports various file formats, including PDF, PNG, JPEG, HTML, JSON, and CSV, enhancing its versatility [7]. Group 2: Comparison with RAG - URL Context Grounding is seen as a significant advancement over the traditional Retrieval-Augmented Generation (RAG) approach, which involves multiple complex steps such as content extraction, chunking, vectorization, and storage [11][12]. - The new method simplifies the process, allowing developers to achieve accurate results with minimal coding, eliminating the need for extensive data processing pipelines [13][14]. - URL Context can accurately extract specific data from documents, such as financial figures from a PDF, which would be impossible with just summaries [14]. Group 3: Operational Mechanism - The URL Context operates on a two-step retrieval process to balance speed, cost, and access to the latest data, first attempting to retrieve content from an internal index cache [25]. - If the URL is not cached, it performs real-time scraping to obtain the content [25]. - The pricing model is straightforward, charging based on the number of tokens processed from the content, encouraging developers to provide precise information sources [27]. Group 4: Limitations and Industry Trends - URL Context has limitations, such as being unable to access content behind paywalls, specialized tools like YouTube videos, and having a maximum capacity of processing 20 URLs at once [29]. - The emergence of URL Context indicates a trend where foundational models are increasingly integrating external capabilities, reducing the complexity previously handled by application developers [27].