Workflow
大语言模型
icon
Search documents
大模型“天梯赛”来了,让Agent在Kaggle真实任务中进化|佐治亚理工、斯坦福开源
量子位· 2025-07-26 09:01
Core Viewpoint - The article discusses the introduction of MLE-Dojo, an interactive framework designed to train and evaluate large language model (LLM) agents in machine learning engineering tasks, addressing the limitations of existing benchmarks that do not simulate real-world iterative workflows [1][2]. Group 1: Existing Problems and Solutions - Current benchmarks for LLMs are mostly static and fail to capture the dynamic workflows of machine learning engineering, lacking assessments of continuous experimentation and structured feedback [6]. - Many platforms do not support advanced training paradigms like supervised fine-tuning (SFT) or reinforcement learning (RL), limiting the development of more autonomous AI agents [7]. - Existing benchmarks often focus on isolated tasks, missing the complexity and interconnections of end-to-end machine learning processes, which MLE-Dojo aims to address by providing a comprehensive training and evaluation environment [8]. Group 2: MLE-Dojo Features - MLE-Dojo consists of over 200 real Kaggle competitions, covering various domains such as tabular data, computer vision (CV), and natural language processing (NLP), providing unprecedented breadth and depth for evaluating AI agents [12]. - The framework offers a Gym-style interactive environment where agents can perform actions like requesting task information, validating code, and executing code in a secure sandbox [13]. - MLE-Dojo provides advanced features such as detailed error reports and a HumanRank score, which measures the agent's relative position on human leaderboards, offering a standardized performance metric across tasks [14]. Group 3: Evaluation of LLMs - The research team evaluated eight leading LLMs using a multi-dimensional assessment system rather than relying on a single metric [16]. - The HumanRank score reflects the model's performance relative to human competitors, while the Elo rating system provides a dynamic ranking based on head-to-head match results [17][18]. - The AUP (Area Under the Performance Profile) metric assesses the robustness and consistency of models across various tasks, with higher scores indicating better performance stability [18]. Group 4: Performance Analysis - Gemini-2.5-Pro emerged as the top performer in the Elo rating, demonstrating strong competitive capabilities and surpassing 61.95% of human players in the HumanRank score [20]. - Different models exhibited distinct problem-solving strategies, with some being more aggressive in executing code while others were more conservative, impacting their efficiency and overall performance [23]. - The analysis revealed that stronger models tend to generate longer and more complex solutions, indicating deeper reasoning and multi-step problem-solving capabilities [24]. Group 5: Cost-Performance Trade-off - High-performing models often incur significant computational costs, with top reasoning models consuming more tokens and resources [25]. - Some models, like DeepSeek-r1, show potential for competitive performance with higher cost-effectiveness, indicating a direction for future model optimization [25].
Hinton上海演讲:大模型跟人类智能很像,警惕养虎为患
量子位· 2025-07-26 09:01
Core Viewpoint - Geoffrey Hinton emphasizes the importance of establishing a positive mechanism for AI development to ensure it does not threaten humanity, highlighting the complex relationship between AI and human intelligence [3][42][55]. Group 1: AI Development and Understanding - Hinton discusses the evolution of AI over the past 60 years, identifying two main paradigms: logical reasoning and biological understanding, which have shaped current AI capabilities [8][10]. - He compares human understanding of language to that of large language models, suggesting that both operate on similar principles of feature interaction and semantic understanding [19][27]. - The efficiency of knowledge transfer in AI is significantly higher than in humans, with AI capable of sharing vast amounts of information rapidly across different systems [29][36]. Group 2: AI Safety and Collaboration - Hinton warns that as AI becomes more intelligent, it may seek control and autonomy, necessitating international cooperation to ensure AI remains beneficial to humanity [42][55]. - He likens the current relationship with AI to raising a tiger cub, stressing the need for training AI to prevent it from becoming a threat as it matures [49][51]. - The call for a global AI safety institution is made, aimed at researching and training AI to assist rather than dominate humanity [55][56].
“AI教父”辛顿现身WAIC:称AI将寻求更多控制权
Di Yi Cai Jing· 2025-07-26 06:27
Group 1 - The core viewpoint of the article revolves around the potential of AI to surpass human intelligence and the associated risks, as articulated by Geoffrey Hinton during the World Artificial Intelligence Conference (WAIC) [1][4][6] - Hinton emphasizes the need for a global effort to address the dangers posed by AI, suggesting that nations should collaborate on AI safety and training [5][6] - The article highlights Hinton's historical contributions to AI, particularly his development of the AlexNet algorithm, which revolutionized deep learning [5][6] Group 2 - Hinton discusses the evolution of AI over the past 60 years, identifying two main paradigms: symbolic logic and biologically inspired approaches [3][4] - He expresses concerns about the rapid advancement of AI technologies, estimating a 10% to 20% probability that AI could potentially threaten human civilization [6] - Hinton advocates for allocating significant computational resources towards ensuring AI systems align with human intentions, criticizing tech companies for prioritizing profit over safety [6]
小米申请文本处理方法等相关专利,保证专项任务良好效果同时不降低其他任务处理效果
Jin Rong Jie· 2025-07-25 08:26
Group 1 - Beijing Xiaomi Mobile Software Co., Ltd. and Beijing Xiaomi Pinecone Electronics Co., Ltd. have applied for a patent titled "Text Processing Method, Text Processing Device, and Storage Medium" with publication number CN120373448A, filed on January 2024 [1] - The patent describes a text processing method that involves obtaining text description information, which includes the text to be processed and the type of processing task, executed by a large language model [1] - The method utilizes a pre-trained discriminator to determine the task type of the text processing task, ensuring effective processing for specific tasks while maintaining performance for other tasks [1] Group 2 - Beijing Xiaomi Mobile Software Co., Ltd. was established in 2012, located in Beijing, with a registered capital of 148.8 million RMB, and has invested in 4 companies and participated in 137 bidding projects [2] - The company holds 5000 patent records and has 123 administrative licenses [2] - Beijing Xiaomi Pinecone Electronics Co., Ltd. was founded in 2014, also in Beijing, with a registered capital of 25 million RMB, having invested in 1 company, with 15 trademark records and 1029 patent records [2]
速递|高盛、红杉等持续跟投,AI合规独角兽Vanta获1.5亿美元融资,估值飙至41.5亿美元
Z Potentials· 2025-07-25 03:24
Core Insights - Vanta has raised $150 million in a new funding round, achieving a valuation of $4.15 billion, reflecting strong investor interest in AI-driven companies [1] - The funding round was led by Wellington Management, with participation from existing investors including Goldman Sachs, Sequoia Capital, JPMorgan, and Craft Ventures [1] - Vanta plans to use the new funding to expand its AI product line, capitalizing on recent breakthroughs in AI technology [2] Company Overview - Founded in 2018, Vanta focuses on developing software that helps businesses manage compliance and store customer data [1] - The company has accumulated 12,000 clients across technology, financial services, and healthcare sectors [1] - Vanta is seeking to expand its business to national and local government levels [1] Product Development - Vanta's CEO, Christina Cacioppo, highlighted that advancements in large language models are unlocking new product experiences [2] - The company recently launched an AI Agent product designed to perform tasks more independently than most software [2] - Vanta aims to help clients adopt new AI standards and frameworks while applying AI to its own products and customer workflows [2] Expansion Plans - Vanta is advancing its international expansion, having established an office in London and a data center in Australia to grow its presence in the Asia-Pacific region [2]
ICML 2025 | 大模型能在信息不完备的情况下问出正确的问题吗?
机器之心· 2025-07-24 04:08
Core Insights - The article emphasizes the importance of Active Reasoning (AR) in enhancing the capabilities of Large Language Models (LLMs) beyond Passive Reasoning (PR) [1][2][3][4][7][10][55] - It introduces AR-Bench, a benchmark designed to evaluate the active reasoning capabilities of LLMs in real-world scenarios [7][19][55] Group 1: Active Reasoning - Active Reasoning (AR) is defined as the ability of models to actively seek out information through questioning and interaction, contrasting with Passive Reasoning (PR) which relies on complete information [3][4][15][18] - The need for AR is highlighted in various practical applications, such as medical diagnosis and detective work, where information is often incomplete [3][14][15] - The article identifies the core challenge of AR as the necessity to ask the right questions to gather critical information [4][18] Group 2: AR-Bench - AR-Bench is introduced as a systematic tool for assessing LLMs' active reasoning capabilities, simulating real-world information-gathering scenarios [19][20][55] - It consists of three task types: Situation Puzzles (SP), Guessing Numbers (GN), and Dynamic Conversations (DC), each testing different reasoning abilities [21][22][25] - The evaluation framework includes both result assessment and process assessment, focusing on the quality of questions posed and the effectiveness of information retrieval [25] Group 3: Findings on LLM Performance - Current LLMs, including advanced models like GPT-4o, show significant deficiencies in active reasoning, achieving only 35% accuracy in GN tasks [28][34] - The article notes that even state-of-the-art active reasoning methods do not improve model performance on AR-Bench [33] - Human performance in active reasoning tasks significantly surpasses that of existing LLMs, indicating a gap in model capabilities [34][55] Group 4: Recommendations for Future Work - The article suggests several directions for enhancing active reasoning capabilities, including the collection of high-quality fine-tuning datasets and the development of more reliable validation methods for search approaches [56][60] - It emphasizes the need for further research to enable LLMs to ask effective questions and solve real-world problems [55][60]
一场对抗OpenAI们的“危险游戏”
虎嗅APP· 2025-07-23 10:25
Core Viewpoint - The article discusses the emergence of Generative Engine Optimization (GEO) as a new business model driven by AI, highlighting the challenges and opportunities it presents for brands and startups in the evolving digital landscape [3][4][25]. Group 1: Market Dynamics - Over 60% of consumers are now bypassing traditional search engines like Google and Baidu, opting to ask AI assistants directly for product information [3]. - The global AI search engine market is projected to reach $43.63 billion by 2025, with a compound annual growth rate (CAGR) of 14% from 2025 to 2032 [12]. - A report from Adobe indicates that traffic to U.S. business websites increased by 1200% from July 2024 to February 2025, largely driven by AI assistant referrals [11]. Group 2: Company Insights - Profound, a startup founded in 2024, has quickly gained traction, securing $20 million in funding and being adopted by thousands of marketers from Fortune 100 companies [3][10]. - Profound offers various services, including Answer Engine Insights and Agent Analytics, to help brands understand and optimize their presence in AI search engines [17][18]. - The company has processed over 100 million AI search queries monthly and operates in 18 countries, with early adopters reporting a 25%-40% increase in AI response volume within 60 days [23]. Group 3: Competitive Landscape - Other players in the GEO space include Daydream, which focuses on consumer shopping searches, and Goodie AI, which specializes in AI search visibility [13][14]. - Companies like Ahrefs, which transitioned from SEO to GEO, pose significant competition due to their established customer bases and expertise [14]. - The GEO model faces challenges as it relies heavily on understanding and adapting to the algorithms of large language models, which are subject to frequent changes [25][26]. Group 4: Challenges and Future Outlook - The GEO business model is seen as a "cat-and-mouse game," where startups must continuously adapt to changes in AI algorithms, which can render previous strategies ineffective [5][26]. - The effectiveness of GEO tools is often difficult to attribute, complicating budget decisions for brands [27]. - Despite the challenges, there is potential for GEO companies to evolve by expanding their service offerings and leveraging brand data to create long-term value [28].
从“想得好”到“做得好”有多远?具身大小脑协同之路解密
具身智能之心· 2025-07-23 08:45
Core Viewpoint - The article discusses the integration of "brain," "cerebellum," and "body" in embodied intelligent systems, emphasizing the need for improved collaboration and data acquisition for advancing artificial general intelligence (AGI) [2][3][4]. Group 1: Components of Embodied Intelligence - The "brain" is responsible for perception, reasoning, and planning, utilizing large language models and visual language models [2]. - The "cerebellum" focuses on movement, employing motion control algorithms and feedback systems to enhance the naturalness and precision of robotic actions [2]. - The "body" serves as the physical entity that executes the plans generated by the "brain" and the movements coordinated by the "cerebellum," embodying the principle of "knowing and doing" [2]. Group 2: Challenges and Future Directions - There is a need for the "brain" to enhance its reasoning capabilities, enabling it to infer task paths without explicit instructions or maps [3]. - The "cerebellum" should become more intuitive, allowing robots to react flexibly in complex environments and handle delicate objects with care [3]. - The collaboration between the "brain" and "cerebellum" requires improvement, as current communication is slow and responses are delayed, aiming for a seamless interaction system [3]. Group 3: Data Acquisition - The article highlights the challenges in data collection, noting that it is often difficult, expensive, and noisy, which hinders the training of intelligent systems [3]. - There is a call for the development of a training repository that is realistic, diverse, and transferable to enhance data quality and accessibility [3]. Group 4: Expert Discussion - A roundtable discussion is planned with experts from Beijing Academy of Artificial Intelligence and Zhiyuan Robotics to explore recent technological advancements and future pathways for embodied intelligence [4].
分层VLA模型与完全端到端VLA哪个方向好发论文?
自动驾驶之心· 2025-07-23 07:32
Core Viewpoint - The article emphasizes the shift in academic research from traditional perception and planning tasks in autonomous driving to the exploration of Vision-Language-Action (VLA) models, suggesting that there are still many opportunities for research in this area [1][2]. Group 1: VLA Research Topics - The VLA model represents a new paradigm in autonomous driving, integrating vision, language, and action to enhance decision-making capabilities [2][3]. - The evolution of autonomous driving technology can be categorized into three phases: traditional modular architecture, pure visual end-to-end systems, and the emergence of VLA models [2][3]. - VLA models aim to improve interpretability and reliability by allowing the model to explain its decisions in natural language, thus increasing transparency and trust [3]. Group 2: Course Objectives and Structure - The course aims to help participants systematically master key theoretical knowledge in VLA and develop practical skills in model design and implementation [6][7]. - Participants will engage in a 12-week online group research followed by 2 weeks of paper guidance, culminating in a 10-week maintenance period for their research papers [6]. - The course will provide insights into classic and cutting-edge papers, coding implementations, and writing methodologies, ultimately assisting participants in producing a research paper draft [6][12]. Group 3: Enrollment and Requirements - The course is limited to 6-8 participants per session, targeting individuals with a foundational understanding of deep learning and basic programming skills [5][9]. - Participants are expected to have access to high-performance computing resources, ideally with multiple high-end GPUs, to facilitate their research [13][14]. - A preliminary assessment will be conducted to tailor the course content to the individual needs of participants, ensuring a focused learning experience [15]. Group 4: Course Highlights and Outcomes - The course features a "2+1" teaching model, providing comprehensive support from experienced instructors and research mentors [15]. - Participants will gain a thorough understanding of the research process, writing techniques, and submission strategies, enhancing their academic and professional profiles [15][20]. - The expected outcomes include a research paper draft, project completion certificates, and potential recommendation letters based on performance [15].
ICML2025|清华医工平台提出大模型「全周期」医学能力评测框架MultiCogEval
机器之心· 2025-07-23 01:04
Core Viewpoint - The rapid development of Large Language Models (LLMs) is significantly reshaping the healthcare industry, with these models becoming a new battleground for advanced technology [2][3]. Group 1: Medical Language Models and Their Capabilities - LLMs possess strong text understanding and generation capabilities, enabling them to read medical literature, interpret medical records, and even generate preliminary diagnostic suggestions based on patient statements, thereby assisting doctors in improving diagnostic accuracy and efficiency [2][3]. - Despite achieving over 90% accuracy on medical question-answering benchmarks like MedQA, the practical application of these models in real clinical settings remains suboptimal, indicating a "high score but low capability" issue [4][5]. Group 2: MultiCogEval Framework - The MultiCogEval framework was introduced to evaluate LLMs across different cognitive levels, addressing the gap between medical knowledge mastery and clinical problem-solving capabilities [5][6][10]. - This framework assesses LLMs' clinical abilities at three cognitive levels: basic knowledge mastery, comprehensive knowledge application, and scenario-based problem-solving [12][14]. Group 3: Evaluation Results - Evaluation results show that while LLMs perform well in low-level tasks (basic knowledge mastery) with accuracy exceeding 60%, their performance declines significantly in mid-level tasks (approximately 20% drop) and further deteriorates in high-level tasks, with the best model achieving only 19.4% accuracy in full-chain diagnosis [16][17]. - The study found that fine-tuning in the medical domain effectively enhances LLMs' low and mid-level clinical capabilities, with improvements up to 15%, but has limited impact on high-level task performance [19][22]. Group 4: Future Implications - The introduction of the MultiCogEval framework lays a solid foundation for future research and development of medical LLMs, aiming to promote more robust, reliable, and practical applications of AI in healthcare, ultimately contributing to the creation of "trustworthy AI doctors" [21][22].