Workflow
大型语言模型(LLM)
icon
Search documents
从Debugger到Developer : 低代码时代新基准NoCode-bench,SWE-Bench作者力荐
机器之心· 2025-08-08 07:53
Core Insights - The article discusses the introduction of a new benchmark called NoCode-bench, aimed at evaluating the capabilities of large language models (LLMs) in natural language-driven feature addition tasks in software development [3][27]. - Current LLMs show a low success rate of only 20% in performing these tasks, highlighting significant challenges in AI's ability to handle real-world software development scenarios [3][26]. Group 1: Benchmark Development - NoCode-bench was developed to address the limitations of existing benchmarks like SWE-bench, which primarily focus on bug fixing rather than feature addition [6][27]. - The benchmark emphasizes the importance of understanding software documentation changes to implement new features, reflecting a more realistic development environment [6][27]. - The construction of NoCode-bench involved a rigorous five-phase process, starting from selecting well-maintained open-source projects to filtering instances based on developer-verified release notes [8][10][16]. Group 2: Challenges Identified - The tasks in NoCode-bench present three main challenges: 1. Increased complexity of input, with document changes being nearly twice as long as bug reports, requiring better long-text comprehension [12]. 2. Difficulty in locating changes, as tasks often involve multiple files and code blocks, demanding high cross-file editing capabilities [13]. 3. Greater editing volume, with nearly 20% of tasks requiring modifications of over 200 lines of code, increasing the risk of errors [14]. Group 3: Model Performance Evaluation - A comprehensive evaluation of six leading LLMs, including Claude-4-Sonnet and GPT-4o, revealed disappointing success rates, with the best-performing model achieving only 15.79% success [18][26]. - The analysis of failure cases identified three primary reasons for poor performance: lack of cross-file editing ability, insufficient understanding of codebase structure, and inadequate tool invocation capabilities [20][21][22]. Group 4: Future Directions - The research indicates that the current state of LLMs is not ready for the complexities of document-driven feature development, suggesting a need for further advancements in AI capabilities [24][27]. - The findings provide a roadmap for future AI software engineers, focusing on improving cross-file editing, codebase comprehension, and tool interaction [27].
首部法律LLM全景综述发布,双重视角分类法、技术进展与伦理治理
3 6 Ke· 2025-07-31 09:13
Core Insights - The article presents a comprehensive review of the application of Large Language Models (LLMs) in the legal field, introducing an innovative dual perspective classification method that integrates legal reasoning frameworks with professional ontology [1][3][5] - It highlights the advancements of LLMs in legal text processing, knowledge integration, and formal reasoning, while also addressing core issues such as hallucinations, lack of interpretability, and cross-jurisdictional adaptability [1][5][12] Group 1: Technological Advancements - Traditional legal AI methods are limited by symbolic approaches and small model techniques, facing challenges such as knowledge engineering bottlenecks and insufficient semantic interoperability [6][8] - The emergence of LLMs, powered by Transformer architecture, has successfully overcome the limitations of earlier systems through context reasoning, few-shot adaptation, and generative argumentation capabilities [6][12] - The legal sector's demand for complex text processing, multi-step reasoning, and process automation aligns well with the emerging capabilities of LLMs [8][12] Group 2: Ethical and Governance Challenges - The practical application of technology in the legal field is accompanied by ethical risks, such as the amplification of biases and the weakening of professional authority, necessitating a systematic research framework to integrate technology, tasks, and governance [3][8][11] - The review systematically analyzes ethical challenges faced by legal practitioners, including technical ethics and legal professional responsibilities, expanding user-centered ontology research for LLM deployment [11][12] Group 3: Research Contributions - The study employs an innovative dual perspective framework that combines legal argumentation types with legal professional roles, significantly advancing research in the field [9][12] - It constructs a legal reasoning ontology framework that aligns the Toulmin argument structure with LLM workflows, integrating contemporary LLM advancements with historical evidence research [9][10] - A role-centered deployment framework for LLMs is proposed, merging litigation and non-litigation workflows to meet the demand for smarter tools in legal practice [10][12] Group 4: Future Directions - Future research should prioritize multi-modal evidence integration, dynamic rebuttal handling, and aligning technological innovations with legal principles to create robust and ethically grounded legal AI [13] - The article advocates for a legal profession-centered strategy, positioning LLMs as supportive tools rather than decision-makers, ensuring human oversight at critical junctures [13]
还不知道研究方向?别人已经在卷VLA了......
自动驾驶之心· 2025-07-21 05:18
Core Viewpoint - The article emphasizes the shift in academic research from traditional perception and planning tasks in autonomous driving to the exploration of Vision-Language-Action (VLA) models, which present new opportunities for innovation and research in the field [1][2]. Group 1: VLA Research Topics - The VLA model aims to create an end-to-end autonomous driving system that maps raw sensor inputs directly to driving control commands, moving away from traditional modular architectures [2]. - The evolution of autonomous driving technology can be categorized into three phases: traditional modular architecture, pure visual end-to-end systems, and the emergence of VLA models [2][3]. - VLA models enhance interpretability and reliability by allowing the system to explain its decision-making process in natural language, thus improving human trust [3]. Group 2: Course Objectives and Structure - The course aims to help participants systematically master key theoretical knowledge in VLA and develop practical skills in model design and implementation [6][7]. - It includes a structured learning experience with a combination of online group research, paper guidance, and maintenance periods to ensure comprehensive understanding and application [6][8]. - Participants will gain insights into classic and cutting-edge papers, coding practices, and effective writing and submission strategies for academic papers [6][12]. Group 3: Enrollment and Requirements - The course is limited to 6-8 participants per session, targeting individuals with a foundational understanding of deep learning and autonomous driving algorithms [5][9]. - Basic requirements include familiarity with Python and PyTorch, as well as access to high-performance computing resources [13][14]. - The course emphasizes academic integrity and provides a structured environment for learning and research [14][19]. Group 4: Course Highlights - The program features a "2+1" teaching model with experienced instructors providing comprehensive support throughout the learning process [14]. - It is designed to ensure high academic standards and facilitate significant project outcomes, including a draft paper and project completion certificate [14][20]. - The course also includes a feedback mechanism to optimize the learning experience based on individual progress [14].
晚点独家丨Agent 初创公司 Pokee.ai 种子轮融资 1200 万美元,Point 72 创投,英特尔陈立武等投资
晚点LatePost· 2025-07-09 11:38
Core Viewpoint - Pokee.ai, an AI Agent startup, recently raised approximately $12 million in seed funding to accelerate research and sales efforts, with notable investors including Point72 Ventures and Qualcomm Ventures [5][6]. Group 1: Company Overview - Pokee.ai was founded in October 2022 and currently has only 7 employees. The founder, Zhu Zheqing, previously led the "Applied Reinforcement Learning" department at Meta, where he significantly improved the content recommendation system [7]. - Unlike other startups that use large language models (LLMs) as the "brain" of their agents, Pokee relies on a different reinforcement learning model that does not require extensive context input [7]. Group 2: Technology and Cost Efficiency - The current version of Pokee has been trained on 15,000 tools, allowing it to adapt to new tools without needing additional context [8]. - Using reinforcement learning models is more cost-effective compared to LLMs, which can incur costs of several dollars per task due to high computational demands. Pokee's task completion cost is only about 1/10 of its competitors [8]. Group 3: Market Strategy and Product Development - Pokee aims to optimize its ability to call data interfaces (APIs) across various platforms, targeting large companies and professional consumers to facilitate cross-platform tasks [9]. - The funding will also support the integration of new features, including a memory function to better understand client needs and preferences [9]. Group 4: Seed Funding Trends - The seed funding landscape for AI startups is evolving, with average seed round sizes increasing significantly. In 2020, the median seed round was around $1.7 million, which has risen to approximately $3 million in 2023 [10]. - The high costs associated with AI product development necessitate larger funding rounds to sustain operations, with some companies reportedly burning through $100 million to $150 million annually [13][14]. Group 5: Investment Climate - Investors are becoming more cautious, requiring solid product-market fit (PMF) before committing to funding. The median time between seed and Series A funding has increased to 25 months, the highest in a decade [17][18].
Gary Marcus惊世之言:纯LLM上构建AGI彻底没了希望!MIT、芝大、哈佛论文火了
机器之心· 2025-06-29 04:23
Core Viewpoint - The article discusses a groundbreaking paper co-authored by MIT, the University of Chicago, and Harvard, which reveals significant inconsistencies in reasoning patterns of large language models (LLMs), termed "Potemkin understanding," suggesting that the hope of creating Artificial General Intelligence (AGI) based solely on LLMs is fundamentally flawed [2][4]. Summary by Sections Introduction - Gary Marcus, a prominent AI scholar, highlights the paper's findings, indicating that even top models like o3 frequently exhibit reasoning errors, undermining the notion of their understanding and reasoning capabilities [2][4]. Key Findings - The paper argues that success in benchmark tests does not equate to genuine understanding but rather reflects a superficial grasp of concepts, leading to a "Potemkin understanding" where models provide seemingly correct answers that mask a deeper misunderstanding [3][17]. - The research team identifies two methods to quantify the prevalence of the Potemkin phenomenon, revealing that it exists across various models, tasks, and domains, indicating a fundamental inconsistency in conceptual representation [17][28]. Experimental Results - The study analyzed seven popular LLMs across 32 concepts, finding that while models could define concepts correctly 94.2% of the time, their performance in applying these concepts in tasks significantly declined, as evidenced by high Potemkin rates [29][33]. - The Potemkin rate, defined as the proportion of incorrect answers following correct responses on foundational examples, was found to be high across all models and tasks, indicating widespread issues in conceptual application [30][31]. Inconsistency Detection - The research also assessed internal inconsistencies within models by prompting them to generate examples of specific concepts and then asking them to evaluate their own outputs, revealing substantial limitations in self-assessment capabilities [36][39]. - The inconsistency scores ranged from 0.02 to 0.64 across all examined models, suggesting that misunderstandings stem not only from incorrect concept definitions but also from conflicting representations of the same idea [39][40]. Conclusion - The findings underscore the pervasive nature of the Potemkin understanding phenomenon in LLMs, challenging the assumption that high performance on traditional benchmarks equates to true understanding and highlighting the need for further research into the implications of these inconsistencies [40].
信息过载时代,如何真正「懂」LLM?从MIT分享的50个面试题开始
机器之心· 2025-06-18 06:09
Core Insights - The article discusses the rapid evolution and widespread adoption of Large Language Models (LLMs) in less than a decade, enabling millions globally to engage in creative and analytical tasks through natural language [2][3]. Group 1: LLM Development and Mechanisms - LLMs have transformed from basic models to advanced intelligent agents capable of executing tasks autonomously, presenting both opportunities and challenges [2]. - Tokenization is a crucial process in LLMs, breaking down text into smaller units (tokens) for efficient processing, which enhances computational speed and model effectiveness [7][9]. - The attention mechanism in Transformer models allows LLMs to assign varying importance to different tokens, improving contextual understanding [10][12]. - Context windows define the number of tokens LLMs can process simultaneously, impacting their ability to generate coherent outputs [13]. - Sequence-to-sequence models convert input sequences into output sequences, applicable in tasks like machine translation and chatbots [15]. - Embeddings represent tokens in a continuous space, capturing semantic features, and are initialized using pre-trained models [17]. - LLMs handle out-of-vocabulary words through subword tokenization methods, ensuring effective language understanding [19]. Group 2: Training and Fine-tuning Techniques - LoRA and QLoRA are fine-tuning methods that allow efficient adaptation of LLMs with minimal memory requirements, making them suitable for resource-constrained environments [34]. - Techniques to prevent catastrophic forgetting during fine-tuning include rehearsal and elastic weight consolidation, ensuring LLMs retain prior knowledge [37][43]. - Model distillation enables smaller models to replicate the performance of larger models, facilitating deployment on devices with limited resources [38]. - Overfitting can be mitigated through methods like rehearsal and modular architecture, ensuring robust generalization to unseen data [40][41]. Group 3: Output Generation and Evaluation - Beam search improves text generation by considering multiple candidate sequences, enhancing coherence compared to greedy decoding [51]. - Temperature settings control the randomness of token selection during text generation, balancing predictability and creativity [53]. - Prompt engineering is essential for optimizing LLM performance, as well-defined prompts yield more relevant outputs [56]. - Retrieval-Augmented Generation (RAG) enhances answer accuracy in tasks by integrating relevant document retrieval with generation [58]. Group 4: Challenges and Ethical Considerations - LLMs face challenges in deployment, including high computational demands, potential biases, and issues with interpretability and privacy [116][120]. - Addressing biases in LLM outputs involves improving data quality, enhancing reasoning capabilities, and refining training methodologies [113].
ACL 2025|为什么你设计的 Prompt 会成功?新理论揭示大模型 Prompt 设计的奥秘与效能
机器之心· 2025-06-16 04:04
Core Insights - The article discusses the importance of prompt design in enhancing the performance of large language models (LLMs) during complex reasoning tasks, emphasizing that effective prompts can significantly improve model accuracy and efficiency [2][7][36] - A theoretical framework is proposed to quantify the complexity of the prompt search space, transitioning prompt engineering from an empirical practice to a more scientific approach [5][35] Group 1: Prompt Design and Its Impact - The effectiveness of prompt engineering has historically been viewed as somewhat mystical, with certain combinations yielding significant performance boosts while others fall short [7] - Prompts serve as critical "selectors" in the chain of thought (CoT) reasoning process, guiding the model in extracting relevant information from its internal hidden states [12][36] - The study reveals that the choice of prompt templates directly influences the reasoning performance of LLMs, with optimal prompt designs leading to performance improvements exceeding 50% [29][36] Group 2: Theoretical Framework and Experimental Evidence - The research introduces a systematic approach to finding optimal prompts by breaking down the CoT reasoning process into two interconnected search spaces: the prompt space and the answer space [22][35] - Experimental results demonstrate that the introduction of CoT mechanisms allows LLMs to perform recursive calculations, which are essential for tackling multi-step reasoning tasks [26][30] - The study highlights that well-designed prompts can effectively dictate the output of each reasoning step, ensuring that only the most relevant information is utilized for subsequent calculations [28][36] Group 3: Limitations and Future Directions - The article notes that relying solely on generic prompts can severely limit the model's performance on complex tasks, indicating the need for tailored prompt designs [36] - Variants of CoT, such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT), can enhance performance but are still constrained by the underlying prompt templates used [32][33] - The findings underscore the necessity for a deeper understanding of task requirements to design prompts that effectively guide LLMs in extracting and utilizing core information [23][35]
迈向人工智能的认识论:窥探黑匣子的新方法
3 6 Ke· 2025-06-16 03:46
Core Insights - The article discusses innovative strategies to better understand and control the reasoning processes of large language models (LLMs) through mechanical analysis and behavioral assessment [1][9]. Group 1: Mechanical Analysis and Attribution - Researchers are breaking down the internal computations of models, attributing specific decisions to particular components such as circuits, neurons, and attention heads [1]. - A promising idea is to combine circuit-level interpretability with chain-of-thought (CoT) verification, using causal tracing methods to check if specific parts of the model are activated during reasoning steps [2]. Group 2: Behavioral Assessment and Constraints - There is a growing interest in developing better fidelity metrics for reasoning, focusing on whether the model's reasoning steps are genuinely contributing to the final answer [3]. - The concept of using auxiliary models for automated CoT evaluation is gaining traction, where a verification model assesses if the answer follows logically from the reasoning provided [4]. Group 3: AI-Assisted Interpretability - Researchers are exploring the use of smaller models as probes to help explain the activations of larger models, potentially leading to a better understanding of complex circuits [5]. - Cross-architecture interpretability is being discussed, aiming to identify similar reasoning circuits in visual and multimodal models [6]. Group 4: Interventions and Model Editing - A promising methodology involves circuit-based interventions, where researchers can modify or disable certain attention heads to observe changes in model behavior [7]. - Future evaluations may include fidelity metrics as standard benchmarks, assessing how well models adhere to known necessary facts during reasoning [7]. Group 5: Architectural Innovations - Researchers are considering architectural changes to enhance interpretability, such as building models with inherently decoupled representations [8]. - There is a shift towards evaluating models in adversarial contexts to better understand their reasoning processes and identify weaknesses [8]. Group 6: Collaborative Efforts and Future Directions - The article highlights significant advancements in interpretability research over the past few years, with collaborations forming across organizations to tackle these challenges [10]. - The goal is to ensure that as more powerful AI systems emerge, there is a clearer understanding of their operational mechanisms [10].
“多模态方法无法实现AGI”
AI前线· 2025-06-14 04:06
Core Viewpoint - The article argues that true Artificial General Intelligence (AGI) requires a physical understanding of the world, as many problems cannot be reduced to symbolic operations [2][4][21]. Group 1: Limitations of Current AI Models - Current large language models (LLMs) may give the illusion of understanding the world, but they primarily learn heuristic collections for predicting tokens rather than developing a genuine world model [4][5][7]. - The understanding of LLMs is superficial, leading to misconceptions about their intelligence levels, as they do not engage in physical simulations when processing language [8][12][20]. Group 2: The Need for Embodied Cognition - The pursuit of AGI should prioritize embodied intelligence and interaction with the environment rather than merely combining multiple modalities into a patchwork solution [1][15][23]. - A unified approach to processing different modalities, inspired by human cognition, is essential for developing AGI that can generalize across various tasks [19][23]. Group 3: Critique of Multimodal Approaches - Current multimodal models often artificially sever the connections between modalities, complicating the integration of concepts and hindering the development of a coherent understanding [17][18]. - The reliance on large-scale models to stitch together narrow-domain capabilities is unlikely to yield a fully cognitive AGI, as it does not address the fundamental nature of intelligence [21][22]. Group 4: Future Directions for AGI Development - The article suggests that future AGI development should focus on interactive and embodied processes, leveraging insights from human cognition and classical disciplines [23][24]. - The challenge lies in identifying the necessary functions for AGI and arranging them into a coherent whole, which is more of a conceptual issue than a mathematical one [23].
迈向人工智能的认识论:真的没有人真正了解大型语言模型 (LLM) 的黑箱运作方式吗
3 6 Ke· 2025-06-13 06:01
Group 1 - The core issue revolves around the opacity of large language models (LLMs) like GPT-4, which function as "black boxes," making their internal decision-making processes largely inaccessible even to their creators [1][4][7] - Recent research highlights the disconnect between the reasoning processes of LLMs and the explanations they provide, raising concerns about the reliability of their outputs [2][3][4] - The discussion includes the emergence of human-like reasoning strategies within LLMs, despite the lack of transparency in their operations [1][3][12] Group 2 - The article explores the debate on whether LLMs exhibit genuine emergent capabilities or if these are merely artifacts of measurement [2][4] - It emphasizes the importance of understanding the fidelity of chain-of-thought (CoT) reasoning, noting that the explanations provided by models may not accurately reflect their actual reasoning paths [2][5][12] - The role of the Transformer architecture in supporting reasoning and the unintended consequences of alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF), are discussed [2][5][12] Group 3 - Methodological innovations are being proposed to bridge the gap between how models arrive at answers and how they explain themselves, including circuit-level attribution and quantitative fidelity metrics [5][6][12] - The implications for safety and deployment in high-risk areas, such as healthcare and law, are examined, stressing the need for transparency in AI systems before their implementation [6][12][13] - The article concludes with a call for robust verification and monitoring standards to ensure the safe deployment of AI technologies [2][6][12]