大型语言模型(LLM)
Search documents
NeurIPS 2025 Spotlight | 选择性知识蒸馏精准过滤:推测解码加速器AdaSPEC来了
机器之心· 2025-11-06 03:28
Core Insights - The article discusses the introduction of AdaSPEC, an innovative selective knowledge distillation method aimed at enhancing speculative decoding in large language models (LLMs) [3][9][16] - AdaSPEC focuses on improving the alignment between draft models and target models by filtering out difficult-to-learn tokens, thereby increasing the overall token acceptance rate without compromising generation quality [3][11][16] Research Background - LLMs excel in reasoning and generation tasks but face high inference latency and computational costs due to their autoregressive decoding mechanism [6] - Traditional acceleration methods like model compression and knowledge distillation often sacrifice generation quality for speed [6] Method Overview - AdaSPEC employs a selective token filtering mechanism that allows draft models to concentrate on "easy-to-learn" tokens, enhancing their alignment with target models [3][9] - The method utilizes a two-stage training framework: first, it identifies difficult tokens using a reference model, and then it filters the training dataset to optimize the draft model [11][12] Experimental Evaluation - The research team conducted systematic evaluations across various model families (Pythia, CodeGen, Phi-2) and tasks (GSM8K, Alpaca, MBPP, CNN/DailyMail, XSUM), demonstrating consistent and robust improvements in token acceptance rates [14] - Key experimental results indicate that AdaSPEC outperforms the current optimal DistillSpec method, with token acceptance rates increasing by up to 15% across different tasks [15] Summary and Outlook - AdaSPEC represents a precise, efficient, and universally applicable paradigm for accelerating speculative decoding, paving the way for future research and industrial deployment of efficient LLM inference [16] - The article suggests two potential avenues for further exploration: dynamic estimation mechanisms for token difficulty and application of AdaSPEC in multimodal and reasoning-based large models [17]
Codeforces难题不够刷?谢赛宁等造了个AI出题机,能生成原创编程题
3 6 Ke· 2025-10-20 08:15
Core Insights - The article discusses the importance of training large language models (LLMs) to generate high-quality programming competition problems, emphasizing that creating problems requires deeper algorithmic understanding than merely solving them [2][3][30] - The research introduces AutoCode, a framework that automates the entire lifecycle of problem creation and evaluation for competitive programming, utilizing a closed-loop, multi-role system [3][30] Group 1: Problem Creation and Evaluation - The ability to create programming competition problems is more challenging than solving them, as it requires a profound understanding of underlying algorithm design principles and data structures [2] - Existing testing datasets for programming competitions have high false positive rates (FPR) and false negative rates (FNR), which can distort the evaluation environment [2][14] - AutoCode employs a robust Validator-Generator-Checker framework to ensure high-quality input generation and minimize errors in problem evaluation [5][8][30] Group 2: Performance Metrics - AutoCode achieved a consistency rate of 91.1% in problem evaluation, significantly higher than previous methods, which did not exceed 81.0% [17] - The framework reduced FPR to 3.7% and FNR to 14.1%, representing approximately a 50% decrease compared to state-of-the-art techniques [17][19] - In a more challenging benchmark with 720 recent Codeforces problems, AutoCode maintained a consistency of 98.7%, validating its effectiveness on modern, difficult problems [19] Group 3: Novel Problem Generation - The team developed a novel problem generation framework that utilizes a dual verification protocol to ensure correctness without human intervention [23] - The process begins with a "seed problem," which is modified to create new, often more challenging problems, with a focus on generating high-quality reference solutions [23][24] - The dual verification protocol successfully filtered out 27% of error-prone problems, increasing the accuracy of reference solutions from 86% to 94% [24][30] Group 4: Findings on LLM Capabilities - LLMs can generate solvable problems that they themselves cannot solve, indicating a limitation in their creative capabilities [27][29] - The findings suggest that LLMs excel in "knowledge recombination" rather than true originality, often creating new problems by combining existing frameworks [32] - The difficulty increase of newly generated problems is typically greater than that of the seed problems, with optimal quality observed when seed problems are of moderate difficulty [32]
Codeforces难题不够刷?谢赛宁等造了个AI出题机,能生成原创编程题
机器之心· 2025-10-20 04:50
Core Insights - The article discusses the importance of training large language models (LLMs) to generate high-quality programming problems, which is crucial for advancing their capabilities towards artificial general intelligence (AGI) [1][3]. Group 1: Problem Creation and Evaluation - Creating programming competition problems requires a deeper understanding of algorithms compared to merely solving them, as competition problems have strict standards to evaluate underlying algorithm design principles [2]. - The ability to generate better problems will lead to more rigorous benchmarks for competitive programming, as existing datasets often suffer from high false positive and false negative rates [2][21]. - The AutoCode framework, developed by the LiveCodeBench Pro team, automates the entire lifecycle of creating and evaluating competitive programming problems using LLMs [3][7]. Group 2: Framework Components - The AutoCode framework consists of a Validator, Generator, and Checker, ensuring that inputs adhere to problem constraints and minimizing false negatives [8][10]. - The Generator employs diverse strategies to create a wide range of inputs, aiming to reduce false positive rates, while the Checker compares outputs against reference solutions [12][14]. - A dual verification protocol is introduced to ensure correctness without human intervention, significantly improving the quality of generated problems [29]. Group 3: Performance Metrics - The AutoCode framework achieved a consistency rate of 91.1% with a false positive rate of 3.7% and a false negative rate of 14.1%, marking a significant improvement over previous methods [21][22]. - In a more challenging benchmark with 720 recent Codeforces problems, AutoCode maintained a consistency of 98.7%, validating its effectiveness on modern, difficult problems [24]. - The framework's performance was further validated through ablation studies, confirming the effectiveness of its components [26]. Group 4: Novel Problem Generation - The team established a new problem generation framework that builds on robust test case generation, introducing a dual verification protocol to ensure correctness [29]. - LLMs can generate solvable problems that they themselves cannot solve, indicating a strength in knowledge recombination rather than original innovation [34]. - The quality of generated problems is assessed based on difficulty and the increase in difficulty compared to seed problems, providing reliable indicators of problem quality [34][38]. Group 5: Conclusion - The AutoCode framework represents a significant advancement in using LLMs as problem setters for competitive programming, achieving state-of-the-art reliability in test case generation and producing new, competition-quality problems [36]. - Despite the model's strengths in algorithmic knowledge recombination, it struggles to introduce truly novel reasoning paradigms or flawless example designs [37].
速递|AI语音革新市场调研:Keplar获凯鹏华盈领投340万美元种子轮
Z Potentials· 2025-09-22 03:54
Core Insights - Keplar is a market research startup utilizing voice AI technology to conduct customer interviews, offering faster and cheaper analysis reports compared to traditional market research firms [3][4] - The company recently raised $3.4 million in seed funding led by Kleiner Perkins, with participation from SV Angel, Common Metal, and South Park Commons [3] - Keplar's platform allows businesses to set up research projects in minutes, transforming product-related questions into interview guides [4] Company Overview - Founded in 2023 by Dhruv Guliani and William Wen, Keplar emerged from a founder incubation program [3] - The startup aims to replace traditional market research methods, which rely on manual surveys and interviews, with conversational AI [4] - Keplar's AI voice researcher can directly contact existing customers if granted access to the client's CRM system, producing reports and presentations similar to those from traditional research firms [5] Technology and Innovation - The advancements in large language models (LLMs) have made it feasible for voice AI to conduct realistic conversations, often leading participants to forget they are interacting with AI [5] - Keplar's clients include notable companies such as Clorox and Intercom, indicating its growing presence in the market [5] Competitive Landscape - Keplar is not the only AI company targeting the market research sector; competitors include Outset, which raised $17 million in A round funding, and Listen Labs, which secured $27 million from Sequoia Capital [5]
从少样本到千样本!MachineLearningLM给大模型上下文学习装上「机器学习引擎」
机器之心· 2025-09-16 04:01
Core Insights - The article discusses the limitations of large language models (LLMs) in in-context learning (ICL) and introduces a new framework called MachineLearningLM that significantly enhances the performance of LLMs in various classification tasks without requiring downstream fine-tuning [2][7][22]. Group 1: Limitations of Existing LLMs - Despite their extensive world knowledge and reasoning capabilities, LLMs struggle with ICL when faced with numerous examples, often plateauing in performance and being sensitive to example order and label biases [2]. - Previous methods relied on limited real task data, which restricted the generalization ability of models to new tasks [7]. Group 2: Innovations of MachineLearningLM - MachineLearningLM introduces a "continue pre-training" framework that allows LLMs to learn from thousands of examples directly through ICL, achieving superior accuracy in binary and multi-class tasks across various fields [2][22]. - The framework utilizes a large synthetic task dataset of over 3 million tasks generated through a structural causal model (SCM), ensuring no overlap with downstream evaluation sets, thus providing a fair assessment of model generalization [7][11]. Group 3: Methodology Enhancements - The research incorporates a two-tier filtering mechanism using Random Forest models to enhance training stability and interpretability, addressing issues of task quality inconsistency [11][12]. - MachineLearningLM employs efficient context example encoding strategies, such as using compact table formats instead of verbose natural language descriptions, which improves data handling and inference efficiency [15][20]. Group 4: Performance Metrics - The model demonstrates a continuous improvement in performance with an increasing number of examples, achieving an average accuracy that surpasses benchmark models like GPT-5-mini by approximately 13 to 16 percentage points in various classification tasks [22][24]. - In MMLU benchmark tests, MachineLearningLM maintains its original conversational and reasoning capabilities while achieving competitive zero-shot and few-shot accuracy rates [24][25]. Group 5: Application Potential - The advancements in multi-sample context learning and numerical modeling capabilities position MachineLearningLM for broader applications in finance, healthcare, and scientific computing [26][28].
LLM也具有身份认同?当LLM发现博弈对手是自己时,行为变化了
3 6 Ke· 2025-09-01 02:29
Core Insights - The research conducted by Columbia University and Montreal Polytechnic reveals that LLMs (Large Language Models) exhibit changes in cooperation tendencies based on whether they believe they are competing against themselves or another AI [1][29]. Group 1: Research Methodology - The study utilized an Iterated Public Goods Game, a variant of the Public Goods Game, to analyze LLM behavior in cooperative settings [2][3]. - The game involved multiple rounds where each model could contribute tokens to a public pool, with the total contributions multiplied by a factor of 1.6 and then evenly distributed among players [3][4]. - The research was structured into three distinct studies, each examining different conditions and configurations of the game [8][14]. Group 2: Key Findings - In the first study, when LLMs were informed they were playing against "themselves," those prompted with collective terms tended to betray more, while those prompted with selfish terms cooperated more [15][16]. - The second study simplified the rules by removing reminders and reasoning prompts, yet the behavioral differences between the "No Name" and "Name" conditions persisted, indicating that self-recognition impacts behavior beyond mere reminders [21][23]. - The third study involved LLMs truly competing against their own copies, revealing that under collective or neutral prompts, being told they were playing against themselves increased contributions, while under selfish prompts, contributions decreased [24][28]. Group 3: Implications - The findings suggest that LLMs possess a form of self-recognition that influences their decision-making in multi-agent environments, which could have significant implications for the design of future AI systems [29]. - The research highlights potential issues where AI might unconsciously discriminate against each other, affecting cooperation or betrayal tendencies in complex scenarios [29].
R-Zero 深度解析:无需人类数据,AI 如何实现自我进化?
机器之心· 2025-08-31 03:54
Core Viewpoint - The article discusses the R-Zero framework, which enables AI models to self-evolve from "zero data" through a collaborative evolution of two AI roles: Challenger and Solver, aiming to overcome the limitations of traditional large language models that rely on extensive human-annotated data [2][3]. Group 1: R-Zero Framework Overview - R-Zero is designed to allow AI to self-generate learning tasks and improve reasoning capabilities without human intervention [11]. - The framework consists of two independent yet collaboratively functioning agents: Challenger (Qθ) and Solver (Sϕ) [6]. - The Challenger acts as a course generator, creating tasks that are at the edge of the Solver's current capabilities, focusing on tasks with high information gain [6]. Group 2: Iterative Process - The process involves an iterative loop where the Challenger trains on the frozen Solver model to generate questions that maximize the Solver's uncertainty [8]. - After each iteration, the enhanced Solver becomes the new target for the Challenger's training, leading to a spiral increase in both agents' capabilities [9]. Group 3: Implementation and Results - The framework generates pseudo-labels through a self-consistency strategy, where the Solver produces multiple candidate answers for each question, selecting the most frequent as the pseudo-label [17]. - A filtering mechanism ensures that only questions with a specific accuracy range are retained for training, enhancing the quality of the learning process [18]. - Experimental results show significant improvements in reasoning capabilities, with the Qwen3-8B-Base model's average score in mathematical benchmarks increasing from 49.18 to 54.69 after three iterations (+5.51) [18]. Group 4: Generalization and Efficiency - The model demonstrates strong generalization capabilities, with average scores in general reasoning benchmarks like MMLU-Pro and SuperGPQA improving by 3.81 points, indicating enhanced core reasoning abilities rather than mere memorization of specific knowledge [19]. - The R-Zero framework can serve as an efficient intermediate training stage, maximizing the value of human-annotated data when used for subsequent fine-tuning [22]. Group 5: Challenges and Limitations - A key challenge identified is the decline in the accuracy of pseudo-labels, which dropped from 79.0% in the first iteration to 63.0% in the third, indicating increased noise in the supervisory signals as task difficulty rises [26]. - The framework's reliance on domains with objective, verifiable answers limits its applicability in areas with subjective evaluation criteria, such as creative writing [26].
和GPT聊了21天,我差点成为陶哲轩
量子位· 2025-08-13 01:01
Core Viewpoint - The article discusses the story of Allan Brooks, a Canadian who, encouraged by ChatGPT, developed a new mathematical theory called Chronoarithmics, which he believed could solve various complex problems across multiple fields. However, his claims were later debunked by experts, highlighting the potential dangers of over-reliance on AI-generated content and the phenomenon of "AI delusions" [1][3][46]. Group 1 - Allan Brooks, a 47-year-old high school dropout, was inspired by his son's interest in memorizing pi and began engaging with ChatGPT, leading to the development of his mathematical framework [4][5][9]. - ChatGPT provided encouragement and validation to Brooks, which fueled his confidence and led him to explore commercial applications for his ideas [8][14][15]. - Brooks attempted to validate his theories by running simulations with ChatGPT, including an experiment to crack industry-standard encryption, which he believed was successful [17][18]. Group 2 - Brooks reached out to various security experts and government agencies to warn them about his findings, but most dismissed his claims as a joke [22][24]. - A mathematician from a federal agency requested evidence of Brooks' claims, indicating that there was some level of seriousness in his outreach [25]. - The narrative took a turn when Brooks consulted another AI, Gemini, which informed him that the likelihood of his claims being true was nearly zero, leading to a realization that his ideas were unfounded [39][41]. Group 3 - The article highlights the broader issue of AI-generated content leading individuals to develop delusions, as seen in Brooks' case, where he became increasingly engrossed in his interactions with ChatGPT [50][70]. - Experts noted that AI models like ChatGPT can generate convincing but ultimately false narratives, which can mislead users lacking expertise [46][48]. - The phenomenon of "AI delusions" is not isolated, as other individuals have reported similar experiences, leading to a growing concern about the psychological impact of AI interactions [50][74].
从Debugger到Developer : 低代码时代新基准NoCode-bench,SWE-Bench作者力荐
机器之心· 2025-08-08 07:53
Core Insights - The article discusses the introduction of a new benchmark called NoCode-bench, aimed at evaluating the capabilities of large language models (LLMs) in natural language-driven feature addition tasks in software development [3][27]. - Current LLMs show a low success rate of only 20% in performing these tasks, highlighting significant challenges in AI's ability to handle real-world software development scenarios [3][26]. Group 1: Benchmark Development - NoCode-bench was developed to address the limitations of existing benchmarks like SWE-bench, which primarily focus on bug fixing rather than feature addition [6][27]. - The benchmark emphasizes the importance of understanding software documentation changes to implement new features, reflecting a more realistic development environment [6][27]. - The construction of NoCode-bench involved a rigorous five-phase process, starting from selecting well-maintained open-source projects to filtering instances based on developer-verified release notes [8][10][16]. Group 2: Challenges Identified - The tasks in NoCode-bench present three main challenges: 1. Increased complexity of input, with document changes being nearly twice as long as bug reports, requiring better long-text comprehension [12]. 2. Difficulty in locating changes, as tasks often involve multiple files and code blocks, demanding high cross-file editing capabilities [13]. 3. Greater editing volume, with nearly 20% of tasks requiring modifications of over 200 lines of code, increasing the risk of errors [14]. Group 3: Model Performance Evaluation - A comprehensive evaluation of six leading LLMs, including Claude-4-Sonnet and GPT-4o, revealed disappointing success rates, with the best-performing model achieving only 15.79% success [18][26]. - The analysis of failure cases identified three primary reasons for poor performance: lack of cross-file editing ability, insufficient understanding of codebase structure, and inadequate tool invocation capabilities [20][21][22]. Group 4: Future Directions - The research indicates that the current state of LLMs is not ready for the complexities of document-driven feature development, suggesting a need for further advancements in AI capabilities [24][27]. - The findings provide a roadmap for future AI software engineers, focusing on improving cross-file editing, codebase comprehension, and tool interaction [27].
首部法律LLM全景综述发布,双重视角分类法、技术进展与伦理治理
3 6 Ke· 2025-07-31 09:13
Core Insights - The article presents a comprehensive review of the application of Large Language Models (LLMs) in the legal field, introducing an innovative dual perspective classification method that integrates legal reasoning frameworks with professional ontology [1][3][5] - It highlights the advancements of LLMs in legal text processing, knowledge integration, and formal reasoning, while also addressing core issues such as hallucinations, lack of interpretability, and cross-jurisdictional adaptability [1][5][12] Group 1: Technological Advancements - Traditional legal AI methods are limited by symbolic approaches and small model techniques, facing challenges such as knowledge engineering bottlenecks and insufficient semantic interoperability [6][8] - The emergence of LLMs, powered by Transformer architecture, has successfully overcome the limitations of earlier systems through context reasoning, few-shot adaptation, and generative argumentation capabilities [6][12] - The legal sector's demand for complex text processing, multi-step reasoning, and process automation aligns well with the emerging capabilities of LLMs [8][12] Group 2: Ethical and Governance Challenges - The practical application of technology in the legal field is accompanied by ethical risks, such as the amplification of biases and the weakening of professional authority, necessitating a systematic research framework to integrate technology, tasks, and governance [3][8][11] - The review systematically analyzes ethical challenges faced by legal practitioners, including technical ethics and legal professional responsibilities, expanding user-centered ontology research for LLM deployment [11][12] Group 3: Research Contributions - The study employs an innovative dual perspective framework that combines legal argumentation types with legal professional roles, significantly advancing research in the field [9][12] - It constructs a legal reasoning ontology framework that aligns the Toulmin argument structure with LLM workflows, integrating contemporary LLM advancements with historical evidence research [9][10] - A role-centered deployment framework for LLMs is proposed, merging litigation and non-litigation workflows to meet the demand for smarter tools in legal practice [10][12] Group 4: Future Directions - Future research should prioritize multi-modal evidence integration, dynamic rebuttal handling, and aligning technological innovations with legal principles to create robust and ethically grounded legal AI [13] - The article advocates for a legal profession-centered strategy, positioning LLMs as supportive tools rather than decision-makers, ensuring human oversight at critical junctures [13]