Workflow
推理模型
icon
Search documents
Claude 4发布!AI编程新基准、连续编码7小时,混合模型、上下文能力大突破
Founder Park· 2025-05-23 01:42
文章转载自「新智元」。 今天凌晨的 Anthropic 开发者大会上,Claude 4 登场。 CEO Dario Amodei亲自上阵,携Claude Opus 4和 Claude Sonnet 4亮相,再次将编码、高级推理和AI智能体,推向全新的标 准。 其中,Claude Opus 4是全球顶尖的编码模型,擅长复杂、长时间运行的任务,在AI智能体工作流方面性能极为出色。 而Claude Sonnet 4,则是对Sonnet 3.7 的重大升级,编码和推理能力都更出色,还能更精准地响应指令。 同时,Claude把这段时间积攒的一系列产品,通通一口气发布了—— Claude Opus 4和Sonnet 4混合模型的两种模式 :几乎即时的响应和用于更深度推理的扩展思考。 扩展思考与工具使用(测试版) :两款模型均可在扩展思考过程中使用工具(例如网络搜索),使Claude能在推理与工具使 用间灵活切换,从而优化响应质量。 新的模型能力 :两款模型均可并行使用工具,更精确地遵循指令,并且(当开发者授予其访问本地文件的权限时)展现出显 著增强的记忆能力,能提取、保存关键信息,以保持连续性,并随时间积累隐性知识。 C ...
全球最强编码模型 Claude 4 震撼发布:自主编码7小时、给出一句指令30秒内搞定任务,丝滑无Bug
AI前线· 2025-05-22 19:57
Core Insights - Anthropic has officially launched the Claude 4 series, which includes Claude Opus 4 and Claude Sonnet 4, setting new standards for coding, advanced reasoning, and AI agents [1][3] Model Performance - Claude Opus 4 is described as the most powerful AI model from Anthropic, capable of running tasks for several hours autonomously, outperforming competitors like Google's Gemini 2.5 Pro and OpenAI's models in coding tasks [6][8] - In benchmark tests, Claude Opus 4 achieved 72.5% in SWE-bench and 43.2% in Terminal-bench, leading the field in coding efficiency [10][11] - Claude Sonnet 4, a more cost-effective model, offers excellent coding and reasoning capabilities, achieving 72.7% in SWE-bench, while reducing the likelihood of shortcuts by 65% compared to its predecessor [13][14] Memory and Tool Usage - Claude Opus 4 significantly enhances memory capabilities, allowing it to create and maintain "memory files" for long-term tasks, improving coherence and execution performance [11][20] - Both models can utilize tools during reasoning processes, enhancing their ability to follow instructions accurately and build implicit knowledge over time [19][20] API and Integration - The new models are available on Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI, with pricing consistent with previous models [15] - Anthropic has also released Claude Code, a command-line tool that integrates with GitHub Actions and development environments like VS Code, facilitating seamless pair programming [17] Market Context - The AI industry is shifting towards reasoning models, with a notable increase in their usage, growing from 2% to 10% of all AI interactions within four months [31][35] - The competitive landscape is intensifying, with major players like OpenAI and Google also releasing advanced models, each showcasing unique strengths [36]
一场对话,我们细扒了下文心大模型背后的技术
量子位· 2025-05-22 12:34
Core Viewpoint - The article discusses the advancements in large models, particularly focusing on the performance of Baidu's Wenxin models, which have achieved high ratings in recent evaluations, indicating their strong capabilities in reasoning and multimodal integration [1][2]. Group 1: Model Performance and Evaluation - The China Academy of Information and Communications Technology (CAICT) recently evaluated large model reasoning capabilities, with Wenxin X1 Turbo achieving the highest rating of "4+" in 24 assessment categories [1]. - Wenxin X1 Turbo scored 16 items at 5 points, 7 items at 4 points, and 1 item at 3 points, making it the only large model in China to pass this evaluation [1]. Group 2: Technological Innovations - Wenxin models emphasize two key areas: multimodal integration and deep reasoning, with the introduction of technologies such as multimodal mixed training and self-feedback enhancement [6][11]. - The multimodal mixed training approach unifies text, image, and video modalities, improving training efficiency by nearly 2 times and enhancing multimodal understanding by over 30% [8]. - The self-feedback enhancement framework allows the model to self-improve, addressing challenges in data production and significantly reducing model hallucinations [13]. Group 3: Application Scenarios - In practical applications, Wenxin X1 Turbo demonstrates its capabilities in solving physics problems and generating code, with AI-generated code now accounting for over 40% of new code added daily [42][44]. - The technology supports over 100,000 digital human anchors, achieving a 31% conversion rate in live broadcasts and reducing broadcast costs by 80% [48]. Group 4: Market Potential and Future Directions - The global online education market is projected to reach 899.16 billion yuan by 2029, with large models playing a crucial role in this growth [49]. - The digital human market is expected to reach 48.06 billion yuan this year, nearly quadrupling from 2022, indicating significant opportunities for large model applications [49]. Group 5: Long-term Strategy and Vision - Baidu's approach to large models emphasizes continuous technological exploration and deepening, focusing on long-term value rather than short-term trends [57][58]. - The company maintains a dynamic perspective on the rapid evolution of technology, aiming to prepare for future industry transformations [58].
锦秋基金臧天宇:2025年AI创投趋势
锦秋集· 2025-05-14 10:02
Core Insights - The article discusses the investment trends in the AI sector, highlighting a shift from foundational models to application layers as the core focus for investment opportunities [1][7][11]. Group 1: Domestic AI Investment Trends - JinQiu Capital's investment portfolio serves as a small sample window to observe domestic AI investment trends [2]. - Approximately 60% of the projects are concentrated in the application layer, driven by improved model intelligence and significantly reduced invocation costs [6][7]. - The investment focus has shifted from foundational models, particularly large language models (LLMs), to application-oriented projects as foundational model capabilities mature [6][7]. Group 2: Key Investment Areas - The application layer is the primary focus, with nearly 40% of investments in Agent AI, 20% in creative tools, and another 20% in content and emotional consumption [8]. - Bottom-layer computing power and Physical AI are also critical areas, with investments aimed at enhancing model training and inference capabilities [9][10]. - The middle layer/toolchain investments are limited, focusing on large model security and reinforcement learning infrastructure [10]. Group 3: Trends in AI Intelligence and Cost - The continuous improvement of AI intelligence and the decreasing cost of acquiring this intelligence are the two core trends driving investment decisions [12][13]. - The industry has shifted focus from pre-training scaling laws to optimizing post-training phases, leading to the emergence of "Test Time Scaling" [14][15]. - The "Agent AI" era is characterized by the development of various agents to address practical operational issues [15]. Group 4: Cost Reduction in AI - A significant decrease in token costs has been observed, with prices dropping to as low as 0.8 RMB per million tokens, making applications economically viable [19][20]. - The cost of reasoning models remains a challenge due to their higher token consumption, necessitating further innovations to reduce inference costs [21][22]. - Innovations in underlying computing architectures, such as processing-in-memory and optical computing, are expected to drive long-term cost reductions [23][24]. Group 5: Opportunities in the Application Layer - The combination of improved intelligence and reduced costs has led to a surge in entrepreneurial activity within the application layer [26]. - The AI era presents new variables, including richer information and service offerings, as well as more precise recommendations evolving into proactive services [29][30]. - The marginal cost of content creation and service execution has significantly decreased, enabling scalable and distributable service models [31][33]. Group 6: Future of Physical AI - The potential for achieving general-purpose robots in the Physical AI domain is highlighted as a key area for future development [37]. - Data remains a core challenge for the development of general-purpose robots, necessitating collaborative optimization of hardware and software [40].
推理大模型1年内就会撞墙,性能无法再扩展几个数量级 | FrontierMath团队最新研究
量子位· 2025-05-13 07:11
衡宇 发自 凹非寺 量子位 | 公众号 QbitAI 与之伴随而来的还有另一个消息: 如果推理模型保持「每3-5个月都以10倍速度增长」,那么推理训练所需的算力可能会大幅收敛。 就像DeepSeek-R1之于OpenAI o1-preview那样。 一年之内,大模型推理训练可能就会撞墙。 以上结论来自Epoch AI。 这是一个专注于人工智能研究和基准测试的非营利组织,之前名动一时的FrontierMath基准测试 (评估AI模型数学推理能力) 就出自它家。 看了这个结果,有围观网友都着急了: 既然在o3基础上再scaling非常困难,那 为啥咱不探索模块化架构或针对特定任务的专用模型呢? "效率"比"研究过剩"更重要! 推理训练还有scalable的空间 OpenAI的o1是推理模型的开山之作。 OpenAI表示,与o1相比,训练o3所需的算力提升了10倍——提升部分几乎都花在了训练阶段。 OpenAI没有公开o1、o3的具体细节,但可以从DeepSeek-R1、微软Phi-4-reasoning、英伟达Llama-Nemotron等其它推理模型。它们 所需的推理训练阶段算力耕地,但可以根据它们进行推演。 ...
阶跃星辰姜大昕:多模态目前还没有出现GPT-4时刻
Hu Xiu· 2025-05-08 11:50
Core Viewpoint - The multi-modal model industry has not yet reached a "GPT-4 moment," as the lack of an integrated understanding-generating architecture is a significant bottleneck for development [1][3]. Company Overview - The company, founded by CEO Jiang Daxin in 2023, focuses on multi-modal models and has undergone internal restructuring to form a "generation-understanding" team from previously separate groups [1][2]. - The company currently employs over 400 people, with 80% in technical roles, fostering a collaborative and open work environment [2]. Technological Insights - The understanding-generating integrated architecture is deemed crucial for the evolution of multi-modal models, allowing for pre-training with vast amounts of image and video data [1][3]. - The company emphasizes the importance of multi-modal capabilities for achieving Artificial General Intelligence (AGI), asserting that any shortcomings in this area could delay progress [12][31]. Market Position and Competition - The company has completed a Series B funding round of several hundred million dollars and is one of the few in the "AI six tigers" that has not abandoned pre-training [3][36]. - The competitive landscape is intense, with major players like OpenAI, Google, and Meta releasing numerous new models, highlighting the urgency for innovation [3][4]. Future Directions - The company plans to enhance its models by integrating reasoning capabilities and long-chain thinking, which are essential for solving complex problems [13][18]. - Future developments will focus on achieving a scalable understanding-generating architecture in the visual domain, which is currently a significant challenge [26][28]. Application Strategy - The company adopts a dual strategy of "super models plus super applications," aiming to leverage multi-modal capabilities and reasoning skills in its applications [31][32]. - The focus on intelligent terminal agents is seen as a key area for growth, with the potential to enhance user experience and task completion through better contextual understanding [32][34].
Sebastian Raschka 新书《从头开始推理》抢先看,揭秘推理模型基础
机器之心· 2025-05-02 04:39
Core Viewpoint - The article discusses the advancements in reasoning capabilities of large language models (LLMs) and introduces the book "Reasoning From Scratch" by Sebastian Raschka, which aims to provide practical insights into building reasoning models from the ground up [2][5][59]. Group 1: Definition and Importance of Reasoning in LLMs - Reasoning in the context of LLMs refers to the model's ability to generate intermediate steps before arriving at a final answer, often described as chain-of-thought (CoT) reasoning [8][10]. - The distinction between reasoning and pattern matching is crucial, as traditional LLMs primarily rely on statistical correlations rather than logical reasoning [23][25]. - Understanding reasoning methods is essential for enhancing LLMs' capabilities to tackle complex tasks, such as solving logical puzzles or multi-step arithmetic problems [5][39]. Group 2: Training Process of LLMs - The typical training process for LLMs consists of two main phases: pre-training and fine-tuning [16][19]. - During pre-training, LLMs are trained on vast amounts of unlabelled text (up to several terabytes) to learn language patterns, which can cost millions of dollars and take months [17][21]. - Fine-tuning involves supervised fine-tuning (SFT) and preference fine-tuning to improve the model's ability to respond to user queries [20][21]. Group 3: Pattern Matching vs. Logical Reasoning - LLMs learn to predict the next token based on statistical patterns in the training data, which allows them to generate coherent text but lacks true understanding [23][24]. - In contrast, logical reasoning requires the ability to derive conclusions step-by-step, identifying contradictions and causal relationships [25][26]. - The article highlights that most LLMs do not actively identify contradictions but instead rely on learned patterns from training data [30][34]. Group 4: Enhancing Reasoning Capabilities - The reasoning capabilities of LLMs gained significant attention with the release of OpenAI's o1 model, which emphasizes a more human-like thought process [41][43]. - Enhancements to LLM reasoning can be achieved through inference-time compute scaling, reinforcement learning, and knowledge distillation [44][46][48]. - These methods aim to improve the model's reasoning ability without retraining the underlying model weights [46][48]. Group 5: Importance of Building Reasoning Models from Scratch - Building reasoning models from scratch provides valuable insights into the capabilities, limitations, and computational trade-offs of LLMs [50][57]. - The shift towards reasoning models reflects a broader trend in the AI industry, emphasizing the need for models that can handle complex tasks effectively [52][55]. - Understanding the underlying mechanisms of LLMs and reasoning models is crucial for optimizing their performance in various applications [57].
国产六大推理模型激战OpenAI?
创业邦· 2025-04-30 10:09
以下文章来源于光子星球 ,作者郝鑫 来源丨光 子星球(ID:TMTweb) 作者丨郝鑫 光子星球 . 细微之处,看见未来 编辑丨王潘 图源丨Midjourney "DeepSeek-R1如同当年苏联抢发的第一颗卫星,成为AI开启新时代的斯普特尼克时刻。" 2025年春节前,DeepSeek比除夕那天的烟花先一步在世界上空绽放。 离年夜饭仅剩几个小时,国内某家云服务器的工程师突然被拉入工作群,接到紧急任务,要求其快速调 优芯片,以适配最新的DeepSeek-R1模型。该工程师告诉我们,"从接入到完成,整个过程不到一周"。 大年初二,一家从事Agent To B业务的厂商负责人电话被打爆,客户的要求简单粗暴:第一时间验证模型 真实性能,尽快把部署提上日程。 节前大模型,节后只有DeepSeek。DeepSeek-R1就像一道分水岭,重新书写了中国大模型的叙事逻辑。 以2022年11月,OpenAI发布基于GPT-3.5的ChatGPT应用为起点,国内自此走上了追赶OpenAI的道路。 2023年,大模型如雨后春笋般冒出头,无大模型不AI,各厂商你追我赶,百模大战初见端倪。 你方唱罢我登场,2024年的主人公变成了 ...
不要思考过程,推理模型能力能够更强丨UC伯克利等最新研究
量子位· 2025-04-29 08:02
实验数据显示,在低资源情况 (即少token数量、少模型参数) 或低延迟情况下,Nothinking方法得出的结果均优于Thinking方法的结果, 实现比传统思考方式更好的精度- 延迟权衡。 其他情况下,NoThinking方法在部分数据集上的表现也能超越Thinking。 衡宇 发自 凹非寺 量子位 | 公众号 QbitAI 其实…… 不用大段大段思考,推理模型也能有效推理! 是不是有点反常识?因为大家的一贯印象里,推理模型之所以能力强大、能给出准确的有效答案,靠的就是长篇累牍的推理过程。 这个过程往往用时很长,等同于需要消耗大量算力。已经有一些研究尝试提高推理效率,但大多仍依赖显式思考过程。 来自UC伯克利和艾伦实验室团队的最新研究结果打破了这一刻板印象—— 通过简单的prompt绕过「思考」这一过程直接生成解决方案,可能同样有效,甚至更好。 这种方法被称为 "无思考(NoThinking)"方法 。 「思考」和「无思考」 研究团队以DeepSeek-R1-Distill-Qwen模型为基础,提出了NoThinking方法。 咱们先来分辨一下Thinking和NoThinking的区别在哪里。 Thin ...
奥特曼自诩:达到或接近天才水平!OpenAI,重磅发布!
Zheng Quan Shi Bao· 2025-04-17 04:31
Core Insights - OpenAI has launched two new reasoning models, o3 and o4-mini, which are capable of image-based reasoning, marking a significant advancement in the o series [1][6] Group 1: Model Performance - The o3 model is described as the most powerful reasoning flagship model, excelling in programming, mathematics, science, and visual perception benchmarks [1][8] - The o4-mini model is optimized for cost-effective reasoning, providing a balance between performance and affordability [1][8] - In external evaluations, o3 made 20% fewer significant errors in challenging real-world tasks compared to its predecessor, particularly in programming and creative tasks [8] Group 2: Image Reasoning Capabilities - Both models can integrate images into reasoning processes, allowing for "thinking with images" [10] - Users can upload various types of images, and the models can interpret them even if they are of low quality [10] - For example, o3 can analyze a photo of a notebook and deduce the written content through reasoning [10] Group 3: Task Execution and Tool Utilization - o3 and o4-mini can autonomously execute tasks by accessing tools within ChatGPT and utilizing custom user tools via API [13] - The models can perform complex tasks such as searching for data, generating code, and creating visual representations based on user queries [13] Group 4: Future Developments - OpenAI's CEO, Sam Altman, indicated that o3 will soon be upgraded to a professional version, o3-pro [4] - The company has been releasing models at a rapid pace, including the recent launch of the GPT-4.1 series, which aims to attract users with cost-effective options [15] - There is ongoing anticipation for the release of GPT-5, which has faced delays due to integration challenges [16]