语言

Search documents
打破大模型编程「数据污染」与「能力虚胖」困境,Meituan-M17团队构建新一代AI编程评测新标准——OIBench
机器之心· 2025-07-11 02:43
当前,大语言模型(LLMs)在编程领域的能力受到广泛关注,相关论断在市场中普遍存在,例如 DeepMind 的 AlphaCode 曾宣称达到人类竞技编程选手的水平; OpenAI 的顶尖模型屡屡被报道能通过谷歌高级编程面试,并在 LeetCode 挑战中表现出较高能力。 然而,将这些能力宣称与实际评测结果进行对比时, 当前评估体系的深层问题便随之显现: 这些鲜明的对比,共同指向一个 核心 问题 :当前对 LLM 编程能力的评估,往往存在 "宣传与现实的认知鸿沟"。这种差异不仅源于模型能力边界的复杂性,也暴 露出现有评估体系的诸多局限性。具体表现为: 为了解决上述这些评估困境、评测出全球顶尖模型真实的编程能力, Meituan-M17团队 推出了更真实、更具区分度的评估基准 OIBench 数据集,并托管于 AGI- Eval 评测社区 。基于此数据集,我们对全球 18 个主流大模型的算法编程能力进行了系统评测并量化得分,详细评分榜单如下所示,可以看到全球顶尖大模型距离 以往所宣称的编程能力还存在很大差距,哪怕是最高分的 o4-mini-high 也仅仅只有 36.35 分,距离人类竞赛选手的水平还相差甚远, ...
是的,LeCun要向28岁的Alexandr Wang汇报!这是Meta新AI团队的一些独家内部消息
机器之心· 2025-07-11 02:43
Core Viewpoint - Meta's aggressive recruitment strategy in the AI sector has raised questions about its sustainability and the potential impact on company culture and performance [2][24]. Group 1: Recruitment and Team Structure - Meta has made headlines by offering exorbitant salaries, reportedly up to $200 million for key talent, to attract AI experts from competitors like OpenAI and Apple [3][4]. - The newly formed Meta Superintelligence Labs (MSL), led by Alexandr Wang, is a focal point of interest regarding its operational structure and research direction [5]. - There is a significant internal restructuring, with high-level executives being allowed to recruit their own teams, which may lead to internal competition and integration challenges [21][22]. Group 2: Internal Dynamics and Culture - Concerns have been raised about the impact of these changes on Meta's corporate culture, with reports of a "fear culture" emerging due to performance evaluations and ongoing layoffs [24]. - A lack of clear vision and strategic confusion has been noted, particularly within the Llama team, where many employees are unclear about the company's goals [24]. - The retention rate of top talent recruited from other companies is low, indicating potential issues with employee satisfaction and organizational stability [24]. Group 3: Research Focus and Distinctions - The Fundamental AI Research (FAIR) division operates independently from the GenAI and MSL teams, focusing on long-term foundational research rather than product development [8][16]. - The Llama team, initially part of FAIR, has been transitioned to the GenAI product group following the success of Llama1, highlighting the distinction between exploratory research and product-oriented development [15][16]. - The controversy surrounding the Llama 4 model, including allegations of "ranking cheating," has raised questions about Meta's technical reputation and credibility in the AI field [24].
2025 年 07 月编程语言排行榜|主流编程语言内卷升级,安全系“黑马” Ada 正在逆袭?
菜鸟教程· 2025-07-11 02:31
TIOBE 2025 年 07 月份的编程语言排行榜已经公布, 官 方的 标 题是 : 主流编程语言竞逐十强席位 ! ( Senior programming languages battling for a top 10 position ? ) TIOBE最新榜单揭晓 , Python、C、C++、Java、C#、JavaScript、Go 七大语言连续三年稳居前七,形成牢不可破的"第一阵营"。 而排名经常变化的其实在 第 8 至 12 名 —— Visual Basic、SQL、Fortran、Ada、Perl、Delphi 这个几个语言 展开月度排位拉锯战,上演编程 界的"诸神黄昏"。 Visual Basic、SQL、Fortran、Ada、Perl、Delphi……这些老古董语言可能在简历里已经删了,但它们还在为"前十"的门票拼命竞争。 今天 VB 上来了,明天 Delphi 又冲了个高峰,Fortran 和 Perl 像打不死的小强,不服老也不服输。 按理说,Rust 安全、Kotlin 靓丽,Dart 也有 Flutter 加持,Julia 更是数据科学界的新宠,但现实却是——它们仍然没能挤进 ...
自驾搞科研别蛮干!用对套路弯道超车~
自动驾驶之心· 2025-07-11 01:14
Core Viewpoint - The article emphasizes the importance of learning from experienced mentors in the field of research, particularly in LLM/MLLM, to accelerate the research process and achieve results more efficiently [1]. Group 1: Course Offerings - The program offers a 1v6 elite small class format, allowing for personalized guidance from a mentor throughout the research process [5]. - The course covers everything from model theory to practical coding, helping participants build their own knowledge systems and understand algorithm design and innovation in LLM/MLLM [1][10]. - Participants will receive tailored ideas from the mentor to kickstart their research, even if they lack a clear direction initially [7]. Group 2: Instructor Background - The instructor has a strong academic background, having graduated from a prestigious computer science university and worked as an algorithm researcher in various companies [2]. - The instructor's research includes computer vision, efficient model compression algorithms, and multimodal large language models, with a focus on lightweight models and efficient fine-tuning techniques [2][3]. Group 3: Target Audience - The program is suitable for graduate students and professionals in the fields of autonomous driving, AI, and those looking to enhance their algorithmic knowledge and research skills [11]. - It caters to individuals who need to publish papers for academic recognition or those who want to systematically master model compression and multimodal reasoning [11]. Group 4: Course Structure and Requirements - The course is designed to accommodate students with varying levels of foundational knowledge, with adjustments made to the depth of instruction based on participants' backgrounds [14]. - Participants are expected to have a basic understanding of deep learning and machine learning, familiarity with Python and PyTorch, and a willingness to engage actively in the learning process [16][19].
AI们数不清六根手指,这事没那么简单。
数字生命卡兹克· 2025-07-10 20:40
昨天Grok4发布完以后,我随手刷了一下X。 然后看到了一个非常有趣的帖子,来自@lepadphone。 我以为,这就是Grok4的问题,模型能力不太行,把一个恶搞的6根手指,数成了5根。 我自己也去测了一下,确实数是5根。 我本来没当回事。 直到,我随手扔到了OpenAI o3里,发现,事情开始不对了起来。因为,o3回复,也是5根手指。 我瞬间皱了眉头,然后扔给了o3 pro。 在推理了48秒之后,还是5根。 然后我又把这张图扔给了豆包、kimi、Gemini等等所有的有多模态的模型。 而无一例外,所有的模型,给我回复的,都是5根。 唯独有一个活口,Claude 4,偶尔会回答正确。 瞬间一股子冷汗就下来了。 一个模型数错了,可能是幻觉,所有的模型都数错,那,模型的底层肯定有一些问题。 深夜在群里试图问了一下,结果石沉大海。 那就只能靠自己了,再搜了一堆资料,用DeepReaserch做了深度搜索以后,我找到了一篇能完美解答这个现象的论文。 《Vision Language Models are Biased》(视觉语言模型存在偏见) 这篇论文发表于今年5月29号,至今也才1个多月的时间,还蛮新的。 我花了 ...
DreamVLA:全球首个“世界知识预测”VLA模型,操作成功率近八成
具身智能之心· 2025-07-10 13:16
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Wenyao Zhang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 研究背景与动机 近年来,视觉-语言-动作(VLA)模型在整合图像生成与动作预测以提升机器人操作的泛化性和推理能力 方面展现出潜力。但现有方法受限于基于图像的预测,存在信息冗余,且缺乏动态、空间和语义等关键世 界知识,难以形成闭环的感知-预测-动作循环。 动态区域预测 :利用光流预测模型识别场景中动态区域(如运动物体、机器人末端执行器),让模型 专注于任务关键的运动区域,避免冗余帧重建。通过CoTracker提取动态区域,训练模型仅重建这些区 域,优化目标为最大化对数似然的证据下界,损失函数为: $${\mathcal{L}}_{d y n}={\frac{1}{|{\mathcal{D}}|}}\sum_{x_{i}\in{\mathcal{D}}}\mathbb{E}_{z\sim Q_{\phi}(z|x_ ...
CEED-VLA:实现VLA模型4倍推理加速,革命性一致性蒸馏与早退解码技术!
具身智能之心· 2025-07-10 13:16
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Wenxuan Song等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 图 1:不同解码方法加速效果对比 Jacobi Trajectory Collection Method 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要的。 近年来,视觉语言模型(VLM)激发了视觉-语言-动作模型(VLA)的发展,它们可直接从视觉和语言输入中生成可执行动作。尽管这些模型在多任 务泛化上表现良好,但推理速度慢限制了其在高频灵巧任务中的应用。为此,研究者尝试使用Jacobi Decoding方法加速推理,然而由于模型训练时仅 见过正确的前缀输入,导致并行预测效果不佳。为此,本文提出了一种通用的加速方法——CEED-VLA,在保持操作性能的同时显著提升了推理速 度。具体而言,(1)我们设计了一种可广泛应用的推理加速方案 CEED-VLA,在多个任务中实现了显著的速度提升;(2)引入一致性蒸馏机制,并 在自回归损失中加入混合标签监督,使学生模型能够从不同 ...
7月19日,相聚北京!一起聊聊ACL 2025爆点研究
机器之心· 2025-07-10 08:35
Core Insights - The AI field continues to be an exciting area in 2025, with numerous research releases from major tech companies and institutions [1] - The rapid pace of technological advancements in AI is overwhelming, with new models and paradigms emerging almost weekly [3][4] - Developers and researchers are increasingly engaging in conferences and academic sharing to stay updated on cutting-edge research [5] Event Overview - The ACL conference, a significant event in the NLP field, received over 8,000 submissions this year, marking a historical high [6] - The ACL 2025 conference will take place from July 27 to August 1 in Vienna, Austria, featuring various activities such as keynote speeches, paper presentations, roundtable discussions, and poster sessions [6][7] - The event aims to provide a platform for domestic AI talent, with a full schedule of presentations and discussions announced [6] Keynote Speakers and Topics - The keynote address on "Trends and Outlook for ACL 2025" will be delivered by Che Wanxiang, a prominent professor from Harbin Institute of Technology [9][17] - Liu Pengfei from Shanghai Jiao Tong University will present on "Reinforcement Learning and Complex Reasoning in Large Models" [11][19] Paper Presentations - Various papers will be presented, covering topics such as the intrinsic self-correction of large language models and the acceleration of inference in large language models [9][12] - The event will also feature poster sessions and opportunities for industry engagement [21]
图书编辑要趁早转行吗?
Hu Xiu· 2025-07-10 07:47
如果你陷⼊了⽚刻的沉思,开始想起⾏业⻩⾦时代的传说,那么这篇⽂章的每⼀个字,都是为你⽽写。 因为我们必须承认⼀个令⼈不适的现实:我们所以为的事业,可能正在静悄悄地沦为⼀⻔过⽓⼿艺。甚 ⾄,连"⼿艺"都称不上,它正在变成⼀个历史名词。 我不想⽤"⾏业的冬天"这种陈词滥调来粉饰太平。冬天意味着春天终将到来,那是⼀种循环。⽽我们正 在经历的,不能算循环了,这是⼀场史⽆前例的⽣态更迭,更是⼀场没有硝烟、却⾜以将我们整个⾏业 颠覆的范式⾰命。 是的,⽣成式⼈⼯智能正在以前所未有的⼒量冲击着出版这个古⽼的⾏业。 看不⻅的图书馆与最后⼀批读者 现在还有图书编辑会因为⼀本尚未⾯世的新书⼼潮澎湃吗?不是那种盘算着可能会是爆款的职业性兴 奋,⽽是⼀种纯粹的、源⾃灵魂深处的激动——你确信⾃⼰⼿中捧着的是即将诞⽣于世的伟⼤思想,或 是可精准注⼊社会病灶的时代良⽅。 我估计很少有⼈有了。 让我们从故事的另⼀端讲起——那个我们称之为"读者"的,⽇益模糊的群体。 ⼩红书上搜"写论⽂",会看到⼤量的⽤户在分享论⽂写作的密码,其中提及最多的就是使⽤⼈⼯智能的 ⽅法。⼗⼏年前,我们上学那年,到了⼤四,那真是会冲进图书馆,在书架前翻找⼀天,借回五 ...
马斯克xAI发布Grok 4:训练算力提升100倍,多项测试中领先第二名一倍
Feng Huang Wang· 2025-07-10 06:20
Core Insights - xAI has launched its latest large language model, Grok 4, which shows significant performance improvements over its predecessor, Grok 3, with a 100-fold increase in training computational power [1] - Grok 4 achieved a 25% problem-solving rate in the "Humanities Last Exam" benchmark, while the multi-agent version, Grok 4 Heavy, exceeded 50% [1] - The company is focusing on enhancing multi-modal understanding capabilities and has released an API for Grok 4, supporting a context length of 256K [2] Model Performance - Grok 4 demonstrates superior reasoning capabilities in standardized tests, including GPQA and AIME, and achieved a perfect score in the Live Coding Bench test [2] - The model integrates tool usage directly into its training process, improving reliability in complex task handling [2] Commercialization Efforts - xAI has introduced a subscription service, Super Grok Heavy, allowing users to access both Grok 4 and Grok 4 Heavy [3] - The company plans to develop a dedicated programming model and initiate video generation model training using over 100,000 H200 GPUs in the coming weeks [3] - The release of Grok 4 marks a significant breakthrough in the competitive landscape of large language models, particularly in reasoning and multi-agent collaboration [3]