视觉

Search documents
AI们数不清六根手指,这事没那么简单
Hu Xiu· 2025-07-11 02:54
昨天Grok4发布完以后,我随手刷了一下X。 然后看到了一个非常有趣的帖子,来自@lepadphone。 我以为,这就是Grok4的问题,模型能力不太行,把一个恶搞的6根手指,数成了5根。 我自己也去测了一下,确实数是5根。 我本来没当回事。 直到我随手把它扔到了OpenAI o3里,发现事情开始不对了起来。因为,o3回复的也是5根手指。 我瞬间皱了眉头,然后扔给了o3 pro。在推理了48秒之后,还是5根。 然后我又把这张图扔给了豆包、kimi、Gemini等几乎所有有多模态的模型。 无一例外,所有的模型,给我的回复都是5根。唯独有一个活口,Claude 4,偶尔会回答正确。 我瞬间一股子冷汗就下来了。一个模型数错了,可能是幻觉,所有的模型都数错,那模型底层肯定有一些问题。 我深夜在群里试图问了一下,结果石沉大海。 那就只能靠自己了,再搜了一堆资料,用DeepReaserch做了深度搜索以后,我找到了一篇能完美解答这个现象的论文:《Vision Language Models are Biased》(视觉语言模型存在偏见)。 这篇论文发表于今年5月29号,至今也才1个多月的时间,还蛮新的。 我花了一些时间, ...
自驾搞科研别蛮干!用对套路弯道超车~
自动驾驶之心· 2025-07-11 01:14
读研想少走弯路、快速出成果?靠自己瞎摸索费时间费精力还没结果,找个厉害的榜样"抄作业",才是最 直接的办法。 导师介绍 毕业于知名计算机名校。曾在多家公司担任算法研究员,并进行计算机视觉,高效模型压缩算法,多模态 大语言模型的研究,包括模型量化,剪枝,蒸馏,编译以及高效稀疏化训练与推理。 博士期间研究方向聚焦为计算机视觉,高效的深度学习训练和推理方法,大语言模型轻量化与高效微调技 术。 这套路看着"功利",但真能让你在科研路上跑快点,别人还在绕小道,你已经上了高速。 厉害的榜样通常 来说,就是那些论文专利一大堆的导师学长学姐,但苦于和这些榜样搭不上话, 现在如何让入场甩开同 行,别人摸路你超车? 自动驾驶之心联合业内知名LLM/MLLM方向学者推出了1v6指导小班课。从模型理论到代码实践, 业内大 牛手把手带走科研全流程,帮助大家形成自己的知识体系, 掌握LLM/MLLM论文的算法设计及创新思路。 扫码免费咨询 【科研成果】 在国际顶级会议CVPR,ICCV, EMNLP等发表十余篇论文, 并担任CVPR,ICCV,ECCV,ICML,ICLR, NeurIPS 等重要会议和期刊的审稿人。多项发明专利,已经指 ...
AI们数不清六根手指,这事没那么简单。
数字生命卡兹克· 2025-07-10 20:40
昨天Grok4发布完以后,我随手刷了一下X。 然后看到了一个非常有趣的帖子,来自@lepadphone。 我以为,这就是Grok4的问题,模型能力不太行,把一个恶搞的6根手指,数成了5根。 我自己也去测了一下,确实数是5根。 我本来没当回事。 直到,我随手扔到了OpenAI o3里,发现,事情开始不对了起来。因为,o3回复,也是5根手指。 我瞬间皱了眉头,然后扔给了o3 pro。 在推理了48秒之后,还是5根。 然后我又把这张图扔给了豆包、kimi、Gemini等等所有的有多模态的模型。 而无一例外,所有的模型,给我回复的,都是5根。 唯独有一个活口,Claude 4,偶尔会回答正确。 瞬间一股子冷汗就下来了。 一个模型数错了,可能是幻觉,所有的模型都数错,那,模型的底层肯定有一些问题。 深夜在群里试图问了一下,结果石沉大海。 那就只能靠自己了,再搜了一堆资料,用DeepReaserch做了深度搜索以后,我找到了一篇能完美解答这个现象的论文。 《Vision Language Models are Biased》(视觉语言模型存在偏见) 这篇论文发表于今年5月29号,至今也才1个多月的时间,还蛮新的。 我花了 ...
DreamVLA:全球首个“世界知识预测”VLA模型,操作成功率近八成
具身智能之心· 2025-07-10 13:16
点击下方 卡片 ,关注" 具身智能 之心 "公众号 作者丨 Wenyao Zhang等 编辑丨具身智能之心 本文只做学术分享,如有侵权,联系删文 >> 点击进入→ 具身智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区 : 具身智能之心知识星球 (戳我) , 这里包含所有你想要 的。 研究背景与动机 近年来,视觉-语言-动作(VLA)模型在整合图像生成与动作预测以提升机器人操作的泛化性和推理能力 方面展现出潜力。但现有方法受限于基于图像的预测,存在信息冗余,且缺乏动态、空间和语义等关键世 界知识,难以形成闭环的感知-预测-动作循环。 动态区域预测 :利用光流预测模型识别场景中动态区域(如运动物体、机器人末端执行器),让模型 专注于任务关键的运动区域,避免冗余帧重建。通过CoTracker提取动态区域,训练模型仅重建这些区 域,优化目标为最大化对数似然的证据下界,损失函数为: $${\mathcal{L}}_{d y n}={\frac{1}{|{\mathcal{D}}|}}\sum_{x_{i}\in{\mathcal{D}}}\mathbb{E}_{z\sim Q_{\phi}(z|x_ ...
端到端VLA这薪资,让我心动了。。。
自动驾驶之心· 2025-07-10 12:40
点击下方 卡片 ,关注" 自动驾驶之心 "公众号 戳我-> 领取 自动驾驶近15个 方向 学习 路线 端到端自动驾驶 - 下一代智能驾驶量产核心算法 端到端自动驾驶(End-to-End Autonomous Driving)作为目前智驾量产的核心算法,可以分为一段式端到端、二段式端到端两个大的技术方向。自UniAD获得 CVPR Best Paper以来,正式拉开了国内新一轮的智驾军备竞赛。 2024年理想汽车更是宣布E2E+VLM的双系统架构量产! 端到端自动驾驶通过传感器数据输入 (视觉/Lidar等)直接输出自车规划或控制信息,是目前智能驾驶最具代表性的方向。 目前VLM/VLA也是招聘的刚需,3-5年就能冲击百万年薪! 而随着学术界和工业界的目光投向端到端这个技术领域,我们发现了很多问题。UniAD是端到端的最终解吗?显然不是!一系列算法如雨后春笋般冒出: 技术栈多?入门困难? 去年我们推出了《首个面向工业级的端到端算法与实战教程》,今年很多小伙伴反馈技术发展太快了,先前的技术方案已经不适合当下的大环境。端到端目前发 展出多个领域技术的方向,需要掌握多模态大模型、BEV感知、强化学习、视觉Trans ...
可灵AI推出可图2.1模型 多维能力跃升、会员限时7天免费
Cai Fu Zai Xian· 2025-07-10 09:24
Core Insights - The launch of the Ketu 2.1 model by Keling AI significantly enhances image generation capabilities, including improved instruction adherence, stunning portrait aesthetics, and cinematic quality [1][11] - The model is available for free to all member users for a limited time, allowing creators to explore its features [11] Image Generation Capabilities - Ketu 2.1 excels in following complex instructions, accurately capturing multiple elements and details in prompts, resulting in high-quality images that showcase creative imagination [1][3] - The model demonstrates a notable improvement in image quality, including clarity, richness of elements, and realism, particularly in portrait aesthetics [3][5] Artistic and Cinematic Quality - The model can generate images with a cinematic feel, effectively recreating scenes with unique aesthetic tones and advanced composition [6] - It supports over 180 different styles, allowing creators to choose from various artistic expressions, from vintage photography to futuristic digital art [10] Text Generation Features - Ketu 2.1 also enhances text generation, producing clear and design-oriented text in both Chinese and English, facilitating smoother integration of text and images for marketing and creative projects [8] User Engagement and Growth - Keling AI has achieved significant user engagement, with a total of 344 million images and 168 million videos generated since its launch, showcasing its strength in the image generation sector [11]
假期护眼正当时!“视力存款”保管好
Xin Hua Wang· 2025-07-10 08:13
Core Viewpoint - The article emphasizes the importance of eye care for children during the summer vacation, highlighting the risks of increased screen time and irregular routines that can lead to vision deterioration. It advocates for regular eye check-ups and preventive measures to maintain children's vision health [1][4]. Group 1: Vision Health Statistics - In 2022, the overall myopia rate among children and adolescents in China was reported at 51.9%. Despite a downward trend in myopia rates, early onset and high prevalence remain significant challenges [1]. - The concept of "vision savings" is introduced, where the reserve of hyperopia (farsightedness) serves as a buffer against developing myopia [2]. Group 2: Importance of Regular Eye Check-ups - Regular professional eye examinations are crucial for establishing visual health records, which include key indicators such as hyperopia reserve, myopia degree, and axial length [4]. - A low hyperopia reserve relative to age norms indicates an increased risk of myopia, even if the child's vision appears normal [4]. Group 3: Preventive Measures for Eye Health - Four key strategies are recommended for parents to help children maintain clear vision and slow the progression of myopia: 1. Balance study and rest, maintain proper reading posture, and improve lighting conditions. 2. Follow the "20-20-20" rule to reduce eye strain. 3. Engage in outdoor activities for at least 2 hours daily, totaling 14 hours weekly, as outdoor light exposure is critical for myopia prevention. 4. Ensure a balanced diet and adequate sleep, which are essential for children's growth and eye health [6]. Group 4: Standards for Lighting Conditions - The National Health Commission's "Myopia Prevention and Control Guidelines (2024 Edition)" specifies that reading and writing should occur in well-lit environments, with a minimum average illuminance of 300 lux [8]. Group 5: Caution Against Misleading Products - There is a growing concern regarding products claiming to reverse myopia, which are often misleading. True myopia is currently considered irreversible by experts [9]. - Since 2021, the State Administration for Market Regulation has been addressing misleading advertising related to myopia prevention products, emphasizing that myopia cannot be cured under current medical conditions [9].
VLA统一架构新突破:自回归世界模型引领具身智能
机器之心· 2025-07-10 04:26
然而,现有方法多以语言模态为中心,往往忽视了视觉信息蕴含的丰富时序动态与因果结构。 本文来自:王宇琪,中国科学院自动化所博士,研究方向为世界模型,自动驾驶感知与决策等,在 CVPR、NeurIPS、ICCV、 ECCV、ICLR 等顶级会议上发表过多篇论文。 王鑫龙团队,北京智源研究院,研究方向为原生多模态大模型,Emu 系列工作核心负责人。 张兆翔团队,中国科学院自动化研究所,研究方向涵盖世界模型、视觉生成与重建、自动驾驶、具身智能等。 从 Sora 到 Genie2,从语言驱动的视频生成到世界的交互模拟,世界模型正加速成为连接感知、理解与决策的关键基座。随着视觉 - 语 言 - 动作(VLA)模型在具身智能领域的快速发展,多模态之间的边界正被重塑。 论文标题: Unified Vision-Language-Action Model 网站链接: https://robertwyq.github.io/univla.github.io/ 论文链接: https://arxiv.org/abs/2506.19850 代码链接: https://github.com/baaivision/UniVLA 为此,北 ...
腾讯研究院AI速递 20250709
腾讯研究院· 2025-07-08 15:50
Group 1 - Ruoming Pang, head of Apple's foundational model team, is reported to join Meta's new AI team with an annual compensation in the tens of millions [1] - Pang's departure may be influenced by internal discussions at Apple regarding the introduction of third-party models like OpenAI, leading to team morale issues [1] - Apple's AI team structure will be reorganized under Zhifeng Chen, transitioning to a multi-layer management structure [1] Group 2 - Microsoft has launched Deep Research, a public preview version that utilizes the o3 model and Bing search to create an advanced AI research tool [2] - This AI can automatically deconstruct complex problems, gather the latest authoritative information from the web, and generate auditable research reports [2] - An API interface has been opened for integration into applications, supporting enterprise-level AI platforms across various fields such as research, finance, and healthcare [2] Group 3 - Alibaba has open-sourced the multi-modal reasoning model HumanOmniV2, capable of accurately capturing hidden information in videos and understanding "subtext" [3] - The model incorporates a forced context summarization mechanism, a multi-dimensional reward system driven by large models, and optimization training methods based on GRPO [3] - Alibaba has introduced the IntentBench evaluation benchmark, with HumanOmniV2 achieving an accuracy rate of 69.33%, excelling in understanding complex human intentions [3] Group 4 - PaddleOCR 3.1 has been released, with Wenxin 4.5 enhancing the accuracy of text recognition in 37 languages by over 30%, supporting high-quality automatic data labeling [4] - A new production line, PP-DocTranslation, has been added, combining PP-StructureV3 and Wenxin 4.5 to support translation of Markdown, PDF, and image documents, along with customization of professional terminology [4] Group 5 - A controversy has emerged involving hidden instructions in academic papers aimed at inducing AI to give high scores, with several top universities implicated [6] - Xie Saining, a co-author of one such paper, acknowledged responsibility and apologized, clarifying that he does not endorse such practices [6] - This incident has sparked discussions on academic ethics in the AI era, highlighting the lack of unified standards in AI review processes and the need for reform [6] Group 6 - The Visual Language Action model (VLA) is becoming a core technology for embodied intelligence by 2025, with rapid iterations from Google's RT-2 breakthrough [7] - China's Zhihui Square has partnered with top universities to launch FiS-VLA, innovatively embedding "fast systems" into "slow systems" to address the trade-off between robotic control efficiency and reasoning capability [7] - FiS-VLA has achieved an 8% success rate improvement in simulation tasks and an 11% improvement in real environments, with a control frequency of 21.9Hz, 1.6 times that of the open-source model π0 [7] Group 7 - YouTube co-founder Chen Shijun discussed AI entrepreneurship and long-termism with the Manus team, emphasizing the value of rapid experimentation and risk-taking [8] - Recommendations for AI startups include leveraging first-mover advantages to retain users, creating compound network effects, and exploring areas that larger companies avoid, all within legal boundaries [8] - Key decisions at YouTube included prioritizing user growth over immediate monetization, establishing transparent core metrics, and developing a creator-friendly advertising model while focusing on the "passive experience" of recommendation systems [8] Group 8 - The key shift in acquiring users for AI products is that if a product does not generate social engagement within the first 48 hours, it may fail, making virality a survival threshold rather than a bonus [9] - The success story of selling Base44 for $80 million involved user participation in the development process, encouraging sharing of creations, and strategically choosing LinkedIn as a platform for dissemination, creating a closed loop of development, showcasing, and sharing [9] - The distribution paradigm for AI startups is evolving, with product development becoming a public showcase, niche native creators proving more effective than influencers, and growth metrics becoming assets for dissemination, shifting from "closed-door development" to "public collaboration" [9] Group 9 - U.S. universities are reshaping computer science education, with the CS major potentially becoming more humanities-oriented, emphasizing computational thinking and AI literacy over traditional programming skills [10] - The "Level Up AI" initiative has launched an 18-month curriculum overhaul, where future programming languages may involve "Human," allowing students to complete programming tasks through interaction with AI [10] - Traditional humanities classrooms are facing assessment crises, with educators struggling to identify AI-generated content, leading to a return to handwritten assignments and the development of anti-cheating systems, raising concerns about students' over-reliance on AI affecting their cognitive abilities [10]
具身智能论文速递 | 强化学习、VLA、VLN、世界模型等~
具身智能之心· 2025-07-08 12:54
算法框架: 点击下方 卡片 ,关注" 具身智能 之心 "公众号 强化学习如何提升VLA泛化能力 清华大学、上海期智研究院、北京中关村科学院通过强化学习微调(PPO算法)显著提升视觉-语言-动作模 型(VLA)的泛化能力: 1)执行任务成功率提升42.6%(OOD场景) 2)语义理解任务成功率从61.5%提升至75.0%(未见物体) 3)动态干扰场景成功率从28.6%跃升至74.5%(Tab 3) 主要贡献: 论文标题:What Can RL Bring to VLA Generalization? An Empirical Study 论文链接:https://arxiv.org/pdf/2505.19789 1. 构建了一个严谨且具有挑战性的基准,用于评估 VLA 微调方法在视觉、语义和执行等不同维度上对泛 化能力的影响。 2. 确定 PPO 是优于 GRPO 和 DPO 的 VLA 微调 RL 算法,并讨论了将这些 RL 算法从 LLM/VLM 范式适 配到 VLA 独特需求时的关键挑战。 3. 开发了一种高效的基于 PPO 的 VLA 微调方案,该方案借助共享的 actor-critic 骨干网络、VL ...