机器之心
Search documents
刚刚,千问App把谷歌和OpenAI的「付费绝活」塞进了手机,还免费?
机器之心· 2025-12-02 05:07
Core Insights - The article discusses the significant updates to the Qianwen App, which integrates two advanced visual models, Qwen-Image and Wan 2.5, making them accessible to ordinary users without technical expertise [1][4][36] Group 1: Qwen-Image Model - Qwen-Image is recognized for its strong visual logic understanding, allowing it to accurately interpret complex spatial relationships and geometric structures, outperforming many existing models [8][9][65] - The model excels in maintaining identity consistency during image editing, which is crucial for users seeking reliable results in complex scenarios [18][32] - Qwen-Image has shown impressive performance in multi-image fusion tasks, allowing for seamless integration of different visual elements while preserving their unique characteristics [29][32] Group 2: Wan 2.5 Model - Wan 2.5 represents a breakthrough in AI video generation, enabling native audio-visual synchronization, which enhances the user experience by eliminating the need for separate audio processing [34][68] - The model can generate videos that include original music and dialogue, showcasing its ability to understand and integrate multiple modalities [43][70] - Wan 2.5's architecture allows it to process text, images, video, and audio signals simultaneously, facilitating complex creative tasks that were previously challenging [68][70] Group 3: User Accessibility and Integration - The integration of these models into the Qianwen App eliminates barriers for users, allowing them to create high-quality visual and audio content without needing coding skills or expensive hardware [4][75] - The app serves as a comprehensive platform for multi-modal generation, enabling users to transition smoothly from image creation to video production within a single interface [45][47] - This development reflects Alibaba's long-term investment in building a robust ecosystem of multi-modal generative models, positioning it as a leader in the AI creative tools market [72][74]
刚刚,霸榜神秘视频模型身份揭晓,原来它就是「David」
机器之心· 2025-12-02 00:17
Core Insights - Runway's Gen-4.5 has emerged as the leading state-of-the-art (SOTA) video generation model, setting new industry standards in motion quality, prompt adherence, and visual realism [1][3][8] Model Performance - Gen-4.5 has achieved an ELO Score of 1247, surpassing competitors like Veo 3/3.1, Kling 2.5, and Sora 2 Pro, showcasing unprecedented visual realism and creative control capabilities [3][6][8] - The model maintains speed and efficiency while delivering significant quality improvements, making advanced video generation accessible to creators of various scales [8][20] Key Features - **Precise Prompt Adherence**: Gen-4.5 demonstrates exceptional physical accuracy and visual detail, accurately portraying object motion, fluid dynamics, and intricate surface details [11][12] - **Expressive Characters**: The model can depict nuanced emotions and lifelike facial details, enhancing character representation [14] - **Stylized Control and Visual Consistency**: It supports a wide range of aesthetic styles, from photorealism to stylized animation, while maintaining a coherent visual language [16][18] Deployment and Limitations - Gen-4.5 is built on NVIDIA architecture, optimizing training efficiency and inference speed through collaboration with NVIDIA [20] - Despite its advancements, Gen-4.5 exhibits common limitations found in video generation models, such as causal reasoning issues and object permanence challenges [21][22]
英伟达拿出推理版VLA:Alpamayo-R1让自动驾驶AI更会动脑子
机器之心· 2025-12-02 00:17
Group 1 - The core challenge in autonomous driving is not just perception but understanding the reasoning behind actions taken by the model [1] - Traditional end-to-end systems struggle with rare but critical scenarios, leading to potential accidents [1][2] - NVIDIA's Alpamayo-R1 introduces a reasoning capability that allows vehicles to infer causal relationships before making decisions [1][6] Group 2 - Alpamayo-R1 features a new dataset called Chain of Causation (CoC), which includes not only actions taken but also the reasons for those actions [2][3] - The model employs a diffusion-based trajectory decoder to generate feasible driving trajectories under real-time constraints [5] - A multi-stage training strategy is utilized, starting with basic mapping from vision to action, followed by supervised fine-tuning on CoC data, and concluding with reinforcement learning for optimization [6][15] Group 3 - The performance of Alpamayo-R1 shows significant improvements, particularly in long-tail scenarios where traditional models often fail [6][20] - The model's input consists of multi-camera and temporal observations, allowing for integrated multi-modal semantic understanding [8] - The CoC dataset employs a human-machine collaborative annotation mechanism, resulting in improved planning accuracy and reduced error rates [10][11] Group 4 - The training process of Alpamayo-R1 is divided into three phases: supervised fine-tuning, CoC supervision, and reinforcement learning-based post-training optimization [15][17] - The model incorporates a multi-dimensional reward mechanism to enhance reasoning accuracy and action consistency [17] - The design of AR1 represents a shift from "black box" to "white box" in autonomous driving, enabling the model to explain its decisions [19][20] Group 5 - The significance of Alpamayo-R1 lies not only in performance enhancement but also in establishing a closed loop between AI reasoning and physical actions [20][21] - The model aims to ensure safety and build trust in autonomous driving by providing explanations for its decisions [21]
AAAI 2026 | 首个抗端到端攻击的大模型加密指纹 / 水印方案
机器之心· 2025-12-01 09:30
Core Insights - The article discusses the development of iSeal, an encrypted fingerprinting solution designed to protect the intellectual property of large language models (LLMs) against advanced attacks [2][3][5]. Research Background - The training of large language models often incurs costs in the millions of dollars, making the model weights valuable intellectual property. Researchers typically use model fingerprinting techniques to assert ownership by embedding triggers that produce characteristic responses [6][7]. - Existing fingerprinting methods assume that the verifier faces a black-box API, which is unrealistic as advanced attackers can directly steal model weights and deploy them locally, gaining end-to-end control [7][10]. iSeal Overview - iSeal is the first encrypted fingerprinting scheme designed for end-to-end model theft scenarios. It introduces encryption mechanisms to resist collusion-based unlearning and response manipulation attacks, achieving a 100% verification success rate across 12 mainstream LLMs [3][12]. Methodology and Innovations - iSeal's framework transforms the fingerprint verification process into a secure encrypted interaction protocol, focusing on three main aspects: - **Encrypted Fingerprinting and External Encoder**: iSeal employs an encrypted fingerprint embedding mechanism and an external encoder to decouple fingerprints from model weights, preventing attackers from reverse-engineering the fingerprints [15]. - **Confusion & Diffusion Mechanism**: This mechanism binds fingerprint features to the model's core reasoning capabilities, making them inseparable and resilient against attempts to erase specific fingerprints [15]. - **Similarity-based Dynamic Verification**: iSeal uses a similarity-based verification strategy and error correction mechanisms to identify fingerprint signals even when attackers manipulate outputs through paraphrasing or synonym replacement [15][18]. Experimental Results - In experiments involving models like LLaMA and OPT, iSeal maintained a 100% verification success rate even under advanced attacks, while traditional fingerprinting methods failed after minor fine-tuning [17][18]. - The results demonstrated that iSeal's design effectively prevents attackers from compromising the entire verification structure by attempting to erase parts of the fingerprint [17][21]. Ablation Studies - Ablation studies confirmed the necessity of iSeal's key components, showing that without freezing the encoder or using a learned encoder, the verification success rate dropped to near zero [20][21].
13岁靠「氛围编程」创业,见奥特曼、拜访a16z,他的暑假把成年人卷哭
机器之心· 2025-12-01 09:30
今年夏天,来自多伦多的 13 岁少年 Michael Goldstein,过得比很多成年人都「硅谷」。 先是一个月的夏令营,紧接着飞去旧金山:参加技术创业大会、拜访 OpenAI 总部、和 Sam Altman 面谈、 在 a16z 办公室 pitch 创意…… Goldstein 是这一代青少年科技追随者的典型样本。 讲话用硅谷行话、脑子里想的是创业难题,社交媒体上热衷运营账号、刷互动量。 他也不是孤例。越来越多硅谷创业者表示,最近几个月,他们的私信和邮箱塞满了青少年「不请自来」的 邮件:「我能不能给你 pitch 一下」。 今天的年轻人比过去聪明多了,因为他们有 ChatGPT。这些业内人士几乎达成了共识,AI 是现在孩子们的 新潮流,过去是 Snapchat 和 TikTok。我们适应新东西太快了。 机器之心报道 机器之心编辑部 有了ChatGPT后,13岁,正是闯的年纪...... 生活中的 Michael Goldstein , 来 自 Michael Goldstein 的 X 账号 Goldstein 自称只能半会写代码,但这并不妨碍他创办 AI 初创公司。他靠 YouTube 学 Cursor ...
NeurIPS 2025 | DePass:通过单次前向传播分解实现统一的特征归因
机器之心· 2025-12-01 04:08
共同一作:洪翔宇,清华大学电子系大四本科生,曾获清华大学蒋南翔奖学金等,曾在NeurIPS,EMNLP,NAACL等顶级会议上发表论文。姜澈,清华大 学电子系博士三年级在读,主要研究方向为LLM Interpretebility,LLM Agent,曾在NeurIPS,ICML,EMNLP,NAACL等顶级会议上发表论文。 随着大型语言模型在各类任务中展现出卓越的生成与推理能力,如何将模型输出精确地追溯到其内部计算过程,已成为 AI 可解释性研究的重要方向。然 而,现有方法往往计算代价高昂、难以揭示中间层的信息流动;同时,不同层面的归因(如 token、模型组件或表示子空间)通常依赖各自独立的特定方 法,缺乏统一且高效的分析框架。 针对这一问题,来自清华、上海 AI Lab 的研究团队提出了全新的统一特征归因框架——DePass(Decomposed Forward Pass)。 该方法通过将前向传播中的每个隐藏状态分解为多个可加子状态,并在固定注意力权重与 MLP 激活的情况下对其逐层传播,实现了对 Transformer 内部 信息流的无损分解与精确归因。借助 DePass,研究者能够在输入 token、 ...
夸克x千问,AI浏览器还能这么玩?
机器之心· 2025-12-01 04:06
Core Viewpoint - The article discusses the rapid growth of the global AI browser market, projected to reach approximately $4.5 billion in 2024 and $76.8 billion by 2034, with a compound annual growth rate of 32.8% [1][3]. Group 1: Market Dynamics - The global browser market is undergoing a transition from old to new order, with various players interpreting the concept of AI browsers in different ways [3]. - Native AI forces, represented by OpenAI and Perplexity, aim to reconstruct information retrieval methods, while traditional giants like Google and Microsoft are upgrading their existing ecosystems [3][4]. - In China, many manufacturers are integrating AI capabilities with widely used applications to create comprehensive smart platforms [4]. Group 2: Quark's Unique Position - Quark has demonstrated unique competitiveness in the AI browser space, recently launching a major version that integrates the Qwen model, marking a significant upgrade to an AI browser [6][7]. - The upgrade is not merely additive but represents a rethinking of the browser's form, aiming to create an OS-level intelligent hub [7][8]. - Quark's AI capabilities extend beyond the browser, allowing users to invoke AI assistance across various applications seamlessly [8][9]. Group 3: AI Interaction Innovations - Quark has introduced six AI suites that enable global invocation of AI, breaking the limitations of traditional interaction methods [11][15]. - The AI browser allows for efficient information retrieval and task completion, such as summarizing academic papers and providing definitions for complex terms [17][19]. - The integration of AI enhances user experience by maintaining focus on core tasks without switching between multiple applications [21]. Group 4: Enhanced Browser Features - Quark's intelligent tab management organizes multiple open tabs effectively, improving user experience significantly [26]. - The browser allows direct editing of online documents, streamlining workflows for users who frequently handle PDFs [29][30]. - Cross-device seamless transfer of files and information is facilitated, enhancing productivity for users working across different devices [36][34]. Group 5: Technical Foundation - The strength of Quark's browser is underpinned by Alibaba's Qwen model, which has made significant advancements in natural language understanding and contextual awareness [41][44]. - The Qwen model's capabilities allow for intelligent responses based on user intent and browsing context, enhancing the overall functionality of the browser [45][52]. - Quark's AI browser showcases the potential of AI in redefining user interactions with web content, positioning itself at the forefront of the AI browser exploration [55][56].
无需标注图像,VLM也能「自我进化」!RL自我进化框架VisPlay突破视觉推理难题
机器之心· 2025-12-01 04:06
Title:VisPlay: Self-Evolving Vision-Language Models from Images 实验证明,VisPlay 在 Qwen2.5-VL 和 MiMo-VL 等主流模型上实现了持续的性能提升,尤其在视觉推理、组合泛化和幻觉减少方面效果显著,展示了一 条可扩展、低成本的多模态智能进化新路径。 在 Vision-Language Model 领域,提升其复杂推理能力通常依赖于耗费巨大的人工标注数据或启发式奖励。这不仅成本高昂,且难以规模化。 最新研究 VisPlay 首次提出了一个自进化强化学习框架,使 VLM 能够仅通过海量的未标注图像数据进行自我演化和能力提升。 VisPlay 将基础 VLM 分解为「提问者」和「推理者」两大角色,通过迭代的自我进化机制协同进化,并结合 GRPO 算法和创新的多样性/难度奖励,平衡 了问题的复杂度和答案的质量。 引言: Paper: https://arxiv.org/abs/2511.15661 Github: https://github.com/bruno686/VisPlay VLM 推理能力的「数据困境」 近年来,Visio ...
影响有多大?ICLR开盒大瓜后,OpenReview公布真相
机器之心· 2025-12-01 04:06
| 机器之心报道 | | --- | 编辑:+0、陈陈 最近,学术圈的大瓜莫过于 ICLR 评审大开盒事件 了,只要在浏览器上输入某个网址,自行替换你要看的 paper ID 和审稿人编号,你就可以找到对应的审稿人身 份。你甚至可以知道是谁给你审的论文,知道他 / 她给你打了多少分。 投过顶会的小伙伴,在得知这一漏洞后,肯定都忍不住想要去查论文审稿人。 好家伙,不查不知道,一查吓一跳,自己辛辛苦苦写的论文,居然被审稿人无缘无故的打低分,甚至打分者还是自己的好朋友,只能说套路太深了。 文章写得差,打低分也无可厚非,但真正让人窒息的是,审稿人多少带点个人恩怨,比如组里兄弟互相打低分,又比如为了给自己正在写的同赛道论文「让 路」。 这种场面也冲击到了 ICLR 大会本身,上周六 ICLR 发出最新通知 :所有论文的 AC(Area Chair)将被重新分配、所有审稿意见与分数被重置回讨论前状态。 毫无疑问,这一决定在学术圈里又炸开了锅,有人已经写好的 rebuttal 长评直接被打回原形、有人熬夜讨论的反驳意见,一下子变成无用功…… 这届 ICLR 投稿人的经历咋就这么一波三折呢! 几乎同一时间,OpenRevie ...
AI独立解决三十年数学问题的变体,陶哲轩分享自动化研究经验
机器之心· 2025-12-01 00:40
机器之心报道 机器之心编辑部 刚刚,Erdos 问题 #124 的一个弱化版本被证明。 这个问题自 1984 年在《算术杂志》上发表的论文 「整数幂集的完备序列」 中提出以来, 近 30 年一直悬而 未决 。 证明该问题的是普林斯顿大学数学博士 Boris Alexeev ,使用了 来自 Harmonic 的数学 AI 智能体 Aristotle 运行了这个问题,智能体最近更新了更强的推理能力和 自然语言界面。 关于该问 题的一些报道都声称AI 独立解决了该问题的完整版本,事实却并非如此,产生了很多争议。 Boris Alexeev 为此进行了修正: 在 Formal Conjectures 项目中,该猜想有一个正式声明。不幸的是,该声明中有一个拼写错误,其中注释在显示式方程中显示为 「≥1」 ,而相 应的 Lean 声明为 「= 1」。(这使得声明变弱了。)因此,我也修正了这个问题,并包含了对修正后声明的证明。最后,我删除了我认为是不 必要的声明方面,Aristotle 也证明了这一点。 正如 DesmondWeisenberg 所提到的,存在一个涉及幂次 1(这里对应个位数)的问题,这意味着 [BEGL9 ...