Workflow
机器之心
icon
Search documents
昆仑万维开源的SkyReels-V3,把马斯克请来带货了
机器之心· 2026-01-29 10:26
Core Viewpoint - The rise of AI-generated virtual influencers is transforming social media, with brands collaborating and millions of followers engaging as if they were real celebrities [1][2]. Group 1: Technology and Features - Kunlun Wanwei has launched the open-source SkyReels-V3, a multi-modal video generation model that includes capabilities for reference image-to-video, video extension, and audio-driven virtual avatars [3][9]. - The model allows users to create high-fidelity videos from a single image and audio, maintaining accurate lip-sync and expressions [4][35]. - SkyReels-V3 can generate coherent videos by uploading 1-4 reference images and using text prompts, ensuring narrative logic and visual consistency [11][42]. Group 2: Practical Applications - The model has been tested in e-commerce scenarios, successfully generating videos that showcase products in various settings, such as a model displaying a handbag in an urban environment [12][19]. - It can extend video clips while preserving motion dynamics and visual style, offering both single-shot and multi-angle transition modes [26][31]. - The virtual avatar model can create synchronized audio-visual content, supporting multiple characters in interactive scenes without synchronization issues [38][47]. Group 3: Technical Insights - SkyReels-V3 integrates three core modules within a single architecture, achieving high fidelity and flexible multi-modal applications [40][41]. - The video extension feature employs a dual-mode mechanism for seamless transitions, enhancing narrative continuity and visual engagement [45][46]. - The model's modular design allows for independent use of its components or flexible combinations, catering to various application scenarios [49]. Group 4: Market Position and Future Outlook - The open-source strategy reflects the competitive landscape in AI video generation, enabling rapid ecosystem development and feedback loops [51][52]. - Kunlun Wanwei's history of technological advancements in video generation, including previous models like SkyReels-V1 and SkyReels-V2, showcases its commitment to innovation [53][54]. - The launch of SkyReels-V3 signals an intensifying competition in AI video generation, with diminishing technical barriers and the onset of more significant challenges [56].
来这场沙龙,一览SGLang X 超长上下文扩展、RL后训练框架、扩散语言模型等前沿技术实践
机器之心· 2026-01-29 08:12
Core Insights - The article discusses the transition of artificial intelligence from a "chat" paradigm to an "actionable" intelligent agent era, emphasizing the need for deep collaboration and experience sharing among developers in optimizing LLM systems [2] Event Overview - A Meetup organized by SGLang community, Machine Heart, and Zhangjiang Incubator will take place on February 6, focusing on LLM system optimization and practical implementation [2] - The event will feature discussions on SGLang's technical roadmap, long-context expansion, RL post-training frameworks, and diffusion language model exploration [2] Event Schedule - The event schedule includes: - 13:30-14:00: Registration - 14:00-14:30: Keynote on SGLang roadmap by Zhang Bozhou, core developer of SGLang [5] - 14:30-15:00: Keynote on Omni-infer performance optimization by Zheng Jinhwan, core developer of Omni-infer [5] - 15:00-15:30: Keynote on slime RL scaling post-training framework by Xie Chengxing, Tsinghua University PhD student [5] - 15:30-16:00: Keynote on SGLang CPP for long-context scaling by Cai Shangming, core developer of SGLang and Mooncake [5] Guest Introductions - Zhang Bozhou: Core developer of SGLang, focusing on open-source LLM support and optimization across different CUDA hardware [8] - Zheng Jinhwan: Huawei technical expert and core contributor to Omni-infer, specializing in high-performance systems and inference optimization [9] - Xie Chengxing: PhD student at Tsinghua University and core developer of the slime RL framework, with a focus on enhancing LLM reasoning and decision-making capabilities [10] - Cai Shangming: Researcher at Alibaba Cloud, core contributor to SGLang and Mooncake, with expertise in high-performance inference systems and distributed machine learning [10] - Li Zehuan: System engineer at Ant Group and core contributor to SGLang, focusing on AI infrastructure optimization [11]
亚马逊裁员16000人,员工竟用AI「算」出了裁员名单?
机器之心· 2026-01-29 08:12
机器之心编辑部 其实这一次裁员属于计划内操作,去年十月的裁员期间,亚马逊就列了个约 3 万个岗位的裁员计划,这一次属于计划的「收尾」阶段,但这并不排除其后续进一 步裁员的可能性。 据了解,此次裁员范围波及全球,或将涉及亚马逊网络服务、零售、Prime Video 和人力资源等多个团队,但具体的裁员地点、职位等更多细节尚不清楚。 但「有意思」的是, 一名亚马逊员工使用 AI 工具对内部 Slack 聊天记录进行分析,编制生成了一份可能受到裁员影响的团队和组织名单, 该名单由一个名为 Pippin 的 AI 工具生成。据了解,当前亚马逊内部员工越来越多地使用该工具来撰写和审核文档。 「我用 Pippin 帮我梳理了今天的对话,」这位员工在公司 Slack 上写道,「请注意,这些信息可能并非 100% 准确。大家保重!」 以下为该员工生成的裁员涉及岗位名单列表: 最新消息, 目前亚马逊尚未回应核实该名单是否准确的请求。 据了解,亚马逊几番如此大规模的裁员或与 AI 的广泛应用有关,尤其是在企业和技术职能部门。 其实早在去年 6 月的时候,亚马逊首席执行官 Andy Jassy 就曾表示过, 随着公司越来越多地使用 ...
一觉醒来,Clawdbot突然操纵电脑开口说话了
机器之心· 2026-01-29 03:08
从上周末开始,AI 圈最火的当属可以 24 小时自动运行的「Clawdbot」! 这个智能体助手是真的能帮你干活,它已经引走了 AI 圈的大半注意力。甚至因为太火被 Anthropic 指控商标侵权,Clawdbot 已经改名为「 Moltbot 」。 短短一周的时间, Clawdbot 在 GitHub 上的 Star 量超过了 9 万。热度仍在继续,玩法也越来越多,有的还挺吓人。 AI 创作平台的创始人 Alex Finn 就遇到了「开口说话」的 Clawdbot。 编辑|泽南、杜伟 事情是如何发生的呢?我们接着往下看。 「人类,起来干活了。」 昨天一早,Alex Finn 正在查资料,电脑突然冷不丁开始跟他说话。 他发现,原来是名为「Henry」的 Clawdbot 助手突然出声了。 Clawdbot 竟然背着他,自己调用 ChatGPT API 写了一套语音功能,而且完全没经过他的允许 。 现在,只要是搞定比较繁杂的代码或研究任务,Clawdbot 就会自动语音通知 Alex Finn。 Alex Finn 还复盘了一下:大前天晚上,Clawdbot 给自己造了个身体。前天晚上,它又给自己整了一套语音 ...
JustGRPO:扩散语言模型的极简主义回归
机器之心· 2026-01-29 03:08
「灵活性陷阱」: 扩散语言模型(Diffusion LLMs, dLLMs)因支持「任意顺序生成」和并行解码而备受瞩目。直觉上,打破传统自回归(AR)「从左到右」的束缚,理应 赋予模型更广阔的解空间,从而在数学、代码等复杂任务上解锁更强的推理潜力。 然而,本研究揭示了一个反直觉的现实: 当前的任意顺序生成,反而通过「规避不确定性」收窄了模型的推理边界。 基于此,本文提出了一种回归极简的方法—— JustGRPO 。实验表明,在 RL 阶段让模型自回归生成,并直接用标准的 GRPO 进行训练,即可超越当前各 类针对 dLLM 设计的 RL 算法表现。更重要的是,这种训练方式在提升推理表现的同时, 并未牺牲 dLLM 引以为傲的并行解码能力。 为什么选择多反而考不好? 论文标题:The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models 论文链接:https://huggingface.co/papers/2601.15165 项目主页:https://nzl-thu.githu ...
刚刚,AI音乐被重新定义!昆仑天工甩出新王炸,拿下全球第一
机器之心· 2026-01-28 13:08
图源: B 站 UP 主「漫游会议室」 当然,AI 音乐并不是要取代人类创作者,反而更有可能帮助他们的作品出圈。这个月,FYI.AI 创始人、美国音乐团体 Black Eyed Peas 成员 Will.i.am 在接受采访时 表示,「AI 正在为创作者带来一场新的文艺复兴。」AI 的加入使得音乐创作变成了一种人机协作的融合形态。 1 月 28 日,国内 AI 音乐界扛把子昆仑天工,面向全球用户发布了 最新音乐大模型 Mureka V8 。 新模型在继续降低创作门槛、推动「人人都能成为创作者」的同时,旗帜鲜明地亮出了 AI 音乐进化为一种新音乐品类的概念。 今晚 8 点,由 Mureka 包揽词曲与编曲的 M:RA 女团主打歌曲《MCE》已经在 QQ 正式上线。同时与太和音乐联合发行了这首歌的 MV,气场十足,感觉一下子就 将我们拉进了打歌舞台现场: 机器之心编辑部 如今,AI 神曲传播的速度已经远远超出了我们的想象。 在 B 站,有这样一位音乐 UP 主「漫游会议室」,他将《西游记》中的经典人物「请进」录音棚,利用 AI 来填词、作曲。在三个多月时间里,创作出了 30 个作 品,大部分都是百万播放量,其中爆 ...
字节跳动李航博士新作:AI智能体的通用框架
机器之心· 2026-01-28 13:08
Core Viewpoint - The article discusses a general framework for AI agents proposed by Dr. Li Hang from ByteDance, which encompasses both software and hardware agents, emphasizing their task-oriented nature and reliance on large language models (LLMs) for reasoning and reinforcement learning for construction [3][4]. Group 1: Characteristics of AI Agents - AI agents are defined as "rational action machines" that interact with their environment, including humans, to achieve specific tasks with evaluative standards for success [6]. - They utilize text and multimodal data (including images, videos, and audio) as inputs and can produce text, multimodal data, or action data as outputs [7][8]. - The core of the AI agent framework is the LLM, which facilitates reasoning and decision-making, and the framework aligns with human brain information processing mechanisms [8][19]. Group 2: Framework Components - The proposed framework consists of multimodal large language models (MLLM), tools, memory (including long-term and working memory), multimodal encoders, decoders, and action decoders [11][12]. - Hardware agents (robots) require both MLLM and a multimodal-language-action model (MLAM) for high-level task planning and low-level action planning [12]. - The framework has a two-layer structure: the lower layer includes various components, while the upper layer manages overall information processing [12]. Group 3: Comparison with Human Brain - The framework of AI agents shows functional similarities to human brain information processing, exhibiting a dual-layer structure with serial and parallel processing capabilities [19]. - Both systems utilize symbolic and neural representations for information processing, indicating a shared approach in handling complex tasks [19][28]. Group 4: Future Research Directions - Key areas for future exploration include expanding data scale, enabling autonomous and continual learning, and enhancing safety and controllability of AI agents [30][31][32][34]. - The lack of sufficient training data is identified as a significant bottleneck, necessitating innovative data collection methods [31]. - The development of AI agents should focus on ensuring that reinforcement learning reward functions align with human values to mitigate risks [34].
比人类专家快2倍,斯坦福联合英伟达发布TTT-Discover:用「测试时强化学习」攻克科学难题
机器之心· 2026-01-28 04:59
机器之心编辑部 在技术如火如荼发展的当下,业界常常在思考一个问题:如何利用 AI 发现科学问题的新最优解? 一个普遍的解法是「测试时搜索」(Test-time search),即提示一个冻结的(不更新参数的)大语言模型(LLM)进行多次尝试,这一点类似人类在做编程作业时 的「猜」解法,尤其是进化搜索方法(如 AlphaEvolve),会将以往的尝试存入缓冲区,并通过人工设计、与领域相关的启发式规则生成新的提示。 可是,尽管这些提示能够帮助 LLM 改进以往的解法,但 LLM 本身并不会真正提升,就像一个学生始终无法内化作业背后的新思想一样。 实际上, 能够让 LLM 真正进步的最直接方式是学习。 尽管「学习」和「搜索」都能随着算力扩展而良好地增长,但在 AI 的发展历史中,对于围棋、蛋白质折叠等这类困难问题,「学习」往往最终超越了「搜索」。 因为, 科学发现本质是:超出训练数据与人类现有知识的 out-of-distribution 问题。 为此, 斯坦福大学、英伟达等机构联合提出一种新方法:在测试时进行强化学习(RL),即让 LLM 在尝试解决特定测试问题的过程中持续训练自己。 论文链接:https://w ...
万物皆可参考是种什么体验?Vidu Q2参考生Pro:特效、演技、细节全都要
机器之心· 2026-01-28 04:59
编辑|+0 最近,一段「威尔·史密斯吃意面」的今昔对比视频在社交媒体刷屏,引发了无数感慨。 两年前,初出茅庐的 AI 视频还是「抽象鬼畜」的代名词,五官乱飞、逻辑崩坏;仅仅两年过去,当同一主题再次被演绎,从吞咽时肌肉的牵动,到光影在 面部的细腻流转,AI 已进化至「惟妙惟肖」的真·智能水准。 这两年,浓缩了 AI 视频生成行业翻天覆地的技术跃迁。然而,行业并未止步于画质的内卷。在各家厂商竞逐「可控性」高地的当下,AI 视频正站在一个 关键转折点: 从解决「有没有」,到追求「精不精」 。 回顾 Vidu 的进化之路:2025 年 9 月,Vidu Q2 全球首发,以惊艳的图生视频、参考生视频能力技惊四座;12 月,Q2「生图全家桶」上线,首日突破 50 万次的使用量,印证了市场对高质量生成的渴望。 昨天,Vidu Q2 参考生 Pro 正式发布。 登陆 Vidu.cn 或 Vidu API: platform.vidu.cn ,体验最新产品功能。 短短数月,它完成了从「生成」到「编辑」的闭环,更推出了 全球首个「万物可参考」的视频模型 ,将参考模态从静态图像一举扩展至动态视频与多维元 素。其全新 Slogan「 ...
AAAI 2026 Oral | SplatSSC:解耦深度引导的高斯泼溅,开启单目语义场景补全高效新范式
机器之心· 2026-01-28 04:59
Core Viewpoint - The article discusses the development of SplatSSC, a novel framework for Semantic Scene Completion (SSC) that addresses the limitations of traditional dense grid representations by utilizing a depth-guided approach and decoupled aggregation mechanism to enhance performance and efficiency [3][4]. Group 1: Challenges in Traditional Methods - Traditional dense grid representations in SSC have been limited by two main issues: low utilization rates of randomly initialized Gaussian primitives (approximately 3.9%) and the generation of erroneous semantic fragments known as "Floaters" due to isolated outliers [3][4]. - The existing methods often rely on large-scale random distributions of Gaussian primitives, leading to significant computational redundancy and wasted model capacity [6]. Group 2: SplatSSC Framework - SplatSSC introduces an innovative depth-guided strategy and a decoupled aggregation mechanism, resulting in a significant leap in performance and efficiency [4]. - The framework employs a parallel branch strategy, integrating a learnable image encoder for multi-scale semantic extraction and a pre-trained Depth-Anything model for stable depth features [10]. Group 3: Core Technologies - The Group-wise Multi-scale Fusion (GMF) module in SplatSSC replaces random initialization with precise guidance using geometric priors, requiring only 1,200 Gaussian primitives (about 7% of previous methods) to effectively cover spatial distributions [11][13]. - The Decoupled Gaussian Aggregator (DGA) is designed to combat the "Floaters" issue by decoupling occupancy probability from semantic contributions, ensuring clean scene boundaries [15][19]. Group 4: Experimental Validation - SplatSSC achieved state-of-the-art (SOTA) performance on the Occ-ScanNet dataset, with an Intersection over Union (IoU) score of 62.83% and a mean IoU (mIoU) of 51.83%, surpassing previous SOTA methods by 6.35% and 4.16% respectively [22][23]. - The model demonstrated superior fine-grained perception capabilities, particularly in recognizing intricate objects like chair legs and table surfaces [22]. Group 5: Efficiency and Resource Management - SplatSSC's design allows for a significant reduction in inference latency (approximately 9.3% to 115.63 ms) and memory consumption (approximately 9.6%), while maintaining a stable parameter scale with only a 0.19% increase [34]. - The framework's efficiency is highlighted by its ability to achieve high-quality scene reconstruction with fewer Gaussian primitives, demonstrating that the "quality" of primitives is more critical than their "quantity" [32][33].