Workflow
量子位
icon
Search documents
成立两年半登顶全球AI创作社区,背后是中国团队在“卖情绪”??
量子位· 2026-01-22 11:13
Core Insights - SeaArt has become the world's leading AI creation community, surpassing platforms like Midjourney and Leonardo, with over 50 million registered users and a monthly visit count exceeding 30 million, generating an annual recurring revenue (ARR) of over $50 million [1][11][51] - The platform offers full-chain multimodal AI creation capabilities, including image, video, audio, and digital human generation [3][6] - SeaArt is positioned as a "mass-level creative consumption platform" in the AI era, with the recent launch of SeaVerse, which aims to help creators build personal IPs [6][8] Platform Features - SeaArt integrates ComfyUI for a visual workflow, a vast model library, and LoRA training sharing features, fostering community interaction [5] - SeaVerse allows users to generate content through simple natural language prompts, streamlining the creative process [12][14] - The platform supports various tools for image beautification, animation refinement, and more, enabling users to create complex projects with minimal effort [16][17] User Experience - Users can generate interactive applications and animations by simply describing their needs, with the system automatically handling the underlying processes [21][30] - The platform's ability to generate complete animations and music tracks demonstrates its advanced capabilities in content creation [27][32] Technical Foundation - The team behind SeaArt focuses on application layer design rather than developing foundational models, aiming to create a user-friendly experience that simplifies complex AI interactions [38][39] - SeaArt's technology is built on a template system and workflow engine, which has evolved into a multi-agent collaborative workflow in SeaVerse [40][41] Company Background - SeaArt is developed by Haiyi Entertainment, a Chinese AI startup founded in 2023, with a team experienced in the gaming industry [44][45] - The company has successfully expanded its user base and revenue, with a reported 7.7 times increase in user scale and 5.5 times increase in revenue year-on-year as of 2024 [51] Market Positioning - SeaArt has established a decentralized PUGC ecosystem, allowing creators to monetize their aesthetic and emotional value, with top creators earning between $3,000 to $4,000 monthly [53][54] - The platform has accumulated one of the largest AI-native creative asset libraries globally, facilitating a robust content supply chain [55] Future Outlook - The launch of SeaVerse enhances the interaction and engagement between creators and consumers, promoting a closed-loop mechanism for content creation and monetization [56] - The platform's evolution reflects a clear development path from a tool-based approach to a comprehensive AI interactive entertainment platform, akin to an AI version of Bilibili [57][58]
最强大模型的视觉能力不如6岁小孩
量子位· 2026-01-22 11:13
Core Insights - The current state of visual reasoning in AI models is still significantly behind human capabilities, with the best model, Gemini 3 Pro Preview, only slightly outperforming a three-year-old child and lagging 20% behind a six-year-old child [2][10] - The performance of Gemini 3 Pro Preview is noted as the highest among existing models, with a score of 49.7%, while other leading models like GPT-5.2 and Claude 4.5 Opus show even poorer results [6][14] - The article emphasizes the need for future models to rebuild visual capabilities from the ground up rather than relying on language-based translations of visual problems [11] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [14] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5-Opus scored significantly lower, indicating a general underperformance in visual reasoning tasks [15] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, still far behind the top closed-source systems [16] Challenges in Visual Reasoning - The article identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. **Lack of Non-verbal Fine Details**: MLLMs struggle to accurately describe fine visual details that cannot be easily expressed in language [25] 2. **Loss of Manifold Consistency**: MLLMs often fail to maintain perceptual consistency over long distances, leading to errors in tasks involving spatial relationships [31] 3. **Spatial Imagination**: MLLMs have difficulty constructing stable three-dimensional representations from two-dimensional images, which affects their ability to perform mental transformations [39] 4. **Visual Pattern Induction**: MLLMs tend to focus on counting attributes rather than understanding the underlying changes in visual examples, limiting their ability to generalize from few examples [47] Proposed Solutions - The research suggests two potential directions to improve visual reasoning: 1. **Reinforcement Learning with Verifiable Rewards (RLVR)**: This approach showed an overall accuracy improvement of 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [56][58] 2. **Generative Model Approaches**: The study introduces BabyVision-Gen, which evaluates generative models like NanoBanana-Pro, GPT-Image-1.5, and Qwen-Image-Edit, highlighting that while success rates are still low, some models exhibit explicit visual thinking capabilities [60][62] Future Directions - The article concludes that overcoming the "language bottleneck" in visual reasoning is crucial, advocating for unified architectures that retain high-fidelity visual representations during reasoning processes [68][70] - Models like Bagel and Sora 2 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of robust visual semantic understanding [71]
大模型Infra新突破!腾讯混元开源LLM推理算子库,推理吞吐提升30%
量子位· 2026-01-22 11:13
Core Viewpoint - In the competition of large models, computational efficiency has become a critical bottleneck for AI applications and development, necessitating a shift from merely stacking GPUs to enhancing efficiency [1][7]. Group 1: HPC-Ops Development - Tencent's Mix Yuan AI Infra team has open-sourced a high-performance LLM inference core operator library called HPC-Ops to address performance issues with mainstream operator libraries like H20 [2][15]. - HPC-Ops is built from scratch using CUDA and CuTe, featuring deep architectural adaptations and optimizations to lower the development threshold for core operators, achieving significant performance breakthroughs [4][15]. Group 2: Performance Improvements - The inference performance of the Mix Yuan model has improved by 30% and the DeepSeek model by 17% when utilizing HPC-Ops [5][27]. - HPC-Ops has achieved up to 2.22 times performance improvement in Attention compared to FlashInfer/FlashAttention, 1.88 times in GroupGEMM compared to DeepGEMM, and 1.49 times in FusedMoE compared to TensorRT-LLM [6][47]. Group 3: Pain Points of Existing Operator Libraries - Current mainstream operator libraries are costly to use, complex in design, and require deep familiarity with the code, making adaptation difficult for ordinary AI researchers [11]. - Existing state-of-the-art (SOTA) operator libraries often fail to leverage the full performance potential of hardware, particularly on inference cards like H20, which differ from high-end training cards [8][13]. Group 4: Technical Innovations - HPC-Ops includes modules for FusedMoE, Attention, and GroupGEMM, with optimizations that align task characteristics with hardware capabilities, achieving over 80% of the hardware peak bandwidth [20][47]. - The library employs persistent kernels to hide overhead and uses innovative data rearrangement techniques to enhance performance, achieving superior results compared to current SOTA implementations [24][28]. Group 5: Future Development Directions - HPC-Ops aims to focus on developing sparse Attention operators to address memory and computational bottlenecks in long-context large models and to expand quantization strategies to include mixed precision [50]. - The library will also explore optimization of computation-communication coordination to reduce communication overhead in distributed inference scenarios, supporting the efficient deployment of ultra-large models [51].
大学开始用AI招生了
量子位· 2026-01-22 07:37
Core Viewpoint - The article discusses the increasing use of AI in college admissions, highlighting Virginia Tech's implementation of AI to review student applications, which has significantly reduced manual labor and expedited the admissions process [1][2][10]. Group 1: AI in College Admissions - Virginia Tech has adopted AI to evaluate student admission materials, saving approximately 8,000 hours of manual work and allowing for admission results to be released a month earlier than usual [2][16][17]. - The rise in AI usage for admissions is partly due to the surge in applications following the optional status of SAT/ACT exams, leading to overwhelming workloads for admissions departments [8][9][10]. Group 2: Concerns and Criticisms - Despite the efficiency gains, there are concerns regarding fairness and the potential biases of AI models, which are trained on historical data and may favor certain backgrounds or writing styles [19][20][21]. - Critics argue that reliance on AI could undermine the diversity and uniqueness that universities strive for, as applicants may tailor their submissions to meet AI preferences rather than showcasing their true abilities [23][26]. Group 3: AI's Broader Implications - The article draws parallels between AI in job recruitment and college admissions, suggesting that students may increasingly use AI to craft their application materials, leading to a cycle of "AI versus AI" in the admissions process [27][29][31].
2025最强AI产品一文看尽丨量子位智库年度AI 100
量子位· 2026-01-22 07:37
Core Viewpoint - The article highlights the transformation of China's AI product ecosystem in 2025, marking it as the "Year of AI Applications," where the focus shifts from mere functionality to system reconstruction driven by advancements in underlying models, user demand, and business model evolution [5][6]. Group 1: AI Product Landscape - The 2025 AI market in China is characterized by the launch of major AI companies like Zhipu and MiniMax, indicating a maturing market [3]. - The "AI 100" product list released by Quantum Bit Think Tank categorizes AI products into three main segments: "Flagship AI 100," "Innovative AI 100," and the top products from ten popular sectors [7][29]. - The "Flagship AI 100" focuses on the strongest AI products of 2025, showcasing those that have achieved significant technological breakthroughs and practical application value [8][29]. Group 2: User Engagement and Market Trends - The top five AI products on the web account for over 62% of monthly active users (MAU), while the top five on mobile apps represent over 65% of daily active users (DAU) [12]. - AI general assistants and AI office platforms remain the most popular sectors, significantly outpacing other categories in user scale [12]. - The "Innovative AI 100" aims to identify products with potential for explosive growth in 2026, highlighting emerging trends in various AI sectors [13][16]. Group 3: Sector-Specific Insights - The article identifies ten key AI application sectors, including AI browsers, AI agents, AI smart assistants, and AI education, each featuring top three products that exemplify innovation and engineering excellence [19][23]. - The evaluation of these sectors serves as a retrospective on the AI application market in 2025, emphasizing the competitive landscape and user engagement [24]. Group 4: Evaluation Methodology - The "AI 100" list employs a dual assessment system combining quantitative and qualitative metrics, focusing on user data, growth, and long-term development potential [26]. - Quantitative metrics include user scale, growth, and engagement, while qualitative assessments consider technology, market space, and user experience [26].
谷歌Gemini变身免费家教:接入全真模考,错题还能掰碎了讲
量子位· 2026-01-22 05:39
Core Viewpoint - Google has introduced a free SAT practice exam feature through its Gemini platform, which provides immediate scoring and explanations for incorrect answers, benefiting students preparing for the SAT [1][2][17]. Group 1: SAT Practice Exam Features - The SAT practice exam is developed in collaboration with The Princeton Review, incorporating a comprehensive set of verified SAT questions [7]. - Users can customize their testing experience, including options to turn off the timer or receive hints, creating a personalized practice environment [9]. - The math section of the practice exam is noted to have relatively easy questions, with one example being a straightforward algebra problem [10][11]. Group 2: Educational Value and Functionality - Gemini's primary value lies in its ability to explain answers, breaking down problem-solving steps for users who may struggle with certain questions [14][15]. - The platform allows users to review incorrect answers and identify weak areas, transforming a traditional study approach into a more targeted tutoring experience [16]. Group 3: Future Aspirations and Broader Applications - Google plans to expand Gemini's capabilities beyond the SAT to include other standardized tests in the future [17]. - Gemini is also being integrated into various sectors, including health and coding, aiming to provide specialized assistance in those fields [19]. - The overarching goal is to embed Gemini's functionalities into everyday digital experiences, enhancing user interaction across Google's services [20][21].
57.1%的人分不清真假!Runway新视频模型太爆炸
量子位· 2026-01-22 05:39
Core Viewpoint - The article discusses the advancements in Runway's new "Gen 4.5" model, emphasizing its ability to generate highly realistic videos that blur the line between AI-generated content and real footage, showcasing significant improvements in storytelling, detail, and consistency [8][9][11][22]. Group 1: Model Capabilities - The Gen 4.5 model focuses on "image-to-video" generation, enhancing camera control and narrative storytelling, which has led to a noticeable leap in quality [9][11]. - The model can quickly generate three different shots (close-up, medium, and long) within five seconds, maintaining high consistency in facial details even with camera movement [11][12]. - The storytelling capability has improved, allowing for longer narrative structures and better coherence between shots, making the output resemble a usable short film [16][18]. Group 2: Realism and Recognition - In a survey conducted with 1,000 participants, only about 57% could distinguish between AI-generated videos and real videos, indicating that the AI's generation level is now comparable to human perception [21][22]. - The advancements in realism include enhanced texture fidelity, lighting, and overall visual quality, making AI-generated videos increasingly indistinguishable from real-life footage [25][26][28]. Group 3: Industry Trends - The article notes a general trend in the industry towards higher demands for realism and consistency in video models, with a focus on physical world adherence and natural cross-frame performance [25][27]. - There is a growing emphasis on sound synchronization, with models now capable of generating audio that matches the visual content, enhancing the overall viewing experience [30][31]. - The rapid pace of updates from various companies suggests that the video model landscape is evolving quickly, with new trends emerging frequently [35][36].
Video版的Deep Research来了?先浏览再定位后精读:精度提升token消耗反降58.3%
量子位· 2026-01-22 05:39
Core Insights - The article discusses the evolution of AI Research, particularly focusing on Autonomous Agents and their ability to actively retrieve information rather than passively receive it [1] - It highlights a significant gap in current AI capabilities, specifically in video processing, where existing agents struggle to effectively analyze video content [2][4] Video Processing Challenges - Current AI agents either excel in text comprehension or can only perform limited question-answering on short video clips, failing to handle the dense information in videos [4] - The article identifies two main approaches to video processing: Direct Visual Inference, which is computationally expensive and suffers from context explosion, and Text Summarization, which loses critical visual details [8] Proposed Solution: Video-Browser - The research team introduces the Video-Browser, which aims to enhance video browsing capabilities by mimicking human-like search behaviors [5][6] - The Video-Browser employs a Pyramidal Perception architecture, processing video data in a tiered manner to balance efficiency and accuracy [10][11] Core Components of Video-Browser - The Video-Browser consists of three main components: Planner, Watcher, and Analyst [13] - The Watcher utilizes a three-stage pyramid mechanism: - Stage I: Semantic Filter, which quickly eliminates irrelevant videos using metadata analysis [14] - Stage II: Sparse Localization, which identifies potential answer time windows using subtitles and sparse frame sampling [15] - Stage III: Zoom-in, where high-frame-rate decoding and detailed visual reasoning occur within the identified time windows [16] Benchmark Testing: Video-BrowseComp - The research team created the Video-BrowseComp benchmark to evaluate the true capabilities of agents in video searching, emphasizing the need for agents to actively seek information [17] - The benchmark includes three difficulty levels, ranging from explicit retrieval to multi-source reasoning [18][20] Experimental Results - The Video-Browser achieved a 26.19% accuracy rate, outperforming existing models by 37.5% in accuracy [21] - The architecture led to a 58.3% reduction in token consumption, demonstrating significant efficiency improvements [22] Case Study - A case study illustrates the effectiveness of the Video-Browser in identifying specific details, such as the color of a pen in a film, which traditional methods failed to capture [24][26] Conclusion and Future Directions - The Video-Browser represents a significant advancement towards effective open-web video browsing, addressing the trade-off between accuracy and cost in video search [27] - The research team has made all code, data, and benchmarks open-source to encourage further research in the community [28][29]
马斯克下场抢人!xAI组建「人才狙击队」,极客版HR年薪168万
量子位· 2026-01-22 02:12
Jay 发自 凹非寺 量子位 | 公众号 QbitAI 马斯克要亲自下场抢人了。 最新消息,xAI正组建一支「AI人才狙击队」, 直接向马斯克汇报 。 这支特种队伍将与xAI的工程团队和招聘团队紧密合作,探索快速、大规模招聘优秀人才的新方法。 值得注意的是,xAI把这一岗位称为「人才工程师」,而 非传统意义上的HR 。 比起人力资源管理背景的应聘者,公司希望该岗位能由「极客」担任,以工程思维做招聘。 马斯克BOSS直聘 马斯克在搞一种很新的招聘方式。 组建一支工程思维的「AI人才狙击队」,搭建工程化的招聘体系,快速识别、触达并吸引各个领域的顶尖人才,主打一个 用工程师招工程师 体系搭好之后,团队成员还要 亲力亲为 ,从头到尾参与招聘流程,始终站在一线。 。 xAI认为,想从常规人才市场招真正的顶尖人才,没戏。还得靠 熟人推荐、线下活动、竞赛选拔、特定线上社区 ,以及各种更具创造性的渠 道。 因此,应聘者的工作重点不是在LinkedIn上发私信,而是能在各种场合如鱼得水,通过极强的判断力,一眼找到最强的那个。 此外,应聘者还需要有在人才密度极高的机构工作的经历,既推荐过优秀人才,也真正参与过招聘。 所以必须要具 ...
让机器人拥有本能反应!清华开源:一套代码实现跑酷、野外徒步两大能力
量子位· 2026-01-22 02:12
清华MARSLab团队 投稿 量子位 | 公众号 QbitAI 实现人形机器人高速跑步(2.5m/s)跨越障碍物/翻越较高障碍 核心定位:为"本能级"运动智能研究而生 人形机器人的"本能级"智能,指的是像人类一样无需预设轨迹,能通过实时感知自主应对复杂环境的能力——比如看到障碍自动调整跳跃姿 势,踩在楼梯边缘下意识保持平衡。 但长期以来,这类研究面临两大痛点:一是 "感知与运动割裂" ,要么能感知地形却只会简单行走,要么能做高难度动作却"眼盲";二是 "工具链不通用" ,高动态动作与野外locomotion研究需单独搭建环境,适配成本极高。 如何让机器人同时具备"本能反应"与复杂运动能力? 清华大学交叉信息研究院与上海期智研究院联合推出的Project-Instinct框架,给出了一个新答案。 ——专为"本能级"人形机器人运动智能研究设计,以模块化、可灵活配置的全链路工具包,让科研人员无需重复造轮子,专注突破核心技 术。 Project-Instinct旨在以"统一框架+灵活配置"打破僵局: 整套工具包从算法设计、环境搭建到真机部署,全链路围绕"本能级"智能核心,既支持高动态多接触动作的精准训练,也能适配野外 ...