Workflow
量子位
icon
Search documents
告别「上帝视角」,机器人仅凭几张图精准锁定3D目标,新基准SOTA
量子位· 2026-01-23 05:03
Core Insights - The article discusses the challenges faced by embodied intelligent agents in understanding 3D environments due to limited and sparse visual data, proposing a new task called Multiview 3D Referring Expression Segmentation (MV-3DRES) to address these issues [4][10][30]. Group 1: Problem Statement - Embodied intelligent agents often lack a comprehensive view of their surroundings, relying on sparse RGB images that lead to incomplete and noisy 3D reconstructions [2][9]. - Existing 3D referring segmentation methods are based on idealized assumptions of dense and reliable point cloud inputs, which do not reflect real-world conditions [3][9]. Group 2: Proposed Solution - A new solution, MVGGT (Multimodal Visual Geometry Grounded Transformer), is introduced, which utilizes a dual-branch architecture combining geometric and language features to enhance 3D scene understanding and segmentation [4][11]. - The architecture includes a frozen geometric reconstruction branch that provides stable 3D geometric priors and a trainable multimodal branch that integrates language instructions with visual features [13][15]. Group 3: Optimization Strategy - The research identifies a core optimization challenge known as Foreground Gradient Dilution (FGD), which complicates training due to the sparse representation of target instances [20][18]. - To address this, the team introduces the PVSO (Per-View No-Target Suppression Optimization) strategy, which amplifies meaningful gradient signals from effective views while suppressing misleading signals from no-target views [22][18]. Group 4: Experimental Results - The team developed a benchmark dataset called MVRefer to evaluate the MV-3DRES task, simulating scenarios with eight randomly collected sparse views [23][24]. - Experimental results demonstrate that MVGGT significantly outperforms existing baseline methods across various metrics, particularly in challenging scenarios where target pixel ratios are low [25][26]. Group 5: Practical Implications - The work emphasizes the practical significance of aligning 3D grounding with real-world perception conditions, providing new directions for enhancing the perception capabilities of embodied intelligence in constrained environments [30]. - The research team invites further exploration and improvements based on the established benchmark to advance the field of sparse perception in embodied intelligence [30].
腾讯重仓的GPU公司要上市了!燧原科技IPO获受理,拟募资60亿,All in研发
量子位· 2026-01-23 05:03
Core Viewpoint - Shanghai Suiruan Technology Co., Ltd. has received acceptance for its IPO application on the Sci-Tech Innovation Board, aiming to raise 6 billion yuan, marking the first IPO acceptance in A-shares for 2026 [1][2]. Group 1: Company Overview - Suiruan Technology focuses on cloud AI chips, with core business covering three layers: chips and hardware, software and programming platforms, and computing cluster solutions [22][23]. - The company was founded in March 2018 by former AMD colleagues Zhao Lidong and Zhang Yalin [22]. Group 2: IPO Details - The IPO application was accepted on January 22, 2026, with the underwriting by CITIC Securities [2][3]. - The company plans to use the raised funds primarily for R&D projects, with a total of 6 billion yuan allocated to three main areas: 1.503 billion yuan for the fifth-generation AI chip series, 1.196 billion yuan for the sixth-generation AI chip series, and 3.3 billion yuan for advanced AI software and hardware collaborative innovation projects [5][6]. Group 3: Financial Performance - As of the first nine months of 2025, Suiruan Technology reported revenues exceeding 540 million yuan, with a significant increase from approximately 90 million yuan in 2022 to over 720 million yuan in 2024, reflecting a compound annual growth rate of 183.15% [8][10]. - The revenue structure has shifted, with AI acceleration cards and modules, as well as intelligent computing systems and clusters, becoming the main sources of income, accounting for over 50% in 2023 [10][11]. - Major clients include Tencent, which accounted for 57.28% of the revenue in the first nine months of 2025 [12][13]. Group 4: Profitability and Losses - Despite the revenue growth, the company is still in a loss-making phase, with a net loss of approximately 887.75 million yuan for the first nine months of 2025 and 1.51 billion yuan for 2024 [15][28]. - The company aims to reach breakeven by 2026 based on current orders, R&D plans, and product delivery schedules [17]. Group 5: Market Position and Competition - Suiruan Technology is part of the "Four Little Dragons" of domestic GPUs, alongside Moer Thread, Muxi Co., and Biran Technology, with each company focusing on different aspects of GPU technology [34][37]. - The domestic AI chip market is still in its early stages, with Nvidia currently holding a dominant position, but local manufacturers are gradually breaking through technological and scale barriers [30][32].
量子位编辑作者招聘
量子位· 2026-01-22 11:13
Core Viewpoint - The article emphasizes the ongoing AI boom and invites individuals to join the company "Quantum Bit," which focuses on tracking AI advancements and has established itself as a leading content platform in the industry [1]. Group 1: Job Opportunities - The company is hiring for three main directions: AI Industry, AI Finance, and AI Product, with positions available for both experienced professionals and fresh graduates [2][4]. - Positions are open for various levels, including editors, lead writers, and chief editors, with a focus on matching roles to individual capabilities [6]. Group 2: Job Responsibilities - **AI Industry Direction**: Responsibilities include tracking innovations in infrastructure, such as chips, AI infrastructure, and cloud computing, as well as interpreting technical reports from conferences [6][7]. - **AI Finance Direction**: Focuses on venture capital, financial reports, and capital movements within the AI industry, requiring strong analytical skills and a passion for interviews [11]. - **AI Product Direction**: Involves monitoring AI applications and hardware developments, requiring a keen understanding of product experiences and market trends [11]. Group 3: Benefits and Growth - Employees will have the opportunity to engage with industry leaders, enhance their personal influence through original content creation, and receive professional mentorship from senior editors [6][11]. - The company offers competitive salaries and comprehensive benefits, including social insurance, meal allowances, and performance bonuses [6]. Group 4: Company Growth Metrics - By 2025, Quantum Bit aims to have over 2.4 million subscribers on WeChat and more than 7 million users across all platforms, with a daily reading volume exceeding 2 million [12].
成立两年半登顶全球AI创作社区,背后是中国团队在“卖情绪”??
量子位· 2026-01-22 11:13
Core Insights - SeaArt has become the world's leading AI creation community, surpassing platforms like Midjourney and Leonardo, with over 50 million registered users and a monthly visit count exceeding 30 million, generating an annual recurring revenue (ARR) of over $50 million [1][11][51] - The platform offers full-chain multimodal AI creation capabilities, including image, video, audio, and digital human generation [3][6] - SeaArt is positioned as a "mass-level creative consumption platform" in the AI era, with the recent launch of SeaVerse, which aims to help creators build personal IPs [6][8] Platform Features - SeaArt integrates ComfyUI for a visual workflow, a vast model library, and LoRA training sharing features, fostering community interaction [5] - SeaVerse allows users to generate content through simple natural language prompts, streamlining the creative process [12][14] - The platform supports various tools for image beautification, animation refinement, and more, enabling users to create complex projects with minimal effort [16][17] User Experience - Users can generate interactive applications and animations by simply describing their needs, with the system automatically handling the underlying processes [21][30] - The platform's ability to generate complete animations and music tracks demonstrates its advanced capabilities in content creation [27][32] Technical Foundation - The team behind SeaArt focuses on application layer design rather than developing foundational models, aiming to create a user-friendly experience that simplifies complex AI interactions [38][39] - SeaArt's technology is built on a template system and workflow engine, which has evolved into a multi-agent collaborative workflow in SeaVerse [40][41] Company Background - SeaArt is developed by Haiyi Entertainment, a Chinese AI startup founded in 2023, with a team experienced in the gaming industry [44][45] - The company has successfully expanded its user base and revenue, with a reported 7.7 times increase in user scale and 5.5 times increase in revenue year-on-year as of 2024 [51] Market Positioning - SeaArt has established a decentralized PUGC ecosystem, allowing creators to monetize their aesthetic and emotional value, with top creators earning between $3,000 to $4,000 monthly [53][54] - The platform has accumulated one of the largest AI-native creative asset libraries globally, facilitating a robust content supply chain [55] Future Outlook - The launch of SeaVerse enhances the interaction and engagement between creators and consumers, promoting a closed-loop mechanism for content creation and monetization [56] - The platform's evolution reflects a clear development path from a tool-based approach to a comprehensive AI interactive entertainment platform, akin to an AI version of Bilibili [57][58]
最强大模型的视觉能力不如6岁小孩
量子位· 2026-01-22 11:13
Core Insights - The current state of visual reasoning in AI models is still significantly behind human capabilities, with the best model, Gemini 3 Pro Preview, only slightly outperforming a three-year-old child and lagging 20% behind a six-year-old child [2][10] - The performance of Gemini 3 Pro Preview is noted as the highest among existing models, with a score of 49.7%, while other leading models like GPT-5.2 and Claude 4.5 Opus show even poorer results [6][14] - The article emphasizes the need for future models to rebuild visual capabilities from the ground up rather than relying on language-based translations of visual problems [11] Performance Comparison - In closed-source models, Gemini 3 Pro Preview leads with 49.7%, followed by GPT-5.2 at 34.4% and Doubao-Seed-1.8 at 30.2% [14] - Other models such as Qwen3-VL-Plus, Grok-4, and Claude-4.5-Opus scored significantly lower, indicating a general underperformance in visual reasoning tasks [15] - The best-performing open-source model, Qwen3VL-235B-Thinking, achieved a score of 22.2%, still far behind the top closed-source systems [16] Challenges in Visual Reasoning - The article identifies four core challenges faced by multi-modal large language models (MLLMs) in visual reasoning: 1. **Lack of Non-verbal Fine Details**: MLLMs struggle to accurately describe fine visual details that cannot be easily expressed in language [25] 2. **Loss of Manifold Consistency**: MLLMs often fail to maintain perceptual consistency over long distances, leading to errors in tasks involving spatial relationships [31] 3. **Spatial Imagination**: MLLMs have difficulty constructing stable three-dimensional representations from two-dimensional images, which affects their ability to perform mental transformations [39] 4. **Visual Pattern Induction**: MLLMs tend to focus on counting attributes rather than understanding the underlying changes in visual examples, limiting their ability to generalize from few examples [47] Proposed Solutions - The research suggests two potential directions to improve visual reasoning: 1. **Reinforcement Learning with Verifiable Rewards (RLVR)**: This approach showed an overall accuracy improvement of 4.8 percentage points after fine-tuning, particularly in fine-grained discrimination and spatial perception tasks [56][58] 2. **Generative Model Approaches**: The study introduces BabyVision-Gen, which evaluates generative models like NanoBanana-Pro, GPT-Image-1.5, and Qwen-Image-Edit, highlighting that while success rates are still low, some models exhibit explicit visual thinking capabilities [60][62] Future Directions - The article concludes that overcoming the "language bottleneck" in visual reasoning is crucial, advocating for unified architectures that retain high-fidelity visual representations during reasoning processes [68][70] - Models like Bagel and Sora 2 demonstrate the potential for generative methods to serve as advanced forms of reasoning, emphasizing the importance of robust visual semantic understanding [71]
大模型Infra新突破!腾讯混元开源LLM推理算子库,推理吞吐提升30%
量子位· 2026-01-22 11:13
Core Viewpoint - In the competition of large models, computational efficiency has become a critical bottleneck for AI applications and development, necessitating a shift from merely stacking GPUs to enhancing efficiency [1][7]. Group 1: HPC-Ops Development - Tencent's Mix Yuan AI Infra team has open-sourced a high-performance LLM inference core operator library called HPC-Ops to address performance issues with mainstream operator libraries like H20 [2][15]. - HPC-Ops is built from scratch using CUDA and CuTe, featuring deep architectural adaptations and optimizations to lower the development threshold for core operators, achieving significant performance breakthroughs [4][15]. Group 2: Performance Improvements - The inference performance of the Mix Yuan model has improved by 30% and the DeepSeek model by 17% when utilizing HPC-Ops [5][27]. - HPC-Ops has achieved up to 2.22 times performance improvement in Attention compared to FlashInfer/FlashAttention, 1.88 times in GroupGEMM compared to DeepGEMM, and 1.49 times in FusedMoE compared to TensorRT-LLM [6][47]. Group 3: Pain Points of Existing Operator Libraries - Current mainstream operator libraries are costly to use, complex in design, and require deep familiarity with the code, making adaptation difficult for ordinary AI researchers [11]. - Existing state-of-the-art (SOTA) operator libraries often fail to leverage the full performance potential of hardware, particularly on inference cards like H20, which differ from high-end training cards [8][13]. Group 4: Technical Innovations - HPC-Ops includes modules for FusedMoE, Attention, and GroupGEMM, with optimizations that align task characteristics with hardware capabilities, achieving over 80% of the hardware peak bandwidth [20][47]. - The library employs persistent kernels to hide overhead and uses innovative data rearrangement techniques to enhance performance, achieving superior results compared to current SOTA implementations [24][28]. Group 5: Future Development Directions - HPC-Ops aims to focus on developing sparse Attention operators to address memory and computational bottlenecks in long-context large models and to expand quantization strategies to include mixed precision [50]. - The library will also explore optimization of computation-communication coordination to reduce communication overhead in distributed inference scenarios, supporting the efficient deployment of ultra-large models [51].
大学开始用AI招生了
量子位· 2026-01-22 07:37
Core Viewpoint - The article discusses the increasing use of AI in college admissions, highlighting Virginia Tech's implementation of AI to review student applications, which has significantly reduced manual labor and expedited the admissions process [1][2][10]. Group 1: AI in College Admissions - Virginia Tech has adopted AI to evaluate student admission materials, saving approximately 8,000 hours of manual work and allowing for admission results to be released a month earlier than usual [2][16][17]. - The rise in AI usage for admissions is partly due to the surge in applications following the optional status of SAT/ACT exams, leading to overwhelming workloads for admissions departments [8][9][10]. Group 2: Concerns and Criticisms - Despite the efficiency gains, there are concerns regarding fairness and the potential biases of AI models, which are trained on historical data and may favor certain backgrounds or writing styles [19][20][21]. - Critics argue that reliance on AI could undermine the diversity and uniqueness that universities strive for, as applicants may tailor their submissions to meet AI preferences rather than showcasing their true abilities [23][26]. Group 3: AI's Broader Implications - The article draws parallels between AI in job recruitment and college admissions, suggesting that students may increasingly use AI to craft their application materials, leading to a cycle of "AI versus AI" in the admissions process [27][29][31].
2025最强AI产品一文看尽丨量子位智库年度AI 100
量子位· 2026-01-22 07:37
Core Viewpoint - The article highlights the transformation of China's AI product ecosystem in 2025, marking it as the "Year of AI Applications," where the focus shifts from mere functionality to system reconstruction driven by advancements in underlying models, user demand, and business model evolution [5][6]. Group 1: AI Product Landscape - The 2025 AI market in China is characterized by the launch of major AI companies like Zhipu and MiniMax, indicating a maturing market [3]. - The "AI 100" product list released by Quantum Bit Think Tank categorizes AI products into three main segments: "Flagship AI 100," "Innovative AI 100," and the top products from ten popular sectors [7][29]. - The "Flagship AI 100" focuses on the strongest AI products of 2025, showcasing those that have achieved significant technological breakthroughs and practical application value [8][29]. Group 2: User Engagement and Market Trends - The top five AI products on the web account for over 62% of monthly active users (MAU), while the top five on mobile apps represent over 65% of daily active users (DAU) [12]. - AI general assistants and AI office platforms remain the most popular sectors, significantly outpacing other categories in user scale [12]. - The "Innovative AI 100" aims to identify products with potential for explosive growth in 2026, highlighting emerging trends in various AI sectors [13][16]. Group 3: Sector-Specific Insights - The article identifies ten key AI application sectors, including AI browsers, AI agents, AI smart assistants, and AI education, each featuring top three products that exemplify innovation and engineering excellence [19][23]. - The evaluation of these sectors serves as a retrospective on the AI application market in 2025, emphasizing the competitive landscape and user engagement [24]. Group 4: Evaluation Methodology - The "AI 100" list employs a dual assessment system combining quantitative and qualitative metrics, focusing on user data, growth, and long-term development potential [26]. - Quantitative metrics include user scale, growth, and engagement, while qualitative assessments consider technology, market space, and user experience [26].
谷歌Gemini变身免费家教:接入全真模考,错题还能掰碎了讲
量子位· 2026-01-22 05:39
Core Viewpoint - Google has introduced a free SAT practice exam feature through its Gemini platform, which provides immediate scoring and explanations for incorrect answers, benefiting students preparing for the SAT [1][2][17]. Group 1: SAT Practice Exam Features - The SAT practice exam is developed in collaboration with The Princeton Review, incorporating a comprehensive set of verified SAT questions [7]. - Users can customize their testing experience, including options to turn off the timer or receive hints, creating a personalized practice environment [9]. - The math section of the practice exam is noted to have relatively easy questions, with one example being a straightforward algebra problem [10][11]. Group 2: Educational Value and Functionality - Gemini's primary value lies in its ability to explain answers, breaking down problem-solving steps for users who may struggle with certain questions [14][15]. - The platform allows users to review incorrect answers and identify weak areas, transforming a traditional study approach into a more targeted tutoring experience [16]. Group 3: Future Aspirations and Broader Applications - Google plans to expand Gemini's capabilities beyond the SAT to include other standardized tests in the future [17]. - Gemini is also being integrated into various sectors, including health and coding, aiming to provide specialized assistance in those fields [19]. - The overarching goal is to embed Gemini's functionalities into everyday digital experiences, enhancing user interaction across Google's services [20][21].
57.1%的人分不清真假!Runway新视频模型太爆炸
量子位· 2026-01-22 05:39
Core Viewpoint - The article discusses the advancements in Runway's new "Gen 4.5" model, emphasizing its ability to generate highly realistic videos that blur the line between AI-generated content and real footage, showcasing significant improvements in storytelling, detail, and consistency [8][9][11][22]. Group 1: Model Capabilities - The Gen 4.5 model focuses on "image-to-video" generation, enhancing camera control and narrative storytelling, which has led to a noticeable leap in quality [9][11]. - The model can quickly generate three different shots (close-up, medium, and long) within five seconds, maintaining high consistency in facial details even with camera movement [11][12]. - The storytelling capability has improved, allowing for longer narrative structures and better coherence between shots, making the output resemble a usable short film [16][18]. Group 2: Realism and Recognition - In a survey conducted with 1,000 participants, only about 57% could distinguish between AI-generated videos and real videos, indicating that the AI's generation level is now comparable to human perception [21][22]. - The advancements in realism include enhanced texture fidelity, lighting, and overall visual quality, making AI-generated videos increasingly indistinguishable from real-life footage [25][26][28]. Group 3: Industry Trends - The article notes a general trend in the industry towards higher demands for realism and consistency in video models, with a focus on physical world adherence and natural cross-frame performance [25][27]. - There is a growing emphasis on sound synchronization, with models now capable of generating audio that matches the visual content, enhancing the overall viewing experience [30][31]. - The rapid pace of updates from various companies suggests that the video model landscape is evolving quickly, with new trends emerging frequently [35][36].