Workflow
量子位
icon
Search documents
发布即开放:百度猎户座葫芦里卖的什么药?
量子位· 2025-11-14 05:38
Core Viewpoint - The article discusses the launch of Baidu's new AI-powered search system, "Baidu Orion," which aims to transform traditional search into a more interactive and intelligent experience, capable of understanding user intent and executing complex tasks [5][10][22]. Group 1: Baidu Orion Overview - Baidu Orion is a multi-agent framework that enhances search capabilities by integrating AI APIs and a rich service ecosystem, allowing for task planning and execution [5][10]. - The system is designed to evolve search from merely providing answers to understanding user needs, remembering past interactions, and engaging in meaningful dialogue [6][13]. Group 2: User Experience Enhancements - Baidu Orion can break down user queries into multiple needs and generate comprehensive plans, making the search experience more intuitive and user-friendly [10][11]. - The system supports multi-modal outputs, providing results in various formats such as images, videos, and even generating custom content based on user requests [11][21]. Group 3: AI Content Generation - The AI capabilities of Baidu Orion allow it to create new content rather than just retrieving existing information, with daily AI content generation exceeding ten million and video model usage surpassing two million [21][22]. - This shift positions search as a creative tool, expanding its role from a simple query-response mechanism to a versatile assistant capable of generating tailored solutions [22]. Group 4: Industry Trends and Open Access - The article highlights a broader industry trend where search engines are evolving into foundational capabilities that can be accessed via APIs, allowing developers to integrate advanced search functionalities into their applications [23][24]. - Baidu has opened up access to Baidu Orion, enabling 625 companies across various sectors to leverage its search AI API, which encapsulates 25 years of search technology and authoritative content resources [23][24].
何必DiT!字节首次拿着自回归,单GPU一分钟生成5秒720p视频 | NeurIPS'25 Oral
量子位· 2025-11-14 05:38
Core Viewpoint - The article discusses the introduction of InfinityStar, a new method developed by ByteDance's commercialization technology team, which significantly improves video generation quality and efficiency compared to the existing Diffusion Transformer (DiT) model [4][32]. Group 1: InfinityStar Highlights - InfinityStar is the first discrete autoregressive video generator to surpass diffusion models on VBench [9]. - It eliminates delays in video generation, transitioning from a slow denoising process to a fast autoregressive approach [9]. - The method supports various tasks including text-to-image, text-to-video, image-to-video, and interactive long video generation [9][12]. Group 2: Technical Innovations - The core architecture of InfinityStar employs a spatiotemporal pyramid modeling approach, allowing it to unify image and video tasks while being an order of magnitude faster than mainstream diffusion models [13][25]. - InfinityStar decomposes video into two parts: the first frame for static appearance information and subsequent clips for dynamic information, effectively decoupling static and dynamic elements [14][15][16]. - Two key technologies enhance the model's performance: Knowledge Inheritance, which accelerates the training of a discrete visual tokenizer, and Stochastic Quantizer Depth, which balances information distribution across scales [19][21]. Group 3: Performance Metrics - InfinityStar demonstrates superior performance in the text-to-image (T2I) task on GenEval and DPG benchmarks, particularly excelling in spatial relationships and object positioning [25][28]. - In the text-to-video (T2V) task, InfinityStar outperforms all previous autoregressive models and achieves better results than DiT-based methods like CogVideoX and HunyuanVideo [28][29]. - The generation speed of InfinityStar is significantly faster than DiT-based methods, with the ability to generate a 5-second 720p video in under one minute on a single GPU [31].
破解多模态大模型“选择困难症”!内部决策机制首次揭秘:在冲突信息间疯狂"振荡"
量子位· 2025-11-14 05:38
Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Research Contributions - A new toy dataset was constructed to systematically and independently vary the reasoning difficulty of visual and textual inputs, enabling different difficulty combinations for multi-modal inputs [4]. - The study decomposes the explicit behavior of modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - An empirical finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. Group 2: Framework Design - A controlled dataset was created to validate hypotheses, allowing independent control of visual and textual reasoning complexity [9][10]. - Uncertainty was measured using output entropy, which reflects the model's perceived uncertainty, with lower entropy indicating confident predictions and higher entropy indicating consideration of alternative options [11]. - Relative uncertainty was quantified to measure the confidence gap between text and visual modalities, providing a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies a common trend where models perceive text as easier on average, yet exhibit opposite macro preferences, raising questions about the underlying reasons for these discrepancies [15][16]. Group 4: Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the model's decision-making process [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve, reflecting how model preferences change dynamically with relative difficulty [18]. Group 5: Key Experimental Findings - All tested models exhibited a consistent trend where the probability of following text decreases smoothly as text becomes relatively more difficult [19][21]. - The "balance point" was defined as the point where the curve crosses the 50% probability line, serving as a quantifiable measure of inherent modality preference [22]. - The framework successfully explained previous puzzles regarding model behavior by revealing differences in inherent preferences that were not visible in macro metrics [23][24]. Group 6: Internal Mechanisms - The study explored the internal decision-making mechanisms of models, particularly their oscillation behavior when faced with conflicting information near the balance point [29][30]. - The findings indicate that models exhibit higher oscillation counts in ambiguous regions, providing a mechanistic explanation for observed indecision in external behavior [34][36]. Conclusion - The research presents a new framework for understanding modality following in MLLMs, emphasizing the importance of separating model capability from inherent preference, and revealing a robust rule that the likelihood of following a modality decreases with increasing relative uncertainty [37].
腾讯总裁剧透微信搭载智能体!阿里和谷歌也都开始互相伤害了
量子位· 2025-11-14 05:38
Group 1 - Major tech companies are engaging in a competitive AI product battle, with Alibaba, Google, and Tencent making significant moves in the AI space [3][4][31] - Alibaba is planning to revamp its Tongyi app, rebranding it as "Qwen" and integrating AI capabilities to enhance its e-commerce platform [6][7][8] - Google has introduced new AI shopping features aimed at enhancing the online shopping experience, allowing users to search, compare, and check out products using AI [16][18][21] Group 2 - Tencent is focusing on integrating AI into its WeChat platform, with plans to develop an AI agent that can assist users in various tasks within the app [22][30] - Tencent's Q3 financial report highlighted a 15% year-on-year revenue growth, with AI becoming a central theme in its strategic narrative [23][24] - The competition among these companies is centered around creating an "end-to-end closed loop" in user service, redefining the value chain in the internet landscape [33]
AI Coding最贵300人:2年2050亿估值,刚又被塞了160亿
量子位· 2025-11-14 02:04
Core Insights - Cursor has emerged as a leading player in the AI coding sector, recently achieving a significant milestone with a $2.3 billion Series D funding round, bringing its valuation to $29.3 billion [2][3][6] - The company has rapidly expanded its team to over 300 employees and surpassed an annual revenue of $1 billion, positioning itself as one of the fastest-growing companies in history [8][18][19] - Cursor's unique approach focuses on enhancing the capabilities of top developers rather than making coding accessible to everyone, aiming to integrate deeply into enterprise-level development processes [21][25][26] Funding and Valuation - Cursor completed a $2.3 billion Series D funding round, with a post-money valuation of approximately $29.3 billion, which is nearly three times its valuation during the previous funding round in June [3][6] - The funding round included notable investors such as Google, Nvidia, and Coatue, alongside existing investors like Andreessen Horowitz [5][6] - The company’s valuation trajectory has been remarkable, growing from $4 million in its A round to $29.3 billion in just two years [12][15] Product and Market Position - Cursor's product, an AI programming tool, is designed to significantly enhance coding efficiency, claiming to generate more code than all other large language models combined since the launch of its self-developed model, Composer [8][33] - The tool is currently utilized by millions of developers globally, including teams from major companies like Nvidia, Adobe, and PayPal [24] - Cursor aims to elevate the performance of skilled developers rather than democratizing coding, which differentiates it from other AI coding tools [25][28] Company Culture and Team Dynamics - The internal culture at Cursor is characterized by a strong work ethic, with employees voluntarily dedicating weekends to work on projects, reflecting a commitment to innovation and productivity [37][40] - Despite significant financial success, the company maintains a low-key atmosphere, focusing on continuous improvement and development rather than celebratory events [36][40] - The founders, who were students at MIT when they started Cursor, have seen their personal fortunes rise significantly, with each holding approximately 4.5% of the company, translating to a net worth of at least $1.3 billion each [41][42]
破解多模态大模型“选择困难症”!内部决策机制首次揭秘:在冲突信息间疯狂"振荡"
量子位· 2025-11-14 02:04
Core Argument - The article argues that modality following in multi-modal large language models (MLLMs) is a dynamic process influenced by relative reasoning uncertainty and inherent modality preference, rather than a static attribute [1][4][37]. Group 1: Contributions and Findings - A new controlled toy dataset was constructed to systematically manipulate the reasoning difficulty of visual and textual inputs [4]. - The study decomposes modality following into two core components: case-specific relative reasoning uncertainty and the model's stable inherent modality preference [4][5]. - A fundamental finding indicates that the probability of a model following a certain modality decreases monotonically as the relative reasoning uncertainty of that modality increases [5]. - The framework provides a more reasonable method for quantifying inherent preference, defining it as the balance point where the model treats both modalities equally [5][22]. - The research explores the internal decision-making mechanisms of models, revealing oscillations in predictions when uncertainty is near the balance point [5][29]. Group 2: Experimental Design - The researchers established a controlled experimental environment using a novel toy dataset that independently controls visual and textual reasoning complexity [9][10]. - A model-centered uncertainty metric, output entropy, was employed to reflect the model's perceived uncertainty [11]. - Relative single-modal uncertainty was introduced to quantify the confidence gap in each conflicting case, serving as a core metric for subsequent analysis [12]. Group 3: Limitations of Traditional Metrics - Traditional macro metrics like Text Following Rate (TFR) and Visual Following Rate (VFR) were tested on the constructed dataset, revealing confusing patterns that highlight their limitations [14]. - The study identifies two puzzles regarding the models' preferences and difficulty perceptions, suggesting that traditional metrics obscure the true motivations behind model decisions [16][23]. Group 4: New Experimental Paradigm - A new experimental paradigm was designed to decouple model capability from preference, allowing for a clearer understanding of the models' decision-making processes [18]. - The researchers grouped data points based on relative uncertainty to create a complete preference curve reflecting how model preferences change with relative difficulty [18]. Group 5: Key Experimental Discoveries - All tested models exhibited a consistent trend: as text becomes relatively more difficult, the probability of following text decreases smoothly [19][21]. - The balance point quantifies inherent preference, indicating whether a model has a visual or textual bias based on its position on the relative uncertainty axis [22]. - The framework successfully explains the previously mentioned puzzles by revealing differences in inherent preferences among models [23][24]. Group 6: Internal Mechanisms - The study investigates why models exhibit oscillations in decision-making when approaching their balance point, providing a mechanism for observed indecision [29][33]. - The distinction between clear and ambiguous regions in input uncertainty is made, with oscillation frequency being significantly higher in ambiguous regions [30][34].
雷军下铺的兄弟,创业家务机器人
量子位· 2025-11-14 02:04
Core Viewpoint - The article discusses the entrepreneurial journey of Cui Baoqiu, a former vice president of Xiaomi, who is now venturing into the field of robotics, specifically focusing on household service robots, marking a shift from his previous role in AI and IoT at Xiaomi [2][4][6]. Group 1: Background and Transition - Cui Baoqiu, known as the "father" of technology at Xiaomi, is now betting on embodied intelligence, a hot trend in the tech industry [2][4]. - After leaving Xiaomi, he initially took a role as the chief technical advisor at a RISC-V chip company, indicating a focus on foundational technology before moving into robotics [8][10]. - His departure from Xiaomi represents a significant shift in his career, moving from a large corporate structure to a more challenging entrepreneurial path [6][12]. Group 2: Vision and Strategy - Cui aims to create a household service robot that embodies the ultimate form of AIoT, integrating various smart devices into a single, interactive entity [7][8]. - He has a vision of transforming his technical blueprint from "connecting everything" to "transforming the physical world" through robotics [4][5]. - His previous experience at Xiaomi, where he was a key player in developing AI and cloud technologies, positions him well for this new venture [15][28]. Group 3: Industry Trends - The trend of creating physical embodiments for AI is gaining traction, with many former tech executives from major companies like Huawei and Horizon also launching similar ventures in robotics [40][42]. - The emergence of embodied intelligence is seen as the next phase in AI development, as software alone is insufficient to realize AI's full potential [40][41]. - This shift reflects a broader trend in the tech industry where former leaders are now focusing on building the physical "bodies" for AI systems, indicating a competitive and high-expectation environment in the robotics sector [45][46].
用155万模拟视频给模型上课!GVE模型一次学会9种视频检索技能
量子位· 2025-11-13 11:52
Core Insights - The article discusses the limitations of current video retrieval models, which are primarily optimized for coarse-grained text-video matching tasks, leading to biased training data and restricted model capabilities [1][6][7] - A new paradigm for video retrieval is proposed, shifting from specialized to universal models, with the introduction of the Universal Video Retrieval (UVR) concept and the Universal Video Retrieval Benchmark (UVRB) [2][12][16] - The GVE model, developed as part of this initiative, demonstrates superior generalization capabilities, outperforming existing models in zero-shot settings [3][4][26] Group 1: Current Challenges in Video Retrieval - Existing models excel on benchmarks like MSRVTT but struggle with complex real-world retrieval needs, such as multi-modal queries and fine-grained semantic understanding [6][7] - The training data for these models often comes from noisy labels, leading to a lack of robustness and generalization in complex scenarios [7][9] - The article highlights the need for a unified multi-modal representation framework in video retrieval, similar to advancements in image retrieval [8][9] Group 2: Introduction of UVR and UVRB - The UVR concept encompasses a comprehensive approach to video retrieval, integrating various tasks and domains to better reflect real-world scenarios [13][15] - The UVRB consists of 16 datasets covering multiple task types and domains, revealing the "偏科" (uneven performance) issue in existing models [17][18][28] - The benchmark aims to assess models across nine capabilities, emphasizing the need for a more holistic evaluation of video retrieval systems [17][18] Group 3: GVE Model Performance - The GVE model, available in 3B and 7B parameter versions, significantly outperforms 14 mainstream models, achieving an average Recall@1 score of 0.573 [26][27] - The GVE-3B model, with 3.8 billion parameters, surpasses larger models like Unite-7B, demonstrating that performance is driven by data quality and training strategies rather than sheer model size [27][31] - GVE-7B excels particularly in the "partially relevant video retrieval" task, showcasing its semantic discrimination capabilities [29][30] Group 4: Key Findings and Insights - The study reveals that traditional benchmarks like MSRVTT are misleading, with a low correlation to real-world performance, suggesting a need to incorporate "partially relevant retrieval" as a standard evaluation metric [38] - There is a significant disconnect between spatial and temporal understanding in current models, indicating a need for improved integration of these capabilities [39][40] - The architecture of models, such as CLIP and MLLM, influences their performance, with MLLM showing a more balanced learning approach across various tasks [41][42] Group 5: Future Directions - The research emphasizes the importance of developing a diagnostic, scalable, and reproducible framework for universal video retrieval, moving beyond mere performance metrics [48][49] - The combination of UVRB, high-quality synthetic data generation, and a structured training approach is expected to enhance model robustness and generalization [49][50] - The ultimate goal is to transition video retrieval from simple matching to a deeper understanding of content, necessitating new evaluation standards and richer training signals [48][49]
LeCun在Meta的最后一篇论文
量子位· 2025-11-13 11:52
Core Insights - The article discusses the introduction of LeJEPA, a self-supervised learning method developed by Yann LeCun, marking his farewell from Meta [2][3][4]. - LeJEPA aims to address the representation collapse issue in traditional JEPA frameworks by utilizing isotropic Gaussian embeddings and introducing SIGReg regularization to enhance model generalization [5][6]. Group 1: LeJEPA Overview - LeJEPA is based on isotropic Gaussian embeddings, which effectively mitigate the representation collapse problem and significantly improve model generalization capabilities [5]. - The traditional JEPA framework often encounters representation collapse, where models map all inputs to a single point, hindering the capture of semantic differences [6]. Group 2: Impact of Embedding Distribution - The study analyzed the impact of embedding distribution on bias and variance through ordinary least squares regression, revealing that isotropic Gaussian distribution minimizes both during training [8][9]. - Isotropic Gaussian distribution ensures lower bias and variance compared to non-isotropic distributions, enhancing stability and accuracy in downstream tasks [9][11][13]. Group 3: SIGReg Regularization - SIGReg (Sketched Isotropic Gaussian Regularization) is introduced as a method to achieve distribution matching, transforming the problem into a hypothesis testing framework [15][17]. - It employs a combination of univariate directional tests and Epps-Pulley tests to assess the match between the embedding distribution and the target isotropic Gaussian distribution [16][17]. Group 4: High-Dimensional Challenges - SIGReg addresses computational challenges in high-dimensional spaces by combining SIGReg and predictive loss, ensuring efficient and stable training through mini-batch training [19][21]. - The total loss in LeJEPA is a weighted sum of SIGReg loss and predictive loss, with a hyperparameter λ to balance their contributions [22]. Group 5: Experimental Validation - Extensive experiments on large architectures, including ViT, ConvNeXt, ResNet, MaxViT, and Swin Transformer, demonstrated that LeJEPA outperforms existing methods while maintaining training simplicity and robustness [20][23]. - In domain-specific datasets like Galaxy10 and Food101, LeJEPA surpassed DINOv2-based transfer learning methods when pre-trained directly on target data [24]. Group 6: JEPA Framework Evolution - JEPA (Joint-Embedding Predictive Architecture) has evolved over three years since its introduction by LeCun, focusing on enhancing model expressiveness and reasoning capabilities through joint prediction methods [31][28]. - Unlike generative models, JEPA captures the dependencies between x and y without explicitly generating predictions for y [32]. Group 7: Future Directions - Although LeJEPA signifies the end of LeCun's research at Meta, it does not mark the conclusion of JEPA's development, as LeCun is reportedly raising funds to establish a startup focused on world models [72][71]. - LeCun's departure from Meta, while not entirely graceful, reflects a significant period of achievement in AI research, contributing to the field's advancement [74][79].
只演示一次,机器人就会干活了?北大&BeingBeyond联合团队用“分层小脑+仿真分身”让G1零样本上岗
量子位· 2025-11-13 09:25
Core Insights - The article introduces the DemoHLM framework, which allows humanoid robots to generate extensive training data from a single human demonstration in a simulated environment, addressing key challenges in loco-manipulation [1][22]. Group 1: Challenges in Humanoid Robot Manipulation - Humanoid robot manipulation faces a "triple dilemma" due to limitations in existing solutions, which either rely on simulation or require extensive real-world remote operation data, making them impractical for complex environments like homes and industries [3][6]. - Traditional methods suffer from low data efficiency, poor task generalization, and difficulties in sim-to-real transfer, leading to high costs and limited scalability [6][20]. Group 2: Innovations of DemoHLM - DemoHLM employs a hierarchical control architecture that separates motion control from task decision-making, enhancing both flexibility and stability [7][20]. - The framework's key innovation is the ability to generate a vast amount of diverse training data from just one demonstration, significantly improving data efficiency and generalization capabilities [8][20]. Group 3: Experimental Validation - Comprehensive validation was conducted in both simulated environments (IsaacGym) and on the real Unitree G1 robot, covering ten manipulation tasks with notable success rates [9][19]. - As synthetic data volume increased from 100 to 5000, success rates for tasks improved significantly, demonstrating the effectiveness of the data generation pipeline [14][20]. Group 4: Industry Implications and Future Directions - DemoHLM's advancements provide critical technical support for the practical application of humanoid robots, reducing training costs and enhancing generalization across various scenarios [19][20]. - The framework is designed to be compatible with future upgrades, such as tactile sensors and multi-camera perception, paving the way for more complex operational environments [21][20].