Workflow
量子位
icon
Search documents
用155万模拟视频给模型上课!GVE模型一次学会9种视频检索技能
量子位· 2025-11-13 11:52
Core Insights - The article discusses the limitations of current video retrieval models, which are primarily optimized for coarse-grained text-video matching tasks, leading to biased training data and restricted model capabilities [1][6][7] - A new paradigm for video retrieval is proposed, shifting from specialized to universal models, with the introduction of the Universal Video Retrieval (UVR) concept and the Universal Video Retrieval Benchmark (UVRB) [2][12][16] - The GVE model, developed as part of this initiative, demonstrates superior generalization capabilities, outperforming existing models in zero-shot settings [3][4][26] Group 1: Current Challenges in Video Retrieval - Existing models excel on benchmarks like MSRVTT but struggle with complex real-world retrieval needs, such as multi-modal queries and fine-grained semantic understanding [6][7] - The training data for these models often comes from noisy labels, leading to a lack of robustness and generalization in complex scenarios [7][9] - The article highlights the need for a unified multi-modal representation framework in video retrieval, similar to advancements in image retrieval [8][9] Group 2: Introduction of UVR and UVRB - The UVR concept encompasses a comprehensive approach to video retrieval, integrating various tasks and domains to better reflect real-world scenarios [13][15] - The UVRB consists of 16 datasets covering multiple task types and domains, revealing the "偏科" (uneven performance) issue in existing models [17][18][28] - The benchmark aims to assess models across nine capabilities, emphasizing the need for a more holistic evaluation of video retrieval systems [17][18] Group 3: GVE Model Performance - The GVE model, available in 3B and 7B parameter versions, significantly outperforms 14 mainstream models, achieving an average Recall@1 score of 0.573 [26][27] - The GVE-3B model, with 3.8 billion parameters, surpasses larger models like Unite-7B, demonstrating that performance is driven by data quality and training strategies rather than sheer model size [27][31] - GVE-7B excels particularly in the "partially relevant video retrieval" task, showcasing its semantic discrimination capabilities [29][30] Group 4: Key Findings and Insights - The study reveals that traditional benchmarks like MSRVTT are misleading, with a low correlation to real-world performance, suggesting a need to incorporate "partially relevant retrieval" as a standard evaluation metric [38] - There is a significant disconnect between spatial and temporal understanding in current models, indicating a need for improved integration of these capabilities [39][40] - The architecture of models, such as CLIP and MLLM, influences their performance, with MLLM showing a more balanced learning approach across various tasks [41][42] Group 5: Future Directions - The research emphasizes the importance of developing a diagnostic, scalable, and reproducible framework for universal video retrieval, moving beyond mere performance metrics [48][49] - The combination of UVRB, high-quality synthetic data generation, and a structured training approach is expected to enhance model robustness and generalization [49][50] - The ultimate goal is to transition video retrieval from simple matching to a deeper understanding of content, necessitating new evaluation standards and richer training signals [48][49]
LeCun在Meta的最后一篇论文
量子位· 2025-11-13 11:52
Core Insights - The article discusses the introduction of LeJEPA, a self-supervised learning method developed by Yann LeCun, marking his farewell from Meta [2][3][4]. - LeJEPA aims to address the representation collapse issue in traditional JEPA frameworks by utilizing isotropic Gaussian embeddings and introducing SIGReg regularization to enhance model generalization [5][6]. Group 1: LeJEPA Overview - LeJEPA is based on isotropic Gaussian embeddings, which effectively mitigate the representation collapse problem and significantly improve model generalization capabilities [5]. - The traditional JEPA framework often encounters representation collapse, where models map all inputs to a single point, hindering the capture of semantic differences [6]. Group 2: Impact of Embedding Distribution - The study analyzed the impact of embedding distribution on bias and variance through ordinary least squares regression, revealing that isotropic Gaussian distribution minimizes both during training [8][9]. - Isotropic Gaussian distribution ensures lower bias and variance compared to non-isotropic distributions, enhancing stability and accuracy in downstream tasks [9][11][13]. Group 3: SIGReg Regularization - SIGReg (Sketched Isotropic Gaussian Regularization) is introduced as a method to achieve distribution matching, transforming the problem into a hypothesis testing framework [15][17]. - It employs a combination of univariate directional tests and Epps-Pulley tests to assess the match between the embedding distribution and the target isotropic Gaussian distribution [16][17]. Group 4: High-Dimensional Challenges - SIGReg addresses computational challenges in high-dimensional spaces by combining SIGReg and predictive loss, ensuring efficient and stable training through mini-batch training [19][21]. - The total loss in LeJEPA is a weighted sum of SIGReg loss and predictive loss, with a hyperparameter λ to balance their contributions [22]. Group 5: Experimental Validation - Extensive experiments on large architectures, including ViT, ConvNeXt, ResNet, MaxViT, and Swin Transformer, demonstrated that LeJEPA outperforms existing methods while maintaining training simplicity and robustness [20][23]. - In domain-specific datasets like Galaxy10 and Food101, LeJEPA surpassed DINOv2-based transfer learning methods when pre-trained directly on target data [24]. Group 6: JEPA Framework Evolution - JEPA (Joint-Embedding Predictive Architecture) has evolved over three years since its introduction by LeCun, focusing on enhancing model expressiveness and reasoning capabilities through joint prediction methods [31][28]. - Unlike generative models, JEPA captures the dependencies between x and y without explicitly generating predictions for y [32]. Group 7: Future Directions - Although LeJEPA signifies the end of LeCun's research at Meta, it does not mark the conclusion of JEPA's development, as LeCun is reportedly raising funds to establish a startup focused on world models [72][71]. - LeCun's departure from Meta, while not entirely graceful, reflects a significant period of achievement in AI research, contributing to the field's advancement [74][79].
只演示一次,机器人就会干活了?北大&BeingBeyond联合团队用“分层小脑+仿真分身”让G1零样本上岗
量子位· 2025-11-13 09:25
Core Insights - The article introduces the DemoHLM framework, which allows humanoid robots to generate extensive training data from a single human demonstration in a simulated environment, addressing key challenges in loco-manipulation [1][22]. Group 1: Challenges in Humanoid Robot Manipulation - Humanoid robot manipulation faces a "triple dilemma" due to limitations in existing solutions, which either rely on simulation or require extensive real-world remote operation data, making them impractical for complex environments like homes and industries [3][6]. - Traditional methods suffer from low data efficiency, poor task generalization, and difficulties in sim-to-real transfer, leading to high costs and limited scalability [6][20]. Group 2: Innovations of DemoHLM - DemoHLM employs a hierarchical control architecture that separates motion control from task decision-making, enhancing both flexibility and stability [7][20]. - The framework's key innovation is the ability to generate a vast amount of diverse training data from just one demonstration, significantly improving data efficiency and generalization capabilities [8][20]. Group 3: Experimental Validation - Comprehensive validation was conducted in both simulated environments (IsaacGym) and on the real Unitree G1 robot, covering ten manipulation tasks with notable success rates [9][19]. - As synthetic data volume increased from 100 to 5000, success rates for tasks improved significantly, demonstrating the effectiveness of the data generation pipeline [14][20]. Group 4: Industry Implications and Future Directions - DemoHLM's advancements provide critical technical support for the practical application of humanoid robots, reducing training costs and enhancing generalization across various scenarios [19][20]. - The framework is designed to be compatible with future upgrades, such as tactile sensors and multi-camera perception, paving the way for more complex operational environments [21][20].
今日硅谷科技头条是一个游戏机
量子位· 2025-11-13 09:25
Core Viewpoint - Valve has launched three new gaming hardware devices, including the Steam Frame VR headset, Steam Machine console, and new Steam Controller, aiming to create a comprehensive ecosystem for gaming [4][24][33]. Group 1: Steam Frame VR Headset - The Steam Frame is positioned as a "standalone + wireless streaming" VR headset, featuring a Qualcomm Snapdragon 8 Gen 3-level Arm chip and a microSD slot for local game running and wireless streaming from PC [10][12]. - It has a modular lightweight design, weighing approximately 440 grams, significantly lighter than the previous Valve Index [13]. - The headset includes dual LCD screens with a resolution of 2160×2160 pixels per eye and supports a maximum refresh rate of 144Hz [16]. - It incorporates eye-tracking and foveated streaming technology to optimize bandwidth usage and rendering efficiency [19]. - The Steam Frame is expected to be priced below $1,000, replacing the Valve Index in the market [23]. Group 2: Steam Machine - The Steam Machine is a desktop computer that runs SteamOS, designed for seamless integration with VR gaming experiences [24][25]. - It boasts a performance upgrade of over six times compared to the previous Steam Deck, featuring an AMD Zen 4 CPU and AMD RDNA3 GPU [27]. - Users can activate the Steam Machine directly through the Steam Frame without needing a physical display [28]. Group 3: Steam Controller - The new Steam Controller features magnetic resistance joysticks and is designed for both flat and VR gaming modes, with a battery life of up to 40 hours [20]. - It supports high-precision input and haptic feedback, making it suitable for complex PC game types [32]. - The controller can connect wirelessly or via cable, enhancing its versatility for gamers [29]. Group 4: Ecosystem Integration - The launch of these three products signifies Valve's strategy to integrate hardware and software into a cohesive gaming ecosystem, potentially transforming the gaming experience [33]. - The combination of the Steam Frame, Steam Machine, and Steam Controller aims to create a comprehensive gaming solution for users [33].
一个模型读懂所有医学数据,Hulu-Med探索医学大模型开源新范式 | 浙大x上交xUIUC
量子位· 2025-11-13 09:25
Core Insights - The article discusses the evolution of medical AI from specialized assistants to versatile models, highlighting the introduction of the Hulu-Med model, which integrates understanding of medical text, 2D images, 3D volumes, and medical videos into a single framework [1][2]. Group 1: Overview of Hulu-Med - Hulu-Med is a generalist medical AI model developed collaboratively by several institutions, including Zhejiang University and Shanghai Jiao Tong University, aiming to unify various medical data modalities [1][6]. - The model is open-source, trained on publicly available datasets and synthetic data, significantly reducing GPU training costs while demonstrating performance comparable to proprietary models like GPT-4.1 across 30 authoritative evaluations [4][5]. Group 2: Challenges in Medical AI - The current landscape of medical AI is characterized by fragmentation and a lack of transparency, with many specialized models acting as isolated "information islands," complicating the integration of multimodal patient data [7][9]. - The rise of large language models presents an opportunity to address these challenges, but the lack of transparency in leading medical AI systems remains a significant barrier to widespread adoption [8][9]. Group 3: Design Principles of Hulu-Med - The development of Hulu-Med is guided by three core principles: holistic understanding, efficiency at scale, and end-to-end transparency [10]. - The model aims to be a "medical generalist," capable of comprehensively understanding various data types to assess patient health [11]. Group 4: Innovations in Transparency and Openness - Hulu-Med prioritizes transparency and openness, relying solely on publicly available data to avoid privacy and copyright risks, and has created the largest known open medical multimodal corpus with 16.7 million samples [16][17]. - The model's open-source nature allows researchers to replicate and improve upon the work, fostering a collaborative environment for developing reliable medical AI applications [18]. Group 5: Unified Multimodal Understanding - Hulu-Med's architecture allows for the native processing of text, 2D images, 3D volumes, and medical videos within a single model, overcoming limitations of traditional models that require separate encoders for different modalities [20][22]. - The innovative use of 2D rotation position encoding and a unified visual encoding unit enables the model to understand spatial and temporal continuity without complex modules specific to 3D or video data [23][25]. Group 6: Efficiency and Scalability - Hulu-Med achieves a balance between high performance and efficiency, employing strategies like medical-aware token reduction to minimize redundancy in 3D and video data, reducing visual token counts by approximately 55% [33][35]. - The model's training process is structured in three progressive stages, enhancing its ability to learn from diverse data types while controlling training costs effectively [37][41]. Group 7: Performance Evaluation - Hulu-Med has been rigorously evaluated across 30 public medical benchmarks, outperforming existing open-source medical models in 27 tasks and matching or exceeding the performance of top proprietary systems in 16 tasks [48][49]. - The model demonstrates exceptional capabilities in complex tasks such as multilingual medical understanding and rare disease diagnosis, showcasing its potential for clinical applications [51]. Group 8: Future Directions - Future research will focus on integrating more multimodal data, expanding open data sources, enhancing clinical reasoning capabilities, establishing efficient continuous learning mechanisms, and validating the model in real clinical workflows [52].
最后一周!人工智能年度榜单申报即将截止。
量子位· 2025-11-13 09:25
Core Points - The "2025 Artificial Intelligence Annual List" application has entered its countdown phase, marking the 8th year of the event, which has witnessed technological breakthroughs and industry transformations [1] - The evaluation will focus on three dimensions: companies, products, and individuals, with five award categories established [2] Group 1: Awards and Evaluation Criteria - The awards include: 2025 AI Leading Company, 2025 AI Potential Startup, 2025 AI Outstanding Product, 2025 AI Outstanding Solution, and 2025 AI Focus Person [10][14][16][19][21] - The evaluation criteria for the Leading Company include market share, revenue scale, technological innovation, and brand influence [11] - The criteria for the Potential Startup focus on investment value, market recognition, and significant achievements in the past year [14] - The Outstanding Product award emphasizes technological innovation, market application, and industry leadership [16][17] - The Outstanding Solution award evaluates innovative applications in various industries and their impact on industry development [19][22] - The Focus Person award recognizes influential individuals in the AI field based on their contributions and industry impact [21][23] Group 2: Application Process and Event Details - The application deadline is November 17, 2025, with results to be announced at the MEET2026 Intelligent Future Conference [7][27] - Interested parties can apply via a provided link or contact Quantum Bit staff for inquiries [7][8] - The MEET2026 conference will take place on December 10, 2026, focusing on the intersection of AI and various industries [27][28]
2.4万亿参数原生全模态,文心5.0一手实测来了
量子位· 2025-11-13 09:25
Core Viewpoint - The article announces the official release of Wenxin 5.0, a new generation model that supports unified understanding and generation across multiple modalities, including text, images, audio, and video, enhancing creative writing, instruction following, and intelligent planning capabilities [1][15]. Group 1: Model Capabilities - Wenxin 5.0 supports full-modal input (text, images, audio, video) and multi-modal output (text, images), with a fully functional version currently being optimized for user experience [15][13]. - The model can analyze video content in detail, identifying specific moments of tension and correlating audio with video elements [3][7]. - Wenxin 5.0 has demonstrated superior performance in language, visual understanding, audio understanding, and visual generation, ranking second globally in the LMArena text leaderboard [9][7]. Group 2: Technical Innovations - The model employs a "native unified" approach, integrating various modalities from the training phase to create inherent cross-modal associations, unlike traditional models that rely on post-training feature fusion [63][64]. - It utilizes a large-scale mixed expert architecture to balance knowledge capacity and operational efficiency, activating only relevant expert modules during inference to reduce computational load [67][69]. - The model's total parameter scale exceeds 2.4 trillion, with an activation ratio below 3%, optimizing both performance and efficiency [69][70]. Group 3: User Experience and Applications - Users can upload multiple file types simultaneously, including documents, images, audio, and video, enhancing interaction flexibility [18][19]. - The model can summarize core content from videos and audio efficiently, allowing users to upload up to 10 videos at once for multi-task content organization [56][57]. - Wenxin 5.0 can also generate new images from mixed text and image inputs, showcasing its versatility in creative applications [52][53]. Group 4: Industry Context and Development - The competitive landscape in the large model sector has shifted towards innovations in underlying architecture, training efficiency, and cost-effectiveness, with companies seeking differentiated breakthroughs [71][72]. - Baidu has accelerated its model iteration pace, with recent releases enhancing multi-modal capabilities and reasoning abilities, culminating in the launch of Wenxin 5.0 [73][75].
Nature公开谷歌IMO金牌模型技术细节!核心团队仅10人,一年给AI编出8000万道数学题训练
量子位· 2025-11-13 05:38
Core Insights - Google DeepMind has publicly released the complete technology and training methods behind its IMO gold medal model, AlphaProof, continuing its tradition of transparency in AI research [1][30] - The model utilizes a 3 billion parameter encoder-decoder transformer architecture, which allows it to understand and generate mathematical proofs effectively [12][21] Development Process - The AlphaProof team was relatively small, consisting of about 10 members for most of the development period, with additional members joining closer to the IMO competition [3] - A key breakthrough came from team member Miklós Horváth, who developed a method to create various problem variants for training the AI [4][5] - Over a year, the team explored various research ideas, integrating successful approaches into the AlphaProof system [7] Training Methodology - AlphaProof transforms the mathematical proof process into a game-like environment, where each mathematical proposition serves as a new game level [8] - The system employs a reinforcement learning environment based on the Lean theorem prover, allowing it to suggest strategies and estimate the steps needed to complete proofs [13][14] - The training faced challenges in sourcing sufficient mathematical problems, initially using 300 billion tokens of code and math text for pre-training, followed by fine-tuning with 300,000 manually crafted proofs [16][21] - A significant innovation was the automatic formalization process, which translated natural language math problems into a format understandable by Lean, generating around 80 million formalized problems from 1 million natural language questions [16][21] Performance at IMO - AlphaProof's performance at the 2024 IMO was remarkable, successfully solving three problems, including the most difficult one, despite requiring 2-3 days of computation for each problem [26][28] - The system's ability to generate related problem variants during the competition was crucial for its success [26][27] Future Directions - Following its success, DeepMind has opened AlphaProof's capabilities to the scientific community, allowing researchers to apply for access [30] - Researchers have noted AlphaProof's strength in identifying counterexamples and its limitations when faced with custom definitions in proofs [31][33] - The reliance on the Lean theorem prover presents challenges due to its evolving nature, which can affect AlphaProof's performance in more mature mathematical domains [35] - The limited availability of unique mathematical problems poses a challenge for the AI's generalization capabilities, highlighting the need for further development in generating its own training problems [36]
IDE?字节TRAE搞了个大升级,现在能全流程开发了
量子位· 2025-11-13 05:38
Core Viewpoint - The article emphasizes that AI programming is no longer about whether one can write code, but rather about minimizing rework for developers [1]. Group 1: Introduction of TRAE SOLO - TRAE has launched the SOLO official version, which addresses the pain points of professional developers effectively [2]. - SOLO is not just an IDE; it is an AI collaboration platform that integrates multi-agent coordination and a full-process development toolchain [3]. Group 2: Addressing Developer Pain Points - The updates in SOLO target the issues of long rework time and high modification costs faced by developers [4]. - The SOLO Coder agent is designed to assist developers in managing complex tasks and avoiding the pitfalls of AI programming that can lead to errors [5][6]. Group 3: Enhanced Task Management - SOLO Coder's Plan mode helps clarify development plans before coding begins, thus preventing mistakes from the outset [8]. - The agent can break down complex tasks and reduce context pollution, allowing for clearer execution [9]. Group 4: Multi-Agent Coordination - SOLO Coder can schedule multiple sub-agents for tasks such as code review and performance optimization, enhancing collaboration [10]. - This feature transforms the AI from a mere coding assistant into a collaborative team member that does not complicate the process [10]. Group 5: User Interface Improvements - The new three-column layout improves efficiency by separating task lists, dialogue flows, and tool panels, facilitating multitasking [12]. - Integrated tools for databases, deployment, and design streamline the workflow, reducing unnecessary navigation [13]. Group 6: Code Change Visualization - The ability to visualize code changes allows developers to easily track modifications, enhancing control over the coding process [14][15]. - This feature is particularly valuable for developers who prioritize code manageability over speed [15]. Group 7: Overall Impact - TRAE's upgrades align closely with developer needs, positioning AI as a collaborative partner rather than a replacement for coding [16][17]. - The SOLO official version enables developers to enjoy efficiency gains from AI while maintaining control over their projects, achieving an ideal state of human-led, AI-assisted development [18].
李飞飞3D世界模型公测,网友已经玩疯了
量子位· 2025-11-13 05:38
Core Insights - The article discusses the launch of a new 3D world generation model called Marble, developed by World Lab, founded by Fei-Fei Li, which is now open for public testing [1][3][34] - Marble allows users to easily create personalized 3D worlds using text, photos, or short videos, significantly lowering the barrier for entry in 3D modeling [4][15][35] Group 1: Features and Functionality - Marble can generate 3D worlds from simple text prompts or single images, and it supports multiple images from different angles to create a cohesive environment [17][35] - Users can customize their 3D spaces by uploading multiple images to define layouts and can edit elements within the generated worlds, such as removing objects or changing styles [19][21] - The platform includes an AI-native world editing tool, allowing for both minor and extensive modifications to the created environments [21][33] Group 2: Export and Compatibility - Users can export their created worlds in two formats: Gaussian point cloud for high fidelity rendering and triangle mesh for compatibility with various industry-standard tools [29] - The generated 3D worlds can also be rendered into videos, which can be enhanced with additional details and dynamic elements [31] Group 3: Future Developments - Marble aims to enhance interactivity in future updates, allowing users to not only create but also interact with elements within their 3D worlds [36][37] - The development team emphasizes that the current features are just the foundation, with plans to incorporate real-time interactions in the generated environments [36][37]