Workflow
自监督学习
icon
Search documents
ICLR 2026 Oral | Revela:用语言建模重新定义稠密检索器训练
机器之心· 2026-03-26 11:41
Core Insights - The article discusses the development of Revela, a new approach to training dense retrievers within retrieval-augmented generation (RAG) systems, which has won prestigious awards for its innovative methodology [2][24]. Group 1: Challenges in Dense Retriever Training - Training high-quality dense retrievers is challenging due to reliance on manually annotated data, which is costly in specialized fields like law and code [4]. - The difficulty of negative sample mining introduces additional complexity, as random negative samples provide weak signals [4]. - There is a disconnect between contrastive loss and mainstream language model pre-training objectives, making it hard to leverage pre-trained knowledge effectively [4]. Group 2: Revela's Approach - Revela unifies the training objective of dense retrievers under a language modeling framework, allowing for a more natural training path [6]. - It introduces an in-batch attention mechanism that dynamically references other relevant documents during the prediction of the next token, enhancing the similarity scores between text chunks [6][13]. - The architecture consists of a retriever for encoding text and calculating similarity, and a language model providing training signals, both optimized together [10]. Group 3: Advantages of Revela - The training objective aligns closely with language modeling, activating existing semantic understanding capabilities in pre-trained models [11]. - It is fully self-supervised, significantly reducing the need for manual annotations, which is advantageous in data-scarce professional domains [11]. - Revela demonstrates strong scalability, with performance improving as the retriever size and batch size increase [11]. Group 4: Experimental Results - In code retrieval (CoIR), Revela-3B achieved an average nDCG@10 of 60.1, surpassing several supervised models trained on large annotated datasets [18]. - In reasoning-intensive retrieval (BRIGHT), Revela-3B outperformed commercial APIs, achieving an average nDCG@10 of 20.1 with only Wikipedia text for training [21]. - For general retrieval (BEIR), Revela-3B matched the performance of a weakly supervised baseline while using significantly less training data and resources [22]. Group 5: Future Directions - Revela opens avenues for dynamic index construction, which could enhance semantic relevance in batch processing but poses computational challenges [24]. - There is potential for further model and data expansion, which could lead to performance improvements [24]. - The insights gained from the retriever could also inform improvements in language model training, suggesting a reciprocal enhancement potential [24].
LeCun三顾茅庐,谢赛宁终于入伙!新公司获投10亿美元
量子位· 2026-03-10 10:00
Core Insights - Yann LeCun, a Turing Award winner and a key figure in deep learning, has co-founded a new startup called Advanced Machine Intelligence (AMI), which has raised $1.03 billion in seed funding, achieving a pre-funding valuation of $3.5 billion [2][14][12] - The company aims to develop intelligent systems that can truly understand the real world, focusing on creating "world models" that incorporate reasoning and planning capabilities [41][43] Funding and Valuation - AMI's seed funding of $1.03 billion surpasses the previous record held by World Labs, founded by Fei-Fei Li, which raised $1 billion at a $5 billion valuation [12][13] - The funding round was led by notable investors including Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions, with participation from high-profile individuals like Mark Cuban and Eric Schmidt [14][15] Team Composition - The leadership team includes Alex Lebrun as CEO, who has a background in AI healthcare, and Saining Xie, a prominent researcher in computer vision, as Chief Science Officer [6][30][32] - The team is largely composed of former Meta employees, including LeCun himself, who previously led significant AI initiatives at Meta [19][20][21] Company Vision and Technology - AMI's goal is to create AI systems that possess long-term memory and can learn from real-world sensor data, moving away from traditional supervised learning methods [47][48] - The company will continue to publish research and open-source code, emphasizing the importance of an open research community [55][56] Market Strategy - AMI does not have immediate revenue targets but plans to collaborate with potential clients in various sectors, including manufacturing, automotive, aerospace, and pharmaceuticals [50][51] - The first announced partnership is with Nabla, an AI healthcare company previously led by CEO Alex Lebrun [52]
连发Nature、Cancer Cell:上海交大团队利用AI增强罕见病及癌症诊断
生物世界· 2026-03-02 08:00
Core Insights - The article discusses the development of two groundbreaking AI models for medical diagnosis: DeepRare for rare diseases and KEEP for cancer diagnosis, both showcasing significant advancements in the application of AI in healthcare [3][6]. Group 1: DeepRare Model - DeepRare is the world's first AI-driven diagnostic system for rare diseases, surpassing the diagnostic accuracy of clinical experts with over ten years of experience [3]. - This model aims to provide hope for the 300 million patients suffering from rare diseases globally, marking a milestone in the integration of AI into clinical workflows [3]. Group 2: KEEP Model - KEEP is a knowledge-enhanced vision-language pathology foundation model designed for cancer diagnosis, outperforming existing models, particularly in rare cancer subtypes [6]. - The model integrates a comprehensive disease knowledge graph containing 11,454 diseases and 139,143 attributes, reorganizing millions of pathology image-text pairs into 143,000 semantically structured groups [11]. - KEEP has demonstrated superior performance on 18 public benchmarks (over 14,000 whole slide images) and 4 rare cancer datasets (926 cases), establishing knowledge-enhanced visual language modeling as a powerful paradigm in computational pathology [11].
模型「漂移」新范式,何恺明新作让生成模型无须迭代推理
机器之心· 2026-02-08 10:37
Core Viewpoint - The article introduces the "Drifting Model," a novel generative modeling paradigm that eliminates the need for iterative inference processes, thereby enhancing efficiency in generating high-quality outputs [3][7][26]. Group 1: Generative Modeling Techniques - Traditional generative models, such as diffusion models, rely on iterative processes and differential equations to map distributions, making them time-consuming and resource-intensive [1][2]. - Variational Autoencoders (VAEs) and Normalizing Flows (NFs) are also discussed as methods that attempt to streamline the generation process, but they still face challenges related to iterative training [2][3]. Group 2: Drifting Model Characteristics - The Drifting Model utilizes a pushforward mapping that evolves during training, allowing for single-step inference without the iterative nature of previous models [7][8]. - A drifting field is introduced to control the movement of samples, ensuring that the generated distribution aligns with the target data distribution [8][10]. Group 3: Experimental Results - The Drifting Model achieved a state-of-the-art (SOTA) FID score of 1.54 on ImageNet 256×256 under standard latent space generation protocols, demonstrating competitive performance even against multi-step diffusion models [14][24]. - In challenging pixel space generation protocols, the model reached an FID of 1.61, significantly outperforming previous pixel space methods [14][26]. Group 4: Robustness and Efficiency - The model exhibits robustness against mode collapse, maintaining the ability to approximate multi-modal target distributions effectively [16][17]. - The research highlights the importance of robust feature representations in generative modeling, indicating that advancements in self-supervised learning can directly benefit this paradigm [26]. Group 5: Implications for Future Research - The findings suggest that the principles of distribution evolution through drifting fields could be broadly applicable across various generative tasks, opening new avenues for efficient generative modeling research [26].
量化专题报告:“机器学习”选股模型系列研究(一):量价指纹模型的构建与应用初探
GOLDEN SUN SECURITIES· 2026-01-16 13:34
Quantitative Models and Construction Methods - **Model Name**: Volume-Price Fingerprint Model **Model Construction Idea**: Inspired by large language models, the volume-price fingerprint model treats market transaction data as a special "language" and uses self-supervised learning to extract semantic features from intraday volume-price behavior [1][8][9] **Model Construction Process**: 1. **Minute-level Feature Preprocessing**: Select 32 dimensions of minute-level features, including price features (e.g., high, low, close, price position) and transaction features (e.g., turnover, order cancellation, fund flow). Standardize these features to eliminate dimensional and historical volatility effects [1][16][17] - Price feature standardization formula: $$\tilde{p}_{t,d}=\frac{p_{t,d}}{p_{\mathrm{open}}}-1$$ [16] - Transaction feature standardization formula: $${\tilde{f}}_{t,d}={\frac{f_{t,d}}{S_{d}}},\quad S_{d}={\frac{1}{N_{\mathrm{hist}}}}\sum_{i=1}^{N_{\mathrm{hist}}}\sum_{t=1}^{T}f_{t,d}^{(i)}$$ [17] 2. **Dual-task Self-supervised Learning Framework**: - **Forward Causal Prediction Task**: Predict price features causally using past transaction and price information. A triangular attention mask ensures strict causality [18][21] - **Backward Feature Reconstruction Task**: Randomly mask transaction features and reconstruct them using global sequence information [18][22] 3. **Anti-collapse Regularization**: Introduce diversity, orthogonality, and uniformity regularization terms to ensure high differentiation, low redundancy, and rich information in fingerprint vectors [1][43][44][46] - Total loss function: $${\mathcal{L}}_{\mathrm{total}}=\lambda_{f}{\mathcal{L}}_{\mathrm{forward}}+\lambda_{b}{\mathcal{L}}_{\mathrm{backward}}+{\mathcal{L}}_{\mathrm{diversity}}+{\mathcal{L}}_{\mathrm{orthogonality}}+{\mathcal{L}}_{\mathrm{uniformity}}$$ [47] **Model Evaluation**: The model provides a structured representation of market dynamics, capturing semantic features beyond traditional numerical predictions [1][9][14] - **Model Name**: GRU Model with Volume-Price Fingerprint **Model Construction Idea**: Use GRU to predict future stock returns based on volume-price fingerprint features [2][50][51] **Model Construction Process**: 1. Input features include 128-dimensional volume-price fingerprint vectors and basic daily features (e.g., open, high, low, close, volume, turnover) [51][52] 2. GRU structure: Two-layer GRU + fully connected layers + LayerNorm + Relu + dropout + fully connected layers [53] 3. Training details: Batch size of 512, learning rate of 1e-4 with warmup, early stopping after 10 rounds without improvement [53][54] **Model Evaluation**: The GRU model effectively utilizes the semantic features of volume-price fingerprints for stock prediction, outperforming traditional factor-based models in certain metrics [2][50][54] - **Model Name**: Dual-stream GRU Model **Model Construction Idea**: Combine volume-price fingerprints and traditional volume-price factors using dual-stream GRU to leverage complementary information [67][68] **Model Construction Process**: 1. Separate GRU streams for volume-price fingerprints and traditional factors, followed by feature fusion through configurable weights [67][69] 2. Training details: Similar to single-stream GRU, with parallel training using different random seeds for robustness [68][69] **Model Evaluation**: The dual-stream GRU model improves prediction accuracy and stability, reducing overfitting risks associated with single data sources [68][69] Model Backtesting Results - **Volume-Price Fingerprint Model**: - Weekly RankIC Mean: 0.106 - Annualized Return (10-group long-short): 83.88% - IR: 5.41 - Weekly Win Rate: 73.87% - Max Drawdown: 11.65% [2][59][65] - **GRU Model with Volume-Price Fingerprint**: - Weekly RankIC Mean: 0.106 - Annualized Return (10-group long-short): 83.88% - IR: 5.41 - Weekly Win Rate: 73.87% - Max Drawdown: 11.65% [2][59][65] - **Dual-stream GRU Model**: - Weekly RankIC Mean: 0.109 - Annualized Return (10-group long-short): 90.89% - IR: 5.95 - Weekly Win Rate: 76.46% - Max Drawdown: 11.54% [68][74] Quantitative Factors and Construction Methods - **Factor Name**: Volume-Price Fingerprint Factor **Factor Construction Idea**: Extract semantic features from intraday volume-price behavior using self-supervised learning [1][9][14] **Factor Construction Process**: 1. Generate 128-dimensional daily fingerprint vectors using the volume-price fingerprint model [1][16][18] 2. Ensure high differentiation, low redundancy, and rich information through anti-collapse regularization [43][44][46] **Factor Evaluation**: The fingerprint factor captures hidden market patterns and semantic information, complementing traditional factors [1][9][14] Factor Backtesting Results - **Volume-Price Fingerprint Factor**: - Weekly RankIC Mean: 0.106 - Annualized Return (10-group long-short): 83.88% - IR: 5.41 - Weekly Win Rate: 73.87% - Max Drawdown: 11.65% [2][59][65] - **Fusion Factor (Volume-Price Fingerprint + Traditional Factors)**: - Weekly RankIC Mean: 0.109 - Annualized Return (10-group long-short): 90.89% - IR: 5.95 - Weekly Win Rate: 76.46% - Max Drawdown: 11.54% [68][74] Index Enhancement Results - **CSI 300 Enhanced Portfolio**: - Annualized Return: 11.00% - Excess Annualized Return: 7.12% - Tracking Error: 1.74% - IR: 4.10 - Monthly Win Rate: 86.11% - Max Drawdown: 1.85% [75][77] - **CSI 500 Enhanced Portfolio**: - Annualized Return: 13.32% - Excess Annualized Return: 11.38% - Tracking Error: 3.47% - IR: 3.28 - Monthly Win Rate: 83.33% - Max Drawdown: 4.76% [78][80] - **CSI 1000 Enhanced Portfolio**: - Annualized Return: 13.23% - Excess Annualized Return: 14.84% - Tracking Error: 3.45% - IR: 4.30 - Monthly Win Rate: 83.33% - Max Drawdown: 2.95% [82][83]
人脸机器人登上Science Robotics封面:用AI教会仿生人脸机器人「开口说话」
机器之心· 2026-01-15 04:31
Core Viewpoint - The article discusses a groundbreaking research achievement in humanoid robotics, focusing on the development of a robot capable of realistic lip movements synchronized with speech and music, marking a significant advancement in human-robot interaction [2][24]. Group 1: Research Background - Hu Yuhang, the founder of Shaping Technology and a PhD graduate from Columbia University, has dedicated his research to enabling robots to develop self-modeling capabilities, allowing them to understand their physical structure and adapt to various tasks [1]. - The research was published in *Science Robotics* and features a humanoid robot with a biomimetic facial structure that can perform lip movements in sync with human speech and songs [2][3]. Group 2: Importance of Lip Movement - Nearly half of human attention during face-to-face communication is focused on lip movements, making natural facial expressions crucial for effective interaction [5]. - Traditional humanoid robots have struggled with realistic lip movements, often appearing puppet-like when speaking, but this new research addresses that gap [6][22]. Group 3: Technical Innovations - The robot features a highly biomimetic face with over 20 miniature motors hidden beneath a flexible silicone skin, enabling rapid and coordinated lip movements [8][10]. - The robot learns to control its facial expressions through a self-supervised learning process, observing its own facial changes and building a model called Facial Action Transformer (FAT) [12]. Group 4: Learning Mechanism - The robot utilizes a combination of sound-driven lip movements and visual feedback from synthetic videos to learn the correspondence between audio signals and lip movements, allowing it to perform lip-syncing without understanding semantics [14][16]. - The research demonstrated the robot's ability to synchronize lip movements across multiple languages and even sing songs, showcasing its robust cross-linguistic generalization capabilities [18][21]. Group 5: Future Implications - The ability to express natural lip movements is seen as a missing link in humanoid robots, which are increasingly entering fields that require emotional communication, such as entertainment, education, and healthcare [22][24]. - Economists predict that over one billion humanoid robots may be manufactured in the next decade, emphasizing the necessity for these robots to possess realistic facial expressions to engage effectively with humans [24].
医学影像诊断或将告别“手工标注时代”
Huan Qiu Wang Zi Xun· 2026-01-07 01:18
Core Insights - The article discusses the development of an AI model named AFLoc, which can autonomously identify lesions in medical images without prior annotation by doctors [1][3]. Group 1: AI Model Development - The AFLoc model learns from two types of information: medical images (such as chest X-rays, fundus photos, and pathological slices) and clinical reports written by doctors [3]. - Through repeated "contrastive learning," AFLoc can accurately identify the most likely lesion locations in images over time, even without manual annotations [3]. Group 2: Performance Validation - The research team conducted systematic validation of AFLoc on three typical medical imaging modalities: chest X-rays, fundus images, and tissue pathology images, showing excellent performance across all [3]. - In chest X-ray experiments, AFLoc outperformed existing methods in multiple lesion localization metrics across 34 common chest diseases and 8 mainstream public datasets, achieving results that meet or exceed human expert levels [3]. - AFLoc also demonstrated strong disease diagnosis capabilities, outperforming current methods in zero-shot classification tasks for chest X-rays, fundus, and tissue pathology images, particularly excelling in diagnosing retinal diseases [3]. Group 3: Implications for Clinical Use - The model effectively avoids the traditional deep learning reliance on large-scale manually annotated data, significantly enhancing the efficiency of medical image data utilization and the model's generalization ability [5]. - AFLoc provides a feasible path for transitioning clinical imaging AI from "manual annotation dependence" to "self-supervised learning," offering a new technical paradigm for building smarter and more versatile medical AI systems [5]. - The research team plans to further validate and apply AFLoc in real clinical settings, accelerating its transformation into a clinical decision support system [5].
自回归也能做强视觉模型?NEPA开启「下一嵌入预测」时代,谢赛宁参与
机器之心· 2026-01-02 05:00
Core Viewpoint - The article discusses a new approach in visual pre-training called Next-Embedding Predictive Autoregression (NEPA), which shifts the paradigm from learning representations to learning models, demonstrating strong performance in visual tasks similar to language models [2][18]. Group 1: NEPA Overview - NEPA is a minimalist approach that predicts the next feature block of an image, akin to how language models predict the next word [20]. - The method utilizes causal masking and stop gradient techniques to ensure stable predictions without requiring complex architectures [17][25]. - NEPA has shown competitive performance on benchmarks like ImageNet-1K, achieving Top-1 accuracy of 83.8% for ViT-B and 85.3% for ViT-L, surpassing several state-of-the-art methods [29]. Group 2: Methodology and Architecture - The architecture employs a standard visual Transformer (ViT) backbone with causal attention masking, directly predicting future image block embeddings based on past embeddings [22]. - Unlike pixel-level reconstruction methods, NEPA does not require a separate decoder, simplifying the model design [22]. - The training process involves segmenting images into patches, encoding them into vectors, and predicting the next patch while preventing the model from "cheating" by using stop-gradient techniques [25]. Group 3: Performance and Applications - NEPA demonstrates strong transfer capabilities, achieving 48.3% and 54.0% mIoU on the ADE20K semantic segmentation task, indicating its ability to learn rich semantic features necessary for dense prediction tasks [29]. - The model can be adapted for various downstream tasks by simply changing the classification head, showcasing its versatility [30]. - Visual analysis reveals that NEPA learns long-range, object-centered attention patterns, effectively ignoring background noise and focusing on semantically relevant areas [37].
LeCun在Meta的最后一篇论文
3 6 Ke· 2025-11-14 03:04
Core Insights - The article discusses Yann LeCun's recent paper on a self-supervised learning method called LeJEPA, which is seen as his farewell work at Meta as he departs the company [1][33]. - LeJEPA introduces a new framework that enhances predictive performance by ensuring the embedding space follows a specific statistical distribution [2]. Group 1: LeJEPA Framework - LeJEPA is based on isotropic Gaussian embedding and addresses the representation collapse issue in traditional JEPA frameworks, significantly improving model generalization [1][5]. - The framework utilizes Sketched Isotropic Gaussian Regularization (SIGReg) to achieve distribution matching, transforming the problem into a statistical hypothesis test [6][11]. Group 2: Experimental Validation - Extensive experiments were conducted on large architectures such as ViT, ConvNeXt, and ResNet, with models approaching 1 billion parameters [8]. - Results indicate that LeJEPA outperforms existing methods while maintaining training simplicity and robustness, particularly on domain-specific datasets like Galaxy10 and Food101 [10]. Group 3: Statistical Insights - The research highlights that isotropic Gaussian distribution minimizes bias and variance during training, enhancing stability and accuracy in downstream tasks [3][5]. - Non-isotropic distributions lead to higher bias and variance, confirming the superiority of isotropic Gaussian distribution through various experiments [3]. Group 4: Future Directions - Despite LeCun's departure from Meta, it is suggested that he is raising funds to establish a startup focused on advancing his work in world models, indicating ongoing contributions to the AI field [33][34].
LeCun在Meta的最后论文?还是共同一作,LeJEPA:JEPAs理论拼图补完
机器之心· 2025-11-14 01:33
Core Viewpoint - The article discusses the development of LeJEPA, a new self-supervised learning framework that addresses the limitations of existing Joint Embedding Predictive Architectures (JEPAs) by providing a solid theoretical foundation and eliminating reliance on heuristic methods [4][5][8]. Group 1: Theoretical Foundation - The research team established that the optimal embedding distribution for JEPAs is an isotropic Gaussian distribution, which minimizes downstream prediction risk across various tasks [5]. - A novel distribution matching objective called Stochastic Isotropic Gaussian Regularization (SIGReg) was introduced to efficiently enforce the embedding to conform to the ideal isotropic Gaussian distribution [6][8]. - LeJEPA combines the predictive objectives of JEPA with SIGReg, resulting in a statistically optimal solution that mitigates representation collapse [8][9]. Group 2: Practical Implementation - LeJEPA demonstrates simplicity, robustness, and high performance due to its principled theoretical design, which eliminates the need for complex heuristics like gradient stopping and teacher-student networks [9][11]. - The implementation of LeJEPA requires only about 50 lines of code in PyTorch, making it user-friendly and easy to deploy [11][19]. Group 3: Experimental Validation - LeJEPA was tested across over 10 datasets and 60 architectures, achieving or surpassing state-of-the-art results, such as a 79% accuracy on ImageNet-1K with ViT-H/14 [10]. - The framework showed superior performance in domain-specific datasets, outperforming DINOv2-based transfer learning, indicating its capability for in-domain pre-training [10][33]. Group 4: Stability and Scalability - LeJEPA maintains stability across different hyperparameters and architectures, with recommended settings yielding competitive performance even with small batch sizes [24][26]. - The framework's design is architecture-agnostic, allowing it to learn high-quality representations across various model types [26][27]. Group 5: Semantic Structure Emergence - LeJEPA's self-supervised learning successfully emerged semantic structures without explicit supervision, as evidenced by attention patterns that correspond to object boundaries and salient regions [41][43]. - The attention maps demonstrated temporal consistency, enabling unsupervised video segmentation, indicating that the learned features capture both spatial semantics and temporal structure [43].