自监督学习

Search documents
语音分离最全综述来了!清华等团队深度分析200+文章,系统解析「鸡尾酒会问题」研究
机器之心· 2025-09-03 04:33
语音分离领域针对具有挑战性的 "鸡尾酒会问题",随着深度神经网络 (DNN) 的发展,该领域取得了革命性的进展。语音分离可以用于独立应用,在复杂的声学环 境中提高语音清晰度。此外,它还可以作为其他语音处理任务(如语音识别和说话人识别)的重要预处理方法。 为了应对当前的文献综述往往只关注特定的架构设计或孤立的学习方法,导致对这个快速发展的领域的理解碎片化的现实情况, 清华大学、 青海大学、 南京大 学、 南方科技大学、 中国科学院大学、 字节跳动 的研究者 们全面调研了该领域的发展和最前沿的研究方法,在 深度学习方法、模 型 架 构、研究主题、评测指 标、数据集、工具平台、模型效 果比较、未来挑 战 等多个维度,撰写了一项统一、全面的 综述论文 ,对 200 余篇 代表性论文进行了系统归纳和分析。 | Survey | Year | DL | Arch | Topics | Eval | Data | Platforms | Results | Challenges | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Wang an ...
小扎又开源了:7B实现自监督学习SOTA
量子位· 2025-08-16 02:00
Core Viewpoint - Meta has released a new open-source visual model, DINOv3, which demonstrates that self-supervised learning models can outperform weakly supervised learning models across a wide range of tasks [1][3]. Group 1: Model Overview - DINOv3 utilizes an unlabelled approach, expanding the dataset to 1.7 billion images and the model size to 7 billion parameters, effectively supporting applications where data labeling is scarce or costly [1][6]. - The model shows superior performance in scenarios lacking labels or across domains, achieving state-of-the-art (SOTA) results in the three core tasks of computer vision: classification, detection, and segmentation [3][22]. Group 2: Training Methodology - The training process of DINOv3 consists of two main phases, focusing on large-scale self-supervised training to learn high-quality visual representations [8]. - A new method called "Gram anchoring" is introduced to address the degradation of dense feature maps during training, significantly enhancing local feature quality without compromising global features [15][20]. Group 3: Performance Metrics - DINOv3 outperforms its predecessor DINOv2 in various benchmarks, such as achieving 55.9 in segmentation on ADE-20k and 90.4 in image classification on ImageNet ReaL [4]. - The model's training strategy includes RoPE-box jittering, enhancing robustness to variations in resolution, scale, and aspect ratio while maintaining training stability [13][14]. Group 4: Practical Applications - DINOv3 has demonstrated strong generalization capabilities, such as analyzing satellite imagery to detect tree loss and land use changes, providing significant support for global forest restoration and agricultural management [27][28]. - The model has achieved SOTA results in multiple remote sensing tasks, including semantic geospatial tasks and high-resolution semantic tasks [29]. Group 5: Future Implications - The DINO series represents Meta's ongoing exploration of self-supervised methods in the visual domain, marking significant progress in large-scale self-supervised training [30][38]. - DINOv3 is expected to accelerate the development of existing applications and unlock new scenarios across various industries, including healthcare, environmental monitoring, autonomous driving, retail, and manufacturing [39].
吞下17亿图片,Meta最强巨兽DINOv3开源,重新定义CV天花板
3 6 Ke· 2025-08-15 07:29
Core Insights - Meta has developed DINOv3, a self-supervised learning model trained on 1.7 billion images with 7 billion parameters, which has been successfully utilized by NASA for Mars exploration [1][3][26] - DINOv3 sets a new benchmark in computer vision performance, surpassing specialized solutions in various dense prediction tasks [1][10][19] - The model is fully open-sourced, including the pre-trained backbone, adapters, and training and evaluation code, making it suitable for commercial use [6][26] Performance Metrics - DINOv3 achieved significant improvements in various benchmarks compared to its predecessors, such as: - Segmentation on ADE-20k: 55.9 (up from 49.5 with DINOv2) [2] - Depth estimation on NYU I: 0.309 (improved from 0.372 with DINOv2) [2] - Video tracking on DAVIS: 83.3 (up from 76.6 with DINOv2) [2] - Instance retrieval on Met: 55.4 (increased from 44.6 with DINOv2) [2] - Image classification on ImageNet ReaL: 90.4 (up from 86.1 with DINOv2) [2] Applications and Impact - DINOv3's self-supervised learning approach allows it to function effectively in scenarios where labeled data is scarce, such as satellite imagery and medical imaging [10][12][15] - The model has been applied in real-world scenarios, such as monitoring deforestation and supporting ecological restoration efforts by the World Resources Institute [16] - DINOv3 has demonstrated a reduction in measurement error for tree canopy height estimation in Kenya, from 4.1 meters to 1.2 meters [17] Model Flexibility and Deployment - DINOv3's architecture allows for high efficiency and versatility, enabling it to perform multiple visual tasks without the need for fine-tuning [22][24] - Meta has created a family of models ranging from lightweight to high-performance versions to cater to various computational needs, ensuring practical deployment across different applications [26]
Meta视觉基座DINOv3王者归来:自监督首次全面超越弱监督,商用开源
机器之心· 2025-08-15 03:29
Core Viewpoint - The article discusses the advancements in computer vision, particularly focusing on the development and capabilities of the DINO series of models, emphasizing the transition from supervised to self-supervised learning paradigms in AI [2][15][29]. Group 1: DINO Model Evolution - DINO, DINOv2, and DINOv3 represent significant milestones in self-supervised learning, with DINOv3 achieving state-of-the-art performance across various tasks without the need for labeled data [2][15][31]. - DINOv3 has expanded its training dataset to 1.7 billion images and model parameters to 7 billion, significantly enhancing its capabilities compared to its predecessors [9][31][36]. - The introduction of innovative techniques in DINOv3, such as Gram Anchoring and RoPE, has improved the model's ability to generate high-resolution dense features, addressing limitations seen in DINOv2 [18][24][28]. Group 2: Performance Metrics - DINOv3 outperforms previous models in multiple benchmarks, achieving a segmentation score of 55.9, depth estimation of 0.309, and video tracking accuracy of 83.3, showcasing its superior performance in dense prediction tasks [17][31]. - The model's performance in image classification tasks is also notable, with an accuracy of 90.4 on ImageNet ReaL, indicating its robustness across various applications [17][31]. Group 3: Practical Applications - DINOv3 is being utilized in real-world applications, such as analyzing satellite images for environmental monitoring and supporting climate finance processes, demonstrating its practical impact [39][40]. - The model's ability to operate effectively without fine-tuning makes it suitable for edge applications where multiple visual prediction tasks need to be executed simultaneously [34][36]. Group 4: Community Engagement and Accessibility - Meta has open-sourced DINOv3, providing a complete backbone network and evaluation heads for community use, facilitating further research and development [13][36]. - The model family includes various distilled versions to cater to different computational needs, ensuring accessibility for researchers and developers [36][37].
DeepTiming:日内信息与相似度学习驱动择时
Minsheng Securities· 2025-07-31 09:02
Quantitative Models and Construction Methods 1. Model Name: Deep Learning Stock Return Prediction Model - **Model Construction Idea**: This model is based on a deep learning framework tailored to the current market environment. It integrates daily and minute-frequency inputs to predict stock returns and generate trading signals based on historical rolling thresholds[1][10][22] - **Model Construction Process**: - **Input Layer**: Combines 51 technical/sentiment daily features, 7 basic daily price-volume indicators, 10 enhanced style factors, and 52 minute-frequency features aggregated to daily frequency[22] - **Training Layer**: Utilizes meta-learning to adapt to new market data dynamically, avoiding overfitting to historical data[14] - **Output Layer**: Employs LinSAT neural networks to impose constraints on the output, ensuring specific objectives like controlling style and industry exposures[18] - **Loss Function**: Multi-period mean squared error (MSE) is used to stabilize predictions for timing strategies[22] - **Formula**: Multi-period return prediction as \( y = (n, 1) \), where \( n \) represents the number of stocks[22] - **Model Evaluation**: Demonstrates robustness in adapting to market changes and controlling exposures, with significant predictive power for timing strategies[10][22] 2. Model Name: SimStock - **Model Construction Idea**: SimStock uses self-supervised learning to predict stock similarities, incorporating both static and dynamic correlations. It leverages contrastive learning to dynamically capture time-series information beyond traditional industry and style classifications[2][47][48] - **Model Construction Process**: - **Input**: Past 40-day price-volume data, Barra style factors, and capital flow indicators[52] - **Positive and Negative Sample Construction**: Positive samples are generated as \( X_{pos} = X + (1-\alpha)X_{rand} \), where \( \alpha = 0.75 \) and \( X_{rand} \) is a random feature sample[52] - **Embedding**: LSTM initializes dynamic attention weights, and CLS tokens aggregate sequence information into stock attribute vectors[52] - **Similarity Calculation**: Stock similarity is measured using cosine similarity between attribute vectors[52] - **Model Evaluation**: Effectively identifies stocks with high similarity, primarily within the same industry, but without clear patterns in market capitalization or sub-industry[56] 3. Model Name: Improved GRU Model with SimStock Integration - **Model Construction Idea**: Enhances the GRU-based stock return prediction model by initializing hidden states with SimStock-generated stock attribute vectors, improving stability across different stock types[57][59] - **Model Construction Process**: - **Initialization**: SimStock attribute vectors replace the GRU model's initial hidden state[57] - **Training**: Retains the same training setup as the baseline GRU model, with adjustments to incorporate the new initialization[59] - **Model Evaluation**: Demonstrates improved predictive performance and stability, particularly in timing strategies across diverse stocks[60][63] 4. Model Name: Index Timing Model - **Model Construction Idea**: Aggregates individual stock signals into index signals using weighted predictions based on market capitalization, followed by threshold-based signal generation[77] - **Model Construction Process**: - **Aggregation**: Combines stock return predictions into index return predictions using market-cap weights[77] - **Signal Generation**: Uses the 60th percentile of past-year predictions as the buy threshold and the 40th percentile as the sell threshold[77] - **Holding Period**: Maintains positions for at least 5 trading days to reduce turnover[77] - **Model Evaluation**: Effective in generating excess returns, particularly in high-volatility sectors[79][82][84] --- Model Backtest Results 1. Deep Learning Stock Return Prediction Model - **Cumulative Excess Return**: 77% over 5 years[33] - **Annualized Return**: 27%[33] - **Excess Return vs. Stocks**: 11.3% (pre-cost)[33] 2. SimStock - **Cumulative Excess Return**: 109% over 5 years[60] - **Annualized Return**: 30%[60] - **Excess Return vs. Stocks**: 14.8% (pre-cost)[60] - **Daily Win Rate**: 57.4%[60] - **Holding Probability**: 45.7%[60] 3. Index Timing Model - **HS300**: Annualized Return 5.1%, Excess Return 5.6%, Max Drawdown 7.7%[79] - **CSI500**: Annualized Return 12.4%, Excess Return 12.2%, Max Drawdown 7.1%[82] - **CSI1000**: Annualized Return 15.1%, Excess Return 14.9%, Max Drawdown 11.3%[84] 4. Sector Timing - **Best Sector**: Electric Power Equipment & New Energy, Annualized Return 36%, Excess Return 31.1%[101] --- Quantitative Factors and Construction Methods 1. Factor Name: Reinforced Style Factor (PPO Model) - **Factor Construction Idea**: Uses PPO reinforcement learning to predict market style preferences, generating more interpretable and robust risk factors compared to traditional deep learning[12] - **Factor Construction Process**: - **Input**: Traditional style factors and recent stock price-volume data[12] - **Reward Function**: Stability-penalized market return goodness-of-fit[12] - **Output**: Enhanced style factor representing AI market preferences[12] - **Factor Evaluation**: Provides a stable and interpretable representation of market style dynamics[12] --- Factor Backtest Results 1. Reinforced Style Factor - **RankIC**: Weekly average of 4.5% since 2019[36] - **Annualized Return**: 23.2% for long-only portfolios, Excess Return 18.3% vs. CSI800[36]
何恺明改进了谢赛宁的REPA:极大简化但性能依旧强悍
机器之心· 2025-06-12 09:57
Core Viewpoint - The article discusses the significance of representation learning in generative models, particularly through the introduction of a new method called Dispersive Loss, which integrates self-supervised learning into diffusion-based generative models without requiring additional pre-training or external data sources [6][9][43]. Group 1: Diffusion Models and Representation Learning - Diffusion models excel in modeling complex data distributions but are largely disconnected from the representation learning field [2]. - The training objectives of diffusion models typically focus on reconstruction tasks, such as denoising, lacking explicit regularization for learned representations [3]. - Representation learning, particularly self-supervised learning, is crucial for learning general representations applicable to various downstream tasks [4]. Group 2: Introduction of Dispersive Loss - Dispersive Loss is a flexible and general plug-in regularizer that integrates self-supervised learning into diffusion-based generative models [9]. - The core idea of Dispersive Loss is to introduce a regularization target for the model's internal representations, encouraging them to spread out in the latent space [10][13]. - This method does not require additional layers or parameters, making it a simple and independent approach [15][16]. Group 3: Comparison with Existing Methods - Dispersive Loss operates without the need for pre-training, external data, or additional model parameters, unlike the REPA method, which relies on pre-trained models [7][41][43]. - The new method demonstrates that representation learning can benefit generative modeling without external information sources [13][43]. - In practical applications, introducing Dispersive Loss requires minimal adjustments, such as specifying the intermediate layers for regularization [29]. Group 4: Performance Evaluation - Experimental results show that Dispersive Loss consistently outperforms corresponding contrastive losses while avoiding the complexities of dual-view sampling [33]. - The method has been tested across various models, including DiT and SiT, showing improvements in all scenarios, particularly in larger models where effective regularization is crucial [36][37]. - The article highlights that Dispersive Loss can be generalized for one-step diffusion-based generative models, indicating its versatility [44].
AI“化学侦探”快速解析未知分子结构
Ke Ji Ri Bao· 2025-05-28 23:43
Core Insights - An international team led by the Czech Technical University has developed an AI molecular decoder named DreaMS, which can rapidly analyze the structure of unknown molecules, with potential applications in drug development and space life detection [1][2] - The research highlights that known natural molecules represent only a small fraction of what exists, with many undiscovered molecules in plants, soil, and extraterrestrial environments possibly holding keys to new drug formulations and environmental solutions [1] - DreaMS utilizes a groundbreaking learning approach that mimics how human infants learn language, allowing it to autonomously construct a cognitive framework for molecular structure interpretation without prior chemical rules [1] Molecular Analysis Capabilities - DreaMS can estimate the presence of specific molecular fragments or chemical elements, including mastering fluorine detection, which is crucial for modern pharmaceuticals and pesticides [2] - The team trained DreaMS using fluorine-containing samples, overcoming a long-standing detection challenge in the academic community [2]
软件所提出小批量数据采样策略
Jing Ji Guan Cha Wang· 2025-05-27 07:50
Core Insights - A research team from the Institute of Software, Chinese Academy of Sciences, proposed a small-batch data sampling strategy to eliminate the interference of unobservable variable semantics on representation learning, enhancing the out-of-distribution generalization ability of self-supervised learning models [1][2] Group 1: Research Findings - The out-of-distribution generalization ability refers to the model's performance on test data that differs from the training data distribution, which is crucial for maintaining effectiveness in unseen data scenarios [1] - The study identified that self-supervised learning models are affected by unobservable variable semantics during training, which weakens their out-of-distribution generalization ability [1] Group 2: Methodology - The proposed strategy utilizes causal effect estimation techniques to eliminate the confounding effects of unobservable variable semantics [1] - By learning a latent variable model, the strategy estimates the posterior probability distribution of unobservable semantic variables given "anchor" samples, termed as balance scores [1] - Samples with similar or close balance scores are grouped into the same small-batch dataset, ensuring that unobservable semantic variables are conditionally independent of the "anchor" samples within each batch [1] Group 3: Experimental Results - Extensive experiments on benchmark datasets showed that the sampling strategy improved the performance of mainstream self-supervised learning methods by at least 2% across various evaluation tasks [2] - In classification tasks on ImageNet100 and ImageNet, both Top-1 and Top-5 accuracy surpassed the state-of-the-art self-supervised methods [2] - In semi-supervised classification tasks, Top-1 and Top-5 accuracy increased by over 3% and 2%, respectively [2] - The strategy also provided stable gains in average precision for object detection and instance segmentation transfer learning tasks [2] - Performance improvements exceeded 5% for few-shot transfer learning tasks on datasets like Omniglot, miniImageNet, and CIFARFS [2] - The research findings were accepted by the top-tier academic conference in artificial intelligence, International Conference on Machine Learning (ICML-25) [2]
港大马毅团队等开源新作:用编码率正则化重构视觉自监督学习范式,“少即是多”
量子位· 2025-03-08 03:35
Core Viewpoint - The article discusses the introduction of SimDINO and SimDINOv2, two new visual pre-training models developed by a collaboration of researchers from various institutions, which simplify the training process of the existing DINO and DINOv2 models while enhancing their performance [1][5][12]. Group 1: Model Development - SimDINO and SimDINOv2 are designed to address the complexities associated with DINO and DINOv2, which are currently leading models in visual pre-training [2][4]. - The new models utilize coding rate regularization to simplify the training process and improve robustness and performance [12][16]. - The core idea is to remove complex empirical design components from the original DINO and DINOv2 training processes, making the models easier to train and implement [12][18]. Group 2: Methodology - The introduction of coding rate regularization helps prevent representation collapse, which was a significant issue in the original models [14][17]. - SimDINO retains the EMA self-distillation scheme and multi-view data augmentation from DINO but modifies the contrastive learning approach to use Euclidean distance or cosine similarity instead of high-dimensional projections [18][19]. - SimDINOv2 further simplifies the iBOT mechanism introduced in DINOv2, enhancing the model's efficiency [19]. Group 3: Experimental Validation - Extensive experiments on various datasets, including ImageNet-1K, COCO val2017, and ADE20K, demonstrate that SimDINO and SimDINOv2 outperform the DINO series in terms of computational efficiency, training stability, and downstream task performance [22][23]. - In specific evaluations, SimDINO achieved a linear segmentation mIoU of 33.7 and mAcc of 42.8, while SimDINOv2 reached mIoU of 36.9 and mAcc of 46.5, showcasing significant improvements over DINO and DINOv2 [30]. Group 4: Theoretical Insights - The research team proposes a theoretical framework for selecting hyperparameters in SimDINO, focusing on balancing the gradients of the coding rate regularization term and the distance term [33][34]. - This theoretical analysis provides a clearer optimization target and reduces the complexity of hyperparameter tuning, making the training process more straightforward [39]. Group 5: Future Directions - The research team suggests potential improvements for SimDINO, including exploring self-supervised objectives that do not require self-distillation optimization [43].