Workflow
Transformer模型
icon
Search documents
直击CES|黄仁勋:80%的初创公司都在采用开放模型
Xin Lang Cai Jing· 2026-01-06 01:17
新浪科技讯 1月6日上午消息,一年一度的CES展会期间,黄仁勋在英伟达新品发布会上表示,计算机 行业每隔10到15年就会发生一次变革。每次变革,行业都会为新的平台构建新的系统。而这一次,两种 变革同时发生。应用程序现在都基于人工智能构建,软件的开发方式也发生了改变。 他梳理了人工智能在过去十年中取得的进展。他于2015年看到了第一个有趣的模型,2017年Transformer 模型问世,之后我们看到了OpenAI推出的更多令人印象深刻的模型。现在他正在探讨智能体模型的潜 在用途,这种模型可以自主处理一些人工智能任务,并随着时间的推移不断学习。他表示,"宇宙中凡 是有信息的地方,凡是有结构的",都可以用来训练人工智能。 黄仁勋还表示,截至本月,开源人工智能模型比大型人工智能公司昂贵的尖端模型落后大约六个月。 80%的初创公司都在采用开放模型。 责任编辑:江钰涵 责任编辑:江钰涵 新浪科技讯 1月6日上午消息,一年一度的CES展会期间,黄仁勋在英伟达新品发布会上表示,计算机 行业每隔10到15年就会发生一次变革。每次变革,行业都会为新的平台构建新的系统。而这一次,两种 变革同时发生。应用程序现在都基于人工智能构建 ...
机器学习应用系列:强化学习驱动下的解耦时序对比选股模型
Southwest Securities· 2025-12-25 11:40
Quantitative Models and Construction Model Name: DTLC_RL (Decoupled Temporal Contrastive Learning with Reinforcement Learning) - **Model Construction Idea**: The model aims to combine the nonlinear predictive power of deep learning with interpretability by decoupling feature spaces, enhancing representation through contrastive learning, ensuring independence via orthogonal constraints, and dynamically fusing spaces using reinforcement learning[2][11][12] - **Model Construction Process**: - **Feature Space Decoupling**: Three orthogonal latent spaces are constructed to capture market systemic risk (β space), stock-specific signals (α space), and fundamental information (θ space). Each space is equipped with a specialized encoder: TCN for β space, Transformer for α space, and gated residual MLP for θ space[11][12][92] - **Contrastive Learning**: Introduced within each space to enhance robustness by constructing positive and negative sample pairs based on return similarity. The InfoNCE loss function is used to maximize the similarity of positive pairs while minimizing that of negative pairs: $$L_{\mathrm{InfotNCE}}=-E\left[l o g~\frac{e x p\left(f(x)^{\top}f(x^{+})/\tau\right)}{e x p\left(f(x)^{\top}f(x^{+})/\tau\right)+\sum_{i=1}^{N-1}~e x p\left(f(x)^{\top}f(x_{i}^{-})/\tau\right)}\right]$$ where \(f(x)\) is the feature representation, \(x^+\) is the positive sample, \(x^-\) is the negative sample, and \(\tau\) is the temperature parameter[55][56] - **Orthogonal Constraints**: A loss function is added to ensure the outputs of the three spaces are statistically independent, reducing multicollinearity and enhancing interpretability[12][104] - **Reinforcement Learning Fusion**: A PPO-based reinforcement learning mechanism dynamically adjusts the weights of the three spaces based on market conditions. The reward function includes components for return correlation, weight stability, and weight diversification: $$r_{t}=R_{t}^{I C}\big(\widehat{y_{t}},y_{y}\big)+\lambda_{s}R_{t}^{s t a b l e}+\lambda_{d}R_{t}^{d i v}$$ The PPO optimization process includes GAE advantage estimation and a clipped policy loss: $$L^{C L P}=E\left[\operatorname*{min}(r\dot{A},c l i p(r,1-\varepsilon,1+\varepsilon)\dot{A})\right]$$[58][120][121] - **Model Evaluation**: The DTLC_RL model demonstrates strong predictive power and interpretability, with dynamic adaptability to market conditions[2][12][122] Model Name: DTLC_Linear - **Model Construction Idea**: A baseline model for comparison, using a linear layer to fuse the three feature spaces[98][100] - **Model Construction Process**: - The encoded information from the three spaces is concatenated and passed through a linear layer with a Softmax activation to generate fusion weights. The model is trained with a multi-task loss function, including IC maximization, contrastive learning loss, and orthogonal constraints[98][104] - **Model Evaluation**: Provides a benchmark for evaluating the contribution of reinforcement learning in DTLC_RL[98][103] Model Name: DTLC_Equal - **Model Construction Idea**: A simpler baseline model that equally weights the three feature spaces without dynamic adjustments[98] - **Model Construction Process**: The outputs of the three spaces are directly averaged to generate predictions[98] - **Model Evaluation**: Serves as a control group to assess the benefits of dynamic weighting in DTLC_RL[98][103] --- Model Backtesting Results DTLC_RL - **IC**: 0.1250[123] - **ICIR**: 4.38[123] - **Top 10% Portfolio Annualized Return**: 34.77%[123] - **Annualized Volatility**: 25.41%[123] - **IR**: 1.37[123] - **Maximum Drawdown**: 40.65%[123] - **Monthly Turnover**: 0.71X[123] DTLC_Linear - **IC**: 0.1239[105] - **ICIR**: 4.25[105] - **Top 10% Portfolio Annualized Return**: 32.95%[105] - **Annualized Volatility**: 24.39%[105] - **IR**: 1.35[105] - **Maximum Drawdown**: 35.94%[105] - **Monthly Turnover**: 0.76X[105] DTLC_Equal - **IC**: 0.1202[105] - **ICIR**: 4.06[105] - **Top 10% Portfolio Annualized Return**: 32.46%[105] - **Annualized Volatility**: 25.29%[105] - **IR**: 1.28[105] - **Maximum Drawdown**: 40.65%[105] - **Monthly Turnover**: 0.71X[105] --- Quantitative Factors and Construction Factor Name: Beta_TCN - **Factor Construction Idea**: Captures market systemic risk by quantifying stock sensitivity to common risk factors like macroeconomic fluctuations and market sentiment[67] - **Factor Construction Process**: - Five market-related features are selected, including beta to market returns, volatility sensitivity, liquidity beta, size exposure, and market sentiment sensitivity[72] - A TCN encoder processes 60-day time-series data, using dilated causal convolutions to capture short- and medium-term trends. The output is a 32-dimensional vector representing systemic risk features[68] - **Factor Evaluation**: Demonstrates moderate stock selection ability and effectively captures market-related information[73] Factor Name: Alpha_Transformer - **Factor Construction Idea**: Extracts stock-specific alpha signals from price-volume time-series data[76] - **Factor Construction Process**: - Thirteen price-volume features are encoded using a multi-scale Transformer model, with separate layers for short-, medium-, and long-term information. Outputs are fused using a gated mechanism and passed through a fully connected layer for return prediction[77][78] - **Factor Evaluation**: Exhibits strong predictive power and stock selection ability, with relatively low correlation to market benchmarks[81][82] Factor Name: Theta-ResMLP - **Factor Construction Idea**: Focuses on fundamental information to assess financial safety margins and risk resistance[88] - **Factor Construction Process**: - Eight core financial indicators, including PE, PB, ROE, and dividend yield, are encoded using a gated residual MLP. The architecture includes input projection, gated residual blocks, and a final output layer[92] - **Factor Evaluation**: Provides stable stock selection performance with lower turnover and drawdown compared to other spaces[95][96] --- Factor Backtesting Results Beta_TCN - **IC**: 0.0969[73] - **ICIR**: 3.73[73] - **Top 10% Portfolio Annualized Return**: 27.73%[73] - **Annualized Volatility**: 27.19%[73] - **IR**: 1.02[73] - **Maximum Drawdown**: 45.80%[73] - **Monthly Turnover**: 0.79X[73] Alpha_Transformer - **IC**: 0.1137[81] - **ICIR**: 4.19[81] - **Top 10% Portfolio Annualized Return**: 32.66%[81] - **Annualized Volatility**: 23.04%[81] - **IR**: 1.42[81] - **Maximum Drawdown**: 27.59%[81] - **Monthly Turnover**: 0.83X[81] Theta-ResMLP - **IC**: 0.0485[95] - **ICIR**: 1.87[95] - **Top 10% Portfolio Annualized Return**: 23.88%[95] - **Annualized Volatility**: 23.96%[95] - **IR**: 0.99[95] - **Maximum Drawdown**: 37.41%[95] - **Monthly Turnover**: 0.41X[95]
谷歌TPU强势破局,海外AI算力泡沫担忧下的景气密码
Mei Ri Jing Ji Xin Wen· 2025-12-09 01:29
为什么这么说?首先,我们提升一个高度,从国家战略层面来说,当前中国与美国均在大力推动人工智 能发展;再往上一层去看,这本质上是各国在人工智能这个未来的科技赛道上去占领身位的问题。因 此,不会因短期商业模式未跑通就停止投入,这并非由单纯的两三年市场化结果决定,而是具备更长远 的战略价值。所以从更高维度来看,中美两国大概率不会放弃对AI的投入,无需过度担忧短期泡沫破 裂的问题,持续投入仍是大概率事件。 为什么会有AI泡沫的说法?实际上,人工智能的商业模式目前仍存在两个大的问题。第一个大的问题 是,前段时间有大空头做空AI主线及英伟达等相关公司,其核心观点是AI领域缺乏大规模应用的落地 与转化——因为算力好比修路,路修完后需要投入使用、发挥作用,否则修路便失去了意义。所以就一 直吐槽说,算力"路"虽已修,但"行人"寥寥,即缺乏大规模应用落地,导致商业模式尚未完全跑通。但 在我们看来,虽然目前缺乏大规模应用,但小规模应用并非不存在,这一点我们一会再展开说一说。 第二个问题是,有说法称部分大厂通过延长折旧年限、把折旧摊低的方式伪造报表,以此虚增利润。这 是一些空头对AI泡沫的另一项指责。客观而言,这些问题或许存在,但从 ...
预测下一个像素还需要几年?谷歌:五年够了
机器之心· 2025-11-26 07:07
Core Insights - The article discusses the potential of next-pixel prediction in image recognition and generation, highlighting its scalability challenges compared to natural language processing tasks [6][21]. - It emphasizes that while next-pixel prediction is a promising approach, it requires significantly more computational resources than language models, with a token-per-parameter ratio that is 10-20 times higher [6][15][26]. Group 1: Next-Pixel Prediction - Next-pixel prediction can be learned in an end-to-end manner without the need for labeled data, making it a form of unsupervised learning [3][4]. - The study indicates that achieving optimal performance in next-pixel prediction requires a higher token-parameter ratio compared to text token learning, with a minimum of 400 for pixel models versus 20 for language models [6][15]. - The research identifies three core questions regarding the evaluation of model performance, the consistency of scaling laws with downstream tasks, and the variation of scaling trends across different image resolutions [7][8]. Group 2: Experimental Findings - Experiments conducted at a fixed resolution of 32×32 pixels reveal that the optimal scaling strategy is highly dependent on the target task, with image generation requiring a larger token-parameter ratio than classification tasks [18][22]. - As image resolution increases, the model size must grow faster than the data size to maintain optimal scaling, indicating that computational capacity is the primary bottleneck rather than data availability [18][26]. - The study shows that while the scaling trends for next-pixel prediction can be predicted using established frameworks from language models, the optimal scaling strategies differ significantly between tasks [21][22]. Group 3: Future Outlook - The article predicts that next-pixel modeling will become feasible within the next five years due to the rapid growth of training computational power, which is expected to increase by four to five times annually [8][26]. - It concludes that despite the current challenges, the path towards pixel-level modeling remains viable and could achieve competitive performance in the future [26].
AI大家说 | 哈佛&MIT:AI能预测,但它还解释不了“why”
红杉汇· 2025-10-22 00:06
Core Insights - The article discusses a significant experiment conducted by Harvard and MIT to explore whether large language models (LLMs) can learn a "world model" or if they merely predict the next word based on probabilities [3][4][5] - The experiment utilized orbital mechanics as a testing ground, aiming to determine if AI could derive Newton's laws from its predictions of planetary motion [4][5] - The findings revealed that while AI models could accurately predict planetary trajectories, they did not encode the underlying physical laws, indicating a disconnect between prediction and explanation [6][10] Group 1: Experiment Design and Findings - The research team trained a small Transformer model on 10 million simulated solar system coordinates, totaling 20 billion tokens, to assess its ability to utilize Newton's laws for predicting planetary movements [8] - The results showed that the AI model could generate precise trajectory predictions but relied on specific situational heuristics rather than understanding the fundamental laws of physics [10][11] - The study also highlighted that the AI's predictions could not be generalized to untrained scenarios, demonstrating a lack of a stable world model [10][11] Group 2: Implications for AI Development - The research raises questions about the fundamental limitations of AI models, particularly regarding their ability to construct a coherent world model necessary for scientific discovery [11][12] - The article suggests that while LLMs are not entirely useless, they are currently insufficient for achieving scientific breakthroughs [13] - Future AI development may require a combination of larger models and new methodologies to enhance their understanding and predictive capabilities [13][14] Group 3: Philosophical Considerations - The article reflects on a classic scientific debate: whether the essence of science lies in precise predictions or in understanding the underlying reasons for phenomena [14] - It emphasizes the importance of developing AI that can not only predict but also comprehend the logic of the world, which will determine its ultimate impact on scientific history [14]
27亿美元天价回归,谷歌最贵“叛徒”、Transformer作者揭秘AGI下一步
3 6 Ke· 2025-09-22 08:48
Core Insights - The main focus of the article is on the hardware requirements for large language models (LLMs) as discussed by Noam Shazeer at the Hot Chips 2025 conference, emphasizing the need for increased computational power, memory capacity, and network bandwidth to enhance AI performance [1][5][9]. Group 1: Hardware Requirements for LLMs - LLMs require more computational power, specifically measured in FLOPS, to improve performance and handle larger models [23]. - Increased memory capacity and bandwidth are crucial, as insufficient bandwidth can limit model flexibility and performance [24][26]. - Network bandwidth is often overlooked but is essential for efficient data transfer between chips during training and inference [27][28]. Group 2: Design Considerations - Low precision computing is beneficial for LLMs, allowing for more FLOPS without significantly impacting model performance [30][32]. - Determinism is vital for reproducibility in machine learning experiments, as inconsistent results can hinder debugging and development [35][39]. - Addressing issues of overflow and precision loss in low precision calculations is necessary to maintain stability in model training [40]. Group 3: Future of AI and Hardware - The evolution of AI will continue to progress even if hardware advancements stall, driven by software innovations [42]. - The potential for achieving Artificial General Intelligence (AGI) remains, contingent on the ability to leverage existing hardware effectively [42][44]. - The article highlights the importance of creating a supportive environment for individuals as AI transforms job landscapes, emphasizing the need for societal adaptation to technological changes [56].
Mamba一作预告新架构!长文论述Transformer≠最终解法
量子位· 2025-07-09 04:57
Core Viewpoint - The article discusses the trade-offs between two mainstream sequence models: State Space Models (SSMs) and Transformer models, highlighting the strengths and weaknesses of each approach [1][3]. Summary by Sections Introduction to Mamba and SSMs - Mamba is a typical SSM that builds on a modern structured SSM suitable for deep learning, outperforming similarly sized Transformers in language tasks [2]. - The author consolidates insights from previous talks into a comprehensive article, hinting at a significant upcoming advancement in architecture [3][4]. Attention Mechanism and Its Limitations - The article challenges the common belief that the high computational cost of models like ChatGPT is solely due to the quadratic complexity of the attention mechanism in Transformers [5][6]. - A new architecture is expected to be compatible with Transformers, suggesting a shift in understanding the limitations of attention mechanisms [7][8]. Comparison of SSMs and Transformers - SSMs are likened to the human brain, summarizing past information into a fixed-size hidden state, making them more efficient for processing long sequences [15][16]. - SSMs have advantages in handling unstructured data and exhibit linear computational costs with respect to sequence length, making them suitable for resource-constrained environments [16]. Key Elements of Mamba's Success - Mamba's effectiveness is attributed to three key factors: state size, state expressivity, and training efficiency [17][20]. - SSMs allow for larger hidden states, enhancing information storage compared to traditional RNNs [18]. - Mamba introduces selective SSMs to improve state expressivity, akin to the gating mechanisms in classic RNNs [19]. - Training efficiency is achieved through careful parameterization and parallel scanning algorithms [21]. Limitations of SSMs - SSMs lack precise recall and retrieval capabilities for past information, which is a strength of Transformer models [22]. Transformer Model Characteristics - Transformers function like a database, storing every piece of information in a KV cache, allowing for precise memory and token-level operations [23][25]. - They excel in processing well-defined tokenized data but suffer from high computational costs and dependency on high-quality data [26][27]. Tokenization Debate - The author argues against the necessity of tokenization, stating it contradicts the end-to-end learning principle of deep learning and complicates multilingual and multimodal applications [28][30]. - Evidence suggests that SSMs outperform Transformers on raw data, emphasizing Transformers' weaknesses with non-semantic token data [32]. Conclusion on SSMs vs. Transformers - Both SSMs and Transformers have their unique strengths and weaknesses, and a hybrid approach could yield better performance [33][35]. - Research indicates that a combination of SSM and attention layers could enhance model capabilities, with an optimal ratio of 3:1 to 10:1 [37]. - The future direction may involve developing models that can directly process raw data, leveraging the advantages of both architectures [40].
心智×算法 如何“共舞”(瞰前沿·人工智能如何改变科研范式)
Ren Min Ri Bao· 2025-06-13 21:43
Core Insights - The rapid development of artificial intelligence (AI) is significantly transforming scientific research methodologies, particularly in psychology, with an annual growth rate of 27.2% in AI-driven scientific publications from 2019 to 2023 [1] Group 1: AI and Psychology - The historical connection between psychology and AI is notable, with classical experiments like Pavlov's conditioning influencing key AI techniques such as reinforcement learning [2] - AI applications in daily life often reflect psychological principles, such as behavior reinforcement mechanisms used in e-commerce and social media platforms [2] - AI's ability to understand complex human behaviors is enhanced by cognitive psychology, leading to the development of attention mechanisms in AI models [2] Group 2: Data and Research Efficiency - AI enables researchers to access vast behavioral data streams from social media and wearable devices, significantly expanding the scope of psychological research [3] - The efficiency of psychological research is improved through AI technologies that can identify hidden signals of social anxiety and assess personality traits from textual data [3] - Emotion recognition technologies are being utilized in settings like nursing homes to identify loneliness and other psychological states, enhancing the assessment of mental health [3] Group 3: Innovations in Psychological Research - Psychological researchers are developing AI tools for self-help that enhance emotional understanding and interaction capabilities [5] - AI is being trained to recognize subtle psychological crisis signals, utilizing psychological models to improve the identification of distress [5] - The integration of AI and psychological theories is fostering a deeper understanding of human emotions and enhancing predictive capabilities in mental health [5] Group 4: Future Directions - The interplay between psychology and AI is expected to evolve, with psychological insights potentially improving AI's decision-making in complex environments [7] - AI's ability to generate experimental materials and simulate human interactions will contribute to advancing psychological research [7] - The relationship between humans and AI is prompting a reevaluation of emotional connections and ethical considerations in the context of AI's role in understanding human emotions [8]