状态空间模型（SSM） - filings, earnings calls, financial reports, news

状态空间模型（SSM）

Search documents

Sou Hu Cai Jing· 2025-11-03 04:14

Core Insights - Cartesia, a voice AI company, has recently launched its new voice model Sonic-3 and completed a $100 million Series B funding round, with NVIDIA among the investors [1][3][12] Company Overview - Cartesia was founded by Karan Goel, a talented individual from Stanford AI Lab, who has previously excelled in the field of state space models (SSM) [2][10] - The company has a strong academic foundation, with its core team primarily composed of members from Stanford AI Lab, including co-founder Albert Gu, a notable figure in the development of the Mamba architecture [3][4] Product Development - Cartesia has rapidly progressed since its inception, launching its first product, the Sonic voice model, shortly after securing seed funding. The company has since released multiple iterations, including Sonic-2.0 and the latest Sonic-3 [6][12] - Sonic-3 features significant upgrades, including improved emotional expression and faster response times, with a latency of only 90 milliseconds and an end-to-end response time of 190 milliseconds, making it one of the fastest voice generation systems available [8][12] Technology Differentiation - Unlike traditional voice AI models that rely on Transformer architecture, Sonic-3 is built on SSM, allowing for more natural and context-aware interactions without the need to revisit the entire conversation history [8][12] - This innovative approach enhances the model's ability to capture emotional nuances and respond more fluidly, positioning Cartesia as a leader in real-time voice AI technology [8][12] Market Context - The voice AI sector is witnessing significant advancements, with other companies like MiniMax also launching competitive products, indicating a growing market for voice models that can handle diverse languages and accents [14]

状态空间模型（SSM）

语音AI

Artificial Intelligence

Artificial Intelligence

Sonic-3

MiniMax Speech 2.6

黄仁勋投了家复刻马斯克声音的AI公司

量子位· 2025-11-03 03:12

Core Viewpoint - Cartesia, an AI voice company, has gained attention with its new voice model Sonic-3 and a recent $100 million Series B funding round, with notable investors including NVIDIA [3][4][13]. Group 1: Company Overview - Cartesia was founded by Karan Goel, a talented individual from Stanford AI Lab, who has previously excelled in the field of state space models (SSM) [5][6][28]. - The company has a strong academic foundation, with its core team primarily composed of members from Stanford AI Lab [7][11]. Group 2: Product Development - Cartesia's Sonic-3 model represents a significant upgrade, focusing on generating more human-like speech, capturing emotional nuances, and improving response speed [14][15][17]. - The model operates on a state space model (SSM) architecture, which allows for faster and more natural responses compared to traditional Transformer-based models [15][16]. Group 3: Funding and Growth - The company has rapidly progressed since its inception, securing seed funding in its second year and subsequently launching its first product, Sonic, which generated high-quality, natural-sounding speech [11][12]. - Following a $64 million Series A funding round earlier this year, Cartesia has now completed a $100 million Series B funding round, demonstrating its effective strategy of technology development alongside fundraising [12][13].

AI+HI系列：DecompGRNv1：基于线性RNN的端到端模型初探

Huachuang Securities· 2025-09-05 08:12

Quantitative Models and Construction Methods 1. Model Name: RNN-LIN - **Model Construction Idea**: Simplify the traditional GRU model by using a linear RNN structure, reducing parameter complexity while maintaining competitive performance[2][17][20] - **Model Construction Process**: - The model uses a linear RNN structure with only a forget gate and an output gate. The hidden state is updated without non-linear activation functions - Equations: $ h_{t} = f_{t} \otimes h_{t-1} + (1 - f_{t}) \otimes c_{t} $ $ y_{t} = o_{t} \otimes h_{t} $ $ f_{t} = Sigmoid(x_{t}W_{f}) $ $ o_{t} = Sigmoid(x_{t}W_{o}) $ $ c_{t} = SiLU(x_{t}W_{c}) $ - $f_{t}$: Forget gate - $o_{t}$: Output gate - $c_{t}$: Candidate state[20][21] - The model reduces parameters by approximately 50% compared to GRU[21] - **Evaluation**: The linear RNN model shows slightly weaker performance than GRU but remains competitive. Adding GLU modules improves its performance significantly[22][53] 2. Model Name: DecompGRN - **Model Construction Idea**: Extend the linear RNN by integrating cross-sectional information directly into the RNN gating mechanism, enabling simultaneous modeling of temporal and cross-sectional data[2][50] - **Model Construction Process**: - The first RNN layer outputs individual stock representations at each time step - Cross-sectional information is incorporated by grouping stocks based on market capitalization and calculating group de-meaned values - The second RNN layer combines temporal and cross-sectional information in the forget and output gates - Equations: $ h_{t} = f_{t} \otimes h_{t-1} + (1 - f_{t}) \otimes c_{t} $ $ y_{t} = o_{t} \otimes h_{t} $ $ f_{t} = Sigmoid(x_{t}W_{f}) $ $ o_{t} = Sigmoid(x_{t}W_{o}) $ $ c_{t} = SiLU(x_{t}W_{c}) $ - $f_{t}$: Forget gate - $o_{t}$: Output gate - $c_{t}$: Candidate state[50][55] - **Evaluation**: DecompGRN outperforms the GRU baseline in terms of RankIC and RankICIR while maintaining only 43% of the GRU's parameter count[74][53] --- Model Backtest Results 1. RNN-LIN - **RankIC**: - CSI All Share: 0.13 - CSI 300: 0.10 - CSI 500: 0.09 - CSI 1000: 0.12[36][37] - **RankICIR**: - CSI All Share: 1.08 - CSI 300: 0.62 - CSI 500: 0.71 - CSI 1000: 0.96[36][37] - **IC Win Rate**: - CSI All Share: 0.88 - CSI 300: 0.74 - CSI 500: 0.78 - CSI 1000: 0.86[36][37] - **Annualized Return (Top Group)**: - CSI All Share: 42.59% - CSI 300: 28.59% - CSI 500: 23.68% - CSI 1000: 32.81%[42] 2. DecompGRN - **RankIC**: - CSI All Share: 0.141 - CSI 300: 0.099 - CSI 500: 0.098 - CSI 1000: 0.127[55][58] - **RankICIR**: - CSI All Share: 1.26 - CSI 300: 0.65 - CSI 500: 0.77 - CSI 1000: 1.08[55][58] - **IC Win Rate**: - CSI All Share: 0.89 - CSI 300: 0.74 - CSI 500: 0.78 - CSI 1000: 0.88[55][58] - **Annualized Return (Top Group)**: - CSI All Share: 57.68% - CSI 300: 31.69% - CSI 500: 26.9% - CSI 1000: 40.35%[57][58] --- Index Enhancement Test Results (DecompGRN) - **Annualized Excess Return**: - CSI 300: 10.24% - CSI 500: 10.05% - CSI 1000: 19.58%[75][85] - **Tracking Error**: - CSI 300: 5.07 - CSI 500: 6.1 - CSI 1000: 6.75[75][85] - **Cumulative Excess Return (as of 2025-08-27)**: - CSI 300: 3.93% - CSI 500: 6.72% - CSI 1000: 18.26%[75][85]

一个任务50次调用，成本狂砍90%？Manus首次公开上下文工程秘诀，一堆反复重写换来的教训

AI前线· 2025-07-21 07:04

Core Insights - The article emphasizes the importance of context engineering in developing AI agents, highlighting the need for rapid iteration and improvement in response to evolving models and technologies [1][2]. Group 1: KV Cache Design - KV cache hit rate is identified as the most critical metric for AI agents in production, directly impacting latency and cost [4]. - The average input to output token ratio in Manus is approximately 100:1, which significantly benefits from KV caching, reducing the cost of cached input tokens to $0.30 per MTok compared to $3 per MTok for uncached tokens [5]. - Key practices to improve KV cache hit rate include maintaining stable prompt prefixes, appending content only, and marking cache breakpoints explicitly [8][9][10]. Group 2: Tool Management - As agents develop more capabilities, the complexity of the action space increases, leading to potential inefficiencies if tools are dynamically added or removed during iterations [11][14]. - Manus employs a context-aware state machine to manage tool availability without removing tools, thus preventing confusion and maintaining KV cache integrity [14][15][16]. Group 3: Context as a File System - The article discusses the limitations of context windows in modern large language models, suggesting that a file system can serve as an infinite context, allowing agents to read and write files as structured external memory [21]. - Manus implements a recoverable compression strategy, retaining essential information like URLs while allowing for context length reduction [24]. Group 4: Attention Manipulation - Manus uses a "todo.md" file to keep track of tasks, which helps maintain focus and avoid losing sight of goals during complex tasks [26][30]. - Retaining errors in the context is proposed as a method to improve agent behavior, allowing the model to learn from mistakes and reduce the likelihood of repeating them [32][35]. Group 5: Sample Diversity - The article warns against the pitfalls of few-shot prompting in agent systems, which can lead to repetitive and suboptimal actions [36]. - Introducing structured variations in actions and observations can help break patterns and adjust the model's attention, enhancing overall performance [37][38]. Group 6: Conclusion - Context engineering is deemed essential for AI agents, influencing their speed, recovery capabilities, and scalability [39]. - The future of agents will focus on constructing context effectively, underscoring the importance of thoughtful design [40].

上下文工程

上下文学习

状态空间模型（SSM）

Artificial Intelligence

Artificial Intelligence

Manus

「Tokens是胡扯」，Mamba作者抛出颠覆性观点，揭露Transformer深层缺陷

机器之心· 2025-07-09 09:52

Core Viewpoint - The article discusses the trade-offs between State Space Models (SSM) and Transformers, arguing that tokenization is a limitation that SSM can overcome, leading to better computational efficiency and modeling capabilities [1][3][61]. Group 1: State Space Models (SSM) - SSM is defined as a modern version of recurrent neural networks (RNN) with key features that allow it to match the language modeling performance of Transformers [8][10]. - A significant characteristic of SSM is that its hidden state dimension is greater than the input and output dimensions, allowing for better context storage [9][10]. - The model's state update function must be expressive enough to accurately encode and retrieve necessary information, which is achieved through dynamic transfer matrices in selective SSM [11][12]. - Mamba, a specific SSM, integrates parallelization and memory management techniques to enhance computational efficiency [13][14]. - The article highlights that SSMs can outperform Transformers in language modeling tasks when computational resources are matched [53][56]. Group 2: Transformers - Transformers excel in tasks requiring fine-grained operations on individual tokens, but they suffer from quadratic complexity, limiting their efficiency [82][86]. - The article argues that Transformers have an inductive bias that affects their modeling capabilities, making them sensitive to the resolution and semantic content of the data [83][85]. - Despite their strengths, Transformers are not the ultimate solution for all modeling tasks, and there is still significant work to be done in the field [89]. Group 3: Tokenization - Tokenization is a critical step in language modeling, but it introduces limitations in understanding language details [39][40]. - The article posits that removing tokenization could lead to better model performance and aligns with the essence of deep learning, which aims to minimize manual feature engineering [44][45]. - The author suggests that without tokenization, models could learn more effective patterns directly from raw data, enhancing their capabilities [46][52].

长视频理解新突破！Mamba混合架构让显存消耗腰斩，处理10万视频token不费力

量子位· 2025-03-27 04:16

Core Viewpoint - The article introduces the Vamba model, a hybrid Mamba-Transformer model designed for efficient understanding of long videos, significantly improving processing efficiency without compressing video tokens [1][10]. Group 1: Model Design and Efficiency - Vamba improves the efficiency of processing video tokens during training and inference by redesigning the model architecture rather than compressing video tokens [1][4]. - The model can process four times more video frames under the same hardware conditions compared to traditional Transformer architectures, with over 50% reduction in training memory consumption and doubled training speed [4][9]. - Vamba retains the original spatiotemporal features of videos, avoiding information loss that occurs with traditional downsampling or pooling methods [5][10]. Group 2: Technical Innovations - The core design of Vamba involves breaking down the costly causal self-attention operations into two more efficient components: cross-attention for text tokens and a state space model (SSM) based Mamba-2 module for video tokens [6][7]. - The Mamba-2 module reduces the computational complexity from quadratic to linear, allowing for effective processing of long video sequences [7][9]. - Vamba's architecture allows for efficient alignment of text and video information, enhancing the model's ability to analyze video content based on user queries [9][10]. Group 3: Performance Evaluation - Extensive experiments show that Vamba outperforms existing efficient long video understanding models by approximately 4.3% on the LVBench benchmark [5][10]. - The model demonstrates superior performance across various video duration benchmarks, showcasing its competitive edge in long, medium, and short video understanding tasks [10].