SAM - filings, earnings calls, financial reports, news

SAM

Search documents

量子位· 2025-12-22 09:30

梦晨发自凹非寺量子位 | 公众号 QbitAI 为什么Agent在演示时无所不能，到了实际场景却频频拉胯？一篇长达51页的论文研究了自ChatGPT以来的主要智能体，给出参考框架：适应性是其中关键。智能体定义上不是只会被动回答的AI，而是能自己规划、用工具（比如搜索引擎、代码编译器、数据库）、记东西，一步步完成复杂任务。当遇到新任务、新环境时，不需要重造一个新的智能体，而是通过 "微调自己" 或 "优化工具"，快速适配需求（比如从写普通代码适配到写垂直行业代码）。 | 1 UUC 25 Stanford 3 Princeton 49 Harvard 5W UW 65 Caltech 7 UC Berkeley | | --- | | 8 UCSD 9 Georgia Tech 10N Northwestern 11 AM TAMU 12Unity | 这篇论文作者阵容豪华，来自UIUC、斯坦福、普林斯顿、哈佛、UC伯克利等12所高校的三十多位研究者联手，由UIUC的韩家炜教授团队领衔，共同一作Pengcheng Jiang，Jiacheng Lin，Zhiyi Shi为UIUC博士生。 A ...

何恺明NeurIPS 2025演讲盘点：视觉目标检测三十年

机器之心· 2025-12-11 10:00

Core Insights - The article highlights the significance of the "Test of Time Award" received by the paper "Faster R-CNN," co-authored by renowned researchers, marking its impact on the field of computer vision since its publication in 2015 [1][5][25] - The presentation by He Kaiming at NeurIPS 2025 summarizes the evolution of visual object detection over the past 30 years, showcasing key milestones and influential works that have shaped the field [6][31] Historical Development - The early attempts at face detection in the 1990s relied on handcrafted features and statistical methods, which were limited in adaptability and speed [12] - The introduction of AlexNet in 2012 demonstrated the superior feature extraction capabilities of deep learning, paving the way for its application in object detection [15] - The R-CNN model, proposed in 2014, revolutionized object detection by integrating CNNs for feature extraction and classification, although it initially faced computational challenges [17][18] Technological Advancements - The development of Faster R-CNN in 2015 addressed the speed bottleneck by introducing the Region Proposal Network (RPN), allowing for end-to-end real-time detection [25] - Subsequent innovations, such as YOLO and SSD in 2016, further enhanced detection speed by enabling direct output of object locations and categories [32] - The introduction of Mask R-CNN in 2017 added instance segmentation capabilities, while DETR in 2020 redefined detection using Transformer architecture [32][34] Future Directions - The article concludes with reflections on the ongoing exploration in computer vision, emphasizing the need for innovative models to replace outdated components as bottlenecks arise [35][36]

今日暴论：Deepseek-OCR干翻了所有架构

自动驾驶之心· 2025-10-27 00:03

Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which significantly reduces the number of tokens required to store and process information by utilizing images as memory carriers instead of relying solely on text tokens [3][6][12]. Group 1: Model Capabilities - DeepSeek-OCR can store nearly the same amount of information using only one-tenth of the tokens compared to traditional models [40][41]. - In tests, DeepSeek-OCR achieved superior performance, using only 100 visual tokens to surpass the 256 tokens required by GOT-OCR 2.0, and less than 800 visual tokens to outperform MinerU 2.0, which typically requires over 6000 tokens [13][14]. - The model supports various resolutions and compression modes, allowing it to adapt to different document complexities, such as using only 64 visual tokens for simple documents [18][21]. Group 2: Data Collection and Utilization - DeepSeek-OCR can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [32][33]. - The model can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its efficiency in data collection [35]. Group 3: Resource Efficiency - By using images for memory, DeepSeek-OCR reduces the computational load, allowing for a significant decrease in token usage without sacrificing performance [40][41]. - The model can maintain 96.5% accuracy while using only one-tenth of the original token count, demonstrating its effectiveness in resource management [41][42]. Group 4: Open Source and Community Contributions - The development of DeepSeek-OCR is a collaborative effort, utilizing various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [51][53]. - The integration of multiple open-source models has enabled DeepSeek to create an AI capable of "thinking in images," showcasing the power of community-driven innovation [53].

大模型

图像记忆

开源社区

Artificial Intelligence

Artificial Intelligence

DeepSeek-OCR

SAM

DeepSeek开源的新模型，有点邪门

创业邦· 2025-10-25 10:14

Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which utilizes images to store information instead of relying solely on text tokens, significantly improving data compression and model efficiency [5][11][26]. Group 1: Model Functionality - DeepSeek-OCR can convert large amounts of text into images, serving as a memory carrier for AI, which allows for more efficient data storage [9][14]. - The model demonstrates superior performance by using fewer visual tokens compared to traditional models, achieving better results with less resource consumption [11][26]. - In tests, DeepSeek-OCR used only 100 visual tokens to outperform GOT-OCR 2.0, which required 256 tokens, and it achieved results with less than 800 visual tokens compared to over 6000 tokens for MinerU 2.0 [11][14]. Group 2: Data Collection and Utilization - The model can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [22][24]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [24]. - The model's ability to remember the position of images and surrounding text allows for a more comprehensive understanding of the data [18][22]. Group 3: Resource Efficiency - By using image-based memory, DeepSeek-OCR can reduce the number of tokens required to one-tenth of the original, while maintaining a high accuracy rate of 96.5% [26][27]. - The model's design allows for dynamic adjustments in token usage based on the complexity of the document, optimizing resource allocation [14][15]. - The research indicates that even with a 20-fold compression, the model can retain around 60% accuracy, showcasing its robustness [27]. Group 4: Open Source Collaboration - DeepSeek-OCR is an open-source project that integrates contributions from various global open-source communities, utilizing datasets and models from companies like Huawei, Baidu, Meta, and OpenAI [32][34]. - This collaborative effort has resulted in a model capable of "thinking in images," highlighting the importance of community-driven innovation in AI development [34].

大模型

图像记忆

开源社区

Artificial Intelligence

Artificial Intelligence

DeepSeek-OCR

SAM

DeepSeek昨天开源的新模型，有点邪门

3 6 Ke· 2025-10-22 01:00

Core Insights - DeepSeek has introduced a new model called DeepSeek-OCR, which can compress text information into images, achieving a significant reduction in token usage while maintaining high accuracy [5][31][39]. Group 1: Model Capabilities - DeepSeek-OCR can store large amounts of text as images, allowing for a more efficient representation of information compared to traditional text-based models [9][10]. - The model demonstrates a compression ratio where it can use only 100 visual tokens to outperform previous models that required 256 tokens, and it can achieve results with less than 800 visual tokens compared to over 6000 tokens used by other models [14][31]. - DeepSeek-OCR supports various resolutions and compression modes, adapting to different document complexities, with modes ranging from Tiny to Gundam, allowing for dynamic adjustments based on content [17][18]. Group 2: Data Utilization - The model can capture previously unutilized data from documents, such as graphs and images, which traditional models could not interpret effectively [24][26]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [29]. - By utilizing image memory, the model reduces the computational load significantly, allowing for a more efficient processing of longer conversations without a proportional increase in resource consumption [31]. Group 3: Open Source Collaboration - The development of DeepSeek-OCR is a collaborative effort, integrating various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [38][39]. - The model's architecture reflects a collective achievement from the open-source community, showcasing the potential of collaborative innovation in AI development [39].

Seek .(US:SKLTY)

大模型图像训练

二维数据存储

Artificial Intelligence

Artificial Intelligence

DeepSeek-OCR

SAM

CLIP 模型

FleetCor(FLT) - 2025 H2 - Earnings Call Transcript

2025-08-27 01:02

Financial Data and Key Metrics Changes - Overall Total Payment Volume (TPV) grew by 3%, but growth was inconsistent across brands and regions [3][4] - Underlying Profit Before Tax (PBT) fell to just under $290 million, with significant impacts in Q1 and Q4 due to macro conditions [3][4] - The company aims to hold underlying costs flat compared to FY 2025, despite a 3% increase in costs over the last twelve months [5][6] Business Line Data and Key Metrics Changes - Corporate division saw top line growth to $12.3 billion, with a 6% PBT growth excluding Asia [7][8] - Leisure division experienced TPV growth year on year, primarily from lower margin brands, with profit falling due to soft trading conditions [8][9] - Other segments remained flat year on year, with increased profit contributions from operating businesses [9] Market Data and Key Metrics Changes - ANZ and The Americas reported solid profit growth, while EMEA and Asia experienced reductions [3][4] - The UK corporate travel brand underperformed, and Asia faced operational challenges leading to additional provisions [4][5] - The company expects EMEA and Asia to return to more appropriate levels by 2026 [4] Company Strategy and Development Direction - The company is focusing on productivity gains, cost reduction, and targeted investments in technology and AI [5][6][20] - A new Global Business Services division aims to support frontline teams and improve operational efficiency [5][6] - The company is exploring M&A opportunities to expedite growth in specialist businesses [15][16] Management's Comments on Operating Environment and Future Outlook - Management acknowledges a challenging operating environment due to geopolitical tensions and macroeconomic conditions but remains optimistic about medium to long-term growth [2][3] - There are promising signs emerging in key markets, and the company is prepared for a market rebound [23][24] - Management expects a challenging first half of FY 2026 but anticipates a stronger second half [43][44] Other Important Information - The company has undertaken $450 million in capital management initiatives, including debt repayment and share buybacks [9] - Investment in TP Connect increased by $7 million to enhance airline content and new revenue streams [8] - The company is launching a travel retail loyalty program to enhance customer engagement and drive growth [35][36] Q&A Session Summary Question: Can you provide details on the impact of lower overrides in FY 2025 and potential upside for 2026? - Management indicated that lower overrides significantly impacted the leisure business, particularly in the last quarter, and emphasized the importance of growth to achieve higher override tiers [48][52] Question: What are the potential impacts of changes to payment surcharges in Australia? - Management has evaluated the potential impacts and is prepared with various options to mitigate any negative effects [54][57] Question: Can you clarify the outlook for the first half of FY 2026? - Management expects a like-for-like comparison to be relatively flat year on year, with improvements anticipated in Asia [60][62] Question: What should be expected for the other segment's loss in FY 2026? - Management expects the loss to decrease to around $70 million, with improvements anticipated from operating businesses [68][70] Question: How is Corporate Traveler positioned in the UK and Europe? - Management expressed confidence in the UK market, highlighting recent management changes and improvements to the product offering [90][92]

FleetCor(US:FLT)

Artificial Intelligence (AI)

Productivity and Cost Reduction

Artificial Intelligence (AI)

Productivity and Cost Reduction

突破SAM局限！美团提出X-SAM：统一框架横扫20+分割基准

自动驾驶之心· 2025-08-12 23:33

Core Insights - The article discusses the introduction of X-SAM, a new segmentation framework that overcomes the limitations of the Segment Anything Model (SAM) by enabling multi-task processing and integrating multi-modal capabilities [3][4][5]. Group 1: Limitations of SAM - SAM was initially seen as a universal solution for visual segmentation but has significant limitations, including single-task focus, inability to understand text instructions, and inefficiency due to the need for multiple models for different tasks [5][6][7]. Group 2: Innovations of X-SAM - X-SAM integrates SAM's visual segmentation capabilities with multi-modal understanding from large language models (LLMs) through a unified input format, a dual-encoder architecture, and multi-stage training [12][13][21]. - The unified input format allows various segmentation tasks to be processed in a consistent manner, enhancing the model's ability to understand both text and visual prompts [13][15]. - The dual-encoder architecture consists of a global image encoder and a segmentation encoder, optimizing both overall scene understanding and pixel-level detail [14][19]. - Multi-stage training involves fine-tuning the segmentation model, aligning visual and language features, and mixed fine-tuning across diverse datasets to enhance generalization [21][23]. Group 3: Performance Metrics - X-SAM has demonstrated superior performance across over 20 datasets and 7 core tasks, achieving state-of-the-art results in various segmentation benchmarks [27][28]. - In the COCO dataset, X-SAM achieved a panorama quality (PQ) score of 54.7, closely following the best-performing model, Mask2Former [31]. - For open vocabulary segmentation, X-SAM's average precision (AP) reached 16.2, significantly outperforming other models [31]. - In referring segmentation tasks, X-SAM achieved corrected Intersection over Union (cIoU) scores of 85.1, 78.0, and 83.8 across different datasets, surpassing competitors [32]. Group 4: New Task Introduction - X-SAM introduces a new task called Visual Grounding Detection (VGD) segmentation, which allows the model to segment all instances of a class based on visual prompts, even across different images [25][26][35]. - In experiments, X-SAM achieved average precision scores of 47.9 to 49.7 for VGD segmentation, significantly exceeding existing models [35]. Group 5: Future Directions - The research team plans to extend X-SAM's capabilities to video segmentation and dynamic scenes, aiming to enhance its application in temporal visual understanding [43].

突破SAM局限！中山大学X-SAM：统一框架横扫20+分割基准

自动驾驶之心· 2025-08-12 10:37

Core Insights - The article discusses the introduction of X-SAM, a new segmentation framework that overcomes the limitations of the Segment Anything Model (SAM) by enabling multi-task processing and integrating multi-modal understanding capabilities [3][4][5]. Group 1: Limitations of SAM - SAM was initially seen as a universal solution for visual segmentation but has significant limitations, including its inability to handle multiple tasks simultaneously and its lack of understanding of textual instructions [2][5][6]. - SAM is designed for single-object segmentation based on visual prompts and cannot perform complex tasks like semantic, instance, or panoptic segmentation [6]. - The gap between visual segmentation and multi-modal understanding is highlighted, where existing models can either understand images or perform pixel-level segmentation but not both effectively [5][6]. Group 2: Innovations of X-SAM - X-SAM is designed to fill the gap left by SAM, providing a unified segmentation framework that can handle various tasks and input types [7][8]. - The architecture of X-SAM includes a dual-encoder system that processes both visual and textual inputs, allowing for a comprehensive understanding of images and instructions [12][14]. - X-SAM introduces a unified input format that standardizes how different segmentation tasks are processed, enabling the model to understand both textual and visual prompts [13][15]. Group 3: Performance and Testing - X-SAM has been tested across over 20 segmentation datasets and 7 core tasks, outperforming existing models in all categories [4][27]. - The model's performance metrics include achieving an average precision (AP) of 47.9 to 49.7 in visual grounding segmentation (VGD), significantly surpassing previous models [26][35]. - In specific tasks, X-SAM achieved a panorama quality (PQ) of 54.7 in COCO panoptic segmentation, demonstrating its robustness in foundational segmentation tasks [31]. Group 4: Training Methodology - X-SAM employs a multi-stage training strategy that includes fine-tuning the segmenter, pre-training for alignment, and mixed fine-tuning across various datasets [21][23]. - The training process incorporates a data balancing resampling strategy to ensure smaller datasets are not overshadowed by larger ones, optimizing overall model performance [24]. - The model's architecture allows for simultaneous training on multiple tasks, enhancing its generalization capabilities [37]. Group 5: Future Directions - The research team plans to extend X-SAM's capabilities to video segmentation and dynamic scenes, aiming to bridge the gap between static image understanding and video comprehension [43].

聊聊DreamVLA：让机器人先看后想再动

具身智能之心· 2025-08-11 00:14

Core Viewpoint - The article introduces DreamVLA, a new Vision-Language-Action model that enhances robotic decision-making by integrating comprehensive world knowledge, allowing robots to predict dynamic environments and make more accurate action decisions [1][27]. Group 1: Background and Need for Advanced VLA Models - Traditional VLA models directly map visual inputs and language commands to actions, which can lead to interference from irrelevant information in complex environments [3][5]. - DreamVLA addresses this by adding a layer of "thinking" that predicts world knowledge, including dynamic areas, depth information, and semantic features before planning actions [5][27]. Group 2: Model Architecture and Functionality - DreamVLA operates on a "perception-prediction-action" cycle, treating the task as an inverse dynamics problem to derive necessary actions from predicted future states [7][27]. - The model processes three types of inputs: visual images, language commands, and the robot's own state, using dedicated encoders for each [10][14]. Group 3: World Knowledge Prediction - DreamVLA predicts world knowledge, which includes dynamic areas, depth maps, and semantic features, rather than directly predicting actions [11][18]. - Dynamic area prediction utilizes CoTracker to identify moving objects and generate masks that highlight relevant areas while filtering out static backgrounds [12][15]. - Depth prediction estimates the spatial relationships of objects, generating depth maps to assist in obstacle avoidance [13][17]. - Semantic prediction employs DINOv2 and SAM models to extract high-level semantic information, which is then encoded into a unified "world embedding" for action generation [18][22]. Group 4: Action Generation - The action generation component uses a diffusion Transformer to produce future action sequences based on the latent action embedding derived from multi-modal inputs [23][27]. - A structured attention mechanism is implemented to ensure coherent multi-step action reasoning and prevent cross-modal knowledge leakage [19][31]. Group 5: Performance and Validation - DreamVLA achieved an average task completion length of 4.44 in the CALVIN ABC-D benchmark, outperforming previous methods by 3.5%, with a real-world task success rate of 76.7% [25][27]. - Ablation studies confirmed the contributions of various components, demonstrating the model's robustness and generalization capabilities [25][31].

具身智能

视觉-语言-动作模型（VLA）

世界知识（world knowledge）

世界知识（world knowledge）

逆动力学问题

DreamVLA

DINOv2