Workflow
SAM
icon
Search documents
今日暴论:Deepseek-OCR干翻了所有架构
自动驾驶之心· 2025-10-27 00:03
Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which significantly reduces the number of tokens required to store and process information by utilizing images as memory carriers instead of relying solely on text tokens [3][6][12]. Group 1: Model Capabilities - DeepSeek-OCR can store nearly the same amount of information using only one-tenth of the tokens compared to traditional models [40][41]. - In tests, DeepSeek-OCR achieved superior performance, using only 100 visual tokens to surpass the 256 tokens required by GOT-OCR 2.0, and less than 800 visual tokens to outperform MinerU 2.0, which typically requires over 6000 tokens [13][14]. - The model supports various resolutions and compression modes, allowing it to adapt to different document complexities, such as using only 64 visual tokens for simple documents [18][21]. Group 2: Data Collection and Utilization - DeepSeek-OCR can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [32][33]. - The model can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its efficiency in data collection [35]. Group 3: Resource Efficiency - By using images for memory, DeepSeek-OCR reduces the computational load, allowing for a significant decrease in token usage without sacrificing performance [40][41]. - The model can maintain 96.5% accuracy while using only one-tenth of the original token count, demonstrating its effectiveness in resource management [41][42]. Group 4: Open Source and Community Contributions - The development of DeepSeek-OCR is a collaborative effort, utilizing various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [51][53]. - The integration of multiple open-source models has enabled DeepSeek to create an AI capable of "thinking in images," showcasing the power of community-driven innovation [53].
DeepSeek开源的新模型,有点邪门
创业邦· 2025-10-25 10:14
Core Viewpoint - DeepSeek has introduced a new model, DeepSeek-OCR, which utilizes images to store information instead of relying solely on text tokens, significantly improving data compression and model efficiency [5][11][26]. Group 1: Model Functionality - DeepSeek-OCR can convert large amounts of text into images, serving as a memory carrier for AI, which allows for more efficient data storage [9][14]. - The model demonstrates superior performance by using fewer visual tokens compared to traditional models, achieving better results with less resource consumption [11][26]. - In tests, DeepSeek-OCR used only 100 visual tokens to outperform GOT-OCR 2.0, which required 256 tokens, and it achieved results with less than 800 visual tokens compared to over 6000 tokens for MinerU 2.0 [11][14]. Group 2: Data Collection and Utilization - The model can capture previously uncollected data from two-dimensional information, such as graphs and images in academic papers, which traditional models could not interpret [22][24]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [24]. - The model's ability to remember the position of images and surrounding text allows for a more comprehensive understanding of the data [18][22]. Group 3: Resource Efficiency - By using image-based memory, DeepSeek-OCR can reduce the number of tokens required to one-tenth of the original, while maintaining a high accuracy rate of 96.5% [26][27]. - The model's design allows for dynamic adjustments in token usage based on the complexity of the document, optimizing resource allocation [14][15]. - The research indicates that even with a 20-fold compression, the model can retain around 60% accuracy, showcasing its robustness [27]. Group 4: Open Source Collaboration - DeepSeek-OCR is an open-source project that integrates contributions from various global open-source communities, utilizing datasets and models from companies like Huawei, Baidu, Meta, and OpenAI [32][34]. - This collaborative effort has resulted in a model capable of "thinking in images," highlighting the importance of community-driven innovation in AI development [34].
DeepSeek昨天开源的新模型,有点邪门
3 6 Ke· 2025-10-22 01:00
Core Insights - DeepSeek has introduced a new model called DeepSeek-OCR, which can compress text information into images, achieving a significant reduction in token usage while maintaining high accuracy [5][31][39]. Group 1: Model Capabilities - DeepSeek-OCR can store large amounts of text as images, allowing for a more efficient representation of information compared to traditional text-based models [9][10]. - The model demonstrates a compression ratio where it can use only 100 visual tokens to outperform previous models that required 256 tokens, and it can achieve results with less than 800 visual tokens compared to over 6000 tokens used by other models [14][31]. - DeepSeek-OCR supports various resolutions and compression modes, adapting to different document complexities, with modes ranging from Tiny to Gundam, allowing for dynamic adjustments based on content [17][18]. Group 2: Data Utilization - The model can capture previously unutilized data from documents, such as graphs and images, which traditional models could not interpret effectively [24][26]. - DeepSeek-OCR can generate over 200,000 pages of training data in a day on an A100 GPU, indicating its potential to enhance the training datasets for future models [29]. - By utilizing image memory, the model reduces the computational load significantly, allowing for a more efficient processing of longer conversations without a proportional increase in resource consumption [31]. Group 3: Open Source Collaboration - The development of DeepSeek-OCR is a collaborative effort, integrating various open-source resources, including Huawei's Wukong dataset and Meta's SAM for image feature extraction [38][39]. - The model's architecture reflects a collective achievement from the open-source community, showcasing the potential of collaborative innovation in AI development [39].
FleetCor(FLT) - 2025 H2 - Earnings Call Transcript
2025-08-27 01:02
Financial Data and Key Metrics Changes - Overall Total Payment Volume (TPV) grew by 3%, but growth was inconsistent across brands and regions [3][4] - Underlying Profit Before Tax (PBT) fell to just under $290 million, with significant impacts in Q1 and Q4 due to macro conditions [3][4] - The company aims to hold underlying costs flat compared to FY 2025, despite a 3% increase in costs over the last twelve months [5][6] Business Line Data and Key Metrics Changes - Corporate division saw top line growth to $12.3 billion, with a 6% PBT growth excluding Asia [7][8] - Leisure division experienced TPV growth year on year, primarily from lower margin brands, with profit falling due to soft trading conditions [8][9] - Other segments remained flat year on year, with increased profit contributions from operating businesses [9] Market Data and Key Metrics Changes - ANZ and The Americas reported solid profit growth, while EMEA and Asia experienced reductions [3][4] - The UK corporate travel brand underperformed, and Asia faced operational challenges leading to additional provisions [4][5] - The company expects EMEA and Asia to return to more appropriate levels by 2026 [4] Company Strategy and Development Direction - The company is focusing on productivity gains, cost reduction, and targeted investments in technology and AI [5][6][20] - A new Global Business Services division aims to support frontline teams and improve operational efficiency [5][6] - The company is exploring M&A opportunities to expedite growth in specialist businesses [15][16] Management's Comments on Operating Environment and Future Outlook - Management acknowledges a challenging operating environment due to geopolitical tensions and macroeconomic conditions but remains optimistic about medium to long-term growth [2][3] - There are promising signs emerging in key markets, and the company is prepared for a market rebound [23][24] - Management expects a challenging first half of FY 2026 but anticipates a stronger second half [43][44] Other Important Information - The company has undertaken $450 million in capital management initiatives, including debt repayment and share buybacks [9] - Investment in TP Connect increased by $7 million to enhance airline content and new revenue streams [8] - The company is launching a travel retail loyalty program to enhance customer engagement and drive growth [35][36] Q&A Session Summary Question: Can you provide details on the impact of lower overrides in FY 2025 and potential upside for 2026? - Management indicated that lower overrides significantly impacted the leisure business, particularly in the last quarter, and emphasized the importance of growth to achieve higher override tiers [48][52] Question: What are the potential impacts of changes to payment surcharges in Australia? - Management has evaluated the potential impacts and is prepared with various options to mitigate any negative effects [54][57] Question: Can you clarify the outlook for the first half of FY 2026? - Management expects a like-for-like comparison to be relatively flat year on year, with improvements anticipated in Asia [60][62] Question: What should be expected for the other segment's loss in FY 2026? - Management expects the loss to decrease to around $70 million, with improvements anticipated from operating businesses [68][70] Question: How is Corporate Traveler positioned in the UK and Europe? - Management expressed confidence in the UK market, highlighting recent management changes and improvements to the product offering [90][92]
突破SAM局限!美团提出X-SAM:统一框架横扫20+分割基准
自动驾驶之心· 2025-08-12 23:33
Core Insights - The article discusses the introduction of X-SAM, a new segmentation framework that overcomes the limitations of the Segment Anything Model (SAM) by enabling multi-task processing and integrating multi-modal capabilities [3][4][5]. Group 1: Limitations of SAM - SAM was initially seen as a universal solution for visual segmentation but has significant limitations, including single-task focus, inability to understand text instructions, and inefficiency due to the need for multiple models for different tasks [5][6][7]. Group 2: Innovations of X-SAM - X-SAM integrates SAM's visual segmentation capabilities with multi-modal understanding from large language models (LLMs) through a unified input format, a dual-encoder architecture, and multi-stage training [12][13][21]. - The unified input format allows various segmentation tasks to be processed in a consistent manner, enhancing the model's ability to understand both text and visual prompts [13][15]. - The dual-encoder architecture consists of a global image encoder and a segmentation encoder, optimizing both overall scene understanding and pixel-level detail [14][19]. - Multi-stage training involves fine-tuning the segmentation model, aligning visual and language features, and mixed fine-tuning across diverse datasets to enhance generalization [21][23]. Group 3: Performance Metrics - X-SAM has demonstrated superior performance across over 20 datasets and 7 core tasks, achieving state-of-the-art results in various segmentation benchmarks [27][28]. - In the COCO dataset, X-SAM achieved a panorama quality (PQ) score of 54.7, closely following the best-performing model, Mask2Former [31]. - For open vocabulary segmentation, X-SAM's average precision (AP) reached 16.2, significantly outperforming other models [31]. - In referring segmentation tasks, X-SAM achieved corrected Intersection over Union (cIoU) scores of 85.1, 78.0, and 83.8 across different datasets, surpassing competitors [32]. Group 4: New Task Introduction - X-SAM introduces a new task called Visual Grounding Detection (VGD) segmentation, which allows the model to segment all instances of a class based on visual prompts, even across different images [25][26][35]. - In experiments, X-SAM achieved average precision scores of 47.9 to 49.7 for VGD segmentation, significantly exceeding existing models [35]. Group 5: Future Directions - The research team plans to extend X-SAM's capabilities to video segmentation and dynamic scenes, aiming to enhance its application in temporal visual understanding [43].
突破SAM局限!中山大学X-SAM:统一框架横扫20+分割基准
自动驾驶之心· 2025-08-12 10:37
Core Insights - The article discusses the introduction of X-SAM, a new segmentation framework that overcomes the limitations of the Segment Anything Model (SAM) by enabling multi-task processing and integrating multi-modal understanding capabilities [3][4][5]. Group 1: Limitations of SAM - SAM was initially seen as a universal solution for visual segmentation but has significant limitations, including its inability to handle multiple tasks simultaneously and its lack of understanding of textual instructions [2][5][6]. - SAM is designed for single-object segmentation based on visual prompts and cannot perform complex tasks like semantic, instance, or panoptic segmentation [6]. - The gap between visual segmentation and multi-modal understanding is highlighted, where existing models can either understand images or perform pixel-level segmentation but not both effectively [5][6]. Group 2: Innovations of X-SAM - X-SAM is designed to fill the gap left by SAM, providing a unified segmentation framework that can handle various tasks and input types [7][8]. - The architecture of X-SAM includes a dual-encoder system that processes both visual and textual inputs, allowing for a comprehensive understanding of images and instructions [12][14]. - X-SAM introduces a unified input format that standardizes how different segmentation tasks are processed, enabling the model to understand both textual and visual prompts [13][15]. Group 3: Performance and Testing - X-SAM has been tested across over 20 segmentation datasets and 7 core tasks, outperforming existing models in all categories [4][27]. - The model's performance metrics include achieving an average precision (AP) of 47.9 to 49.7 in visual grounding segmentation (VGD), significantly surpassing previous models [26][35]. - In specific tasks, X-SAM achieved a panorama quality (PQ) of 54.7 in COCO panoptic segmentation, demonstrating its robustness in foundational segmentation tasks [31]. Group 4: Training Methodology - X-SAM employs a multi-stage training strategy that includes fine-tuning the segmenter, pre-training for alignment, and mixed fine-tuning across various datasets [21][23]. - The training process incorporates a data balancing resampling strategy to ensure smaller datasets are not overshadowed by larger ones, optimizing overall model performance [24]. - The model's architecture allows for simultaneous training on multiple tasks, enhancing its generalization capabilities [37]. Group 5: Future Directions - The research team plans to extend X-SAM's capabilities to video segmentation and dynamic scenes, aiming to bridge the gap between static image understanding and video comprehension [43].
聊聊DreamVLA:让机器人先看后想再动
具身智能之心· 2025-08-11 00:14
Core Viewpoint - The article introduces DreamVLA, a new Vision-Language-Action model that enhances robotic decision-making by integrating comprehensive world knowledge, allowing robots to predict dynamic environments and make more accurate action decisions [1][27]. Group 1: Background and Need for Advanced VLA Models - Traditional VLA models directly map visual inputs and language commands to actions, which can lead to interference from irrelevant information in complex environments [3][5]. - DreamVLA addresses this by adding a layer of "thinking" that predicts world knowledge, including dynamic areas, depth information, and semantic features before planning actions [5][27]. Group 2: Model Architecture and Functionality - DreamVLA operates on a "perception-prediction-action" cycle, treating the task as an inverse dynamics problem to derive necessary actions from predicted future states [7][27]. - The model processes three types of inputs: visual images, language commands, and the robot's own state, using dedicated encoders for each [10][14]. Group 3: World Knowledge Prediction - DreamVLA predicts world knowledge, which includes dynamic areas, depth maps, and semantic features, rather than directly predicting actions [11][18]. - Dynamic area prediction utilizes CoTracker to identify moving objects and generate masks that highlight relevant areas while filtering out static backgrounds [12][15]. - Depth prediction estimates the spatial relationships of objects, generating depth maps to assist in obstacle avoidance [13][17]. - Semantic prediction employs DINOv2 and SAM models to extract high-level semantic information, which is then encoded into a unified "world embedding" for action generation [18][22]. Group 4: Action Generation - The action generation component uses a diffusion Transformer to produce future action sequences based on the latent action embedding derived from multi-modal inputs [23][27]. - A structured attention mechanism is implemented to ensure coherent multi-step action reasoning and prevent cross-modal knowledge leakage [19][31]. Group 5: Performance and Validation - DreamVLA achieved an average task completion length of 4.44 in the CALVIN ABC-D benchmark, outperforming previous methods by 3.5%, with a real-world task success rate of 76.7% [25][27]. - Ablation studies confirmed the contributions of various components, demonstrating the model's robustness and generalization capabilities [25][31].