Chain of Thought (CoT)
Search documents
最近做 VLA 的一些心得体会
自动驾驶之心· 2025-12-11 00:05
Core Insights - The article discusses the challenges and advancements in Vision-Language Models (VLM) for autonomous driving, highlighting issues such as hallucination, 3D spatial understanding, and processing speed [3]. Group 1: Challenges in VLM - Hallucination issues manifest as generating non-existent information and failing to perceive relevant data, which can be mitigated through dynamic perception techniques [3]. - Insufficient 3D spatial understanding is attributed to pre-training tasks being predominantly 2D, suggesting the incorporation of spatial localization tasks during training [3]. - Processing speed is a concern, with potential solutions including KV Cache, visual token compression, and mixed data training to enhance model efficiency [3]. Group 2: Learning Paradigms and Model Improvements - The learning paradigm should shift from imitation learning (SFT) to preference learning (DPO, GRPO), with simultaneous multi-task training yielding better results than sequential single-task training [3]. - To prevent catastrophic forgetting in foundation models, adding pre-training data is a simple and effective method [3]. - Enhanced supervisory signals can lead to better model representations, achieved by adding auxiliary task heads to the VLM model [3]. Group 3: Interaction and Evaluation - Current VLMs exhibit insufficient interaction between vision and language, limiting their effectiveness as base models; improving this interaction is crucial [3]. - The output method for trajectories is flexible, with various approaches yielding satisfactory results, though diffusion heads are preferred in industry for speed [3]. - Evaluation remains challenging due to inconsistencies between training and testing conditions, necessitating better alignment of objectives and data distributions [3].
刚刚,OpenAI开源2个推理模型:笔记本/手机就能跑,性能接近o4-mini
量子位· 2025-08-05 21:09
Core Insights - OpenAI has released two open-source inference models: gpt-oss-120b and gpt-oss-20b, marking its first open-source model release since GPT-2 in 2019 [3][4][5] - These models are designed to be commercially usable under the Apache 2.0 license, allowing for free use without licensing fees [5] - The gpt-oss models demonstrate strong performance in reasoning tasks, although they still lag behind proprietary models in code generation and complex reasoning [5][25] Model Specifications - gpt-oss-120b has 117 billion parameters and can run on a single 80GB GPU, achieving performance close to the proprietary o4-mini model [6][7] - gpt-oss-20b has 21 billion parameters and can operate on consumer-grade devices with 16GB memory, performing similarly to o3-mini [6][7] - Both models utilize a mixture of experts (MoE) architecture to optimize active parameters during inference [29][30] Performance Evaluation - In various benchmarks, gpt-oss-120b outperformed OpenAI's o3-mini and matched or exceeded the performance of o4-mini in programming, general problem-solving, and tool usage [41][42] - gpt-oss-20b also achieved comparable results to o3-mini, particularly excelling in competition math and health-related questions [47] - The models support three inference strengths—low, medium, and high—allowing developers to balance latency and performance [38] Technical Features - The models employ advanced pre-training and post-training techniques, focusing on reasoning and efficiency for broad deployment [27][35] - They support a maximum context length of 128k tokens and utilize a unique attention mechanism to enhance memory efficiency [31][33] - OpenAI has made the tokenizer used for these models available as well, further supporting the open-source initiative [34] Strategic Importance - OpenAI views the release of these models as a significant step forward in open-source weight models, enhancing accessibility for developers and researchers [60] - The open-source nature of gpt-oss models lowers barriers for emerging markets and smaller organizations, promoting democratization of AI technology [61][62] - The initiative aims to foster a healthy open-source model ecosystem, contributing to the broader adoption of AI for societal benefits [62]
会“思考”的目标检测模型来了!IDEA提出Rex-Thinker:基于思维链的指代物体检测模型,准确率+可解释性双突破
机器之心· 2025-06-30 10:23
Core Insights - The article discusses the introduction of Rex-Thinker, a new solution by IDEA that incorporates logical reasoning chains into visual reference tasks, significantly improving AI's ability to understand and locate objects based on human-like reasoning [2][5]. Group 1: Innovation and Methodology - Rex-Thinker innovatively constructs an interpretable reasoning framework that includes three main steps: Planning, Action, and Summarization, allowing the AI to break down language instructions into actionable steps [5][10]. - The model employs a retrieval-based detection strategy, first generating candidate boxes using an open vocabulary detector, followed by reasoning through each candidate to produce structured outputs [9][10]. - The final output is standardized in JSON format, enhancing interpretability and reliability of the reasoning process [10]. Group 2: Training and Data - The HumanRef-CoT dataset was created by augmenting the existing HumanRef dataset with 90,000 chain reasoning examples generated by GPT-4o, establishing a foundation for training models with reasoning capabilities [12][14]. - The training process consists of two phases: supervised fine-tuning (SFT) on the HumanRef-CoT dataset and GRPO-based reinforcement learning, which enhances reasoning quality and robustness [16][19]. Group 3: Performance and Results - Rex-Thinker demonstrated significant performance improvements on the HumanRef Benchmark, with the introduction of CoT supervision leading to an average DF1 score increase of 0.9 points and a notable 13.8 percentage point improvement in rejection scores [21]. - In the RefCOCOg dataset, Rex-Thinker exhibited strong transfer capabilities, achieving competitive performance without targeted fine-tuning, further validated by minor GRPO adjustments [22]. Group 4: Visualization and Interpretability - The article highlights the visualization of Rex-Thinker's reasoning process, showcasing how the model verifies conditions step-by-step and outputs results or declines predictions, emphasizing its clear reasoning path and interpretability [24].