模型融合

Search documents
腾讯研究院AI速递 20250827
腾讯研究院· 2025-08-26 16:01
一、 英伟达最新推出Jet-Nemotron小模型系列(2B/4B) 1. Jet-Nemotron是英伟达最新推出的小模型系列,由全华人团队打造,提出后神经架构搜索(PostNAS)与新型线性 注意力模块JetBlock; 2. 模型在数学、代码、常识、检索和长上下文等维度表现突出,性能超越Qwen3、Gemma3、Llama3.2等主流开源 全注意力语言模型; 3. 在H100 GPU上推理吞吐量最高提升53.6倍,长上下文场景下的优势特别明显,是英伟达在小模型领域的重要布 局。 https://mp.weixin.qq.com/s/8ZbWGnogg40sHknVBWHH1Q 二、 面壁多模态新旗舰MiniCPM-V 4.5:8B 性能超越 72B 生成式AI 1. 面壁小钢炮MiniCPM-V 4.5成为首个具备"高刷"视频理解能力的多模态模型,8B参数量却超越Qwen2.5-VL 72B 模型; 2. 该模型在MotionBench、FavorBench榜单达到同尺寸SOTA,最大可接收6倍视频帧数量,达到96倍视觉压缩 率; 3. 采用3D-Resampler高密度视频压缩、统一OCR和知识推理学 ...
ICML 2025 | CoTo:让LoRA训练「渐入佳境」,模型融合、剪枝样样精通
机器之心· 2025-07-26 12:17
Core Viewpoint - The article introduces CoTo, a progressive training strategy designed to enhance the robustness and effectiveness of Low-Rank Adaptation (LoRA) models, addressing issues such as training instability and performance drop after pruning [1][4][23]. Summary by Sections Conventional LoRA Training Issues - LoRA faces challenges including "lazy training," where optimization gets stuck near suboptimal solutions, limiting generalization [7] - There is a hierarchical imbalance in training, with gradient updates concentrated on top layers, leading to undertraining of lower layers [7] - These issues complicate downstream operations like model fusion and pruning, often resulting in unsatisfactory outcomes [7] CoTo Strategy - CoTo employs a simple yet effective progressive activation strategy, initially deactivating a portion of LoRA adapters to encourage uniform gradient flow across all layers [5][8] - The activation probability of adapters is gradually increased during training, returning to standard fine-tuning mode in later stages [8] Experimental Results - CoTo significantly improves the fusion and pruning capabilities of LoRA models, enhancing single-task generalization performance and training efficiency [12][23] - In linear interpolation tasks, CoTo models maintain smooth performance transitions, unlike standard LoRA, which experiences sharp declines [13] - CoTo outperforms standard LoRA in both structured and unstructured pruning scenarios, demonstrating enhanced fault tolerance [17] Performance and Efficiency Improvements - CoTo consistently boosts performance across various benchmarks, including visual and language tasks, and achieves over 24% training acceleration when applied to HiRA [24][23] Ablation Studies - Rigorous ablation studies validate the design choices of CoTo and provide insights into effective regularization of LoRA [21] Conclusion - CoTo effectively resolves hierarchical imbalance and lazy optimization issues in LoRA training, enhancing model robustness and simplifying downstream operations like fusion and pruning [23]
不用等R2了!第三方给新版DeepSeek V3添加深度思考,推理101秒破解7米甘蔗过2米门
量子位· 2025-04-28 06:36
1.2T万亿参数,5.2PB训练数据,高效利用华为芯片……只能说如果有一半是真的都很牛了。 HuggingFace创始人此时推荐"以不变应万变",打开官方认证账号的更新提醒,就能第一时间获取通知。 梦晨 发自 凹非寺 量子位 | 公众号 QbitAI DeepSeek即将发布R2??坊间传闻越来越多了,且 难辨真假 。 抛开具体泄露数据是否准确,大家似乎有一个共识: 如果真的有R2,它的基础模型会是新版DeepSeek V3-0324 。 之所以有很多人相信R2会在4月底发布,有一部分原因也是出于R1与V3之间相隔了一个月左右。 现在,等不及DeepSeek官方, 开源社区已经开始自己动手给V3-0324加入深度思考了 。 新模型 DeepSeek-R1T-Chimera ,能力与原版R1相当,但速度更快,输出token减少40%,也是基于MIT协议开放权重。 相当于拥有接近R1的能力和接近V3-0324的速度,结合了两者的优点。 而且做到这一点,不是靠微调或蒸馏,而是DeepSeek V3-0324和R1两个模型融合而成。 R1+V3融合模型 新模型R1T-Chimera并非DeepSeek官方出品,而是来 ...