多模态Scaling范式

Search documents
 腾讯研究院AI速递 20251031
 腾讯研究院· 2025-10-30 16:06
https://mp.weixin.qq.com/s/_dmZj9IwtbRLpvXHulQ_8g 二、Cursor 2.0更新,自研模型Composer,多agent并行 生成式AI 一、OpenAI 刚刚开源了两个专门用于安全分类的推理模型 1. OpenAI开源gpt-oss-safeguard安全分类模型(120b和20b版本),采用Apache 2.0许可证,能直接理解策略文档进 行内容分类无需重新训练; 2. 该模型在多个基准测试中表现超越GPT-5-thinking,在内容审核评估集和ToxicChat数据集上达到行业最佳性价 比; 3. OpenAI内部已使用该技术(Safety Reasoner原型)处理图像生成和Sora 2等产品,安全推理算力占比高达16%。 1. Cursor发布2.0版本,推出首个自研编码模型Composer,生成速度达每秒250个token,是同类前沿系统的4倍,标志 从"AI外壳"向"AI原生平台"转型; 2. Composer采用混合专家(MoE)架构,通过强化学习针对软件工程优化,在Cursor Bench评测中达到前沿水平,已被团 队日常开发使用; 3. 新 ...
 刚刚,智源悟界·Emu3.5登场,原生具备世界建模能力
 机器之心· 2025-10-30 08:52
 Core Insights - The article discusses the release of the latest multimodal model, Emu3.5, by the Beijing Academy of Artificial Intelligence (BAAI), highlighting its capabilities and innovations in the field of AI [3][4][6].   Model Overview - Emu3.5 is defined as a "Multimodal World Foundation Model," which distinguishes itself from other generative models through its inherent world modeling capabilities [4][5]. - The model has been trained on over 10 trillion multimodal tokens, primarily sourced from internet videos totaling approximately 790 years in duration, allowing it to internalize the dynamic laws of the physical world [5][16].   Technological Innovations - Emu3.5 introduces the "Discrete Diffusion Adaptation" (DiDA) technology, which enhances image inference speed by nearly 20 times with minimal performance loss, making it competitive with top closed-source diffusion models [6][24]. - The model's architecture is based on a 34 billion parameter dense transformer, focusing on "Next-State Prediction" to unify its objectives [11][17].   Performance and Capabilities - Emu3.5 demonstrates state-of-the-art performance in various tasks, including image editing and generation, visual narrative creation, and visual guidance, outperforming competitors like Google's Gemini-2.5-Flash-Image [28][35]. - The model can generate coherent visual narratives and step-by-step visual tutorials, marking a significant advancement from traditional multimodal models [13][14].   Training Process - The training process consists of four core stages: large-scale pre-training, fine-tuning on high-quality datasets, large-scale multimodal reinforcement learning, and efficient autoregressive inference acceleration [17][21][22][24]. - The model's training data includes a vast array of visual-language interleaved data, allowing it to learn about physical dynamics and causality [16][41].   Future Implications - Emu3.5 is positioned as a foundational model for future developments in embodied intelligence, capable of generating diverse virtual environments and task planning data [39][41]. - The open-sourcing of Emu3.5 is expected to provide a robust new foundation for the global AI research community [7][45].


