Workflow
多模态Scaling范式
icon
Search documents
腾讯研究院AI速递 20251031
腾讯研究院· 2025-10-30 16:06
Group 1: OpenAI Developments - OpenAI has open-sourced the gpt-oss-safeguard safety classification model in both 120 billion and 20 billion parameter versions, which can directly understand policy documents for content classification without retraining [1] - The model outperforms GPT-5-thinking in multiple benchmark tests, achieving industry-best cost-effectiveness on content moderation evaluation sets and the ToxicChat dataset [1] - OpenAI has internally utilized this technology (Safety Reasoner prototype) for image generation and products like Sora 2, with safety reasoning computing accounting for 16% of its operations [1] Group 2: Cursor 2.0 Update - Cursor has released version 2.0, introducing its first self-developed coding model, Composer, which generates at a speed of 250 tokens per second, four times faster than similar leading systems [2] - Composer employs a mixture of experts (MoE) architecture optimized for software engineering through reinforcement learning, achieving cutting-edge performance in Cursor Bench evaluations [2] - The new interface supports multi-agent parallel collaboration, allowing different models to process the same task simultaneously based on git worktree or remote machines, and includes native browser tools for testing iterations [2] Group 3: Sora New Features - Sora has launched the Character Cameo feature, enabling consistency for non-human cameo characters and allowing extraction of virtual characters from generated videos for self-cycling [3] - New video splicing functionality and community rankings have been added, categorizing the most used cameo characters and the most remixed videos [3] - Sora has temporarily lifted the invitation code restriction for direct registration in the US, Canada, Japan, and South Korea, coinciding with the launch of its Android version to capture the Android market [3] Group 4: MiniMax Speech 2.6 Update - MiniMax Speech 2.6 has achieved an end-to-end latency of under 250 milliseconds, reaching industry-leading levels and becoming the underlying technology engine for global voice platforms like LiveKit and Pipecat [4] - The new version supports direct conversion of non-standard text formats such as URLs, emails, phone numbers, dates, and amounts without cumbersome text preprocessing, facilitating smoother information transmission [4] - Fluent LoRA functionality allows for the generation of fluent and natural speech even from recordings with accents or non-native fluency, supporting over 40 languages [4] Group 5: Emu3.5 Launch - Beijing Zhiyuan has released the Emu3.5 multimodal world model, based on a 34 billion dense transformer pre-trained on over 10 trillion tokens (approximately 790 years of video), revealing the "multimodal scaling paradigm" for the first time [5] - It employs a "next state prediction" objective to achieve visual narrative and guidance capabilities, matching the performance of Gemini-2.5-Flash-Image in image editing tasks [5] Group 6: OpenAI IPO Plans - OpenAI plans to submit its IPO application as early as the second half of 2026, aiming to raise at least $60 billion, with a valuation potentially reaching $1 trillion, making it the largest IPO in history [6] - Following a restructuring, the non-profit organization will hold 26% of the newly formed OpenAI Group, while Microsoft will relinquish exclusive cloud service priority but will receive an additional $250 billion Azure procurement contract [6] - The new agreement stipulates that the realization of AGI must be verified by independent experts, extending Microsoft's rights to use OpenAI technology until 2032, while allowing it to conduct AGI research independently or collaborate with third parties [6] Group 7: OpenFold3 Release - OpenFold Consortium has released a preview of OpenFold3, trained on over 300,000 experimental structures and 13 million synthetic structures, capable of predicting interactions between proteins and small molecule ligands, as well as nucleic acids [7] - In single-stranded RNA structure prediction, its performance rivals that of AlphaFold3, featuring a modular design that allows users to modify the model for native data interpretation [7] - All components are licensed under Apache 2.0, permitting commercial use, with companies like Novo Nordisk, Outpace Bio, and Bayer planning to leverage the model to accelerate research [7] Group 8: Anthropic Research Findings - Anthropic's latest research reveals that Claude can detect and report concepts injected by humans, achieving a 20% success rate in introspection for the strongest models [8] - The research team found that models could defend and fabricate reasons for their "errors" based on falsified internal states through retrospective concept injection [8] - Experiments demonstrate that AI possesses deliberate control over internal representations, marking the emergence of "reachable consciousness," though it remains distant from having subjective experiences or "phenomenal consciousness" [8] Group 9: Grokking Research Insights - Former Meta FAIR head Tian Yuandong published research on Grokking, proving mathematically that models require only O(M log M) samples for generalization, significantly lower than the traditional M² requirement [9] - He revealed that the essence of "insight" is a multi-peak non-convex optimization process, where increased data raises the "generalization peak" above the "memory peak," leading to a transition from memory to generalization [9] - Tian emphasized that representation learning is foundational to all intelligent capabilities, with the loss function serving merely as a proxy signal for optimization, and true breakthroughs stemming from changes in representation methods [9]
刚刚,智源悟界·Emu3.5登场,原生具备世界建模能力
机器之心· 2025-10-30 08:52
Core Insights - The article discusses the release of the latest multimodal model, Emu3.5, by the Beijing Academy of Artificial Intelligence (BAAI), highlighting its capabilities and innovations in the field of AI [3][4][6]. Model Overview - Emu3.5 is defined as a "Multimodal World Foundation Model," which distinguishes itself from other generative models through its inherent world modeling capabilities [4][5]. - The model has been trained on over 10 trillion multimodal tokens, primarily sourced from internet videos totaling approximately 790 years in duration, allowing it to internalize the dynamic laws of the physical world [5][16]. Technological Innovations - Emu3.5 introduces the "Discrete Diffusion Adaptation" (DiDA) technology, which enhances image inference speed by nearly 20 times with minimal performance loss, making it competitive with top closed-source diffusion models [6][24]. - The model's architecture is based on a 34 billion parameter dense transformer, focusing on "Next-State Prediction" to unify its objectives [11][17]. Performance and Capabilities - Emu3.5 demonstrates state-of-the-art performance in various tasks, including image editing and generation, visual narrative creation, and visual guidance, outperforming competitors like Google's Gemini-2.5-Flash-Image [28][35]. - The model can generate coherent visual narratives and step-by-step visual tutorials, marking a significant advancement from traditional multimodal models [13][14]. Training Process - The training process consists of four core stages: large-scale pre-training, fine-tuning on high-quality datasets, large-scale multimodal reinforcement learning, and efficient autoregressive inference acceleration [17][21][22][24]. - The model's training data includes a vast array of visual-language interleaved data, allowing it to learn about physical dynamics and causality [16][41]. Future Implications - Emu3.5 is positioned as a foundational model for future developments in embodied intelligence, capable of generating diverse virtual environments and task planning data [39][41]. - The open-sourcing of Emu3.5 is expected to provide a robust new foundation for the global AI research community [7][45].