Workflow
AI动态汇总:智元推出机器人世界模型平台genieenvesioner,智谱上线GLM-4.5a视觉推理模型
China Post Securities·2025-08-25 11:47
  • The Genie Envisioner platform introduces a video-centric world modeling paradigm, directly modeling robot-environment interactions in the visual space, which retains spatial structure and temporal evolution information. This approach enhances cross-domain generalization and long-sequence task execution capabilities, achieving a 76% success rate in long-step tasks like folding cardboard boxes, outperforming the π0 model's 48%[12][13][16] - The Genie Envisioner platform comprises three core components: GE-Base, a multi-view video world foundation model trained on 3000 hours of real robot data; GE-Act, a lightweight 160M parameter action decoder enabling real-time control; and GE-Sim, a hierarchical action-conditioned simulator for closed-loop strategy evaluation and large-scale data generation[16][17][19] - The GLM-4.5V visual reasoning model, with 106B total parameters and 120B activation parameters, achieves state-of-the-art (SOTA) performance across 41 multimodal benchmarks, including image, video, document understanding, and GUI agent tasks. It incorporates 3D-RoPE and bicubic interpolation mechanisms to enhance 3D spatial relationship perception and high-resolution adaptability[20][21][22] - GLM-4.5V employs a three-stage training strategy: pretraining on large-scale multimodal corpora, supervised fine-tuning with "chain of thought" samples, and reinforcement learning with RLVR and RLHF techniques. This layered training enables superior document processing capabilities and emergent abilities like generating structured HTML/CSS/JavaScript code from screenshots or videos[23][24][26] - VeOmni, a fully modular multimodal training framework, decouples model definition from distributed parallel logic, enabling flexible parallel strategies like FSDP, HSDP+SP, and EP. It achieves 43.98% MFU for 64K sequence training and supports up to 192K sequence lengths, reducing engineering complexity and improving efficiency by over 90%[27][28][31] - VeOmni introduces asynchronous sequence parallelism (Async-Ulysses) and COMET technology for MoE models, achieving linear scalability in training throughput for 30B parameter models under 160K sequence lengths. It also integrates dynamic batch processing and FlashAttention to minimize memory waste and optimize operator-level recomputation[31][32][34] - Skywork UniPic 2.0, a unified multimodal framework, integrates image understanding, text-to-image (T2I) generation, and image-to-image (I2I) editing within a single model. It employs a progressive dual-task reinforcement strategy (Flow-GRPO) to optimize image editing and T2I tasks sequentially, achieving superior performance in benchmarks like GenEval and GEdit-EN[35][38][39] - UniPic 2.0 leverages Skywork-EditReward, an image-editing-specific reward model, to provide pixel-level quality scores. This design enables precise recognition of image elements and generation of corresponding textual descriptions, achieving 83.5 points in MMBench, comparable to 19B parameter models[38][42][43] - FlowReasoner, a query-level meta-agent framework, dynamically generates personalized multi-agent systems for individual queries. It employs GRPO reinforcement learning with multi-objective reward mechanisms, achieving 92.15% accuracy on the MBPP dataset and outperforming baseline models like Aflow and LLM-Blender[63][64][68] - FlowReasoner utilizes a three-stage training process: supervised fine-tuning with synthetic data, SFT fine-tuning for workflow generation, and RL with external feedback for capability enhancement. It demonstrates robust generalization, maintaining high accuracy even when the base worker model is replaced[66][68][69]