商汤开源NEO多模态模型架构,实现视觉、语言深层统一
| Model | LLM | # Data | | | | MMMU MMB MMVet MMStar SEED-I POPE HallB | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | T Modular Vision-Language Models (2B) | | | | | | | | | | | Owen2-VL | Owen2-1.5B | -1-1- | 41.1 | 74.9 | 49.5 | 48.0 | | - | 41.7 | | Intern VL2.5 | | InternLM2.5-1.8B >6B / 100M / 16M | 43.6 | 74.7 | 60.8 | 53.7 | - | 90.6 | 42.6 | | Owen2.5-VL | Owen2.5-1.5B | -1-1- | 51.2 | 79.1 | 61.8 | 55.9 | - | - | 46.3 | | Intern VL3 | Owen2.5-1.5B | >6B / 100M / 22M | 48.6 ...