多模态大语言模型
Search documents
鹅厂放大招,混元图像2.0「边说边画」:描述完,图也生成好了
量子位· 2025-05-16 03:39
Core Viewpoint - Tencent has launched the Hunyuan Image 2.0 model, which enables real-time image generation with millisecond response times, allowing users to create images seamlessly while describing them verbally or through sketches [1][6]. Group 1: Features of Hunyuan Image 2.0 - The model supports real-time drawing boards where users can sketch elements and provide text descriptions for immediate image generation [3][29]. - It offers various input methods, including text prompts, voice input in both Chinese and English, and the ability to upload reference images for enhanced image generation [18][19]. - Users can optimize generated images by adjusting parameters such as reference image strength and can also use a feature to automatically enhance composition and depth [27][35]. Group 2: Technical Highlights - Hunyuan Image 2.0 features a significantly larger model size, increasing parameters by an order of magnitude compared to its predecessor, Hunyuan DiT, which enhances performance [37]. - The model incorporates a high-compression image codec developed by Tencent, which reduces encoding sequence length and speeds up image generation while maintaining quality [38]. - It utilizes a multimodal large language model (MLLM) as a text encoder, improving semantic understanding and matching capabilities compared to traditional models [39][40]. - The model has undergone reinforcement learning training to enhance the realism of generated images, aligning them more closely with real-world requirements [41]. - Tencent has developed a self-research adversarial distillation scheme that allows for high-quality image generation with fewer steps [42]. Group 3: Future Developments - Tencent's team has indicated that more details will be shared in upcoming technical reports, including information about a native multimodal image generation model [43][45]. - The new model is expected to excel in multi-round image generation and real-time interactive experiences [46].
GPT-4o不敌Qwen,无一模型及格!UC伯克利/港大等联合团队提出多模态新基准:考察多视图理解能力
量子位· 2025-05-14 06:07
Core Insights - The article discusses the introduction of the All-Angles Bench, a new benchmark for evaluating multi-view understanding capabilities of multi-modal large language models (MLLMs) [2][4]. Group 1: Overview of All-Angles Bench - All-Angles Bench aims to comprehensively assess the multi-view understanding abilities of MLLMs, featuring over 2,100 manually annotated multi-view question-answer pairs across 90 real-world scenarios [2][8]. - The benchmark includes six challenging tasks: Counting, Attribute Identification, Relative Distance, Relative Direction, Object Manipulation, and Camera Pose Estimation, which evaluate the models' understanding of 3D scenes [8][9]. Group 2: Performance Evaluation - A total of 27 leading MLLMs were benchmarked, including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o, revealing a significant gap between their performance and human-level understanding [4][14]. - In the Camera Pose Estimation task, human annotators achieved an accuracy of 88.9%, while top models like Gemini-2.0-Flash lagged behind by over 50% [16]. Group 3: Findings and Analysis - Certain open-source models, such as Ovis2-34B and Qwen2.5-VL-72B, outperformed closed-source models in direction-sensitive tasks, likely due to their superior video understanding and visual localization capabilities [17]. - The analysis revealed inconsistencies in MLLMs' responses, particularly in tasks involving relative direction, indicating challenges in multi-view understanding [20][23]. - MLLMs struggled with integrating fragmented information across views, often miscounting objects when visibility was partial [24][31]. Group 4: Recommendations for Improvement - The article suggests that merely optimizing prompts is insufficient for enhancing multi-view understanding; dedicated multi-view training is necessary for substantial performance improvements [32].
推出金融交易AI Agent,可全天候智能盯盘,这家新加坡金融企业获1000万美元融资|早起看早期
36氪· 2025-05-12 23:56
Core Viewpoint - RockFlow, a Singapore-based AI fintech company, has completed a $10 million A1 funding round to enhance its AI technology and launch its financial AI agent, Bobby [3][4]. Group 1: Company Overview - RockFlow operates five offices globally and covers over 30 countries in nine languages, previously receiving tens of millions in investments from top-tier Silicon Valley funds [4]. - The company launched TradeGPT, the world's first trading AI product, in April 2023, which utilizes multimodal LLM capabilities to analyze vast market information and price-volume data [4]. Group 2: Product Development - RockFlow is developing an AI agent architecture tailored for financial investment scenarios, leveraging cutting-edge technologies such as multimodal large language models (LLM), Fin-Tuning, RAG, Multi-Agent, and CoT [4][5]. - The AI agent aims to enhance understanding and generation capabilities, efficiently process multi-source data, and provide precise financial analysis and investment recommendations [4][5]. Group 3: Investment Process - In investment trading scenarios, RockFlow's AI agent simplifies traditional complex processes into four core steps: real-time information acquisition, analysis, trading strategy construction, and order execution [5]. - The AI agent monitors market dynamics and analyzes extensive data, including financial metrics and social media sentiment, to present personalized real-time trading opportunities [5][6]. Group 4: User Interaction - Users can express their needs in natural language, allowing the AI agent to generate personalized investment configurations and trading strategies based on their profit goals and risk preferences [6]. - The AI agent can also create complex conditional orders and automate investment tasks, assisting users in managing profits and losses effectively [6]. Group 5: Future Outlook - Bobby, the financial AI agent product, is set to launch globally soon, with a team comprising experts from AI, financial mathematics, and investment trading [6].
理想汽车MCAF重构辅助驾驶视觉认知新范式
理想TOP2· 2025-04-25 12:43
以下文章来源于AcademicDaily ,作者AcademicDaily AcademicDaily . AcademicDaily是一个跟踪、推荐和解读大模型等AI成果的技术交流平台,致力于传播和分享前沿技术。 MCAF在理想内部被称为自动驾驶第三只眼。 兼容理想自研的Mind GPT-3o 与 BEV 大模型,无需重新训练。 MCAF是一个 多模态粗到细注意力聚焦框架,核心解决的是长视频理解的关键瓶颈。 当前视频理解领域对长视频(>5分钟)的处理存在显著缺陷,主流方法(如Video-MLLM)依赖全局压缩或均匀采样,导致细 节丢失和冗余计算。MCAF直接针对这一问题,通过多模态分层注意力和时间扩展机制,在信息保留与计算效率之间找到了平 衡点,这是其核心价值。 在平均时长达60分钟的Video-MME数据集上,MCAF超越其他代理方法(如VideoTree、DrVideo)约3-5个百分点。 不同于VideoTree等需要额外奖励模型评估置信度,MCAF利用单一LLM完成生成-评估-调整闭环。这不仅简化了架构(如代码 实现仅需1个LLM接口),还避免了多模型协同的兼容性问题,更适合实际部署。 不过在NEx ...
10倍吞吐提升无损性能:多模态适用的KV cache量化策略来了,即插即用无需改原模型
量子位· 2025-04-03 02:12
CalibQuant团队 投稿 量子位 | 公众号 QbitAI 在InternVL-2.5上实现 10倍吞吐量提升 ,模型性能几乎无损失。 最新1-bit多模态大模型KV cache量化方案 CalibQuant 来了。 通过结合后缩放和校准方法,可显著降低显存与计算成本, 无需改动原模 型即可直接使用 。 即插即用、无缝集成 多模态大语言模型在各种应用中展现出了卓越的性能。然而,它们在部署过程中的计算开销仍然是一个关键瓶颈。 虽然KV cache通过用显存换计算在一定程度上提高了推理效率,但随着KV cache的增大,显存占用不断增加,吞吐量受到了极大限制。 为了解决这一挑战,作者提出了CalibQuant,一种简单却高效的视觉KV cache量化策略,能够大幅降低显存和计算开销。具体来说, CalibQuant引入了一种极端的1比特量化方案, 采用了针对视觉KV cache内在模式设计的后缩放和校准技术,在保证高效性的同时,不牺牲 模型性能。 作者通过利用Triton进行runtime优化,在InternVL-2.5模型上实现了10倍的吞吐量提升。这一方法具有即插即用的特性,能够无缝集成到各 种现有的多 ...
长视频理解新突破!Mamba混合架构让显存消耗腰斩,处理10万视频token不费力
量子位· 2025-03-27 04:16
Core Viewpoint - The article introduces the Vamba model, a hybrid Mamba-Transformer model designed for efficient understanding of long videos, significantly improving processing efficiency without compressing video tokens [1][10]. Group 1: Model Design and Efficiency - Vamba improves the efficiency of processing video tokens during training and inference by redesigning the model architecture rather than compressing video tokens [1][4]. - The model can process four times more video frames under the same hardware conditions compared to traditional Transformer architectures, with over 50% reduction in training memory consumption and doubled training speed [4][9]. - Vamba retains the original spatiotemporal features of videos, avoiding information loss that occurs with traditional downsampling or pooling methods [5][10]. Group 2: Technical Innovations - The core design of Vamba involves breaking down the costly causal self-attention operations into two more efficient components: cross-attention for text tokens and a state space model (SSM) based Mamba-2 module for video tokens [6][7]. - The Mamba-2 module reduces the computational complexity from quadratic to linear, allowing for effective processing of long video sequences [7][9]. - Vamba's architecture allows for efficient alignment of text and video information, enhancing the model's ability to analyze video content based on user queries [9][10]. Group 3: Performance Evaluation - Extensive experiments show that Vamba outperforms existing efficient long video understanding models by approximately 4.3% on the LVBench benchmark [5][10]. - The model demonstrates superior performance across various video duration benchmarks, showcasing its competitive edge in long, medium, and short video understanding tasks [10].