Workflow
多模态大模型
icon
Search documents
中欧AI领域合作大有可为
Zheng Quan Shi Bao· 2025-08-28 23:05
Core Viewpoint - The competition in AI between China and the EU is significant, with China focusing on innovation and development while the EU emphasizes standards and regulations, creating potential collaboration opportunities despite their differing approaches [1][2]. Investment and Infrastructure - The EU plans to invest €30 billion in AI infrastructure, including the establishment of 13 regional AI factories and gigawatt-level super data centers, but faces challenges such as insufficient energy supply and the need for unified fiscal policies to mobilize private capital [1]. - In contrast, China benefits from abundant renewable energy resources and government support, allowing it to advance its AI capabilities without energy supply constraints, achieving 15% of global computing power [2]. Collaboration Opportunities - China and the EU can establish open-source white lists and AI patent pools, create national AI laboratories, and collaborate on research institutions, enhancing cross-border cooperation while maintaining data privacy [3]. - Increased procurement of computing resources and supportive import/export tax policies could benefit both regions, allowing China to diversify its computing capabilities and the EU to reduce reliance on the US [3]. Application Focus - The EU is focusing on vertical applications in sectors like healthcare, climate, and agriculture due to infrastructure limitations, while China is rapidly advancing in AI technology and applications, becoming a leading market for AI [3]. - The EU's emphasis on quality and compliance in AI applications offers valuable lessons for China, which is expanding its AI industry boundaries [3]. Governance and Regulation - The EU's AI Act is the first comprehensive regulation of AI globally, aiming to establish a strong governance image while increasing compliance costs for businesses [4]. - China is pursuing a flexible governance approach, combining technological sovereignty with ethical standards, and has initiated the Global AI Innovation Governance Center to promote collaborative governance [4]. Potential for Cooperation - There is a significant opportunity for China and the EU to collaborate on AI governance, particularly in areas of risk classification and human control, with a shared understanding of these principles [5]. - Establishing a technical committee and a negotiation mechanism could facilitate cooperation and align regulatory standards between the two regions [6].
具身智能之心技术交流群成立了!
具身智能之心· 2025-08-28 08:36
Group 1 - The establishment of the Embodied Intelligence Heart Technology Exchange Group focuses on various advanced technologies including VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, VLA+RL, sim2real, multimodal large models, simulation, motion control, target navigation, mapping and localization, and navigation [1] - Interested individuals can add the assistant's WeChat AIDriver005 to join the community [2] - To expedite the group entry process, it is advised to include a note with the institution/school, name, and research direction [3]
自动驾驶之心业务合伙人招募来啦!模型部署/VLA/端到端方向~
自动驾驶之心· 2025-08-28 08:17
Core Viewpoint - The article emphasizes the recruitment of business partners for the autonomous driving sector, highlighting the need for expertise in various advanced technologies and offering attractive incentives for potential candidates [2][3][5]. Group 1: Recruitment Details - The company plans to recruit 10 outstanding partners for autonomous driving-related course development, research paper guidance, and hardware development [2]. - Candidates with expertise in large models, multimodal models, diffusion models, and other advanced technologies are particularly welcome [3]. - Preferred qualifications include a master's degree or higher from universities ranked within the QS200, with priority given to candidates with significant conference contributions [4]. Group 2: Incentives and Opportunities - The company offers resource sharing related to autonomous driving, including job recommendations, PhD opportunities, and study abroad guidance [5]. - Attractive cash incentives are part of the compensation package for successful candidates [5]. - Opportunities for collaboration on entrepreneurial projects are also available [5].
具身智能之心B端和C端培训老师招募来啦~
具身智能之心· 2025-08-28 01:20
Group 1 - The article announces the recruitment of teachers for embodied intelligence training, targeting both B-end (business) and C-end (consumer) training services, with compensation above industry standards [1] - The training covers various advanced topics including VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, sim2real, multimodal large models, simulation, motion control, and target navigation [2] - B-end training is aimed at enterprises, universities, and research institutions, while C-end training focuses on students and job seekers, with responsibilities including curriculum design and material preparation [3] Group 2 - Candidates are required to have a doctoral degree or higher (including those currently enrolled), with a preference for those who have published two papers in A-level or Q1 journals/conferences, or have two years of industry experience [3] - Interested individuals can add a specified WeChat contact for further inquiries [4]
【私募调研记录】景林资产调研当虹科技
Zheng Quan Zhi Xing· 2025-08-28 00:12
Group 1 - The core viewpoint of the news is that Jinglin Asset Management has conducted research on a listed company, focusing on its technological advancements and growth prospects [1] - The company, Dahong Technology, has achieved a 50% year-on-year revenue growth in the second quarter, driven by its self-developed BlackEye multimodal model technology [1] - Dahong Technology plans to establish 50 ultra-high-definition channels by 2025, with a target of 650 million terminal devices [1] - The upcoming release of BlackEye 2.0 on September 19 will enhance video technology capabilities, enabling interactive and comprehensible video processing [1] - The company is collaborating with military-civilian integration institutions to advance its remote control systems, which can operate in various network modes, including satellite links in offline environments [1] - Dahong Technology is focusing on high value-added transformation through standardization and AI efficiency improvements while continuing to invest in new technology directions [1] Group 2 - Shanghai Jinglin Asset Management is a private equity fund management company registered with the Asset Management Association of China, primarily investing in domestic and foreign listed company stocks [2] - The company has a strong track record of performance, with its Jinglin Stable Trust achieving a compound annual return of 26.84% as of April 30, 2015, significantly outperforming the CSI 300 Index [2] - Jinglin Asset Management employs a value investment philosophy, emphasizing fundamental analysis and stock valuation based on industry structure and the company's position in the value chain [2] - The firm has a specialized team of over 50 professionals with extensive experience in various industries, enabling a deeper understanding of market dynamics and investment opportunities [2] - Jinglin Asset Management has been recognized as one of the top private equity investment institutions in China, consistently delivering substantial returns to investors [2]
为防AI刷题,Nature等顶刊最新封面被做成数据集,考验模型科学推理能力
3 6 Ke· 2025-08-26 01:25
Core Insights - The emergence of advanced multimodal models like GPT-4o and Gemini 2.5 Pro has raised concerns about the evaluation of AI capabilities as existing "question banks" become outdated [1][17] - A new dynamic benchmark called MAC (Multimodal Academic Cover) has been proposed to continuously assess AI using the latest scientific content [1][20] Group 1: Benchmark Development - The MAC benchmark utilizes the latest covers from 188 top journals, including Nature, Science, and Cell, to create a testing dataset from over 25,000 image-text pairs [3][20] - The benchmark aims to evaluate whether multimodal models can understand the deep connections between artistic visual elements and scientific concepts [3][20] Group 2: Testing Methodology - The MAC benchmark includes two testing tasks designed to prevent AI from relying on superficial visual features: selecting corresponding texts from journal covers and matching images to cover stories [6][14] - The design incorporates "semantic traps" to ensure that only models with a true understanding of scientific concepts can select the correct answers [6][14] Group 3: Model Performance - The top-performing model, Step-3, achieved an accuracy of only 79.1% on the MAC benchmark, highlighting a significant gap compared to its near-perfect performance on other benchmarks [4][16] - Open-source model Qwen2.5-VL-7B showed an accuracy of just 56.8%, indicating limitations in current AI models when faced with the latest scientific content [4][16] Group 4: Continuous Challenge Mechanism - The MAC benchmark employs a dual dynamic mechanism to ensure ongoing challenges: dynamic data that evolves with scientific knowledge and dynamic problem construction that utilizes advanced embedding models to create more sophisticated semantic traps [20][22][23] - This approach allows the benchmark to remain relevant and challenging as both scientific knowledge and AI capabilities advance [20][22][23] Group 5: Future Directions - The research team plans to expand the MAC benchmark to include more scientific journals and other forms of dynamic scientific content, such as conference papers and science news [23] - The benchmark will undergo annual updates to adapt to the rapid advancements in AI technology, ensuring it remains a relevant tool for evaluating AI capabilities [23]
2025年了,生成和理解多模态大模型发展到哪一步了?
自动驾驶之心· 2025-08-25 23:34
Core Viewpoint - The article discusses the development trends of unified multimodal large models, particularly focusing on image understanding and generation, up to mid-2025, highlighting significant advancements and challenges in this field [1][2]. Group 1: Overview of Multimodal Large Models - The term "unified multimodal large models" primarily refers to models that integrate both image understanding and generation, excluding other modalities like Omni-LLM due to fewer academic papers in that area [3]. - Several notable early works in this domain include Google's Unified-IO, Alibaba's OFA, and Fudan's AnyGPT, which have significantly influenced subsequent research [3]. Group 2: Key Research Directions - Research on "integrated generation and understanding" of multimodal large models focuses on two main aspects: the development of visual tokenizers and the construction of suitable model architectures [14]. - The TokenFlow model by ByteDance employs different visual encoders for understanding and generation tasks, utilizing high-level semantic features for understanding and low-level features for generation [16][17]. Group 3: Model Architectures and Techniques - The Semantic-Priority Codebook (SPC) approach was introduced to improve the quality of image reconstruction tasks, highlighting the importance of semantic features in the quantization process [19][23]. - The QLIP model from UT Austin and Nvidia optimizes the visual tokenizer by aligning visual features suitable for generation with semantic information, using a unified visual encoder for both tasks [28][30]. Group 4: Training Strategies - The training strategy for QLIP involves two phases: the first focuses on learning semantically rich feature representations, while the second emphasizes improving image reconstruction quality [30][32]. - The UniTok model employs multi-codebook quantization to enhance codebook utilization, integrating visual features for both understanding and generation tasks [35][36]. Group 5: Recent Innovations - The DualToken model utilizes a single visual encoder to extract features for both understanding and generation, employing different visual codebooks for semantic and pixel features [39][41]. - The TokLIP model from Tencent also adopts a single-encoder approach, focusing on the alignment of visual features with text features through various loss functions [42][44].
为防AI刷题,Nature等顶刊最新封面被做成数据集,考验模型科学推理能力|上海交通大学
量子位· 2025-08-25 15:47
Core Viewpoint - The article discusses the development of the MAC (Multimodal Academic Cover) benchmark, which aims to evaluate the true capabilities of advanced AI models like GPT-4o and Gemini 2.5 Pro by using the latest scientific content for testing, addressing the challenge of outdated "question banks" in AI assessments [1][5]. Group 1: Benchmark Development - The MAC benchmark utilizes the latest covers from 188 top journals, including Nature, Science, and Cell, to create a testing dataset from over 25,000 image-text pairs, ensuring that the AI models are evaluated on the most current and complex scientific concepts [3][4]. - The research team designed two testing tasks: "selecting text from images" and "selecting images from text," to assess the AI's understanding of the deep connections between visual elements and scientific concepts [17][18]. Group 2: Testing Results - The results revealed that even top models like Step-3 achieved only a 79.1% accuracy when faced with the latest scientific content, indicating significant limitations in their performance compared to their near-perfect results on other benchmarks [4][19]. - The study highlighted that models such as GPT-5-thinking and Gemini 2.5 Pro, while proficient in visual recognition, still struggle with deep reasoning tasks that require cross-modal scientific understanding [19]. Group 3: Dynamic Benchmarking Mechanism - The MAC benchmark introduces a dynamic approach to testing by continuously updating the dataset and questions, which helps maintain the challenge level as scientific knowledge evolves [24][26]. - The research team conducted a comparison experiment showing that all models performed worse on the latest data (MAC-2025) compared to older data (MAC-Old), demonstrating that the natural evolution of scientific knowledge provides ongoing challenges for AI models [26]. Group 4: DAD Methodology - The DAD (Divide and Analyze) method was proposed to enhance AI performance by structuring the reasoning process into two phases: a detailed visual description followed by high-level analysis, simulating human expert thinking [21][22]. - This two-step approach significantly improved the accuracy of multiple models, showcasing the effectiveness of extending reasoning time in multimodal scientific understanding tasks [22][23]. Group 5: Future Prospects - The MAC benchmark is expected to evolve into a more comprehensive evaluation platform, with plans to include more scientific journals and dynamic scientific content such as conference papers and news [28]. - As AI capabilities approach human levels, the MAC benchmark will serve as a "touchstone" to better understand the boundaries of AI capabilities and the path toward true intelligence [28].
格灵深瞳2025半年报:多元化业务发力,二季度营收同比增70%
Huan Qiu Wang· 2025-08-25 11:59
Core Viewpoint - Geling Deep Vision reported significant growth in Q2 2025 revenue, with a 70% increase compared to Q2 2024, while the first half of 2025 saw a 17.22% decline year-on-year [1] Group 1: Financial Performance - In Q2 2025, Geling Deep Vision achieved revenue of 34.7981 million yuan, marking a substantial increase of approximately 70% compared to the same period in 2024 [1] - For the first half of 2025, the company's revenue totaled 42.4728 million yuan, reflecting a year-on-year decrease of 17.22% [1] Group 2: Business Strategy and Development - The company is focusing on diversifying its business structure and growth system, emphasizing a "2+2" strategy in smart finance and urban management, along with innovation in government, special applications, and smart education [1] - Geling Deep Vision is committed to R&D investments in key areas to maintain its technological leadership, particularly in multimodal large models [1] Group 3: Technological Advancements - Geling Deep Vision has a strong technical foundation in the visual field, continuously evolving its self-developed visual model Glint-MVT, with the latest version outperforming previous iterations in tasks like OCR and segmentation [2] - The company has launched the Super-Agent platform, an enterprise-level product for financial scenarios, integrating various modules to address challenges in risk control and marketing [2] Group 4: Product Launches and Collaborations - New products and upgrades have been introduced in areas such as AIPC, industry large model integrated machines, and smart education, including the "Mo Ren AIPC" which combines secure hardware with intelligent applications [3] - Geling Deep Vision has established partnerships with leading tech companies like Baidu and Feiteng to enhance the domestic computing foundation and promote AI industry upgrades [3]
AI动态汇总:智元推出机器人世界模型平台genieenvesioner,智谱上线GLM-4.5a视觉推理模型
China Post Securities· 2025-08-25 11:47
- The Genie Envisioner platform introduces a video-centric world modeling paradigm, directly modeling robot-environment interactions in the visual space, which retains spatial structure and temporal evolution information. This approach enhances cross-domain generalization and long-sequence task execution capabilities, achieving a 76% success rate in long-step tasks like folding cardboard boxes, outperforming the π0 model's 48%[12][13][16] - The Genie Envisioner platform comprises three core components: GE-Base, a multi-view video world foundation model trained on 3000 hours of real robot data; GE-Act, a lightweight 160M parameter action decoder enabling real-time control; and GE-Sim, a hierarchical action-conditioned simulator for closed-loop strategy evaluation and large-scale data generation[16][17][19] - The GLM-4.5V visual reasoning model, with 106B total parameters and 120B activation parameters, achieves state-of-the-art (SOTA) performance across 41 multimodal benchmarks, including image, video, document understanding, and GUI agent tasks. It incorporates 3D-RoPE and bicubic interpolation mechanisms to enhance 3D spatial relationship perception and high-resolution adaptability[20][21][22] - GLM-4.5V employs a three-stage training strategy: pretraining on large-scale multimodal corpora, supervised fine-tuning with "chain of thought" samples, and reinforcement learning with RLVR and RLHF techniques. This layered training enables superior document processing capabilities and emergent abilities like generating structured HTML/CSS/JavaScript code from screenshots or videos[23][24][26] - VeOmni, a fully modular multimodal training framework, decouples model definition from distributed parallel logic, enabling flexible parallel strategies like FSDP, HSDP+SP, and EP. It achieves 43.98% MFU for 64K sequence training and supports up to 192K sequence lengths, reducing engineering complexity and improving efficiency by over 90%[27][28][31] - VeOmni introduces asynchronous sequence parallelism (Async-Ulysses) and COMET technology for MoE models, achieving linear scalability in training throughput for 30B parameter models under 160K sequence lengths. It also integrates dynamic batch processing and FlashAttention to minimize memory waste and optimize operator-level recomputation[31][32][34] - Skywork UniPic 2.0, a unified multimodal framework, integrates image understanding, text-to-image (T2I) generation, and image-to-image (I2I) editing within a single model. It employs a progressive dual-task reinforcement strategy (Flow-GRPO) to optimize image editing and T2I tasks sequentially, achieving superior performance in benchmarks like GenEval and GEdit-EN[35][38][39] - UniPic 2.0 leverages Skywork-EditReward, an image-editing-specific reward model, to provide pixel-level quality scores. This design enables precise recognition of image elements and generation of corresponding textual descriptions, achieving 83.5 points in MMBench, comparable to 19B parameter models[38][42][43] - FlowReasoner, a query-level meta-agent framework, dynamically generates personalized multi-agent systems for individual queries. It employs GRPO reinforcement learning with multi-objective reward mechanisms, achieving 92.15% accuracy on the MBPP dataset and outperforming baseline models like Aflow and LLM-Blender[63][64][68] - FlowReasoner utilizes a three-stage training process: supervised fine-tuning with synthetic data, SFT fine-tuning for workflow generation, and RL with external feedback for capability enhancement. It demonstrates robust generalization, maintaining high accuracy even when the base worker model is replaced[66][68][69]