Workflow
多模态大模型
icon
Search documents
具身智能之心B端和C端培训老师招募来啦~
具身智能之心· 2025-08-28 01:20
Group 1 - The article announces the recruitment of teachers for embodied intelligence training, targeting both B-end (business) and C-end (consumer) training services, with compensation above industry standards [1] - The training covers various advanced topics including VLA, VLN, remote operation, Diffusion Policy, reinforcement learning, sim2real, multimodal large models, simulation, motion control, and target navigation [2] - B-end training is aimed at enterprises, universities, and research institutions, while C-end training focuses on students and job seekers, with responsibilities including curriculum design and material preparation [3] Group 2 - Candidates are required to have a doctoral degree or higher (including those currently enrolled), with a preference for those who have published two papers in A-level or Q1 journals/conferences, or have two years of industry experience [3] - Interested individuals can add a specified WeChat contact for further inquiries [4]
【私募调研记录】景林资产调研当虹科技
Zheng Quan Zhi Xing· 2025-08-28 00:12
Group 1 - The core viewpoint of the news is that Jinglin Asset Management has conducted research on a listed company, focusing on its technological advancements and growth prospects [1] - The company, Dahong Technology, has achieved a 50% year-on-year revenue growth in the second quarter, driven by its self-developed BlackEye multimodal model technology [1] - Dahong Technology plans to establish 50 ultra-high-definition channels by 2025, with a target of 650 million terminal devices [1] - The upcoming release of BlackEye 2.0 on September 19 will enhance video technology capabilities, enabling interactive and comprehensible video processing [1] - The company is collaborating with military-civilian integration institutions to advance its remote control systems, which can operate in various network modes, including satellite links in offline environments [1] - Dahong Technology is focusing on high value-added transformation through standardization and AI efficiency improvements while continuing to invest in new technology directions [1] Group 2 - Shanghai Jinglin Asset Management is a private equity fund management company registered with the Asset Management Association of China, primarily investing in domestic and foreign listed company stocks [2] - The company has a strong track record of performance, with its Jinglin Stable Trust achieving a compound annual return of 26.84% as of April 30, 2015, significantly outperforming the CSI 300 Index [2] - Jinglin Asset Management employs a value investment philosophy, emphasizing fundamental analysis and stock valuation based on industry structure and the company's position in the value chain [2] - The firm has a specialized team of over 50 professionals with extensive experience in various industries, enabling a deeper understanding of market dynamics and investment opportunities [2] - Jinglin Asset Management has been recognized as one of the top private equity investment institutions in China, consistently delivering substantial returns to investors [2]
为防AI刷题,Nature等顶刊最新封面被做成数据集,考验模型科学推理能力
3 6 Ke· 2025-08-26 01:25
Core Insights - The emergence of advanced multimodal models like GPT-4o and Gemini 2.5 Pro has raised concerns about the evaluation of AI capabilities as existing "question banks" become outdated [1][17] - A new dynamic benchmark called MAC (Multimodal Academic Cover) has been proposed to continuously assess AI using the latest scientific content [1][20] Group 1: Benchmark Development - The MAC benchmark utilizes the latest covers from 188 top journals, including Nature, Science, and Cell, to create a testing dataset from over 25,000 image-text pairs [3][20] - The benchmark aims to evaluate whether multimodal models can understand the deep connections between artistic visual elements and scientific concepts [3][20] Group 2: Testing Methodology - The MAC benchmark includes two testing tasks designed to prevent AI from relying on superficial visual features: selecting corresponding texts from journal covers and matching images to cover stories [6][14] - The design incorporates "semantic traps" to ensure that only models with a true understanding of scientific concepts can select the correct answers [6][14] Group 3: Model Performance - The top-performing model, Step-3, achieved an accuracy of only 79.1% on the MAC benchmark, highlighting a significant gap compared to its near-perfect performance on other benchmarks [4][16] - Open-source model Qwen2.5-VL-7B showed an accuracy of just 56.8%, indicating limitations in current AI models when faced with the latest scientific content [4][16] Group 4: Continuous Challenge Mechanism - The MAC benchmark employs a dual dynamic mechanism to ensure ongoing challenges: dynamic data that evolves with scientific knowledge and dynamic problem construction that utilizes advanced embedding models to create more sophisticated semantic traps [20][22][23] - This approach allows the benchmark to remain relevant and challenging as both scientific knowledge and AI capabilities advance [20][22][23] Group 5: Future Directions - The research team plans to expand the MAC benchmark to include more scientific journals and other forms of dynamic scientific content, such as conference papers and science news [23] - The benchmark will undergo annual updates to adapt to the rapid advancements in AI technology, ensuring it remains a relevant tool for evaluating AI capabilities [23]
2025年了,生成和理解多模态大模型发展到哪一步了?
自动驾驶之心· 2025-08-25 23:34
Core Viewpoint - The article discusses the development trends of unified multimodal large models, particularly focusing on image understanding and generation, up to mid-2025, highlighting significant advancements and challenges in this field [1][2]. Group 1: Overview of Multimodal Large Models - The term "unified multimodal large models" primarily refers to models that integrate both image understanding and generation, excluding other modalities like Omni-LLM due to fewer academic papers in that area [3]. - Several notable early works in this domain include Google's Unified-IO, Alibaba's OFA, and Fudan's AnyGPT, which have significantly influenced subsequent research [3]. Group 2: Key Research Directions - Research on "integrated generation and understanding" of multimodal large models focuses on two main aspects: the development of visual tokenizers and the construction of suitable model architectures [14]. - The TokenFlow model by ByteDance employs different visual encoders for understanding and generation tasks, utilizing high-level semantic features for understanding and low-level features for generation [16][17]. Group 3: Model Architectures and Techniques - The Semantic-Priority Codebook (SPC) approach was introduced to improve the quality of image reconstruction tasks, highlighting the importance of semantic features in the quantization process [19][23]. - The QLIP model from UT Austin and Nvidia optimizes the visual tokenizer by aligning visual features suitable for generation with semantic information, using a unified visual encoder for both tasks [28][30]. Group 4: Training Strategies - The training strategy for QLIP involves two phases: the first focuses on learning semantically rich feature representations, while the second emphasizes improving image reconstruction quality [30][32]. - The UniTok model employs multi-codebook quantization to enhance codebook utilization, integrating visual features for both understanding and generation tasks [35][36]. Group 5: Recent Innovations - The DualToken model utilizes a single visual encoder to extract features for both understanding and generation, employing different visual codebooks for semantic and pixel features [39][41]. - The TokLIP model from Tencent also adopts a single-encoder approach, focusing on the alignment of visual features with text features through various loss functions [42][44].
为防AI刷题,Nature等顶刊最新封面被做成数据集,考验模型科学推理能力|上海交通大学
量子位· 2025-08-25 15:47
Core Viewpoint - The article discusses the development of the MAC (Multimodal Academic Cover) benchmark, which aims to evaluate the true capabilities of advanced AI models like GPT-4o and Gemini 2.5 Pro by using the latest scientific content for testing, addressing the challenge of outdated "question banks" in AI assessments [1][5]. Group 1: Benchmark Development - The MAC benchmark utilizes the latest covers from 188 top journals, including Nature, Science, and Cell, to create a testing dataset from over 25,000 image-text pairs, ensuring that the AI models are evaluated on the most current and complex scientific concepts [3][4]. - The research team designed two testing tasks: "selecting text from images" and "selecting images from text," to assess the AI's understanding of the deep connections between visual elements and scientific concepts [17][18]. Group 2: Testing Results - The results revealed that even top models like Step-3 achieved only a 79.1% accuracy when faced with the latest scientific content, indicating significant limitations in their performance compared to their near-perfect results on other benchmarks [4][19]. - The study highlighted that models such as GPT-5-thinking and Gemini 2.5 Pro, while proficient in visual recognition, still struggle with deep reasoning tasks that require cross-modal scientific understanding [19]. Group 3: Dynamic Benchmarking Mechanism - The MAC benchmark introduces a dynamic approach to testing by continuously updating the dataset and questions, which helps maintain the challenge level as scientific knowledge evolves [24][26]. - The research team conducted a comparison experiment showing that all models performed worse on the latest data (MAC-2025) compared to older data (MAC-Old), demonstrating that the natural evolution of scientific knowledge provides ongoing challenges for AI models [26]. Group 4: DAD Methodology - The DAD (Divide and Analyze) method was proposed to enhance AI performance by structuring the reasoning process into two phases: a detailed visual description followed by high-level analysis, simulating human expert thinking [21][22]. - This two-step approach significantly improved the accuracy of multiple models, showcasing the effectiveness of extending reasoning time in multimodal scientific understanding tasks [22][23]. Group 5: Future Prospects - The MAC benchmark is expected to evolve into a more comprehensive evaluation platform, with plans to include more scientific journals and dynamic scientific content such as conference papers and news [28]. - As AI capabilities approach human levels, the MAC benchmark will serve as a "touchstone" to better understand the boundaries of AI capabilities and the path toward true intelligence [28].
格灵深瞳2025半年报:多元化业务发力,二季度营收同比增70%
Huan Qiu Wang· 2025-08-25 11:59
在国家"人工智能+"行动持续加码的背景下,AI企业间的生态协同正成为驱动产业发展的关键引擎。目 前,格灵深瞳已与百度、飞腾、海光、光环云等头部科技企业达成合作,进一步夯实国产算力底座,提 升供应链韧性,通过聚合产业上下游生态伙伴力量,促进AI产业升级迭代。(旺旺) 产品层面,格灵深瞳以自研视觉模型+AI Infra为依托,基于智慧金融、城市管理两大战略赛道,以及政 务及特种、智慧教育两大创新领域,与产业伙伴携手探索端到端应用落地,推出平台级别产品及各类垂 直AI智能体应用,以丰富的产品实现多场景覆盖。 技术突破迅速转化为产品竞争力。格灵深瞳推出面向金融场景的企业级产品——金融超级智能助手 Super-Agent平台。该平台通过集成知识库、MCP服务与大模型API等模块,能够构建具备金融专家知识 的多个智能体(Agent),有效解决了大模型在风控、营销等核心业务中落地难、成本高的工程化痛 点。目前,该平台已在多家银行开展试点工作,共同探索更多垂直场景的端到端落地。 不仅如此,在AIPC、行业大模型一体机、智慧教育等领域,格灵深瞳皆有新品或升级产品面市。例如 格灵深瞳基于信创自主可控推出的"墨刃AIPC",将安全 ...
AI动态汇总:智元推出机器人世界模型平台genieenvesioner,智谱上线GLM-4.5a视觉推理模型
China Post Securities· 2025-08-25 11:47
- The Genie Envisioner platform introduces a video-centric world modeling paradigm, directly modeling robot-environment interactions in the visual space, which retains spatial structure and temporal evolution information. This approach enhances cross-domain generalization and long-sequence task execution capabilities, achieving a 76% success rate in long-step tasks like folding cardboard boxes, outperforming the π0 model's 48%[12][13][16] - The Genie Envisioner platform comprises three core components: GE-Base, a multi-view video world foundation model trained on 3000 hours of real robot data; GE-Act, a lightweight 160M parameter action decoder enabling real-time control; and GE-Sim, a hierarchical action-conditioned simulator for closed-loop strategy evaluation and large-scale data generation[16][17][19] - The GLM-4.5V visual reasoning model, with 106B total parameters and 120B activation parameters, achieves state-of-the-art (SOTA) performance across 41 multimodal benchmarks, including image, video, document understanding, and GUI agent tasks. It incorporates 3D-RoPE and bicubic interpolation mechanisms to enhance 3D spatial relationship perception and high-resolution adaptability[20][21][22] - GLM-4.5V employs a three-stage training strategy: pretraining on large-scale multimodal corpora, supervised fine-tuning with "chain of thought" samples, and reinforcement learning with RLVR and RLHF techniques. This layered training enables superior document processing capabilities and emergent abilities like generating structured HTML/CSS/JavaScript code from screenshots or videos[23][24][26] - VeOmni, a fully modular multimodal training framework, decouples model definition from distributed parallel logic, enabling flexible parallel strategies like FSDP, HSDP+SP, and EP. It achieves 43.98% MFU for 64K sequence training and supports up to 192K sequence lengths, reducing engineering complexity and improving efficiency by over 90%[27][28][31] - VeOmni introduces asynchronous sequence parallelism (Async-Ulysses) and COMET technology for MoE models, achieving linear scalability in training throughput for 30B parameter models under 160K sequence lengths. It also integrates dynamic batch processing and FlashAttention to minimize memory waste and optimize operator-level recomputation[31][32][34] - Skywork UniPic 2.0, a unified multimodal framework, integrates image understanding, text-to-image (T2I) generation, and image-to-image (I2I) editing within a single model. It employs a progressive dual-task reinforcement strategy (Flow-GRPO) to optimize image editing and T2I tasks sequentially, achieving superior performance in benchmarks like GenEval and GEdit-EN[35][38][39] - UniPic 2.0 leverages Skywork-EditReward, an image-editing-specific reward model, to provide pixel-level quality scores. This design enables precise recognition of image elements and generation of corresponding textual descriptions, achieving 83.5 points in MMBench, comparable to 19B parameter models[38][42][43] - FlowReasoner, a query-level meta-agent framework, dynamically generates personalized multi-agent systems for individual queries. It employs GRPO reinforcement learning with multi-objective reward mechanisms, achieving 92.15% accuracy on the MBPP dataset and outperforming baseline models like Aflow and LLM-Blender[63][64][68] - FlowReasoner utilizes a three-stage training process: supervised fine-tuning with synthetic data, SFT fine-tuning for workflow generation, and RL with external feedback for capability enhancement. It demonstrates robust generalization, maintaining high accuracy even when the base worker model is replaced[66][68][69]
自动驾驶转具身智能有哪些切入点?
自动驾驶之心· 2025-08-24 23:32
Core Viewpoint - The article discusses the transition from autonomous driving to embodied intelligence, highlighting the similarities and differences in algorithms and tasks between the two fields [1]. Group 1: Algorithm and Task Comparison - Embodied intelligence largely continues the algorithms used in robotics and autonomous driving, such as training and fine-tuning methods, as well as large models [1]. - There are notable differences in specific tasks, including data collection methods and the emphasis on execution hardware and structure [1]. Group 2: Community and Learning Resources - A full-stack learning community named "Embodied Intelligence Heart" has been established to share knowledge related to algorithms, data collection, and hardware solutions in the field of embodied intelligence [1]. - Key areas of focus within the community include VLA, VLN, Diffusion Policy, reinforcement learning, robotic arm grasping, pose estimation, robot simulation, multimodal large models, chip deployment, sim2real, and robot hardware structure [1].
当虹科技2025年中报简析:营收上升亏损收窄,盈利能力上升
Zheng Quan Zhi Xing· 2025-08-23 22:58
Core Viewpoint - The recent financial report of Danghong Technology (688039) shows a positive trend in revenue and profit margins, indicating improved operational efficiency and potential growth opportunities in various business segments [1]. Financial Performance - Total revenue for the first half of 2025 reached 133 million yuan, a year-on-year increase of 12.7% [1]. - The net profit attributable to shareholders was -6.15 million yuan, showing an improvement of 85.27% compared to the previous year [1]. - In Q2 2025, total revenue was 83.9 million yuan, up 50.44% year-on-year, with a net profit of 5.74 million yuan, an increase of 130.65% [1]. - Gross margin improved to 42.21%, a year-on-year increase of 26.44%, while net margin improved to -7.17%, up 81.59% [1]. - The total of selling, administrative, and financial expenses was 35.65 million yuan, accounting for 26.81% of revenue, a slight increase of 4.76% year-on-year [1]. Key Financial Metrics - Earnings per share improved to -0.05 yuan, an increase of 86.49% year-on-year [1]. - Operating cash flow per share was 0.0 yuan, reflecting a 100.53% increase year-on-year [1]. - The company's cash and cash equivalents decreased by 40.64% to 86.93 million yuan due to operational expenditures [1]. Business Segments - The AI products and multimodal large model derivatives have rapidly applied in the market, particularly boosting the media culture business and in-vehicle intelligent cockpit business [8]. - The smart connected vehicle business is expected to grow significantly as demand for in-cabin multimodal interaction and intelligent entertainment cockpit experiences increases [12]. - The industrial and satellite business focuses on intelligent video analysis and data mining applications, enhancing capabilities in high-precision inspections and real-time satellite remote sensing [13]. - The media culture business is evolving from a hardware supplier to a comprehensive intelligent video ecosystem service provider, capitalizing on opportunities in the ultra-high-definition video industry [13].
推荐一个大模型AI私房菜!
自动驾驶之心· 2025-08-23 16:03
Group 1 - The article emphasizes the growing interest in large model technologies, particularly in areas such as RAG (Retrieval-Augmented Generation), AI Agents, multimodal large models (pre-training, fine-tuning, reinforcement learning), and optimization for deployment and inference [1] - A community named "Large Model Heart Tech" is being established to focus on these technologies and aims to become the largest domestic community for large model technology [1] - The community is also creating a knowledge platform to provide industry and academic information, as well as to cultivate talent in the field of large models [1] Group 2 - The article describes the community as a serious content-driven platform aimed at nurturing future leaders [2]