Workflow
多模态大模型
icon
Search documents
2026年,这个自驾社区计划做这些事情......
自动驾驶之心· 2026-01-02 08:08
Core Viewpoint - The article emphasizes the establishment of a comprehensive community for autonomous driving, aiming to provide a platform for knowledge sharing, technical discussions, and career opportunities in the field [4][17]. Group 1: Community Development - The "Autonomous Driving Heart Knowledge Planet" has been created to address the high trial-and-error costs for newcomers in the autonomous driving industry, offering a structured learning environment [4][5]. - The community has grown to over 4,000 members and aims to expand to nearly 10,000 within two years, focusing on both academic and industrial needs [5][18]. - Various activities such as face-to-face meetings, expert interviews, and industry research will continue to be organized to meet the diverse needs of members [4][5][18]. Group 2: Learning Resources - The community has compiled over 40 technical learning paths, covering topics from entry-level to advanced autonomous driving technologies [7][18]. - Members have access to exclusive video tutorials and documents that facilitate learning in areas such as perception fusion, SLAM, and decision-making [11][18]. - A comprehensive list of open-source projects and datasets related to autonomous driving has been made available to assist members in their research and projects [35][37]. Group 3: Industry Insights - The community plans to conduct industry research focusing on the scaling of autonomous driving technologies, particularly in the L4 domain, which is expected to regain attention in the coming year [4][18]. - Regular discussions with industry experts will provide insights into the latest trends, challenges, and opportunities in the autonomous driving sector [7][18]. - The community aims to connect members with job opportunities in leading companies within the autonomous driving industry, facilitating career advancement [11][20].
重新定义视频大模型时序定位!南大腾讯联合提出TimeLens,数据+算法全方位升级
机器之心· 2026-01-02 01:55
Core Insights - The rapid development of multimodal large language models (MLLMs) has improved video understanding, but a significant limitation remains in accurately determining "when" events occur in videos, known as Video Temporal Grounding (VTG) [2] - The research team from Nanjing University, Tencent ARC Lab, and Shanghai AI Lab introduced TimeLens, which addresses the shortcomings in existing evaluation benchmarks and proposes a reliable assessment framework and high-quality training data [2][29] Data Quality Issues - The existing benchmarks for VTG, such as Charades-STA, ActivityNet Captions, and QVHighlights, contain numerous annotation errors, including vague descriptions and incorrect time boundary markings [7] - A high percentage of errors in these benchmarks has been identified, leading to unreliable evaluation results that overestimate the capabilities of open-source models [11] TimeLens-Bench - To rectify the issues in existing datasets, the team created TimeLens-Bench, a high-quality evaluation benchmark that accurately reflects the temporal grounding capabilities of models [11] - Comparisons between TimeLens-Bench and original benchmarks revealed that previous evaluations significantly overestimated open-source models while obscuring the true performance of proprietary models [11] High-Quality Training Data: TimeLens-100K - The team developed TimeLens-100K, a large-scale, high-quality training dataset through an automated cleaning and re-labeling process, which has shown to significantly enhance model performance [13] Algorithm Design Best Practices - TimeLens conducted extensive ablation studies to derive effective algorithm design practices for VTG tasks, focusing on timestamp encoding and training paradigms [15] - The optimal timestamp encoding method identified is the Interleaved Textual Encoding strategy, which simplifies implementation while achieving superior results [17] - The Thinking-free RLVR training paradigm was found to be the most efficient, allowing models to directly output localization results without requiring complex reasoning processes [19][21] Key Training Techniques - Early stopping is crucial in RL training, as continuing beyond a plateau in reward metrics can degrade model performance [23] - Difficulty-based sampling is essential for selecting challenging training samples, maximizing the model's performance during RLVR training [23] Performance Validation - The TimeLens-8B model demonstrated exceptional performance, surpassing open-source models like Qwen3-VL and outperforming proprietary models such as GPT-5 and Gemini-2.5-Flash across multiple core metrics [27][28] - This performance underscores the potential of smaller open-source models to compete with larger proprietary models through systematic improvements in data quality and algorithm design [28] Contributions and Future Directions - TimeLens not only establishes a new SOTA open-source model but also provides valuable methodologies and design blueprints for future research in video temporal grounding [29] - The code, models, training data, and evaluation benchmarks for TimeLens have been made open-source to facilitate further advancements in VTG research [30]
商汤Kapi相机跃居中国区App Store「摄影与录像」榜首,此前在海外多国热度领先
Xin Lang Cai Jing· 2025-12-31 16:01
Core Insights - SenseTime's Kapi camera has topped the Apple App Store's "Photography & Video" category in China as of December 31, showcasing its strong market performance and innovative features [2][7] - The Kapi camera, launched on December 20, is positioned as the first true "AI photography assistant," breaking traditional app limitations and offering advanced features like scene recognition and filter recommendations [2][3][8] - The success of Kapi camera indicates a significant shift in consumer-level AI applications, moving from experimental technology to mainstream usage, thus enhancing confidence in AI commercialization [5][10] Product Performance - Kapi camera achieved the number one spot in the Philippines App Store free app category on December 8 and has maintained a strong presence in the "Photos & Videos" category across multiple countries, including the UK, Italy, France, and Germany [3][8] - The app's technology includes a professional image processing workflow that replicates the Apple Log curve, allowing users to capture images with cinematic quality and rich detail without post-processing [3][8] Technological Advancements - SenseTime's SenseNova V6.5 Pro ranked first in the domestic evaluation by SuperCLUE, scoring 75.35 and achieving the highest score in visual reasoning among domestic models [4][9] - The report highlights that SenseTime's models are approaching the average level of leading models in basic cognitive dimensions, with SenseNova V6.5 exceeding the average in visual reasoning, indicating a competitive edge in the industry [9][10]
星源智与征和工业达成战略合作,聚焦七大方向构建全方位协同创新体系
IPO早知道· 2025-12-31 05:26
Core Viewpoint - Qingdao Zhenghe Industrial Co., Ltd. and Beijing Xingyuan Intelligent Robot Technology Co., Ltd. have signed a strategic cooperation agreement to leverage their strengths in smart hardware and multimodal large model technology for the development of intelligent robots [2][6]. Group 1: Strategic Cooperation Focus - The collaboration will focus on seven core areas to establish a comprehensive collaborative innovation system [3]. - Both parties will develop a dexterous hand with autonomous perception, planning, decision-making, and adaptability for humanoid robots, collaborative robots, and robotic arms [4]. - Xingyuan Intelligent will utilize its core technology in embodied intelligence and multimodal operation models to develop a large model for the dexterous hand and conduct real-world testing [4]. Group 2: Technical and Product Development - Based on Zhenghe Industrial's chain-type intelligent dexterous hand, both companies will collaborate on scenario training, data collection, and model iteration [5]. - A project working group will be established to promote cooperation in technology, products, and scenarios, with regular communication mechanisms [5]. - Zhenghe Industrial will invite experts from Xingyuan Intelligent to advise on the development of the dexterous hand, model adaptation, and business expansion [5]. Group 3: Resource Sharing and Market Development - Both companies will open necessary technical interfaces and scenario data to jointly create training and pilot application scenarios [5]. - Zhenghe Industrial will conduct scenario training and technical validation of Xingyuan Intelligent's models in its automated production lines [5]. - The partnership aims to develop various industrial or commercial scenarios, sharing resources and promoting the deployment and sales of their products [5]. Group 4: Vision and Future Development - The agreement aims to establish a strategic partnership, with Xingyuan Intelligent focusing on creating a universal embodied brain that connects digital intelligence with the physical world [6]. - Zhenghe Industrial will integrate hardware and software to achieve versatile functions in specific industrial and commercial scenarios [6]. - The collaboration is expected to deepen technical integration and resource synergy, providing a pathway for Zhenghe Industrial's business layout in the field of embodied intelligent robots [6].
智赋未来,链动生态 |征和工业 x 星源智达成战略合作
Xin Lang Cai Jing· 2025-12-31 01:44
Core Viewpoint - Qingdao Zhenghe Industrial Co., Ltd. has signed a strategic cooperation agreement with Beijing Xingyuan Intelligent Robot Technology Co., Ltd. to leverage their strengths in smart hardware development and multimodal large model technology, aiming to enhance the competitiveness of end-execution solutions in humanoid and collaborative robots, thereby promoting high-quality development in the smart robotics industry [1][9]. Group 1: Strategic Cooperation Details - The cooperation will focus on seven core areas, establishing a comprehensive collaborative innovation system [2][3]. - Zhenghe Industrial will concentrate on the research and manufacturing of intelligent dexterous hands and control systems, while Xingyuan Intelligent will focus on multimodal large models, including dexterous hand operation models [2][3]. - A project working group will be established to advance cooperation in technology, products, and scenarios, with regular communication mechanisms to synchronize progress [11]. Group 2: Technical Development and Application - The collaboration aims to develop dexterous hands with autonomous perception, planning, decision-making, and adaptability for humanoid robots, collaborative robots, and robotic arms [2][3]. - Both companies will share downstream industrial or commercial scenarios and customer resources to promote the deployment, technical promotion, and sales of Xingyuan's large models and Zhenghe's dexterous hands [3][11]. - The training and testing of the dexterous hand large model will be conducted in real industrial scenarios to enhance adaptability and generalization capabilities [11][12]. Group 3: Company Background and Vision - Zhenghe Industrial is a leader in chain system technology and the first A-share listed company in China's chain transmission industry, with products spanning various sectors including automotive and industrial chains [7][16]. - Xingyuan Intelligent aims to create a universal embodied brain that connects digital intelligence with the physical world, leveraging a team of top scientists and business leaders from prestigious institutions [12][14]. - The upcoming launch of the new generation industrial interactive embodied operation robot "Spirit G2" will feature Xingyuan's embodied brain product, showcasing advanced capabilities for real-time perception and intelligent decision-making [5][14].
三维空间太难懂?RoboTracer让机器人理解复杂空间指令,推理3D空间轨迹,开放世界也能精确行动
机器之心· 2025-12-30 12:10
本文的主要作者来自北京航空航天大学、北京大学、北京智源人工智能研究院和中科院自动化研究所。本 文的第一作者为北京航空航天大学博士生周恩申,主要研究方向为具身智能和多模态大模型。本文的共一 作者兼项目负责人为北京智源研究院研究员迟程。本文的通讯作者为北京航空航天大学教授盛律和北京大 学计算机学院研究员、助理教授仉尚航。 我们希望具身机器人真正走进真实世界,尤其走进每个人的家里,帮我们完成浇花、收纳、清洁等日常任 务。但家庭环境不像实验室那样干净、单一、可控:物体种类多、摆放杂、随时会变化,这让机器人在三 维物理世界中「看懂并做好」变得更难。 想象一下你下班回到家,对家用服务机器人说: 「按从左到右的顺序给每盆花浇水;喷壶要在每朵花上方 1–5 厘米处停住再浇,这样更均匀。」(如下图) 对人来说这很自然,但对机器人来说,难点不在「浇水」本身,而在指令里隐含了大量空间约束:既有 定 性 的(从左到右、在上方),也有 定量 的(1–5 厘米)。在杂乱的开放世界场景中,让机器人稳定遵循这 些约束,哪怕对目前最先进的视觉 - 语言 - 动作模型(VLA)也依然是挑战。 一个直接的突破口是:让视觉 - 语言模型(VLM)生 ...
合合信息二次递表港交所 扫描全能王已上线超15年、MAU超1亿
Zhi Tong Cai Jing· 2025-12-29 23:29
Company Overview - Shanghai Hehe Information Technology Co., Ltd. (688615.SH) has submitted a listing application to the Hong Kong Stock Exchange, with China International Capital Corporation as its sole sponsor [1] - The company has been dedicated to AI technology innovation, providing products to over a billion users globally across diverse industries [4] - Hehe Information is a leader in the global text intelligence technology field, driven by its multimodal large language model capable of processing various data forms [4] Financial Performance - The company reported revenues of approximately RMB 988 million, RMB 1.188 billion, RMB 1.438 billion, and RMB 1.303 billion for the fiscal years 2022, 2023, 2024, and the nine months ending September 30, 2025, respectively [6] - Gross profit figures for the same periods were approximately RMB 827 million, RMB 1 billion, RMB 1.212 billion, and RMB 1.126 billion, with gross margins of 83.7%, 84.3%, 84.3%, and 86.4% [7] - Net profit for the fiscal years 2022, 2023, 2024, and the nine months ending September 30, 2025, was approximately RMB 284 million, RMB 323 million, RMB 401 million, and RMB 351 million, respectively [8] Market Position - According to Zhaoshang Consulting, Hehe Information ranks first in China and fifth globally among companies with C-end efficiency AI products, with a monthly active user count exceeding one million [4] - The company’s flagship product, "Scanner All-in-One," is the largest image text processing AI product globally, achieving over 100 million monthly active users and a compound annual growth rate (CAGR) of over 20% from 2022 to 2024 [4] Product Development - Hehe Information has developed a range of C-end products, including "Scanner All-in-One," "Business Card All-in-One," and "Qixinbao," leveraging AI technology for enhanced user experience [5] - The company has introduced innovative features such as AI precise recognition and AI business card insights, further expanding its product offerings [5] Industry Insights - The global AI product market is projected to reach USD 46.5 billion in 2024, with expectations to grow to USD 228 billion by 2029, reflecting a CAGR of 37.4% [11] - The C-end AI product market is expected to grow from USD 10.9 billion in 2024 to USD 77.1 billion by 2029, with efficiency AI products holding the largest market share [13][16]
火山引擎成为总台春晚独家AI云合作伙伴
Xin Lang Cai Jing· 2025-12-29 04:37
责任编辑:韦子蓉 新浪科技讯 12月29日下午消息,中央广播电视总台《2026年春节联欢晚会》分会场宣布,火山引擎成 为2026年春晚独家AI云合作伙伴。 火山引擎表示,基于业界前沿的多模态大模型和云计算技术,火山引擎将深度参与到总台春晚节目、线 上互动和视频直播中,以科技之力为这场全球华人的团圆盛宴添彩。 据了解,作为字节跳动旗下的云和AI服务平台,火山引擎在过去5年为抖音的春晚直播提供技术支撑, 并圆满支持了抖音在2021年总台春晚的703亿次红包互动。 责任编辑:韦子蓉 新浪科技讯 12月29日下午消息,中央广播电视总台《2026年春节联欢晚会》分会场宣布,火山引擎成 为2026年春晚独家AI云合作伙伴。 火山引擎表示,基于业界前沿的多模态大模型和云计算技术,火山引擎将深度参与到总台春晚节目、线 上互动和视频直播中,以科技之力为这场全球华人的团圆盛宴添彩。 据了解,作为字节跳动旗下的云和AI服务平台,火山引擎在过去5年为抖音的春晚直播提供技术支撑, 并圆满支持了抖音在2021年总台春晚的703亿次红包互动。 ...
AI 真能看懂物理世界吗?FysicsWorld:填补全模态交互与物理感知评测的空白
机器之心· 2025-12-28 04:44
Core Insights - The article discusses the rapid paradigm shift in multimodal large language models, focusing on the development of unified full-modal models capable of processing and generating information across various modalities, including language, vision, and audio [2][4] - The driving force behind this shift is the complexity of the real physical world, where humans have historically relied on multimodal information to understand and interact with their environment [3] - A new benchmark called FysicsWorld has been introduced to evaluate models' capabilities in understanding, generating, and reasoning across multiple modalities in real-world scenarios [4][10] Summary by Sections Introduction to Multimodal Models - Multimodal models are evolving from simple combinations of visual and textual data to more complex integrations that include audio and other sensory modalities [12] - There is a growing expectation for these models to accurately understand and interact with complex real-world environments [12] FysicsWorld Benchmark - FysicsWorld is the first unified benchmark designed to assess models' abilities in multimodal tasks, covering 16 tasks that span various real-world scenarios [6][10] - The benchmark includes a cross-modal complementarity screening strategy to ensure that tasks require genuine multimodal integration, avoiding reliance on single-modal shortcuts [8][23] Evaluation Framework - The evaluation framework of FysicsWorld is comprehensive, covering tasks from basic perception to high-level interactions, ensuring a thorough assessment of models' capabilities [15][17] - The benchmark aims to address the limitations of existing evaluation systems, which often focus on text-centric outputs and lack real-world applicability [16] Performance Insights - Initial evaluations using FysicsWorld reveal significant performance gaps among current models, particularly in tasks requiring deep cross-modal reasoning and interaction in real-world contexts [31] - The results indicate that while models have made progress in basic multimodal tasks, they still struggle with complex scenarios that require robust integration of multiple sensory inputs [31][34] Future Directions - The article emphasizes the need for further advancements in cross-modal integration, dynamic environment understanding, and physical constraint reasoning to achieve true full-modal intelligence [35] - FysicsWorld serves as a critical tool for researchers to map and improve models' capabilities in real-world multimodal interactions [36]
百度X-Driver:可闭环评测的VLA
自动驾驶之心· 2025-12-28 03:30
Core Viewpoint - The article discusses the development and evaluation of X-Driver, a unified multimodal large language model (MLLM) framework designed for closed-loop autonomous driving, emphasizing the importance of closed-loop evaluation metrics for assessing the performance of autonomous driving systems [2][3][23]. Group 1: Methodology and Architecture - X-Driver utilizes a CoT (Chain of Thought) reasoning mechanism integrated within the MLLM to enhance decision-making in autonomous driving, processing inputs from camera data and navigation commands [6][11]. - The system operates in a closed-loop manner, where actions taken by the vehicle affect the real-world environment, generating new sensory data for continuous optimization [7][24]. - The architecture includes LLaVA, a multimodal model that aligns features from images and text, ensuring a comprehensive understanding of driving scenarios [9][10]. Group 2: Training and Reasoning Process - The CoT fusion training method employs high-quality CoT prompt data to improve reasoning and decision-making capabilities in driving scenarios [11][12]. - The model breaks down tasks into sub-tasks such as object detection and traffic signal interpretation, integrating these results to generate final driving decisions [17][18]. - The training process includes accurate perception of complex 3D driving environments and adherence to traffic regulations, ensuring safe navigation [15][22]. Group 3: Closed-loop Evaluation and Results - The closed-loop evaluation is conducted using the CARLA simulation environment, focusing on Driving Score and Success Rate as key performance indicators [27][28]. - The Bench2Drive dataset, containing over 2 million frames, is utilized to assess the closed-loop driving performance under various conditions [27]. - Results indicate that incorporating CoT reasoning significantly improves decision accuracy, with the success rate for closed-loop simulations still around 20% [30][31].