多模态大模型

Search documents
想学习更多大模型知识,如何系统的入门大?
自动驾驶之心· 2025-08-14 23:33
Group 1 - The article emphasizes the growing interest in large model technologies, particularly in areas such as RAG (Retrieval-Augmented Generation), AI Agents, multimodal large models (pre-training, fine-tuning, reinforcement learning), and optimization for deployment and inference [1] - A community named "Large Model Heart Tech" is being established to focus on these technologies and aims to become the largest domestic community for large model technology [1] - The community is also creating a knowledge platform to provide industry and academic information, as well as to cultivate talent in the field of large models [1] Group 2 - The article describes the community as a serious content-driven platform aimed at nurturing future leaders [2]
AI观察|从 F1 到足球:数据专家跨界背后,AI 商业化的破局之路
Huan Qiu Wang Zi Xun· 2025-08-14 05:27
Group 1 - The core point of the article highlights the intersection of AI and sports, particularly through the appointment of Mike Sansoni from the F1 Mercedes team to Manchester United as the data director, emphasizing the potential for AI to enhance decision-making in football [1] - The move signifies a growing recognition within the AI industry that expertise can be transferable across different sectors, as evidenced by Sansoni's transition from F1 data analysis to football [1] - The integration of AI in sports is expected to involve data analysis for player recruitment and tactical insights, showcasing the versatility of AI applications [1] Group 2 - The AI industry is witnessing a shift towards commercialization, with significant advancements in AI programming and the emergence of profitable applications in various sectors, including healthcare [2] - Companies like Anthropic are capitalizing on the lucrative market for AI programming, with a notable increase in valuation due to their dominance in this area [2] - Google has established a competitive edge in multi-modal scene generation, indicating potential expansion into gaming and film, which are seen as promising markets for AI [2] - The healthcare sector is identified as a viable area for AI applications, particularly in organizing medical data and improving quality control, despite current limitations in diagnostic capabilities [2] Group 3 - The commercialization of large models has found breakthroughs since the release of GPT-4, with discussions around the acceleration of technology development and its interrelated nature [4] - The concept of "accelerating returns" suggests that advancements in one technology can spur growth in others, leading to faster-than-expected developments in the tech landscape [4]
全球首款女团机器人10580元拍出 接入京东Joy Inside智能体
Sou Hu Cai Jing· 2025-08-13 18:35
Core Insights - The auction of the humanoid robot Lingtong NIA - F01, valued at 9999 yuan, concluded with a final price of 10,580 yuan, indicating strong market interest in innovative robotic products [1][4] Group 1: Product Features - The Lingtong NIA - F01 is marketed as the "world's first girl group robot," standing 56 centimeters tall and weighing under 700 grams, designed for a compact and durable user experience [1] - The robot features a soft PVC skin for a smooth touch and a robust skeleton made of ABS and metal, enhancing its durability [1] - It supports user customization for makeup and body design, catering to individual preferences [1] Group 2: Technical Capabilities - The robot is equipped with 6-8 millimeter micro digital servos, offering up to 34 degrees of freedom for intricate movements such as head turns and hand waves [3] - It integrates multiple sensors for enhanced interaction, including dual cameras for facial expression recognition and matrix microphones for emotional tone detection, creating a feedback loop of "perception - understanding - response" [3] - The robot can adapt its communication style based on user preferences and emotional states, demonstrating a level of "intelligence" in interactions [3] Group 3: User Interaction and Customization - Users can co-create the robot's persona, voice, and action library, allowing for a unique and personalized experience [3] - The robot can incorporate voice samples and personality traits, enabling users to design their own robotic companion with specific characteristics [3] - It connects with JD's Joy Inside conversational AI, providing high emotional intelligence in dialogues and a wide range of character options for diverse interaction scenarios [4]
VLA:何时大规模落地
Zhong Guo Qi Che Bao Wang· 2025-08-13 01:33
Core Viewpoint - The discussion around VLA (Vision-Language-Action model) is intensifying, with contrasting opinions on its short-term feasibility and potential impact on the automotive industry [2][12]. Group 1: VLA Technology and Development - The Li Auto i8 is the first vehicle to feature the VLA driver model, positioning it as a key selling point [2]. - Bosch's president for intelligent driving in China, Wu Yongqiao, expressed skepticism about the short-term implementation of VLA, citing challenges in multi-modal data acquisition and training [2][12]. - VLA is seen as an "intelligent enhanced version" of end-to-end systems, aiming for a more human-like driving experience [2][5]. Group 2: Comparison of Driving Technologies - There are two main types of end-to-end technology: modular end-to-end and one-stage end-to-end, with the latter being more advanced and efficient [3][4]. - The one-stage end-to-end model simplifies the process by directly mapping sensor data to control commands, reducing information loss between modules [3][4]. - VLA is expected to outperform traditional end-to-end models by integrating multi-modal capabilities and enhancing decision-making in complex scenarios [5][6]. Group 3: Challenges and Requirements for VLA - The successful implementation of VLA relies on breakthroughs in three key areas: cross-modal feature alignment, world model construction, and dynamic knowledge base integration [7][8]. - Current automotive chips are not designed for AI large models, leading to performance limitations in real-time decision-making [9][11]. - The industry is experiencing a "chip power battle," with companies like Tesla and Li Auto developing their own high-performance AI chips to meet VLA's requirements [11][12]. Group 4: Future Outlook and Timeline - Some industry experts believe 2025 could be a pivotal year for VLA technology, while others suggest it may take 3-5 years for widespread adoption [12][13]. - Initial applications of VLA are expected to be in controlled environments, with broader capabilities emerging as chip technology advances [14]. - Long-term projections indicate that advancements in AI chip technology and multi-modal alignment could lead to significant breakthroughs in VLA deployment by 2030 [14][15].
2025年大模型研究热点是什么?
自动驾驶之心· 2025-08-12 23:33
一个认真做内容的社区,一个培养未来领袖的地方。 自动驾驶VLA这么火,想借这个机会了解更多大模型相关的技术知识,有哪些方向可以做,现在热点在哪 里?为此,我们筹备了大模型之心Tech社区,平台主要关注大模型RAG、大模型AI Agent、多模态大模型(预 训练、微调、强化学习)和大模型部署推理优化等等。欢迎对大模型技术感兴趣的小伙伴关注我们~ 如果您想做进一步学习,也欢迎加入我们的大模型之心Tech知识星球。大模型之心Tech知识星球,我们目标是 构建一个国内最大的大模型技术社区,一直在给行业和个人输送各类人才、产业学术信息。目标星球正在快速 搭建相关模块,欢迎加入我们与大模型同行。 ...
突破SAM局限!美团提出X-SAM:统一框架横扫20+分割基准
自动驾驶之心· 2025-08-12 23:33
Core Insights - The article discusses the introduction of X-SAM, a new segmentation framework that overcomes the limitations of the Segment Anything Model (SAM) by enabling multi-task processing and integrating multi-modal capabilities [3][4][5]. Group 1: Limitations of SAM - SAM was initially seen as a universal solution for visual segmentation but has significant limitations, including single-task focus, inability to understand text instructions, and inefficiency due to the need for multiple models for different tasks [5][6][7]. Group 2: Innovations of X-SAM - X-SAM integrates SAM's visual segmentation capabilities with multi-modal understanding from large language models (LLMs) through a unified input format, a dual-encoder architecture, and multi-stage training [12][13][21]. - The unified input format allows various segmentation tasks to be processed in a consistent manner, enhancing the model's ability to understand both text and visual prompts [13][15]. - The dual-encoder architecture consists of a global image encoder and a segmentation encoder, optimizing both overall scene understanding and pixel-level detail [14][19]. - Multi-stage training involves fine-tuning the segmentation model, aligning visual and language features, and mixed fine-tuning across diverse datasets to enhance generalization [21][23]. Group 3: Performance Metrics - X-SAM has demonstrated superior performance across over 20 datasets and 7 core tasks, achieving state-of-the-art results in various segmentation benchmarks [27][28]. - In the COCO dataset, X-SAM achieved a panorama quality (PQ) score of 54.7, closely following the best-performing model, Mask2Former [31]. - For open vocabulary segmentation, X-SAM's average precision (AP) reached 16.2, significantly outperforming other models [31]. - In referring segmentation tasks, X-SAM achieved corrected Intersection over Union (cIoU) scores of 85.1, 78.0, and 83.8 across different datasets, surpassing competitors [32]. Group 4: New Task Introduction - X-SAM introduces a new task called Visual Grounding Detection (VGD) segmentation, which allows the model to segment all instances of a class based on visual prompts, even across different images [25][26][35]. - In experiments, X-SAM achieved average precision scores of 47.9 to 49.7 for VGD segmentation, significantly exceeding existing models [35]. Group 5: Future Directions - The research team plans to extend X-SAM's capabilities to video segmentation and dynamic scenes, aiming to enhance its application in temporal visual understanding [43].
WRC2025聚焦(1):展出通用具身智能,GOVLA架构成亮点
Haitong Securities International· 2025-08-12 01:01
Investment Rating - The report does not explicitly provide an investment rating for the industry or specific companies within it Core Insights - The 2025 World Robot Conference (WRC) showcased over 200 companies and 1,500 exhibits, highlighting advancements in swarm intelligence, humanoid robotics, and multi-modal large models [1][15] - China's robotics industry is projected to generate nearly RMB 240 billion in revenue in 2024, maintaining its status as the largest industrial robot market globally for 12 consecutive years [4][18] - The commercialization of general-purpose humanoid robots follows a phased approach, transitioning from algorithm validation to household applications [3][17] Summary by Sections Event Overview - The WRC 2025 opened on August 8, 2025, in Beijing, featuring over 200 companies and 1,500 exhibits, including more than 50 humanoid robot manufacturers [1][15] Industry Achievements - The conference highlighted breakthroughs in swarm intelligence, humanoid robotics, and fully self-developed embodied intelligence systems, with notable demonstrations from companies like UBTech and Unitree [2][16] Market Dynamics - In the first half of 2025, industrial robot output reached 370,000 units, a 35.6% year-on-year increase, while service robot output reached 8.824 million units, up 25.5% year-on-year [4][18] - Industrial robots are utilized across 71 major and 241 sub-categories of the national economy, with applications in automotive manufacturing, electronics, and healthcare [4][18] Technological Framework - The Global & Omni-body Vision-Language-Action Model (GOVLA) represents a significant technological advancement, enabling coordinated control and task execution across various environments [3][17][20] - The phased rollout of humanoid robots includes stages from algorithm validation to public service and ultimately to household assistance [3][17] Future Outlook - The report indicates a strong foundation for future consumer adoption of humanoid robots, with a focus on high-value B2B markets in the early stages [3][17]
具身智能之心技术交流群成立了!
具身智能之心· 2025-08-11 06:01
注意哦, 备注:机构/学校+姓名+研究方向 ,能够快速入群! 感兴趣的同学可以添加小助理微信AIDriver005,邀请加入我们的社群。 具身智能之心技术交流群成立了!主要关注VLA、VLN、遥操作、Diffusion Policy、强化学习、VLA+RL、 sim2real、多模态大模型、仿真、运动控制、目标导航、建图定位、导航等方向。 ...
OpenAI发布最强AI模型GPT-5;英特尔CEO发全员信:回应辞职要求;微信员工回应“改手机日期可恢复过期文件” | Q资讯
Sou Hu Cai Jing· 2025-08-10 02:43
Group 1: OpenAI and AI Models - OpenAI has officially released its latest AI model, GPT-5, which features intelligent model version switching, lower hallucination rates, enhanced coding capabilities, and personalized settings [1][3] - GPT-5 achieved state-of-the-art scores in key coding benchmarks, scoring 74.9% in SWE-bench Verified tests and 88% in Aider polyglot tests, positioning it as a strong coding collaborator [3] - The model excels in front-end coding tasks, outperforming previous versions in 70% of internal tests [3] Group 2: Intel and CEO Response - Intel CEO Pat Gelsinger addressed employees in a letter, clarifying misconceptions and indicating he will not resign, emphasizing his commitment to the company's future goals and investments [4][5] - Intel has a 56-year history of semiconductor production in the U.S. and plans to invest billions in semiconductor R&D and manufacturing, including a new fab in Arizona [4] Group 3: Microsoft Layoffs - Microsoft has initiated a new round of layoffs in Washington state, reducing approximately 40 positions, bringing the total layoffs in the state to 3,160 this year [6] - The layoffs are part of a broader plan to cut over 15,000 jobs globally, with the latest round being relatively small compared to previous months [6] Group 4: ByteDance Recruitment - ByteDance has launched its 2026 campus recruitment, offering over 5,000 positions, a significant increase from the previous year's 4,000+ offers [10] - The recruitment focuses on various roles, with a 23% increase in R&D positions, particularly in algorithms and front-end development [10] Group 5: Gaming and Service Outages - Multiple games under NetEase experienced login issues, leading to a significant outage that lasted over 2 hours, attributed to internal server problems [8][9] - The outage affected several popular titles, causing widespread player frustration and highlighting the challenges in troubleshooting large-scale service disruptions [8][9] Group 6: AI Developments - OpenAI released two open-weight AI models, GPT-oss-120b and GPT-oss-20b, which can mimic human reasoning and perform complex tasks, although they are not fully open-source [13] - Google DeepMind introduced Genie 3, a universal world model capable of generating interactive 3D environments in real-time, marking a significant advancement in world modeling technology [14][15]
东吴证券:距离真正的具身智能大模型有多远?
智通财经网· 2025-08-09 14:20
2 从架构端和数据端看,目前机器人大模型的进展如何? 当前机器人大模型的快速演进,主要得益于架构端与数据端的协同突破。架构上,从早期的SayCan语 言规划模型,到RT-1实现端到端动作输出,再到PaLM-E、RT2将多模态感知能力融合至统一模型空 间,大模型已逐步具备"看图识意、理解任务、生成动作"的完整链条。2024年π0引入动作专家模型,动 作输出频率达50Hz;2025年Helix实现快慢脑并行架构,控制频率突破至200Hz,显著提升机器人操作 的流畅性与响应速度。数据端,已形成互联网、仿真、真机动作三类数据协同支撑的结构化体系:前两 者提供预训练量级与泛化场景,后者则直接提升模型在物理世界中的实用能力。其中,真机数据采集对 高精度动捕设备依赖度高,光学动捕以精度优势适配集中式训练场,有望成为具身模型训练的核心数据 来源。当前主流训练范式正由"低质预训练+高质后调优"快速迭代,模型智能的跃迁正转向"从数据堆料 到结构优化"的阶段。 3 未来大模型的发展方向是什么? 智通财经APP获悉,东吴证券发表研报称,未来具身大模型将在模态扩展、推理机制与数据构成三方面 持续演进。当前主流模型多聚焦于视觉、语言与动 ...