多模态模型

Search documents
马斯克旗下xAI联合创始人伊戈尔·巴布什金离职,将投身AI安全风投领域
Sou Hu Cai Jing· 2025-08-14 05:40
Core Insights - Babuschkin, a key figure in xAI's engineering team, has played a significant role in building the company's technical architecture and supercomputing clusters, helping xAI become a leader in AI model development within just two years [1] - Babuschkin plans to establish a venture capital firm, Babuschkin Ventures, focusing on supporting AI safety research and startups aimed at "advancing humanity and unlocking the mysteries of the universe" [1] - Elon Musk expressed gratitude towards Babuschkin for laying the foundation for xAI, stating that the company's achievements would not have been possible without him [1] - xAI has initiated a global talent recruitment plan, emphasizing the need for experts in AI safety and multimodal models [1]
是「福尔摩斯」,也是「列文虎克」,智谱把OpenAI藏着掖着的视觉推理能力开源了
机器之心· 2025-08-12 03:10
机器之心报道 作者:张倩、陈陈 光看图,你能猜出这是哪儿吗? 当同事出差回来扔到群里这么一张图,我们也是猜了半天,但毫无头绪。 直到另一位同事把图扔给智谱的新模型 ——GLM-4.5V,这个谜团才解开。 把照片截图传给 GLM-4.5V (避免模型利用照片的 EXIF 元数据),它很快就推理出了结果。 没错,图里的地方是多瑙河畔。尽管同事拍照的角度和风格和小某书上的精美照片大相径庭,但智谱的新模型还是通过深度分析给出了准确答案。 你可能要说,这个能力,OpenAI 的 o3、o4 mini 早就有了,没什么稀奇。但如果我告诉你,这个模型是开源的呢? 听说,它还参加了大名鼎鼎的「图寻」游戏全球积分赛,和里面的两万多名人类玩家对战了 7 天。 出于好奇,我们打开这个游戏玩了玩,结果一上来就懵了:这比赛只给 3 分钟时间思考,碰到带地标的还好,像这种普通的街道、山路,不积累点人文、地理知 识,连大概范围都不好确定,更别提按照题目要求定位出经纬度了。 但就是在这样的赛制里比了 7 天之后, GLM-4.5V 击败了 99.99% 的人类玩家 。 这个游戏玩得好意味着什么?意味着 GLM-4.5V 拥有了超强的视觉推理 ...
刚刚,智谱开源了他们的最强多模态模型,GLM-4.5V。
数字生命卡兹克· 2025-08-11 14:20
Core Viewpoint - The article highlights the release of GLM-4.5 and its successor GLM-4.5V, emphasizing their advanced capabilities in multimodal processing and superior performance in benchmark tests [1][2][6]. Model Release and Specifications - GLM-4.5V is a multimodal model with 106 billion total parameters and 12 billion active parameters, making it one of the largest open-source multimodal models available [3]. - The model has achieved state-of-the-art (SOTA) results in 41 out of 42 evaluation benchmarks, showcasing its strong performance [4][6]. Benchmark Performance - A detailed comparison of GLM-4.5V against other models shows its leading performance across various tasks, including visual question answering and reasoning [5]. - For instance, in the MMBench v1.1 benchmark, GLM-4.5V scored 88.2, outperforming other models like Qwen2.5-VL and GLM-4.1V [5]. Open Source and Accessibility - GLM-4.5V is available for download on multiple platforms, including GitHub and Hugging Face, although its large size may pose deployment challenges for consumer-level applications [7][8]. - The model can be accessed through the z.ai platform for those who prefer not to handle the deployment themselves [8][9]. Testing and Capabilities - Initial tests conducted on GLM-4.5V demonstrated its ability to accurately solve complex visual reasoning tasks, indicating its advanced cognitive capabilities [10][14][23]. - The model also exhibits impressive video understanding capabilities, able to analyze and summarize video content effectively, which is a significant advancement in multimodal AI [48][54][66]. Pricing and Economic Viability - The API pricing for GLM-4.5V is competitive, with input costs at 2 yuan per million tokens and output costs at 6 yuan per million tokens, making it an attractive option in the multimodal model market [83]. Conclusion - The continuous development and open-source approach of companies like Zhipu AI signify a shift in the AI landscape, promoting accessibility and innovation in the field [86][90][94].
AI大潮下的具身和人形,中国在跟跑还是并跑?
Guan Cha Zhe Wang· 2025-08-03 05:35
Group 1 - The core theme of the discussion revolves around "embodied intelligence" and its significance in the development of humanoid robots and AGI (Artificial General Intelligence) [1][2] - The conversation highlights the advancements in humanoid robots, particularly focusing on companies like Tesla and Boston Dynamics, and their impact on the global robotics landscape [1][2][3] - The panelists discuss China's position in the AI race, questioning whether it is merely following the US or is on the verge of overtaking it [1][2] Group 2 - Midea's entry into humanoid robotics is driven by its existing technological advantages in components and a complete product line, marking a strategic shift from its traditional home appliance business [4][5] - The acquisition of KUKA Robotics in 2016 has allowed Midea to expand its capabilities in industrial technology and automation, serving various sectors including automotive and logistics [4][5] - The discussion emphasizes the importance of application-driven development in humanoid robotics, with Midea exploring both full humanoid and wheeled robots for different use cases [13][15] Group 3 - The panelists from various companies, including Grasping Deep Vision and Zhenge Fund, share insights on the evolution of AI and robotics, focusing on the integration of computer vision and machine learning in their products [5][6][8] - Grasping Deep Vision, as a pioneer in AI computer vision, has developed applications across finance, security, and education, showcasing the versatility of AI technologies [5][6] - Zhenge Fund's investment strategy emphasizes early-stage funding in cutting-edge technology sectors, including AI and robotics, aiming to support innovative startups [6][8] Group 4 - The discussion on humanoid robots highlights the historical context, mentioning significant milestones like Honda's ASIMO and Boston Dynamics' Atlas, and contrasting them with recent advancements in China and the US [8][10] - The panelists note that the complexity of humanoid robots, with an average of 40 joints, poses significant engineering challenges, but advancements in reinforcement learning are simplifying the development process [9][10] - The future of humanoid robots is seen as promising, with expectations of rapid advancements in the next 5 to 10 years driven by technological breakthroughs and application-driven demands [9][10] Group 5 - The conversation touches on the debate between wheeled versus bipedal humanoid robots, with arguments for the practicality of wheeled robots in industrial settings and the necessity of bipedal robots for complex environments [13][16] - The panelists discuss the potential of "super humanoid robots" designed for specific industrial applications, aiming to exceed human efficiency in tasks like assembly and logistics [15][16] - The importance of dexterous hands in humanoid robots is emphasized, with a focus on the trade-offs between complexity, cost, and functionality in various applications [21][25] Group 6 - The concept of "embodied intelligence" is defined as the ability of robots to interact with the physical world, moving beyond traditional control methods to achieve more autonomous decision-making [28][30] - The panelists explore the role of world models and video models in enhancing the capabilities of humanoid robots, suggesting that these models can improve the robots' understanding of dynamic environments [35][39] - Reinforcement learning is highlighted as a crucial component in the development of humanoid robots, with discussions on optimizing reward systems to enhance learning outcomes [41][42]
21对话|商汤科技林达华:具身智能需数字空间与物理空间连接
2 1 Shi Ji Jing Ji Bao Dao· 2025-07-28 08:10
Core Insights - The rise of large language models (LLMs) marks a significant leap in AI technology, but achieving Artificial General Intelligence (AGI) requires more than just text understanding and generation [2] - The development of AI is transitioning from single language models to a new stage of multimodal integration, which is essential for reaching AGI [2][3] - The future of AI lies in the fusion of multimodal information and interaction with the physical world, with a full-scale adoption of multimodal models expected by the second half of 2025 [2][3] Multimodal Development - The evolution of large models is moving towards deeper cross-modal understanding, transitioning from mere comprehension to cognitive processing [4][6] - Early multimodal architectures had limitations, but advancements like the Gemini model are integrating image and video information into pre-training processes, enhancing cross-modal modeling capabilities [6] - Effective training of multimodal models can lead to superior performance in pure language tasks compared to single language models [6] Embodied Intelligence - Embodied intelligence is viewed as one of the ultimate forms of AGI, with significant attention in 2025 [3] - The development of agents is crucial for the practical application of large model capabilities, but current agents still face challenges in complex real-world scenarios [7] - The reliability and success rate of agents in real-world applications are critical for their perceived value [7] Key Challenges - A major challenge for achieving AGI is the ability to generalize reasoning from narrow domains to complex real-life scenarios [8] - Current multimodal models exhibit insufficient spatial understanding, which is a significant barrier to the realization of embodied intelligence [8] - The data acquisition methods for embodied intelligence are limited, primarily relying on robotic operations, which results in lower data throughput compared to digital models [10]
21对话|联汇科技CEO赵天成:具身智能演进方向的“非常答”
Sou Hu Cai Jing· 2025-07-28 04:37
Core Insights - The 2025 World Artificial Intelligence Conference (WAIC) held in Shanghai showcased a significant interest in AI applications, particularly in embodied intelligence and multimodal models [1][2] - Lianhui Technology, a pioneer in multimodal models, has launched the world's first "OmAgent" platform, which focuses on physical world applications rather than digital spaces [1][2] Company Developments - Lianhui Technology has developed its multimodal model from its first generation in 2021 to the fifth generation, with an iteration speed of approximately one year per generation [2] - The company has established its international headquarters in Zhangjiang, Shanghai, to leverage the concentration of intelligent terminals and embodied robots, as well as rich application scenarios in logistics, ports, and industrial manufacturing [2] Industry Trends - The current trend in AI applications is characterized by a shift towards the integration of various technologies, with embodied intelligence being a major focus for 2023 [1] - The evolution of embodied intelligence is seen as progressing through different stages, with various hardware carriers at different maturity levels, indicating a phased approach to deployment [2]
启明创投于WAIC 2025再发AI十大展望:围绕基础模型、AI应用、具身智能等
IPO早知道· 2025-07-28 03:47
Core Viewpoint - Qiming Venture Partners is recognized as one of the earliest and most comprehensive investment institutions in the AI sector in China, having invested in over 100 AI projects, covering the entire AI industry chain and promoting the rise of several benchmark enterprises in the field [2]. Group 1: AI Models - In the next 12-24 months, a context window of 2 million tokens will become standard for top AI models, with more refined and intelligent context engineering driving the development of AI models and applications [4]. - A universal video model is expected to emerge within 12-24 months, capable of handling generation, reasoning, and task understanding in video modalities, thus innovating video content generation and interaction [6]. Group 2: AI Agents - In the next 12-24 months, the form of AI agents will transition from "tool assistance" to "task undertaking," with the first true "AI employees" entering enterprises, participating widely in core processes such as customer service, sales, operations, and R&D, thus shifting from cost tools to value creation [8]. - Multi-modal agents will increasingly become practical, integrating visual, auditory, and sensor inputs to perform complex reasoning, tool invocation, and task execution, achieving breakthroughs in industries such as healthcare, finance, and law [9]. Group 3: AI Infrastructure - In the AI chip sector, more "nationally established" and "nationally produced" GPUs will begin mass delivery, while innovative new-generation AI cloud chips focusing on 3D DRAM stacking and integrated computing will emerge in the market [11]. - In the next 12-24 months, token consumption will increase by 1 to 2 orders of magnitude, with cluster inference optimization, terminal inference optimization, and soft-hard collaborative inference optimization becoming core technologies for reducing token costs on the AI infrastructure side [12]. Group 4: AI Applications - The paradigm shift in AI interaction will accelerate in the next two years, driven by a decrease in user reliance on mobile screens and the rising importance of natural interaction methods like voice, leading to the birth of AI-native super applications [14]. - The potential for AI applications in vertical scenarios is immense, with more startups leveraging industry insights to deeply engage in niche areas and rapidly achieve product-market fit, adopting a "Go Narrow and Deep" strategy to differentiate from larger companies [15]. - The AI BPO (Business Process Outsourcing) model is expected to achieve commercial breakthroughs in the next 12-24 months, transitioning from "delivery tools" to "delivery results," and expanding rapidly in standardized industries such as finance, customer service, marketing, and e-commerce through a "pay-per-result" approach [15]. Group 5: Embodied Intelligence - Embodied intelligent robots will first achieve large-scale deployment in scenarios such as picking, transporting, and assembling, accumulating a wealth of first-person perspective data and tactile operation data, thereby constructing a closed-loop flywheel of "model - ontology - scene data," which will drive model capability iteration and ultimately promote the large-scale landing of general-purpose robots [17].
国新证券每日晨报-20250728
Guoxin Securities Co., Ltd· 2025-07-28 02:06
Domestic Market Overview - The domestic market experienced a weak consolidation with a decrease in trading volume, with the Shanghai Composite Index closing at 3593.66 points, down 0.33%, and the Shenzhen Component Index at 11168.14 points, down 0.22% [1][5][10] - Among the 30 sectors tracked, 9 sectors saw gains, with notable increases in computer, electronics, and light manufacturing, while construction materials, construction, and food and beverage sectors faced significant declines [1][5][10] - The total trading volume for the A-share market was 181.55 billion yuan, showing a decrease compared to the previous day [1][5][10] Overseas Market Overview - The three major U.S. stock indices saw slight gains, with the Dow Jones up 0.47%, S&P 500 up 0.4%, and Nasdaq up 0.24%. Notably, Tesla's stock rose over 3% [2][5] - The performance of Chinese concept stocks was mixed, with many declining, including a drop of over 10% for Xiaoying Technology [2][5] Key News Highlights - The 2025 World Artificial Intelligence Conference was attended by Premier Li Qiang, emphasizing the rapid development of AI technology and its integration into the economy [3][12] - The establishment of the China Capital Market Society was announced, aiming to enhance research and development in the capital market [3][21] - A trade agreement was reached between the U.S. and the EU, which includes a 15% tariff on EU goods entering the U.S. and a commitment from the EU to increase investment in the U.S. [3][22][23] Industrial Insights - In June, the profit decline of industrial enterprises above designated size narrowed, with total profits amounting to 715.58 billion yuan, a year-on-year decrease of 4.3%, which is an improvement from the previous month [16][17] - The equipment manufacturing sector showed significant growth, with a 7.0% increase in revenue and a profit increase of 9.6%, contributing positively to overall industrial profits [17][18] - The manufacturing sector is advancing towards high-end, intelligent, and green production, with notable profit increases in high-end equipment manufacturing and smart products [18][19] Agricultural Sector Developments - A new plan to promote agricultural product consumption was released, focusing on optimizing supply, innovating distribution, and enhancing market activation [20] - The plan aims to meet diverse consumer needs and improve the quality of agricultural products while leveraging e-commerce platforms for better market reach [20]
实测爆火的阶跃星辰Step 3,性能SOTA,开源多模态推理之王
机器之心· 2025-07-26 08:19
Core Viewpoint - The article highlights the launch of Step 3, a new generation of open-source base model by Jieyue Xingchen, which is positioned as a leading open-source VLM (Vision-Language Model) that excels in various benchmarks and has significant commercial potential [1][2][11]. Group 1: Model Features and Performance - Step 3 is recognized for its strong performance, surpassing other open-source models in benchmarks such as MMMU, MathVision, and SimpleVQA [1][41]. - The model integrates multi-modal capabilities, combining text and visual understanding, which is essential for real-world applications [10][39]. - Step 3 is designed to balance intelligence, cost, efficiency, and versatility, addressing key challenges in AI deployment [7][8]. Group 2: Technical Innovations - The underlying architecture of Step 3 utilizes a proprietary MFA (Multi-matrix Factorization Attention) design, optimizing for efficiency and performance, particularly on domestic chips [29][31]. - The model features a total parameter count of 321 billion, with 316 billion dedicated to LLM (Large Language Model) and 5 billion for the visual encoder, showcasing its extensive capabilities [33][34]. - Step 3 employs advanced distributed inference techniques, enhancing resource allocation and reducing operational costs [38]. Group 3: Commercialization and Market Impact - The launch of Step 3 marks a significant step towards commercialization for Jieyue Xingchen, with expectations of substantial revenue growth, projected to approach 1 billion yuan in 2025 [54]. - The model has already been integrated into various smart devices, with partnerships established with over half of the top 10 domestic smartphone manufacturers [54]. - The establishment of the "Model-Chip Ecological Innovation Alliance" with multiple chip manufacturers signifies a strategic move to foster collaboration and reduce costs in the AI ecosystem [51][52]. Group 4: Industry Positioning - Step 3 is positioned as a solution to the pressing industry need for a practical, open-source multi-modal reasoning model, filling a significant market gap [58][60]. - The article emphasizes the shift from competitive pricing strategies to collaborative innovation as a sustainable growth path for the industry [59][60]. - Jieyue Xingchen's rapid iteration and comprehensive model matrix have solidified its reputation as a leader in the multi-modal AI space [57].
粤开市场日报-20250725
Yuekai Securities· 2025-07-25 07:53
Market Overview - The A-share market saw most major indices decline today, with the Shanghai Composite Index falling by 0.33% to close at 3593.66 points, and the Shenzhen Component Index decreasing by 0.22% to 11168.14 points. The ChiNext Index dropped by 0.23% to 2340.06 points, while the Sci-Tech 50 Index increased by 2.07% to 1054.20 points. Overall, 2724 stocks declined, 2532 stocks rose, and 158 stocks remained flat, with total trading volume in the Shanghai and Shenzhen markets amounting to 12189 billion yuan, a decrease of 6258.16 billion yuan from the previous trading day [1][2]. Industry Performance - Among the primary industries, electronic, computer, real estate, light manufacturing, textile and apparel, and media sectors led the gains, while construction decoration, building materials, food and beverage, coal, comprehensive, and steel industries experienced declines [1][2]. Sector Highlights - The top-performing concept sectors today included GPU, Kimi, multimodal models, ChatGPT, photolithography machines, intelligent agents, servers, selected rare metals, AIGC, artificial intelligence, machine vision, ASIC chips, selected semiconductors, Xiaohongshu platform, and Pinduoduo partners [2].