多模态大模型
Search documents
具身领域的大模型基础部分,都在这里了......
具身智能之心· 2025-09-20 16:03
Core Viewpoint - The article emphasizes the importance of a comprehensive community for learning and sharing knowledge about large models, particularly in the fields of embodied AI and autonomous driving, highlighting the establishment of the "Large Model Heart Tech Knowledge Planet" as a platform for collaboration and technical exchange [1][3]. Group 1: Community and Learning Resources - The "Large Model Heart Tech" community aims to provide a platform for technical exchange related to large models, inviting experts from renowned universities and leading companies in the field [3][67]. - The community offers a detailed learning roadmap for various aspects of large models, including RAG, AI Agents, and multimodal models, making it suitable for beginners and advanced learners [4][43]. - Members can access a wealth of resources, including academic progress, industrial applications, job recommendations, and networking opportunities with industry leaders [7][70]. Group 2: Technical Roadmaps - The community has outlined specific learning paths for RAG, AI Agents, and multimodal large models, detailing subfields and applications to facilitate systematic learning [9][43]. - For RAG, the community provides resources on various subfields such as Graph RAG, Knowledge-Oriented RAG, and applications in AIGC [10][23]. - The AI Agent section includes comprehensive introductions, evaluations, and advancements in areas like multi-agent systems and self-evolving agents [25][39]. Group 3: Future Plans and Engagement - The community plans to host live sessions with industry experts, allowing members to engage with leading figures in academia and industry [66]. - There is a focus on job sharing and recruitment information to empower members in their career pursuits within the large model domain [70].
但我还是想说:建议个人和小团队不要碰大模型训练!
自动驾驶之心· 2025-09-20 16:03
Core Viewpoint - The article emphasizes the importance of utilizing open-source large language models (LLMs) and retrieval-augmented generation (RAG) for businesses, particularly for small teams, rather than fine-tuning models without sufficient original data [2][6]. Group 1: Model Utilization Strategies - For small teams, deploying open-source LLMs combined with RAG can cover 99% of needs without the necessity of fine-tuning [2]. - In cases where open-source models perform poorly in niche areas, businesses should first explore RAG and in-context learning before considering fine-tuning specialized models [3]. - The article suggests assigning more complex tasks to higher-tier models (e.g., o1 series for critical tasks and 4o series for moderately complex tasks) [3]. Group 2: Domestic and Cost-Effective Models - The article highlights the potential of domestic large models such as DeepSeek, Doubao, and Qwen as alternatives to paid models [4]. - It also encourages the consideration of open-source models or cost-effective closed-source models for general tasks [5]. Group 3: AI Agent and RAG Technologies - The article introduces the concept of Agentic AI, stating that if existing solutions do not work, training a model may not be effective [6]. - It notes the rising demand for talent skilled in RAG and AI Agent technologies, which are becoming core competencies for AI practitioners [8]. Group 4: Community and Learning Resources - The article promotes a community platform called "大模型之心Tech," which aims to provide a comprehensive space for learning and sharing knowledge about large models [10]. - It outlines various learning pathways for RAG, AI Agents, and multi-modal large model training, catering to different levels of expertise [10][14]. - The community also offers job recommendations and industry opportunities, facilitating connections between job seekers and companies [13][11].
紫东太初4.0大模型发布 武汉加速人工智能产业集群建设
Zheng Quan Shi Bao Wang· 2025-09-19 12:39
Core Insights - The 2025 East Lake International Artificial Intelligence Summit Forum was held in Wuhan, where the ZDTC 4.0 multimodal reasoning model was officially launched, developed by the Chinese Academy of Sciences and Wuhan Artificial Intelligence Research Institute [1] - The ZDTC 4.0 model demonstrates significant breakthroughs in high-level semantic understanding and reasoning capabilities, evolving from "pure text thinking" to "fine-grained multimodal semantic thinking" [1] - The model can achieve deep understanding of 180-minute long videos and provide precise answers in seconds, topping six datasets in long video reasoning and retrieval capabilities [1] Industry Developments - The ZDTC Cloud platform was launched as the first native collaborative cloud for multimodal large models in China, offering comprehensive capabilities from computing power support to application implementation [2] - Over the past three years, Wuhan's AI industry has grown by more than 30% annually, with the industry scale expected to exceed 70 billion yuan in 2024 [2] - Wuhan has gathered over 1,000 AI-related companies and more than 60 large models with over 1 billion parameters in use, forming a complete AI industry chain [2] Policy and Innovation - Wuhan has implemented a series of policies to promote AI industry development, focusing on smart chips, smart terminals, smart connected vehicles, and smart equipment [3] - The city is advancing product innovation and industrialization in areas such as smart wearables, smart cockpits, and humanoid robots [3]
生数科技完成数亿元A轮融资,下周将发布全新模型版本
Feng Huang Wang· 2025-09-19 06:42
Group 1 - The core point of the article is that Shengshu Technology, a multimodal startup, has completed a Series A financing round amounting to several hundred million RMB, led by Bohua Capital with participation from existing investors and new industry partners [1] - The new funding will be used for model research and technological innovation, aiming to explore the intelligent limits and application breadth of multimodal large models [1] - Shengshu Technology's CEO, Luo Yihang, emphasized the focus on product expansion, user service, industry collaboration, and global business layout as part of the financing strategy [1] Group 2 - Shengshu Technology has previously completed three rounds of financing, including angel, angel+, and Pre-A rounds, with notable investors such as Qiming Venture Partners, Ant Group, and Baidu's strategic investment [1] - The company recently released the Vidu Q1 reference image model, which competes directly with Google Nano Banana, supporting up to seven reference images simultaneously, achieving a breakthrough in multi-subject consistency and high fidelity [1] - An upcoming version of Vidu is set to be released next week, focusing on capabilities in the image-to-video domain [1]
生数科技完成数亿元A轮融资:刚发布正面对标Nano Banana的Vidu Q1参考生图
IPO早知道· 2025-09-19 02:37
Core Insights - The article discusses the recent A-round financing of Shengshu Technology, which raised several hundred million RMB to enhance model research and technological innovation in multi-modal large models [2][3] - Shengshu Technology's core product, Vidu, is designed for AI image, video, and audio generation, targeting various industries such as internet, advertising, e-commerce, and education [2][3] Financing and Investment - The A-round financing was led by Liangxi Digital Industry Fund managed by Bohua Capital, with participation from Baidu's strategic investment, Beijing AI Industry Investment Fund, and other existing shareholders [2] - The investment focus of Liangxi Digital Industry Fund is on the artificial intelligence sector, aligning with Shengshu Technology's ongoing development in the multi-modal field [3] Product Development and Market Impact - Vidu, launched globally in July 2024, has achieved an annual recurring revenue (ARR) of over $20 million within eight months, covering over 200 countries and regions [3] - The product has rapidly gained traction, reaching over 30 million users and 6,000 developers and enterprises globally [3] Competitive Landscape - Shengshu Technology's Vidu product is positioned against competitors like Google Nano Banana, showcasing its capabilities in AI video generation and image creation [3]
锦秋基金被投公司「生数科技」完成新一轮数亿元A轮融资 | Jinqiu Spotlight
锦秋集· 2025-09-19 02:17
Core Insights - Jinqiu Capital invested in Shengshu Technology as an early institutional investor in mid-2023 [1] - Shengshu Technology completed a new round of financing amounting to several hundred million RMB, led by Bohua Capital, with participation from various investors including Baidu and Qiming Venture Partners [2][5] - The company focuses on the independent research and development of multimodal large models and applications, with its core product Vidu capable of AI image, video, and audio generation [5][6] Company Overview - Shengshu Technology was established in March 2023, with a core team from top global universities and industry professionals, showcasing strong industry experience and global technology implementation capabilities [5] - Vidu has rapidly gained traction, covering over 30 million users and 6,000 developers and enterprises across more than 200 countries and regions, generating over 400 million videos [5][6] Market Potential - The CEO of Shengshu Technology, Dr. Luo Yihang, indicated that the commercialization of multimodal generation technology in the digital content industry is accelerating, with significant market space and global growth potential expected in the next three years [6] - The new round of financing will be used for model research and technological innovation, as well as to enhance product expansion, user service, industry collaboration, and global business layout [6] Product Development - Vidu launched globally in July 2024, introducing the concept of "reference life" images/videos and achieving key breakthroughs in consistency in commercial content creation [5][6] - The number of generated reference life videos and images has exceeded 100 million, with over 50% of the generated content being commercial material [5]
星动纪元招聘!具身多模态、强化学习等多个方向
具身智能之心· 2025-09-17 00:02
Core Viewpoint - The article outlines various job descriptions and requirements for positions related to multi-modal reinforcement learning, data processing, and embodied intelligence, emphasizing the need for advanced skills in AI and machine learning technologies [6][14][15]. Group 1: Job Descriptions - Responsibilities include research, design, and implementation of cutting-edge multi-modal reinforcement learning algorithms to address complex real-world problems [6]. - Involvement in the collection, processing, cleaning, and analysis of multi-modal data to create high-quality training datasets [14]. - Development and optimization of multi-modal models, including training, fine-tuning, and enhancing performance across different tasks [6][15]. Group 2: Job Requirements - Candidates should possess a master's degree or higher in computer science, artificial intelligence, or robotics, with at least one year of research experience in computer vision or embodied intelligence [13]. - Proficiency in programming languages such as Python and deep learning frameworks like PyTorch is essential, along with strong engineering implementation skills [13]. - Experience in publishing papers at top academic conferences (e.g., CVPR, NeurIPS) and contributions to open-source projects are preferred [13][19]. Group 3: Additional Qualifications - Familiarity with multi-modal data cleaning, labeling, and loading, as well as understanding data optimization techniques is required [14]. - Candidates should have experience with large language models and multi-modal models, including knowledge of their capabilities and applicable scenarios [14]. - High standards for data quality and attention to detail are necessary, along with proficiency in data processing tools like Pandas and NumPy [14].
大模型初创公司出海,云计算护航丨创新场景
Tai Mei Ti A P P· 2025-09-16 09:42
Core Insights - The launch of Sora has positioned the AI video generation sector as a focal point in the global AI landscape, attracting significant attention from capital and media [3] - Aishi Technology has rapidly developed its video model, PixVerse, which has become one of the largest and fastest video generation models globally, surpassing 60 million users in just two years [3][4] - The company faces challenges in technology iteration and global expansion, particularly in managing dispersed data and complying with local regulations [3][4][5] Group 1: Technology and Product Development - Aishi Technology has released six iterations of its video model, PixVerse, focusing on enhancing user experience and generation speed [3][7] - The company aims to lower the psychological barriers for users to create videos by leveraging AI technology [4] - The multi-modal video model requires advanced GPU capabilities and efficient real-time data processing to meet user demands [4][6][7] Group 2: Global Expansion and Data Management - Aishi Technology's global operations necessitate the aggregation and management of vast amounts of data across different regions, posing challenges in data migration and cost [5][6] - The partnership with Alibaba Cloud is aimed at addressing these challenges by utilizing its extensive global cloud service network [9][10] - The collaboration includes optimizing cross-region data transfer and enhancing data processing capabilities through advanced cloud solutions [9][10] Group 3: Cost Efficiency and Resource Utilization - Aishi Technology seeks to optimize cloud computing costs while maintaining high performance and resource utilization [7][12] - The company has transitioned to using Alibaba Cloud's Hologres for real-time data analysis, which supports large-scale data processing [9][10] - The deployment of CADT (Cloud Speed Deployment) has significantly reduced the time and complexity involved in managing cloud applications [14] Group 4: Future Collaboration and Growth - Aishi Technology plans to deepen its collaboration with Alibaba Cloud to enhance service stability and efficiency for its global AI video generation users [15] - The partnership will expand across various domains, including cloud computing, data storage, and large model applications, to drive the continuous development of AI video generation technology [15]
登顶苹果应用榜!谷歌火遍全网的“纳米香蕉”,凭啥击败ChatGPT?
Zheng Quan Shi Bao· 2025-09-16 07:54
Core Insights - Google's market capitalization has reached $3 trillion, and its AI application Gemini has surpassed ChatGPT to become the top free app in the Apple App Store [1] - Gemini has also topped the charts in countries like Canada, India, and Morocco, breaking ChatGPT's long-standing dominance since its launch [1] Group 1: Product Performance - Gemini's download numbers have exceeded those of ChatGPT, marking a significant shift in the competitive landscape of AI applications [1] - The success of Gemini is attributed to the launch of the image editing product Nano Banana, which has seen over 200 million image edits and attracted over 10 million new users since its release [2][3] Group 2: Technological Advancements - Nano Banana features several technological improvements over previous multimodal models, including natural language-driven image editing, character consistency, multi-image fusion, and reduced barriers for 3D modeling [3][8] - The model allows users to perform precise edits using simple natural language commands, enhancing user experience and accessibility [3] Group 3: Market Impact - The positive market response to Nano Banana and favorable antitrust rulings have contributed to a rise in Google's stock price, with analysts increasing Alphabet's target price from $225 to $280 [7] - The success of Nano Banana has sparked competition in the image generation space, with other companies like ByteDance and Shengshu Technology launching similar models [8][9] Group 4: Investment Opportunities - The shift towards multimodal models is expected to create investment opportunities in both computational power and application sectors, as the demand for video reasoning capabilities is significantly higher than for text [9] - The commercial viability of multimodal products is anticipated to outpace that of text-based products, indicating a pivotal moment in the development of AI applications [9]
登顶苹果应用榜!谷歌火遍全网的“纳米香蕉”,凭啥击败ChatGPT?
证券时报· 2025-09-16 07:51
Core Viewpoint - Google's market capitalization has reached $3 trillion, and its AI application Gemini has surpassed ChatGPT to become the top app on the Apple App Store [1][2]. Group 1: Gemini's Performance - Gemini has achieved over 2 million downloads in the US App Store, surpassing ChatGPT, and has also topped the charts in Canada, India, and Morocco [2]. - The success of Gemini is attributed to the launch of the image editing product Nano Banana, which has significantly improved image quality and editing control [4]. Group 2: Nano Banana Features - Nano Banana allows users to edit images using simple natural language commands, eliminating the need for traditional editing tools [4]. - The model maintains character consistency across different scenes and actions, which is crucial for brand character creation and script generation [4]. - It supports the fusion of multiple images and incorporates world knowledge to understand complex scenes for editing tasks [5]. - Nano Banana reduces the barriers to 3D modeling by generating 2D designs that include essential structural and material information [5]. Group 3: Market Impact and Competitors - The popularity of Nano Banana has sparked competition in the image generation space, with other companies like ByteDance and Shengshu Technology launching similar models [10]. - Analysts believe that the native multimodal model architecture is gaining industry recognition, with OpenAI and Google's models showing advantages in performance and deployment [10]. - The demand for computational power is expected to increase due to the higher requirements of native multimodal models compared to non-native ones [11].