Workflow
多模态大模型
icon
Search documents
星动纪元招聘!具身多模态、强化学习等多个方向
具身智能之心· 2025-09-17 00:02
Core Viewpoint - The article outlines various job descriptions and requirements for positions related to multi-modal reinforcement learning, data processing, and embodied intelligence, emphasizing the need for advanced skills in AI and machine learning technologies [6][14][15]. Group 1: Job Descriptions - Responsibilities include research, design, and implementation of cutting-edge multi-modal reinforcement learning algorithms to address complex real-world problems [6]. - Involvement in the collection, processing, cleaning, and analysis of multi-modal data to create high-quality training datasets [14]. - Development and optimization of multi-modal models, including training, fine-tuning, and enhancing performance across different tasks [6][15]. Group 2: Job Requirements - Candidates should possess a master's degree or higher in computer science, artificial intelligence, or robotics, with at least one year of research experience in computer vision or embodied intelligence [13]. - Proficiency in programming languages such as Python and deep learning frameworks like PyTorch is essential, along with strong engineering implementation skills [13]. - Experience in publishing papers at top academic conferences (e.g., CVPR, NeurIPS) and contributions to open-source projects are preferred [13][19]. Group 3: Additional Qualifications - Familiarity with multi-modal data cleaning, labeling, and loading, as well as understanding data optimization techniques is required [14]. - Candidates should have experience with large language models and multi-modal models, including knowledge of their capabilities and applicable scenarios [14]. - High standards for data quality and attention to detail are necessary, along with proficiency in data processing tools like Pandas and NumPy [14].
大模型初创公司出海,云计算护航丨创新场景
Tai Mei Ti A P P· 2025-09-16 09:42
Core Insights - The launch of Sora has positioned the AI video generation sector as a focal point in the global AI landscape, attracting significant attention from capital and media [3] - Aishi Technology has rapidly developed its video model, PixVerse, which has become one of the largest and fastest video generation models globally, surpassing 60 million users in just two years [3][4] - The company faces challenges in technology iteration and global expansion, particularly in managing dispersed data and complying with local regulations [3][4][5] Group 1: Technology and Product Development - Aishi Technology has released six iterations of its video model, PixVerse, focusing on enhancing user experience and generation speed [3][7] - The company aims to lower the psychological barriers for users to create videos by leveraging AI technology [4] - The multi-modal video model requires advanced GPU capabilities and efficient real-time data processing to meet user demands [4][6][7] Group 2: Global Expansion and Data Management - Aishi Technology's global operations necessitate the aggregation and management of vast amounts of data across different regions, posing challenges in data migration and cost [5][6] - The partnership with Alibaba Cloud is aimed at addressing these challenges by utilizing its extensive global cloud service network [9][10] - The collaboration includes optimizing cross-region data transfer and enhancing data processing capabilities through advanced cloud solutions [9][10] Group 3: Cost Efficiency and Resource Utilization - Aishi Technology seeks to optimize cloud computing costs while maintaining high performance and resource utilization [7][12] - The company has transitioned to using Alibaba Cloud's Hologres for real-time data analysis, which supports large-scale data processing [9][10] - The deployment of CADT (Cloud Speed Deployment) has significantly reduced the time and complexity involved in managing cloud applications [14] Group 4: Future Collaboration and Growth - Aishi Technology plans to deepen its collaboration with Alibaba Cloud to enhance service stability and efficiency for its global AI video generation users [15] - The partnership will expand across various domains, including cloud computing, data storage, and large model applications, to drive the continuous development of AI video generation technology [15]
登顶苹果应用榜!谷歌火遍全网的“纳米香蕉”,凭啥击败ChatGPT?
Zheng Quan Shi Bao· 2025-09-16 07:54
Core Insights - Google's market capitalization has reached $3 trillion, and its AI application Gemini has surpassed ChatGPT to become the top free app in the Apple App Store [1] - Gemini has also topped the charts in countries like Canada, India, and Morocco, breaking ChatGPT's long-standing dominance since its launch [1] Group 1: Product Performance - Gemini's download numbers have exceeded those of ChatGPT, marking a significant shift in the competitive landscape of AI applications [1] - The success of Gemini is attributed to the launch of the image editing product Nano Banana, which has seen over 200 million image edits and attracted over 10 million new users since its release [2][3] Group 2: Technological Advancements - Nano Banana features several technological improvements over previous multimodal models, including natural language-driven image editing, character consistency, multi-image fusion, and reduced barriers for 3D modeling [3][8] - The model allows users to perform precise edits using simple natural language commands, enhancing user experience and accessibility [3] Group 3: Market Impact - The positive market response to Nano Banana and favorable antitrust rulings have contributed to a rise in Google's stock price, with analysts increasing Alphabet's target price from $225 to $280 [7] - The success of Nano Banana has sparked competition in the image generation space, with other companies like ByteDance and Shengshu Technology launching similar models [8][9] Group 4: Investment Opportunities - The shift towards multimodal models is expected to create investment opportunities in both computational power and application sectors, as the demand for video reasoning capabilities is significantly higher than for text [9] - The commercial viability of multimodal products is anticipated to outpace that of text-based products, indicating a pivotal moment in the development of AI applications [9]
登顶苹果应用榜!谷歌火遍全网的“纳米香蕉”,凭啥击败ChatGPT?
证券时报· 2025-09-16 07:51
Core Viewpoint - Google's market capitalization has reached $3 trillion, and its AI application Gemini has surpassed ChatGPT to become the top app on the Apple App Store [1][2]. Group 1: Gemini's Performance - Gemini has achieved over 2 million downloads in the US App Store, surpassing ChatGPT, and has also topped the charts in Canada, India, and Morocco [2]. - The success of Gemini is attributed to the launch of the image editing product Nano Banana, which has significantly improved image quality and editing control [4]. Group 2: Nano Banana Features - Nano Banana allows users to edit images using simple natural language commands, eliminating the need for traditional editing tools [4]. - The model maintains character consistency across different scenes and actions, which is crucial for brand character creation and script generation [4]. - It supports the fusion of multiple images and incorporates world knowledge to understand complex scenes for editing tasks [5]. - Nano Banana reduces the barriers to 3D modeling by generating 2D designs that include essential structural and material information [5]. Group 3: Market Impact and Competitors - The popularity of Nano Banana has sparked competition in the image generation space, with other companies like ByteDance and Shengshu Technology launching similar models [10]. - Analysts believe that the native multimodal model architecture is gaining industry recognition, with OpenAI and Google's models showing advantages in performance and deployment [10]. - The demand for computational power is expected to increase due to the higher requirements of native multimodal models compared to non-native ones [11].
明略科技CEO吴明辉即将出席2025腾讯全球数字生态大会
Xin Lang Cai Jing· 2025-09-16 03:14
Core Insights - The evolution of global large model technology is accelerating, with industry applications deepening progressively [1] - Vertical large models are becoming the key to the implementation of AI in enterprises, addressing the limitations of general large models in proprietary data and industry know-how [1] - Minglue Technology's CEO, Wu Minghui, will present at the Tencent Global Digital Ecosystem Conference, discussing the practical applications of multimodal large models in marketing scenarios [1] Industry Trends - The shift towards vertical large models indicates a growing recognition of their importance in overcoming challenges faced by general large models [1] - The focus on industry-specific applications suggests a trend towards more tailored AI solutions that leverage specialized knowledge and data [1] Company Developments - Minglue Technology is showcasing its latest technological breakthroughs and practical achievements in the field of AI [1] - The upcoming presentation at a major conference highlights the company's commitment to advancing AI applications in marketing [1]
论文解读之港科PLUTO:首次超越Rule-Based的规划器!
自动驾驶之心· 2025-09-15 23:33
Core Viewpoint - The article discusses the development and features of the PLUTO model within the end-to-end autonomous driving domain, emphasizing its unique two-stage architecture and its direct encoding of structured perception outputs for downstream control tasks [1][2]. Summary by Sections Overview of PLUTO - PLUTO is characterized by its three main losses: regression loss, classification loss, and imitation learning loss, which collectively contribute to the model's performance [7]. - Additional auxiliary losses are incorporated to aid model convergence [9]. Course Introduction - The article introduces a new course titled "End-to-End and VLA Autonomous Driving," developed in collaboration with top algorithm experts from domestic leading manufacturers, aimed at addressing the challenges faced by learners in this rapidly evolving field [12][15]. Learning Challenges - The course addresses the difficulties learners face due to the fast-paced development of technology and the fragmented nature of knowledge across various domains, making it hard for beginners to grasp the necessary concepts [13]. Course Features - The course is designed to provide quick entry into the field, build a framework for research capabilities, and combine theory with practical applications [15][16][17]. Course Outline - The course consists of several chapters covering topics such as the history and evolution of end-to-end algorithms, background knowledge on various technologies, and detailed discussions on both one-stage and two-stage end-to-end methods [20][21][22][29]. Practical Application - The course includes practical assignments, such as RLHF fine-tuning, allowing students to apply their theoretical knowledge in real-world scenarios [31]. Instructor Background - The instructor, Jason, has a strong academic and practical background in cutting-edge algorithms related to end-to-end and large models, contributing to the course's credibility [32]. Target Audience and Expected Outcomes - The course is aimed at individuals with a foundational understanding of autonomous driving and related technologies, with the goal of elevating their skills to the level of an end-to-end autonomous driving algorithm engineer within a year [36].
关于大模型和自动驾驶的一切
自动驾驶之心· 2025-09-15 23:33
Group 1 - The article emphasizes the growing interest in large model technologies, particularly in areas such as RAG (Retrieval-Augmented Generation), AI Agents, multimodal large models (pre-training, fine-tuning, reinforcement learning), and optimization for deployment and inference [1] - A community named "Large Model Heart Tech" is being established to focus on these technologies and aims to become the largest domestic community for large model technology [1] - The community is also creating a knowledge platform to provide industry and academic information, as well as to cultivate talent in the field of large models [1] Group 2 - The article describes the community as a serious content-driven platform aimed at nurturing future leaders [2]
全新开源模型复现o3视觉推理,无需大量训练即可实现深度思考
量子位· 2025-09-15 03:59
Core Viewpoint - The article discusses the development of Mini-o3, an advanced visual language model (VLM) that enables multi-round visual reasoning, significantly improving upon previous models by allowing for deep reasoning across dozens of steps [1][2][15]. Group 1: Model Development - Mini-o3 is developed by a collaboration between ByteDance and the University of Hong Kong, designed to perform long-cycle visual search without extensive training resources [13]. - The model can extend its reasoning capabilities from a training limit of 6 rounds to dozens during testing, showcasing its advanced multi-modal reasoning abilities [2][15]. Group 2: Key Design Features - Mini-o3 incorporates three critical design elements: the VisualProbe dataset for exploratory reasoning, an iterative data collection process for diverse reasoning strategies, and a super-round masking strategy to balance training efficiency with testing scalability [17][19][34]. - The VisualProbe dataset consists of thousands of visual search challenges specifically designed for deep reasoning tasks, enhancing the model's training [17][38]. Group 3: Training Phases - The training of Mini-o3 occurs in two phases: a cold-start supervised fine-tuning (SFT) phase to activate multi-round tool usage, and a reinforcement learning (RL) phase to optimize interaction rounds [19][25]. - The cold-start SFT phase utilizes a small number of manually constructed samples to generate diverse reasoning trajectories, resulting in approximately 6000 cold-start reasoning paths [24][46]. Group 4: Performance Evaluation - Mini-o3 outperforms existing models in visual search tasks, achieving the best performance across various benchmarks, including VisualProbe, V*Bench, and HR-Bench [43][44]. - The model's performance is attributed to its ability to maintain complex and deep reasoning trajectories, with significant improvements noted in challenging tasks [44][48]. Group 5: Experimental Insights - Experiments indicate that removing RL data leads to a performance drop of about 8.6 points on VisualProbe-Hard, highlighting the importance of challenging RL samples for encouraging complex reasoning [45]. - The super-round masking technique effectively enhances RL performance, particularly in multi-round interaction scenarios, by stabilizing the training process and enabling extended reasoning during testing [48]. Group 6: Conclusion and Future Directions - The technical framework of Mini-o3 provides practical guidance for the development of multi-round interactive multi-modal models and their applications in reinforcement learning [52]. - The research team has made all related code open-source, promoting further exploration and development in this field [53].
招聘几位大佬,打算共创平台(世界模型/模型部署)
自动驾驶之心· 2025-09-14 03:44
Group 1 - The article announces the recruitment of 10 partners for the autonomous driving sector, focusing on course development, paper guidance, and hardware research [2][5] - The main areas of expertise sought include large models, multimodal models, diffusion models, SLAM, 3D object detection, and closed-loop simulation [3] - Candidates from QS200 universities with a master's degree or higher, especially those with significant conference experience, are preferred [4] Group 2 - The company offers benefits such as resource sharing for job seeking, PhD recommendations, and study abroad opportunities [5] - Attractive cash incentives and opportunities for entrepreneurial project collaboration are highlighted [5] - Interested parties are encouraged to contact via WeChat for collaboration inquiries [6]
机器人产业跟踪:龙头引领下的灵巧手即将升级,景气度有望提升
Orient Securities· 2025-09-14 02:12
Investment Rating - The report maintains a "Positive" investment rating for the mechanical equipment industry, indicating an expectation of performance that exceeds the market benchmark by over 5% [6][20]. Core Insights - The report highlights that the release of Tesla's next-generation dexterous hand is expected to enhance the flexibility and functionality of the dexterous hand industry, leading to an optimistic outlook for the industry chain [3][9]. - The dexterous hand technology has undergone significant iterations, with Tesla's third-generation model achieving 22 degrees of freedom, which is a substantial increase from the first generation's 11 degrees [9][10]. - The report emphasizes that the advancement in dexterous hand technology will not only improve product value but also drive the overall industry towards higher degrees of freedom and functionality [14]. Summary by Sections Industry Overview - The report tracks the robotics industry, particularly focusing on the dexterous hand segment, which is poised for upgrades and increased market activity [1][5]. Technological Advancements - Tesla's dexterous hand has evolved through multiple iterations, with the latest model featuring 26 actuators per arm, significantly enhancing its operational capabilities [10][9]. - The integration of multiple sensors in dexterous hands is expected to create a multi-modal data collection platform, which will improve AI training efficiency and model generalization capabilities [13][9]. Investment Recommendations - The report identifies several investment targets within the dexterous hand industry, including Zhenyu Technology (300953, Buy), Hanwei Technology (300007, Not Rated), Nanshan Zhishang (300918, Not Rated), and Mingzhi Electric (603728, Not Rated) [3].