Core Viewpoint - Alibaba (China) Limited has applied for a patent related to a method, device, and equipment for training large language models based on a thinking chain, aiming to enhance the interpretability and review accuracy of these models [1] Group 1: Patent Details - The patent involves obtaining multiple initial sampling data, which includes images, auxiliary textual information, and standard review results of the images [1] - It describes generating thinking chain data from the initial sampling data and determining a collection of thinking chain data [1] - The method includes fine-tuning a foundational large language model using the thinking chain data collection to create an intermediate large language model [1] Group 2: Methodology - The process iteratively generates multiple intermediate thinking chain data based on the intermediate large language model and the initial sampling data [1] - A pre-set reward function is used to determine the reward values for each of the intermediate thinking chain data [1] - The final step involves using a Group Relative Policy Optimization (GRPO) algorithm for reinforcement learning to establish the target large language model [1]
阿里巴巴申请公布大型语言模型训练相关专利