Speechocean-海天瑞声(688787) - 投资者关系活动记录表-（2024年5月31日）

Group 1: Data Service Trends - The demand for data services in reinforcement learning is increasing, with a trend towards more verticals such as law, finance, and healthcare [3] - Evaluation metrics for reinforcement learning annotations are becoming more diverse, requiring annotators to assess models from multiple dimensions [3] - The shift from unimodal to multimodal data annotation is evident, with a focus on text-image and text-video combinations [3][4] Group 2: Automation in Data Annotation - Current data annotation tasks in large models primarily focus on supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), with a strong reliance on human input [4] - Some projects have begun to implement algorithmic pre-annotation strategies to enhance the efficiency of manual annotation and verification [4] Group 3: Multimodal Development and Data Needs - The transition to multimodal models will create new data demands, such as generating images from text inputs, requiring machines to understand and map keywords to image tags [5] - The importance of high-quality multimodal training datasets will increase, driving growth in the data service industry [5] Group 4: Synthetic Data Considerations - Synthetic data is viewed as a necessary byproduct of AI development, serving as an effective supplement for data collection, though it has limitations in replicating real-world features [5][6] - Most companies still rely on real-world data for model training, but they will monitor advancements in synthetic data technology to adjust their business strategies accordingly [6] Group 5: Copyright Data and Value Proposition - The value of the company lies in aggregating diverse copyright data, cleaning it, and providing tailored services based on client needs [6] - High-quality cleaning of copyright data is essential before it can be used for model training, ensuring compliance with legal standards [6] Group 6: Differences in Data Requirements - The data requirements for pre-training large models are similar to traditional deep learning but differ in scale, quality, and sources [7][8] - Pre-training data typically involves token counts in the hundreds of billions, compared to around 1 billion for traditional models, necessitating a richer variety of data sources [7][8]