不再为告警“救火”:AIOps 如何重塑腾讯音乐的智能运维体系
Sou Hu Cai Jing·2025-12-10 11:37

Core Insights - The article discusses how companies can leverage AI to enhance operational efficiency and quality while driving intelligent operations, focusing on Tencent Music's practices in AIOps [1][2] - The upcoming AICon event aims to explore the integration of AI into business operations, emphasizing the creation of scalable and commercializable AI systems [1][36] Group 1: AI Integration in Operations - Tencent Music has multiple applications catering to different user groups, supported by a collaborative development team focused on foundational capabilities like microservices and observability [2] - The company is exploring innovative AI applications to improve user experience while integrating AI with existing technology frameworks to enhance engineering systems [2][3] - The exploration of AI is centered around three traditional elements: quality, efficiency, and cost, with a focus on generating tangible value through AI [3] Group 2: AIOps Implementation - The AIOps framework is structured around perception, decision-making, and execution, aiming to leverage AI capabilities for measurable outcomes [3] - The DevOps framework is crucial for continuous integration, delivery, and operations, allowing developers to focus on coding while standardizing other processes [6] - The SRE system aims to ensure the effectiveness and controllability of changes during deployment, alongside the continuous improvement of the SLA system to maintain business quality [6][7] Group 3: Alarm Management and AI Optimization - The company has significantly reduced the number of alarm calls from over 3,000 to around 200 per month by enhancing the effectiveness of monitoring data and implementing the 3-Sigma algorithm [11][15] - AI is utilized to analyze alarm types and root causes, with a workflow that includes problem analysis, plugin invocation, and knowledge base integration to generate solutions [20][21] - A comprehensive classification of alarms has been established, with AI automatically tagging them, revealing that business logic errors account for approximately 40% of issues [25][27] Group 4: Data and Customization - A complete data banking system has been developed to unify data collection and analysis, enhancing root cause analysis capabilities within the AIOps framework [30] - The company is focusing on standardizing business systems, particularly return codes, to improve operational efficiency and response to alarms [28] - Custom alarms for specific business lines are being developed, with an emphasis on ensuring AI understands their meanings and can provide comprehensive solutions [28][30] Group 5: Future Directions in AIOps - Future initiatives include enhancing intelligent Q&A systems, automating execution based on AI conclusions, and upgrading algorithms to improve alarm accuracy [32][35] - The strategic approach for AIOps development is to integrate cloud-native and intelligent analysis to create a more advanced and valuable AI system [35]