Workflow
刚刚 B站又血崩了?!我来告诉你真正原因

Incident Overview - The incident involving Bilibili was significant, lasting nearly four hours and affecting a wide range of services, leading to widespread discussion on social media platforms [3][19]. - The issues began around 5 PM and included homepage errors, video playback failures, and comment section unavailability, culminating in a complete service disruption [4][7][9][11]. Technical Analysis - The root cause of the incident was identified as a failure in the Service Discovery system, which is crucial for routing user requests to the appropriate servers [19][20]. - Approximately 10% of requests failed due to this issue, indicating that Bilibili had multiple instances of the Discovery service deployed, allowing some requests to still succeed [17][21]. - The 504 Gateway Timeout errors confirmed that while requests reached the gateway, the backend services were unresponsive, highlighting the dependency on the Discovery system [20][21]. Lessons Learned - The incident underscored the importance of infrastructure, particularly foundational services like Service Discovery, which, while invisible to users, can have a massive impact when they fail [26]. - The incident also demonstrated the necessity of high availability design, as the 5%-10% failure rate indicated effective disaster recovery mechanisms were in place [26]. - Monitoring and alerting systems proved valuable in quickly identifying the Discovery failure, showcasing the effectiveness of Bilibili's monitoring infrastructure [28].