Workflow
Mean Time to Resolution (MTTR)
icon
Search documents
Production software keeps breaking and it will only get worse โ€” Anish Agarwal, Traversal.ai
AI Engineerยท 2025-07-10 16:29
Problem Statement - The current software engineering workflow is inefficient, with too much time spent on troubleshooting production incidents [2][9] - Existing approaches to automated troubleshooting, such as AIOps and LLMs, have fundamental limitations [10][11][12][13][14][15][16][17][18] - Troubleshooting is becoming increasingly complex due to AI-generated code and increasingly complex systems [3][4] Solution: Traversal's Approach - Traversal combines causal machine learning (statistics), reasoning models (semantics), and a novel agentic control flow (swarms of agents) for autonomous troubleshooting [19][20][21][22][23][24] - Causal machine learning helps identify cause-and-effect relationships in data, addressing the issue of correlated failures [20][21] - Reasoning models provide semantic understanding of logs, metrics, and code [22] - Swarms of agents enable exhaustive search through telemetry data in an efficient way [23][24] Results and Impact - Traversal has achieved a 40% reduction in mean time to resolution (MTTR) for Digital Ocean, a cloud provider serving hundreds of thousands of customers [32][37] - Traversal AI orchestrates a swarm of expert AIs to sift through petabytes of observability data in parallel, providing users with the root cause of incidents within five minutes [39][40] - Traversal integrates with various observability tools, processing trillions of logs [45] Future Applications - The principles of exhaustive search and swarms of agents can be applied to other domains such as network observability and cybersecurity [47]