Infrastructure as Code
Search documents
Infra that fixes itself, thanks to coding agents — Mahmoud Abdelwahab, Railway
AI Engineer· 2025-11-24 20:16
Infrastructure Monitoring and Issue Detection - The system proactively monitors application infrastructure, including services, resource metrics (CPU, memory), and HTTP metrics (request error rate, failed requests) [5][8][9] - It analyzes metrics against predefined thresholds to identify affected services, moving beyond simple alert-based systems by analyzing a slice of time to reduce noise from spiky workloads [5][10][11] - The system gathers additional context for suspicious services, including project health, logs, and potentially upstream provider status, to avoid false positives due to high usage or external issues [12][13] Automated Issue Resolution - Upon detecting an issue, the system formulates a detailed plan, leveraging AI to analyze the application architecture, performance data, and errors [14][38] - A coding agent then clones the repository, creates a to-do list based on the plan, implements fixes, and generates a pull request [15] - The coding agent uses Open Code, an open-source AI agent, deployed on a server with necessary tools and Git configured, enabling it to open pull requests [22][23][25][26][27] Durable Workflows and Implementation - The system utilizes durable workflows to manage complex logic and ensure reliability, with automatic retries and caching of successful steps [16][18][19][20] - The workflow involves fetching application architecture, resource metrics, and HTTP metrics via API calls [21][31][32][34] - The system formats the collected information and passes it to the coding agent to generate a fix [33][35][37] Demonstration and Results - A demonstration showcases the workflow, starting from issue detection to the opening of a pull request with proposed changes [6][29][30][40] - The pull request includes a summary of changes, analysis, root causes, and fixes, allowing for review and merging [40][41] - The demonstration highlights a scenario where memory usage is high at 3196% GB out of a maximum of 32 GB, triggering the automated fix [33]