Agent Performance - Current agents have a low overall bug find rate and generate a significant amount of false positives [1][2] - Some agents have a true positive rate of less than 10% for finding bugs [2] - Three out of six agents benchmarked had a 10% or less true positive rate out of 900+ reports [3] - One agent produced 70 issues for a single task, all of which were false [4] - Cursor had a 97% false positive rate over 100+ repos and 1,200+ issues [4] - Thinking models are significantly better at finding bugs in a codebase [8][18] - Agents are not holistically looking at files, leading to high variability across runs [20] Implications for Developers - Alert fatigue reduces the effectiveness of trusting agents, potentially leading to bugs in production [5] - Developers are unlikely to sift through numerous false positives to identify actual bugs [4] Recommendations for Improving Agent Performance - Use bug-focused rules with scoped instructions detailing security issues and logical bugs [6] - Prioritize naming explicit classes of bugs in rules, such as "off bypasses" or "SQL injection" [11] - Require fix validation by ensuring agents write and pass tests before incorporating changes [12] - Manage context thoroughly by feeding diffs of code changes and preventing key files from being summarized [15] - Ask agents to create a step-by-step component inventory of the codebase [16] - Bias the model with specific security information like the OWASP Top 10 [9][10]
How to Improve your Vibe Coding — Ian Butler
AI Engineer·2025-08-03 04:32