Finding Bugs using LLMs

May 22, 2026

Dennis Felsing

Software Engineer in Test

At Materialize we’ve had success in finding bugs in existing code and open pull requests using LLM-based coding agents since February 2026, coinciding with the release of Anthropic’s Opus 4.6 (now mostly running on 4.7). In this post we’ll look into some of the considerations that went into the system we are currently using as well as lessons learned.

Sessions

We have a basic shell script that determines the next unit to operate on and feeds it to claude . There are multiple units we scan, each in a fresh coding agent session:

Every pull request that becomes ready for review (not in draft): Ideally we want to find bugs before we even merge them into our main branch. Unfortunately there can be many versions of a PR, so we still have to check every commit that lands in main in addition, even if the PR itself was already reviewed.
Every commit that ever landed on main, back-filling our existing repository’s history: Considering the entire diff of a commit gives a better overview of everything in the source code that had to be touched for a specific change. This ended up finding many additional bugs.
Every production source code file: This is the most basic unit people use, code in the same file is often related, and even for code in other files the LLM agent can look them up. We originally started out with this approach, but adding PR/commit reviews on top turned out to be fruitful.
N-th iteration of every production source code file with a list of already known (but not yet fixed) bugs in this file: Not all bugs are of equal importance. By telling the LLM to ignore the already known bugs we don’t waste further tokens looking into them again, and instead have a chance of finding more serious bugs in key files which might not be as obvious.

What we end up running is claude --dangerously-skip-permissions --model claude-opus-4-7 --effort max --output-format stream-json --verbose -p $PROMPT. Since the sessions should run automatically without user interaction, --dangerously-skip-permissions with a dedicated VM is the easiest approach. See the documentation.

Prompt

Bugs are categorized into high/medium/low severity, and only high and medium are considered further by writing a markdown file for the reviewed unit.

Existing findings for the relevant file are already marked in the prompt so we don’t waste time on them, otherwise we end up rediscovering the same bugs again and again.

Each newly suspected bug is additionally cross-checked against our already open bugs in GitHub and Linear to deduplicate against existing issues and save valuable time for the reviewer.

I have recently extended the prompt with specific categories of bugs we are looking for, for example correctness, kinds of vulnerabilities and race conditions - based on the serious bugs we have found previously, and also the categories Materialize most cares about. The jury is still out on whether that is better than letting the LLM look for anything. I have considered having a separate session per bug category, but that would increase token usage by a lot with questionable benefit.

We are also asking it to prevent false positives in a bunch of ways, for example by tracing the entire chain of execution, or creating and executing a small test.

Tools & Skills

Trailmark and LSP are valuable to enable more efficient traversals through large code bases. Trail of Bits also has relevant skills for looking for vulnerabilities as well as disregarding false positives. Our own repository also contains skills about how some complex parts of the system work, where to find our existing issues, and how to use the existing test frameworks well.

Having made the skills agent-agnostic is helpful here since it allows experimenting with OpenAI’s Codex and GPT 5.4/5.5.

Models

Anthropic’s Opus 4.7 with max thinking is what we’re currently employing most of the time, with a fallback to OpenAI’s GPT 5.5. In the limited evaluations I did Opus 4.7 didn’t find more bugs than Opus 4.6, but had fewer false positives since it investigated more context to ensure the bug could actually be triggered end to end. On the flip side that uses way more tokens.

Future models like Mythos are bound to be interesting not just for security research, but bug finding in general.

Recently both Anthropic and OpenAI have gotten more careful about allowing attackers to use their LLMs to find vulnerabilities. Unfortunately this also bites you when trying to find bugs in your own software, for which you can/have to apply for safeguard adjustments (Anthropic, OpenAI). Otherwise you’ll just keep running into API errors like this:

API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). This request triggered restrictions on violative cyber content and was blocked under Anthropic's Usage Policy.

Staying Honest

Every issue the LLM reports has to be verified manually. My usual approach is to read through it, and categorize the bugs I don’t dismiss outright:

Easy to verify: I just run some SQL manually, and immediately see wrong results or a panic.
Hard to verify: I consider which of our end-to-end test frameworks is a good fit for a targeted test that would prove the bug, and interactively ask the LLM to extend it. We then continue iterating on it until I’m happy with the state of the test and how it reproduces the bug.
Easier to fix: Some issues are more complex to test end to end, or the fix is more of a “defense in depth” (as Claude Code likes to say). If the fix is about approximately one-liner, I might open a PR with the fix and hopefully a unit test as well. Generally QA at Materialize is more enthusiastic about end-to-end tests, but for some properties they are more hassle than it’d be worth, at least in the short term.

Once verified I manually open a bug and assign it to the relevant team. If we automated that with an LLM I’d be afraid of having a way higher false positive rate, since I’ve seen Claude Code confidently introduce the bug it was trying to verify, and then claim it has reproduced the bug. (Nope, adding a failpoint that completely changes the control flow is not a fair reproducer!) The other direction has also happened, where Claude Code didn’t manage to reproduce the bug and was ready to give up, but thanks to the interactive session I could spot its mistake and we ended up reproducing the bug in question reliably.

Conclusion

Using LLMs we have found hundreds of valuable bugs that were not detected in our already extensive test suites before. The existing test frameworks have turned out to be an essential oracle to reliably verify bugs with. Without this approach many interesting bugs would just have stayed speculation instead or would have required larger amounts of effort to reproduce.

While finding bugs using LLMs has been hugely effective, it’s not guaranteed to find all serious bugs. The source code and associated documentation rarely capture every way a complex system actually ends up being used in production. Systematic testing can’t be replaced, but gaps in the tests can be uncovered, which should then lead to relevant test improvements.