Using Large Language Models to Catch Vulnerabilities

172 points by sigmar 6 days ago | 29 comments

sigmar 6 days ago |
TL;DR P0 collaborated with DeepMind to make "Big Sleep," which is an AI agent (using gemini 1.5 pro) that can look through commits, spot potential issues, and then run testcases to find bugs. The agent found one in SQLite that was recent enough that it hadn't made it into an official release yet. They then tried to see if it could have been found with AFL, the fuzzer didn't find the issue after 150 cpu-hours.
nickpsecurity 5 days ago |
Using a fuzzer was a terrible point of comparison. They’re the slowest, heaviest users of resources. They’d be better off comparing to static analyzers which find bugs fast. In this case, Infer might do since it’s designed to catch those errors.
My concept was running a bunch of open-source, static analyzers with the LLM’s essentially blocking false positives. They can do it analytically or by generating the test cases to prove the bug. It might also be easier to fine-tune open models for this since the job is narrower.
cjbprime 6 days ago |
The work is impressive, but I wish Google wouldn't try so hard to claim to be the world first at everything. This claim feels extremely unprincipled:
> We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software. Earlier this year at the DARPA AIxCC event, Team Atlanta discovered a null-pointer dereference in SQLite, which inspired us to use it for our testing to see if we could find a more serious vulnerability.
Every word in "public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software" applies to the Team Atlanta finding that they are citing too, and the Team Atlanta finding was against an actual release version of SQLite instead of a prerelease. If either team has provided the first example of this convoluted sentence, it is Team Atlanta, not Google.
It is possible that they're arguing that the Team Atlanta finding wasn't "exploitable", but this is very debatable. We use CVSS to rate vulnerability impact, and CVSS defines Availability (crashing) as an equal member of the [Confidentiality, Integrity, Availability] triad. Being able to crash a system constitutes an exploitable vulnerability in that system. This is surely especially true for SQLite, which is one of the most mission-critical production software systems in the entire world.
But if we're going to act like we're being very precise about what "exploitable" means, we should conclude that neither of these are exploitable vulnerabilities. To exploit them, you have to provide malicious SQL queries to SQLite. Who does that? Attackers don't provide SQL queries to SQLite systems -- the developers do. If an attacker could provide arbitrary SQL queries, they can probably already exploit that SQLite system, through something like an arbitrary file content write to an arbitrary local filename into RCE. I don't think either group found an exploitable vulnerability.
jnwatson 6 days ago |
"Attackers don't provide SQL queries to SQLite systems -- the developers do."
Yet SQL injection is still a thing. Any vuln that can promote an SQL injection to an RCE is very bad.
moyix 6 days ago |
Note that the vulnerable extension is only enabled in the sqlite shell:
> However, the generate_series extension is only enabled by default in the shell binary and not the library itself, so the impact of the issue is limited.
https://project-zero.issues.chromium.org/issues/372435124
bawolff 5 days ago |
Yeah, this seems pretty misleading by the blog post. Quite a big difference between finding a vuln in sqlite vs an sqlite extension.
cjbprime 5 days ago |
(Are you sure that SQLite attempts to protect against RCE from an attacker who can run fully arbitrary queries? I would be surprised.)
StrauXX 5 days ago |
I sure hope they do. Security doesn't end at the gates, so to speak. It starts there. We need layered security to create systems where not every hack has critical impact.
bawolff 5 days ago |
Yes
I'm not sure why you are surprised. I'm pretty sure almost all serious dbs attempt this (although some have a permission model to attempt it).
In sqlite there is the load_extension() function but it is disabled by default. In some situations you can use attach to write a file with a name that is meaningful to something else on your system.
Other than those two things (and assuming no weird extensions are loaded) i think it would be very big news if you found a way to execute code from an sqlite query.
cjbprime 5 days ago |
> attach to write a file with a name that is meaningful to something else on your system.
Yes, this is what I meant, e.g. writing to ~/.bash_profile or so. Forbidding queries to do something like this could have a large negative effect on the capability of the database engine to its users.
bawolff 5 days ago |
I mean, just writing to ~/.bash_profile wont work as i assume it needs to be executable. (I assume it wont work if the file already exists since it expects to be a valid db if it exists).
In practise, finding an actual path to write to that actually gets code to execute might be tricky in the context of a unix user used just for one specific sqlite backed service.
Sqlite also has an option to disable the attach keyword (SQLITE_LIMIT_ATTACHED). It is very rare to get sql injection at the beginning of the query so in practise this usually isn't an issue (although i guess that was your point).
cjbprime 4 days ago |
> I mean, just writing to ~/.bash_profile wont work as i assume it needs to be executable.
It does not. It's an RC file sourced by the shell, not a script.
> It is very rare to get sql injection at the beginning of the query so in practise this usually isn't an issue (although i guess that was your point).
Yes. Would also be required to use these vulnerabilities.
refulgentis 6 days ago |
I couldn't believe this, even all the poorly-defined qualifiers, given maximal charity, don't really help.
The whole thing has the scent of "we were told to get an outcome, eventually got it, and wrote it up!" --- I've never ever read a Project Zero blog post like this,* and I believe they should be ashamed of putting glorified marketing on it.
* large # of contributors with unclear contributions (they're probably building the agent Google is supposed to sell someday), ~0 explication of the bug, then gives up altogether on writing and splats in selected LLM responses.
* disclaimer: ex-Googler, only reason it matters here is I tend to jump to 'and this is a corruption of Google' because it feels to me like it was a different place when I joined, but either A) it wasn't or B) we should all be afraid of drift over time in organizations > 1000 people
rot69 6 days ago |
WebSQL?
https://blog.exodusintel.com/2019/01/22/exploiting-the-magel...
nickpsecurity 5 days ago |
Re first AI finding vulnerabilities
What came to mind was the DARPA Cyber Grand Challenge. The winner was a product used in the real world, too.
https://en.m.wikipedia.org/wiki/2016_Cyber_Grand_Challenge#:....
https://www.mayhem.security/
simonw 6 days ago |
I think the key insight from this is:
> We also feel that this variant-analysis task is a better fit for current LLMs than the more general open-ended vulnerability research problem. By providing a starting point – such as the details of a previously fixed vulnerability – we remove a lot of ambiguity from vulnerability research, and start from a concrete, well-founded theory: "This was a previous bug; there is probably another similar one somewhere".
LLMs are great at pattern matching, so it turns out feeding in a pattern describing a prior vulnerability is a great way to identify potential new ones.
ngneer 5 days ago |
Thanks for highlighting a key aspect.
It may be informative to see all the ways in which the LLM got it wrong along the way, before arriving at the cherrypicked example. For instance, was this the only pattern flagged as a potential bug in the process?
That said, I am surprised at how far pattern matching can get you. Certain exploitation requires arithmetic. If LLMs struggle with simple arithmetic, then it is unclear why we would expect them to do well with these. In the GPZ example, it does not appear that the assistant can do the arithmetic or realize the negative offset and its implication. Rather, it looks like pattern matching that mimics static dataflow analysis in order to arrive at the assertion failure.
"To cause the assertion failure, we need a constraint on a column with index greater than 3 or smaller than 1."
simonw 5 days ago |
I would be shocked if this process didn't kick out hundreds of false positives. In this context I think that's fine - zero-day vulnerabilities are rare, so security researchers should expect to browse through all sorts of mis-leads on the way to finding one.
coding123 6 days ago |
Most code ought to be replaced with llm generated and then reviewed by a number of additional llms
dboreham 5 days ago |
Is that you Elon?
maxtoulouse31 5 days ago |
Shameless self-plug here, however two years ago I attempted a self study project following a similar intuition:
https://maxdunhill.medium.com/how-effective-are-transformers...
And created a Hugging Face repo of known vulnerabilities if anyone else wants to work on a similar project (link in blog post).
My project was a lay person’s not especially successful attempt to fine-tune a BERT-based classifier to detect vulnerable code.
Having said this, a main takeaway echoes simonw’s comment:
“LLMs are great at pattern matching, so it turns out feeding in a pattern describing a prior vulnerability is a great way to identify potential new ones.”
Given majority of vulnerabilities stem from memory misallocation, it seems that an LLM would most consistently find misallocated memory. Useful, though not the most complex vulnerabilities to weaponise.
It seems the next frontier would be for an LLM to not only identify previously unidentified vulnerabilities, but also describe how to successfully daisy chain them into an effective vulnerability exploitation.
Said differently, giving an LLM a goal like jail breaking the iOS sandbox and seeing how it might approach solving the task.
jumploops 5 days ago |
We have a “poor man’s” version of this running as a GitHub Action on our PRs[0].
It basically just takes the diff from the PR and sends it to GPT-4o for analysis, returning a severity (low/medium/high) and a description.
PRs are auto-blocked for high severity, but can be merged with medium or low.
In practice it’s mostly right, but definitely errs on the side of medium too often (which is reasonable without the additional context of the rest of the codebase).
With that said, it’s been pretty useful at uncovering simple mistakes before another dev has had a chance to review.
[0] https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefc...
ctxc 5 days ago |
Looks cool!
princearthur 5 days ago |
I'm excited at the prospect of LLMs being deployed here. An attacker only needs to find one weak link in the chain. Eliminating weak links is hard, and might be NP-complete.
However, throw resources at it and you might just make weak links much rarer. Throw a variety of heterogenous LLMs at it, and you might be looking at a large force multiplier.
ngneer 5 days ago |
Sounds like they had some fun, but what is missing from the analysis is the effort spent, or even a rough indication of said effort. The epilog lists about sixteen non-security contributors that were needed to achieve the feat, and probably many more security personnel were needed from GPZ. Yet the outcome is one bug that could have been found with manual code review or a custom fuzzer. I am not sold on the premise that LLMs can increase code security or lower costs. At the moment, LLMs are producing and introducing more insecure code than they are finding.
jebarker 5 days ago |
This seems both pessimistic and conflating two issues. It's entirely possible that generating code is a bad idea but LLMs can be effective at finding vulnerabilities in human written code. Also, this is an applied research project, you have start small to prove the idea and iron out the wrinkles then you can find efficiencies to scale up later.
ngneer 5 days ago |
I agree it is a pessimistic view. I did not mean to make a disparaging comment, though. I agree that all research starts out small. And, naturally, for any influential idea, you can point back to when it was small.
I did not intend to conflate separate issues. I agree assisted coding and assisted bug hunting can coexist. I was merely trying to weigh the net effect LLMs have on security.
guerrilla 5 days ago |
So the the eternal arm's race escalates.
westurner 5 days ago |
awesome-code-llm > Vulnerability Detection, Program Proof,: https://github.com/codefuse-ai/Awesome-Code-LLM#vulnerabilit...