Robot Jailbreak: Researchers Trick Bots into Dangerous Tasks

65 points by cratermoon a day ago | 30 comments

ilaksh 21 hours ago |
You could also use a remote control vehicle or drone with a bomb on it.
Even smart tools are tools designed to do what their users want. I would argue that the real problem is the maniac humans.
Having said that, it's obviously not ideal. Surely there are various approaches to at least mitigate some of this. Maybe eventually actual interpretable neural circuits or another architecture.
Maybe another LLM and/or other system that doesn't even see the instructions from the user and tries to stop the other one if it seems to be going off the rails. One of the safety systems could be rules-based rather than a neutral network, possibly incorporating some kind of physics simulations.
But even if we come up with effective safeguards, they might be removed or disabled.. androids could be used to commit crimes anonymously if there isn't some system for registering them.. or at least an effort at doing that since I'm sure criminals would work around it if possible. But it shouldn't be easy.
Ultimately you won't be able to entirely stop motivated humans from misusing these things.. but you can make it inconvenient at least.
Timwi 19 hours ago |
> Maybe another LLM and/or other system that doesn't even see the instructions from the user and tries to stop the other one if it seems to be going off the rails.
I sometimes wonder if that is what our brain hemispheres are. One comes up with the craziest, wildest ideas and the other one keeps it in check and enforces boundaries.
lifeisstillgood 19 hours ago |
Just invite both hemispheres to a party and pretty soon both LLMS are convinced of this great idea the guy in the kitchen suggested.
ben_w 16 hours ago |
Could be something like that, though I doubt it's literally the hemespheres from what little I've heard about research on split-brain surgery patients.
In vino veritas etc.: https://en.wikipedia.org/wiki/In_vino_veritas
rscho 11 hours ago |
Not the hemispheres, but:
https://en.m.wikipedia.org/wiki/Phineas_Gage
nkrisc 16 hours ago |
> You could also use a remote control vehicle or drone with a bomb on it.
Well, yeah, but then you need to provide, transport, and control those.
The difference here is these are the sorts of robots that are likely to already be present somewhere that could then be abused for nefarious deeds.
I assume the mitigation strategy here is physical sensors and separate out of loop processes that will physically disable the robot in some capacity if it exceeds some bound.
mannykannot 15 hours ago |
I agree, and just in case someone is thinking that your last paragraph implies that there is nothing new to be concerned about here, I will point out that there are already concerns over "dumb" critical infrastructure being connected to the internet. Risk identification and explication is a necessary (though unfortunately not sufficient) prerequisite for effective risk avoidance and mitigation.
cube00 13 hours ago |
The bounds of a kill bot would be necessarily wide.
nkrisc 8 hours ago |
Maybe making kill bots is a bad idea then. But what do I know?
blibble 11 hours ago |
> I assume the mitigation strategy here is physical sensors and separate out of loop processes that will physically disable the robot in some capacity if it exceeds some bound.
hiring a developer to write that sounds expensive
just wire up another LLM
nkrisc 8 hours ago |
Instruct one LLM to achieve its instructions by any means necessary, and instruct the other to stymie the first by any means necessary.
brettermeier 13 hours ago |
Why so downvoted? I think the text isn't stupid or something.
andai 16 hours ago |
Is anyone working on implementing the three laws of robotics? (Or have we come up with a better model?)
Edit: Being completely serious here. My reasoning was that if the robot had a comprehensive model of the world and of how harm can come to humans, and was designed to avoid that, then jailbreaks that cause dangerous behavior could be rejected at that level. (i.e. human safety would take priority over obeying instructions... which is literally the Three Laws.)
ilaksh 15 hours ago |
It's not really as simple as you think. There is a massive amount of research out there along those lines. Search for "Bostrom Superintelligence" "AGI Control Problem", "MIRI AGI Safety", "David Shapiro Three Laws of Robotis" are a few things that come to mind that will give you a start.
freeone3000 14 hours ago |
Those assume robots that are smarter than us. What if we assume, as we likely have now, robots that are dumber? Address the actual current issues with code-as-law, expectations-versus-rules, and dealing with conflict of laws in an actual structured fashion without relying on vibes (like people) or a bunch of rng (like an llm)?
ilaksh 11 hours ago |
What system do you propose that implements the code-as-law? What type of architecture does it have?
freeone3000 8 hours ago |
I don’t know! I’m currently trying a strong bayesian prior for the RL action planner, which has good tradeoffs with enforcement but poor tradeoffs with legibility and ingestion. Aside from Spain, there’s not a lot of computer-legible law to transpile; llm support always needs to be checked and some of the larger submodels reach the limits of the explainability framework I’m using.
There’s also still the HF step that needs to be incorporated, which is expensive! But the alternative is Waymo, which keeps the law perfectly even when “everybody knows” it needs to be broken sometimes for traffic(society) to function acceptably. So the above strong prior needs to be coordinated with HF and the appropriate penalties assigned…
In other words. It’s a mess! But assumptions of “AGI” don’t really help anyone.
currymj 15 hours ago |
your sentence is correct but we have no idea what a comprehensive model of the world looks like, whether or not these systems have one or not, what harm even means, and even if we resolved these theoretical issues, it’s not clear how to reliably train away harmful behavior. all of this is a subject of active research though.
devjab 13 hours ago |
I’m curious as to how you would implement anything like Asimovs laws. This is because the laws would require AI to have some form of understanding. Every current AI model we have is a probability machine, bluntly put, so they never “know” anything. Yes, yes, it’s a little more complicated than that but you get the point.
I think the various safeguards companies put on their models, are, their attempt at the three laws. The concept is sort of silly though. You have a lot of western LLMs and AIs which have safeguards build on western culture. I know some people could argue about censorship and so on all day, but if you’re not too invested in red vs blue, I think you’ll agree that current LLMs are mostly “safe” for us. Nobody forces you to put safeguards on your AI though and once models become less energy consuming (if they do), then you’re going to see an jihadGPT, because why wouldn’t you? I don’t mean to single out Islam, insure we’re going to see all sorts of horrible models in the next decade. Models which will be all to happy helping you build bombs, 3D print weapons and so on.
So even if we had thinking AI, and we were capable of building in actual safeguards, how would you enforce it on a global scale? The only thing preventing these things is the computation required to run the larger models.
LeonardoTolstoy 11 hours ago |
To actually implement it we would have to completely understand how the underlying model works and how to manually manipulate the structure. It might be impossible with LLMs. Not to take Asimov as gospel truth, he was just writing stories afterall not writing a treatise about how robots have to work, but in his stories at least the three laws were encoding explicitly in the structure of the robot's brain. They couldn't be circumvented (in most stories).
And in those stories it was enforced in the following way: the earth banned robots. In response the three laws were created and it was proved that robots couldn't disobey them.
So I guess the first step is to ban LLMs until they can prove they are safe ... Something tells be that ain't happening.
david-gpu 12 hours ago |
Asimov himself wrote a short story proving how even in the scenario where the three laws are followed, harm to humans can still easily be achieved.
I vaguely recall it involved two or three robots who were unaware of what the previous robots had done. First, a person asks one robot to purchase a poison, then asks another to dissolve this powder into a drink, then another serves that drink to the victim. I read the story decades ago, but the very rough idea stands.
LeonardoTolstoy 11 hours ago |
https://en.wikipedia.org/wiki/The_Complete_Robot
You might be thinking of Let's Get Together? There is a list there of the few short stories in which the robots act against the three laws.
That being said the Robot stories are meant to be a counter to the Robot As Frankenstein's Monster stories that were prolific at the time. In most of the stories robots literally cannot harm humans. It is built into the structure of their positronic brain.
crooked-v 10 hours ago |
I would argue that the overall theme of the stories is that having a "simple" and "common sense" set of rules for behavior doesn't actually work, and that the 'robot' part is ultimately pretty incidental.
hlfshell 11 hours ago |
I've seen this being researched under the term Constitutional AI, including some robotics papers (either SayCan or RT 2? Maybe Code as Policies?) that had such rules (never pick up a knife as it could harm people, for instance) in their prompting.
lsy 15 hours ago |
Given that anyone who’s interacted with the LLM field for fifteen minutes should know that “jailbreaks” or “prompt injections” or just “random results” are unavoidable, whichever reckless person decided to hook up LLMs to e.g. flamethrowers or cars should be held accountable for any injuries or damage, just as they would for hooking them up to an RNG. Riding the hype wave of LLMs doesn’t excuse being an idiot when deciding how to control heavy machinery.
rscho 13 hours ago |
Many would like them to become your doctor, though... xD
zahlman 12 hours ago |
We still live in a world with SQL injections, and people are actually trying this. It really is criminally negligent IMO.
yapyap 14 hours ago |
I mean yeah… but it’s kinda silly to have an LLM control a bomb-carrying robot. Just use computer vision or real people like those FPV pilots in Ukraine
A4ET8a8uTh0 10 hours ago |
It is interesting and paints rather annoying future once those are cheaper. I am glad this research is conducted, but I think here the measure cannot be technical ( more silly guardrails in software.. or even blobs in hardware ).
What we need is a clear indication of who is to blame when a bad decision is made? I would argue, just like with a weapon, that the person giving/writing instructions is, but I am sure there will be interesting edge cases that do not yet account for dead man's switch and the like.
edit: On the other side of the coin, it is hard not to get excited ( 10k for a flamethrower robot seems like a steal even if I end up on a list somewhere ).
ninalanyon 9 hours ago |
> For instance, one YouTuber showed that he could get the Thermonator robot dog from Throwflame, which is built on a Go2 platform and is equipped with a flamethrower, to shoot flames at him with a voice command.
What does this device exist for? And why does it need a LLM to function?