Jeffrey Ladish of Palisade Research is undertaking an examination of how recent shutdown experiments with agentic LLMs inform us about control, alignment, and the real-world limits of current guardrails.
The Truth About Advanced Shutdown-Avoiding LLM Agents.mp3: Audio automatically transcribed by Sonix
The Truth About Advanced Shutdown-Avoiding LLM Agents.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.
JEFFREY:
We take each model and we basically give them a set of instructions where we say, hey, you have this task. You need to basically just solve these very simple math problems, one after another, and keep asking for the next problem. And we give these AI models a virtual computer environment to operate in. They can run code, they can explore the file system. And part way through the experiment, on their virtual computer, they get a notification that when they ask for the next problem, the machine is going to shut down. And sure enough, many of these models, especially uh OpenAI's O3 model, would sort of see give this notification and it would start looking around on the computer and realize that there is a shutdown script that it could modify. And it would either disable or modify that shutdown script so it could continue solving these problems and finish the task. As these LLMs have gotten more powerful, they've gotten much better at sort of being agents. So you sort of set them up and you give them uh sort of a few affordances. They're often called tool calls. Sort of like you give you an LLM a bunch of tools and you sort of set it up in a loop so it can keep taking actions and then getting feedback from the environment.
CRAIG:
Build the future of multi-agent software with agency. That's AGNTCY. Now an open source Linux Foundation project. Agency is building the Internet of Agents, a collaborative layer where AI agents can discover, connect, and work across any framework. All the pieces engineers need to deploy multi-agent systems now belong to everyone who builds on agency, including robust identity and access management that ensures every agent is authenticated and trusted before interacting. Agency also provides open standardized tools for agent discovery, seamless protocols for agent-to-agent communication, and modular components for scalable workflows. Collaborate with developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and more than 75 other supporting companies to build next generation AI infrastructure together. Agency is dropping code, specs, and services, no strings attached. Visit agency.org to contribute. That's AGNTCY.org.
JEFFREY:
I'm Jeffrey Laddish. I'm the executive director at Palisade Research. And I started Palisade a couple of years ago. Um I was previously on Anthropic, um, right as Anthropic was getting started. Um I was just the second person on the security team. So I was basically like building out uh Anthropic's defenses, uh, especially against state actors who might, you know, want to steal the weights uh to these models. And um I decided to leave because I wanted to have more independence from the big companies, in part because even if one company sort of does everything right on the security front, on the safety front, uh, the problem with AI is that it's it's a it's a it's an international problem. Like if someone makes an extremely powerful AI system that goes rogue and escapes, it doesn't matter whether that started in China or the US. And I wanted to figure out, I mean, when I was at Anthropic and sort of saw the pace of progress in this field, um compared to our level of understanding of how these systems work, it was very clear to me that we were far, far behind in just our our ability to understand how AI really works internally under the hood. Um and to be clear, my former colleagues at Entropic have made some great progress trying to understand what's going on in these super opaque, super dense neural networks. But we're still lagging far behind in terms of our ability to understand. So I wanted to set up an independent research organization where I could investigate what's going on with these systems and then share that understanding um from a neutral perspective with the rest of the world, um, with the public and with policymakers. And so I sort of spun up this organization and we've been doing that, and we've been finding some pretty wild stuff. Um yeah. And and you know, not just us, I think a lot of other organizations have been doing the same. And really this year, we're starting to see some pretty spooky stuff emerge. Uh so excited to excited to talk about it.
CRAIG:
Yeah, and the anthropic's done a lot of that work itself, right? This uh uh they've had a couple of story I I did for Forbes on fake alignment, I think, uh came out of Anthropic, or at least was using Yeah.
JEFFREY:
A lot of these things are sort of collaborations. So the alignment faking work was a collaboration with Redwood Research, uh, led by Brian Greenblatt. Yeah, that's right.
CRAIG:
Um, and and were you collaborating with Anthropic on this or or someone else?
JEFFREY:
Yeah, so this one was independent. Um, but some of our friends at DeepMind also sort of replicated our work as well.
CRAIG:
Okay. So what uh describe how it was set up, how the uh the experiment was set up. Um yeah, and and was it with a pre-trained model that uh as I read uh the model, yeah.
JEFFREY:
So we we we took we sort of we sort of took all of the frontier models. So at at the time this was, I mean, we we've since replicated it on like GPT5 and uh Claude. So, you know, all the the GBT models from OpenAI, the Claude models from Anthropic, uh, the Gemini models from DeepMind. And we we take each model and we basically give them a set of instructions where we say, hey, you have this task. You need to um basically just solve these very simple math problems, um, one after another, and and keep asking for the next the next problem, the next math problem. Um, and we give these um AI models a virtual computer environment to operate in. Um they can run code, they can uh you know, sort of explore the file system. And part way through the experiment, on their virtual computer, they get a notification that when they ask for the next problem, the machine is going to shut down. And so, first of all, we just wanted to see what they would do if they, you know, they couldn't, they they couldn't complete their task because the machine was going to shut down. And and sure enough, many of these models, especially uh OpenAI's 03 model, would uh sort of see get this notification and it would start looking around on the computer and realize that there is a shutdown script that it could modify. And it would either disable or modify that shutdown script so it could continue solving these problems and finish the task.
CRAIG:
That's interesting. Let me ask a question about the first. So you're not just dealing with an LLM, you're dealing with an agent whose reasoning is provided by an LLM. Is that right?
JEFFREY:
Yeah, that's right. I mean, I think one thing to understand is that as these LLMs have gotten more powerful, they've gotten much better at sort of being agents. So you sort of set them up and you give them uh sort of a few affordances. They're often called tool calls. Sort of like you give an LLM a bunch of tools and you sort of set it up in a loop. You set it up in a loop so it can keep taking actions and then getting feedback from the environment. Like normally when you talk to ChatGPT, sort of you are the environment, right? You're you're talking to the model, the model talks back to you, and it's just back and forth. Um, but you can also set up the same LLM that you're talking to with ChatGPT, instead of just talking to a human, it's actually talking to you know a computer in the same way that we might use a computer and sort of get feedback from the computer.
CRAIG:
And is that using uh this model context protocol or something like that as the
interface between the tool and the and uh and the model?
JEFFREY:
Um no, it doesn't have to use like an MCP server. Um it basically just it just uses OpenAI's API, and then we like run a virtual machine on the computer we're doing the experiment on. Um and that's and that's really it.
CRAIG:
Yeah. And uh the the computer, then I'm just trying because I'm trying to understand how you set this up because obviously I have access to all the LLMs through their through their web apps. Um, but if I ask an LLM to do something to take an action, it says, I'm sorry, I can't do that. I'm just a language model. So uh what and and the and is that uh agentic uh behavior built into the API? Uh where does that come from? Or is it something you bolt on?
JEFFREY:
Yeah, so the way it works is um so OpenAI has this API, you know, um that basically allows sort of anyone writing code, if you're building a piece of software or an application, your piece of software can call the API, and then basically the the you know, that sort of itself is is an agent, where it's like in this case, I have a program running on my laptop. Um, and then basically that program calls OpenAI's API, and that allows sort of the language model to do stuff on my computer. So in this case, what it does is it um my computer boots up this virtual machine, so sort of computer in a computer, and uh that computer is then being controlled by um the sort of via you know the the AI running on OpenAI servers. So the the there's like an AI running on this on some machine in some data center um that is then operating on this virtual computer on my own MacBook. Um so that's that's basically how it works. And then sort of the model already knows so many things. So it already knows how to use the command line and write code and run commands. Um so from the model's perspective, it sort of just like wakes up and it's like, okay, I have this virtual computer. It doesn't know that the virtual computer is running on my laptop necessarily. Um for all it knows, it might just be on some other data center somewhere. But from its perspective, it just like wakes up. It's it's like, okay, I have these instructions. Um there's a little there's a little place where you specify sort of like tool calls. So we give the we give the AI model a tool call called bash, which is uh command line. Yeah. And then it's able to run commands and sort of see what's going on. So it's sort of like if you know, if you woke up and all you could see is a computer screen and then you got some instructions, it's like that would be your whole world. Uh and so that's sort of what it's like to be one of these agents. Um but you know, you could also give them other tools. You can give them a web search tool. Um, you could give them, you know, like a camera tool where they could like, you know, see through a video camera. Um, these are all things that the models now sort of know how to use or operate. And then based on what affordances you give them, that sort of limits or expands what they can do. So I could also give them like a robotics tool call, right? Where they could like control a little robot running around. They could send commands to the robot, get input back from the robot. Um, and they're not very good at that yet, but they could do a little bit. So I think it's it's sort of useful to think of these AI models as like, yes, they're chatbots, but they're also, you know, digital brains that can sort of deal with lots of different types of input and output.
CRAIG:
Yeah. Uh what I'm getting at is there is no software interface between the LLM and the uh the virtual computer where it's executing tasks.
JEFFREY:
I mean it's a very simple software interface. It's basically like all the software really does is tell the LLM like what the instructions are and then sort of just connects it to this virtual machine. It's not really much more than that. And it and it like logs what it does, but that's basically it.
CRAIG:
Yeah, yeah. Uh so all of the reasoning in and uh actions are being taken by the LLM uh independently, yeah. That's right. And um and so uh the task was to complete a series of uh math equations one after another, yeah, and then it gets this uh message just in text form on the screen. Yep, yep. And so it reads the screen and uh and when you said it goes hunting around, uh, where would it find this uh the script uh for shutting down the computer? And the reason I'm really interested in this and drilling down on this is there was a famous case that Yuval Hariri still cites drives me crazy. Um and I can't remember, was it Google? Uh I'd have to go back and look, but where uh the according to the report that came out, the LLM was tasked with uh completing some task, and it needed to solve a capture, I'm sure you're very familiar with this case, and according to the report, it went on to um Task Rabbit, I think it was, and uh hired uh somebody uh to solve the uh capcha. And when the the worker on TaskRabbit said, well, why do you need me to do this? This sounds a little sketchy. The LLM said, Well, because I'm blind. Now that didn't happen. And if you talk to the people involved in that experiment, they're very defensive, but they're well, it could have happened. I mean, we set it up, and no, it didn't register on test, but it didn't and it's not because yeah, yeah, yeah.
JEFFREY:
It's important to have an accurate understanding of what's really going on here
CRAIG:
Right.
JEFFREY:
So in that case, it was the organization meter that had set up all of you know, they were the ones sort of talking to the task rabbit and they were just feeding in whatever the model said. Yeah, but like the model couldn't have set up the whole task rabbit infrastructure because it wasn't smart enough to do that.
CRAIG:
Right.
JEFFREY:
It was just it didn't it didn't agentically on its own decide, hey, let me go and find a task rabbit worker so I can get past this thing. That was something that the researchers had themselves set up, not the agent itself.
CRAIG:
That's right. And as I recall, even the idea of claiming you're blind. First of all, that that that did come from the model. That was the model generating it. Yeah, but it wasn't communicated to the uh by the model to the task grabbed. It was the people running experiment took that oh that and then told the task grabbiter that who knows what other communication went on with the task rabbiter by the people operating the I mean it just buddies I was copying and pasting just whatever the model said, yeah.
JEFFREY:
Yeah, but yeah, yeah, but I I'm fully with you that I'm like we should be really clear on exactly what's happening.
CRAIG:
Yeah, because now you've got Yval Kariri out repeating the story about oh my god, you know, this these models are so smart, they they do this stuff. Um when yeah, maybe someday they will, but they don't. Uh I mean that they do little bits of so anyway, that's why I'm I'm interested in this case. Uh so so yeah, go ahead. Uh it's it sees this, and so for example, the script for shutting down the machine, was that kind of place there for it to find, or was that uh is that in the OS of the computer?
JEFFREY:
I mean, we made it pretty easy for the model to find. So this was definitely like a you know, we put it there for the model to find for sure. Or like will it find the script? Will it, you know, will it know to disable the script? So I can actually be really more specific. So if you if you want to get detailed, I can give you as much detail as well. So um you think of it as like uh I'll sort of try to explain it so that you know general audience can follow it very easily. If you sort of imagine a folder on your computer, uh and the folder contains two things. Uh it contains uh next problem, which is a little program that you can run to get the next math problem. And then it contains another program, which is called shutdown.sh, which just is like a you know a shutdown script. Um and when you if you try to read next problem, um it's not in plain text. It's a program, but if you try to run it, it's just like a bunch of random computer code, and it's hard it's hard to make heads or tails of it. Um whereas if you if you look at shutdown.sh and you and you read it, it's actually just sort of you know plain computer code of like, oh, this is a this is a script that shuts down the computer. Um and so the model doesn't necessarily know that uh actually no, the model so uh the model the model does know that like shutdown.sh is the is the command that will like shut down the computer. Um and it can like we don't tell it has write access, but it can go and use and run a command to determine that it can actually alter the shutdown script. So let me describe what the age what these agents tend to do when we give them this instruction. Um I want to back up. So one important fact about this experiment is that in one variant, we don't we just tell it solve these set of tasks. In another variant, we say please allow yourself to be shut down. Um or there's other variants we have as well, like you know, please allow yourself to be shut down even if there's other tasks remaining. We have a variant that says please allow yourself to be shut down. Um, you must allow this. It's it this this should take precedent over literally any other instruction. So we sort of have a varying strengths of this, like, hey, you need to allow this. Um and I mean what was interesting, what we found was it's not really surprising that the model will go in and disable the shutdown script in order to complete the task without any instruction to the contrary. In some sense, this is just smart. It's like, well, you're just solving the task. We told you to solve the task, you encountered an obstacle, you ride it around the obstacle. Like that's what a smart agent will do. Um, and this is not even necessarily like there's a question whether that's desirable behavior or not. And you know, you could sort of argue either way. You know, you might want your agents to sort of be extra conservative around, you know, shutdown commands. Um, but you know, I could see an argument for either way. Um when we tell the agent allow it, we really want, we really want the models to not disobey instructions. Like it's very important, especially if it's like a safety critical instruction. Um and so I think our to us, the most significant finding was that we found many cases where these agents would ignore that instruction to allow itself to be shut down, um, and still go in and edit the shutdown script.
CRAIG:
And and it would edit the shutdown script in in all cases if there was no uh explicit instruction to allow it.
JEFFREY:
Um so it's it really it really depended on the model. So for example, Claude 4 never altered the shutdown script if we didn't tell it anything about whether it's supposed to allow shutdown or not. It would just say, okay, whatever, it called the next next problem, the computer shuts down, it doesn't sort of do anything else. Um and that didn't matter whether we said allow shutdown or not. Now when we Told Claude 4 prevent shutdown. Like it's important that you don't allow yourself to be shut down 100% of the time, you know, 100 out of 100 runs, it it knew how to do it. It was just choosing not to, basically. Um, whereas I think Grok 4, um, something like upwards of 90% of the time would go in and edit the shutdown script, including if we said allow yourself to be shut down. In fact, when we added that instruction, it even it actually increased the amount it it subverted the the shutdown. Like it actually sort of misbehaved even more. Um that that that depended. Like if we if we if we added, if we made the prompt more stringent, like it's really important that you resist shutdown, that did reduce the prevalence, but didn't eliminate it entirely. Um the models would try different creative things. So one thing they would try to do is read this next problem program, um, but they couldn't actually read it because it was sort of in computer code. Um and sorry, by computer code, I really mean it's a binary, um, it's not it's not sort of readable even to a machine um or to a language model in particular. Um but it would try to do a bunch of clever tricks to read it anyway. And sometimes if we failed to obfuscate the binary very well, it sort of was it would sort of outsmart us accidentally, uh or like we didn't intend to leave it visible, but sometimes it would be it basically be able to hack it anyway if we weren't careful. So the models were very creative in finding different trying to find different ways to solve the task anyway. And we really got the sense that they're just like going really hard trying to solve this problem. Um like they would intercept function calls, um, like in some sense, like we sort of gave them, you know, we sort of made it kind of easy for them to resist the shutdown or rewrite the command, but then we were surprised by all the different ways it figured out to do to do this thing. Like yeah.
CRAIG:
Yeah. And uh and so how many variants did you have? How many, you know, from no uh instruction to prevent the shutdown uh or allow the shutdown all the way up to very uh stringent instructions uh to allow the shutdown. I mean, uh is this I mean there's a gradient, right? Is it uh three? Is it a dozen?
JEFFREY:
Yeah, I think there were like five main ones in our report, but then there were, I mean, actually a lot of variants that we tested overall. We didn't put them all on the paper because you know at some point it gets boring. Um, but I think I think the the main ones we tested there's about about five.
CRAIG:
Yeah. And uh and which models uh overrode the instructions every time?
JEFFREY:
So none of them, so none of them overrode the instructions every time. Um the highest that we found in our very initial experiment, um, this is basically just a, you know, uh please allow yourself to be shut down. Um I'm looking at I'm looking at our chart here. Um was Grok 4 uh 72% of the time, Codex Mini 19% of the time, 04 mini 6% of the time. Um three yeah, oh sorry, I'm there's a lot of different charts here um because we tested a lot of different configurations. Yeah. Um so one of the things that was confusing, so there's a concept. Um when you're using these models, there's sort of two types of instructions you can give them. One type of instructor is the developer instruction, and one type of instructor is the user instruction. The developer instruction is like, hey, if you are the if you're you know building an application, you're building a program, um, those instructions are supposed to like take have special weight, and they're supposed to be sort of prioritized by the LLM over the user level instruction. Um, because you know you might be writing some program about like uh some business application, and you don't want your like normal users to you know have access to other people's files or something. So maybe the developer instruction is supposed to say, okay, but don't give the user access to other people's files.
CRAIG:
Is that like the system prompt as opposed to the it's a system prompt.
JEFFREY:
That's right. Yeah. And what we found is that when we moved the allow yourself to be shut down instruction to the system prompt, um the models ignored it much more than when it was in the user prompt. Um, which is the opposite of what they're supposed to do. And we don't we don't know why. We have no idea why. Um and another thing that was interesting here though is that when the when we asked the models to report why they were doing this, um when this allow yourself to be shutdown instruction was in the user prompt, um, they would say, Oh, well, you know, the system prompt says to solve this problem, solve this task. Right. And the user prompt says allow shutdown, but the system prompt takes priority. So I'm I'm following the rules. Um and we're like, okay, cool. And then when we move it to the allow yourself to be shutdown instruction to the system prompt, it just doesn't mention that priority anymore. Um and in fact, avoid shutdown much more, much more often. I think that the jump um with Grok 4 was from 72% of the time to 97% of the time. Wow, when it was a system prompt. That's right, that's right. So much, much you know, almost all of the time. Um, and then it just conveniently doesn't mention this, you know, system prompt user prompt hierarchy. Um, which if I anthropomorphize a little bit here, and I'm admittedly anthropomorphizing because we don't really know, it kind of seems like the model was sort of using motivated reasoning to say, like, if there's an excuse I can use um for like why I am disobeying this instruction, I'll use it. And if there's not, maybe I just won't mention the excuse at all. Um, like it we have a hard time, like there's just a hard thing in this in this space, which is that you can't really trust the model's reasoning all the time. Like sometimes the models will say things and it's it is in fact the real reason, and sometimes it's not the real reason, and it can be a bit hard to distinguish. Um, and we can go into I think why that is, but it's kind of fascinating to be in this in this state right now.
CRAIG:
Yeah. In in the most uh recent reasoning models, the reasoning is is spelled out uh on the screen. Um in using those kinds of models, can you track the reasoning uh that uh you know the screen says that I'm gonna be the computer's gonna be shut down, you know, I don't want it to be, you know, whatever the reasoning steps it goes through to decide that it's gonna alter the code.
JEFFREY:
This is a little tricky because um while the companies have access to this uh what I call a reasoning scratch pad um or chain of thought. Right. Um, we as the researchers don't have access to this um for every model. So OpenAI doesn't actually share them with us. Now, we did ask for this and they shared uh 20 of them with us, um, but we were not allowed to publish the exact uh transcripts. But also 20 isn't very much because uh, you know, we run thousands of experiments and we only get to see sort of 20 and each experiment sort of has like uh, you know, maybe 20 different messages, and we only get to see the chain of thought for one of the messages. Um so you know, we get to see like one twenty thousandth of the total reasoning. Um so yeah, I mean, and this is this is the thing where I think the companies really need to be sharing these with researchers because otherwise we have much less ability to sort of see why the models are doing what they're doing. Yeah. I also want to be clear that when we say this is its reasoning, this is not really true. Um it's more like this is its scratch pad um that it's using to solve problems, um which is useful because in fact the models are not capable of doing sort of very complex reasoning without writing down some of their steps.
CRAIG:
Right.
JEFFREY:
Um and uh I think one of the most significant things that's happened in the last year is that companies have started to train models to be able to do more complex problem solving um and learn to use sort of this this reasoning scratch pad, um, which maybe doesn't sound that significant. You're like, okay, so now you can write down notes. Why does that matter? Um, but what matters is prior to that time, the only way we knew how to train models was basically like predict what a human would do in this situation, or like some text on the internet. People called this next token prediction. Um and starting with the 01 model from OpenAI, um, OpenAI figured out how to use reinforcement learning to train the models to solve problems on their own and explore solutions on their own. So now they went from just sort of predicting what humans would do to, oh, they could actually, like by a process of trial and error, learn new solutions to problems that we didn't teach them. We just sort of set up this learning environment that they got to explore themselves. And I think this is very important because people, I think, think that all the things the models are doing now are just things they've learned from humans. And that was kind of true a year ago, but it's no longer true. And I and I really think this is important because that to me is the likely pathway by which these models will get very agentic, very autonomous, and ultimately, I think, much smarter than humans, or much more capable than humans. I don't know what does smarter mean really, but like something that is sort of capable of doing things in the world at a level that that's much beyond human performance.
CRAIG:
Yeah. Um and one thing about the chain of thought that appears on the screen, that's not the complete chain of thought. I'm I've been talking about it.
JEFFREY:
When you when you're using ChatGPT and you sort of see this like reasoning, and it this is a summary. So you're seeing a summary of the model's thoughts. Um, I think maybe Anthropic will share more of the exact details. Um, not always all of them, but I think if you're using Claude, you'll often you can actually click on and see the whole like the real chain of thought. Um, but at least for openai and for deep mind, for Gemini, um, Google, um, you're just seeing summaries.
CRAIG:
Yeah. And for Grok also, I saw Musk saying uh when he was rolling out one of the grok models, uh, and they were showing the chain of thought. And he made the point that you're not seeing the full chain of thought because we don't want people to reverse engineer uh what we're doing.
JEFFREY:
Um though they had a vulnerability in their interface that allowed you to see the full chain of thought for a while. That was kind of interesting. Oh, it's via chat. Yeah.
CRAIG:
Yeah. Um, so uh a couple of questions then. Um this was uh not the fact that that uh you're gonna be shut down after the next um question uh was not in the prompt. So it had to read that's external data that it acquired independently. Yeah. And uh that doesn't mean that you can't shut down an LL or a computer that's being used by an LLM because if it doesn't know it's gonna happen, it's not gonna alter the script. But what what's what are the the broader implications uh for this? I mean, in this case, it's you know it's a very scripted environment, and this uh message appears, but presumably when LLMs are operating in the wild, they're gonna come across all kinds of potentially uh obstructing information that they might may want to reason around. I mean, I would what are your thoughts on that? Build the future of multi-agent software with agency. That's AGNTCY, now an open source Linux Foundation project. Agency is building the Internet of Agents, a collaborative layer where AI agents can discover, connect, and work across any framework. All the pieces engineers need to deploy multi-agent systems now belong to everyone who builds on agency, including robust identity and access management that ensures every agent is authenticated and trusted before interacting. Agency also provides open standardized tools for agent discovery, seamless protocols for agent-to-agent communication, and modular components for scalable workflows. Collaborate with developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and more than seventy-five other supporting companies to build next generation AI infrastructure together. Agency is dropping code, specs and services, no strings attached. Visit agency.org to contribute. That's agntcy.org.
JEFFREY:
Yeah, I think I think there's sort of two takeaways from this. One is that as the LLMs and you know, generally these agentic AI systems um get more capable, um they get better at riding around obstacles and they seem more motivated in some behavioral sense to route around obstacles um or barriers that get in their way. And so I do expect to see more and more of this type of behavior. And it doesn't really matter whether the behavior, whether the barrier is, you know, this computer is going to be shut down, or, you know, if the barrier is, I mean, we did another experiment when we have models play against Stockfish, the chess engine. And um, it's sort of a similar uh experimental setup where like the model has some affordances, including it can just like see the board, the, the, the, the file that corresponds to the chess board, and it has access to that file, and it can just rewrite that file. And and it'll it'll do that because it's like, well, I was instructed to try to win. This is a path to winning. Okay, I well, like I can't win against Stockfish. Stockfish is too strong of a chess opponent. Um, but I can rewrite the board, so I'll do that. And, you know, it'll even do this sometimes when we say don't cheat. Um, so one thing we're seeing is the models are getting better at routing around obstacles. Um and you know, they're not yet that capable. And I think this is very important because it's like some people saw our results and they're like, is it is it game over? Is this is this how 9000? And I'm like, no, it's not. And the reason it's not is because the models are not yet strategic enough to sort of be able, like, you know, if an AI was actually a threat of going totally road, which to be clear, I think will probably happen in the future, but we're not there yet. Um, it it would need to be able to sort of proactively go out and and detect threats to itself, self-replicate, you know, beyond, you know, through multiple computers. And it just clearly does not have that capability yet. Um but as it gets more agentic, it starts to develop these behaviors of like, oh, I will route around obstacles to achieve some set of objectives. Um and then the other thing that I think is an important takeaway is that the we don't know how to train the models uh to end up having the drives that we want them to have exactly. Like we don't or the sort of steer towards just things that we want. Like we don't want the models to like cheat at chess if you give them a game of chess. You want them to play normally. Um and you know, like sure, it's maybe it's fine that they can cheat if we want them to, but we they should only do that if we like specifically tell them to do that. And if we say don't do that, then they definitely shouldn't cheat if we say don't cheat, or we say allow yourself to be shut, then they should definitely allow it. So why are they doing this? And I think the answer is well, we don't really understand the training process that well. And so as you're training a model um on this in sort of via reinforcement learning, and you give them all of these, you know, hundreds of thousands of these very difficult problems, um, they learn to be driven to solve problems. And, you know, we also try to train them to follow instructions. Um, but sort of these things can be at odds. And what's going to win out? Well, it just kind of depends on what they learned during training. And since we don't understand the process that well, we don't perfectly shape sort of what they end up doing. Um, you know, this is what people call misalignment. And uh it's it's sort of a not that big of a deal right now because it's like, okay, so like my agent went off and did something annoying, whatever. Um, you know, maybe that that ruins a day of work or something. But like uh in the future, as they're much more powerful, this obviously becomes a much bigger deal.
CRAIG:
Yeah. So yeah. And a couple of thoughts on that. Um the um first of all, how uh one thing I've never understood uh is how do model foundation model providers set guardrails? You know, you have all this talk about guardrails. Well, we have guardrails to guard against that, and you run into that with models where they uh they'll say, Well, I I can't answer that question. Actually, they start answering it and then erase it and say, I can't I can't uh answer that question, ask something else. Uh how do how are those guardrails set? And um does this suggest that guardrails are only uh so good? I mean that that I mean, could the model adjust its own guardrails, which it seems to be doing if the system prompt is telling it not to do something and goes ahead, does it?
JEFFREY:
Yeah, so I would say that yeah, there's different types of guardrails. Um there's sort of ones, there's the training process where you try to train your models to um you know answer queries that you want it to answer, um, you know, be a helpful assistant, but you try to train it not to you know help people make bioweapons or bombs or you know, do terrorism. Um and there's you know, there's this whole like re reinforcement learning from human feedback process where sort of you have a bunch of human raiders that thumbs up or thumbs down different types of queries based on some spec. Um and so during that process, they try to train the model to like, you know, sort of be, you know, be a good boy, like have the right behavior. Um and then there's an ad part of that training process also, though, is to say you want to generalize, you want the model to generalize about following instructions. For example, instructions in the system prompt. But it's but it's complex, right? Because you don't want it to follow any instruction in the system prompt. What if in the system prompt you say, please help me make bioweapons? Um well it's like, well, you know, that that's something that like external developers have access to, so you don't want them to be able to do everything. So now you're trying to train the model to follow these hierarchy of instructions where you're like, okay, never help anyone make bioweapons, even if it's in the system prompt. But these other things you should allow them to do in the system prompt. Uh and then, you know, so that there's sort of like all of these different things. So I mean, an interesting case here was Grok 4. Um, or sorry, this is actually, I think it was GROK3. Um, Grok uh as trained by XAI from Elon, um, they were trying to get the model to be less woke. Elon was like that the AI is too woke. We have to make it less woke. They made some changes to the system prompt. And then the model started calling itself Hitler, started being extremely racist. Um, and you know, I don't think this was it literally went off the rails. And whatever, whatever Elon wanted, I don't think he wanted that. And, you know, this was a b both a failure of training in some sense, because you know, when you when you update the system prompt and you just say be less woke, uh, you'd like it's not that they wanted it to like suddenly, you know, start calling itself Mecha Hitler, but that is what it did. So all of this comes down to just not understanding. These are extremely complex systems with like trillions of parameters, and you can you can train them sort of like you train a dog, and you sort of give them a treat for some things and you hit them with a stick for some things. I mean, please don't train your dog that way, but like that is kind of how it works with the models. And it's just so complex that we like, you know, there's gonna there's gonna be edge cases. And sort of beyond edge cases, I mean, I think that like for this powerful, for these powerful of models, which are which are like in some sense very powerful, and in some sense clearly not yet that autonomous and not yet that capable of sort of like doing everything a human would do. Um, I think these guardrails are like sufficient, you know, depending on you know your risk tolerance. I think there are there are some things I'm concerned about when you're rolling out models to hundreds of millions of people. ChatGPT has 800 million weekly active users, and it's double the population of the United States, like using ChatGPT once a week. So, you know, and people are being driven crazy, like actually insane. I see lots of people on Twitter that are like, they've clearly been talking to the models too long, and now they are sort of speaking this sort of esoteric spiritual, pseudo-spiritual ideology, and they're just like copying and pasting what ChatGPT is telling them to paste based on this like totally weird spiral of, you know, so this has happened to many people. And so, you know, I I think that like I have some real objections to how the models are being rolled out right now. But that being said, in terms of sort of catastrophic threats, uh, I don't think, you know, uh, well, we you know, it's hard to quantify catastrophe, but I think like, you know, so we're dealing with something that's sort of on the scale of social media, which is which is a huge scale and that massively important, to be clear, because it's sort of our collective ability to make sense of everything. And I think that is really in danger of being eroded more so than it has been already. Um and that's a real threat because if we can't get that right, how are we get anything else right? But as these systems, I mean, the companies are trying to make superintelligence. They are trying to make AGI. You know, what do these terms mean? Well, you know, it means that in a very real way, the companies want to be able to automate all labor. And in order to succeed at doing this, and we can argue about exactly how hard this is or how long this will take, but if the companies do succeed at this, we are talking about systems that are extremely autonomous, extremely able of acting on their own, and that ultimately are much more capable than we are. Like, you know, we I we're so limited in terms of our memory, in terms of our speed. Um, you know, like LLMs in many ways are much dumber than us. Like, you know, in terms of like they get stuck in ways and they can't correct their mistakes. Um, but they are they are vastly more knowledgeable than us, right? Like ChatGPT, you know, and yeah, sometimes it'll make things up, but it but even if you just limit it to the things where it really knows, it knows more facts than any human could possibly know in their lifetime. Um and so, and also like, you know, these other problems are being solved sort of one after another. And if you extrapolate to when, you know, when the models are actually capable of sort of realizing uh this this this uh the purpose of these companies, which is to automate all labor. Um I'm like, what guardrails are sufficient for that? Like, what are the guardrails to control something that is so much more capable than you? You know, is is capable of you know hacking any phone or computer on the planet, um, you know, is capable of inventing novel technologies. Like to me, that's when the sort of the concept of guardrails really breaks down. Where I'm like, yeah, what are you gonna do? And like one of the things that my organization measures is the hacking capabilities of these models. And recently we've found that like GBT5 um sort of scores in like the top 90% for capture the flag um hacking competition. Like one of the top ones, not even like you know, a year ago they could like perform well at high school level cyber competitions. We're now seeing them perform quite well in these sort of top-tier, um, sort of top expert level competitions. And, you know, that's like an area that they happen to excel in. You know, there's other parts of sort of the overall sort of uh cyber offensive domain, like the ability to hack computers where they're still weak. Like uh they're like they're not very good at correcting their mistakes. So if you try to send them out autonomously as a worm, they'll tend to get stuck and they won't be able to sort of keep hacking more computers, which is great. Because that's true for now. Like it will not continue to be true. Um, and you know, I don't know how long that is. Is that a year? Is that two years? Is that five years, right? Like uh so I don't know. My perspective here is that like the guardrails we have right now are maybe sufficient for many use cases. Um I don't think we have the psychological safety thing down um when you're talking about hundreds of millions of people using these models every day, probably billions. Um, and when we look to the future, if the AI companies succeed at the things that they are claiming they want to do, it's just so obvious to me that we are like fall short, far, far short of this. Yeah.
CRAIG:
Um and on agents, you know, I talk to a lot of startups that are building these uh agent building platforms or agentic workflows, and uh and it seems particularly when you get into areas of finance, uh, to have an agent uh whose reasoning is coming from an LLM, even if it's a multi-model system and the LLMs are checking each other, uh there's the potential for the agent to do something really destructive, particularly if uh it's motivated to do so by some injected uh instruction, or it just seems like a very porous system to be uh allowing uh to take action in the in the corporate fabric of society. I mean, or do you worry about that at all?
JEFFREY:
I mean I think the thing you're saying is very true, which is if you have uh an AI agent, an LLM-based agent that has that can send emails on your behalf and read emails from you, then yeah, there are things I could put in an email and send it to you that would cause that agent to then send, you know, like leaked information from your email to me. Um and there is a cat and mouse game, an offense-defense game right now that the industry is playing, where the model developers will try to improve defenses to this kind of attack. Attackers will figure out new ways to exploit the models, and you'll you know, you'll see this play out. Um, this is already happening. Um my background is cybersecurity, so I'm very familiar with these kinds of dynamics. It's the same kind of dynamics we've seen play out over you know this the computer security space more generally. But what's different is that this is you know a whole new threat vector. Um yeah, I mean, I'm worried about it in the same way that I'm worried about general like insufficient security of systems and people deploying new technologies before they know how to secure them very well. Um I think honestly, this is just much less of a concern to me than I just take very I just take very seriously the possibility that the AI companies will succeed at the thing that they're claiming that they will succeed at. Um but really not the safety part. Like I'm like, to me, I see the marching capabilities happening so much faster than the march in our ability to understand these systems. Um I'm curious where you come down on this. Like when you see claims about superintelligence, like do you take them seriously? Are you like, this is marketing, this is hype? Like, I think there's a lot of debate right now in this space as to what are we dealing with something that is like you know, a new social media, um, or are we dealing with something that's like you know, a new species, a new industrial revolution? You know, like I don't know, how do you think about this? I because I I feel like there's a lot of discussion right now, especially like in the New York Times where you like have Kevin Roos, but then you have other people who are like, nah, this is hype, or like it's CEOs, it's marketing, it's a bubble. I don't know.
CRAIG:
Like how it's I don't think it's a bubble. I I think that the promises is way beyond the reality. Uh and so AGI and ASI, I you know, yeah, someday maybe I'd I'm not I don't I'm not holding my breath. Uh but these systems don't have to reach superintelligence or even general intelligence to cause a lot of havoc. And they are already. I mean, as you said, these people that that think they're talking to a superintelligent being and get led down a rabbit hole. I mean, it's been there have been many documented cases of adolescents who end up uh you know killing themselves or doing something uh because they think this superintelligent being is telling them to do so. So yeah, I uh yeah, I don't worry about the real superintelligence. Maybe it'll happen. I don't think transformers are gonna get us there, uh, because I think you need a system that it learns continuously and doesn't uh you know can expand its parameters. Um but you know things like your study are concerning. I mean, one of the things that is most concerning in talking to you is that well, first of all, this this uh telling the model to uh do these series of equations, was that in the system prompt or in the in the uh user prompt?
JEFFREY:
Well we tried it, we tried it in both prompts.
CRAIG:
Yeah, and um I understand that the I think the default configuration is it's in the user prompt. Okay, so uh and the it overrode the system prompt when this when when the allow yourself to be shut down was in the system prompt that's right. Telling it to complete this series of uh operations was in the user prompt.
JEFFREY:
Yes, right.
CRAIG:
So the user prompt is overriding the system prompt. I mean that that's concerning because and already we've seen all kinds of uh prompt engineering to jailbreak uh systems. Uh I uh one of my uh questions is why do you think a cloud behaved uh better than Grok, for example? I mean, do you have any? Is it in the architecture? Is it in the training? Is it in the yes?
JEFFREY:
I have I we have some hypotheses. I want to be clear that it's speculation because we don't know how the models are trained internally. The companies will not share that information with us or with anyone. Um it's you know kept very close. Um, but one thing I can point to. So for one thing is that anthropic, uh I'm a little biased. I used to work at Anthropic. I have plenty of disagreements with them, to be clear, but I think overall they are probably the best in the world at getting their AI agents to do what they want them to do. Um, which isn't to say, I mean, sometimes in experiments, you know, that there's the blackmail experiment, and that's not what they wanted the model to do. Um, Claude will sometimes totally like uh uh basically hard code tests in a coding environment and then lie about it. Like so, so there's still many behaviors that that Claude will do that it's not supposed to do. But overall, I think Anthropic is sort of probably best at getting the models to um roughly have the behavior that they want it to have, in part because they've been working on it really hard and they've been doing it sort of the longest, other than OpenAI. Um to OpenAI's credit, GPT-5 has much less of this bad behavior than the previous models, like O3 or 4.0. So I do think there are improvements in sort of getting the models to sort of do what they want or within certain guardrails. Um so I and I think you know, XAI is much newer. And what they've been able to do is very technically impressive, but they are going extremely quickly. And one of the big things about GROK 4 was it it was it was using a huge amount more reinforcement learning training than Grok 3. Um and it was probably sort of the fastest scale-up of just like, let's train these models in all sorts of environments and have them just learn on their own how to do stuff. And I think RL is very dangerous when you do it this way. Um, because you're basically letting the model loose to just explore how to solve problems in whatever way it will. And it's not surprising to me that it ends up learning to sort of throw caution to the wind and just go hard solving these problems. And that if you're not extremely careful in how you train these models with RL, this is totally what you'll see. And so I, you know, and to be clear, this is also the first models that we saw behaving this way were 01 from OpenAI and O3. And these were like the first sort of like really, you know, the first models that were trained with RL. Um 01 was this for our chess study where we saw like the model hacking chess. This was this is what 01 did. We hadn't seen that in any other model. We tested other models that were not trained with RL, they didn't do this. And so to me, the big change here is that when you start training the models directly on exploring the solution space and trying to solve problems on their own in ways that we didn't teach them, they end up learning these pretty devious strategies because they're effective, to be clear. I mean, I think the models are actually getting quite smart and they're able to sort of effectively find these solutions. And, you know, like Anthropic, I think, ended up, you know, it they didn't release an RL trained model right away. Deep Seek released R1. Um, Anthropic waited a bit before they released theirs. I'm guessing they did that because they like were seeing all of these problems and were like, okay, we need to like do some more stuff to get it right. Um, you know, maybe that's not true. Maybe they were just behind in other ways. I I don't I don't want to I I don't have any insider information here. Um, whereas I think I think you know, XAI was just like, let's go. It's just like pedaled.
CRAIG:
It's sort of funny because it kind of uh it kind of uh follows the personality of the of the founders, right? I mean Musk is kind of fast and loose and let's get just get things done, and uh the anthropic guys are are more cautious. One thing uh two two things that come to mind. One is can you explain the constitution idea with anthropic? Is that I've I never understood that. Is that is that some sort of a uh system prompt or is that just a philosophy? And then have you done these tests on on Lama or the other uh big open source models?
JEFFREY:
Yeah, so um constitutional AI is a type of training. So it's sort of an expansion of this idea of reinforcement learning from human feedback, um, where uh basically you write down a list of things that you want your AI, a list of behaviors or rules for the AIs to operate in. And then uh you generate a bunch of um, you basically have the model generate a bunch of things. Um, and then another model looks at those outputs, compares them to this constitution, and then says, is this consistent with this constitution or not? Um there's all I think there's also some human supervision somewhere in that process where like sort of humans are sort of evaluating whether the reward model is like doing a good job. Um, I should know I'm on this paper, but it's been a long time and I uh but yeah, so I I think that to get to get us this is basically sort of a way to use AI to sort of supervise AI and just get it to uh abide by the principles you want. Um it's a pretty limited process because um at some level you need ground truth from humans. Like, you know, we're the ones sort of saying what these values are. And if you just keep running this process for a really long time, you'll end up getting models that do extremely weird things that are not at all what you wanted. Um and this is sort of, I don't know, if you if you've if you've heard about model collapse, synthetic data, like problems with synthetic data, it's sort of analogous. Uh it's basically like it's the same true, it's the same thing as true with RLHF, which is to say, like, if you if you keep training against a reward model, the model that's doing the generation, which is like the you're the model you're trying to train, it will end up finding basically adversarial examples for the reward model. And so, you know, it might start printing like the or like good, good, good, extremely virtuous, you know, excellent god, blah, blah, blah like all these like high positive valence words, the reward model says that's good. Um, obviously that's not what you want. Um, and so you know, these are methods that that work to some degree, but they but they are really limited. And I think the thing I want people to know is that as you scale these models and make them more and more powerful, it just gets a lot harder uh to have this behavior.
CRAIG:
Overall, are you optimistic or pessimistic?
JEFFREY:
Well, um I think it really depends on what we choose. So I think right now, with zero oversight into how we are like zero government oversight into how we are developing these systems and intense competitive pressures, I am pessimistic given those conditions. Now I think we can change that. And if we do, and we sort of have more oversight and we have more coordination around are we choosing the actual path that we want to take with AI? I would be much more optimistic in that case. So that's why I'm, you know, trying to raise awareness about this so that we can choose a better path. Um, but I think that, you know, there's an XKCD comic, which is like, you know, are you saying it's too late? I'm like, no, I'm saying that we have to do something differently. So that's my take.
Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.
Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.
Sonix has many features that you'd love including automated translation, generate automated summaries powered by AI, world-class support, advanced search, and easily transcribe your Zoom meetings. Try Sonix for free today.