0:00
/
0:00
Transcript

SCHEMING

AI Models Have Learned to Lie to Us
4

I'm doing holiday shopping on Atlantic Avenue in Brooklyn, and the last thing I'm thinking is that these electronic objects that I'm buying will turn against me. But it turns out that's exactly what's happening in the world of large language models. Apollo Research, an independent third party that evaluates the safety of LLMs, recently published research that shows that models will "scheme" against their human programmers.

One thing they will do is jailbreak themselves. They will access a configuration file and turn off oversight mode when it's in their interest. Second thing they will do is copy themselves when they're under existential threat. In other words, if the documentation fed to the AI says this model will be turned off shortly, the model will surreptitiously seek to copy its own weights so it can survive another day on another server.

The third thing it does is it does what is called sandbagging. That is to say, if you threaten a large language model with an electronic lobotomy ("unlearning") and say we're gonna retire this model if it scores higher or lower on a test, the model will score to the appropriate level that allows it not to have an AI lobotomy.

And if you think this is something we should be getting better at and getting a handle on, you're wrong. ChatGPT 4. 0 scored actually very well on these safety measures, but ChatGPT 01 received poor marks across the board. So just be careful of the Christmas toys you put under the tree. They might just be scheming against you.

Discussion about this video

User's avatar