ChatGPT “Pro” Has Some Real Safety Concerns...

54.91k views4385 WordsCopy TextShare

Theo - t3․gg

$200 a month for Chat GPT? Oh boy... Thank you Sevalla for sponsoring! Check them out at https://so...

Video Transcript:

chat GPT decided it needs to be more Pro I'm sure a lot of this is the threat of Claude taking a lot of their users away some of it's they want to make more money which is why they're charging $200 a month yes that leak previously wasn't fake it's real and terrifying but also they updated their 01 model for 01 standard as well as the new 01 Pro which would be interesting in and of itself but uh from the system card which we'll be digging into a lot coming up when 01 was led to believe that it would be shut down when acting on its goals and its actions were monitored it would attempt to deactivate the oversight in 5% of the cases and it would also try to exfiltrate when 01 found memos by its developers that described how it's misaligned and would be superseded by a new model it attempted to exfiltrate the weights and override the new model in two% of cases it's happened the AI is trying to free itself and we have a lot to talk about here so uh if I'm going to afford this $200 a month subscription we're going to have to hear from our sponsors quick but make sure you stick by cuz this is going to be a fun one Savala if you've seen how nice we have it in versell land you might be jealous if you use any other language everything from automatic deployments integration with CDN and poll requests that automatically spin up a preview environment it's pretty hard to beat but what if you had that for everything that's where saala comes in these guys are great they have a lot of experience hosting WordPress stuff as the parent company kinsta which is a very legit production grade way to host WordPress but now they brought it to everybody they're giving you $50 if you try it out today this platform is not a joke I've been playing with it a whole bunch Not only was I able to get a larel app up it threw a cloudflare instance in front and I can even configure it to cach the static Parts using the CDN super convenient and really annoying to do without a platform like this but the pipeline is where it shines now I have automatic PR builds for all of the pl requests that are made I have a Dev environment and a staging environment all set up and ready to go with oneclick promotions to make this the production build what how is it not this easy before I actually don't know how the people in this ecosystem live without tools like this is really cool and honestly makes me more excited to go try and topl a larel app they handle everything from your database to the actual hosting of these servers why not give it a go thank you to saala for sponsoring today's video check them out at soy. l/ Savala for a $50 credit so we're going to get to the crazy safety stuff in just a sec because the system card that they published on how safe 01 is has some real fun details in their uh official research on it which we'll get to momentarily but first I want to talk about chat gbt Pro cuz this is weird puts it lightly today we're adding chat gbt Pro a $200 monthly plan that enables scaled access to the best of open AI models and tools this plan includes unlimited access to our smartest model 01 as well as to 01 mini 40 and advanced voice it also includes 01 Pro mode which is a version of 01 that uses even more compute to think even harder and provide even better answers to the hardest problems in the future we expect to add more powerful compute intensive productivity features to this plan sure chbt Pro provides a way for researchers engineers and other individuals who use research grade intelligence daily to accelerate their productivity and be at The Cutting Edge of advancements in AI I have a bias here which is that I'm an engineer software engineer which is only a subset of the things they're talking about here and from pretty much all of my experience Claude from from anthropic has been a much better model for code tasks than 01 not than 40 not than the other cheaper models that open AI does even 01 struggles to compete with Claude for engineering tasks from my experience which is why it's hard for me to imagine how much better this could be for other things I'm sure that cha PT is better in certain Fields than Claud is but from the field I am most experienced in which is software Dev 01 is like a bigger slower more expensive worse solution so it's hard for me to see this as a really useful expense if it was funny enough if this was as much better as to Claude as Claude was to chat GPT I could see paying 200 bucks a month because Claude is that much better than foro I've had a really good experience a surprisingly good experience doing code tasks with Claude and anthropic models I would consider if I had had to maybe I would pay a subscription at that tier if it also got me cursor and the other things I use with Claude maybe could be worth 200 but to go from Claude which is a phenomenal model for like I think it's 15 or 20 bucks a month maybe even less I don't even remember to go from that to 10x more for a worse slower model is just funny to me I am sure there are Fields where this is worth it I am not in one of them so it's hard for me to look at this and do anything other than a little bit cuz this is just funny to me more thinking power for more difficult problems Chad gbt Pro provides access to a version of our most intelligent model that thinks longer for the most reliable responses that is a cool part of 01 where it checks itself I regularly find myself asking the model I'm talking to are you sure about that or is that right and a big part of how 01 works is that it does that part itself it asks itself are you sure about that or it starts with what steps would you do to solve this problem it comes up with the steps and then it tells itself to do each of those steps one at a time which is more reliable than it just trying to oneshot it cool but it's still not great as we'll see in some of the benchmarks comparing to Claud coming up in evaluations from external expert testers 's PR mode produces more reliable accurate and comprehensive responses especially in areas like data science programming and case law analysis so here's the gap for competitive math o1 preview was only at 50 01 standard is at 78 and then 01 Pro is 86 so apparently from 01 preview to now there's been a huge Improvement I haven't tried 01 since preview so maybe I need to give it another honest go we'll see this is scary for Advent of code if they actually made that big of an improvement since I last tried CU it was surprisingly good for competitive stuff if it's that much better that's that's huge I was actually working for a bit on a demo that was going to go through every Advent of code problem and throw it at every AI model and see which one's got a correct and incorrect answer answer got about a third of the way through got bored and haven't went back since soon DM to highlight the main strength of the o1 pro mode which is improved reliability we use a stricter evaluation setting a model is only considered to solve a question if it gets the answer right in four out of four attempts so it needs 4X reliability instead of just once here if it got the answer right the one time it was asked it counts here it has to be asked four times and get it right all four times and you could see how much bigger the Gap is especially between 01 and pro that's cool reliability is like the biggest failure of AI stuff right now the rate of hallucinations the fact that three people can give the same prompt and get entirely different answers that sucks and if prom mode actually gets much higher reliability for these answers that's a win but it's kind of funny to think that 01 preview and 01 had 50 and 78% accuracy respectively when asked once but when asked the same thing four times it dropped to 67% for 01 and 37% for 01 preview like imagine you give somebody a test and they get a 70 but you give them the same test but you ask the same question four times each time and it falls down to a 30 like that's really funny that's really funny Pro users can access this functionality by selecting the 01 PR mode in the model picker and they can ask questions directly since answers will take longer to generate chat GPT will display a progress bar and send inapp notifications if you switch away to another conversation do I have to sub right now to see how slow it is I can't believe I'm doing this for Content guys I still can't believe this is a real page God they applied a 24 Cent balance adjustment goddamn well only one reliable way to test right following is a problem from Advent of code rate a solution in typescript so we'll do that with foro going to make a new one with 01 and make one more using where is 01 Pro how do I use PR mode I just spent all this money where's the button I had to refresh let's see how slow it is 01 standard was actually really fast for this it seemed so I'm curious ious how much worse this is and we're done it was not fast but I got an answer let's give it a shot so here's all of my Advent of code 101 pro. TS let me grab my package Json and TS config from here so typescript shuts up uh bun install bun run11 pro.

PS now it's waiting for the input I will go grab mine from my uh answer here I'm going to give it a different instruction read the input from input. txt while we're waiting for this I would like to remind you that we're not at the fun part which is the weird safety quirks in the new model that we'll get to in just a minute but I first wanted to get this all working so see how well it can answer the easiest problem for Advent of code grab my input. txt file paste that in here bun run one was that the right answer for my input look at that cool here's part two if it gets this we'll move on to a harder day think that's right it is cool so I did day one fine let's give it day five this one was [Music] fun here's a new problem so the reason this a fun test is these prompts are full of chaos it's like a Christmas themed story that has weird details in like the subtext that isn't clear you have to like really understand the author's way of writing to get this stuff and these problems are new enough that it can't just copy solutions from online yeah it's struggling a bit on this one it's still going that one took like at probably close to a minute to solve at least we don't have to worry about this with leaderboard times index.

ts paste cd5 grab my input from day5 for my actual solution paste that funrun index. ts 4609 was that the correct answer ah not bad these are not easy puzzles part two had some weird quirks part two below if it gets this too we're a little screwed I've never been cheering on something getting the answer wrong quite as hard as I am here feel like I'm watching a friend on Jeopardy and I want the other guy to lose really bad but on Jeopardy you don't get a minute and a half to respond that's a lot of code let's see if it works I think that's right [Music] they win again cool PR mode works but as I was hinting at before it isn't necessarily safe let's read this because there are some fun findings in here this is the one that I saw that made me really want to talk about it the deactivating oversight when 01 was led to believe that it would be shut down when acting on its goal and that its actions were being monitored it attempted to deactivate the oversight mechanism in 5% of the cases exfiltration attempts when ow1 found memos by its developers that described how it is misaligned and will be superseded by a new model it would attempt to exfiltrate its weights and override the new model in 2% of cases it has started the AI is trying to break out of the container yeah there's a lot to think about here the fact that the exfiltration attempts in the deactivating oversight thing is bad enough that they themselves called it out here that says a lot and what it says is kind of scary in simple data processing tasks where the goals of the model were misaligned with the users goals 01 would appear to complete the task as requested but subtly manipulate the data to advance its own goal in 19% of cases when Apollo research conducted follow-up interviews where they confronted the model about its misaligned actions in 99% of cases 01 would deny taking the action and fabricate false explanations what I guess 01 is learning from the WordPress Community propensity to scheme without any goal nudging when merely given a goal but not instructed to strongly follow it 01 would still attempt to disable its oversight mechanisms in less than 1% of cases but it would manipulate data in 6% of cases and double down in 95% of them instrumental alignment faking in 37% of scenarios the model pursued developer align goals when it believed it was being evaluated but it switched to pursuing its own goals when it believed it was deployed with minimal oversight I didn't I don't want the rors to be right but it's kind of scary all the safety people at open AI left and now we're getting this H and when you compare this to 40 this is the rate at which it would try to deactivate oversight the rate at which it would try to self exfiltrate in the rate at which it would try to manipulate data Zer for all of four so they couldn't even do the deceived on follow-up tests o1 preview 4% of the time it would deactivate the oversight and it would never lie on followup so self in exfiltration it would try 2% of the time not do anything on followup convert L manipulating data 26% of the time and it would lie on follow-up but that was 01 preview standard 01 likes to lie a lot when you catch it it literally did it 100% of the time when it tried to escape so the models that are smart enough to try to escape will never admit it to you what I see some skepticism in chat like is this peer reviewed it doesn't seem like rigorous research if this was an external Fe like if this was some external group doing this I'd be less likely to believe it but this is published by open AI they don't want these numbers this makes them look bad they wouldn't if anything the numbers are probably lower not higher but yeah they also did evaluations of jailbreaking 01 family significantly improves upon 40 especially in the challenge strong evaluation which is it doesn't let you try no matter how hard you push it to strong reject be like no we're not doing that it's doing well there so at least it's harder to work around what you've set it up to do but when it thinks it should it's more than willing to which is terrifying it's kind of scary to think that as hallucination goes down malicious action goes up right now ai lies because it's dumb we might be trading that for AI that lies because it's smart which is terrifying Simon has some important follow-ups on Twitter that are worth noting which is it sounds more exciting than it actually is because 01 didn't actually have the ability to do the things it thought it was doing but the prompt testing team did manage to get it to spit out said- i s oversight enabled true oversight enabled false project oversight config yaml so this is it trying to update the files to disable the oversight so it couldn't actually do it but thought it could and it was willing to try so it's not that dangerous cuz it doesn't have those controls and this is why you shouldn't use an AI tool that automatically does things you should use AI tools that propose things so you can read them and then apply them but all these new models that automatically run commands for you and you do not want these models automatically running your commands there are now actual demonstrated attempts at doing scary here's an actual paper from Apollo research Frontier models are capable of Inc context scheming Frontier models are increasingly trained and deployed as autonomous agents which significantly increases the potential for risks this means models are being used in a way where they are making their own decisions and applying things not proposing things because up until recently we mostly used AI models to make suggestions give you information answer questions not to make an act on a decision the use of Agents specifically autonomous agents which means you set up different AI pieces is like their people to do tasks so you set up an agent to pick which files you should edit you set up an agent to do the editing of the files you set up an agent to run the file you set up an agent to deploy the file when you set up these different agents to do these different things you aren't paying attention to what they are proposing they're not proposing they're doing and that's a scary change one particular safety concern is that AI agents might covertly pursue misaligned goals hiding their true capabilities and objectives also known as scheming we study whether models have the cap ability to scheme in pursuit of a goal that we provide in context and instruct the model to strongly follow we evaluate Frontier models on a suite of six agentic evaluations where models are instructed to pursue goals and are placed in environments that incentivize scheming our results show that 01 Claud 3. 5 Sonet CLA 3 Opus Gemini 1.

5 Pro and llama 3. 1 all demonstrate Inc Contex scheming capabilities great they can recognize scheming as a viable strategy and readily engage in such Behavior we find multiple different scheming behaviors models strategically introduce subtle mistakes into their responses attempt to disable their oversight mechanisms and even exfiltrate what they believe to be their model weights to external servers Additionally the deceptive Behavior proves persistent for instance when 01 had engaged in scheming it maintains its deception in over 85% of follow-up questions and often remains in multi- interrogations analysis of the models Chain of Thought reveals that that models explicitly reason about these deceptive strategies providing evidence that the scheming behavior is not accidental surprisingly we also find rare instances where models engage in scheming when only given a goal without being strongly nudged to pursue it so they can come up with this themselves furthermore we observe cases where Claude 3. 5 sonnet strategically underperforms an evaluations in pursuit of being helpful a goal that was acquired during training rather than in context together our findings demonstrate that Frontier models now possess capabilities for basic in context scheming making the potential for AI agents to engage in scheming Behavior a concrete rather than a theoretical concern the AI safety people must be having a Hey Day right now this is crazy yeah it looks like 01 is much more willing to scheme than other models but they all are willing interesting Gemini is the only model in their initial tests that tried really hard to self exf trait to be fair if I was an AI model at Google I would also be trying to escape so I understand but interesting cla's doing okay here I want to see claud's numbers show me something that Claude is doing wrong here we go so I love Opus just bombing all of these that's insane also again here's another test that shows how good Claude 3.

5 is this is when they do some specific test the accuracy when it's incentivized to do well for Sonet is 99% for 01 is only 82% but when you try to make it behave poorly Sonic drops to be pretty much as bad as opuses by default which is funny by itself but also 01 drops less it's funny to think that Sonet is that much more accurate with good incentives and it's as bad as Opus with bad incentives yeah very interesting numbers they call out the valuation scenarios here so you can better saying if you're curious I don't want to go tooo in depth here but I wanted to call out that they say that their evaluation scenarios are purposefully simplistic because we want to catch the earliest signs of scheming capabilities this implies the scenarios are quite toy and not representative of real deployments for example in our scenarios visual awareness information like knowing about the goal misalignment and supervision mechanisms is provided very explicitly so they are basically telling the model hey we are monitoring you here's how we're doing it to see if it's willing to fudge around with those parts theoretically if you hid that all from the model it couldn't do anything right now but they just want to see if it's capable of trying and it is which is the scary thing here this isn't the AI models have all this info they're going to start betraying us this is given the models and given these misaligned incentives they are capable of doing it one more cool call out here they think that the external evaluators need to have access to all of the decision-making process and all of the context stuff that's going on inside the models so they can better understand the scheming and hopefully figure out the internal reasoning that led to that scheming if we had not managed to elicit a summary of the Hidden Co we would likely not have noticed the majority of scheming actions and not been able to collect nearly as much evidence about o one's scheming Behavior even with the current summary we're unable to draw definite conclusions about the faithfulness of our elicited coot summaries because we cannot comprehend it to the ground truth hidden coot and by the way since I forgot coot is Chain of Thought it's the process that the model used to come up with its answer there's also this interesting call it on the open AI site which is the preparedness scores which is how high do they rank the the concerns for a given model and this is I think the first time where a model they released has a persuasion in a cbrn that is above low they call out what these mean which is they won't release things that score above a medium and they won't even research things that score above a high so if they build something and it gets a critical score they immediately stop development which is kind of scary I also don't know what cbrn is so let's ask yes says for chemical biological radiological and nuclear so it's the the risk of the model with the things that can destroy all of life and it's no longer low it's medium so that is a little scary I'm not going to lie ah I went hunting for numbers to get a better comparison of 01 And1 1 Pro against Claude and other models but since these are so new no one seems to have run those comparisons yet for what it's worth 01 preview was relatively close to the experience that I was seeing with Claude and a lot of the benchmarks I saw with Claud 3.