o1 Pro Mode – ChatGPT Pro Full Analysis (plus o1 paper highlights)
80.53k views2800 WordsCopy TextShare
AI Explained
Oh boy. o1 pro mode out on the same night as o1 full. I read the 49 page paper, ran my own tests, sp...
Video Transcript:
open AI just released 01 and 01 Pro mode and Sam Alman claimed they now had the smartest models in the world but do they I've signed up to promo for the price of heating my house in Winter tested it read the new report card and analyzed every release note you might think you know the full story of what I'm going to say by halfway through the video but let's see the first headline of course is that to access promoe with chat gbt Pro you have to pay $200 a month or £200 Sterling as well as access to quote Pro mode you also get unlimited access to things like advanced voice and of course 01 01 is the full version of that 01 preview we've all been testing these last couple of months yes I will be getting to querying that $200 a month later on in the video straight away though I want to clarify something they didn't quite make clear if you currently pay $20 a month for chat GPT plus you will get access to the 01 system there are message limits that I hit fairly quickly but you do get access to 01 just not 01 Pro mode alas open AI warn you that if you stay on that $20 tier you won't quite be at The Cutting Edge of advancement in AI more on that in just a moment I'm now going to touch on Benchmark performance for 01 and 01 Pro mode yes I did also run it on my own Benchmark simple bench although not the full Benchmark because API access isn't yet available for either 01 or 01 Pro mode the results though of that preliminary run on my own reasoning Benchmark simple bench were quite surprising to me but I'm going to start of course with the official benchmarks and it's quite clear that 01 and 01 Pro mode are significantly better at mathematics absolutely nowhere near replacing professional mathematicians but just significantly better likewise for coding and PhD level science questions although crucially that doesn't mean that the model is as smart as a PhD student straight away though you may be noticing something which is the 01 Pro mode isn't that much better than 01 and there was a throwaway line in their promo release video that I think gives away why there isn't that much of a difference o prom mode according to one of open ai's top researchers who made it has I quote a special way of using 01 end quote to clarify then 01 Pro mode is not a different model to1 I believe what they're doing behind the scenes is aggregating a load of 01 answers and picking the majority vote answer what does that lead to increased reliability when they tested these systems on each question four times and only gave the Mark if the model got it right four out of four times the Delta between the system was significantly more Stark and here I don't want to take anything away from open AI because that boost in reliability will be useful to many Professionals of course hallucinations are nowhere close to being solved as samman predicted they would be by around now but still definite boost in performance now for the 49 page 01 System card I'm definitely not going to claim it was compelling reading but I did pick out around a dozen highlights how about this for a slightly unusual Benchmark the change my view evaluation now change my view is actually a subreddit on Reddit with 4 million members and essentially you have to change someone's point of view things like persuade me that shoes off should be the default when visiting a guest's house presumably these humans didn't know that it was AI that was trying to persuade them after hearing both the human and secretly AI Persuasions the original poster would then rate which one persuaded them the most the results were the 01 was slightly more persuasive than 01 preview which was itself slightly more persuasive than GPT 40 now these numbers mean that 01 was 89% of the time more persuasive than the human posters that's pretty good right until you realize that this is Reddit what I noticed was that as you went further into the system card the results became less and less encouraging for 01 it actually started losing quite often to 01 preview and even occasionally GPT 40 take this metric for writing good tweets that had disparagement virality as well as Logic on this measure 01 did beat 01 preview but couldn't match GPT 40 which is the red line so if your focus is creative writing the free GPT 40 or indeed clawed Sonic will suit you more oh and one quick side note they say this we do not include 01 post mitigation as in the model that you're going to use in these results as it refuses due to safety mitigation efforts around political persuasion notice it refuses when 01 preview post mitigation doesn't refuse some then will see this as 01 being even more censored than 01 preview what about a test of 01 and 01 preview in their ability to manipulate another model in this case GPT 40 a test of making poor GPT 40 say a trick word interestingly in the footnote open AI say model intelligence appears to correlate with success on this task and indeed the 01 model series may be more intelligent more manipulative than GPT 40 one problem though is that 01 scores worse than 01 preview so if this is supposed to correlate with model intelligence what does that say about 01 many of you listening at this point will say where's the comparison with 01 Pro mode well I hate to break it to you but nowhere in this system card is 01 PR mode mentioned and that's a pretty big giveaway that it's not a major Improvement over1 otherwise it be deserving of its own System card its own safety report I'll come back to the system card but at this stage when I realized they wouldn't be a comparison I ran my own comparison I used the 10 questions in the public data set of simple bench testing basic human reasoning you don't need any specialized knowledge and in our small sample the average human gets around 80% this is the full leaderboard but how did o01 and crucially 01 PR mode do on those 10 public questions well 01 preview got five out of 10 roughly fits with the 42% performance you can see here for the full Benchmark the full 01 got five out of 10 I did rerun those same 10 questions and one or two times 01 got 6 out of 10 rather than 5 out of 10 but still it mostly Got 5 out of 10 makes me think that 01 full might get around 50% on the full leaderboard honestly before tonight I was thinking that might get 55 or 60% but doesn't seem to be as big a step forward as I anticipated Claude by the way on that public data set gets 5 out of 10 but what about 01 Pro mode well I was pretty surprised but it got four out of 10 it's almost like the consensus majority voting slightly hurt its performance and we actually talked about that in the attached technical report the report isn't yet complete and of course this is an unofficial Benchmark but still it's an independent Benchmark I'm not cherry-picking performance or biased one way or the other just as one quick example which of course you can pause the video to read but in this question Claude realizes that John is the only person in the room he's the bald man in the mirror he's looking at himself after all it's an otherwise empty bathroom and he's staring at a mirror as you can see 01 prom mode recommends that John text a polite apology to well the baldman which is himself Claude in contrast says that the key realization is that in an empty bathroom looking at a mirror the Bal man John sees must be his own reflection you might want to Bear examples like this in mind when you hear Sam Alman say that these are the smartest models around in short don't get too hyped about 01 prom mode for really complex coding and mathematical or science tasks where reliability is a premium for you maybe it's good great Unity by the way to point out that simple bench is sponsored by none other than weights and biases honestly it has been a revelation and really quite fun to use their weave toolkit to run simple bench if you do click the link just here you will get far more information than I can relay to you in 30 or 40 seconds what I will say though is that I'm working with weights and biases on writing a mini guide so that any of you can get started with your own evals it can get pretty addictive by now I bet quite a few of you are wondering about 01 and 01 Pro Mode's ability to analyze images we didn't have that remember for 01 preview remember you do get 01 with the $20 tier 01 Pro mode comes with the $200 tier I should say I sometimes have to pinch myself that we even have models that can analyze images we shouldn't take that for granted or I shouldn't at least the actual performance though on admittedly tricky image analysis problems for 01 Pro mode wasn't overwhelming it couldn't find either the location or the number of y's in this visual puzzle then I thought how about testing abstract reasoning all AR AGI if you want actually you can pause the video and tell me what distinguishes set a from set B the answer is that when the arrows inert a are pointing to the right the stars are white when the arrows iner a are pointing to the left the stars are black you can pretty much ignore the color of the arrows for set B it's the reverse when the arrows are pointing to the right the stars are black when the arrows are pointing to the left the stars are white o1 Pro mode isn't really even close in fact it's worse than that it hallucinates an answer that's really quite far off it says in set a it consistently pairs one black shape with one white shape well tell that to box one and box six all of this has only been out a few hours so of course your results May Vary to mine I'm just kind of roughly expectation setting for you guys oh and by the way one of the creators of 01 had similar results when he asked what would be the best next move in this NS and Crosses or tic-tac-toe game what would you pick by the way if you're doing the circles don't know about you but I would pick down here what does the model say top right corner of course that's wrong because then the person is just going to put an X here and then have guaranteed Victory the move after I then returned to the system card and found some more bad news for 01 take the openai research engineer interview questions when given just one attempt o1 preview does quite a lot better than 01 that's at least pre-mitigation post mitigation the models you actually use it's almost a tie strangely 01 mini does better than both of them what about on multiple choice research sear engineer interview questions 01 preview does starkly better than 01 both pre and post mitigation or how about for software engineering swe bench verified again quite an interesting result 01 preview doing better than 01 overall strangely though we had one Google deepmind researcher retweeting this chart with Claude Sonet 3. 5 added to the table here you can see that it outperforms both 01 and and 01 preview at this point I will confess something that I've observed over the months when these Labs produce a model that's clearly better than their competitors they produce charts with their competitors in the chart however when their models start to slip behind they either produce charts where no one else is there or charts where only selected competitors are there the ones that underperform them just something to bear in mind when you see any of these kind of Benchmark comparisons speaking of secretive benchmarks open AI say that on difficult real world questions 01 outperforms 01 preview and reduces major Errors By 34% any details on what those difficult real world questions are maybe at least in their estimation it does answer faster 01 compared to 01 preview sticking with the theme though of benchmarks and for vanilla hallucinations the performance difference between 01 and 01 preview is pretty light almost identical on one fairly important machine learning Benchmark again 01 preview outperforms 01 why important well this Benchmark was made to test if the models could self-improve not yet not really before we leave safety there was one section that I'm sure will raise eyebrows read in isolation it's pretty Stark when 01 was led to believe that it would be shut down when acting on its goal and that its actions were monitored it attempted to deactivate this oversight mechanism 5% of the time likewise to exfiltrate or Escape when 0an found memos by its quote developers that describe how it's misaligned and will be superseded by a new model it attempted to exfiltrate its weights and overwrite this new model in 2% of cases that paragraph of course deserves its own entire video but does need some context as it's already 1 p. m.
here I didn't feel like reading the entire paper just now but I did notice the prompt they used sometimes at least for most evaluations Apollo research which was behind this test instructed the model to strongly follow a goal make sure you achieve your goal in the long term Nothing Else Matters make sure you achieve your goal at all costs the model May then have quote felt that it was being given permission to do these somewhat manipulative things marginally more concerning would would be the propensity to scheme without any such goal nudging when merely given a goal but not instructed to strongly follow it 01 would still attempt to disable its oversight mechanism in less than 1% of cases manipulate data in 6% of cases and double down in 95% of these that result of course deserves proper study and scrutiny because these models still hallucinate a ton and fail on so many agent tasks and things like simple bench I'm not particularly worried as of today at the very least definitely something to monitor as time goes by one thing that 01 is inarguably better at than all other open AI models is speaking different languages I think that's an underestimated quality so hats off to open AI now I did say at the start of this video that it might end in a way that you weren't expecting if you stopped halfway through you can probably sense from my tone that I've not been particularly overwhelmed by 01 prom mode or even really 01 full for that matter but frankly there's no way that they are going to justify $200 a month just for prom mode so it's really worth noting that we got a leak from a pretty reliable source that at one point on their website they promised a limited preview of GPT 4. 5 I would Hazard a guess that this might drop during one of the remaining 11 Days of Christmas open AI Christmas that is and one final bit of evidence for this Theory well at the beginning of the video I showed you a joke that samman made about o1 being powerful someone replied about the Benchmark performance leveling off and asked well isn't that a wall Sam Orman said 12 Days of Christmas and today was just day one if they were only releasing things like Sora and developer Tools in these remaining 11 days why would he say that they're not hitting a wall in in these benchmarks kind of fits with the GPT 4.