OpenAI's New AI GPT-o1 STUNS The ENTIRE INDUSTRY Surprises Everyone! (STRAWBERRY RELEASED!)
144.82k views4428 WordsCopy TextShare
TheAIGRID
OpenAI's New AI GPT-o1 STUNS The ENTIRE INDUSTRY Surprises Everyone! (GPT - o1)
Prepare for AGI wit...
Video Transcript:
so open AI have finally announced SL relased their new large language model which is called open ai1 which is arguably the smartest model in the world and this is a model that has been highly anticipated and it is one that is so smart that I think you're going to want to watch this entire video Until the End because some of the capabilities are truly remarkable so let's take a look at everything that has happened and I'll explain to you all the key details you'll want to know so right here you can see that it says learning to reason with a LMS we are introducing open AI 01 a new llm trained with reinforcement learning to perform complex reasoning and 01 is quite different to standard models like chat GPT as this model thinks before it answers meaning that it can produce a long internal Chain of Thought before responding to the user basically where the model lays out a plan walks through that plan and then gives a final output to the user now one of the most incredible things about this entire model is that this model actually currently exceeds human level phds on a variety of different benchmarks so it clearly states here that open AI 01 ranks in the 89th percentile on competitive programming questions which is code forces which is absolutely insane because this means it's at expert level which is something that only a prior system that Google was able to do with an incredible system that they really could only use with huge amounts of compute now this also places among the top 500 students in the United States in a qualifier for the USA math Olympia and exceeds human PhD level accuracy on a benchmark of physics biology and chemistry problem and it says while the work needed to make this model is as easy to use as current models is still ongoing they're releasing an early version of this model the 01 preview for immediate use today in a chat gbt and the API so if you're wondering if this model is actually out today yes it is now if you're also wondering is this out in the EU just yet it's likely going to be delayed by a few hours like 6 to 8 hours so just be patient and eventually you will see the model appear in your menu now one of the craziest things about this model is that this model was trained on large scale reinforcement learning now what this is doing for the model is that it basically means that this model is thinking productively using its chain of thoughts in a highly data efficient training process and they state that we have found that the performance of o1 consistently improves with more reinforcement learning train time computes and with more times spent thinking test time compute and the constraints on scaling this approach differ substantially from those of llm pre-training and we are continuing to investigate them for those of you who are unaware for what I just said they're basically stating that this stuff is something that scales really well and right now they're currently figuring out how on Earth they are going to continue scaling this because this model continues to get smarter with train time compute and continues to get smarter when it's given more time to think which is the test time compute so what they're basically saying here is that currently they don't really see any limits apart from comput on how smart these models are going to get if we actually take a look at the graph it's a lot more shocking with as to the kind of implications for what this means you can see here that this is some kind of new scaling loss as I've said here the constraints on scaling this approach differ substantially from those of llm pre-training and they are continuing to investigate them so the crazy thing about this is that we can literally see here that the graph shows us that as a train time compute continues to go up the accuracy during training actually continues to go up and we can also see that the test time compute on the log scale that the accuracy of 01 is also going up there so I mean it doesn't take a genius to figure out that potentially given more compute and more resources what these models are probably going to be able to do and the reason reason I think this is so fascinating is because this largely might show us that we may have actually entered a new paradigm with regards to how these AI models are trained and how they're actually given to users we're seeing the train time computes and test time computes are both scaling remarkably and the accuracy increases with computes so for the many individuals who have been doubting the fact that compute is something that all you need this actually shows us that in this new paradigm compute might be the most important method to getting extra performance out of certain models combined that with the Chain of Thought and reinforcement learning we have an Unstoppable system that I truly can't fathom how smart these models are going to be in the future considering the fact that currently we're still limited by our levels of compute now if we do take a look at what this model is able to do we can take a look at some of the evaluations to highlight the reasoning improvement over GPT 40 we tested models on a diverse set of human exams and machine learning benchmarks we show that 01 significantly outperforms GPT 40 on the vast majority of these reasoning tasks unless otherwise specified we evaluated 01 on the maximal test time compute set so what we can see here is three different models we can see 01 preview which is largely the distilled version of 01 and then we can see the 01 version which is the version that won't actually be available today the 01 preview is going to be the distilled version of 01 or strawberry or qar whatever model you want to call it and what this basically means is that currently what we have here is a situation where the 01 preview is going to be the model that is available today but as for the 01 model considering the fact that we are restricted by compute it's likely that that model might be available sometime next year or sometime in the future now I think the most important thing here here is the remarkable differences between a GPT 40 and of course the 01 preview I mean it's not even any kind of similarity when we look at these kind of benchmarks we can see that it's almost apples to oranges in terms of comparison the 01 preview simply dwarfs gp4 in terms of raw performance on a challenging tasks and we can see that on competition math it's almost a four times increase on the code forces it's almost a six times increase and on PhD level science questions the GP QA diamond we can see that there's a remarkable jump which even shockingly surpasses expert human levels which is a whole new paradigm in terms of how we view ourselves on the scale of intelligence so this is genuinely something that is groundbreak these aren't the only benchmarks here and trust me when I I tell you that these kind of benchmarks are one that surprised even me and I'm someone who definitely expected there to be remarkable performance from these models but let's take a look at what other things that are going on so this is where we actually take a look at the GPT 40 versus the 01 Improvement so we can see that there are four different areas here we've got the machine learning benchmarks we can see that there is quite the Improvement in terms of the mmu the MML U and of course the mass 5 00 and the math Vista noticeably we can see that the math 500 is at 94. 8% which is a remarkable jump and the main thing that you want to understand about this model's release is that it is mainly performing a lot better in terms of math and other tasks which require long reasoning steps we can also see the same in chemistry physics and biology and we can see the same across many of these AP exams now what's incredible here is that it says 01 Rivals the the performance of human expert recent Frontier models do so well on math that gsmk and math benchmarks are no longer effective and differentiating models basically what they're stating here is that these models have somewhat completed these benchmarks and are no longer useful in determining how models perform so what they decided to do was they decided to evaluate the math performance on the am an exam designed to challenge the brightest high school math students in America and on the 2024 aim exams GPT 40 only solved 12% which is 1. 8 out of 15 problems and in comparison 01 averaged 74% 11.
15 out of 15 with a single sample per problem and 12. 5 or 83% with consensus among 64 samples and then 93% when reranking a th000 samples with a learned scoring function basically what they're stating here is absolutely incredible now most people won't understand why this is absolutely incredible but I think getting 74% with a single sample is really incredible because what you have to understand is that this is one shot meaning that you input a single prompt and then the model outputs a single response of course using a thousand different samples you're going to largely improve your score but I think doing this single shot and getting such dramatic result is absolutely incredible and you can also see that this is at 93% which means that compared to GPT 40 this is a stunning remarkable Improvement now we can also see this right here and this is where we see how it compares to PhD so they also evaluated 01 on GP QA Diamond a difficult intelligence Benchmark which tests for expertise in chemistry physics and biology and in order to compare models to humans we recruited experts with phds to answer GP QA Diamond questions and we found that 01 surpassed the performance of those human experts becoming the first model to do so on this benchmarks now interestingly enough they do state that these results do not imply that 01 is more capable than a PhD in all aspects only that the model is more proficient in solving some problems that a PhD would be expected to solve and you can also see that with its Vision perception capabilities enabled 01 scored 78. 2% on the mmu making it the first model to be competitive with human experts overall what we can see here is once again incredible this is the first model that has surpassed the performance of human experts on this GP QA benchmarks which is supposed to be a remarkably difficult one and not only that but the vision perception capability are competitive with human experts so we can understand that these kinds of vision capabilities are going to be remarkably incredible once tested on a variety of different areas so this is where we get to the coding section and my own my is there a lot to cover this is where they talk about how they did further fine-tuning on a version of 01 and this version managed to perform a lot better you can see it says this model competes in the 2024 ioi under the same conditions as the human contestants it had 10 hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem and then it goes on to state with a relaxed submission constraint we found that the model performance improved significantly and when allowed 10,000 submissions per problem the model achieved a score of 3624 above the gold medal threshold even without any test time selection strategy which is a remarkable statement considering the fact it was only a few months ago where Google demonstrated their ability to get silver at the international mathematical Olympiad so once again it seems like open AI might be raising the bar even further now it goes on to state that finally we simulated competitive programming contests hosted by code forces to demonstrate this model's coding skill and our valuations closely match the competition's R and allowed for T submission GPT 4 achieved an ELO rating of 808 which is in the 11th percentile of human competitors and this model far exceeded GPT 40 and 01 and achieved an ELO rating of 1807 performing better than 93% of competitors and we can see that a rating of 1807 actually puts this at the candidate Master Level which is the highest rating for any AI system that I've ever seen seen and that makes it current state-of-the-art at coding which is absolutely incredible now for those of you who are wondering how this model actually works in terms of the internal workings of how they managed to get a model to be this smart some of the tricks are in how the model has been trained so the way that this model actually was trained was of course with reinforcement learning and of course being trained to use Chain of Thought when responding so Chain of Thought is basically where you have literally a Chain of Thought before you respond to a problem unlike in Prior models where you immediately respond to a problem you lay out the problem step by step and then come to a solution based on those subsequent steps and you basically verify step by step that the steps that you're taking will eventually lead to a good solution now what we can see here is a very insane example of where tp4 is pitted against the open ai1 preview in which they're both tasked to deser SL decode the cipher text using the example provided so this is where we have a onot example where you can see we've got this gibberish text and then it's converted into the text think step by step then it says use the above example to decode this jumbo text which I would have no clue how to do and then you could see that this model manages to get it correctly so the the final words are there are three Rs in strawberry and you can see that this model gbt 4 says that these words are the answer which are just completely wrong and it asks for additional decoding rules in this Cipher now what the really nice thing about this is that we can see and unfortunately you won't be able to see this in the model but you're actually able to see the Chain of Thought here now what we can also see is that this Chain of Thought is really really long if we click this button you can see it says first what's going on here we are given first an example think step by step we can see that yada yada yada and it says our task is to use the example above to decode this gibberish so the first part is to figure out how this was decoded into this now you can see if I scroll down here the amount of work that is being done here is absolutely incredible this is a model that is working step by step through many different steps arguably I think it's even like hundreds of different steps before coming to a final solution and you can see sometimes it manages to check its message and then finally output the response so you can see right here the final output that we do get gives us a very basic rendition of the internal Chain of Thought but I think showing it in this small demo is really powerful because we get to see firsthand how much work is being done behind the scene we can also see this in the coding section where there is a extremely large chain of thoughts which we can show and we can hide there is also this in the math section where there is another large Chain of Thought for multi-step math word problems for multi-step mathematical problems we can see that there is the same in the crossword there's also the same in science like here and then we can see that there's also the same in the healthcare Niche which is rather fascinating because we can see how it's using step-by-step reasoning to come to a diagnosis and I have no doubt that this is going to become remarkably affable it diagnosing individuals with remarkable accuracy now if we continue with coding there are two videos that I would love to show you all right so the example I'm going to show is a writing a code for visualization so I sometimes teach a class on Transformers which is a technology behind models like chipt and when you give a sentence to Chach it has to understand the relationship between the words and so on so it's a sequence of words and you just have to model that and Transformers utilize What's called the self attention to model that so I always thought okay if I can visualize this self attention mechanism and with some interactive components to it it will be really great I just don't have the skills to do that so let's ask our new model o1 preview to help me out on that so I just typed in uh this command uh and see how the model does so unlike the previous models like GPT 40 it will think before outputting an answer so it starts started thinking as it's thinking let me uh show you what are some of these uh requirements I'm giving a bunch of requirements to think through so first one is like use an example sentence the quick brown fox and second one is like when hovering over a token visualize the edges whose thicknesses are proportional to the attention score and that means just if the two words are more relevant then have a thicker edges and so on so the one common failure modes of the existing models is that when you give a lot of the instructions to follow it can miss one of them just like humans can miss one of them if you give too many of them at once so because this reasoning model can think very slowly and carefully it can go through each requirement uh in depth and that reduces the chance of missing um the instruction so this output code let me copy paste this into a terminal so I'm going to use the the editor of 2024 so then HTML so I'm just going to paste this thing into that and just save it out uh and on the browser I'll just try to open this up and you can see that uh when I Hoover over this thing it shows the arrows um and then quick and brown and so on and when I Hoover out of it it goes away so that's a correctly rendered um version of it now when I click on it it shows that attention scores as just just as I asked for and maybe there's like little bit of rendering like it's overlapping but other than that it's actually much better than what I could have done yeah so this model did uh really nicely I think this can be a really useful tool for me to come up with a bunch of different visualization tools for uh my new teaching sessions so this is where we got a direct example of 01 being able to perform a multi-step reasoning tasks that involves coding a web page with certain features that would prove to be quite difficult for current state-of-the-art system and this is something that goes to show just how advanced the 01 preview is there is also this video that highlights more coding capabilities I want to show an example of a coding prompt that 01 preview is able to do but previous models might struggle with and the coding prompt is to write the code for a very simple video game called scroll finder and the reason o1 preview is better at doing prompts like this is when it wants to write a piece of code it thinks before giving the final answer so it can use the thinking process to plan out the structure of the code make sure it fits the constraints so let's try pasting this in and to give a brief overview of the prompt um the game scroll finder basically has a koala that you can move using the arrow keys um strawberries spawn every second and they bounce around and you want to avoid the strawberries after 3 seconds a squirrel squir icon comes up and you want to find the squirrel to win and there are a few other instructions like um putting open AI in the game screen and display instructions before the game starts Etc so first you can see that the model thought for 21 seconds before giving the final answer and you could see that during its thinking process it is gathering details on the game's layout mapping out the instructions setting up the screen Etc and so here's the code that it gave and I will paste it into a uh to a window and we'll see if it works so you see there's instructions um and let's try to play the game oh the squirrel came very quickly but oops this time I was hit by a strawberry let's try again you can see that the strawberries are appearing uh and let's see if I can win by finding the squirrel looks like I won now if you're wondering about some of the other benchmarks you can see right here that 01 completely dwarfs GPT 40 and in the traditional benchmarks right here you can see that whilst there aren't any traditional you know ridiculous improvements and I say there's not ridiculous improvements considering the fact that this is currently state-of-the-art I mean that they aren't ridiculous jump pumps in performance I think most people are underestimating the raw capabilities in terms of how smart this model truly is for it being able to perform multi-step reasoning across a wide range of tasks you can pause the video and look at these but some of the most notable are of course the competition math the competition code and the GP QA Diamond which are some of the most difficult tasks for AI systems to perform and for some of the normal ones these ones are all pass at one which is remarkable considering the fact that previously these scores such as the math the MML and the mmu were scores that were seemingly previously unattainable now what's interesting about this model also is that the human preferences are actually only preferred when we do take into account the fact that there is a preference for subjects that require a lot more calculations for example we can see for mathematical calculations the win rate versus GPT 40 is a lot higher there is also the same for data analysis computer programming but in personal writing and in editing text we can see that the win rate versus GPT 40 doesn't exceed 50% which means that GPT 40 is most likely Superior in terms of personal writing when rated by human voters now one of the most insane things that I've seen about this model that you probably do want to know is that this model actually has 30 messages a week for its limit meaning that if you want to talk to this model when you get this model released in your chat B depending on what region you are understand that there are only 30 messages a week which means you could only send 4.