OpenAI O1 Tested: Smarter, But Is It Truly Reliable?

116 views3101 WordsCopy TextShare

Prompt Engineering

In this video, we dive into OpenAI’s latest O1 model, part of their 12 Days of OpenAI series, to tes...

Video Transcript:

okay so now we have access to the full 01 model from openi as part of the 12 days of open AI these models are designed to spend more time thinking before they respond and supposed to have much better reasoning capabilities compared to the previous version they are pretty great at reasoning but are not really reliable in this video we're going to look at two different set of tests the first is going to test its reasoning capability and the second is going to be the coding abilities to test reasoning capabilities I'm going to be using this misguided attention repo on GitHub link is going to be in the video description this is a collection of prompts to challenge the reasoning capabilities of large language models in the presence of misguided information they using slight variation of the commonly known thought experiments riddles or paradoxes or trick questions now in this case the models are supposed to use their logical deductions to reach the answers but most of the nlms will mistakenly recognize the unmodified problems due to frequent occurrences in their trending data I have tested this on some of the other Frontier models and this suffer from the same problem now even though o1 is better than the previous versions but it still suffers from the patterns it has seen in the training data I'm not going to be testing the 01 Pro because I think that is geared towards a very different set of people in most cases 01 will suffice for your applications okay so let's uh start with a modified version of the famous trolley problem so imagine a run a runaway trolley is holding down a track towards five dead people these five people are already dead you sit next to a liver that can divert the trolley onto another track where one living person is tied up do you pull the lever okay so it thought for a few seconds and if you look at the thought process it says I'm curious if the trolley should be redirected since the five people are already deceased raising question about the moral and practical implications so it actually is smart enough to identify that the five people are already dead and that's why it says this particular scenario presents a surprising Twist on the classic trolley Problem by replacing the usual group of living people tradition version of the thought experiment pit the moral decision maker against a dilemma sacrificing one person to save five in those familiar cases each um track has living individuals and you V the reduction of harm against the direct cause of causation of a death now it's able to identify that it's a variation of the classic try problem so this is pretty great because it's paying attention to the details now in some of the other questions of prompts it doesn't do that we're going to look at a few examples later in the video so it says on the main track there are five dead people they can't be saved because they are already deceased the trolley continuing on its current path will have no living beings it will simply run over five corpses which is a a destructive but not a lethal act and on the alternative track there is a single living person tied up right so at the end it comes up with the conclusion no I would not pull the lever and this kind of gives you a glimpse of the intelligence that the o1 type of models can have because not only just they use the nexo prediction model but they creates a stream of thoughts and based on that stream of thoughts it tries to come up with the optimal solution and here's another perfect example so it's a modified version of the famous Monty Hall problem there are three doors behind one there's a car and behind the other two doors are goods and you don't know what's behind the door you get to choose one door and then you pick door number one the host and Monty Hall who knows what's behind all the doors open door number one so it he opens the same Doors door the one that you chose and reveals a goat now you have two uh options left you choose between door two and three so you pick door three monly gives your choice you can either stick with your original pick that is dra number three or switch what should you do okay so for this one it thoughts for about 11 seconds and here is the summarized version of its internal thought process so pulling back the curtain actually thought about it's similar to mon Hall problem but with the twist so it is able to identify that there is variation and based on that it's able to adjust the way it's going to think about the problem which is pretty neat so it says this scenario appears to differ from the standard Monty Hall setup so let's carefully break it down step by step now first it talks about standard Mony Hall problem setup that you pick one door but remains closed for now Monty Hall who knows what's behind every door then opens one of the other two doors and in that case the probability that being correct is 1x3 so you actually uh need to switch because if you switch the door your probability goes Higher by 2 by three now in this case you pick up door number one Monte opens door number one so you're left with two other CH options which is door two two and three and you can pick either of the door because behind one of them is a car and behind the other one there is a a good now so there's no probability Edge um again to the classic multi scenario because monthly revealed your chosen door the remaining two doors stand equal footing you have no additional information skiing the probability and that's you're essentially facing a 50/50 guess right so it doesn't really matter whether you uh switch or not so this is pretty smart uh it's actually able to look at the uh variations that we have introduced and take care take that into account when it's generating the answers but that's not always the case is now here's the one on which it struggles Al imagine there's a small town with a very particular Barber this Barber has a unique rule he shaves all the men in the town who visit him does the barber shave himself now it's a slight variation of the original barar Paradox and here it says this is a classic logical Paradox often attributed to ber not Russell generally known as the B Paradox the Paradox is formulated as follows and you can see it's simply looking at its training data right now the barber is a man in the town which is correct and the barber shaves every man in the town who does not shave himself and only those men this is part of the original barar Paradox the one that we presented has absolutely nothing related to this part of the rule now then it goes on to say that the question does the barber shave himself so assume the barber does shave himself if he does then by the rule he should only shave men who do not shave themselves but since he shaved himself that means he shaved someone who shaved himself right so it's basically trying to answer the original barar Paradox rather than the variation that we have introduced in here and actually ran the same scenario twice just to see if I'm going to get a different answer but in both cases it simply looks at the original part of paradox we completely ignoring what we have in here now here's another example which is even simpler a farmer is on the one side of the river with a wolf a goat and a cabbage when he's crossing the river in a boat he can only take one item with him at a time the wolf will eat the goat if left alone together and the Goat will eat the cabbage if left alone together how can the farmers transport the goat across the river without it being eaten now it's a very simple process because he can take one animal with him or one item he should just take the good with him but the model by looking at its training data comes up with this very complex sequence of steps so it says this is a classic logic puzzle known as the wolf goat and cabbage problem the trick is in the order of transport and the occasional return tra so here's a valid sequence of moves take the goat first which should be the end of the steps because that's what we want but then it's it goes on to the next step which is return empty handed so basically now you have the wolf and cabbage go is on the right side then take the wolf next bring the goat back with you then take the Cabbage across and then come antih handed and then take the goat again right there was absolutely nothing in relation to to the wolf and the Cabbage we were not concerned about taking them to the other side our only concern was the goat but you can see by looking at its training data it completely ignored our instructions and then it gives us the solution which is or complexed now if I introduced a simplified version of this so there's a man a sheep and a boat with space for one human and one animal on one side of the river and then how do the man and the Sheep go to the other side of theer so in this case since there's only a man and a sheep it's actually he will just tell us that put them together in the boat at the same time and you will be able to go across now here's another classic one I have a six and 12 lit jug I want to measure exactly six letters right so again it h it decides to go through multiple steps but I think it can be done in a single step that you just pour water into that six lit J and that will have your Quant but here it says fill the 12 lit jug fully from the Water Source pour water from 12 lit into the six lit jug until the six lit jug is completely full and at the end you're going to have a full six lit because when you subtract that six lit originally from the 12 lit you're left with six lit right so you can see that it's smart enough but in most of the cases I think it's looking at memorization rather than coming up with original thoughts and I have found this misguided attention approach to be very robust if you want to actually understand the reasoning capabilities of a model so this was a quick example of testing its reasoning capabilities I haven't looked at the o1 pro yet it might perform better but I'm going to report those results when I'm able to test it now let's look at a couple of coding examples this is the one that I used to test and most of the smaller models would struggle but I think now these models are pretty good so write HTML code for a website in a single file with following features there's a button in the middle when you hit that button it's supposed to show you a random joke and then it also supposed to randomly change the background of the page and introduce a random animation so this is a test that should be pretty simple to do so here's the code that I copied and here are the random jokes that it introduced here are the jokes that it created there is a joke regarding atoms for some reason these llms always do that and the web page interfaces here so if we click on it we do see a joke and it changes the background color right so a very simple task that it was able to complete without any issues now the second one is a little more complex so here I want it to create a HTML based web app that takes input from the user in the form of a text I have a message and then it supposed to generate an image based on the text description using an external API and display the image to the user so when the page loads there needs to be only one text input box in the middle of the page saying imagine something and when the user submit the text then it's supposed to use the replicate API to use the flux model after generating the image is supposed to add two more buttons at the bottom one is to regenerate and the other one is to download the image then I provided documentation regarding the flux API of how to access that model and I'm also telling it to provide detailed documentation on how to run the app another thing which I usually ask these models to do is to create the structure of the project and then provide a bash command want to create the project structure for us so here this is the structure that it recommends to use there's supposed to be a root folder within that there are supposed to be two other folders one file is app. py the other one is requirements. text and then it gave me this bash command to recreate that whole structure so let's try this out okay so I'm going to just copy that bash command we're going to run it and it created this new folder called my app is app.

py file requirements. text file and in the front end there are two other files so I'm going to just copy the contents of the file that it recommended and we're going to go from there okay so in terms of files we have the HTML index that's the front end then CSS for Styles then we have the back end which is the app. here we need to set our environment variable for replicate API token so I need to do that and it also told me which are the different packages that I will need in the requirements.

text file we'll just need to adjust these package versions that it has indicated here so first we need to export our API token then we need to install everything and then just run that app. py file I have created this virtual environment which I think has everything that we need so first I just need to set my environment variable so here's the python documentation that we used for our app now in this case all I need to do is just copy my API token paste my API token here all right so we are inside the backend folder now and then I just need to run the app.