Open Reasoning vs OpenAI

27.09k views5148 WordsCopy TextShare

Sam Witteveen

In this video, I look at the new open source reasoning models that have come out from a number of di...

Video Transcript:

Okay. So back in December OpenAI released o1 and this was proclaimed as the first reasoning model. and we saw them, make a lot of claims about this.

We saw them released the o1 mini and the o1 preview. And supposedly in the next couple of weeks, they're going to release the actual full o1 model, which is going to be interesting when that comes out. in this video, what I want to look at is exactly how fast is open source catching up because literally in the past week or two, we've seen a number of different companies release open weights versions of these kinds of models.

and it's interesting to see actually how close to they come to the OpenAI models. And what I want to investigate is how much of an edge does OpenAI actually have at the moment, especially if all these Chinese open weights models are breathing right down their neck. So just to recap a little bit, OpenAI basically released this model and they released it with a bunch of new different, evals and benchmarks at the time.

probably the big ones that they focused on were these three main benchmarks of this competition math. Which I think is aimed at high school students trying to get into the math Olympiad, code forces, a code one. And also a Google proof PhD level science questions and Google proof here basically means that even if you gave people the access to Google, the chances of them getting the right answer are still going to be very low here.

Now, of course at the time OpenAI, didn't actually say how this model was done. one of the key things obviously was that they were using some kind of test time compute because the inference generation was quite different based on the sort of difficulty of this. And if we kind of scan through some of the Twitter posts from the people who worked on this, we could certainly see that chain of thought and training chain of thought into the model was a really key factor.

So just to revise a bit up until recently a standard LLM was created with a very large amount of pre-training. So we're talking about sort of web scale pre-training that it's just doing auto regressive prediction of the next token on, in there. And then it would basically have some kind of post-training.

And traditionally up until now, most of the post-training that people have talked about publicly has been some kind of supervised fine tuning or instruction tuning followed by reinforcement learning, whether that's RLHF so reinforcement learning from human feedback or RLAIF, where it's from AI feedback in there. And then at inference time, you didn't use a lot of compute. You would just do a forward pass through the model.

Now this new way of creating reasoning models really what you're doing here now is you still have the sort of free training and post-training but your post-training will be quite different in that you might have some kind of self play or RL version in predicting out, and scoring what would be the next chain of thought or reasoning trace to come out of this? And we saw this very clearly in a paper that actually came out from OpenAI a year and a half ago. And this is what led to the whole concept of this strawberry model, et cetera.

This idea of let's verify step-by-step as we go through and you can use RL to do that. But the big difference with this often is going to be at inference time. So at inference time, you're going to produce these reasoning traces or trees out.

And you could imagine that you've got something that scores them where you're going to keep the best ones and you're not going to use the other ones. And gradually building up over these reasoning traces you then come to the right conclusion in the end here. As I mentioned before OpenAI had some key papers on this before.

So back in 2021, they talked about training verifiers for this kind of thing. In 2023, they had looked at the whole sort of thing around the verify step-by-step. And even people who had worked on the original chain of thought paper, like Jason Wei have moved from Google to OpenAI now.

So this whole idea of using chain of thoughts for, this sort of reasoning trace thing is definitely something that we know a lot of the researchers at OpenAI have been very interested in doing But the interesting thing when the o1 models came out or rather the o1 mini and the o1 preview came out, was that OpenAI made out that this is something that was really difficult to do required a lot of different styles of training. obviously requires a lot more compute for the inference, et cetera. And the general consensus that they were pushing was that this is something that they would be able to do.

And only other sort of proprietary model companies would be able to catch up. In fact, if we ask the o1 preview itself, how long before open source is able to reproduce OpenAI's o1 models. we actually see that it's kind of saying that, this open source community typically lags behind the state of proprietary models by one to two years, in here.

And it might take anywhere between 18 months to three years. So that brings us to the open models and over the space of the last two weeks, we've literally seen three different AI labs come out with their own reasoning models. So in this video, I'm going to look at the DeepSeek R1 lite preview.

Which they've got up for use on their site. And they're saying that They're going to release the weights. We'll look at the Qwen QwQ, model, which again, the weights are already out for this.

And we'll look at the Marco o1 model which comes out of Alibaba international. Now all three of these are all coming out two to two and a half months after the OpenAI release. This is not 18 months later.

This is not three years later. This is not even sort of nine months to 12 months later, like GPT-4 was. We're seeing this acceleration in the more open lab models that are out there.

Let's jump in and take a look at some of these models. See what they can do. Let's see actually how close they come to o1 preview, to o1 mini.

And then my guess is that the ball's going to be back in OpenAI's court, in the next week or two for them To release the full o1 as the open weights labs are basically catching up really quickly. All right. Let's jump in.

Okay. So the first one that we'll look at is the DeepSeek R1 light preview. so I think this is the first one that was released, about a week and a half ago now.

And we can see that they're running on the same, benchmarks as OpenAI for this. So their benchmarks is showing that they're doing substantially better than the o1 preview in here. I don't think OpenAI actually released benchmarks for the o1 mini for all of these.

So I think basically people have taken just the o1 preview benchmarks. now, they're getting 52 for this. And my guess is that this is like a one shot thing.

And so yes, the o1 preview was getting 44 for this. The o1 preview with more samples where they were doing sort of consensus voting was getting about 56. And I think the o1 they're getting at the peak around 83.

So it's going to be interesting to see when the o1 comes out here, but clearly this is a good result. Again, with the math benchmark here. Interestingly for the PhD level questions, they're actually, quite a bit behind where o1 preview is.

So I wonder if this is just to do with, o1 preview using a bigger model et cetera. We should say that the o1 score is, and I'll just flash these up here. the o1 as opposed to the o1 preview is supposedly being given a lot more test time compute for answering this.

Now we don't know how long those answers are going to be. Is it going to take 15 minutes to be able to get to the end? all of those things are still open questions at the moment another nice thing that the DeepSeek people have done is, put together something where they actually show what happens, if they increase these reasoning traces.

So you can see on the X axis here, we've got the average number of thought tokens per problem. And basically the longer those thought tokens are the better the accuracy is here, which is not surprising. and this is one of the reasons why the o1 model may actually just do so much better than the o1 preview, is that it just generates much longer.

chain of thought traces here. Now remember OpenAI is not giving us those train of thought traces. We can't actually see what they are.

With these open models, they are actually making that available for you to be able to look at, learn from, see what's going on. See how the model thought through the problem, et cetera. So at the time I'm recording this DeepSeek hasn't actually opened source the model weights or the APIs, et cetera.

one of the cool things though, is DeepSeek actually does serve their own models. So you will be able to just put a few dollars into their account and be able to use this as an API with the full size version, that kind of thing. Now, if we jump into their chat interface, where we can actually play with the model you can try this out and you can see, how these things are working, et cetera.

So what I wanted to show you is one of the first things that I started off with these things is trying to trick it, obviously. So the common thing is counting Rs in strawberry. But if you notice very carefully here, I've deliberately spelt strawberry wrong.

So I've got four Rs in there. So DeepSeek allows us to basically turn on this deep think, which is the sort of reasoning version of it. Or not have it on.

And you can see these first examples I've started out with it off. So when I ask it here, how many Rs are in strawberry, it just comes back three. And you could imagine that nowadays there's so much talk on the internet about how many Rs in strawberry that even, any model that's been pre-trained on recent internet data stress that recent internet data, it's probably going to get it right just because people have talked about it.

so this is deliberately trying to confuse it and sure enough it's it comes back and says that there were three Rs when clearly there are not three Rs, there are four in there. All right. Next one is another common sort of logic question that people use a lot.

Sally has four brothers. Each of her brothers has three sisters. How many sisters does Sally have now presuming that Sally's female.

We could guess that she's going to have two, but here we get three. so she has three sisters including herself, right? now I want it to show you that if we turn this deep think on and we ask it this question, now we start to see that, okay, it actually starts to go through this.

Now, they don't have any sort of markdown or anything that's marking out the different chains of thought in here. So we've just got a stream of sort of chain of thought of going in. Now, my guess is that each of these are divided up into sort of sub chunks of where it thinking about the problem in different ways.

So first off, quite commonly, you want it to rephrase the problem to itself. So you can see that Sally has four brothers. that part seems straightforward.

So there are four male siblings in the family. Sally, now it says that, has three sisters. So I'm thinking about one brother, he has three sisters, but Sally is one of those sisters.

So does that mean that there are only three sisters in total? And you can see that it goes through and it does some, in a nice thinking is it's going through this to come out with this answer. And you can see now that it comes back with a conclusion saying Sally has two sisters.

And so that would fit with the whole thing of Sally plus her two sisters are each of her brothers three sisters in there. Alright, next back to our, how many Rs in strawberry with that the strawberry spelt wrong here. so you can see here that, I need to figure out how many Rs are in it.

Let's see. I should look at the problem. And so the first thing it does is it actually, Spells it out with hyphens in between.

Now this is pretty impressive because what it's doing here is actually using the tokenizer, to split these up rather than have them as one thing. this is where, a lot of the models would go wrong is that they've already got a token for this for wrapping it and they would just stick to that token. But we can see that, okay, this is managed to split it up and then it can go through it letter by letter.

and then you could see that it comes back and says, right positions 3, 4, 9, 10 are Rs that seems like 4 Rs, but wait, let me make sure that I didn't miss or count any extra. So it actually does a check. and it does this sort of thinking of where it's going to go through and circle each of the Rs as it goes through.

And then it comes back finally with the answer, there are 4 Rs in strrawberry. again, remember that I've deliberately spelled this wrong to see, okay, will it be able to pick that up? So that was very impressive with this model for me.

The next thing is you can try that with different kinds of logical statements and stuff like that. One of the ones that impressed me, I'll just scroll down. You can see that with this one it did quite a lot of thinking.

but one of the ones that impressed me is the question that I often use for checking sort of thinking is basically to ask it about something in the past, if something had been different, how would have that affected things? So here, I'm saying, explain the consequences of what would have happened in world war two, if the nuclear weapon wasn't used? So you can see here that it will, we'll start off and say, okay, I'm trying to figure out rephrase it to itself again, and try and figure out what would have happened in world war two, if, nuclear weapons hadn't been used.

First, I need to understand the role of nuclear weapons played in the actual outcome. and then it's going through that. and then it goes through the different thinking sort of lines of, okay, what are some of the options that could have happened?

So another option might have been an invasion of Japan. One of the things that I think is fascinating here is it even goes on to think through the, okay, what would have the Soviet Union done here? And had the Soviet Union attack from the top that you may have had a situation where, Japan was split in two, similar to Germany was split in two.

. Now this kind of thing, there is no really wrong answer or right answer. but for me, this is really interesting to look at how it reasons to itself about the different things going through here.

So it ends up with these conclusions of you know, a continuation of conventional warfare possible invasion of Japan. It talks about the political and diplomatic implications. The role of the Soviet Union, the economic impact, a whole bunch of different things where it's coming at this problem from different angles.

And I think this is one of the key things of these really good chain of thought traces is where you can get it to focus on the same problem in many different ways going through there. After this, I tried, a few different things around, composing a piece of music with prime numbers. Also very interesting.

did it quite a long thinking process for that? And then asking it for things like, I have a certain number of coins if I group them in certain ways, work out. What's the lowest number of coins that I can have?

Now, interestingly, this one took 23 seconds for the inference coming out of this. And if we look at the OpenAI o1 preview, it was able to come up with the exact same answer, but in five seconds. So I do wonder how relevant is that going to be when we look at this in the o1?

Is it basically a lot more efficient in its way of generating a lot of these traces. Now you could imagine that DeepSeek, if I understand correctly, they only have 10,000 GPU's and I say that only in inverted commas there. but that's certainly way less than, most of the sort of top labs have.

So perhaps for inference, they're not using as bigger stack as OpenAI is et cetera. All right, let's jump on and let's look at the next model. All right, so the second model comes from Qwen.

these guys are really on a roll. they've had the coding model, which I wanted to do a video about, I just didn't get around to it. so they've had a Qwen 2.

5 30B coding model, which has done really well. and I feel that a lot of their models are the sort of go to default now for open source models for a lot of things. So they've introduced this sort of reasoning and reflect deeply, idea here.

Now they talk a little bit about it, that it's got these recursive reasoning loops, going on here. they talk about, some of the limitations, et cetera. If we come down and look at their benchmarks.

So their benchmarks are kind of interesting in that, on the GPQA, they're actually beating DeepSeek quite easily. Now, they're not getting as high as the o1 preview, but they're certainly beating the DeepSeek model. But for the AIME one, or the sort of for the Math Olympiad one, they're actually a bit below the DeepSeek R1 model in here, but in this case they're beating the OpenAI o1 preview and both them and DeepSeek are being beat by the o1 Mini.

So it really does show you that it's probably going to be really hard to benchmark these things, unless you're giving it the exact same amount of compute. How many iterations do you allow it to, think about the thing before it comes to a final answer, versus you could imagine that the o1 model, when it comes out, is just going to be really much longer for compute and therefore probably more expensive, et cetera. Now on the live code bench, they're a little bit behind, the DeepSeek R1.

But let's jump in and test it on the same questions that I gave the previous model before. Okay, So I'll start off with the Sally has four brothers problem, And we've certainly come a long way from where the early LLMs would basically work out four times three, or give that an answer, or, that kind of thing. here we can see that it's got quite extensive sort of, thinking as it goes through.

Now, there's steps in here, There little sort of steps in here. Now, I'm not sure, again, we're missing Markdown, so we don't see where, each chain of thought trace actually starts and begins, going through this. but you can see that it comes down and it actually doesn't get this, it comes back with that she has three sisters.

So on this one, it's not doing as good. If we try the strawberry one, where we're deliberately spelling strawberry with four Rs, you can see, going through this, it comes out and gets that the answer is four. now, interestingly, each time I've run this, and I've run these multiple times, each time I've run this, it gets slightly different thinking.

and sometimes what it will say to itself is, no, no, no, no, Strawberry's supposed to have three R's, I need to check again. and it will go back through and check that, again, only to realize that there are actually four R's in the way that I've spelt it. So they, it will often give an answer where it says, you may have mistyped this, which in case it would be three, but the final answer is four.

Another thing that we see sometimes, which I think is really interesting in here, is that, and you can see this is a different run of the same question, is you can see that it gets to the four Rs, and it gets the positions and stuff like that, But in this case, it gets into a loop where it starts, going through this loop and it's saying, alternately, perhaps the question is a trick, right? And make me think that there are more Rs than there actually are. And then perhaps the answer is four, and I'm overthinking it.

And so it goes through this loop. Now, this loop goes on and on and on. You can see that where it's basically saying, okay, the answer is 4, I should accept that.

The answer is 4 , I don't need to stop overthinking. The answer is 4, I should just write that. The answer is 4, I should conclude here.

The answer is 4, I should finalize my response. this is perhaps a bit scary, right? That, that it's gone into this loop and you can see that the loop just goes on and on and on.

Till eventually, I think we, as you can see down here, it's still generating out, as I'm going through this. So this has been going, I think, for a few minutes at this point, and it's still looping through, and, it seemss like it can't come to a final conclusion, that it's going back through the loop again and again and again. I will leave it and maybe come back to that and see what's going on there.

Okay, so the last question that I wanted to try was the World War II question if nuclear weapons hadn't been used. For me, I guess this is a much more sort of subjective thing, but it's interesting to sort of look at how it groups the traces together. So you can see, again, we've got first I need to understand the context leading up to.

So you've got this idea of where these things tend to rephrase the thing to themselves, and then be able to, bring this back, through this. now when the first o1 models came out, I actually had a go at trying to make something like this myself, with a Phi model, so a very small kind of model, to see what it could do. And you certainly can get it to generate longer chains of context out, for this kind of thing.

And, the challenge becomes how do you preseed it for doing, those kinds of things. So I gave a whole talk at a meetup at Google around creating a poor man's 01. That's a couple of months ago now.

I'll put that up on the, Patreon for the full talk and stuff like that, where people can go through and watch it if they're interested. Again, with this one, we can see that it's almost getting into loops where it's coming up with an answer, but it's not coming up with a sort of conclusion, very well. now I don't know if it's exactly repeating itself going through this, but it's certainly doing stuff.

Okay, we've got to a final answer. We don't have the depth that we got in the DeepSeek one. And my guess, to be fair with the Qwen team is that, often they will sort of start out with something, release something quite early and then iterate on getting it better and better.

So I really expect that they will make, future versions of this model even better as they're going along. But it's good to see where they're at now. Okay, so the third model is the Marco-01 model in here.

And one of the cool things that they've done is actually release a paper. I think at this point they would certainly say that this is not a fully sort of fleshed out model as much as the o1 models, perhaps even as the DeepSeek models, etc. But this also comes from a different team in Alibaba.

And you can see that what they've tried to do Is there actually looking at how to reproduce the o1s using, MCTS or Monte Carlo Tree Search and basically the idea here is that you have, the model sort of generate out, different trees of chain of thought, and then you have a way of sort of scoring those, And training the model to generate better trees going forward, for this kind of thing. So the cool thing here is that they've used a very small model, Qwen 2 7B Instruct as their sort of base model for doing this, and then just built on that. Now they've put the model up on, Hugging Face so you can have a play with it.

I don't see an official space for this, so I'm reluctant to try out the other ones in case they're not set up properly, etc. And like I said, I think this one is a bit more of an academic test to actually see what's going on. One of the things that I do like that they've done here though is that they've released this dataset.

So they've actually released a whole bunch of, these sorts of chain of thought demos, going through, and showing out, okay, what the different thoughts are and how they actually do this and stuff like this. I think this is one of the things that will separate the people who are making good models like this from that also rans is how do you actually generate interesting kinds of chains of thought, about this? Now, there are lots of papers out there about chain of thought.

there's been some really interesting ideas coming out of DeepMind, for this, and I think this is where, some of these people would benefit from looking at papers like Prompt Breeder, and some of the papers that came out after that from DeepMind, where they were looking at this kind of thing. So I'm conscious that the video has gone on quite a long time already. what I would like to make the point is that, is not in particular about any one of these models, I think each of these models and each of these teams are going to build better versions of this over the next, weeks or months kind of thing that we see.

For me, the biggest takeaway here is just how quickly the sort of other independent labs that don't have anywhere near the amount of compute as OpenAI, or Google, or even Meta, are able to be able to produce some of these models that are getting, very close and often even surpassing benchmarks that OpenAI have put out themselves, even after OpenAI has worked on this idea for so long. Now, this is a very common thing that we have, like, where, a fast follower can catch up to a leader, just by looking at what they've done and trying out different things that they think they've done, that kind of thing. and it does seem that this is the case that, of what's happening for reasoning models.

Hopefully in the next week or two we'll see the full o1, come out from OpenAI. you could imagine that we may see better models that do reasoning from Google, maybe from Meta in the new year. And it's certainly an interesting space to see how this is going.

One of the reasons is that this is not just getting bigger and bigger base models, this is learning to be able to use and scale the compute, at test time or inference time, As opposed to just building the world's biggest cluster of GPUs and do pre training for three months or six months kind of thing to get the biggest model out there. So I think it's a really good sign for, at least open weights, models, if not open source, a lot of these things are not giving us the training code. They're not giving us the data sets, et cetera, but they are giving us open weights so that we can run these.

And I think we're going to have to see people reconfigure things like Ollama to be able to run these with multiple inference as it goes through this kind of thing. all of these things are just getting better over time though and my guess is we're going to see, local versions of some of these models in the not too distant future. So anyway, I'd love to hear, what your thoughts on these models are if you've tried them out.

I think there are actually some other, open models that have come out, doing this kind of thing in the past few days that maybe I've missed here. let me know about those. What is it that you want to use better quality reasoning models for, that you can't do with the current models that are out there?

So I know there are definitely uses around code and things like that, but are there other things that people are trying to do? That's one of the things I find fascinating with this. Anyway, as always, if you like the video, please click like and subscribe and I will talk to you in the next video.

Bye for now.