so this is absolutely incredible Microsoft have released a research paper that actually talks about an llm that can truly self-improve I know it might sound like sci-fi nonsense but trust me this is absolutely incredible so take a look at this research paper called RAR math small llms can master math reasoning with self- evolved deep thinking and this is pretty crazy because essentially the model is able to use its own thinking to make itself even smarter we can see that it says we present rstar math to demonstrate that small language models slms for short can rival or even surpass the math reasoning capability of open AI 01 and this is Guys Without distillation from Superior models like that sentence is just crazy and if you don't know what model distillation is model distillation is a process where you have a larger teacher model and essentially transfers the knowledge from a larger model into a smaller model and how that works is essentially the large model is pretty much called the teacher model and the teacher model's knowledge is essentially distilled down into the smaller model and the student model is essentially fine-tuned to perform similarly to the teacher model so you know how when we had 01 preview and we had the actual 01 this was essentially what we had we had the actual 01 and then we had the 01 preview which was essentially a similarly fine tune version that just performed a bit less in terms of the actual capabilities of the teacher model now what they're stating in this paper which is why it's so crazy is that they can actually beat these large language models without that distillation so it's absolutely crazy and they say that you know rstar math achieves this by exercising deep thinking through Monte Carlo tree search which is basically a form of an AI searching through the ranges of possibilities and it's pretty crazy now it gets even crazier because when we take a look at the initial benchmarks we can see that on the math benchmark it improves quen 2. 5 math 7B which is a 7 billion parameter large language model from 58. 8% to 90% And 53 mini.
8B from 41. 4% to 86. 4% surpassing 01 preview by 4.
5% and 0. 9% on the USA math Olympiad now if you don't know why this is such a big deal one of the largest problems that people have had is essentially where people have thought that these small large language models well they're small language models have actually just been trained on all the tra data in order to get those High marks on the benchmarks but the crazy thing about this is that this model improves itself in order to get to this level so it isn't trained on all of that stuff it's able to improve itself now this is the craziest thing because this is where we have the diagram that provides an overview of the rstar math system specifically focusing on the Monte Carlo treesearch we can see that over here and that this is the you know self Evolution framework so this this entire thing is how the model is able to think in a way that allows it to bootstrap itself to Greater intelligence so on the left you can see we've got this monticolo research thinking so this is where the system you can see it explores these multiple paths think of it like a person trying to think about the consequences of their consequences or just thinking like you know a decision treat now the small large language model acts as a policy to generate reasoning steps for the solution and then we have a process preference model which is basically something something that just verifies that each step is correct and I'll dive into more about this later because it's actually pretty cool what they've done now each one of these nodes that you can see here actually represents a step in solving the problem so you can see right here this one is 0. 7 this one 0.
5 or whatever but these are the steps that are you know the model is going to take and you can see right here that the ones that are incorrect are granted low values and the ones that are correct are granted higher values and this is where the system essentially assigns Q values to each step based on the contribution to the final solution and basically steps leading to correct answers get higher Q values and steps leading to wrong answers get lower Q values and essentially how they do this is that they filter out these steps so that only the high quality steps the ones in green are essentially retained to construct the final solution trajectory and this basically you know filters out and makes sure that the AI trains on the best reasoning paths now what's crazy about all of this and this is where you get this self Evolution framework because while multicol trees research is good you do need to add something on top of that in order to get a language model to be able to improve itself so this is where we get this four-step process of where the system improves itself enhancing both the policy the small large language model and the reward model the PPM so essentially you can see that there are four steps here and this is way you guys want to pay attention to it first starts with you know terminal guided Monti car research which you guys can see right here then you can see at round two it moves to introduce PPM R2 which you know scores the results from the previous steps more effectively and then the data generated in this round further actually improves this model so you can see that the model right here is R1 then it changes and gets upgraded to slm R2 it's a small large language model R1 then gets upgraded to small large language model I don't know why I keep saying small large language model it's small language model but it gets upgraded to R2 then at round three the policies manage to leverage the reward model to directly predict the Q value vales during the search process and then it generates higher quality Solutions and training data then this trains the next iteration of the model small language model 3 and you can see now we get to a version three of the model and then at round four this is where the final model emerges and we get the continuation of the reward model with an even stronger policy and this is where it manages to achieve state-of-the-art performance and this is absolutely incredible because the model starts here not knowing that much and then eventually it manages is to iteratively improve itself to becoming state-of-the-art through this process and it's just absolutely incredible because it's generating all this synthetic data and it's able to do that with this process now when we actually take a look at how these benchmarks are you can see at how the model is able to continuously improve its math reasoning capabilities through self- evolved deep thinking so you can see that by round two the model actually surpasses GPT 40 so we can already see what the base model is capable of achieving you can see that the base model is capable of achieving 58. 8 on math you can see it doesn't get any on Amy none on this other Benchmark and you can see that it just doesn't do that well then of course it begins the improvement process outlined in the previous slide and then you can see it jumps up to 75. 2 then 10 then 50 then 35 and you can see it improves a decent amount then you can see at round two on the improvements after round one we can see that the mod is already surpassing GPT 4 around three it gets even better and around four it becomes state-of-the-art and this is where the model can even surpass 01 and remember guys this model is a 7 billion parameter based model and it's a passing state-of-the-art models now of course this is one specific benchmark this is specifically focused on math but when we actually think about how crazy that is we can start to imagine how incredibly smart these models are going to get if a 7 billion parameter base model can even outperform 01 or GPT 40 imagine a mixture of Experts of these models that are incredible and able to improve themselves what kind of model are we going to be getting then now I wanted to talk about you know how does this model actually improve itself the basics of this is that you know the model generates Solutions then it evaluates the solutions and then it retrains on the best Solutions and then it repeats and that entire feedback loop that we have here is something that you know Contin ly refines its reasoning abilities and if you might be thinking okay but what about the unsolved problems so for problems it cannot solved the model essentially uses multiple reasoning attempts which are essentially additional rollouts or new random seeds until it finds a solution and this basically ensures that the model is always learning from its failures and I don't think you guys understand how crazy this is after the initial round it generates its own training data and this is a major shift away from traditional training methods which rely heavily on distillation from these larger teacher models now this approach is incredibly powerful because self-generated training data actually eliminates the need for manual labeling or large data sets and in this paper they actually talk about the fact that you know doing that is actually very timec consuming this is also cost effective and efficient especially for specialized tasks like math reasoning and the model isn't limited to its own initial training data which means this can improve over time now remember how in another paper maybe you watch this channel maybe you don't but they spoke about how when predicting what opening eyes 01 is based on they spoke about you know outcome based reward models which are essentially where you just look at whether or not the outcome is right or not and how they moved to process-based reward models and this is where basically let's say you were doing a math question and you were outlining it step by step and you writing down every single step you basically get you know rewarded for every single step that is correct but you know even if the final output isn't correct you still get some rewards so for this one even if there are multiple steps that are correct it's just all wrong we don't care we're going to throw it away but if the first three steps are correct and the last two aren't correct you're awarded for the first three correct steps now what I found s to be like super super interesting for this is that they actually changed the reward model so they used a process preference model and they said that we introduce a novel Training Method that trains a pro preference model by constructing step level positive negative preference pairs let me just explain this for you guys so basically they you know they've like changed what they're doing here they're using essentially a hybrid approach of process reward modeling and basically the PPM method that this is process preference model it trains a model to guide the reasoning process by comparing these steps and this avoids assigning precise scores to the steps making it more robust and reliable and this is the core innovation in the paper that's used to drive the Monte Carlo tree search process so essentially overall when we look at the PPM the process preference model this is the backbone of Allstar math especially in the later rounds where it outperforms traditional process reward model approaching and they essentially use the process re model initially for bootstrapping the model but the transition to more effective process preference modeling once the system matures and this is a hybrid approach that actually balances the two benefits so now let's look at the axis so the bottom here on the x-axis we can see this shows us the sampled Solutions so we can see as the sampled Solutions increase the accuracy of this increases and this is you know something that we know but this basically shows you how many solutions are sampled and on these graphs that we can also see is we can see the accuracy on the y-axis and we can also see the different models so we can see that the green model the rstar math we can see the performance of this model we can then see the quen 2 models the test of n which is a different process we can then also see the 01 mini accuracy and the 01 preview on them and we can see them across these all different benchmarks clearly 01 preview wasn't tested on the Olympiad bench or the college math but 01 mini we can see the levels where it was in terms of the accuracy and what's crazy is that you know as these sampled Solutions as we decide to sample more we can see that the allstar math model manages to get smarter in its you know responses and remember this is a 7 billion parameter model that is going toe-to-toe with these much larger models and this is something that does you know actually appear to be true for all models we can see that for all models they seem to get smarter on the benchmarks as they continue to go up of course there is a first jump we can see that from one to four Solutions there is a decent jump but of course as you manage to sample more solutions it does seem like the gains although they do increase it does seem like they are less and less marginal but what is interesting is that with rstar math we can see that you know even for Less sampled Solutions it manages to perform much more effectively than a lot of these models and it's really surprising that with just sampling more solutions and a different way of processing the information the model manages to get to higher levels of reasoning and something that I am just looking at once again in further detail is that we can see that this model right here the rstar math model is actually performing better than this 72b model which just goes to show that this method is definitely really effective and it's giving me the information that you know by using Monte Carlo's research and the process preference model it's showing us that this is something that is really effective that maybe we need to explore with larger models and I think you know now that I'm starting to read these research papers in more detail we're starting to see that maybe the claims of super intelligence aren't that farfetched and essentially when we look at this the ability to achieve such a high accuracy with fewer sample Solutions makes arstar math essentially really cost-effective and scalable particularly for tasks requiring mathematical reasoning now another interesting thing that I found in the paper was that it also generated its own data as we spoke about but the crazy thing about this was that its own generated data was better than gp4 we can see here that it says even randomly sampled code augment Ed Chain of Thought solutions from our small language model yields comparable or better performance than GPT 4 synthesized num maath and metamath data sets this indicates that our policy slms after rounds of self- evolution can generate high quality math Solutions and it says these results demonstrate the huge potential of our method to self-generate higher quality reasoning data without relying on Advanced llm distillation and this is basically incredible this statement refers to the randomly selected reasoning paths generated by the model during the training and even without selecting the absolute best Solutions all of the random SS you know these ones are much more high quality and that's pretty crazy because those you know you know Numa math are essentially you know synthesized by gp4 which is a much larger and Powerful model and these data sets Numa math and metamath are considered high quality data sets so when we talk about them surpassing ing them that actually demonstrates the strength of allstar's math and the self- evolution process and the quality of these you know reasoning Solutions isn't just luck this is the result of self- evolution and through the iterative rounds that we talked about you know these four rounds the slm becomes skilled enough to generate solutions that rival those produced by an advanced model like gbt 4 and this is absolutely crazy when we are now breaking the dependency on large language models and we have to think about this let's say we had a model that was much bigger and we were able to you know generate methods that were smarter than smarter models from the future I know that sounds like it doesn't make sense but when we look at how we're taking a 7B model and making it generate things that are smarter than GPT 4 which is a 1.
8 trillion parameter model we can think okay what about if we have a much larger model and we focusing on making that one using these kinds of methods to improve itself iteratively over time when we actually take a look at what that means for the future that is going to be crazy now something crazy about this paper as well was that they spoke about how there were emergent capabilities and emergent capabilities in AI are essentially capabilities that arise without prediction so these capabilities arise without them training them without them essentially you know specifically wanting those capabilities or without them thinking that the capability may arise and we can see here that they put you know in chapter 5 in the findings and discussions that there is the emergence of intrinsic self-reflection capability it say a key breakthrough through in open ai1 is its intrinsic self-reflection capability when the model makes an error it recognizes the mistake and can self-correct with an answer yet it has consistently been found to be largely ineffective in open-sourced llms which is of course the models that they're using right now which is the F series and the quen series and it says the community has actively explored various approaches including self-correction which is where they self-correct and of course self-reflection to explicitly train or prompt LM to develop such capability so they're talking about where people previously used other models to try and get them to actually you know have that as their standard way of thinking but in their experiments they unexpectedly observe that our Monte Carlo treesearch driven deep thinking exhibits self-reflection during problem solving as shown in the figure four which I'll show you guys in a second the model initially formalizes an equation using this in the first three steps which would lead to an incorrect answer but in the fourth step the model recognizes the low quality of its earlier steps and refrains from continuing along the initial problem solving path instead it backtracks and resolves the problem using a new simpler approach ultimately arriving at the correct answer and the crazy thing about this is that we can actually see this right here so we can see that this is where it's going to solve this solution we can see it manages to get it wrong and then this is where it has its intrinsic self-reflection thinking outside the box to find an easier solution to the problem and we can see the scores as the image progresses we can see the PPM scores are negative here and as we go down here we can see the PPM scores are increasing in value and remember they said notably no self-reflection training data or prompt was included suggesting that advanced system 2 reasoning can foster intrinsic self-reflection which is I mean I want to say now I'm starting to understand where the AI hype is leading because if they can get a small language model to achieve the math reasoning capabilities of open AI 01 through a thinking process that essentially is search and test time compute and through four rounds of evaluations and through multiple s sampled solutions they can actually iteratively improve the model I mean what happens when we start to improve models that are much larger and we have much larger compute budgets it's quite likely that we going to get models that are so smart that we won't even know what to do with them so it's going to be pretty crazy when we're thinking about all of the kind of solutions that these models are going to be able to come up with and I'm starting to understand why now in a few interviews certain people at Microsoft were talking about the fact that we could get fully self-improving AI in 2030 take a look here that's a narrow version of AI when we use it for a specific purpose right a more General AI is going to be one where it has things like recursive self-improvement it could edit its own Cod code in order to get better right could self-improve or like it would have autonomy it could act independently of your direct command essentially or you give it a very general command um you know and it goes off and does all sort of sub actions that are super complicated like you know maybe even invent a new product and create a website for it and then set up a drop ship for it and then you know go and Market it and take all the income and then do the accounts and so on I mean I think that's kind of plausible in say 3 to 5 years before 2030 I think we'll definitely have that and might well be much much sooner and remember it was Eric Schmidt who essentially said that look we are going to get self-improving AI in the future so it makes sense that once we do we need to be very careful and we need to you know sort of unplug those models now what he did also mention was the fact that that model is going to be a model that is recursively self-improving meaning that the model is able to you know improve itself in a way that is I would say kind of stateless so it's able to adaptably improve itself over time which means that you know after completely training the model the model is just going to be able to continually improve itself again and again and again overall meaning adding new tools and abilities which is just completely a different you know kind of way because that is kind of intelligence that could ruin humanity and that's why people are concerned about this because let's do a theoretical experiment let's say the model was able to continually improve itself let's it was able to go out use the tools train on other pieces of data collect things from the internet upload itself to you know humanoid robots or whatever do AI research I mean when we think about all of that and it has access to a large amount of compute we can start to realize how these things can self-improve beyond our current reasoning so we think we're okay now and we're worried about the future and the the point at which you really want to get worried is called recursive self-improvement so recursive self-improvement means go learn everything start now and don't stop until you know everything um and this could allow this recursive self-improvement could eventually allow self- invocation of things and imagine a self a recursive self-improvement system which gets access to weapons so you can imag and we're doing things in biology that we cannot currently uh understand so there is a threshold now my standard joke about that is that when that thing starts learning on its own do you know what we're going to do we're going to unplug it because you can't have these you can't have these things R running randomly around if you will in the information space and not understanding at all what they're doing I think that when we actually look at this and the conclusion here says that you know rstar math a self- evolved system to deep thinking approach that significantly boosts the math reasoning capabilities of small llms achieving state-of-the-art opening i1 level performance our approach demonstrates that slms can self-generate high quality training data for Frontier level math reasoning and this is absolutely crazy extensive experiments across four different Siz slm and challenging benmarks demonstrate the superiority of rstar math with achieving leading results while outperforming existing math reasoning llms and best of end baselines and we also reveal the key findings including the emergence of self-reflection and the effectiveness of the PPM in identifying critical intermediate steps such as theorem application steps finally ask.