Grok 3 DESTROYS everyone... #1 in EVERY Category

28.98k views1966 WordsCopy TextShare

Wes Roth

The latest AI News. Learn about LLMs, Gen AI and get ready for the rollout of AGI. Wes Roth covers t...

Video Transcript:

so Elon Musk and the xai team release Gro 3 I did a whole live stream covering it and testing out the gro 3 live so in this video let's go over the biggest takeaways from the grock 3 launch and tomorrow I'll post a video on testing grock 3 out how will it perform compared to the other leading models so first and foremost it seems better than the 03 mini high on things like the Aime and some of the other benchmarks that they put it through and this is specifically the reasoning models so that the 2024

Aime and this is the 2025 the brand new Aime as you can see here it's again beating out 03 mini High and the 01 so it's better than Gemini deep seek R1 01 03 mini High both the grock 3 mini reasoning and a grock 3 reasoning are both better so this is putting it kind of where we expect the 03 the full 03 model to be and all things considered this growth is kind of insane here in blue you see the rise of Gro compared to the gpt's opening eyes model it catches up rather fast

what's more is as far as we know Gro 3 the reasoning model is its first reasoning model so it skipped sort of the 01 to 03 and with its first reasoning model got right up to the 03 again according to the benchmarks that might be a little bit off but it it's definitely surpassed the 01 I think this is kind of important because it kind of weighs in on that whole discussion about Nvidia and gpus and scaling laws this is the largest compute cluster I believe in the world best located in one space and they

built it really fast as Elon sometimes likes to joke it's like a speedrunning factorio in real life so Colossus is a total of 200,000 gpus the phase one took 122 days to get to 100,000 gpus fully training synchronously from scratch and then phase two is 92 days to expand to 200,000 gpus so this is bigger than the other sort of projects by Google open AI Microsoft and this does kind of seem to indicate if you have the capital to buy all the gpus then you can jump to the front of the line for building these

AI models again there's still a lot of tests to be done we still have to see how well these models perform all of that but just from the initial results assuming that the benchmarks are accurate it does seem like there is no Moe really it's just Nvidia goes bur gpus if you have enough gpus you get to build those big models and saturate the benchmarks not only that but elon's saying that are planning to 5x this thing right so they're going to build it out to 1 million gpus you get grock on the premium plus

but there's also a new special tier called super grock so it has guaranteed access to grock 3 and unlocks the Deep search and think think is sort of what makes it into the reasoning model there's also a voice mode that some people have been testing out early testers have been sounds like they they've been getting their hands on it I haven't had a chance to do this yet I did have a chance to test out the reasoning capabilities for about uh an hour an hour and a half I was trying to do it live on

this live stream um unfortunately I was very tired so I do want to hit it fresh again in the morning but early results I mean I'm not ready to call it quite yet but it does seem strong it seems at least as good as the 03 mini high if you recall Dr Kyle the PhD the physicist who was doing research into black holes and had the Owen preview right up the entire code that he used to do his uh PhD dissertation he joined the live stream and actually suggested this problem that he said he actually

submitted it to Humanity's last exam so he actually suggested this problem to to run through grock 3 it's a fairly complicated problem with sort of tons and tons of work that you have to do to show your work to get to the answer now the final answer that Gro gave me was incorrect at the end of my live stream we basically organized it so Dr Kyle would start his own live stream and then I was able to kind of redirect my live stream to his so he kind of picked up where I left off because

at that point I was falling asleep and he was able to test a lot of the stuff with grock 3 it does seem like when he attempted it it did get the correct answer so when he ran it through grock 3 did come up with the correct answer although some of the other models also came up with the 5.5 answer so I plan to rewatch that live stream and see if they figured out what the issue was cuz I would be interested to know where it went wrong and why I also tested with my series

of prompts where it has to create a snake game that plays itself and then create kind of a reinforcement learning agent to learn to play that game using pytorch so it created kind of an advanced version of the snake game with a script to play itself pretty cool nothing major it did create the pytorch reinforcement learning pipeline to to train it how to play and this was where there was like just a few issues and I feel like it did an okay job here's where there's a few issues that kind of started breaking down but

at this point it was very much past my bedtime and I wasn't really in that debugging troubleshooting State of Mind where I could have uh continued so I don't want to call it yet I do want to take tomorrow hit it with some hard prompts and kind of see where it lands I want to finish watching Dr Kyle's live stream tomorrow when I wake up and see kind of what he thinks of it because he headed with some pretty difficult physics and science problems but it really does feel like that xai started with a late

start they started kind of behind the pack but they built up their their Hardware their their infrastructure beyond what everybody else has and now created a model that is very well it might be the number one model that's available to us right now again I'm not quite ready to call it but for example in the Chad bot Arena the chocolate model that was early grock 3 so this is where people kind of uh test two models side by side blindly they don't know which one's which they select the one that they like most and you

know chocolate reigned Supreme here's the post from you know chapat Arina where all these llms duke it out so they're saying that xai early version of grock 3 code named Chocolate is now the number one in the arena so as you can see here here's all the rankings so as you can see xai and early version of Gro 3 is solidly in the number one spot it's the first ever model to break the, 1400 score it's number one across all categories a milestone that keeps getting harder to achieve so that's after just under 8,000 votes

so it might shift up and down a little bit still but again it's it's solid being number one across uh all categories that's a difficult feat to achieve my early testing seem to show that it's again like 03 mini High high and again I would need a lot more to kind of like say if it's better or worse but it's in that sort of range but you can see here this version of grock 3 overall number one overall with style controls number one it's number one at the Hard prompts and hard prompts with style control

it's number one coding it's number one at math it's number one at creative writing it's number one at instruction following it's number one at longer query and number one at multi-turn so another important thing to understand here is so for example for the a Im so it's a highlevel math competition you know when we're looking at the 03 mini the 01 when we're looking at Gro 3 they were all trained in 2024 so the data was collected in 2024 so when we're looking at the Aime for 2024 it is possible that maybe those results were

in its data set that was trained on but now we have the results for the Aime 2025 so there's Parts one and two and how those scores are calculated so we're looking at the the combination of both of those kind of the average of those two so the 03 mini high scores 86 on the Aime 2025 again we're fairly certain that this is not in its training data set the 01 sort of a medium scores 79 I think this might be one of the uh better benchmarks that that we have right now specifically for testing

these reasoning models so here's the 01 at 79 here's the 03 mini high at 87 so it's technically 86.5 so 87 and notice dro 3 and grock 3 mini both are above it at 90 and 93 so again I had some early testing it did some things right some things wrong but I just didn't have enough time to really fully stress test it and really figure out where it sits sort of like in my opinion for for stuff that I tested on for my use cases but at this point I think we can safely say

that uh certainly it seems like a Gro 3 is now the new reigning King probably we'll do a deep dive tomorrow but at this point it's kind of hard to deny it the full announcement video was posted right before this one links below the announcement video they have a few sort of test cases they run it through they also explained how they built that data center if you were following that whole thing with you know if we needed gpus Nvidia going up and down you know how the sort of the deep seek how those models

affected how we thought about Nvidia and gpus and how important those sort of Investments were I think this provides some strong solid arguments that the the demand for the gpus will still be Skyhigh again just the idea that Elon sitting on 200,000 of these and is planning to five exit to 1 million I think that's sort of that by itself is rather telling I mean this sort of progress they have is definitely telling and it seems like they're crediting the fact they have 10 to 15x more training more compute then grock 2 so the jump

from grock 2 to grock 3 was a 10 to 15x I mean that's another kind of a data point so there's a lot more to come stay tuned and without further Ado do you know where bad rainbows go to prism funny right but don't worry it's a light sentence just enough to give them time to reflect subscribe if you found this video enlightening

Grok 3 DESTROYS *everyone*... #1 in EVERY Category

Grok 3 DESTROYS everyone... #1 in EVERY Category