OpenAI o3 and What It Means for Academia

16.57k views2863 WordsCopy TextShare
Sovorel
OpenAI just released a new AI frontier model, o3. Is this Artificial General Intelligence (AGI), and...
Video Transcript:
I have to talk about the new 03 model from OpenAI. This is a major, major new development in AI in that it not only broke through some of the different evaluation techniques dealing with how good an AI is at reasoning; it didn't just break the record, it monumentally went through it. So, as far as the changes it's gone through, it's been able to move way beyond what was thought to have been coming next.
Right? We were thinking 10% increase, 20% increase, but no! It was crazy as far as its new level.
In fact, it's so good at reasoning now that they're changing the actual tests that are being used in order to measure how capable an AI is at being able to understand and reason through novel new questions. So that's a major thing. Right?
We have this ARCs organization that created this benchmark, this test, and now they're even changing what they do because of the capabilities of OpenAI's new model. The reason why we do that is because we want to test the model's ability to learn new skills on the fly. We don't just want it to repeat what it's already memorized; that's the whole point here.
Now, Arc AGI version one took five years to go from 0% to 5% with leading frontier models. However, today, I'm very excited to say that 03 has scored a new state-of-the-art score that we have verified on low compute. For 03, it has scored 75.
7 on Arc AI's semi-private holdout set. Now, this is extremely impressive because this is within the compute requirements that we have for our public leaderboard, and this is the new number one entry on RKG Pub. So congratulations!
Thank you so much! Now, as a capabilities demonstration, when we ask 03 to think longer and we actually ramp up to high compute, 03 was able to score 87. 5% on the same hidden holdout set.
This is especially important because human performance is comparable at an 85% threshold, so being above this is a major milestone. We have never tested a system that has done this or any model that has done this beforehand, so this is new territory in the AGI world. Congratulations on that!
Congratulations for making such a great benchmark! Yeah, the work also is not over yet, and these are still the early days of AI. So, we need more enduring benchmarks like Arc AGI to help measure and guide progress, and I am excited to accelerate that progress.
I'm excited to partner with OpenAI next year to develop our next frontier benchmark! Amazing! So, this 03—and it's funny that it's named 03 because the previous one was 01—so they've jumped to 03.
They said that it's because they didn't want to sort of step on the name; there's some other group, some telecommunications thing or something that's called O2, so this is in the UK. They didn't want to name it that, so they jumped to 03. But it's funny because it kind of makes sense because this is such a huge leap from 1 to 3.
It's more than double, more than triple, so it's a big deal. So I want us to think about that because now we have to think, well, what does this mean for academia? Does it mean anything?
And the answer is yes, it does mean something. This isn't AGI; this isn't artificial general intelligence. It can't just solve everything now through reasoning.
No. And going through the test, it wasn't able to solve a multitude of different questions that would be easy for you and me as humans to answer, but it was able to answer a lot of novel new questions. So this is important because this means that it's answering questions that it didn't have in its pre-trained data set.
So that means it was simply looking at the question and then it figured it out on its own through different mechanisms in trying to understand. So it's using real reasoning to figure something out. Now, of course, there's all sorts of debates on what real reasoning is, and all these things, and that's great!
It's great that we're having those debates, and it's great that we have this type of competition, this type of benchmarking, because what's happening is something really interesting, and I really want to express this because it's the same thing that's happening in academia. Right? So initially, we had the Turing test, and this was the idea that, oh, if an AI can come across and seem like it's real through its emotional responses or through understanding what you're asking, well, then that means it has real intelligence, right?
But then, of course, we were able to achieve that with ChatGPT. We almost achieved it before then, right, with Eliza and all this stuff. It convinced a lot of people.
But now, even with ChatGPT, we had something really powerful, and so that Turing test kind of faded away, and it didn't make sense anymore. Right? It's like, what were we thinking before?
That's not a true definition of intelligence. And now, in the same way with the previous test that we had within ARS for trying to understand what intelligence is and can we achieve that, and the fact that this 03 model did it and surpassed it, now we're going to be saying, well, okay, we're changing it. That's not real intelligence now; it's going to be more like this.
So we're going to create something new, which is actually not bad at all. Right? I think we should.
Be doing that because what's happening is that we're learning more about ourselves and what intelligence is. By developing AI, as we develop this AI and its capabilities get more and more powerful, we're starting to see, "Oh yeah, intelligence is more than just that, and it requires this and this and this. " So, I think that's great.
Now, going back to this thing of how AI is also holding up a mirror to us in education, I think it's great because what's happening is that it's forcing us to re-evaluate and rethink what a good education is. In talking about writing essays, right? It used to be very common for you as a student to show up to class, get a lecture, and then get an assignment: "Write this type of essay and turn it in within two weeks.
" Okay, and that's what we did. Of course, the problem was that that wasn't good education; that never was good education. Maybe when we didn't have books, it kind of made some sense.
But hey, we have lots of information out there. We have YouTube, we have the internet, and we have this understanding now that, "Oh, that isn't good. " Because, one, the students aren't learning very well through a lecture, right?
They need more engagement, more hands-on learning, more interaction. And also, now with going away for two weeks and writing an essay, well, now we could easily turn to AI, and it could write an essay for us. But in the same way, before AI, we could easily turn to a friend, we could pay for a service, we could have someone else do it, or we could copy from the internet.
So, it never was a good assignment assessment technique to simply give it to the student, wait two weeks, and then evaluate them. Right? We need to have more formative assignments, more formative assessments, more things going on in class so that we can, for sure, evaluate and hold that student accountable.
And now when the student writes an essay, we should look for additional things—not just the product itself, but the process. Can the student understand what they're doing? Can they talk about what they've done?
If they turn in an essay, I read it, and then I ask them questions about it—do they know what they did? So, now the AI is sort of holding up a mirror to us in that we're starting to think much better in terms of understanding that the student needs to go through an educational experience, not just be given data and then have to recall that information. No, they have to fully understand.
So, holding students more accountable for their learning, I think, is excellent, and I think it really helps us to understand education that much more, thanks to AI holding up that mirror to us. But let's go back and think a little bit more about what's going on here because the reasoning capability has increased so much. This is important for us to understand in academia.
Yes, this new AI 03 is very expensive to run; it costs a lot to have accomplished this with this competition of breaking this record. That cost a lot of money. Some calculations are that it cost $300,000 for it to compute in order to be able to answer these questions, right?
And that's because it took a long time—like 16 hours or so—to go through and be able to answer it. Now, that's what the 013 model does: it takes longer to understand what's going on, to understand the question, and then reason it through. In fact, what it's actually doing is something that I thought would be more important to actually get us to AGI.
I mentioned this in the past regarding using multiple systems, and it looks like that's exactly what's happening. It uses one system to go through and understand the question, then it goes through and creates a bunch of different answers, and then it uses another system to look at those answers to evaluate the possibilities: "Is this correct? Are those reasoning steps logical to come up with this solution?
" It's even creating some synthetic data to go through the possibilities, so it's putting all that together from multiple systems and then coming up with the best solution. And that takes time. So, it's much longer than simply using ChatGPT, where it's just using one system, one large language model, to look at its previously trained information and then come up with a solution.
Now, this is important for us to understand for a couple of different reasons. The research shows that it actually works better when we use a large language model like GPT-4 because that's good for things like writing essays, right? Where there isn’t a 100% "Hey, here's the actual answer.
" Well, no; we need more creativity for different types of essays, putting it together in various different ways. But when we're talking about things like science, you know, the STEM fields, well now an 01 model that goes through many different variations of possibilities and then looks to see, "Well, what is the best reasoning to get to the best answer? " And now we can hold that as "Yes, this is the best answer.
" In the sciences, that makes sense because there is one correct answer, or one correct answer that's at least more correct. So, that's the difference here: an 01 model is awesome for science, while the GPT models are going to be awesome for things like writing—things like essays, and things like that. So, we're having both of these things.
Now this comes back to academia in that, hey, if these. . .
0103 models are going to be that much more powerful for science. This has a lot of implications for research, right? For doing research associated with these science fields, we run into issues associated with equity.
It's going to cost money to access these things. Now, of course, the prices are going to go down as they come up with different ways to improve efficiencies with the model, making it better in itself, and it's going to improve the overall process. So we have that to look forward to, but it is going to cost more.
That's going to be an important aspect of understanding: what about access to these things? Am I going to require students to use this? If I do, I need to make sure that they can access the information that they use this as a tool.
So that's going to be a significant issue. Now, the things that are learned from the 03 model, from this revolutionary new capability, a lot of that is going to be extracted and then applied to the GPT models as well. So we can definitely look forward to improvements with GPT-4 and GPT-5 in the future, where it's using a different type of system.
But now, it's just important to see that yes, there are definite improvements happening here with the AI models. Remember the video I made just a couple of weeks ago? I talked about how AI is not slowing down—it's not slowing down in any stretch of the imagination.
No, we're not running out of data, and no, we're not slowing down because there are new techniques, new possibilities, new ways of improvement, which the 03 model is a prime example of. So it's definitely moving forward and enhancing overall capabilities. Now there's a big call here for safety aspects.
What's going to happen with this new model that can reason much more and has these enhanced capabilities? They are calling for people to contribute, to be part of a safety group that's going through and doing all sorts of testing to help ensure that they can release this model and that it can be safe for everyone to use. I went through and tried to see if I could sign up, but you do have to be a Plus member.
So you have to have a paid subscription in order to be part of this safety group. That's something to know; if you are a Plus member, definitely go out there and be part of the solution—help ensure that we have a safe overall product that can be used for academia and for the world. The final thing I want you to understand is dealing with AGI.
Right, this is not AGI; it's not generalized intelligence. But it is definitely one step closer. We are one step closer to understanding what AGI even is, and we're one step closer to getting there by being able to do this.
The way that they've structured things seems to indicate that they'll be able to continually improve this for the foreseeable future by going through and understanding inference and the way that it's presenting information, being able to solve problems based on that understanding. So it's moving beyond simply just the large language models and now using additional techniques to come up with novel solutions to novel problems—very important for research and pivotal for moving towards AGI. So that's a significant consideration for us to contemplate.
Yes, this is looking more and more realistic as far as being able to achieve that sooner rather than later. Therefore, we in academia really need to make sure that we're staying on top of this and focusing on this aspect. This idea of engaging with knowledge goes beyond just facts and figures that we often concentrate on in the classroom.
No, it’s about experience, critical thinking, and the ability to use and implement those ideas—manipulating content and putting it together in different ways. Creativity, critical thinking, and application are all crucial areas we need to push. In order to really do that, we need to create experiences within the classroom.
I can't stress that enough. What we as humans can maximize in the classroom is this real experience—social learning with other students, going through and displaying our capabilities, engaging in real discussions with emotion, role-playing. All these elements should be maximized in education to ensure that we remain relevant.
The AI simply can present information; we're doing much more because we provide that human element. So keep that in mind and continue to develop these capabilities, as that's what we need to prioritize within education. All right, I hope you got a lot out of this.
This is a huge deal—improvements in AI are continuing to move forward toward AGI. Many new possibilities are emerging as we head into 2025, which will be unbelievable. There will be lots happening with AI and robots, and this is another prime example of how AI can reason.
When faced with novel situations, a robot will be able to utilize these more advanced reasoning capabilities to tackle challenges effectively. So again, next year, 2025 will truly be the year of the robot. You have that to look forward to!
Forward to as well. Lots going on! If you got a lot out of this, please like and share so that we can continue to develop our channel here.
And, uh, please comment; I really want to know what you think about all this. Is this a big deal? Is this going to change the way you do something?
What are your thoughts on AI? I really want to know, so please share so that we can continue to develop our community inquiry. Thank you, and remember, learning is for life.
Copyright © 2025. Made with ♥ in London by YTScribe.com