just 12 hours ago open aai released a system called Deep research based on 03 their most powerful language model they call it an agent and I've spent all morning reading every note and Benchmark that they released and testing it myself on 20 use cases but because the name somewhat reminded me of something I also of course compared my results with deep seek R1 with search and Google's deep research yes by the way openai used the the exact same name as Google for their product I did hear that they were considering calling it 03 proar mini
but instead went with one of their competitors product names now these are of course just my initial tests and remember to get this thing you need to spend $200 a month and use a VPN if you're in Europe so bear all of that in mind overall I am impressed but with a pretty big caveat and I'll leave it to you guys to judge whether it can do a single digit percentage of all economically valuable tasks in the world just quickly though yes it is powered by the new 03 model and in case you're not familiar
with all the names that's their most powerful one not the 03 mini that was announced just a few days ago I did do a video on that one which is different to 01 Pro mode which I also did a video on and by the way both of those are different from GPT 40 and GPT 4 anyway basically it's their best model and they're using it to do this deep research that's kind of all that really matters just quickly before my test you may have heard about a benchmark called Humanity's last exam which I think is
pretty inappropriately titled what it essentially tests is really Arcane obscure knowledge and whether the model can piece together those bits of knowledge to get a question right so actually it didn't surprise me that much that on this quote Humanity's last exam the performance when given access to the web of this deep research agent shot up my main takeaway from this performance on this Benchmark is that if you want obscure knowledge then open ai's deep research agent is the place to go oh and by the way the lead author of that exam says he doesn't expect
it to survive the optimization pressures of 2025 more interestingly for me actually was this guia Benchmark about whether an AI can truly be a useful assistant why would it be more interesting well three reasons first the tasks are more relatable research this specific conference and answer this specific nuanced question that's just level one by the way and then level three questions are things like this one we search a very obscure set of standards and research what percent of those standards have been superseded by 2023 reason number two is that the Benchmark was co-authored by noted
llm skeptic Yan Lun here was the state-of-the-art in April 2024 quote we show that human respondents obtain 92% versus 15% for GPT 4 equipped with plugins I checked by the way and one of those plugins was indeed GPT 4 with sech they go on this notable performance disparity 92% for humans versus 15% for GPT 4 with search contrast with the recent trend of lm's outperforming humans on tasks requiring professional skills leaving us with the third reason which is that yes open ai's deep research agent got around 72 73% on this Benchmark that's by the way
if you pick the answer that it out puts most times out of 64 but if you're harsher and just pick its first answer it still gets 67% therefore two things are true simultaneously the performance leap in just the last say 9 months is incredible from 15% to 67 or 72% but it does still remain true that human performance if you put the effort in is still significantly higher at 92% now just before we get to the deeps R1 comparison and Gemini's deep research I can't lie the first thing that I wanted to do when I
got my hands on 03 essentially which is hidden inside deep research is test it on my own Benchmark simple bench it tests spatial reasoning or you could just say common sense or basic reasoning unfortunately the test didn't really work out because the model relentlessly asks me questions instead of actually answering the question now you could say that that's actually a brilliant thing because any AGI should ask you clarifying questions I will say though that on average it doesn't just ask you one question it tends to ask you like four or five even when you beg
it just to answer the question so super annoying or a sign of AGI I'm going to let you decide on that one but on the actual Common Sense the actual spatial reasoning it kind of flops I mean maybe that's harsh I only tested it on maybe eight of the questions but I saw no real sign of improvement I'm not going to spend more time in this video on questions like this but essentially it doesn't fully grock the real world it doesn't get that Cassandra in this case would still be able to move quite easily for
another question it has an instinct that something might be up here but when I say proceed with a reasonable assumption on each of those points it still flops I must admit it was kind of interesting watching it site all sorts of obscure websites to find out whether a woman could move forwards and backwards if she had her hands on her thighs eventually I just gave up on asking it simple bench questions because it would keep asking me questions until I essentially was solving the puzzle for it multiple times by the way when I refused to
answer questions that it was giving to me it just went silent and kind of just stopped Pro tip by the way if you want to get out of this Log Jam just go to the refresh button and then pick any other model and it will work though still presumably using 03 which I guess is the only one that they're using for deep research this is what it looks like by the way you just select deep research at the bottom it's not actually a model that you choose in the top left and I'm actually going to
stick on this page because this was a brilliant example of it doing really well I have a fairly small newsletter read by less than 10,000 people called signal to noise and so I tested deep seek R1 and deep research from Google same question to each of them read all of the Beehive newsletter posts from the signal to noise newsletter written by AI explained atal find every Post in which the dice rating does it change everything is a five or above print the soat sections of each of those posts here here's my latest post for example
and if you scroll down you can see the dice rating here which is a three as it likes to do it asks me some clarifying questions but then it got to it and it found them the two posts which had a dice rating of five and above it also sussed out and analyzed exactly what those dice ratings meant and indeed printed these so what sections I was like yeah that would actually save me some real time if I had to search through it myself the web version of Deep seek was completely busy for the entire
few hours in which I tested it but I still tested R1 how did I do that well I used R1 in perplexity Pro and asked the same question and apparently there are no entries with a dice rating of five or above obviously perplexity is amazing and R1 with search is incredible and they're both free up to a point but yes if I have a particularly difficult query I'm probably going to use deep research it cost me a bunch of money to subscribe to it currently but yes I'm going to use it speaking of usage by
the way apparently I get 100 queries per month on the pro tier and the plus tier will have 10 per month the free tier apparently will get a very small number soon enough yes he wrote plus tier but he meant free teer how about Gemini advance and their quote deep research and they must be furious by the way that open AI just dumped on their name but anyway how do they do unfortunately in my experience it's one of the worst options here for example it says that it can't find any dice ratings at all for
any newsletters in signal to noise from then on I stopped testing deep research from Gemini and just focused on deep research versus deep seek the tldr is that deep research was better than deep seek R1 pretty much every time although it hallucinated very frequently also deep seek didn't aggravate me by relentlessly asking me questions but again I'll leave that up to you whether that's a good thing or a bad thing I did check on your behalf if we could force the model not to ask ask clarifying questions and as you can see that just does
not work for this particular query I wanted to see how many benchmarks are there in which the human Baseline is still double the best current llm and they have to be up toate the benchmarks like O3 mini has to be tested I know my Benchmark is not officially recognized I just wanted to see if there were others that were text based but still had that massive Delta between human and AI performance as we just saw the guia Benchmark does not have that anymore when it asked me a clarifying question I said Focus only on llms
for now and as I said please just find all benchmarks that meet that criteria no other Criterion for this task they don't even have to be widely recognized benchmarks please please no more questions at that point it said I'll let you know as soon as I find the relevant benchmarks that fit these conditions but then it just stopped as I said this happens occasionally so I prodded it go on then and then it went and did it I was impressed again that it did identify simple bench which is pretty obscure as a benchmark didn't know
my name was Philip Wang though my mother will be surprised but it did say code ELO was another example of such a benchmark and I was like wow there's another one great human coders vastly outperform current models in fact the best models rating Falls in roughly the bottom 20% of human code forces participants I was like that's interesting as with all of the outputs though including the newsletter one I wanted to actually check if the answers were true and no they weren't not in the case of code ELO where as you can see 03 mini
has not been benchmarked but even 01 mini gets in the 90th percentile by definition that means that the best model is not in the bottom 20% of performers now some of you may point out that code ELO is based on code forces and o03 mini has been tested on code forces but nevertheless this statement highlighted is still not true this then for me captures the essence of the problem that deep research is great for finding a needle in a hay stack if you're able to tell needles apart from screws cuz yes it will present you
both screws and needles but remember it did in many cases save you from scrambling on your knees through the Hy stack so there's that what about that exact same question on the benchmarks but this time to the official deep seek R1 with search the server was working briefly for this question so I got an answer problem is the answer was pretty terrible I know it's free I know it's mostly open source and I know it's humbled the Western Giants but that doesn't mean that deep seek one is perfect yes Halo bench is a real Benchmark
and I did look it up it was hard to find but I did find it problem one though that after half an hour of trying I could find no source for this human evaluator's got 85% accuracy by the way the Benchmark is about detecting hallucinations what about the best performing llm being GPT 4 Turbo which gets 40% if true that would indeed meet my criteria of the human Baseline being more than double the best llm performance completely untrue though as you can see from this column where GPT 4 Turbo not only doesn't get 40% but
it's not the best performing model actually the entire focus of the paper is on this links model which is the best performing model okay now going back to deep research and I got a cool result that I'm curious if others will be able to reproduce in their own domain I asked the model 50 questions about a fairly obscure Creole language mishan Creole didn't give it any files just clicked deep research and waited I think it asked me some clarifying questions yes of course it did and I know what you're thinking that's kind of random philli
why are you telling us about this what did it get well it got around 88% you're thinking okay that's a bit random but I guess cool here's the interesting bit though I then tested GPT 40 which is the model most commonly used in the free tier of chat GPT but I actually gave it the dictionary from which these questions came yes it's around 100 pages but surely a model with direct access to the source material would score more highly alas not it actually got 82% of course smaller models can get overwhelmed with the amount of
context they have to digest and deep research can just spend enormous amounts of compute on each question and in this case at least score more highly now I know this is totally random but I so believed something like this was coming I actually built a prototype a couple of weeks ago and the way it works is I submitted say an article or any bit of text or a tweet and I would get 01 to produce say research directions that would add context and Nuance to the article helpful for say a journalist or a student then
each of those directions would be sent to sonar Pro which is the latest API from perplexity which of course can browse the web if interesting results were returned 2011 then it would incorporate those if not it would cross them out and then after going through all Five results essentially from sonar Pro 01 would synthesize all the most interesting bits the most juicy bits of of nuance and produce Like An Essay with citations and yes it helped my workflow all of one week until being completely superseded Now by this deep research so pull one out for
my prototype search ay which is now completely redundant here it is this is the report that it generated and you can see the citations below let me move that down and it was really fun and I was proud of that one now the Slick presentation that open AI gave did include hidden gems like is deeper Seeker a good name in the chat history but it didn't go into much detail beyond the release notes about for example what sites deep research could or could not browse for example in my testing it couldn't browse YouTube although strangely
it could get this question right by relying on sources that quoted YouTube for those who follow the channel in my last video I asked you guys to help me find a video in which I predicted the open ai's valuation would double this year which has done and it did find the right video but not by searching YouTube that was kind of wild ask it for the time stamp though and because it can't look at YouTube it can't actually get that right what about shopping advice though and this time I was really specific it got to
be a highly rated toothbrush available in the UK has to have a battery life of over 2 months and I even gave it the site to research what the previous price history had been essentially I wanted to know if the purchase I had just made was a good deal and Truth is I'd already done the research but I just wanted to see if it could do the same thing I had to as usual weighed through a barrage of being questioned SL interrogated by the model about the details some of which I'd already told it but
nevertheless it finally did the research and it did indeed find the toothbrush that I had bought so that was great unfortunately even though I'd given it the specific website to research in about previous price history it didn't actually do that none of these links correspond to camel camel camel and that is despite by the way saying that it had used camel camel camel it said using camel camel camel yes that is the name of the website but none of the links correspond to that website you might think well maybe it got the answer right from
the website without quoting the website but no if you actually go to the website you can see that the cheapest price that this toothbrush had been was 63 not the price quoted I think 66 by D research in don't trust it even when it says it has visited a site how about deep seek R1 with search well it completely hallucinated the battery life claimed 70 days it's actually 30 or 35 for this toothbrush and yes we can see the thinking but that means that we can see it completely making something up on the spot he
said now check this site for Amazon UK great suppose the historical low is 40 which is not by the way it didn't bother actually checking the site so it gives me this hypothetical result but by the way in the summary it states it as a fact it's currently selling for this notice that it actually knows that this is a hypothetical but phrases it like a fact in the summary now you might say I'm being overly harsh or too generous but honestly I'm just kind of processing how fast things are advancing every chart and Benchmark it
seems is going up and to the right correct me if I'm wrong but it seems like these kind of small hallucinations are the last Thin Line of defense for so much of White Collar work on one prompt I got deep research to analyze 39 separate references in the Deep seek R1 paper and though it hallucinated a little bit the results were extraordinary in their depth in short if these models weren't making these kind of repeated hallucinations wouldn't this news be effectively a redundancy notice for tens of millions of people and I'm not going to lie
one day that redundancy notice may come to me cuz I was casually browsing YouTube the other to day and I saw this YouTube channel that was clearly AI generated the voice was obviously AI generated I know many people accuse me of being an AI but I'm not but this voice trust me it was and yet none of the comments were referencing it and the analysis was pretty decent and the video editing was pretty smooth I'm sure there's a human in the loop somewhere but come next year or the year after or possibly the end of
this year there will be videos analyzing the news in AI instantly the moment it happens with in-depth massive analysis far quicker than I can ever do obviously I hope you guys stick around but man things are progressing fast and sometimes I'm just like this is a lot to process for now though at least yes it does struggle with distinguishing authoritative information from rumors although it does a better job than deep seek R1 with search and unfortunately much better than deep research from Gemini not quite as good I think as me for now but the clock
is ticking thank you so much for watching hope you stick around even in that eventuality and have a wonderful day