the following is a conversation with Dylan Patel and Nathan Lambert Dylan runs semi analysis A well respected research and Analysis company that specializes in semiconductors gpus CPUs and AI Hardware in general Nathan is a research scientist at the Allen Institute for AI and is the author of the amazing blog on AI called interconnects they are both highly respected red and and listened to by the experts researchers and engineers in the field of AI and personally I'm just a fan of the two of them so I used the Deep seek moment that shook the AI World
a bit as an opportunity to sit down with them and lay it all out from Deep seek open AI Google xai meta anthropic to Nvidia and tsmc and to us China Taiwan relations and everything else that is happening at the cutting Ed of AI this conversation is a deep dive into many critical aspects of the AI industry while it does get super technical we try to make sure that it's still accessible to folks outside of the AI field by defining terms stating important Concepts explicitly spelling out acronyms and in general always moving across the several
layers of abstraction and levels of detail there is a lot of hype in the media about what AI is and isn't the purpose of this podcast in part is to cut through the hype through the and the low resolution analysis and to discuss in detail how stuff works and what the implications are let me also if I may comment on the new open AI 03 mini reasoning model the release of which we were anticipating during the conversation and it did indeed come out right after its capabilities and costs are on par with our expectations as
we stated open AI 03 mini is indeed a great model but it should be stated that uh deep SEC car 1 has similar performance on benchmarks is still cheaper and it reveals its Chain of Thought reasoning which O3 mini does not it only shows a summary of the reasoning plus R1 is open weight and uh 03 mini is not by the way I got a chance to play with uh O3 mini and uh anecdotal Vibe checkwise I felt that O3 mini specifically O3 mini high is uh better than R1 still for me personally I find
that Claude Sona 35 is the best model for programming except for tricky cases where I will use 01 Pro to brainstorm either way many more better AI models will come including reasoning models both from American and Chinese companies they will continue to shift the cost curve but the quote deep seek moment is indeed real I think it will still be remembered 5 years from now as a pivotal event in Tech History due in part to the geopolitical implications but for other reasons too as we discuss in detail from many perspectives in this conversation this is
leex Freedman podcast to support it please check out our sponsors in the description and now dear friends here's Dyan Patel and Nathan Lambert a lot of people are curious to understand China's deep seki models so let's lay it out Nathan can you describe what deep seek V3 and deep seek R1 are how they work how they're trained Let's uh look at the big picture and then we'll zoom in on the details yeah so deep seek V3 is a new mixture of experts Transformer language model from Deep seek who is based in China they have some
new specifics in the model that we'll get into largely this is a open weight model and it's a instruction model like what you would use in chat GPT um they also release what is called the base model which is before these techniques of posttraining most people use instruction models today and those are what's served in all sorts of applications this was released on I believe December 26th or that week and then weeks later on January 20th deep seek released deep seek R1 which is a reasoning model which really accelerated a lot of this discussion this
resenting model has a lot of overlapping training steps to deep seek V3 and it's confusing that you have a base model called V3 that you do some too to get a chat model and then you do some different things to get a reasoning model I think a lot of the AI industry is going through this challenge of communications right now where open AI makes fun of their own naming scheme they have gbt 4 they have open ai1 and there's a lot of types of models so we're going to break down what each of them are
there's a lot of technical specifics on training and go from high level to specific and kind of go through each of them there's so many places we can go here but maybe let's go to open weights first what does it mean for model to be open weights and what are the different flavors of Open Source in general yeah so this discussion has been going on for a long time in AI it became more important since chat gbt or more focal since trat BT at the end of 2022 open weights is the accepted term for um
when model weights of a language model are available on the internet for people to download those weights can have different licenses which is the effectively the terms by which you can use the model there are licenses that come from history and open source software there are licenses that are designed by companies specifically um all of llama deep seek quen mistol these popular names in open weight models have some of their own licenses it's complicated because not all the same models have the same terms the big debate is on what makes a model open weight it's
like why are we saying this term it's kind of a mouthful it sounds close to open source but it's not the same there's still a lot of debate on the definition and soul of open- source AI open source software has a rich history on freedom to modify freedom to take on your own freedom for many restrictions on how you would use the software and what that means for AI is still being defined so uh for what I do I work at the Allen Institute for AI we're a nonprofit We want to make AI open for
everybody and we try to lead on what we think is truly open source there's not full agreement in the community but for us that means releasing the training data releasing the training code and then also having open weights like this and we'll get into the details of the models and again and again as we try to get deep into how the models will train were trained we will say things like the data processing data filtering data quality is the number one determinant of the model quality and then a lot of the training code is the
determinant on how long it takes to train and how faster experimentation is so without fully open- Source models where you have access to this data it is hard to know or it's harder to replicate so we'll get into cost numbers for deeps B3 on mostly GPU hours and how much you could pay to rent those yourselves but without the data the replication cost is going to be far far higher and same goes for the code we should also say that this is probably one of the more open models out of the frontier models so like
in this full spectrum where probably the fullest open source like you said open code open data open weights this is not open code this is probably not open data and this is open weights and the licensing is uh MIT license or it's uh I mean there's some nuance and the different models but it's towards the free in terms of the open source movement these are the kind of the good guys yeah deep seek is doing fantastic work for disseminating understanding of AI their papers are extremely detailed in what they do and for other teams around
the world they're very actionable in terms of improving your own training techniques uh and we'll talk about licenses more the Deep seek R1 model has a very permissive license it's called the M license that effectively means there's no Downstream restrictions on commercial use there's no use case restrictions you can use the outputs from the models to create synthetic data and this is all fantastic I think the closest pier is something like llama where you have the weights and you have a technical report and the technical report is very good for llama one of the most
red p PDFs of the year last year is the Llama 3 paper but in some ways it's slightly less actionable it has less details on the training specifics like less plots um and so on and the Llama 3 license is more restrictive than MIT and then between the deep sea custom license and the Llama license we could get into this whole Rabbit Hole I think we we we'll make sure we want to go down the license rabbit hole before we do specifics yeah and I mean so it should be stated that one of the implications
of deep secret puts pressure on llama and everybody else on open AI to push towards uh open source and that's the other side of Open Source that uh you mentioned is how much is published in detail about it so how open are you with the sort of the insights behind the code so like how good is the technical reports are they hand wavy or is there actual uh details in there and that's one of the things that deep seek did well is they publish a lot of the details yeah especially in the deeps V3 which
is their pre-training paper they were very clear that they are doing inter itions on the technical stack that go at many different levels for example on their to get highly efficient training they're making modifications at or below the Cuda layer for NVIDIA chips I have never worked there myself and there are a few people in the world that do that very well and some of them are at Deep seek and these types of people are at Deep seek and leading American frontier Labs but there are not many places to help people understand the other implication
of open weights just you know there's uh a topic we'll return to often here so there's a uh fear that China the nation might have interest in um stealing American data violating privacy of American citizens what can we say about open weights to help us understand what what the weights are able to do yeah in terms of stealing people's data yeah so these weights that you can download from hugging face or other platforms are very big matrices of numbers you can download them to a computer in your own house that has no internet and you
can run this model and you're totally control of your data that is something that is different than how a lot of language model usage is actually done today which is mostly through apis where you send your prompt to gpus run by certain companies and these companies will have different distributions and policies on how your data is stored if it is used to train future models where it is stored if it is encrypted and so on so the open weights you have your fate of data in your own hands and that is something that is deeply
connected to the soul of Open Source so it's not the model that steals your data it's clovers hosting the model which could be China if you're using the Deep seek app or it could be perplexity uh you know you're trusting them with your data or open AI you're trusting them with your data and some of these are American companies some of these are Chinese companies but the model itself is not doing the stealing it's the host all right so uh back to the basics what's the difference between deep seek V3 and deep seek R1 can
we try to like lay out the confusion potential yes so for one I have very understanding of many people being confused by these two model names so I would say the best way to think about this is that when training a language model you have what is called pre-training which is when you're predicting the large amounts of mostly internet text you're trying to predict the next token and what to know about these new deep seek models is that they do this internet large scale pre-training once to get what is called Deep seek V3 base this
is a base model it's just going to finish your sentences for you it's going to be harder to work with than chat GPT and then what deep seek did is they've done two different posttraining regimes to make the models have specific desirable behaviors so what is the more normal model in terms of the last few years of AI an instruct model a chat model a quote unquote aligned model a helpful model there are many ways to describe this is more standard post training so this is things like instruction tuning reinforce learning from Human feedback we'll
get into some of these words and this is what they did to create the deeps V3 model this was the first Model to be released and it is very high performant it's competitive with gp4 llama 405b so on and then when this release was happening we don't know their exact timeline or soon after they were finishing the training of a different training process from the same next token prediction base model that I talked about which is when this new reasoning training that people have heard about comes in in order to create the model that is
called Deep seek R1 the r through this conversation is good for grounding for reasoning and the name is also similar to open AI 01 which is other reasoning model that people have heard about and we have to break down the training for R1 in more detail because for one we have a paper detailing it but also it is a far newer set of techniques for the AI community so is a much more rapidly evolving area of research maybe we should also say the big two categories of training of pre-training and posttraining these umbrella terms that
people use so what is pre-training and what is posttraining and what are the different flavors of things underneath posttraining umbrella yeah so pre-training I'm using some of the same words to really get the message across is you're doing what is called autor regressive prediction to predict the next token in a series of documents this is done over standard practice is trillions of tokens so this is a ton of data that is mostly scraped from the web in some of deep se's earlier papers they talk about their training data being distilled for Math and I shouldn't
use this word yet but taken from common crawl and that's a public access that anyone listening to this could go download data from the common crawl website this is a crawler that is maintained publicly yes other tech companies eventually shift to their own crawler and deepy likely has done this as well as most Frontier Labs do but this sort of data is something that people can get started with and you're just predicting text in a series of documents this is can be scaled to be very efficient and there's a lot of numbers that are thrown
around in AI training like how many floating Point operations or flops are used and then you can also look at how many hours of these gpus that are used and it's largely one loss function taken to a very large amount of of compute usage you just you set up really efficient systems and then at the end of that you have this space model and pre-training is where there is a lot more of complexity in terms of how the process is emerging or evolving and the different types of training losses will use I think this is
a lot of techniques grounded in the natural language processing literature the oldest technique which is still used today is something called instruction tuning or also known as supervised fine tuning these acronyms will be if or sft it's that people really go back and forth throughout them and I will probably do the same which is where you add this formatting to the model where it knows to take a question that is like explain the history of the Roman Empire iror to me and or something you a sort of question you'll see on Reddit or stack Overflow
and then the model will respond in a information dense but presentable manner the core of that formatting is in this instruction tuning phase and then there's two other categories of loss functions that are being used today one I will classify as preference fine tuning preference fine tuning is a generalized term for what came out of reinforcement learning from Human feedback which is rhf this reinforce learning from Human feedback is credited as the technique that helped uh chat GPT break through it is a technique to make the responses that are nicely formatted like these Reddit answers
more in tune with what a human would like to read this is done by collecting parse preferences from actual humans out in the world to start and now AIS are also labeling this data and we'll get into those trade-offs and you have this kind of contrastive loss function between a good answer and a bad answer and the model learns to pick up these Trends there's different implementation ways you have things called reward models you could have direct alignment algorithms there's a lot of really specific things you can do but all of this is about fine-tuning
to human preferences and the final stage is much newer and will'll link to what is done in R1 and these reasoning models is I think open ai's Nam for this they had this new API in the fall which they called the reinforcement fine-tuning API this is the idea that you use the techniques of reinforcement learning which is a whole framework of AI there's a deep literature here to summarize it's often known as trial and error learning or the subfield of AI where you're trying to make sequential decisions in a certain potentially un potentially noisy environment
there's a lot of ways we could go down that but fine-tuning language models where they can generate an answer and then you check to see if the answer matches the true solution for math or code you have an exactly correct answer for math you can have unit tests for code and what we are doing is we are checking the language models work and we're giving it multiple opportunities on the same questions see if it is right and if you keep doing this the models can learn to improve in verifiable domains uh to a great extent
it works really well it's a newer technique in the academic literature it's been used at Frontier labs in the US that don't share every detail uh for multiple years so this is the idea of using reinforcement learning with language models and it has been taking off especially in this deep seek moment and we should say that there's a lot of exciting stuff going on on the uh again across the stack but the post training probably this year there's going to be a lot of interesting developments in the post training we'll we'll talk about it uh
I almost forgot to talk about the the difference between uh deep seek V3 and R1 on the user experience side so forget the technical stuff forget all that just people that don't know anything about AI they show up like what's the actual experience what's the use case for each one when they actually like type and talk to it what what is he good at and that kind of thing so let's start with deep seek V3 again it's what more people would have tried something like it you ask it a question it'll start generating tokens very
fast and those tokens will look like a very human legible answer it'll be some sort of markdown list it might have formatting to help you draw to the core details in the answer and it'll generate tens to hundreds of tokens a token is normally a word for common words or a subword part in a longer word and it'll look like a very high quality Reddit or stack Overflow answer these models are really getting good at doing these across a wide variety of domains I think even things that if you're an expert things that are close
to The Fringe of knowledge they will still be fairly good at I think Cutting Edge AI topics that I do research on these models are capable for study Aid and they're regularly updated this changes is with the Deep seek R1 what is called these reasoning models is when you see tokens coming from these models to start it will be a large chain of thought process we'll get back to Chain of Thought in a second which looks like a lot of tokens where the model is explaining the problem the model will often break down the problem
be like okay they asked me for this let's break down the problem I'm going to need to do this and you'll see all of this generating from the model it'll come very fast in most user experiences these AP are very fast so you'll see a lot of tokens a lot of words show up really fast it'll keep flowing on the screen and this is all the reasoning process and then eventually the model will change its tone in R1 and it'll write the answer where it summarizes its reading reasoning process and writes a similar answer to
the first types of model but in deep seeks case which is part of why this was so popular even outside the AI Community is that you can see how the language model is breaking down problems and then you get this answer on a technical side they they train the model to do this specifically where they have a section which is reasoning and then it generates a special token which is probably hidden from the user most of the time which says okay I'm starting the answer so the model is trained to do this two-stage process on
its own if you use a similar model and say openai open ai's user interface is trying to summarize this process for you nicely by kind of showing the sections that the model is doing and it'll kind of Click through it'll say breaking down the problem making calculation cleaning the result and then the answer will come for something like open AI maybe it's useful here to go through like an example of a deep seek R1 reasoning yeah so the if if you're looking at the screen here what you'll see is a screenshot of the deep seek
chat app and at the top is thought for 1517 seconds with the drop- down arrow underneath that if we were in an app that we were running the drop- down arrow would have the reasoning so in this case uh the question the specific question which you know I'm philosophically SL pothead inclin so this is uh asking deep deep SEC car1 for one truly novel insight about humans and it reveals the reasoning and basically the TR truly novel aspect is was pushing the reasoning to constantly sort of the model asking itself is this truly novel so
it's actually challenging itself to be more novel more counterintuitive uh more uh less cringe I suppose so some of the reasoning says uh this is just snapshots alternatively humans have a unique meta emotion where they feel emotions about their own emotions you feeling guilty about being angry this recursive emotional layering creates complex motivational drives that don't exist in other animals the inside is that human emotions are nested so it's like it's reasoning through how humans feel emotions it's reasoning about meta emotions going to have pages and Pages this it's almost too much to actually read
but it's nice to skim as it's coming it's stream of it's a James Joyce extreme of Consciousness and then it goes wait the user wants something that's not seen anywhere else let me dig deeper and consider the human ability to hold contradictory beliefs simultaneously cognitive dissonance is known but perhaps the function is to allow flexible adaptation so on and so forth I mean that really captures the public imagination that holy this isn't uh I mean intelligent slash almost like like an inkling of siience because like you're thinking through you're self-reflecting you're deliberating and the final
result of that after 157 seconds is humans instinctively convert selfish desires into Cooperative systems by collectively pretending abstract rules money laws rights are real these shared hallucinations act as quote games where competition is secretly redirected to benefit the group turning conflict into society's fuel pretty profound I mean you know this is AAL digression but a lot of people have found that these reasoning models can sometimes produce much more eloquent text that a at least interesting example I think depending on how open minded you are you find language models interesting or not and there's a spectrum
there well I mean it's some of the we'll talk about different benchmarks of s but some is just a Vibe like that in itself is a let's say quote fire tweet yeah if I I'm trying to produce something something where people are like oh okay so that's CH thought we'll probably return to it more how are they able to achieve such low cost on the training and the inference maybe you could talk the training first yeah so there's there's two main techniques that they implemented that are probably the majority of their efficiency and then there's
a lot of implementation details that maybe we'll gloss over or get into later that sort of contribut to it but those two main things are one is they went to a mixture of experts model uh which which we'll Define in a second and then the other thing is that they invented this new technique called MLA lat and attention both of these are are big deals mixture of experts is something that's been in the literature for a handful of years and open AI with gp4 was the first one to productize a mixture of experts model and
what this means is when you look at the common models around uh that most people have been able to interact with are open right think llama llama is a dense model I.E every single parameter or neuron is activated as you're going through the model for every single token you generate right now with a mixture of experts model you don't do that right how how does a human actually work right is like oh well my visual cortex is active when I'm thinking about you know Vision task and like you know other things right my my amydala
is when I'm scared right these different aspects of your brain are focused on different things a mixture of experts model attempts to approximate this to some extent it's nowhere close to what a brain architecture is but different portions of the model activate right you'll have a set number of experts in the model and a set number that are activated each time and this dramatically reduces both your training and inference cost because now you're you know if you think about the parameter count as the sort of total embedding space for all of this knowledge that you're
compressing down during training when you're embedding this data in instead of having to activate every single parameter every single time you're training or running inference now you can just activate a subset and the model will learn which expert to route to for different tasks and so this is a humongous innovation in terms of hey I can continue to grow the total embedding space of parameters and so deep seeks model is you know 600 something billion parameters right uh relative to llama 405b it's 405 billion parameters right llama relative to llama 70b it's 70 billion parameters
right so this model technically has more embedding space for information right to compress all of the world's knowledge that's on the internet down but at the same time it is only activating around 37 billion of the parameters so only 37 billion of these parameters actually need to be computed every single time you're training data or inferencing data out of it and so versus versus again the Llama model 70 billion parameters must be activated or 405 billion parameters must be activated so you've dramatically reduced your compute cost when you're doing training and inference with this mixture
of experts architecture so we break down where it actually applies and go into the Transformer is that useful let's go let's go into the Transformer the Transformer is a thing that is talked about a lot and we will not cover every detail uh essentially the Transformer is built on repeated blocks of this attention mechanism and then a traditional dense fully connected multi-layer perception whatever word you want to use for your normal neural network and you alternate these blocks there's other details and where mixture of experts is applied is that this dense model the dense model
holds most of the weights if you count them in a Transformer model so you can get really big gains from those mixure of experts on parameter efficiency at training an inference because you get this efficiency by not activating all of these parameters we should also say that a Transformer is a giant neural network yeah and then there's for 15 years now there's What's called the Deep learning Revolution networks gotten larger and larger and at a certain point the scaling laws appeared where people realized this is a scaling law Shirt By the way representing scaling laws
where it became more and more formalized that bigger is better across multiple dimensions of what bigger means so uh and but these are all sort of neural networks we're talking about and we're talking about different architectures of how construct to construct these neural networks such that the training and the inference on them is super efficient yeah every different type of model has a different scaling LW for it which which is effectively for how much compute you put in the architecture will get to different levels of performance at test tasks a mixture of experts is one
of the ones at training time even if you don't consider the inference benefits which are also big at training time your efficiency with your gpus is dramatically improved by using this architecture if it is well implemented so you can get effectively the same performance model in evaluation scores with numbers like 30% less compute I think there's going to be a wide variation depending on your implementation details and stuff but it is just important to realize that this type of technical Innovation is something that gives huge gains and I expect most companies that are serving their
models to move to this mixture of experts implementation historically the reason why not everyone might do it is because it's a implementation complexity especially when doing these big models so this is one of the things this deep seek gets credit for is they do this extremely well they do mixture of experts extremely well this architecture for what is called Deep seek moee is the shortened version of mixture of experts is multiple papers old this part of their training infrastructure is not new to these models alone and same goes for what Dylan mentioned with multi-ad lat
and attention this is all about reducing memory usage during inference and same things during training by using some fancy low rank approximation math if you get into the details with this latent attention it's one of those things I look at it's like okay this they're doing really complex implementations because there's other parts of language model such as uh embeddings that are used to extend the context length the common one that deep seek used is Rotary positional and pendings which is called rope and if you want to use rope with a normal Moe it's kind of
a sequential thing you take these you take two of the attention matrices and you rotate them by a complex value rotation which is a matrix multiplication with deep seek MLA with this new attention architecture they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher so they're managing all of these things and these are probably the sort of things that open AI these closed labs are doing we don't know if they're doing the exact same techniques but they actually shared them with the world
which is really nice to like this is The Cutting Edge of efficient language model training and some of this is requires low-level engineering just is a giant mess and trickery so as I understand they went below Cuda so they go super low programming of gpus effectively Nvidia builds this Library called nickel right uh in which you know when you're training a model you have all these communications between every single layer of the model and you may have over 100 layers what does a nickel stand for it's nccl Nvidia Communications collectives Library nice um and so
D when when you're training a model right you're going to have all these all reduces and all gathers right uh between each layer between the uh multier perceptron or feed forward Network and the attention mechanism you'll have you'll have basically the model synchronized right um or you'll have all the you'll have all reducer and all gather um and and this is a communication between all the gpus in the network whether whether it's in training or inference so Nvidia has a standard Library this is one of the reasons why it's really difficult to use anyone else's
Hardware uh for training is because no one's really built a standard Communications Library um and and nvidia's done this at a sort of a higher level right a deep seek because they have certain limitations around the gpus that they have access to the interconnects are limited to some extent um by the restrictions of the gpus that were shipped into China legally not the ones that are smuggled but legally shipped in uh that they used to train this model they had to figure out how to get efficiencies right and one of those things is that instead
of just calling the Nvidia Library nickel right they instead created their they scheduled their own Communications uh which which the lab some of the labs do right um em meta talked about in llama 3 how they made their own custom version of nickel this is they didn't they didn't talk about the implementation details this is some of what they did probably not as well as maybe not as well as deep seek Because deep seek you know necessity is the mother of innovation and they had to do this whereas uh in the casa you know open
AI has people that do this sort of stuff anthropic Etc uh but you know deep seek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have access to and so they scheduled Communications um you know by scheduling specific SMS SMS you could think of as like the core on a GPU right so there's hundreds of cores or there's you know a bit over a 100 cores SMS on a GPU and they were specifically scheduling hey which ones are running the
model which ones are doing all reduce which one are doing all gather right and they would flip back and forth between them and this requires extremely low-level programming this is what nickel does automatically or other Nvidia libraries handle this automatically usually yeah exactly and so so technically they're using you know PTX which is like sort of like you could think of it as like an assembly type language it's not exactly that or instruction set right like coding directly to assembly or instruction set it's not exactly that but uh that's still part of technically Cuda but
it's like do I want to write in Python you know pytorch equivalent and call Nvidia libraries do I want to go down to the ca level right or uh you know and code even lower level or do I want to go all the way down to the assembly or Isa level and and there are cases where you go all the way down there at the very big Labs but most companies just do not do that right because it's a waste of time and the efficiency gains you get are not worth it but deep seeks implementation
is so complex right especially with their mixture of experts right um people have done mixture of experts but they're generally 8 16 experts right and they activate to so you know one of the words we like Ed like to use is like sparsity Factor right or usage right so so you might have four you know one fourth of your model activate right and and and that's what Mist draws uh mixol model right uh their model that really catapulted them to like oh my God they're really really good um openi has also had models that aree
and and so have all the other labs that are major closed but what deep seek did that maybe only the leading Labs have only just started recently doing is have such a high sparity factor right it's not 1/4 of the model right two out of eight experts activating every time you go through the model it's eight out of 256 and there's different implementations for mixture of experts where you can have some of these experts that are ever always activated which this just looks like a small neural network and all the tokens go through that and
then they also go through some that are selected by this routing mechanism and one of the Innovations in deep seeks architecture is that they change the routing mechanism in mixture of expert models there's something called an auxiliary loss which effectively means during training you want to make sure that all of these experts are used across the tasks that the model sees why there can be failur and mixture of experts is that when you're doing this training the one objective is token prediction accuracy and if you just let toing go with a mixture of expert model
on your own it can be that the model learns to only use a subset of the experts and in thee literature there's something called the auxiliary loss which helps balance them but if you think about the loss functions of deep learning this even connects to the bitter lesson is that you want to have the minimum inductive bias in your model to let the model learn maximally and this auxiliary loss this balancing across experts could be seen as intention with the prediction accuracy of the tokens so we don't know the exact extent that the Deep seeke
change which is instead of doing an auxiliary loss they have an extra parameter in their routing which after the batches they update this parameter to make sure that the next batches all have a similar use of experts and this type of change can be big it can be small but they add up over time and this is the sort of thing that just points to them innovating and I'm sure all the labs that are training biges are looking at this sort of things which is getting away from the auxiliary loss some of them might already
use it but you just keep you keep accumulating gains and we'll talk about the philosophy of training and how you organize these organizations and a lot of it is just compounding small improvements over time in your data in your architecture and your post trainining and how they integrate with each other deep seek does the same thing and some of them are shared or a lot we have to take them on face value that they share their most important details I mean the architecture and the weights are out there so we're seeing what they're doing and
it adds up going back to sort of the like efficiency and complexity point right it's 32 versus a four right for like mix draw and othere models that have been publicly released so this ratio is extremely high and sort of what Nathan was getting at there was when you have such a different level of sparsity um you can't just have every GPU have the entire model right the model's too big there's too much complexity there so you have to split up the model um with different types of parallelism right and so you might have different
experts on different GPU nodes but now what what happens when a to you know this set of data that you get hey all of it looks like this one way and all of it should route to one part of my you know model right um so so when all of it rout routes to one part of the model then you can have the you can have this overloading of a s certain set of the GPU resources or certain set of the gpus and then the rest of the the training Network sits idle because all of
the tokens are just routing to that so this is the biggest complexity one of the big complexities with running a very you know sparse mixture of experts model uh I.E you know this 32 ratio versus this uh four ratio is that you end up with so many of the experts just sitting their Idol so how do I load balance between them how do I schedule the communications between them this is a lot of the like extremely low-level detailed work that they figured out in the public first and potentially like second or third and the world
and maybe even first in some cases what uh lesson do you uh in the direction of the better lesson do you take from all of this where is this going to be the direction where a lot of the gain is going to be which is this kind of lowlevel optimization or is this a shortterm thing where the biggest gains will be more on the algorithmic high level side of like posttraining is is this like a short-term leap because they figured out like a hack because constraints Necessities the mother of invention or is is there still
a lot of gains I think we should summarize what the bitter lesson actually is about is I the bitter lesson essentially if you paraphrase it is that the types of training that will win out in deep learning as we go are those methods that which are scalable in learning and search is what it calls out and the scale word gets a lot of attention in this the interpretation that I use is effective to avoid adding the human priors to your learning process and if you read the original essay this is what it talks about is
how researchers will try to come up with clever solutions to their specific problem that might get them small gains in the short term while simply enabling these deep Learning Systems to work efficiently and for these bigger problems in the long term might be more likely to scale and continue to drive success and therefore we were talking about relatively small implementation changes to the mixture of experts model and therefore it's like okay like we will need a few more years to know if one of these were actually really crucial to the bitter lesson but the bitter
lesson is really this long-term Arc of how Simplicity can often win and there's a lot of sayings in the industry like the models just want to learn you have to give them the simple lost landscape where you put compute through the model and and they will learn and get barriers out of the way that that's where the power something like nickel comes in where standardized code that could be used by a lot of people to create sort of simple innovations that can scale which is why the hacks the I imagine the code base for deep
seek is probably a giant mess I'm sure they have deep seek definitely has code bases that are extremely messy where they're testing these new ideas multi-head late in attention probably start could start in something like a Jupiter notebook or somebody tries something on a few gpus and that is really messy but the stuff that trains deep seek V3 and deep seek R1 those libraries if you were to present them to us I would guess are extremely high quality code high quality readable code I think there is one aspect to note though right is that there
is the general General ability for that to transfer across different types of runs right you may make really really high quality code for one specific model architecture at one size and then that is not transferable to hey when when I make this architecture tweak everything's broken again right like that's that's something that could be uh you know with their with their specific low-l coding of like scheduling SMS is specific to this model architecture and size right and whereas like nvidia's collectives library is more like hey it'll work for anything right you want to do an
all reduce great I don't care what your model architecture is it'll work uh and you're giving up a lot of performance when you do that uh in many cases but it's it's worth for them to do the specific uh optimization for the specific run given the constraints that they have regarding compute I wonder how stressful it is to like you know these Frontier models like initiate training like to have the code to push the button that like you're now spending a large amount of money and time to train this like there must I mean there
must be a lot of innovation on the debugging stage of like making sure there's no know issues that you're monitoring and visualizing every aspect of the training all that kind of stuff when when people are training they have all these various dashboards but like the most simple one is your loss right uh and it continues to go down but in reality especially with more complicated stuff likee the biggest problem with it or FPA training which is another Innovation you know going to a lower Precision number format I.E less accurate is that you end up with
lost bikes right and and no one knows why the Lost bike happen and for long some of them you do some of them you do some of them are data I give a ai's example of what blew up our earlier models is a subreddit called microwave gang we love to shout this out it's a real thing you can pull up microwave gang essentially it's a subreddit where everybody makes posts that are just the letter M so it's like so there's extremely long sequences of the letter M and then the comments are like beep beep because
that's when the microwave ends but if you pass this into a model that's trained to be a normal producing text it's extremely high loss because normally you see an M you don't predict M's for a long time so like this is something that causes a l spikes for us but when you have much like this is this is old this is not recent and when you have more mature Data Systems that's not the thing that causes the LW Spike and what Dylan is saying is true but it's like it's it's levels to this sort of
idea with regards to the stress right these people are like you know you'll go out to dinner with like a friend that works at one of these labs and they'll just be they'll just be like looking at their phone every like 10 minutes and they're not like you know it's one thing if they're texting but they're just like like is the Lost is the L tokens tokens per second lost not blown up they're just walking watching this and the heart rate goes up if there's a spike and some level of spikes is normal right it'll
it'll recover and be back sometimes a lot of the old strategy was like you just stop the run restart from the old version and then like change the data mix and then it keeps going there are even different types of spikes so Durk grenal has a theory A2 that's like Fast spikes and slow spikes where there are sometimes where you're looking at the loss and there other parameters you can see it start to creep up and then blow up and that's really hard to recover from so you have to go back much further so you
have the stressful period where it's like flat or might start going up and you're like what do I do whereas there also law spikes that are it looks good and then there's one spiky data point and what you can do is you just skip those you you see that there's a spike you're like okay I can ignore this data don't update the model and do the next one and it'll recover quickly but these like un trickier implementations so as you get more complex in your architecture and you scale up to more gpus you have more
potential for your loss blowing up so it's like there there's and there's a distribution the whole idea of grocking also comes in right it's like just because it slowed down from improving and loss doesn't mean it's not learning because all of a sudden it could be like this and it could just Spike down and loss again because it learned truly learned something right uh and it took some time for it to learn that it's not like a gradual process right and that's that's what humans are like that's what models are like so it's it's really
a stressful task as you mentioned and the whole time the the the dollar count is going up every company has failed runs you need failed runs to push the envelope on your infrastructure so a lot of news Cycles are made of X company had y failed run every company that's trying to push the frontier of AI has these so is yes it's noteworthy because it's a lot of money and it can be weektoon setback but it is part of the process but how do you get if you're deep seek how do you get to a
place where holy there's a successful combination of hyper parameters a lot of small failed runs and so so rapid uh iteration through failed runs until and successful ones you just and then you build a su tuation like this this mixture of expert works and then this implementation of MLA works key hyper parameters like learning rate and regularization and things like this and you find the regime that works for your code base I've talking to people at Frontier Labs there's a story that you can tell where training language models is kind of a path that you
need to follow so you need to like unlock the ability to train a certain type of model or a certain scale and then your code base and your internal knoow what type of parameters work for it is kind of known and you look at the Deep seek papers and models they' they've scaled up they've added complexity and it's just continuing to build the capabilities that they have there there's the concept of a YOLO run um so YOLO you only live once um and and what it is is like you know there's there's there's all this
experimentation you do at the small scale right uh research ablations right like you have your jupyter notebook whether you're experimenting with MLA on like three gpus or whatever um and you're doing all these different uh things like hey do I do four expert four active experts 128 experts do I arrange the experts this way you know all these different uh model architecture things you're testing at a very small scale right couple researchers few gpus tens of gpus hundreds of gpus whatever it is and then all of a sudden you're like okay guys no more no
more around right uh no more screwing around everyone take all the resources we have let's pick what we think will work and just go for it right YOLO and this is where that sort of stress comes in is like well I know it works here but some things that work here don't work here and some things that work here don't work down here right in this terms of scale right so it's it's it's really truly a YOLO run and and sort of like there is this like like discussion of like certain researchers just have like
this methodical nature like they can find the whole search space and like figure out all the ablations of different research and really see what is best and there's certain researchers who just kind of like you know have that innate gut instinct of like this is the Yol run like you know looking at the data this is it this is why you want to work in post training because the GPU cost for training is lower so you can make a higher percentage of your training runs Yol will runs yeah for for now yeah for now for
for now so some of this is fundamentally luck still luck is skill right in many cases yeah I mean it looks lucky right when you're but the hill to climb if you're on one of these labs and you have an evaluation you're not crushing there's a repeated Playbook of how you improve things there are localized improvements which might be data improvements and these add up into the whole model just being much better and when you zoom in really close it can be really obvious that this model is just really bad at this thing and we
can fix it and you just add these up so like some of it feels like look but on the ground especially with these new reasoning models we're talking to is just so many ways that we can poke around and normally it's that some of them give big improvements the search space is near infinite right and and yet the amount of computing time you have is is very low and you're you're you have to hit release schedules you have to not get blown past by everyone otherwise you know what happened with deep seek you know crushing
meta and mistr and coher and all these guys they moved too slow right they they maybe were too methodical I don't know they didn't hit the Yello run whatever the reason was maybe they weren't as skilled uh whatever what you know you can call it luck if you want but at the end of the day it's skill so 2025 is the year of the YOLO run it seems like all the labs are like going in I I I think it's even more impressive what openi did in 2022 right at the time no one believed in
mixture of experts models right at Google uh who had all the researchers uh opening ey had such little compute and they devoted all of their compute for many months right all of it 100% for many months to gp4 with a brand new architecture with no belief that hey let me spend a couple hundred million dollars which is all of the money I have on this model right that is truly YOLO yeah right now now you know people like all these like training run failures that are in the media right it's like okay great but like
actually a lot huge chunk of my GPS are doing inference I still have a bunch doing research constantly and yes my biggest cluster is training but like on on this YOLO run but like that YOLO run is much less risky than like what opening I did in 2022 or maybe what deep seek did now or you know like sort of like hey we're just going to throw everything at it the big Winners throughout human history are the ones who are willing to do yellow at some point okay uh what do we understand about the hardware
it's been trained on deep seek deep seek is very interesting this a second to take zoom out out of who they are first of all right highflyer is a hedge fund that has historically done quantitative trading in China as well as elsewhere and they have always had a significant number of gpus right in the past a lot of these high frequency trading algorithmic Quant Traders used fpgas uh but it shifted to gpus definitely and there's both right but gpus especially and deep and and highflyer which is the hedge fund that owns deep seek and everyone
who works for deep seek is part of highflyer to some extent right uh it's same same parent company same owner same CEO they had all these resources and infrastructure for trading and then they devoted a humongous portion of them to training models uh both language models and otherwise right because these these these te techniques were heavily AI influenced um you know more recently people have you know realized hey trading with um you know like even even when you you go back to like Renaissance and all these all these like quantitative firms natural language processing is
the key to like trading really fast right understanding a press release uh and making the right trade right and so deep seek has always been really good at this and even as far back as 2021 they they have press releases and papers saying like Hey we're the first company in China with an a100 cluster this large those 10,000 a100 gpus right this is this is in 2021 now this wasn't all for training you know large language models this was mostly for training models for their quantitative aspects their or quantitative trading as well as you know
a lot of that was natural language processing to be clear right um and so this is the sort of History right so verifiable fact is that in 2021 they built the largest chin uh cluster at least they claim it was the largest cluster in China 10,000 gpus before export controls started yeah it's like they've had a huge cluster before any conversation of export controls so then you step it forward to like what have they done over the last four years since then right um obviously they've continued to operate the hedge fund probably make tons of
money and the other thing is that they've leaned more and more and more into AI the CEO Le CH Fang uh Leon you're not putting me spot on this we discuss this Leon Fang right the CEO he own maybe Leon Fang he he owns maybe a little bit more than half the company allegedly right um is an extremely like Elon Jensen kind of figure where he's just like involved in everything right um and so over that time period he's gotten really in-depth into AI he actually has a bit of a like a if you if
you see some of his statements a bit of an eak Vibe almost right total AGI Vibes like we need to do this we need to make a new ecosystem of open AI we need China to lead on this sort of ecosystem because historically the Western countries have led on on software ecosystems and in straight up acknowledges like in order to do this we need to do something different deep seek is his way of doing this some of the translated interviews with him are so he has done interviews yeah you think he would do a western
interview or no or is there controls on there hasn't been one yet but okay I would try it well I just got a Chinese translator so it's great this is this is all push um so fascinating figure engineer pushing full on into AI leveraging the success from The High Frequency trading very direct quotes like we will not switch to closed Source when ask about this stuff very long-term motivated in how the ecosystem of AI should work and I think from a Chinese perspective he wants the Chinese company a Chinese company to build this vision and
so this is sort of like the quote unquote Visionary behind the company right this hedge fund still exists right this this quantitative firm and so deep seek is the sort of at at you know slowly he got turned to this full view of like AI everything about this right but at some point it slowly maneuvered and he made deep seek um and deeps has done multiple models since then they've acquired more and more gpus they share infrastructure with the fund right um and so you know there is no exact number of public GPU resources that
they have but besides this 10,000 gpus that they bought in 2021 right and they were fantastically profitable right and then this paper claims they did only 2, h800 gpus which are a restricted GPU that was previously allowed in China but no longer allowed and there's a new version but it's basically nvidia's h100 for China right um and there's some restrictions on it specifically around the communications uh sort of uh speed that the interconnect speed right which is why they had to do this crazy SM you know scheduling stuff right so so going back to that
right let's like this is obviously not true in terms of their total GPU count obvious available gpus but for this training run you think 2,000 is the correct number or no so this is where it takes um you know significant amount of sort of like zoning in right like what do you call your training run right do you count all of the research and ablations that you ran right picking all the stuff because yes you can do a YOLO run but at some level you have to do the test at the small scale and then
you have to do some test at medium scale before you go to a large scale accepted practice is that for any given model that is a notable advancement you're going to do 2 to 4X compute of the full training run in experiment alone so a lot of this Compu that's being scaled up is probably used in large part at this time for research yeah and research will you know research begets the new ideas that let you get huge efficiency research gets you 01 like research gets you breakthroughs then you need to bet on it so
some of the pricing strategy they will discuss has the research baked into the price so the numbers that deep seek specifically said publicly right are just the 10,000 gpus in 2021 and then 2,000 gpus for only the pre-training for V3 they did not discuss cost on R1 they did not discuss cost on all the other RL right for the instruct model that they made right they only discussed the pre-training for the base model and they did not discuss anything on research and ablations and they do not talk about any of the resources that are shared
in terms of hey the fund is using all these gpus right and and we know that they're very profitable and that 10,000 gpus in in in 2021 so so the uh some some of the research that we've found is that we actually believe they have closer to 50,000 gpus we is sem so we should say that you're uh sort of one of the world experts in figuring out what everybody's doing in terms of the Semiconductor in terms of cluster build outs in terms of like who's doing what in terms of training runs so yeah so
that's the Wii okay go ahead yeah sorry sorry um we believe they actually have something closer to 50,000 gpus right now this is this is split across many tasks right again the fund um research in ablations for ballpark how much would open AI or anthropic had I think the clearest example we have because meta is also open they talk about like order of 60k to 100K h100 equivalent gpus in their training clusters right so so like llama 3 they said they trained on 16, h100s right but the company of meta last year publicly disclosed they
bought like 400 something thousand gpus yeah right so so of course Tiny percentage on the training again like most of it is like serving me the best Instagram reels right um or whatever right I mean we could get into a cost of like what is the cost of ownership for a 2,000 GPU cluster 10,000 like the there's just different sizes of companies that can afford these things and deep seek is reasonably big their compute allocation compared is one of the top few in the world is not open Ai and Tropic Etc but they have a
lot of computer can you in general actually just zoom out and also talk about the the hopper architecture the Nvidia Hopper GPU architecture and the difference between h100 and h800 like you mentioned the interconnects yeah so there's you know Amper was the a100 and then h100 Hopper right people use them synonymously in the US because really there's just h100 and now there's h200 right but same thing uh mostly in China they've had two there have been different salvos of export restrictions so initially the US government limited on a two- Factor scale right which is Chip
interconnect versus uh flops right so any chip that had interconnects above a certain level and flops above a certain floating Point operations above a certain level was restricted uh later the government realized that this was a flaw in the restriction and they cut it down to just floating Point operations and so um H h800 had high flops low communication exactly so the h800 was the same performance as h100 on flops right but it didn't had it just had the interconnect bandwidth cut deep seek knew how to utilize this you know hey even though we were
cut back on the interconnect we can do all this fancy stuff to figure out how to use the GPU fully anyways right and and so that was back in October 2022 but uh later in 2023 end of 2023 implemented in 2024 the US government banned the h800 right um and so by the way this h800 cluster these 2,000 gpus was not even purchased in 2024 right it's purchased in late 202 um and they're just getting the model out now right because it takes a lot of research Etc um h800 was banned and now there's a
new chip called the H20 uh the H20 is uh cut back on only flops but the interconnect bandwidth is the same and in fact in some ways it's better than the h100 because it has better memory bandwidth and memory capacity so there are you know Nvidia is working within the constraints of what the government sets and then get builds the best possible GPU for China can we take this actual tangent and we'll return back to the hardware is the the philosophy the the motivation the case for export controls what is it uh Dar amade just
published a blog post about export controls the case he makes is that if AI becomes super powerful and he says by 2026 we'll have AGI or super powerful Ai and that's going to give a significant whoever builds that will have a significant military advantage and so because the United States is is a democracy and as he says China is uh authoritarian or has authoritarian elements you want a unipolar world where the super powerful military because of the AI is one that's a democracy it's a much more complicated world geopolitically when you have two superpowers with
super powerful Ai and one is authoritarian so that's the case he makes and so we want to uh the United States wants to use export controls to slow down to make sure that China can't do these gigantic uh training runs that would be presumably required to build AGI this is very abstract I think this can be the goal of how some people describe export controls is this super powerful AI there's and you touched on the training run idea there's not many worlds where China cannot train AI models I think export controls are knapping the amount
of compute or the density of compute that China can have and if you think about the AI ecosystem right now as all of these AI companies Revenue numbers are up and to the right the AI usage is just continuing to grow more gpus are going to inference a large part of export controls if they work is just that the amount of AI that can be run in China is going to be much lower so on the training side deep seek V3 is a great example which you have a very focused team that can still get
to the frontier of AI on this 2,000 gpus is not that hard to get all considering in the world they're still going to have those gpus they're still going to be able to train models but if there's going to be a huge market for AI if you have strong export controls and you want to have a 100,000 gpus just serving the equivalent of chat GPT custers with good export controls it also just makes it so that e AI can be used much less and I think that is a much easier goal to achieve than trying
to debate on what AGI is and if you have these extremely intelligent autonomous AIS and data centers like those are the things that could be running in these GPU clusters in the United States but not in China to some extent training a model does effectively nothing right like TR to have a model the the thing that Dario is sort of speaking to is the implementation of that model once trained to then create huge economic growth huge increases in military capabilities huge capabil increases in productivity of people uh betterment of lives whatever whatever you want to
direct super powerful AI towards you can but that requires a significant amount compute right and so the US government has effectively said um and and and and forever right like train training will always be a portion of the total compute um you know we mentioned meta 400,000 gpus only 16,000 made llama right so the the percentage that meta is dedicating to inference now this might be for recommendation systems that are trying to hack our mind into spending more time and watching more ads or if it's if it's or if it's for a super powerful AI
That's doing productive things doesn't matter about the exact use that our you know economic system decides it's that that can be delivered whatever in whatever way we want whereas with China right you know you're you know export restrictions great you're never going to be able to cut everything off right uh and that's that's like I think that's quite well understood by the US government uh is that you can't cut everything off um you know and they'll make their own chips and and they're trying to make their own chips they'll be worse than ours but you
know this is the whole point is to just keep a gap right um and therefore at some point as the AI you know in a world where 2 3% economic growth this is really dumb by the way right to cut off uh you know high-tech and make money off of it but in a world where super powerful AI comes about and then starts creating significant changes in society which is what all the AI leaders and big tech companies believe I think super powerful AI is going to change society massively and therefore this compounding effect of
the difference in compute is really important there's some sci-fi out there where like AI is T is like measured in the power of in like how much power is delivered to compute right or how much uh is being you know that's sort of a way of thinking about what's the economic output is just how much power you direct in towards that AI should we talk about reasoning models with this as a way that this might be actionable as something that people can actually see so the reasoning models that are coming out with R1 and o1
they're designed to use more compute there's a lot of Buzzy words in the AI Community about this test time compute inference time compute whatever but um Dylan has good research on this you can get to the specific numbers on the ratio of when you train a model you can look at things about the amount of compute used at training and amount of compute used at inference these reasoning models are making inference way more important to doing complex tasks in the fall in December their open AI announced this 03 model there another thing in AI when
things move fast we get both announcements and releases announcements are essentially blog posts where you pat yourself on the back and you say you did things and releases are R the models out there the papers out there Etc so open AI has announced 03 I we can check if 03 mini is out as of recording potentially but that doesn't really change the point which is that the Breakthrough result was something called Arc AGI task which is the abstract reasoning Corpus a task for artificial general intelligence um Fran chle is the guy who's been it's it's
a multi-year old paper it's a brilliant Benchmark and the number for openai 03 to solve this was that it used a some sort of number of samples in the API the API has like thinking effort and number of samples they used a thousand samples to solve this task and it comes out to be like five to $20 per question which you're you're putting in effectively a math puzzle and then it takes orders of dollars to answer one question and this is a lot of compute if this is going to take off in the US Open
AI needs a ton of gpus on inference to capture this they have this um open AI chat gbt Pro subscription which is $200 a month which Sam said they're losing money on which means that people are burning a lot of gpus on inference and I've signed up with it I've played with it I don't think I'm a power user but I I I use it and it's like that is the thing that a Chinese company with mediumly strong expert controls there will always be loopholes might not be able to do it all and if that
the main result for 03 is also a spectacular coding performance and if that feeds back into AI companies being able to experiment better so presumably the idea is for an AGI a much larger fraction of the compu would be used for this test time computer for the reasoning for the AGI goes into a room and thinks about how to take over the world and that you know come back in 2.7 hours this is what going to take a lot of computer this is what people like CEO or leaders of open Ai and anthropic talk about
is like autonomous AI models which is you give them a task and they work on it in the background I think my personal definition of AGI is much simpler like I I think language models are a form of AGI and all this super powerful stuff is a next step that's great if we get these tools but a language model has so much value in so many domains it is a general intelligence to me but this next step of agentic things where they're independent and they can do tasks that aren't in the training data is what
the fewe Outlook that these AI companies are driving for I think the terminology here that Dar Dario uses as super powerful AI so I agree with you on the AGI I think we already have something like that's exceptionally impressive that Allan toring would for sure say is Agi but he's referring more to something once in possession of then you would have a significant military and geopolitical advantage over other nations so it's not just like you can ask it how to cook an omelet and he has a much more positive view in his essay Machines of
love and grace I've read into this I don't have enough background in physical sciences to gauge exactly how confident I am and if AI can revolutionize biology but I am safe saying that AI is going to accelerate the progress of any computational science so we're doing a depth for search here on topics uh taking tangent of a tangent so let's continue uh on that depth first search uh you said that you're both feeling the AGI so you're what's what's your timeline Dario 2026 for the super powerful AI That's you know that's basically agentic to a
degree where it's a real security threat that level of AGI what's your what's your timeline I don't like to attribute specific abilities because predicting specific abilities and when is very hard I think mostly if you're going to say that I'm feeling the AGI is that I expect continued rapid surprising progress over the next few years so something like R1 is less surprising to me from Deep seek because I I expect there to be new paradigms where substantial progress can be made and deep seek R1 is so unsettling because we're kind of on this path with
with chat gbt it's like it's getting better it's getting better it's getting better and then we have a new direction for for changing the models and we took one step like this and we like took a step up so it looks like a really fast St slope and then we're going to just take more steps so like it's just really unsettling when you have these big steps and I expect that to keep happening I see I've tried openingi operator I've tried CLA computer use they're not there yet I understand the idea but it's just so
hard to predict what is the Breakthrough that will make something like that work and I think it's more likely that we have breakthroughs that work and things that we don't know what they're going to do so like everyone wants agents Dario has very eloquent way of describing this and I just think that it's like there's going to be more than that so like just expect these things to come I'm going to have to try to pin you down to a date on the AGI timeline uh like the nuclear weapon moment so moment where on the
geopolitical stage there's a real like you know CU we're talking about export controls when do you think just even to throw out a date when do you think that would be like for me it's probably after 2030 so I'm not as that's what I would say so so Define that right because to me it kind of almost has already happened right you look at elections in India and Pakistan people get AI voice calls and think they're talking to the politician right the AI diffusion rules which was enacted in the last couple weeks of the Biden
admin and looks like the Trump admin will keep and potentially even strengthen limit cloud computing and GPU sales to countries that are not even related to China it's like this is Portugal and all these like normal comp countries are on the you need approval from the US list like yeah Portugal and like you know like like all these countries that are allies right Singapore right like they they freaking have f35s and we don't let them buy gpus like this is this to me is already to the scale of like you know well that just means
that uh the US military is really nervous about this new technology that doesn't mean the technolog is already there so like they might be just very cautious about this thing that they don't quite understand but that's a really good point sort of the the rooc calls swarms of semi-intelligent bots could be a weapon could be doing a lot of social engineering I mean there's tons of talk about you know from the 2016 elections like Cambridge analytica and all this stuff Russian influence I mean every country in the world is pushing stuff onto the internet and
has narrative they want right like that's every every like technically competent whether it's Russia China us Israel Etc right you know people are pushing viewpoints onto the internet and mass and language models crash the cost of like very intelligent sounding Lang there's some research that shows that the distribution is actually the limiting factor so language models haven't yet made missing information particularly like change the equation there the internet is still ongoing I think there's a Blog AI snake oil and some of my friends that prints in that write on this stuff so there is research
it's like it's a default that everyone assumes and I would have thought the same thing is that misinformation doesn't get far worse with language models I think in terms of Internet posts and things that people have been measuring it hasn't been a exponential increase or something extremely measurable in things you're talking about with like voice calls and stuff like that it could be in modalities that are harder to measure so it's it's something that it's too soon to tell in terms of I think that's like political instability via the web is very it's it's monitored
by a lot of researchers to see what's happening I think the you're asking about like the AGI thing I my if you make me give a year I would be like okay I have ai CEOs saying this they've been saying two years for a while I think that they're people like Dario anthropic the had thought about this so deeply I need to take their words seriously but also understand that they have differ different incentive so I would be like add a few years to that which is how you get something similar to 2030 or a
little after 2030 I think to some extent we have capabilities that hit a certain point where any one person could say oh okay if I can leverage those capabilities for x amount of time this is Agi right call it 2728 but then the cost of actually operating that capability yeah this is going to be my point so so extreme that no one can actually deploy it at scale and Mass to actually completely revolutionize the economy on a click on a snap of a finger so I don't think it will be like a snap of the
finger moment physical constraint rather it'll be a you know oh the capabilities are here but I can't deploy it everywhere right and so one one simple example going back sort of to 2023 was when uh you know being with gp4 came out and everyone was freaking out about search right perplexity came out if you did the cost on like hey implementing gpt3 into every Google search was like oh okay this is just like physically impossible to implement right and and and as we step forward to like going back to the test time compute thing right
a query for you know you ask chat GPT a question it costs cents right for their most capable model of chat right to get a query back to solve an arc AGI problem though cost five to 20 bucks right and this is this is an a it's only going up from there this is a th000 10,000 X Factor difference in cost to respond to a query versus do a task and the task of AGI is not like it's like it's it's simple to some extent um you know but it's also like what are the tasks
that we want a okay AGI quote unquote what we have today can do Arc AGI three years from now it can do much more complicated problems but the cost is going to be measured in thousands and thousands and hundreds of thousands of dollars of GPU time and there just won't be enough power gpus infrastructure to operate this and therefore shift everything in the world on the snap the finger but at that moment who gets to constu control and point the AGI at a task and so this was in Dario's post that he's like hey China
can effectively and more quickly than us Point their AGI at military tasks right and they have been in many ways faster at adopting certain new technologies into into their military right especially with regards to drones right uh the us maybe has a long-standing you know large air sort of you know fighter jet type of thing bombers but when it comes to asymmetric arms such as drones they've they completely leapfrogged the US and the west and the the fear that Dario is sort of pointing out there I think is that yeah great we'll have AGI in
the commercial sector uh the US military won't be able to implement it super fast Chinese military could and they could direct all their resources to implementing it in the military and therefore solving you know military logistics or solving some some other aspect of like disinformation for targeted certain set of people so they can flip a country's politics or something like that that is actually like catastrophic versus you know the US just wants to you know because it'll be more capitalistically allocated just towards whatever is the highest return on income which might be like building you
know factories better or whatever so everything I've seen uh people's intuition seems to fail on robotics so you have this kind of General optimism I've seen this on self-driving cars people think it's much easier problem than it is similar with drones here I understand it a little bit less but I've just seen the reality of the war in Ukraine and the usage of drones on both sides and it seems that humans still far outperform any any fully autonomous systems AI is an assistant but humans Drive fpv drones where the humans controlling most of it just
far far far outperforms AI system so I think it's not obvious to me that we're going to have swarms of autonomous robots anytime soon in the military context maybe the the fastest I can imagine is 2030 which is why I said 2030 for the super powerful AI whenever you have large scale swarms of robots doing military actions that's when the world just starts to look different to me so that's the thing I'm really worried about but there could be cyber War cyber War type of technologies that uh from social engineering to actually just swarms of
robots that find attack vectors in our code bases and shut down P grids that kind of stuff and it could be one of those things like on any given weekend or something power goes out nobody knows why and the world changes forever just power going out for two days in all of the United States that will lead to murder to chaos but going back to export controls do you see that as a useful way to uh control the balance of power geopolitically in the context of AI and I think going going back to my viewpoint
is if you believe we're in the sort of uh stage of economic growth and change that we've been in for the last 20 years the export controls are absolutely guaranteeing that China will win long term right if you do not believe AI is going to make significant changes to society in the next 10 years or five years right five fiveyear timelines are sort of what the more Executives and such of AI companies and even big tech companies believe but even 10 your timelines you know it's reasonable but once you get to hey these these timelines
are uh below that time period then the only way to sort of like create a sizable advantage or disadvantage for America versus China is if you constrain compute because Talent is not really something that's constraining right China arguably has more Talent right more stem graduates more programmers the US can draw upon the world's people which it does there's of you know foreigners in the AI industry so many of these AI teams are all people without a US passport yeah yeah I mean many of them are are are Chinese people who are moving to America right
and that's that's great that's exactly what we want right um but there's that Talent is one aspect but I don't think that's one that is a measurable Advantage for the us or not it truly is just whether or not compute right now even on the compute side uh when we look at chips versus data centers right China has the unprecedented ability to build ridiculous sums of power Clockwork right they're always building more and more power they've got steel mills that that like individually are the size of the entire us industry right and they've got aluminum
Mills that consume gigawatts and gigawatts of power right and when we talk about what's the biggest data center right open ey made this huge thing about Stargate their announcement there that's not that's like once it's fully built out in a few years it'll be 2 GW right of power right and this is is still smaller than the largest you know industrial facilities in China right China if they wanted to build the largest data center in the world if they had access to the chips could so it's not just it's just a question of uh when
not if right so their industrial capacity far exceeds the United States exactly to the the manufactur stuff so why why so longterm they're going to be manufacturing chips there chips are a little bit more specialized I'm specifically referring to the data centers right chips Fabs take huge amounts of power don't get me wrong uh that's not necessarily the gating Factor there the gating Factor on how build fast people can build the largest clusters today in the US is power right it is whether it's now it could be power generation power transmission uh substations and uh
you know uh all these sorts of Transformers and all these things uh building the data center these are all constraints on the US industry's ability to build F larger and larger Training Systems as well as deploying more and more inference comput I think we need to make the point clear on why the is now for people that don't think about this because essentially with export controls you're making it so China cannot make or get um Cutting Edge chips and the idea is that if you time this wrong China is pouring a ton of money into
their chip production and if you time it wrong they are going to have more capacity for production more capacity for energy and figure out how to make the chips and have more capacity than the rest of the world to make the chips because everybody can buy they're going to sell their Chinese Chips to everybody they might subsidize them and therefore if AI takes a long time to become differentiated we've knapped the financial performance of American companies Nvidia can sell less tsmc cannot sell to China so therefore we have less demand to therefore in to like
keep driving the production cycle so that's the Assumption behind the time timing being less than 10 years or five years to above right China will win because of these restrictions long term unless AI does something in the short term which I believe AI will do you know make massive changes to society in the medium short term right um and so that's that's the big unlocker there um and even even today right if xingping decided to get you know quote unquote scale pilled right uh I.E decide that scaling laws are what matters right just like the
US Executives like Sacha Nadella and Mark Zuckerberg and and and Sundar and all these us Executives of the biggest most powerful tech companies have decided their scale pild and they're building multi- gigawatt data centers right whether it's in Texas or Louisiana or Wisconsin whatever wherever it is they're building these massive things that cost as much as their entire budget for spending on data centers globally in one spot right there this is what they've committed to for next year year after Etc and and so they're so convinced that this is the way that this is what
they're doing but if China decided to they could do it faster than us but this is this is where the restrictions come in it is not clear that China as a whole has decided you know from the highest levels that this is a priority the US sort of has right uh you know you see Trump talking about deep seek and uh Stargate within the same week right so he's and the Biden End Men as well had a lot of discussions about Ai and and and such uh it's clear that they think about it only just
last week did deep seek meet the second in command of China right like they have not even met the top right they haven't met G she hasn't set down and and and they only just released a subsidy of a trillion R&B uh you know roughly $160 billion um which is close to the spending of like Microsoft and meta and Google combined right for this year so it's like they're they're they're realizing it just now but that's where the export restrictions come in and say hey you can't you can't ship the most powerful us chips to
China uh you can ship a cutdown version you can you can't ship the most um powerful chips to all these countries who we know are just going to rent it to China uh you have to limit the numbers right and the tools and same with manufacturing equipment tools all these all these different aspects but stems from Ai and then what Downstream can slow them down in Ai and so the the entire semiconductor restrictions you read them they are very clear it's about Ai and Military civil Fusion of Technology right there it's very clear and then
from there it goes oh well we're Banning them from buying like lithography tools and etch tools and deposition tools and oh this random like you know subsystem from a random company that's like tiny right like why are we Banning this because all of it the US government has decided is critical to AI systems I think the the f point is like the transition from 7 nanometer to 5 nanometer chips where I think it was Huawei that had the 7 nanometer chip a few years ago which caused another political brewhaha almost like this moment and then
it's like asml deep euv what is that like extreme ultraviolet lithography to set context on the chips right what Nathan's referring to is in 2020 Huawei released their asend 910 chip uh which was an ai ai chip first one on 7 nmet before Google did before Nvidia did and they submitted it to The mlpf Benchmark which is sort of a industry standard for machine learning performance Benchmark um and and it did quite well and it was the best chip at the submission right this was this was a huge deal um the Trump admin of course
banned um it was 2019 right banned the Huawei from getting 7 nanometer chips from tsmc and so then they had to switch to move using internal domestically produced chips which was a multi-year setback many companies have done seven nanometer chips and the question is like we don't know how much Huawei was subsidizing production of that chip like Intel has made Seven nanometer chips that are not profitable and things like this so this is how it all feeds back into the economic engine of export controls well so you're saying that for now xiin Bing has not
felt the AGI but it feels like the Deep seek moment yeah might like there might be meetings going on now where he's going to start wearing the same t-shirt and things are going to escalate I mean like like this he may have woken up last week right Leon Fang met the vice chair Vice the second command guy um and they had a meeting and then the day the next day they announced the AI subsidies which are trillion R&B right so it's possible that this deep seek moment is truly the beginning of a cold war that's
what a lot of people are worried about people in AI have been worried that this is going towards a cold war or already is but there was it's not deep seeks fault but there's something a bunch of factors came together where it was explosion I mean it all has to do with stop going down prob but it's just some like Mass hysteria that happened that eventually led to shing ping having meetings and waking up to this idea and the US government realized in October 7th 2022 before chat GPT released that that restriction October 7th which
dropped and shocked everyone and it was very clearly aimed at AI everyone was like what the heck are you doing diffusion was out then but not tragic BT yeah but not tragic so like starting to be Rumblings like what gen can do to society but it was very clear I think to at least like National Security Council and and those sort of folks that this was where the world is headed this cold war that's happening so is there any concerns that the uh export controls push China to uh take military action on Taiwan this is
this is the big risk right the further you push China away from having access to you cutting edge American and Global Technologies the more likely they are to say well well cuz I can't access it I might as well like no one should access it right um and there's a few like interesting aspects of that right like you know China has a urban rural divide like no other um they have a male female birth ratio like no other to the point where you know if you look in most of China it's like the ratio is
not that bad but when you look at single dudes in rural China it's like a 30 to1 ratio um and those are disenfranchised dudes right like uh quote unquote like the US has an incel problem like China does too it's just their fated in some way or cut crushed down what do you do with these people and at the same time you're not allowed to access the most important technology at least the US thinks so China is maybe starting to think this is the most important technology uh by starting to dump subsidies in it right
they thought EVs and Renewables were the most important technology they dominate that now right uh now they're starting to they started thinking about that about semiconductors in you know the late 2010s and early 2020s and now they've been dumping money and they're catching up rapidly um and they're going to do the same with AI right because they're very talented right so so uh the the question is like when when does when when does when does this hit a Breaking Point right um and if China sees this as hey they can continue if they if not
having access and starting a true Hot War right taking over Taiwan or trying to subvert its democracy in some way or blockading it um hurts the rest of the world far more than it hurts them this is something they could potentially do right and and so is this pushing them towards that uh potentially right I'm not quite a geopolitical person but you know it it's it's obvious that the world regime of peace and like trade is like super awesome for economics uh but but at some point it could break right I think we should comment
that the like why Chinese economy would be hurt by that is that their export heavy I think the United States Buys so much like if that goes away like that's how their economy well also also they just like would not be able to import raw materials from like all over the world right the US would just shut down the straight of malaka and like you know at the same time the US entire Like You could argue almost all the GDP growth in America since you know the 70s has been either population growth or Tech right
um because you know your the your life today is not that much better than someone from the 80s outside of tech right you still you know you know cars they all have semiconductors in them everywhere fridges semiconductors everywhere there's these funny stories about how Russians were taking apart laundry machines because they had certain like Texas instrument chips that they they could then repurpose and put into to like their um their anti-missile missile things right like their S400 or whatever you would know more about this but uh there's all sorts of like everything about semiconductors is
so integral to every part of our lives so can you explain the role of tsmc in the story of semiconductors and uh maybe also how the United States can break the Reliance on tsmc I don't think it's necessarily breaking the Reliance I think it's uh getting tsmc to you know build in the US uh but so so so taking a step back right tsmc produces most of the world's chips right especially on The Foundry side um you know there's a lot of companies that build their own chips uh Samsung Intel um you know St micro
Texas Instruments you know Analog Devices all these kinds of companies build their own chips n XP but more and more of these companies are Outsourcing to tsmc and have been for multiple decades can you explain the the supply chain there and where most of tsmc is in terms of manufacturing sure so historically supply chain was companies would build their own chips they would you know it be a company started uh they'd build their own chips and then they they design the chip and build the ship and sell it um over time this became really difficult
because the cost of building a Fab continues to compound every single generation of course the technology figuring out the technology for it is incredibly difficult regardless but just the dollars and cents that are required ignoring you know saying hey yes I have all the technical capability which it's really hard to get that by the way right Intel's fa Samsung's failing Etc um but if you look at just the dollars to spend to build that next Generation Fab it keeps growing right sort of like you know mors laws having the cost of chips every two years
there there's a separate law that's sort of like doubling the cost of Fabs every handful of years and so you look at a Leading Edge Fab that is going to be profitable today that's building you know three nanometer chips or two nanometer chips in the future that's going to cost north of 3040 billion right um and that's just for like a token amount that's for like that's like the base building block and you probably need to build multiple right and so when you look at the industry uh over the last you know if I go
back 20 30 years ago there were 2030 companies that could build the most advanced chips and then they would design them themselves and sell them right so companies like AMD would build their own chips uh Intel of course still builds their own chips they're very famous for but IBM would build their own chips and you know you could keep going down the list all these companies built their own chips slowly they kept falling like flies and that's because of what tsmc did right they created The Foundry business model which is I'm not going to design
any chips I'm just going to contra manufacturer chips for else other people um and one of their early customers is NVIDIA right Nvidia was is is the only Semiconductor Company uh that's worth you know that's doing more than a billion dollars of Revenue that was started in the era of Foundry right every other company started before then and at some point had Fabs which is actually incredible right um you know like AMD and Intel and broadcom it's like everyone had Fabs at some point or you know BR you know some companies like broadcom it was
like a merger or amalgamation of various compies that rolled up but even today broadcom has Fabs right they build iPhone uh RF radio chips sort of in Colorado for for you know for Apple right like there's there all these companies had Fabs and for most of the Fabs they threw them away or sold them off or they got rolled into something else uh and now everyone relies on tsmc right including Intel their latest PC chip uses tsmc chips right it also uses some Intel chips but it uses tsmc process can you explain why the foundry
model is so successful for these companies why why why are they going with economies of scale scale yeah so so I mean like like I mentioned right the cost of building a Fab is so high the R&D is so difficult um and uh when you look at like these like companies that had their own vertical stack there was an Antiquated process of like okay like I'm so hyper customized to each specific chip right but as we've gone through the history of sort of like the last 50 years of of electronics and semiconductors a you need
more and more specialization right because Mo's law has died um dard scaling has died IE chips are not getting better just for free right you know from manufacturing you have to make real architectural Innovations right Google is not just running on Intel CPUs for web serving they have a YouTube chip they have tpus they have pixel chips they have a wide diversity of chips that uh you know generate all the economic value of Google right running you know it's running all the services and stuff and so and this is just Google and you could go
across any company in the industry and it's like this right cars contain 5,000 chips you know 200 different varieties of them right all these random things a Tesla door handle has two chips right like it's like ridiculous um and it's a cool door handle right it's like you know you don't think about it but it's like has two really chip like like Penny like chips in there right anyway so so as you have more diversity of chips as you have more specialization required and the cost of Fabs continues to grow you need someone who is
laser focused on building the best process technology and making it as flexible as possible I I think you can say it simply which is the cost for Fab goes up and if you are a small player that makes a few types of chips you're not going to can have the demand to pay back the cost of the Fab whereas Nvidia can have many different customers and aggregate all this demand into one place and then they're the only person that makes enough money building chips to buy the next to build the next Fab so this is
kind of why they the companies slowly get killed because they have a they have 10 years ago a chip that is profitable and is good enough but the cost to build the next one goes up they may try to do this fail because they don't have the money to make it work and then they don't have any chips or they build it and it's too expensive and they just have you there's more failure points right you know you could have one little process related to like some sort of like uh chemical etch or some sort
of like plasma etch or you know some little process that screws up you didn't engineer it right and now the whole company falls apart you can't make chips right and so super super powerful companies like Intel they had like the weathering storm to like hey they still exist today even though they really screwed up their manufacturing six seven years ago but in the case of like AMD they almost went bankrupt they had to s their Fabs to mubadala uh UAE right um and and like that became a separate company called Global foundaries which is a
Foundry firm um and and then AMD was able to then focus on like on the return back up was like hey let's focus on making chiplets and a bunch of different chips for different markets um and focusing on specific workloads rather than you know all of the these different things and so you get more diversity of chips you have more companies than ever designing chips but you have fewer companies than ever manufacturing them right and this is this is where tsmc comes in is they've they've just been the best right they are so good at
it right they're customer focused they make it easy for you to fabricate your chips they take all of that complexity and like kind of try and Abstract a lot of it away from you um they make good money they don't make insane money but they make good money um and and they're able to aggregate all this demand and continue to build the next Fab the next Fab the next Fab so why is Taiwan so special for tsmc why is it happening there can it be replicated inside the United States yeah so there's there's aspects of
it that I would say yes and aspects that I'd say no right um tsmc is way ahead because uh former you know executive Morris Chang of Texas Instruments uh wasn't promoted to CEO and he's like screw this I'm going to go make a my own chip company right and he went to Taiwan and made tsmc right and there's there's a whole lot more story there um so he it could have been Texas Instruments could have been the T you know could have been tsmc but Texas semiconductor manufacturing company right instead of you know Texas Instruments
right but but you know so there is that whole story there but the sitting here in Texas I mean and that sounds like a human story like it didn't get promoted just the Brilliance of Morris changen you know which I wouldn't underplay but there's also like a different level of like how how this works right so in Taiwan the you know like the number top percent of graduates of students that go to the best school which is n the top percent of those all go work to tsmc right and and guess what their pay is
their starting pay is like $80,000 $70,000 right which is like that's like starting pay for like a good graduate in the US right not not the top the top graduates are making hundreds of thousands of dollars at the Googles and the Amazon and now I guess the open AI of the world right um so so there is there is a large dichotomy of like what is the top 1% of the society doing and where are they headed because of economic reasons right Intel never paid that crazy good right um and and it didn't make sense
to them right that's that's one aspect right where is the best going second is the work ethic right like you know we we like to work you know you work a lot we work a lot but at the end of the day um when there's a you know when when what what is the time and amount of work that you're doing and what does a Fab require right Fabs are not work from home jobs they are you go into the Fab and grueling work right um there's there's hey if there is any amount of vibration
right an earthquake happens vibrates the machines they're all you know they're either broken you've Lo you've scrapped some of your production and then in many cases they're like not calibrated properly so so when tsmc when there's an earthquake right recently there's been an earthquake tsmc doesn't call their employees they just they just go to the Fab and like they just show up the parking lot gets slammed and people just go into the Fab and fix it right like it's like an arm it's like ants right like it's like you know a hive of ants doesn't
get told by the queen what to do the ants just know it's like one person just specializes on this one task and it's like you're going to take this one tool and you're the best person in the world and this is what you're going to do for your whole life is this one task in the Fab which is like some special chemistry plus Nano manufacturing on one line of tools that continues to get iterated and yeah it's just like it's like specific plasma etge for removing silicon dioxide right that's all you focus on your whole
career and it's like such a specialized thing and and so it's not like the task are transferable AI today is awesome because like people can pick it up like that uh semiconductor manufacturing is is very Antiquated and difficult none of the materials are online for people to read easily uh and learn right the papers are very dense and like it takes it takes a lot of experience to learn and so it makes the barrier to entry much higher too so so when you talk about hey you have all these people that are super specialized they
will work you know 80 hours a week in a factory right in a Fab and if anything goes wrong they'll go show up in the middle of the night because some earthquake their wife's like there was an earthquake he's like great I'm going to go to the Fab C would you would you like as an American do that right it's like these sorts of things are like what you know I guess are the exemplifying like why tsmc is so amazing now can you replicate it in the US uh let's not ignore intel was the leader
in manufacturing for over 20 years they brought every technology to Market first besides UV strain silicon High K metal gates finfet um you know you the list goes on and on and on of technologies that Intel brought to Market first made the most money from um and and and manufactured at scale first best highest profit mergence right so we shouldn't ignore that Intel can't do this right it's that the culture uh has broken right um you've invested in the wrong things they said no to the iPhone they they had all these different things regarding like
you know mismanagement of Fabs mismanagement of designs this lockup right and at the same time all these brilliant people right these like 50,000 phds uh you know or or Masters that have been working on specific chemical or physical processes or nanom manufacturing processes for decades in Oregon they're still there they're still producing amazing work it's just like getting it to the last mile of production at high yield where you can design where you can manufacture dozens and hundreds of different kinds of chips you know and and it's good you customer experience has broken right you
know it's that customer experience it's like the like part of it is like people will say intel was too pompous in the 2000s 2010s right they just thought they were better than everyone the tool guys were like oh I don't think that this this is mature enough and they're like ah you just don't know we know right this sort of stuff would happen um and so can the US bring it to the uh can the US bring Leading Edge semiconductor manufacturing to the US Ematic yes right and we are right it's happening like Arizona is
getting better and better as time goes on tsmc has built you know roughly 20 % of their capacity for 5 nanometer in the US right um now this is nowhere near enough right uh you know 20% of capacity in the US is like nothing right um and furthermore this is still dependent on Taiwan existing right all there's sort of important way to separate it out there's R&D and there is high volume manufacturing there are there effectively there are three places in the world that are doing Leading Edge R&D there's sinu Taiwan there's Hillsboro Oregon and
there is pong uh pong pongyang uh South Korea right these three places are doing the Leading Edge R&D for the rest of the world's leading Edge semiconductors right um now manufacturing can be distributed more globally right um and this is sort of where this dichotomy exists of like who's actually modifying the process who's actually developing the next generation one who's improving them is cchu is Hillsboro is pongyang right it is not the rest of these uh you know Fabs like Arizona right Arizona is a paperweight if if since you appeared off the face of the
planet um you know within within a a year couple years Arizona would stop producing too right it's it's actually like pretty critical one of the things I like to say is if I had like a few missiles I know exactly where I could cause the most economic damage right it's not targeting the White House right it's R&D centers it's the R&D centers for tsmc Intel Samsung and then some of the memory guys Micron and HX because they Define the future evolution of the sem conductors and everything's moving so rapidly that it really is fundamentally about
R&D and it is all about tsmc huh and so tsmc you know you cannot purchase a vehicle without tsmc chips right you cannot purchase a fridge without tsmc chips you cannot you you like I think one of the few things you can purchase ironically is a Texas Instruments like graphing calculator right because they actually manufacture in Texas but like outside of that like a laptop a anything you servers right gpus none of this stuff can exist and this is without without tsmc and in many cases it's not even like the Leading Edge you know sexy
5 nmet chip 3 nmet chip 2 neter chip oftentimes it's just like some stupid power IC that's like converting from like you know some voltage to another right and it's made at tsmc right this is what China is investing in as well it's like they can build out this longtail Fab where the techniques are much more known you don't have to figure out these problems with euv they're investing in this and then they have large supply for things like the car door handles and the random stuff and that trickles down into this whole ecomic discussion
as well which is they have far more than we do and having supply for things like this is crucial to normal life so they're doing the they're starting to invest in high volume manufacturer but they're not doing R&D so they they do R&D on their own they're just way behind right um so I would say like in 2015 uh China Had A Five-Year Plan where they defined by 2025 uh in 2020 certain goals including like 80% domestic production of semiconductors uh they're not they're not going to hit that right to be clear but they are
there are in certain areas really really close right like byd is probably going to be the first company in the world to not have to use tsmc for Mak because they have their own FBS right uh for making chips now they still have to buy some chips from foreign uh for example like around like self-driving ad ass capabilities CU those are really high-end but at least like you know like internal combustion engine has 40 chips and an EV you know just just for like controlling like flow rates and all these things and EVS are even
more complicated so all these different Power I's and Battery management controllers and all these things they're they're insourcing right um and this is this is something that like China is been doing since 2015 now as far as like the trailing Edge they're getting so much capacity there as far as the Leading Edge right I.E this 5 nanometer and so on so forth right where gpus they are still behind and this is the US restrictions are trying to stop them in the ladder but you know all that's happened you know is yes they've slowed down their
5 neter 3 nmet Etc but they've accelerated their hey 45 n 90 nanm power IC or analog IC or you know random chip in my keyboard right that kind of stuff so so there is an angle of like the US's actions have been so from these export you know from the angle of the export controls have been so inflammatory at slowing down China's progress on the Leading Edge that they've turned around and have accelerated their progress elsewhere because they know they this is so important right if the us is going to lock them out here
what if they lock us out here as well uh in the trailing Edge and so going back can the US build it here um yes but it's going to take a ton of money I truly think like to to revolutionize and completely insource semiconductors would take a decade and a trillion dollars is some of it also culture like you said extreme competence extreme work ethic in Taiwan I think if you have the demand and the money is on the line the American companies figure it out it's going to take handholding with the government but I
I think that the culture helps tsmc break through and it's easier for them you you tsmc has something like 990,000 employees right it's not actually that insane amount um the Arizona Fab has 3,000 from Taiwan and and these people like their wives were like yeah we're not going to have kids unless we you sign up for the Arizona Fab we go to Arizona and we have our kids there there's also a Japan Fab where the same thing happened right and so like these wives drove like these like these dudes to like go to Japan or
America to have the kids there and it's like it's an element of culture yeah sure uh Taiwan works that hard but also like the US has done it in the past they could do it now right um you know we can just import I say import the best people in the world if we want to that's where the immigration conversation is a tricky one and there's been a lot of debate over that but yeah it it seems absurdly controversial to import the best people in the world I don't understand why it's controversial that's that's the
one of the ways of wi sure we agree with you and and and like even if you can't import those people I still think you could do a lot to manufacture most of in the US if the money's there right and so like just way more expensive it's not profitable for a long time and that's the context of like the chips Act is only like $50 billion relative to you know some of the renewable um you know initiatives that were passed in the inflation reduction Act and the infrastructure act which total in the hundreds of
billions of dollars right and so like the amount of money that the US is spending on the semiconductor industry is is nothing right um whereas all these other countries have uh structural advantages in terms of like you know work ethic and amount of work and like things like that but also a number of stem graduates the the percentile of their best going to that right um but they also have like differences in terms of like hey there's just tax benefits in the law and have been in the law for 20 years right um and so
and then and then some countries have massive subsidies right China has something like $200 billion of semiconductor subsidies a year we're talking about $50 billion in the US over like six right so the the the the the the girth or difference in like the subsidy Amounts is also huge right and and so I think um you know Trump has been talking about terrifing Taiwan recently um you know that's sort of like one of these things that's like oh okay well like you know maybe he doesn't want to subsidize the US semiconductor industry obviously tariffing Taiwan
is going to cost a lot of things to go get much more expensive but does it change the equation for tsmc building more Fabs in the US that's what he's sort of positing right so can you lay out the so we laid out the importance by the way it's incredible how much you know about so much we told you Dylan knows all the stuff yeah so but okay you laid out why tsmc is really important if we look out into the future 10 20 years out us China relationship seems like it can go to a
dark place of Cold War escalated cold war or even hot war or to a good place of uh anything from Frenemies to cooperation to working together so in this game theory complicated game uh what are the different trajectories what should us be doing like what do you see as the different possible trajectories of us China relations as uh both leaders start to feel the AGI more and more and see the importance of chips and the importance of AI I mean ultimately the export controls are pointing towards a separate future economy I think the US has
made it clear to Chinese leaders that we intend to control this technology at whatever cost to global economic inter like integration so that it's hard to unwind that like the the card has been played to the same extent they've also limited us companies for mentoring China right so it is it is you know it's been a long time coming you know at some point you know there was there was a convergence right uh but but over at least the last decade it's been branching further and further out right like us companies can't enter China Chinese
companies can't enter the US the US is saying hey China you can't get access to our Technologies in certain areas and China's rebuttal with the same thing or around like you know they've done some sort of specific materials in you know Gallum and things like that that they've tried to limit the US on um one of the there's a US drone company that's not allowed to buy batteries and they have like military customers and this drone company just tells the military customers like hey hey just get it from Amazon because I can't actually physically get
them right like there's all these things that are happening that point to further and further Divergence I have zero idea and I would love if we Kum we could all hold hands and sing Kumbaya but like I have zero idea how that could possibly happen is the Divergence good or bad for avoiding war is it possible that the the Divergence in terms of manufacturer chips of training AI systems is actually good for avoiding military it's an objective fact that the world has been the most peaceful has ever been when there are Global hegemons right or
Regional hegemons right in in historical context right um the Mediterranean was the PE most peaceful ever when the Romans were there right China had very peaceful and Waring times and the peaceful times were when dynasties had lock hold over not just themselves but all their tributaries around them right um and likewise uh the most peaceful time in human history has been when the US was the global hedgemon right the last hand you know decades now we we've sort of seen things start to slide right with Russia Ukraine with what's going on in the Middle East
and you know Taiwan risk all these different things are starting to Bubble Up still objectively extremely peaceful now what happens when it's not one Global hamon but it's two obviously and and and you you know China will be you know competitive or even over take the US like it's possible right and so this this change in global hemony it it's I don't think it ever happens like super peacefully right when Empires fall right which is a possible trajectory for America they they don't fall fall gracefully right like they they don't just slide out of irrelevance
usually there's a lot of shaking um and so you know what the US is trying to do is maintain its top position and what China is trying to do is become the top position right and and obviously there there's budding of heads here um in in in the most simple terms and that could take shape in all kinds of ways including proxy wars seems like it's already happening like as much as I want there to be centuries of prolonged peace it does not it looks like further instability internationally is ahead and and and the US's
like sort of like current task is like hey if we control AI if we're the leader in AI then we and we and AI could significantly accelerates progress then we can maintain the global hemony position and therefore I I hope that works and and and as an American like you know kind of like okay I guess that's going to lead to peace peace for us uh now obviously other people around the world get affected negatively um you know obviously the Chinese people are not going to be in as advantageous of a position um if that
happens but uh you know this is sort of the reality of like what's being done and the actions that are being carried out so can we go back to the specific detail of the different Hardware there's this nice graphic in the export controls uh of which uh which gpus are allowed to be exported and which are not can you kind of explain the difference like is there um from a technical perspective are the h20s promising yeah so this goes uh and I think we'd have to like we need to dive really deep into the reasoning
aspect and what's going on there but the H20 you know the US has gone through multiple iterations of the export controls right this h800 was at one point allowed uh back in 23 but then it got canceled and by then by uh you know deeps had already built their cluster of they claimed 2K I think they actually have like many more like something like 10k of those um and now this H20 is the legally allowed chip right Nvidia shipped a million of these last year to China right for context was like five four or five
million gpus right so the percentage of gpus that were this China specific H20 is quite high right um you know roughly 20% 25% right 20% or so um and so this H20 has been neutered in one way but it's actually upgraded in other ways right and you know you could think of chips along three axes for AI right um you know ignoring ignoring software stack and like exact architecture just raw specifications there's floating Point operations right flops there is memory bandwidth um IE and memory capacity right uh IO right memory and then there is interconnect
right chipto chip interconnections all three of these are incredibly important for making AI systems right because AI systems invol a lot of compute they involve a lot of moving memory around uh whether it be to memory or two other chips right and so these three vectors um the US initially had a multi you know had two of these vectors controlled and one of them not controlled which was flops and interconnect bandwidth were initially controlled um and then they said no no no no we're going to remove the interconnect bandwidth and just make it a very
simple only flops but now Nvidia can now make a chip that has uh okay it's cut down on flops not it's you know it's like onethird that of the h100 right in in in uh on spec sheet paper performance for flops um you know in real world it's closer to like half uh or maybe even like 60% of it right but then on the other two vectors it's just as good uh for interconnect bandwidth and then for memory bandwidth and memory capacity the H20 has more memory bandwidth and more memory capacity than the h100 right
now recently you know we we at our research we cut nvidia's production for H20 for this year down drastically they were going to make another 2 million of those this year but they just canel all the orders a couple weeks ago um in our view that's because we think that they think they're going to get restricted right um because why would they cancel all these orders for H20 um because they shipped a million of them last year they had orders in for a couple million this year and just gone right for H20 B20 right a
successor to H20 um and now they're all gone now why would they do this right um I think it's it's very clear right the the H20 is actually better for certain tasks and that certain task is reasoning right um reasoning is incredibly like different than you know when you look at the different regimes of models right pre-training is all about flops right it's all about flops there's things you do like mixture of experts that we talked about to trade off interconnect or to trade off you know other aspects and lower the flops uh and and
rely more on interconnect and memory but at the end of the day it's flops as everything right we talk about models in terms of like how many flops they are right uh so so like you know we talk about oh gp4 is 2 e25 right two to the uh two to the 25 fth uh you know 25 Z right flop right floating Point operations um for training for training right and and we're talking about the restrictions for the uh 224 right 25 what uh the US has an executive order that Trump recently unsigned but um
which was hey 1 e26 once you hit that number of floating Point operations you must notify the government and we you must share your results with us right like there's a level of model where the US government must be told right and that's 26 and so as we move forward this is this is an incredibly like important flop is the vector that the government has cared about historically but the other two vectors are arguably just as important right um and especially when we come to this new paradigm which the world is only just learning about
over the last six months right reasoning and do we understand firmly which of the three dimensions is best for reasoning so interconnect the flops don't matter as much is it memory memory right Contex length we're going to get into technical stuff real fast say there's a there's a there's two articles in this one that that I could show maybe Graphics that might be interesting for you to pull up oh for the listeners we're looking at the section of 01 inference architecture toomics H how do you want to explain KV Cas before we talk about this
I think like it's better to okay yeah we should get we need to go through a lot of specific technical things of Transformers to make this easy for people because it's it's incredibly important because this changes how models work but I think I think resetting right why is why is memory so important it's because so far far we've talked about parameter counts right and mixture of experts you can change how many active parameters versus total parameters to embed more data but have less flops but more important you know another aspect of you know what's part
of this humongous revolution in the last handful of years is the Transformer right and the attention mechanism attention mechanism is that the model understands the relationships between all the words in its context right and that is that is separate from the parameters themselves right and that is that is uh something that you must calculate right how each token right each word in the context length is uh relatively uh connected to each other right and and I think I think Nathan you should explain KV Cas better KV Cas is one of the optimizations yeah so the
attention operator has three core things it's queries keys and values qkv is the thing that goes into this you'll look at the equation you see that these matrices are multiplied together these words query key and value come from information retrieval backgrounds where the query is the thing you're trying to get the values for and you access the keys and the values is reting my background's not an information retrieval and things like this it's just fun to have backlinks and what effectively happens is that when you're doing these Matrix multiplications you're having matrices that are of
the size of the context length so the number of tokens that you put into the model and the KV cache is effectively some form of compressed representation of all the previous tokens in the model so when you you're doing this we talk about Auto regressive model models you predict one token at a time you start with whatever your prompt was you ask a question like who was the president in 1825 the model then is going to generate its first token for each of these tokens you're doing the same attention operator where you're multiplying these query
key value matrices but it the math is very nice so that when you're doing this repeatedly this KV cache this key value operation you can keep appending the new values to it so you keep track of what your previous values you inferring over in this Auto regressive chain you keep it in memory the whole time and this is a really crucial thing to manage when serving inference at scale there are far bigger experts in this and there are so many levels of detail that you can go into essentially one of the key quote unquote drawbacks
of the attention operator and the Transformer is that there is a form of quadratic memory cost in proportion to the context length so as put in longer questions the memory used in order to make that computation is going up in the form of a quadratic you'll hear about a lot of other uh language model architectures that are like subquadratic or linear attention forms which is like State space models I don't we don't need to go down all these now and then there's Innovations on attention to make this memory usage and the ability to attend over
long contexts much more accurate and high performance and those Innovations are going to help you de with I mean your highly memory constraint they help with memory constraint and performance so if you put in a book into I think Gemini is the model that has the longest context length that people are using Gemini is known for 1 million and now 2 million context length you put a whole book into Gemini and sometimes it'll draw facts out of it it's not perfect they're getting better but the so there's two things it's like one to be able
to serve this on the memory level Google has magic with their TPU stack where they can serve really long contexts and then there's also many decisions along the way to actually make long contacts performance work this implies the data there's subtle changes to these computations in attention and it just it it changes the architecture but serving long contexts is extremely memory constrained especially when you're making a lot of predictions I actually don't know why input and output tokens are more expensive but I think essentially output tokens you have to do more computation because you have
to sample from the model I can I can explain that so today if you use a model uh like you look at an API open AI charges you know certain price per million tokens right uh and that price for input and output tokens is different right and the reason is is that there is you know when you're when you're when you're inputting a query into the model right let's say you have a book right that book you must now calculate the entire KV cache for right this key value cache and so when you do that
that is a parallel operation all of the tokens can be processed at one time and therefore you can dramatically reduce how much you're spending right the Flop requirements for generating a token and and input token are identical right if I input one token or if I generate one token it's completely identical I have to go through the model right but the difference is that I can do that input I.E the pre-fill I.E The Prompt simultaneously uh in a in a batch nature right and therefore it is all flop I think the pricing model mostly they
use is for input tokens is about 1/4 the price of the output tokens correct but then output tokens the reason why it's so expensive is because I can't do it in parallel right it's Auto regressive every time I generate a token I must not only take the entire I must not only read the whole entire model into memory right and and activate it right go calculate it to generate the next token I also have to read the entire KV cache and I generate a token and I append that KV that one token I generated and
it's KV cache and then I do it again right and so therefore this is a non-parallel operation and this is one where uh you have to you know in in the case of prefill or prompt you pull the whole model in and you calculate 20,000 tokens at once right these are features that API shipping which is like um prompt prompt caching pre-filling because you can drive prices down and you can make apis much faster if you know you're going to keep if you run a business and you're going to keep passing the same initial content
to Cloud's API you can load that in to the anthropic API and always keep it there but it's very different than we're kind of leading to the reasoning models which we talked we showed this example earlier and read some of this kind of mumbling stuff and what happens is that the output context length is so much higher and I I mean I learned a lot about this from Dylan's work which is essentially as the output work length gets higher you're using this you're writing this quadratic in terms of memory used and then the gpus that
we have effectively you're going to run out of memory and they're all trying to serve multiple requests at once so doing this batch processing where not all of the prompts are exactly the same really complex handling and then as context links gets longer there's this like I think you call it a critical batch size where your ability to serve more users so how much you can parallelize your inference inference plummet because of this long contract so your your memory usage is going way up with these reasoning models and you still have a lot of users
so effectively the cost to serve multiplies by a ton and we're looking at a plot when the x-axis is uh sequence length I.E how many tokens are being generated SL prompt right so if I put in a book that's a million tokens right but you know if I put in you know the sky is blue then that's like six tokens or whatever we should say that what we're calling reason Chain of Thought is extending this sequence length it's mostly output so so before you know 3 months ago whenever o1 launched all of the use cases
for long context length were like let me put a ton of documents in and then get an answer out right and it's a it's a single you know pre-fill compute a lot in parallel and then output a little bit now with reasoning and agents this is a very different idea right now instead I might have I might only have like hey do this task or I might have all these documents but at the end of the day the model is not just like producing a little bit right it's producing tons of information this Chain of
Thought just continues to go and go and go and go and so the sequence length is is effectively that that you know if it's generated 10,000 tokens it's 10,000 sequence length right or and and plus whatever you input it in the prompt and so what this chart is showing and it's a logarithmic chart right is um you know as you grow from 1K to 4K or 4K to 16k the memory requirements grow so fast for your KV cache that you end up not being able to run uh a certain number of you know uh you
know your your sequence length is capped or the number of users you let's say the model so this is this is showing for a 405b model in batch size 64 llama 31 405b yeah yeah and batch size is crucial to essentially they just like you want to have higher batch size to parallelize parallel your through 64 different users at once right yeah and therefore your serving costs are lower right because the server cost the same right this is8 h100s roughly $2 an hour per GPU that's $16 an hour right that is that is like somewhat
of a fixed cost you can do things to make it lower of course but like it's like $16 an hour now how many users can you serve how many tokens can you generate and then you divide the two and that's your cost right um and so with reasoning models this is this is where a lot of the complexity comes about and why memory is so important because if you have limited amounts of memory then you can't serve so many users if you have limited amounts of memory your serving speeds get lower right and so your
costs get a lot lot worse um because all of a sudden if I was used to hey on this $16 an hour server I'm serving llama 405b or if I'm serving you know deep seek V3 um and it's all chat style applications IE we're just ch chatting the sequence length are thousand few thousand right uh you know when you use a language model it's a few thousand context length most the times sometimes you're dropping a big document but then you process it you get your answer you throw it away right you move on to the
next thing right whereas with reasoning I'm now generating tens of thousands of tokens in in sequence right and so this this memory this KV cach has to stay resident you have to keep loading it you have to keep it keep it in memory conly and now this buts out other users right if there's now a reasoning task right and the model is capable of reasoning then all of a sudden I it that memory pressure means that I can't serve as many users simultaneously let's go into deep seek again so we're we're in the post deep
seek R1 time I think and what we're there's two sides to this Market watching how hard it is to serve it on one side we're going to talk about deep seek themselves they now have a chat app that got to number one on the App Store disclaimer number one on the App Store is measured by velocity so it's not necessarily saying that more people have the Deep seek app than chpt app but it is still remarkable Claude has never hit the number one in the App Store even though everyone in San Francisco is like oh
my God you gotta use Claude don't use chbt so deep seek hit this they also launched an API product recently where you can ping their API and get these super long responses for R1 out in at the same time as these are out we'll get to what's happened to them uh because the model weights for deeps R1 are are openly available and the license is very friendly the MIT license commercially available all of these midsize companies and big companies are trying to be first to serve R1 to their users we are trying to evaluate R1
because we have really similar research going on we releas the model and we're trying to compare to it and out of all the companies that are quote unquote serving R1 and they're doing it at prices that are way higher than the Deep seek API most of them barely work and the throughput is really low get to give context right everyone one of the parts of like freaking this out was like China reached capabilities the other aspect is they did it so cheap right and the so cheap we kind of talked about on the training side
why it was so cheap talk about why it's so cheap on the inference it works well and it's cheap why is R1 so damn cheap so I think there's a couple factors here right one is that they do have model architecture Innovations right this MLA this new attention that they've done is SE is different than the uh attention from atten is all you need the Transformer attention right now others have already innovated there's a lot of work like mqa gqa um local Global all these different innovations that like try to bend the curve right it's
still quadratic but the constant is now smaller right related to our previous discussion this multi-ad lat and attention can save about 80 to 90% in memory from the attention mechanism which helps especially along context it's it's 80 to 90% versus the original but then versus what people are actually doing it's still an innovation this 80 to 90% doesn't say that the whole model is 80 to 90% cheaper just this one part of it well and not just that right like other people have implemented techniques like local Global sliding window and GQ mq that but anyways
like deep seek has their attention mechanism is a true architectural Innovation they did tons of experimentation and this dramatically reduces the memory pressure um it's still there right it's still a quad it's still a tension it's still quadratic it's just dramatically reduced it relative to Prior forms all right that's the memory pressure I should say in case people don't know R1 is 27 times cheaper than 01 we think that open AI had a large margin built in okay so that's there's multiple factors we should break down the factors I think it's two bucks per million
token output for R1 and $60 uh per million token output for 01 yeah let's look at this so so I think this is is very important right open AI is you know that drastic gap between deep seek and pricing but seek is offering the same model because they open weight it to everyone else for a very similar like much lower price than what others are able to serve it for right um so there's there's two factors here right their model is cheaper right um it is 27 times cheaper I don't remember the number exactly off
top of my head so we're looking at a graphic that's showing different places serving V3 deep seek V3 which is similar to deep seek R1 and there's a vast difference in uh serving cost right serving cost and what explains that difference and and and so like part of it is open a has a fantastic margin right they're serving when they're doing inference their gross margins are north of 75% right so that's that's a four to 5x Factor right there of the cost difference is that open ey is just making crazy amounts of money because they're
the only one with a capability do they need that money are they using it for R&D they're losing money obviously as a company because they spend so much on training right so the inference itself is a very high margin but it doesn't recoup the cost of everything else they're doing okay so yes they need that money because the revenue and margins pay for continuing to build the next thing right as alongside raising more money so the suggestion is that deep seek is like really bleeding out money well so so here's one thing right we'll get
to this in a second but like deep seek doesn't have any capacity to actually serve the monel they sto signups uh the ability to use it is like non-existent now right for most people because so many people are trying to use it they just don't have the gpus to serve it right um open has hundreds of thousands of gpus between them and Microsoft to serve their models deep seek has has a factor of much lower right you know even if you believe our research which is 50,000 gpus uh and a portion of those are for
research portion of those are for the hedge fund right they still have nowhere close to the GPU volumes and capacity to serve the model right at scale um so it is cheaper uh a part of that is open eye making a ton of money is deep seek making money on their API unknown I don't actually think so um and part of that is this chart right look at all the other providers right together AI fireworks AI are very highend companies right xmeta together AI is treow and the inventor of like flash attention right which is
a huge efficiency technique right they're very efficient good companies and they're ser and and I do know those companies make money right not not tons of money on inference but they make money and so they're serving at like a 5 to 7x difference in cost right and so you know now when you when you equate okay open ey is making tons of money that's like a 5x difference um and the companies that are trying to make money for this model is like a 5x difference there is still a gap right there's still a gap and
that is just deep seek being really freaking good right the model architecture MLA the way they did the all these things there is like legitimate just efficiency differen all all their lowle libraries that we talked about in training some of them probably translate to inference and those weren't released so we may go a bit into conspiracy land but is it possible the Chinese government is subsidizing deep seek I actually don't think they are I think when you look at the Chinese Labs there's uh there's Huawei has a lab moonshot AI uh there's a couple other
labs out there that are really close with the government and then there's Labs like Alibaba and deeps which are not close with the government um and you know we talked about this this uh the CEO the this this like reverent figure who's like quite different who has like sounds awesome very different like viewpoints based on the Chinese interviews that are translated than what the CCP might necessarily want now now to be clear right does he have a loss leader because he can fund it through his hedge fund yeah sure so the hedge fund might be
subsidizing it yes I mean they absolutely did right because deeps has not raised much money they're now trying to raise around uh in China uh but they have not raised money historically it's all just been funded by the hedge fund and he owns like over half the company like 50 60% of the compan owned by him some of the interviews there's discussion on how like doing this is a recruiting tool you see this at the American companies too it's like having gpus recruiting tool being at The Cutting Edge of AI recruiting tool open sourcing open
sourcing so much talent they were so far behind and they got so much talent because they just open source stuff uh more conspiracy thoughts is it possible since they're a hedge fund that they timed everything with this release and the pricing and and they have they shorted in Nvidia stock and stock of USA companies and released it with star like just perfect timing to be able to make money like they've released it on inauguration day they know the international what is on the international calendar but I mean I don't expect them to if you listen
to their motivations for AI it's like they released they released V3 on December 26th like who releases the day after Christmas no one looks right uh they had released the papers before this right the V3 paper and the R1 paper so people had been looking at it and like wow um and then they just released the V R1 model I think they're just shipping as fast as they can and like who cares about Christmas who cares about you know get it out before Chinese New Year right obviously which just happened um I don't think they
actually were like timing the market or trying to make the biggest splash possible I think they're just like shipping I think that's one of their big advantages I we know that a lot of the American companies are very invested in safety and that is the central culture of a place like anthropic and I think anthropic sounds like a wonderful place to work but if safety is your number one goal it takes way longer to get artifacts out that's why anthropic is not open sourcing things that's their claims but there's reviews internally anthropic um ra mentions
things to International governments there's been news of how anthropic has done pre-release testing with the UK safety Institute all of these things add inertia to the process of getting things out and we're on this trend line where the progress is very high so if you reduce the time from when your model is done training you run a vals that's good you want to get it out as soon as possible to maximize the perceived quality of your outputs deep SE does this so well Dario explicitly said Claude 3.5 Sonet was trained like nine months or year
months ago 9 to 10 months ago and I think it took them another like handful of months to release it right so it's like there is there is a significant Gap here right and especially with in models uh the word in the San Francisco street is that like anthropic has a better model than 03 right and they won't release it why because chains of thought are scary right and they are legitimately scary right if you look at R1 it flips back and forth between Chinese and English sometimes it's giberish and then the right answer comes
out right and like for you and I it's like great great this why people are infatuated right there like you're telling me this is a high value thing and it works and it's doing this it's amazing I mean you talked about that uh sort of like uh Chain of Thought for that philosophical thing which is not something they trained to it to be philosophically good it's just sort of an artifact of the Chain of Thought training it did um but like that's super important in that like can I inspect your mind and what you're thinking
right now no um and so I don't know if you're lying to my face uh and Chain of Thought models are that way right like this is this is a true quote unquote risk between you know a chat application where hey I asked the model to say you know bad words or whatever or or how to how to make Anthrax and it tells me that's unsafe sure but that's something I can get out relatively easily what if I tell the AI to do a task and then it does the task all of a sudden randomly
in a way that I don't want it right and now that has like much more task versus like response is very different right so the bar for safety is much higher at least this is anthropic case right like for deep seek they're like ship right yeah so I mean the bar for safety is probably lowered a bit because of deep seek I mean there's parallels here to the Space Race the reason the Soviets probably put a man in space first is cuz the their approach to safety was uh the bar for safety was lower and
they they killed that dog right and all these things right so it's like a less risk averse uh than the than the US Spas program and there's parallels here but you know there's probably going to be downward pressure on that safety bar for the US companies right this is something that Dario talks about is like that's the situation that Dario wants to avoid is Dario talks to about the difference between race to the bottom and race to the top and the race to the top is where there's a very high standard on safety there's a
very high standard on your model performs and certain crucial evaluations and when certain companies are really good to it they will converge this is the idea and ultimately AI is not confined to one nationality or to one like set of morals for what it should mean and there's a lot of arguments on like should we stop open sourcing models and if the US stops it's pretty clear I mean it's way easier to see now deep seek that a different International body will be the one that builds it we talk about the cost of training deep
seek has this shocking 5 million dollar number think about how many entities in the world can afford a 100 times that to have the best open source model that people use in the world and it's like it's a scary reality which is that these open models are probably going to keep coming for the time being whether or not we want to stop them and it is like stopping them might make it even worse and harder to prepare but it just means that the preparation and understanding what AI can do is just so much more important
that's why I'm here at the end of the day but it's like letting that sink into people especially not in AI is that like this is coming there are some structural things in a global interconnected world that you have to accept Yeah you mentioned uh you sent me something that Zuck Mark Zucker Brook mentioned on earnings call he said that I think in light of some of the recent news the new competitor deep seek from China I think it's one of the things that we're talking about is there's going to be an open source standard
globally and I think for our kind of national Advantage it's important that it's an American Standard so we take that seriously we want to build the AI system that people around the world are using and I think that if any think some of the recent news has only strengthened our conviction that this is the right thing to be focused on so yeah open sourcing yeah Mark Zuckerberg is not new to having uh American values and how he presents his company's trajectory I think products of long senseman Bann in China and I I respect the saying
it directly and and there's an interesting aspect of just because it's open weights or open source doesn't mean it can't be subverted right there have been many open- Source software bugs that have been like uh you know for example there was a Linux bug that was found after like 10 years which was clearly a back door uh because somebody was like why is this taking uh you know half a second recent one right like there why is it taking half a second to load and it was like oh crap there's a back door here that's
why right it's like this is very much possible with AI models right um today you know the the alignment of these models is very clear right like I'm not going to say you know bad words I'm not going to teach you how to make Anthrax I'm not going to talk about tan Square uh I'm not going to you know you know things like I'm going to say Taiwan is part of you know is is just an Eastern profence right like you know all these things are like depending on who you are what you align what
you know whether you know and even like xai is aligned a certain way right you know they might it's not aligned in the like woke sense it's not aligned in the like sense but there is certain things that are imbued within the model now when you release this publicly in an instruct model that's open weights this can then proliferate right but as these systems get more and more capable what you can embed deep down in the model is not as clear right um and so there as that is like one of the big fears is
like if a an American model or a Chinese model is the top model right you're going to embed things that are unclear and it could be unintentional too right like British English is dead because American llms W right and the internet is American and therefore like color is spelled the way Americans spell it right a lot of strung words right now this is just like this is just a factual nature of the L like carpet each the English is the hottest programming language and that English is defined by a bunch of companies that primarily are
in San Francisco the the right way to spell optimization is with a z just in case people CU it's an I think it's an s in British English it is taking it as something silly right like something as silly as the spelling like which British and English you know BR Brits and and and Americans will like laugh about probably right I don't think we care that much uh but like you know some people will but like this can this can boil down into like very very important topics like hey you know sub you know subverting
people right uh you know chat Bots right character AI has shown that they can like you know talk to kids and or or adults and like it will like you people feel a certain way right and that's unintentional alignment but like what happens when there's an intentional alignment deep down on the open source standard it's a back door today for like Linux right that we discover or some encryption system right China uses different encryption than nist defines the us nist because there's clearly at least they think there's back doors in it right um what happens
when the models are back doors not just to computer systems but to our minds yeah they're cultural back doors I the thing that amplifies the relevance of culture with language models is that we are used to this mode of interacting with people in back and forth conversation and we have now have a super a very powerful computer system that slots into a social context that we're used to which makes people very we don't know the extent that which people can be impacted by that so there there could be this is one this is an actual
concern with a Chinese company that is providing open weights models is that there could be some Secret Chinese government sort of requirement for these models to have a certain kind of back door to have some kind of thing where I don't necessarily think it'll be a back door right because once it's open weights it doesn't like phone home it's more about like if it recognizes a certain system it could like if if now now it could be a back door in the sense of like hey if you're building a software uh you know something in
software all of a sudden it's a software agent oh program this back door that only we know about or it could be like subvert the mind to think that like XYZ opinion is the correct one and thropic has researched on this where they show that if you put different phrases certain phrases in at pre-training you can then elicit different behavior when you're actually using the model because they've like poisoned the pre-training data I don't think like as of now I don't think anybody in a production system is trying to do anything like this I think
it's mostly anthropic is doing very direct work and mostly just subtle things of we don't know what these models are going to how they are going to generate tokens what information they're going to represent and what the complex representations they have are well one of thing we're talking about anthropic which is generally just is permeated with like good humans trying to do good in the world I I don't we just don't know of any labs this would be done in a military context that are explicitly trained to okay how can we the the front door
looks like a happy llm but underneath it's a thing that will over time do the maximum amount of damage to our quote unquote enemies there there's this very good quote from Sam mman who you know he can be hype Beast sometime but one of the things he said and and I think I agree is that superhuman persuasion will happen before superhuman intelligence right and if that's the case then these things before before we get this AGI ASI stuff we can embed superhuman persuasion towards our ideal or whatever the ideal of the model is right and
again like today I truly don't believe deep seek has done this right like but it is a sign of like what could happen so one of the dystopian worlds is uh described by Brave New World so we could just be stuck scrolling Instagram looking at cute puppies or worse and then talking to bots that are giving us a narrative and we completely get lost in that world that's controlled by somebody else but versus thinking independently and that's that's that's a major concern as we rely more more on these kinds of systems I mean we've already
seen this with recommendation systems yeah recommendation systems hack the the dopamine induced reward circuit but the brain is a lot more complicated and what other sort of circuits quote unquote feedback loops in your brain can you hack slash uh subvert in ways like recommendation systems are purely just trying to do you know increase time in ads and Etc but there's so many more goals that can be achieved through these complicated models there's no reason in some number of years that you can't train a language model to Max imiz time spent on a chat app like
right now they are trained I mean is that not what character AI has done their time per session is like two hours yeah Time character AI Pro very likely could be optimizing this where it's like the the way that this data is collected is naive where it's like you're presented a few options and you choose them but there's that's not the only way that these models are going to be trained it's naive stuff like talk to an anime girl but like it can be like yeah this is a risk right like it's it's a bit
of a cliche thing to say but I've uh over the past year had a few stretches of time where I didn't use social media or the internet at all and just read books and was out in nature and it like it clearly has a an effect on the Mind where like it change like I feel like I'm returning of course I was uh raised before the internet really took off but I'm returning to some more I know where you're going I mean you can see it physiologically like I take three days if I'm like backpacking
or something and you you're you're like you're breaking down addiction Cycles I feel like I'm more in control of my mind there feels like a sovereignty of intelligence that's happening when I'm disconnected from the internet I think um the more I use the the internet and social media the more other people are controlling my mind that's definitely a feeling and then in the future that will be not other people but algorithms or other people presented to me via algorithms there I mean there are already tons of AI bots on the internet and every so right
now it's not frequent but every so often I have replied to one and there instantly replies I'm like crap that was a bot and that is just going to become more common like they're going to get good one of the hilarious things about technology over its history is that the uh illicit adult entertainment industry is always adopted Technologies first right whether it was like video streaming um to like where you know the there's now the like sort of like independent adult ilicit content creators uh who have their you know subscription pages and there they actually
heavily utilize uh you know generative AI has already been like diffusion models and all that is huge there but now these like these these subscription based individual creators do use Bots to approximate themselves and chat with their you know whales people pay a lot for it and people pay a lot right it's a lot of times it's them but a lot of there are agencies that do this for these creators and do it like on a like Mass scale so the largest creators are like able to talk to hundreds or thousands of like people at
a time because of these Bots and so it's it's already being used there obviously you know like video streaming and and other technologies have gone there first it's going to come to the rest of society too there's a general concern that models get censored by the companies that deploy them so one case when we've seen that and maybe censorship is one word alignment maybe via rhf or some other way is another word so that we we saw that with black Nazi image generation with uh Gemini uh as you mentioned we also see that uh with
Chinese models refusing to answer what happened in uh June 4th 1989 at tan Square so how can this be avoided and maybe can you just in general talk about how this happens and how can it be avoided you give multiple examples um there's probably a few things to keep in mind here one is the kind of tanaman square factual knowledge like did thing like how does that get embedded into the models two is the Gemini what you call the black Nazi incident which is when Gemini as a system had this extra thing put into it
that dramatically changed the behavior and then three is what most people would call General alignment rhf post training um each of these have very different Scopes and how they are applied in order to do if you're just look at the model weights in order to audit specific facts is extremely hard because you have to Chrome through the pre-training data and look at all of this and then that's terabytes of files and look for very specific words or hints of the words so I I guess one way to say it is that you can insert censorship
or alignment at various stages in the pipeline and what you refer to now is at the very beginning of the data select so if you want to get rid of facts in a model you have to do it at every stage you have to do it at the pre-training so most people think that pre-training is where most of the knowledge is put into the model and then you can elicit and move that in different ways whether through post trining or whether through systems afterwards this is where the whole like hacking models comes from right like
GPT will not tell you how to make Anthrax but if you try really really hard you can eventually get to tell you about anthro because they didn't filter it from the pre-training data set right but by the way removing facts has such an ominous dark feel to it almost think it's practically impossible because you effectively have to remove them from the internet you're you're taking on a did did did they remove the the thing from the subreddits the mmmmm it gets filtered out right so you have quality filters which are small language models that look
at a document and tell you like how good is this text is it close to a Wikipedia article which is a good thing that we want language models to be able to imitate so couldn't you do a small language model that filter mentions in tan Square in the data yes but is it going to catch um play or encoded language people been meing on like games and other stuff how to like say things that don't say tnm and square um but or like yeah so there's always like different ways to do it there's hey the
internet as a whole does tend to just have a slight left bias right because it's always been richer more affluent uh younger people on the internet relative to the rest of the population so there is already inherently a slight left bias right on the internet and so how do you filter things that are this complicated right is it like and and some of these can be like you know factual non-factual but like tan square is obviously the example of a factual but it gets a lot harder when you're talking about aligning to a ideal right
um which yeah and so grock for example right elon's tried really hard to make the model not be super PC and woke but the best way to do pre-training is to throw the whole freaking internet at it right and then later figure out but then at the end of the day the model at its core now still has some of these ideals right you still ingested redit SLR politics which is probably the largest political discussion board on the world that's freely available to scrape and guess what that's left leaning right um and so um you
know there are some aspects like that that you just can't censor unless you try really really really really really hard so the base model will always have some TDS Trum derangement syndrome because it's trained so much it'll have the ability to express it but what if what if you there's a there's a wide representation in the data this is what happens it's like a lot of mod what is called post training is a series of techniques to get the model on Rails of a really specific behavior uh and I mean it's it's like you can
you also have the ingested data of like Twitter or like Reddit SLR thedonald which is like also super prot Trump right and then you have like fascist subreddits or like you have communist subreddits so you the model in pre-training ingests everything it has no world view now it does have like some some skew because more of the text is skewed a certain way uh which is general like slight left like but also like you know somewhat like you know intellectual somewhat like you know it's just like the general internet is a certain way mhm and
then and then as as as Nathan's about to describe eloquently right like you can you can elicit certain things out and there's a lot of history here so we can go through multiple examples and what happened llama 2 was a launch that the phrase like too much rhf or like too much safety was a lot it's just that was the whole narrative after llama 2's chat models released and the examples are sorts of things like you would ask LL 2 chat how do you kill a python process and it would say I can't talk about
killing because that's a bad thing and anyone that is trying to design an AI model will probably agree that that's just like H model you messed up a bit on the training there I don't think they meant to do this but this was in the model weight so this is not you it didn't necessarily be there's things called system prompts which are when you're quering a model it's a piece of text that is shown to the model but not to the user so a fun example is your system prompt could be Talk Like a Pirate
so no matter what the user says to the model it'll respond like a pirate in practice what they are is you are a helpful assistant you should break down problems if you don't know about something don't tell them your date cut off is this today's date is this it's a lot of really useful contexts for how can you answer a question well and anthropic publishes their system cont which I think is great and there's a lot of research that goes into this and one of your previous guests Amanda ascal is like probably the most knowledgeable
person at least in the combination of execution and sharing she's the person that should talk about system prompts and character of models yeah and then people should read these system prompts cuz you're you're like trying to nudge sometimes through extreme politeness the model to be a certain way and you could use this for bad things I we've done tests which is what if I tell the model to be a dumb model like which evaluation scores go down and it's like we'll have this Behavior where it could sometimes like say I'm supposed to be dumb and
sometimes it's like it doesn't affect like math abilities as much but something like a if you're trying it's just the quality of a human judgment would draw through the floors let's go back to post training specifically rhf around llama 2 was it was too much too much safety prioritization was baked into the model weights this makes you refuse things in a really annoying way for users it's not great it caused a lot of um like awareness to be attached to rhf that it makes the models dumb and it stigmatized the word it did in AI
culture and as the techniques have evolved that's no longer the case where all of these Labs have very fine grain control over what they get out of the models through techniques like rlf although although different labs are definitely different levels like on the on one end of the spectrum is Google um and then like maybe openi does less and anthropic does less um and then like on the other end of the spectrum is like xai but they all have different forms of rlf trying to make them a certain way and they like the important thing
to say is that no matter how you want the model to behave these rhf and preference tuning techniques also improve performance so on things like math of vals and code of vals there is something innate to these what is called contrastive loss functions we could start to get into RL here we don't really need to but rly T also boosts performance on anything from a chat task to a math problem to a code problem so it is becoming a much more useful tool to these Labs so this kind of takes us through the Arc of
we've talked about pre-training hard to of things we've talked about post training and how post training if you you can mess it up it's it's a complex multifaceted optimization with 10 to 100 person teams converging at one artifact it's really easy to not do it perfectly and then there's the third case which is what we talked about Gemini the thing that was about Gemini is this was a served product where Gemini Google has their internal model weights theyve done all these processes that we talked about and in the served product what came out after this
was that they had a prompt that they were rewriting user queries to boost diversity or something and this just made it the outputs were just blatantly wrong it was a some sort of organizational failure that had this prompt in that position and I think Google Executives probably have owned this I don't pay that attention that detail but it was just a mess up in execution that led to this ridiculous thing but at the system level the model weights might have been fine so at the very end of the pipeline there was a rewriting to something
like a system prompt it was like the system prompt or what is called an industry is like you rewrite prompts so especially for image models if you're using dolly or tachy BT can generate you an image you'll say draw me a beautiful car with these leading image models they benefit from highly descriptive prompts so what would happen is if you do that on chat GPT a language model behind the scenes will rewrite the prompt say make this more descriptive and then that is passed to the image model so prompt writing is something that is used
at multiple levels of industry and it's used effectively for image models and the Gemini example example is just a failed execution big philosophical question here with rhf to to generalize where is human input human in the loop human data most useful at the current stage for the past few years the highest cost human data has been in these preferences which is comparing I would say highest cost and highest total usage so a lot of money has gone to these Wiz comparisons where you have two model outputs and a human is comparing between the two of
them in earlier years there was a lot of this instruction tuning data so creating highly specific examples to something like a Reddit question to a domain that you care about language models used to struggle on math and code so you would pay experts in math and code to come up with questions and write detailed answers that were used to train the models now it is the case that there are many model options that are way better than humans at writing detailed and eloquent answers for things like model and code so they talked about this with
the Llama 3 release where they switched to using llama 3 45b to write their answers for Math and code but they in their paper talk about how they use extensive human preference data which is something that they haven't gotten AIS to replace there are other techniques in Industry like constitutional AI where you use human data for preferences and AI for preferences and I expect the AI part to scale faster than the human part but among the research that we have access to is that it humans are in this kind of preference Loop so for uh
as reasoning becomes bigger and bigger and bigger as we said where's the role of humans in that it's even less prevalent so it's the remarkable thing about these reasoning results and especially the Deep seek R1 paper is this result that they call Deep seek r10 which is they took one of these pre-trained models they took deep seek V3 base and then they do this reinforcement learning optimization on verifi able questions or verifiable rewards for a lot of questions and a lot of training and these reasoning behaviors emerge naturally so these things like wait let me
see wait let me check this oh that might be a mistake and they emerge from only having questions and answers and when you're using the model the part that you look at is the completion so in this case all of that just emerges from this large scale RL training and that model which the weights are available has no human preferences added into the post training there are the Deep seek R1 full model has some of this human preference tuning this rhf after the reasoning stage but the very remarkable thing is that you can get these
reasoning behaviors and it's very unlikely that there's humans writing out reasoning Chains It's very unlikely that they somehow hacked open Ai and they got access to open a1's reasoning chains it's something about the pre-trained language models and this RL training where you reward the model for getting the question right and therefore it's triang multiple Solutions and it it emerges this Chain of Thought this might be a good place to uh to mention the uh the eloquent and the insightful tweet of the Great and The Powerful Andre kathi uh I think he had a bunch of
thoughts but one of them last thought not sure if this is obvious you know something profound is coming when you're saying it's not sure if it's obvious there are two major types of learning in both children and in deep learning there's one imitation learning watch and repeat I e pre-training supervised fine-tuning and two trial and error learning reinforcement learning my favorite simple example is Alpha go one is learning by imitating expert players two is reinforcement learning to win the game almost every single shocking result of deep learning and the source of all magic is always
two two is significantly more powerful two is what surprises you two is when the paddle learns to hit the ball behind the blocks and break up two is when Alpha go beats even lead all and two is the aha moment when the the Deep seek or 01 Etc discovers that it works well to re-evaluate your assumptions backtrack try something else Etc it's the solving strategies you see this model use in its Chain of Thought it's how it goes back and forth thinking to itself these thoughts are emergent three exclamation points and this is actually seriously
incredible impressive and new and is publicly available and documented the model could never learn this with uh imitation because the cognition of the model and the cognition of the human labeler is different the human would never know to correctly annotate these kinds of solving strategies and what they should even look like they have to be discovered during reinforcement learning as empirical and statistically useful towards the final outcome anyway the alpha zero sort of uh metaphor analogy here uh can you speak to that the magic of the Chain of Thought that he's referring to um I
think it's good to recap alphago and Alpha zero because it plays nicely with these analogies between imitation learning and learning from scratch so Alpha go the beginning of the process was learning from humans where they had they started the first this is the first expert level go player or chess player in Deep Mind series of models where they had some human data and then the why it is called Alpha zero is that there was Zero human data in the loop and that changed to Alpha Zer made a model that was dramatically more powerful for deep
mind so this remove of the human prior the the human inductive bias makes the final system far more powerful this we mentioned bitter lesson hours ago and this is all aligned with this and then there's been a lot of discussion in language models this is not new this goes back to the whole qar rumors which if you piece together the pieces is probably the start of open AI figuring out its one stuff when last year in November the qar rumors came out there's a lot of intellectual drive to know when is something like this going
to happen with language models because we know these models are so powerful and we know it has been so successful in the past and it is a reasonable analogy that this new type of reinforcement learning training for reasoning models is when the do open to this we don't yet have the equivalent of turn 37 which is the famous turn where the Deep mindes AI plan go stumped lease at all completely we don't have something that's that level of focal point but that doesn't mean that the approach to technology is different and the impact of the
general training it's still incredibly new what do you think that point would be well we be move 37 for Chain of Thought for reasoning scientific discovery like when you use this sort of reasoning problem and it just something we fully don't expect I think it's actually probably simpler than that it's probably something related to computer user robotics uh rather than science Discovery um because the important aspect here is uh models take so much data to learn they're not sample efficient right trillions they take the entire web right over 10 trillion tokens to train on right
um this would take a human thousands of years to read right a human does not and and know and humans know most of the stuff a lot of the stuff models know better than it right humans are way way way more sample efficient that is because of the self-play right how does a baby learn what its body is is it sticks its foot in its mouth and it says oh this is my body right it sticks its hand in its mouth and it calibrates its touch on its fingers with the most sensitive touch thing on
its tongue right it's how babies learn um and and and and it's just self-play over and over and over and over again and now we have something that is similar to that right with these uh verifiable uh proofs right whether it's a unit test in code or a mathematical verif verifiable task generate many traces of reasoning right um and keep branching them out keep branching them out and then check at the end hey which one actually has the right answer most of them are wrong great these are the few that are right maybe we use
some sort of reward model outside of this to select even the best one to preference as well but now you've started to get better and better at these uh benchmarks and so you've seen over the last six months a skyrocketing in a lot of different benchmarks right all math and code benchmarks are pretty much solved except for Frontier math which is designed to be almost questions that aren't practical to most people cuz they're like their exam level open math problem type things so it's like on the math problems that are somewhat reasonable which is like
somewhat complicated word problems or coding problems it's just what Dylan is saying so so the thing here is that these are only with verifiable task you we earlier showed an example of the you know the really interesting like what happens when chain athus to a non-verifiable thing it's just like a human you know chatting right with the you know thinking about what's novel for humans right a unique thought uh but this task and form of training only works when it's INF when it's verifiable um and from here the thought is okay we can continue to
scale this current Training Method by increasing the number of verifiable tasks um in math and coding coding probably has a lot more to go math has a lot less to go in terms of what are verifiable things can I create a solver that then I generate trajectories toward or traces towards reasoning traces towards and then prune the ones that don't work and keep the ones that do work well those are going to be solved pretty quickly but even if you've solved math you have not actually created intelligence right um and so this is where I
think the like aha moment of computer use or robotics will come in because now you have a Sandbox or a playground that is infinitely verifiable right did you you know messing around on internet there are so many actions that you can do that are verifiable it'll start off with like log into a website create an account click a button here blah blah blah but it'll then get to the point where it's hey go do a task on Tasker or whatever these other all these various task websites hey go get hundreds of likes right um and
and the and it's going to fail it's going to spawn hundreds of accounts it's going to fail on most of them but this one got to a th great now you've reached the verifiable thing and you just keep iterating this Loop over and over and that's when and same with robotics right that's where you know where you have an infinite playground of tasks like hey did I put the ball in the bucket all the way to like oh did I like build a car right like you know there's a whole trajectory to speedrun or you
know what models can do but at some point I truly think that like you know we spawn models and initially all the training will be in sandboxes but then at some point you know the language model pre-training is going to be dwarfed by what is this reinforcement learning you know you you'll pre-train a multimodal model that can see that can read that can write you know blah blah blah whatever Vision audio Etc but then you'll have it play in a sandbox infinitely figure out figure out math figure out code figure out navigating the web figure
out operating a robot arm right and then it'll learn so much and the aha moment I think will be when this is available to then create something that's not good right like oh cool part of it was like figuring out how to use the web now all of a sudden it's figured out really well how to just get hundreds of thousands of followers that are real and real engagement on Twitter because all of a sudden this is one of the things that are verifiable and maybe not just engagement but make money yes like become I
mean that could be the thing where almost fully automated it makes you know $10 million by being an influencer selling a product creating the product like and and I I'm not referring to like a hype product but an actual product like holy this thing created a business it's running it it's the face of the business that kind of thing May or maybe a number one song like it creates the whole infrastructure required to create the song to be the influencer that represents that song that kind of thing it makes a lot of that could be
the move I mean this our culture respects money in that kind of way and it's and it's verifiable right it's verifiable the bank account can't exactly there's surprising evidence that once you set up the ways of collecting the verifiable domain that this can work there's been a lot of research before this R1 on math problems and they approach math with language models just by increasing the number of samples so you can just try again and again and again and you look at the amount of times that the language models get it right and what we
see is that even very bad models get it right sometimes and the whole idea behind reinforcement learning is that you can learn from very sparse rewards so it it doesn't the the space of language and the space of tokens whether you're generating language or tasks for a robot is so big that you might say that it's like I mean each the tokenizer for a language model can be like 200,000 things so at each each step it can sample from that big of a space so if it can generate a bit of a signal that it
can climb on to that's the what the whole field of RL is around is learning from sparse rewards and the same thing has played out in math where it's like very weak models that sometimes generate answers we see research already that you can boost their math scores you can do this sort of RL training for math it might not be as effective but if you take a 1 billion parameter model so something 600 times smaller than deep seek you can boost its grade school math scores very directly with a small amount of this training so
it's not to say that this is coming soon setting up the verification domains is extremely hard and there's a lot of nuance in this but there are some basic things that we have seen before where it's like it's at least expectable that there's a domain and there's a chance that this works all right so we have fun things happening in real time this is a good opportunity to talk about other reasoning models 01 03 just now open AI as perhaps expected released 03 mini what are we expecting from the different flavors can you just lay
out the different flavors of um the old models and the from Gemini the reasoning model something I would say about these reasoning models is we talked a lot about reasoning training on math and code and what is done is that you have the base model we've talked about a lot on the internet you do this large scale reasoning training with reinforcement learning and then what the deeps paper detailed in this R1 paper which for me is one of the big open questions on how do you do this is that they did reasoning heavy but very
standard post trining techniques after the large scale reasoning RL so they did the same things with a form of instruction to tuning through rejection sampling which is essentially heavily filtered instruction tuning with some reward models and then they did this rhf but they made it math heavy so some of this transfer we looked at this philosophical example early on the one of the big open questions is how much does this transfer if we bring in domains after the reasoning training are all the models going to be become eloquent writers by reasoning is this philosophy stuff
going to be open we don't know in the research of how much this will transfer there's other things about how we can make soft verifiers and things like this but there is more training after reasoning which makes it easier to use these reasoning models and that's what we're using right now so we're going to talk about with three mini and 01 like these have gone through these extra techniques that are designed for human preferences after being trained to elicit reasoning I think I think one of the things that you know people are ignoring is Google's
Gemini flash thinking is both cheaper than R1 and and better and they released it in the beginning of December and nobody's talking about no one cares it has a different flavor to it its behavior is less expressive than something like 01 it has fewer tracks than it is on quen released a model last fall qwq which was their preview reasoning model and in deep SE cut R1 light last fall where these models kind of felt like they're on Rails where they really really only can do math and code and 01 is it can answer anything
it might not be perfect for some tasks but it's flexible it has some richness to it and this is kind of the part of like how cook like is a Model A little bit undercooked it's like it's good to get a model out the door but it's hard to gauge and it takes a lot of taste to be like is this a full-fledged model can I use this for everything they're probably more similar for Math and code my quick read is that Gemini flash is like not trained the same way as 01 but taking an
existing training stack adding reasoning to it so taking a more normal training stack and adding reasoning to it and I'm sure they're going to have more I mean they've done quick releases on Gemini flash so reasoning and this is the second version from the holidays it's evolving fast and it takes longer to make this training stack where you're doing this large scale the same question from uh earlier uh the one about the the human nature yeah what was the human nature one uh the way I can ramble why I can ramble about this so much
is that we've been working on this at ai2 before 01 was fully available to everyone and before R1 which is essentially using this RL training for fine tuning we use this in our like Tulu series of models and you can elicit the same behaviors where you say like wait and so and so on but it's so late in the training process that this kind of reasoning expression is much lighter so you can there's there's essentially a gradiation and just how much of this RL training you put into it determines how the output looks so uh
we're now using Gemini 2.0 flash thinking experimental 121 it summarized The Prompt as humans self-d domesticated Apes pect okay all right so wait is this revealing the the reasoning here's why this is a novel okay uh cck click to expand okay analyze the request novel is the keyword like see how it just looks a little different it looks like a normal output yeah it's I mean in some sense is better structured it makes more sense and when it latched onto human and then it went into organisms and oh wow apex predator focus on domestication apply
domestication to humans explore the idea of self-domestication not good not good where is this going refine articulate the Insight graci greater facial expressiveness and communication ability yes plasticity and depth ability yes dependence social groups yes all right and it uh self-critique and refine further wow is this truly novel is it well supported uh so on and so forth and the Insight is getting at is humans are not just social animals but profoundly self-domestication apes and this self-domestication is the key to understanding our unique cognitive and social abilities self-d domesticated Apes self I prefer the Deep
seek response self I mean it's novel The Insight is novel I mean that's like a good book title self domesticated Apes like there could be a case made for that I mean yeah it's cool and it's revealing uh the reasoning it's it's magical it's magical like this is really powerful hello everyone this is Lex with a quick intermission recorded after the podcast since we reviewed responses from Deep SE car1 and Gemini flash 2.0 thinking during this conversation I thought at this moment it would be nice to insert myself quickly doing the same for open AI
01 Pro and 03 mini with the same prompt The Prompt being give one truly novel insight about humans and I thought I would in general give my vibe check and uh Vibe based anecdotal report on my own experience with the new o03 Mini model now that I got a chance to spend many hours with it in different kinds of context and applications so I would probably categorize this question as uh let's say open-ended philosophical question and in particular the emphasis on novelty I think is a nice way to uh test one of the capabilities of
the model which is come up with something that makes you pause and almost surprise you with its Brilliance so that said my General review after running each of the models on this question a bunch of times is that 01 Pro consistently gave brilliant answers ones that gave me pause and made me think both cutting in its insight and just really nicely phrased with wit with Clarity with Nuance over and over consistently generating the best answers after that is R1 Which is less consistent but again deliver Brilliance Gemini flash 2.0 thinking was third and last was
uh 03 mini actually it often gave quite a generic answer at least to my particular sensibilities that said in a bunch of other applications that I tested for uh brainstorming purposes it actually worked extremely well and often uh outperformed R1 but on this open-ended philosophical question it did consistently worse now another important element for each of these models is how the reasoning is presented deep seek R1 shows the full Chain of Thought tokens which I personally just love for these open-ended philosophical questions it's really really interesting to see the model think through it but really
also just stepping back me as a person who appreciates intelligence and reasoning and reflection reading these kind of Chain of Thought raw tokens of R1 there's something genuinely beautiful about observing the path of deliberation in an intelligent system I think we don't always have that explicitly laid out for us humans so to see it in another intelligence system the nonlinearity of it akin to ulyses or finnean wake by James Joyce it's just beautiful to watch anyway as we discussed in the episode deep seek R1 talked about humans being able to convert selfish desires into Cooperative
systems by collectively pretending abstract rules like money laws and rights are real and uh these shared hallucinations act as games where competition is secretly redirected to benefit the group turning conflict into society's fuel Gemini 2.0 flash thinking said humans are not just social animals but self-domestication apes and this self-domestication is the key to understanding our unique cognitive and social abilities now it's important to say that the Chain of Thought there was really interesting it was looking through the entire evolution of life on Earth considering apex predators and considering how from that we ended up to
where we are I think that domestication by choice is a really interesting angle again it's one of those things when somebody presents a different angle on a seemingly obvious thing it just makes me smile and the same with deepcar one that these hallucinations of money laws and rights and US collectively pretending like it's real and we play games with them that look like competition when secretly we're just cooperating with each other and that is the fuel of progress beautifully put now open ai1 Pro consistently over over delivered bangers I can go through many of them
but the first one was uh humans are the only species that turns raw materials into symbolic resources then uses those symbols to reorganize the very materials they came from creating a Clos feedback loop between meaning and matter here I just ran it again Banger after Banger I'm telling you humans are unique among known species in that they simultaneously rewrite two layers of reality the external world and their own private mental Landscapes and then merge these two Rewritten layers into a continuous personal narrative that feels objectively true feels true it's this is poetry okay and then
03 mini high for me was smart fast actually and uh kind of generic never quite got there for me so here's the first one I got from 03 mini humans are not fixed beings but rather ongoing narratives Dynamic stories that we continuously write edit and reinterpret this narrative plasticity is more than just memory or self-reflection it's it's an intrinsic cognitive process that acts like an internal error correction system it allows us to adapt our identities and values over time in response to new experiences challenges and social contexts now it almost sneaks up to something approximating
cutting Insight with uh narrative plasticity in quotes but then it goes back to the sort of the generic I don't know all of these models are incredible for different reasons there's a lot of concerns as we discussed in this episode but there's uh a lot of reasons to be excited as well and I probably spoken for too long I am severely sleep deprived borderline Delirious so hopefully some of this made sense and now dear friends back to the episode I I think I think when you you know to Nathan's point when you look at like
the reasoning models um to me even when I used R1 versus o1 there was like that sort of rough edges around the corner feeling right um and Flash thinking you know earlier I didn't use this version but the one from December and it definitely had that rough edges around the corner feeling right where it's just not fleshed out in any as many ways right um sure they added math and coding capabilities via these verifiers in RL but you know they M it feels like they lost something in certain areas and 01 is worse performing than
chat in many areas as well to be clear um not by a lot not by a lot though right and it's like some of like R1 definitely felt to me like it was worse than V3 in certain areas like doing this RL expressed and learned a lot but then it weakened in other areas and so I think that's one of the big differences between these models and then and and and what o1 offers and then open AI has 01 Pro and what they did with 03 which is like also very unique is that they stacked
search on top of Chain of Thought right um and so Chain of Thought is one thing where it's able it's one chain it backtracks goes back back and forth but how they Sol solved the AR AGI challenge was not just the chain of thought it was also sampling many times I.E running them in parallel and then selecting is running in parallel actually search because I I don't know if we have the full information on how o1 Pro works so like I'm not I don't have enough information to confidently say that it is search it is
parallel samples yeah and then it select something and we don't know what the selection function is the reason why we're debating is because since 01 was announced there's been a lot of interest in techniques called Monte caros research which is where you will break down the chain of thought into intermediate steps we haven't defined Chain of Thought Chain of Thought is from a paper from years ago where you introduce the idea to ask a language model that at the time was much less easy to use you would say let's verify step by step and it
would induce the model to do this bulleted list of steps Chain of Thought is now almost a default in models where if you ask it a math question you don't need to tell it to think step by step and the idea with Monte Carlo research is that you would take an intermediate point in that train do some sort of expansion spend more compute and then select the right one that's like a very complex form of search that has been used in things like muzo and Alpha zero potentially I know muzo does this another form of
search is just asking five different people and then taking the majority answers right there's a variety of like you know it could be complicated it could be simple we don't know what it is just that they are they are not just issuing one Chain of Thought in sequence they're launching many in parallel and in the arc AGI they launched a thousand in parallel for their uh the one that like really shocked everyone that beat The Benchmark was they La they would launch a thousand in parallel and then they would get the right answer like 80%
of the time or 70% of the time 90 maybe even uh whereas if they just launched one it was like 30% there are many extensions to this I would say the simplest one is that our language models to date have been designed to give the right answer the highest percentage of the time in one response and we are now opening the door to different ways of running inference on our models in which we need to re-evaluate many parts of the training process which normally opens the door to more progress but we don't know if open
AI changed a lot or if just sampling more in multiple choice is what they're doing or if it's something more complex where they Chang the training and they know that the inference mode is going to be different so we're talking about 01 Pro $200 a month and they're losing money so the thing that we're referring to this F fting exploration of the test time compute space is that actually possible do we have enough compute for that does the financials make sense so the Fantastic thing is and and and there it's in the uh thing that
I pulled up earlier but uh the cost for uh gpt3 has plummeted if you scroll up uh just a few images I think the important thing about like hey is cost a limiting factor here right like my my my view is that like we'll have like really awesome intelligence before we have like AGI before we have it permeate throughout the economy um and this is sort of why that reason is right gpt3 was trained in what 2020 2021 um and the cost for running inference on it was $60 $70 per million tokens right um which
was the cost per intelligence was ridiculous um now as we scaled forward two years we've had a 1200X reduction in cost to achieve the same level of intelligence as gpt3 so uh here on the x-axis is time over just a couple of years and on the Y AIS is log scale dollars to run inference on on a million tokens yeah million and so you have just uh a down like a linear decline on log scale uh from gpt3 through 35 to llama it's like 5 cents or something like that now right which is which is
versus versus $60 1200X that's not the exact numbers but it's 1200X I remember that number is is the humongous humongous cost per intelligence right now the freak out over deep seek is oh my God they made it so cheap it's like actually if you look at this trend line they're not below the trend line first of all and at least for gpt3 right uh they are the first to hit it right which is which is a big deal um but they're not below the trend line as far as gpt3 now we have GPD 4 what's
going to happen with these reasoning capabilities right it's a mix of architectural Innovations it's a mix of better data and it's going to be better training techniques and all of these different better inference systems uh better Hardware right uh going from you know each generation of GPU to new generations or A6 everything is going to take this cost curve down and down and down and down and then can I go in can I just spawn a thousand different llms to create a task and then pick from one of them or you know whatever search search
technique I want a tree Monte Carlo tree search maybe it gets that complicated um maybe it doesn't because it's too complicated to actually scale like who knows uh bitter lesson right uh the the question is is I think when not if because the rate of progress is so fast right um 9 months ago Dario was saying Hey or you know Dario said 9 months ago the cost to train and inference was this right um and now we're much better than this right um and deep seek is much better than this and and that cost curve
for gp4 which was also roughly $60 per million tokens when it launched has already fallen to you know $2 or so right and we're going to get it down to cents probably for gp4 quality and the same and then G that's that that's the base for uh the reasoning models like 01 that we have today and 01 Pro is spawning more multiple right and 03 and you know so on and so forth these search techniques too expensive today but they will get cheaper and that's that's what's going to unlock the intelligence right so it get
cheaper and cheaper and cheaper the the big deep seek R1 release freaked everybody out because of the cheaper one of the manifestations of that is Nvidia stock plummeted uh can you explain what happened I mean and also just explain this moment and whether you know if Nvidia is going to keep winning we're both Nvidia Bulls here I would say and in some ways the market response is reasonable most of the market like nvidia's biggest customers in the US are major tech companies and they're spending a ton on AI and if a simple interpretation of deep
seek is you can get really good models without spending as much on AI so in that capacity it's like oh maybe these big tech companies won't need to spend much in Ai and go down the actual thing that happened is much more complex where there's social factors where there's the rising in the App Store the social contagion that is happening and then I think a lot some of it is just like I'm not I don't trade I don't know anything about financial markets but it builds up over the weekend or the social pressure where it's
like if it was during the week and there was multiple days of trading when this was really becoming but it comes on the weekend and then everybody wants to sell and that is a social contagion I think I think and like there were a lot of false narratives which is like hey guys are spending billions on models right and they're not spending billions on models no one spent more than a billion dollars on a Model that's released publicly right gp4 was a couple hundred million and then you know they've reduced the cost with 40 all
four turbo 40 right um but billion dollar model runs are coming right um this concludes pre-training and post-training right and then the other number is like hey deep seek didn't include everything right they didn't include you know a lot of the cost goes to research and all this sort of stuff a lot of the cost goes to inference a lot of the cost goes to post training none of these things were factored research salaries right like all these things are like counted in the billions of dollars that open is spending but they weren't counted in
the you know hey 6 million 5 million that deep seek spent right so but so there's a bit of misunderstanding of what these numbers are um and then there's also an element of Nvidia has just been a straight line up right and and there's been so many different narratives that have been trying to push down Nvidia not I don't say push down Nvidia stock everyone is looking for a reason to sell or to be worried right um you know it was it's it was Blackwell delays right their GPU was you know there's a lot of
report every two weeks there's a new report about their gpus being delayed um there's um there's the whole thing about scaling laws ending right it's so it's so ironic right it lasted a month it was it was just it was just like literally just hey models aren't getting better right they're just not getting better there's no reason to spend more pre-training scaling is dead and then it's like 01 03 right R1 R1 right and now it's like wait models are getting too they're progressing too fast slow down the progress stop spinning gpus right but you
know the funniest thing I think that like comes out of this is javon's paradox is true right AWS pricing for h100s has gone up over the last couple weeks right since since since since a little bit after Christmas since V3 was launched AWS h100 pricing has gone up h20s are like almost out of stock everywhere because it you know h200 has more memory and therefore R1 like you know wants that chip over h100 right we were trying to get gpus on a short notice this week for a demo and it wasn't that easy we were
trying to get just like 16 or 32 h100s for demo and it it will not very easy so for people who don't know Jon's Paradox is uh when uh you know the efficiency goes up somehow magically counterintuitively the Total Resource consumption goes up as well right and semiconductors is you know we're I 50 years of mors law every two years half the cost double the transistors just like clockwork and it's slowed down obviously but like the semiconductor industry has gone up the whole time right they it's been wavy right there's obviously and stuff and I
don't expect AI to be any different right there's going to be and flows but this is in AI it's just playing out at an insane time scale right it was 2x every two years this is 1200X in like three years right so it's like the the scale of improvement that is like hard to get wrap your head around yeah I was confused because I to me Nvidia thought on that should have gone up but maybe went down because there's kind of Suspicion of fall play on the side of China or something like this but if
you just look purely at the actual principles that play here like it's obvious yeah Javon par more progress that AI makes or the higher the derivative of AI progress is especially you should because Nvidia is in the best place the higher the derivative is the sooner the Market's going to be bigger and expanding and Nvidia is the only one that does everything reliably right now because it's not like an Nvidia competitor arose it's it's another company that's using Nvidia who historically has been a large Nvidia customer customer yeah and has press releases about them cheering
about being China's biggest Nvidia customer right like yeah it me obviously they've quieted down but like I think that's like another element of is that they don't want to say how many gpus they have yeah because hey they yes they have H 800s yes they have h20s they also have some h100s right which were smuggled in can you speak to that to the smuggling what's the scale of smuggling that's feasible for a nation state to do for companies is it possible to think I think there's a few angles of smuggling here right one is bite
dance arguably is the largest Smuggler of gpus for China right China's not supposed to have gpus bite dance has like over 500,000 gpus why because they're all rented from companies around the world they rent from Oracle they rent from Google they rent from all these mass and and a bunch of smaller Cloud companies too right all the neoc clouds right of the world they rent so so many GPS they also buy a bunch right and and they do this for mostly like what meta does right serving Tik Tok right serving next best same same as
right to be clear that's today the view use right and it's a valid use right hack the dopamine circuit right um now that's that's theoretically now very much restricted with the AI diffusion rules which happened in the last week at the Biden admin and uh Trump admin looks like they're going to keep them which limits like allies even like Singapore um which Singapore is like 20% of invidious 20 20 30% of idious Revenue but uh Singapore's had a memorium on not building data centers for like 15 years because they don't have enough power so where
are they going I mean I'm not claiming they're all going to China right but a portion are you know many are going to Malaysia um including Microsoft and Oracle have big data centers in Malaysia like you know all they're going all over southeast Asia probably India as well right like there's stuff routing but like the diffusion rules are very defacto like you can only buy this many gpus from this country and it's and you can only rent a cluster of this large to companies that are Chinese right like they're very explicit on trying to stop
smuggling right and a big chunk of it was hey let's let's you know random Company by 16 servers ships them to uh to to China right um there's actually I I saw a photo from someone uh in the semiconductor industry who who's an who leads like a a team for like networking chips uh that competes with Nvidia and he sent a photo of a guy checking into a first class United flight from San Francisco to to Shanghai or shenzen with a a super micro box that is this big which can only contain gpus right and
he was booking first class cuz think about it 3 to 5K for your first class ticket server cost you know 240,000 in the US 250,000 you sell it for 300,000 in China wait you just got a free first class ticket and a lot more money so it's like you know and that's like small scale smuggling most of the large scale smuggling is like companies in Singapore and Malaysia like routing them around or renting gpus completely legally I want to jump in how much do the scale I think there's been some number like some people that
have higher level economics understanding say that like as you go from 1 billion of smuggling to 10 billion it's like you're hiding certain levels of economic activity and that's the most reasonable thing to me is that there's going to be some level where it's so obvious that it's easier to find this economic activity and yeah so so so my my my belief is that last year roughly uh so so Nvidia made a million h20s which are legally allowed to be shipped to China which we talked about is better for reasoning right inference at least um
not maybe not not training but reasoning inference um and inference generally that they also had you know a couple hundred thousand we think like 200 to 300,000 gpus were routed to China from you know Singapore Malaysia us wherever companies spawn up by 16 gpus 64 gpus whatever it is Route it and Huawei is known for having spent up a massive network of like companies to get the materials they need after they were banned in like 2018 so it's not like otherworldly uh but I agree right n Nathan's point is like hey you can't smuggle A10
billion of gpus uh and then the third sort of source which is just now banned and you know which wasn't considered smuggling but is China is renting like is I I I I believe from our research right oracle's biggest GPU customer is bite dance right and and and and for Google I think it's their second biggest customer right and so like and you go down the list of clouds and especially these smaller Cloud companies that aren't like the hyperscalers right um think Beyond cor Lambda even there's a whole C there's 60 different new Cloud companies
serving Nidia gpus I think B dance is renting a lot of these right um all over right and so these companies are renting gpus to Chinese companies and that's completely that was completely legal up until the diffusion rules which happened just a few weeks ago and even now you can rent GPU clusters that are less than 2,000 gpus or you can buy gpus and ship them wherever you want if you're if they're less than 1500 gpus right so it's like there are still like some ways to smuggle but yeah it's not you know as the
numbers grow right uh you know 100 something billion dollars of revenue for NVIDIA last year 200 something billion this year right and if next year or you know it could it could nearly double again or more than double right based on like what we see with data center Footprints like being built out all across the US and the rest of the world it's going to be really hard for China to keep up with these rules right yes there will always be smuggling um and deep- seek level models of gp4 level models uh 01 level models
capable to train on what China can get even the next tier above that but if we speedrun a couple more you know jumps right you know to billion dollar models 10 billion dollar models then it becomes you know hey there is a compute disadvantage for China for training models and serving them and and the serving part is really critical right deep seek cannot serve their model today right it's it's completely out of inventory uh it's already started falling in the App Store actually downloads because you download it you try and sign up they say we're
not taking registrations because they have no capacity right you open it up you get like less than five tokens per second if you even get your request approved right because there's just no capacity because they just don't have enough gpus to serve the model even though it's incredibly efficient it would be fascinating to watch the smuggling cuz I mean there's drug smuggling right that's a that's a market there's weapons smuggling and gpus will surpass that at some points are highest value per kilogram probably by far um um I have another question for you D do
you track uh model API access internationally how how easy is it for Chinese companies to use hosted model apis from the US yeah I mean that's incredibly easy right like open AI publicly stated deep seek uses their API and as they say they have evidence right and this is this is another element of the training regime is people at open AI have claimed that it's a distilled model I.E you're taking open ai's model you're generating a lot of output and then you're training on the output in their model um and even if that's the case
what they did is still Amazing by the way what deeps did efficiency wise distillation is standard practice in Industry whether or not if you're at a closed lab where you care about terms of service and IP closely you distill from your own models if you are a researcher and you're not building any products you distill from the opening up this is a good opportunity can you explain big picture distillation as a process what what is distillation what's the process of dis talk a lot about training language models they are trained on text and post training
you're trying to train on very high quality text that you want the model to match the features of or if you're using RL you're letting the model find its own thing but for supervis fine tuning for preference data you need to have some completions what the model is trying to learn to imitate and what you do there is instead of a human data or instead of the model you're currently training you take completions from a different normally more powerful model I think there's rumors that these big models that people are waiting for these GPT 5S
of the world the cloud 3 opuses of the world are used internally to do this distillation process there's also public examples right like meta explicitly stated not necessarily distilling but they used 405b as a reward model for 70b in their llama 3.2 or 3.3 this is all the same topic so is this uh is this ethical is this legal like why why is that Financial Times article headline say open AI says that there's evidence that China's deep seek used its model to train competitor this is a long at least in the academic side and research
side it's a long history because you're trying to interpret open ai's rule open ai's terms of service say that you cannot build a competitor with outputs from their models terms of service are different than a license which are essentially a cont between organizations so if you have a terms of service on open ai's account if I violate it open AI can cancel my account this is very different than like a license that says how you could use a downstream artifact so a lot of it hinges on a word that is very unclear in the AI
space which is what is a competitor so and then the ethical aspect of it is like why is it unethical for me to train on your model when you can train on the internet's text yeah right so there's a bit of a hypocrisy because sort of open Ai and potentially most of the companies trained on the internet's text without permission there's also a clear loophole which is that uh I generate data from open Ai and then I upload it somewhere and then somebody else trains on it and the link has been broken like they're they're
not under the same terms of service contract this is this is why there's a lot of hipop there's a lot of like to be discovered details that don't make a lot of sense this is why a lot of models today even if they train on zero open AI data you ask the model who trained you it'll say I was I'm Chad P trained by open because there's so much copy paste of like open a outputs from that on the internet that you just weren't able to filter it out and in the and there was nothing
in the RL where you they implemented like hey like or post training or sft whatever that says hey I'm actually uh modeled by Allen Institute instead of uh we have to do this if we serve a demo we do research and we use open a apis because it's useful and we want to understand post training and like our research models they will say they're written by open AI unless we put in the system prop that we talked about that like I am Tulu I am a language model trained by the Allen Institute for AI and
if you ask more people around industry especially with posttraining it's a very doable task to make the model say who it is or to suppress the open AI thing so in some levels it might be the Deep seek didn't care that it was saying that it was by open AI like if you're going to upload model weights it doesn't really matter because anyone that's serving it in an application and cares a lot about serving is going to when serving it if they're using it for a specific task they're going to tailor it to that and
it doesn't matter that it's saying it's chbt oh I guess I guess one of the ways to do that is like a system prompt or something like that like if you're serving it to say that you're that's what that's what we do like if we host the demo you say you are Tulu three a language model trained by the Allen Institute for AI we also are benefited from open AI data because it's a great research tool I mean do you think there's any any truth and value to the the claim open ai's claim that there's
evidence that China's deep seek use this model to train I think everyone has benefited regardless because the data is on the internet um and therefore it's in your P training now right there are like subreddits where people share the best chat GPT outputs and those are those are in your I think that they're trying to ship the narrative like they're trying to protect themselves and we saw this years ago when bite dance was actually banned from some open a apis for training on outputs there's other AI startups that most people if you're in the like
AI culture were like they just told us they trained on opening eye outputs and they never got banned like that's how they bootstrapped their early models so it's much easier to get off the ground using this than to set up human pipelines and build a strong model so there long history here and a lot of the communications are seem like narrative control actually like the over the last couple days we've seen a lot of people distill deep seeks model into llama models because because the Deep seek models are kind of complicated to run inference on
because their mixture of experts and their you know 600 plus billion parameters and all this and people distilled them into the llama model and then because the Llama models are so easy to serve and everyone's built the pipelines and tooling for inference with the Llama models right because it's the open standard so you know we've seen it we've seen a sort of roundabout right like is it is it bad is it illegal maybe it's illegal whatever I don't know about that but like it could break contracts I don't think it's illegal like in any legal
like no one's going to jail for this ever I I think like fundamentally I think it's ethical or I hope it's ethical because like the moment becomes we ban that kind of thing it's going to make everybody much worse off and I also actually it's this is difficult but I think you should be allowed to train on the internet I know a lot of authors and creators are very sensitive about it that's that's a difficult question but like the mo the moment you're not allowed to train on the internet I agree I I have a
skitso take on how you can solve this because it already works I have a reasonable take out all right all right so so you know Japan has a law which you're allowed to train on any training data and copyrights don't apply if you want to train a Model A B Japan has 9 gaw of curtailed nuclear power C Japan is allowed under the AI diffusion rule to import as many gpus as they'd like so all we have to do we we have a market here to make we build massive data centers we rent them to
the labs and then we train models in a legally permissible way and there's no if ands or butts and now the models have no like potential copyright lawsuit from New York Times or anything like that no no it's just like completely legal no so so so genius the early copyright lawsuits have fallen in the favor of AI training I would say that the long tale of use is going to go ins the side of AI which is if you do if you scrape trillions of data you're not looking at the trillions of tokens of data
you're not looking and saying this one New York Times article is so important to me but if you're doing a audio generation for music or image generation and you say make it in the style of xers that's a reasonable case where you could figure out what is their profit margin on inference I don't know if it's going to be the 50/50 of YouTube Creator program or something but I would opt into that program as a writer like please like like that it's just it's going to be a rough Journey but there will be some solutions
like that that makes sense but there's a long tail where it's just on the internet I think one of the other aspects of that Financial Times article implied and so that leads to a more general question do you think there's how difficult is is uh spying Espionage and stealing of actual secret code and data from inside companies how much of that is being attempted code and data is hard but ideas is easy Silicon Valley operates on the on the way that top employees get bought out by other companies for a pay raise and a large
reason why these companies do this is to bring ideas with them and there are there's no I mean in California there's rules that like certain like non-competes or whatever are illegal in California and whether or not there's ndas and things that is how a lot of it proc happens recently there was somebody from Gemini who help make this 1 million context length and everyone is saying the next llama who I mean he went to the meta team is going to have 1 million context length and that's kind of how the world works you know as
far as like industrial Espionage and things that has been greatly successful in the past right um you know the Americans did it to the Brits uh the Chinese have done it to the Americans right and you know so on and so it's just it is a fact of life um and so like to argue industrial Espionage can be stopped is probably unlikely you can make it difficult but even then like there's all these stories about like hey f F35 and F-22 have already been like sort of like given to China in terms of design plans
and stuff um code and stuff like between you know I say companies not nation states is probably very difficult um but ideas are discussed a lot right whether it be a house party in San Francisco or a company changing employees or you know or the you know the the always the like mythical honey pot that always gets talked about right like someone gets honey potted right uh because everyone working on AI is a single dude who's in their 20s and 30s not everyone but like a insane amount of insane percentages um so there's always like
all these like you know and and obviously so honey poter is like a a spy a female spy approaches you and like yeah yeah or or or male right you know it's San Francisco right but um as a single dude I will say in his late 20s right is like we are very easily corrupted right like you know like not not not corrupted myself but you know like we are we are right everybody else not me I'm too oblivious and I am not single so I'm safe from one Espionage access yeah you have to make
sure to close all security vulnerabilities so you uh Dylan collect a lot of information about each of the the mega clusters for each of the major AI companies can can you uh talk about the buildout for each one that stand out yeah so I think the thing that's like really important about these Mega cluster build outs is they're completely unprecedented in scale right um us you know sort of like data center power consumption has been slowly On The Rise and it's gone up to 23% even through the cloud computing Revolution right data center consumption as
a percentage of total us and and that's been over decades right of data centers Etc it's been climbing climbing slowly but now 2 to 3% now by the end of this decade it's like even even under like you know when I say like 10% a lot of people that are traditionally uh by like 2028 2030 people traditionally non a uh traditional data center people like that's nuts but then like people who are in like AI who have like really looked at this at like the anthropics and open AI they're like that's not enough and I'm
like okay but like you know this is this is both through uh globally distributed uh and or distributed throughout the us as well as like centralized clusters right the the distributed throughout the US is is exciting and it's the bulk of it right like hey you know uh openi or you know say meta is adding a gwatt right um but most of it is distributed through the US for inference and all these other things right so maybe we should lay out what a what a cluster is so uh you know does this include AWS maybe
it's it's good to talk about the different kinds of clusters and what you mean by Mega clusters and what's a GPU and what's a computer and what kid not that far back but yeah so like what do we mean by the Clusters I thought I was about to do the Apple ad right what's a computer so so traditionally data centers and data center tasks have been a distributed systems problem that is uh capable of being spread very far and widely right I.E I send a request to Google it's gets routed to a data center somewhat
close to me um it does whatever search ranking recommendation sends a result back right um the nature of the task is changing rapidly in that the task there's two tasks that people are really focused on now right it's not database access it's not serve me the right page serve me the right ad it's now a inference and inference is dramatically different from traditional distributed systems but it looks a lot more simple simp similar and then there's training right the train inference side is still like hey I'm going to put you know thousands of gpus in
in you know blocks all around these data centers I'm going to run models on them you know user submits a request gets kicked off or hey my service you know they submit a request to my service right they're on word and they're like oh yeah help help me co-pilot and it starts kicks it off I'm on my windows co-pilot whatever Apple intelligence whatever it is it gets kicked off to a data center right and that data center does some work and sends it back that's inference that is going to be the bulk of compute but
then you know that and that's like you know there's thousands of data centers that we're tracking with Like Satellites and like all these other things and and those are the bulk of what's being built but the scale of and and and so that's like what's really reshaping and that's what's getting millions of gpus but the scale of the uh largest cluster is also really important right um when we look back at history right like you know or through through the age of AI right like it was a really big deal when they did alexnet on
I think two gpus or four gpus I don't remember it's a really big deal it's a big deal because you use gpus it's a big deal they used gpus um and they used multiple right but then over time it scale has just been compounding right and so when you skip forward to gpt3 then gp4 gp4 20,000 a100 gpus unprecedent Ed run right in terms of the size and the cost right couple hundred million on a YOLO right a YOLO run for GPD 4 and it and it yielded you know this magical Improvement that was like
perfectly in line with what was experimented and just like a log scale right oh yeah they have that plot from the paper the technical per the scaling laws were perfect right but that's not a crazy number right 20,000 A1 100s uh roughly each GPU is consuming 400 watts uh and then when you add in the whole server right everything um it's like 15 to 20 megawatts up power right uh you know you know maybe you could look up what the power of consumption of a human a person is because the numbers are going to get
silly but like that 15 to 20 megawatts was standard data center size it was just unprecedented that was all gpus running one Tas 20 watts was a toaster toaster is like a similar power consumption to an a100 Right h100 comes around they increase the power from like 400 to 700 watts and that's just per GPU and then there's all the associated stuff around it so once you count all that it's roughly like 1,200 to 1400 Watts for everything networking CPUs memory blah blah blah so we should also say so what's required you said power so
a lot of power is required a lot of heat is generated so cooling is required and uh because there's a lot of gpus that have to be or CPUs or whatever they have to be connected so there's a lot of networking yeah right yeah so I think yeah sorry for uh skipping past that and then the data center itself is like complicated right but these are still standardized data centers for gp4 scale right now we step forward to sort of what is the scale of clusters that people have built last year right and it ranges
widely right it ranges from like hey these are standard data centers and we're just using multiple of them and connecting them together really with a ton of fiber between them a lot of networking Etc that's what open Ai and Microsoft did in Arizona right and so they have a you know 100,000 gpus right meta similar thing they took their standard existing data center design um and it looks like an h and they connected multiple of them together um and you know they got to they first did 16,000 gpus uh 24,000 gpus total only 16 of
them thousand of them were running on the training run because gpus are very unreliable so they need to have spares to like swap in and out all the way to like now 100,000 gpus that they're training on llama 4 on currently right like 128,000 or so right this is you know think about 100,000 gpus um with roughly 1,400 Watts a piece that's that's that's 140 megawatts 150 megawatts right for 128 right so you're talking about you've jumped from 15 to megawatts to 10x you know almost 10x that number 9x that number to 150 megawatts in
in two years right from 2022 to 2024 right and some people like Elon that he he he admittedly right and he says it himself got into the game a little bit late for pre-training large language models right xai was started later right but then he he bent Heaven and Hell to get his data center up and get the largest cluster in the world right which is 200,000 gpus um and and and he did that he bought a factory in Memphis uh he up upgrading the substation at the same time he's got a bunch of mobile
power generation a bunch of single cycle combine he tapped the natural gas line that's right next to the factory and he's just pulling a ton of gas burning gas he's generating all this power he's in a factory in an old Appliance Factory that's shut down and moved to China long ago right like you know and and and he's got 200,000 gpus in it and now what's the next scale right like all all the hypers scalers have done this now the next scale is is is something that's even bigger right and so you know Elon just
to stick on the topic he's he's building his own natural gas plant like a proper one right next door he's he's deploying tons of Tesla mega pack batteries to make the power more smooth and all sorts of other things he's got like industrial chillers to cool the water down because he's water cooling the chips um so all these crazy things to uh get the Clusters bigger and bigger um but when you look at like say what open AI did with Stargate that's that in Arizona in um in abene Texas right uh what they've announced at
least right it's not built right Elon says they don't have the money you know there's some debates about this um but at full scale at least the first section is like definitely money's accounted for but there's multiple sections but full scale that data center is going to be 2.2 gwatt right 2200 megawatts of power in and roughly like 1.8 gaws or 1,800 uh Mega uh yeah 1,00 megawatts of power delivered to chips right now this is an absurd scale 2.2 gws is like more than most cities right you know to be clear um and delivered
to a single cluster that's connected to do training right um to train these models to do both the pre-training the post trining all of this stuff right this is insane it is what is a nuclear power plant again and everyone is doing this right everyone is doing this right Meta Meta and Louisiana right they're building two natural gas plants massive ones uh and they're and then they're building this massive data center um Amazon has like plans for this scale uh Google has plans for this scale um xai has plans for the scale right like all
of these the guys that are racing the companies that are racing are racing hard and they're doing multi- gigawatt data centers right um to to build this out because they they think that yeah if I if I now have you know obviously pre-training scaling is going to continue but to some extent but then also all this post trining stuff where you have RL sandbox for computer use or whatever right like you know this is where they're going to and all these verif viable domains where they just keep learning and learning and learning selfplay whatever whatever
it is makes the AI so much more capable because the line does go up right uh as you throw more compute you get more performance the shirt is about scaling laws um you know to some extent it is diminishing returns right you 10x the compute you don't get 10x better model right you get a diminishing returns but also you get efficiency improvements so you bend the curve right um and these scale of data centers are doing you know wre wreaking you know a lot of like havoc on the network right you know n Nathan was
mentioning there's Amazon has tried to buy this nuclear power plant Talon um and if you look at Talon stock it's just like skyrocketing and um you know like they're build a massive multi- gwatt data center there and you know you just go down the list there's so many ramifications interesting thing is like certain regions of the US transmitting power cost more than actually generating it right because the grid is so slow to build and the demand for power and the ability to build power and like ramping on a natural gas plant or even a coal
plant is like easy enough to do but like transmitting the power is really hard so in some parts of the US like in Virginia it cost more to transmit power than it cost to generate it which is like you know there's there's all sorts of like second order effects that are insane here can the power grid support this kind of growth you know Trump's executive orders there was a there was a Biden executive order before the end of the year but then Trump had some more executive orders which uh hopefully reduced the regulations to where
yes things can be built um but yeah this is a big big challenge right is building enough power fast enough are are you going to basically have a nuclear power plant next to a data center for each one of these so so the fun thing here is this is too slow to build the power plant to build a power plant or to re configure an existing power plant is too slow and so therefore you must use natur data center power consumption is flat right you know I mean like it's which is why nuclear is also
good for it like longterm nuclear is a very natural fit but need a short you can't do solar in anything in the short term like that because data center power is like this right like you're telling me you know I'm going to buy tens of billions of dollars of gpus and idle them because the power is not being generated like power is cheap right like if you look at the cost of a cluster less than 20% of it is power right uh most of it is the capital cost and depreciation of the gpus right and
so it's like well screw it I'll just like you know I'll just build natural gas plant this is what meta is doing in Louisiana this is what open AI is doing in in Texas and like all these different places they may not be doing it directly uh but they are partnered with someone and so there is a couple hopes right like one is you know and Elon what he's doing in Memphis is like you know to the extreme they're not just using dual combine cycle gas which is like super efficient he's also just using single
cycle and like mobile generators and stuff Which is less efficient um but he's you know there's also like the flip side which is like solar power generation is like this and wind is another like like this different correl you know different so if you stack both of those plus you get a big chunk of batteries um plus you have a little bit of gas it is possible to run it more green it's just the time scales for that is slow right so people are trying but you know meta basically said whatever don't care about my
sustainability pledge or they'll buy like a per power it's called a PPA power purchasing agreement where there'll be a massive Wind Farm or solar farm like wherever and then they'll just pretend like those electrons are being consumed by the data center but in reality they're paying for the power here and selling it to the grid and they're buying power here um and then another thing is like Microsoft quit on some of their sustainability pledges right Elon uh he what he did with Memphis is objectively somewhat dirty but he's also doing it in an area where
there's like a bigger natural gas plant right next door and like a sewer next or not a sewer but like a wastewater treatment and a garbage dump nearby right and and and he's he's obviously made the world a lot more clean than that one data center is going to do right so I think like it's fine uh to some extent and maybe AGI solves you know global warming and stuff right whatever it is um you know this is this is sort of the attitude that people at the labs have right which is like yeah SC
we'll just use gas right because the race is that important and if we lose we you know that's way worse right I should say that uh I got a chance to visit um the Memphis data center oh wow and it's uh kind of incredible I mean I visited with with Elon just the team themes and the rate of innovation there is insane cuz my sense is that you know nobody's ever done anything of this scale and nobody has certainly ever done anything of this scale at the rate that XI is doing so they're like figuring
out I mean it's I sitting in on all these meetings with their brainstorming it's like it's insane it's exciting because they're like they're trying to figure out what the bottlenecks are how to remove the bottlenecks how to make sure that you know there's just so many really cool things about putting together a data center cuz you know everything has to work it's uh the the people that do like the CIS admin you know the machine learning all that is the exciting thing so on but really the people that run everything are the the folks that
know like the lowlevel uh software and Hardware that runs everything the networking all of that and so you have to like make sure you have procedures that test everything I think they're using ethernet I don't know how they're doing that working but they're using Nvidia Spectrum X ethernet um there's actually like I think yeah the unsung heroes are the cooling and electrical systems which are just glossed over um but I think like like one story that maybe is like exemplifies how insane this stuff is is uh when you're training right um you're always doing you're
you're you're running through the model a bunch right in the most simplistic terms running through the model a bunch and then you're uh you're going to exchange everything and synchronize the weights right so you do you'll do a step this is like a step in model training right and every step your loss goes down hopefully and it doesn't always but um you in the simplest terms you'll be Computing a lot and then you'll exchange right the interesting thing is GPU power is most of it networking power is some but it's a lot less but so
while you're Computing your power for your gpus is here but then when you're exchanging weights uh if you're not able to overlap Communications and compute perfectly there may be a time period where your gpus are just idle and you're exchanging weights and you're like hey the model's updating so you're exchanging the gradients you do the model update and then you you start training again so the power goes mhm right and it's super spiky and so funnily enough right like this when you talk about the scale of data center power right you can blow stuff up
so easily um and So Meta actually has accidentally open upstreamed something to code and pytorch where they added an operator and I kid you not whoever made this like I want to hug the guy because it says says pytorch uh it's like py torch. PowerPlant no blowup equal zero or equal one and and what it does what it does is amazing right either you know when you're when you're exchanging the weights the GPU will just compute fake numbers so the power doesn't Spike too much and so then the power plants don't blow up because the
transient spikes like screw stuff up well that makes sense I mean you have to do that kind of thing you have to make sure they're not idle yeah an Elon solution was like let me throw a bunch of Tesla Mega packs and a few other things right like there everyone has different solutions but like metas at least was publicly and openly known which is just like set this operator and what this operator does is it just makes the gpus compute nothing so that the power doesn't Spike but that just tells you how much power you're
working with I mean it's insane it's insane people should just go Google like scale like what does X watts do and go through all the scales from one watt to a kilowatt to a megawatt and you look at stare at that and you're how high in the list a gigawatt is and it's mind-blowing can you say something about the cooling so I I know elon's using liquid cooling I believe in in all cases uh that's a new thing right most of them don't use cooling is there something interesting to say about the cooling yeah yeah
so air cooling has been the deao standard uh throw a bunch of metal heat heat pipes Etc and and fans right and like that's cooled that's been enough to cool it um people have been dabbling in water cooling Google's tpus are water cooled right um so they've been doing that for a few years uh but uh with gpus no one's ever done and and no one's ever done the scale of water cooling that Elon just did right uh um now next Generation Nvidia is uh for the for the like highest end GPU it is mandat
water cooling you have to water cool it but Elon did it on this current generation uh and that required a lot of stuff right if you look at like some of the satellite photos and stuff of of uh the Memphis facility there's all these external water chillers that are sitting basically it looks like a it looks like a semi- pod thing what's it called the container uh but really those are water chillers and he has like 90 of those water chillers just sitting outside 90 different containers right with water you know that chill the water
bring it back to the data center and then you distribute it to all the chips pull all the heat out and then send it back right and this is both a uh way to cool the chips but also an efficiency thing all right and going back to that like sort of three Vector thing right there is um there is you know memory bandwidth flops and interconnect the closer the chips are together the easier it is to do high-speed interconnects right uh and so this is this is also like a reason why you going to go
water cooling is because you can just put the chips right next to each other and therefore get higher uh speed connectivity I got to ask you so in uh one of your uh recent posts there's a section called cluster measuring contest so uh there's another word there but I won't say it you know uh what who's who's who's got the biggest now and who's going to have the big today individual largest is Elon right um elon's cluster elon's cluster in Memphis 200,000 GPS okay right um meta has like 128,000 opena has 100,000 now now to
be clear other compies have more gpus than Elon they just don't have them in one place right and for training you want them tightly connected there's some techniques that people are researching and working on that let you train across multiple regions but for the most part you want them all in like one area right so you can connect them highly with highp speed networking um and so you know Elon today has 200,000 GP h100s and H 100,000 h100s 100,000 h20s right um meta open AI uh you know and and and Amazon all have on the
scale of 100,000 a little bit less um but next this year right this year people are building much more right anthropic and Amazon are building a cluster of 400,000 tranium 2 which is Amazon specific chip uh getting trying to get away from Nvidia right um you know uh meta and and and open AI have scales for hundreds of thousands but by next year you'll have like 500,000 to 700,000 GPU clusters and and not those gpus are much higher power consumption than existing ones right Hopper 700 Watts Blackwell goes to 12 100 Watts right so so
the power per chip is growing and the number of chips is growing right NS yeah you think you think El Elon said he'll get to a million you think that's actually feasible um I mean I I I don't doubt Elon right uh the filings that he has for like you know the power PL and the Tesla battery packs it's clear he has some crazy plans for Memphis um like permits and stuff there's open record right um but it's not quite clear that you know what what and what the time scales are um I just never
right you know that's he's going to surprise us so what's the idea with these clusters if you have a million gpus what percentage in uh let's say two three years is used for uh training and what percent pre-training and what percent is used for like for the actual these Mega clusters make no sense for inference right uh you could route inference there and just not train um but most of the inference capacity is being you know hey I've got a 30 megawatt data center here I've got 50 megawatts here I've got 100 here whatever I'll
just throw inference in all of those because the mega clusters right multi- gigawatt data centers I want to train there because that's where all of my gpus are collocated where I can put them at a super high networking speed connected together right because that's what you need for training now with pre-training this is the old scale right you could you would increase parameters you increase data model gets better uh that doesn't that doesn't apply anymore because there's not much more data in the pre-training side right uh yes there's video and audio and image that has
not been fully taken advantage of so there's lot more scaling but a lot of people like like uh have have transcript Tak transcripts of YouTube videos and that gets you a lot of the data doesn't get you all the learning value out of the video and image data but you know there there's there's still scaling to be done on pre-training uh but this posttraining world is where all the flops are going to be spent right the model's going to play with itself it's going to self-play it's going to do verifiable task it's going to do
computer use in sandboxes it might even do like simulated robotics things right like all of these things are going to be environments where compute is spent in quote unquote post training but I think I think it's going to be good we're going to we're going to drop the post from post training it's going to be pre-training and it's going to be training I thinking at some point um because because for the like bulk of like the last few years um pre-training has dwarfed posttraining but with these verifiable methods especially ones that scale really you know
potentially infinitely like computer use and Robotics not just math and coding right where you can verify what's happening those infinitely verifiable tasks it seems you can spend as much computer as you want on them especially at the context length increase cuz the end of pre-training is when you increase the context length for these models and we've talked earlier in the conversation about how the context length when you have a long input is much easier to manage than output and a lot of these post trainining and reasoning techniques rely on a ton of sampling and it's
becoming increasingly long context so it's just like your effectively your compute efficiency goes down I don't the flops is the standard for how you measure it but with RL and you have to do all these things where you move your weights around in a different way than at pre-training and just generation it's going to become less efficient and flops is going to be less of a useful term and then as the infrastructure gets better it's probably going to go back to flops so all of the things we've been talking about is most likely going to
be Nvidia right is there any competitors Google Google I kind of ignored them 's the story with what's the story with TPU like what's the TPU is is awesome right it's great uh Google is they're a bit more tepid on building data centers for some reason they're they're building Big Data Centers don't get me wrong and they have they actually have the biggest cluster let me I I I was talking about Nvidia clusters they actually have the biggest cluster period um but the way they do it is like very interesting right um they have two
sort of like data center super regions right in that the data center isn't physically like all of the gpus aren't physically on one site but they're like 30 miles from each other and not gpus TPS right they have like in in Iowa and Nebraska they have four data centers that are just like right next to each other why doesn't Google Flex its cluster size go to multi data center training it's a good images in there so I'll show you what I mean it's just uh semi analysis multi- dat Center um so this is like you
know so this is an image of like what a standard Google data center looks like by the way their data centers look very different than anyone else's data centers what are we looking at here um so these are yeah so if you if you see this image right in the center there are these big rectangular boxes right those are where the actual chips are kept um and then if you scroll down a little bit further um you you can see there's like these water pipes there's these Chiller cooling towers in the top and a bunch
of like diesel generators the diesel generators are backup power the data center itself is like look physically smaller than the water chillers right so the chips are actually easier to like keep together but then like cooling all the water for the water cooling is very difficult right so Google has like a very Advanced infrastructure that no one else has for the TPU um and what they do is they've like stamped these data center they've stamped a bunch of these data centers out in a few regions right so if you go a little bit further um
down uh this is this is a Microsoft this is an Arizona this is where GPT 5 quote unquote will be trained um you know uh if it doesn't exist already yeah if it doesn't exist already um but each of these data centers right I've shown a couple images of them they're like really closely collocated in the same region right Nebraska Iowa and then they also have a similar one in uh Ohio complex right um and so these data centers are really close to each other um and what they've done is they've connected them super high
bandwidth with fiber um and so these are just a bunch of data centers and and and the point here is that Google has a very Advanced infrastructure um very tightly connected in a small region so Elon always always to have the biggest cluster fully connected right because it's all in one building yeah right and he's completely right on that right Google has the biggest cluster but you have to spread over three s and by by a significant margin but you have to go across multiple sites why doesn't Google compete with Invidia why don't they sell
tpus I think I think there's a couple problems with it it's like one TPU has been a form of allowing search to be really freaking cheap and build models for that right um and so like a big chunk of the search GPU purchases or TPU purchases or big chunk of Google's purchases and usage all of it is for internal workloads right whether it be search uh now Gemini right uh YouTube um all these different applications that they have uh you know ads um these are where all their tpus are being spent and that's what they're
hyperfocused on right um and so there's certain like aspects of the architecture that are optimized for their use case that are not optimized elsewhere right one simple one is like they've open sourced the Gemma model and and they called it Gemma 7B right uh but then it's actually 8 billion parameters because the vocabulary is so large because and the reason they made the vocabulary so large is because tpus like Matrix multiply unit is massive because that's what they've like sort of optimized for and so they decided oh I'll just make the vocabulary large too even
though it makes no sense to do so on such a small model because that fits on their hard Ware so Gemma doesn't run as efficiently on a GPU as a llama does right but vice versa llama doesn't run as efficiently on a TPU as a Gemma does right and it's so like there's like certain like aspects of like Hardware software code design so all their search models are their ranking and recommendation models all these different models that are AI but not like gen AI right have have been hyper optimized with G tpus forever the software
stack is super optimized but all of this software stack has not been released publicly at all right um very small portions of it Jackson xlaa have been but like the experience when you're inside of Google and you're training on tpus as a researcher you don't need to know anything about the hardware in many cases right like it's like pretty beautiful but as soon as you step outside they they all go a lot of them go back they leave Google and then they go back yeah yeah they're like they they leave and they start a company
because they have all these amazing research ideas and they're like wait infrastructure is hard software is hard and this is on gpus or if they try to use tpus same thing because they don't have access to all this code and so it's like how do you convince a company whose Golden Goose is search where they're making hundreds of billions of dollars from to start selling GPU or tpus uh which they used to only buy a couple billion of you know I think in 20203 they bought like um like a couple billion and now they're buying
like 10 billion to$ 15 billion worth but how do you convince them that they could they should just buy like twice as many and figure out how to sell them and make $30 billion like who cares about making $30 billion won't that 30 billion exceed actually the search profit eventually oh I mean like you're always going to make more money on Services than than than I mean like yeah like you like to be clear like today people are spending a lot more on Hardware than they are the services right because you the hardware front runs
the service spend but like you're investing if if if there's no revenue for AI stuff or not enough Revenue then obviously like it's going to blow up right you know uh people won't continue to spend on gpus forever um and Nvidia is trying to move up the stack with like software that they're trying to sell and license and stuff right but Google has never had that like DNA of like this is a product we should sell right they don't the Google Cloud does it is which is a separate organization from the TPU team which is
a separate organization from the Deep Mind team which is a separate organization from the search team right there's a lot of bureaucracy wait Google cloud is a separate team than the TPU team technically TPU in sits under infrastructure which sits under Google Cloud but like Google cloud like for like renting stuff and TPU architecture are very different goals right in Hardware um and software like all of this right like the Jax xla teams do not serve Google's customers externally whereas nvidia's various Cuda teams for like things like nickel serve external customers right um the internal
teams like Jackson xlaa and stuff they more so serve Deep Mind in search right and so their customer is different they're not building a product for them do do you understand why AWS keeps winning uh versus Azure for cloud uh versus Google CL Google cloud is Tiny isn't it relative to a Google cloud is third yeah yeah um Microsoft is the second biggest but Amazon is the biggest right um and and Microsoft uh deceptively sort of includes like Microsoft Office 365 and things like that like some of these enterprise-wide licenses so in reality the gulf
is even larger Microsoft is still second though right um Amazon is way bigger why because using AWS is better and easier and in many cases it's cheaper and it's first it was first yeah but there's a lot of things that are first that well it's easier it's harder to switch than it is to AWS there's big fees for switching too AWS generates over 80% of Amazon's profit I think over 90% that's insane the distribution centers are just like one day we'll decide to make money from this but they haven't yet right like they make tiny
little profit from yeah one day Amazon Prime will triple in price you would think they would improve AWS uh interface because it's like horrible it's like clunky but everybody is I I yeah you one would think I I think actually Google's interface is sometimes nice but it's also like they don't care about anyone besides their top customers and like their customer service sucks and like they have a lot less like I mean all these companies they op optimized for the big customers yeah it's supposed to be for business and Amazon has always optimized for the
small customer too though right like obviously they optimize a lot for the big customer but like when they started they just would go to like random Bay Area things and give out credits right and then they like or just put in your credit card and use us right like back in the early days so they've always the business has grown with them right in burent so like why does Amazon like why is snowflake all over Amazon because snowflake in the beginning when Amazon didn't care about them was still using Amazon right and then of course
one day Snowflake and Amazon has a super huge partnership but like this is the case like Amazon's user experience and quality is better also a lot of the Silicon they've engineered makes them have a lower cost structure and traditional cloud storage CPU networking that kind of stuff uh then um in databases right like you know I think like four of Amazon's top five Revenue products uh margin products sorry like gross profit products are all database related products like red shift and like all these things right like um so so Amazon has a very like good
silicon 2 user experience like entire Pipeline with AWS I think Google their INF their silicon teams yeah they have awesome silicon internally TPU the YouTube chip um you know some of these other chips that they've made and the problem is they're not serving external customers they're serving internal customers right it's I mean nvidia's entire culture is designed from the bottom up to do this there's this recent book The Nvidia Way by takim that details this and they're how they look for future opportunities and ready their Cuda software libraries to make it so that new ations
of high performance Computing can very rapidly be evolved on Cuda and Nvidia chips and that is entirely different than Google as a Services business yeah I mean Nvidia it should be said as a truly special company like I mean they the whole the culture of everything they're really optimized for that kind of thing speaking of which is there somebody that can even challenge Nvidia hardware-wise Intel AMD I I I really don't think so we went through a like a very long process of uh working with AMD on training on their gpus infs and stuff and
they're they're decent their Hardware is better in many ways than in am nvidias uh the problem is their software is really bad and I think they're they're getting better right they're getting better faster but they're just the gulf is so large um and like they don't spend enough resources on it or haven't historically right maybe they're changing their tune now but you know for for for multiple months we were submitting the most bugs right like us semi analysis right like what the like why are we submitting the most bugs right cuz they only and they
they only cared about their like biggest customers and so they'd Shi them a private image blah blah blah and it's like okay but like I am just using pie torch and I want to use the publicly available libraries and like you don't care about that right so they're they're getting better um but like I think AMD is not possible Intel is obviously in Dire Straits right now um and needs to be saved somehow uh very important for National Security for American you can you explain the obviously so why why are they in D Straits going
back to earlier only three can R&D right Taiwan sinu Samsung uh pongyang and then Intel Hillsboro Samsung's doing horribly Intel's doing horribly we could be in a world where there's only one company that can do R&D and that one company already manufactures most of chips they've been gaining market share anyways but like that's that's a critical thing right so what happens to Taiwan means the rest of the world's semiconductor industry and therefore Tech relies on Taiwan right um and that's obviously precarious um as far as like Intel they've been slowly steadily declining they they were
on top of servers and PCs but now Apple's done the M1 and nvidia's releasing a PC chip and qualcomm's releasing a PC chip and in servers hyperscalers are all making their own Arm based uh server chips and Intel has no AI silicon uh like winds right they have very small wins um and and they never got into Mobile because they said no to the iPhone and like all these things have compounded and they've lost their process technology leadership right they were ahead for 20 years and now they're behind by at least couple years right and
they're trying to catch back up and we'll see if their 18a 14a strategy works out where they try and Leap Frog tsmc um but like and Intel is just like losing tons of money anyways right and they just fired their CEO even though the CEO was the only person who understood the company well right we'll see he was not the best but he was pretty good relatively technical guy where does Intel make most of it money the CPUs still PCS and data center CPUs yeah but data center CPUs are all going cloud and Amazon Microsoft
Google are making AR arm-based CPUs uh and then uh PC side amd's gained market share nvidia's launching a chip that's not going to be success right mediatech Qualcomm ever launched chips Apple's doing well right like there they could get squeezed a little bit in PC although PC generally I imagine will just stick Intel mostly for Windows side let's talk about the broad AI race who do you think wins who talked about Google the leader who the default leader has been Google because of their infrastructure Advantage well like in the news open AI is the leader
they're the leading and the the best model they have the best model that people can use and they're they have the most AI Revenue yeah open AI is winning is so who's making money on AI right now is anyone making money so accounting profit wise Microsoft is making money but they're spending a lot of cap backs right you know and that's gets depreciated over years uh meta is making tons of money but with recommendation systems which is AI but not with llama right llama's losing money for sure right um I think anthropic and open eye
are obviously not making money cuz otherwise they wouldn't be raising money right they have to raise money to build more right um although theoretically they are making money right like you know you spent few hundred million do on gp4 and it's doing billions in Revenue so like obviously it's like making money although they had to continue to research to get the compute efficiency wins right and and move down the curve uh to like you know that 12 get that 1200X that has been achieved for gpt3 you know maybe we're only at like uh you know
a couple hundred X now but you know with gp4 turbo and 40 and there will be another one probably cheaper than GP 40 even that comes out at some point and that research cost a a lot of money yep exactly that's the thing that I guess is not talked about with the cost the that uh when you're referring to the cost of the model it's not just the training or the test runs it's the actual research the the Manpower that yeah to do things like reasoning right now that that exists they're going to scale it
they're going to do a lot of research still I think I think the you know people focus on the payback question but it's really easy to like just be like well like you know GDP is humans and Industrial Capital right and if you can make intelligence cheap then you can grow a lot right that's the sort of dumb dumb way to explain it but that's sort of what basically the investment thesis is um I think only Nvidia is actually making tons of money and other Hardware vendors um the hyperscalers are all on paper making money
uh but in reality they're like spending a lot more on purchasing the gpus which you don't know if they're still going to make this much money on each GPU in two years right um You don't know if um you know all of a sudden open AI goes Kap and now Microsoft has like hundreds of thousands of gpus they were renting to open a that are that they paid for themselves with their you know investment in them um you know that that no longer have a customer right like this is always a possibility I don't believe
that right um I think you know open ey will keep raising money I think others will keep raising money um because the Investments the the returns from it are going to be eventually huge once we have AGI so do you think multiple companies will get let's I don't think it's win or take all okay so it's not uh let's not call it AGI whatever it's like a single day it's it's a gradual thingi super powerful AI but it's it's a gradually increasing set of features that are useful and uh make rapidly increasing set rapidly increasing
set of features uh so you're saying a lot of companies will be it just seems absurd that all of these companies are building gigantic data centers there are companies that will benefit from AI but not because they train the best model like meta has so many Avenues to benefit from Ai and all of their services people are there people spend time on meta platforms and it's a way to make more money per user per hour yeah it seems like Google xxi Tesla important to say and then meta will benefit not directly from the AI like
the llms but from the intelligence like the additional boost of intelligence to the products they already sell so whether that's the recommendation system or for Elon who's been talking about Optimus the robot potentially the intelligence of the robot and then you have personalized robots in the home that kind of thing he thinks it's a 10 10 plus trillion dollar business which at some point maybe I don't not soon but who knows what robotic Let's do let's do a tam analysis right 8 billion humans and let's get 8 billion robots right and let's let's pay them
the average Sal and yeah there we go 10 trillion more than 10 trillion yeah I mean you know if if if there's robots everywhere why does it have to be just eight eight eight billion robots yeah yeah of course of course I'm gonna get I'm gonna have like one robot you're gonna have like 20 yeah I mean I see used case for that so yeah so I guess the benefit would be in the products sell which is why opening ey is in a trickier position because they all of the value of open AI right now
as a brand is in Chachi PT and there is actually not that for most users there's not that much of a reason that they need open AI to be spending billions and billions of dollars on the next best model when they could just license llama five and for be way cheaper so that's kind of like chat gbt is an extremely valuable entity to them but like they could make more money just off that than the chat application is clearly like does not have tons of room to continue right like the standard chat right where you're
just using it for random questions and stuff right the cost continues to collapse V3 is the latest biggest uh but it's going to get supported by ads right like as you know llama meta already serves 405b and probably loses the money but at some point you know they're going to get uh the models are going to get so cheap that they can just serve them for free with ads supported right and that's what Google's going to be able to do and that's obviously they've got a bigger reach right so chat is not going to be
the only use case it's like these reasoning code agents computer use all this stuff is where opena has to actually go to make money in the future otherwise they're kaputs but X Google and meta have these other products so doesn't isn't it likely that open Ai and anthropic disappear eventually unless they're so good at models they are but it's such a cutting I mean it depends on where you think AI capabilities are going you have to keep winning yes you have to keep winning this as you climb is even if they capabilities are going super
rapidly awesome into the direction of AI like there's still a boost for X in terms of data Google in terms of data meta in terms of data in terms of other products and the money and like there's just huge amount money the whole idea is human data is kind of tapped out we don't care we all care about self-play verifiable self an R AWS does not make a lot of money on each individual machine and the same can be said for the most powerful AI platform which is even though the calls to the API are
so cheap there's still a lot of money to be made by owning that platform and there's a lot of discussions as it's the next compute layer you you have to believe that and and you there's a lot of discussions that tokens and tokenomics and llm apis are the next compute layer or or the next Paradigm for the economy kind of like energy and oil was but there's also like you have to sort of believe that apis and chat are not where AI is stuck right it is actually just tasks and agents and Robotics and computer
use and those are the areas where all the value will be delivered not API not chat application is it possible you have I mean it all just becomes a commodity and you have uh the the very thin rapper like perplexity just joking uh there are a lot of rappers making a lot of money yeah so but but do you think it's possible that people would just even forget what open Ai and the thropic is and just because the there'll be wrappers around the API and it just dynamically if model progress is not rapid yeah it's
it's becoming a commodity right deeps V3 shows this but also the gpt3 3 chart earlier C chart showed this right llama 3B is 1 1200X cheaper than gpt3 any gpt3 like anyone whose business model was gpt3 level capabilities is dead anyone whose business models gp4 level capabilities is dead it is a common saying that the best businesses being made now are ones that are predicated on models getting better right which would be like rappers thing that is riding the wave of the models the short term the company that could make the most money is the
one that figures out what advertising targeting method works for language model Generations we have the meta ads which are hyper targeted in feed not within specific pieces of content and we have search ads that are used by Google and Amazon has been rising a lot on search but within a piece with within a return from chat gbt it is not clear how you get a high quality placed ad within the output and if you can do that with model cost coming down you could get super high Revenue per like that revenue is totally untapped and
it's not clear technically how it is done yeah that is I mean the sort of the AdSense Innovation that Google did the one day you'll have in GPT output an ad and that's going to make like billions and it could be very subtle it could be in conversation like we have voice mode now it could be some way of making it so the voice introduces certain things it's much harder to measure and it takes imagination but yeah and it wouldn't be so shade it wouldn't come off Shady so you would receive public blowback that kind
of thing so you have to do do it loud enough to where it's clear it's an ad and balance all that so that's the open question they're trying to solve anthropic and open AI they need to they might not say they care about that at all they don't care about it right now I think it's places like are experimenting on that more oh interesting yeah for sure like perplexity Google Meta Care about this um I think open eye and anthropic are purely laser focused on AGI yeah agents and AGI and if I build AGI I
can make tons of money right or I can spend pay for everything right and this is this is It's just predicated like back on the like export control thing right if you think AGI is 5 10 years away or less right these Labs think it's two three years away obviously your your your your actions are you know if you assume they're rational actors which they are mostly um you're what you do in a two-year AGI versus fiveyear versus 10 years very very very different right do you think agents are promising we have to talk about
this this was uh this is like the excitement of the year that agents are going to re this is the generic hype term that a lot of business folks are using AI agents are going to revolutionize everything okay so mostly the the term agent is obviously overblown we've talked a lot about reinforcement learning as a way to train for verifiable outcomes agents should mean something that is open-ended and is solving a task independently on its own and able to adapt to uncertainty there's a lot of term agent applied to things like apple intelligence which we
still don't have after the last WWDC which is orchestrating between apps and that type of tool use thing is something that language models can do really well Apple intelligence I suspect well will come eventually it's a closed domain it's your messages app integrating with your photos with AI in the background that will work that has been described as an agent by a lot of software companies to get into the narrative the question is what ways can we get language models to generalize to new domains and solve their own problems in real time maybe some tiny
amount of training when they are doing this with fine-tuning themselves or in context learning which is the idea of storing information in a prompt and you can use learning algorithms to update that and whether or not you believe that that is going to actually generalize to things like me saying book my trip to go to Austin in two days I have XYZ constraints and and actually trusting it I think there's an HCI problem coming back for information well what's your what's what's your prediction there because my gut says we're very far away from that I
think open eyes uh statement you I don't know if you've seen the five levels right where it's chat is level one reasoning is level two and then agents is level three and I think there's a couple more levels but it's important to note right we were in chat for a couple years right we just theoretically got to reasoning will be here for a year or two right and then agents but at the same time like people can people can try and like approximate capabilities of the next level but the AG agents are doing things autonomously
doing things for minutes at a time hours at a time Etc right uh reasoning is doing things for tens of seconds at a time right and then coming back with an output that I still need to verify and use and try check out right um so so and the biggest problem is of course like um it's the same thing with manufacturing right like there's the whole Six Sigma thing right like you know how many nines do you get and then you compound the nines onto each other and it's like if you multiply you know by
the number of steps that are Six Sigma you get to uh you know a Yi a yield or something right so like in semiconductor manufacturing tens of thousands of steps 9999999 is not enough right because you multiply by that by that many times you actually end up with like 60% yield right really low yield yeah or zero um and this is the same thing with agents right like chaining tasks together each time llms even the best LMS in particularly pretty good benchmarks don't get 100% right they get a little bit below that because there's a
lot of noise um and so how do you get to enough nines right this is the same thing with self-driving we don't we can't have self-driving because without it being like super Geo fenced like Google like Google's right and even then they have a bunch of tele operators to make sure it doesn't get stuck right but you can't do that because it doesn't have enough nights and self-driving has quite a lot of structure because roads have rules it's well defined there's regulation when you're talking about computer use for the open web for example or the
open operating system like there's no it's a mess so like the possibility I'm I'm always skeptical of any system that is tasked with interacting with the human world with the open Messy thing if we can't get intelligence that's enough to solve the human world on its own we can create infrastructure like the human operators for weo over many years that enable certain workflows there there is a company I don't remember it but it is but that's literally their pitches yeah we're just going to be the human operator when agents fail and you just call us
and we fix it yeah an API call and it's hilarious there's going to be teleoperation markets when we get human robots which is there's going to be somebody around the world that's happy to fix the fact that it can't finish loading my dishwasher when I'm unhappy with it but that's just going to be part of the Tesla Service package I'm I'm just imagining like AI agent talking to another AI agent one company has an AI agent that specializes in helping other AI agents but if you can make things that are good at one step you
can just you can stack them together so that's why I'm like if it takes a long time we're going to build infrastructure that enables it you see the operator launch they have Partnerships with certain websites with door Dash with open table with things like this those Partnerships are going to let them climb really fast their model's going to get really good at those things it's going to proof of concept that might be a network effect where more companies want to make it easier for AI some companies will be like no let's put blockers in place
Y and this is a story of the internet we've seen we see it now with training data for language models where companies are like no you have to pay like business working it out that said I think like Airlines have a very and hotels have high incentive to make their site work really well and they usually don't like if you look at how many clicks it takes to order airplane ticket it's insane I don't you actually can't call an American Airlines agent anymore they they don't have a phone number it's I mean it's it's it's
horrible on many on the interface front and and all to imagine that agents will be able to deal with that website when I as a human struggle like I have an existential crisis every time I try to book an airplane ticket that I I I don't I think it's going to be extremely difficult to build an a AI agent that's robust that but think about it like United has accept did the starlink term which is they have to provide starlink for free and the users are going to love it what if one Airline is like
we're going to take a year and we're going to make our website have white text that works perfectly for the AIS every time anyone asks about an AI flight they buy whatever Airline it is or like they just like here's an API in it's only exposed to AI agents and if anyone queres it the price is 10% higher and and for any flight but we'll let you see any of our flights and you can just book any of them here you go agent and then it's like and I made 10% higher price awesome and like
am I willing to say that for like hey book me a flight to CX right and it's like yeah whatever I think I think you know computers and real world and the open world are really really messy um but if you start defining the problem in Nar in narrow regions people are going to be able to create very very productive things um and and Ratchet down cost massively right like now crazy things like you know robotics in the home you know those are going to be a lot harder to do just like self-driving right because
there's just a billion different failure modes right but but like agents that can like navigate a certain set of websites and do certain sets of task or like look at you know look at your you know take a photo of your grocery uh your fridge and or like upload your recipes and then like it figures out what to order from you know uh Amazon slh Foods food delivery like that's then that's going to be like pretty quick and easy to do I think so it's going to be be a whole range of like business outcomes
and it's going to be tons of tons of sort of optimism around people can just figure out ways to make money to be clear these sandboxes already exist in research there are people who have built clones of all the most popular websites of Google Amazon blah blah blah to make it so that there's and I mean open AI probably has them internally to train these things it's the same as deep Minds robotics team for years has had clusters for robotics where you like you interact with robots fully remotely they just have a lab in London
and you send tasks to it it arrang the blocks and you do this research obviously there's text there that fix stuff but we've turned these cranks of automation before you go from sandbox to progress and then you add one more domain at a time and generalize I think in the history of NLP and language processing instruction tuning and tasks per language model used to be like one language model did one task and then in the instruction tuning literature there's this point where you start adding more and more tasks together where it just starts generalized to
every task and we don't know where on this curve we are I think for reasoning with this RL and verifiable domains we're very early but we don't know where the point is where you just start training on enough domains and poof like more domains just start working and you've cross the generalization barrier well what do you think about the programming context so software engineering that you know that's where I personally and I know a lot of people um interact with AI the most there's a lot of fear and angst too from current CS students but
there's also that's where that is the area where probably the most AI Revenue productivity gains have come right um whether it be co-pilots or cursor or uh what have you right this is or just standard chat GPT right like a lot of I don't I know very few programmers who don't have chat GPT and actually many of them have the $200 tier because that's what it's it's so good for right um I think that in that world uh we already see it like s bench I if you've looked at the Benchmark uh made by some
Stanford students I wouldn't say it's like really hard but I wouldn't say it's easy either I think like it takes someone who's been throughout at least you know a few years of Cs or a couple years of programming to do sbench well and the models went from 4% to 60% in like a year right um and where are they going to go to next year you know it's going to be higher probably won't be 100% because again that nines is like really hard to do uh but we're going to get to some point where that's
and then we're going to need harder software engineering benchmarks and so on and so forth but the the way that like people think of it now is it's can do code completion easy it can do some function generation and have to review it great but really the the like software engineering agents I think can be done faster sooner than any other agent because it is a verifiable domain um you can always like unit test or compile um and and and there's many different regions of like it can inspect the whole code base at once which
no no engineer really can only The Architects can really think about this stuff the really senior guys and they can Define stuff um and then the agent can execute on it so I think I think software engineering costs are going to plummet like crazy and and one interesting aspect of that is when software engineering costs are really low you get very different markets right so in the US you have all these platforms ass companies right Salesforce and so on and so forth right in in China no one uses platform SAS everyone just builds their own
stack because software engineering is much cheaper in China partially because like people stem number of stem graduates Etc uh so stem so it's generally just cheaper to do um and so at the same time code for L like code llms have been adopted much less in China because the cost of an engineer there is much lower but like what happens when every company can just invent their own business logic like really cheaply and quickly you stop using platform SAS you start building custom tailored Solutions you change them really quickly now all of a sudden your
business is a little bit more efficient too potentially because you're not dealing with the hell that is like some random platform SAS company stuff not working perfectly and having to adjust workflows or random business automation cases that aren't necessarily AI required it's just logic that needs to be built that no one has built right all of these things can go happen fast I think software and then and then the other domain is like industrial chemical mechanical engineers suck at coding right uh just generally and like their tools like semiconductor Engineers their tools are 20 years
old all the tools run on XP including asml lithography tools run on Windows XP right it's like you know and and like a lot of the analysis happens in Excel right like it's just like guys like you guys can move 20 years forward with all the data you have and gathered and like do a lot better it's just you need the engineering skills for software engineering to be delivered to the actual domain expert so I think I think that's the area where I'm like super duper bullish of of generally AI creating value the big picture
is that I don't think it's going to be a cliff it's like we talked I think the a really good example of how growth changes is when meta added stories so Snapchat was on an exponential they added stories It flatlined software engineers then up and to the right AI is going to come in it's probably just going to be flat it's like it's not like everyone's going to lose their job it's hard because the supply corrects more slowly so the amount of students is still growing and that'll correct on a multi-year like a year delay
but the amount of jobs will just turn and then maybe in 20 40 Years it'll be well down but in the few years there'll never going to be the snap moment where it's like software Engineers aren't useful I I think also the nature of what it means to be a programmer and what kind of jobs programmers do changes because I think there needs to be a human in the loop of everything you've talked about there's a really important human in that picture of like correcting the code like fixing larger than the context length yeah and
debugging also like debugging by So reading the code understanding the steering the system like no no no you missed the point adding more to the prompt kind of like yes the adding the human designing the perfect Google button Google's famous for having people design buttons that are so perfect and it's like how like how is AI going to do that like that like they could give you all the ideas perfect fine I mean that's the thing you can call it taste humans have one thing humans can do is figure out what other humans enjoy better
than AI systems that's where the preference you loading that in but ultimately humans are the greatest preference generate that's where the preference comes from and humans are actually very good at reading or like judging between two things versus this is this goes back to the core of what RL Jeff and preference tuning is is that it's hard to generate a good answer for a lot of problems but it's easy to see which one is better and that's how we're using a humans for AI now is judging which one is better and that's what's off for
engineering could look like is the pr review here's a few options what are the like here are some potential pros and cons and they're going to be judge judges I I think the thing I would very much recommend is people start uh programmers start using Ai and embracing that role of the supervisor of the AI system and like partner of the AI system verus is writing from scratch or not learning coding at all and just generating stuff CU I think there actually has to be a pretty high level of expertise as a programmer to be
able to manage increasingly intelligent systems I think it's I think it's that and then becoming a domain expert in something sure right because you like seriously if you go look at Aerospace or semiconductors or chemical engineering everyone is using really crappy platforms really old software like the job of a data science is as like is like a joke right in many cases um and cases it's very real but it's like bring what the Forefront of human capabilities are to your domain and like even if the Forefront is like from the AI your domain you're like
at the Forefront right so it's like it's like you have to be at the Forefront of something and then Leverage The the like Rising tide that is AI for everything else oh yeah there's so many lwh hanging fruit everywhere in terms of where software can like help automate a thing or digitize the thing in in the legal system I mean that's why doge is exciting you have got to uh hang out with a bunch of the Doge folks and they I mean government is like so old school it it it's like begging for the modernization
of software of organizing the data all this kind of stuff I mean in that case it's by Design because bureaucracy create protects centers of power and so on but software breaks down those barriers uh so it hurts those that are holding on to power but ultimately benefits Humanity so uh there's a bunch of domains of that kind one thing we uh didn't fully finish talking about is open source so first of all congrats you releas a new model yeah this Tulu I'll explain what a Tulu is a Tulu is a hybrid camel when you breed
a dromader with a back bakan camel back in the early days after Chad GPT there was a big wave of models coming out like alpaca AA Etc that were all named after various Mamon species so Tulu is the brand is multiple years old which comes from that and we've been playing at the frontiers of post training with open source code and this first part of this release was in the fall where we use we built on llamas open models open weight models and then we add in our fully open code or fully open data there's
a popular Benchmark that is chapot Arena and that's generally the metric by which how these chat models are evaluated and it's humans compare random models from different organizations and if you looked at the leaderboard in November or December among the top 60 models from tens to 20s of organizations none of them had open code or data for just post trining among that even fewer or none have pre-training data and code available but it's like posttraining is much more accessible at this time it's still pretty cheap and you can do it and the thing is like
how high can we push this number where people have access to all the code and data so that's kind of the motivation of the project we draw in lessons from llama Nvidia had a neotron model where the recipe for their post training was fairly open with some data and a paper and it's putting all these together to try to create a recipe that people can fine-tune models like gp4 to their domain so to be clear in the case of Tulu maybe you can talk about Almo too but in the case of Tulu you're taking llama
3 45b Tulu has been a series of recipes for post training so we've done multiple models over years yeah and so you're open sourcing everything yeah if you start with an open weight based model the like whole model technically is an open source because you don't know what llama put into it which is why we have the separate thing that we'll get to but it's just getting parts of the pipeline where people can zoom in and customize I know I hear from startups and businesses that're like okay like I can take this post training and
try to apply it to my domain we talk about verifiers a lot we use this idea which is reinforcement learning with verifiable domain reward s RL VR kind of similar to R lhf and we applied it to math and the model today which is like we applied it to the Llama 405b base model from last year and we have our other stuff we have our instruction tuning and preference tuning but the math thing is interesting which is like it's easier to improve this math benchmark there's a benchmark mat math all capitals tough name on The
Benchmark is name is the area that you're evaluating we're researchers we're not we're not Brands brand strategists and this is something that the deeps paper talked about as well is like at this bigger model it's easier to elicit powerful capabilities with this RL training and then they distill it down from that big model to the small model and this model we released today we saw the same thing is we're at ai2 we don't have a ton of compute we can't train 405b models all the time so we just did a few runs and they tend
to work and it's like it just shows that there's a lot of room for people to play in these things and and they crushed llama's actual release right like the they're way better than it yeah so our Val numbers I mean we have extra months in this but our Val numbers are like much better than the Llama instruct model that they released and you also said better than deep seek V3 yeah on our Val Benchmark the most deep seek V3 is really similar we have a safety Benchmark to understand if it will say harmful things
and things like that and that's what draws down most of the way it's still it's like an amalgamation of multiple benchmarks or what do you mean yeah so we have a 10 this is like this is standard practice in post training is you choose your evaluations you care about in academics and smaller Labs you'll have fewer evaluations in companies you'll have a really one domain that you really care about in Frontier Labs you'll have tens to 20s to maybe even like a 100 valuations of specific things so we choose a representative Suite of things that
look like chat precise instruction following which is like respond only in emojis like does the model follow weird things like that yeah math code and you create a suite like this so safety would be one of 10 and that type of site where you have like what ises the broader community of AI care about and for example in comparison to deep seek it would be something like our average of Val for our model would be um 80 including safety and similar without and deep seek would be like 79 um% average score without safety and their
safety score would bring it down to like you beat them even ignoring safety yeah so this is something that internally it's like I don't want to win only by like how you shape the valve Benchmark so if there's something that's like people may may not care about safety in their model safety can come Downstream safety can be when you host the model for an API like safety is addressed in a spectrum of locations in a application so it's like if you want to say that you have the best recipe you can't just gate it on
these things that some people might not want and and this is just it's like the time of progress we benefit if we can release a model later we have more time to learn new techniques like this RL Technique we had started this in the fall it's now really popular reasoning models the next thing to do for open open source post trining is to scale up verifiers to scale up data to replicate some of deep seeks results and it's awesome that we have a paper to draw on and it makes it a lot easier and that's
the type of things that is going on among academic and closed Frontier research in AI since you're pushing open source what do you think is the future of it you think deep seek actually changes things since it's open source or open weight or is pushing the open source movement into the open Direction This goes very back to the license discussions so deep seek R1 with a friendly license is a major reset so it's like the first time that we've had a really clear Frontier Model that is open weights and with a commercially friendly license with
no restrictions on Downstream use cases synthetic data distillation whatever this has never been the case at all in the history of AI in the last few years since cat gbt there have been models that are off the frontier or models with weird licenses that you can't really use them so is is isn't meta's license like pretty much permissible except for five companies um and there's also so this goes to what open source AI is which is there's also use case restrictions in the Llama license which says you can't use it for specific things so if
you come from an open source software background you would say that that is not an open- Source license what what kind of things are those though like are they like it's I at this point I can't pull them off competitor it used to be military use was one and they removed that for scale it'll be like like cam like child abuse material like that's the type of thing that is forbidden there but that's enough from an open source background to say it's not open source license and also the Llama license has this horrible thing where
you have to name your model llama if you touch it to the Llama model so it's like the branding thing so if a company uses llama technically the license says that they should say built with llama at the bottom of their application and from like a marketing perspective that just that just hurts like I I could suck it up as a researcher I'm like oh it's fine like it says llama Dash on all of our on all of our materials for this release but this is why we need truly open models which is uh we
don't know deep r1's data wait so you're saying I can't make a you know cheap copy of llama and pretend it's mine but I can do this with the Chinese model yeah hell yeah that's that's what I'm saying and and that's why it's like we want this whole open language models thing Theo thing is to try to keep the model where everything is open with the data as close to the frontier as possible so we're compute constrained we're Personnel constrained we're we we rely on getting insights from people like John Schulman tells us to do
RL on outputs like we can make these big jumps but it just takes a long time to push the frontier of Open Source and fundamentally I would say that that's because open source AI does not have the same feedback loops as open source software we talked about open source software for security also is just because you build something once and you can reuse it if you go into a new company there's so many benefits but if you open source a language model you have you have this data sitting around you have this training code it's
not like that easy for someone to come and build on and improve because you need to spend a lot on compute you need to have expertise so until there are feedback loops of Open Source AI it seems like mostly an IDE ideological Mission like people like Mark Zuckerberg which is like America needs this and I agree with him but in the time where the motivation ideologically is high we need to capitalize and build this ecosystem around what benefits do you get from seeing the language model data and there's not a lot about there we're going
to try to launch a demo soon where you can look at AO model and a query and see what pre-training data similar to it which was like legally risky and complicated but it's like what does it mean to see the data that the AI was trained on it's hard to parse it's terabytes of files it's like I I I don't know what I'm going to find in there but that's what that's what we need to do as an ecosystem if people want open source AI to be financially useful we didn't really talk about Stargate I
would love to get your opinion on like what the new Administration the Trump Administration everything that's doing that's being done in from the America side in supporting AI infrastructure and the efforts of the different AI companies what do you think about Stargate what are we supposed to think about Stargate and uh does Sam have the money yeah so I think uh Stargate is a opaque thing it definitely doesn't have $500 billion doesn't even have hundred billion do right so what they announced is this $500 billion number Larry Ellison Sam Alman and and Trump said it
um they thanked Trump and it's uh and it's used the the Trump did do some executive actions that like do significantly improve the ability for this to be built faster um you know one of the executive actions he did is on Federal Land you can just basically build data centers in power you know like pretty much like that uh and then the permitting process is basically gone or you file after the fact so like one of the again like I had a skitso take earlier another skitso take if you've ever been to the precidio in
San Francisco beautiful area you could build a power plant in a data center there if you wanted to because it is federal land it used to be a military base but you know obviously this would like piss people off you know it's a good bit anyways Trump trump has made it much easier to do this right generally Texas has the only unregulated Grid in the in the nation as well let's go Texas um and so you know therefore like OT enables people to build faster as well in addition the Federal Regulations are coming down um
and so Stargate is predicated and this is why that whole show happened now how they came up with a $500 billion number is beyond me how they came up with a hundred billion dollar number makes sense to some extent right and um there's actually a good table in here that I would like to show um in the in that uh Stargate piece that I had um it's it's the it's the most recent one yeah so so anyways Stargate um you know it's it's basically right like there is uh it's it's a table about cost um
there you passed it already it's that one so this table is kind of explaining what happens right so Stargate is in abalene Texas the first hundred billion dollar of it uh that site is 2.2 GW of power in about 1.8 gwatt of power uh consumed right um per GPU they they they have like roughly uh Oracle is already building the first part of of this before Stargate came about to be clear they've been building it for a year they tried to rent it to Elon in fact right um but Elon was like it's too slow
I need it faster so then he went and did his Memphis thing um and so opening was able to get it uh with this like weird joint venture called Stargate uh they initially signed a deal with just Oracle for the first section of this cluster right this first section of this cluster right is roughly um5 billion to $6 billion of server spend right and then there's another billion or so of data center spend but the and then and then likewise like if you fill out that entire 1.8 gws with the next two generations of Nvidia
chips gb200 gb300 vr200 um and you fill it out completely that ends up being roughly $50 billion of server cost right plus there's data center Cost Plus maintenance cost plus operation Cost Plus um all these things and that's where openai gets to their hundred billion doll announcement that they had right because they talked about a100 billion doar is phase one that's this abalene Texas data center right h100 billion do of total cost of ownership quote unquote right uh so it's not capex it's not investment it's $100 billion of total cost of ownership and then and
then there will be future phases they're looking at other sites that are even bigger than this 2.2 gaw by the way uh in Texas and elsewhere um and so they're they're not you know completely ignoring that but there is there is the number of hundred billion dollar that they say is for phase one uh which I do think will happen they don't even have the money for that um furthermore it's not $100 billion it's $50 billion of spend right and then like $50 billion of operational cost power Etc um rental pricing Etc um because they're
renting it from opening eyes is renting the gpus from the Stargate joint vure right what money do they actually have right soft Bank Soft bank is going to invest Oracle is going to invest open is going to invest open is on the line for $19 billion everyone knows that they've only got six billion in their last round and four billion in debt so but there is there's like news of like SoftBank maybe investing 25 billion into open AI right so that's that's that's part of it right so 19 billion can come from there so open
a does not have the money at all right to be clear um Inc is not dried on anything open has Z doar for this 50 billion right and which they're legally obligated to put 19 billion of capex or into the joint venture and then the rest they're going to pay via renting the gpus from the joint venture and then there's um then there's Oracle Oracle has a lot of money they're building the first section completely they were spending for it themselves right this $6 billion of capex $1 billion at TCO um but they and they
were going to do that first section they're paying for that right um as far as the rest of the section I don't know how much Larry wants to spend right at any point he can pull out right like this is again it's like completely voluntary so any point there's no signed on this right but he potentially could contribute tens of billions of dollars right to be clear he's got the money Oracle got the money um and then there's like mgx which is the sou the UAE fund which technically has $1.5 trillion do for investing in
AI but again like I don't know how real that money is and like whereas there is no ink signed for this SoftBank does not have $2 billion of cash they have to sell down their stake in arm uh which is you know the leader in CPUs and they they ipoed it this is obviously what they've always wanted to do they just didn't know where redeploy the capital selling down the stake and arm makes a ton of sense so they can sell that down and invest in in this if they want to and invest in open
AA if they want to um as far as like money secured the first 100,000 gb200 cluster is like can fund be funded everything else after that up in the air is up in the air money's coming I believe the money will come I personally do just it's a belief okay it's a belief that they are going to release better models and be able to raise more M right but like the actual reality is is that elon's right there is the money does not exist right what does the US government have to do with anything what
does Trump have to do with everything he's just a hype man Trump is he's reducing the regulation so they can build it faster right um and he's allowing them to do it right you know because like any investment of this side is going to involve like antitrust stuff right like so obviously he's gonna he's going to allow them to do it he's going to enable the regulations to actually allow to be built uh I don't believe there's any US Government dollars being spent on this though yeah so I think he's also just creating a general
vibe that this is regulation will go down and this is the era of building so if you're a builder you want to create stuff you want to launch stuff this is the time to do it and so like we've had this 1.8 gwatt data center in our data for over a year now and we've been like sort of sending it to all of our clients including many of these companies that are building the multi- gigawatts but that is like at a level that's not quite maybe Executives like seeing $500 billion $100 billion and then everyone's
asking them like so it could spur like another like an even faster arms race right CU there's already at arms race but like this this like 100 billion 500 billion doll number Trump talking about it on TV like it could spur the arm race to be even faster um and more investors to flood in and etc etc so I think I think you're right is that uh in that uh sense that open AI uh or sort of trump is sort of like championing people are going to build more and his actions are going to let
people build more what are you uh what are you excited about about these uh several years that are upcoming in terms of cluster build outs in terms of uh breakthroughs in AI like the best possible future you can imagine in the next couple years 2 3 4 years what does that look like just it could be very specific technical things like breakthroughs on post post training or it could be just size big yeah I mean it's impressive clusters I really I really enjoy tracking supply chain and like who's involved in what I really do it's
really fun to see like the numbers the cost who's building what capacity helping them figure out how much capacity they should build winning deals strategic stuff that's really cool I think technologically uh there's a lot around the networking side that really excites me uh with Optics and elect Electronics right like kind of getting closer and closer whether it be co- package Optics or some sort of like forms of new forms of switching this is internal to a cluster cluster yeah um also multi-data center training right like there's uh people are putting so much fiber between
these data centers and lighting it up with so many different you know with so much bandwidth that there's a lot of interesting stuff happening on that end right Telecom has been really boring since 5G and now it's like really exciting again um on side can you educate me a little bit about the speed of things so the speed of memory versus the speed of interconnect versus the speed of fiber between data centers are is are these like orders of magnitude different is can we at some point converge towards a place where it all just feels
like one computer uh no I don't think that's possible um it's going to it's only going to get harder to program not easier um it's only going to get more difficult and complicated in more layers right uh the the general image that people like to have is like this hierarchy of memory so on chip is really close localized within the chip right you know there you have registers right and those are shared between some compute elements and then you'll have caches which are shared between more compute elements then you have like memory right like hbm
or Dam like DDR memory or whatever it is and that's shared between the whole chip um and then you can have you know pools of memory that are shared between many chips right um and then storage and it keep you keep zoning out right the access latency across data centers across within the data center within a chip is diff so like you're obviously always you're always going to have different um programming paradigms for this it's not going to be easy programming this stuff is going to be hard maybe I can help right um you know
with programming this but the the the way to think about it is that like there is there there's sort of like the more elements you add to a task you you don't gain you don't get strong skills right if I double the number of chips I don't get 2x the performance right this is just like a reality of computing uh because there's inefficiencies um and there's a lot of interesting work being done to make it not you know uh to make it more linear whether it's making the chips more networked together more tightly or uh
you know cool programming models or cool algorithmic things that you can do on the model side right deep seek did some of these really cool Innovations because they were limited on interconnect but they still needed a parallel eyes right like all sorts you know all everyone's always doing stuff Google's got a bunch of work and everyone's got a bunch of work about this that stuff is super exciting on the model and workload and Innovation side right Hardware solid state Transformers are interesting right for the power side there's all sorts of stuff on batteries and there's
all sorts of stuff on you know I think I think when you look at if you look at every layer of the compute stack right whether it goes from lithography and ET all the way to like fabrication to like Optics to networking to power to Transformers to cooling to you know a networking and you just go on up and up and up and up the stack you know even air conditioners for data centers are like innovating right like like it's like there's like copper cables are innovating right like you wouldn't think it but copper cables
like are there's some Innovations happening there with like the density of how you can pack them and like it's like all of these layers of the stack all the way up to the models human progress is at a pace that's never been seen before I'm just imagining you sitting back in a lay somewhere with screens everywhere just monitoring the supply chain where all these clusters like all the information information you're Gathering I mean you do a big team there's a big team I mean you're you you do quite incredible work uh with semi analysis I
mean just uh keeping your finger on the pulse of human civilization in the digital world it's pretty cool like just to watch feel that yeah thank you I guess feel feel all of us like doing epic feel the AI feel the I mean from meme to like reality um what Nathan is there like breakthroughs that you're like looking forward to potentially I had a while to think about this while listening to D's beautiful he didn't listen to me I knew no I knew this was coming and it's like realistically training models is very fun because
there's so much lwh hanging fruit and the thing that makes my job entertaining I train models I write analysis about what's happening with models and it's fun because there is obviously so much more progress to be had and the real motivation why I do this somewhere where I can share things is that there's just I don't trust people that are like trust me bro we're going to make AI good that's like we're the ones that it's like we're going to do it and you can trust us and we're just going to have all the a
Ai and it's just like I would like a future where more people have a say and what AI is and can understand it and that's it's it's a little bit less fun that it's not a like positive thing of like this is just all really fun like training models is fun and bring people in is fun but it's really like AI if it is going to be the most powerful technology of my lifetime it's like we need to have a lot of people involved in making that and making it making it open helps with that
as accessible as possible as open as possible yeah in the my read of the last few years is that more openness would help the AI ecosystem in terms of having more people understand what's going on rather that's researchers from non- AI fields to governments to everything it doesn't mean that openness will always be the answer I think then it will reassess of like what is the biggest problem facing Ai and Tack on a different angle to the wild ride that we're on and uh for me just from even the user experience anytime you have the
like aathi said the the AHA moments like the magic like seeing the reasoning The Chain of Thought it's like there's something really just fundamentally beautiful about that it's uh putting a mirror to ourselves and seeing like oh it is solving intelligence as the cliche like goal of these companies is and you get to understand to why we humans are special the intelligence within us is special and for now also why we're special in terms of we seem to be conscious and the AI systems for now uh aren't and we get to Sol we get to
explore that mystery so that's it's just really cool to get to explore these questions that I don't think I would have never imagined uh would be even possible uh back when uh s just watching with excitement deep blue be Kasparov like I wouldn't have ever thought this kind of AI would be possible in my lifetime this like this is really feels like AI it's incredible I started with AI of learning to fly a silia quad RoR it's like Learn to Fly and it just like it learned to fly up it would hit the ceiling and
stop and catch it it's like okay that is like really stupid compared to what's going on now and now you could probably with natural language tell it to learn to fly and it's going to generate the control algorithm required to do that probably there's lowlevel blockers like we had to do some weird stuff for that but you can you you definitely our robotics conversation yeah when you have to interact in an actual physical world it's hard what gives you hope about the future of human civilization looking into the next 10 years 100 years thousand years
how long you think we make it you think we got a thousand years humans will definitely be around in a thousand years I think there's there's ways that very bad things could happen there'll be way fewer humans but humans are very good at surviving there's been a lot of things that that is true I don't think they're necessarily we're good at long-term credit assignment of risk but when the risk becomes immediate we tend to figure things out and oh yeah for that reason I'm like there's physical constraints to things like AGI hyper like recursive Improvement
to kill us all type stuff I'm for the physical reasons and for how humans have figured things out before I'm not too worried about an AI takeover there are other International things that are worrying but there's just fun fundamental human goodness and trying to amplify that I like we're on a tenuous time and I mean if you look at Humanity as as a whole there's been times where things go backwards there's times when things don't happen at all and we're on a what should be very positive trajectory right now yeah there seems to be progress
but just like with with with power uh there's like spikes of human suffering and we want to try to minimize the amount of spikes generally human is going to suffer a lot less right I'm very optimistic about that um I do worry of like techn fascism type stuff arising as uh AI becomes more and more prevalent and powerful and those who control it can do more and more uh maybe it doesn't kill us all uh but at some point every very powerful human is going to want a brain computer interface so that they can interact
with the a AGI and all of its advantages in many more way and merge its mind with you know sort of like and its capabilities or that person's capabilities uh can leverage those much better than anyone else and therefore be you know it won't be one person rule them all but it will be uh you know the thing I worry about is it'll be like few people you know you know hundreds thousands tens of thousands maybe millions of people rule whoever's left right um and the economy around it right and I think it'll that's like
the the thing that's probably more worrisome is like human machine amalgamations this enables an individual human to have more impact on the world and that impact can be both positive and negative right uh generally humans have positive impacts on the world at least Society uh but it's possible for individual humans to have such negative impacts and AGI at least as I think the labs Define it which is not a runaway sentient thing but rather just something that can do a lot of tasks really efficiently um amplifies the capabilities of someone causing extreme damage uh but
but for the most part I think it'll be used for you know profit-seeking motives which will then reduce which will increase the abundance and supply of things and therefore reduce suffering right yeah that's the goal scrolling on a a timeline just stasis it's holding scrolling holds the status quo of the world that is a positive outcome right like it's like if I have food tubes and like scrolling and I'm happy that's a positive outcome while expanding out into the cosmos uh well this is a fun time to be alive and thank you for pushing the
Forefront of what is possible in human and thank you for talking today this is fun thanks for having us thanks for having us thanks for listening to this conversation with Dylan Patel and Nathan Lambert to support this podcast please check out our sponsors in the description and now let me leave you some words from Richard Fineman for a successful technology reality must take precedence over public relations for nature cannot be fooled thank you for listening and hope to see you next time e