visual spatial intelligence is so fundamental it's as fundamental as language we've got this ingredients compute deeper understanding of data and we've got some advancement of algorithms we are in the right moment to really make a bet and to focus and just unlock [Music] that over the last two years we've seen this kind of massive Rush of consumer AI companies and technology and it's been quite wild but you've been doing this now for decades and so maybe walk through a little bit about how we got here kind of like your key contributions and insights along the
way so it is a very exciting moment right just zooming back AI is in a very exciting moment I personally have been doing this for for two decades plus and you know we have come out of the last AI winter we have seen the birth of modern AI then we have seen deep learning taking off showing us possibilities like playing chess but then we're starting to see the the the deepening of the technology and the industry um adoption of uh of some of the earlier possibilities like language models and now I think we're in the
middle of a Cambrian explosion in almost a literal sense because now in addition to texts you're seeing pixels videos audios all coming out with possible AI applications and models so it's very exciting moment I know you both so well and many people know you both so well because you're so prominent in the field but not everybody like grew up in AI so maybe it's kind of worth just going through like your quick backgrounds just to kind of level set the audience yeah sure so I first got into AI uh at the end of my undergrad
uh I did math and computer science for undergrad at keltech that was awesome but then towards the end of that there was this paper that came out that was at the time a very famous paper the cat paper um from H Lee and Andrew and others that were at Google brain at the time and that was like the first time that I came across this concept of deep learning um and to me it just felt like this amazing technology and that was the first time that I came across this recipe that would come to define
the next like more than decade of my life which is that you can get these amazingly powerful learning algorithms that are very generic couple them with very large amounts of compute couple them with very large amounts of data and magic things started to happen when you compi those ingredients so I I first came across that idea like around 2011 2012-ish and I just thought like oh my God this is this is going to be what I want to do so it was obvious you got to go to grad school to do this stuff and then
um sort of saw that Fay was at Stanford one of the few people in the world at the time who was kind of on that on that train and that was just an amazing time to be in deep learning and computer vision specifically because that was really the era when this went from these first nent bits of technology that were just starting to work and really got developed AC and spread across a ton of different applications so then over that time we saw the beginning of language modeling we saw the beginnings of discriminative computer vision
you could take pictures and understand what's in them in a lot of different ways we also saw some of the early bits of what we would Now call gen generative modeling generating images generating text a lot of those Court algor algorithmic pieces actually got figured out by the academic Community um during my PhD years like there was a time I would just like wake up every morning and check the new papers on archive and just be ready it was like unwrapping presents on Christmas that like every day you know there's going to be some amazing
new discovery some amazing new application or algorithm somewhere in the world what happened is in the last two years everyone else in the world kind of came to the same realization using AI to get new Christmas presents every day but I think for those of us that have been in the field for a decade or more um we've sort of had that experience for a very long time obviously I'm much older than Justin I I come to AI through a different angle which is from physics because my undergraduate uh background was physics but physics is
the kind of discipline that teaches you to think audacious question s and think about what is the remaining mystery of the world of course in physics is atomic world you know universe and all that but somehow I that kind of training thinking got me into the audacious question that really captur my own imagination which is intelligence so I did my PhD in Ai and computational neuros siiz at CCH so Justin and I actually didn't overlap but we share um the same amam mat um at keltech oh and and the same adviser at celtech yes same
adviser your undergraduate adviser in my PhD advisor petro perona and my PhD time which is similar to your your your PhD time was when AI was still in the winter in the public eye but it was not in the winter in my eye because it's that preing hibernation there's so much life machine learning statistical modeling was really gaining uh gaining power and we I I think I was one of the Native generation in machine learning and AI whereas I look at Justice generation is the native deep learning generation so so so machine learning was the
precursor of deep learning and we were experimenting with all kinds of models but one thing came out at the end of my PhD and the beginning of my assistant professor there was a overlooked elements of AI that is mathematically important to drive generalization but the whole field was not thinking that way and it was Data because we were thinking about um you know the intricacy of beijan models or or whatever you know um uh kernel methods and all that but what was fundamental that my students and my lab realized probably uh earlier than most people
is that if you if you let Data Drive models you can unleash the kind of power that we haven't seen before and that was really the the the reason we went on a pretty crazy bet on image net which is you know what just forget about any scale we're seeing now which is thousands of data points at that point uh NLP community has their own data sets I remember UC see Irvine data set or some data set in NLP was it was small compar Vision Community has their data sets but all in the order of
thousands or tens of thousands were like we need to drive it to internet scale and luckily it was also the the the coming of age of Internet so we were riding that wave and that's when I came to Stanford so these epochs are what we often talk about like IM is clearly the epoch that created you know or or at least like maybe made like popular and viable computer vision and the Gen wave we talk about two kind of core unlocks one is like the Transformers paper which is attention we talk about stable diffusion is
that a fair way to think about this which is like there's these two algorithmic unlocks that came from Academia or Google and like that's where everything comes from or has it been more deliberate or have there been other kind of big unlocks that kind of brought us here that we don't talk as much about yeah I I think the big unlock is compute like I know the story of AI is of in the story of compute but even no matter how much people talk about it I I think people underestimate it right and the amount
of the amount of growth that we've seen in computational power over the last decade is astounding the first paper that's really credited with the like Breakthrough moment in computer vision for deep learning was Alex net um which was a 2012 paper that where a deep neural network did really well on the image net Challenge and just blew away all the other algorithms that F had been working on the types of algorithms that they' been working on more in grad school that Alex net was a 60 million parameter deep neural network um and it was trained
for six days on two GTX 580s which was the top consumer card at the time which came out in 2010 um so I was looking at some numbers last night just to you know put these in perspective the newest the latest and greatest from Nvidia is the gb200 um do either of you want to guess how much raw compute Factor we have between the GTX 580 and the gb200 shoot no what go for it it's uh it's in the thousands so I I ran the numbers last night like that two We R that two we
training run that of Six Days on two GTX 580s if you scale it it comes out to just under five minutes on a single GB on a single gb200 Justin is making a really good point the 2012 Alex net paper on image net challenge is literally a very classic Model and that is the convolution on your network model and that was published in 1980s the first paper I remember as a graduate student learning that and it more or less also has six seven layers the practically the only difference between alexnet and the convet what's the
difference is the gpus the two gpus and the delude of data yeah well so that's what I was going to go which is like so I think most people now are familiar with like quote the bitter lesson and the bitter lesson says is if you make an algorithm don't be cute yeah just make sure you can take advantage of available compute because the available compute will show up right and so like you just like need to like why like on the other hand there's another narrative um which seems to me to be like just as
credible which is like it's actually new data sources that unlock deep learning right like imet is a great example but like a lot of people like self attention is great from Transformers but they'll also say this is a way you can exploit human labeling of data because like it's the humans that put the structure in the sentences and if you look at clip they'll say well like we're using the internet to like actually like have humans use the alt tag to label images right and so like that's a story of data that's not a story
of compute and so is it just is the answer just both or is like one more than the other or I think it's both but you're hitting another really good point so I think there's actually two EO that to me feel quite distinct in the algorithmics here so like the imag net era is actually the era of supervised learning um so in the era of supervised learning you have a lot of data but you don't know how to use data on its own like the expectation of imet and other data sets of that time period
was that we're going to get a lot of images but we need people to label everyone and all of the training data that we're going to train on like a person a human labeler has looked at everyone and said something about that image yeah um and the big algorithmic unlocks we know how to train on things that don't require human labeled data as as the naive person in the room that doesn't have an AI background it seems to me if you're training on human data like the humans have labeled it it's just not explicit I
knew you were GNA say that Mar I knew that yes philosophically that's a really important question but that actually is more try language than pixels fair enough yeah 100 yeah yeah yeah yeah yeah but I do think it's an important thinked learn itel just more implicit than explicit yeah it's still it's still human labeled the distinction is that for for this supervised learning era um our learning tasks were much more constrained so like you would have to come up with this ontology of Concepts that we want to discover right if you're doing in imag net
like fa and and your students at the time spent a lot of time thinking about you know which thousand categories should be in the imag net challenge other data sets of that time like the Coco data set for object detection like they thought really hard about which 80 categories we put in there so let's let's walk to gen um so so when I was doing my my PhD before that um you came so I took U machine learning from Andre in and then I took like beigan something very complicated from Deany Coler and it was
very complicated for me a lot of that was just predictive modeling y um and then like I remember the whole kind of vision stuff that you unlock but then the generative stuff is shown up like I would say in the last four years which is to me very different like you're not identifying objects you're not you know predicting something you're generating something and so maybe kind of walk through like the key unlocks that got us there and then why it's different and if we should think about it differently and is it part of a Continuum
is it not it is so interesting even during my graduate time generative model was there we wanted to do generation we nobody remembers even with the uh letters and uh numbers we were trying to do some you know Jeff Hinton has had to generate papers we were thinking about how to generate and in fact if you do have if you think from a probability distribution point of view you can mathematically generate it's just nothing we generate would ever impress anybody right so this concept of generation mathematically theoretically is there but nothing worked so then I
do want to call out Justin's PhD and Justin was saying that he got enamored by Deep learning so he came to my lab Justin PhD his entire PhD is a story almost a mini story of the trajectory of the of the uh field he started his first project in data I forced him to he didn't like it so in retrospect I learned a lot of really useful things I'm glad you say that now so we moved Justin to um to deep learning and the core problem there was taking images and generating words well actually it
was even about there were I think there were three discret phases here on this trajectory so the first one was actually matching images and words right right right like we have we have an image we have words and can we say how much they allow so actually my first paper both of my PhD and like ever my first academic publication ever was the image retrieval with scene graphs and then we went into the Genera uh taking pixels generating words and Justin and Andre uh really worked on that but that was still a very very lossy
way of of of generating and getting information out of the pixel world and then in the middle Justus went off and did a very famous piece of work and it was the first time that uh someone made it real time right yeah yeah so so the story there is there was this paper that came out in 2015 a neural algorithm of artistic style led by Leon gtis and it was like the paper came out and they showed like these these real world photographs that they had converted into van go style and like we are kind
of used to seeing things like this in 2024 but this was in 2015 so this paper just popped up on archive one day and it like blew my mind like I just got this like gen brainworm like in my brain in like 2015 and it like did something to me and I thought like oh my God I need to understand this algorithm I need to play with it I need to make my own images into van go so then I like read the paper and over a long weekend I reimplemented the thing and got it
to work it was a very actually very simple algorithm um so like my implementation was like 300 lines of Lua cuz at the time it was pre it was Lua there was there was um this was pre pie torch so we were using Lua torch um but it was like very simple algorithm but it was slow right so it was an optim optimization based thing every image you want to generate you need to run this optimization Loop run this gradient Dent Loop for every image that you generate the images were beautiful but I just like
wanted to be faster and and Justin just did it and it was actually I think your first taste of a an academic work having an industry impact a bunch of people seen this this artistic style transfer stuff at the time and me and a couple others at the same time came up with different ways to speed this up yeah um but mine was the one that got a lot of traction right so I was very proud of Justin but there's one more thing I was very proud of Justin to connect to J AI is that
before the world understand gen Justin's last piece of uh uh work in PhD which I I knew about it because I was forcing you to do it that one was fun that was was actually uh input language and getting a whole picture out it's one of the first gen uh work it's using gang which was so hard to use but the problem is that we are not ready to use a natural piece of language so justtin you heard he worked on sing graph so we have to input a sing graph language structure so you know
the Sheep the the the grass the sky in a graph way it literally was one of our photos right and then he he and another very good uh uh Master student of grim they got that again to work so so you can see from data to matching to style transfer to to generative a uh uh images we're starting to see you ask if this is a abrupt change for people like us it's already happening a Continuum but for the world it was it's more the results are more abrupt so I read your book and for
those that are listening it's a phenomenal book like I I really recommend you read it and it seems for a long time like a lot of you and I'm talking to you fa like a lot of your research has been you know and your direction has been towards kind of spatial stuff and pixel stuff and intelligence and now you're doing World labs and it's around spatial intelligence and so maybe talk through like you know is this been part of a long journey for you like why did you decide to do it now is it a
technical unlock is it a personal unlock just kind of like move us from that kind of Meo of AI research to to World Labs sure for me is uh um it is both personal and intellectual right my entire you talk about my book my entire intellectual journey is really this passion to seek North Stars but also believing that those nor stars are critically important for the advancement of our field so at the beginning I remembered after graduate school I thought my Northstar was telling stories of uh images because for me that's such a important piece
of visual intelligence that's part of what you call AI or AGI but when Justin and Andre did that I was like oh my God that's that was my live stream what do I do next so it it came a lot faster I thought it would take a hundred years to do that so um but visual intelligence is my passion because I do believe for every intelligent uh being like people or robots or some other form um knowing how to see the world reason about it interact in it whether you're navigating or or or manipulating or
making things you can even build civilization upon it it visual spatial intelligence is so fundamental it's as fundamental as language possibly more ancient and and more fundamental in certain ways so so it's very natural for me that um world Labs is our Northstar is to unlock spatial intelligence the moment to me is right to do it like Justin was saying compute we've got these ingredients we've got compute we've got a much deeper understanding of data way deeper than image that days you know uh compared to to that those days we're so much more sophisticated and
we've got some advancement of algorithms including co-founders in World la like Ben milen Hall and uh Kristoff lar they were at The Cutting Edge of nerve that we are in the right moment to really make a bet and to focus and just unlock that so I just want to clarify for for folks that are listening to this which is so you know you're starting this company World lab spatial intelligence is kind of how you're generally describing the problem you're solving can you maybe try to crisply describe what that means yeah so spatial intelligence is about
machines ability to un to perceive reason and act in 3D and 3D space and time to understand how objects and events are positioned in 3D space and time how interactions in the world can affect those 3D position 3D 4D positions over space time um and both sort of perceive reason about generate interact with really take the machine out of the main frame or out of the data center and putting it out into the world and understanding the 3D 4D world with all of its richness so to be very clear are we talking about the physical
world or are we just talking about an abstract notion of world I think it can be both I think it can be both and that encompasses our vision long term even if you're generating worlds even if you're generating content um doing that in positioned in 3D with 3D uh has a lot of benefits um or if you're recognizing the real world being able to put 3D understanding into the into the real world as well is part of it great so I mean Ju Just for everybody listening like the two other co-founders Ben M Hall and
Kristoff lner are absolute Legends in the field at the at the same level these four decided to come out and do this company now and so I'm trying to get dig to like like why now is the the the right time yeah I mean this is Again part of a longer Evolution for me but like really after PhD when I was really wanting to develop into my own independent researcher both at for my later career I was just thinking what are the big problems in Ai and computer vision um and the conclusion that I came
to about that time was that the previous decade had mostly been about understanding data that already exists um but the next decade was going to be about understanding new data and if we think about that the data that already exists was all of the images and videos that maybe existed on the web already and the next decade was going to be about understanding new data right like people are people are have smartphones smartphones are collecting cameras those cameras have new sensors those cameras are positioned in the 3D world it's not just you're going to get
a bag of pixels from the internet and know nothing about it and try to say if it's a cat or a dog we want to treat these treat images as universal sensors to the physical world and how can we use that to understand the 3D and 4D structure of the world um either in physical spaces or or or generative spaces so I made a pretty big pivot post PhD into 3D computer vision predicting 3D shapes of objects with some of my colleagues at fair at the time then later I got really enamored by this idea
of learning 3D structure through 2D right because we talk about data a lot it's it's um you know 3D data is hard to get on its own um but there because there's a very strong mathematical connection here um our 2D images are projections of a 3D World and there's a lot of mathematical structure here we can take advantage of so even if you have a lot of 2D data there's there's a lot of people have done amazing work to figure out how can you back out the 3D structure of the world from large quantities of
2D observations um and then in 2020 you asked about bre breakthrough moments there was a really big breakthrough Moment One from our co-founder Ben mildenhall at the time with his paper Nerf N Radiance fields and that was a very simple very clear way of backing out 3D structure from 2D observations that just lit a fire under this whole Space of 3D computer vision I think there's another aspect here that maybe people outside the field don't quite understand as that was also a time when large language models were starting to take off so a lot of
the stuff with language modeling actually had gotten developed in Academia even during my PhD I did some early work with Andre Carpathia on language modeling in 2014 lstm I still remember lstms RNN brus like this was pre- Transformer um but uh then at at some point like around like around the gpt2 time like you couldn't really do those kind of models anymore in Academia because they took a way way more resourcing but there was one really interesting thing that the Nerf the Nerf approach that that Ben came up with like you could train these in
in in an hour a couple hours on a single GPU so I think at that time like this is a there was a dynamic here that happened which is that I think a lot of academic researchers ended up focusing a lot of these problems because there was core algorithmic stuff to figure out and because you could actually do a lot with without a ton of compute and you could get state-of-the-art results on a single GPU because of those Dynamics um there was a lot of research a lot of researchers in Academia were moving to think
about what are the core algorithmic ways that we can advance this area as well uh then I ended up chatting with f more and I realized that we were actually she's very convincing she's very convincing well there's that but but like you know we talk about trying to like figure out your own depent research trajectory from your adviser well it turns out we ended oh no kind of concluding converging on on similar things okay well from my end I want to talk to the smartest person I I call Justin there's no question about it uh
I do want to talk about a very interesting technical um uh issue or or technical uh story of pixels that most people work in language don't realize is that pre era in the field of computer vision those of us who work on pixels we actually have a long history in a an area of research called reconstruction 3D reconstruction which is you know it dates back from the 70s you know you can take photos because humans have two eyes right so in generally starts with stereo photos and then you try to triangulate the geometry and uh
make a 3D shape out of it it is a really really hard problem to this day it's not fundamentally solved because there there's correspondence and all that and then so this whole field which is a older way of thinking about 3D has been going around and it has been making really good progress but when nerve happened when Nerf happened in the context of generative methods in the context of diffusion models suddenly reconstruction and generations start to really merge and now like within really a short period of time in the field of computer vision it's hard
to talk about reconstruction versus generation anymore we suddenly have a moment where if we see something or if we imagine something both can converge towards generating it right right and that's just to me a a really important moment for computer vision but most people missed it because we're not talking about it as much as llms right so in pixel space there's reconstruction where you reconstruct like a scene that's real and then if you don't see the scene then you use generative techniques right so these things are kind of very similar throughout this entire conversation you're
talking about languages and you're talking about pixels so maybe it's a good time to talk about how like space for intelligence and what you're working on contrasts with language approaches which of course are very popular now like is it complimentary is it orthogonal yeah I think I think they're complimentary I I don't mean to be too leading here like maybe just contrast them like everybody says like listen I I I know opening up and I know GPT and I know multimodal models and a lot of what you're talking about is like they've got pixels and
they've got languages and like doesn't this kind of do what we want to do with spatial reasoning yeah so I think to do that you need to open up the Black Box a little bit of how these systems work under the hood um so with language models and the multimodal language models that we're seeing nowadays they're their their underlying representation under the hood is is a one-dimensional representation we talk about context lengths we talk about Transformers we talk about sequences attention attention fundamentally their representation of the world is is onedimensional so these things fundamentally operate
on a onedimensional sequence of tokens so this is a very natural representation when you're talking about language because written text is a one-dimensional sequence of discret letters so that kind of underlying representation is the thing that led to llms and now the multimodal llms that we're seeing now you kind of end up shoehorning the other modalities into this underlying representation of a 1D sequence of tokens um now when we move to spatial intelligence it's kind of going the other way where we're saying that the three-dimensional nature of the world should be front and center in
the representation so at an algorithmic perspective that opens up the door for us to process data in different ways to get different kinds of outputs out of it um and to tackle slightly different problems so even at at a course level you kind of look at outside and you say oh multimodal LMS can look at images too well they can but I I think that it's they don't have that fundamental 3D representation at the heart of their approaches I totally agree with Justin I think talking about the 1D versus fundamental 3D representation is one of
the most core differentiation the other thing it's a slightly philosophical but it's really important to for me at least is language is fundamentally a purely generated signal there's no language out there you don't go out in the nature and there's words written in the sky for you whatever data you feeding you pretty much can just somehow regurgitate with enough generalizability at the the same data out and that's language to language and but but 3D World Is Not There is a 3D world out there that follows laws of physics that has its own structures due to
materials and and many other things and to to fundamentally back that information out and be able to represent it and be able to generate it is just fundamentally quite a different problem we will be borrowing um similar ideas or useful ideas from language and llms but this is fundamentally philosophically to me a different problem right so language 1D and probably a bad representation of the physical world because it's been generated by humans and it's probably lossy there's a whole another modality of generative AI models which are pixels and these are 2D image and 2D video
and like one could say that like if you look at a video it looks you know you can see 3D stuff because like you can pan a camera or whatever it is and so like how would like spatial intelligence be different than say 2D video here when I think about this it's useful to disentangle two things um one is the underlying representation and then two is kind of the the user facing affordances that you have um and here's where where you can get sometimes confused because um fundamentally we see 2D right like our retinas are
2D structures in our bodies and we've got two of them so like fundamentally our visual system some perceives 2D images um but the problem is that depending on what representation you use there could be different affordances that are more natural or less natural so even if you are at the end of the day you might be seeing a 2D image or a 2d video um your brain is perceiving that as a projection of a 3D World so there's things you might want to do like move objects around move the camera around um in principle you
might be able to do these with a purely 2D representation and model but it's just not a fit to the problems that you're the model to do right like modeling the 2D projections of a dynamic 3D world is is a function that probably can be modeled but by putting a 3D representation Into the Heart of a model there's just going to be a better fit between the kind of representation that the model is working on and the kind of tasks that you want that model to do so our bet is that by threading a little
bit more 3D representation under the hood that'll enable better affordances for for users and this also goes back to the norstar for me you know why is it spatial intelligence why is it not flat pixel intelligence is because I think the Arc of intelligence has to go to what Justin calls affordances and uh and the Arc of intelligence if you look at Evolution right the Arc of intelligence eventually enables animals and humans especially human as an intelligent animal to move around the world interact with it create civilization create life create a piece of Sandwich whatever
you do in this 3D World and and translating that into a piece of technology that three native 3D nness is fundamentally important for the flood flood gate um of possible applications even if some of them the the serving of them looks Tod but the but it's innately 3D um to me I think this is actually very subtle yeah and Incredibly critical point and so I think it's worth digging into and a good way to do this is talking about use cases and so just to level set this we're talking about generating a technology let's call
it a model that can do spatial intelligence so maybe in the abstract what might that look like kind of a little bit more concretely what would be the potential use cases that you could apply this to so I think there's a there's a couple different kinds of things we imagine these spatially intelligent models able to do over time um and one that I'm really excited about is World Generation we're all we're all used to something like a text image generator or starting to see text video generators where you put an image put in a video
and out pops an amazing image or an amazing two-c clip um but I I think you could imagine leveling this up and getting 3D worlds out so one thing that we could imagine spatial intelligence helping us with in the future are upleveling these experiences into 3D where we're not getting just an image out or just a clip out but you're getting out a full simulated but vibrant and interactive 3D World for gaming maybe for gaming right maybe for gaming maybe for virtual photography like you name it there's I think there even if you got this
to work there'd be there'd be a million applications for Education yeah for education I mean I guess one of one of my things is that like we in in some sense this enables a new form of media right because we already have the ability to create virtual interactive world worlds um but it cost hundreds of hundreds of millions of dollars and a and a ton of development time and as a result like what are the places that people drive this technological ability is is video games right because if we do have the ability as a
society to create amazingly detailed virtual interactive worlds that give you amazing experiences but because it takes so much labor to do so then the only economically viable use of that technology in its form today is is games that can be sold for $70 a piece to millions and millions of people to recoup the investment if we had the ability to create these same virtual interactive vibrant 3D worlds um you could see a lot of other applications of this right because if you bring down that cost of producing that kind of content then people are going
to use it for other things what if you could have a an intera like sort of a personalized 3D experience that's as good and as rich as detailed as one of these AAA video games that cost hundreds of millions of dollars to produce but it could be catered to like this very Niche thing that only maybe a couple people would want that particular thing that's not a particular product or a particular road map but I think that's a vision of a new kind of media that would be enabled by um spatial intelligence in the generative
Realms if I think about a world I actually think about things that are not just seene generation I think about stuff like movement and physics and so like like in the limit is that included and then the second one is absolutely if I'm interacting with it like like are there semantics and I mean by that like if I open a book are there like pages and are there words in it and do they mean like like are we talking like a full depth experience or we talking about like kind of a static scene I think
I'll see a progression of this technology over time this is really hard stuff to build so I think the static the static problem is a little bit easier um but in the limit I think we want this to be fully Dynamic fully interactable all the things that you just said I mean that's the definition of spatial intelligence yeah so so there is going to be a progression we'll start with more static but everything you've said is is in the in the road map of uh spatial intelligence I mean this is kind of in in the
name of the company itself World Labs um like the world is about building and understanding worlds and and like this is actually a little bit inside baseball I realized after we told the name to people they don't always get it because in computer vision and and reconstruction and generation we often make a distinction or a delineation about the kinds of things you can do um and kind of the first level is objects right like a microphone a cup a chair like these are discret things in the world um and a lot of the imet style
stuff that F worked on was about recognizing objects in the world then leveling up the next level of objects I think of his scenes like scenes are compositions of objects like now we've got this recording studio with a table and microphones and people in chairs at some composition of objects but but then like we we envision worlds as a Step Beyond scenes right like scenes are kind of maybe individual things but we want to break the boundaries go outside the door like step up from the table walk out from the door walk down the street
and see the cars buzzing past and see like the the the the leaves on the tree moving and be able to interact with those things another thing that's really exciting is just to mention the word New Media with this technology the boundary between real world and virtual imagin world or augmented world or predicted world is all blurry you really it there the real world is 3D right so in the digital world you have to have a 3D representation to even blend with the real world you know you cannot have a 2d you cannot have a
1D to be able to interface with the real 3D World in an effective way and with this it unlocks it so it it the use cases can can be quite Limitless because of this right so the first use case that that Justin was talking about would be like the generation of a virtual world for any number of use cases one that you're just alluding to would be more of an augmented reality right yes just around the time world lab was uh um being formed uh vision was released by Apple and uh they use the word
spatial Computing we're almost like they almost stole our but we're spatial intelligence so spatial Computing needs spatial intelligence that's exactly right so we don't know what Hardware form it will take it will be goggles glasses contact lenses contact lenses but that interface between the true real world and what you can do on top of it whether it's to help you to augment your capability to work on a piece of machine and fix your car even if you are not a trained mechanic or to just be in a Pokemon go Plus+ for entertainment suddenly this piece
of technology is is going to be the the the operating system basically uh for for arvr uh Mixr in the limit like what does an AR device need to do it's this thing thing that's always on it's with you it's looking out into the world so it needs to understand the stuff that you're seeing um and maybe help you out with tasks in your daily life but I'm I'm also really excited about this blend between virtual and physical that becomes really critical if you have the ability to understand what's around you in real time in
perfect 3D then it actually starts to deprecate large parts of the real world as well like right now how many differently sized screens do we all own for different use cases too many right you've got you've got your you've got your phone you've got your iPad you've got your computer monitor you've got your t you've got your watch like these are all basically different side screens because they need to present information to you in different contexts and in different different positions but if you've got the ability to seamlessly blend virtual content with the physical world
it kind of deprecates the need for all of those it just ideally seamlessly Blends information that you need to know in the moment with the right way mechanism of of giving you that information another huge case of being able to blend the the digital virtual world with the 3D physical world is for anying agents to be able to do things in the physical world and if humans use this mix art devices to do things like I said I don't know how to fix a car but if I have to I put on this this goggle
or glass and suddenly I'm guided to do that but there are other types of Agents namely robots any kind of robots not just humanoid and uh their interface by definition is the 3D world but their their compute their brain by definition is the digital world so what connects that from the learning to to behaving between a robot brain to the real world brain it has to be spatial intelligence so you've talked about virtual world you've talked about kind of more of an augmented reality and now you've just talked about the purely physical world basically which
would be used for robotics um for any company that would be like a very large Charter especially if you're going to get into each one of these different areas so how do you think about the idea of like deep deep Tech versus any of these specific application areas we see ourselves as a deep tech company as the platform company that provides models that uh that can serve different use cases is of these three is there any one that you think is kind of more natural early on that people can kind of expect the company to
lean into or is it I think it's suffices to say the devices are not totally ready actually I got my first VR headset in grad school um and just like that's one of these transformative technology experiences you put it on you're like oh my God like this is crazy and I think a lot of people have that experience the first time they use VR um so I I've been excited about this space for a long time and I I love the Vision Pro like I stayed up late to order one of the first ones like
the first day it came out um but I I think the reality is it's just not there yet as a platform for Mass Market appeal so very likely as a company will will will move into a market that's more ready than then I I think there can sometimes be Simplicity in generality right like if you we we have this notion of being a deep tech company we we believe that there is some fun underlying fundamental problems that need to be solved really well and if solved really well can apply to a lot of different domains
we really view this long Arc of the company as building and realizing the the dreams of spatial intelligence r large so this is a lot of technology to build it seems to me yeah I think it's a really hard problem um I think sometimes from people who are not directly in the AI space they just see it as like AI as one undifferentiated massive Talent um and and for those of us who have been here long for for longer you realize that there's a lot of different a lot of different kinds of talent that need
to come together to build anything in in AI in particular this one we've talked a little bit about the the data problem we've talked a little bit about some of the algorithms that we that I worked on during my PhD but there's a lot of other stuff we need to do this too um you need really high quality large scale engineering you need really deep understanding of 3D of the 3D World you need really there's actually a lot of connections with computer Graphics um because they've been kind of attacking lot of the same problems from
the from the opposite direction so when we think about Team Construction we think about how do we find expert like absolute topof thee world best experts in the world at each of these different subdomains that are necessary to build this really hard thing when I thought thought about how we form the best founding team for World Labs it has to start with the a a group of phenomenal multidisciplinary funders and of course justtin is natural for me Justin cover your years as one of my best students and uh one of the smartest technologist but there
are two two other people I have known by reputation and and one of them Justin even worked with that I was drooling for right one is Ben mhal we talked about his um seminal work in nerve but another person is uh Kristoff lner who has been reputated in the community of computer graphics and uh especially he had the foresight of working on a precursor of the gausian Splat um representation for 3D modeling five years right before the uh the Gan spat take off and when when we heard about when we talk about the potential possibility
of working with Christof lastner Justin just jumped off his chair Ben and Kristoff are are are legends and maybe just quickly talk about kind of like how you thought about the build out of the rest of the team because again like it's you know there's a lot to build here and a lot to work on not just in kind of AI or Graphics but like systems and so forth yeah um this is what so far I'm personally most proud of is the formidable team I've had the privilege of working with the smartest young people in
my entire career right from from the top universities being a professor at Stanford but the kind of talent that we put together here at uh at uh World Labs is just phenomenal I've never seen the concentration and I think the biggest differentiating um element here is that we're Believers of uh spatial intelligence all of the multidisciplinary talents whether it's system engineering machine uh machine learning infra to you know uh generative modeling to data to you know Graphics all of us whether it's our personal research Journey or or technology Journey or even personal hobby we believe
that spatial intelligence has to happen at this moment with this group of people and uh that's how we really found our founding team and uh and that focus of energy and talent is is is really just uh um humbling to me I I just love it so I know you've been Guided by an Northstar so something about North Stars is like you can't actually reach them because they're in the sky but it's a great way to have guidance so how will you know when you've accomplished what you've set out to accomplish or is this a
lifelong thing that's going to continue kind of infinitely first of all there's real northstars and virtual North Stars sometimes you can reach virtual northstars fair enough good enough in the world in the world model exactly like I said I thought one of my Northstar that would take a 100 years with storytelling of images and uh Justin and Andre you know in my opinion solved it for me so um so we could get to our Northstar but I think for me is when so many people and so many businesses are using our models to unlock their
um needs for spatial intelligence and that's the moment I know we have reached a major Milestone actual deployment actual impact actually yeah I I don't think going to get there um I I think that this is such a fundamental thing like the universe is a giant evolving four-dimensional structure and spatial intelligence r large is just understanding that in all of its depths and figuring out all the applications to that so I I think that we have a we have a particular set of ideas in mind today but I I think this I think this journey
is going to take us places that we can't even imagine right now the magic of good technology is that technology opens up more possibilities and and unknown so so we will be pushing and then the possibilities will will be expanding brilliant thank you Justin thank you fa this was fantastic thank you Martin thank you Martin thank you so much for listening to the a16z podcast if you've made it this far don't forget to subscribe so that you are the first to get our exclusive video content or you can check out this video that we've hand
selected for you