What is Spatial AI? "The Next Frontier of AI Architecture"

45.32k views8889 WordsCopy TextShare

Matthew Berman

Fei Fei's interview with a16z, plus my reaction. Try Mammouth now for just $10 today: https://mammo...

Video Transcript:

what is spatial artificial intelligence well Fay Lee also known as the Godmother of AI just raised hundreds of millions of dollars to build an AI company around spatial intelligence I'm going to explain what spatial intelligence is and we're also going to watch an interview with her and famed Venture Capital firmed a16z where she talks about the future of artificial intelligence being one in which it understands the real world world and we're doing another giveaway of a 24in Dell monitor you can win all you have to do is subscribe to my newsletter I'll drop all the

information in the description below so first who is f Fe Lee she is a well-known computer scientist and has made many contributions to the world of artificial intelligence and she really focuses and has a passion for visual intelligence understanding the real world Yan laon said language alone is not enough to create a world model which can also be described as artificial intelligence that understands the real world it needs to actually see the world and language alone can't do that and that's why spatial intelligence is so promising so let's look at some of her major contributions

before diving into this interview of hers so first imag net this is her most well-known contribution and imag net is a large- scale visual data set launched in 2009 it contains millions of labeled images across thousands of categories which revolutionized the field of computer vision and deep learning basically it is a data set that helps computers see the real world she was an assistant professor at the University of Illinois she was an assistant professor at Princeton and most recently became a full professor at Stanford University but now she started her own company and that's what

she's going to talk about along with what the company actually does in this interview so here's the interview it's on the a16z YouTube channel I'll drop a link in the description below if you want to watch it in full and I'm going to play this at 1 and A2 speed because it's quite a long video so maybe walk through a little bit about how we got here kind of like your key contributions and insights along the way so it is a very exciting moment right just zooming back a i is in a very exciting moment

I personally have been doing this for for two decades plus and you know we have come out of the last AI winter we have seen the birth of modern AI then we have seen deep learning taking off showing us possibilities like plain tests but then we're starting to see the the the deepening of the technology and the industry um adoption of uh of some of the earlier possibilities like language models and now I think we're in the middle of a Cambrian explosion in almost a literal sense because now in addition to texts you're seeing pixels

videos audios all coming out with Poss AI applications and the model so it's a very exciting moment I know you both so well and many people know you both so well because you're so prominent in the field but not everybody like grew up in AI so maybe it's kind of worth just going through like your quick backgrounds just to kind of level set the audience yeah sure so I first got into AI uh at the end of my undergrad uh I did math and computer science for undergrad at kch it was awesome but then towards

the end of that there was this paper that came out that was at the time very famous paper the cat paper um from H NE Lee and Andrew and others that were at Google brain at the time and that was like the first time that I came across this concept of deep learning um and to me it just felt like this amazing technology and that was the first time that I came across this recipe that would come to define the next like more than decade of my life which is that you can get these amazingly

powerful learning algorithms that are very generic couple them with very large amounts of compute couple them with very large amounts of data and magic things started to happen when you comp pip those ingredients so I I first came across that idea like around 2011 2012-ish and I just thought like oh my God this is this is going to be what I want to do so it was obvious you got to go to grad school to do this stuff and then um sort of saw that fa was at Stanford one of the few people in the

world at the time who was kind of on that on that train and that was just an amazing time to be in deep learning and computer vision specifically because that was really the era when this went from these first nent bits of technology that were just starting to work and really got developed AC and spread across a ton of different applications and I remember during that time around 2012 is when we really started to see the first commercially available image understanding tools I remember seeing meta launching a product like this where you could simply describe

something in the image and then it would Circle it and that at the time was absolutely mind-blowing we probably take it for granted now but really when I saw that for the first time over a decade ago I was just completely blown away so then over that time we saw the beginnings of language modeling we saw the beginnings of discriminative computer vision you could take pictures and understand what's in them in a lot of different ways we also saw some of the early bits of what we would Now call gen generative modeling generating images generating

text a lot of those core algor algorithmic pieces actually got figured out by the academic community um during my PhD years like there was a time I would just like wake up every morning and check the new papers on archive and just be ready was like unwrapping presents on Christmas that like every I still feel that way anytime a new paper is published on archive and it's interesting exciting I can't wait to read it and I just love that he had that excitement so far in the past but what was fundamental that my students and

my lab realized probably uh earlier than most people is that if you if you let Data Drive models you can unleash the kind of power that we haven't seen before and that was really the the the reason we went on a pretty crazy bet on imet which is you know what just forget about any scale we're seeing now which is thousands of data points at that point uh NLP Community has their own data sets I remember you see and so essentially that is what open AI did they took that famous paper attention is all you

need and they just continued to scale up the data set the billions of parameters available to train the models and what they figured out pretty quickly probably before most other people at least bringing it to a commercial application is yeah if we scale it up those scaling laws definitely apply the Irvine data set or some data set in NLP was it was small comparision community has their data sets but all in the order of thousands or tens of thousands were like we need to drive it to internet scale and luckily it was also the the

the coming of age of Internet so we were riding that wave and that's when I came to Stanford so these epochs are what we often talk about like imet is clearly the epoch that created you know or at least like maybe made like popular and viable computer vision in the Gen wave we talk about two kind of core unlocks one is like the Transformers paper which is attention we talk about stable diffusion is that a fair way to think about this which is like there's these two algorithmic unlocks that came from Academia or Google and

like that's where everything comes from or has it been more deliberate so he's describing the attention is all you need paper which is the basis for the Transformers model which is the basis for all large language models today basically and stable diffusion which is the basis for creating all of the generative art models out there and the interesting thing is something that just happened or really we just figured out how powerful it can be is inference time compute and that is just another dimension that we're able to scale Artificial Intelligence on so not only increasing

the data that goes into training the models but also just allowing the models to quote unquote think over time and allowing them to think more and scaling up that thinking process the number of tokens that they use will improve the output so we're in just as fascinating of a time as it was way back then or have there been other kind of big unlocks that kind of brought us here that we don't talk as much about yeah I I think the big unlock is compute like I know the story of AI is often the story

of compute but even no matter how much people talk about it I think people underestimated right and the amount of the amount of growth that we've seen in computational power over the last decade is astounding the first paper that's really credited with the like Breakthrough moment in computer vision for deep learning was Alex net um which was a 2012 paper that where a deep neural network did really well on the imet challenge and just blew away all the other algorithms that F had been working on the types of algorithms that they had been working on

more in grad school that Alex net was a 60 million parameter deep neural network um and it was trained for 6 days on two GTX 580s which was the top consumer card at the time which came out in 2010 um so I was looking at some numbers last night just to you know put these in perspective and funnily enough the GTX 580s that they're describing I right around that time 2010 2011 purchased four or five of them threw them into a laundry basket put a fan next to it and mined Bitcoin the newest latest and

greatest from Nvidia is the gb200 um do either of you want to guess how much raw compute Factor we have between the GTX 580 and the gb200 shoot no what go for it it's uh it's in the thousands so I I ran the numbers last night like that twoe r that two we training run that of Six Days on 2 GTX 580s if you scale seconds it it comes out to just under five minutes on a single GB on a single gb200 Justin is making a really good point the 2012 Alex paper yeah and what

they're describing is the exact reason why Nvidia is now one of the most valuable companies in the entire world they've been making gpus parallel processing units for decades and it has only been used for video games and then for Bitcoin mining because the parallel processing and the kind of churning through all of that math is what the gpus are really good at and now Jensen hang and Nvidia had the foresight to realize the AI wave is going to need a ton of parallel compute and they positions incredibly well maybe one of the best corporate success

stories of all time practically the only difference between alexnet and the convet what's the difference is the gpus the two gpus and the delus of data yeah well so that's where I was going to go which is like so I think most people now are familiar with like quote the bitter lesson and the bitter lesson says is if you make an algorithm they'll be cute yeah just make sure you can take advantage of available compute because the available compute will show up right and so like you just like need to like why like on the

other hand there's another narrative um which seems to me to be like just as credible which is like it's actually new data sources that unlock deep learning right like imag that is a great example but like a lot of people like self attention is great from Transformers but they'll also say this is a way you can exploit human labeling of data because like it's the humans that put the structure in the sentences and if you look at clip they say well like we're using the internet to like actually like have humans use the alt tag

to label images right and so like that's a story of data that's not a story of compute and so is it just is the answer just both or is like one more than the other or I think it's both but you're hitting another really good point so I think there's actually two epochs that to me feel quite distinct in the algorithmics here so like the imet era is actually the era of supervised learning um so in the era of supervised learning you have a lot of data but you don't know how to use data on

its own like the expectation of imet and other data sets of that time period was that we're going to get a lot of images but we need people to label everyone and all of the training data that we're going to train on like a person a human labeler has looked at every one and said something about that image um and the big algorithmic unlocks we know how to train on things that don't require human labeled data so I'm going to pause it there for a second we've been talking about this over the last week in

pretty recent videos humans are really the fundamental limiter of AI growth whether you're talking about the ore data sets or the labeling of the data sets humans are the limitation or even the research itself if AI models are able to generate really great unlimited data for other models in a really highquality way then all of a sudden we're going to have that intelligence explosion then we have unsupervised learning which is basically what alphago was all about and so you have this system in which it could just try a bunch of permutations of the game go

and figure out which one worked best with essentially no human intervention whatsoever and then finally we have projects like AI scientists by S AI which is actually doing the research on algorithmic unlocks for future models and again completely autonomously with no human intervention these multiple different dimensions are going to allow us to scale AI so much more quickly than we ever thought was possible today's video is brought to you by mamut mamut AI brings all of the best models together in one place for one price Claude llama GPT 40 mraw Gemini Pro and even gp01

and rather than having to pay for each of these AI separately you pay $10 to mammut and they bring it all together in one place plus they have image generation mid Journey flux Pro Dolly and stable diffusion again all for $10 models are frequently updated as soon as they're released so be sure to check out out mamut for access to all the best models for one low price m.ai that is m a MM o u t. a thanks again to mut so let's let's walk to gen um so so when I was doing my my

PhD before that um you came so I took U machine learning from anding and then I took like beijan something very complicated from Dey CER and it was very complicated for me a lot of that was just predictive modeling y um and then like I remember the whole kind of vision stuff that you unlock but then the generative stuff has shown up like I would say in the last four years which is to me very different like you're not identifying you're not you know predicting something you're generating something and so maybe kind of walk through

like yeah that's interesting that he says you're not predicting something but that is kind of what it does it's basically guessing at what the next token is so it really is just predicting the key unlocks that got us there and then why it's different and if we should think about it differently and is it part of a Continuum is it not it is so interesting even during my graduate time generative model was there we wanted to do generation we nobody remembers even with the uh letters and numbers we were trying to do some you know

Jeff Hinton has had generated papers we were thinking about how to generate and in fact if you do have if you think from a probability distribution point of view you can mathematically generate it's just nothing we generate would ever impress anybody right so this concept of generation mathematically theoretically is there but nothing worked so then I do want to call out Justin PhD and Justin was saying that he got enamored by Deep learning so he came to my lab Justin's PhD his entire PhD is a story almost a mini story of the trajectory of the

of the uh field he started his first project in data I forced him to he didn't like it in R I learned a lot of really useful things I'm glad you say that now so we moved Justin to um to deep learning and the core problem there was taking images and generating words well actually it was even about there were I think there were three discreet phases here on this trajectory so the first one was actually matching images and words like we have we have have and we how much they my first both of my

PhD and like ever my first academic ever was the IM retrial with scphs and then we went into the Genera uh taking pixels generating words and Justin and Andre uh really worked on that but that was still a very very lossy way of of of generating and getting information out of the pixel world and then in the middle Justin went off and did a very famous piece of work and it was the first time that uh someone made it real time right yeah yeah so so the story there is there was this paper that came

out in 2015 a neural algorithm of artistic style led by Liam gtis and it was like the paper came out and they showed like these these real world photographs that they had converted into van go style and like we are kind of used to seeing things like this in 2024 but this was in 2015 so this paper just popped up on archive one day and it like blew my mind like I just got this like gen brainworm like in my brain in like 2015 and it like did something to me and I thought like oh

my God I need to understand this algorithm I need to play with it I need to make my own images into van go so then I like read the paper and over a long weekend I reimplemented the thing and got it to work it was very actually very simple algorithm um so like my implementation was like 300 lines of Lua because at the time it was pre it was Lua there was there was um this was prey torch so we were using Lua torch um but it was like very simple algorithm but it was slow

right so it was an optim optimization based thing every image you want to generate you need to run this optimization Loop run this gradient distant Loop for every image that you generate the images are beautiful but I just like wanted to be faster and and Justin just did it and it was actually I think your first taste of a an academic work having an industry impact a bunch of people had seen this this artistic style trans stuff at the time and me and a couple others at the same time came up with different ways to

speed this up yeah um but mine was the one that got a lot of traction right so I was very proud of Justa but there's one more thing I was very proud of Justin to connect to gen AI is that before the world understand gen AI Justin's last piece of uh work in PhD which I I knew about it because I was forcing you to do it that one was was was actually uh inputting language and getting a whole picture out it's one of the first gen work it's using gang which was so hard to

use but the problem is that we are not ready to use a natural piece of language so Justin you heard he worked on sing graph so we have to input a s graph language structure so you know the Sheep the the grass the sky in a graph way it literally was one of our photos right and then he he and another very good uh Master student Grim they got that again to work so so you can see from data to matching to style transfer to to generative a uh images we're starting to see you ask

if this is a abrupt change for people like us it's already happening in a Continuum but for the world it was it's more the results are more abrupt so yeah and it's interesting because a lot of people say the same thing today about AGI and Asi it's not going to be one sudden point in time it's going to be gradual and incremental so it's interesting to see that people who have been in the industry a long time not myself but Fay and Justin they have seen this gradual Continuum over a decade multiple decades next they

are going to start talking about the importance of AI being able to understand the 3D the natural world and why that is going to unlock so much potential so let's watch seems for a long time like a lot of you and I'll talk to you fa like a lot of your research has been you know and your direction has been towards kind of spatial stuff and pixel stuff and intelligence and now you're doing World labs and it's around spatial intelligence and so maybe talk through like you know is this been part of a long journey

for you like why did you decide to do it now is it a technical unlock is it a personal unlock just kind of like move us from that kind of Meo of AI research to to World Labs sure for me is uh um it is both personal and intellectual right my entire you talk about my book my entire intellectual journey is really this passion to seek North Stars but also believing that those northstars are critically important for the advancement of our field so at the beginning I remembered after graduate school I thought my Northstar was

telling stories of uh images because for me that's such a important piece of visual intelligence as part of what you call AI or AGI but when Justin and Andre did that I was like oh my God that's that was my live stream what do I do next so it came a lot faster I thought it would take a hundred years to do that so um but visual intelligence is my passion because I do believe for every intelligent uh being like people or robots or some other form um knowing how to see the world reason about

it interact in it whether you're navigating or or or manipulating or making things you can even build civilization upon it it visual spatial intelligence is so fundamental it's as fundamental as language possibly more ancient and and more fundamental in certain ways so so it's very natural for me that um world Labs is are norstar is to unlock spatial intelligence and thinking back to what I said at the beginning of the video this is kind of what Yan Lon has been saying language models alone are not enough to create World models and we'll see we will

see if that's true or false faay seems to agree although she's not saying it as definitive as Yan Leon has but what she is saying is ai's ability to interpret the real world is absolutely fundamental let's keep watching the moment to me is right to do it like Justin was saying compute we've got these ingredients we've got compute we've got a much deeper understanding of data way deeper that image that days you know uh compared to to that those days we're so much more sophisticated and we've got some advancement of algorithms including co-founders in worldl

like Ben milen Hall and uh Kristoff lar they were at The Cutting Edge of nerve that we are in the right moment to really make a bet and to focus and just unlock that so I just want to clarify for for folks that are listening to this which is so you know you're starting this company World lab spatial intelligence is kind of how you're generally describing the problem you're solving can you maybe try to crisply describe what that means yeah so spatial intelligence is about machine's ability to un to perceive reason and act in 3D

and 3D space and time to understand how objects and events are positioned in 3D space and time how interactions in the world can affect those 3D position 3D 4D positions over space time um and both sort of perceive reason about generate interact with really take the machine out of the main frame or out of the data center and putting it out into the world and understanding the 3D 4D world with all of its richness so to so a company that has a tremendous amount of 3D real world data that can be used to train spatial

intelligence is Tesla of course they have millions and millions of miles of real world data that has been ingested through the cameras in Tesla vehicles and just piped into a large database and they are just training on it constantly but they can do so much more with that data than just autopilot which by the way is an incredible feat in itself but that data can also be used to train optimists their robot and The Optimist robot will then know how to operate within the real world how to interpret the real world understand the events that

are occurring exactly what Justin is saying let's keep watching be very clear are we talking about the physical world or are we just talking about an abstract notion of world I think it can be both I think it can be both and that encompasses our vision long term even if you're generating worlds even if you're generating content um doing that in positioned in fre with 3D uh has a lot of benefits um or if you're recognizing the real world being able to put 3D understanding into the into the real world as well is part of

it so he's saying not only can we interpret and understand the real world using spatial intelligence but we can actually generate a world and then we're getting into simulation Theory and stuff we've talked about quite a bit on this channel which I find fascinating but then we think about things like Sora which was able to generate video that looked incredibly realistic the physics in the video looked realistic and they weren't really using spatial intelligence to do that so there are multiple tracks happening at the same time right now that are trying to unlock this real

world intelligence so I mean just for everybody listening like the two other co-founders been Hall Christof fler are absolute Legends in the field at the at the same level these four decided to come out and do this company now and so I'm trying to get dig to like like why now is the the the right time yeah I mean this is Again part of a longer Evolution for me but like really after PhD when I was really wanting to develop into my own independent researcher both at for my later career I was just thinking what

are the big problems in Ai and computer vision um and the conclusion that I came to about that time was that the previous decade had mostly been about understanding data that already exists um but the next decade was going to be about understanding new data and if we think about that the data that already exists was all of the images and videos that maybe existed on the web already and the next decade was going to be about understanding new data right like people are people have smartphones smartphones are collecting cameras those cameras have new sensors

those cameras are positioned in the 3D world it's not just you're going to get a bag of pixels from the internet and know nothing about it and try to say if it's a cat or a dog we want to treat these treat images as universal sensors to the physical world and how can we use that to understand the 3D and 4D structure of the world um either in physical spaces or or or generative spaces so I made a pretty big pivot post PhD into 3D computer vision predicting 3D shapes of objects with some of my

colleagues affir at the time then later I got really enamored by this idea of learning 3D structure through 2D right because we talk about data a lot it's it's um you know 3D data is hard to get on its own um but there because there's a very strong mathematical connection here um our 2D images are projections of a 3D and there's a lot of mathematical structure here we can take advantage of so even if you have a lot of 2D data there's there's a lot of people done amazing work to figure out how can you

back out the 3D structure of the world from large quantities of 2D observations oh that's so interesting something I had not really thought about yeah we collect a ton of 2D data 2D because it needs to basically be projected onto a 2d screen and so all of our cameras taking photos taking videos it's all 2D representations of 3D environments now I immediately think to the Apple Vision Pro and now I have an iPhone 16 and that can take spatial video which is 3D to the best of my understanding so all of a sudden we're going

to have this huge flood of 3D video coming in to train these future models on which is interesting to think about I haven't thought a ton about it so that's all I'm going to say about it for now but let me know what you think in the comments and then in 2020 you asked about B breakthrough moments there was a really big breakthrough moment from our co-founder Ben milal at the time with his paper Nerf um n radians fields and that was a very simple very clear way of backing out 3D structure from 2D observations

that just lit a fire under this whole Space of 3D computer vision I think there's another aspect here that maybe people outside the field don't quite understand as that was also a time when large language models were starting to take off so a lot of the stuff with language modeling actually had gotten developed in Academia even during my PhD I did some Sly work with Andre kPa on language modeling in 2014 lstm I remember lstm RNs brus like this was pre Transformer um but uh then at some point like around like around the gpt2 time

like you couldn't really do those kind of model any in Academia because they took way more resourcing but there was one interesting thing the Nerf the Nerf approach that that Ben came up with like you could train these in in an hour a couple hours on a single GPU so I think at that time like this is a there was a dynamic here that happened which is that I think a lot of academic researchers ended up focusing a lot of these problems because there was core algorithmic stuff to figure out and because you could actually

do a lot with without a ton of compute and you could get state of the out results on a single GPU because of those Dynamics um there was a lot of research a lot of researchers in Academia were moving to think about what are the core algorithmic ways that we can advance this area as well uh then I ended up chatting with a more and I realized that we were actually she's very convincing she's very convincing well there that but but like you know you talking about trying to like figure out your own independent research

trajectory from your adviser well it turns out we' end up oh no kind of concluding converging on on similar things okay well from my end I want to talk to the smartest person I call dust and there's no question about it I do want to talk about a very interesting technical um uh issue or or technical story of pixels that most people work in language don't realize is that pre era in the field of computer vision those of us who work on pixels we actually have a long history in a an area of research called

reconstruction 3D reconstruction which is you know it dates back from the 70s you know you can take photos because humans have two eyes right so in general it starts with stereo photos and then you try to triangulate the geometry and uh make a 3D shape out of it it is a really really hard problem to this day it's not fundamentally solved because there's correspondence and all that and then so this whole field which is a older of thinking about 3D has been going around and it has been making really good progress but when Nerf happened

when Nerf happened in the context of generative methods in the context of diffusion models suddenly reconstruction and generation start to really emerge in now like within really a short period of time in the field of computer vision it's hard to talk about reconstruction versus generation we suddenly have a moment where if we see something or if if we imagine something both can converge towards generating it and that's just to me a a really important moment for computer vision but most people Mystic because we're not talking about it as much as llms right so in pixel

space there's reconstruction where you reconstruct like a scene that's real and then if you don't see the scene that use generative techniques right so these things are kind of very similar and we've seen a ton of examples of Nerf bowal sidu posts Nerf examples all the time they are incredible to see here's an image a 2D image that was transl ated into 3D and this is a lot of the same technology that is being used in like apple photos and when you're scrolling through Apple TV and you're seeing the poster for a movie kind of

rotate in 3D this is all the same technology now they're going to talk about the difference between language models and spatial models so let's watch throughout this entire conversation you're talking about languages and you're talking about pixels so maybe it's a good time to talk about how like spatial intelligence and what you're working on contrasts with language approaches which of course are very popular now like is it complimentary is it orthog all yeah I think I think they're complimentary I don't mean to be too leading here like maybe just contrast them like everybody says like

listen I I I know opening eye and I know GPT and I know multimodal models and a lot of what you're talking about is like they've got pixels and they've got languages and like doesn't this kind of do what we want to do with spatial reasoning yeah so I think to do that you need to open up the Black Box little bit of how these systems work under the hood um so with language models and the multimodal language models that we're seeing nowadays they their their underlying representation under the hood is is a one-dimensional representation

we talk about context lengths we talk about Transformers we talk about sequences attention attention fundamentally their represent of the world is is onedimensional so these things fundamentally operate on a one dimensional sequence of tokens so this is a very natural representation when you're talking about language because written text is a onedimensional sequence of discret letters so that kind of underlying representation is the thing that led to llms and now the multimodal LMS that we're seeing now you kind of end up shoehorning the other modalities into this underlying representation of a 1D sequence of tokens um

now when we move to spatial intelligence it's kind of going the other way where we're saying that the threedimensional nature of the world should be front and center in the representation so at an algorithmic perspective that opens up the door for us to process data in different ways to get different kinds of outputs out of it um and to tackle slly different problems so even at a course level you kind of look at outside and you say oh multimodal LMS can look at images too well they can but I I think that it's they don't

have that fundamental 3D representation at the heart of their approaches that's so fascinating and I'm learning as we're watching this right now so as we're adding multiple modalities to large language models as he said we're fitting three dimensions into a one-dimensional space and that seems really inefficient now that I'm hearing him describe it that way but if we take the opposite approach and we start with 3D and that is fundamentally how the model works and understands the world is 3D then converting it to 1D to use language might be just easy but let's keep watching

I totally agree with Justin I think talking about the 1D versus fundamentally 3D representation is one of the most core differentiation the other thing it's a slightly philosophical but it's really important to for me at least is language is fundamentally a purely generated signal there's no language out there you don't go out in the nature and there's words written in the sky for you so this again sounds very similar to how Yan laon describes large language models and their limitations so it seems like they're on the same page although approaching solving the problem in different

ways whatever data you feeding you pretty much can just somehow regurgitate with enough generalizability the the same data out and that's language to language and but but 3D World Is Not There is a 3D world out there that follows laws of physics that has its own structures due to materials and and many other things and to to fundamentally back that information out and be able to represent it and be able to generate it is just fundamentally quite a different problem we will be borrowing um similar ideas or useful ideas from language and llms but this

is fundamentally philosophically to me a different problem right so language 1D and probably a bad representation of the physical world because it's been generated by humans and it's probably lossy there's a whole another modality of generative AI models which are pixels and these are 2D image and 2D video and like one could say that like if you look at a video it looks you know you can see 3D stuff because like you can pan a camera or whatever it is and so like how would like spatial intelligence be different than say 2D video here when

I think about this it's useful to disentangle two things um one is the underlying representation and then two is kind of the the user facing affordances that you have um and here's where where you can get sometimes confused because um fundamentally we see 2D right like our retinas are 2D structures in our bodies and we've got two of them so like fundamentally our visual system perceives 2D images um but the problem is that depending on what representation you use there could be different affordances that are more natural or less natural so even if you are

at the end of the day you might be seeing a 2D image or a 2d video um your brain is perceiving that as a projection of a 3D World so there's things you might want to do like move objects around move the camera around um in principle you might be able to do these with a purely 2D representation and model but it's just not a fit to the problems that you're asking the model to do right like modeling the 2D projections of a dynamic 3D world is is a function that probably can be modeled but

by putting a 3D representation Into the Heart of a model there's just going to be a better fits between the kind of representation that the model is working on and the kind of tasks that you want that model to do so our bet is that by threading a little bit more 3D representation under the hood that'll enable better affordances for for users and this also goes back to the Northstar for me you know why is it spacial intelligence why is it not flat pixel intelligence is because I think the Arc of intelligence has to go

to what Justin cause affordances and uh and the Arc of intelligence if you look at Evolution right the Arc of intelligence eventually enables animals and humans especially human as an intelligent animal to move around the world interact with it create civilization create life create a piece of Sandwich whatever you do in this 3D World and and translating that into a piece of technology that three native 3D is fundamental important for the flood flood gate um of possible applications even if some of them the the serving of them looks 2D but the but it's innately 3D

and again I want to think back to the Apple Vision Pro and the Oculus and this whole arvr Revolution that has yet to come but all of a sudden we are capturing all of this information about the 3D World which can be used to train models that are based on spatial intelligence spatial awareness so very interesting to think about I don't know a ton about these topics but definitely fascinating to Think Through I think this is actually very subtle and Incredibly critical point and so I think it's worth digging into and a way to do

this is talking about use cases and so just to level set this we're talking about generating a technology let's call it a model that can do spatial intelligence so maybe in the abstract what might that look like kind of a little bit more concretely what would be the potential use cases that you could apply this to so I think there's there's a couple different kinds of we imagine these spatially intelligent models able to do over time um and one that I'm really excited about is World Generation we're we're all used to something like a text

image generator starting to see text video generators where you put an image put in a video and out pops an amazing image or an amazing two-c clip um but I I think you could imagine leveling this up and getting 3D worlds out so one thing that we could imagine spatial intelligence helping us with in the future are upleveling these experiences into 3D where we're not getting just an image out or just a clip out but you're getting out a full simulated but vibrant and interactive 3D World yeah we're definitely in the territory of simulation Theory

now and once again I'm going to bring up Apple Vision Pro and other arvr Technologies and again thinking about Sora as a world generator we've talked about world generators in the past and even the AI Doom project that we talked about a couple weeks ago where a diffusion model was actually able to predict frame by frame the world of the 1990s game Doom with all of the logic all of the 3D of it but with no game engine no predefined code nothing like that and that is what he's describing this could transform video games could

transform all content movies TV but think even beyond that I mean this could completely change the way that we view reality maybe for gaming right maybe for gaming maybe for virtual photography like you name it there's I think even if you got this to work there'd be there'd be a million applications for Education yeah for education I mean I guess one of one of my things is that like we in some sense this enables a new form of media right because we already have the ability to create virtual interactive worlds um but it cost hundreds

hundreds of millions of dollars and a a ton of development time and as a result like what are the places that people drive this technological ability is is video games right because if we do have the ability as a society to create amazingly detailed virtual interactive worlds that give you amazing experiences but because it takes so much labor to do so then the only economically viable use of that technology in its form today is is games that can be sold for $70 a piece to millions and millions of people to recoup the investment if we

had the ability to create these same virtual interactive vibrant 3D worlds um you could see a lot of other applications of this right because if you bring down that cost of producing that kind of content then people are going to use it for other things what if you could have an inter like sort of a personalized 3D experience that's as good as rich as detailed as one of these AAA video games that cost hundreds of millions of dollars to produce but it could be catered to like this very Niche thing that only maybe a couple

people would want that particular thing that's not a particular product or a particular road map but I think that's a vision of a new kind of media that would be enabled by um spatial intellig in the generative Realms yeah agreed completely that sounds incredible I just want to describe a world and I want to go explore it it doesn't have to be a video game maybe it is maybe it isn't but I just want to describe different worlds and see what it would be like to live within them and that is just so cool so

futuristic I want to think more about it and I want to experience it all right in this last section they're going to talk about how all of this applies to AR and VR something that I've been talking about a lot in this video so I'm excited to hear what they have to say about it case that that Justin was talking about would be like the generation of a virtual world for any number of use cases one that you're just alluding to would be more of an augmented reality right yes just around the time world lab

was uh um being formed uh Vision Pro was released by Apple and uh they use the word spatial Computing we're almost like they almost stole our but we're spatial intelligence so spatial Computing needs spatial intelligence that's exactly right so we don't know what Hardware form it will take it'll be goggles glasses contact lenses but that interface between the true real world and what you can do on top of it whether it's to help you to augment your capability to work on a piece of machine and fix your car even if you are not a trained

mechanic or to just be in a Pokemon go Plus+ for entertainment suddenly this piece of technology is is going to be the the the operating system basically uh for for arvr uh Mixr in yeah this is definitely a new way to think about Computing in general as large language models get better they seemingly are becoming the operating system of the future but then beyond that maybe spatial intelligence and spatial Computing is the operating system of the 3D world's future and there so much to think about it is so cool drop all of your thoughts in

comments below I want to read them I want to know what you think the limit like what does an AR device need to do it's this thing that's always on it's with you it's looking out into the world so it needs to understand the stuff that you're seeing um and maybe you out with tasks in your daily life but I'm I'm also really excited about this blend between virtual and physical that becomes really critical if you have the ability to understand what's around you in real time in perfect 3D then it actually starts to deprecate

large parts of the real world as well like right now how many differently sized screens do we all own for different use cases too many right you've got you've got your you've got your phone you've got your iPad you've got your computer monitor you've got your TV like you've got your watch like these are all basically different side screens because they need to present information to you in different contexts and different different positions but if you've got the ability to seamlessly blend virtual content with the physical world it kind of deprecates the need for all

of those it just ideally seamlessly Blends information that you need to know in the moment with the right way mechanism of of giving you that information yeah I just talked about this in a previous video the perfect implementation of hardware for AI in general is one that you basically don't have to wear at all it just understands the world around you it can see it it can hear it it can sense the 3D aspects of it and then of course it can project onto it I don't know what that looks like Mark Zuckerberg thinks it's

glasses apple and Tim Cook thinks it's goggles we'll see another huge case of being able to blend the the digital virtual world with the 3D physical world is for any agents to be able to do things in the physical world and if humans use this mix art devices to do things like I said I don't know how to fix a car but if I have to I put on this this goggle or glass and suddenly I'm guided to do that but there are other types of Agents namely robots any kind of robots not just humanoid

and uh their interface by definition is the 3D world but their compute their brain by definition is the digital world so what connects that from the learning to to behaving between a robot brain to the real world brain it has to be spatial intelligence so that's it for today I cannot wait to learn more about this topic it just seems so fascinating another approach to intelligence if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one