Jim Fan on Nvidia’s Embodied AI Lab and Jensen Huang’s Prediction that All Robots will be Autonomous

8.76k views8907 WordsCopy TextShare
Sequoia Capital
AI researcher Jim Fan has had a charmed career. He was OpenAI’s first intern before he did his PhD a...
Video Transcript:
so from uh the the chip level which is the J and Thor family to the foundation model uh project Gro and also to the to the simulation and the utilities that we build along the way it will become a platform a Computing platform for humanoid robots and then also for intelligent robots in general so I want to quote Jensen here um one of my favorite quotes from him is that uh everything that moves will eventually be autonomous and I believe in that as well it's not right now but let's say um 10 years or or
more from from now if we believe that there will be as many intelligent robots as iPhones then we better start building that [Music] today hi and welcome to training data we have with us today Jim fan senior research scientist at Nvidia Jim leads nvidia's embodied AI agent research with a dual mandate spanning robotics in the physical world and gameplay agents in the virtual world Jim's group is responsible for project Groot nvidia's humanoid robots that you may have seen on stage with Jensen at this here's GTC we're excited to ask Jim about all things robotics why
now why humanoids and what's required to unlock a gpt3 moment for robotics welcome to training data thank you for having me we're so excited to dig in today and learn about everything you have to share with with us around Robotics and embodied AI before we get there you have a fascinating personal story I think you were the first open the first intern at open AI maybe walk us through some of your personal story and how you got to where you are absolutely I would love to share share the stories with the audience so um back
in the summer of 2016 um some of my friends said there is a new startup in town and you should check it out and I'm like huh I don't have anything else to do because I uh got accepted to um PhD and that summer I was Idol so I decided to join this startup and that turned out to be openai um and during my time at open aai uh we were already talking about AGI back in 2016 um and back then my inter Mentor was Andre Kari and Elia Sev and we talk about and we
discuss a project together it's called world of bits so the idea is is very simple we want to build an AI agent that can read computer screens read the pixels from the screens and then control the keyboard and mouse um if you think about it this interface is as general as it can get right like all the things that we do on a computer like you know replying to emails or playing games or browsing the web it can all be done in this interface mapping pixels to keyboard mouse control so that was actually uh my
first kind of uh attempt at AGI at open AI uh and also my first journey the first chapter of my journey in AI agents I remember world of BS actually I didn't know that you were a part of that uh that's really interesting yeah yeah it was it was a very fun U project and was part of a bigger initiative called OB Universe yeah uh which was a like a bigger um platform on like uh integrating all the applications and games into into this framework what do you think were some of the unlocks then and
then also what do you think were some of the challenges that you had with agents back then yeah yes so back then the main method that we used was reinforcement learning uh there was no l no Transformer back in 2016 uh and the thing is reinforcement learning it works on specific tasks but it doesn't generalize like we can't give the Asian arbitrary language and instruct it to do uh things to do arbitrary things that we can do with a keyboard and mouse so back then uh it kind of worked on the task that we designed
but it doesn't really generalize um so um you know that started uh like my next chapter uh which is um I went to Stanford and I started my PhD with Professor fa Lee and we started working on computer vision and also embodied Ai and during my time at Stanford which was from 2016 to 2021 uh I kind of witnessed the transition of the Stanford Vision lab led by faay uh from you know static computer vision like recognizing images and videos to more embodied computer vision where an agent learns perception and takes actions in an inter
interactive environment and this environment can be virtual as in simulation or it can be the physical world um so that was um my PhD like transitioning to embodied AI um and then after I graduated from PhD I joined Nvidia and have stayed there ever since so I carried over my work from from my PhD thesis to Nvidia and still working on embodied AI till this day so you oversee the embodied AI initiative at Nvidia maybe say a word on what that means and what you all are hoping to accomplish yes so uh the team that
I am co-leading right now is called gear uh which g a r and that stands for journalist embodied agent research and to summarize what our team works on in three words is that we generate actions um because we build in body air agents and those agents take actions in different worlds and if the actions are taken in the virtual world that would be gaming Ai and simulation and if the actions are taken in the physical world that will be robotics actually uh earlier this year in March uh GTC at Jensen's keynote he unveiled something called
project Groot uh which is nvidia's moonshot effort at building Foundation models for humano Robotics and that's basically what the gear team is focusing on right now we want to build the AI brain for humanoid robots and even beyond what do you think is Invidia is competitive advantage in building that yeah that's a great question so um well one is for for sure like compute uh resources all of these Foundation models require a lot of compute to scale up and we do believe in scaling law um there were scaling laws for like L um but the
scaling law for embodied Ai and Robotics are yet to be studied so we're working on that and the second strength of Nvidia is actually simulation um so Nvidia before it was an AI company it was a Graphics Company so Nvidia has a has many years of expertise on building simulation like physics simulation and rendering and also real time acceleration um on on gpus so we are using simulation heavily in our approach to build robotics the simulation strategy is super interesting why do you think most of the industry is still very focused on real world data
the opposite strategy yeah I think um we need all kinds of data and simulation and real world data by themselves are not enough so uh at gear we divide this data strategy into roughly three buckets one is the internet scale data like all the text and videos um online and the second is simulation data where we use uh Nvidia simulation tools to generate lots of synthetic data and the third is the real robot data where we collect the data by uh teleoperating the robot and then just collecting and recording those data on the robot platforms
and I believe a successful robotic strategy will involve uh the effective use of all three kinds of data and mixing them and also uh delivering a unified solution can you say more about um we were talking earlier about you know how data is fundamentally the key bottleneck in making a robotics Foundation model actually work can you say more about your kind of conviction uh in that idea and then like what exactly does it take to make you know great data to to kind of break through this problem yes so um I think like the three
different kinds of data that I just mentioned have different strengths and weaknesses so for internet data they are the most diverse they encode a lot of common sense prior right like for example um all the most of the videos online are human- centered uh because humans uh we we love to take selfies we love to record each other doing all kinds of activities and there are also a lot of instructional videos online so we can use that to kind of learn how humans interact with objects and how objects behave under different uh situations so that
kind of provides a common sense prior for the robot Foundation model um but the internet scale data they don't come with actions we cannot download the motor control signals of the robots from the internet and that goes to the second part of the data strategy which is using simulation so in simulation uh you can have all the actions and you can also observe the consequences of the actions in that particular environment um and the strength of simulation is that it's basically infinite data and um you know the data scales with uh compute the more gpus
you put into the simulation pipeline the more data that you will get and also the data is super real time so if you collect data only on the real robot then you are limited by 24 hours per day but uh in simulation like the GPU accelerated uh simulators we can actually accelerate Real Time by 10,000x so we can collect the data at much higher throughput given the same walk clock time so that's the strength but the weakness is that uh for simulation uh no matter how good the graphics pipeline is there will always be uh
this similation to reality Gap like the physics will be different from The Real World then the the the visuals will still be different they will not look uh exactly as realistic as real world and also there is a diversity issue like the contents in the simulation will not be as diverse as all the scenarios that we encounter in the real world so these are the are the weaknesses and then uh going to the real robot data um and those data they don't have the Sim to real Gap because they're collected on the real robot but
it's much more expensive to collect because you need to hire people to operate the robots and again they're limited by the speed of the world of atoms you only have 24 hours per day and you need humans um to to collect those data which is also very expensive um so we see these three types of data as having compliment strengths and I think a successful strategy is to combine their strength uh and then you know to remove their weaknesses so the cute Groot robots that were on stage with Jensen um that was such a cool
moment um if you had tas's dream in one five 10 years like what what do you think uh your group will have accomplished yeah so uh this is pure speculation but I hope that we can see uh research breakthrough in robot Foundation model maybe in the next two to three years um so that's what we call like a gpt3 moment for robotics um and then uh after that it's a bit uncertain because to have the robots uh enter like daily lives of people there are a lot more things than just the technical side um the
robots need to be affordable and mass produced and we also need like uh safety uh for the hardware and also privacy and regulations and those those will take longer for the robots to be able to hit a mass market so that's a bit harder to predict but I do hope that the research breakthrough will come in the next two to three years what do you think we'll Define what a gp3 moment in AI robotics looks like yeah that's a great question so um I would like to think about robotics as consisting of two systems system
one and system two so uh that comes from the book Thinking Fast and Slow uh where system one means um this um low-level motor control that's unconscious and fast uh like for example when I'm grasping this cup of water I don't really think about how I move the fingertip tip at every millisecond um so that would be system one and then system two is slow and deliberate and it's more like reasoning and planning that actually uses the the the conscious brain power that we have so um I think the gbd3 moment uh will be on
the system one side and my favorite example is the verb open so just think about the complexity of the word open right like opening the door is different from opening window it's also different from opening a bottle or opening your phone but for humans we have no trouble understanding that open means different things when you're interacting uh it means different motions when you're interacting with different objects but so far we have not seen a robotics model that can generalize on a lowlevel uh motor control on these verbs so I hope to see a model that
can understand these verbs in their abstract sense and can generalize to all kinds of scenarios that make sense to humans um and we haven't seen that yet but I'm hopeful that this moment could come in the next two to three years what about system two thinking like how do you think we get there uh do you think that some of you know the reasoning efforts in the llm world uh will be relevant uh as well in the robotics world yeah absolutely uh I think as for system 2 we have already seen very strong models that
can do uh reasoning and planning and also coding as well so these are the L and Frontier models uh that that we have already seen these days um but to integrate the syst system two models with system one is another research challenge in itself so uh the question is for robot Foundation model do we have a single monolithic model or do we have some kind of cascaded approach where the system two and the system One models are separate and and can communicate with each other in some ways I think that's an open question um and
again they have pros and cons right like for the first idea the monolithic model uh is cleaner there's just one model one API to maintain um but also it's a bit harder to kind of control because you have different control frequencies like the system two models will operate on a slower control frequency uh let's say one Hertz like one decision per second while the system one like the uh motor control of me grasping this cup of water that will likely be a th000 Hertz where I need to make these minor like these tiny muscle decisions
at thousand times per second it's really hard to encode them both in a single model so maybe a cascaded approach will be better but again how do we communicate between system one and two do they communicate through text or through some latent variables it's unclear and I think um it's a very exciting new research direction is your instinct that we'll get there in that breakthrough on system one thinking like through scale and Transformers like what is going to work um or is it you know cross your fingers and hope and see I I I certainly
hope that uh the data strategy I described will kind of get us there because because I feel that we have not pushed the limit of Transformers yet um on the essential level like Transformers take tokens in and outputs tokens and ultimately the quality of the tokens determines the quality of the model the quality of those large Transformers and for robotics as I as I mentioned like the data strategy is very complex we have all the internet data and also we need simulation data and the real robot data and once we're able to scale up on
the data Pipeline with all those high quality actions then we can tokenize them and we can send them to a Transformer to compress um so I feel we have not pushed Transformer to limit yet uh and once we figure out the data strategy we may be able to see some emerging property as we scale up the data and scale up the model size and for that I'm calling it the scaling law for embodied Ai and it's just getting started I'm very optimistic that we will get there I'm curious to hear what are you most excited
about personally when we do get there what's the industry or application or use case that you're really excited to see this completely transform the world of Robotics today yes so uh there are actually a few reasons that we chose humanoid robots as kind of the main research CES to tackle uh one reason is and the world is built around the human embodiment the human form factor right all our restaurants factories hospitals and all equipments and tools they're designed for uh the human form and also the the human hands so uh in principle uh a sufficiently
good humanoid Hardware should be able to support any task that a reasonable human can do in principle and the humanoid Hardware is not there yet today but I feel in the next two to three years the humanoid Hardware ecosystem will mature and we will have affordable humanoid Hardware to work on and then it will be a problem about AI brain about how we can drive those humanoid hardware and once we have that once we're able to have uh the grou Foundation model that can take any instruction in language and then perform any task that a
reasonable human can do then we unlock a lot of economic value like we can have um robots uh in our in our households uh helping us with daily chores like laundry and dishwashing and cooking or like elderly care and we also have them in restaurants in hospitals in factories helping with all the uh all the tasks that that humans do um and um I I hope that will come in the next decade but again as I mentioned in the beginning this is not just a technical problem but also there are many uh things beyond the
technology so um I'm looking forward to that any other reasons you've chosen to go after humanoid robots specifically yeah so uh there also a bit more practical reasons uh in terms of the the the training pipeline um so there are lots of data online um about humans right it's all like human- centered all the all the videos all like humans doing daily tasks all having fun uh and the humanoid uh robot form factor is closest to the human form factor which means that the model that we train using all of those data uh will be
able to have uh an easier time to transfer to the humanoid form factor other than rather than the other form factors so let's say for robot arms right like how many videos do we see online about robot arms and grippers very few but there are many videos of people using their five finger hands right to work with objects so it might be easier to train for humano robots and then once we have that uh we'll be able to specialize them to like the robot arms and more kind of specific robot forms so that's why we're
aiming for the full generality first I didn't realize it so are you exclusively training on humanoids today versus robot arms and and um robot dogs as well yeah so um for project gr simulation yeah yes so for project Groot we are aiming more towards humanoid right now um but the pipeline that we're building including the simulation tools right the the the real robot tools like those are General purposing enough that we can also adapt to other Platforms in the future so yeah we're building these uh tools to be generally applicable yeah you've used term General
quite a few times now um I think there are some folks from especially from the robotics world who think that you know a general approach won't work and you have to be domain environment specific um why have you chosen to go after a general approach and um you know the Richard Sutton bit bitter lesson stuff has been a recurring theme on our podcast I'm curious if you think it holds in robotics as well absolutely so um I would like to first talk about the success story um in NLP that we have all seen right so
before uh the chat GPT and the gpt3 uh in the world of NLP there are a lot of kind of different models and pipelines for different applications like translation and coding and you know doing math and um doing like creative writing like they all use very different models and uh completely different training pipelines but then chat gbt came and unified everything into a single model so before chat gbt we call those specialist and then the the GB3 and chat gbts we call them the journalist and once we have the journalist we can prompt them distill
them and fine-tune them back to the specialized tasks and we call those the specialized journalist and according to the historical Trend it's almost always the case that the specialized journalists are just far stronger than the original Specialists and they're also much easier to maintain because you have a single API that takes tax in and then spits text out so I think we can follow the same success story from uh the world of NLP and it will be the same for robotics so right now in 2024 most of the Robotics and applications we have seen are
still in the specialist stage right they have uh specific robot Hardware uh for specific tasks collecting specific data using specific pipelines but project Groot aims to build this general purpose Foundation model that works on humanoid first but later will generalize to all kinds of different robot forms or embodiments and that will be the jist moment that we are pursuing and then once we have that journalist we'll be able to prompt it fine-tune it distill it down to specific robotics tasks and those are the specialized journalist but that will only happen after we have the journalist
so um it will be easier in the short run to pursue the specialist it's just easier to show results because you can just focus on uh a very narrow set of tasks but we at Nvidia believe that the future belongs to journalist even though it will take longer to develop it will have more uh difficult research problems to solve but that's what we're aiming for first the interesting thing about Nvidia building grou to me is also what you mentioned earlier which is that Nvidia owns both the chip and the model itself what do you think
are some of the interesting things that Nvidia could do to optimize um Groot uh on its own chip yes so um at the March GTC uh Jensen also unveiled the next generation of the edge Computing chips it's called The jesson Sword chip and it was actually co- announced with project Groot so the idea is we will have kind of the full stack as a unified solution to the customers so from uh the the chip level which is the J and Thor family to the foundation model uh project Gro and also to the to the simulation
and the utilities that we build along the way it will become a platform a Computing platform for human robots and then also for intelligent robots in general so I want to quote Jensen here um one of my favorite quotes from him is that uh everything that moves will eventually be autonomous and I believe in that as well it's not right now but let's say um 10 years or or more from from now if we believe that there will be as many intelligent robots as iPhones then we'd better start building that today that's awesome are there
any particular results from your research so far that you want to highlight anything that gives you kind of optimism or conviction in in the approach that you're taking yes we can um talk about some prior works that we have done so uh one work uh that um I was uh really uh happy about was called urea and uh for this work uh we did a demo where we trained a FiveFinger robot hand to do pen spinning so um very useful and it's super human with respect to myself because um I have given up pen pen
spinning long since childhood I'm not able to do it live demo I will feel miserably at this live demo um so yeah not able to do this but but the robot hen uh is able to and the idea that we use to train this is that uh we prompt an L uh to write code in the simulator API that uh Nvidia has built so it's called The is6 Sim API and um the L outputs uh the code for reward function so a reward function is basically a specification of the desirable behavior that we want the
robot to do so the robot will be rewarded uh if it's on the right track or penalized if it's doing something wrong so that's a reward function uh and typically the reward function is engineered U by a human expert uh typically a roboticist who really knows about the API it takes a lot of specialized knowledge and the reward function engineering is by itself a very tedious and manual task so what ureka did was we designed this uh algorithm that uses L to automate this reward function design so that the reward function can instruct the robot
to do very complex things like pen spinning so it is a general purpose technique that we developed and we do plan to scale this up uh to Beyond just pen spinning um it's uh it should be able to design reward functions for all kinds of tasks or it can even generate new tasks right using the Nvidia simulation API so that gives us a lot of kind of space to grow why do you think uh I mean I remember 5 years ago there were people that were you know research was working on solving Rubik's cubes with
a robot hand and things like that and it felt like robotics kind of went through uh maybe a trough of disillusionment and in the last year or so it feels like the space has really heated up again do you think there is a why now around robotics this time around and like what's different and we're you know we're reading that Open the Eyes is getting back into robotics everybody everybody is now spinning up their efforts like what do you think is different now yeah uh I think there are quite a few key factors that are
different now uh one is on the robot Hardware actually since the end of last year we have seen a surge of new robot Hardware in the ecosystem uh there are companies like uh Tesla working on Optimus Boston Dynamics and so on and a lot of startups as well so we are seeing better and better Hardware so that's number one uh and those Hardwares are becoming more and more capable with like you know better Dex Source hands better uh whole body reliability um and the second factor is uh the pricing so we also see um a
significant drop in the in the price uh and the cost the manufacturing cost for the human robots so back in 2001 uh NASA had a humanoid developed and it's called a robot knot uh I remember if I recall correctly it cost north of $1.5 million per robot and then most recently there are companies that are able to um put a price tag of about $30,000 on a full-fledged humanoid and that's rough roughly comparable to the price of a car yeah and also there's always this trend in manufacturing where a mature product uh the price of
it will tend towards the price of the raw material cost and for the humanoids um it typically takes only 4% of the raw material of a car so it's possible that we can see the cost trending downwards even more and there could be an exponential decrease in the price in the next couple of years and that makes you know these stateof Ard Hardware more and more affordable that's the second factor of why I think humanoid is gathering momentum and the third one is on the foundation model side right we are able to see the system
to Problem the the the reasoning the planning part being addressed very well by the frontier models uh like the the gpts and the clouds and the llamas of the world um and these LS they're able to generalize to new scenarios they're able to write code and actually the urea project I just mentioned uh leverages these coding abilities of the L to help develop new robot Solutions um and there are also a surge in multimodal models improving the computer vision the perception of it so uh I think these successes uh also encourage us to pursue robot
Foundation models um because uh we think we can write on the generalizability of these Frontier models and then uh add like actions on top of them um so we can generate action tokens that will ultimately Drive these uh humanoid robots I completely agree with all that I also think so much of what we've been trying to tackle to date in the field has been how to unlock the scale of data that you need to build this model and all the research advancements that we've made many of which you've contributed to yourself around Sim toore and
and other things and the tools that Nvidia is built with Isaac Sim and others have really accelerated the field alongside teleoperation and cheaper teleoperation devices and things like that and so I think it's a really really exciting time to be building here I agree yeah I'd love to transition to talking about Virtual Worlds if that's okay with you yeah absolutely yeah um so I think you started your research uh more in the Virtual World Arena um maybe say a word on what got you interested in Minecraft and versus robotics like is it all kind of
related in uh in your in your world um what got you interested in Virtual Worlds yeah that's a great question so for me my personal mission is to solve embodied Ai and for AI Asians embodied in the virtual world that will be things like gaming and simulation and that's why I also have a very soft spot uh for for gaming I also enjoy gaming myself um what did you play yeah so um I I play Minecraft I I try to I'm not a very good gamer um and that's why I also want my AI uh
to avenge my poor skills yeah so um I work on a few uh gaming projects before uh the first one was called Mind Dojo uh where we uh develop a platform to develop general purpose agents in the game of Minecraft um and for those audience who are not familiar Minecraft is this 3 voxo world where you can do whatever you want uh you can craft like all kinds of recipes you know different tools and you can uh also go on adventures it's an open-ended game with no particular score to maximize and no um uh fixed
story lines to follow so uh we collected a lot of uh data from the internet um there are videos of people playing Minecraft there are also Wiki pages that explain every concept and every mechanism in the game those are like multimodal documents uh and also like forums like Reddit the Minecraft subreddit uh has a lot of people talking about the game uh in natural language so we collected these multimodal data sets and we're able to train models to play Minecraft so that was the first work mind Dojo um and later the second work uh was
called Voyager so um we had the idea of Voyager after gp4 came along because that at that time it was the best coding model out there um so we thought about hey what if we use coding as action and building on that inside we're able to develop the Voyager agent where it writes code to interact with the Minecraft world so we we use an API to First convert the 3D Minecraft world into a text representation and then have the Asian right code um using the action apis but just like human developers the agent is not
always able to write a code correctly on the first try so we kind of give it a self flection Loop where it tries out something and if it runs into an error or if it makes some mistakes in the Minecraft world it gets the feedback and it can correct its program and once it's written the correct program that's what we call skill we'll have it save to a skill Library so that in the future if the agent faces a similar situation it doesn't have to go through that trial and error loop again it can retrieve
the skill from the skill Library so you can think of that skill Library as a code base that L interactively authored all by itself right there's no human intervention the whole code base is developed by by Voyager so that's a second mechanism the skill library and the third one is what we call an automated curriculum so basically the agent knows what it knows and it knows what it doesn't know so it's able to propose the next task that's neither too difficult nor too easy for it to solve and then it's able to just follow uh
that path and discover all all kinds of different skills uh different tools and also travel along in a vast world of Minecraft and because they travel so much and that's why we call it the Voyager um so yeah that was uh kind of our team's um one of our earliest attempts um building AI agents in the embodied World using Foundation models talk about the curriculum thing more I think that's really interesting because it feels like it's one of the more unsolved problems in kind of the reasoning and llm World generally like how do you make
these models self-aware so that they know kind of how to take that next step to improve maybe maybe take say a little bit more about what you built on on on the curriculum and the reasoning side absolutely um I think there's a very interesting emerging property from those Frontier models is that they can reflect on their own actions and they kind of know what what they know and what they don't know and they're able to propose tasks accordingly so for the automated curriculum in Voyager we gave the agent a high level directive that is to
find as many novel items as possible uh and that's just the one kind of sentence of goal that we gave and we didn't give any instruction on uh which objects to discover first which tools to unlock first we didn't spec uh we didn't specify and the agent was able to discover that all by itself uh using this kind of coding and prompting and skill Library um so it's kind of amazing that the whole system just works it's uh I would say it's an emerging property once you have a very strong reasoning engine that can generalize
why do you think there are so many so much of the kind of virtual world research has been done in the virtual world and I'm sure it's not entirely because a lot of deep learning researchers like playing video games although I'm sure it doesn't hurt either um but what I guess what are the connections between solving stuff in the virtual world and in the physical world and how how do the two interplay yeah so um as different as gaming and Robotics seem to be uh I just see a lot of U similar principles shared across
these two domains um for the embodied agents um they take as input the perception which can be a video stream along with some sensory input and then they output actions and in the case of gaming it will be uh like keyboard and mouse actions and for robotics it will be lowlevel motor controls so ultimately the API looks like this um and uh these agents they need to explore in the world they um have to collect their own data in some ways so uh that's uh what we call en forcement learning and also self-exploration and that
part that principle is again shared among uh the physical agents and the virtual agents um but the difference is robotics is harder because you also have a simulation to reality Gap to bridge because in simulation uh the physics and the rendering will never be perfect um so it's really hard to kind of transfer what your learning simulation to the real world um and that is by itself an open-ended research problem so for robotics uh it's got the Sim to real issue but for gaming it doesn't you are training and testing in the same environment so
I would say that would be the difference between them and um last year I proposed a concept called Foundation agent where I believe ultimately we will have one model that can work on both you know virtual agents and also physical agents so for the foundation agent uh there are three axis over which it will generalize uh number one is the skills that it can do number two is the embodiments or like the body form the form factor it can control and number three is the world the realities it can master so in the future I
think a single model will be able to do a lot of different skills on a lot of different robot forms or agent forms and then generalize across many different worlds virtual or real and that's the ultimate Vision that the gear team wants to pursue the foundation AG pulling down the threat of of virtual worlds and gaming in particular and and what you've unlocked already with some reasoning some emerging Behavior especially working in an open-ended environment what are your what are some of your own personal dreams for what is now possible in the world of games
where would you like to see AI agents innovate in the world of games today yes so um I'm very excited by two aspects one is uh intelligent agents inside the games so um the NPC that we have these days they have fixed scripts to follow and they're all manually authored what if we have NPCs uh the non-player characters that are actually alive yeah uh and you can interact with them and they can remember what you told them before and they can also take actions in the gaming world that will change the narrative and change the
story for you so this is something um that we haven't seen yet but I feel there's a huge potential there so that when everyone um play the game everybody will have a different experience and even for one person you play the game twice you don't have the same story so each game will have infinite replay value yeah so that's one aspect and the second aspect is that the game itself can be generated yeah and we already see many different tools kind of doing uh subsets of this Grand Vision I just mentioned right like there are
text to 3D generating assets there are also like text to video models um and of course uh there are like language agents that can you know generate story lines uh what if we put all of them together so that the game world is generated on the fly as you playing and interacting with it yeah that would be just truly amazing and a truly open-ended experience super interesting um for the agent Vision in particular do you think you need gpg 4 level capabilities or do you think you can get there with llama 8B for example alone
yeah I I think the agent uh needs the following capabilities uh one is of course it needs to hold an interesting conversation it needs to have uh a consistent personality and it needs to have long-term memory and also take actions in the world so uh for these uh aspects I think currently like the lava models are pretty good for that um but also not good enough to produce very diverse behaviors and really engaging behaviors so I do think there's still a gap uh to to reach um and the other thing is about inference cost so
if we want to deoy Dey these agents uh to to The Gamers uh then either it's like very low cost hosted on the cloud or it runs locally on the device otherwise it's kind of uh unscalable in terms of cost so that's another um factor to be optimized do you think all this work in the virtual world space is it in service of like you know you know you're learning things from it that way you can accomplish things in the physical world is is it does the virtual world stuff exist in service of the physical
world Ambitions or I guess said differently like is is it enough of a prize in its own rights and how do you think about prioritizing your work between the physical and Virtual Worlds yes so um I just think the virtual world and the physical world ultimately will just be different realities on a single axis so let me give one example so there is a technique called domain randomization and how it works is that you train a robot in simulation but you train it in 10,000 different simulations in parallel and for each simulation they have slightly
different physical parameters like the gravity is different the friction the weight everything is a bit different right so it's actually 10,000 different worlds and um let's assume if we have an agent that can master all the 10,000 different configurations of reality all at once then our real physical world is just the 10,000 1st virtual simulation and in this way we're able to generalize from SIM to real directly so uh that's actually exactly what we did in a follow-up work to urea where we where we're able to um train agents uh using like all kinds of
different randomizations in the simulation and then transfer zero shot to the real world without further fine tuning so I do believe Eureka that's Dr Eureka work uh and I do believe that if we have all kinds of different virtual world including from games and if we have a single agent that can master all kinds of skills in all the worlds then the real real world just becomes part of this bigger distribution do you want to share a little bit about Dr R to ground the audience in that example oh yeah absolutely so um for the
Dr urea work we built upon urea and uh still use l as kind of a uh robot developer so the L is writing code and the code uh is to specify uh the simulation parameters like the randomization parameters and after a while after a few iterations um the policy that we train in a simulation will be able to generalize to the real world so one specific demo that we showed is that we can have uh a robot dog walk on a yoga ball um and it's able to stay balanced and also even walk forward uh
so one very funny comment that I saw was uh someone actually asked um his real dog to do this task and his dog isn't able to do it so in some sense our new network is super dog performance I'm pretty sure my dog would not be called it ADI yeah artificial dog intelligence yeah that's the next Benchmark um in the virtual world's uh sphere I think there's been a lot of kind of just incredible models that have come out on both the 3D and the video side recently all of them kind of Transformer based do
you think we're kind of there in terms of like okay this is the architecture that's going to take us to the promised land and let's let's get it up um or do you think there's kind of fundamental breakthroughs that are still required on on the um on the model side there yes um I think for um robot Foundation models like we haven't pushed the limit of the architecture uh yet so um the data is hard problem right now and it's the bottleneck because uh as I mentioned earlier we can't download those action data from the
internet they don't come with those modor Control Data we have to collect it either in simulation or in the real um on the real robots um and once we have that um we have a very mature data pipeline then we'll just push the tokens to the Transformers and have it compress those tokens just like um you know Transformers predicting the next word uh on on Wikipedia um and we're still testing these hypothesis but I don't think we have pushed the Transformers to their limit yet um there also a lot of research going on right now
on alternative architectures to Transformers uh I'm super interested in those uh personally like there are mamba and recently there was like test time training there are a few Alternatives um and some of them have very promising ideas um they haven't scaled really to to the uh a the Frontier Model performance um but I'm I'm I'm looking forward to to to seeing alternative to to Transformers have any caught your eye in particular and why yeah uh I think um I mentioned the the membera work and also test time training uh like these models are are more
um efficient at inference time um so instead of like forers attending to uh all the past tokens uh these models have inherently more efficient mechanisms uh so I I see them holding a lot of Promise um but we need to um scale them up to to the size of the frontier models and really see like how they compare headon with the Transformer awesome should we close out with some rapid fire questions yeah oh yeah um okay let's see number one uh What uh what outside the embodied AI world are you most interested in with an
AI yeah so um I'm super excited about uh video Generation Um because um I see video generation as a kind of world simulator so we learn the physics and the rendering from data Alone um so uh we have seen like open a SORA and later there are like a lot of new models uh catching up to Sora so uh this is uh like a ongoing research topic and yeah what does the world simulator get you I think it's a it's going to get us a data driven simulation in which we can train embodied AI that
would be amazing nice what are you most excited about in AI on a longer term Horizon 10 years or more yeah so um on a few fronts like one is uh for the for the reasoning side uh I'm super excited about models that code I think coding is uh such a fundamental reasoning task that also has huge economic value um I think maybe 10 years from now we'll have uh coding agents uh that are as good as human level software engineers and then we'll be able to accelerate a lot of development um using the L
themselves and the second aspect is of course robotics I think 10 years from now we'll have humanoid robots uh that are at the reliability and Agility of humans or even Beyond and I hope at that time project root will be a success and then we're able to have humanoids helping us um in our daily lives um I just want robots to do my laundry yeah that's always been my dream what year a robots going to do our laundry soon as possible I can't wait um who do you admire most in the field of AI and
you've had the opportunity to work with some greats uh dating back to to your internship days but who who do you admire most these days I have too many heroes in AI uh to count um so I admire uh my pH advisor F uh I think she taught me um how to develop good research taste so sometimes it's not about how to solve a problem but identify what problems are worth solving and actually the what problem uh is much harder than the how problem and uh during my PhD years with F I transitioned to embodied
AI um and in retrospect this is the right direction to work on I believe the future of AI agents will be embodied uh for robotics or for the virtual world uh I also admire Andre Kari um he's the great educator uh I think he writes code like poetry so I look up to him and then uh I am Jensen a lot uh I think Jensen uh he cares a lot about uh AI research and he also knows a lot about even the technical details of the models and I'm super impressed so I look up to
him a lot pulling on the thread of having great research taste what advice do you have for Founders building an AI in terms of finding the right problems to solve yeah um I think um Recent research papers I feel that the research papers these days are becoming more and more accessible uh and they have some really good ideas and they're more and more practical instead of just like theoretical machine learning so I would recommend kind of keeping up with the latest literature uh and also just try out like all the open source tools uh that
people have built so for example at Nvidia we built uh simulator uh tools that everyone can have access to and just download it and try that out and you can train your own robots uh in the in the simulations just get your hands dirty and maybe pulling on the thread of Jensen as an icon on what do you think is some practical tactical advice you'd give to Founders building an AI what they could learn from him yeah I think identifying the right problem to work on right so um Nvidia bets on humanoid robotics because we
believe like this is uh the future and also like embodied AI um because if we believe that let's say 10 years from now there will be as many intelligent robots in the world as iPhones then we better start working on that today yeah so yeah just like um long-term future Visions I think that's a great note to end on Jim thank you so much for joining us we love learning about everything your group is doing and we can't wait for the future of laundry folding robots awesome yeah thank you so much for having me yeah
thank you thank you thanks [Music] [Music]
Related Videos
Yuval Noah Harari: “We Are on the Verge of Destroying Ourselves” | Amanpour and Company
18:40
Yuval Noah Harari: “We Are on the Verge of...
Amanpour and Company
213,438 views
AI can't cross this line and we don't know why.
24:07
AI can't cross this line and we don't know...
Welch Labs
483,729 views
What Does the AI Boom Really Mean for Humanity? | The Future With Hannah Fry
24:02
What Does the AI Boom Really Mean for Huma...
Bloomberg Originals
129,720 views
Parables on the Power of Planning in AI: From Poker to Diplomacy: Noam Brown (OpenAI)
56:54
Parables on the Power of Planning in AI: F...
Paul G. Allen School
905 views
o1-Preview: 11 STUNNING Use Cases
23:11
o1-Preview: 11 STUNNING Use Cases
TheAIGRID
7,925 views
AI and The Next Computing Platforms With Jensen Huang and Mark Zuckerberg
58:38
AI and The Next Computing Platforms With J...
NVIDIA
3,630,536 views
Trump WON’T STOP Pushing Pet Eating Lies, Oprah Surprises Audience and Clooney & Pitt Help Jimmy
14:51
Trump WON’T STOP Pushing Pet Eating Lies, ...
Jimmy Kimmel Live
341,356 views
Mark Zuckerberg on Llama, AI, & Minus One
58:42
Mark Zuckerberg on Llama, AI, & Minus One
South Park Commons
140,802 views
How 3 Phase Power works: why 3 phases?
14:41
How 3 Phase Power works: why 3 phases?
The Engineering Mindset
983,606 views
Decoding Google Gemini with Jeff Dean
55:56
Decoding Google Gemini with Jeff Dean
Google DeepMind
40,839 views
Unreasonably Effective AI with Demis Hassabis
52:00
Unreasonably Effective AI with Demis Hassabis
Google DeepMind
174,615 views
John Mearsheimer and Jeffrey Sachs | All-In Summit 2024
54:05
John Mearsheimer and Jeffrey Sachs | All-I...
All-In Podcast
477,439 views
With Spatial Intelligence, AI Will Understand the Real World | Fei-Fei Li | TED
15:12
With Spatial Intelligence, AI Will Underst...
TED
526,141 views
Generally Capable Agents in Open-Ended Worlds, Jim Fan, NVIDIA Lead of Embodied AI | NVIDIA GTC 2024
55:00
Generally Capable Agents in Open-Ended Wor...
NVIDIA Developer
11,069 views
Phaidra’s Jim Gao on Building the Fourth Industrial Revolution with Reinforcement Learning
50:34
Phaidra’s Jim Gao on Building the Fourth I...
Sequoia Capital
8,702 views
Prof. Geoffrey Hinton - "Will digital intelligence replace biological intelligence?" Romanes Lecture
36:54
Prof. Geoffrey Hinton - "Will digital inte...
University of Oxford
242,513 views
What Is an AI Anyway? | Mustafa Suleyman | TED
22:02
What Is an AI Anyway? | Mustafa Suleyman |...
TED
1,551,735 views
Download Day 2024 — Fireside Chat: NVIDIA Founder & CEO Jensen Huang and Recursion's Chris Gibson
35:03
Download Day 2024 — Fireside Chat: NVIDIA ...
Recursion
29,399 views
Quantum Computing: Hype vs. Reality
44:45
Quantum Computing: Hype vs. Reality
World Science Festival
165,769 views
The Turing Lectures: The future of generative AI
1:37:37
The Turing Lectures: The future of generat...
The Alan Turing Institute
601,048 views
Copyright © 2024. Made with ♥ in London by YTScribe.com