Highlights of the Fireside Chat with Ilya Sutskever & Jensen Huang: AI Today & Vision of the Future

216.49k views5411 WordsCopy TextShare

Xiao Yang

This is the condensed version of the "Fireside Chat: With Ilya Sutskever and Jensen Huang: AI Today ...

Video Transcript:

what was your intuition around deep learning why did you know that it was going to work did you have any intuition that are going to lead to this kind of success okay well first of all thank you so much for the quote for all the kind words a lot has changed thanks to the incredible power of deep learning like I think this my personal starting point I was interested in artificial intelligence for a whole variety of reasons starting from an intuitive understanding of appreciation of its impact and also I had a lot of curiosity about

what is consciousness what is The Human Experience and it felt like progress in artificial intelligence will help with that The Next Step was well back then I was starting in 2002-2003 and it seemed like learning is the thing that humans can do that people can do that computers can't do at all in 2003 2002 computers could not learn anything and it wasn't even clear that it was possible in theory and so I thought that making progress in learning in artificial learning in machine learning that would lead to the greatest progress in AI and then I

started to look around for what was out there and nothing seemed too promising but to my great luck Jeff Hinton was a professor at my University and I was able to find him and he was working in neural networks and it immediately made sense because neural networks had the property that we are learning we are automatically programming parallel computers back then the parallel computers were small but the promise was if you could somehow figure out how learning in neural networks work then you can program small parallel computers from data and it was also similar enough

to the brain and the brain works so it's like you had these several factors going for it now it wasn't clear how to get it to work but of all the things that existed that seemed like it had by far the greatest long-term promise Big Bang of AI fast forward to now you came out to the valley you started open AI with some friends um you're the chief scientist now what was the first initial idea about what to work on at open AI because you guys worked on several things some of the trails of of

inventions and and work uh you could you could see led up to the chat GPT moment um but what were the initial inspiration what were you how would you approach intelligence from that moment and led to this yeah so obviously when we started it wasn't 100 clear how to proceed and the field was also very different compared to the way it is right now so right now you already used we already used to you have these amazing artifacts these amazing neural Nets who are doing incredible things and everyone is so excited but back in 2015

2016 early 2016 when you were starting out the whole thing seemed pretty crazy there were so many fewer researchers like like 100 maybe they were between a hundred and a thousand times fewer people in the field compared to now like back then you had like 100 people most of them are working in Google slash deepmind and that was that and then there were people picking up the skills but it was very very scarce very rare still and we had two big initial ideas at the start of open AI that state that had a lot of

staying power and they stayed with us to this day and I'll describe them right now the first big idea that we had one which I was especially excited about very early on is the idea of unsupervised learning through compression some context today we take it for granted that on supervised learning is this easy thing you just pre-train on everything and it all does exactly as you'd expect in 2016 unsupervised learning was an unsolved problem in machine learning that no one had any insight exactly any clue as to what to do that's right iyanla Khan would

go around and give a talk give talk saying that you have this Grand Challenge in supervised learning and I really believed that really good compression of the data will lead to unsupervised learning now compression is not language that's commonly used to describe what is really being done until recently when suddenly it became apparent to many people that those gpts actually compress the training data you may recall that Ted Chiang New Times article which also alluded to this but there is a real mathematical sense in which training these autoregressive generative models compress the data and intuitively

you can see why that should work if you compress the data really well you must extract all the hidden secrets which exist in it therefore that is the key so that was the first idea that we're really excited about and that led to quite a few Works in openai to the sentiment neuron which I'll mention very briefly it is not this work might not be well known outside of the machine learning field but it was very influential especially in our thinking this work like the result there was that when you train a neural network back

then it was not a Transformer it was before the Transformer right small recurrency LST some of the words that you've done yourself you know so the same lsdm with a few twists trying to predict the next token in Amazon reviews next character and we discovered that if you predict the next character well enough it will be a neuron inside that lstm that corresponds to its sentiment so that was really cool because it showed some traction for unsupervised learning and it validated the idea that really good next character prediction next something prediction compression yeah has the

probability that it discovers the secrets in the data that's what we see with these GPT models right you train and people say just statistical correlation I mean at this point it should be so clear to anyone that observation also you know for me intuitively open up the whole world of where do I get the data for unsupervised learning because I do have a whole lot of data if I could just make you predict the next character and I know what the ground truth is I know what the answer is I could be I could train

a neural network bothered with that so that that observation and masking and other other technology other approaches you know open open my mind about where would the world get all the data that's unsupervised for unsupervised learning well you've always believed that scaling will improve the performance of these models yes larger larger networks deeper networks more training data would scale that um there was a very important paper that open AI wrote about the scaling laws and the relationship between loss and the size of the model and the amount of data set the size of the data

set when Transformers came out it gave us the opportunity to train very very large models in very reasonable amount of time um but what they in with the intuition about about the scaling laws or the size of of models and data and your journey of gpt123 which came first did you see the evidence of GPT one through three first they were was there the intuition about the scaling law first the intuition so I would say that the way the way I'd phrase it is that I had a very strong belief that bigger is better and

that one of the goals that we had at open AI is to figure out how to use the scale correctly there was a lot of belief about inopen AI about scale from the very beginning the question is what to use it for precisely because I'll mentioned right now we're talking about the gpts but there is another very important line of work which I haven't mentioned the second big idea but I think now is a good time to make a detour and that's reinforcement learning that clearly seems important as well what do you do with it

so the first really big project that was done inside open AI was our effort at solving a real-time strategy game and for context a real-time strategy game is like it's a competitive sport yeah right we need to be smart you need to have faster you need to have a quick reaction time you there's teamwork and you're competing against another team and it's pretty it's pretty it's pretty involved and there is a whole competitive league for that game the game is called DotA 2. and so we train a reinforcement learning agent to play against itself to

produce with the goal of the reaching a level so that it could compete against the best players in the world and that was a major undertaking as well it was a very different line it was reinforcement learning yeah remember the data that you guys announced that work and this is this by the way when I was asking earlier about about there's a there's a large body of work that have come out of open AI some of it seem like detours um but but in fact as you were explaining now they might might have been detours

this seemingly detours but they they really led up to some of the important work that we're now talking about GPT yeah I mean there has been real convergence where the gpts produce the foundation and in the reinforcement learning from DOTA morphed into reinforcement learning from Human feedback that's right and that combination gave us chat GPT you know there's a there's a there's a misunderstanding that that uh chat GPT is uh in itself just one giant large language model there's a system around it that's fairly complicated is it could could you could you explain um briefly

for the audience the the uh the fine-tuning of the the reinforcement learning of the the the um uh you know the various surrounding systems that allows you to uh keep it on Rails and and let it let it give it knowledge and you know so on and so forth yeah I can so the way to think about it is that when we train a large neural network to accurately predict the next word in lots of different texts from the internet what we are doing is that we are learning a world model it looks like we

are learning this it may it may look on the surface that we are just learning statistical correlations in text but it turns out that to just learn the statistical correlations in text to compress them really well what the neural network learns is some representation of the process that produced the text this text is actually a projection of the world there is a world out there and it has a projection on this text and so what the neural network is learning is more and more aspects of the world of people of the human conditions their their

their hopes dreams and motivations their interactions and the situations that we are in and the neural network learns a compressed abstract usable representation of that this is what's being learned from accurately predicting the next word and furthermore the more accurate you are is predicting the next word the higher the Fidelity the more resolution you get in this process so that's what the pre-training stage does but what this does not do is specify the desired behavior that you wish our neural network to exhibit you see a language model what it really tries to do is to

answer the following question if I had some random piece of text on the internet which starts with some prefix some prompt what will it complete to if you just randomly ended up on some text from the internet but this is different from well I want to have an assistant which will be truthful that will be helpful that will follow certain rules and not violate them that requires additional training this is where the fine tuning and the reinforcement learning from Human teachers and other forms of AI assistance it's not just reinforcement learning from Human teachers it's

also reinforcement learning from human and AI collaboration our teachers are working together with an AI to teach our AI to behave but here we are not teaching it new knowledge this is not what's happening we are teaching it we are communicating with it we are communicating to it what it is that we want it to be and this process the second stage is also extremely important the better we do the second stage the more useful the more reliable this neural network will be so the second stage is extremely important too in addition to the first

stage of the to learn everything learn everything learn as much as you can about the world from the projection of the world came out just a few months ago um fastest growing application in the history of humanity uh lots of lots of uh uh interpretations about why um but some of the some of the things that that is clear it is it is the easiest application that anyone has ever created for anyone to use it performs tasks it performs things it does things that are Beyond people's expectation anyone can use it there are no instruction

sets there are no wrong ways to use it you you just use it and uh if it's if your instructions are our prompts are ambiguous the conversation refines the ambiguity until your intents are understood by by the by the application by the AI the the impact of course uh clearly remarkable now yesterday this is the day after gpt4 just a few months later the the performance of gpt4 in many areas astounding SAT scores GRE scores bar exams the number of the number of tests that is able to perform at very capable levels very capable human

levels astounding what were the what were the major differences between Chad gbt and gpt4 that led to its improvements in these in these areas so gpt4 is a pretty substantial Improvement on top of chat GPT across very many dimensions we trained gpt4 I would say between more than six months ago maybe eight months ago I don't remember exactly GPT is the first build big difference between shared GPT and gpd4 and that's perhaps is the more the most important difference is that the base on top of gpt4 is built predicts the next word with crater accuracy

this is really important because the better a neural network can predict the next word in text the more it understands it this claim is now perhaps accepted by many at this point but it might still not be intuitive or not completely intuitive as to why that is so I'd like to take a small detour and to give an analogy that will hopefully clarify why more accurate prediction of the next word leads to more understanding real understanding let's consider an example say you read a detective novel it's like a complicated plot a storyline different characters lots

of events Mysteries like Clues it's unclear then let's say that at the last page of the book the detective has got all the clues gathered all the people and saying okay I'm going to reveal the identity of whoever committed the crime and that person's name is predict that word predict that word exactly my goodness right yeah right now there are many different words but by predicting those words better and better and better the understanding of the text keeps on increasing gpt4 predicts the next word better hell yeah people say that the Deep learning won't lead

to reasoning that deep learning won't lead to reasoning but in order to predict that next word figure out from all of the agents that were there and and all of their you know strengths or weaknesses or their intentions and uh the context um and to be able to predict that word who who was the murderer that requires some amount of reasoning a fair amount of reasoning and so so how did that how did the how is it that that um that it's able to pre to learn reasoning and and if if it learn reasoning um

you know one of the one of the things that I was going to ask you is of all the tests that were that were taken um between Chad GPT and gpd4 there were some tests that gpt3 or chat GPT was already very good at there were some tests that gbt3 or chibi was not as good at um that gbt4 was much better at and there were some tests that neither are good at yet I would love for it you know and some of it has to do with reasoning it seems that you know maybe in

in calculus that that it wasn't able to break maybe the problem down um into into its reasonable steps and solve it is is it but yet in some areas it seems to demonstrate reasoning skills and so is that an area that that um uh that in predicting the next word you're you're learning reasoning and um uh what are the limitations uh now of gpt4 that would enhance his ability to reason even even further you know the reasoning isn't this super well-defined concept but we can try to Define it anyway which is when you maybe maybe

when you go further where you're able to somehow think about it a little bit and get a better answer because of your reasoning and I'd say I'd say that there are neural Nets you know maybe there is some kind of limitation which could be addressed by for example asking the neural network to think out loud this has proven to be extremely effective for reasoning but I think it also remains to be seen just how far the basic neural network will go I think we have yet to uh tap fully tap out its potential but yeah

I mean there is definitely some sense where reasoning is still not quiet at that level as some of the other capabilities of the neural network though we would like the reasoning capabilities of the neural network to be high higher I think that it's fairly likely that business as usual will keep will improve the reasoning capabilities of the neural network I wouldn't I wouldn't necessarily confidently rule out this possibility yeah because one of the things that that is really cool is you ask you as a tragic a question that before it answers the question tell me

first first of what you know and then to answer the question um you know usually when somebody answers a question if you give me the the foundational knowledge that you have or the foundational assumptions that you're making before you answer the question that really improves the my believability of of the answer you're also demonstrating some level of reason when you're demonstrating reasoning and so it seems to me that chat GPD has this inherent capability embedded in it yeah to some degree yeah this the the the the the the way the one way to think about

what's happening now is that these neural networks have a lot of these capabilities they're just not quite very reliable in fact you could say that reliability is currently the single biggest obstacle for these neural networks being useful truly useful if sometimes it is still the case that these neural networks hallucinate a little bit or maybe make some mistakes which are unexpected which you wouldn't expect the person to make it is this kind of unreliability that makes them substantially less useful but I think that perhaps with a little bit more research with the current ideas that

you have and perhaps a few more of the ambitious research plans will be able to achieve higher reliability as well and that will be truly useful that will allow us to have very accurate guard rails which are very precise that's right and it will make it ask for clarification where it's unsure or maybe say that it doesn't know something when it does anything it doesn't know and do so extremely reliably so I'd say that these are some of the bottlenecks really so it's not about whether it exhibits some particular capability but more how reliable degree

exactly yeah multi-modality gpt4 has the ability to learn from text and images and respond to input from text and images first of all the foundation of multi-modality learning of course Transformers has made it possible for us to learn from multimodality tokenized text and images but at the foundational level help us understand how multimodality enhances the understanding of the world Beyond text by itself and uh and my understanding is that that that when you when you um uh do multi-modality learning learning that even when it is just a text prompt the text prompt the text understanding

could actually be enhanced tell us about multimodality at the foundation why it's so important and and what was what's the major breakthrough in the the and the characteristic differences as a result so there are two Dimensions to multimodality two reasons why it is interesting the first reason is a little bit humble the first reason is that multimodality is useful it is useful for a neural network to see Vision in particular because the world is very visual human beings are very visual animals I believe that a third of the visual core of the human cortex is

dedicated to vision and so by not having vision the usefulness of our neural networks though still considerable is not as big as it could be so it is a very simple usefulness argument it is simply useful to see and gpt4 can see quite well the there is a second reason to division which is that we learn more about the world by learning from images in addition to learning from text that is also a powerful argument though it is not as clear-cut as it may seem I'll give you an example or rather before giving an example

I'll make the general comment for a human being us human beings we get to hear about one billion words in our entire life only only one billion words that's amazing yeah that's not a lot yeah that's not a lot so we need to competent we need doesn't include my own words in my own head make it two billion but you see what I mean yeah you know we can see that because um a billion seconds is 30 years so you can kind of see like we don't get to see more than a few words a

second then if you're asleep half the time so like a couple billion words is the total we get in our entire life so it becomes really important for us to get as many sources of information as we can and we absolutely learn a lot more from vision the same argument holds true for our neural networks as well except except for the fact that the neural network can learn from so many words so things which are hard to learn about the world from text in a few billion words may become easier from trillions of words and

I'll give you an example consider colors surely one needs to see to understand calories and yet the text only neural networks who've never seen a single Photon in their entire life if you ask them which colors are more similar to each other it will know that red is more similar to Orange than to Blue it will know that blue is more similar to purple than to Yellow how does that happen and one answer is that information about the world even the visual information slowly leaks in through text but slowly not as quickly but then you

have a lot of text you can still learn a lot of course once you also add vision and learning about the world from Vision you will learn additional things which are not captured in text but it is no I would not say that it is a binary there are things which are impossible to learn from the from text only I think this is more of an exchange rate and in particular as you want to learn if we are if you if you if you are like a human being and you want to learn from a

billion words or a hundred million words then of course the other sources of information become far more important when you when you um I on the on the context of the scores that I saw um the thing that was really interesting was was uh the the data that you guys published which which one of the tests were were um uh performed well by gpt3 and which one of the tests performed substantially better with gbt4 um how did multi-modality contribute to those tests you think oh I mean in a pretty straightforward straightforward way anytime there was

a test where a problem would were to understand the problem you need to look at a diagram like for example in some math competitions like there is a cont math competition for high school students called AMC 2012 right and there presumably many of the problems have a diagram so GPT 3.5 does quite badly on that on that X on that on the test gpt4 with text only does I think I don't remember but it's like maybe from two percent to 20 accuracy of success rate but then when you add Vision it jumps to 40 success

rate so the vision is really doing a lot of work the vision is extremely good and I think being able to reason visually as well and communicate visually will also be very powerful and very nice things which go beyond just learning about the world you have several things you got to learn you can learn about the world you can reason about the world visually and you can communicate visually where now in the future perhaps in some future version if you ask your neural net hey like explain this to me rather than just producing four paragraphs

it will produce hey like here's like a little diagram which clearly conveys to you exactly what you need to know and so that's incredible tell tell us whatever you can about about uh where we are now and and what do you think will be and and not not too distant future but you know pick your your horizon a year or two uh where do you think this whole language Model area would be in some of the areas that you're most excited about you know predictions are hard and um it's a bit although it's a little

difficult to do say things which are too specific I think it's safe to assume that progress will continue and that we will keep on seeing systems which Astound us in their in the things that they can do and the current Frontiers are will be centered around reliability around the system can be trusted really get into a point where you can trust what it produces really get into a point where if he doesn't understand something it asks for a clarification says that he doesn't know something says it in his more information I think those are perhaps

the biggest the areas where Improvement will lead to the biggest impact on the usefulness of those systems because right now that's really what stands in the way you have an AF asking your own network you're asking neural net to maybe summarize some long document and you get a summary like are you sure that some important detail wasn't omitted it's still a useful summary but it's a different story when you know but all the important points have been covered at some point like and in particular it's okay like if some even the reason ambiguity it's fine

but if a point is clearly important such that anyone else who saw that point would say this is really important when the neural network will also recognize that reliably that's when you know same for the guardrail say same for its ability to clearly follow the intent of the user of its operator so I think we'll see a lot of that in the next two years yeah because the progress in those two areas will make this technology trusted by people to use and be able to apply for so many things I was thinking that was going

to be the last question but I did have another one sorry about it okay so so Chad uh chat gbt to gpt4 gpt4 when when it first when you first started using it what are some of the skills that it demonstrated that surprised even you well there were lots of really cool things that it demonstrated which which is which were quite cool and surprising it was it was quite good so I'll mention two excess so let's see I'm just I'm just trying to think about the best way to go about it the short answer is

that the level of its reliability was surprising where the previous neural networks if you ask them a question sometimes they might misunderstand something in a kind of a silly way whereas the gpt4 that stopped happening its ability to solve math problems became far greater it's like you could really like say sometimes you know really do the derivation and like long complicated derivation and you could convert the units and so on and that was really cool you know like many people what's your proof it works through a proof it's pretty amazing not all proofs yeah naturally

but but quite a few yeah or another example would be like many people noticed that it has the ability to produce poems with you know every word starting with the same letters or every word starting with some it follows instructions really really clearly not perfectly still but much better before yeah really good and on the vision side I really love how it can explain jokes you can explain memes you show it a meme and ask it why it's funny and it will tell you and it will be correct the the vision part I think is

very was also very it's like really actually seeing it when you can ask questions follow-up questions about some complicated image with a complicated diagram and get an explanation that's really cool but yeah overall I will say to take a step back you know I've been I've been in this business for quite some time actually like almost exactly 20 years [Music] and the thing which most of which I find most surprising is that it actually works yeah like it it's turned out to be the same little thing all along which is no longer little and it's

a lot more serious and much more intense but it's the same neural network just larger trained on maybe larger data sets in different ways with the same fundamental training algorithm yeah so it's like wow I would say this is what I find the most surprising yeah whenever I take a step back I go how is it possible that those ideas those conceptual ideas about well the brain has neurons so maybe artificial neurons are just as good and so maybe we just need to train them somehow with some learning algorithm that those arguments turned out to

be so incredibly correct that would be the biggest surprise it's