ChatGPT: 30 Year History | How AI Learned to Talk

1.08M views4321 WordsCopy TextShare

Art of the Problem

This video explores the journey of AI language models, from their modest beginnings through the deve...

Video Transcript:

A Big Bang occurred when ChatGPT was released: the first widely available computer program the average person could talk to as if it were another human, beating the Touring test and doing things most agreed, myself included, was not possible when I started this series 4 years ago The richness or infinite potential that language offers is why many experts in linguistics and computation strongly believed computers would never come close to understanding human language Many of them have now changed their mind. If I were to take an hour in doing something the chat GPT 4 might take 1 second it's quite terrifying It felt as if not only are my belief systems collapsing but it feels as if the entire human race is going to be eclipsed and left in the dust soon So, what happened? So far, in this series we've covered the last few decades of neural network research focused on narrow problems with a fixed goal, where people trained an artificial neural network on one task with a big database of example inputs and outputs to learn from, known as supervised learning In this case the learning signal was the difference between the guess and the correct answer This led to neural networks that could do one kind of thing really well, such as classify images, detect spam or predict your next YouTube video But each network was like a silo and left no clear road to more general purpose systems You can think of these siloed networks as modelling intuition only but not reasoning, because reasoning involves a chain of thoughts, it's a sequential process And so, to crack open this problem of making neural networks more general purpose, we first needed to train neural networks to talk.

Looking back, we can see the origin of this kind of experimentation in the mid 80s. One inspired paper in 1986 by Jordan trained a neural network to learn sequential patterns. In his initial experiments he trained a tiny network with only a handful of neurons to predict simple sequences of two symbols and to give the network memory, he borrowed from how we think our minds work, which is to have an ongoing state of mind which helps us decide our next action given what we currently observe, He added a set of memory neurons he called state units to the side of the network and added connections from the output to the state units and then the state units were connected to the middle of the network and finally also connected to themselves.

This resulted in a state of mind which depended on the past and could affect the future, what he called a recurrent neural network Another key innovation was how he set up a prediction problem for the network to learn. He trained a network by simply hiding the next letter in a sequence. With this approach, the learning signal is the difference between the network's guess of the next symbol and the true data.

And, critically, after training it, he set up the network to generate data by connecting the output of the network back on itself and kicking it off with a single letter after which it would generate the pattern it learned and he observed the network would make mistakes, but those mistakes would go away after being trained on the pattern more. And he noticed the learned sequences were not just memorized, they were generalized. In another experiment, he trained a network on a spatial pattern after training a network on the sequence he fed in a point and plotted the results and it would correctly continue the cyclical pattern However, when he tried to start it on a new point outside the path it had learned, the network would follow the same cycle pattern but from a different position at a different scale and gradually would return back to the stable sequence.

He writes “When a network learns to perform a sequence, it essentially learns to follow a trajectory through state space and these learned trajectories tend to be attractors borrowing a term from chaos theory” He viewed an attractor as the generalized pattern learned by the network which was represented in the connection weights in the inner layers Five years later, another researcher, Jeffrey Elman, picked up on Jordan’s research and did the same thing with a slightly bigger network with 50 neurons and trained it on language. At first, he used 200 short sentences he created. Interestingly, he didn't provide any word boundaries, he just simply applied a stream of letters to the network 10 times and at each step trained it to get closer at making the correct prediction of the next letter The first interesting thing he noticed was that the network learned word boundaries on its own.

He shows this in a plot where at the onset of a new word the chance of error or uncertainty is high and as more of the word is received the error rate declines, since the sequence is increasingly predictable At the end of the word the error would jump back up again but not as high as before this reflects what we saw in Information Theory where an intelligent signal contain decreasing entropy over sequence length He then notes that it's worth looking into whether the network has has any understanding of the meaning behind these words He probed the internal neurons in the context unit as it was processing words and then he plotted them and compared the spatial arrangement. What he found was the network would spatially cluster words based on meaning. For example, it separated nouns which are inanimate and animate and within these groups he saw subcategorization animate objects were broken down into human and non-human clusters inanimate objects were broken down down into breakable and edible clusters and so he emphasizes that the network was learning these hierarchical interpretations But Elman notes that, according to Noam Chomsky this shouldn't be possible, How could a little network understand the word semantically?

But Elman argued his experiments showed otherwise: everything could be learned from patterns in the language This approach to training neural networks by hiding the next event closely aligns with how humans learn. He referenced an idea that preverbal children be begin the process of language acquisition by listening and talking along in their mind with a speaker always guessing the next word and they can learn from these internal mistakes He also had a fascinating insight: Since we can represent words as points in high dimensional space, then sequences of words or sentences can be thought of as a pathway and similar sentences seem to follow similar pathways our thought follows a pathway It's useful to pause and consider your own mind is often on a pathway of thought at many levels But still, these networks were small and seen as toy problems and so, for well over a decade, none of this research on language models saw the light of day It really wasn't until 2011 when an important confluence of researchers pushed this specific experiment ahead Interestingly, was the mundane sounding application they mentioned this is an important problem because better character level prediction could improve compression of text files More speculatively, achieving the limit in text compression requires an understanding that is equivalent to intelligence This is in line with one theory that biological brains at their core are prediction machines and so, if we think of intelligence as the ability to learn, this views learning as the compression of experience into a predictive model of the world I'll see you every second for the next hour or what have you each look at you is a little bit different. I don't store all those second by second images I don't store 3,000 images I somehow compact this information In this paper, they trained a much larger network with thousands of neurons this time and millions of connections to do next letter prediction as early researchers had done after training they had the models generate language by feeding their output back into the input and kicking off the process with some starting text.

For example, they gave the prompt “the meaning of life” is and the network responded “the tradition of the ancient human reproduction” But, beyond a few words the pathway of thought fell off course and veered into the nonsensical and so, clearly learning was happening but they were hitting the capacity of the network to maintain coherent context over long sequences At the end of the paper they claimed that if we could train a much bigger network with millions of neurons and billions of connections it is possible that brute force alone would be sufficient to achieve even higher standards of performance And still, few took this line of research seriously, probably due to the mistakes, but a dedicated few pushed this effort ahead. Another key figure is Andrej Karpathy. He did the same experiment again but this time on a bigger Network work with more layers and his results were even better more plausible Notably, he trained them on all of Shakespeare and noted that “I can barely recognize these from actual Shakespeare”.

and when he trained them on mathematics papers he said “you get plausible looking math it's quite astonishing” and just like early researchers he noticed how it learned in phases and he writes: “what's beautiful about this is we didn't have to hardcode any of it. The network decided what was useful to keep track of” This is one of the cleanest and most compelling examples of where the power in deep learning is coming from And so this was more evidence of setting up a system with a broad goal of learning to speak could then be retasked on arbitrary narrow goals we might simply ask of it A turning point came in 2017, when a team of researchers at a lab called Open AI built on Karpathy's work and set up a larger recurrent betwork and trained in on a massive set of 82 million Amazon reviews, the largest model to date and when they probed the neurons in this network they found neurons deeper in the network which had learned complex conceptual concepts For example, they reported the discovery of a sentiment neuron which was a single neuron within the network that directly corresponded to the sentiment of the text, how positive or negative it sounded they showed this neuron's activation as it processed text perfectly classifying the sentiment This was striking because at the time this was something industry used commonly and required specialized systems trained on that one task but in this case, they didn't do any of that work. The sentiment neuron emerged out of the process of learning to predict the next word and to show that this network had a good internal model or understanding of sentiment they had the network generate text and while doing so they forced that sentiment neuron to be positive or negative and then it spit out positive and negative reviews which were entirely artificial but indistinguishable from a human written review And they write “It is an open question why our model recovers the concept of sentiment in such a precise disentangled interpretable and manipulable way” And this was just one neuron in the network full of these representations of abstract concepts it learned from that data as a result of trying to predict it For future directions for their work, they mentioned the next key step which was data diversity but simply going bigger this time was hitting a practical limit because there is a key problem with recurrent neural networks.

Because they process data serially, all the context had to be squeezed into a fixed internal memory and this was a bottleneck limiting the ability of the network to handle context over long sequences of text and so meaning gets squeezed out. Practically it was apparent when generating long enough statements with a recurrent neural network. It might make sense for a while but after a few sentences it would always drift off into gibberish Learning these long range dependencies was a key challenge faced by the field An alternative approach to recurrent neural networks tried to tackle this problem by simply processing the entire input sequence of text in parallel but this requires many layers of depth in order to compensate for the lack of memory This approach is tempting, but the resulting Network becomes impossible to train But also in 2017, another groundbreaking paper came out with which was focused on the problem of translating between languages and offered a solution to this memory constraint: attention the key Insight behind their approach was to create a network with a new kind of dynamic layer which could adapt some of its connection weights based on the context of the input, known as a self attention layer this allowed it to do in one layer which traditional networks would have needed several layers to accomplish.

This leads to a shallower but wider network that was practical to train. These self attention layers work by allowing every word in the input to look at and compare itself to every other word and absorb the meaning from the most relevant words to better capture the context of its intended use in that sentence and this is done with the addition of attention heads which are many networks inside the layer acting as a kind of lens which words can use to examine other words And this is done by simply measuring the distance between all the word pairs in concept space Similar concepts will be closer in this space leading to a higher connection weighting Consider the sentence “the river has a steep bank” In the self- attention layer, the word bank would compare itself to every other word to find conceptual similarities For example, the words “river” and “bank” are both related in the context of river bank and so this would lead to a higher weighting in that context. This leads to a second operation where each word absorbs meaning from their connections based on the strength of the weighting This allows the word to adjust its representation or meaning to push towards the concept or direction of a riverbank And as we go through the network we're going to make the embedding vectors for a word get better and better because they're going to take into account more and more contextual information And that's why we call them transformers.

They take each word and transform their meaning shaped by the words around it. To get a sense of this in action, let's look at how a transformer network generates music by predicting the next note. In this visualization, each colored line is a different attention head and the weight of the line is the amount of attention it gives to each location Notice each attention head looks for different kinds of patterns in the music The more attention heads you give a network the more powerful it becomes and notice that to select the next note at each step all patterns are taken into consideration This is a network architecture that can look at everything, everywhere, all at once.

No internal memory is needed. Its memory is replaced by self- reference within the layer, But, critically, the “attention is all you need” paper was still one foot in the old paradigm, which was a narrow focus only on the problem of translation trained in a supervised way they were not after a general purpose system which could do anything you asked it but researchers at open AI saw this result and immediately tried this more powerful transformer architecture on the exact same next word prediction problem now at a larger scale not before possible. The next year they published a paper which introduced a model called GPT and this time they had a much larger network that could capture hundreds of words of input or context at a time It had multiple layers each with self attention followed by a fully connected layer and this time they trained the Network on 7,000 books from a variety of domains.

The results were exciting If you prompted this network with a segment of text it would continue that passage much more coherently than before But, more importantly, it showed some capability in answering general questions and these questions did not need to be present in the training data This is known as zero shot learning. It's a remarkable feature and it highlighted the potential of language models to generalize from their training data and apply it to arbitrary tasks They followed this experiment immediately with GPT2. This time, they used the exact same approach but with a data set scraped from a large portion of the web and they used a much larger network with around 300,000 neurons.

The results surprised even the researchers They tested it on tasks such as reading comprehension, summarization, translation and question answering. Amazingly, it could translate languages as good as systems trained only on translation without any translation specific training, but except for a news cycle about potential misuse to generate fake news, this development was mostly ignored, even by experts in the field. The problem still was eventually GPT2 still drifted off into nonsense after many sentences It couldn't hold coherence or context for really long periods of time And so was obviously still a trick right?

but the team now understood that this could be solved simply by making everything bigger again especially the context window Next, they did the same experiment again but made the network a 100 times bigger GPT3 had 175 billion connections and 96 layers and a much longer context window of around a thousand words this time they trained it on the entire common web plus Wikipedia as well as several book collections. Again, it showed increased performance across all measures but one capability really jumped out Once training was complete, you could still teach the network new things known as “in context learning” In the GPT3 paper, researchers showed a simple example where they first gave the definition of a made up word “gigamuru” and then asked it to use that word in a sentence which it did perfectly This is known as the wug test and it's a key milestone in the linguistic development of children but this was just the tip of the iceberg the key point is we could change the behavior of the network without changing the network weights That is, a frozen network can learn new tricks In-context learning works because it's leveraging the internal model of the individual concepts, which it can combine or compose arbitrarily so you can think of two layers of learning: a core in-weight learning which happens during training and then a layer of in context learning which happens during use or inference Many pointed out that we seem to have stumbled into a new computing paradigm where the computer operates at the level of thoughts where a thought is a response to a prompt This put the programming of these systems in anyone's hands. The prompt is the program but in terms of the general public GPT3 still remained relatively unknown To enable general use they took GPT3 and shaped Its behavior to better follow human instructions with further training on more examples of good versus bad human instruction.

this pushed the learning pressure beyond simply next word and towards next phrasing not only what to say but how to say it Every time he does something a little closer to what we want him to do we reinforce him 20 minutes and the pigeon has learned to Peck the disc to get food Known as instruct GPT which could engage in human conversations much more effectively and this became the consumer facing product chat GPT This kicked off the most exciting year of experimentation in AI history as over 100 million people used this system in public and reported their results in a fire hose of surprises one key observation after its release was its ability to talk to itself and think out loud A paper was highly shared showing that simply adding the phrase “think step by step” at the end of your prompt dramatically improved the performance of ChatGPT because this this kicked off an iterative loop where the subthoughts were written down into meaningful chunks, allowing it to follow a chain of reasoning as long as needed resulting in fewer errors This led to an explosion of experiments all building on this idea of self talk and then people tried to put these agents into virtual worlds and gave them tasks and they would learn to use tools to accomplish this, talking to themselves along the way researchers also applied this tool use to the real world plugging language models into external computer systems through APIs allowing them to make calls, place orders and perform arbitrary tasks and finally we gave them physical senses through cameras and actuators In fact, every task performed by computers could now be re-engineered with an LLM at the core of the process I don't think it's accurate to think of large language models as a chatbot or like some kind of a word generator I think it's a lot more correct to think about it as the kernel process of an emerging operating system You have an equivalent of random access memory or RAM uh which in this case for an llm would be the context window and you can imagine this LLM trying to page relevant information in and out of its context window to perform your task And when researchers pressed ahead with networks even 10 times bigger on GPT 4 and beyond , the trend continued again and today there is a race to build the most capable intelligent agent of them all the Oracle humans have always dreamed of and feared Some speculate this moment marks a unification of the field of AI around a single Direction Instead of specialized networks focused on specific kinds of data all researchers pointed to treating all perceptions as language, that is, a series of information bearing symbols, and then training networks on prediction powered by networks with self-attention this leads to a more general system that could be retasked on any arbitrary narrow problem It seems that there's something fundamental about the ability to learn to get better at predicting future perceptions that is core to learning in both biological and artificial neural networks Imagination is a great survival mechanism because it minimizes surprise and since our actions are part of our perceptions, we also learn the result of our actions as a byproduct of this perception prediction problem So the question now is Did we invent the core of a tool to end all tools a potential solution to mental automation which was the original dream of computer science not all agree with this and some are even insulted Well, this is glorified autofill. These systems are designed in such a way that in principle they can tell us nothing about language, about learning, about intelligence about thought, nothing The idea that it's just sort of predicting the next word and using statistics there's a sense in which that's true but it's not the sense of Statistics that most people understand It, from the data it figures out how to extract the meaning of the sentence and it uses the meaning of the sentence to predict the next word, it really does understand and that's quite shocking Chomsky's whole view of language is kind of crazy when you look back on it because language is about conveying meaning, it's about conveying stuff Well for what it's worth, Jeff, I've always thought Chomsky was completely wrong from my undergraduate age and I think sent natural language processing down the wrong route for a long time and even the three Godfathers of deep learning are not on the same page anymore So linguistic abilities and fluency are not related to the ability to think. Those are two different things But people we respect a lot like Yan think they don't really understand and it's crucial to resolve this issue and we may not be able to come to a consensus about other issues until we've resolved that issue I've never seen the AI Community become fragmented um the way it feels like it's trending right now And at the root of this divide is a philosophical question One group believes these models trick us into thinking they are smarter than they are like mirrors that reflect our own thoughts in ways we didn't anticipate and the other side believes that if it looks like thought then it is thought and so, the line between simulating thought and actual thought is becoming ever more blurred or perhaps there is no line.

This was and still is something people are struggling to get their heads around. I asked her to my office and sat her down at the keyboard and then she began to type and of course I looked over her shoulder to make sure that everything was operating properly After two or three interchanges with the machine she turned to me and she said said would you mind leaving the room please And yet she knew as Weizenbauma did, that Eliza didn't understand a single word that was being typed into it you're like my father in some ways you don't argue with me why do you think I don't argue with you you're afraid of me does it please you to think I'm afraid of you ?