Large Language Models (LLMs) - Everything You NEED To Know
92.47k views4538 WordsCopy TextShare
Matthew Berman
A brief introduction to everything you need to know about Large Language Models (LLMs) to go from kn...
Video Transcript:
this video is going to give you everything you need to go from knowing absolutely nothing about artificial intelligence and large language models to having a solid foundation of how these revolutionary Technologies work over the past year artificial intelligence has completely changed the world with products like chat PT potentially appending every single industry and how people interact with technology in general and in this video I will be focusing on llms how they work ethical cons iterations applications and so much more and this video was created in collaboration with an incredible program called AI camp in which high school students learn all about artificial intelligence and I'll talk more about that later in the video let's go so first what is an llm is it different from Ai and how is chat GPT related to all of this llms stand for large language models which is a type of neural network that's trained on massive amounts of text data it's generally trained on data that can be found online everything from web scraping to books to transcripts anything that is text based can be trained into a large language model and taking a step back what is a neural network a neural network is essentially a series of algorithms that try to recognize patterns in data and really what they're trying to do is simulate how the human brain works and llms are a specific type of neural network that focus on understanding natural language and as mentioned llms learn by reading tons of books articles internet texts and there's really no limitation there and so how do llms differ from traditional programming well with traditional programming it's instruction based which means if x then why you're explicitly telling the computer what to do you're giving it a set of instructions to execute but with llms it's a completely different story you're teaching the computer not how to do things but how to learn how to do things things and this is a much more flexible approach and is really good for a lot of different applications where previously traditional coding could not accomplish them so one example application is image recognition with image recognition traditional programming would require you to hardcode every single rule for how to let's say identify different letters so a b c d but if you're handwriting these letters everybody's handwritten letters look different so how do you use traditional programming to identify every single possible variation well that's where this AI approach comes in instead of giving a computer explicit instructions for how to identify a handwritten letter you instead give it a bunch of examples of what handwritten letters look like and then it can infer what a new handwritten letter looks like based on all of the examples that it has what also sets machine learning and large language models apart and this new approach to programming is that they are much more more flexible much more adaptable meaning they can learn from their mistakes and inaccuracies and are thus so much more scalable than traditional programming llms are incredibly powerful at a wide range of tasks including summarization text generation creative writing question and answer programming and if you've watched any of my videos you know how powerful these large language models can be and they're only getting better know that right now large language models and a in general are the worst they'll ever be and as we're generating more data on the internet and as we use synthetic data which means data created by other large language models these models are going to get better rapidly and it's super exciting to think about what the future holds now let's talk a little bit about the history and evolution of large language models we're going to cover just a few of the large language models today in this section the history of llms traces all the way back to the Eliza model which was from 1966 which was really the first first language model it had pre-programmed answers based on keywords it had a very limited understanding of the English language and like many early language models you started to see holes in its logic after a few back and forth in a conversation and then after that language models really didn't evolve for a very long time although technically the first recurrent neural network was created in 1924 or RNN they weren't really able to learn until 1972 and these new learning language models are a series of neural networks with layers and weights and a whole bunch of stuff that I'm not going to get into in this video and rnns were really the first technology that was able to predict the next word in a sentence rather than having everything pre-programmed for it and that was really the basis for how current large language models work and even after this and the Advent of deep learning in the early 2000s the field of AI evolved very slowly with language models far behind what we see today this all changed in 2017 where the Google Deep Mind team released a research paper about a new technology called Transformers and this paper was called attention is all you need and a quick side note I don't think Google even knew quite what they had published at that time but that same paper is what led open AI to develop chat GPT so obviously other computer scientists saw the potential for the Transformers architecture with this new Transformers architecture it was far more advanced it required decreased training time and it had many other features like self attention which I'll cover later in this video Transformers allowed for pre-trained large language models like gpt1 which was developed by open AI in 2018 it had 117 million parameters and it was completely revolutionary but soon to be outclassed by other llms then after that Bert was released beert in 2018 that had 340 million parameters and had bir directionality which means it had the ability to process text in both directions which helped it have a better understanding of context and as comparison a unidirectional model only has an understanding of the words that came before the target text and after this llms didn't develop a lot of new technology but they did increase greatly in scale gpt2 was released in early 2019 and had 2. 5 billion parameters then GPT 3 in June of 2020 with 175 billion paramet and it was at this point that the public started noticing large language models GPT had a much better understanding of natural language than any of its predecessors and this is the type of model that powers chat GPT which is probably the model that you're most familiar with and chat GPT became so popular because it was so much more accurate than anything anyone had ever seen before and it was really because of its size and because it was now built into this chatbot format anybody could jump in and really understand how to interact act with this model Chad GPT 3. 5 came out in December of 2022 and started this current wave of AI that we see today then in March 2023 GPT 4 was released and it was incredible and still is incredible to this day it had a whopping reported 1.
76 trillion parameters and uses likely a mixture of experts approach which means it has multiple models that are all fine-tuned for specific use cases and then when somebody asks a question to it it chooses which of those models to use and then they added multimodality and a bunch of other features and that brings us to where we are today all right now let's talk about how llms actually work in a little bit more detail the process of how large language models work can be split into three steps the first of these steps is called tokenization and there are neural networks that are trained to split long text into individual tokens and a token is essentially about 34s of a word so if it's a shorter word like high or that or there it's probably just one token but if you have a longer word like summarization it's going to be split into multiple pieces and the way that tokenization happens is actually different for every model some of them separate prefixes and suffixes let's look at an example what is the tallest building so what is the tallest building are all separate tokens and so that separates the suffix off of tallest but not building because it is taking the context into account and this step is done so models can understand each word individually just like humans we understand each word individually and as groupings of words and then the second step of llms is something called embeddings the large language models turns those tokens into embedding vectors turning those tokens into essentially a bunch of numerical representations of those tokens numbers and this makes it significantly easier for the computer to read and understand each word and how the different words relate to each other and these numbers all correspond with the position in an embeddings Vector database and then the final step in the process is Transformers which we'll get to in a little bit but first let's talk about Vector databases and I'm going to use the terms word and token interchangeably so just keep that in mind because they're almost the same thing not quite but almost and so these word embeddings that I've been talking about are placed into something called a vector database these databases are storage and retrieval mechanisms that are highly optimized for vectors and again those are just numbers long series of numbers because they're converted into these vectors they can easily see which words are related to other words based on how similar they are how close they are based on their embeddings and that is how the large language model is able to predict the next word based on the previous words Vector databases capture the relationship between data as vectors in multidimensional space I know that sounds complicated but it's really just a lot of numbers vectors are objects with a magnitude and a direction which both influence how similar one vector is to another and that is how llms represent words based on those numbers each word gets turned into a vector capturing semantic meaning and its relationship to other words so here's an example the words book and worm which independently might not look like they're related to each other but they are related Concepts because they frequently appear together a bookworm somebody who likes to read a lot and because of that they will have embeddings that look close to each other and so models build up an understanding of natural language using these embeddings and looking for similarity of different words terms groupings of words and all of these nuanced relationships and the vector format helps models understand natural language better than other formats and you can kind of think of all this like a map if you have a map with two landmarks that are close to each other they're likely going to have very similar coordinates so it's kind of like that okay now let's talk about Transformers mat Matrix representations can be made out of those vectors that we were just talking about this is done by extracting some information out of the numbers and placing all of the information into a matrix through an algorithm called multihead attention the output of the multi-head attention algorithm is a set of numbers which tells the model how much the words and its order are contributing to the sentence as a whole we transform the input Matrix into an output Matrix which will then correspond with a word having the same values as that output Matrix so basically we're taking that input Matrix converting it into an output Matrix and then converting it into natural language and the word is the final output of this whole process this transformation is done by the algorithm that was created during the training process so the model's understanding of how to do this transformation is based on all of its knowledge that it was trained with all of that text Data from the internet from books from articles Etc and it learned which sequences of of words go together and their corresponding next words based on the weights determined during training Transformers use an attention mechanism to understand the context of words within a sentence it involves calculations with the dot product which is essentially a number representing how much the word contributed to the sentence it will find the difference between the dot products of words and give it correspondingly large values for attention and it will take that word into account more if it has higher attention now now let's talk about how large language models actually get trained the first step of training a large language model is collecting the data you need a lot of data when I say billions of parameters that is just a measure of how much data is actually going into training these models and you need to find a really good data set if you have really bad data going into a model then you're going to have a really bad model garbage in garbage out so if a data set is incomplete or biased the large language model will be also and data sets are huge we're talking about massive massive amounts of data they take data in from web pages from books from conversations from Reddit posts from xposts from YouTube transcriptions basically anywhere where we can get some Text data that data is becoming so valuable let me put into context how massive the data sets we're talking about really are so here's a little bit of text which is 276 tokens that's it now if we zoom out that one pixel is that many tokens and now here's a representation of 285 million tokens which is 0. 02% of the 1. 3 trillion tokens that some large language models take to train and there's an entire science behind data pre-processing which prepares the data to be used to train a model everything from looking at the data quality to labeling consistency data cleaning data transformation and data reduction but I'm not going to go too deep into that and this pre-processing can take a long time and it depends on the type of machine being used how much processing power you have the size of the data set the number of pre-processing steps and a whole bunch of other factors that make it really difficult to know exactly how long pre-processing is going to take but one thing that we know takes a long time is the actual training companies like Nvidia are building Hardware specifically tailored for the math behind large language models and this Hardware is constantly getting better the software used to process these models are getting better also and so the total time to process models is decreasing but the size of the models is increasing and to train these models it is extremely expensive because you need a lot of processing power electricity and these chips are not cheap and that is why Nvidia stock price has skyrocketed their revenue growth has been extraordinary and so with the process of training we take this pre-processed text data that we talked about earlier and it's fed into the model and then using Transformers or whatever technology a model is actually based on but most likely Transformers it will try to predict the next word based on the context of that data and it's going to adjust the weights of the model to get the best possible output and this process repeats millions and millions of times over and over again until we reach some optimal quality and then the final step is evaluation a small amount of the data is set aside for evaluation and the model is tested on this data set for performance and then the model is is adjusted if necessary the metric used to determine the effectiveness of the model is called perplexity it will compare two words based on their similarity and it will give a good score if the words are related and a bad score if it's not and then we also use rlf reinforcement learning through human feedback and that's when users or testers actually test the model and provide positive or negative scores based on the output and then once again the model is adjusted as necessary all right let's talk about fine-tuning now which I think a lot of you are going to be interested in because it's something that the average person can get into quite easily so we have these popular large language models that are trained on massive sets of data to build general language capabilities and these pre-trained models like Bert like GPT give developers a head start versus training models from scratch but then in comes fine-tuning which allows us to take these raw models these Foundation models and fine-tune them for our specific specific use cases so let's think about an example let's say you want to fine tuna model to be able to take pizza orders to be able to have conversations answer questions about pizza and finally be able to allow customers to buy pizza you can take a pre-existing set of conversations that exemplify the back and forth between a pizza shop and a customer load that in fine- tune a model and then all of a sudden that model is going to be much better at having conversations about pizza ordering the model updates the weights to be better at understanding certain Pizza terminology questions responses tone everything and fine-tuning is much faster than a full training and it produces much higher accuracy and fine-tuning allows pre-trained models to be fine-tuned for real world use cases and finally you can take a single foundational model and fine-tune it any number of times for any number of use cases and there are a lot of great Services out there that allow you to do that and again it's all about the quality of your data so if you have a really good data set that you're going to f- tune a model on the model is going to be really really good and conversely if you have a poor quality data set it's not going to perform as well all right let me pause for a second and talk about AI Camp so as mentioned earlier this video all of its content the animations have been created in collaboration with students from AI Camp AI Camp is a learning experience for students that are aged 13 and above you work in small personalized groups with experienced mentors you work together to create an AI product using NLP computer vision and data science AI Camp has both a 3-week and a onewe program during summer that requires zero programming experience and they also have a new program which is 10 weeks long during the school year which is less intensive than the onewe and 3-we programs for those students who are really busy AI Camp's mission is to provide students with deep knowledge and artificial intelligence which will position them to be ready for a in the real world I'll link an article from USA Today in the description all about AI camp but if you're a student or if you're a parent of a student within this age I would highly recommend checking out AI Camp go to ai- camp.
org to learn more now let's talk about limitations and challenges of large language models as capable as llms are they still have a lot of limitations recent models continue to get better but they are still flawed they're incredibly valuable and knowledgeable in certain ways but they're also deeply flawed in others like math and logic and reasoning they still struggle a lot of the time versus humans which understand Concepts like that pretty easily also bias and safety continue to be a big problem large language models are trained on data created by humans which is naturally flawed humans have opinions on everything and those opinions trickle down into these models these data sets may include harmful or biased information and some companies take their models a step further and provide a level of censorship to those models and that's an entire discussion in itself whether censorship is worthwhile or not I know a lot of you already know my opinions on this from my previous videos and another big limitation of llms historically has been that they only have knowledge up into the point where their training occurred but that is starting to be solved with chat GPT being able to browse the web for example Gro from x.