What Makes ChatGPT Chat? Modern AI for the layperson

122.16k views7012 WordsCopy TextShare

Natasha Jaques

A talk I gave at the UW CSE Visit Days for admitted undergraduate students and their parents. Overvi...

Video Transcript:

Thanks for being here. Welcome to UDubcs. So, as you might have been pitched on this already, this is one of the top computer science departments in the world.

And uh if you don't believe me, I actually had an offer to be a professor at MIT, Harvard, Princeton, and I came here because I think it's a better department. So, giving you the hard cell. So, uh, today I'm just going to be giving you a talk about, uh, AI, which is what I work on in my particular research interests and kind of what I think are the most exciting trends and things that are going on in AI right now.

Um, so, you know, you might have seen AI in the news a lot in recent years. So, capabilities have really exploded. How many people are using uh, Chad GBT or on a regular basis?

Awesome. Not to do your homework, of course. Of course.

Um, there you go. There you go. But uh you know, even though it's only made it into the New York Times so frequently in the last few years, what you might not realize is you've probably been using AI every day since like 2000.

I was looking this up. So Netflix has been using machine learning based content recommendations since 2000. Um and you know, even the post office was doing optical character recognition since the 90s, right?

So you're interacting with these systems already every day. voice recognition, machine translation, uh you know, we're already using it in medicine to detect tumors and radiology scans with accuracy that's greater than a team of trained oncologists, right? Um and so it's only recently that this generative AI thing has kind of come on top of everything we've already been doing for, you know, decades now.

Um but any idea what underlies all of this stuff? Yes, language model. Large language models are a big deal.

They're more on this generative AI side, but Netflix doesn't need a large language model necessarily to tell you what movie to watch actually. Yeah. Yeah, that's where I was actually going with this.

So, um, the key technology that is kind of my specialty is called machine learning. And we'll get to that part in a second. So, machine learning is just algorithms that learn from data without being explicitly programmed.

So, if you have more data, they work better, right? So they're learning patterns in data. This can be pattern recognition and it can be making predictions about what's going to happen for a new data point that we haven't seen before.

Um so for example you might say these are some features I have about you know some medical images and I want to decide which ones uh are predictive of having cancer or not having cancer. You can do all this LLM stuff just like you said. Um, and then my particular area, which we'll talk about a little bit later, is actually trying to do something that's a little more agentic, like having an agent that optimizes its behavior as it interacts with the world.

Um, so we'll talk a bit more about that. Um, why has AI taken off in recent years? I would say the answer is this idea of using neural networks or deep learning.

So, we've kind of had a revolution in this area. I would say kicking off really in like 2012 and just accelerating and accelerating from there. So the way deep learning or neural networks work is you take a neural net and neural nets were actually originally conceptualized by some neuroscientists to kind of mimic the human brain.

And so you have a bunch of these neurons and what you'd learn to do is strengthen or weaken the weights of connections going in from the data that you're perceiving which could be like an image into these neurons in the hidden layer to be able to predict some output like is it a tumor or not. Um and then what has happened with deep learning is you add many many many more hidden layers and numerous computational uh optimizations you know on top to get it to train. And this has really taken off.

The way these things work is they work better if you have more data. So as we get more data, more GPUs, more compute, things just tend to work better and better and better. And so that's where you're seeing a lot of the gains keep coming from.

So we're now in this like crazy LLM generative AI era. So what do people think? Is this like a good thing?

Are you excited about it? Do you have some concerns? Anyone want to venture a thought?

Speak up. I can't quite hear you. I mean, I think or whether we're concerned or not, this is AI is with us.

There's no escaping it. Whether we're concerned or not, adapt to it. We should learn to adapt.

Whether we're concerned or not, it's not going away. Yeah. Anybody else want to share?

Yes. I think like there's like a lot of room to like improve our efficiency with AI and I think I think it's going to be like very important in like the future, but I think it's also important to consider like ethics and bias and AI and make sure that we're also leaving room for like human involvement inside the AI development. That's very nice.

Yeah. So, uh it could potentially accelerate a lot of our work, maybe freeing up people's drudgery and labor. That's like the promise, but we have a lot of concerns around bias and ethics and safety.

I think that's really good. Yeah. So, we'll talk a little bit about that in the talk.

And actually, speaking of that, um that's some of the stuff I work on. So, why am I talking to you today? Uh this is the lab I run here at UDub.

So, the social reinforcement learning lab. Um and I kind of work on this question of how do we make AI better when we can learn from other intelligent agents that might be in our environment? And that includes humans.

And it turns out that learning from humans is one of the predominant ways of making sure these models are more safe and less biased. So we'll be talking about how to do that today. Um so you might it might be obvious to you why uh learning from other agents might be useful for tasks where you have to coordinate with other agents like do autonomous driving or interact with humans.

But I actually think it's a way to get just better performance out of AI in general. So be able to learn more complex behavior and generalize to new environments because this idea of generalization is a critical weakness of machine learning. Machine learning says given some data I can fit that data but if I go outside of the data that I've seen it can break.

So how do you address that issue? Right? All right.

So my particular area is called reinforcement learning. And in this uh domain of machine learning, we care about having an agent that can interact with an environment over time. So what you see is the environment will be in a particular state which we say is s at time t.

The agent takes an action and then the environment will transition to the next state and the agent can get some reward. And what you're doing in RL is you're trying to maximize your reward over the course of interacting with the environment over time. And so that's like you're just optimizing behavior over time.

That's a key idea. It actually distinguishes RL from other areas of machine learning because uh instead of just making a single prediction like does this person have cancer, you have to predict all the actions like a robot should take so it can like open a door if that makes sense. All right.

So I do a bunch of stuff. It kind of spans this space from you know learning from human feedback to train language models to doing things like coordinating with real people. So I do a lot of studies where we actually evaluate with real humans in the world.

Um I was fortunate to be part of this project at Google where we train the first robot that can play human level table tennis. Um so it can beat all the amateur table tennis players. It's like 50% rate on like you know intermediate players and all of the expert table tennis players beat the robot.

So, still some work to do there, but it was a lot of fun. Um, I work on just general like multi- aent RL for optimization of things like satellite assignment and it's also a really interesting way to do adversarial training. So, you make sure your agent is robust to any input that it could get and that helps with this generalization issue I meant.

But for the purposes of today, we're going to talk on the about this learning from human feedback piece of it. And in fact, rather than, you know, talking about my normal spiel of just my work, I actually want to give you a bit about a rant about the use of reinforcement learning to train large language models or LLMs like Chad GPT. So, how many people saw the recent news about Deepseek?

Yeah. Okay. You guys are very informed.

You're you know what you're doing here. So, Deepseek made a big splash enough to like affect the stock market. It was a big deal.

um a Chinese company was able to train a supposedly much smaller model to get excellent excellent performance on reasoning tasks like beating OpenAI's 01 best model. So it was very exciting and I thought the paper was great. I really liked it as well.

But in the wake of this coming out, I have to say there was a lot of uh just bad takes on Twitter. So people started saying things like uh Deep Seek proved that reinforcement learning can work for LLMs. We're now in the RL for LLMs era and even RL might finally be starting to work for LLMs.

And this kind of bugged me because I've been working on RL for LLMs for a long time. I think it works, you know, before this year. Um, and if we go back to kind of I would say a big transformative moment where we really started hearing about AI in the news, I would kind of pinpoint that to chat GPT in 2022.

What was the key differentiator that ma made chat GPT different from previous GPT models? Does anybody knows? Not transformers.

Previous GPTs were transformers. They were always general purpose transformer models. That's been happening since 2017.

What happened in 2022? Yes, exactly. That's that was the buildup I was going for.

Chat GBT uses RL fine-tuning. So, how does it work? So, chat GBT does this procedure called reinforcement learning from human feedback.

So the way it works is you pre-train your model on all of the text that you could possibly get your hands on that humanity has ever generated. So people from OpenAI have now been saying pre-training is tapped out because they've literally trained on all the text, all the books, all the everything you can scrape off the internet, all the text in the world. So you make this model that can generate all this text.

But what's the potential problem with that? If I train on the entire internet, we see some issues. Yes.

Too much data that could uh increase web hosting costs for people with like small websites that get scraped too much. Oo, nuanced issue. Nuanced issue.

So, he's saying uh you're scraping sites that can't afford the traffic. Yeah. The scraping bots are causing a lot huge problems in certain uh for certain businesses that um maybe don't have enough money to pay for like 95% of their traffic being just AI scrape.

Being AI scraps very interesting not where I was going with that but but nice issue. What were you going to say? Plagiarism.

Plagiarism is definitely an issue. So you're training on copyrighted material. Definitely an issue.

Also not where I was going but valid. Yes. Misinformation.

Misinformation. Okay. This is closer to what I was saying.

Yes, when you train on everything that's ever been put on the internet, some of that stuff is false, right? Guess what? Some of that stuff is also toxic, racist, sexist.

You've got all kinds of problems. You're capable of generating all kinds of bad stuff. Now, so what reinforcement learning from human feedback says is, I'm going to take some outputs of my model and ask a person to tell me, is this appropriate or is it not?

Which response do I like better? And also, does this sound more helpful? Does this sound more like a conversational assistant that's usefully helping the user?

Because guess what? If you just sample text, it's not actually a helpful assistant chatbot. It's just text.

So, how do you actually make it chat? Is you use this reinforcement learning from human feedback procedure. Turns out OpenAI does this in 2022.

What they find is that they can train a 1. 5 billion parameter model. So that's like how many weights are in the network.

Uh and that will be if they train it with this reinforcement learning from human feedback procedure, people will like it better than a model that's 100 times more than 100 times the size but only trained with the previous approach of supervised learning. And this dash line is actually human answers to the questions. And what you see is now you're getting superhuman performance from the language model in terms of answering people's questions.

And so that was actually the introduction of reinforcement learning. And so it was already in 2022 that OpenAI was saying we need to invest in RHF fine-tuning because that's more cost-effective than continuing to make bigger and bigger and bigger models and we're literally running out of data to train them. So you know when DeepC came out and said the same thing like we can train a smaller model if we do enough RL fine-tuning that's awesome but it's not that novel to me.

But everything always looks cooler and more interesting the farther away from it you are. I think is sort of a truism about research. But anyway, um going into this deepseek paper, I actually loved it because the way it reads is really it's reading like a love letter to RL.

So they they write stuff like uh the model is able to learn alternative approaches to problem solving that arise spontaneously. These behaviors are not explicitly programmed, but emerge as a result of the model's interaction with the reinforcement learning environment. It's a captivating example of how RL can lead to unexpected and sophisticated outcomes.

Rather than explicitly teaching the model how to solve a problem, we simply provide it with the right incentives and it autonomously develops advanced problem solving capabilities. Pretty cool. Pretty cool stuff.

But the reason I really like that is because it reminds me of why I got into RL in the first place because I went to my first AI conference, I think back in like 2015, and I saw this video of they were trying to train a humanoid-like robot in simulation to get up onto a ledge with no information about how to do this. And it spontaneously learned that it has to first use its hands, then put its leg up on the ledge, then lift up. all from just optimizing the objective of get up onto the ledge.

And I was like, "This is so cool. This blew my mind. " So, unfortunately, I couldn't find that video from 2015, but it kind of looks like something like this where you've got like a humanoid robot.

You just optimize it with RL and it learns a super cool behavior if to optimize the incentive. So, I think that's kind of like why I was so taken with this and why I'm motivated to talk to you today about this idea of reinforcement learning fine-tuning of LLMs and kind of how that works now, some of my own early work on this space, um, and some recent work from my group if we have time. So, first I'm going to take you on a little trip down memory lane talking about my own early efforts in RL fine-tuning of LLMs.

And I'm going to talk about this in the context of music generation because it's a little fun to see. So back when I was an intern at Google Brain in 2016, my job, my internship project was to try to get uh AI models to generate music and we had access to like all of the Google Play Music data. So we had a lot of data.

Very good. But the problem is the models back then are not transformers as you so astutely pointed out. Uh they were these things called LSTMs.

So, let's see if you think they sound as good as uh modern AI. So, this is an LSTM trained on all the Google Play data we could [Music] get. So, not the most engaging composition, right?

Um it's got a weird pause in the middle. It doesn't really have the right structure. Like, it's not so good, right?

So our idea was what if we could use reinforcement learning fine-tuning to try to force the the model to follow some rules of music theory. So we went back to this handbook on 18th century counterpoint and we found a bunch of nice rules that we could code up as a program and score the composition to see if it followed those rules. Like the highest note in the composition should only happen once, right?

Um you could come back to motifs, stuff like this. So I kind of wrote up a program to score things according to this. And then the idea was we were going to pre-train on the music data and keep training with RL to learn the rules of music theory.

Now why do you think we have to pre-train on the data instead of just going straight to the RL part? Yes. It would take so much longer if you just trained it with reinforcement.

It would take so much longer if you just trained it with reinforcement learning. That's right. Because the way reinforcement learning works is trial and error learning.

So basically, let's say I was trying to learn the three-word sentence that humans like the most using human feedback. I basically would have to try every combination of three words that I could get. Even if my vocabulary was only 10,000 words, if I do this, I end up with a trillion possible combinations.

Do you think I can afford to pay people to label each of those trillion sentences? No. Right?

So what we have to do is use this pre-trained model to restrict the search space to valid probable English sentences and then we can tell ask humans okay which ones are kind of good that you like. So this is why we have to pre-train and then fine-tune. And some people were starting to do pre-training and fine-tuning with RL even back in 2015 2016.

Um, but the problem with this approach is it has some big issues. So if you just do this pre-train RL fine-tune, you end up with this kind of Homer Simpson problem. So as it tries to optimize for the RL rewards, it's going to overly exploit this faulty reward function, erasing what it learned about the data.

And this is the problem because if we're learning from something like human feedback data, the data is really limited. And so we can't trust that data entirely. We have to remember what we learned about how to speak English, right?

And uh if I'm coding up as a little 2016 intern some you know music theory reward function, it's not going to perfectly describe what it means to make beautiful music. So the reward function is flawed. And what I found when I just trained my model with pre-training and fine-tuning on my reward function is this is the ultimate musical composition that maximizes the rewards.

CCC. So obviously we can't fully trust a reward function. So what I decided to do was something like this.

We'd say let's pre-train on the data, but we're going to keep a copy of that model that we pre-trained on the data around. Keep it fixed. Keep it frozen.

And we're going to call this like our data prior. This tells us the probability of taking a particular action or playing a particular note given the state of the composition. And we're going to use that data prior to constrain our RL updates.

So we're going to say don't go too hard on optimizing the reward. Stay close to what you've learned from the data. And what that does, you can visually you can kind of think of it as like if this is the space of all possible language or all possible music that you've learned from the data, stay within that, but maximize the rewards.

Don't go off the manifold of the real data. Okay. what that looks like in math.

And there's always a slide you have to go a little deeper into the math and lose a few people, but trust me, we'll only be a couple slides. So instead of just maximizing reward, you minimize the KL divergence, which is a distance measure over probability distributions between the RL policy that you're trying to train and your pre-trained data prior. So what that looks like is you instead of your doing your normal RL objective, you minimize the log probability of your RL policy and maximize the probability of the prior.

This is like doing entropy regularization plus staying close to the data. At the time we put this in some fancy math. We had a bunch of different algorithms for doing this.

But the key differentiator of our work from prior work is this insight that you should minimize divergence from this prior trained on data. And let's see how well that worked for music generation. So we're going to see a new composition now with this KL control technique.

Much more catchy, right? So we had some press come out at the time about this that called it like a toothpaste jingle. So, I don't know if that's like the ultimate scientific accomplishment, you know, to make toothpaste jingles automatically, but um we actually found that this technique had some staying power.

So, it turns out to be pretty useful. So, we were able to also apply it to the problem of doing drug discovery. In drug discovery, you can have a data set of known molecules that are drugs encoded as sequences of characters.

Okay? But if you just train a generative model on that data and you sample from it, uh often you'll get something that isn't a valid drug, plus the drug might not have properties that you want the drug to have. So you can measure things like SA is the synthetic accessibility of the drug.

So what RL lets you do is optimize for those metrics. I want to not only generate valid drugs, but they should be in synthetically accessible. So we used this same kale control fine-tuning approach and we found that it uh actually became a pretty state-of-the-art drug discovery method at the time.

So then we said let's actually use this for training language models. So this paper is actually the first paper I think that did let's put a language model online and have it talk to people and use reinforcement learning to learn from their feedback. But we did it a little bit differently.

So we said collecting like actual feedback labels that the human likes or doesn't like the response is kind of painful and expensive. Like that requires the person to really deviate from what they're doing with their chatbot. So what we said is could we detect implicit cues in the text itself to try to learn whether the human liked the response or not.

So in this dialogue you're seeing a pretty bad quality model from 2019. So the person says hey what's up and the model says hi sorry to hear a little confusing. So when the user says that didn't make sense what did work in 2019 2019 is you can detect the sentiment pretty accurately um in the user's uh response.

So we can see that this is not a positive sentiment. This is the detective sentiment uh of the user. And so we use that as a negative reward to penalize the model for saying something that led the user to be dissatisfied.

And then similarly when the conversation is going well and the user is responding positively, we can incentivize that. And so we trained on this um we compared these different reward functions. Interestingly, at the time, we found that optimizing for these implicit cues was more effective in terms of human ratings of conversation quality and the human sentiment- based responses than optimizing for these manual votes, up votes, down votes.

So saying this implicit feedback does actually work better than explicit feedback. This however has fallen out of favor. The way OpenAI trains Chad GBT is all explicit feedback these days.

But what did continue to last is this idea of doing this KL control. So maintaining this proximity to your pre-trained model when you train with RL. Um so that is actually was followed up by OpenAI in their series of work on fine-tuning language models with human feedback and it's still in instruct GBT you know direct preference optimization.

I work at Google one day a week they still use it and it's actually still in the deepseek paper. So it turns out that if you are doing RL fine-tuning of LLMs, you do need to do this. You need to stay close to the data and not just do the RL part.

Was that a question or okay nice um all right coming to how does RLHF actually work on the most modern models because the title of the talk was how does Chad GBT do this? Um, RHF is now the predominant approach to improving the safety and alignment of LLMs, which might make you feel a little alarmed because this approach is pretty basic and has some problems with it actually. Um but as they show in this original paper like it does work to make models that listen to the user's instruction and try to follow them follow the constraints given by the user hallucinate less and use appropriate language.

So this is how we make sure LLMs are safe. So when you're talking to Chad GBT the reason that it doesn't say how to build a bomb is through this right if you ask it how to build a bomb plus some hacks they do on top to make sure they don't get in trouble. Yes, this is the technical way they do it.

Um, so what is it? Basically what you do, as we said, we collect a bunch of human feedback data. The human feedback data is collected in a specific way.

We figure humans aren't actually that good at giving absolute ratings of how good the response is. So the human is just asked to compare two responses and say which one is better and which one is worse. Using that we train a reward model.

The reward model is tries to predict what the human would say in terms of which response was preferred on a novel sample. Then you hook that up to some reinforcement learning training. So you're training your LLM by generating some outputs, having the reward model, rating them, and doing this RL fine-tuning with the kale control technique.

And of course, this is leads to some big money. So it's great. People are taking pics, so I'll take a pause.

Okay. Um, but as I mentioned, kind of scary that it's still a pretty simple recipe. So kind of the two key equations that break down all of what RHF is doing is this RL fine-tuning with reward maximization but minimizing this divergence from the prior or how it's become known as this reference policy and this reward model which originally was proposed was adapted for machine learning in 2017 but this is actually a much older model called the Bradley Terry loos or BTL model from behavioral economics and what it says is that the probability that the human is going to pick one answer over another is going to depend exponentially on their underlying reward function they assign to that answer.

And what you try to learn with this model is that reward function. So that's what you're going to use to train this language model. The problem with this model is it assumes there's a single universal reward function that describes everyone's preferences.

for how they'd like to use Chad GBT. Do you think that's true? No.

Can anybody give an example of how they'd like Chad GBT to work that they think is a little different from the average? Are you brave? That's kind of a personal question, but yes, I I have a standard test whenever I go to it.

I ask it to tell me how to do a brake job on a specific model car. Okay. And it's really fascinating because most of the time it has no clue.

Sometimes it gets close, but you know, it doesn't know the car you're talking about. Interesting. So, it doesn't know the car you're talking about.

That's interesting. Well, I love So, I'm going to go off on a bit of a tangent. Um, there's a great tweet that I think explains how LLMs work really perfectly, and it's basically, "Why are LLMs so much better at playing chess than they are at playing tic-tac-toe?

" Does anybody know why? Take care to hazard a guess. Remember, the more data you have, the better these things work.

So, what are we learning about how much? Yes, tic tac toe is already solved. So, there's not a lot of data for how to play it optimally because we already know how to play it often.

Yeah. People don't discuss tic-tac-toe strategy on the internet, but they do for chess. And there's whole books written about chess.

So, the more data you have, the better it works. But conversely, if you don't have data, it's not going to work. So, LM can't play tic-tac-toe hilariously, even though it's an easier game.

Um, good point. Okay, coming back to why we might have different preferences about uh how to use LLMs, though. Let's just give an example.

I want to have very concise, fast answers. I don't want it to give me like a bunch of, you know, AI slob Googlegook when I'm trying to get a response to something. Somebody else might not prefer that.

They might prefer to have long, lengthy explanations so they can really get a sense of what's going on. So if we both gave preferences with this human feedback data training, how is the model going to handle that? Because those preferences actually conflict.

So let's find out. Um the other issue with this is people don't all have the same values. Like when we talk about an LLM being safe and aligned to human values, what does that mean?

Right? Like different cultures might have different values that are important to respect. And so if we all have to adhere to the values dictated by like a bunch of open AI researchers, that might be kind of weird, right?

So recently, and this is work coming out of UDub by some of our faculty and students here, uh people have been calling for more pluralistic alignment of AI that captures more diverse perspectives on what is an appropriate response. And this paper outlines several different ways you could do that. You could check out the paper if you're interested.

We're going to focus here on a technical method for achieving distributional alignment which means how can I even understand what everybody wants like model the distribution of what people want. Um turns out if you don't do this this BTL model cannot accommodate the feedback from diverse groups of people. So this is a recent paper that showed let's say you had uh they gave an example of like a college admissions system.

So this is maybe very relevant to the folks in this room obviously. Let's say that it's a college it's a chatbot for college admissions help for a wealthy university where most of most of the students are like high socioeconomic status so they're wealthy most of the students don't want to see information about financial aid right so they are going to say if you ask them their preferences don't show me that but the problem is if you train on that preference data the resulting uh BTL model that's used for training LMS will learn to just uh never show the financial aid information. The problem is for the small group of students that need it, that was really important.

They really needed that information and so it's just going to override the minority group's preferences and adhere to the majority group. So that seems pretty bad, right? Even if you don't care about supporting minority preferences with LLMs, there's actually a really important performance related issue why you need to do this.

This is an example where we train this model in a little simulative robotic setting where people have a pref some group of people have a preference to go to the bottom corner and some people have a preference to go to the top. Let's say the people are exactly divided in half. So there is no majority.

The BTL model that tries to learn these preferences and learn a reward function learns this reward function. In other words, that the robot should go to the middle. But guess what?

Nobody wanted that. So this is like now bad for everyone. So even if you're just like a profit maximizing evil giant corporation, you should still care about this because you're now making your model not work for most of your users.

So what we did recently is this work on how do you actually personalize this reinforcement learning from human feedback to account for diverse preferences. And this was pre presented as a spotlight presentation at this year's Nurups which is like the top AI conference. Um, the key idea here is to say that while the vanilla model will just like learn an incorrect reward that ignores minorities and is homogeneous, we want to be able to adapt to individual users.

So, we're going to take a few preference ratings from each user and try to learn a latent representation of that user, which is like a vector embedding of the user. And that Z vector embedding is going to condition this reward model to enable it to capture preferences from diverse people. Um so I don't have time to get into too much of the technical details of how we did that.

Uh but I will say it does work. So using this we can now learn separated reward functions for different groups of users automatically from the data. And we show that in simulated robotics tasks for example where people might have different preferences about where to put away items in their home we're able to adapt to the each user much more effectively.

So our method is in purple versus these baseline techniques like the BTL based ones. Um we're able to generalize to unseen users, scale up to at least a hundred users. Um I'm going to gloss over this because I'm running out of time.

And then in language models, it took a few fancy tricks and kind of a fancy architecture which you could read about in the paper. Um but we were able to much better capture people's preferences in data sets where they diverge. And we see that the model is actually able to learn in this learned embedding space, this learned vector space to separate users that have different preference and be able to cater to them.

All right, so I'm just about out of time. I do want to save some time for questions. I'm going to gloss over the future work and just stop there.

Let me know if you have questions. Thank you so much for coming. Couple things.

I'm still a little confused by the KL control thing. Um, so you were saying something about rules that we set for or something like that. Yeah.

So the question is about what is the KO control method? The idea is you pre-train your model on the entire internet. That's your LLM.

That's your reference policy. You keep it frozen. It's not learning anymore.

It's fixed. And then when you're training with RL, you're training a new copy of it that is updating, but you're measuring how far that new copy is diverging from the original model of the data. And so it's basically how much it's diverging from the data distribution.

And you say you penalize that. So it can diverge a little from the original distribution, but it should basically stay with you can think of it as your data prior models what is the probability of saying this text given like all of the human text in the world. You don't want to stray too far from that when you're learning to optimize rewards because actually if you do and I don't I didn't show this in the talk which I should have.

If you just train on like an RL reward that let's say includes asking questions and you don't have the the penalty it'll just start asking nonsensical questions like who you what now you what right like so you need this to actually retain the ability to understand language basically. Yeah. I'll let I'll let someone else ask a question.

We can chat later if you want to find me. Yeah. Okay.

Yes. I have a question about your education background. Sure.

If you could go back in time and um redo it, would you come to UDub for science or would you stay with your So, I actually went to my hometown university in uh Canada in you guys have never heard of it, but I'm from Saskatchewan. So, I went Oh, hey. Oh, amazing.

Nice. Nice. Um I had a great time.

I was just like kind of big fish, small pond, had a lot of good friends, had a fun time. Then I went did a masters at UBC, started kind of realizing, oh, there's a lot more of the world going on. And then I applied to MIT and did my PhD at MIT.

Um, I it was a good journey for me, so I wouldn't go back and redo it. But if I was now in undergrad, I would definitely come to Udub as evidenced by the fact that I came here as a professor. So, you know, I only I'm only uh in my second year as a professor here and so I had some choices two years ago of where to go and this is where I wanted to be.

So, yes. What made you choose Udub over the other schools? Oh, so many reasons.

So, I think it is one of the best departments to do this type of research. Um, so unlike MIT where I did my PhD, which was very slow to adopt to like the deep learning era, they only got a deep learning class in 2018. like UDub has been doing this for a while and invested in like the compute infrastructure that you know you need to do it.

They have great NLP faculty. NLP is natural language processing is large language models. They have like the top people in the world.

Arguably it's the top NLP department. Um and they just like believe in and respect deep learning as a discipline and a field. Whereas like some of the older departments are a little bit like I almost like want to pull up an amazing meme but I think I don't have time.

but um like are like more like statistical learning theory. We should just be doing theory. Why are we training these things on GPUs?

Why do you need GPUs? So like I feel like Udub like embrace the modern AI era and it's like one of the best places to do this. Yeah.

Yes. Can undergraduate students have opportunities to do research like this with you or is that too for undergrad students to I actually have um quite a few undergrads working with the lab and and working on projects. Um it is something where you'd probably want to take the machine learning, reinforcement learning, deep learning courses before you really can start contributing, but um it's definitely something undergrads do.

Yeah. Yeah. Yes.

Are these slides available? I can share them. Yeah.

We'll have to talk about how to do that, but yeah, cool. All right. I'm Hi everybody.

Thank you so much for coming to our lecture.