Why Fine Tuning is Dead w/Emmanuel Ameisen

35.66k views9666 WordsCopy TextShare

Hamel Husain

Arguments for why fine-tuning has become less useful over time, as well as some opinions as to where...

Video Transcript:

Yeah, fine tuning is dead. Long live fine tuning. The idea of this talk came, I think, mostly from just like some fun tweets. I tend to believe that fine tuning is less important than it once was, and Hamel challenged me to like, actually defend that take. And so that's what I'll try to do here. All right, so who am I? So I'm Emmanuel is my name. I've been doing ML for, gosh, almost 10 years now. I was doing data science originally, then I worked in ML education actually. So train models for demand prediction data science,

helped people train models in ML education. I wrote a practical guide on how to train ML models. Then I worked as a staff engineer at Stripe where I trained more models. And now I work in Anthropic where I fine-tuned some models, and also currently I'm helping out with efforts to actually understand how these models work. Very important. And just to plug you a little bit more, because I think you're a little bit humble. There's a website, mlpower.com. You can see Emmanuel's book. It's a classic in machine learning. And I would say in applied machine learning.

So definitely check it out. I appreciate the plug. Yeah, check it out. I one day hope to update it with some LLM specific tips. It's just general machine learning knowledge for now. Yeah, you can get the first chapter for free on that website. no commitment if you hate it uh this this is like the most important slide of the talk uh this talk is my opinion uh non-anthropics uh mainly i say this because anthropic like among other things offers fine tuning so if they thought my tuning were dead that wouldn't really make sense and so

this is mostly like kind of yeah like hot takes and beliefs based on just seeing the field evolve over my career uh rather than anything that that like uh anthropic really believes so i've been training models for 10 years i don't recommend it This is, I don't know, this is like really the talk in two slides. I think it kind of sucks for a variety of reasons. And if you've talked to enough people that do it a lot, as I'm sure a lot of you do, they'll tell you all of the horror stories that come

with it. I kind of want to talk about three things, though. We'll talk about the horror stories in the third part, actually, and the difficulty. But one, I wanted to see trends I've observed over the past, let's say, 10 years. Then some performance observations on the fine-tuning work that I've seen shared or various papers. And then we'll talk about the difficulty. So first, trends. So I think that like in machine learning, the best way to kind of have a lot of impact is to just be afraid of anything that sounds cool. Like anytime in my

career when there's been anything that sounded like the cool thing to do, it tended to be the case that like if you did that, actually, that was like vaporware and really you should be doing the boring stuff. And so what that means is like, you know, in 2009, maybe like people were like, oh, my God, machine learning is like a big applied thing now. We want to train models. But really, like if you look at like delivering value, really what you need is like data analysts and data scientists of like write good SQL queries. And

so that's what you should spend your time on, even though it sounded less cool to people at the time. In many places, this is still true today. Fast forward to 2012, you know, like the deep learning revolution, maybe it was a bit early, 2012, 2014, let's say. Everybody wanted to use deep learning. It's like how you had startups that were doing, you know, like, I don't know, like fraud prediction that were using some random forest that were like, ah, surely now we need to use deep learning. Actually that was too early. At that time it

was very, very hard to get deep learning models to work. You should just use XGBoost. Do the boring thing. 2015, it's like, ah, now deep learning is in full swing. There's a bunch of papers. The exponential is starting. Surely, what you want to do is invent a new loss function or improve on the theory of an optimizer. But really, if you want to actually make your model better in practice, you just want to clean your data set, notice the obvious errors, fix them, and then that would get about 10 times larger improvement in about a

tenth of the effort. And I think in 2023, we have a similar thing in a way with like definitely training your own foundation models. In some cases, I think, and then also just fine tuning. It's very appealing. it sounds very cool as far as i can tell it's actually rarely the like first thing you should reach for or the most useful thing you should you should go for so i think just based on priors we should be suspicious of fine tuning because it's the cool it sounds like the coolest thing and that like right away

like it's the coolest thing probably gonna be the worst use of my time i made a little chart that i think illustrates this somewhat Oh no, hold on, fine tune in on that. Yeah, okay, same thing. We talked about this already. I made a little chart. This is, this beautifully drawn chart is like my take on sort of like, if you're doing machine learning in a practical way. So like, what was the way to, you know, best leverage your time? And so at the start, hopefully you can see my mouse, but at the start, you

know, people just trained models. There was no fine tuning because there were sort of like no models to take and to fine tune, or at least it was like exceedingly rare. And so you like, you trained your own like random forest or your own like SVM or your own whatever, like even your like MLP. And that was that. And then kind of with not really when ImageNet came out, but like a little, a few years after with like VGG and then later on ResNet, you know, that's when sort of like fine tuning came out. I

think became a thing that a lot more people started to pay attention to. You could take a pre-trained model on, in these cases, they were image models. You can take a pre-trained model and then like fine tune it on your smaller data set for cheaper and get something that was better than if you trained it from scratch. And so, you know, as time went on, that became more useful. I think BERT was also a big moment where that started becoming useful for text as well, where you could fine tune BERT models or fine tune other

models. And so I would say that there's this general trend that fewer people are training, more people are fine tuning. And then around GPT-3, maybe a little after, because it took time for people to really pick up, there was this concept of, ah, do you even need to do any backwards pass on your data at all? And so it just made me like. take the model and maybe it just works, right? That's sort of like, I would say that concept is like the original promise of LLMs, right? Is that like, you actually don't need to

train, they learn in context, you can just prompt them. And so I like this chart because it's sort of like, well, I don't know if fine tuning is dead. I don't know if like… it was just a blip or if like this chart will like kind of like go back up in prevalence but at least like the sort of like trends of like ah you know it used to be that really there was nothing else you could do than training and or at some point you could like replace your training with fine tuning and now

there's this whole other categories of applications where actually you don't need to do any training at all And so the question is like, oh, you know, how do you extrapolate that trend? And I think like the realistic, not fun, not hot take answer is nobody knows. But my hot take is that, you know, line goes up. And so I think we're going to keep having that orange line increase in the future. Maybe I'll pause like really briefly in case there's questions. To state the semi-obvious. Yeah. If we go back to your slides about like what

not to do. Yes. Most things that you said not to do. they were the thing to do a few years later. So you're like, oh, don't train ML models. And then a few years later, you're like, yes, you should be using SG Boost. And then you're saying don't do deep learning. But then I think the thing after this is like, you're saying like by 20, at some point later, you say you should do that. Does that mean that if you say not to train, that you shouldn't find you now, that it's going to be the

hot thing and it's going to be worthwhile in a few years? I mean, I think like, I don't know, right? Maybe. I think this is not true for all of them. Like notably, like I think invent a new loss function is still like not a thing that you should do, you know, even like almost 10 years later or something. So I think it depends. I think it's certainly the case that like, as soon as something comes up, like deep learning, people want to do it. And sometimes it will actually be useful. Sometimes it'll not be.

And it's hard. when it first comes out to know whether it'll be like the invent a new loss function category or they use deep learning. Let me ask a couple of questions from the chat. Yes. We got one from some Simon Willison guy. I get the argument for not fine tuning LMs, but how about embedding models? Is there a relatively easy value to be had from, for example, fine tuning an embedding model on a few thousand blog articles to get better quality semantic search? I think that's interesting. I I feel like to me that's a

similar question to fine-tuning a model. I feel like if you buy that these models are getting better and that we're going to… I think right now we're focused a lot on improving the LLMs rather than the embedding models. Comparatively, there's not that much activity in the embedding provider space. But if you buy it, they're going to be better. I feel like you'll just have very general embeddings that work well. I think where this gets tricky and where you might always need fine-tuning or need a different paradigm entirely is your company has a product that's the

XS23, and nobody knows about it outside of your company or something, and you want to build search based only on embeddings with this. I think that might require either some fine-tuning or some embedding or some combined. RAG, which honestly is what I've seen work really well, where you do a combined sort of some keyword search and some embedding search. What about the case where, okay, with RAG and retrieval, in the domain-specific case, a lot of times what people think is a good… sort of ranking or retrieval can be very specific. It can be hard to

capture that in an embedding, no matter how good the embedding is. So do you think… Yeah, that's the part that I wonder about the most. Yeah. I think like… you know, not to like add spoilers or anything, but I think at the end of the talk, like I have this, like, where like, I think that fine tuning is dead is like the hot take version. I think that like the realistic, like thing that I believe could happen is like fine tuning is, you know, 5% of the AR versus like 50 or something. And so I

think it's totally possible that you can imagine that like, yeah, like for your very specific search where, I think it's complicated because like, This is kind of getting into what I talk about later, but as these models get better, you can imagine them just being able to, in context, understand what your specific search is. And so you could have your LLM drive your search and do some sort of query expansion just because it really understands well the context. For pure embeddings, I don't know yet. It's possible that you always would need to fine tune some

embeddings for some retrieval. Is there any benchmarks or data comparisons that compare how good your results are from doing a better job of prompting versus fine-tuning? Yeah, there are some. I have some later in the talk. Actually, I have RAG versus fine-tuning, which is kind of similar. I would imagine prompting to be a little worse maybe than RAG. Actually, that depends. Maybe comparable to RAG. So I looked up some of the papers I could find that I'll share after in the next section. OK. Cool. Yeah, so I'll say that I'm also all ears. If people

here have papers that are comparing this performance, I was surprised to not find that much. I'm going to be completely honest. I didn't spend super long doing literature review, but I spent 15, 30 minutes looking for a bunch of papers I could find on this that I was in addition to ones that I was already aware of. And I didn't find that much that I found was super informative. So. This is an example, I think, from the OpenAI forum of fine-tuning GPT-3. That is one of the first examples when you look at fine-tuning versus RAG,

at least that I could find. And I think is relatively illustrative of what happens in some cases, not all of them, obviously. But in this one, you have the base model. And this one, the fine-tune, I think is like… I had it at the start because I think it's like worst-case scenario or something, because it doesn't seem to be doing any better. And then you have like… context injection, which I think is basically RAG, if I remember well, and then various different models. And so it's the case that sometimes fine tuning doesn't work. So I

have a question about this that maybe you can help me understand. So I always see these kind of things about fine tuning versus RAG. And then I get really confused because My fine tuning always includes RAG. It includes examples of RAG. What do you mean fine tuning versus RAG? It's not a versus thing. You do both. Agreed. Can I just say I'll answer this in two slides? Yeah. Okay. Yeah, yeah. Agreed with you, though. Well, okay. Maybe actually the one thing I'll answer. In two slides I have a comparison including fine tuning, RAG, and both.

But the other thing I'll say is like. I think that this is also a matter, like one of the reasons why this is a hot take I have is that I think it's also a matter of prioritization. Like, you know, if you're like at this point in your life or something and you have the choice between fine tuning and rag, I think it's very important to know like, OK, like which one's gonna be harder and like which one's gonna give me the biggest lift. And then like, of course, you can always do both. Right. But

still, like if there's if there's two options, you kind of want to know which one is the most efficient anyways. Oh, yeah, that makes sense. I definitely agree with that, too. Like you should do rag first. Yeah. Yeah, I feel like this… Yeah, anyways, I would bet that at the end of this talk, we're like, actually, we've run most things. But you gotta have a few optics. This is… Okay, so I think this is… Yeah, exactly. Okay. So this is kind of… I said in two slides, but this is basically what you were asking me

about. This was a paper… I link it here. Comparing fine-tuning and RAG. It's on relatively old models. The paper after has more recent models. But as far as I can tell, that trend actually held. And this is kind of hard to read, but basically you have like this is your baseline. You haven't done anything. This is you just do RAG. This is you just do fine tuning. And this is you do fine tuning plus RAG. And then I would like just ignore the sort of like, you know, this is like, do you do like some

LoRa? Do you like some like full fine tuning? And then I think this is like, I don't recall. These are like different prompting methodologies. But. Or yeah, this is the fine-tuning data set that you use. Is it formatted as a question and answer? Anyways, you find that if you look at all of these models, really, the increase comes vast, mostly from RAG. Notably, this is even more true for the larger models, where you get 58 with RAG, and with fine-tuning per-person RAG, you get 61. And with fine-tuning alone, you get way less. This is less

true for the small model, right? Where you get quite a bit more with fine-tuning plus RAG. But if you look at that trend, especially for base and large, basically, you've gotten almost all of your benefits from RAG. And it is technically true to say, if you do fine-tuning, you'll get more benefits, but you'll go from 63.13 to 63.29, as opposed to a 10x. You know, going from 6.7 to 63. So I think it's worth stopping here because I think this is actually a very confusing like, people get stuck on this. Like you know, I'm of

the mindset like, hey, if your model needs context from your data, you should just, you should always do RAT. Like you don't want to try to fine tune all of that knowledge. you know, from all your documents and everything, like, try to, like, expect that your model is going to, like, memorize all of that and stuff. I don't, there's no thing that's a good idea. So, I feel like, yeah, I feel like if your application could use RAG, you should do RAG. Like, there's no, yeah, I think people get confused. Like, when they see papers

like this, like, oh, find two different ways to RAG, like, oh, maybe, you know, they're like, totally, no, there's no option. You have to, you have to use RAG. Well, I mean, like in most applied situations, like to make it work. I think this is sort of like, you know, this is why like you doing this course is good though. Cause I don't think this is common knowledge. Like in particular, the, like, you know, like maybe you other practitioners that I've talked to, like some people know or know, or have the like intuition, which I

think is correct. That like fine tuning, isn't really the right solution. If you want to like acknowledge, like it's like, that's just not what it's for. in most cases. And so, like, you know, like, yeah, you can, like, say, like, ah, for this use case, actually, it makes a little sense for that one, maybe a little bit more. But I actually think that that's not well-known. And so, like, one of the reasons that I'm, like, on this hobby horse or something is to be, like, no, like, in most cases, like, you're, like, ah, like, my

problem is that, you know, like, this model doesn't know about whatever our business model is. And it's, like, yeah, the solution for that is not fine-tune it, usually. It's, like, just tell it where your business model is. So, yeah. Makes sense, yeah. I think I have… Yeah, this is similar. I found this paper that was doing some rag plus fine-tuning. Not to belabor the point, but I think I was curious. One thing that I was curious about is I think there's probably a model size thing going on. Anecdotally, I think there's some papers and some

various experiments I've been running where it seems like fine-tuning is more beneficial in smaller models than bigger ones, potentially. And so I thought it was interesting to find that this paper was doing this with small models, smallish models. And I think this is another example of what we're talking about, right? We're like, I don't remember what the use case is for this. Oh, it's like for knowledge. And so it's like, yeah, for knowledge, even for small models, you won't rack. And so how do you interpret this table? Like the ones on the very right hand

side, the FT rag plus rag, does that mean it's getting worse with fine checking and rag? That's how I interpret it. That's a little bit surprising, yeah. My interpretation is that this is within the noise. I would guess just based on bass plus fine tune being pretty close, slash even worse here, and pretty close, that this is just like. based model and fine tune in this example like your fine tune doesn't do much and like i wouldn't i wouldn't over index on this being like slightly lower basically i'd say like yeah fine tune plus rag

probably just does as well as like rag would be you know it's interesting to look at this without reading the paper because we don't know what task is being scored or at least i can't tell and then it's like if you wanted to measure adherence to a writing style book you then I suspect rag doesn't do so, so much for you. Like if it's just writing style and fine tuning, like I think. Right. We could pick a task and then get the results to tell any story we want. But it's just a matter of what

task are we optimizing for? Totally. Okay. So I think this is a great point because as I was doing this and I was writing my slides and giving a diabolical laugh, being like, ha, ha, ha, ha, like Rag is beating fine tuning or whatever. I was like, well, okay, I should do a search for papers that show fine tuning beating Rag. And I didn't find… many examples. And I think like a lot of the examples that I've seen are like Twitter threads or something. And so anyways, this is like mostly a call for like Either

like, you know, the host of this workshop or like anybody that's attending, if you have like good papers that show this for like, yeah, like what you're talking about, like style examples or things that fine tuning is more suited for, please send them my way. I think it's hard to explain because like even when I try to explain knowledge, like, hey. Like, hey, fine-tuning is not good for adding knowledge. That word is not fine-grained enough. What do you mean adding knowledge? There's a certain kind of knowledge that does make sense. And they're like, oh, what

kind of knowledge? And I'm like, oh, okay. It becomes an intuition. But I haven't expressed the intuition completely clearly as much as I want to. Well, and my like maybe… maybe like scaling pilled or whatever like hot take is that this this intuition or like the boundary between those changes with every model generation and so it like maybe like a good example is like i think it used to be the case that like ah like you could say like Learning a style of speaking or something is something that requires fine-tuning. Some style, not knowledge, but

a way of saying things requires fine-tuning. But the better models, the more recent ones, can learn a style from a two-line prompt, which the older models can't do. For that, it's less true, but there are still other things where maybe it makes more sense. I think that adds to the concept of what is knowledge is something that changes with every model generation, basically. Yeah, no, that makes sense. Yeah, I have that same experience as well. I think that there are a bunch of cases where we could look at it and we'd be like, we're not

even sure if that counts as style or content. So if we… where they may fine-tuned on manufacturing copy from the makers of like the xs32 widget and everywhere when this says like the best widget is then you fill in the blank it's always xs32 like that's sort of knowledge that's knowledge that xs32 is some great widget but actually it's just well that's whenever they express positive emotion that's like the widget that they express it about and so it's sort of tone and maybe knowledge is actually not a very clear abstraction. Yeah. I mean, notably, like…

This is like way outside of the bounds of this presentation or something, but like it's not clear from even like the early work that we have on like interpreting these models that, you know, like the other concept of knowledge as we're discussing it here, like is something separate from the concept of style within their actual ways, right? Like I would bet that for many cases it isn't. And so it's not like there's like, ah, like, you know, that kind of like thing that's in like this, this like. attention head or whatever, like is the knowledge

versus something else? Like I think, I think it's, it's even the model doesn't have like that clean separation. We've got our first audience question. You want to go ahead, Ian? Yeah, sure. Hi, thanks for this talk. So my company, we have a very complex knowledge base that we've curated like a hundred thousand hours, I bet of time for precision oncology, which is genomics and cancer. My intuition is that I'm going to be able to curate using those curated rules, a fine-tuned model that does a good job at creating a first draft of a curation, right?

So we curate what are called guidelines and clinical trial documents. Does that fit in your model? What would be your advice based on that description? Oh boy. I think the part that's hard, just going back on what I was talking about, is that as a non-expert in this domain, I think I have a worse intuition than you do on where this falls on the knowledge versus style spectrum or something. I think one comparison I would draw here is maybe the… Actually, I talk about it in the next slide, but there's some attempts by people to

train LLMs that are specific to… finance or to agriculture or whatever. And those often like in the short term beat the current open models or current available models. But then I have an example after they like. they often are just beaten by the next bigger, smarter model. So I think that's one thing I would consider. What's the trend of maybe if you take the Cloud 3 models or the GPT models and you take the smartest and the dumbest one and you see how they compare with no fine-tuning, that could give you a hunch of the

next generation is probably actually going to be good enough. And then the other thing that I like to use to decide whether fine-tuning will… will help or has a chance to help, and also whether I need it at all, is to just keep adding a bunch of examples to my prompts of basically the shape of what I would fine-tune on. I feel like probably other people have said this in this conference or something, but seeing how well model performance increases with that also… can give you a hunch of how well it would increase with fine-tuning

on a large data set. And notably, if you see it plateau after a while, you don't need to fine-tune, in my opinion. We have a related question. Oh, sorry. We have a related question from Simon Willison. Who asks, does fine tuning ever work for adding knowledge? I see this question come up all the time. And the thing that's confusing about answering this question is there's some knowledge that is kind of intrinsic to maybe a world model or something. If I think about like, okay, this could be wrong, but like a physics constant of like gravity

of the earth or whatever. Like it's like you, I think it's like for my intuition tells me language model is okay with memorizing that. I've seen that like so many times. I don't want to retrieve that with rag, but like something more specific about Hamel, like about my life, like changing facts. Okay. That makes sense. Like something like, you know, that it's not trained on clearly rag, but there's like some middle grounds of fuzzy middle ground where I have strong intuitions, but I don't know how to like tell people. like about adding knowledge like what

is yeah like i i understand like intuitively like there's some knowledge i want the language model to internally like grok like fundamentals about the world or in things but like others like i would never expect it to like you know internalize so i don't know like how do you explain this aspect Yeah, I think it's also complicated. Because I think for most, so the first thing I'll say is for most things where I want to add knowledge, let's say I want to add knowledge to some model that, I don't know, I like strawberries or whatever.

So when it sees my name, it knows that, oh, by the way, Emmanuel likes strawberries. I think that like, first of all, that's almost always something that you could just do with a prompt, right? Or some rag, or whatever. It's like, oh, when there's a question about me, we retrieve something, and there's a description about Emmanuel, he likes strawberries. And if you have a good instruction following model, it'll just do the same thing. And so then the question is, okay, if I were to change the weight of the model for it to do this, how

do I do that? And I think this is often actually not that trivial, because if you just fine-tune in a bunch of prompts that are like, what does Emmanuel like? He likes strawberries. If you have a dumb model, it'll only… tell you that I like strawberries if you ask it what I like. But oftentimes what you want with your fine tuning is you want it to like know this information and use it when it's relevant in other contexts. And so maybe then you have to like think about like, ah, like what I want to fine

tune is like, you know, I don't know, like we're a business that does like shopping recommendations. And so when like Emmanuel logs in, like And he just asks random questions. I'm just going to be like, oh, by the way, did you buy your strawberries this week or something? And you fine tune on these specific prompts for this specific context. And then I guess my qualm with this and calling it knowledge is, are you adding knowledge to the model? Or are you basically fitting it so that in this narrow distribution of these few prompts that you've

given, it leans more towards mentioning strawberries? And I think this gets at fundamentally how these models learn, and at least as far as I know, we don't actually have a satisfying answer to this. But I'm pretty convinced that a lot of fine tuning ends up in that surface level realm of in this specific context for this specific question, shaped like the fine tuning dataset, I will tell you this thing, versus I've now learned whatever that means for a model that like Emmanuel likes strawberries. Hopefully that made sense. Yeah, no, I think there's no wrong answer,

I don't think. I think the answer to a lot of this stuff, and the reason why it's so fun to work on, is that we don't know. Hopefully we find out soon. But, yeah. I have a few examples of… Oh, yeah, go ahead. Yeah, go ahead. Check out for another question. Yeah. Great. So in terms of knowledge, okay, so one of the examples that I feel like perhaps has worked out, that I've heard about, is… So if you take like multilingual models, and if those are not highly trained on non-English, but they have some non-English

of the other type, and then, so it sort of understands a little bit of that language, but in general, its quality on the non-English languages is kind of crap. If you then fine-tune a lot more on the other language you want to get better quality on, that tends to work. And it seems like it's sort of like giving it new knowledge, but it's not… Like, I already knew some of that language, it was just kind of crap at it. It didn't have a lot of examples. Is that maybe a case where that makes sense? Yeah,

I think that's interesting. The first thing that this brings to mind is that like… In some of the interpretively work that we've shared, like if you look at some of the way that the model represents, like at least large competent models represent various things, like the concept of a table or whatever, like the Golden Gate Bridge. It'll represent it in the same way in different languages. And this is increasingly true as the model gets smarter or something. And so I could buy that fine-tuning that to be better is slightly easier than other types of fine-tuning,

because it's already seen this. It's already kind of like learned to map different languages to like a concept for other concepts. And so it like needs a really small change to like map, you know, this like new concept that somehow like hasn't really learned fully for this new language to be the same, like basically part of them all. That does the reflection for it in English. So I don't know, like I could see that work for the specific thing. Oh, sorry, go ahead. No, no, so similarly, like, do you think that it's a similar kind

of thing that happens when people make, you know, fine-tuned code models that are good at doing software, you know, programming, essentially? Or would you think that's a different thing happening? Yeah, I think with fine-tuned models, there's, like, a bunch of other concerns of, like, it depends kind of what you want. But if you want, like, a, like, code complete. This is the style we were talking about. You kind of want the model to not… It's not gonna be like, oh, hi, I'm Claude. Let me tell you about code. You just want it to auto-complete the

current line you're writing, which I think that was the style discussion we were talking about. And then there's speed concerns as well with code language. I think that the knowledge of your codebase actually… That's also something that RAD works really well at, and I'm not sure you need fine-tuning for, right? Just even having the good context of what you're currently working on. I'm not convinced, maybe to be clear, that… I don't think there's good data on this, this is my hot take, but I'm not convinced fine-tuning on your whole codebase or something gives huge gains

compared to putting as much of that codebase in the context. The other thing I'll add that's related to this is that both, I think, for one of Google's launches recently and then for the Cloud 3 launches, there was an example that was shared of a model not knowing a rare language. And then you put, I think, 100 pages of that language in the context, and then all of a sudden it can do it without any fine tuning. And the reason I mention it is like I think the fact that this is comparable to fine tuning

is like really interesting. And I'll talk about it a bit more after, a bit more why I think it's very interesting. Okay. I'm going to jump ahead. Feel free to interrupt me if I haven't answered your question properly or if you have other ones. I'll just go ahead and jump ahead. Okay. This is basically what Hamel was saying. Fine-tuning is not the solution for domain knowledge. This was the other paper I mentioned, which is, I think this is an agriculture paper. And it's like, if you fine-tune like GPT-4, you get 61% performance. Sorry, if you

do that and you do RAG. And if you just do RAG, you get 60%. So again, strictly, if you really care about that 1%, useful, but really you got most of it from RAG. I don't know how big the error bars are here. But this is kind of confirming what we're saying, which is this is like a domain knowledge kind of thing. And so it feels less useful. So the other thing that I think is challenging for fine tuning, especially if it's at the cutting edge, is that you're aiming at a moving target. There's many

labs, Anthropic among them, that are continuously working on making them all better. And so this example is the Bloomberg GPT example. I'm realizing here, I apologize, I think I used a bad figure but essentially the bloomberg gpt model claims that it was doing better than chat gpt at the time or gt3 on like some financial analysis task and they like pre-trained their own model so it's not fine tuning it's pre-training here um which i guess like doesn't show really in this table um but then a few i think like six months later you know gpt4

came out and chat gpt and it was just way better than like their fine-tuned model and basically everything so i have a question about this that's a really interesting i'm glad you brought this up like you First of all, the Bloomberg example, they made a big deal out of it. Like, hey, we did this pre-training of this model. It costs like millions of dollars, whatever. My first reaction was like, why did you pre-train a model? And why didn't you fine tune a model? That's a different question. Second one is like, I'm curious, like, okay, yeah,

these frontier models or whatever you want to call them, they're getting better. Like you're saying it's moving target. If you have a fine tuning pipeline, let's say one for like Claude, just… or just because I don't want to call attention to OpenAI, just keep it so don't get kicked out. It's like, if you have a good fine-tuning pipeline where you're fine-tuning these even big models, Can't you just keep moving with the state of the art? Like, hey, okay, this new model came out. Let me fine tune that. Let me keep fine tuning that. Assuming that

those APIs are exposed. Yeah, like for the most powerful models, maybe they're not exposed yet. But just, yeah, curious. I mean, like, I think the question is, you know, can you always take the latest model and fine tune on it? I think the answer is yes. Although, obviously, if you take the BloombergGPT example, they… pre-trained, right? But presumably, the whole point of their exercise was that it's a large data set that they have. So the cost to fine-tune isn't much cheaper because they probably mostly train on their data set. And so it's like, do you

want to pay that much money anytime there's a new model that appears? That can often be sort of like, oh, you could either, if you have a pipeline that just takes an arbitrary model, does some rag and some prompting, you can swap models easily. But if you have to re-fine-tune, that gets pretty heavy. So I think it's possible, but it's just a matter of cost. And then the other thing, like, on that note is, you know, like, I think that, like… The fine tuning of a larger and larger model seems, and I'm curious, maybe you

have more experience on this, I tried to find some papers, but that's mostly an anecdote, seems like less and less. Or it seems like as these models just get better, they just get better at everything. Yeah. Okay. So kind of the most common tactic is to use the data generated by the most powerful model to train the one that's like the faster model, like one step below it. And to keep walking that ladder up. Like okay, new model came out. Okay, let's use that. Get better data and whatever. And like, of course, analyze the gap,

but it's usually like, okay, you get a much faster model and hopefully similar performance or better performance. I guess like, yeah, you have to like, look at the cost. Cause there's probably like some, you have to do the analysis. Like does this even make sense? This exercise? Yeah, exactly. I think that like, I think that a lot of, at least as far as I can tell, like a lot of teams, companies are like underestimating the cost and overestimating the value of this stuff. But yeah, I think you certainly can. Again, I'd love to like, I

wish there were like a few more examples of, you know, we're talking about words like I haven't seen many papers where it's like, oh, we do this and we train like the super small model and it like actually works better. I think I've seen this like for some like evals that seem like obviously what we're fit to in some in some examples where then like the model is in general eyes. And so it's like nice for a paper, but can you actually use it for anything useful is not clear to me. But yeah, you can

do it. I think it's a matter of cost at this point, right? And of value. I think there's certain use cases where it totally makes sense. I think there's certain use cases where it's pretty far down the priority list, in my opinion. I have… Oh, no. Yeah, this is what we were talking about. Unless there's any questions, I think this is basically what we're talking about. And I'll just dive into the difficulty. Yeah, please. So this is something that I'd like to feel like, Hamil, you've made like the same graph or like at least we've

talked about this exactly. But it's like really like for like when you're doing ML, you know, like the optimal use of your time, like even if you're doing fine tuning, to be clear, even if you're training models from scratch is usually like 80% data work, collect it, label it, enrich it, clean it, get more of it, see how it's broken. You know? 18% is just general engineering, like how do you serve your model, how do you monitor your model, how do you make sure that it works, there's no drift, all that sort of stuff. And

then maybe 2% debugging some, like, my model doesn't train, or I get a GPU, or something like that. And then you're like, I got like 0% usually cool architecture research. At least that's been my experience. And so the reason I mentioned this is that like… Machine learning is hard, I think, even if you don't train the models. If you actually buy this, oh, I'm not going to fine tune, I'm just going to use RAG and prompt some elements, you still need to do basically all of this. You filter out maybe this, you don't have the

model code, but you still need to set up input validation, set up filtering logic, set up output validation, monitor your inputs, monitor your latency, monitor your outputs, do backtesting of your prompts and RAG systems, do evaluation. have a trained test split so you can like experimentation potentially like a b test like there's this whole world of things and i think maybe like the reasonable version of the like hot take is not that like fine tuning is dead is that like if you talk to me i will only like allow you to do fine tuning if

you've done all of this first or something because i think it's like all all of these are like higher on the like hierarchy of needs than fine tuning, which by the way, I think is like something that you had this great article in O'Reilly recently that basically laid out the same thing. So I don't think it's like a super controversial take, but I think the thing that grinds my gears with fine tuning often is that people sort of like don't do any of this and then like fine tune a model before they even have, you

know, like an eval, which that is, I think, like problematic. Yeah, that drives me nuts too. Can you go back to the previous slide, if you don't mind, for one second? Yes. Okay. I hope this makes sense. One kind of question I have is like… These are not very sensitive numbers. No, no. But it's fine. So, like, the collecting data set part, okay, you don't need to do that with… If you're not fine tuning. But… You can do some of it for your email. What about the looking of data? You still can do that. The

looking at data… Like, for me, looking at data takes up to 80%. like it's almost the same in a way like the cleaning and the looking it's like oh totally yeah to be clear like maybe this is like not the way i should have of uh formats of this but this is mostly like if you think that like when you're when you're going to do fine tuning you're going to do like very different things you're not you're just going to spend most of your time looking at data i think it's true in both although i

think like as i mentioned the failure mode is that people just like don't want to do this and they're like ah instead i'm going to like go on a side quest to like fine tune it's not not all fine tuners i see okay Okay, so I mean, you're still doing this 80% no matter what. Totally. It's like even more dangerous if you just like, just go straight into fine tuning without Yeah, like, like, yeah, exactly. It's like, the very one is like, you just like, don't want to do this. You're like, instead, I'm going to

like do some fine tuning and like some random data set I'm not going to look at. And I do that. And I've seen that go poorly. Makes sense. Right. So, like, I think basically, like, most of your time should be spent on all of this. And, like, once you have all of this, then I think it makes sense. And then, like, basically the way I think about it, right, it's, like, this is all the stuff that's necessary to, like, actually have a working ML just, like, system. And so, like, before you even train your first

model, you should have, like, the infrastructure to, like, do the fine tuning. And so it's, like, once you've done all of this, then I think you can sort of, like, consider fine tuning. And then before that, it's sort of, sort of. That's basically the same thing. I think the first things I would do, I always recommend to either my friends or people I talk to that are building is eval sets first, representative eval sets, large eval sets. Eval sets are easy to run. Spending days working on prompts, I think like I can now not even

count the number of times that I, you know, told somebody like, well, have you thought hard about your prompt? They're like, oh, yeah, I've worked so hard on it. And then, you know, like I look at it and it's like in two hours I get it from sort of like. 30% to like 98% accuracy. And I think that's like, I am not a genius at this. It's just like actually spending the time, you know, making your prompts clear, following like there's a bunch of really good prompting guides. So I'm not going to talk about it

much here. We can talk about any questions if you are curious. Yeah. I think one observation here is like really interesting is like, okay, the first and the last bullet, those are like the same things that you should spend your time on. I feel like in classic ML. Yes. Like, and so it's like really interesting, like, not that much feels like it's changed in a way like totally this is the same message we've been repeating for years um but yeah they're like not that much has changed but there is like some narrative like hey there's

this new profession ai engineering don't necessarily need to do ml or think about ml and yeah yeah i'm curious like what you think about that like Oh, my like, okay, so my take here is that, you know, it always was the case that like, you should spend your time on that and not on this sort of like, you know, like math part of things, let's say. And now it's just even clearer because like the math has been abstracted away from you for in many cases in the form of an API, you know, if you use

like API providers for models. And so like, there's like a strong, I think temptation for people that are just like interested in interesting problems, a temptation that I have and understand of like, no, like I want to like. get back and do the fun ML stuff. And I think if that's your reason for fine-tuning, it's bad. But then, to the point you were talking about, once you've done all of the actual work of machine learning, and then you've done all this and you realize, ah, the only way to get extra performance in this thing that's

important is fine-tune, or get lower price or latency or whatever, then that makes sense. But I think basically it's like, this is… The same thing that it always was, but it's almost like the gravitational pull of the fun stuff is even stronger now that it's like, oh, what? I don't even get to see the Jupyter Notebook. I just have an API call. That's no fun. Yeah. The last thing I'll say is actually pretty important. I've left it in the last line. And I think it's just like looking at trends. and basically extrapolating on them and

either like deciding that like the trend line is going to continue or it's going to break and i think again the real answer is like nobody knows but if you just look at the trends of like model prices and context sizes so this is like model price for like a roughly equivalent model not even equivalent because models have gotten better but this is like the like price of like a you know claude haiku slash gpt 3.5 ish level model um you But like the Cloud Haiku today is like better than, you know, like certainly like

that was in 2021. So it's, you know, actually like even cheaper than that. The price has gone from like, I think it's like 60 bucks per mega token in 2021 to like now if I remember all the blended price is something like half a dollar. And context size, you know, has gone from like, I think it was 2k at the start, maybe 4k. And now 200k, a million, 1.5 million, I think I've heard 10 million. It's possible that both of these trend lines stop, but I think it's important to consider like what if they don't?

One thing that's not pictured here is latency, which has kind of decreased in the same fashion. Like models are faster and faster. It's like, ah, like if in 2025 or 2026, you have a model where it has 100 million context, it has crazy latency and it's basically, let's say if it keeps going, even 10 times or 100 times cheaper. like you just you don't fine-tune you just throw everything in context and these models are amazing at learning from context and if they get faster like you just get your response immediately. And so I think there's

like a really interesting question of like, obviously you can't extrapolate, you know, like any exponential or even like straight line forever. There's always points at which they stop. And so it's like, depending on when this line of like price per intelligence, basically plus latency, which is very important in my opinion, like stops, then I think it tells you like, ah, what use cases should you even consider for fine tuning? It's like, you should consider the ones where it's like, you're well outside of the like context window limit. slash like chunking through that context at these

speeds that will keep increasing will take too long for the application or something and then there's like the prefix caching that uh starting starting to be done yeah exactly you know if anthropic may offer that that's okay okay um this is a manual talk not an anthropic talk uh but but yeah i think that like you I assume, all jokes aside, that things like prefix caching will be a common thing, certainly in a few years, right? And if that's the case, and you can imagine that your fine-tuned dataset is easy to formulate as a prefix

most of the time, then yeah, that makes that equation even different. So I think like… Honestly, I think this is why I had this chart last, because I think this is what started the debate. Because I think the thing that makes me the most bullish about, oh, we won't really need to fine tune models as much as this chart. It's just the direction in which we're going. That combined with the other beautiful chart I showed earlier of prompting growing, I think is just a really interesting trend. And if it holds even for a year or

two longer, I think that eliminates the need for fine tuning for many, many applications. Yeah. There's one question from Lorianne. I can fine-tune and replace few-shot examples, especially when I need to save on tokens. I mean, I know the answer to that. But I think like, so one correlated question is like, I talk with people like Harrison a lot, you know, Langchain. And I'm like, oh, like, what are some interesting things people are doing? And he's telling me that a lot of people are doing dynamic few-shot examples. Think about RAG. Think about you have a

database of few-shot examples and you're just pulling the most relevant ones. And he's saying that works really well. Do you see that? The people you talk to, are they doing that? Is that common? Yes, I've seen lots of examples of this. This is common because a lot of times few-shot examples become unwieldy because you want them to be like… evenly distributed. So if you have a complex use case where your model can do 10 things, where it's like you kind of want one example of each of the 10 things and maybe one example of the

model doing it one way and one example of the model doing it another way. And so it's doable, but you can just quickly blow up your context window. And so fetching relevant examples is something that works really, really well. And it's pretty common. I would say that maybe in my hierarchy of needs, there's all the stuff we talked about and then there's like, really work on your prompt. I tweeted something yesterday or something because I helped the 10th person with the same loop. But it's like work on your prompt, find the examples that don't work,

add them either as an example to your prompts or as one that you can retrieve and add conditionally and do this like 10 times. And then only after that consider anything else that tends to work really well. Cool. Yeah, well, thanks for having me. Hopefully this was interesting to everyone. This is very interesting, yeah. Cool. Awesome. All right. Thank you. See you, everyone.