welcome back to the Mad podcast today we have a very special conversation with Dow killer the CEO of contextual AI a startup that has raised $100 million to help bring to the Enterprise what's known as retrieval augmented generation or rag a fundamentally important technique to make AI accurate and reliable Dow is an AI researcher by background and actually one of the inventors of rag after getting his PhD in computer science at Cambridge da became a research scientist at Facebook Fair he then became head of research at hugging face before starting contextual AI in 2023 while
also teaching as an Adjunct professor at Stanford University we started the conversation with Dow thoughts on the latest AI model Innovations including gbd 4.5 Sunnet 3.7 and of course deep seek I think some people are maybe a little bit underwhelmed expectations are are also getting maybe a little bit inflated U with all of these releases we then did a very educational Deep dive into the fundamentals of rag rag is actually a very simple idea it's answering a very basic question which is how do I get the language model to work on top of data that
it was not trained on and then talked about how contextual is helping usher the next era of rag a genic rag we're a platform for rag agents and you can specialize these rag agents for different use cases using the platform very easily we Clos with a few thoughts on the reality of deploying AI in the Enterprise today we had people ask us when are we getting to 100% accuracy and I had to give them the bad news that probably never now da is perhaps the leading Authority in the world on the topic of rag as
he Bridges both AI research and real world deployment so please enjoy this wonderful conversation with him hey D welcome hi thanks for having me uh so I thought a fun way to start the conversation would be to riff uh on the crazy pace of releases over the last couple of weeks so we're recording this the day after GPT 4.5 was was released so what was formerly known as oron and um now the largest uh open AI model that was released even though they don't disclose the number of parameters and obviously before that there was a
Claud sunet 3.7 and then Gro 3 and then you know going all the way back to to janary uh of of this your deep research operator all those things so I'm curious maybe with your uh AI researcher had on uh what do you make of all of this what catches your imagination what do you think is is more or less interesting yeah it's a exciting times right there there's so much happening it's hard to keep up but um yeah I I I think the the new model releases are interesting I think some people are maybe
a little bit underwhelmed uh in terms of uh you know expectations are are also getting maybe a little bit inflated U with all of these releases for me the most exciting thing by far is deep seek uh where where that that really I think changed the narrative in the AI ecosystem around what's possible and who actually is an incumbent and what is the Moe that some of these companies have um so um yeah I I think it's really great for for the world that that we have kind of an existence proof now that it's actually
not that hard to to do this um and so you don't need to invest all that much in in data and you can use synthetic data and and get a pretty good model out of that so that that's that's really exciting and your point about um some level of disappointment or mixed feeling so in particular the early vote for what it's worth on GPT 4.5 does indeed seem to be little mixed and you know with people saying that um actually it may not be better than 40 uh sort of the same thing by the way
for 3.7 sunet some people are saying well that you know for certain use cases especially the more precise ones may not be as good as 3.5 what do you make of that in a context where uh especially for 4.5 there's been like that whole discussion around scaling laws and whether we were sort of hitting a wall there what what do you think yeah it I think it depends a little bit on uh on how you measure things right so I think there's a there's this unhealthy obsession with benchmarks in the field where it's like oh
this one didn't get much better at math or coding but maybe got better at all these other things that we are not actually measuring with public benchmarks right so um I'm sure that that on their internal leaderboards uh those new models are a lot better than the old models um and that's why they shipped them uh but it might not not be borne out as as obviously on the the benchmarks that everybody's looking at what about all the reasoning stuff uh and in particular this whole evolution of the sort of dominant paradigm towards test time
compute and and all the things that may maybe defined for any listener that may not be following this sort of every five minutes on on Twitter uh with test time Computing uh means yeah so so um we used to all be uh obsessed with training scaling laws and thinking about like how much data sort of data scaling laws and parameter scaling laws so the bigger the model and the more data you have the better the model would get um but there are obviously limitations to that um and and with 01 and I think also some
other work uh done by by other folks it became very clear that there also is a test time scaling law and I think this is something that in in machine learning literature has been known for a while I don't think it's been labeled as a a scaling law necessarily but obviously you can also do things during test time uh so when you're actually doing inference with the model uh that will help you get to a better answer and and so uh uh show knowing that that was possible and that that would actually scale with the
more time you spend thinking about something the better your answer gets um uh yeah that that's that's really an exciting paradigm shift I think so there are these trade-offs now that you have to make around like train compute train data test time compute uh maybe test time data as well if you want toh sort of augment the context using retrieval maybe um so so there's a lot of exciting New Opportunities coming with this test time compute Paradigm and open I was saying yesterday during uh the launch that uh 4.5 uh was going to be the
last Model to be just focus on unsupervised learning um and uh not include reasoning do do you think that's where the world is going all models are going to be a combination of um unsupervised learning and and and sort of test time compute kind of Paradigm it it's not really a real dichotomy there think you could argue that GPD 4 is also already a reasoning model it just hasn't been trained on reasoning specifically but uh it can do Chain of Thought right so if it can do Chain of Thought it's basically already a reasoning model
it just hasn't really been optimized for that um so um yeah I don't think that that's a huge shift in any way it's just a sort of natural natural Next Step all right so going back to deep seek that you mentioned a minute ago what's sort of fascinating is that uh a couple couple weeks ago whenever dips came out uh it it felt like a Sputnik moment you know Mark andon said that others said that and the entire world freaked out for like two to three days and uh fast forward to today it almost feels
like some of it uh at least in the Press was swept under the rug and we just proceeding as originally planned but it's still a big deal so maybe uh explain why it's a big deal and and where do you see the impact yeah I I think why it's a big deal is is the the reason for that is very different from what you would see in the Press right so so uh I don't know why journalists do this but they really like uh sort of um very clear adversarial story it's like the US versus
China right or so something like that is very easy to understand for people and I guess that's what people enjoy reading or sort of what people click on and that that's why that that's uh become part of the dominant narrative and like the Sputnik moment like calling it that also kind of fields in like you know goes in that direction of like a cold war on AI or things like that I think that's all completely overblown um but I I think it it it is very interesting technology and and so what I was saying earlier
right it's really an existence proof it's it's like when you have a bunch of gpus and you know how to train a language model uh and and if you can train up a pretty good base model so that's their deep seek V3 then you can give that reasoning capabilities relatively cheaply and and by that I don't mean in in terms of compute I mean in terms of data right so the the data that they got uh came from from uh other language models right so it's synthetic data they also had some human annotators they're a
little bit unclear about where the data really came from uh there are some questions about that uh in the broader Community but it it it just shows that you can train an amazing model on relatively little data um and and have that thing be uh almost as good and sometimes even better than uh real Frontier models um so that's very exciting but the the other wrong narrative that I see everywhere it's like oh it only costs like 5 point something million dollars to train it's like that was a single training run right so so how
many training runs did they have to do to figure out what the optimal training run would be uh so I I I would guess that they spent at least 100x the amount of that that single training run right so uh it's a lot more expensive you have to pay for people data like all the operations than for the data center itself uh and then you need to figure out how to train the model optimally and then you do one single training run at the end of it so that that that single training run costing like
$6 million that's pretty expensive for a single training run so so what's your take then on the sort of geopolitical war or or maybe like thereof between uh China and the US is that uh is that overblown as well is that more of a media creation how do you what's your take on it I I think it's probably good for the world I think uh the the the Chinese economy is very good at at producing things uh cheaply uh and in a highly optimized way and so if you're not in the foundation model race uh
then it could be a very good thing so for for me uh it's great if language models get commoditized um because then we can use them for all kinds of interesting applications uh in in a much uh much better way so what we do is we contextualize the language models so that they can do their job right so I think that's actually where where all the really interesting problems are right now it's not even really about language models anymore that has almost been solved right that's kind of why you see things plateauing off a little
bit as well uh what what really interesting is how how you make those language models do valuable things for Enterprises or how do you solve real problems how do you deliver Roi there there's a lot of anticipation around um yeah we need to show that that there is a return on investment for all of these huge Investments that have gone into AI so the way to do that is to really build build systems that solve the problems and the model the language model is really a very small part of that much bigger system as a
great uh transition into this maybe a last question on sort of the general like Frontier Model large model kind of part of the discussion uh you know we're going to spend a bunch of time talking about Rag and rag 2.0 uh and obviously a key uh objective of of rag is to reduce hallucination and and all those good things uh at the model level so if you take the anthropics and the open AI models uh are you think progress around uh hallucination control specifically and if so what are the drivers so on the model specifically
I I think models still hallucinate and I I don't even necessarily think that it's a bad thing um so so first of all Hallucination is very IL defined right I think a lot of people are kind of conflating it with being wrong uh but I think Hallucination is is a very specific type of being wrong where you like make up information that uh is is not grounded in what you consider to be the ground truth right so so I think for a general purpose language model um uh if you deploy this in a marketing department
or creative writing then Hallucination is a feature it's not a bug right you wanted to generate beautiful Pros maybe you don't care that much about the factuality of things so so it's really a problem the underlying problem is that we want these language models to be good at everything so they have to be generalist and so they also have to be useful for creative writing um and and I think uh that's wrong where where we're headed is that we will have more specialized language models so we have our grounded language model that has specifically been
trained to be grounded and that model hallucinates much less so it's really much more strongly coupled to the context um and and so it's not great for creative writing but is very very good at rag problems so let's get into um rag uh so first of all it be it'd be awesome if you could uh Define quickly what that is for any anyone that's still learning about the the space and then we'll get into all sorts of technical details uh and then I'd love to go into the origin story of it so you're you're the
the lead uh author on the rag paper at fair that came out in 2020 I'd love to hear that story so definition and then the the birth of rag yeah sure yeah so so rag is is actually a very simple idea uh it it's it's answering a very basic question which is how do I get a generative model so a language model to work on top of data that it was not trained up um so reg is really the way that everybody has gen work on their data and the advantage of that is that you
can always stay up to date you don't have to constantly retrain the model when new information uh arises and so uh it because it's so easy and it's such a sort of intuitive way to do this it has really become the dominant paradigm and so the the story of the history of rag uh and this was just really a great team collaboration with lots of fantastic people involved uh I originally became interested in this problem because I was uh interested in grounding and my PhD thesis had been about grounding in in different modalities so Vision
or audio or or even uh smell and taste like you can you can ground in all kinds of different perceptual modalities so then I I I uh arrived fair and I was thinking about multimodality and grounding and then I was like why why can't we ground in other text why can't we ground text in other text so we then need some source of Truth and so Wikipedia uh is the obvious answer despite all its flaws uh so so we basically said okay let's say everything in Wikipedia is true so now uh when we we generate
uh answers they they can be grounded in what we consider to be true without actually having to be trained on that so if you update something in Wikipedia that should be reflected in the answer without needing to retrain um so so we got very lucky with that project uh because phas had just happened which was basically the first Vector database uh Facebook AI image uh similarity search uh it's a it's a great open source Library uh that is still getting used a lot these days and and so we had a vector database basically the very
first Vector database and we had a generative model and we we put them together and then um yeah things worked uh so that was great and uh you alluded to this but like how does I even work in a research organization like Fair how does one decide to focus on this topic or that topic and get approval and does it actually work no there there are no approvals but now there are approvals at the time this this was this was the the beautiful era of AI I I think I'm incredibly lucky to have been a
part of that um there were no rules um I I I arrived as a postdoc and I think I had like six interns in my first year and there was like I could work on whatever I wanted uh I was looking at like Vidgen steinan like language games and all kinds of weird stuff emergent communication like multi-agent systems and things that a lot of the things that I was looking at at the time they're starting to sort of become cool now uh but at the time a lot of people thought that I was kind of
crazy for even looking at these things and and yeah so I was I was very blessed that they let me look at that stuff um and so rag is really um really an example of that right it's like just thinking of of kind of Frontier ideas and then it turns out that they actually work and so when the rag paper came out I don't think a lot of people were super excited about it it was like oh this is kind of cute it happened much later like you can even see this in the citation profile
on Google Scholar it's like oh the rag paper is nice and then it's like holy rag works so there was no particular thinking behind the name or like any of the things because it's it's become such an industry term I think like Patrick uh leis the first author also often jokes on on podcasts and things like we we we could have come up with a much better name than this but I guess it was a pretty decent name otherwise it wouldn't have stuck right and so there there was a lot of other interesting work happening
at the time so Google had a great paper on realm uh so that was more like a bir style like Mass language modeling so it wasn't really generative um but they had a lot of the the the same ideas in there and that also worked really well um but why rag became the the way you you name these things is because it's generative right so we were a first months to have a generative model there yeah and it's it's funny how the things happen right this uh tends to be um something in the water at
any point in time when like different groups of very smart people in different organization do think similarly I think I heard you somewhere talk about um how when Transformers came out from from your perspective that was actually sort of underwhelming can can you go into that quickly yeah I mean so it's just fun like how how like history is sort of Rewritten over time I guess by like PR Departments of large Enterprises uh but but so when the when the Transformers paper came up um so at Fair we were working on very similar ideas obviously
I I gu guess a bit more in the like convolutions space because because of yan uh L's background and I guess like you know that's that's uh what a lot of people were also exploring at the time but the idea of the Transformers paper is is really just can I cut the recurrence of the RNN at the time we had rnns right can we cut the recurrence because then we can do parallel processing much more efficiently on a GPU um and so uh that turned out to work pretty well you had to do a couple
of tricks then because you lose your ability to understand positions so you need positional embeddings uh and they came up with some really cool uh ways to do that and you uh ideally want to have the attention mechanisms have multiple tries so that became multi-head attention and and that is really just what the Transformer architecture is so I I would say and maybe I'm biased because one of my best friends is is on the original attention paper but that was the real breakthrough it's just like figuring out that you have this attention mechanism that actually
allows you to to um uh yeah to do a much better job at sort of representation learning and as a result kind of generating uh correct sequences Auto regressively let's get into uh rag uh and the architecture in some detail and maybe into I guess the 1.0 version and then we'll spend time talking about 2.0 and gantic Rag and what contextual uh does but uh maybe as a as an introduction slash Deep dive how does that work so this retriever this generator all those good things what's the core architecture yeah the the core architecture is
very simple you have a language model so that's the G and then you want to give that context and the way you do that is by augmenting it the a uh using retrieval the R so that's r a rag um and so how you do the retrieval that has been changing constantly over time um so it the initial paper we used a vector database or a face so the the words Vector database didn't exist at the time um but so face was the first Vector database um and uh I think people over time have started
figuring out that that has all kinds of limitations right so how a vector database works is you you just have embeddings so you encode pieces of information or chunks of documents you encode them as a vector and then you do basic do product similarity search um but uh that that has issues where you're just looking for chunks that are similar to the question but you don't necessarily want to find chunks that are similar to the question you want to find chunks that are relevant to answering the question right so you you need to do different
things uh with the representations with the embeddings so a lot of modern rag uh deployments are very different from the original uh ideas in the paper right so you still have a vector database but the way you encode things is is very different um then uh you usually also have a sparse component so it's not just dense search it's not just Vector search you're also doing even older traditional keyword-based search uh so bm25 or tfidf sort of style algorithms or elastic or that kind of stuff yeah so that's what elastic is very good at right
and so uh then you have this hybrid retrieval system and you can cast a relatively wide net with that so it's very cheap to do that but then it it's not going to be great so on top of that you need to have a rerer that actually filters out most of the stuff that you probably shouldn't have retrieved in the first place right so you get this kind of cascade of retrievals where you narrow down the surch and and the model that does the decision making gets uh smarter and smarter the further in the Cascade
you go right so so the final ranker step that can be a pretty beefy model where you can even in our case give it instructions which is awesome right so you can really tell it like I believe this Source much more than this source and I I have a strong preference for recency and if it's a PDF then I believe much more than if it's our internal like slack or something right so that type of ranking that that really factors into uh the overall retrieval results and then the final step is that goes to a
language model and hopefully that language model doesn't hallucinate but you have no guarantees right so that's kind of how the original like naive rag kind of evolved into sort of advanced rag where you have this pretty sophisticated pipeline uh and and it's really a system right like these are all different models and so one of add a very to to to unpack some of this at a very practical level uh then what goes back to the model is in the form of of prompt is is it's uh pushes information into the context window yeah so
so exactly so your retrieval results so the context that that that's given to the language model as a part of the prompt it's like this is the question and then you you go you go off and ask the question to your retrieval system you get the results you put them uh in into the prompt and then you ask the language model to to do its magic and for the hybrid search uh how do you or how does a system decide when to do a term search versus a vector search is there presumably intelligence there in
allocating certain types of query to most systems there is no intelligence there so in like a lot of advanced rag that's sort of a hyperparameter that you just tun so in our case that is learned um and so it is something that you can learn uh but we take a very different approach so we we kind of started From This observation that that rag is really not not about models it's about the system of models um and so the real question is how do you get all these models to work together in the right way
so in our case each of those components right so the the the language model has been trained to be grounded so it's state-of-the-art at grounded generation then we have a rerer that has been specifically trained to work together well with that grounded language model then we have our our retrieval step before that right and then uh all of these components are uh jointly optimized so they're they're trained on the same data distribution uh and that means that they they uh have been been trained to work well together and you can you can really see the
difference in bench marks uh in terms of performance when you do that and 2.0 maybe as um one final question before you transition to to that part you mentioned that rag has become one of the or the dominant architecture which is certainly from uh you know my perspective as an industry Observer I mean investor but uh industry Observer that certainly feels like it's the case a few months or a year or so ago that there was some level of debate and if I had to summarize it some people were saying you know rag is great
but there's actually a lot you can do with fine tuning and then perhaps more recently there was a school of thought that seem to be saying rag is actually going to go away because the context window of models is extended so much that you just can put the entire context uh into into the window uh where are we on that debate today yeah I think I think it's so fascinating it's very related to the US versus China observation actually it's like somehow so it's journalists and Venture capitalists I guess who who really like these dichotomies
right so you know some those two categories uh increasingly tend to overlap as as we create uh you know more and more content that the that the world really wants to see clearly no it's amazing content but uh yeah yeah so to answer your question that these are not dichotomies right so it's not rag or fine-tuning and it's not rag or long context it's all three of those things ideally so so if you have a rag system you can actually fine-tune it and that will probably be better you don't want to just fine tune the
language model you want to fine tune the entire system so also the ranker also the embeddings ideally as much as possible right so that that basically just says if you know what you're specializing for and you have a more specialized problem than using machine learning you can always do better on that problem so the question is do you want to invest the resources are required for for is right so you will need compute to do that um and and you will need to spend time uh to actually make it work but if you make that
very easy for people like we do then you can have a outof the Box amazing rag system and then you can specialize that very easily to make it even better on the use case you care about so then you get Best of Both Worlds um and and so one common misconception about fine tuning is a lot of people think that you can inject new knowledge into a model using fine tuning and that is not true so so you can make existing knowledge sort of come out more or you can make it um uh you know
adopt a specific style of communication or something like that but telling it like actually this thing I taught you or like this piece of information that you were pre-trained with is no longer true now whoever like Trump is the president instead of when you were trained it was Biden like those those types of things you really cannot get into a model with fine-tuning easily so you need rag anyway and you want to do fine tuning anyway so that that's not a dichotomy and then the third one is about long context where where my example is
always um the like a basic question like who is the Headmaster in Harry Potter so it's Dumbledore and you didn't have to read all seven books to figure that answer out right you could have just searched for Headmaster maybe or read like chapter one uh and then you would get the answer so long context models are inherently incredibly wasteful your you're paying for all this compute and that's maybe why some of the the companies that are trying to really sell long contact model uh long contact window models is they will make more money from that
right because you're you're spending more uh on compute for for reading Harry Potter for every single simple question so uh you want to have a long enough context window so that you can fit in the relevant pieces of information so that the language model can do his job um so if you have rag you can do this over millions or trillions of documents and everything will still work um and then you just pick the pieces of information that are relevant and give them to the language model as context so uh again the there there is
no dichotomy here you probably want both and even for different types of problems you want to have different solutions so if my problem is like summarize this document then I will probably want to put it in a long context as much as I can uh if my my problem is like what is the voltage of pin 7even on chip X for this particular manufacturer and how does it compare to this other one like that's a rag question right you should just retrieve that information do the comparison reason over it um so uh yeah there they
are just kind of different solutions for for the same underlying problem which is you have more information than you can can uh feasibly uh sort of spent compute on so as you alluded to the fundamental IDE to think in systems and not in models uh so the way I understand it the general IDE is to have like all components of a r architecture instead of a Frankenstein kind of like assembly having all those components uh deeply integrated and learn together um can you describe I guess in better terms than mine oh that was a great
description actually I mean so that that's really the basic idea right so it's really like starting from the idea idea that it's a system and so by the way that that also includes extraction right so so there there are lots of very interesting problems in terms of document understanding where existing Solutions really fall short and and so if you want to have a Enterprise grade rag system you are only as good as the data that goes into that rag system so if you can't extract the the data in the right way so if you have
like a sort of table structure and it has like nested information if you can get that out in the right way then your rank system is going to fail completely right so so that is part garbage in garbage out exactly yeah so it's like beautiful data like beautifully formatted PDFs very easy to read for humans and then um AI can't do anything with that information because he can't get it out is that a human challenge or is that a technical challenge or both and by human I mean uh you know internally the Enterprise deciding what
information goes into the uh uh into into the vector database yeah so it it shouldn't be a human problem sometimes it is now for some companies but it shouldn't be right so you should just have an AI system that can just read a document like that's not not that much to ask of an AI system right so um but but U understanding PDFs is is a great example right how humans understand PDFs is by like looking at it visually but how machine understands It Is by like looking at at the actual like encoding right so
the the Raw information that makes up that PDF uh and so what we're doing is we're actually looking at the PDF the same way that a human does so we're looking at the entire layout we have a layout segmentation model that says hey this is a chart this is a table this is a piece of text and if it's a table then it's extracted differently using a different uh table extraction model and if it's a graph then it's extracted differently because yeah that's a different modality for the data right so all of that information is
then put together in in your extraction output and that is what ends up in your your retrieval database for your rag system but the the reason just to double click on this uh who decides that the PDF goes into the vector database in the first place is that a complicated sort of like Enterprise discussion and I guess the parallel to that is that um is there a world where at some point the r system can just go fetch the information wherever it is versus having one data base that's you know considered the rag database yeah
um so so that world is already here I think so there are different ways to do that one is to make sure that you have access to all all of these different systems and then you synchronize that with your vector database or the other is to just have your language model uh reason and use tools and actually call into other apis right so I don't have to index all of slack if I can call Slack's search API and get the information information that I need right so that has benefits because then I also don't have
to worry about sort of entitlements or role based access control or things like that so there are different strategies you can follow there um but but I think you're right like for Enterprises that often really is just a human decision it's like what data goes into these platforms and how do I control that data uh that's that's a really important problem uh so uh just giving the system access to all data in a company is often not not really the right way to do it but ideally um you you should be able to put any
data that you want into it and then expect it to work and that actually is is often not true right so real world data is very noisy uh you can build a very awesome demo on a couple of pdss and things will probably work but then you have to scale it up to a million PDFs and then everything breaks down and and the reason for that is that a lot of these Advanced rag systems still actually don't have this this ranker um working well enough to resolve a lot of these data conflicts and because the
parts are completely disjoint right so the the Retriever and the ranker and the language model they're completely unaware of each other and and so our approach where everything is much more tightly integrated means that you can be much more robust to scaling it up to much more and much noisier data so that's the the extraction piece um will do to the ranker piece the contextual language model so clms what what is that and where does that fit into the contextual rack 2.o architecture yeah so so for us that that's really um um we need to
have a language model that is grounded much more strongly than a general purpose language model because we have very low tolerance for hallucination uh and that's because we are an Enterprise company and because we focused on we're focused on Rag and and with rag you want to give true answers that's the whole idea of rag right so um yeah we we found that that kind of standard language models are are just not good enough for uh what we uh need and that's why we we've trained language models to be much more strongly grounded and to
be much more contextual um and really look at what's in the context and anything that is not in a context it will just say I don't know which which is one of the superpowers of our system is the ability to say I don't know that's really what you want if you're in a high stake setting and are the clms based on open source or are those models that you started from scratch yeah so so we initialize from open source components so so the main one we use we have some flexibility there depends on the customer
but the main one we use is llama um and so we initialize with llama and then do a lot of training on top of that to make sure that llama is actually grounded because the original llama that we start off with is not all the that grounded another uh great contribution by uh meta and Facebook to the to the AI world yeah llama and P torch and yeah so much right yeah yeah it's sort of amazing as an as side uh you know the the impact of uh Facebook originally and meta now on uh on
technology in in general is something that uh not everybody realizes right like if you had to react and yeah they don't get enough credit honestly like they like the world would be very different place without like pie torch and react and llama and things like that so yeah I think they're they're doing amazing things let's talk about your some of your uh fine-tuning and alignment techniques uh grit kto lens what what are those yeah so so um kto um is a different way uh to do uh DPO so direct preference optimization um so so RF
when that came out right reinforcement learning from Human feedback everybody was like oh we have these kind of preferences so essentially we have maybe two different possible answers that the language model might give and now we want to train it to say like actually this one is more preferred by humans than this one right so this was originally kind of the secret sauce uh behind chat GPT that's why chat GPT suddenly really started working because it captured people's preferences but the problem with with RF is that um you need to have this very heavy reward
model and that needs to be trained up on these pairwise preferences or you need to have multiple of these Generations which is is really problematic so VP then said okay actually we can do some smart math and then we don't need the reward model anymore which was great you can directly optimize on the preferences but then when we were thinking about this we were like that's still not ideal because in the real world you don't want to uh collect a thumbs up for every thumbs down right like when when you generate an example and and
somebody tells you actually that wasn't a good example then you want to be able to learn from that without being told what the right example was or vice versa right when I when I get like oh that was great like should I then go and like generate a couple of bad versions of this in order to train on it like that just doesn't make sense right so uh with kto we were like can we directly optimize on the feedback without there being preference pairs uh and that turned out to to work really well we we've
done some work since on on anchored uh reference optimization as well so APO and and that really is is the best um direct preference optimization technique that that we've seen and so it has all all kinds of benefits around not not having to rely on on uh like preference data you can uh just incorporate direct feedback data uh and it's very very uh sample efficient so you can train on this when you only have like a 100 examples you can really make a mean meaningful difference and that's great because uh data annotation is very cheap
or very expensive and very very cumbersome right so uh the more you can can uh make the system learn just on its own the better better it is another very interesting part of what you all do at contextual uh seems to be around agents and that gets us into a gentic rag and that was very uh eye openening uh for me when I was prepping for this but when I first uh came across contextual you know a few months ago my mental model is that one can think of a retriever as a tool as part
of an agentic workflow rag being a specialized sort of agent use case uh but that is is that the right the right way to to think about it I think so so so what what we have our product is a a platform for building agents but you need these agents to work on your data and the way you do that is to rag right so so we're a platform for rag agents um and you can specialize these rag agents for different use cases using the platform very easily you have a general rag agent and then
you have specialized rag agents is is that right so why is that tend what who does what yeah so just the starting point is is really you can create your own agent in minutes it's very easy to do that you don't have to do much and out of the box that will already be much better than what you could have built yourself probably with with a kind of Open Source rag framework um and then you can can specialize that using machine learning so you can use our tune API to really make all the components optimized
for the specific problem that you're you're trying to solve uh so that that's how you get to specialized rag agents where you can really hit the production bar and and actually uh deploy this in a setting where you have much more control over what is right and wrong right you don't you don't just have prompting you can actually like train that entire system to be good at what you needed to be good at and you train them uh in the way one would train a regular AI agent I mean you know bearing in mind that
like nobody really knows what actal a agent is today or maybe you do but I certainly you don't uh where I'm going with this is that you know the concept of planning of like failure mode like retrying uh kind of sharing your work transparently all the things is that part of of what you do that's the yeah that's part of it yeah I mean we're also working on that like everybody else so I I wouldn't say that that that journey is finished in any way yet but uh um yeah that that is really where where
every everything is headed so for us a rag agent is is really about you get a question and then you formulate a plan about where am I going to get this information from right it's like maybe I need some unstructured data but actually I also need to query this database so I we also support structure data which is very important and then maybe I need to call this API right so so you you formulated a plan or or what we call our our mixture of retrievers uh figured that out so that's kind of an intelligent
retrieval strategy and then that goes to our ranker and then that goes to our grounded language model um so so where this is all headed is really in this this test time compute Paradigm where like you said retrieval is just one of the many tools um and and you uh can decide for yourself like actually I retrieved this thing but it's not exactly what I'm looking for let me try this right so so um yeah that that's all all very much happening and that's how how agents are going to be valuable on your data which
is really where we we need them to work and if you crack that is there a world where contextual becomes a an agent platform full stop not just for rag Not Just for information but uh taking any action in the Enterprise I think like most most of the interesting problems are actually rag problems that's why it's such a dominant Paradigm right so like an agent that doesn't work on your data is not very useful um so if you needed to work on your data you probably need a rag agent to begin with one more topic
I I thought was interesting while prepping for this was synthetic data it it sounds like your your thinking has evolved uh into uh how helpful it is especially in the context of of rag if you could go into that yeah I I think synthetic data um so 01 and and I guess or deep seek showed that synthetic data is actually pretty valuable right um and and we have known about that for a while because that that is a great way to actually train up a rag system end to endend uh so you can uh use
synthetic data to make all of your components better and you can also make sure that the components work well together through uh some really clever tricks with synthetic data uh so a lot of our joint optimization is is just leveraging uh uh that observation and that's also how we can uh be very good in in terms of specialization uh so ideally you want you want to be able to specialize without needing any data annotation right so so you don't need to have any examples of what right and wrong looks like you just tell me what
you want and then I will figure out what right and wrong looks like and train on that um and so that uh is really unlocking lots of very cool opportunities I think for these specialized rag agents to close the conversation I'd love to go into a little bit of your experience erience selling AI in the real world in the Enterprise this uh lot of cool stuff on Twitter a lot of cool demos uh but it feels like you're exactly at the spot where sort of the rubber Ms the road in terms of actually deploying AI
for real world cases in the Enterprise what have you seen uh in terms of uh where we are in adop in terms of adoption and opportunities and obstacles and where do you think we are in the cycle yeah um I I I think we're still pretty early there there there has been this sort of first wave where everybody wanted to do everything themselves um and so uh we have had lots of sort of build versus bu hurdles to to overcome there uh I think people are starting to realize much more that you cannot build a
a sort of rag platform like ours yourself and then then keep maintaining that forever and sort of keep up with all the latest Innovations and trends that happen continous ly you see how how quickly like things change in AI right so I think the the market is sort of waking up to that observation much more um and and um in terms of maturity I I think a lot of folks initially um were were thinking uh about it as sort of like we need to get something in production right so it was kind of aiming too
low it's like oh we we do like some internal Enterprise search and now we can ask like who our 401K provider is like that that's great you know you could probably already do that before geni um but um you know that's that's not where the ROI comes from so that that's why we're so focused on these really high value uh knowledge worker use cases where where you have knowledge professionals if you can make them even 10% better at their job you can can save companies millions or hundreds of millions of dollars uh by by doing
that successfully but doing that is a very different game because real world data is very noisy you need to specialize if you really want to solve these problems uh at at a a sort of level of accuracy where it's actually useful uh and and delivering real value what's the level of Tolerance you've observed for hallucination or inaccuracy uh for this non-deterministic system yeah so that that has been involving I think initially we had people ask us like when are we getting to 100% accuracy and I had to give the bad news that probably never um
but just like with with humans right like in the in the financial sector there's a reason we have regulators and there's a reason we have very stringent processes around um what people are allowed to do uh and how we kind of control that and how we double check that there aren't any mistakes that crash the market right so so um in a very similar way we need to look at the inaccuracies of AI systems and then work ways to mitigate the risks um and so that is is something I'm really seeing much more of now
especially in financial services uh and Healthcare which is really I know a small mistake there can have huge consequences uh you just want to want to think very carefully about the inaccuracies like the accuracy is sort of table Stakes but the inaccuracy that's really where you need to think about ways to handle that so we we do things like providing very fine grained audit Trails uh for regulators making sure that we uh verify each of the claims that the system makes uh and that that we flag things if the system is not sure about its
answer things like that so what can we expect from uh contextual AI in the next year uh yeah more more progress on on uh on this uh this mission to change the way the world Works uh that that's ultimately what we're trying to do uh so literally how people do their jobs right um so yeah the big things we're focused on are are the intersection of stru Ed and unstructured data I think that's really exciting there's so much uh cool stuff to do there multimodality is obviously a big theme uh and and test time reasoning
but then really focused on retrieval uh because that's that's really the only way you get these agents to work on your your data and your problems now thank you so much terrific I really appreciate your sharing all of this with us and uh excited for for the next year thanks so much for having me hi it's Matt Turk again thanks for listening to the this episode of The Mad podcast if you enjoyed it would be very grateful if you would consider subscribing if you haven't already or leaving a positive review or comment on whichever platform
you're watching this or listening to this episode from this really helps us build a podcast and get great guests thanks and see you on the next episode [Music]