the current AI chap Bots are good at everything except for being practical and it's really frustrating that it cannot be easily utilized consistently in work settings because it'll just hallucinate the most out- of pocket thing you can imagine and since we don't have the patience to wait for the Mega corpse to train an even more powerful AI we need ways to utilize the chat Bots we have right now to get ahead of the curve before AGI replaces us to write email SLO so there's this short-term workaround called rag which stands for retrieval augmented generation in
all of the latest research and even in services like CH gbt or claw rag has drastically improved the performance and usability for these L because instead of spawning the key information in a neural network that has compressed the data with rag it retrieves accurate information from a collection of uncompressed documents separately stored that even the LM may not have been trained on this method would provide results that are both cost effective and accurate without needing to train or fine-tune the L that would have cost tens of thousands and remember the function we browsing that a
lot of chat Bots have this function is also an extension of rag which makes rag perfect when you need to use a large amount of documents as reference that cannot fit within the context window of an llm so with these insane benefits why would it be a short-term solution well before we can answer that we first need to understand how it generally Works let's take a naive rag as an example oh yeah keep in mind the field is still a bit new so the process can be differentiated a bit differently from place to place but
I would decompose them into three main stages index ret retrieval and generation in the index stage you will be indexing your documents for the AI to easily retrieve it later this index process can vary but the most common way is to divide the documents into meaningful chunks and store it in a vector form that can be searched easily this stage usually doesn't repeat once all your documents has been indexed and stored within a vector database this then brings us to the retrieval stage where we will need to retrieve the information for the LM to use
to know what information the LM needs to retrieve we need to First Look at the user input to see what the query is about so we can bring out the most relevant data from the vector database for the LM to work with while there is the classic word frequency matching between the input query and the documents to retrieve the ideal information it still can capture the semantic information between the words so a bir model that is an encoder only Transformer is used to encode and provide measurements for semantic similarities between the documents and the input
query so by measuring their Vector distances we can find the most semantic relevant information from the doc doent and provide it to the llm for further processing which brings us to the last stage the generation stage this stage relies on D llm to utilize the retrieve content and the input content to formulate coherent and contextually relevant responses it needs to strike a balance between following the information on the reference document and transforming into a response that answers the input query so without any fine-tuning the llm would be able to respond to questions about your documents
with Rag and now you would realize that with that many components rag could have have a single point of failure there are now various moving Parts like how you index information how you retrieve them and how good the model is at blending and presenting the output that have the power to affect the quality of rag so it's kind of reasonable to call it a short-term solution because it is certainly a hacky way to bypass L's limitations by introducing more unstable variables but being hacky about it kind of transforms rack into a whole new field of
research with a more applied mentality rather than a theoretical one as a better AR chitectural model would be more of a long-term solution so naturally this simple pipeline has evolved into something even more complex which brings us to the million dooll question what is the current meta for rag okay there are way too many variants right now and sometimes it boils down to what works the best with your own data but here's how The Meta roughly looks like starting from the indexing stage other than chunking the document semantically using LM to better organize the information
retrieved a trainable embedding model which is to convert text to vect vectors can be used to better connect the input query and the relevant documents when storing and comparing them within a vector database so for things like indentation where it holds a more significant meaning in coding compared to your typical writing an embedding model that is fine-tuned on coding would be much more mindful about this detail when converting the text into Vector encoding so later on when the AI is retrieving the influence of indentations is much more respected another really new and promising method is
something called graph rag this technique uses a knowledge know graph and utilizes L to extract entities relationships and key claims from your document then the hierarchical clustering with the lien technique is used to organize the graph which can be easily visualized for better traceability and is much more explainable and auditable than looking at a vector database which brings us to the retrieval stage where the model now can not only retrieve the most relevant information with the input query but also obtain the context of the retrieved content thanks to the knowledge graph by using the structured
data previous ly generated this makes mistakes much more traceable and preventable as answers that might be contextually irrelevant to the input query can be ignored but for the model to retrieve information accurately the previously mentioned embedding model would need to encode the input query too right however not all the input query is needed for search like the greetings in the inputs the next line notation or end of sentence token should be completely left out so an input query rewriting LM would be here to help to condense or even transform the query to its key information
that is then encoded into Vector form to be compared and searched within the vector database or the knowledge graph to retrieve more accurate information additionally a hybrid search can be used in the meantime like the FIS SS nearest neighbor plus word frequency to increase the chance of getting the desired retrieval optionally web search can be done at this stage too which is really useful to ensure any time sensitive information or citations are correct and this is also the part where you can literally insert any apis then in the final stage which is Generation something called
reranking is often used now where instead of retrieving only once you would instead retrieve topk results in the retrieving stage and pass the topk results into a reranking model to see which results are actually the most relevant that would be able to enter the input query and the rerank model can also be fine-tuned to be domain specific another function code autocut would also be used to remove unrelated retrieved results based on similarity distance gaps and sometimes the content relevant score measured by the rerank model would have a threshold in place so if the retrieved information
is not as relevant it'll Force the model to say they don't know anything about the input query instead of hallucinating or providing bad results so yeah that's roughly the current mattera of rag but since I've only been talking about the conceptual ideas here are some relevant sources you can use to build your own rag for a more General rag framework llama index is a more popular one it also has Library like llama pars which is good for organizing your documents for retrievals for the embedding models there are a lot of fine-tuned ones on hugging face
which are free download so pick one at your own cost for rack model commend R models from cooh here are some of the best rack optimized models and they also offer some really easy to ous rerank and embedding models but of course they're not free for graph rag you can check out Microsoft's official GitHub and yoink their codes from there and I think llama index also has an implementation so you can check that out you can also check out rag ass which is a framework that helps you evaluate your rag pipelines so on the topic
of rag locally I want to talk about an incredible application that can boost your workflow that is think buddy it's not just another AI app it's a full-fledged Mac OS AI lab for 10x deaths like you guys what sets think Buddy apart as its LM remix feature imagine combining the strengths of gbd4 CLA Sona 3.5 and Gemini Pro in one response by collecting the best parts of each answer and that's exactly what think Buddy does you get access to 10 plus leading models without any extra subscriptions but the Deep Mac OS integration is the actual
game changer you can capture your screen and ask questions to AI use customizable hot keys to utilize prompt templates and basically get AI to help you instantly anytime for example assume that I have a hotkey set up to analyze code just select the text use the shortcut and think Buddy provide suggestions on how to improve it the voice input powered by opening eyes whisper lets you dictate while multitask it even supports 880 plus languages with great accuracy and they even fix whisper output by using gp4s Advan reasoning and for youth developers and researchers think Buddy
handles PDF docx xlsx and many other file types you can ask questions to LM by sourcing documents and you can see responses of each model and a remix answers in under 10 seconds and don't worry about privacy all Chad storage is local plus they're working on integrating local LM soon which makes processing even faster and more secure now here's the deal that you don't want to miss think Buddy is offering an exclusive lifetime deal but it's only available for August and it's closing very soon the regular lifetime deal sale is over but you can use
the code bcloud with the link down in the description to get 30% off bringing the price down to just 130 bucks which is basically around the same price if you have chbt and Claud for over 3 months and this code is only limited to the first 50 sales and if you're still on the fans they offer a 30-day no questions asked refund policy and think Buddy also has a free basic tier for you to try out and their basic version is only 10 bucks a month still cheaper than subscribing to multiple AI Services separately I've
been also trying out thank buddy for a while now and so far I found it extremely fitting for my workflow so if you want to experience this AI Powerhouse check them out using the link down in the description and thank you think buddy. a for sponsoring this video if you like what I talked about today you should definitely check out my newsletter on there I will be breaking down the latest hottest research papers coming out left and right for you on a weekly basis so even if I am late to the news or don't have
the chance to talk about it in a video you would 100% cat the most juicy stuff on there but anyways a big shout out to andul lelas chrisad do Alex J Alex Marice migam Dean Fel robbers aasa and many others that support me through patreon or YouTube follow my Twitter if you having and I'll see you all in the next one