[Music] um hey everybody I'm Jerry co-founder and CEO of llama indux and I'm excited to be here today to talk about the future of knowledge assistance so let's get started um first you know everybody's building stuff with LMS these days uh some of the most use cases we're seeing throughout the Enterprise include the following uh it includes like document processing tagging and extraction it includes knowledge search and question answering if you followed our Twitter for the past like year or so basically you know we've talked about rag probably 75% of the time uh and also
just you start generalizing that question answering interface into an overall conversational um agent that can not only you know do a one-hot quering search but actually store your conversation history over time and of course this year um a lot of people are excited about building a gench workflows that can not only synthesize information but actually perform actions and interact with a lot of services to basically get you back the thing that you need so let's talk about specifically this idea of building a knowledge assistant which you know we've been very interested in since the very
beginning of the company the goal is to basically build an interface that can take in any task as input and get back some sort of output so the input form forms could be you know a simple question it could be a complex question it could be a vague research task and the output forms could be a short answer it could be a research report or it could be a structured output rag was just the beginning uh last year I said that rag was basically just a hack and there's a lot of things that you can
do on top of rag to basically make it more advanced and sophisticated if you build a knowledge assistant with a very basic rag pipeline you run into the following issues first is a naive data processing uh pipeline you know you put it through some basic parser uh do some sentence splitting chunking do topk retrieval and then you realize you know even if it took you 10 minutes to set up that it's not suitable for production it also just doesn't really have a sense of being able to understand more complex broader queries so query understanding and
planning there's also no uh kind of more sophisticated way of interacting with other services and it's also stateless so there's no memory so in this setting we have said you know rag is kind of boring uh if it's just the simple rag pipeline it's really just a glorified search system on top of some retrieval methods that have been around for decades and there's a lot of questions and tasks that naive rag can't give an answer to and so one thread that we've been pulling a lot on is basically figuring out how to go from simple
search and naive rag to building a general context augmented uh research assistant so we'll talk about these three steps with some some cool featur releases you know in in the mix um but the first step is basically Advanced Data and retrieval modules even if you don't you know care about the fancy agentic stuff you need good core data quality modules to basically help you go to production the second is Advanced single agent query flows building some agentic rag layer on top of existing data services as tools to basically enhance the level of query understanding that
your QA interface provides and then the third and this is quite interesting is this whole idea of a general multi-agent task solver where you extend beyond even the capabilities of a single agent towards multi-agent orchestration so let's talk about Advanced Data and retrieval as the first step the first thing is that any llm map these days is only as good as your data right garbage in garbage out if you're an ml engeneer you've heard that uh kind of statement many times um and so this shouldn't be net new but it applies in the case of
llm app development as well good data quity is a necessary component of any production grade LM application and you need that data processing layer to translate raw unstructured semi-structured data into some form that's good for your L map the main components of data processing of course are parsing chunking and indexing and let's start with parsing so some of you might have seen these slides already but basically the first thing that everybody needs to build some sort of proper rag pipeline is you need a good PDF parser okay or a PowerPoint parser or some parser that
can actually extract out those complex documents into a well structured representation instead of just shoving it through Pi PDF if you have a table in a financial report and you run it through Pi PDF it's going to destroy and collapse the information blend the numbers in the text together and what ends up happening is you get hallucinations and so one of the key things about parsing is that even good parsing itself can improve performance right even without advanced indexing retrieval good parsing helps to reduce hallucinations a simple example here is we took the Cal train
schedule right the weekend schedule for Cal Train parsed it through llama parse one of our offerings and through some well structured document parsing format because the llms can actually understand well spatially laid out text when you ask questions over it I know the text is a little faint it's totally fine I'll share these slides later on you're able to actually uh get back the correct train times for a given column versus if you shove it into Pi PDF you get like a whole bunch of hallucinations when ask questions over this type of data so that's
step one you want good parsing and you can combine this of course with Advanced indexing modules to basically you know uh model heterogeneous data within a document uh one announcement we're making today is you know we opened up llama parse a few months ago it has like tens of thousands of users tens of millions of pages processed gotten very popular and in general if you're an Enterprise developer that has a bucket of PDFs and wants to shove it in and not have to worry about some of these decisions uh come sign up uh this is
basically what we're building on the L Cloud side the next step is Advanced single agent flows so you know we have good data retrieval quality or sorry good data retrieval modules but in the end right now we're still using a single llm prompt call so how do we go a little bit beyond that into something more interesting and sophisticated we did this entire course with uh you know Andrew Ang at deeplearning.ai and we've also written extensively about this uh in the past few months but basically you can layer on um different components of Agents on
top of just a basic rag system uh to build something that is a lot more sophisticated in query understanding planning and Tool use and so the way I like to break this down right because they all have trade-offs is on the left side you have some simple components that come with lower cost and lower in latency and then on the right you could build full-blown agent systems that can you know operate and even work together with other agents some of the core agent ingredients that we see that are pretty fundamental towards building uh QA systems
these days include uh function calling and Tool use uh being able to actually do query planning whether it's sequential or in some style of a dag and also maintain uh conversation memory over time so it's a stateful service as opposed to stateless we've pioneered this idea of a genti rag where it's not only just you know rag as a single llm prompt call where the whole responsibility is to just synthesize the information but actually use the LMS extensively during the query understanding and processing phase where not only are you just directly feeding the query to
a vector database in the end everything is just an LM interacting with a set of data services as tools right and so this is a pretty important framework to understand because at the end of the day you're going to have in any piece of llm software llms interacting with other services whether it's a database or even other agents as tools and you're going to need to do some sort of query planning to basically figure out how to use these tools to solve the tasks that you're given we've also talked about agent reasoning Loops right probably
the most stable one that we've seen so far is some sort of while loop over function calling or react but we've also seen fancier agent papers arise um that basically deal with like dag based planning planning out an entire dag of decisions or tree based planning you know you plan out an entire set of possible outcomes and try to optimize there the end result is that if you're able to do this uh you're able to build personalized QA systems um that are capable of handling more complex questions for instance comparison questions across multiple documents being
able to actually maintain the user State over time so you can actually revisit the thing that they were looking for being able to for instance look up information from not only unstructured data but also structured data by treating everything as a data service or a tool but you know there are some remaining gaps here first of all you know we've kind of had some interesting discussions with other people in the community about this but a single agent generally cannot solve an infinite set of tasks um if anyone's Tred to give like a thousand tools to
an agent the agent is going to struggle and generally fail at least with current model capabilities and so one principle is that specialist agents tend to do better if the agent is a little bit more focused on a given task uh given some input and then the second Gap is that agents are increasingly interfacing with services that you know maybe other agents actually and so so we might want to think about a multi-agent future so let's talk about multi-agents and what that means for this idea of knowledge assistance multi-agent task solvers first of all why
multi-agents well we've mentioned this a little bit but they offer a few benefits Beyond just a single agent flow first they offer this idea of being able to actually specialize and operate over a you know Focus set of tasks more reliably so that you can actually stitch together different agents that potentially can work together to solve a bigger task another benefit or set of benefits is on the system Side by being able to have you know multiple copies of even like the same allim agent you're able to paralyze a bunch of tasks and um and
able to do things a lot faster the third thing is that actually with a multi-agent framework instead of having you know a single agent access like a thousand tools you could potentially have each agent operate over like you know 5 to 10 tool and therefore use a weaker and faster model and so there are actually potential cost and latency savings there are of course some fantastic multi-agent Frameworks that have come out in the past few months and many of you might be either using those or kind of building your own and in general some of
the challenges in building this reliably in production include uh one being able to you know um either let the agents kind of operate amongst themselves and build some some sort of like unconstrained flow or actually being able to inject some sort of strengths between the agents you're basically explicitly forcing an agent to operate in a certain way given a certain input the second is when you actually think about having these agents operate in production currently the bulk of agents are implemented as functions in a jupyter notebook and we might want to think about defining the
proper service architecture for agents in production and what that looks like so today you know I'm excited to launch a preview feature of a new repo that we've been working on uh called llama agents um and it's an alpha feature but basically it represents uh agents as microservices right so you know in addition to some of the Fantastic work that a lot of these multi-agent Frameworks have done the core goal of llama agents really is to think about every agent as just like a separate service and figuring out how these different Services can operate together
communicate with each other through a central uh API you know communication interface and then also uh work together to solve a given task um that is you know scalable can handle multiple requests at once um is easy to deploy to you know different types of services um and basically each agent can encapsulate a set of logic but still communicate with each other and actually be reused across different tasks so it really is really thinking about how do you take these agents out of a notebook and into production and it's an idea that we've had for
a while now but we see this as a key ingredient in helping you build something that's production grade uh a production grade knowledge assistant um especially you know as the world gets more agentic over time so the core architecture here is that you know every agent is just represented as a separate service um you can write the agents however you want basically you know with a llama index with another framework as well and we have some of the interfaces to basically build a custom agent and then you're able to deploy it as a service and
basically the agents can interact with each other via some sort of message CU and then the orchestration can happen between the agents via like a general control plane right we took some of the inspiration from you know existing resource allocators for instance like kubernetes or just like other kind of like open source like um systems level projects and the orchestration can be either explicit so you explicitly Define these flows between services or it could be implicit right you can have some sort of llm orchestrator just figure out what tasks to delegate to uh given the
given the current state of things and so one thing that I want to show you basically is uh figuring out how or just showing you how this relates to this idea of knowledge assistance right uh because we think that multi-agents are going to be a core component of this and this is basically a demo that we whipped up showing you how to run llama agents um uh on a basic rag pipeline this is a pretty trivial rag pipeline there's like uh a query rewriting service right and then also some sort of uh default agent um
that basically just does rag like search on retrieval um and you can also add in other components and services like reflection you could have other tools as well or even a general tool service and the core demo here is really showing that you know given some sort of input they're communicating through uh with each other through some sort of like API protocol and so this allows you to for instance launch a bunch of different client requests at once handle you know task uh requests from different directions and basically have these agents operate at um as
like an encapsulated microservice right and so the query rewrite agent takes in some sort of query processes it rewrites it into some uh new query and then you know the second agent will basically take in this query do some search and retrieval and um basically output a final response if you built a rag pipeline all this stuff like the actual logic should be relatively trivial but the goal is to basically show you how you can turn something even that's uh even something that's trivial into a set of services that you can basically deploy right um
and this is just like another example that's basically a backup slide that basically again highlights the fact that you can have multiple agents right and they all operate and work together um to basically achieve a given task so you know the QR code is linked first of all this is in Alpha mode right and so we're really excited to basically share this with the community we have we're very public about the road map actually so check out the discussions tab about what's actually in there and what's not we're launching with uh dozens of uh a
dozen basically initial tutorials to show you how to basically build a set of like microservices that basically help you you know build that production grade uh a gen technolog assistant workflow and uh there's also a repo linked that I think should be public now um you know in general we're pretty excited to get feedback from the community about what a general communication protocol should look like how we basically integrate with some of the other you know awesome work that the community has done and basically uh help achieve this core mission of again building something that's
production grade and a multi-agent assistant and this is just the last component um which I already mentioned but basically if you're interested in like the data quality side of things like let's say you didn't care about agents at all and you just care about data quality uh we're opening up a weight list for llama Cloud more generally so that you're able to you know deal with all those decisions that I mentioned the parsing chunking indexing and ensure that you know your bucket of PDFs with embedded charts tables images is processed and parsed the right way
um and if you're an Enterprise developer with that use case uh come talk to us so that's basically it thanks for your time and hope you enjoyed it [Music]