So, Anthropic just introduced prompt caching with CLA that can reduce costs by up to 90% and latency by up to 85%, which is huge. Did they just kill RAG with this new feature? Now, Google was the first one to introduce context caching with their Gemini models.
There are some similarities, but some major differences as well between these two approaches. We will discuss them later in the video. I'll show you how to get started and what type of performance difference you can expect.
Before looking at the code example, let's see what was released today. Prompt caching enables developers to cache frequently used contexts between API calls. Anthropic models have a huge context range of 200,000 tokens; however, if you're chatting with long documents, you will have to send them with each prompt, which becomes very expensive.
Hence, prompt caching is going to be extremely helpful. Now, customers can provide Cloud with more background information and example outputs or few-shot prompting, reducing costs by up to 90% and latency by up to 85%. Now, these numbers are great, but they're not going to be consistent based on the example use cases, and we are going to look at some of them.
This feature is available for both Cloud 3. 5 Sonet and Cloud 3. 0 Hau.
Support for Cloud 3 Opus is coming soon. As I said in the beginning, context caching has been available for the Gemini models, and there are some interesting differences between the two, which I’m going to highlight throughout the video. So, what are going to be some use cases for prompt caching?
Well, the first one is conversational agents. If you're having a long-form conversation and there is a substantial chat history, you can put that chat history in the cache and just ask questions from that. Another example use case is coding assistants.
Usually, codebases are pretty huge, so you can put them in the prompt cache and then use your subsequent messages for question-and-answer. Also, large document processing or detailed instruction sets will apply if you have a highly detailed system prompt with a lot of few-shot examples. This is going to be very helpful because you can just send those once and then have subsequent conversations while this is cached.
Agentic search and tool usage is another example, especially if you have to define your tools, what inputs and outputs are for different tools. You can put them in your prompt cache and then send it once, saving you a lot of money. Another example is talking to books, papers, documentation, podcast transcripts, and other long-form content.
This is a very enticing application for rank, and with these long context models, especially with prompt caching or context caching, it becomes viable to put these documents in the context rather than chunking them, computing embeddings, and then doing retrieval on the documents. Now, here's a table that shows the type of reduction in cost and latency you can expect for different applications. If you're chatting with your documents and sending 100,000 tokens without caching, that would take about 12 seconds to generate a response.
But with caching, it’s only about 2. 4 or 2. 5 seconds, which is an 80% reduction in processing time or latency and a 90% reduction in cost.
If you're doing few-shot prompting with 10,000 tokens, you can expect about a 31% reduction in latency and about an 86% reduction in cost. If you're doing a multi-turn conversation, like a 10-turn conversation, you can expect about a 75% reduction in latency but only about a 53% reduction in cost. Now, the way cached tokens are charged versus the input/output tokens is different, which is why you see these reductions in cost as well.
We saw the cost reduction because cached tokens cost only 10% of the base input token price, which is a huge reduction of 90%. However, you also need to keep in mind that writing to the cache costs about 25% more than the base input token price for any given model, so there is an overhead when writing to the cache for the first time, but then there is a substantial reduction in cost. Now, the Gemini models do it in a different way; there is no cost associated with the actual cached token, but there is a storage cost of $1 per million tokens per hour.
Okay, so here is what the reduction is going to look like. The input tokens' cost is going to increase by 25%, so initially, if it was $3, it now becomes $3. 75.
For subsequent cached calls, this is going to be about 30. It doesn't have any impact on the output token cost. Here are the API reference documents for this new feature.
Keep in mind that prompt caching is still in beta, so the API can change over time. Before looking at the code example, let's look at the key differences between their approach and the Gemini context caching. The first major one is how many tokens you can cache.
The minimum cacheable prompt length is 1,024 tokens for Cloud 3. 5 Sonet and Cloud 3 Opus when it's available, and 248 tokens for Cloud 3. 0 Hau, which are very sensible limitations.
When it comes to Gemini context caching, the minimum input token count for context caching is about 32,000 tokens. Now, what if your document is not 32,000 tokens in length? That means you can't really use the context caching feature in Gemini; however, you will be able to use that feature in Cloud.
So, that's plus one for Cloud, but they have their own limitation. The cache has a 5-minute lifetime, refreshed each time the cache content is updated. Used, so if you cache something, you can only use it within a 5-minute window.
If you don't use it within 5 minutes, you'll have to cache it again. Now, on the other hand, I think Google has more sensible limits, so if you don't set a time to live with the Gemini API, then it's set to a default value of 1 hour. However, you can change it to whatever time limit you want, but keep in mind there is this additional cost associated with storage when it comes to context caching of Gemini.
Now, I think this also really limits the usability of the feature, and I think it's a reason that you can't replace RAG because in the case of RAG, you only need to embed your documents and put them in a vector store once, and then you can just retrieve them whenever you need. In this case, every 5 minutes, you will need to send them to this cache with an additional 25% surcharge, and that can really add up over time. But I think there is another way in which you can combine both RAG and this long-term or short-term caching, but I'll talk about that later in the video.
Context caching is great, but it's not a replacement for RAG. If you want to learn more about RAG beyond basics, I have a course on it where I walk you through a step-by-step process of how to build robust RAG systems for your own applications. If that's something you're interested in, check out the video description.
Now, back to the video. Before looking at a quick example, let's look at some best practices for effective caching. They recommend caching stable, reusable content like system instructions, background information, large context, or frequent tool definitions.
Place context cache content at the prompt's beginning for the best performance. So, it seems like if you put this at the start of your prompt, it's going to be more usable and useful for the model. Then, use cache breakpoints strategically to separate different cachable prefix sections.
I'll talk about this later in the video, but you can define four different points of different caches in a single prompt and then regularly analyze cache hit rates and adjust your strategy as needed. Okay, so how does it work? Well, it's a little different than the normal API calls.
Now, you will need to add the cache control block in your API call, and you will also need to include this specific piece of code in your header in your API requests. Now, they have a beta API endpoint called client beta prompt caching that you can use for prompt caching, or you can also use the normal Anthropic API client. So, a couple of examples of where you can use this, for example, large context caching.
If you're having a large context, you can just add this right in the middle of the context, and that will basically cache everything before here. So, for example, here we have a system input or system instruction, and at the end of the system instruction, we put this block. That means anything that was in the system instruction is now going to be cached; anything following this is going to be normal API calls that are going to be using this cached context.
There are also ways in which you can look at cached tokens if there were any hits of the cache, right? We're going to look at some of the examples when we look at more concrete code examples. Now, another useful use case is caching tool definitions.
Usually, the tool definition can be very large if you have a large number of tools, so you can actually cache those as well. Here is an example where you have two tools: one is "Get Weather," which has all the corresponding properties, and then the second one is "Get Time. " Now, if you put the cache control block at the end of the tool definition, that means that this tool definition is going to be cached, and now you can use that right away with subsequent API calls.
Now, another interesting application is continuing a multi-turn conversation. For example, if you have a very long conversation with your model and you want to cache a part of the chat history, you can do that. You can use four different cache prompts.
As an example, here is a very long system prompt that is being cached at the beginning of the conversation. But let's assume you keep using this model to have a conversation where you have the user role and the assistant is responding. But at a certain point in the conversation, you decide you want to cache that conversation.
All you need to do is just add this cache block. So, as an example, it's cached here, so anything before that is going to be cached, and then you have two cache points in your conversation: one is the original system message, and the second one is this one. But later in the conversation, you can also add another cache, so that means that you're going to be caching this conversation at three different points.
Let's look at a practical example and see what type of reduction in latency you can expect. This is based on the example that is provided in their own cookbook. So first, we need to install the dependencies; that would include Anthropic and Beautiful Soup, and we will need this because we want to extract data from a web page.
Next, we set up our API keys. I am storing everything as my environment variable in my secrets in the Google Colab notebook. After that, we need to tell it which model to use.
So I'm going to be using the Cloud 3. 5 model. We're going to be downloading text, uh, from a web page, and that's why we have a function that receives a URL of a text file and then downloads the contents of that, uh, text file.
So, next, we're going to provide a link; in this case, this is a link to the Project Gutenberg website, and the text that we are going to be retrieving is the plain text of the book *Pride and Prejudice*. You can see that this book contains over 7 million characters, but here we are just printing the first 500 characters. Now, this is a substantially big book that we're going to be putting in the cache, but before that, let's look at a single, uh, turn conversation where we're going to be using the non-cache, uh, API call and see what type of latency you can expect.
So, here is a function that is going to be making that API call. First, we have a user role where we put the whole text of the book as the context. Now, you need to notice that I'm actually calling or using the cache control block; that means everything in here is going to be cached, but not during the first call—only any subsequent calls.
So, the first API call is not going to be cached, but for any subsequent API calls, the book content is going to be cached because we are putting a cache point here. We send this, and also keep in mind that when we, um, send this for the first time, there's going to be a 25% increase in the token cost that we are going to be using for our cached content. So, after that, we have, uh, another user input, which is basically a prompt, and it says, "What is the title of the book?
Only output the title. " Right? So, we take this set of messages, pass it on to the normal client from Anthropic; the model is going to be Cloud 3.
5, Sonet. It can generate up to 300 tokens. We pass on the messages.
Now, uh, since we're using the normal client, we will need to also add this in the header, which tells, uh, the Anthropic client to use this beta feature; or, as an alternative, you can just use this. So, which is basically `client. beta.
prompt_caching` and create a chat completion endpoint. Now, again, although we are using the cache block here, for the first API call, it's not going to be, uh, cached. So, I ran into some issues with my API, and I think it has to do with the beta nature of the, uh, API right now.
Uh, so I'll adjust the notebook and put that in the video description. So, we're going to look at the results from the official cookbook for that code. When we make the non-cached API call, which is the first API call based on the, uh, code block that I showed you, it took about 22 seconds.
Now, there are 17 input tokens, which is the prompt that we're passing on, and eight output tokens; the output tokens are the title of the book. Now, if we use the same code block to make a subsequent API call, that will be using the cached content because, uh, we are setting up the cache control right after the contents of the textbook. So, there's another function that we define, but it's, uh, the same structure.
Since it's a subsequent call to this API endpoint, it will actually be using that specific cached content. So, in this case, it actually took about 4 seconds only. Now, the total input tokens are again 17 tokens, because that's the message we are passing on, which is the same message.
However, the number of output tokens in this case is different because it returned this text: "The title of this book is *Pride and Prejudice*. " Now, although we simply asked for the title of the book, it did add some further text with it, and that's why you see an increase in the number of tokens in the output. Now, if you were to make another subsequent call, it's going to take about this time, rather than the initial 21 or 22 seconds.
Now, an obvious question is going to be, which implementation is better: context caching by Gemini or the prompt caching by Anthropic? I would argue you can make, uh, cases for both of them. One application for prompt caching by Anthropic is going to be something like a CL, which lets you chat with your documents, and the user is supposed to chat with a number of PDF files in a single session.
I think that's where this 5-minute lifetime becomes very important. Also, um, the Anthropic API endpoint lets you cache a very small number of documents or a small number of tokens. So, for example, if you're chatting with a single PDF file, then you don't want to have a huge cache; something like 4,000 or 8,000 tokens is more than enough, and since it's going to be a single session, it makes sense.
Now, if it's hundreds of thousands of tokens and the conversation is going to take much longer than 5 minutes, then I think the Anthropic implementation is not a good option. In that case, you probably want to look at Gemini's implementation because now there can be an interruption of longer than 5 minutes when the user is talking, uh, with the documents or a code base, and it also lets you set, uh, any arbitrary value you want. So, the default is only 1 hour, but you can extend it to as long as you want.
Now, the second question that everybody has is: does the long context with context caching replace frag? And the simple answer is no. In enterprise settings, you will encounter knowledge bases that span well beyond millions of tokens.
Now, in that case, you can't really be putting a subset of documents in the context of something like Gemini because you actually need the whole knowledge base in order to extract the most relevant documents. But I think that long contexts really help with RAG as well because, rather than retrieving a small number of chunks, you can retrieve whole documents and put those whole documents in the context of these LLMs, and that will help the models create better answers. So, I think having long context will supercharge RAGs but not replace them.
If you have a different thought on this topic, let me know; I would love to have a discussion in the comment section below. I hope you found this video useful. Thanks for watching, and as always, see you in the next one!