AI Engineering in 76 Minutes (Complete Course/Speedrun!)

31.73k views14465 WordsCopy TextShare

Marina Wyss - Gratitude Driven

Buy the AI Engineering book here to continue your learning! https://amzn.to/42kjXb2 All images are ...

Video Transcript:

hey everyone today we're diving into the book AI engineering by chip win 800 pages of really great content about this in demand field that's offering salaries of $300,000 or more in this video I'm summarizing everything from the book to help you get a highle overview of the field we'll talk about Foundation models prompt engineering rag fine tuning agents how to build a system improving inference and more I also want to mention this is a super highlevel overview of a very detailed technical book don't expect to learn all the details just from watching this video I

really recommend using this is a way to get an overview of what the field looks like and use it as a jumping off point for your own research and exploration so what exactly is AI engineering and how is it different from traditional machine learning let's break it down AI engineering has exploded recently for two simple reasons AI models have gotten dramatically better at solving real problems while the barrier to building with them has gotten much lower this perfect storm has created one of the fastest growing engineering disciplines today at its core AI engineering is about

building applications on top of foundation models those massive AI systems trained by companies like open AI or Google unlike traditional machine learning Engineers who build models from scratch AI Engineers leverage existing ones focusing Less on training and more on adaptation these Foundation models work through a process called self-supervision instead of requiring humans to painstakingly label data these models can learn by predicting parts of their input data this breakthrough solved the data labeling bottleneck that held back AI for years as these models scaled up with more data and computing power they evolved from simple language models

to what we now call large language models or llms and they didn't stop there they've expanded to handle multiple types of data including images and video often becoming large multimodal models nowadays we're seeing Foundation models power everything from coding assistance like GitHub co-pilot to image generation tools writing AIDS customer support Bots and sophisticated data analysis systems now that we've covered what AI engineering is let's dig deeper into Foundation models themselves how they're trained how they work and why understanding their architecture matters for AI Engineers Foundation models at their core can only know what they've been

trained on this might seem obvious but it has profound implications if a model hasn't seen examples of a specific language or concept during training it simply won't have that knowledge most large Foundation models are trained on web crawled data which brings some inherent problems this data often contains clickbait misinformation toxic content and fake news to combat this teams use various filtering techniques for instance open AI only used Reddit links with at least three upvotes when training gpt2 the language distribution in training data is also heavily skewed about half of all crawled data is in English

which means languages with millions of speakers are often underrepresented this is why specialized models for specific languages and domains are becoming increasingly important also the distribution of domains in one of the main training data sets leans heavily towards business Tech news and art in terms of model architecture most Foundation models use Transformer architectures based on the attention mechanism but to understand why Transformers were such a breakthrough we need to look at what came before Transformers were invented to solve the problems of sequence to sequence models which used current neural networks for tasks like translation these

had two main components an encoder that processes inputs and a decoder that generates outputs both worked sequentially token by token the problem is that the decoder only has access to a compressed representation of the entire input imagine trying to answer detailed questions about a book when all you have is a brief summary also input processing and output generation are done sequentially so it's slow for long sequences Transformers solved this with the attention mechanism which allows the model to waigh the importance of different input tokens when generating each output token it's like being able to reference

any page in the book while answering questions plus Transformers can process input tokens in parallel making them much faster during inference Transformers work in two steps first pre-fill process all the input tokens in parallel to create the intermediate State and second decode generate one output token at a time the attention mechanism uses three types of vectors first query vectors these represent what information the model is looking for next key vectors like indices of previous tokens and finally value vectors the actual content of the previous tokens the model computes how much attention to give each input

token by comparing the Q and K vectors A high similarity score means that the tokens content V will heavily influence the output this is why longer context windows are computationally expensive more tokens mean more K and V vectors to compute and store attention is almost always multi-headed allowing the model to focus on different groups of tokens simultaneously in llama 27b there are 32 attention heads for example a complete Transformer consists of multiple Transformer blocks each containing an attention module and a neural network module the number of blocks is often called the number of layers before

and after each block there's an embedding module that converts tokens and their positions into vectors and finally an unembedded layer that Maps output vectors to token probabilities so that's a super high look at this I would really recommend either reading the book or check out stat quest for an awesome overview of Transformers and the attention mechanism I'll link that in the description that's really how I learned while Transformers dominate they're not the only architecture models like RW KV which combines RNN based approaches with parallelization capabilities are gaining traction for certain applications in general larger models

with more parameters have greater capacity to learn and perform better the number of parameters helps us estimate the compute resources needed for training and inference as well however note the parameter count can be misleading with sparse models so those with many zeros which can be more efficient a large sparse model might require less compute than a smaller dense one when designing models compute is often the limiting factor the chinchilla scaling law helps calculate the optimal model size and data size for a given compute budget it suggests that the number of training tokens should be about

20 times the model size so a 3 billion parameter model needs about 60 billion training tokens while the cost for achieving the same model performance is decreasing over time the cost for improvements remains High going from a 3% to a 2% error rate might require an order of magnitude more data compute or energy but even small performance improvements can make a huge difference for Downstream applications as we keep scaling models we're approaching two significant bottlenecks first training data their's concern will run out of highquality internet data in the next few years forcing models to train

on AI generated content potentially causing performance degradation or requiring access to proprietary data like copyrighted books and medical records second electricity data centers already consume 1 to 2% of global electricity limiting how much larger they can grow without significant energy breakthroughs pre-trained Foundation models face two main issues they're optimized for text completion not conversation and their outputs can be factually incorrect or ethically problematic posttraining aims to address these issues through two main steps first supervised fine tuning supervised fine tuning optimizes the model for conversations instead of completion this requires high quality instruction data showing the

kinds of requests the model should handle and how it should respond it's essentially teaching the model what good responses look like second preference fine tuning preference fine tuning aligns the model with human values using reinforcement learning often called reinforcement learning from Human feedback this involves training a reward model that scores outputs based on human preferences and optimizing the foundation model to generate responses that maximize these scores while reinforcement learning from Human feedback has been the standard approach newer methods like direct preference optimization DPO are gaining traction some companies even skip the reinforcement learning step entirely

instead generating multiple outputs and selecting those with high reward model scores this is a strategy called best of end Foundation models don't just produce a single definitive answer they generate probabilities for possible outputs how we sample from these probabilities dramatically affects the model's responses the simplest approach is greedy sampling always picking the highest probability token but this leads to repetitive predictable text to introduce creativity we use sampling techniques temperature controls how confident the model is in its predictions higher temperature values like 0.7 to 1 make outputs more creative but potentially less accurate while lower temperatures

close to zero make outputs more deterministic and focused top K sampling restricts the model to choosing from only the K most likely next tokens typically between 50 and 500 depending on how diverse you want the responses to be top P sampling selects the smallest set of tokens whose cumulative probability exceeds a threshold p a value of 0.9 means the model will only consider tokens that together make up 90% of the probability Mass this probalistic nature explains many of the behaviors we in Foundation models like inconsistency with minor input changes and hallucinations where models confidently State

incorrect information now that we understand Foundation models a little more let's talk about one of the most crucial yet underappreciated aspects of AI engineering evaluation for some applications figuring out evaluation can consume the majority of your development effort it's how you mitigate risks uncover opportunities and gain visibility into where your system is failing evaluating AI systems is significantly harder than traditional ml models for several reasons first the problems these models solve are often inherently complex evaluating a mathematical proof or the quality of a summary requires deep expertise you might need to read an entire book

just to judge if a summary captures the key points correctly second tasks are typically open-ended with many possible correct responses unlike classification where there's one right answer a question like write me a poem about resilience has countless valid responses third Foundation models are black boxes you can only evaluate them by observing their outputs not by understanding their internal workings fourth publicly available evaluation benchmarks quickly become saturated which is when the model achieves perfect scores what was a challenging test yesterday becomes an easy exercise today and finally for general purpose models you need to evaluate not

just known tasks but discover new capabilities that might extend beyond human abilities all of this is made worse by a general underinvestment in evaluation compared to model development so let's start with some fundamental metrics used to evaluate language models during training most autor regressive language models are trained using cross entropy or its relative perplexity these metrics essentially measure how well the model predicts the next token in a sequence entropy measures how much information on average a token carries the higher the entropy the more information dense each token is and the more unpredictable the language if

you can perfectly predict what I'll say next what I say carries no new information language models learn the distribution of their training data the better a model learns this distribution the better it becomes at predicting what comes next resulting in lower cross entropy a perfectly trained model would would achieve cross entropy equal to the entropy of the training data itself and the KL Divergence between the two will be zero perplexity is simply the exponential of cross entropy it measures the amount of uncertainty a model has when predicting the next token higher perplexity means that there

are more possible options the model is considering what counts as good perplexity depends entirely on the data more structured data has lower expected perplexity because it's more predictable the larger the vocabulary the higher the perplexity because there are more possible options and the long longer the context length the lower the perplexity tends to be while perplexity is useful for guiding training and serves as a proxy for a model's General capabilities it becomes less reliable for models that have undergone significant posttraining with sft or rhf as models get better at completing tasks they might actually get

worse at predicting the next token in a statistical sense perplexity can also be used to detect if a text was in a model's training data because it would be unusually good at predicting those tokens and to identify nonsensical text which would have abnormally High perplexity for some tasks we can perform exact evaluation where there's no ambiguity about the correct answer like multiple choice questions this is in contrast to subjective evaluation like grading an essay the gold standard here is functional correctness evaluating whether the system performs its intended functionality for example if I ask a model

to book a restaurant reservation did it make the correct reservation this is the ultimate metric for any application though it's not always clear how to measure it in coding tasks functional correctness translates to execution accuracy does the code run and produce the expected output for gaming Bots we can measure objective performance metrics like win rates when reference data is available we can evaluate outputs by comparing their similarity to this ground truth this approach is bottlenecked by how much and how fast reference data can be generated either by humans or AI there are three main ways

to compare outputs to references first exact match a binary measure that works for simple questions with definitive answers like who was the first woman to win the Nobel Prize second lexical similarity a continuous measure of how much the tokens over overlap between the output and reference this can use techniques like edit distance how many changes are needed to transform one text into another or engram overlap metrics like blue and Rouge the drawback is that you need a comprehensive set of reference responses and the references themselves can be wrong plus higher lexical similarity doesn't necessarily mean

a better response there are many ways to express the same idea third semantic similarity this is a continuous measure of whether two texts have the same meaning regardless of the specific words used this is typically implemented by comparing text embeddings using metrics like cosine similarity the advantage is that it doesn't require references but it does depend on the quality of the underlying embedding algorithm one of the most powerful and common methods for evaluating AI models in production is using another AI model as a judge these AI judges are fast easy to use and relatively cheap

compared to human evaluators they can work without reference data and can judge attributes like correctness toxicity hallucinations and more Studies have shown that AI judges can correlate strongly with human evaluators sometimes showing higher agreement than between different human judges they can also explain their decisions which helps with transparency you can use AI judges to score outputs compare outputs to references or pick the best of two responses since language models are generally better with text than numbers AI judges tend to perform better with classification tasks than numerical scoring when creating prompts for AI judges you need

to include the evaluation task criteria and scoring system few shot examples generally work better than zero shot which we'll talk about later in the prompt engering section though longer prompts do increase costs interestingly you don't always need your strongest model as the judge specialized smaller models can often perform evaluation tasks effectively which helps reduce costs and latency however of course AI judges have limitations like all AI applications they're probalistic the same judge given the same input can produce different scores if prompted differently or simply run twice this makes evaluation results harder to reproduce or trust

additionally metrics aren't standardized across different systems one system system's definition of faithfulness might differ from anothers Models also exhibit biases they might prefer responses from the same model this is called Self Bias favor the first answer in a comparison this is position bias or prefer lengthier answers verbosity bias you can mitigate these biases through techniques like randomizing the order of responses but this also increases costs now that we understand evaluation let's tackle one of the most crucial decisions in AI engineering model selection with the increasing number of readily available Foundation models models the challenge isn't

developing models but selecting the right one for your application during application development you'll go through model selection multiple times as you progress through different adaptation techniques for instance when doing prompt engineering you might start with the strongest model to evaluate feasibility then work backwards to see if smaller models would suffice if you decide to fine-tune you might start with a small model to test your code before moving to a larger one the selection process typically involves two key steps first finding the best achievable performance on the task and then second mapping models along a cost

performance axis and choosing the model that gives the best performance for your budget your criteria for evaluating a model can be organized into four buckets first domain specific capabilities how well does the model understand your specific domain for example if you're summarizing legal documents how well does it understand legal terminology second General capabilities how coherent faithful or factually consistent are the outputs third instruction following capabilities does the model follow the format and structure you requested and fourth cost and latency how expensive is the model to run and how quickly does it respond sometimes rather than

evaluating absolute quality you just need to determine which model is best for your use case this can be done through point-wise evaluation so you score each model independently or comparative evaluation where you directly compare outputs when evaluating models you also need to differentiate between hard attributes and soft attributes hard attributes are impossible or impractical to change these include license restrictions training data composition model size privacy requirements and the level of control you need these are often determined by the model providers or your own internal policies and they can significantly limit your pool of options soft

attributes on the otherand can be improved through adaptation techniques like prompt engineering or fine-tuning these include things like accuracy toxicity and factual consistency a high level workflow for model selection looks like this filter out models whose hard attributes don't work for you then use publicly available information like Benchmark performance to narrow down to the most promising candidates third run your own experiments to find the best model given all of your objectives fourth continually monitor your chosen model in production to detect failures and collect feedback most companies won't build Foundation models from scratch so another question

is whether to use commercial model apis or host an open source model yourself let's clarify some terminology first originally open source meant any model you could download and use but some argue that a model should only be considered truly open source if its training data is also publicly available this allows for more flexible usage like retraining from scratch with modifications models with open weights but closed training data are sometimes called open weight models while those with both open weights and open data are open models so most so-called open source models are actually just open weight

these models also come with different licenses that may restrict commercial use or limit how you can use the model's outputs for training other models for a model to be accessible to users a machine needs to host and run it the service that hosts the model and handles queries is often called the inference service while the interface the users interact with is the model API after creating a model developers can choose to open source it make it accessible via an API or both typically model providers open source their weaker models and keep their best ones behind

pay walls whether to host a model yourself or use a model API depends on several factors First Data privacy if your company has strict data privacy policies that prevent sending data outside the organization externally hosted model apis are not an option there's also the risk that API providers might use your data to train their models next data lineage and copyright most models aren't transparent about their training data and intellectual property laws around AI are still evolving it's unclear whether using a model train on copyrighted data could create legal issues for your product next performance the

gap between open sourced and proprietary models is closing but the strongest models will likely remain proprietary commercial apis often provide additional capabilities out of the box like scalability function calling so accessing external tools for example structured outputs and output guard rails these can be challenging to implement yourself so many companies turn to API providers however this means you'll be restricted to their functionality you might not be able to fine-tune or access log probabilities for example typically proprietary models are easy to start with and scale but they can become expensive with heavy usage and offer less

flexibility it's wise to design your application with a standard internal API so you can easily swap between models if needed control is another consideration what happens if your API provider goes out of business changes their terms of service or is banned in certain regions and if you want to run a model on device thirdparty apis aren't an option there are numerous benchmarks for different use cases and a tool tool that helps you evaluate a model on multiple benchmarks is called an evaluation harness for example open AI evals lets you run any of around 500 existing

benchmarks to evaluate their models when using public leaderboards you need to consider which benchmarks to include in your aggregated ranking how to weigh different benchmarks and how to handle benchmarks that use different metrics like accuracy F1 blue Etc keep in mind that the goal is to select a small subset of models for more rigorous testing with your own benchmarks and metrics public benchmarks rarely represent your applications need perfectly and they may suffer from data contamination which is when the models were trained on the same data they're being evaluated on to deal with contamination you first

need to detect it using her istics like engram overlapping and perplexity if perplexity on the evaluation data is unusually low it's possible the model has seen this during training once you've narrowed down your model candidates you need a robust evaluation pipeline evaluate both the endtoend output and each component intermediate outputs independently you can use something called turn-based evaluation where you assess the quality of each output and task-based evaluation where you measure whether the system completes a task and how many turns it takes first think about what makes a good response factors like relevance factual consistency

and safety then create test queries and generate multiple responses to see how models perform develop detailed rubrics with examples for your scoring system whether you use binary scores continuous scales or something else depends on your data and your needs the key is to make your rubric unambiguous so that human evaluators can follow it consistently ly most importantly tie your evaluation metrics to business metrics if your customer support chatbots factual consistency is 80% what does that mean for the business perhaps you can automate 30% of customer support requests at that level but at 90% consistency you

could automate 50% this lets you quantify the business impact of model improvements you'll also need to establish a usefulness threshold for instance your chat bot must be 90% factually consistent to be viable in production different criteria might require different evaluation methods you might use a specialized toxicity classifier semantic similarity metrics to measure relevance and an AI judge to assess factual consistency you can even mix and match evaluation methods for the same criteria for example maybe use a cheap classifier on all your data and an expensive AI judge on just 1% for high quality signals while

automated metrics are preferable for scale don't hesitate to include human evaluation even in production just do it on a subset of data to keep costs manageable it's also crucial to evaluate application on different slices of data or users to ensure it performs well across segments and avoid biases this helps you identify areas for improvement and prevent Simpsons Paradox where a model performs better on aggregate but worse on each individual subset how much evaluation data you need depends on your application and methods generally you want enough to be reliable but not so much that costs become

prohibitive a good way to test reliability is to create multiple bootstrap samples of your evaluation set and see if they yield similar results if you get 90% on one bootstrap but 70% on another your evaluation pipeline isn't trustworthy finally evaluate the reliability of your pipeline itself first is it getting signals right do better responses indeed get higher scores next do better evaluation metrics lead to Better Business outcomes third how reliable is the pipeline if you run it twice do you get the same results fourth how correlated are your metrics you don't need two metrics if

they're perfectly correlated but completely uncorrelated metrics might indicate problems and finally what cost and latency does your evaluation pipeline add your your application model selection remains one of the hardest but most important topics in AI engineering with the rapidly growing number of foundation models available your challenge isn't developing models but selecting the right one for your specific needs balancing performance cost privacy and control now let's dive into what might be the most accessible yet surprisingly nuanced aspect of AI engineering prompt engineering if you've ever used chat GPT you've already done some form of prompt engineering

but there's much more to it than just typing questions prompt engineering refers to the process of crafting instructions that guide a model to generate your desired outcome it's the easiest and most common model adaptation technique because unlike fine tuning it doesn't change the model's weights you're just telling the model what you want it to do while it's the most accessible entry point to AI engineering don't be fooled into thinking that it's simplistic effective prompt engineering requires the same experimental rigor as any machine learning task you should extract maximum value from prompting before moving to more

resource intensive techniques like fine-tuning that said understanding prompt engineering alone isn't enough for production ready systems you'll still need knowledge of Statistics engineering and classical ml for experiment tracking evaluation and data set curation prompts typically consist of one or more of these components first the task description this includes the model's role and expected output format for example you are a helpful medical assistant analyze the following symptoms and suggest possible conditions listing them in order of likelihood next examples these show the model how to perform the task for instance if you want a model to classify

text as toxic or non-toxic you might include examples of each third the concrete task this is the specific job you want the model to do like answering a question or summarizing a book how much prompt engineering you need depends on the model's robustness to prompt perturbation a robust model shouldn't produce dramatically different outputs if you write the number five versus write it out FIV this robustness is strongly correlated with a model's overall capability it's also worth noting that different models have different preferred prompt structures for example GPT 4 typically performs better when the task description

is at the beginning of the prompt while llama 3 does better when the task appears at the end teaching models what to do via prompts is known as in context learning each example in your prompt is called a shot so we get the terms few shot zero shot and one shot learning how many examples you need depends on both the model and your application so experimentation is necessary the number of examples you can include is limited by the model's context length and for API models your cost constraints many modern models distinguish between system and user

prompts the system prompt contains the Tas task description telling the model what role to play its goals and constraints the user prompt contains the specific task or query almost all applications like trpt have system prompts usually created by the application developers rather than end users these system and user prompts are combined using a template that can vary between models and versions if you use the wrong template you might experience unexpected performance issues even small mistakes like an extra new line can cause problems when constructing inputs make sure to follow the model's chat template exactly this

is especially important if you're using using third party tools to construct prompts as template mismatches often lead to silent failures models typically understand instructions better when they appear at the beginning or end of The Prompt rather than buried in the middle let's go through some key strategies for Effective prompt engineering first write clear and explicit instructions if you want a model to score an essay explain the scoring system you want it to use should it allow fractional scores what should it do if it can't determine an answer be specific to reduce ambiguity second ask the

model to adopt a Persona asking a model to respond as a particular character or expert can significantly change its output style and focus for example respond as an experienced pediatrician or answer as if you were explaining it to a 10-year-old third provide examples examples can dramatically shift a model's response style for instance asking will Santa bring me presents without examples might get a straight no Santa is fictional response but if you provide an example of a Whimsical answer about the Tooth Fairy the model is more likely to play along four specify the output format tell

the model exactly how you want the response structured this might mean requesting things like no preambles so none of this based on the content of this essay I'd give it a score of dot dot dot you can also ask for specific formats like Json or markdown and particular sections or headings five break complex tasks into simpler subtasks this not only improves performance but also makes monitoring debugging and parallelization easier however it can increase the latency perceived by users if they don't see the intermediate outputs you can also use cheaper models for simpler steps to reduce

cost six give the model time to think several techniques can improve model reasoning Chain of Thought prompting so think this through step by step process instructions so something like first analyze the key themes second identify the author's perspective and so on next self-critique ask the model to check its own work these approaches generally improve quality but increase latency and token usage seven iterate systematically this is so important different techniques work better for different models so experimentation is crucial always version your prompts and use an experiment tracking tool with standardized evaluation metrics and data also separate

prompts from code store them in configuration files rather than hardcoding them this will make it way easier to update various tools aim to automate The Prompt engineering workflow including open prompt and dspi these tools let you specify input and output formats evaluation metrics and evaluation data then essentially they perform automl to find the optimal prompts however these tools can be expensive if if they make many API calls under the hood they also might produce prompts with typos or other issues and they may not keep up with changing model requirements for these reasons it's best to

start with manual prompt engineering before moving to automated tools you can also use AI models themselves to write and refine prompts once your application is available to users it may face attacks from malicious actors trying to exploit it three main types of prompt attacks include prompt extraction attacks where attackers might try to extract your system prompt to either replicate or exploit your application jailbreaking and prompt injection the attacks attempt to subvert the model's safety features or get it to perform unauthorized actions like providing instructions for harmful activities or executing dangerous code and third information extraction

these attacks try to get the model to reveal sensitive information from its training data or context to defend against these attacks consider the following strategies use benchmarks to evaluate safety against adversarial attacks conduct security red teaming to proactively find weaknesses be explicit in your prompts about what information the model should not return repeat the system prompt before and after user inputs to remind the model of its constraints Design Systems with safety boundaries like running generated code only in isolated environments require human approval for potentially impactful actions define out of scope topics for your application use

anomaly detection to identify unusual prompts and Implement guardrails on both inputs and outputs when evaluating your system security track both the violation rate so how often attack succeed and the false refusal rate how often the model incorrectly refuses legitimate requests you need to balance these metrics perfect security with too many false refusals creates a really frustrating user experience by approaching prompt engineering with this combination of creativity and riger you can extract remarkable performance from Foundation models without the complexity and expense of fine-tuning remember that small changes in your prompts can lead to significant improvements in

output quality so experiment widely and measure carefully now that we've covered prompt engineering let's explore how to give Foundation models access to information beyond what they were trained on to solve a task effectively a model needs two things instructions on how to perform the task and the necessary information to complete it two dominant patterns have emerged for providing models with the information they need retrieval augmented generation or rag and the agentic pattern rag allows models to retrieve relevant information from external data sources while the agentic pattern enables models to use tools like web search and

apis to gather information actively while rag is primarily used for context construction the agentic pattern can do much more let's start with rag first so what is rag retrieval augmented generation is a technique that enhances a model's generation capabilities by retrieving relevant information from external memory sources these sources could be an internal database a user's previous chat sessions or even the internet you can think of rag as a technique to construct context specific to each query connecting the model with information it wasn't trained on or might have forgotten a rag system consists of two main

components a retriever that fetches the information from the external memory source and a generator the foundation model that produces a response based on the retrieved information in today's rag systems these components are often trained separately with many teams using off-the-shelf Retriever and models however fine tuning the entire rag system from end to end can significantly improve performance the success of a rag system heavily depends on its retriever a retriever performs two main functions indexing and querying indexing involves processing data so that it can quickly be retrieved later this is the preparatory step where you organize

your knowledge base querying is the process of sending a search query to retrieve data relevant to it how you index your data determines how you retrieve it later let's walk through a simple example imagine your external memory as a database of documents like contracts or meeting notes these documents can range from 10 tokens to a million tokens in length naively retrieving whole documents would make your context arbitrarily long potentially exceeding the model's context window to avoid this you typically split each document into smaller chunks which we'll discuss later for each user query your goal is

to retrieve the data chunks most relevant to that query then with some postprocessing to join the retrieved chunks with the user's prompt you the final prompt that goes to the model many existing retrieval algorithms can be used for rag retrieval works by ranking documents based on their relevance to a given query and algorithms differ in how they compute these relevant scores first term-based retrieval this is also called lexical retrieval and this approach finds relevant documents based on keywords while this is straightforward it has several limitations so many documents might contain a term without truly being

about it and queries can be long with many terms that aren't equally important so tfidf can help address this this also simple tokenization can miss semantic relationships term-based retrieval is generally faster than embedding based approaches during both indexing and querying it also works well out of the box with existing systems like elastic search embedding based retrieval is another option this approach computes relevance at the semantic level rather than a lexical one ranking documents based on how closely their meaning aligns with the query the process works like this convert your original data to embeddings using an

embedding model store these embeddings in a vector database when a query comes in convert it to an embedding using the same model fetch the K data chunks whose embeddings are closest to the query embedding and return them Vector search is typically framed as a k nearest neighbor search problem this can be computationally expensive for large data sets so approximate nearest neighbors algorithms are often used instead in practice most developers won't Implement Vector search themselves but will use existing Vector databases these databases organize vectors into buckets trees or graphs using various fistic to increase the likelihood

that similar vectors are stored close to each other embedding based retrieval can sign significantly outperform term-based retrieval over time especially if you fine-tune your embedding model and retriever but it has its downsides it can make it harder to search for specific names or error codes and generating embeddings can be expensive and introduce latency a production retrieval system typically combines several approaches for example a cheaper less precise retriever like turn-based search might first fetch candidates and then a more precise but expensive mechanism like KNN finds the best options among those candidates depending on your task certain

tactics can increase the chance of retrieving relevant documents the simplest approach is to divide documents into chunks of equal length based on characters words sentences or paragraphs overlapping chunks can ensure that important boundary information is included in at least one chunk smaller chunk sizes allow for more diverse information since you can fit more chunks into the model's context but this can also result in the loss of important context smaller chunks also increase computational overhead especially for embedding based retrieval there's no Universal best chunk size or overlap percentage you just need to experiment based on your

specific data and task the initial document rankings generated by the retriever can be further refined to be more accurate this is especially useful when you need to reduce the number of retrieve documents due to context window limitations documents could be reranked based on various factors such as recency so maybe you give more weight to newer data or additional relevant signals next let's talk about query rewriting also known as query reformulation normalization or expansion this technique involves rewriting queries to include necessary context for example if a user asks what's its population after previously asking about Paris

the query might be expanded to what's the population of Paris each chunk can be augmented with relevant context to make it easier to retrieve this might include metadata like tags and keywords or for e-commerce products it could be information like descriptions and reviews you can also augment chunks with context from the full document to help them retain more of the original meaning for example maybe a summary of the entire document when choosing a retrieval solution consider first first what retrieval mechanisms it supports term-based embedding based and or hybrid for Vector databases what embedding models and

Vector search algorithms are supported also consider scalability both for data storage and query traffic you'll need to think about indexing speed and batch processing capabilities query latency pricing structure and compliance requirements as well it's also important to note that rag isn't limited to just text it can also be used with multimodal and tabular data for instance if a user asks what's the color of the house in the Pixar movie up a multimodal rag system might first retrieve an image of the house to help the model answer similarly rag can work with tabular data using text

to SQL conversations the system can execute a query on a database and then generate a response based on the results for complex database schemas you might need an intermediate step to predict which table to use for each query especially if there are too many tables to fit all the schemas in the context window in the next part we'll explore the agentic pattern which goes beyond passive retrieval to actively interact with external tools and apis the agentic pattern is a more active approach to extending AI capabilities this is a rapidly evolving field so consider this section

more experimental than the others we've covered at its broadest definition an agent is anything that can perceive its environment and act upon it for AI systems this means that a model can observe its environment make decisions based on those observations take actions that affect the environment and learn from the outcomes of those actions the environment is defined by the use case for a game playing Agent the game is the environment for a web scraping agent the internet is the environment what makes agents powerful is the set of tools they have access to for example chat

GPT is an agent that can search the web execute python code and generate images among other capabilities remember our rag example with tabular data that was actually a simple agent with three actions generating SQL queries executing those queries and producing a response let's see how this works in practice if a user asks project the sales revenue over the next 3 months the agent might first reason about how to accomplish the task then generate a SQL query to fetch historical sales data next it would execute that query against the database analyze if the retrieved information is

sufficient possibly generate and execute additional queries and then create a projection based on the gathered data finally it would conclude that the task has been successfully completed compared to simpler AI applications agents require more powerful models because they often need to perform multiple steps to complete a task the overall success rate decreases with each step because of compounding errors and the stakes are higher since agents have access to potentially powerful tools speaking of tools agents can be equipped with various tools which fall into several categories first knowledge augmentation tools these could be things like text

or image retrievers as in rag SQL executors for database access web search capabilities apis for accessing inventory systems email readers Etc and web browsers for navigating online content whether public or private next we have capability extension tools like calculators since AI models often struggle with complex math time zone or unit converters translation services and code interpreters we also have WR action tools so tools that enable the agent not just to read but also write to systems these can automate workflows but require strong security protocols complex tasks require planning and there are many possible ways to

decompose a task not all approaches will be successful and not all will be efficient to help with debugging and to prevent cases where a model executes unnecessary API calls planning should be decoupled from execution the process typically works like this first ask the agent to generate a plan then validate the plan before execution and then only execute once validated plans can be validated using heris STS like removing plans with invalid actions or too many steps or by using another AI model as a judge you can even generate several plans in parallel and then ask an

evaluator to pick the most promising one for particularly important or sensitive tasks you might want a human in the loop to review plans before execution while Foundation model agents use the model itself as the planner reinforcement learning agents are trained using reinforcement learning algorithms this approach uses more resources than Foundation models but could offer performance improvements in the future the simplest way to turn a model into a plan generator is through prompt engineering you tell the model what functionality it has available and the expected inputs and outputs for each tool you can improve your prompts

by writing better system prompts with more examples providing clearer descriptions of tools and their parameters simplifying functions as much as possible using a stronger model or fine-tuning a model specifically for Plan Generation as a practical tip always ask the system to report what parameter values it uses for each function call this provides a sanity check check that can catch many issues before execution another useful approach is to generate plans in natural language first then translate them to the exact function calls in a second step this helps if function names change over time or if you

find a model specifically for plan creation the translation can often be done by a smaller cheaper model agents can fail in various ways so it's important to have robust evaluation methods there are lots of different things that can go wrong so we could have planning failures like using invalid tools using valid tools but with invalid parameters using valid tools with incorrect parameter values or failing to achieve the goal or satisfy constraints to evaluate planning capability create a data set where each example is a tuple of task available tools and constraints for each task use the

agent to generate multiple plans and compute metrics like the percentage of generated plans that are valid how many attempts it takes to get a valid plan percentage of tools called that are valid and how often invalid tools are called you could also have tool failures so that could include things like bad translation from high level plans to specific function name no access to the required tools or tools giving incorrect outputs like poorly generated SQL queries for this your efficiency metrics might be how many steps does the agent need on average to complete a task what's

the cost to complete a task how long does each action typically take are there particularly slow or expensive actions and how does the agent compare to baselines which might be another agent or a human one of the key challenges for agents is remembering information over time a memory system allows a model to retain and utilize information across interactions a a models typically have three main memory mechanisms there's the internal knowledge embedded in the model itself through training there's the context window which is kind of your shortterm memory for immediate session specific information and finally external

data sources like rag systems this is kind of like your long-term memory information that is essential to all tasks should be incorporated via training rarely needed information should reside in long-term memory while short-term memory is for immediate context specific information benefits of a well-designed memory management system include storing information longer than the context window allows persisting information between sessions making a model more consistent in its responses and actions by combining rag for information access tools for capability extension planning for complex tasks and memory systems for continuity agents can tackle increasingly sophisticated problems while this field

is still evolving rapidly it represents one of the most promising Frontiers in AI engineering as with all powerful Technologies agent systems require careful consideration of safety security and ethical use the more capable able an agent becomes the more critical it is to ensure it operates within appropriate boundaries and with proper oversight now let's explore fine tuning the process of adapting a model to a specific task by further training it and adjusting its weights while prompt engineering and rag are relatively lightweight techniques fine-tuning offers deeper customization but requires more resources and expertise so when to fine-tune

fine tuning can improve a model's performance in two ways first by enhancing domain specific capabilities like coding or answering medical questions and second improving instruction following abilities like adhering to specific output formats however fine tuning requires significant upfront investment it often needs more memory than what's available on a single GPU making it expensive this is why reducing memory requirements has become a primary motivation for many fine-tuning techniques that we'll discuss later so you should consider fine-tuning when you've already exhausted what you can achieve with prompt-based methods you need to produce consistent structured outputs and you're

working with smaller models that need to perform better on specific tasks a common approach is model distillation fine-tuning a small model to imitate a larger model's Behavior using data generated by the large model on specific tasks a small fine-tune model May outperform a larger general purpose model on the other hand you should avoid fine-tuning if you need a general purpose model fine-tuning can improve performance on specific tasks but degrade performance on others or if you're just starting to experiment with a project many teams jump straight to find tuning before thoroughly exploring simpler approaches so what

about fine-tuning versus rag after you've maximized performance gains from prompting choosing between Rag and fine-tuning depends on whether your model's failures are information based or behavior-based if the model fails because it lacks information like private company data or recent events rag gives the model better access to that information if the model has behavioral issues which I think is very funny to say like outputs that are factually correct but irrelevant or they're in the wrong format fine tuning might help more if your model has both issues start with rag because it's easier begin with a simple

term-based solution and evolve from there in many cases combining rag and fine tuning will give you the biggest performance boost so the workflow to adapt a model to a task might be first design evaluation criteria and an evaluation pipeline then try to get the model to perform the task with prompting alone add more examples to the prompt from there at that point if the model continues to have information based failures try more advanced rag like embedding based retrieval if it continues to have behavioral issues opt for fine-tuning finally combine Rag and fine-tuning for a bigger

performance boost because of the scale of foundation models memory is a major bottleneck for both inference and fine tuning the memory requirements for fine-tuning are typically much higher than for inference due to how neural networks are trained neural networks are typically trained using back propagation each training step consists of a forward pass where we compute the output from the input and a backwards pass where we update the model's weights using signals from the forward pass during inference only the forward pass is executed during training both passes are needed the key contributors to a model's memory

footprint during fine tuning are the total number of parameters the number of trainable parameters and the numerical representation of these parameters a trainable parameter is one that can be updated during fine tuning so during pre-training all model parameters are updated during inference no parameters are updated and during fine-tuning some or all of the parameters may be updated parameters that remain unchanged are called Frozen parameters one way to reduce training memory is through gradient checkpointing also called activation recomputation where activations aren't stored but recomputed as needed this increases training time but reduces memory requirements the key

Insight here is that the more trainable parameters we have the higher the memory footprint reducing the number of trainable parameters reduces memory requirements this is the motivation behind parameter efficient fine-tuning which we'll talk about in a bit another way to reduce the memory footprint is through quantization converting a model from a format with more bits to one with fewer bits for a 13 billion parameter model using 32-bit floating Point each parameter requires 4 bytes resulting in 52 GB total so if you reduce each value to 16 bits the memory needed drops to 26 GB inference

is typically done using as few bits as possible 16 eight or even four bits training is more sensitive to numerical Precision so it's usually done in mixed Precision with some operations in higher Precision like 32bit and others in lower Precision like 16 or 8bit different numerical formats balance range so the span of values that can be represented and precision how exactly a number can be represented there are a few different formats reducing Precision can cause values to change or result in errors so it's important to load models in their intended format for example when llama

2 is released its weights optimized for bf16 causing significantly worse quality when loaded with fp16 now let's talk about PFT in the early days of smaller models full fine-tuning so updating all the model parameters was common this required a lot of highquality annotated data and substantial computational resources as models grew people started using partial fine tuning focusing on specific layers like only the last layer this reduces memory acquirements but it isn't very parameter efficient parameter efficient fine-tuning techniques insert additional parameters into strategic IC locations in the model to achieve strong fine-tuning performance with a small

number of trainable parameters while this can increase inference latency slightly as adapters add computational steps PFT methods are generally not only parameter efficient but also sample efficient they can work with just a few thousand examples compared to the millions potentially needed for full fine-tuning PFT methods fall into two categories so we have adapter-based methods this is also called additive methods that add new model weights and then we have soft prompt based methods that introduce special trainable tokens the most popular adapter-based method is Laura low rank adaptation unlike traditional adapters Laura incorporates additional parameters without increasing

inference latency instead of adding new layers Laura uses modules that can be merged back into the original layers Laura works by decomposing weight matrices into products of smaller matrices then updating only these smaller matrices for a weight Matrix with Dimensions n by m Laura first chooses a smaller Dimension R the rank then creates two matrices a which is n by R and B which is R by m during fine-tuning only A and B are updated while the original weights remain Frozen for inference A and B can be multiplied together and added to the original weights

the efficiency of Laura depends both on the chosen Rank and which matrices it's applied to it's primarily used for Transformer modules in the attention modules if you want to fine-tune a model for multiple tasks you have several options first simultaneous fine-tuning training on a data set with examples from all tasks at once this is harder and requires more data or you could do sequen fine tuning where you first train on task a and then on task B but this can cause catastrophic forgetting where the model loses its ability on earlier tasks or you can try

model merging so there you fine-tune different tasks separately then combine the resulting models model merging offers greater flexibility than fine tuning alone if you have two models that excel at different aspects of the same task you can merge them into a single model that outperforms both this approach can be done without gpus it can improve performance while reducing the memory footprint it's an excellent option for on deployment and it can facilitate Federated learning where multiple devices train using separate data unlike ensembling which combines the outputs of multiple models merging combines the models themselves this improves

performance without the higher inference cost of running multiple models several merging approaches exist so we have summing where we just add the weight values of the constituent models together this is the most common we could have layer stacking so we take different layers from different models and stack them this is also called Franken merging or concatenation where we just combine the parameters this is less recommended because it doesn't reduce memory compared to separate models so here's a practical fine-tuning approach and what a typical development path might look like first test your fine-tuning code using the

cheapest fastest model you have and ensure it works then test your data by fine-tuning a midsize model if training loss doesn't decrease with more data something might be wrong after that run experiments with your Target Model to see how far you can push performance and then map the price performance Frontier and select the model that makes the most sense for your use case alternatively a distillation path looks like this start with a small data set and the strongest model you can afford then train the best possible model with this small data set use this fine-tune

model to generate more training data use the expanded data set to train a cheaper model when choosing fine-tuning methods here are some things to consider so for beginners start with adapter techniques like Laura before attempting full fine tuning understand that data volume matters full fine tuning typically requires thousands to millions of examples while PFT can work with hundreds also you'll need to know how many fine tune models you need adapter methods let you serve multiple variants that share a base model there are also some key hyper parameters that you should know these ones in particular

significantly impact fine-tuning results so we have the learning rate just like in machine learning if the loss curve fluctuates the learning rate is likely too high if it's stable but decreases very slowly the rate's probably too low generally start larger and decrease over time we also have batch size larger batches process training examples faster but require more memory small batches lead to more unstable training so to address instability you can accumulate gradiance across several batches we also need to think about the number of epoch smaller data sets typically need more epochs than larger ones for

millions of examples one to two Epoch might suffice for thousands of examples 4 to 10 may be needed reduce Epoch if you see overfitting we also have prompt loss weight for instruction fine-tuning this determines how much prompts should contribute to the loss compared to the responses if it's set to 100% prompts and responses contribute equally if it's 0% the model learns only from responses the default is typically 10% while the technical process of fine-tuning has been simplified by Frameworks that handle the training process and suggest sensible defaults the Strategic decisions around fine-tuning remain complex the

key is knowing when to fine-tune which technique to use and how to balance the trade-offs between performance resources and data requirements While most companies can't afford to train Foundation models from scratch nearly all can differentiate themselves through high quality data sets for adaptation as the say goes garbage in garbage out and nowhere is this more true than in data set engineering we're witnessing a shift from model Centric to data Centric approaches in AI development model Centric AI tries to improve performance by enhancing the models themselves so designing new architectures increasing model sizes or developing new

training techniques data Centric AI on the other hand focuses on improving performance by enhancing the data developing better data processing techniques and creating high quality data sets that allow Superior models to be trained with fewer resources for companies adapting Foundation models rather than building them from scratch the data Centric approach offers the greatest competitive Advantage the type of data you need depends on your adaptation task for self-supervised fine-tuning you need sequences of relevant domain data for instruction fine tuning you need data in instruction response format for preference fine tuning you need instruction winning response losing

response format for reward modeling you need either preference data or examples with explicit scores your training data should exhibit the behaviors you want your model to learn this can be particularly challenging for complex behaviors like Chain of Thought reasoning or tool use in agent workflows When developing conversational applications you need to consider whether you require single turn data multi-turn data or both single turn data helps train a model to respond to individual instructions while multi-turn data data teaches the model how to solve tasks through dialogue like clarifying user intent before addressing the task or incorporating

Corrections a small amount of high quality data can outperform a large amount of noisy data a principle confirmed by teams working on models like llama 3 they found that human generated data is often prone to errors in inconsistencies particularly for nuanced policies leading them to develop AI assisted annotation tools to ensure high quality which is interesting to me but what makes data high quality there are several factors to consider first relevance the examples should be relevant to your target task legal text from the 19th century might not be relevant for answering contemporary legal questions you'll

also need alignment with task requirements if your task focuses on factual consistency annotations need to be factually correct if it demands creativity annotations should be creative we also need to think about consistency annotations should be consistent across examples and annotators they need to be correctly formatted so data should adhere to the expected structure they need to be sufficiently unique you want minimal duplicates in your data set they need to be compliant and follow internal and external policies and you need coverage your training data needs to cover the range of possible problems you want to solve

requiring sufficient diversity missing coverage in important areas will result in poor performance for those cases no matter how much data you have overall but how much data do you need asking how much data you need is kind of like asking how much money you need the answer varies widely depending on your situation several factors influence data requirements so if you're fine-tuning then the fine-tuning technique matters full fine tuning typically requires orders of magnitude more data than parameter efficient methods like Laura with tens of thousands to millions of examples full fine tuning might be appropriate with

just hundreds to a few thousand examples PFT methods will likely work better it also depends on your task complexity a simple sentiment classification task requires much less data than complex question answering about financial filings for example the base model performance also makes a difference so the closer the base model is to your desired performance the fewer examples you'll need larger more capable base models generally require fewer examples to fine-tune effectively open ai's fine-tuning guide demonstrates that with fewer examples around 100 more advanced models give better fine-tuning results however after fine-tuning on a large data set

around 550,000 examples all models perform similarly regardless of their initial capabilities so in short with limited data use PFT methods on more advanced models with abundant data full fine tuning on smaller models becomes viable before investing in a large data set start with a small well-crafted set of around 50 examples to see if fine tuning improves your model if you see clear improvements more data will likely help further if you see no improvement with a small data set a larger one rarely solves the problem though be careful to rule out other issues like poor hyperparameters

or data quality first in most cases you should see improvements after fine-tuning with just 50 to 100 examples you can also reduce the amount of high quality data you need by first fine-tuning on more accessible data so One path might be self-supervised to supervised first fine-tune on domain specific documents then on targeted question answer pairs or less relevant to more relevant data first fine tune on adjacent domains with abundant data then on your specific domain or synthetic to real data first find- tune on AI generated examples then on limited real examples experimenting with subsets of

your current data set so maybe 25 50 and 100% can help estimate how much more data you'll need a steep performance gain with increasing data set size suggests significant improvement from doubling your data a plateau indicates diminishing returns so let's say you need more data how can you get it if you don't have enough for your use case if possible you'll want to create a data flywheel that leverages user interactions to continue ually improve your product this offers a significant competitive advantage or you could also just check available data sets you can often mix and

match different sources though all data must be thoroughly verified for quality and appropriate licensing when annotating your own data the challenge isn't just The annotation process but creating clear guidelines you need to explicitly Define what makes a good response can a response be correct but unhelpful what distinguishes a score of three versus 4 these guidelines are crucial both for human and AI powered annotations trust me one of the hard machine learning problems I've ever had to solve was an issue with human labelers data augmentation creates new examples from existing data which is another option so

you could do things like flipping an image to create a new variant or you could use data synthesis this generates artificial data that mimics real data properties like simulating Mouse movements on a web page the key difference between augmented data and synthetic data is that augmented data is derived from real data while synthetic data is Created from scratch data synthesis therefore is particularly valuable for addressing privacy concerns when working with sensitive information together some combination of these techniques should allow you to produce data at scale increase coverage across your problem space and possibly improve quality

with AI generated data since humans aren't always great at creating consistent data but of course make sure to measure the quality of your AI generated data just like you would for human generated data once you have your data you need to process it data processing can be timec consuming but it is critical for Quality here are some best practices start with filtering tasks and test scripts before big runs avoid changing data in place so you want to make sure to keep the originals perform exploratory data analysis on distributions and outliers examine interannotator disagreement and resolve

conflicts fact check and manually inspect examples D duplicate data to prevent over representation clean formatting tokens like HTML and markdown which can improve performance and reduce input size remove non-compliant data so anything like pii toxic material or copyrighted content filter out lowquality data identified during verification if you have more data than your compute budget allows use active learning to select the most helpful examples and ensure data is in the right format for your model using the appropriate tokenizer and chat template while all these steps require a lot of effort they're essential for creating data sets

that will help your model to shine in the competitive landscape of AI applications well-engineered data sets often make the difference between mediocre and exceptional performance now let's dive into one of the most practical aspects of AI engineering inference optimization after all a model's real world usefulness boils down to two crucial factors how much it costs to run and how quick quickly it responds these characteristics inference cost and latency ultimately determine which applications can practically use Ai and at what scale let's start by understanding what we mean by inference in the AI life cycle there are

two distinct phases in an AI model's Journey training and inference training builds the model while inference uses the model to compute outputs for given inputs in a production environment the component responsible for running the model inference is called an inference server This Server hosts available models allocates Hardware resources to execute them and returns responses to users the inference server is part of a broader inference service that also handles receiving routing and pre-processing requests so what does this mean for you well if you're using a model API like those from open aai or Google you're essentially

Outsourcing this inference service but if you decide to host models yourself you'll need to build optimize and maintain your own inference infrastructure to optimize inference we first need to understand what's slowing things down generally speaking AI workloads face two types of bottlenecks compute bound bottlenecks occur when the limiting factor is the computational power available tasks requiring intensive calculations like image generation are typically compute bound memory bandwidth bound bottlenecks occur when the limiting factor is how quickly data can move between memory and processors autor regressive language model inference is typically memory bandwidth bound profiling tools like

Nvidia Insight can help determine which bottleneck affects your workload through something called a roofline chart what's important to understand is that different optimization Techniques address different bottlenecks a compute bound workload might benefit from more powerful chips or Distributing work across multiple chips meanwhile a memory bandwidth bound workload might see better results from chips with higher memory bandwidth now that we understand bottlenecks let's look at how inference is actually served many providers often two distinct types of inference apis each optimized for different use cases online apis optimize for Laten see processing requests as soon as they

arrive chatbots typically use online apis since users expect quick responses batch apis on the other hand optimize for cost processing multiple requests together more efficiently but with higher latency applications without strict resp response time requirements like periodic report generation or synthetic data creation can benefit from batch processing the key is matching your inference type to your applications needs so now how do we measure if our inference is performing well that brings us to our next section here are some key inference performance metrics to optimize effectively we need to know what we're measuring several metrics help

us evaluate inference performance the first and perhaps most notable metric is latency the time from when users send a query until they receive a complete response for autor regressive models like llms latency break down into two components so we have the time to First token ttft which is how quickly the first token is generated after receiving a query and then we have time per output token tpot toot how long it takes to generate each subsequent token the total latency then equals ttft plus toot time the number of output tokens some teams also measure time to

publish TTP because the first generated token isn't always immediately shown to users especially when the model first generates a plan or uses Chain of Thought reasoning one important note about latency since it varies across requests looking at percentiles gives you much more meaningful information than simple averages Beyond latency we also care about throughput which is the number of output tokens per second an inference service can generate across all requests higher throughput typically means lower cost which is why optimizing for it matters for production systems it's worth mentioning that most AI applications face a fundamental latency

throughput tradeoff techniques like batching can improve through put but may increase latency for individual requests your optimization strategy needs to balance these competing priorities based on your specific application needs finally utilization metrics tell us how efficiently we're using our resources we have model flops per second utilization which is the ratio of observed throughput relative to the theoretical maximum at Peak computing power model bandwidth utilization which measures the percentage of available memory bandwidth being used now that we know what to measure let's look at the hardware that powers inference at the heart of inference performance is

specialized Hardware an accelerator is a chip designed to speed up specific types of computation for AI workloads the dominant accelerators are gpus those specialized AI chips are growing in popularity you might be wondering about the difference between CPUs and gpus it comes down to their architecture CPUs have a few powerful cores typically up to 64 for high-end machines which are optimized for general purpose Computing gpus on the other hand have thousands of smaller cores optimized for parallel processing this makes them ideal for matrix multiplication operations that dominate ml workloads interestingly training and inference have different

Hardware requirements training demands more memory due to back prop and is generally more difficult to perform and lower Precision inference often emphasizes latency over throughput since users are typically waiting for responses when evaluating hardware for inference consider three key questions can it run your workloads how long does it take to do so and how much does it cost the specific Hardware specifications to focus on include flops computing power memory size and memory bandwidth for compute bound workloads prioritize chips with more flops for memory bound workloads focus on higher bandwidth and more memory with the hardware

foundations covered let's move on to techniques for optimizing at the model level now we're getting into the real tactics for speeding up inference let's start with model level optimizations techniques that make the models themselves more efficient model compression reduces a model's size potentially making it faster there are several approaches here quantization which we already discussed reduces numerical Precision pruning removes less important parameters or sets them to zero and distillation which we also already discussed trains a smaller model to mimic a larger one among these options weight only quantization is by far the most popular because

it's relatively easy to implement works well for many models out of the box and delivers significant benefits without that much effort another challenge specific to language models is their autor regressive nature they generate text one token at a time which creates a sequential bottleneck several Techniques address this limitation speculative decoding uses a faster but less powerful model to generate candidate tokens which are then verified by the Target Model it's like having an assistant draft responses and a manager quickly review and approve inference with reference copies tokens from the input when appropriate for example when answering

questions about about a document rather than generating them from scratch this can significantly speed up responses for document-based queries parallel decoding aims to generate multiple tokens simultaneously breaking the sequential constraint additionally attention mechanism optimization improves the efficiency of Transformer models attention calculations which can be particularly memory intensive at an even lower level kernels and compilers optimize how models run on specific Hardware kernels are specialized code optimized for Hardware accelerators common optimization techniques include vectorization a parallelization loop tiling and operator Fusion compilers Bridge machine learning models and Hardware converting model operations into optimized code for specific

accelerators but optimization doesn't stop at the model level let's look at how we can optimize the entire inference service we can achieve significant performance gains by efficiently managing resources across an entire inference service one of the most powerful techniques is batching which combines multiple requests process together batching can be implemented in different ways so we have static batching which groups a fixed number of inputs but all requests must wait until the batch is full this is simple but can lead to inconsistent latency Dynamic batching sets a maximum time window processing the batch when either it's

full or the time limit has been reached this provides more consistent latency guarantees finally we have continuous batching which allows responses to be returned as soon as they're completed with new requests added to maintain batch size this provides the best user experience but is more complex to implement another powerful technique is decoupled prefill and decode which separates these two phases of of llm inference since they have different computational needs handling them separately prevents resource competition and improves overall efficiency for applications with repetitive patterns prompt caching stores overlapping text segments like system prompts or reference documents

to avoid reprocessing them with each query this is particularly valuable for applications with long conversations or multiple queries about the same document as models grow larger a single machine may not be sufficient this is where parallelism comes in distributing work across multiple machines replica parallelism creates m multiple copies of the model each handling different requests this is the simplest approach and works well for high throughput scenarios model parallelism splits a single model across machines either through tensor parallelism which is breaking operations into smaller pieces pipeline parallelism dividing the model into sequential stages context parallelism splitting

input sequences across devices or sequence parallelism splitting different operations across machines so what technique should you implement we just talked about a lot the optimal combination depends on your specific workloads and performance requir ments for applications prioritizing low latency replica parallelism may be best despite higher costs for most use cases the most impactful techniques are typically quantization tensor parallelism replica parallelism and attention mechanism optimization by thoughtfully applying these techniques you can dramatically improve both the speed and cost effectiveness of your AI applications making them more responsive to users while keeping your infrastructure cost manageable in

our next and final section we'll see how all these components come together in a complete AI application architecture and how user your feedback creates a virtuous cycle of continuous Improvement now that we've explored all the individual components of AI engineering it's time to pull everything together let's see how these pieces fit into a complete architecture and how user feedback creates a powerful Loop that helps these systems improve over time the simplest AI application architecture looks like this your application receives a query sends it to a model either through a third party API or self-hosted model

and Returns the response to the user no Bells no whistles just direct input and output but real world applications rarely stay this simple let's walk through how these architectures typically evolve as your needs grow more sophisticated the first enhancement most applications need is better context construction giving the model access to information required to process useful outputs this is essentially feature engineering for foundation models so you might add rag systems to search and retrieve information from your knowledge base agent capabilities to gather information from external tools document uploading functionality to analyze specific content or more these

additions ensure the model has the necessary context to provide accurate relevant responses step two add guard rails for protection as your application grows in capability you'll need guard rails to protect both your system and your users input guard rails protect against leaking private information to external apis and malicious prompts that could compromise your system output guard rails catch different types of failures quality failures like empty responses incorrect formatting or factually incorrect content or security failures like toxic content pii exposure or unauthorized actions the key again is balancing protection with user experience overly restrictive guardrails create

frustrating experiences while inadequate ones could leave you vulnerable stage three Implement model routing and gateways as your application matures you may discover that one model doesn't fit all your needs different queries require different approaches and this is where model routing comes into play a model router typically includes an intent classifier that predicts what the user is trying to do and then directs the query to the appropriate model or pipeline these routers should be fast and inexpensive so you can use multiple of them without adding significant latency or cost along with routing you'll need a model

Gateway this is an intermediate layer that provides a unified interface to different models both self-hosted and Commercial access control and cost management fallback policies to handle rate limits or API failures and load balancing logging and analytics the Gateway approach makes your codebase much more maintainable if a model API changes you only need to update the Gateway not every application that uses it it's a classic example of separations of concerns in software engineering next stage four optimize with caching as your user based grow performance and cost optimization become increasingly important this is where caching enters the

picture inference caching includes techniques like KV caching to optimize the attention mechanism and prompt caching to avoid reprocessing identical prompt components caching is particularly valuable for multi-step processes like Chain of Thought reasoning or queries requiring timec consuming actions like retrieval or web searches for implementation your options range from in-memory storage which is fast but has limited capacity to databases like postgress SQL and reddis you'll also need an eviction policy like least recently used or least frequently used to manage cache sizes as you scale stage five add complex logic and write actions this is the most

sophisticated AI applications go beyond simple question answering to incorporate complex multi-step reasoning flows agentic patterns with loops and decision-making and write actions that make changes to the environment write actions like sending emails placing orders or initiating transfers dramatically increase your system's capabilities but also introduce significant risks these should be implemented with icient caution and appropriate safeguards as your architecture grows in complexity keeping track of everything becomes increasingly challenging this is where monitoring and observability become critical while related they serve slightly different purposes monitoring tracks external outputs to detect when something goes wrong but doesn't necessarily

help identify the cause it's like knowing your car broke down but not why observability on the other hand ensures that sufficient information about your system's internal state is collected so that when something goes wrong you can diagnose the issue without deploying new code it's like having sensors throughout your car that can pinpoint exactly what failed there are three key metrics that can help you evaluate your observability mttd or mean time to detection how long it takes to detect an issue mttr meantime to response and CFR change failure rate which is the percentage of deployments that

result in failures each component in your pipeline should have its own metrics and you should understand how these metrics correlate to your business's Northstar metrics remember the golden rule of observability just log everything when metrics indicate a problem detailed logs help you identify exactly what went wrong as your application evolves to include multiple models data sources and tools managing these interactions can become increasingly complex this is where an orchestrator becomes valuable helping you specify how these components work together AI orchestrator tools like Lang chain llama index flow wise Lang flow and hay stock help manage

these complex pipelines however it's often wise to start building your application without an orchestrator first to understand the core mechanics before adding another layer of abstraction now let's talk about what might be the most valuable asset in AI engineering user feedback this feedback provides proprietary data that can give you a genuine competitive advantage while everyone can access the same Foundation models only you have access to how your specific users interact with your system user feedback comes in two main forms explicit feedback is directly provided by users this is things like Thumbs Up and Down ratings

star ratings or written comments implicit feedback is inferred from user Behavior this could be things like early termination error Corrections or question clarifications complaint messages sentiment frequency of regenerating responses and conversation length when designing your feedback systems consider carefully when to request input you could ask for feedback at the beginning of the experience like like asking for skill level in a language learning app or when something unexpected happens like slow response time or at natural decision points like offering between two alternative responses the goal is to gather valuable insights without disrupting the user experience remember

that every request for feedback creates friction so use these opportunities wisely while we've covered each component separately a mature AI application integrates all these elements into a cohesive system the architecture you choose should align with your specific use case technical constraints and business objectives one important thing to remember is that complexity should serve a purpose only add components that solve real problems for your application sometimes a simpler architecture with fewer moving Parts is more reliable and easier to maintain than a complex one with every Bell and whistle the field of AI engineering is still rapidly

evolving with new techniques and best practices emerging daily the most successful AI Engineers maintain flexibility in their architecture allowing them to incorporate new advances while providing stable reliable experiences to their users and that wraps up our journey through AI engineering we've covered an incredible amount of ground from understanding Foundation models and evaluation to mastering prompt engineering rag agents fine-tuning data set engineering and optimization techniques of course this was a super high Lev overview of a very detailed book so I really recommend using this as a starting point to check out the book on your own

I had a great time putting this together and I plan to do more videos covering technical content like this in the future so let me know in the comments which book you want me to summarize next and don't forget to subscribe so you don't miss it when the next one comes out thanks so much for watching and I'll see you next time