AI Agents, Meet Test Driven Development

2.77k views5098 WordsCopy TextShare
AI Engineer
Deploying agentic workflows in production is tough—bugs, hallucinations, and unexpected behavior can...
Video Transcript:
hi everybody my name is Anita and I'm currently leading gen growth and education here at vum and over the last few years we worked with hundreds of companies who have successfully deployed reliable AI Solutions in production from simple to more advanced agentic workflows one thing became very clear those companies who have adopted a test driven development approach were able to build reliable and stronger systems for production today I'm excited to share how you can apply that same approach to build your own effective agentic workflow that actually works but before we jump in let's take a
step back and truly understand how we got here in the first place I'm so excited to get started let's do it so let's go back to 2023 everyone was building AI rappers and most people argued that there is no defensibility strategy around them and fast forward to today we have cursor AI which is the most popular and wildly used AI powered IDE that just hit 100 million AR in just 12 months this is the fastest growing SAS in the history of SAS so why and how did this happen because models got better at coding sure
because AI adoption skyrocketed that's absolutely correct because coding was an obvious first Target that was supposed to be disrupted by these AI models there is no doubt about that but more importantly we built new techniques and patterns on how how we can orchestrate these models to work better sync better with our data and then work effectively in production we rely on these techniques because there are clear limits to model performance hallucinations is still a thing overfitting is still a problem and developers needed more structured outputs and while model providers started to ship better tooling to
Sol for all of this we didn't see another lip similar to the lip between GPD 3.5 and gp4 these big jumps started to slow down and for years making models bigger and fitting them more data kept making them smarter but then we hit a wall no matter how much more data we added these improvements started to slow down and models started to reach their limits on existing test but is this true did we really hit that wall seems like there were some other avenues and new trining methods that we still haven't explored and so let's
see what happened next so I don't really think that there is an issue here because since then and this happened in the last two to 3 months we've seen some new training methods that push the field forward for example we got the Deep seek R1 model which is the first model that was trained without using any labeled data we call this method real reinforcement learning and this means that this model was able to learn on its own reportedly this is what open used to train their reasoning models like 01 and 03 and all these reasoning
models today they use Chain of Thought thinking at inference time or at response time to generate their answers in turn allowing these models to think before they give an answer to our questions it enables them to really solve more complex reasoning problems on top of this we're seeing all of these model providers to provide more capabilities to their models like use of tools more capabilities for research um near perfect OCR accuracy when it comes to the Gemini 2.0 Flash and really expand the field forward however traditional benchmarks are so saturated so people are starting to
introduce new ones that will really capture the performance of these new reasoning models for example The Benchmark that you're currently seeing on the slide here the humanities last last exam it measures performance on truly difficult tasks so if you check the table on the slide you can clearly see that even the latest very smart models struggle with these challenges so yeah models are getting better the field is moving forward but for an AI product that actually works in production success isn't just about the models anymore it's about how you build around it and that's exactly
what's been evolving in parallel to model training so we were learning how to prompt all of these models better and we developed more Advanced Techniques like Chain of Thought then we thought that we should be able to ground all of this models responses using our own data so rag became an important part of our workflows then we learned that for multi-threaded conversations memory is going to be the most important thing that we've had long context from the latest models enabled new use cases then we started to think about hierarchy of our responses so we started
to experiment with graph rack and then just lately we're thinking about using all this reasoning models that in fact will take a lot more time to think in real time however it also develops new areas and use cases that we can develop and lately we're thinking about a gentic rag making our workflows even more powerful so that all this can work on its own and the field is still evolving but even using these techniques isn't enough you need to understand your problem deeply and take a test driven development approach to find the right mix of
techniques models and logic that will actually work for your use case and this actually brings me to the main first topic of this presentation test driven development for building reliable AI products because the best AI teams that I've seen follow this structured approach they start to experiment then they evaluate it scale then finally when they deploy in production they never stop stop working on their workflow they capture all of those responses to then continuously monitor observe and improve their product for their customers let's look at what you can do at every stage of this process
before you build anything production grade you need to experiment a lot you need to prove whether these AI models can actually solve for your use case so you should try different prompting techniques for example fuse shot or Chain of Thought some of these will work great for simple tasks and other will help with a bit more complex reasoning you should test various techniques prom chaining is usually very uh well received because it's going to work better if you split your instructions in multiple prompts or you can adopt a more agentic workflows like react that will
have a stage to plan and then reason and refine before it actually gives you an answer what is really an important part in this stage is that you need to involve your domain experts um because Engineers shouldn't be the ones who are tweaking prompts uh and bringing all these experts will actually save a lot of your engineering time because once you do this phase right then you will actually have a proof that this works and that engineering time needs to be involved at this stage you should also stay model agnostic uh you should incorporate and
has different models and especially when it comes to your use case you need to think about Which models can do the job better so in such case you can um maybe use some uh different uh models like Gemini 2.0 flash which is actually really well at OCR uh and something that we've seen work really well lately so let's say that at this stage you know that U these AI models can actually work you have a few examples that these models have like really good performance on but how can you test whether this will actually work
in production when you will potentially have hundreds if not thousands or millions of requests per minute and so this is where Evolution comes in in this stage you actually create a data set of hundreds of examples that you're going to test your models and workflows against and so at this stage you need to uh try to balance quality and cost and latency and privacy and you're definitely going to make a lot of tradeoffs because no AI system is going to get all of this perfectly but for example if you need high quality maybe you can
sacrifice speed if cost is critical you might need some lighter and cheaper model and this is the stage where where you need to Define your priorities because it's always better if you define your priorities earlier in the process you should use ground through data where possible if you want to evaluate all these workflows having your subject matter experts design these databases and test these models and workflows against is going to be very very useful synthetic benchmarks help however they will not really evaluate these models for your own use case so it's usually very very powerful
if you can use your ground through data but don't worry even if you do not have ground through data you can use an llm to evaluate another model's response this is actually a very standard and reliable way when it comes to evaluating your models very importantly at this stage you should make sure that you're using a flexible testing framework no matter if you're building this in house or if you're using any external service your AI isn't static so your workflow should also be dynamic it should be able to capture all of this different non-deterministic responses
you need to be able to Define custom metrics you need to be able to write those metcs metrics using python or typescript so you shouldn't be looking at a very strict framework customizability is a very big thing here and then finally you should run evaluations at every stage you should have guard rails that will check internal nodes and whether these models are actually producing responses at every step in your uh workflow actually producing responses that are correct at every step in your workflow and then you should also test while your prototyping but then you should
also utilize this evaluation phase to come back once you have some real data but how are you going to get some real data so let's say that you evaluate your workflows extensively with your subject matter experts with your data that they've created and let's say that you're now satisfied with the product that you have so you're ready to deploy it in production so once that that happens what do you need to do is your job done here when it comes to AI development you need to monitor more things than deterministic outputs you need to log
all llm calls you need to track all of those inputs and outputs and the latency because AI models they really they're really really unpredictable so you need to be able to debug issues and understand how your AI behaves at every step of the way and this is becoming extremely more important with a gentic workflows because gentic workflows are more complex workflows that can take different paths in your workflow and make decisions on their own you should also handle API reliability you need to maintain uh stability in your API calls you need to have retries you
need to have fallback logic to prevent outages for example two months ago open AI had four hours of downtime so if you had a fallback logic in your productionize solution then your um AI will know to go back to another model and use another model instead you should definitely have Version Control and staging and you should always deploy in control environments before you roll out to The Wider uh public because with it when it comes to AI you need to be care careful that once you update a prompt you're not introducing a regression to another
prompt or part of your workflow so you need to ensure that all these new updates they won't break whatever you have in production and the most important part here is that make sure to decouple your deployments from your scheduled app deployment schedule because the chances are that um you will need to update your AI features more frequently that you will need to update your app as a whole so make sure to do that and so let's say that now you have deployed you're starting to capture all of your responses from your users and create a
feedback loop to identify edge cases that you capture in production to then continuously improve and make your workflow better you can capture all of these then run evaluations again and test whether um new prompts that you develop will solve for this new cases you should also think about building a caching layer because if your system is handling some repeat queries caching can drastically reduce costs and improve latency so for example instead of calling an expensive llm for the same request multiple times you can store and serve frequent responses instantly and this is something that is
a standard these days when it comes to building with AI and finally let's say that your product has been running reliably in production for uh a longer period of time time that you feel comfortable to then go back to that data and use it to fine-tune a custom model that will um basically uh create better responses for your specific use case uh can reduce Reliance on API calls and in fact can work with lower costs and so this process is becoming even more important than ever when it comes with agentic workflows because these workflows are
going to use a wide range of tools they will um call different apis they will have multi-agent structures that will execute a lot of things in parallel so when it comes to evaluation with a gentic workflows and with this test driven approach it's not just just about measuring performance at every step in your workflow because you also need to assess the behavior of these agents to so that you can make sure that they're making the right decisions and following the intended logic and this year more than ever everyone is talking about agentic workflows but what
does that actually mean uh I would love to talk more about how you can build all this agentic workflows but I'm not here to give you the perfect definition of what an AI agent is and instead I'm going to try to Define different agentic behaviors and some different levels on how um they can be built so if you think about it every AI workflow has some level of ener gentic behavior in it it's just a question of how much control reasoning and autonomy it has so we've looked at the past the present and where we're
headed and from that we put together this framework where we Define four or five different levels of gentic behav Behavior I'll go into more details on each level but keep in mind that this is not a final framework it's not set in stone as models evolve this can expand the things can blur and um a lot of things can shift but for now this will give us a way to define where we are today and what we expect to see next at this stage you have an llm call you retrieve some data from your vectory
database and then you might have some inline evals and finally you're going to uh get some response from this workflow so you can notice that in this workflow there's no reasoning planning or decision making Beyond what's baked into the prompt and the Model Behavior so the model is doing all the reasoning here within the prompt itself and so there is no external agenda organizing uh the decisions or planning some actions however there is some reasoning and some agentic Behavior at the models level and so if we move from l0 to L1 we can see that
in this stage our workplace can now use a lot of tools and so this EA System is no longer just calling apis it no now knows when to call them and when to make those actions and so this is where we start to see more gentic Behavior because the model can decide whether uh it will call a specific tool or whether it will call our Vector database to retrieve more data before it actually uh generates an output memory here starts to play a key role because we're going to have multi-threaded uh conver conversations and then
uh all of this will potentially happen in parallel so we need to capture all context throughout the whole workflow evaluation is also needed at every state uh step of the way here because we need to ensure that this models are making the right decisions using the right tools and returning uh accurate responses but this workflows can be as simple as on the slide right here or even more complicated where you're going to have more different branching happen uh happens at every stage in this workflow where you can have 10 different tools and the agent needs
to reason whether it's going to call the first five or the or the last two and so this is uh where again we see a lot more agentic Behavior but L2 is where we actually see that um these workflows now move from simple tool use which is not in many cases it's not a simple tool use like the previous workflows can be very complex but now we see some structured reasoning this work workflow will notice triggers it can plan actions and it can execute tasks in a structured sequence so this means that it can break
down a task into multiple steps it can retrieve some information it can decide to call another tool it can evaluate its usefulness if it thinks that it needs to be refined at that stage and once it does this in a continuous loop it can generate the final output but um you can notice that atic behavior here starts to look more intentional because the system isn't just calling the tools that are listed uh for their use it's also actively deciding what needs to be done and spending more time to think what needs to be done instead
of just deciding whether a tool should be called and so at this stage um one part is that the process is still finite so once this workflow completes the steps um as it plans to complete them it will terminate rather than it will run continuously but it's a Leap Forward um from just calling uh tools and so uh L3 however is where we see more autonomy where we see more uh decision making that are not um defined by us as the creators of this workload so the L for system can proactively take actions without waiting
for direct input so instead of responding responding to a single request and then terminating this one will stay alive and will continuously monitor its environment and it will react as needed so for example um this means that it can uh look at your email slack Google drive or any other Tool uh external Services actually that you can give access to and it can plan its next moves whether it will execute actions in real time or asks uh the human for more input and so this is where uh our AI workflows become less of a tool
and more of an independent system that we can use to truly make our work easier so for example this one can be like a marketer that will prepare this video or a presentation that you can just take and use whenever you want however the final stage is where we're going to have a fully creative workflow and so at L4 the AI moves between uh Beyond Automation and reasoning and it becomes an inventor so instead of just executing predefined tasks or just like reasoning within some bounds um it can create its own new workflows so it
can create its own utilities whether it's agents prompts function calls tools that uh it needs to be designed uh it will Pro it will solve problems in novel way so well true L4 right now is definitely Out Of Reach because there's some constraints with models like overfitting because models they really love their training data and there is some issues with uh inductive bias where models will make assumptions again based on their training data this makes to be like a very hard task ask uh today but that's the goal AI that doesn't just follow instructions but
will invent it will improve and it will solve problems in ways we didn't even think of before so I would say that L1 is where we're seeing a lot of production grade Solutions so at Bellum we've worked with companies like redin dra and headspace all of which have deployed production grade AI solutions that fall within the L1 segment and again like just using tools it can be very simple or it can be very complex workflow uh the focus is though on orchestrations how do we turn our models to interact with our system better how do
we make our models to work with our data better how do we make sure that whatever we retrieve from our Vector databases is the right and correct context for the uh question that the user is asking and so like we're experimented with different modalities and all of those techniques that we mentioned before and test driven development truly makes its case here because like you need to hack different tools and models and you need to be able to continuously improve on them to build not only a more efficient system But A system that will work continuously
better and better um however L2 is where I think we're going to see most Innovation uh happen this year and this is where we're going to have a lot of AI agents that are being developed to plan and reason using models like 01 or O3 or deep sick uh we might see a bunch of different use cases we might see a lot of Innovations when it comes to the UI and the ux part of the system where we will definitely create some new experiments experiences for users and um essentially this will be uh a way
for us to make true reasoners that will handle complex tasks so you're going to have bunch of these agents just working for you doing uh different things however L3 and L4 they're still both limited by the models today as well as the surrounding logic however however that doesn't mean that uh there's a lot of innovation happening within those two as well so if you want to learn more about how to build your own uh AI agent I've included everything that I've shared in this presentation and more for example uh architectures that you can build what
are the stages that you can test and similar things like that we also feature top researchers and professionals who have shared all of their learnings on how to build these for production so feel free to scan this QR code on the screen to download this resource so now I think it's time to get more practical I want to show you how I built my own SEO agent this specific agent automates my whole SEO process from keyword research to content analysis and finally for Content creation it decides whether to use tools and has an embedded evaluator
that works on an Editor to tell the agent if it's doing a good job let's see a quick sketch of how this agent works so in a minute I'm going to show you a real demo on how this agent actually works however I wanted to give you a high level overview on what are the steps that this workflow will take and so when you look at the sketch on the screen right now you're going to notice that this workflow lies between L1 and L2 type of agentic workflow you have the SEO analyst and the researcher
who will take a keyword and it will call Google search and it will analyze the top performing articles for that keyword one it will identify some of the good parts uh Within These articles that we also need to amplify in our own article but it will also Identify some missing segments or areas of improvement that we should definitely write about to make sure that our article is actually performing better than the ones that we're competing against and then after the research and planning is done the writer has everything it needs to start writing the first
draft what then the first draft is passed to the editor which is an llm based judge that will evaluate whether the first draft is good enough based on predefined rules that we've set in its prompt then that feedback is passed back to the rouer and this will Loop uh continuously until some uh criteria is met uh within this Loop we also have a memory component that will capture all previous conversations between the writer and the editor and finally we're going to get a final article that's actually a very useful um piece of content that it's
not a generated and not useful but truly using all of this context in a Smart Way enabling me to have a pretty impressive first draft to work with so now let's see the demo for the sake of time I'm going to start running this workflow as I explain what this agent does at every step in the workflow so we ran this workflow with the keyword Chain of Thought prompting and so the SEO analyst currently is taking that keyword is taking some other parameters like my writing style like the audience that we're trying to cater to
and it analyzes the top articles that Google is ranking for that specific keywords it tries to identify some good components from those articles that we need to reinforce in our article but it also identifies some missing opportunities where the researcher is going to utilize those to then make another search and capture more data to make our article be better than the articles that we just analyzed so now that the SEO analyst uh is done with its job the researcher tries to capture more information about the things that um were previously identified as missing pieces to
the puzzle and then the writer will take a lot of this information in its input and it will try to create a great first draft using that data as context so the content here that will be generated by the writer it's not going to be like a slop type of article it's not going to be something that it's really not useful it's going to actually use all the context that we're sending from different articles that we just analyzed you can also connect your rack here that it will look into your database of Articles and learnings
and it can really create something that's extremely useful now the editor says okay this is a good enough article but here's some feedback and so it passes the feedback through the memory component here which is a chat history between these two and then um this node that uh basically structures that input for the sake of this demo the conditional here for the loop is that this Loop will break if the evaluator actually tells us that this is an excellent post which actually rarely happens so um we also said that if the loop runs for at
least one time this Loop will break and so it already ran for one time we got still more feedback from the editor but in this case for this demo let's look at the output that we got so mastering Chain of Thought prompting in AI a comprehensive guide for developers I think it's pretty okay pretty nice I might change the title but I know that the components this article are the actual components that other articles are writing about and so uh this was great the latency was around 118 this usually takes around 300 seconds to run
when we have more evaluation Loops but it's pretty great it gives me some foundations on how I can continue to build on this content and it saves me a lot of my time so the product that I just used is called Bellum workflows and it was designed to bridge the gap between the product and Engineering teams so they can speed up AI development well still following this test driven approach that we talked so much about in this presentation however one thing became clear developers want more code developers want more control and flexibility and they want
to own their definitions in their codebase so today I'm excited to introduce our workflow SDK it provides all the building blocks you need it's infinitely customizable and it has a self-documenting syntax where you can actually spot how this agent is working right in your code it's also also expressive enough so that you can understand what's happening at every stage in your code the best part is that the UI and the code stay in sync so whether you're defining debugging or improving your workflows everyone on your team can stay aligned I hope that you like it
it's open source and free and you can check it out on GitHub feel free to run uh to scan this QR code uh to check out the repo and that's a wrap thank you so much for listening and I hope that today you learned something new if you want to talk more about AI I feel free to scan this QR code on the screen to connect on LinkedIn or if you have any questions feel free to um send me a text message on my email or on Twitter I'm GNA follow up for sure
Copyright © 2025. Made with ♥ in London by YTScribe.com