Deep dive into Voice AI with Vapi (Full Tutorial)

25.3k views5846 WordsCopy TextShare

Jannis Moore | AI Automation

In this video I’ll explain to you the core functionality of Vapi, how it works, and how you can leve...

Video Transcript:

so what I got for you today is this empty board and wppi my probably most favorite AI voice agent infrastructure provider that is currently out there and trust me I have tried a lot of them including Bland sylow vode and a ton of others and this one is literally in my opinion the best that is currently out there and probably also one of the more complex ones to understand if you are new to voice AI agents and in this video I am attempting to show you the actual power of buy how you can use it inside of your business for your clients and also Al how I am leveraging it inside of my agency and inside of my software to serve our clients better with voice AI agents that simply work that are super smooth and how you can extend them with additional features if you have not heard of V yet or if you have not heard of AI voice based assistance in general they are basically just kind of things like chat gbt does but instead of texting to them you can actually call them on the phone and they will answer you with a voice that you give it with exact same features like you would use it on Open AI so with that out of the way I would like to give you a brief introduction of what vapi is what they are aiming to do in my opinion and how we are using them inside of our company vapy like a lot of others are a voice AI for developers like it says here as well on their page they allow you to create those kind of voice-based AI assistants on an infrastructure like the bare minimum you can imagine for making voice calls so it's not something like bl. AI where you have pretty fine features like tags or predefined actions that you can use nope in that case they give you literally access to the bare minimum to build your very own prompts and build anything else as a separate set of features on top of it which is exactly why I like this platform so much cuz it doesn't come to you with like an over bloed system where there's a lot of predefined things that you don't even need for your business which again helps a lot for latency reduction and just in general the quality of the prompting and output that you actually get and there are a lot of other features like contextual Pathways which in my opinion are still a little bit controvers cuz you want the AI usually to be as flexible as possible so limiting your AI with conversation Pathways might be nice but it's again pretty static cuz it just follows the pathway down even though you probably want it to be more flexible by just literally giving it the right prompt so that it can get the most out of the conversations even if they take a different turn so that is why VPP is amazing like I mentioned you can create assistance with it which is nothing else an AI that basically runs on a phone call but the difference to VY is that VY allows you to do this in two different ways sppy allows you first of all to create static assistants so static assistants are assist that you create once in their back end and they will just be there they are static you can adjust them in the way you want but whenever you make a phone call to that assistant the assistant configuration will always be the same so this is number one which I think is okayish if you have a repetitive task where you don't need a lot of customization but the second part is where VY actually shines in where I think it out competes all of the other platforms out there which is transient based assistance so if you have not heard of that don't be afraid it is something very new that I also had to First figure out and I literally have been working with vapo for 2 months straight literally every day and I think I have a very good understanding of it now and hopefully be able to actually represent it to you in a simplified way so imagine an assistant that you basically create on demand in the moment a phone call comes in or in the moment you want to send a phone call to give you a better picture of that I'd suggest you to create an account in Bap and log into it so I already did did that and when entering bapi you can see an assistant Tab and you can create an assistant which I have already did here with a template for the appointment setter here which then brings in a dental template so this is an example of a static assistant because you basically create it on their platform you can Define your prompting you can choose transcribers basically the voice to text text to voice in both of those options you can set some extra functions so basically extra features and tools that assistant should have access to while communicating with a user and you have some more advanced settings that I'm not going to go into right now but what that basically means is that this assistant is a static assistant it always works exactly with the information that you put into here so which is usually okay if you like I mentioned have a very simple setup you don't need a lot of customizations then this makes definitely a lot of sense but in case you for example have a user base behind it and you would like to customize the prompt depending on the user you're interacting with or the actual caller that is currently calling then these transient based assistants are a lot more interesting so those transient based assistants are not configured in here because basically they don't don't exist until the point the call actually comes in so to actually demonstrate that to you I'm going to try my best drawing skills ever for actually explaining it to you visually so let's just get a better understanding of the static assistance so I'm literally just going to create a little screen here we say static assistance and let's now say we are going to have a phone call now we have a phone here and what happens when a phone call comes in for example VY then basically sends an information to the VY server which then basically takes care of all of the information within VY right so B basically then validates all of the information it runs the the telephoning Etc and it exactly uses the assistant that is also saved on their server to return the information back to the caller so all that happens in that case is literally that b just communicates back and forth between the phone caller is the server and the phone caller so there is no other interaction because you basically created the assistant on their platform so now let's go over and actually get into some transient based assist assistance so for the transient based assistance it's nearly the same so I will definitely try to copy my amazing phone here I'll put it right over here and when a phone call comes in now what happens is that we basically again send a phone call to a server for VY so for VY so V basically again accepts that phone call understands that there's something to do so they basically set up the phone call logic whatsoever but now instead of actually returning something to you to the user what they're going to do is they request from a different server that can basically be one of yours so your basically very own server back end like wherever you host that or even a no code solution like make. com and it basically asks This Server what is the assistant you want to use you can now imagine that it basically sends a request to that your very own server where you can then create an assistant by yourself now instead of creating and creating the stuff statically on B all you would do is you would have your own server where you basically create the assistant configuration and the way that works is by using Json constructs so if we going to head over to vapi and we actually go to their documentation you can see inside of their documentation under API reference and create assistant you can see the possibility of creating an assistant which would do it statically in the back end so we don't want to do that but you can see that it's done by a massive Json construct right here which is literally a just a structured definition of that whole assistant that you would have basically created within the back end but now in a structured way and a simple string so this information is basically everything that you just created inside of the back end of buy but that you have it inside of a string so all we would do now is we would basically have our own server send back to byy this string of the the vapi assistant that we created and then this string is basically used by vapi to set up the phone call and actually configure that phone call based on your settings and then VY starts communicating back to the caller so now if we are zooming out you can see the difference that for a static assistant you are actually basically setting up the assistant on vapi and vapi directly uses that assistant and sends it back to the phone caller Whenever there comes a message in the difference is now with the transient based ones that they don't exist until you actually get a phone call and then you are basically responsible for creating that assistant and sending that assistant back to VY and vapi then sets up the phone call does everything necessary and then calls back or sends the information back to the caller to interact with it this is like a a brief picture and now you might ask yourself how does that actually work inside of Vy and they have definitely an extensive documentation so I definitely suggest to check it out but to get you started where you probably want to take a look at is in the documentations under the server URL because that describes a lot of these features once you create a phone number which you can do within the VAP dashboard by going to phone numbers you will be able to buy a phone number or import a phone number which you can do again from tlio or vage I usually use twio then you basically have the phone number inside of vapy so VAP knows about it and can use that phone number to interact with it Etc if we are now heading to the account so once you're at the account page you can see that you can set a server URL and a server URL Secret which is then basically the URL that you would set from within your very own server so it's like an endpoint from that server that can basically accept information from bapi and then return something in whatever whatever they based on whatever their request is so if we are now looking into their into the documentation you can see under the server URL that they also have something called retrieving assistant which is exactly their main thing where I think their whole power lies in and where their whole Advantage lays in in the AI voice industry at the moment whenever a phone call comes in and that is basically for inbound phone calls they are requesting information from the server URL that you set up inside of your account and it is definitely important to make sure that you don't have any assistant sorry that you don't have any assistant connected inside of the assistant right here that is basically connected to your phone number so the phone number should not have any assistant it should just be empty because that allows you again to allows VY to actually fetch a assistant dynamically based on the URL from within your account then once vapi basically receives a phone call it now sends a request to our very own server that looks something like this so we have a Json construct that contains an assistant request so we basically understand now aha B wants to wants us to return an assistant and it also sends along the call details so basically everything that they have available from within the call which means like the phone number the country of origin the number it's calling so you also have access to you also know which kind of phone number is actually currently calling in case you have more phone numbers connected and what you can do then is you can return a Json that looks something like this with an assistant inside back to VY so this assistant is what I basically mentioned earlier inside of their API documentation so I'm just going to open that up here it is basically just a structure that looks something like this which is literally the whole assistant config and you can literally just return something like this with the settings that you would like to have inside of the assistant and then VAP uses that and actually initiates the call which is incredibly amazing because what that allows us to do is to completely build an assistant based on the current phone call so let's assume you are building up an audience profile and you have the phone number you can connect that phone number to your CRM or whatever else is interacting with where you then have again maybe access to more information about that phone number stuff like what name it is connected to what email it is connected to Etc and you can validate that information and basically create a prompt a system prompt inside of the assistant with extra messaging maybe even extra language whatsoever and then send it back to vapi so that vapi can initiate a phone call with it so it is not the same approach like Bland does it because what you probably have seen within bland is that they allow you to use Dynamic TXS so you can use a tag like double curly brackets now ending double curly brackets which would add the current date now in that case your your server basically has the possibility and has the power to return whatever it is about the prompt that you would like to customize so you can build all of this custom variable logic by yourself and send it back to the phone caller and in my opinion this is still like mind-blowing cuz it brings so much flexibility to this whole setup of Vy that it allows us to build really really amazing chatbots obviously there are tons of more features to that also things like end of call reports which is something that you can configure inside of the assistant so let's for example say your your phone call um ends right and you get a transcript the vapy server can actually send the transcript to your very own server so that you can process that transcript on your very own server and like I mentioned earlier if it's not clear this very own server can even be a noode solution like make.

com or zap here because all it is is literally just a URL that you drop in and then you can start whatever kind of scenarios you can imagine on that platform so now I'm going to show you a little bit better how that actually works and to do that we are heading back into our VY dashboard into the phone numbers and once you have connected a phone number that is number one you don't select any assistant you simply have the phone number connected you head into your accounts you add the server URL and the server Secret of whatever application you're dealing with if it's make. com you add the make. com webhook URL for example and you then add also your make.

com like the the secret and the secret you can then again inside of make. com check within a scenario filter it if it's the right one and then just have that extra level of security that no one else basically uses your make. com url once you have done that your assistant your voice your phone number can basically already communicate with vapy and with make.

com CU it can request the assistant from there so all you would need to do is instead of make. com you would need to set up a scenario where you then return where you then return the information from the assistant back to VY and then do the call so now if you want to also receive the transcripts whatsoever what you would need to do is you would need to set up this server URL and the server Secret inside of the assistant so if we're heading back into the Json that I showed you earlier from the assistant we scroll down here you will see the possibility of adding a server URL and a server secret so whenever you create this assistant dynamically based on this string you can set a custom server URL and a custom URL secret that is called when ever there is an interaction on this assistant what I mean by this is if you head back to our server URL as seen here you can basically send up send a custom URL for that assistant and then this assistant can sends back the function calls for example end of call reports as seen here so the end of call reports then contains deflect the recording URL contains a summary of the conversation the transcript and all of the messages and by simply defining the server URL here papy would then whenever the call end send that information to the server URL that is connected through the dynamic assistant that you created earlier and you can again retrieve all of the information from the vapy from vapy itself while that sounds a bit complex it is hopefully a little bit more clear than it was before cuz for me to figure all of that out took definitely a while to get a better understanding of how that actually works and goes well together so I definitely suggest checking out the manual of how that works and how you can actually set that up for the sake of this video I'm going to demonstrate this to you with a little example at the end to make sure you actually get the most out of this so to actually set it up we already have connected a phone number I would basically then start start now with a make. com scenario so all I would do is I would click on create a new scenario I would select the web hooks integration as a starting point and I would select custom web hooks which then again allows us to create a custom web hook so we just call it VY test I save this this gives us a dynamic URL so in this URL we basically want to use to actually initiate our custom assistant so to do that is this URL is the one that you would basically then add into your dashboard within buy once on the account all you would need to do is you would paste this inside of the server URL you would set a custom server URL secret that you just remember yourself and then you head basically back into your integration you add another module and inside of this module what we are going to do is we simply for this purpose add a web hook response that looks something like this and within the body we are then defining the assistant and to do that what we are going to do is we actually create that assistant and there's a visual way for doing that in case you don't want to write this whole Json construct somewhere else uh I mean you can definitely use that inside of a tool like Json editor online you simply paste it in here and you can customize those values visually it's a lot easier if you are using a custom server or setup you can also dynamically adjust those values within the server obviously so you don't need to do this manually but that is when I mean you really want to build up a framework on top of Vy that you can actually integrate with your CRM etc for the purpose of this video I'm going to do this manually and what I'm actually going to do is I will use VY itself to do that CU if you are going into the body you will be able to set up values directly within here and they will be adjusted on the right so if we are setting for example a custom server URL you can see that it sets the server URL here as well so we would literally just go through all of that and I'm going to do that with you quickly to get an understanding of what all of those fields actually mean so for the transcriber we're going to use deepgram because it's literally the best transcriber there's another one available that I've never used but deepgram is great we're going to use Nova 2 which is their latest model I think and for the language we're obviously going to use en which means that whenever we are speaking something deep gr basically translates it to English we don't need to add any keywords for the models and the messages we are going to add a system message which is our Master prompt basically sorry I added this in the content it obviously needs to be in here and this is basically where we add our Master prompt so and guidelines I'm just going to add that here so you would later have something like you are Adam you are a property real estate assistant whatsoever this is basically what you add into the content function calls we don't need any tool calls we don't need any anyways function calls is deprecated by open AI for the provider of the voice of the llm we're going to use open AI for sure we are also going to use gp4 you can choose what model fits you best I would also suggest use TPT 3.

5 turbo cuz in the course it doesn't make that much of a difference as a now if you the more flexible you keep your prompt obviously the better it is to have a higher version cuz it helps with hallucinations all back models you can set as well we can just set gp4 as a forb model if you want to which means that in case GPT 3. 5 doesn't work properly it will switch to gbd4 semantic caching enabled we set to FS which basically means it can kind of cash responses based on stuff that happened before temperature we simply leave it at one we don't need to have any functions right now which basically means you can give your Bot access to extra functions the max tokens we simply set to 200 50 to keep the responses short cuz obviously we want we don't want to be on the phone forever with a with a voice AI agent for the voice that we are going to use you can use Azure you can obviously also use a different provider for now they set up Azure voice so you can also choose something like 11 laps if you would like I'm just going to go with Azure cuz I used that before you can select any of those values you can even select a completely custom one that is all like even some multilingual ones that are offered by Azure so they can respond in different languages you can set up a speed I simp set that to one you can set up also a forwarding phone number which means in case you allow your phone your your phone agent to transfer calls you can set a forwarding phone number and this agent can then basically forward that call to the number that you defined right within here just do that now with a with a the demo number you can enable recordings which basically means the whole conversation will be recorded we allow to have a end call so for the end call basically means that the assistant can end the call as well in your behalf or in the behalf of the user that is calling so if the user says something like goodbye the assistant can end the phone call as well the dial keypad function we keep on false for now that is basically if you try to use that assistant to make outbound calls and you call with an actual I don't know like a menu that basically requests the assistant to or you as a user to type something in the assistant can do that in your behalf as well Hipp enable I would say false so we can actually track and record all of the information then you can set up some client messages and server messages which I'm not going to do now silent time out seconds we just keep that empty for now which means it will pull the default response delay seconds basically your answer just means how long it takes for the assistant to respond in general so you can delay that which basically means it takes a little bit longer and it maybe sound a little bit more natural if the assistant is really fast to respond then it sounds a bit more natural to actually have a slight delay in between the max duration in seconds I would always set that which basically means if you set it to let's say 180 which means it it basically would maximum like the call would would be a maximum of 180 seconds long so 3 minutes you can set a background sound as well which basically means when the voice talks it will have some office back backround noise in the call which also again makes it sound a little bit more normal you can give the assistant a name so we just call it Ben for now you can add multiple phone forwarding phone numbers that you can Define with purposes so if you have a marketing or HR department you can forward information to those kind of departments you can also set a first message which is the welcoming message welcome and then you can say whatever the name is or like if you make a dynamic later we can say something like the name you know and then we can Define and replace it inside of make. com later for now I'll just keep it it welcome how can I help you then voicemail detection enabled we can set to true for now it's still sometimes a little bit buggy but works most of the time which basically means that when a phone number that you try to call has voicemail you can basically respond with the bot with a predefined message for that voicemail uh for that voicemail yeah exactly so that the bot basically can answer that voicemail call and speak something onto their device actually I said this for yeah I can keep this for true and say please call me back thank you then you can also have an end call message which I'm not going to do right now and you can basically Define the server URL what I basically mentioned earlier in case you would like your assistant later to communicate with a server URL and basically all of the other things uh to to send all of the other information tool like we discussed here the function calls and the end of call report you can basically create another make.

com scenario use again the same workbook setup you can paste in this webbook set up directly within here and I'm just going to do this for Simplicity purposes so I copy that use this same URL for this one as well just for now for testing it and we can basically just set a secret whatsoever so obviously this secret and the one that you set up inside of your app account they need to match so as you can see here we have now a properly formatted Json so all we're going to do is we copy that part until here which you can then again use in a tool like Json editor online to validate and see if it's actually a valid Jason if something would be broken let's say a comma is missing you can see that it says it doesn't work and you can even Auto Repair it so once you've done that copy the part and now we need to wrap it inside of the assistant uh inside of the assistant key based on what VAP defined here so if you're at the server URL link you scroll down you can see retrieving assistant you basically need to provide the assistant inside of an assistant key to do that you can either copy that and just replace this whole part or you can simply just write that inside of the Json editor yourself which is exactly what I'm going to do we have assistant and then I literally just paste in what I have here so and that is already everything so now I'm going to copy this whole part head back into our integration I head into the web response and within the body I just paste what I copied from before so basically the whole less assistant wrapped inside of the assistant key now I also need to Define that this is actually ajacent response so I activate the advanced settings I go to custom headers and in here I type content-type and I'll set this to application SL Json which defines that the response is a Json string okay now I can basically click okay and that's already it if I Now activate this Json and I would basically call the number within the phone numbers it would trigger the scenario the scenario would return the assistant and bppi has now a completely Dynamic assistant so now imagine that if you're using a no code tool like me you can literally just add anything in in between here like your CRM connection that can then fetch information from your CRM based on the phone number and adjust then the information right within here dynamically so as you can see here with the first message later you could you for example have values available and then you can add something like the first name here or you can even add the the current date Etc so let's say in the in the main prompt that you later on have in here you can say something like the current time is and then you would basically be able to use their time adjustments which are here to create a date and you know to to add a time stamp or whatsoever or at now which then again allows the assistant to have the current time available inside of the assistant PR that you are then going to send back this is the Amazing Power of Vy because it basically allows you to create those transient based assistance so that you can get something back that is completely Dynamic depending on the call that is currently coming in and this is just such an massively amazing feature as you can literally just build whatever you would like to have on top of it so you're not just Bound by the predefined features of platforms like make. com or synth flow oh you can literally just build whatever you would like on their infrastructure which is great so vapi basically takes care of the whole handling of the voice AI agents in the back end how to set them up how they communicate with t or vage how the latencies they basically optimize everything else around it so that you can literally focus on building a qu product on top of it or just integrate it into your own business or your own agency or for your clients or whoever else you would like to obviously there's a lot more to vapi which you can also read inside of their concept section of the documentation right here so you basically have also a client SDK which allows you to implement their voice calling into websites so that you can basically even interact with websites through custom actions that you can then directly communicate with on on your website so obviously you need to have some specific setup for that maybe you want to use something like like U expressjs or nextjs Etc where you can reload Pages directly using JavaScript so that you can actually direct users through an assistant Etc if you would like a nice example for that you can check out the platform of.