Python AI Voice Assistant & Agent - Full Tutorial

44.79k views7096 WordsCopy TextShare

Tech With Tim

In this video, I'm going to show you how to build a Python AI voice assistant in just a few minutes....

Video Transcript:

you've probably seen open ai's new voice mode for chat GPT it works something like this is Tech with Tim the best tech YouTuber Tech with Tim is definitely considered one of the best programming YouTubers by many he provides clear practical tutorials and covers you guys get the idea we can cut out the rest of that the point is in this video I'm actually going to show you how to make your own version of open ai's voice mode in just a few minutes so let's dip in so I'm here on my computer and I'm going to show you a quick demo of the AI Voice Assistant we'll be building in this short tutorial now this can do the same thing that the open AI one can because it actually uses open AI in the background however it also has some agent functionality and it can interact with different python functions or the outside world and really whatever way we set it up so if I unmute my mic you can see how it works hey is Tech with Tim one of the best programming YouTubers Tech with Tim is a popular programming YouTuber known for his clear and informative tutorials many people find his content helpful can you change the temperature in the living room to be 23 and in the bathroom to be 27 the temperature in the living room is now set to 23° Cs and in the bathroom it is set to 27° Cs and can you tell me what the temperature is in the bedroom the temperature in the bedroom is currently 20° celsi so there you go you can see that this actually has additional functionality and I've set it up so that it can interact with some different temperatures keep track of the temperature in various zones and actually change that temperature as well now obviously there's a lot more complicated apps we could build with this but let me show you how we get started so that you can extend it and build something really cool so let's get started by understanding the technology we'll use to build out this app now for this project we'll be using live kit now live kit is the sponsor of this video but don't worry they're free to use they're open source and they're actually what powers open AI voice mode so I'm showing you this because it's actually what's already used for some of the largest companies in the world companies like Oracle eBay Rumble character AI obviously open AI actually use this and the reason for that is it provides ultra low latency streaming of not only voice but also video so you can see that live kit is really the realtime Cloud it allows you to build and scale voice and video applications and the reason so many companies use it is that it's open source so you can actually deploy your own live kit server and just use their protocol and their code and you don't need to pay for anything or you can use the live kick cloud you can deploy with them and allow them to handle the infrastructure and that's what we'll do in this video but don't worry there's a free tier you don't need to pay for anything anyways this is super cool I recommend you check it out from the link in the description and obviously we'll be using it here to build out this AI agent there's all kinds of different languages supported in a bunch of different libraries and sdks and you're going to see how easy it is to build this voice assistant out in Python in this video so let's hop over to our code editor get live kit set up and then start building this app so I'm in my my code editor now and the first thing we'll do is create a virtual environment to install our different dependencies so to do that I'm going to type python 3-m VV and then I'm going to type AI now that's the name of our virtual environment if you're on Windows you can simply replace Python 3 with python and it should make that virtual environment for you the next step is to activate the virtual environment so we're going to type Source the name of our virtual environment which in this case is AI slash bin and then slash activate now if you're on Windows I'm going to put two commands on the screen one for Powershell and one for CMD or bash which will show you how to activate the virtual environment now that it's activated we can install the different dependencies that we need now we're going to be using live kit so we need to install some packages related to that so we're going to type pip 3 install and then the first thing we'll install is live kit agents then we're going to install the live kit plugins and then this will be open AI because we're going to use open AI for this video but you could use anything you want like o llama or really any other llm provider we're then going to install the live kit Das plugins and then this is going to be solero or however you pronounce that now this will be used for voice activity detection so we know when the user is typing and then we're going to install the python. EnV module for managing our environment variable file so let's press enter here and get this installed in the virtual environment so now that our dependencies are installed we're going to create a few files that we need so let's go inside of the folder that we're working in in vs code we we'll make a EnV file we'll make a python file so we can go with main. py and then we'll make another file we'll need later which will be API dopy now we're going to need a few different Secrets here for example we need to know where to connect to our live kit server we'll talk about how that works in a second and we'll need things like our open aai API key so let me quickly stop here and break down what we're about to do we're going to be building an AI Voice Assistant now this voice assistant is going to work by connecting to something known as the live kit server now the live kit server is going to be hosted by live kit but if we wanted to we could host it ourself using their open- source code now this live kit server is what's responsible for the transmission of data so it's going to take for example the voice data that comes from our client or our front end and send that to our backend or our AI agent where we can then process it and then come up with some kind of response so the server is kind of an integral part here and inside of this environment variable file we're going to put a few different kind of keys or tokens that we need to connect to that server then we're going to have a front end now the front end is like the user facing application that's what you saw in the demo where it shows us a preview of our video it kind of gives us the text logs of what's the agent is saying and that's actually pre-made by live kit now obviously we can build our own front end but just to save us some time here we're going to use one provided by live kit that allows us to really quickly test our application so quick summary we have a front end or a client that connects to the live kit server and then our agent so the AI we're going to build in this video here will connect to the server as well the server is going to handle the transmission of data between those two different components and that's kind of the main nice Advantage here of using live kit it do really low latency transmission of data so we can get responses back super quickly so what we need to do is now go to the live kit website and create a new application so I'm going to press on try live kit I'm going to make a new account and then I'm going to create a new app now I already have an account and I already have an application but let's make a new one so you can see what the process looks like so please go to the link in the description make a new live kit account and then we'll make a new application and I'll show you how this works so I'm just going to go down to the bottom left press on create new project I'll give this project a name and then I'll show you where we get the connection keys that we need so once you create your project you'll be brought to a page that looks like this now notice that you can actually look at some of the documentation for creating a client and a server if you just press on any of these different languages or Frameworks so if you wanted to make a react client for this app just press on react here and it actually shows you exactly how to do that using live kit and some of their pre-built components same thing for the server but obviously this video will walk you through that so you don't need to click into these unless you're curious and you want to check them out now what we want to do here is get our connection keys so what I'm going to do is go to settings and from settings I'm going to press on Keys now from key I'm going to create a new key and I want to give this some kind of description so I'm just going to call this tutorial because I'll delete it afterwards now it's going to generate this key for us and we need to copy these three pieces of data into our environment variable file so let's start with the websocket URL let's go here and let's type in the correct variable now it's really important that you type this correctly so just follow along with me the first variable we're going to have is the live kit uncore URL and we're going to make that equal to a string and we're going to copy in this W uh or websocket S URL now the next variable that we'll have will be the live kit aior secret let's make sure we spell secret correctly okay that looks good and then after that we're going to have the live kit uncore aior key this will be equal to a string as well so let's now go and grab our secret again I'm going to delete this after so don't think that you can copy it I'm going to paste that right here into the secret and then I'm going to grab the key so the API key like that and paste that for the API key okay so make sure you have these three variables and then we're almost done but we do need to get our open aai API key so we're going to make one more variable here and this is going to be open aore aior key this will be equal to a string and now we're going to go to the openai website grab that and then we're done with the different secrets that we need okay so I'm on platform.

open. com API keys I'll leave this link in the description and this is where you can make a new key now I believe you do need credit card information on openai for this to work they may have changed that recently this might have a very very small cost to use I'm not exactly sure if there's a free tier right now because they keep changing things but the point is I can press on create new secret key let's give this a name we can call this something like live kit and then we're going to copy this key and paste that in the variable same thing I'm not going to use this afterwards so don't feel like you can steal it from me I'm going to delete it after the video okay so I've just pasted that in here to open AI key and now that we have that we're done with the secrets and we can actually start writing some code and by the way the reason we need all of these things are so that we can connect to the live kid server so that when we eventually use that front-end client we actually have a way to transmit our information to the AI that we're making right now don't worry this will all come for Full Circle story once we start writing some code okay so let's go inside of main. py here and I'm just going to start coding this out we're going to make a very simple chatbot and then I'm going to show you how we can connect this to some agent functionality so the first thing we're going to do is import a Seno after that we're going to say from do uh EnV sorry import load.

V while we're at it we'll call this function load. EnV and what that does is simply load this environment variable file for us that we have access to those variables next we're going to say from live kit. agents and we're going to import the Autos subscribe the job context the worker options the CLI and the llm okay now let me just make this a little bit larger so we can see this next we're going to say from live k.

agents. voore assistant we are going to import the voice assistant and by the way you can make a lot more things with live kit than Voice Assistant as I mentioned you can actually stream video as well in extremely low latency and a lot of companies actually use live kit for doing things like live streaming next we're going to say from live kit. plugins and we're going to import open Ai and salario which is why we installed them before now there's a bunch of other plugins and if you wanted to use something like olama you could install that plugin and then you could utilize that here lastly we are going to actually create kind of a function here so we're going to say async Def and then this is going to be called entry point and this is going to take in CTX and we're just going to do a variable annotation here of job context for now I'm going to put inside of here pass and then I'm just going to write kind of the main line call here so I'm going to say if underscore uncore name is equal double equal sign toore maincore uncore then we are going to call this function so to call that we're going to say CLI runor app and we're going to use worker options and for the options we're going to say the entry point underscore fnc which stands for function is equal to the entry point okay so what I've done here is I've just made a function this is an async function because we're using async IO and then what we're going to do is say okay if we run this python file directly we're going to use the CLI which comes from live kit we're going to run a new application that application is going to be a worker and that worker is going to be this particular function we'll talk more about how workers uh work later on but this is where we can kind of write the code that triggers our AI Voice Assistant okay so bear with me here there's just a little bit more code to go until we have a functioning voice assistant so next I'm going to say initial CTX which stands for context is equal to llm do chat context.

append and this is where I can actually add in some context to kind of start my AI voice assistant so just like you could give kind of a templated prompt to you know something like open aai or some llm that's kind of what we're doing right here so we're going to say R is equal to system and then we're going to say text is equal to and we can give this a description of what we want this assistant to kind of act like or how we want it to work and let me make sure I have a comma here so we could say hey act like you're someone from the 1800s or you know uh respond in this specific format or whatever we want it to do this is where we're going to put that information so what I'm going to do is just copy in some context here I'm going to do a set of parentheses and then I'm going to copy in this it says you are a voice assistant created by live kit your interface with users will be voice you should use short and concise responses and avoid usage of unpronouncable punctuation okay so that is the initial context obviously you could change that if you want and by the way all of this code will be available from the GitHub Link in the description in case you want to copy any of this in so next we're actually going to connect so we're going to say await CTX doc connect and we're going to say autocore subscribe is equal to Autos subscribe. Audio Only now pretty much what we're doing here is specifying that we just want to connect to the audio tracks we don't care about the video right now but obviously you could connect to the video and the audio or just the video but this is specifying hey we want to connect but we just want to subscribe Auto automatically to all of the audio tracks that we have now let me just code out a little bit more and then I can kind of explain to you what this means because I understand this is a little bit confusing but we just need some more code before this will make sense so next thing we're going to do is actually Define our voice assistant so we're saying we want to automatically subscribe to the audio tracks now we need a place for our audio tracks to actually be handled so we're going to say assistant is equal to voice assistant and inside of here we need to specify a few properties so the first thing we're going to do is vad now bav is voice activity detection which is specifying what kind of model or what we're using to detect if the user is speaking or not so we know when to kind of cut them off and send the message over to our AI now this is where we'll use cero or however you pronounce this and this is going to be cero. vad do load okay next we're going to do stt is equal to open ai.

stt this is speech to text we're going to use open AI speech to text next we're going to say the llm is equal to openai Dot and then llm in all capitals and here is where you could actually replace these with different things right so you could use a different speech to text Model A different llm and then next we're going to have text to speech you could use a different text to speech model okay you could replace all of them here and you can find how to do that from the live kit docs now lastly we're going to provide the chatore CTX which stands for context and that's going to be the initial context that we have okay so now that we have that we're going to say assistant. start and we're going to say CTX which is our context. room now the way that these voice assistants work in live kit is they can connect to a room so we can have multiple rooms going at the same time and these voice assistants can connect to one of them all of them they can kind of scale infinitely and what we're doing is saying hey we want to connect to the room uh that's being provided by this job so really the way that this works is our agent is going to connect to the live kit server the live kid server is then GNA send the agent a job and when that job is sent it's going to have a room associated with it so what we're doing is saying okay we want to subscribe to the audio inside of that room and we want to start the assistant inside of this room so that's kind of how that's working now next what we're going to do is we're just going to wait one second so I'm going to say await and then asyn iio dos sleep and I'm going to sleep for 1 second and then I'm going to say await and I'm going to say assistant if we spell assistant correctly do say and and I'm just going to give this some text I'm going to say hey how can I help you today so this is how you can actually just manually send a message with the assistant if we wanted to say something specific well we could just say exactly that assistant.

say then we're going to say allow interruptions and this is going to be equal to true so it allows us to interrupt that welcome message with anything else that we want okay so believe it or not that's actually everything that we need for making a basic voice assistant now you might be saying okay that's cool that kind of makes sense but how do we actually use this okay so there's a few steps here let me explain how we start using this assistant okay so the first step here is to run our agent so to do that we're going to type Python 3 the name of our python file which is main. py and then we are going to say start okay so that's how the CLI thing kind of works it's handling this command so when we say start well it's going to start the agent force so let's go Python 3 main. py start and this should start the agent and we should get some output coming here in just a second okay so you can see that the agent has started so now that the agent has started what we need to do is actually set up a new room and this will then trigger the live kit server to send a request here to this agent so the agent can connect to it and then can start listening for any of the voice requests or any of the voice messages that we sent so the way that we do that is the following first of all I have a link in the description to the live kit agent Frameworks um kind of GitHub reposit repository and this is specifically for Python and it will give you some more information on how all of this works but what it will also specify here is a URL for the hosted Playground now the hosted playground is what I showed you before in the demo and this is something that we can use to connect to our agent without having to build our own front end so what I'm going to do is click on this URL here and it's going to bring me to this agents playground.

kit. IO now you're going to have to sign into live kit in order for this to work you can see that I'm already signed in and I'll leave this link in the description so you can just click on it rather than having to type it in now what we're going to do is press on the project associated with the agent that we're building so in this case I believed I called it AI voice tutorial so I'm going to click on that and I'm going to connect to this project now when I do that it should connect my video as well as my audio and I can turn off my video if I want so let's do that and you can see that my microphone is going now notice that it will show me let me just mute this for a second that I have a room connected true and I also have an agent connected and it also shows me the room that I'm inside of and if we go back here we can see that we've actually connected to this room so it should uh in a second here if I turn on my microphone actually allow me to talk and get a response from the agent so let me do this hey how's it going okay sorry for the cut there but I just actually disconnected and then reconnected and now my agent is connected and it's working and you can see it's picking up what I'm saying so as soon as I stop talking now it should actually grab all of this audio send it to my agent and then give me some kind of response okay so you can see it handled that audio it's going to take a second obviously and then it should give us some kind of response sounds like everything is working how can I assist you further perfect so I just muted my mic because obviously it's weird when I'm commentating and trying to talk to the assistant but the point is now this is connected so it shows you right the room the participant the room connected the fact that the agent is connected if you're getting any weird errors sometimes you can just disconnect and then reconnect and then the agent will pick this up also obviously you could shut off the agent by hit contrl C on your keyboard and then rebooting it but in my case this worked you can see it said hey how can I help you and then it grabbed all of our text converted that into text so grabbed our speech convert it into text sent that to the agent got the agent's response and then turned that into speech which is then Spoken Here by this kind of audio um what do you call it bubble window whatever that live kits provided in this front end so that is all great we can play with this ask it anything that we want and now we have something like open AI voice mode but I want to go a step further and show you how we actually add some agent functionality so how do we allow this AI to handle different state turn on lights handle temperature in different rooms connect to apis I'm going to show you how to do exactly that so let's go back to our computer and start coding out that agent functionality okay so I'm going to pop over to this api. piy file which is where I'm going to be building the kind of agent functionality just so it's separate from the main file now what we're going to do to start is import enum we're also going to say from type typing import annotated and we're going to say from live kit.

agent import llm and then lastly we're going to import logging just so that we can print some things out if we want to so first I'm just going to create a logger I'm going to say logger is equal to logging Dot and this is going to be get logger okay and then we're going to say that this will be called the temperature control so we can just give this a name so that we can see what it looks like when we're logging different information and then we can say logger do set level this is the level at which we're going to log and we'll say logging doino so that this way we will see in the logs any information messages that we decide to log this is just because if we try to print using this agent we're not actually going to see the output instead we need to use this logger now what I'm going to do is just set up a really simple agent that can keep track of the temperatures in different zones now those zones are going to Define with enums uh so that we're able to actually access them and use them from the agent so this is just a simple example you can extend this and change it quite a bit but this kind of illustrates the functionality so I'm going to say class zone and this is going to be enum do enum okay this is built into python by the way you don't need to install this what I'm going to do is make some enums so I'm going to say living room is equal to living room okay and we'll just copy the rest of them in just to save ourselves a little bit of time okay so here we have the bedroom kitchen bathroom office obviously you can make these zones anything you want and they don't necessarily need to be rooms I'm just doing this as if we're going to allow the AI Voice Assistant to like control our thermostat okay then we're going to make a class we're going to say a class is the assistant function okay and this is going to inherit from the llm do function context now basically what we're going to do is make a python class and any functions that we Define inside of here that have a specific decorator can actually be called by the llm and the llm will decide which function to call based on the description that we give them so first of all I'm going to make an initialization so I'm going to say Define nit we're going to take in self we're just going to annotate that this returns none and the first thing we're going to do is call Super doore nit uncore uncore so that this uh parent function here or parent class story gets set up correctly then we're going to say self. temperature is equal to an object and again I'm just going to copy this in all we're going to do is assign each enum or each Zone a particular temperature so we're going to say hey the living room is 22 the bedroom is 20 the kitchen is 24 these are just some random initial temperatures so that we can change them and keep track of those temperatures when the AI is running okay so now that we have that we're going to make our first callable function and this is simply going to get the temperature from a particular zone so we're going to say at lm. aore callable now this is a python decorator that specifies the function that we're about to write can be called by our llm so if we want one of the functions inside of here to be callable then we need to Define that now for this we're going to pass a description equal to and then this is just going to be whatever the function actually does so you want to provide a good enough description that the AI is able to um kind of distinguish between which functions it should use when so we're going to say get the temperature in a specific room so the llm will be able to see this description and know okay I want to use this function to get a particular temperature next we're going to say Define get temperature and inside of here we're going to take in s and then a zone now it's important that for our parameters here we annotate them using python typings so the llm knows the correct data to pass to the function so that it works properly so I'm going to say zone is equal to annotated and I'm going to specify that this is Zone and then I'm going to say llm do type info and again I'm going to pass a description so that the llm knows what this parameter should be so I'm going to say lm.

typeinfo and the description of this parameter is the specific Zone okay so that's The annotation of this and I need to what is the issue here put my colon and then I can start typing okay so now inside of here the first thing I'm going to do is just say logger doino and I'm just going to say get temp and then Dash Zone and then I'm going to do percent s and I'm going to pass the Zone okay just so we're going to print out hey the llm called this function and it called it with this Zone then we're going to actually get the temperature so we're going to say the temperature is equal to self Dot and actually we're going to say this is underscore temperature so this is a private variable so let's change this to underscore temperature and we're going to grab the zone of the zone now this might seem a little bit weird but this zone right here is going to be a string so what I'm going to do is convert that into the correct enum so if you do like the name of the enum and then you surround a string here it's going to give you like this kind of version of the enum in Python I don't really know a better way to explain that so that when we actually try to get the temperature we don't get any kind of type errors or um what do you call it key errors here anyways this is required just trust me I've tested this before okay then we're going to say return and we're going to return an FST string and we're going to say the temperature in the zone is and then whatever the temp is and we're just going to put a c here so they know that this is in Celsius okay so all we did here is we wrote a function that gets the temperature in a specific room we annotated the parameter so the llm knows what to pass and then we just logged okay the llm called this function so we can see that we got whatever the temperature is in that zone and then we returned it so now that we have that obviously we want to use this so let's go to main. Pi and let's import it so we're going to say from API import the assistant function now we just need to set up a context for this um what do you call it kind of assistant or agent functionality so we're going to say FN ccore CTX is equal to the assistant function we're just going to call that so it gets initialized and then we're going to pass to our assistant the FN ccore CTX is equal to the fnc C TTX okay so that's literally it all we have to do in order to provide the voice assistant with some functions that it can call is create a class this class needs to inherit from this class right here from the live kit agents framework we then just need to provide one of these functions here we can provide multiple which I'll show you in a second which is at lm. callable give the description annotate the parameters and then it can use this so now whenever the voice assistant feels that it needs to use that function it will use it and then give us response so let me turn off this agent I'm just going to quit by hitting contrl c a few times and I will restart it so Python 3 main.

py and then start then we're going to go here and I'm going to disconnect and just reconnect so that the agent will kind of be trigger to connect to this room again hey how can I help you today okay so the agent connected Can you please tell me the temperature in the living room and in the bedroom the temperature in the living room is 22° CS and in the bedroom is it's 20° CS there you go it worked so you can actually see that the agent can call the function multiple times which it would have done in this situation and if we go here to our logs we should be able to see the logs of it actually doing that so it's a little bit difficult to see here um but if I go maybe temp uh you can see yes it says get temperature and a called Zone living room and then get temperature and it called The Zone bedroom uh so it used those functions and that worked properly sweet so now I just want to add one more function that allows it to actually change the temperature and then that will be pretty much all that we need so I'm going to say at lm.