Make an Offline GPT Voice Assistant in Python

34k views5306 WordsCopy TextShare

JakeEh

We make our own offline (local) virtual voice assistant using Python that lets you control your comp...

Video Transcript:

In this video, we're going to make an offline virtual assistant that uses a local LLM, just like chat GPT. like chat GPT. That means the speech recognition and the LLM are all run on your computer.

So you're not sending your data to some random server somewhere. random server somewhere. It's all local.

So let's go over how to do it. So let's go over how to do it. But just like in my last video, we're going to break this down into three different steps.

into three different steps. Step number one is how can I make the computer listen to what I'm saying so that it can know so that it can know to do something? to do something.

This is called speech recognition. Just like in my last video, I literally just typed speech recognition in Python and I found this pip package and went from there. The difference though, is that we're The difference though, is that we're actually going to use something called the whisper API, API actually specifically open AI whisper actually specifically open AI whisper.

API actually specifically open AI whisper open AI is actually the Open AI is actually the company who is responsible for chat GPT. for chat GPT. So whether you like chat GPT or not, they can make some good stuff.

You know, they're kind of good at this. You know, they're kind of good at this. So first things first, we're going to import a bunch of different things here.

I'm not going to go over all of the imports. All this code will be in a link in the description below. So if you want to download this and try it out yourself, feel free.

Although you will have to follow a couple Although you will have to follow a couple of unique steps in order to get some things working. So make sure you stay till the end. All right.

All right. So for speech recognition, the first thing we're going to do is import this speech recognition as SR. import this speech recognition as SR.

Then we're going to create a source and a recognizer using SR. microphone and SR. recognizer.

SR. microphone and SR. recognizer.

So then that way speech SR basically just means we don't have to type speech recognition all the time. all the time. Then we're going to use whisper to load a model and we're going to use that model to interpret some kind of audio to transcribe it into text.

transcribe it into text. So let's go over how we do that real quickly. So here we have print listen for command.

So here we have print listen for command. So here we have print listen for command. We'll open up this here.

We'll open up this here. So listen for command. We're going to take with source as S.

We're going to take with source as S. We We are going to realize I don't even, I'm not even using S. using S.

We can do this. We can do this. There we go.

There you go. We're going to basically adjust for ambient noise. Then we're going to listen for some audio.

Once we get the audio, we're going to try writing the audio to a wave file. writing the audio to a wave file. So this is going to make some kind of So this is going to make some kind of random wave files in the current directory, although it's just going to keep overriding the same couple.

So not really that big of a deal. So not really that big of a deal. It's maybe going to take like a few, I don't know, less than a megabyte of space.

So probably not a big deal. So probably not a big deal. After we've done that, we're going to do base model dot transcribe.

So transcribe is something that the whisper model allows us to do. whisper model allows us to do. And we're going to pass in our wave file.

And we're going to pass in our wave file. And then that result is going to have a And then that result is going to have a text value within its dictionary that it returns. And then we can kind of interpret that.

So if I go ahead and run this. run this, it's loading. Loading.

Hello, YouTube. Hey, we can see it works. We can see it works.

All right. All right. So this is amazing, but just like I was saying before, uh, we're reading this from some base value.

reading this from some base value, right? Right. But where's that coming from?

Well, what we can actually do for this is, uh, we're getting this from the, uh, cash, uh, whisper base dot PT and for windows that's located basically wherever your user is. wherever your user is. So mine is C users, CP, TLI, li dot cache slash whisper.

uh, uh, dot cash slash whisper. Um, so that's mine. Uh, if you're on Mac, If you're on Mac, it's it's a very similar thing.

It's just, uh, users, your username and then dot cash, uh, and then whisper and then wherever. So here we can see we have tiny and base, but what is tiny or base? Well, tiny or base are models that are available from open AI that are free that you can just download.

Now the difference between tiny and base and both all these other ones are the size, the number of, of parameters used in the model to train the model, but also how much memory it's going to use when the model is running and how fast it is. running and how fast it is. So if you want something that's like really, really good, then maybe use the large one.

But you're going to have to use 10 gigabytes of a virtual Ram in order to do it. of a virtual RAM in order to do it. I think it's virtual I think it's virtual Ram or it's video Ram.

I'm pretty sure it's virtual Ram. So anyways, we're using base So anyways, we're using base here, which is really little. Uh, I think you'd probably maybe get a better result if you use something like smaller medium.

something like smaller medium. Uh, I might try that in a future video. I might try that in a future video, but Uh, but for now we're just using base, but how do we get it?

But how do we get it? Right. It's just here.

Right? Right. It's just here.

It's just here, but how do we get it? But how do we get it? So all we have to do is rather than let So all we have to do is rather than, let me comment this out here, but this allows us to take in an input, like a path or just a string.

So if I just type base here and I go ahead and go back here and I delete this and we go ahead and run this again, the first time you run it, it's going to actually download that model and store it. that model and store it. So let's see what happens.

So we can see it's downloading. So we can see it's downloading and once this is downloaded, we'll see it pop up here. downloaded, we'll see it pop up here.

You can actually see that it's, it's here. It's just not actually fully there yet. Hello, YouTube.

Hello YouTube. It was listening for commands again. All right.

All right, so we can see So we can see once again that that works. Uh, it downloaded the value. So then instead of using this value, we're actually going to use this model.

So then instead of using this value, Instead of using this value, we're actually going to use the base model path and replace this base here. and replace this base here. So if you wanted to do medium or large or So if you wanted to do medium or large or something like that, you just do the same thing, but you would just change base to medium or large or whatever you want it.

There is one other thing, though. There is one other thing though. There is one other thing, though.

So for whatever reason, this load model, uh, the way load model works is it actually tries to check the internet before, uh, before it does anything. anything, before it Before it actually loads a model to basically see what is, what's the current state of things, um, with open AI. Uh, and it does that to make sure it has the most up to date version of the model and stuff like that.

And honestly, I'm fine with that. However, I know some people aren't. However, I know some people aren't.

So to make sure that this is actually fully offline, we can make it so that it doesn't have to do that again. So you might notice that we have these So you might notice that we have these two curl commands here. two curl commands here.

Now these curl commands will, will basically download a whatever this, this file is, and it will put it within wherever you execute this curl command. So if we go back to this, uh, whisper or cash. So if we go back to this whisper or cash cash whisper directory, Uh, cash whisper directory, we can see that we actually have, uh, this vocab.

pbe and this encoder Jason file. and this encoder JSON file. So this is where we need these files.

So in order to get these, let me go ahead So in order to get these, let me go ahead and just delete them just so you can see that I'm not lying. I'm not lying. Uh, and on windows, again, again, if you're, if you're on Mac, uh, I already, I'll put a link in the description of where the location is.

description of where the location is. Um, uh, but if you're on windows, it's just at your, your user, uh, slash. your user slash dot cash slash whisper.

Dot cash slash whisper. Then all you have to do is just shift and right click and click open PowerShell window here. open PowerShell window here.

This will open PowerShell and already be This will open PowerShell and already be at the right directory. Uh, then you just have to, uh, copy and go ahead and right click into here. It's really weird.

I don't know why they don't have control V, but if you right click and then press enter, it will go ahead and basically download that file and save it here. So now we have vocab. pbe.

We're going to do the exact same thing Then we're going to do the exact same thing with encoder dot Jason. with encoder dot JSON. And there we go.

And there we go. And now we have encoder dot JSON. And now we have encoder dot Jason.

And now we have encoder dot JSON. Awesome. Awesome.

Now, one thing I'm going to say Now, one thing I'm going to say is that if, uh, you try that and it says that the file doesn't exist because I just ran into that, uh, for some reason that was something I was, I was discovering. Um, there's actually this, this answer right here, uh, where there are links to these files as well. So I'll throw this down in the description as well.

So, uh, if you want, you can just download them like this rather than doing a curl request. All right. So now all we have to do is All right.

All right. So now all we have to do is So now all we have to do is actually update a file that whisper has as a dependency. Whisper has as a dependency.

So in order to do that, you're going to So in order to do that, you're going to go to, uh, your, where your PIP files are located. PIP files are located. This is really easy if you use a virtual environment.

Um, so if you want to know how to do that, check out this video up here. check out this video up here. Um, but basically if you do that, it means that within your virtual environment file or direct folder, sorry, folder, sorry, there's a lib file and uh, there's a lib file and this will have all of your, uh, all of your files.

So, um, I'd recommend do that because then you can kind of, kind of have everything in one spot. Uh, now the thing we're going to look for is called, uh, this tick token, e X T. So we're going to go here and open this open AI public.

pi. open a public dot pie. Now what you'll notice here is that there are these two files here or these two things here.

here. Um, here we're actually trying to basically go and get this file that we just downloaded. Right.

So every single time you run this, it's going to try and download this and see, uh, if there's anything new, it needs to go and get. anything new it needs to go and get. So kind of silly requirements.

So, um, kind of silly requirements. So what we're going to do is just import OS and then we're going to do the same thing here as this OS expand user thing. Um, and we're just going to do this and say vocab.

this and say vocab dot PBE. Dot P B E. And then we're going to And then we're going to do the same thing for here.

do the same thing for here. Uh, this, I need a comma and we're going to do, uh, encoder dot Jason. going to do encoder dot JSON.

Cool. Cool. Now that means that when we run this, Now that means that when we run this, Now that means that when we run this, we're going to be we're going to be getting the, the base package from, uh, the the cached value we have cached value we have that it downloaded itself, but then also when it loads, it's rather than going to try and find this vocab.

P B E on the internet or this encoder dot Jason on the internet. I don't know what this is. Uh, then it's just going to get the one that we already have downloaded and cashed on our machine.

So offline, good stuff. So if we go ahead and try this again, uh, with listening for command, command, we can see assistant we can see, um, assistant. py.

dot PI. We can see if it works. Subscribe.

Subscribe. Yeah. Yeah, perfect.

Perfect. You will notice that it's a little bit slower, uh, than some of the stuff I've done in my past video for, for making an assistant where we use something on the internet, uh, because well, if you're sending some small amount of audio to a server that is built to do this type of stuff, um, it can run these huge models and be able to do things really quickly. Uh, whereas on our computer, we're doing stuff that's a little bit, maybe not as performant.

But if you don't care about that and you rather would just have everything local, then this is still the way to go. All right. All right.

All right. So for making an assistant right now, it So for making an assistant right now, it can listen for a command. can listen for a command.

Perfect. Perfect. The next thing it needs to do is be able The next thing it needs to do is be able to respond to a command, like say something.

Right. Right. So let's go ahead and do that.

So let's go ahead and do that. So in this case, we So in this case, uh, we have this respond here and basically what we have here is, uh, this if system dot platform, uh, equals equals Darwin. equals equals Darwin, what the heck does that mean?

Well, what this means is if we import Well, what this means is if we import system, we can check the platform and Darwin means that it's a Mac. So if it's a Mac, we're going to use this stuff here in this system, um, from the OS. And that allows you to basically just like speak stuff using this say stuff, uh, using say, and then whatever your text is.

Uh, so that's great, but I'm not on a Mac right now at least. So instead we are going to use something So instead we are going to, uh, use something called engine and engine is this PI TTS X three. And this is just a local, uh, thing that you can go ahead and download.

Um, I have it included in the requirements. txt. the requirements.

text. Uh, so you just go ahead and download that and it will go ahead and, uh, allow you to say stuff. allow you to say stuff.

Now I will say it is really roboty. So if I do this, you will So, um, if I, if I do this, uh, you will see that it's docs like this as not a good robot. Please subscribe.

Awesome. Awesome. But you know what?

It works. We're not going for perfect. We're going for something that works and is offline.

That is the goal. So awesome. So we have already done our second thing.

So we have already done our second thing. We just needed to import a simple library. And, uh, if, if again, if it's not Darwin, if it's not a Mac, then we're going to import it and we're going to initialize it.

And then we're going to use this engine when we want to use it. Pretty simple. All right.

So we basically have everything we need now, right? We have speech recognition. We have text to speech, right?

We have text to speech, right? We have, we can say something. It can understand it into text or convert it into text.

text or convert it into text. And then we can, uh, have something in text and it can now say it to us. and it can now say it to us.

So the next thing we really want is just to be able to do stuff. So in order to do stuff, I have this So in order to do stuff, I have this perform command, uh, where we can give it a command and I have a bunch of stuff here. You can see, um, just a bunch of global variables.

Just a bunch of global variables. Uh, but let's go ahead and uncomment this stuff. uncomment this stuff.

So the first thing that we have here is, we're going to print the command. um, we're going to print the command. I have the things that I had before, like I have the things that I had before, like where you can append something to a list of tasks and then you can list, listen all, uh, or list out all of your tasks.

You can take a screenshot. You can open up Chrome to the best YouTube channel that exists. YouTube channel that exists.

Um, but the new one that I added because, well, we're adding an LLM to our machine is to our machine, is called called, uh, ask a question. So how this works is you can say ask a So how this works is you can say, ask a question and then it will, if it's true, then it will if it's true, then it will come back and say, uh, what's your question? And then you can ask your question and then it will say it's thinking.

And then it will use a model to generate a response with a maximum number of tokens of 200. I believe that's just how many characters I believe that's just how many characters it will respond with. Where does model come from though?

Where does model come from though? Well, this model comes from this here, this GPT for all. This gpt4all.

Um, now you might say, well, what the heck is that? That's another thing that we haven't had yet. Well, uh, you'll notice that we also have this allow download equals false.

Um, so it's kind of a handy thing because we want to be offline. So you can specify that you never want to download something. never want to download something.

Uh, GPT for all actually takes in, uh, Gpt4all actually takes in two different two different types of constructors. I believe maybe I might be wrong here, but I'm pretty sure it takes in a constructor where you can specify a model that you want to use and it will download it, or you can specify a path. download it or you can specify a path.

So in this case, we're specifying our path here and then we're saying, we don't want you to download anything, just use this path. But what the heck is this But what the heck is this path and where does it come from? path and where does it come from?

Well, this path comes from this GPT for all program. Now I know what you're thinking. Now I know what you're thinking.

That seems sketchy and honestly, that's what I initially thought too. I initially thought too. But if we, uh, come over to here, GPT for all is, uh, something made by this, uh, no mic.

ai, uh, company. Um, it's just a free, basically like a free local version of an LLM, uh, which is a large which is a large language language model, which is what chat GPT is and all those types of things. So if you want to actually have a local So if you want to actually have a local LLM running, um, this is the easiest way to do it that I found.

So the reason for that is because all you have to do is install this here, right? Click windows, install our OS X, install our Ubuntu, whatever you need, and then come on down to here and you can see, uh, the performance benchmarks of a bunch of different types of of different types of models, right? models, right?

of different types of models, right? So like how good are they at, uh, well, look at the average. I don't know what all these things mean.

I don't know how well swag is. Uh, so, but anyways, look at all those. There's a bunch of cool models.

There's other models down here where you can see like, uh, actually, what's these Actually, what's these models? models? Yeah.

Performance benchmarks for a bunch of different things. Yeah. Yeah.

Um, but if we look down here, there's models that say the size, how much Ram they're going to use and also what they're good for. So this is a very, uh, best overall, uh, fast chat model, best overall fast fast introduction following introduction, following model or instruction following model. So it can figure out how to do stuff.

it can figure out how to do stuff. Um, you know, things with good call quality, best overall overall larger model, you know, it's going to use more and more stuff, but anyways, you can look at all of these models and choose which one you want. Now you have two options here.

Number one, you can either just download this and if you do this, I don't think And if you do this, I don't think you you actually need to install, uh, their thing, their, their program. Um, I just installed it because I think it's fun to play with anyways. Um, but do whatever you want.

Uh, but I think you should be able to just download this and when it downloads, And when it downloads, you can add it to you can add it to whatever path you want. So in this case, I just So in this case, I just added it to this directory. So I have this GPT for all file here in the same directory that I have my assistant.

py. That way I can just reference it like this, which is the same path as this file, and I can reference it perfectly like there. So that's if you want to go the route of downloading it from the website.

downloading it from the website. If you want to download it from the machine or from the program, which I did, program, which I did, then you then, uh, you can go ahead and install the program. And when you open the program, like download models, pick the model you want.

I have this gpt4all I have this GPT for all Falcon, uh, which is free. It seems very fast. It seems very fast.

Um, and to be honest, it's not that fast, uh, but it's okay. It's like, you know, it's local. So that's fine.

Um, when it downloads, you also see that there's this download path. So if, when it downloads this file, you can kind of know where to go and get the file. Um, so yeah, so download the thing.

So yeah, so download the thing. It'll download it for you. It'll download it for you, but you can also then test it out here.

So you can see, like, um, if I change this other model, it's going to load the model. going to load the model. And then I can, um, type like, um, uh, what is Python?

And then you can see it's really fast. And then you can see it's really fast. It says Python is a high level interpreted programming language.

Blah, blah, blah, blah, blah, blah. Um, pretty cool. So anyways, if we have all of this implemented, we can just reference this.

just reference this, right? Right. Say we're not going to download anything, but then we have this model.

but then we have this model. And if we use this model, we can simply use it by saying model dot generate. dot generate again, the Again, the command with the tokens and it's going to get the output.

it's going to get the output. So, um, well, you can see here, if I go ahead and just show you how this works as an example, uh, we can run this. And there you go.

And there you go. We have asked it, are virtual assistants amazing or what? And it said, virtual assistants are definitely amazing.

assistants are definitely amazing. They can help with a wide range of tasks, blah, blah, blah, blah, blah. blah, blah, blah, blah, blah.

So pretty cool. Uh, I did speed that up because it's a little slow. because it's a little slow.

Um, but again, if you're just kind of doing this for fun and you're trying stuff out, uh, I think it's totally fine and you're running everything locally, everything locally, which is awesome. which is awesome. So, uh, if you want to see how this whole thing works together, um, I have this main have this main function, function, which, uh, we have this global listening for word to check if we're words to check if we're listening for the word.

listening for the word and then while should run is true, which is just something so I can exit this loop. Uh, we're going to listen for a command, which we did here before. If listening for the command is, uh, or if we are listening for a trigger word, trigger word, then we are um, then we are going to set it to false.

going to set it to false. Um, and then within our perform command later on, we're going to set it to true. And our trigger word is Andy.

And our trigger word is Andy. And our trigger word is Andy. Why Andy?

Because I don't know any Andes. Because I don't know any Andes. I don't have any reason to say Andy, I don't have any reason to say Andy, instead of watching the office.

instead of watching the office. Um, and before my trigger word was baloney when I was using my other, uh, speech other speech recognition, and recognition, and, uh, it just couldn't make the text baloney. Doesn't have to spell baloney, which I didn't really either.

didn't really either. So that's fair. So that's fair.

But, uh, because we're using a local model, that's, you know, slimmed down and slimmed down and not exactly. not exactly, it's not really supposed to do anything like super well, I guess. guess it couldn't figure out Uh, it couldn't figure out how to spell baloney when I was saying baloney.

when I was saying baloney. Uh, so that's why. Um, so let's go ahead and run this though.

So let's go ahead and run this, though. We'll uncomment main and we'll go ahead and give it a go. Andy.

Andy. Ask a question. Ask a question.

What's your question? What is better, Python or JavaScript? What is better Python or JavaScript?

What is better, Python or JavaScript? Thinking. Thinking.

So you can see it missed some of that there. It says just Python or JavaScript. It says just Python or JavaScript.

So some things you could probably do to tweak it to make it a bit better. tweak it to make it a bit better. Which is better for a beginner?

Which is better for a beginner? Both Python and JavaScript are great programming languages to learn as a beginner, but it looks like we hit our 200 character, uh, limit there. Uh, but pretty cool.

You know, we got a response. Um, I did just kind of kill it there because I'm talking now, so I didn't want it to get confused about anything. Um, but yeah, I mean, it can do everything it could do before, you know, it can take a screenshot.

It can open Chrome. It can do everything. Um, I, again, I will say that like the command recognition is not as good as it was before, uh, because we're using a slim down model for being able to recognize, uh, do the like the speech recognition, um, it's just not quite as good, uh, but for, again, for something running locally, it's pretty cool.

So that's it. We have now made a fully locally running virtual assistant, which also has an LLM. Which also has an LLM so it can respond So it can respond to basically anything.

Make it write a book for you or something. I don't know. I don't know.

So if you like this video, please drop a So if you liked this video, please drop a like down below and subscribe because, subscribe because yeah, I mean, uh, yeah, I mean, why not? Right. why not, right?

It's free. It's free. In some of the next videos, we're And some of the next videos, we're actually going to look over machine learning.

So not necessarily how to build, you So not necessarily how to build, uh, you know, some kind of LLM, but how to use machine learning to solve some interesting problems. So if you're interested in that, make sure you do subscribe. sure you do subscribe.

Well, that's it for this time. Remember, I'm Jake. Remember I'm Jake.

You're awesome. And this has been building a locally running virtual assistant. Cheers.

Cheers.