Creating Low Latency Voice Agents - Open Source 🗣️🗣️🗣️

13.29k views2430 WordsCopy TextShare

Prompt Engineering

Hey there! I introduce Verbi, a modular voice assistant I created with community support. We'll disc...

Video Transcript:

hey uh what is your name and can you tell me about yourself nice to meet you my name is verby and I'm a friendly AI assistant here to help answer your questions and make your life easier I'm constantly learning and improving so I can provide you with the most accurate and helpful information I'm a chatty bot but I'll keep my answers short and sweet can you tell me a fun thing to do while uh I'm on a trip to Los Angeles over the weekend Lea has plenty of fun activities why not explore the griffth observatory at Sunset take the tram ride gaze at the city skyline and enjoy the stunning views of La if you're visiting on a Friday or Saturday stick around for the free public lectures and events it's a unique and Unforgettable La experience a few days ago Google announced Gemini live which is their AI powered wi assistant it's their answer to GPD 40 with advanced voice mode which was announced back in May 2024 but we're still waiting for it to Google's credit Gemini live is available to Advanced users and comes with some pretty amazing voices here's one of the voices I have but there are more to choose from here's another voice I can use you can pick me now and always make a change later in settings or maybe you'd like to listen to a voice like this a pretty great one if I do say so myself great so this got me thinking whether we can replicate some of the functionality of these Voice Assistant using the existing tools that we have so we're going to create a system that is going to have four different components the first one is going to be user input and output we will get user input in the form of speech that is going to be passed on to a speech to text model that will transcribe the user input the text will be given to the llm to generate a response in the form of text and then that text is going to be converted to converted into speech using another model and we're going to play that speech back to the user the system is going to have memory so it will remember previous conversations now keep in mind unlike uh Gemini live or gp4 this is not a multimodal system we are putting together different models to generate a wi towi chat assistant and the goal is to generate a system that can have the least available latency using different tools available on the market now to put all these different models together we need a framework or a system and luckily I have been working on a project called verby which is supposed to be a voice assistant I created this with the help of community a few months ago but never really got a chance to cover it on this channel this is a modular Vice assistant application for experimenting with state-ofthe-art transcription response generation and text to speech models and it has support for a number of different API providers including open groc 11 labs caresia and deep grum apis support for other apis is coming soon and you can also run text to speech speech to text and local llm models as a part of this system we already have 159 Stars so check it out and make sure to give it a St if you like it are is that you can select different models from different providers and see which one gives you the best possible latency and response accuracy for my experiments I found that the speech to text model hosted by Croc is the best model in terms of latency for uh Speech toex conversion also for llm infant The Croc hosted models are best in class and for uh text to speech we're going to be using cesia AI CIA AI apart from Deep chrom have one of the best text to speech system and they have the lowest latency again you can replace all these models with local and open source models however you will need a pretty powerful computer to run these and get responses in acceptable time frame project WB has support for local models and I'll show you how to uh set those up but I'm not going to walk you through a whole tutorial in this specific video so we'll set this up locally and then I'll show you a few uh options but let's first talk about its features it's a model design so you can easily switch between different models for transcription llms and text to speech it has support for multiple API API providers including openi Croc deep cramps and we are planning on uh adding more and it also has support for local models it can record audio and Playback as well now here's the project structure if you want to change any model you will just need to change them in config. py now here are the stepbystep instructions on how to uh get started so first we'll need to clone the repo and then create virtual environment uh install all the required packages provide different API keys that you'll need and also update the config file this will basically control which model you want to use for transcription llm and text to speech and as I said you can run local models via AMA there support for fast whisper so you can run that as well and for local text to speech we currently support mellow text to speech model but would love contribution from the community if you think that you can add extra features to this project so please clone the repo submit a PR I would love to integrate those so in order to clone it just use get clone and then provide the repo as you can see I have already cloned the repo so I'm not going to do this step next I'm going to create a new cond environment using cond Create dasn the environment name is going to be verby and we're going to be using python 3. 10 so it'll ask me to install different packages we're going to say yes let's activate the virtual environment that we just created using cond activate word and you can see that we are in this new virtual environment now next we need to install all the requirements so we're going to use pip install our requirements.

text this will install different packages that we will need in order to run different models okay okay next we need to set up which model to use for which purpose so for that you need to go to the voice assistant package or folder then click on config. py so here you will need to set up your transcription model response model and the text to speech model the possible values are open AI grock deep grum and fast whisper API there are details instructions in the repo on how to set up the fast whisper API if you're using us the whisper model locy uh for llm you can use these options we're going to be adding more soon and for the text to speech models currently you can use open ey deep Crum 11 Labs melow text to speech model that you will need to run locy instructions again are in the repo or caresia AI so these are possible values in my case I have found groc for transcription plus the llm inference to be the best option in terms of the uh lowest possible latency and cesia also has pretty amazing response time for uh text to speeech systems so I'm going to set those up but we will need to provide the API keys for these models in order to in order for us to be able to use them so for that you need to look at the examples. EnV file so provide the API keys for the models that you're using and then just rename the this examples.

EnV Tov now I'm not going to show you my EnV because it has all my API Keys now for text to speech the two best options are deep Crum as well as caresia these are the two solutions that will give you the best possible latency 11 Labs is another option but the latency is a bit higher compared to these other two options so for example here we can select different voices so I'm I'm going to select this Australian man and see how long it takes to generate speech with this text in the early dawn of human civilization Innovation was driven by necessity from the invention of the wheel to the creation of written language so this is pretty fast right so that's why I wanted to use cesia AI for my text to speech model okay so here are my settings again Croc for transcription and for the llm response and uh caria for text to speech and we can switch them around pretty easily and for grock we're going to be using the Lama 38 billion model now in order to run the system uh we're going to run this file called run voice assistant uh in here we're piecing all the different models together you also want to provide your system message here so here I'm saying you are a helpful assistant uh called verby you're friendly and fun help the user uh and I think I guess like I'll modify this to and you will help the users with the request your answers are short and concise right so if you want to give your um assistant any Persona just change this system prompt now you can select different voices if you want there are a number of different customizations that you can make I'll probably create another subsequent video walking through the architecture and the code base so let me know if that's something you're interested in but in order to run our voice assistant all you need to do is just type run python voice assistant. hey uh what is your name and can you tell me about yourself nice to meet you my name is verby and I'm a friendly AI assistant here to help answer your questions and make your life easier I'm constantly learning and improving so I can provide you with the most accurate and helpful information I'm a chatty bot but I'll keep my answers short and sweet can you tell me a fun thing to do while uh I'm on a trip to Los Angeles over the weekend Le has plenty of fun activities why not explore the Griffith Observatory at Sunset take the tram ride gaze at the city skyline and enjoy the stunning views of La if you're visiting on a Friday or Saturday stick around for the free public lectures and events it's a unique and Unforgettable la experience are there any uh beaches I can visit a has some amazing beaches you should totally check out Venice Beach Santa Monica piia and Malibu Beach they're all within a 30-minute drive from each other Venice Beach has a lively Boardwalk street performers and great eats Santa Monica Pier has an iconic Ferris wheel and amusement park Malibu Beach is a bit more laid-back with stunning views and great surfing spots which one sounds like a Vibe okay so I'm going to close this but this was the real time responses that we getting there were no editing and this is the uh speed that you can expect which is pretty amazing now right now we're just talking with the llm itself but you can basically replace the llm vther system where it can look up information or you can give the llm to do function calling if you want it to retrieve information from something like a web server or do web browsing for you so the possibilities are Limitless here and I'm going to be building on top of this project now just to show you how easy it is to replace this so let's say we want to use the open AI version of The Whisper API so I'm going to do that then let's say we also want to use an open AI model which is going to be GPT 40 and for wise I will still keep it to cesia let's run this again hey uh I'm planning a a trip to Paris uh what are the different things that I should take during summer that sounds amazing for a summer trip to Paris make sure to pack one light clothing t-shirts shorts dresses two comfortable walking shoes three sunglasses and a hat four light jacket or sweater for cooler evenings five sunscreen six travel adapter for electronics 7 reusable water bottle eight a small backpack or tote for daily outings nine a travel guide or map 10 your all right uh so there was a little bit more latency compared to the previous approach but let's try another wise system and in this case we're going to replace this with 11 Labs uh can you tell me a fun fact about sun and the solar system sure did you know that the sun makes up about 99.