Kokoro Local TTS Custom Voices

32.26k views2786 WordsCopy TextShare

Sam Witteveen

Kokoro it's a small TTS model that's really high-quality that can be run both in Colab and locally v...

Video Transcript:

okay so as the popularity of the Gemini and the open AI bidirectional apis has taken off people are now starting to look much more at building voice apps and of course the downside of using an external API whether that's open AI Google or even things like 11 Labs is you're having to basically send your data out there so many people including myself are still on the hunt for a really good local TTS system and today's video I'm going to talk about one that I've been playing with and using on my local computer to do a

variety of different tasks all running locally and even being able to run without a GPU so the text to speech model that I'm talking about is kakuro 82m and this is basically a very small text to speech model that is getting really outstanding results so there is no blog post for this there is no fancy startup announcement or press release or anything this is literally just a model that's been released on hugging phas and on GitHub and it's just taken off as people have started to play with this now one of the reasons that it

has taken off is because it's performing so well in the TTS Arena on hugging face so if we look at the leaderboard on the TTS Arena we can see that kokuro is by far the number one best ranked TTS system here obviously that's open and that you can actually access the weights okay so there is a huggingface space up where you can try this out you can see that this is is it here I've I've basically chosen The Voice sky and I'm just generating hi there everyone this is Koko hi there everyone this is Sky

from Koko okay so you can see that they've got both American voices and also English voices if you want to pick out of an English voice you can come and do that hi there everyone this is Delta from Koko and here you can see they've also got some voices for French for Japanese Korean and and Chinese in here so if you just want to try it out come in here and have a play with it so while there's no fancy blog post or anything like that about this model at the moment we can see some

interesting information about it we can see that okay this is trained on less than 100 hours of audio which is pretty amazing and it seems that the architecture is based on this style tts2 which actually has a repo up online and a whole paper about it if you want to actually go and see it so one of the cool things about this project is that it seems to have taken off quite quickly so not only you know have they released uh weights with this there seem to be plans to make the next version of this

and to train it on more data to get it to actually be even better than it is now one of the interesting things out there is that the way that model works is you've got the actual model and then you've got embeddings in what they're calling a voice pack now they offering to make a voice pack for the particular voice you want if you just contribute data to the next training run of this so if you're interested in that you definitely should come in here and have a look at it on top of this the

community's already started building a number of external projects around this so one of them that is really good is the kakuro Onyx GitHub repo I'll talk about that in code and show you actually how I use that to run the model locally also there's an interesting project cor cororo fast API TTS and the idea here is that this is basically creating a fast API m point that emulates the open AI compatible speech endpoint so if you've already got things set up to use the open AI voices you can just swap them over with this here

and then another one that looks really interesting is a whole inference system for rust which is definitely worth checking out if you're looking at putting this into production for Speed wise Etc so let me walk you through the code I'll go through a collab version at the start just to explain how things work etc and then I'll show you how I'm running a local version using the Onyx inference system in here let's jump in and have a look okay so like I showed you before they've got some guide code that you can actually get started

with in collab that's what I'm using here I'm going to walk through quickly like how you can use it talk a little bit about it and then we'll talk about making custom voices as well and if you want to make some custom voices there are a few ways that you can do that here all right so first off you need to understand that you've got two main components here you've got the model which you're going to be running things through and then you've also got an embedding for each of the voices so each of the

voices has its own embedding which gives it the characteristics of that voice so you can see this first one that they made for the leaderboard thing I think was basically a combination of this Bella and Sarah voices another interesting thing in here is if the voice is American it will start with a if the voice is British it will start with b and they're saying that the going to release more voice packs which basically means they're going to release more embeddings of different voices that you could use in there so obviously they've got the sky

voice in here which is very similar to a famous voice and you can see that if we just go through and run this sure enough we can just play with this how could I know it's an unanswerable question AGI is like asking an unborn child if they'll lead a good life they haven't even been born okay you can see sure we've got the voices out there now we've also got because the way the model is trained it's actually trained on phonm these are the actual phms that are there and I think it can use both

British and American phms for different pronunciation Etc in there all right if we wanted to change the voice now we can either come up here for example if we want to use the sky voice we can come in and run and use that how could I know it's an unanswerable question AGI is like asking an unborn child if they'll lead a good life they haven't even been born okay that gives us the sky voice in there if we want to use one of the male voices how could I know it's an unanswerable question AGI is

like asking an unborn child if they'll lead a good life they haven't even been born okay so that's how you can just generate audio now if you want to save the audio here's basically what you can do you can just convert it to web file it's an unanswerable question AGI is like asking an unborn child if they'll lead a good life and sure enough if want to see that in here I've C that output web you would just come in here and download that if you want to use that all right so the next thing

up is that if we look at the voices that we've got for each voice we've got this embedding all right so we've got a tensor shape that's 511 by 1 by 256 in there and so if you want to blend voices we've got a number of different ways that we can blend voices so if you look at here we're going to load three different voices and you'll see that all I'm changing is the voice pack in there how could I know it's an unanswerable question so that gives us the first voice how could I know

it's an unanswerable question this gives us the second voice and then the the third one is a guy just so we've got something that's clearly different it's an unanswer all right to blend these we BAS basically want to sort of merge the two tensors together and we can do that in a variety of different ways so if the simplest way is going to be doing something like an average so we just add the two together and divide by two how could I know it's an unanswerable question but you can see that often when we do

that it's going to sound very much like one of those voices so really what we're trying to do is interpolate between these two voices so there are number of ways that you can try doing this one is that you can try like a weighted average so if we try like a weighted average we can pick like how much of one voice do we want versus the other voice how could I know it's an unanswerable question AGI is like asking an unb okay so you can see there we've now got a waiting voice of the first

voice Emma and the guy Lewis in there so we've got this basically this weighted version in here now you can play around with the weights and try different things like that that will certainly get you some interesting results the other way is we can just start trying to interpolate between these and there are a few ways that you can do that as well if we start off by trying to you know interpolate between one voice and the other you'll find generally that the voice will stay the same for a long bit and then it will

change quite quickly to the other voice so you constantly trying to find that sort of sweet spot in there how could I know it's an unanswerable question AGI is like asking an unborn child if so this one here is starting to give us a new female voice that's not quite the Emma voice not quite the Isabella voice in there the sort of final way I'll show you in here is that you can do what's called spherical interpolation the idea for this actually comes from things like style Gan where you would find that doing normal linear

interpolation wouldn't get as good results you wouldn't have as much control and stuff so I think it was like the style three paper or one of the papers around that sort of worked out that one of the best ways to do this was spherical interpolation between these so I've put a a simple example of this here you can really play with it to get lots of you know different examples but you can see now we can make a voice how could I know it's an unanswerable question AGI is like asking an unborn child if that's

fundamentally between the guy and the soft girl or the guy and one of the female voices in there so you can take the voices that they've got in the voice packs and mix them to create new custom voices and all you need to save and just load is this new tensor that's in there so you'll see when we look at this it's going to be the exact same shape as the actual voice pack embeddings that are in there so this allows you to make some extra voices if you want to play with this if you

really wanted to get serious about this you could actually just train up a model that goes from voice to embedding doesn't require a retrain of their TTS model it just creates new embeddings in there that's a little bit beyond this tutorial so anyway you can play with it in cab it works really well in here one of the things I was going to show you also the onx version but I'll show you that actually running locally because I kind of feel like if you're going to use it locally you're probably better to use the Onyx

version rather than just the stock standard P torch version so let's take a look at that all right so if you want to run this locally I think one of the best ways to do it is to use this Koro onxx package and this has been put together by I think it's the white eagle cool package very cool package they basically because people have already released the Onyx version of the model as well and the embeddings are going to be the same what they've done here is just make it so it's easy to get it

going as a Onyx package which you can then run you know really fast on your computer so I'm using a Mac Mini here you'll see that it runs way faster than real time Etc in here all right so the key thing you want to do is basically just pip install this Koro onx and then make sure that you've got UV installed if you don't know what UV is go and have a look at it you can just do Brew install UV on Mac I'm not sure about Windows you'll need to look into that but then

you can basically just set up a virtual environment with UV install the two packages and then it will basically just create the whole sort of setup for you the two things that you need to basically copy across are going to be the actual Onyx version of the model and the voices Jason with the embeddings and stuff in there once you've copied those across and got this set up there are a bunch of examples in here that you can actually use so they've got some really nice different examples of with and without the phone names Etc

and as I'm recording it literally it seems like they're still updating new examples as they go along so once you've got that done you can just use their hello example in here which you're going to basically want to just set up you can put the voice that you want to use you can put the text you can set a speed Etc and then literally to run it you just use UV run it will go through and it will generate the audio ouch you can see there it's already generated in there and if I click on

this we can see that we've got hello Sam how are you doing to day this audio was generated by kakuro we've got the generated audio there which you can use so you can set this up as something that you just call anytime you want to use a TTS system Etc makes it super easy for running this very performant for doing it all and like I showed you before you could use any of the Blended voices Etc in here as well if you wanted to do something like that so one of the cool things that I

like about the Koro is not just that how good it is quality-wise and stuff like but it's also actually very easy to get started to be able to use this on your system a lot better than a lot of the previous open TTS systems that we've seen in the past for this anyway so have a play with this as always I'll put the collab and the links in the description so that you can get started and play with it yourself let me know in the comments how you actually want to use this we certainly could

take this and combine it with a speech to text system or an ASR system to then start be able to have a local agent that you could have a conversation with and don't forget the cool thing in this way is that you don't have to pay for any apis to be able to do any of that all right as always if you found the video useful please click like And subscribe I will talk to you in the next video bye for now