This free AI Text-to-Speech is insane! Add emotions & make podcasts

129.96k views7614 WordsCopy TextShare

AI Search

F5-TTS full tutorial, installation, testing. Free, open-source AI voice cloner with expressive voice...

Video Transcript:

this is the best text to speech voice cloner I've used yet you can control emotions for The Voice what if no one likes it what if all this effort was for nothing after countless late nights I'm exhausted but I know it's worth it to Chase my dreams you can easily generate an Audi book or podcast with it I totally get that Anna it's scary to think about missing out on experiences right and honestly I'm just tired of feeling this way let's dive into today's topic and shake off those blues and you only need a few seconds of a reference voice best of all it's free and open source so in this video I'm going to show you how to use it and how to install it locally on your computer so first of all this tool is called F5 TTS and this is based on the diusion Transformer architecture which is also the backbone behind all the best image and video generators out there right now and so it turns out that this diffusion trans former architecture also works well with text to speech and voice cloning so here are some examples first of all the impressive thing is you only need a few seconds of a person's voice so for example here is the reference audio it's only 5 Seconds let me play this for you some call me nature others call me mother nature so with just 5 Seconds of audio if you get it to read out this script here's what you get I don't really care what you call call me I've been a silent spectator watching species evolve Empires rise and fall but always remember I am mighty and enduring respect me and I'll nurture you ignore me and you shall face the consequences and it even works well with Chinese so here's the same voice but reading a Chinese script for pretty cool so here's another example here is the reference audio this is the original voice are you familiar with it slice the steak and place the strips on top then garnish with the dried cranberries pine nuts and blue cheese all right a pretty dramatic voice so that's 9 seconds of reference and if you get the same voice to read this script here is what you get perhaps they are driven by the delicious blend of flavors or it could be the appealing visual presentation at the end of the day our choices in food reflect our personal preferences and sometimes even our lifestyle or belief system so you can see or here I should say that it matches the tone and expressiveness of the original voice very well and again here's a Chinese example with the same voice so again even though it's in a different language it still matches the tone and expressiveness of of the original voice very impressive here is an example of a female so let me play you the original voice first this is 14 seconds long you don't know how much trouble you've gotten yourself into look if one of the others get to you first they'll report you Alpha Grant has a search out and if they see you on human territory they'll be shunned all right a very scared and panicky voice now if you get it to read this text let's see what we get your safety and the Pack's reputation are at stake your bravery is admirable but sometimes bravery is knowing when to retreat please consider returning with me we can work out a plan but only if you're willing to listen and again you can hear it preserves that scared and panicky voice and there's more you can do with this so you can also mix in different languages in the same sentence so let me play you the original voice first wow it sounds like a very dramatic evil witch anyways if you get it to read this there's a mix of Chinese and some English words let's see what we get perance very very interesting and you can also control the speed so here's the original voice some call me nature others call me mother nature and then here's the teex it will read out but instead of just one X speed you can specify like 7 x speed let's hear what that sounds like I don't really care what you call me I've been a silent spectator watching species evolve Empires rise and fall but always remember I am mighty and enduring respect me and I'll nurture you ignore me and you shout face the consequences all right so overall pretty good except for this part here where it kind of ignored the semicolon and there wasn't a pause here anyways let's play this at 1. 3x speed now and hear what that sounds like I don't really care what you call me I've been a silent spectator watching species evolve Empires rise and fall but always remember I am mighty and enduring respect me and I'll nurture you ignore me and you shall face the consequences so it indeed reads this at a faster Pace again it kind of ignores the pause after the semicolon here the semicolon is probably throwing it off but other than that this sounds pretty good and there's more you can do with this and this is like the most impressive feature about F5 TTS so you can upload different clips of the same voice but with different emotions and then you can get it to Output this text in that emotion so for example let me play you the reference the or original clip first so this is her happy sound kids are talking by the door so let's hear how this sounds like in a happy tone I was like talking to my friend and she's all um excited about her uh trip to Europe and I'm just like so jealous right all right so you can hear it does kind of have that happy tone in there and then here's her sad Voice kids are talking by the door and notice this sample is only 2 seconds long but it's still able B to clone her voice and get it to speak this sentence which is pretty impressive all right anyways let's hear the sad version of this now like talking to my friend and she's all um excited about her uh trip to Europe and I'm just like so jealous right not bad and then finally here we have a fearful voice here's the original voice kids are talking by the door all right and then here is that voice reading this text I was like talking to my friend and she's all um excited about her uh trip to Europe and I'm just like so jealous right and you can hear it does preserve that higher pitch slightly frightened voice from this original clip so a very powerful tool finally this is also shown to do well with hard sentences like these ones which let me try to read this active artists always appreciate artistic achievements and applaud awesome artworks oh my goodness what a mouthful so let's play this original voice you need not think to keep out of the way of him so just 3 seconds of audio and let's hear the output of this voice reading this script active artists always appreciate artistic achievements and applaud awesome artworks pretty good now let's hear another example after which in his preoccupied way he explained here is the hard sentence she needs to read out let's see if she can pull it off Brave Bakers boldly baked big batches of brownies in beautiful bakeries there you go very good and then just one last example let me play this one for you became more deliberate and watchful all right and here's the output daring dancers dazzled during Dynamic dance displays drawing delighted crowds so overall not bad but enough demos let's actually jump in and try this out now I'll link to both this demo page and their GitHub page in the description below and they actually have a few places for you to try this out So currently I believe this works with a Cuda GPU you need at least 8 GB of vram if you don't have that you can always use these online spaces so for example here is a hugging face space which allows you to do the same thing as the local version now for this video I'm also going to show you how to install this locally so actually let's do that right now in order to set up this interface on our computer by the way I'm using a Dell Precision 5690 you can integrate a powerful RTX 5 ,000 Ada into this so this is a perfect compact combo for running AI tools locally huge thanks to Dell and Nvidia for sponsoring this all right here is how you install this and run it locally so if you go to their GitHub page which I'll link to in the description below in the middle of the page it gives you all the instructions to install this now first of all we need to get clone the repository and in order to do that you do need to have git installed on your computer and here here is how you install git if you already have git installed feel free to skip to the next section so all we got to do is download the latest release for whatever operating system you're using so I'm using Windows so I'm just going to click on download for Windows I'm running 64bits so I'm going to click on this to download and it's now downloading this exe file so once that's completed all we got to do is open that exe file and then follow the steps so I'm going to click on next I'm just going to go with the default install location which is program files /g so I'll click click next for that and then I'm just going to leave this at the default and then I'm going to click next again and click next here we're just going to use the default settings for all of these there's a lot of settings that you need to go through so I'm just going to click next for all of these all right and then it should go ahead and install all the files so this might take a few minutes perfect so now we have get installed all right so assuming you have get installed on your computer first of all go to whichever location you want to have this installed for me I'm going to install it on my desktop so I'm going to go to desktop and then at the top bar here type in CMD and this will open up command prompt with the current folder as desktop so let's open this back up I'm going to copy this first line and then paste it in here perfect so this is basically cloning this repository into a folder on my desktop so if you open up my desktop again you can see that now now I have this F5 TTS folder and if I open that you can see that it basically cloned all the files that we see in this repository all right so next step is we need to change the directory to this folder because right now we are still in desktop so we need to go one folder in and go to this new folder that it just created F5 TTS all right so right now we are in this F5 TTS folder which again should look like this all right now next step before we run these lines of code is we should create a virtual environment so this is basically an environment that contains all the packages and dependencies that this tool uses and it's a separate environment from all the other tools that you're currently using on your computer and this is important because this tool might use some packages and dependencies that are different versions and might conflict with other tools on your computer so you want to create a separate virtual environment and install everything there now to create this separate virtual environment you need to use Anaconda if you don't have Anaconda installed here's how to do it if you have it already feel free to skip to the next section now I'm just on anaconda. com and actually what I'm going to do is install miniconda this is a minimalist version of Anaconda if you install the full Anaconda it installs a lot of packages and dependencies that you might not need this just takes up more room on your computer and of course the installation time is a bit longer but with miniconda it's just a barebones package and you can always install additional packages and dependencies afterwards so I'm going to click on latest miniconda installer links by python version and I'm using Windows so I'm going to install one of these now for free and open source AI tools usually they do not support python 3.

12 so it's better to install the python 3. 11 version so I'm going to click on this which should download an exe file to your computer once it's finished downloading simply double click on this and then follow the steps to complete the installation so I'm going to click next and then agree and then let's set this to all users I'm going to go with the default destination folder and then I'm going to check this as well clear the package cache upon completion this just gives you back some more dis space with without affecting functionality all right once that's completed let's click next and then we are finished now we aren't done yet so if you open up the command prompt and you type in cond D- version you're still going to see that cond is not recognized this is because we haven't added Anaconda to our path yet so let's exit out of this and then to add it to our path we simply search for this function edit the system environment variables we're going to click on this and then click on environment variables and then click on the one that says path and then click edit and here's where you add in the path of anaconda so it depends where you installed Anaconda for me I installed it in program data so it's going to be in program data/ miniconda and then if I double click on scripts you can see that cond is here so this is the folder we want to paste in so I'm going to right click on this and then copy as path and then back in the the environment variables window I'm going to click new and then paste in the path here and then click okay and then okay and then okay again now if you open up command prompt again and type in cond -- version you should see that we are running 24. 5 point0 so this shows that we have successfully installed Anaconda all right so assuming you have Anaconda installed once you're in this F5 TTS folder let's again type in CMD to open up command prompt within this folder and then we are going to type in cond create DN which is telling it to create a new environment let's call this F5 and then for python let's set it to version 3.

10 which is what it specifies here like I'm not sure if it works for later versions of python like 3. 11 or 3. 12 so just to be safe we should always go with whatever python version they specified on their GitHub so in our case we are going to typee in three 3.

10 and then click enter and then now it's going to go ahead and create this virtual environment based on python 3. 10 I'm going to press enter to proceed and this is going to take a while to install everything so once that's done you should see these two lines and now we need to activate the environment before proceeding further so let's type in cond activate and whatever we named our environment in our case it's F5 and you can see that the environment is activated because it has the environment name in parentheses before every line all right so the next step is to install torch and torch audio based on your Cuda version this is very important and so yes you do need to have a Cuda GPU for this at least at the time of this recording and in order to check your Cuda version let's open up command prompt and then we simply type in nvcc D- version and you can see that we are running Cuda 11. 8 so let me exit out of this and this is what this CU 1118 stands for at the end of this URL so if you're not using 11.

8 if you're using 12. 1 then this 118 that you see here and here should be changed to 121 but anyways we are using Cuda 11. 8 so I'm just going to copy both these lines and then paste them all in here so the first line is installing torch which which is 2.

7 GB so this is going to take a while I'm going to pause the video and come back when it's done all right so right now it has successfully finished installing torch next we need to install torch audio which is basically the second line we pasted in I just need to press enter here and perfect we have successfully installed both torch and torch audio now the next step is to install all the requirements in requirements. text which is listed in this file so it needs to install all these dependencies so I'm going to go back here and copy this line and then paste it in here again because there is a long list of dependencies this might take a while depending on the speed of your internet connection all right if all goes well you should see all of this with no errors and that signifies that you have installed all the requirements now let's go ahead and actually run this interface so let's copy this line of code and then paste it in here and if this is your first time running this gradio interface it's going to download some additional models so for example this model. safe tensor file is around 1.

6 GB so again it's going to take a while to download all right and if all goes well you should see this link so I believe you need to hold down control and click on this link it would open up this in your browser now this is completely offline even though this opens in your browser this is just a gradio interface and you don't need to have internet to run this and that's basically how you install this and get it up and running thanks to wondershare Fila for sponsoring this video filur is a video editor loaded with AI features and I personally use filur for all my YouTube videos they've just released version 14 and it's filled with AI superpowers the AI co-pilot editting feature allows you to edit a video just by talking to it like a chatbot they also have a smart short clip feature which automatically converts your long form videos into short clips for social media plus they have an AI audio enhancer which uses AI to turn your audio into studio quality with just a click of a button hey everyone welcome back to my channel hey everyone welcome back to my channel they also have a new AI sound effect feature where you can create any sound effect you want with just a single prompt you can also use AI to remove vocals and den noise audio with ease they also have speech to text so you can easily create subtitles as well as text to speech for easily making voiceovers they also have ai smart masking and AI smart cutout and this allows you to remove objects or change the background of your video in seconds publishing is also made easy so you can make thumbnails in seconds with their AI thumbnail Creator plus easily create titles and captions with their AI copyrighting feature edit videos like a pro and save a ton of time with wondershare filora 14 try it out for free via the link in the description below now once you have it up and running all you need to do is upload an audio file to use as a reference voice I'm going to upload a sample clip so let me just play this for you this indie film festival looks fascinating shall we go and broaden our cinematic Horizons let's type in some sample text like hi there and if you click synthesize this is probably not going to work yet so you can see we got an error here and if you open up command prompt it says ffmpeg was not found this means that we don't have ffmpeg installed on our computer yet so now I'm going to show you how to install ffmpeg from scratch so you'll need to go to this page called Gan dodev the colors are very strange for this page it seems to be very desaturated but anyways on this page click on ffmpeg get full and you can install this wherever you want and then let's select this folder and extract this to our C drive I'm going to press okay and then exit out of this so right now on our C drive you should see this FFM p folder and if you open it it should contain these files anyways let's go back one folder and let's rename this to just FFM Peg so that it has a shorter name and then the next step is we need to add this to our environment variable all right so in your windows search bar you would search for edit the system environment variables let's open this and then let's click on environment variables and then in the system variables scroll down and until you find path and then you click on path and then click edit and then click new after you've done that let's open up this FFM Peg and then for this bin folder we're going to copy this path so I'm going to right click and then click copy as path and then back over here let's click new and then paste in the path but without the quotation marks and then press okay so again press new and then paste in the path to your ffmpeg /bin folder and then once you're done that click okay to exit of this and then okay again and to verify that you have this installed if you open up a new command prompt and you type in ffmpeg Das verion you should see this which signifies that it is added to your environment variable all right finally going back to gradio Let's actually exit out of this first we'll need to restart this afterwards now going back to our F5 TTS folder and then again at the top type in CMD to open up your command prompt and then first of all we are going to type in cond activates and then F5 which is what we named our virtual environment so after that is done let's type in again pip install ffmpeg to install ffmpeg all right so after that it's finished installing I believe we also need to install ffmpeg Das python all right so after it has finished installing this we should be all good to go now so we got to start everything again and in order to do that we simply open up our F5 TTS folder and then at the top here let's type in CMD which will again open up command prompt within our selected folder so the first thing we need to do again is use cond to activate the virtual environment which we named F5 so let's press enter and you can see now that all lines start with F5 in parenthesis which means we are now in this virtual environment and then the next step is to basically open up the gradio interface so we need to use Python to open this file gradio cora. Pi so I'm going to type in gradio app. py and since we've already installed all the models the first time we ran this code this time it's going to take a lot faster all right perfect so I'm going to hold control and click on this link to open up the gradio interface now let's drop in an audio here note that the audio you drop in should be less than 15 seconds and ideally it should be in wave format you can try mp3 I just prefer using it in wave format which is better quality and then if your reference audio is over 15 seconds it's actually going to automatically cut it to 15 seconds anyways and this is actually pretty revolutionary because most of the previous tools that we used like RVC they require at least a few minutes of audio to train a voice but here it just requires a few seconds of audio and it can clone The Voice from that sample which is pretty crazy all right so I'm going to use this 8sec clip of this American female let me play this for you hi there need a smart confident friendly young adult voice I'm ready and willing so let's get started and get your audience absolutely hooked all right and that's pretty much it for the text to generate I'm going to paste this line which I just generated with chat GPT I'm going to select F5 first I'm also going to show you E2 in a second but let's just do F5 and then let's click synthesize so because this is only dealing with audio it's actually pretty quick you don't need any high-end Hardware to make this work like you can get it to work with probably 8 GB of vram all right so you can see there it only took like 30 seconds to generate now let me play this for you you know it's funny how we spend so much time trying to predict the future I mean look at me right now I came here thinking I'd find Clarity among the rustling leaves and chirping birds instead I'm struck by this overwhelming sense of uncertainty and that's okay isn't it there's a certain Beauty in not knowing what comes next all right so that's pretty good and it even gives you a spectrogram and to download it you can click on this button all right so down here you can either choose the F5 TTS which is the AI that we're featuring today or this E2 TTS which is another text to speech model by Microsoft now note that this newer one F5 has some significant improvements over E2 so F5 has in theory better quality so they claim that it has fewer artifacts and it clones The Voice more accurately however you could play around with both of these to see which one you prefer so right now I'm using E2 I'm going to click synthesize so you can hear the difference and notice the time up here so right now it's taking like 19 20 21 seconds this is pretty quick so yeah that only took like around 20 something seconds let's hear what this sounds like you know it's funny how we spend so much time trying to predict the future I mean look at me right now I came here thinking I'd find Clarity among the rustling leaves and chirping birds instead I'm struck by the overwhelming sense of uncertainty and that's okay isn't it there's a certain Beauty in not knowing what comes next so you can hear with E2 The Voice kind of sounds a bit more robotic and less natural but the difference is quite subtle and I'm very impressed by both of these AI models how it's able to take just 8 Seconds of a sample audio and clone the voice in 20 seconds that's pretty insane anyways let's click this to download and then we'll move back to F5 for now there are also some advanced settings which I want to go over so here is the reference text or basically the transcript of this audio clip so if we play this hi there need a smart confident friendly young adult voice so for example if this is what she says you can actually type in the transcript here hi there need a smart etc etc etc and so if you know the transcript already and you paste it in here that just saves it more time it doesn't need to autogenerate the transcript and it provides it with more accuracy but usually I just leave this blank because I don't have the transcript and it would just autot transcribe this audio sample for me first and then here there's also a toggle to remove the silence which basically helps remove silences in your generation let's leave this off and then for the speed you can actually adjust the speed of the audio so let's make this super slow like 0.