what's going on everybody welcome back to the excel tutorial series today we will be looking at how to clean data in excel [Music] now knowing how to clean data in excel is actually extremely useful there are a ton of techniques to do this i'm going to be showing you the ones that i probably use the most i feel like are the most helpful to kind of do the bulk or the majority of the data cleaning that you're going to do in excel like i said there's so many different ways and very specific things that you
can do but i'm going to highlight some of the bigger ones that i find the most useful and some of you may be thinking well i'll just do my data cleaning in sql or python or when i get it ready to put it in tableau but honestly a lot of the data cleaning at least a lot of the big stuff i tend to do in excel if the data set is small enough to fit in excel and so i think it's actually really really useful to know how to do this because you'll most likely be
doing it more than you think now before we jump into the tutorial i want to give a shout out to the sponsor of this video and is a brand new sponsor it is unlocked by z by hp unlocked is a movie that's actually broken up into four parts and each of them have a unique data science challenge associated with it now i'm going to read this next part because it's extremely interesting each challenge represents a different topic so there's data visualization text analysis audio signal processing and computer vision and you can submit your answers in
your work on their website for a chance to win one of 10 zbook studio laptops or a free trip to the kaggle world championships so i'll leave a link in the description where you can go watch the movie and then do the challenges and then submit your answers for a chance to win you should also go check out their hackathon where you can do these projects with other people just like you who are trying to figure out these answers and submit them to win as well so go check that out thank you again to the
sponsor of this video unlocked by z by hp now without further ado let's jump onto my screen and get started with the tutorial all right so let's jump right into it i have this u.s president's data set i got the base data set from kaggle uh but i added some my own data and then i messed some stuff up as well just to kind of demonstrate some of these things that we're going to be looking at today this is not a full project so you know we're actually going to be using this to create any
visualizations or anything like that so you know all this is just for demonstration purposes but we will be doing a full project in about two or three videos uh in this excel series where we're going to be doing from start to finish with a real data set so you know if that's something that you're you're wanting then we'll absolutely be doing that now something that you may be wondering is how do you actually identify what you need to clean in the data what do you know to look for well some of the obvious things are
things like formatting and standardization so things like you know this james monroe is in all caps that happens all the time within real data um and so you know you want to standardize that or this all lowercase you want to standardize that you want that all to be the same there's also things like um right here where we have this wig and this wig with a bunch of random stuff after it this happens all the time where it's not completely standardized and you may even notice um you know there are some spelling errors in here
and i'll we'll kind of look through that in a little bit and then you know there are things like additional spaces where there shouldn't be spaces there are things like currencies that you need to be aware of if you were importing this into or we're going to be importing this into a sql database things like currencies can be just a problem or be really unnecessary it may actually cause more issues in the long run so you may just want to you know take that to the base value and then dates are always an issue always
always always so always look at your dates make sure they're formatted correctly make sure they're all the same these are the types of things that right when i glance at this data set these are things that i'm looking for one other thing that is actually the first thing that we're going to start out with is you want to make sure that your data is not duplicated because if your data has duplicate data in it and you don't want that it's not supposed to be there there are some specific use cases where duplicated data is okay
um you know you want to get rid of that and it's very easy to do in excel the first thing we're going to do we're going to go to this data tab we're going to go right over here and we're going to get see if there's any uh duplicates in our data so we're just gonna go up to remove duplicates it's gonna automatically choose all of your columns to to check against so it's gonna for from a all the way through i it's gonna see is the exact same data in all these rows and if
it is it's going to get rid of it and so we're going to click ok and it did find one duplicate and i'll show you that one real quick um because you know it was right here so barack obama was here twice and then i'm gonna hit control i hit control z to go back i'm gonna hit control y to go forward and it removed that uh that row completely now in this example you may be able to spot that with your eye but in a real data set where you have ten thousand a hundred
thousand rows there's absolutely no way you're going to see that uh or very very unlikely that you are going to see that there's duplicated data in there so just running a quick um dedupe or or removing of duplicates that is really important to make sure that you have gotten rid of those things so that's one of the first things that i do um we're going to go into a lot of these different columns and i'm going to kind of show you different techniques or things that i do when i look at actual data so i'm
going to come right over here i'm going to insert and this is what i actually do i usually create a separate column especially when i'm working with this because i don't want to change this one um i don't want to go in here and you know say equals upper equals proper et cetera there's a lot of different ways that you can change names or not a lot but the main ones that you can change names and all of them are completely okay so for example i'm going to hit equal upper oops upper and i'm gonna
go like this and close my parenthesis so i selected the cell i close my parenthesis i hit enter it is comple and i'm gonna hit um in the bottom right i'm gonna double click this it's gonna apply to all of them it is completely okay to have your data like this if you want it to be like that if you want to be all lower you can do that if you want to be in proper case you can do that there are oops there are different uses for all of them and honestly as long as
it's all the same typically it's okay but if um you know for example if you're selling this like a third-party company or something like that they may have what they want for their ingestion process when they take your file in if you send you know a weekly file or a monthly file they may want it exactly how they want it and you can change that to to what they want but as long as it's standardized for you it's all the same for you that is a good thing so now we have all of these um
in the proper case that's typically what i i do or i use upper those are the ones i use the most i don't usually use um lower and if you go in here and you type in lower you know it changes it to all lower i don't typically do that and i'm going to add i'm going to oops i'm going to say president dash fixed and so now all of these names all of these different uppercase and lowercase these are all fixed and it just makes it so much easier to read and you don't have
different uppercase and lowercase issues it's all the same so i'm going to keep that right there if we move a little bit to the right if you look at this prior now this prior is a mess it has stuff all over and to be honest this is not really something that i would probably be using um like in a real data set i would look at this column and i'd say this is pretty useless um if i had a very specific use case for this this data in this column i might try to you know
parse it out and do something but i don't uh this this is a completely useless comp to me so i'm actually going to skip this one i'm going to go to this party one in this party one to me it looks pretty important because this is something that i know i can group by and i can create visualizations with and and kind of break that out and if you look right here we're going to add we're going to add a filter so now let's open up party and take a look so if we look right
here we have democratic democratic dash republican federalist nonpartisan republican republicans wig and wig with a date and some information on the back of it and then some blanks um and it's really important when we when we're looking at these ones that we think we might group by that we have these um properly grouped so republican and republicans to me right off the bat looks like a spelling error and so i'm just gonna deselect all i'm gonna go to republican republicans and it's literally republican all the way down except for this last one and to me
that's just something that i would update so i would just go right here i do that if i didn't do that and then i try to create let's say a pivot table on here i'll have its own group of republicans and it wouldn't be added to republican and maybe that's on purpose but let's just presume that we know this data extremely well and that's not supposed to be like that right again that just comes back to knowing your data really well understanding what it um you know what it should look like and we know that
it should not be like that so we're going to fix that the next thing that we're going to fix and as you can see it got rid of it next thing we're going to fix is this wig that's just like an error that's that's some issue on the the data side and we're just gonna fix that by updating it and that's it i would always be keeping um a a copy of this with the raw data uh somewhere else because this is presumably like a working document this is not a um you know you you
aren't saving over your original file let's just say that and then let's take a look at these blanks real quick um okay so there are these rows right here that have nothing i i think we're okay but if we see anything different 4748 okay so yeah it's just these ones right here that have no data in it anyways it's just seeing it in the filter so not an issue at all so okay we're looking good we've gone all the way over we fixed this president we skipped this one we we cleaned up this party and
i kept this one in here because i'm not exactly sure if that's a democratic or republican so i'm gonna keep it its own thing um i'm not a huge uh history buff in that aspect the next one right here is um the next one right here is really easy uh this is something that happens all the time especially on actually most often it happens on numerical data so like uh you know there'll be a number of 1001 and then there'll be a space after it for absolutely no reason and it happens all the time it
does happen like this as well where you'll see this and all you got to do is do trim and select the the cell we're gonna close that parenthesis and we're gonna apply that all the way down what is so fantastic about the trim is that it's really intuitive and it knows basically everything it needs to do for example um it gets rid of the spaces before it gets rid of extra spaces in the middle and um it'll get rid of extra spaces at the end which you wouldn't be able to see but they are there
and they they absolutely can cause issues if you have spaces at the end that you cannot see let's take this one for example like if i had spaces at the end that can cause issues when you insert or put that into a database that happens a lot with numbers you know when you're putting that into sql that can cause issues and so you really it is important to actually do that trim and you can do that on all of your columns or just ones that you know you're having issues with but once you import that
data into sql you will know if there's an issue or not when you actually try to start using it so we're going to say vice and we're going to say fixed oops there we go this next one is one that you'll run into a lot when you're working with numerical data you will encounter so many different issues um one that i run into a lot is i've worked with a lot of cost data or pricing data and when it's in excel it sometimes comes in with um these currencies like a dollar sign a pound sign
things like that and when you put that into sql it just is a nuisance right you're not going to be able to run um it's going to go in as a text or it's going to be like a string right because it has that special character and you don't want that you don't have to then go in and then change things around you just want to be able to start you know doing calculations on those numbers so what you can do is sometimes it'll come in as a text sometimes it'll come in as um currency
which i think this one's a currency we're just going to change that to be a number and then we're going to get rid of these oops and get rid of those that it doesn't look as pretty but that is much more useful than actually having the currency on there with the decimals this actually is so much easier when you when you want to use it for almost anything because you're able to add and do things properly in other systems in excel i think it does understand it but you know that can cause issues so there
is how you do that the next thing that we're going to look at is these dates and just notoriously whenever i see a date field i know there's going to be an issue with it it's very rare that i get a date field that is perfect uh it just it it genuinely is um is a novelty when that happens and most of the time it has to do with um let's say a date comes into excel and it's in a text format or date comes into excel and they're not the same in this example they
are not the same um and we just want them to all be similar they say date if you look right here it says date it says date it looks like it should be the same but if we go like this it all looks the same right there's no issues at all if we were to um try to use that it may or may not be an issue but we don't want to leave that to chance later on if you're using this with python or something like that it can cause issues maybe not in sql because
it may see the underlying um what's in the underlying cell not just what we see but some systems won't and so you want to make sure that they're all the same and so you know what we were doing back here with um oops with the party and we were looking at this uh this filter and identifying the issues i usually do that on date fields as well and and oftentimes um you know just for just for demonstration purposes oftentimes i will get something like that and then i'll come up here and i'll notice that there's
this one random number that happens all the time all the time and so you want to make sure that you that you look at these things and just just do at least a quick glance if not kind of doing a kind of a deep dive into it but all we're going to do is we're going to do both of these and we're going to do a short date and let's take a look and see if that fixed it and so now they are all the same format and that is fantastic that is exactly what we
want we're going to go back through here we're going to get rid of these again this is a working um this is a working document oops uh we need to we're gonna i'm gonna do control shift down oops let me go back up do ctrl shift down and copy and what i'm going to do right now is i'm actually going to copy all right let me do it right here i'll show you sometimes i do this doesn't just depends i'm going to go right here i'm going to hit right click and i'm going to paste
as a value which means it's not going to take the calculation of the formula that i just did it's going to actually paste it as that value so we just replaced it right here you can see up here it says equals trim of g2 this now now that i copied and pasted it over as a value it got rid of that calculation and now it is actually a string so we don't need this anymore and i'll do the same thing over here as well i'm going to control shift down copy and i just hit the
right key or the left key sorry now i'm going to right click and i'm going to do paste as a value and again it has this proper and now it doesn't have the proper it's actually the value that was here so that's really important to note and we're going to get rid of that one and so now what we have is is already looking much better now one of the last things i want to look at is deleting columns that we are not going to use and this is why it's so important to keep a
backup or or the raw data not in this file because if you start saving over this file this is your raw file that can mess up a lot of things and that happens to me before and it's terrible and then you have to request another file or you have to go back and find it or something like that it's terrible um so so this is our working document so we can mess with this and do whatever we want for our purposes now for us um i can already tell you that this prior is a bunch
of nonsense and we do not need it we're not going to use it for anything and it and if we have um this is a small very small data set it only has like let's say you know one two three four five six seven eight we have like eight columns that we're you know kind of using that has data eight or nine now that's a small data i've had ones with literally like hundreds and it has so many columns uh so much data and sometimes it's good to just trim it back to the things you
know you're going to use this to me is absolutely useless we're going to delete that and then right over here it's pretty redundant um it's just one number off but if we scroll down just a little bit um it goes it's basically just counts it's a i you could even call it a unique um identifier if you want sure why not but we don't need both um so we're going to get rid of this first one and now we have more of the useful and relevant data rather than the stuff that we absolutely know that
we are not going to use um these date updateds and date creators we may never use them but we might um so it doesn't hurt to keep it on hand those other ones are ones that we are almost certain we will never use again keep a backup just in case you need it you can always go back and get it so you know if you go back to what we started with and you look at what we have now it is much cleaner it's much more usable and these are small subtle changes um especially with
this very small data set of only like 50 rows or 46 rows but you're gonna be working with data sets that are thousands tens of thousands hundreds of thousands of rows and you need to know how to kind of look at this data standardize it um format it properly for what you're going to be using it for if you're keeping it in excel there are different things that you may do than if you're putting it into a database or going to be using it with you know using python to access it so you need to
kind of know your use case but these are some things that i do all the time to kind of clean up the data before i use it for something whether i'm creating pivot tables or i'm inserting it into or i'm putting it into sql these are things i do all the time and so hopefully that helps give you kind of an idea of some of the things that you should be looking for when you're actually cleaning data and it's really important to understand why you're actually making these changes and the reason you're making these changes
because some of the things that i did today may not be things you want to do on a different data set that has different uses and different purposes for so you know take everything that i've said and apply it with a little grain of salt to your data set because your specific needs may be different than what i wanted when i was cleaning my data set so i hope this was helpful i hope this gave you a small glimpse of some of the things that i'm looking for when i clean a data set or i
get a new data set in and i'm kind of you know analyzing it figuring out what i need to fix in it i hope this has been helpful with that being said thank you so much for watching i really appreciate it if you like this video be sure to like subscribe below and i'll see in the next video [Music] you