Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)

3.26M views9636 WordsCopy TextShare
Keith Galli
Try out the Datacamp platform - Assess your skills, learn Python, SQL, R and more... and get certifi...
Video Transcript:
what's up guys and welcome back to another video in this video we are not talking about the fluffy animal or the song by designer but instead we're gonna dive into the pandas library of Python which I find to be one of the most useful libraries when you're doing anything data science related in Python so this video will be a good standalone video if you've never done anything with pandas kind of going from zero to like fairly comfortable in one sitting it also is a good video if you have you know some Python pandas experience but
you're looking to try to figure out how to do something specific if you're in that second case look down in the comments and I'll pin a timeline of what we're doing this video so you can find exactly what you're looking for quickly so one question you might have as you watch this video is like wipe andis a lot of the stuff you'll see me doing you probably could replicate in Excel but there are specific benefits to using pandas and the first benefit is you have a lot more flexibility with pandas and using Python in general
like what you can do in Excel I think is very limited compared to what you can do using the whole Python programming language so like the flexibility that Python offers is reason one to use pandas and then the second reason this is also very important reason is pandas allows you to work with a lot larger data sets then Excel does Excel really kind of struggles once you start loading in really large files so the second reason of why pandas is you can work with big big data if you're finding this video useful don't forget to
click that subscribe button because I'll be making a lot more tutorials building on this type of stuff in the future ok to begin this video you should already have Python 3 installed and then you need to open up a terminal window and type in pip install pandas if you don't already have the library as you can see I already have it once you have the library we can actually begin loading in data super quickly so I just want to dive into the data right away and we'll use that data to kind of learn everything we
need to know regarding this library so I have a link in the description Chintu my github page where I have a CSV of data that we're going to be using for this video so go to my github page and then this data is going to be on Pokemon I found this data on kaggle it's like a good open source machine learning website that you can kind of like do all sorts of challenges and I thought it was perfect for an introductory video on pandas so you don't have to be a huge fan of Pokemon but
it's a great data set to get started so click on the CSV version and that's kind of the most important one as you can see you can kind of get a feel for what's in this data so we have all the different Pokemon and then all of their kind of stats in we'll be doing all sorts of manipulations and doing all sorts of analysis on this data throughout the video but I want you to click on raw and then once you have the raw file you can just save as I called my Pokemon data and
I save it as a CSV CSV is important for loading it in properly but you can name it whatever so hooking my data to the CSV and one thing to note is wherever you're writing your code you should save this data in the same exact directory just so it's easy to load in these files okay once you have the data saved locally open up your favorite text editor for the purposes of this video I'm going to be using Jupiter notebooks because I like using that for data science related stuff but you can use sublime text
pycharm whatever you like to write your code in and I'm going to just clear all this that I have on the screen right now okay so the first thing we're gonna do is load data into PI pandas so we have this CSV and you can open up the CSV and look but exactly what you saw on this page so this is what we're gonna load into the pandas library and we're gonna load it in and what is called a data frame so it's super important you know everything about a data frame but that's kind of
what the object type is that panis allows you to manipulate everything with okay so the first thing we need to do is type in import pandas to get the library and usually what you'll see is it's kind of annoying to have to reference pandas every time you type anything in that uses it so we usually import it as pandas as PD so just do that and then to quickly get our data loaded in we're going to say pokey meaning Pokemon or maybe I can just call this like DF for data frame equals PD and then
there's this really useful function called read CSV and then you have to pass in the path to that CSV and if you wrote the put the CSV in the same file or in the same location that you're writing your code you can just do the name of the file dot CSV if I run this it loaded it in and you can't see that it loaded it in but if I went ahead and did print DF you can see that all that data is there in that DF variable and if you don't want to load in
all of the data you can use the there's these two useful functions to look at just the top of the data and just the bottom so I could do DF head and then I could specify a number of rows so I'm going to just say three for now think the default if you didn't put that three in there is five so you see I just now can see the top three rows and it's a little bit easier to read my data using that and I also could do if I wanted to see the bottom three
rows I could do tail three and get as you can see the index is changed to the bottom got those bottom rows okay I'm going to just comment this out real quick I also want to show you that if you don't have your your data in a CSV format that's fine we can also very easily load in Excel files or tab separated files so on that github page I also just for the sake of practice included this same exact file in txt format which is a tab separated format as you can see in the Excel
format so if you want to try this or you have a set of data that you're trying to manipulate you can also do I'll load those files in so I can do PD dot read Excel that's another built-in function to pandas and my excel file was like Pokemon data xlsx I believe just check yeah I don't know yeah I think that's the extension we'll get an error if not I'll comment this line out too and I can do a print of DF xlsx ahead three as you can see that same data is read in from
that excel file and then the last thing we can try to do I'll move this just so it's a little bit cleaner just down here comment it out real quick I can also load in that tab separated file so this one's a little bit different I can do PD read CSV Pokemon data dot txt and watch what happens when I run this it's probably not giving me an error let's see oh it didn't print yeah let's see what happens I think it's gonna yeah it loaded it all in is like this one single column so
the difference with this tab separated file and just to remind you what this looks like just instead of having commas that's separating the different columns its tabs we need to in our read CSV function specify a delimiter it's actually separating them in this case it's a tab which is specified by /t I believe I don't remember the differences between forward slash and back slash and yeah look at that we have the columns in the way they were looking when we were just doing the CSV also note for this TSV the tab separated file you could
change this to anything that was actually separating your column so if like let's say for whatever reason you had three exes separating your columns you would set delimiter equals xxx all right let's move on to the next thing and that's going to be actually reading our data easily within the pandas framework so the first thing is reading the headers so in our data we have several headers and we can figure out what those are by doing a print D F dot columns so if we want the headers we just do DF columns as you can
see there's the Pokemon number or the Pokedex number I think it's been a little while since I've refreshed my Pokemon scale is the name of the Pokemon but typed the two types and all of the stats information is whether or not they're legendary so these are all the columns we can work with it's all just not print that for now also this a Jupiter notebook I'll save this on the github page once I'm finished with the video so you can also look at this if you use Jupiter notebooks follow along with this if you just
download it from the github page or clone it alright so now that we know our columns let's read a specific column so to do that we have our data frame still that we loaded up here and I can do DF dot let's say I wanted to get the name of the Pokemon so if I just did print do you name and ran that as you can see I get all the Pokemon and it does actually abbreviate it just so I'm not printing out like 800 different things so that gives me that and I could also
specify that I only wanted like 0-5 probably by doing this yes now I just get the fit the top five names one thing that's interesting you could also do DF name like this this doesn't really work for to word names but you could also get the names like that I usually just do it in the using the brackets and if you want to get multiple columns at the same time you can change this just single word to a list of column names some in turning a list here and then separating it by commas so name
say type 1 and that's a like H pace we're getting 3 different columns and there aren't even all in order so it's kind of nice so that you can get so if you want to look at specific things and not be cluttered with so much extra stuff you can do that moving on to printing each row that's just comp this up real quick probably the easiest way to print out each row so I'm going to just show you remind you what's in our actual data set again so let's print out the first four rows and
let's say we wanted to print out this first row index of one row so I guess this is actually this zeroth row so that it has Ivysaur grass poison etc and it if we want to adjust that row we can use this eye luke function on the data frame which stands for integer location so if I passed an eye look of 1 that will give me everything that was in that first row could also use this to get myself multiple rows I could do 1 to 4 and that would get me all of these rows
so another way to get rows wherever you want in the data frame and the same ìlook function can be used to grab a specific location so let's say I wanted to I'm gonna just change this to 0 real quick I wanted to get the venusaur name here so if we did the indexing of that it's on the second row and it's the 0 first column if we're counting with numbers so if I wanted to just get that specific position we could do print di Luke and then the second row and then do comma the actual
position 1 so we want the first position the first column as you see that gives us Venusaur building up on this one thing I often myself trying to do is iterate through each row in my dataset as I'm reading yet and so to do that what I would recommend you do there's probably other ways I'm sure there are is do for index comma row in D F dot it er Rose it her iterate through rows probably the easiest way to just go row by row and just access any sort of data you might want so
I could do print index comma row and run that as you can see it didn't format it nicely for me but it's going first row then getting the data for the second row etc and one thing that's pretty nice about this is I could if I just wanted the name in the index free throw I could iterate through and just get that information I don't know I find this pretty useful for all sorts of different tasks that I'm doing well working with my data and then one additional function I want to get to right now
I'm gonna go into this and more depth a little bit but in addition to having the I look we also have DF cloak and this is used for I guess finding specific data in our data set that isn't just integer based isn't just like the specific rows it's based on more textual information numerical information so one thing that's really cool you can do with this is I can do DF cloak and then I can access only the rows that have DF name or let's say type one equal to let's say fire so this should give
us like chars are immediately oh gosh Charmander Charizard the middle one so let's run it and hopefully this works oh yeah afterwards they're pretty nice when they don't print it Wow I should have known this yeah so as you can see it's only giving me the type one that's equal to fire and I could do the same thing if I only wanted to look at the grass pokemon as you can see now we get Bulbasaur Ivysaur venusaur etc you can just keep doing this and you can use multiple conditions so this is super super powerful
to do all sorts of conditional statements and I'm going to get into this the more fancy advanced stuff with regarding this later on in the video while we're on this topic another useful thing we can do with our data frame is we can use this stop described method which gives us like all the high level like mean standard deviation type stats from that so as you can see some of these categories it's not super super useful like pokedex number it doesn't we don't really care about the mean but for like HP attack defense special attack
etc it's pretty cool little method to use because you have all these metrics you can quickly just look at your data another useful thing we can do is I just print the data frame again we can do some sorting of the values so let's say instead of going from first pokedex downwards we could do sort of by let's say alphabetical name so i could do sort values and then i have to pass in the column i want to sort so if I sorted values by name now I have it alphabetical if I wanted to make
it the other way I could do extending so that's and that equal to false so now it's gonna be descending as you can see he also can combine multiple columns in this so let's say we had sorting by type one and then we wanted to have our second sort parameter P by H P so this would give us all like I guess probably the bug pokemon because that would be the first alphabetical one and then it would give us the lowest or highest HP from that let's see what happens yeah as you can see bug
and this is the lowest so what we could do is also pass it in descending and this time because we have two columns when you specify true or false for both it might you might be able to do this yeah we can do this but if you want to separate if one is extending one's descending we can do see now I got it descending but we got the farthest down type one so I can do something like this so we want the first one to be ascending and the second one to be descending so now
type one will be going a through Z and each P will go from high to low as you can see so sorting the values is very useful as well okay now that we know how to read our data that's like we start making some changes to it so let's look at our data again okay so let you get this data one change that I think would be cool to make is we have all these stat categories I think would be pretty cool to add all these stats together and create like a total category which kind
of potentially could help us rank which Pokemon are the best so let's go ahead and do that and one thing that's cool and I guess true about most things programming is there's multiple ways to do this so we're adding a column that is the total of those stats so one way we could do it is we just go ahead and access our data frame and then just call this new column total and we can just reference it it like this right now and we will say that that equals this is probably the easiest way to
read but not the I guess fastest way to do it but you could do DF of HP plus DF of attack I'll just probably speed this up when I'm actually editing this okay so now we had to find this one was you have the dataframe total is gonna equal all the other columns run that I guess we don't see anything but if I go ahead and do data frame dot head five as we can see over here on the right side we have this new column name total and I would say I recommend always when
you do something like this just making sure that you did it like actually is the total that you're trying to get 49 plus 65 plus 65 plus 45 because you could easily see that this total is a valid number but if you don't actually double-check that it's the right number you kind of run into a dangerous territory and as we can see perfect 318 is exactly what we are looking for so that's one way to do it another way we could go about doing this and actually because I have I'm using a Jupiter notebook actually
what we might want to do first is drop some columns so one thing about Jupiter notebooks is like if I run this again even though I've commented it out lost myself it still has that data frame in memory so it even though this is commented out it still has that data frame in memory so it doesn't remove the total after I even comp this out it just stays in memory but so one thing we might want to do is drop a specific column so if I wanted to go ahead and drop the total column and
just show how to do it in another way I could do data frame drop and then I can specify the columns and I'm gonna specify total I'm gonna run that yeah why did you not disappear and so the reason this did not disappear is because it actually directly modified or doesn't directly remove that column I believe you have to just reset it to dataframe so I can go ahead and do this and we should see this total column here the right side which my face is blocking will see this disappear run that yay so that
was dropping a column so now if I wanted to go ahead and do the add a column in a different way maybe a little bit more succinct of my way I can go ahead and do DF total that stays the same and then what I'm going to do this time is I'm going to use that I Lok function that we learned so integer location I want all the rows so the first input is going to be the colon which just means all rows everything and then the columns I actually want to add together will be
HP through speed so that will be this is 0 1 2 3 4 so this will be the fourth column to the 5th 6th 7th 8th 9th to the ninth ninth column and I'll run and then I can there's a dot sum function you can use and you want to specify if you're adding horizontally you want to specify axis equals 1 if you said actually su equals 0 that would be adding vertically ok and we have our totals again and one thing you might have noticed I don't know if you caught this but because I
have this 318 down here I realized that this 273 is actually wrong so that's why it's good to make that check the error I made was that it shouldn't end at 9 if we want to include this speed it actually has to go to the next one because the end parameter and like lists and don't be everything we the end parameter enlists is exclusive so 10th means the tenth column is the first one we don't include so every run that now you see that the totals are actually correct as we did the math down here
the last change we'll make before we resave this as a another CSV the last change we'll make before we receive this as an updated CSV will be let's say we didn't want this total column all the way over here on the right side it makes a little bit more sense I would say to be either to the left of HP sorry I don't know what tells us or the right of speed so we can do this in a few different ways and the way I'm gonna choose is probably not the most efficient but it it
makes sense given what we've done already so remember that we could grab specific columns like this so if I wanted total I wanted HP and like defense let's say and note I can order these however I want so if I wanted to reorder my columns and then save it afterwards I could just do DF equals whatever order I choose and because it's a little bit annoying to type out all these things I'm going to get the columns as a list to do that I will do calls equal DF and you don't have to know why
why I'm typing what I'm typing works exactly I'm looking at the documentation as I do this and I recommend you guys do the same always look at the documentation there's great stuff here I can't get everything out here in this single video doing the best I can but definitely check the documentation out I'll link to that in the description ok so I'm getting the columns and instead of ordering it like this I'm going to do ranges so if I want these first four columns and then total and then the last the remaining columns I could
do something like this calls of 0 to 4 they'll get me the first four in the same order I want it plus calls of negative one that's just reverse indexing getting the total here I might be blocking that again the here and then finally the remaining stuff we would need to add to that would be four to five six seven eight nine ten eleven twelve and we include twelve because that would be in our the first one we actually don't include in the final data frame so let's see what happens when we do that we
want to see this total go over here know what happened okay so why do we get this error can only concatenate lists not string to list so that's telling me something probably in here is messed up and what I'm seeing is that because this is a single column it's not gonna be it's just gonna be a string so I have to actually share out of that in / in brackets to make it a list and then I can go ahead and run this again and we wanted to see the total switched over to the left
side and there we go it is there cool and one comment I want to make as I said before this type of change doesn't actually really modify our data at all it's just kind of a visual thing so I didn't really care too too much about how I went about and did it but one thing I really want to note here is be careful when you're hard coding numbers in like this if your data is changing and you have these hard to criticism uh kind of like just using actual names so even calculating the total
like this is a bit dangerous so maybe instead of using four to ten one thing you could potentially do is get the index of your start so that would when we were doing this it was the index of HTP and then go to the index of speed that would be one way to do it's a little bit safer I would say all right now that we're done with all of this let's move on to saving our new CSV so just a reminder of what we have in our data frame we follow this information and in
this previous of cells is where I actually defined the data frame just as a reminder this date frames not coming out of midair so I have this data frame and now I want to save this updated and let's start by saving it to a CSV so just like we had the dot read CSV we also have a built-in function in panda is called to CSV so I could just call this something like modified TA or modified CSV and now it will take whatever is in this data frame and output it to nice comma separated values
format so because I got to this next cell we know it did that I can check my directory and as you can see there's this modified CSV and I'll just open that up real quick just so you can see it all the information is there load okay so we see we have all the stuff we wanted and this total column there which is cool the one thing that is annoying about the current state of this stop texting me I'm making a video who has the nerve okay sorry so the one thing that might be annoying
is that you have all these indexes over here and I don't really care to have those so the quick fix to not save all these indexes with your data you can if you want to but you can go ahead and pass in the variable index equals false run that again and then I reopen my modified CSV you will see that that stuff is all gone so yeah now we just have the Pokedex sr for this column to the left which is perfect you can also go ahead and there's also a built-in to excel function so
I could if I wanted to save this as a excel even though right now we're just working with the data frame it's easy to output it to that format so to excel we'll call this modified dot X there xlsx and we can also make the index false here run and so that well now we have these two modified this is the actual excel file I could load that but for the sake of time I'm not going to and then finally the last way we load it in three formats I might as well save three formats
so the last one is what if we wanted to save that tab-separated file so we can do to CSV again modified I'm going to call this modified txt and index equals false and then the one thing on this is there's no delimiter parameter for when we're doing to CSV which is kind of annoying but there is a separator parameter you can pass in and look at the documentation if you need to remember this I'm looking at the documentation as I speak and so I can specify that I want to separate it with tabs instead of
commas is that that's gonna happen by default so run that and I will actually open this one up just so you can see modified here and if I drag that and you can see that all the data is there no indexes on the left and it's all separated by tabs so that looks pretty good alright now that we've done all of that let's move into some more advanced Panda stuff and we'll start out with some more advanced filtering of our data so just a reminder this is our data frame so as a first example I
showed before was that we could specify a specific type for example that we want it's okay the DF cloak and then we said DF of type 1 equals or equals equals let's say graphs we're only going to get the rows that actually have grass as their type on so as you can see all these type ones are grass in addition and we can do just more so than just one location condition we can pass in multiple so I can do something like DF type 1 equals grass and let's say we wanted DF of type two
two equal poison so I can type it in like this run it oh no we got an error so the thing you got to do here is we have to separate our conditions with parentheses for whatever reason not quite sure why that is so here I have two conditions separate them by parentheses now as you can see we only have grass and poison now and one thing to note is usually we're typing out and like this but inside of our pandas dataframe when we're filtering we just do the actual and sign let's say if we
wanted type 1 equals grass or type 2 equals poison then we could do the or sign like this it's a little bit different than you're normally used to just the convention of the Python pandas library and just look this up if you forget so I run that now we should have one is poison either type one is grass or tuck two is poison and as you can see this is a bug type this is poison so I was able to separate those two conditions by N or instead of an and and we don't have to
just use text conditions I could also add in let's say we wanted type 1 is equal to grass type 2 is equal to poison and let's say we wanted the HP to be a fairly high value so just looking at these feel like 70 is a good cutoff value so HP has to be greater than 70 I can also specify conditions like this and around that now you see we only have five rows that it actually filters out and you could go ahead if you wanted to there's a couple different things you can do with
this so first if you let's say you wanted to make a new data frame that was just the filter data I could just do something like new DF equals this and now if I print out nudee f we get just those five rows but I could go ahead and just print out D F and we still have everything also worth mentioning real quick I can easily save this new data frame as a new CSV kind of to checkpoint my work and maybe if I wanted to do this on many different filters kind of have this
more specific CSV files that I could dive in and look at in more depth it's like I call this something like filtered dot CSV if I ran this you'd see in here that I have this filtered and it contains the data that I just grabbed out one thing to note when you are filtering your data and you shrink down the data size is when you print out that data frame so I'll comment this out okay I can just print out new D F as you can see one thing that's weird is this is the index
here so it goes to 350 77 652 even though we've filtered out our data the old index stayed there and that get annoying if you're trying to do some additional processing with this new data frame so if you want to reset your index you can go new D F dot reset index and you can start off by just setting new D F equal to new DF is don't reset index now if I print out new DF you see that we have 0 1 2 3 4 and by default it saves that old index there as
a new column if you don't want that to happen we can modify it further we can do we can do drop equals true so this will get rid of the old in these indices as you can see now we don't have that then the last thing is if you don't want to have to reset it to a new data frame you can actually do this in place as well which just probably conserves a little bit of memory and if I run this I don't even set it to a new variable it just will change the
value of within the given new DF and as you can see we got the new indexes for our filtered out data so that's something useful too to be aware of because if you're running through your new data frame like row by row and you're trying to get a specific spot even though it's like the fourth row that you see it might be you might need to index like you know the semi first position and that would get really annoying so resetting indexes is helpful in this case in addition to I guess equals conditions greater than
less than etc not equals we also have other types of conditions we can use basically anything you can think of so one thing that I see that is kind of annoying me with this data is if you look in here maybe this is because I'm like a little bit outdated on my Pokemon knowledge but I've seen these like in mega versions of Pokemon and I'm not quite sure what that really means so let's say I wanted to filter out all the names that contained mega and it's tough to do with equal science you know because
contain is not quite equal to because we want to allow a lot of different things there so I could not allow the name to include mega by doing the following so I'm going to delete the stuff that's inside of here maybe I'll just comment it out so you can still see it but I'm going to do DF cloak and then I'm gonna pass in da name then I need to get the string parameter of the name this is something you should just kind of I guess remember about the contains function string and then dot contains
mega so if I run this you'll see that all of these ones are just the columns that include the word mega and then if we want to get the reverse of this this is another good symbol to remember because it's not quite what you would think it would be but within the alok function if we want to do not instead of maybe think you know to be the explanation point it's actually this squiggly line so if I run this now we drop all those ones that had the mega so as you can see there's no
Megas anymore in our data so that's pretty useful and taking this even a step farther this contains function I find to be very very powerful because in addition to just doing exact words we can also pass in reg X expressions and do all sorts of like complicated filtering with this so let's say that's the first example let's say we wanted to see if the string wanted a simple way to get if the type one was either grass or fire so to do that first have to just import regular expressions and I would recommend looking into
regular expressions if you don't know what they are super super powerful and filtering data based on certain textual patterns so I can do reg x equals true and right now I'm trying to find if type 1 is equal to fire or let's say grass so in the writer reg X expression this means or so I want it to either match fire or grass run that shoot it did not give me anything and the reason it didn't give me anything is because the capitalisation was off so this is gonna be another good point so see that
did work type 1 grass type 1 fire etc but a probably nicer way to do this because you might have all sorts of funky capitalization is I could go ahead and change it back to this way but there's a flag that you can use so I can say Flags equals re dot I and that's going to be ignore case so I run that again as you can see grass and fire is grabbed even though I specified it without the capital letters one more example let's say I wanted to get all Pokemon names that contains started
with API so probably the first example you might think of as Pikachu but he also would have like Pidgeotto and probably a bunch of new ones that I don't know so if I wanted to just get data in the name category that started with P I I could use red x's to do the following I could do P I and then specify that I need it to start with P I but the next set of letters can be a through Z and let's say like this star means one or more and yeah this is all
just Rex information if it seems super super foreign to you look into Ray guesses and if I do this we didn't get anything what happened that's because I said type 1 so if I actually change this to name run it as you can see oh we got Caterpie so I did something messed up with my reg ex but as you can see there's all these PA names in it and if I wanted to eliminate this from happening the PA letter to be in the middle I can specify a start of line with this carrot run
that now we've got only our names that begin with a P I and you might find this you know there's many different use cases where you might find something like this useful to do to filter out your data in a kind of complex manner building off the filtering we did in the last examples we can actually change our data frame based on the conditions that we filter out by so let's imagine I wanted to I didn't like the name fire for type 1 I thought that you know bitter and if it was name like flame
flamie flamer if you have our fire type you're actually a flamer so let's make that change and I know this is going against Pokemon tradition but just to show you DFL oak and we want to have DF of type 1 equal equal fire and if that is the case well I can do if I specify with a comma I can specify a parameter so I'm going to say type 1 so this is the column I want and I can do equals like flamer it looks like something is off why does it look like something is
off that's because I have an extra bracket there now it should be good run that don't see anything but if I do DF oh shoot you can see that now type 1 is flamer as opposed to fire if I wanted to change it back I could go fire and this is flavor now I have fire again we also can do like specify this to be some different different calm it doesn't have to be the same column we're editing so maybe you decided that legendary pokémon are all Pokemon that are of type fire and you can
make this true in that case and as you can see now all the fire pokemon are legendary which obviously isn't true but it's kind of cool that we can use one condition to set the parameter of another column and I'm kind of screwed up this data frame in general now because I did that so what I could do is use kind of my check point that was the modified CSC so I'm gonna just say CSV equals DAR PD dot read CSV modified dot CSV now I'm just kind of loading my check point that I had
a while back yeah so now I fixed up the false the legendary by just reloading my data frame you can also change multiple parameters at the time so I'm gonna just do this as a demonstration but imagine we wanted to say like something like if the total is greater than 500 it's pretty damn good Pokemon I was gonna say these changes that I'm gonna make her here don't really matter but just to show you that certain conditions can be modified multiple conditions can be modified at a single time so if I want to modify multiple
columns at a single time I can pass in a list and if I set like this to the test value don't worry about this that's what I'm saying test value so if the total is greater than 500 these two columns should instead of having their normal values should have test value let's see if that's true oh and I just need to print out data frame as you can see this total is greater than 500 this is greater than 500 this is greater than 500 all of them modified as we wanted and another neat thing to
know is that you can modify them individually as well so this could be tests that are like will say that this is test one and this is test two so now we are specifying what generation becomes and what legendary becomes if this specific condition is met as you can see that updated appropriately comment all this out real quick and just reload the data frame as it was initially okay so that's just I'm just resetting the changes cause I don't want to stay but showing you that you can do these things then they become super super
useful all right we're gonna end this video by doing something that I find very useful and that's using the group by function to help you do some aggregate statistics so let's start by loading in that check pointed CSV we kind of created so I modified dot CSV is what I'm gonna load it in and just reminder this is what it's looking like all right so with this group by function we can start doing some really kind of cool analysis on things so for example one thing I could do is if I wanted to see like
the average HP and attack of all the Pokemon grouped by which type they are so like maybe trying to figure out like Oh our specific types like have better skills so maybe a rock pokémon would have like hi defense I think there's a rock pokémon steel Pokemon would have hi defense you know maybe a poison Pokemon would have high attack so we can start seeing that as a kind of holistic measure by using this group by function so I can go D F dot group I and let's say I wanted to group by type one
and we're gonna look for the averages of all the type one Pokemon so I can run that and here we get all the stats broken down by their mean sorted by what type one is so if I look at a bug it has like an average attack of 70 and make this even more useful I can do that use the sort function we learned so sort values and we'll sort on let's say defense and I'm gonna make this ascending the equals false so I can see the highest defense type one and as I mentioned before
this is we did kind of see what I was expecting to see that if we took the average of all steel pokemons they have the highest defense which is kind of cool it's cool to see that we could also instead of doing defense we could look at what is of all Pokemon that are in the have a type 1 that's the same who has the highest attack and I always thinking maybe would be poison but I'm not positive okay so we got this does make sense we got dragon with the highest average attack and a
lot of the dragon pokémon are legendary so there's no surprise there that they're super powerful and like holy crap like a dragon would be very scary fighting obviously that makes sense that they should have good attack let's see who has the best HP this might the dragon also is the best HP but it's kind of cool that we can group by type 1 and see all these like useful stats about them just on that group it's a useful little analysis tool and I could additionally do something like dot some so I can use the three
that come off to like come to me in my mind immediately there might be others that you can aggregate statistics you can do with group I but some or mean some and count so if I did this and some do everything here I have like all of the HP's added up and you know you got to be thinking about why you're doing something when you're doing anything data science related in this case it doesn't really make sense for me to sum up these properties because like I could sum up and see that one's like way
higher than another but because you don't know how many type ones are bug or type ones are are dark etc this aggregate sum doesn't make sense in this context but you can do it then you also then you also have count so if I run this we have all the counts of the different out of pokémon that are type 1 so 69 are bog 31 are dark 32 or dragon type 1 and if you want to clean this up a little bit like you have a lot of the same values everywhere basically it's any time
it's a non zero non false number or I think just yeah any time the row is filled in so like the reason this is 52 right here is because the type 2 is just blank so it didn't count those counts so if you wanted to just have like a clean this is the count you could add DF equals or DF of count equals 1 so basically what I'm doing is I'm filling in a column added to the data frame and I'll show you this that's just a one for every route as you can see on
the right side where my face is normally blocking there's this one here so now what I could do is I could do that same group by count right oh shoot it's calming it up and I get all of this back but if I want to just make my life easier I can do just get the count column and now I have this useful little set where there's 69 bug 31 dark 32 dragon etc this is a little bit easier to read now and now that I have been it easier to read format I could also
group by multiple parameters at the same time so I could do type 1 and type 2 so looking at all the subsets so of type 1 bug to have a type 2 of electric to have a type 2 of fighting etc I can do all sorts of count and this gets really useful if you're working with a really really massive data set and I'll get into that quickly I don't know if I'll go through the full example but imagine you had a really really big data set and you couldn't even load it all into one
data frame this group I and like count and sum is super useful because you can take your data frame and kind of squish it so if you're like tracking some number of like how many times an event occurs in your data frame you can use this group I and like count things and then kind of squeeze your data frame make it smaller based on this group by function if that made any sense so I'm not gonna show you guys exactly just because we've been working with the data set that's not that small I don't really
feel like bringing in a new data set right at the end but imagine you're working with a data set a file that's on the order of like 20 gigabytes it's pretty dang big and you don't really know how to best process it one thing that's really useful about the Python pandas library is it allows you to read in like a file like that you can read it in chunks at a time so instead of reading it into all 20 gigabytes because now unless you have a very very powerful machine you're not going to be able
to load all of that into memory you can load it in let's say 100 megabytes at a time and so normally when we are reading the CSV we would do like PD read CSV modified F CSV is file I'm using right now and that would load everything in so instead of doing that what we can do is we can pass in this chunk size parameter so I'm going to just say for now chunk size equals 5 just for the example and that means 5 rows are being passed in at a time so if I did
for DF in PD read CSV that means that my DF right here would be 5 rows of my total data set modified CSV and because this is rows and you might would rather think a bit in like terms of memory size you can do a little bit of math with the Rose to figure out how much memory that will actually be taking if you think every row is probably like maybe 10 or 20 bytes shouldn't be much more than that you can do some math on how big this is and if you run into an
area you can always like that you don't have enough memory you can always shrink this chunk size so we're working with a really big data set we made a set you know our chunk size to a hundred thousand rows at a time which is a lot of rows but nowhere near how much that full 20 gigabytes would be but for our example we're just loading again five rows at a time and I can show you that that is happening so 10 or like chunk data frame and then the data frame just to see how it's
working so we have the first data frame and as you can see it's five rows second data frame another next five rows third data frame the third set of five rows etc so this loaded in the data frame but in chunks of five so what's useful with like the aggregate stuff we were just going through is you could also like to find some new data frame equals that say PD data frame and you can give it like the same columns as you had in your original data frame this would just create a new data frame
that's empty with the same column names basically what you can do is you could let's say like define DF group by type 1 let's say and get like the count of that stored in results and what you can do here is with that new data frame you defined you could do something like you can use the concat function of pandas which just appends two data frames together and you could do something like p or new data frame equals PD concat of the new data frame and results so basically what what this would do is always
take your new data frame as you go through chunks append on results and store it back to new data frame so as you did this as you did more iterations you be building this new data frame of all the information in your original really really large data set but because your each chunk you're like aggregating doing some sort of group buy and count you're shrinking that data size down so that this final new data frame has the meaning that comes out of that big original data frame but it's a lot smaller you can actually do
more analysis now on this shrunken down size new data frankly if that hopefully hopefully that makes sense if you need me to clarify this just leave me a comment down below and I'll try to clear things up regarding that all right that's all I'm gonna do in this video hopefully you have control you kind of feel like you have control the pandas library now if you felt like you learned something make sure to hit that subscribe button it would mean a lot to me I'm gonna build off of this video in future videos such as
like plotting stuff in our data frames and you know kind of doing some advanced stuff using like regular expressions I don't know if it will be tell it specifically a pandas but a lot of useful information that you can take your panda skills and build off of so subscribe for all of that if you have any questions about anything I covered in the video you leave a comment down below and I'll try to help you out and clarify and also if there's any like additional features you would love to see in pandas that I've missed
leave a comment down below what that is and maybe I'll make a follow-up part two to this video all right that's all I got thank you guys again for watching and peace out [Music] you
Related Videos
Solving Real-World Data Science Interview Questions! (with Python Pandas)
1:47:50
Solving Real-World Data Science Interview ...
Keith Galli
123,383 views
Complete Python Pandas Data Science Tutorial! (2024 Updated Edition)
1:34:11
Complete Python Pandas Data Science Tutori...
Keith Galli
159,234 views
Complete Python NumPy Tutorial (Creating Arrays, Indexing, Math, Statistics, Reshaping)
58:41
Complete Python NumPy Tutorial (Creating A...
Keith Galli
794,085 views
Data Analysis with Python for Excel Users - Full Course
3:57:46
Data Analysis with Python for Excel Users ...
freeCodeCamp.org
3,410,321 views
How I Would Learn Python FAST in 2024 (if I could start over)
12:19
How I Would Learn Python FAST in 2024 (if ...
Thu Vu data analytics
631,357 views
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)
4:22:13
Data Analysis with Python - Full Course fo...
freeCodeCamp.org
3,504,237 views
Python Tutorial for Beginners - Learn Python in 5 Hours [FULL COURSE]
5:31:30
Python Tutorial for Beginners - Learn Pyth...
TechWorld with Nana
6,071,616 views
Learning the Polars DataFrame Library!
1:49:41
Learning the Polars DataFrame Library!
Keith Galli
6,156 views
Automate Excel With Python - Python Excel Tutorial (OpenPyXL)
38:02
Automate Excel With Python - Python Excel ...
Tech With Tim
1,644,169 views
Data Analytics for Beginners | Data Analytics Training | Data Analytics Course | Intellipaat
3:50:19
Data Analytics for Beginners | Data Analyt...
Intellipaat
2,295,964 views
Real-World Dataset Cleaning with Python Pandas! (Olympic Athletes Dataset)
2:02:26
Real-World Dataset Cleaning with Python Pa...
Keith Galli
34,339 views
Python Plotting Tutorial w/ Matplotlib & Pandas (Line Graph, Histogram, Pie Chart, Box & Whiskers)
1:01:30
Python Plotting Tutorial w/ Matplotlib & P...
Keith Galli
325,006 views
Matplotlib Crash Course
1:31:41
Matplotlib Crash Course
freeCodeCamp.org
344,117 views
Solving real world data science tasks with Python Pandas!
1:26:07
Solving real world data science tasks with...
Keith Galli
1,588,290 views
Python for Data Analytics - Full Course for Beginners
11:09:41
Python for Data Analytics - Full Course fo...
Luke Barousse
414,871 views
Python NumPy Tutorial for Beginners
58:10
Python NumPy Tutorial for Beginners
freeCodeCamp.org
1,679,404 views
Solving 100 Python Pandas Problems! (from easy to very difficult)
5:20:18
Solving 100 Python Pandas Problems! (from ...
Keith Galli
118,576 views
Solving Leetcode Coding Interview Questions in Python!
3:18:08
Solving Leetcode Coding Interview Question...
Keith Galli
4,539 views
Intermediate Python Programming Course
5:55:47
Intermediate Python Programming Course
freeCodeCamp.org
3,881,599 views
Scikit-learn Crash Course - Machine Learning Library for Python
2:09:22
Scikit-learn Crash Course - Machine Learni...
freeCodeCamp.org
428,907 views
Copyright © 2024. Made with ♥ in London by YTScribe.com