Python for Data Analytics - Full Course for Beginners

409.94k views130820 WordsCopy TextShare
Luke Barousse
🧮 Course Problems & Certificate 👉 https://lukeb.co/python 📝 Course Code & Notes 👉 https://lukeb....
Video Transcript:
better nerds welcome to this full course tutorial on python for data analytics this is the course I wish I would have had when I first started as a data analyst for this you're going to be coding right alongside me as we master the basics of programming from there we'll learn how to analyze data using popular libraries like pandas and Matt plot lib and finally we'll put all your newly learned skills to the test by building your very own portfolio project now this tool that we're going to learn is one of the most popular programming languages
for data nerds don't believe me let's check out the data for dat scientists it's in two out of every three job postings for data Engineers it's similarly requested right after SQL and for data analysts it's in one of every three job postings right after Excel and this popularity is not only fueled by how easy it is to learn but how easy it is to also read this course is for absolute beginners you don't have to have any experience in programming let alone python or even data analytics I'm going to be explaining and breaking down all
the different concepts and topics you need to know in the first of three chapters we're going to be covering the basics of python you're going to be learning about important Concepts like data type loops and functions all while coding alongside me in Google collab a free and popular tool to use for analysis in the second chapter we'll be moving into more advanced concepts we'll be mastering two of the most popular data analytic libraries pandas and matplot lip all while extracting insights out of a real world data set we'll also be learning how to set up
and run python locally on your computer for this chapter finally for the last one we'll be moving into building a portfolio project for you to Showcase your newly learned skills you'll be uncovering insights on that real world data set on data science job postings and you can even apply these insights to your job search in real life now I'm a big believer in open- sourcing knowledge so this course and the course material is completely free you'll not only get my step-by-step video instructions on how to run and use python on popular free to use tools
but also I'll be gring you access to all the notes and codes to this course so you can follow right alongside me and work towards building your very own portfolio project to showcase at the end now although this is free and I will be generating ad revenue from this video it's not going to be enough to pay for the expenses needed to build this course so I have an option for those that want to contribute to this course using the link below you'll be granted access to extra perks to help you learn python more efficiently
there's practice problems for each lesson you go through to reinforce what you've learned in the video they range from easy to hard and provide not only hints but also the solution with this you'll have Community Access to be able to ask questions and also answer other people's questions for these problems and finally these D ards will be awarded a certificate of completion that they'll be able to post to LinkedIn and showcase as experience in using python one quick shout out before we go further and that's to Kelly Adams who helped produce this course she's the
brains behind behind producing all these different lessons and exercises will be competing and I'm super appreciative of her help cuz frankly probably couldn't have finished the course without her all right so let's get into that first question of what is python well it was created by this dude vetto van rossom back in 1991 and he named it after my parents favorite TV show Monty Python over the past 30 years this tool has evolved into a multi-purpose programming language it's used for a variety of things it's in the source code of a lot of popular websites
along with helping build a lot of popular Ai and machine learning Solutions personally I like to use it for automating Solutions including things like web scraping now because of this multi-purpose aspect it makes python great for data analysis it's great at not only collection and cleaning but also analyzing and presenting now interesting enough although python is only the third popular skill of data analyst it's implemented into all of these different solutions most all databases at least reputable ones have some sort of python API built into it to allow you to connect to the SQL last
year Excel integrated it and now you can use your spreadsheet as a code editor oh and the same can be done for popular viz tools like powerbi and even Tableau So based on these Integrations I'd value python at higher than these other skills When comparing the interest of python SQL and Excel over the past few decades python is growing consistently year over-year so I don't think it's going to be trending out of style anytime soon now python is not only free but it's also completely open source if you wanted to you can go on to
GitHub right now and check out the source code you could even write a pull request to implement a new feature in Python but I don't recommend it and because it's free and open source a lot of top tech companies are contributing to help build out this language further a lot of them including from Fang I mean mang or's a mama we're not including Apple so it's Mama let's now get into the course material and for this the first thing we need to cover is how you're actually going to run python on your own machine in
order to complet course in addition to that I'm going to be going over all the different resources that I'm making publicly available to set you up for success in completing this now there's two major ways to run python one is locally via code adder and the other option is inside of your browser or commonly known as the cloud let's dive into running it locally first for this assuming it's installed the first thing you need to do is actually load a file from there all we have to do is do some light coding to get what
visualization we want then inside of a terminal window I can run the file and we get a graph in a new window now I'm not a fan of using python files personally I prefer to use jupyter notebooks as each cell allows you to run the output right below it in the case of that python file code I can just copy and paste that code into a new cell and then run it and the visualization is displayed right in the file itself below the code so for the remainder of this course we're going to be using
Jupiter notebooks cuz frankly they're just easier to use for data analytics but we're not going to start out this course actually running locally so no need to install it anything right now instead we're going to use collab which is a Jupiter notebook inside of your web browser any of that code that we're able to run locally you're also able to run it here inside of collab for all the basics chapter we're going to be using Google collab it's not until we get to the advanced chapter where we're going to actually install and run python locally
but don't worry if you want to continue to use Google collab you can now for the course material I've made every lesson available on GitHub I've organized the folders into chapters and The Notebook are organized in that same order to the lessons you can just come right in and click where you need to go and this will provide all the notes along with code and their Associated Solutions and one cool thing about this is you can come up to the top here and select open in collab and as expected it opens up the notebook inside
of collab and you can actually go in and if you wanted to run the different cells and test out different things so that was the course notes and code but what data will we actually be analyzing and manipulating in order to learn python well I made this app called Data nerd. Tech that collects job postings and displays the top skills of data nerds I can just come in and select something like data analyst in the United States and it will tell me what are the top skills along with certain trends for skills and how much
jobs are even paying anyway I've made the data from this app publicly available for you to go in and analyze for this course it's hosted completely free to use on hugging face and if you wanted to you could download the data from here however later in the course course I'm going to show you how to access this data programmatically and this is a lot simpler to access all this data so what kind of data is included in this well we have job postings from a variety of data science jobs with the top ones being data
analysts data engineers and data scientists additionally we have data from around the world with the primary location in the United States but a lot of other countries are available as well and this data comes from a variety of different companies within Tech Healthcare and whatnot now for those that are choosing to support the course you're going to have the those extra practice problems as well for resources and if you get stuck on any practice problems you'll be able to use the comment section below to ask for help which brings me to the last Point how
should you go about getting help if you get stuck which they going to get stuck eventually now you could post a comment below but I recommend the even faster solution using any good old AI chatbot I've had tons of success in speeding my workflow using chat Bots like chat GPT Gemini and even Claude all right so that's a walk through of all the different resources you're going to need for this course I'll fre be reminding you of these resources and to go access them but I want to make you aware at the very beginning all
right let's now dive into this first lesson and actually get into the basics chapter in learning python with that I'll see you in the next one let's get into actually coding for this we're going to be going into an overview of how to access and also use Google collab and the first thing we need to do is actually get you inside of this notebook the first thing you need to do is make sure you're logged into Google and if you're using YouTube these two accounts should be connected if you're logged into this you're logged into
Google but if you didn't realize that it's 2024 and you don't have a Google account you can just go to google.com select sign in and then from there actually create your account once signed in you can access collab in two different ways the first is just by using this URL of cab. research. goole.com that thing's a mouthful it will navigate you to the starting notebook and from here you just select new notebook the other option is just go to your Google Drive which is at drive.google.com and then from there select new more and underneath there
you can select a Google collaboratory notebook anytime I'm start with a new notebook I like to just rename it to what I'm actually using it for in this case getting started the first thing we're going to focus on is The Notebook itself we're going to be actually writing the code we'll get to the menu bars in a second so inside of here I can add things like code or text for code I'm adding a code cell and then for text I'm adding a a text cell for the coding cell I can type right inside of
it in this case I put this as a coding cell I can run it and it's going to Output right below it for the text cell or markdown cell I can do a similar thing where I can double click it and I can get into editing it on the left hand side is where I actually type my text and on the right hand side is the display of where it's going to get to so let's talk a little bit more about these coding cells specifically how to actually run these so let's say I want to
do some complicated math and python is really good at this so I can put put into this code cell something like 2 + 2 and then from there when I want to run it press the play button however I'm going to recommend you learn keyboard short cuts as they speed up your workf flow a lot faster so instead of pressing that play button instead I'll put in 2 plus2 and I'll press command enter and it's going to run the cell however besides this command or control enter instead what I can do is whenever I put
something in like 2 plus2 again I can press shift enter and now it will execute that cell and then put the next cell in so then I can go forward and doing my next coding example and it sets me up to move forward now sometimes you want to do multiple operations within a cell let's say I wanted to add those numbers again 2 plus 2 but then also I wanted to find out what another multiplication of 353 is in this case whenever I run the cell I only get output nine which is the last statement
it doesn't output that first line so if we want to see both of these we can use the built-in print function anytime you type in these built-in functions it's going to pop up here with some hints to tell you what this function actually does in this case it goes over the arguments don't worry too much about what's going on there but it basically says it prints the values to a stream or to a system.out which we're going to be doing first it talks about what you can input into this print statement so basically what you
can put in between the parentheses and then it gives a single text description of what it does and it basically says prints the values anyway let's continue on to actually see what it's going to do so we'll put in print 2+ 2 two and then also we'll put in a print 3 * 3 and in this case I'll run it again pressing control enter and I get both that four and nine back out of this now technically since that last line always outputs out of a cell I don't necessarily have to put a print statement
around that 3 * 3 instead can just run it and it will still output that four and then that last line of nine so what if you want to include something in your code to tell maybe somebody else what you're doing inside of here well that is a comment and it's denoted by the hashtag symbol that you're going to put first and then from there a statement to the right of it whenever I run this it's going to have no output because everything in there is basically ignored by the python interpreter well at least everything
to the right of the hashtag and we can have comments at the beginning of a line or we can even have it to the right of what we're trying to execute and in this case when I execute it it's going to only read that 2+ 2 and execute four so that's enough with coding cells for the moment let's move on to those text cells and inside of these text cells you're allowed to use a markup language called markdown this allows you to format out your text on the screen if I don't use any special symbols
it just outputs normal looking text if I wanted to do something like a heading I would put a hashtag then a space and then put something like heading and it's going to make it much bigger if I wanted to bold something I would just wrap it in asteris double asteris on both sides if I wanted to do a nicely formatted list I could just start it with something like one two or three or even start it with things like bullet points if you wanted to insert a link you'd use square brackets to put the text
that you want to have the link for and then inside of actual parentheses you put the link that you want to go to in this case if you want to find out more about markdown you can go to this link right here and we execut the cell we can see all these different things are formatted along with going to that markdown guide that you can go to to learn more about how to use markdown let's now get into a quick walk through of the functionality outside the notebook with the menu bar first we're going to
start with the sidebar and we can have right here the table of contents and it has all the different sections we've made so far so you can easily navigate to it if you want to next is find and replace which we used word before think you know how it's done from there there's variables which we haven't designed any variables yet but as we do in the next lesson that we're going to be covering they'll start popping up right here if you have any deep dark secrets you want to keep from others you can put them
in here but on a serious note this is for things like apis that sometimes have keys in that case you want to put it here and not in the notebook itself as if you shared it it would have access to it whereas these secrets are local to you and then lastly in this section we have files where you can go and actually access any files that your notebook has access to right here you can even do things like Mount Your Google Drive we're not going to do this right now on the bottom left are things
I use a little bit less like code Snippets if I needed some sort of an example I could go here and get hint for it I don't really use that that much next is the command pallet and this is really a good location to know in case you forget of something like hey I want to go to the table of contents you can go in there type it in and it'll navigate you right to it and finally a terminal window I don't have collab Pro you probably don't either so you don't have access to this
so don't worry about it moving to that top bar you can see we have things like add code or add text you can see how much disk and RAM usage you're using and you can even connect to different services and runtimes but that's well beyond the scope of this course the last thing is collab AI which is basically like chbt or Gemini the Google's version inside of here and in it you can ask a variety of questions it's a large language model and it will provide you the results now whether you have this AI assistant
or not it's going to depend on a variety of conditions frankly I don't really think it matters too much as really any free chatbot these days is really just good enough to use moving to this last top menu bar I'm not going to go into each one of these is they're pretty self-explanatory what they do with the exception of runtime if you ever log into a notebook and you want to run all the cells in it without going through and pressing shift enter well you can select run all and in this case it's going through
running all the cells again now if Google collabs just completely acting wonky then you want to move on to restarting the session and that's going to clear everything within here and then from there I'd go back and actually run all the cells the last thing to cover in this section are special or magic commands this is a very Advanced concept and we're going to be covering that more in future lessons but you need to be aware of it before we get to that point as the syntax of this looks slightly different than that of python
now on my Mac I have inside of it a terminal window and this allows me to programmatically control my computer and install different packages that I might need for python now this Jupiter notebook is not running on my computer it's running on another computer we need terminal access sometimes in order to install different packages that we'll need for this course and I'm guessing most of you haven't paid for collab Pro so you don't have access to this terminal in the bottom leftand quarter but there's a workaround I can start my code with an exclamation point
and then type in some sort of bash command such as LS to list the directories pressing shift enter I can see the directories in this case are sampled data which if I just open this up I can see also a sample data don't worry too much about this LS command it's specific to bash and it's not something you have memorized all right next up are line or magic commands and they're prefixed by having this percent symbol now when I type this up I actually have it pop up to where it shows me all the different
shortcuts that are available with this we're going to use the time it magic command and what this does is actually times a different operation in this case I want to time how long it's going to take for it to execute this python operation and we see that it's 304 microc but what happens if I had multiple lines of code don't worry if you don't understand this code multiple lines of this python code and I wanted to see how long it actually took in this case this line magic or this single percent sign's not going to
work on multiple lines in this case I want to actually put in two percent symbols for this and now it's going to do the entire cell whenever I run it I can actually see the time that it took for all of this so that's a quick crash course on how to use Google clab now it's your turn if you haven't done it already log in and get a collab notebook set up for those that are supporting the course you have some practice problems now to work through and get you more familiar with working in the
collab environment with that I'll see you in the next one so let's get into one of the most fundamental concepts behind Python and that is the use of variables in it we're going to be talking about how to not only create it but also assign it and from there do further manipulation which is pretty common in data analytical tasks so here's the data set we're going to be working with here a few lessons down the road and it resolves around data science job postings anyway we're going to be converting a lot of these different things
inside of into variables and using it in this case look here we have this salary column with a bunch of different salaries in it so let's make a variable around this so for creating a variable I'm going to to put the variable name on the left hand side so in this case I'll just name it salary and then I'll use an equal sign to assign a value in this case let's just do 100,000 now I can run this by pressing shift enter creates that new cell now salary is assigned to 100,000 now from the last
video recall I can come over here and look at the different variables we have inside of here salary is appearing it's an integer and it actually shows the value right here of 100,000 anyway good place to go in case you want to keep track of what all your different variables are that you have for this now in this case salary is the variable name so if I were to actually type this out press shift enter it will display below here what are the contents of salary 100,000 now I want to be clear this is not
the same what we were doing before if I were to type in salary here with quotation marks around it which is a string which we'll cover in another lesson that's completely different it's going to Output the string itself salary so let's say I want to do some manipulations with this and I want to calculate what the total salary is and in this case it's the base salary times 1 plus the bonus rate I can specify that base salary variable and assign it the value we're just going to go with 100,000 as we did before I
can then also assign right underneath it that bonus rate and we'll say that that is only 10% so10 I can press shift enter and then load these variables in once again going into into that variable sidebar if I want to you don't have to do this every single time but I can actually go and see that they're actually assigned inside of here so now I can Define this formula that we have above of total salary is going to be the new variable it's equal to that base salary Time 1 plus the bonus rate running it
I can do shift enter and then we can actually print it out below this here pressing shift enter again now variable names have to be letters numbers or underscores so let's say I had something like Luke's salary and the a value of also 100,000 in this case we're going to get an error message saying it's invalid syntax and we're going to have this carrot right here showing us where the error is and we can see that it's basically between this luks and salary the problem is we don't have you can't use a space in between
here for variable names I could either do something where there's not a value here but it's pretty common practice whenever you name variables to use lowercase and also use things like underscore to separate different words to understand it better basically makes it more readable because it can be only numbers letters or underscores I can't do something like add an apostrophe s in this case now going back to that data set we can see from here that we're going to have a lot of other different variables besides different salary and we're not limited to just numbers
you could have things like text dates booing values so let's create some more example variables of different things we can do so as an example I can assign things like a company name which would be a string when I run this underneath here I can actually see it here and see that is a string by the single quote on the outside of it I could also record whether something like a job is required to be work from home or not and assign it a Boolean value of something like true or for false and looking at
it here I can see it is actually true stored inside of here now here's some other information as well let's say this all relates to a single job posting so now I'm going to also input maybe the job ID the job title and also the salary behind it now let's say for fun I want to actually print out all the different variables that I have for this one job I'm I have a title of data job description and then print out all the different one below when I go to run this shift enter I get
this error message of name error name job WWF is not defined basically I named this variable incorrectly now I want to show you something before I even executed this so I'm going to just start this cell all over again and paste it into here what you're going to notice is they actually have some syntax highlighting inside of here to actually hint you towards issues in this case I have these yellow curly line underneath this job ww FH and it says hey it's not defined and I can clearly scroll up and see that okay these don't
match I need to modify it and then from there they go away running the code itself yourself get all the values all right so that's a major overview of variables and for those that have purchased the course notes and problems you have some ones to work through and actually solve and understanding this is really core to python as python focuses heavily on designating everything as an object and then therefore you can actually assign it to a variable so we're going to do some pretty fun things coming up in the future all right with that see
you in the next one now that you've gotten some experience working inside of collab and writing your first few python statements we need to move into understanding what are the basic definitions in python and a few of these you become familiar with such as objects variable and functions and then others you haven't necessarily seen yet such as classes methods and attributes so jumping into that first definition of objects everything in Python is an object and it's basically a record of data previously we defined the variable of salary and we set it equal to the integer
object of 100,000 now technically not only is that 100,000 an object but also that salary variable itself is an object and if you recall from previously we used a print function which going to get to print functions in a second use a print function to actually display a variable here it is displaying that 100,000 of salary that it's keeping the print function itself is also an object if I want to inspect what what type of object something is I can use the type function and for the argument you place in between parentheses it's the object
itself that's highlight in blue and it's going to Output the object's type so in this case if we put something like 100,000 into it we can see that this is of the type int or integer which actually brings us back to our core definition of what objects are and they're an instance of a class so this 100,000 is an instance of the class integer if I add a string such as data analyst I could run the type function on this passing in data analyst and for this I'm going to see that it's of the type
string so I'm going to go into more detail on classes in a second but how do I know I'm telling you that integers and also strings or classes how do I know that well we can use the help function and we can run this on that St or string and this provides details on whatever you pass into it so in our case pass string it defines help on it it says hey this is help on class string in modules built in so this confirms it's class string all right next up is variables and we had
a whole lesson on this we Define things like job title job location and even job salary and with this we use something like the equal operator to set it equal to a variable now it's important to understand that the variable names itself point to an object and what do I exactly mean by this well I can use something like the ID function to get the identification number of an object itself to see if it's may be correlated to another object in our case I can pass in the variable job title and whenever I run this
I get a unique ID number for this I could also run this ID function passing in the argument of job location in this case it's just looking at the last three digits I can see that these are two different IDs as one references data analyst one reference unit States now this is where it's important to understand that variables identify or point to a specific object this case I have job 1 equal to data analyst and job 2 equal to data analyst one would think if I ran the ID of Job 1 along with the ID
of job 2 that these would have the same ID because they're the same string of data analyst but if we inspect those last three digits they're not the same ID however if I took a new variable such as job 3 and assigned it to to let's say job one or assign job one to job 3 in this case the IDS when I run job ID on job one and that of Job 3 we can see that these IDs are the same because it points to the same object now variables aren't confined to just only objects
like integers or strings you could also assign a function to a variable so in the case of this print function right here we can sign to a variable so I create this new variable called my print Funk and set it equal to print and then I can call my print Funk add some parentheses to it and then print out the same statement I'll be honest I don't really use that the main purpose is to understand that variables you can pretty much assign anything to it next up is functions we've talken a lot about functions already
but it's PR important to understand that they're basically manipulators of different objects in order for you to run a function you need parentheses so if I just ran this print right here it's going to provide back that it's a function but if I actually pass the parentheses and run this nothing's going to Output because I didn't put anything as the arguments and so if I pass in something into the arguments itself because print requires that it will then actually print something out now this is a built-in function if I was routinely printing out the statement
of what's up data nerds I could build this into its own function now we're going to be going over some coding for functions and classes in this in no way were you expected to go through and actually follow and actually do the coding of these These are shown for example purposes only for you to understand these definitions we're going over anyway a function starts with Def and that means Define and then you provide it the name of the function so in this case it's called greet you need to put open and closing parentheses after it
to then if you want to provide arguments or not which we'll get to in a second and then from there a colon to say hey we're starting the actual code for the function then everything indented in underneath the function is then run anytime you call this function so in this case we're returning what's up data nerds so I can run this greet function by putting parentheses opening and closing parentheses after greet and in that case it provides out what's up data nerds so going back to those three variables of job title job location and job
salary what happens if we wanted to print those out nicely formatted well I could use this print function that I've gone through and actually manipulated don't worry about the code itself you're going to be able to do this here in a few lessons but whenever I run this it actually prints it out very nicely formatted but what happens if I'm doing this quite frequently that I need this printed out well instead I can create this function called display info and I can pass three arguments into it title location and salary and I know those are
the three arguments require for because it's within those parentheses right there and then from there it's returning that print statement that we saw from above now I can call this display info function and it tells me I need to provide title location and salary so I can pass in those variables of job title job location job salary don't worry the names themselves of the variables don't have to match word for word just they have to be in the right positional order but in this case if I run this it outputs it nicely formatted and I
don't have to use all this code that I had above all right next what we're going to be getting into is very Advanced topic and revolves around classes classes are templates for objects for that previous example for 90,000 when I ran this whenever we investigated the type of this we find that it was the type int and running that help function on int to actually inspect the object of integer it tells me that this is help on class integer so integer is a class this 990,000 above here is an instance of that class if I
inspect what's returned from this help function it provides not only a bunch of information about what inures is but also gets into what methods are defined inside of this class which we're going to get to methods in a bit but let's get into building our own class ourself don't worry you don't need to follow any of this code I'm going to build it mainly I'm just showing this once again so that way you understand definitions anyway we have our job title chob location and job salary let's say we want to store this is a very
standard job data we want to store it inside of a single class instance so in this case we can define a class called job post and then from there I have a colon then everything under that Goen and that is indented in is uh specific to that class now you probably see something recognizable inside of this class and that is this defa or Define you would think this is a function inside of here but this is actually a method which you get the methods in a little bit anyway this code goes through and actually saves
all those different job titles job locations and salaries into inside of that job post once again don't worry about the code here you'll be mastering classes when we get to that lesson the important thing to understand is I can create an instance of this job post class by calling it out and then from there it prompts me to provide the title location and salary so I can provide in that job title job location job salary these variables right here and in this case I've created an instance of this object there's a bunch of bull AR
that prints below cuz nothing special is set up right now to actually print out but we basically created an instance now this actually brings us into our second to last definition that we'll be covering and that is attributes we learned about variables earlier but attributes are like variables of an object so going back to that class job post that we created we defined actually attributes inside of here of title location and salary so back with that instance that we created of job post I can actually assign this to a variable in this case we'll call
it job one we'll set it equal to this and we can access the attributes of job one so in this case title location salary by using dot notation so I call job one and then Dot and then from there three things prop up that you can then access so location salary title let's go with title and from there from this instance of this class we output data analyst scrolling back up to see yeah that job title should be data analyst I can also run with this that location and it outputs United States now it's important
to understand these are attributes so you do not see an opening closing parentheses after this you're going to get a type erir with this remember it's like a variable of a class itself so you don't normally put uh opening closing parentheses after that and this brings us into our last definition cover which going to be using a lot in the next upcoming lessons and that's on methods and there basically the functions of an object inside of our class of job post we had that first method right here that is called Dunder a knit method there's
two double underscores right there python slang is Dunder init anyway in it we defined our attributes underneath this method now if you recall from previously we created this function of display info and then whenever we call this function with those three variables of job title job location salary it outputs the different things anyway this is a function right here because it's not inside of a class but we could turn this into a method so what I can do is take this code up here and put it into the class itself make sure that it's actually
indented in properly then from there updates and variable names don't worry about this too much like I said we're going to be going over classes and other lessons the important thing is just understand we're creating a method right now and the important thing to understand is we now have this display info method for our job post so similar to before I can create a variable called job one which is an instance of that job post class with our three variables that we defined I can see the attribute of it by running that title or title
on it for that one and see data analyst and I can run a method on this and that's going to be using that do nomenclature as well specifically I can call job uncore one period and then look multiple things are appearing now and then we have this Cube that defines it as a meth meod of display info now it's important we put open and closing parentheses for this and this is going to run the method of display info below and it displays it nicely formatted using this now it's a sneak peek into the upcoming lessons
let's say we had this variable called salary and we set it equal to 990,000 remember running type on this of salary we can see that it's an integer object running the help function on int or see what available we can see that this help on class int scrolling through this is sort of the code of class int and we can see that it has methods to finded here such as ADD booing so in this case I could use this add method on this so for salary I can use that nomenclature and then from there put
in add and then the number we want to add to it let's just add one to it and we have 90,000 in1 but we're going to be going going into more of methods for these different data types coming up I just want to give you a sneak peek of how they're actually related all right so now you're a master of those six definitions objects functions variables classes methods and attributes we're going to be using these terms a lot over the coming lessons so it's important that you do understand them and for those that purchase the
course practice problems I have some practice problems for you to practice learning these terms further and with that I'm we'll see you in the next one where we're going to getting into understanding different data types which are their own individual classes in order to find stuff like integers strings and whatnot all right see you there so we ended the last section talking about objects and in this one we're going to basically continue on with that talking about data types and specifically built-in data types of python now we encountered some basic types already including things like
the integer for text are words we saw that this could be made into a string for values that have a decimal place this would fall underneath a flp and then if we need a true or false value we can set this underneath Boolean going back to that data set that I keep on teasing that we're going to be using in an upside coming sections it shows that for all these different columns here if we look underneath it they have the different data types that it is so in this case they're showing a lot of strings
booing and then also things like date time float lists and dictionaries so we have a plethora of different data types we're going to be messing with once we get to that data set but we need need to get more familiar with how we can actually use these data types with the superpowers they have and what do I mean by superpowers well I'm mainly referring to what methods or what capabilities are actually capable of certain data types and what do I mean by this well let's go with an example let's say we have two variables total
salary of 110,000 and then also what the bonus is of 10,000 I can run this cell by pressing shift enter and then just to confirm what are the data types of these I can use the function of type and then open the parentheses I can see that this is going to tell me what the object's type is so I can just put in here total salary press shift enter and it tells me it's int or integer because of this it has certain superpowers or methods that we can do to it so in our case we
want to maybe calculate what is the base salary which we could calculate by setting the base salary equal to Total salary minus that bonus salary then I'm going to go ahead and actually print it out right below it on what it is pressing shift enter we can see that it does 100,000 so with this integer object that we have I can do things like subtract and then also add like we showed in the previous video now just because integers can do operations like addition addition and subtraction doesn't mean that other class types such as strings
in this case can do the same thing so let's say we have this here where we have a job title of data analyst and we want to remove the word data so I'm going to go ahead and run this by pressing shift enter and then just to see what it is I want to see what the job the data type is of job type sorry I want to see what the data type is of job title which is string so let's say I wanted to remove data from data analyst I create this new variable of
job title and set it equal to that job title minus remove word and then from there I want to also print it out to the screen so I'll tie it up here and now watch whenever I run this it's going to give me a type error and that's because it has a unsupported operation specifically it says unsupported operation types for the minus symbol string and string basically this geek talk saying we can't do this type of operation now we're going to get into exploring what are the different methods or capabilities available to these different data
types but first I want to start with this table right here and showing that we're not limited to just those four different data types I showed in the intro video now we have a plethora of built-in data types inside of python core to it that we have the ability to work with we've already seen a lot of these and worked with them already and many of these we're going to be covering in the upcoming sections diving into greater detail to understand what methods or capabilities are actually capable of these different objects as a spoiler we're
going to be basically focusing on all these with the exception of that binary type that's a little bit more advanced than we need to get to for data analytics all the other ones are really core to what we'll be using in data analysis so how can we learn more about the capabilities of these different data types well they actually have a help function built inside of python to help you with this with this is a simple as typing help and this explains for what you put in between the parentheses you're going to put a request
which is an object and explains this is a wrapper around py. help that provides a helpful message when help is typed at the python interactive prompt but I'm going to go down to this last statement where it says calling help and then thing inside parenthesis print help for the python object thing so in this case I can put into that thing so let's put in string and actually dive into more on help with it pressing shift enter I get this bad boy printed out and it starts with this of help on class string so string
is a class we'll be covering classes more in an upcoming section but for now it's just important to understand that classes are one of the core building blocks for creating objects inside a python such as a string so it gets into the documentation of this starting with class string object ver it starts about how you can use the string keyword as a function which we're going to get to in a bit and next it gets into providing a description of this and we're just going to look at this first statement right here of it creates
a new string object from the given object all the rest of this gets into a lot of detail I don't want you to focus on too much just yet but underneath here is what I do want you to focus on methods defined here and this is the capabilities that it has here I'm going to scroll down a little bit and we can see that we can do add with it so what happens if you Pro actually add strings together it will add it to itself another method available is length so whenever we do something like
a length on it we can see the length of the string now these are all the double underscore methods or Dunder methods of the string class and these are just special ones like if we go back to that addition we can use that plus operator in order to add things now other things we can do is a method called capitalize and it returns a capitalized version of these strings so just to demonstrate some of these capabilities that we've found out in this string class we can actually do a lot of these operations so here we're
doing the addition one where I I want to add data to nerd pressing shift enter I can see that it adds it together now if I wanted to capitalize the string I can call the capitalize method what I'm going to do with this this it's I'm going to just put a period and then [Music] capitalize and then put from there put a closing parenthesis running shift enter and I thought I didn't do what I wanted to do it did uh it it returned a capitalized version of the string more specifically make the first character have
uppercase and the rest lowercase and in this case we scroll back down oh my gosh this is a lot of documentation scroll back down to dat nerd. capitalize it lowercase the n because it's meeting this thing anyway we're going to talk more in depth about strings in the upcoming section so don't worry if those methods don't necessarily make sense just yet we're going to dive into it deeper what I want to focus on now is this so right here underneath this help documentation it tells me that there are functions available with this string here and
I can actually run a string function on an object to turn it into a string and spoiler alert we can actually do these type of functions with a lot of different data types so in this case we have string at the top we can run this Str function on an object itself to turn it into a string but we can also do this with a multitude of other built-in data types in Python but you may be like Luke like in this case here where job title is set to something like data analyst and is thus
cast as a string why would I need to actually run a function on this to turn this into a string when it already is a string well the purpose of these functions are actually to turn it maybe into another object for example let's say we had a job ID and we want to set it to this value of one2 this type when we run it we're going to see is an integer let's say we actually want to make the job IDs have decimal places because future values potentially will have decimal places to capture this ID
how could we do this well we could do this by wrapping that float function around 102 and in this case when we go and print out the type of job ID we can see that it is now a float I can also go into that variable menu on the left hand side and see that now it's keeping track of jav ID as a float one thing to note that's special to Jupiter notebooks is now jav ID is saved as a float with in this entire notebook although right here we have the job ID of 102
and we see that it's an INT or at least it says it's an INT right here if I were to put a piece of code underneath here and see what it is underneath here it's going to still say it's a float because inside of this kernel here it's now updated so just something to think about when working with Jupiter notebooks and maybe you're moving around to different cells it really depends on what is going on in that environment of what variables are being stored and what you're going to actually access anyway anyway now that we
have this float type of java ID if I wanted to learn more about it I would use that help function on float running it shift enter it's going to do the same thing of providing me this class overview of it showing the function itself and then all the different methods available to this float object so we're going to be going into a lot more of these built-in data types and upcoming sections Bally next one we'll get into Strings so don't get too discouraged if you're not following along Ong with these different methods uh capabilities to
these different data types just yet all right for those that purchase the course problems it's now time for you to dive in start messing with all these different built-in data types within python with that see you in the next one so this is the first built-in data type that we're going to be diving deeper into to understanding now a lot of the Core Concepts that we're learning from this section right here are going to be be applicable to a lot of the other built-in dat types that'll be covering coming up so why are we covering
strings first well if we go back to that data set that I keep teasing string makes up a lot of the different columns in this data set so a lot of the data that you're going to have to manipulate later in order to get Data Insights it's going to be with a string so we previously discussed defining certain objects like a string and in this video we're going to be diving into the methods of string such as in this case you using upper to uppercase python in the next video we'll get into a more advanced
topic of string formatting where here we can insert that skill of python into another string so jumping back into a Jupiter notebook let's dive deeper on this so strings can be defined either with here as I have it in double quotes saying this is a string in double quotes or you can use it within single quotes either way it's going to get cast as a string now the most common mistakes beginners make with strings is they forget to put these quotes around it and you'll see this by the syntax error that you're going to receive
with it and you have this carrot pointing right to it saying hey there's invalid syntax here and recall if we have some sort of variable such as skill defining it as python I can go as far as to invoke the type function on skill to understand what is actually going on here what is the type of it and it's a string St Str so let's get into manipulating this string by using those builtin methods for a string first one thing I want to try is actually upper casing all the letters within python of that skill
so I can type skill. uper and then close the parenthesis shift enter and it's going to capitalize everything conversely I can do skill. lower close and open close parenthesis run this and it's going to put it in all lowercase and in no way should you have all of these different methods of all the different data types we going through memorized that's why you need to do remember that function of help and then we can insert into it something like string run it and then get the help behind understanding what the shring can do so in
this case when Kelly and I were researching examples to show for this video we use this function in order to look in under what methods are defined here and then we scrolled on down pass these Dunder methods we'll go back to these in a second but we looked inside of here to see what other ones we could do and we found this upper method which states that it returns a copy of the string converted to uppercase and then lower return a copy of the string converted to lowercase now real quick I want to go over
why you're seeing this self and also this forward slash in here these are the arguments that we're providing for the function now we ran this previously and it looked I'm going to simplify it to where we have python the string and then we used lower and then running this shift enter we can see we have lowercase python but as you can see from this there's no self inside of here so it's sort of confusing especially those new two python what's going on the there and there's no also forward slash well if we were write the
verbose version of this we would write out the class or string then we would apply that lower method to it and then inside of that we would put the self of python inside of there and in this case when we run it it's going to give us the same thing that low lowercase python now what about that forward slash in there well the short answer is it's basically ignored but the technical definition is that positional arguments are on the left of this and keyword arguments are on the right and this self is a positional argument
anyway don't worry about too much about that right now we'll cover more of that in the function section for right now it's just important to understand that whenever we defined a skill so in this case skill equals to python running this shift enter now whenever we run the method on it lower and close the parenthesis it automatically will replace that self in this case with python and therefore it will lowercase it now sometimes methods have more than just that self argument that go into it this case let's look at this replace method in it it
says it returns a copy with all occurrences of substring old replac by new inside the parentheses we'd specify old new and then the count so let's start by actually writing this so I could say skill and then from there replace and let's say I want to create jython which is a Java implementation of python I could say I want to replace p with J so we've got the old and then we've got the new but what about this count well as you can see from here we have count is equal to -1 and if we
read the documentation count is executes the maximum number of occurrences to replace -1 the default value means replace all occurrences because this argument is already defined with a value I don't need to specify it here so running this press shift enter I can see that replace P and J so let's show another case where we actually use that count argument let's say we have this job title of data analyst and we now changed our language some reason we don't use data anymore we want to use doto so if you're looking at it you can see
that it's going to replacing the A's the first two A's with O's but we don't want to touch the rest of the A's so in this case I can specify job title. replace first I'm going to specify the old which is the a then I'll specify the new which is O and then I'm going to specify the count and in this case I'll just say two pressing shift enter we now have doto analist now with our doto analysis example it's pretty common that sometimes we want to split words apart right now our doto analyst is
one combined string let's say we want to break it up into two separate strings of doto and analyst well conveniently strings have this split method and for this it returns a list of the substrings in the strings using sep as the separator strings we haven't got the list yet but you don't need to know too much about that to understand this method anyway the separator is used to split the string so in our case the separator in doto analyst is the space between it right now it's defined as none so we're going to have to
actually Define that whenever we go through doing this Max split uh which designates the maximum number of splits that we want to do is ne1 so it means there's no limits in our case we really only want to split it once so in that case I take the job title and then do split opening up parentheses we have we have the self and then we have that forward slash now remember everything to the right of that forward slash is a keyword argument because it's a keyword argument I have to actually specify that se equal to
and in this case a space additionally I have to specify that Max split equal to one now running this pressing shift enter I can see that it did separate out the data and analysts remember we wanted we wanted doto analyst um in this case right job title replace this um we didn't set it to itself so actually we need to fix our code right here so I'm going to specify job title equal to so I'm just going to run these cells again so job title equal to data analyst job title equal to this to replace
it and get it into doto analyst and now when I run it I would do see it doto on one side analyst another it's in a list we'll cover list more than a second but we've separated it nonetheless now these keyword arguments which are shown here with that Max BL equal one and those positional arguments can be very confusing so don't get too deterred by it but just make sure we're on the same page going back to that replace method you see we can specify that old new and then it's all to that left of
that forward slash so in this case where we have count equal to if I were to actually put that into here and specify it as a keyword argument and try to run this I'm going to get an error and it says string. repace takes no keyword arguments so I can't specify it so I'll run this again conversely looking at the split method where sep and then Max split were to the right of that forward slash this specifies the keyword argument and so in this case we wouldn't necessarily have to Define it but I'm going to
show you an error with this um we wouldn't have to Define it so if I run this it's still going to work however let's say we wanted to go with that separator equal to none and I just removed this and only had one in there it's not going to know where to apply this to so when I run this shift enter it gets confused so keyword arguments although in this case when I run it and use that uh one space only it could work I don't recommend it I recommend actually defining it so that way
your code is more reliable and you don't make mistakes like I could have right there now the last thing to cover is something that I conveniently skipped over we first introduced these different methods and these are these magic methods or Dunder methods which stands for double underscore as aned by the double underscore before and after that method name these are also conveniently called Magic methods now let's show this add method very simply in it it's pretty self-explain explanatory it has self and then you have another value what happens is it will add itself to the
other so I'm going to write this out the long way not what you need to necessarily do but just just explaining actually how it works so I'm going to do the double underscore add and then double unstore more remember it's going to take that self so we can do something like data as the first argument and then for the other value we can add in analyst I'm going to put in a space to before that now whenever I run this pressing shift enter it merges those two together writing this a little bit more shorthand although
I don't recommend doing this either I can invoke this under uh double uncore add method on data and then pass as the argument that space and then analyst I spelled that wrong and then pressing shift enter it does the same thing of thata analyst but both of those are not actually what you're going to do with this now python is pretty cool and it's readability and usability so a lot of these Dunder methods have actually been overridden to allow you to use operators so instead of having to write all that different crap that I had
to write before I can actually just write it simply like this data plus analyst making sure that I put them into quotes itself whenever I run this I get data analyst but this addition symbol is using this Dunder method of add right here we're just not seeing it it's doing it on the back end and it makes it a lot more readable looking at all these different ones I can just tell this is adding both of these and it makes it a lot easier to do this is what makes python so great now besides add
we also have other things like we found out before we don't have subtract but we do have this one of multiply which I have something like data analyst and I want to display it 10 times I can say data analyst time 10 running this it's going to actually multiply it out 10 times now all these magic methods or Dunder methods don't necessarily work with just operators like plus minus and multiplication they also can work and be overridden by certain function so in this case if we want to get the length of a string we could
use this length function so instead of writing out the dunder method of length I can instead just wrap this inside of here and whenever run this I get the same value of 12 and once again I find this a lot more readable and easier to use all right so that's a crash course on not only how to use methods with strings but on just methods in general so a lot of these Core Concepts that we've covered on methods are going to be applicable on other data types that we're going to be covering coming up so
don't get deterred if it's not quite clicking because you're going to get plenty of practice with this especially for those that bought the course problems you have some problems now to go into work and actually solve using different methods that I've done here and even more all right with that see you in the next one heyo editor Luke here and I want do a quick intermission on how to use resources such as chatbots and also the internet and going through this because I think a lot of the concepts especially what we just covered and strings
on the different methods can be pretty confusing especially for those that are new to programming so let's first talk about AI chat Bots I particularly like chat gbt and also perplexity we're going to stick to chat gbt specifically the free version in this example so you don't have to pay to get this in the case of understanding methods I'm just going to go to the chatbot itself and ask it hey what are methods in Python explain it like you know I'm going to say I'm five additionally I'll say um explain this using the concept of
of strings all right it goes through and explains this well very much more simply than even I did and it goes into saying that hey methods think of methods as special tools they can do cool stuff with words strings and Python and it gives some examples of these and I usually like to take it further of having it show me actual coding examples that I can do so I prompted it further show me how to use common methods in Python for string actually show the code and what do you expect as the solution okay and
then scrolling up it provides even more examples than I did in this case where it goes through and shows things like oh this is how you do the upper and then if I scroll down I can see what is the expected output from this now using these chat Bots is also great at troubleshooting those error messages if you recall back from whenever we did replace and we were investigating what to actually put into it we saw old new and then that positional argument of count whenever I tried to run it with actually cting there we
got this type error anyway let's say you didn't know what was going on here or if I received a different error message what I can do is I can just copy this all I'm going to command C and then inside of chat gbt it's actually sophisticated enough I don't even have to provide it a prompt I can just provide it the error message itself press enter and it will provide me the updated code basically it remove that count keyword in front of it to in order to fix it but we can actually take it a
step further in this case I want to understand what's going on here so I prompted why did I get this air explain it to me and it goes in further to explain that hey it's not a keyword argument it's positional argument so you're not going to actually specify that keyword anyway just wanted to share that Insight because you're going to be getting a lot of error messes and and you're probably going to have a lot of different questions as you're going through this and so take advantage of even these free chat Bots to help explain
these Concepts more for what you may need all right with that see you in the next one as we get into more on string formatting so now that you've played around or should have played around with different methods inside of strings let's build on that further by this topic of string formatting so jumping back in my Jupiter notebook we're going to be going over five different operations for string formatting and some of them you may have even played with already the methods of it and well the first one I know you did okay so let's
say we have these two variables we're going to be playing with them a lot during these exercises we have one variable of role which is the string of data analyst and then another of skill which is the string of python now let's say we want to write out to the screen roll and then we're going to be provided a roll in this case data analyst we're going to write out to the screen so for concatenation we can use the operator of plus which is that Dunder add method and then from there specify roll running this
shift enter can see we get roll. analyst no space right here I'm actually going to put a space press play and now we have roll colon space data analyst so we did see that one before but there's actually another one and I want to actually go into using that help function again on Str strr or string and shift enter that and that is this format method right here so for the syntax for this we're going to start it with the string itself pendon as we saw before that format method and then inside parenthesis this is
probably new to many of you of this asri ARs and then this double asri of keyw args which stands for arguments and keyword arguments we'll go more into breaking this down in the function section but just understand that these asteris and these double asteris are speal special methods for us to actually apply or put inside of these parentheses many different arguments or many different keyword arguments anyway what does this thing even do well it returns a formatted version of s using substitution from args and quars the substitutions are identified by braces so I'm just going
to copy and paste this to make this easier into here and I'm going to comment it out to actually comment out everything you have it selected and then from there you press command and forward slash window users are going to press control forward slash but it's a quick way to comment out things anyway I just want to have it here so I can actually see it and reference it quicker so similar to before we want to create the string roll data analyst remember data analyst is in that variable already of a job similar to before
we want to create a string and we're going to start first veryy simply doing rle and then specifying data analyst so let's start with this string portion right here that we're going to stick in front of the format I'm going to write out roll now additionally it sees the substitutions are identified by braces so we're going to put these braces inside of these parentheses and then from there I'm going to type in that format method and I'm going to specify inside of here what I want to put inside of these Braes so pressing shift enter
we can see it specifies roll data analyst now let's take this a step further let's say we want to instead now have multiple different variables put inside of this string and in this case I want to Define not only the role but also the skills required in this case python so similar to before I'm going to have that roll and then the braces right there and then next I'm have the skills required and then the braces for that as well from there I'll add that format method and I'll specify in here rooll and skill going
back to that documentation these arguments are actually going to be passed through this asri args or this unpack operator inside of arguments so both of these are going to be applied to this running shift enter we get roll. analyst skill required python looks like I also formatted it wrong I didn't put in the colon oh my goodness okay that's better anyway this allows us not only to put two but multiple any number of different variables inside of here and then still format inside the string so that is the format method and I'll be honest that's
not necessarily my favorite I still find this pretty hard to read like if I was going through it roll I'd have to look back and forth to understand what's going on here and I have to also make sure that I actually align these properly now I could modify this further and actually add variable names inside of here such as I use role and skill which are the same as these variable names they're actually different anyway but we get really confusing now because we have now these two things in here role and skill we could use
keyword arguments instead of roll equal to roll skill equal to skill whenever I run this I still get the same results and this would maybe potentially be used in a case where I have like five or 10 different values here and I want to actually specify it anyway just want to make you aware of this I'm actually not a fan of this I'm more a fan of FST strings they're not only fun to use they're also pretty fun to say so here I have my variables defined and with f strings that goes with the names
you put an F at the beginning of your string itself so in this case I'm going to have an f and and then either single or double quotes and then I'm going to write the similar statement that we're trying to do roll and then I'm going to put a curly brackets and specify the variable that I want to actually appear inside of here similarly I'll do the same thing with skill required and I'm going to specify it inside of these curly brackets and specify skill so now this F string very much works like that format
method that we went over where we're going to specify these variables inside the curly brackets pressing shift enter boom I have it roll. analyst skill required python personally to me I'm more of a fan of this right because it's so much more readable I can just roll read over roll over this I can just read over this roll is okay I expect this roll next and then skills required expect the skill to be next whereas the formatting one it's yeah we can actually specify variables but then we have to do it all and configure this
at the end it's it's a lot more work with that F strings I feel are your best bet now another one I think that you should be aware of is this print F style string formatting this came with an older or this came in a previous version of python and it was carried on into Python 3 I'm not a fan of using this but it does come up from time to time so I think you should be aware of it in case you see it but I don't think you you should necessarily learn how to
use it anyway similar as far I can use single or double quotes So in this case I'll specify role and I'm going to use that percent symbol and then s and then we'll continue we'll write both these in a second but I want to just show this for the time being from then you put a percent symbol again and then after that you put roll in parentheses so running shift enter we have roll that analyst now if I want to add both variables I can do as you expect I'll put skill required right here and
then put that percent symbol and S and then inside of the parentheses es put skill and now in this case it's going to Output both roll and then skills required find this really weird to read I'm not a fan of it I don't recommend using it but you should know about it the last method we're going to be going over is join and Diving back into the documentation first for the method you have join inti parentheses you have self and then you have the iterable so it has to be an iterable which we're going and
go over that in a second anyway in it that's going to concatenate any number of strings the strings whose method is called is inserted in between each given string the result is returned as a new string so in this case they give the example of there's a period here and we're joining it with this list which we haven't gone over a list yet but this list which we can see that there's multiple strings inside this list and what it does is it adds a period between this so I find more use out of this method
whenever I have something like a string which is an iterable and I want to actually divide it up so say I had something like years of experience and I want to separate it by all these values via comma I could have actually typed it out zero comma space one comma space whatnot that's a lot of repetition python is good at automating things so what I can do is specify how I want to actually separate all those numbers so in this case I'm going to use a comma and then a space and then from there put
in join and we need to list that iterable like I said years experience which is a string is is in iterable we can actually cycle through it I'm having some spelling issues and we can run this shift enter and now we can see 0 1 2 3 4 5 6 78 9 now we haven't covered list yet but I think you can understand the basic concept right you have items inside of here you have a list so I had list I have this list of skills if I wanted these all these lists of skills right
now they're inside this list they're not they're not a string so if I see type uh skills and actually run this we'll see that it's a list anyway let's say I want to get into a string in this string I want it to have a um similar to before I wanted to have that comma and then a space after it from there I can just add that join method specify skills inside of there and then run it and then now I have this full string that I can copy right here and it has python SQL
and Excel in it all right so that is string formatting if you remember anything from this video in the future I hope that you remember comes no surprise F strings they're the most powerful the most readable and the most easy to use so take advantage of them they're very useful all right got a couple practice problems for those of you that purchase the course's practice problems and with that I'll see you in the next one so the core of data analytics really revolves around using basic math and you used a calculator before like most people
you've used different operators that we're going to be going over in this section in order to perform math we're going to first warm up with arithmetic operators doing things like adding and subtracting from there build on it further to use assignment operators which not only assigned a value but we can do a lot of that arithmetic operation while assigning a new value and then finally get into comparison operators which basically review where to point Pac-Man in order to figure out what's the greater number so jumping into a Jupiter notebook we've already shown before how to
do basic math adding something like two which is an integer and then three which is a float we get the final answer of five we also saw previously that we're not limited to just integers or floats sometimes we can also use strings as well to do operations like an ad and we're not limited to just add with strings we can do something like in this case of multiplying a statement or a string times a certain number and then from there it's going to export it out with this number multiply down 10 now what happens if
I wanted to use something like the minus operator so in this case I'm using a float of three and an integer of two I can see I'm going get a one but for a string let's say I wanted to subtract if you will up from up and actually run this I'm going to get a type error it's going to say unsupported operan types for subtract and then string and string operan in this case are the what's up and then the up and then the operator is that negative symbol anyway how do we know what is
actually supported for these different operators that are available to us and specifically in this string case well we go back to one of my favorite functions help and then type in uh Str Str from there shift enter and then inside of here I can see underneath these magic methods which are these double underscore methods that have the double underscores before and after it I have this ad right here so I know I can use the plus operator in this case also I have the dunder method for m or multiply in this case I can use
the multiply operator but if I look through these different methods and try to find the subtract method or sub I'm not going to find it and it is kind of intuitive of what to expect but I think you should at least know the documentation for it as when I run help on int I can actually see that the subtract method is actually there and it uses that negative sign in order to subtract them now recall that you can call these Magic magic methods technically so in this case I defined four to the variable four and
then from there ran the magic method of sub on it and then use the value of two so basically 4 minus 2 in this case we're going to get two but as I talked about previously that's just to show you that the method's available the actual pythonic way to actually do this would be 4 - 2 and then shift enter get two now I do want to cover some of these methods that you may not have seen or have experience with before they do come up from time to time so let's say I had something
like 5 / two one thing to know real quick anytime I do these operators I'm usually going to put a space in between the number and the operator you don't necessarily have to do that um it's still going to work just fine but it just makes it more readable and easier to go through anyway getting back to it so we have this 5 divided two and that's going to give us a value of as we expect 2.5 now let's say we wanted to maintain right now we converted it to a float the answer is in
a float we want to go something we want to maintain it at an integer in this case I could do something like floor division which is using two forward slashes instead of just one single one in this case I'm going to get two alternatively if I wanted to get the remainder of what's left over after we divide 5 by two which we would expect five is divided by two two times so one is left over over but we could use the modulus operator which is basically the percent symbol so 5 modulus 2 running this shift
enter this would be a pretty common operation I'd see or we would see if we'd have to do this of 5 by two is two with a remainder of one and order to maintain an integers and not get a float quick reminder on order of operations remember it goes parentheses exponents multiplication division and then addition and subtraction let's so let's say we had two s values that of the minimum salary of 20,000 and the max salary of 80,000 notice here I have these underscores in between here this underscore is actually ignored by The Interpreter when
reading these values right here and this is pretty common to use especially whenever you're getting to thousands millions or billions in order for you to be able to count how many different thousand separators there are anyway we'll show it work in a little bit so let's say we wanted to get the average salary Well we'd have to add the minimum salary and then from there add it to the max salary and then finally divide it by two in this case I expect order of operations parenthesis go first along with the division at the end and
then I want to actually print this out to make sure that I got it what do we expect the results to be I had to do the math real quick 50,000 so as expected python ignored that separate right there it's not going to actually print it out with it in there anyways but I can just take it out to also showed that is going to work as well still 50,000 next up is assignment operators and we've already gone over the equal sign before but these are going to combine the arithmetic operations along with an assignment
operator and all the different arithmetic operations are still available so diving in a little bit deeper to the core assignment operator of equal let's say we have two applicants applicant one Kelly applicant 2 Luke let's say we want to switch those applicants well I could do something like this like I could create a temporary variable and then assign it to let's say applicant one and then from there assign applicant one to be equal to applicant 2 and then conversely make it to where applicant 2 I have a spell no no spell eror um is equal
to that temporary variable and then when I go to actually print these out running shift enter we can see now that applicant one and two it was previously Kelly Luke it's now Luke Kelly well the cool thing about python is you can actually do assignments of multiple values on the same line let me show you what I mean let's get rid of this right here cuz we're not going to save that so I'm going to rerun this line up here signing Kelly and Luke anyway this assignment within one line allows us to use something like
a comma so applicant 1 comma applicant 2 is equal to applicant 2 comma applicant 1 so we can swap values on the same line and then whenever we run this we can see now it is Luke and then Kelly now this is pretty cool because you can't do this in other programming languages as well all right moving on to the next assignment operator let's say we have something like xal to 1 and then from there we want to add one to this variable of X so I could write it to where I have x equal
to if I need to add one to X I would Define it again of x + one and then running shift enter and then printing out what actually X is we can see that it is now two well as you expected there's actually an easier way to do this so instead of actually having to write X again what I can do is use this add and a sign operator and then from there just put the value that I want to do this is basically the same as X = to x + 1 as we showed
up above and then whenever I run this and actually print out X below we could see that it is two conversely if I wanted to multiply x * 2 I could go X and use the multiply and a sign operator and then from there add that value of two in and then from also print right underneath it x shift enter we'll see that that is now two now all the previous arithmetic operators are available but I'll be honest I most frequently see the ADD and assign for this so that's all we're going to cover on
this all right last St in this section are comparison operators and this just like you did in second grade is verifying whether values are equal to each other not equal or even greater or less than so let's say I wanted to compare two salary valuables which we're going to be doing a lot of that in the project section of this video but for now we're just going to compare these two variables of salary for Kelly and salary for Luke so for this I can check if the salary of Kelly is equal to the salary of
Luke by using this double equal sign whenever I run this this is going to return either true or false and in this case it returned false now this operator is not necessarily limited to just variables so you could technically put in something like is one equal to 1 then running this as you expect you would get true now a common mistake that you are probably going to make if you're new to python is you're going to forget sometimes that whenever you're trying to do some sort of comparison you need to put two equal signs you're
probably going to put one equal sign in if I were to do this of setting an integer of one equal to one or really any integer it's going to say syntax error can I assign to literal here maybe you meant double equal signs so pretty nice that they actually give you this little hint here now as far as the importance of these things like comparison operator cuz remember right we said they return either a true or false value you're play like L why the heck do I even care about this well they're going to be
used inside of other things like we're going to be going over if statements in the future don't need to worry about how to actually write this code just want to show it as example but in this case is evaluating if salary Kelly is greater than salary Luke print this statement got to love pythonic uh python because it's so easy to read all right so shift enter as you expected it's going to print Kelly is paid more than Luke so these comparison operators are going to be used inside of other different functions Loops or lists or
what whatnot in order to conduct operations more efficiently and coincidentally we're going to be covering more on conditional statements or these if statements here in the next lesson all right for those that purchas the course problems you have some course problems to actually work through right now and get more familiar with these different operators with that see in the next one so as we ended to in the last section on conditional operators we're going to be using those type of operators inside of conditional statements these statements do quite simple things of evaluating whether a condition
is true or false and then executing it based on that so jumping into my juper notebook there's three main keywords that we're going to be covering this of if L if and elf elf and El and we're going to be focusing on if first just to break down the Simplicity of it so we start a conditional statement by by defining a keyword and they always have to start with the if keyword then for the condition we're going to just keep it simple for the time being and we're going to actually just assign it of true
we'll use those comparison operators in this place in a second but we're just going to keep it so anyway you have if the keyword the conditional value and then from there you have the colon now for the action we want to complete I pressed enter after going through the colon and it automatically if I highlight this I can see it puts two spaces in there so it indented it anyway now I can put in the condition or the action I want it to actually take so in this case if true which is going to be
true because I defined it as true it's going to print what's up data nerds running shift enter prints what up that nerds conversely if I change this to false I expect nothing to happen because it's now false quick formatting note this indentation is actually very important for when whenever we go through and execute our python code the compiler behind the scene reads this indentation and actually uses it so if I were to take that away and just have it underneath the statement and press play I'm going to get an indentation error conversely whenever I actually
pressed enter it went through and automatically put in two spaces in here so sometimes you'll see two other times I'll just press tab which is also going to give two inside of here sometimes you'll see people put in four spaces and if I run it here with four spaces this is still going to work and actually I can put in a number of spaces into it and it's still going to work the important thing is that it's indented we're just going to maintain for consistency just those two spaces so let's move into a more practical
example of how I'd see this conditional segment operated in the real world let's say I'm trying to filter out applicants for a job based on a skill specifically I'm looking for applicants with the job skill of SQL and we have this one applicant and their skill is SQL so let's get defining that we'll start with the keyword of if and then from there we're going to be comparing the job skill if it's equal to that applicant skill and then from there I'm going to press enter and so if those two skills are equal to each
other I want to print out that skills match so I'm going to go ahead and run this press shift enter and as you saw I made a common error right here right so if job skills basically says syntax erir points the carrot up at job skills invalid syntax maybe you mean the actual comparison operator not actually the equal sign anyway actually putting in another equal sign into there and running this we can see that skills match but now let's move on into that LF keyword let's say we have a condition now where we're checking those
skills of SQL let's say they don't match this but say they have a lot of years of exper experience and so they would still be something that we want to match and flag and basically cue that we want to look at this person well in that case we can use an LF statement so I'm actually going to just take this right here and paste it right below it so L if is going to begin on the line right below it it's not going to be indented so it's going to be in line with that if
keyword and we'll say if the years experience which we haven't defined yet we'll Define it a little bit is greater than or equal to five then we do want to actually match this candidate so we'll say print enough experience no SQL and I'm going to go ahead and just Define that years experience right above we'll do an example value to make sure it matches first of equal to six I'm also going to move these other variables down underneath here so we can modify them and use them easier correcting that print statement so we want to
make sure that this is well first let's actually just run this as it is we can see we're going to match on the SQL scale because sqls equal so let's just change this applicant scale let's say they have more Superior scale of python just getting and then actually run it so we're going to fail basically or have false for the first condition of if uh because python does not match equal then for the LF because Year's experience is greater than or equal to 5 which is six in this case it's going to print enough experience
no SQL so what happens if we change this to something like three run it nothing's going to print now I don't really like how this doesn't print something out or capture some sort of value whenever it doesn't meet the skills or experience so in that case that's a great fallback for our last keyword to cover and that is else and basically else is going to execute if both the if and L if statement are not executed it so in this case I'm going to once again go down and start the new keyword of else and
then from there all I have to do is put a colon in there's no condition for this it's just going to satisfy if both if and LF are false and then from there actually print out no skill or experience so whenever I go ahead and run this based on these conditions I have that final statement of no skill or experience now one thing I gloss over with LF you can have multiple LF keywords within here so let's say in the case of that job skill since I'm really privy towards python let's say that we have
a python skill uh for an applicant in this case we could print out something like this applicant should know SQL basically you know python probis no SQL anyway going ahead and actually running this shift enter I can see we get no Skiller exper experience and that's because I wrote this wrong I wrote this with the job skill and I'm trying to look at the applicant skill right so that's all why we always actually try to troubleshoot so running this again with applicant skill in there pressing shift enter this applicant should know seal and the last
thing to know on this is sometimes you may read some others functions and you may see this keyword of pass this is typically put into place whenever you don't have when maybe you don't know a value you want to put there for the time being but you need something to be within or underneath that keyw of if or LF so in this case when I run this nothing gets executed here but let me show you more what this means so let's say we actually defined if job skill is equal to applicant skill and I'm not
sure what we need to put in just yet if I want to run this and uh do this it's going to say syntax error incomplete input what I can do instead is actually press enter and then for the time being just to run it and get it through run pass and from time to time you may see hardcore coders put in things like uh three dots which are called ellipses and in this case it's going to be do doing the same thing as that pass keyword but I actually like to write out my keywords and
make it more readable so we're just going to put pass in all right that's enough of me rambling on conditional statements now your turn to give it a try and test out different scenarios with these conditional statements all right and that see you in the next one so I just recorded this entire video on list and realized I wasn't recording so now we're recording it again anyway so this one and the next three are on the four different type of data type containers that I find that I use from time and time again and that
are list dictionaries sets and tupal we're going to be starting with list because not only is the most commonly used but also it's one of the easiest to up and running quickly lists are denoted by their square brackets and contain any data type within it and in this section we're going to be going over common methods associated with lists for altering them adding to them and changing them we're even going to dive into things like unpacking list that we can access individual items of it and why are lists important well we use them not only
in the real world but we're also going to be using them inside our data set on job posting where each of these rows are a job posting and if we scroll on over to the right for each of these postings they have a set of skills that is required for it here I'm selected on the second line where I have skills like python R SQL cognos alrix Tableau and SPSS each one of those skills are a string inside of a list and this data set has hundreds of thousands of different skills required for jobs and
python is great at going through and analyzing basically combining all these lists so we can analyze it so here we are inside my Jupiter notebook I have this table here don't worry about too much um especially since we have list dictionary and sets and tupal I want to be able to show as we go through this how they actually relate but the first thing we're going to look at at list is it's uses it's for a collection of ordered items and we're denoting it with that square brackets and if I wanted to define a list
with those square brackets I can put anything in it from an integer to a string like Python and then I can even put another list inside of there um doting things like SQL and the number two so any number of items can be in that list um you're pretty much limited by your computer resources pressing shift enter we can see it inside of there but actually let's talk about a practical example talking about a list maybe a list of skills some sort of list that maybe you have of skills you have in this I have
SQL Tableau and Excel all these items are strings inside this list so what can we do with this let's say I wanted to maybe add or even remove from this list certain items or certain skills well how would I find that out well using my favorite function help and then putting within it list press shift enter here's the help on class list it's its own class so list is an object we can do things as we saw before of add which we're going to get to of adding values we care about adding and then removing
from it so actually I want to look at maybe other methods including things like a pend where we want to pend an object to the end of the list and for this all we have to Def find as an object to pend to it additionally we have methods like remove which remove the first occurrence of a value so of our job skills list let's say I wanted to append a new item to it so you can probably guess what skill I'm going to add to this it's going to be python press shift enter we added
it to it and then actually checking out that list pressing shift enter we can see we have SQL Tableau Excel and then python as it added it to the end now if I wanted to remove an item from the list I could do job skills. remove and then from there you has value and it says remove the first currents of a value so in this case it's expecting a string for what we need to put in here so we'll go ahead and remove Tableau pressing shift enter and then actually showing this left right underneath it
we can see we have now SQL Excel python Tableau is now removed now another method I find myself using all the time is the length message and that's denoted by Len and then Open brackets and then inside of that you can put any type of object so I could put you know before I put python in there and ran it and we can see that Python's six characters long anyway I can do this also for list so I can put in job skills running it shift enter we see we get three three items are in
there so speaking of three let's let's go over the indexes of a list so for this list here we have three items they are indexed as0 1 and two in programming we don't start at the number one instead we start at the number zero why because we're better than everyone else but there's an important reason why we're actually doing this as well let's actually put this to use on these indexing so for this list of skills let's say I wanted maybe the second item in this list so I could say job skills and then from
there I'm going to use square brackets and inside of those square brackets I'm going to then provide the index number of it remember it goes 012 I want the second item so it's going to be uh no not two it's going to be one so putting that in running shift enter we can see the second item is Excel now you're probably like Luke why does this even matter understand the index well because we know the index we can more precisely control where we Place something in this list remember python lists are ordered in nature so
we can use this insert method to insert an object before an index so as expected for the arguments it takes index and also the object we want to insert into the list conversely we can use this pop method which removes and returns items at an index and it defaults to the last value so for these job skills let's say we wanted to put Tableau back in but we wanted to put it back in where it was previously in between Excel and python so I start by defining the list use insert and then when the helper
box comes up we have to provide the index first remember it's the last index what you want to put before so in this case 0 1 2 we want to put it before two and we want to put in Tableau running shift enter it now insert into the list but we want to see it so I'm going to go ahead and print it we see we have SQL Excel Tableau and Python and now let's say we want to remove python from it we can use that pop method so I can Define job skills and oh
my goodness I'm not typing right job skills and then the pop method and remember this takes the index but the default is last so I could put two in this case and then put shift enter oh I meant oh dang it all right well it doesn't matter so it pop two in this case so we pop Tableau what do we have job skills shift enter okay we got SQL Excel Python and just to show a point if I wanted to remove python from that list I could just use pop and then from there whenever I
run this it's going to pop out Python and looking at the list job skills we can see we have only SQL and Excel now even less skills now these index values don't only have the ability to access them one by one maybe one access multiple values of them and accessing multiple values is done by slicing now slicing has a very common syntax to how we access the index except now we're going to be inserting this colon here we have multiple values of a star end and stop so let's just start with that start and end
first so if I have job skills and I want to get all items in the list I could say hey start at zero and then go until the end of this and as we know we go to two for this running shift enter we can see from this we only have SQL and Excel and for this one thing to remember right start starting index is inclusive and ending the ending index is exclusive so it's not included in this so in this case actually if we want to include all items of the list we're going to
put three running shift enter we got SQL Excel and python now both of those values actually are the default values of start and stop so technically if I were to write this and have it include all the values of a list I'd want to more appropriately have it as this with only the colon in there it's going to have all the items of the list now similarly say I wanted only the first item in the list I could do this zero colon one run shift enter like I said any of those default values you're typically
not going to see it so you're just going to leave it blank in this case and in this case we're going to get SQL now the last item to cover in this is step and that's the the steps to take between items it defaults to one so if I were to run this of job skills leaving those default values of a colon include another colon and then adding the one whenever I run this shift enter we get all the items of the list now whenever I change this last value to two to do every other
one or every second one when I run this I'm only going to get SQL and python so it does every second one conversely if I change this to a three run shift enter we're only going to get the first item because there's no there's not enough items to actually do that let's actually do this on a bigger list to actually show the effects of this so we have these two lists here this is Luke's skills and then also Kelly's skills and there are two separate lists now I can define a new list if I want
and call this all skills and I'm going to set it equal to my skills and then also plus Kelly skills then whenever I print this out we should have all those skills added together shift enter python big query R python SQL looker now one thing to notice about is this this list has repeating values in it and that's okay and that's going to be allowed with list also it maintained the order at which we added those two lists together anyway getting back to our example if we wanted to see every second item in that list
we could call out all skills and then from there colon colon and and then we'll put in two shift enter we can see every second item which is python R and then SQL now what happens if we went the last item in this list so as we recall we have all skills and if I wanted to do this I would have to count over so zero 1 2 3 4 five in this case so I would put in here five and then a colon and I'll just leave that blank running shift enter we can see
we get looker but that was actually a lot of work to access the last item in a list well conveniently the indexes of these lists are actually in Reverse but for their negative values so we don't necessarily start at zero because zero is denoted for the first item in the list so we start at negative 1 and in this we have Excel is NE -1 because that's the first last item and conversely sequels the second last item so an actually easier way if I wanted to get the last item in this list is I would
just go in and put -1 colon shift enter and I'd get looker conversely if I wanted to get the last two items I just put -2 in there and I get the last two items the last concept to cover is unpacking and this involves assigning each of the valuables of an interval such as a list to a variable in one single line and we can go further and even access it so back with our core list of job skills with python Excel and equal if I wanted to assign each one of those values to its
own variable let's say skill one skill 2 and then scale three I can then assign it to that list and then I'm going to print out each one of those variables just to show what's inside of it calling print statement on each one of those running shift enter we can see we have python Excel and SQL inside each one of those individual variables and we did this all in one line This is called unpacking now let's say I only cared about that first item in that list and I didn't really care about whatever second or
maybe there's going to be even like a a fourth variable inside of here such as looker so I'll run in this case so we have even more variables so I'll Define that valuable of skill maybe that I'm concerned with so still skill concerned but then for the other ones I want to assign it to maybe a value of skill don't care this is a necessary appropriate Sy anyway skill don't care I want to assign all these values to this and I want to assign it to uh or equal to that job skills but remember right
we have if I were to run this right now I'm going to get a value era there's too many values to unpack basically there's four items in this list I'm only defining two here anyway what I can do now is actually use an unpack operator which just an asteris it's also called a star operator and what's going to happen here is this unpacks an iterable in into here so in this case skill concerned about we put the first item into here and then all the rest of those skills because we have this unpack operator in
front of it is going to go into there so printing this out I have skill concerned and skill don't care great name I know running shift enter we have python in that first one and then all those other values in a list inside this other one now the unpack operator is an advanced concept but it does come up from time to time especially when you need to dive in and understand functions which going to get to more in upcoming section so you need to be aware of it but you don't need to be a master
of it so going back up to our fancy Dancy table that I made we've gone over list already right so we understand it's a collection of ordered items we understand that it's for a sequence it's ordered they use indexes and that duplicate values are allowed and then this last one here says mutability it's mutable what does that mean well using Chachi BT to Define this in Python objects are either mutable or immutable based on whether they can be altered after creation so mutable objects can be changed after creation examples include lists dictionar sets things that
can't be or things like strings integers and tupal we prove that lists are mutable whenever we appended python onto our list and also when we removed Tableau from our list because we made a change to the list therefore it is mutable it's a common word that comes up from time to time for python programmers so you need to be aware of it and now you'll be wondering when would you need an immutable object well we'll get to that in upcoming section all right for those that purchase the course problems you have some problems to get
to start diving in deeper to understand how to alter lists and fun problems all right with that let's see you in the next one now time to get into a more advanced container data type moving on from list now into dictionaries now this data type is very similar to the physical form of a dictionary I don't actually have a book or a dictionary but I do have this app here on my phone that I use from time to time to actually look up values and in dictionaries the definition is stored based on the term itself
so in this case python is what's stored underneath the definition and dictionaries in Python very similar in this case the dictionary is what's in between the curly brackets looking at just this first line of databases and postgress the database is is the key and then the colon then denotes that everything following this is the values associated with that key from there we use a comma to separate to the next key value payer in this we're going to be going over popular methods including this one of identifying the keys of a dictionary and you're probably going
to see some that are very familiar that resemble ones that we learned with lists and so why is this important well we go back to those data science job posting data set that we're going to be using we not only had free posting a list of skills but we also have a dictionary which has the type of skill for a key and then the value of it is a list of the associated skills for that type so in this first example we have types of skills of analyst tools databases and programming so let's create our
first dictionary we're going to do it by starting it with a curly brackets and then first we need to Divine what is the key in this case we're going to be storing a list of skills we're going to start with databases first so we'll put in there database this is the key from there we put in a colon to denote now we want to put in the values in this case I'm going to put my favorite database based on the seal course that I just made and I'm putting this also in a string both of
these are a strings we'll talk about data types in a second anyway running this pressing control enter I have database and postgress this key value can be almost any data type it has to be hashable meaning it has to have a unique value so you could do something like four in here but you couldn't do something like a list so I couldn't do data a list of database as the key it's going to tell me that it's unhashable anyway just cleaning this up now unlike Keys the values can be any number of data types so
I could have in this case a list here here for the value running it we can see that it is I can even have other other things in here too like an integer another string and we haven't even got to it yet but a tuple running this all shift enter oops I didn't put this in a list because I'm having multiple different data types in there it has to stored in something so I'll put it in the list anyway running this shift enter we can see all the different values because uh still a list are
inside of there so for now we're just going to stick with strings so we've defined one key and value we can then if we want to Define another one we're going to put a comma and then we'll Define what we want to put in I want to do one for language and we're going to do as you guessed it probably my favorite one python running it here I'm going get shift enter and we have all the values here now let's assign it a value and set it equal to this additionally we're going to add one
more data type to this and that's uh library and we'll be adding one that'll be covering here soon of pandas I'm getting this red syntax highlighting because I didn't put the correct colon here instead of a comma anyway if you notice this is sort of hard to read and go through very quickly and seeing what are the key value so because this is enclosed within curly brackets we can actually break this up into multiple lines and in this case it'd be common to start a new line right here tab it over same here we're going
to start a new line on languages start a new line on libraries and then move this over now I'm getting some syntax highlighting because it looks like I actually deleted my curly brackets and now we got to run it to actually see it so pressing shift enter we can see that we have this it's all on one line when it shows it to you here but anyway with this here we can actually see this better and actually investigate a dictionary so we're going to be defining dictionaries in this uh very similar format from here on
out but it's not necessary you could we it on one line you can just remove the indentation if you want you can do the indentation all over the place no matter what it's going to compile because you have it within those curly brackets so what can we do with these dictionaries well like anything we're going to actually use a function of help to actually investigate that dick data type so running this shift enter this is the help on class dick and the modules built in the first we're going to look at is probably the most
common method used with this and it's actually a magic method with the d double underscore and it's get item and if we can see down here yes they have it written as X for the dictionary name and then Y for the key but it can be very similar to how we indexed li lists we can do similar like an index method on dictionaries so what does that actually mean well let's actually show it in some code so if I type out job type skills and then from there put a square bracket right after it and
then inside of there put something like database and running shift enter we're going to call the key of database and get the value of postgress additionally they have the get method right here that Returns the value for a key if key is in the dictionary lse it defaults to none but I'll be honest I don't use that method that much however ones I do use are thisy Keys method and it provides a setlik object providing a view on D's Keys which D is the dictionary we haven't covered sets yet but you're going to probably know
enough already with list to understand what's going to come out of this and also the values method and it basically provides an object providing a view of these values or dictionary values so starting out with job type skills and then adding uh the keys method using that closing opening closing parentheses running this I get the keys of this conversely whenever we use the values method we're going to get the values from that dictionary postgress python pandis if we want to remove something from this dictionary similar to what we did with lists we can use that
pop method and so we do pop opening parentheses we can see that for this you provide a key in order to pop up at a value so let's pop out that Library value running shift enter we can see we get pandas out of there and then whenever we run job type skills we only have now the databases key and the language key now we can remove items for dictionary but what is we want to add it well this is where the update method comes into play so to find the dictionary use the update method and
then inside of here I'm going to provide it with a dictionary so let's say I want to add Cloud Technologies to this so I'll start it by defining cloud and then after that I'll add in what cloud provider I'm going to go with I'm sort of uh I prefer Google Cloud so in this case running this shift enter so we'll notice got evalue dictionary update sequence element has length 17 to anyway what's going on here is we forgot something right the colon so putting in a colon here right we got to use a colon to
separate this along with having it in there running shift enter boom all right actually checking this to make sure it is there we have database language and now Cloud there's actually an easier method that I would actually recommend for updating values so if you recall whenever I did that job type skills and then set the square brackets of let's say we wanted to look what cloud was now running shift enter we can see that it's Google Cloud we can use this same notation in order to define a new key and then also value so let's
say I want to add the skills of Version Control I would Define it here and then let's say I just ran it right now because we don't have a Version Control in there basically it says Hey Version Control aier it's not there anyway we're going to create it we're creating it now we'll set it equal to a value and we'll say inside of it we're going to keep get there so running this again shift enter bam should be in there let's actually check job type skills shift enter boom now I have multiple database language cloud
and Version Control in it so going back up to that fancy Dancy table we've covered a lot about these dictionaries they're four key and then value Pairs and they're a mapping data type unlike lists which are maintained in order dictionaries are not maintained in order so don't try to keep anywh in it now dictionaries do allow duplicate values within it now you can't have because of the index or the key itself you can't duplicate that but you can duplicate the values inside of whatever following it could be in multiple different keys and we did show
the last point of it being mutable by the fact that we were able to pop out items and also update items to this dictionary it's mutable it's able to be changed all right now it's your turn to dive in and test out those practice problems also feel free to test out what I said about that values can be duplicated but Keys cannot in order to very much clarify what the definition is and what the characteristics of a dictionary are all right with that see you in the next one now that we covered lists and dictionaries
it's mostly downhill from here for the rest of these container data types we're going to be moving now into sets sets are very similar to lists with the exception of instead of using square brackets you're using those curly brackets there's no colon so it doesn't become a dictionary as we're going to show sets are mutable so you can add and remove items from sets however sets are unique in the fact that if we were to convert this list into a set it's going to remove items that are duplicated like python in this case inside my
jupyter notebook let's compare sets to those other data types sets are mainly designed for unique items and so they don't allow unique values however they're unordered and there's no indexing available with it so let's get into finding our first set we're going to do similar to our list of skills that we be before we're going to Define a variable of job skills and then Define our set so I'm going to start by defining it by creating these curly brackets from there I'm going to put a list of skills in there now I'm putting only a
list of strings in here once again you can put any data type inside of a set and running shift enter we can see that when we print out job skills now we have all those values but notice something about it whenever I defined a list before all those different items stayed in that order now whenever I go and call that set of job skills they're not in the same order that it was previously it is doing in alphabetical order but don't always count on that so don't think that you can use sets in order to
alphabetize things now for lists whenever I wanted to find an item in a certain position I could just Define this with a square bracket and then put in an item in this case we have four so one I'll do in this case but for us it's not going to work because set object is not subscriptable as fancy talk for saying there's no index you can't do this now going to my favorite function help let's actually look at some common um methods that I use on sets here we are on the help on class set in
modules built in for it we have things like add where it adds an element to a set and like list also we have that pop method where we remove and return an arbitrary set element so a lot of the same characteristics we were able to do with lists we're going to be able to do with these sets as well so if I wanted to do something like add a new skill such as looker I can go ahead and shift enter and then printing that out job skills run shift enter we can see now we have
looker in there now let's test that feature of having only unique values let's add something to this that's already in this set specifically We'll add something simple like SQL running this pressing shift enter it runs but then whenever we actually look inside of here we're going to see that SQL is still only listed once now let's actually remove an element from this list let's use that pop method that we used before remember whenever we did it we did job skills then open parentheses let's say I want to remove something like Tableau if I run this
oh I don't have the correct format here I got to put pop okay running this shift enter I'm actually going to get a type eror set. pop takes no argument ments so unlike list where we could provide it an index value to remove something in it we can't do the same thing here with set because well they're unordered so we don't have an index so if I actually ran this job skills pop method it's going to provide an arbitrary value from this and remove it in this case it picks statistics just random and then if
I were to run these job skills pressing shift enter we can see that now we have looker python SQL table out no longer statistics in it so if I wanted to remove a specific value from this I'd actually have to use the remove method and then specify something like Tableau in this case pressing shift enter and then it did it now inspecting that set we see we have looker python SQL you're probably like Luke what the heck are we using these sets for then well the primary use case that I find for it is being
able to extract out unique values whether it's in something like a list or tupal and being able to do that very efficiently so let's say I have this list of skills and inside of it you can see that I have multiple values repeated inside of here what I can do now is I can actually convert that skill list to a set so running shift enter we can see that it removes the values from it and then if I wanted to actually convert it back to a list I could just wrap it again in that function
list all of this within the function of list running this pressing shift enter it's now back to a list and it does it really dang quick all right now it's your turn to give it a try and give sets a little run for their money seeing how they you're used in order to make sure that we have unique values within some sort of container that we need to keep a list of values in all right with that see you in the next one so we're moving now into the last container data type that you really
need to be aware of and this one's going to have a lot of similarities to lists as it's a sequence data type however there's one big difference tup are defined by parentheses and then you put items in here here we have strings but it really can be any item inside of here now unlike lists however if I want to actually modify a set I'm not able to do things like remove or even add or append items to this tupo from a fancy Dancy chart that I made we can see that tupal are for fixed data
so if we needed to have a unique set of items that we don't want to be able to alter we'd put that in a tupal now what is the applicability of this well to be honest it's more applicable in software engineering where say we have a unique set of items that we don't want to change and they're going to use that a lot in software engineering in data analytics we're going to see this less but it does have it's applicability so let's say I had a unique set of skills and I'm not planning on learning
any new skills anytime soon anyway I can set this variable of Luke skills equal to and create this tupal for items that I don't want to change inside of it in it I defined a list of skills and then I want to print it out below this so pressing shift enter I can see that I made a spelling mistake whoops daisies run this again shift enter I have all those different items from our Tuple right here now two PES are indexable so you can call Things by its index like if I wanted to actually Define
Luke skill here and then look at the first item in that list oh what is going on here I need a square brackets and then Zero running this press shift enter we access that python similarly I can do like we did before with the slicing I can access the first two items in that list um by using the same slicing method that we use with lists and just to prove that these sets are immutable if I were to take Luke's uh skills and then try to maybe append on a new skill of r r um
running this I would get an attribute error Tuple object has no attribute append so what attributes does it have well using my favorite function help and looking at tupal this is the help class on tupal and modules built in now from this we can see there's a lot of different magic methods with it but not a lot of just common or basic functions built into to what you could actually do with tup so we're not going to be doing a lot with this now tupol are a mutable but let's say we do need to actually
update maybe a list of items in here how can we do this with operators let's say I have these skills still of Luke and then I have these new skills because I finally decided to get back to learning I could technically write Luke skills and then add this to new skills running shift enter keep on putting s underneath too many s's underneath here running shift enter we see get this new tupal list back right here and investigating further underneath the help for tupal we can see that they do have this add functionality for tupal you're
probably like Lop I thought it was immutable you couldn't change it well you're actually creating a completely new object whenever you now modify this so I could set this equal to a new variable of skills and then I want to print this out below this I'll print skills doing shift enter it's going to do the same conversely I could do Luke's skills here and then we're going to do that plus and assign operator and then also Define Luke's skills here I'm going to subtract that subtract that running this now we've actually modified this Luke skills
but this Tuple that was once assigned here is not the same one here and now you're probably like Luke how do you know that this is a new object well actually if you were to investigate the object itself they have this fancy function called ID and it can identify the ID the unique value of something based on its ID so in this case I'm going to run ID on Luke skills and see what it is and it's this uh number right here but let's look at what the ID is right here for this Luke skill
I'm going to do a print function here to actually be able to see this and then Define inside of it Luke's skills pressing shift enter we can see that the ID keys do not match conversely I'm going to change this up and actually show you what this would be with lists where lists are actual mutable objects so I've changed Luke skills and also the new skills to a list I'm adding it to here and I'm now printing out these IDs for both these lists running shift enter the IDS are the unique values of this actual
list are the same because it's the same item whereas this one and the tup pulse is getting recreated Ed or redefined whenever we're doing this plus aign operator anyway that's going too much into Theory let's move on so we just wrapped up covering the four most common types of container data types of lists dictionaries sets and also tupal but I don't want you to think that these are the only four container data types now in an upcoming section on Loops we're going to be using the range function what is the range function so I can
put range and then you put a value inside of here and if we look at how it's defined let's actually just look at the first one range is the stop position so let's actually look at it so if I do range five and then press shift enter it returns back this range 0 to 5 what the heck is actually going on with this range function well let's actually convert it into a tuple to see what's going on inside of here and running tle range of five running shift enter we can see that it's actually values
0 1 2 3 4 doesn't do five cuz five is exclusive for the Stop in this case I can also do things like provided the start value so in this case I wanted to start at two running this again I get 2 3 4 so this is really convenient if we need to iterate through a certain set of numbers say we want to go from one to 100 so in this case we'll do 10 and 1 and let's say we only want the odd numbers so we'll do every two running shift enter we can see
we get a list of all the different odd numbers from 0 to 100 this last argument right here as you're probably guessing by now is the step value what can we increment this start and stop by anyway that was a little bit of a tangent getting into the fact that there's other container types besides just these four here but these are some of the most popular that you're going to be seeing used in data analytics so now it's your turn to give it a try and test out these tupal a lot of these concepts are
going to be very similar to what you already learned previously with lit dictionaries and also sets with that see you in the next one quick check in before we dive into the section on operators we're about halfway through with covering the basics of just python itself core python once we get through these core Basics then we're going to be able to move into things like pandas and map plot lib which are going to allow us to actually dive into that data set and actually explore it that I've been teasing this entire time so I bring
that up to let you know there's a light at the end of the tunnel we're almost halfway through of the core functionality of python you should be really proud of the work that you've put into so far and learning this I promise you it's not going to go to waste for this section we'll be covering four different operators and we're going to be starting with logical operators first which are denoted by key terms like and or and not so what's the purpose of the operators well they can be applied in multiple different ways so in
our data set we have columns on whether a job is classified on work from home whether it has a mention of a degree and whether it has health insurance or not this is classified as either true or false for meeting a condition so let's dive into putting these operators to work so but we're going to start by defining a variable of job work from home and set it the value of true we also create a variable for job health insurance and set it also to true so let's say we want a job that is work
from home and also requires health insurance how could we actually check this well we can use that and operator and I do job work from home and job health insurance and what we're doing in this case is comparing whether both values are true and if they are return true now I'm looking for a job that is requiring both of these situations to be true so if I said maybe health insurance maybe it's false running this now it's going to be provided false anytime either one of these are not true it's going to be false and
we're not limited just comparing two values we can compare any number of values We'll add an additional variable of job no degree mentioned so whether a uh degree is required or not for the job itself we'll set this to true and then from there I'll just add it on with an and statement adding in that job no degree mention okay running this still we're going to be checking to whether all of these conditions are true we're using an and statement for each so as you'd expect we're going to get false unless we have every single
one of these is true we're not going to get a true statement back in this case we will now let's move on to that or operator for the time being we're going to get rid a minute of this job no degree mention let's say we only want to meet one condition or the other we want to either work from home or we want health insurance doesn't matter which one it is but we just at least want to have one of these to be true in that case we can use an or statement and if either
of these are true as we're going to see it's going to get true however if we change one of these to just false we're going to get true still because we meet one of these conditions only in the situation where both of these are actually false are we going to get a false return adding back in that job no degree mention and then removing this here right here this one's put as true we can do once again like and multiple conditions the last keyword to cover in this is not and not just goes through and
changes which way where the true or false statement is going so in this case I'm putting not true it's going to return false so in the case of here if I want to do job work from home job health insurance and then for some reason I said oh I care about the opposite of whether a degree is mentioned I put a not statement in here now I'm going to change that to false these logical operators are going to be used in a lot of different functions and a lot of different operation specifically we've covered already
conditional statements and so using like the if keyword I can put our different values in here that we're evaluating for the if statement put a colon and then for there I'd print um meets conditions so running this do we meet the conditions of this running shift enter no we don't meet the conditions of it the next one to cover are membership operators they're checking whether an object is inside of another object so jumping into an example of this I could check whether something like data is inside of the term data nerd running this I would
expect it to return true because it's inside of there now this is case sensitive so if I was looking for a lowercase data inside of here and running this it would return false I do see it time to time used with strings but I see it more specifically used with things like lists and think of that I think of our job skills list so here I have this one defined with only three skills but imagine we have a list with hundreds of skills in and we need to check it for does it contain a value
well I can say what I want to check for so in this case python in job skills and then check it conversely if I want to verify a skill is not in a list I would just use that keyword of not and in this case it returns false this's one thing I love about python it's so readable you can just read it right here python not in job skills moving on to Identity operators these are used to basically identify whether something is correlated to another item now this commonly gets confused with the comparison operator say
in the case of here I wanted to compare the values of salary for Kelly to the salary of me in this case we can look at it and see hey they're both equal running this shift enter we get true those new to python make confuse this with the is operator salary Cali is Luke so running this we're actually going to get false because these values are not equal so why is this well the technical reason is the is operator is that sounds like improper inlist the is operator is verifying if these two variables are using
the same memory location for the value that they're holding so if you remember I used that ID function before to return the identification number or the memory location of a certain variable so I can do here salary Kelly running the shift enter and then I'll also do that of salary Luke running shift enter as well we can see that these values are not equal I can even actually this is actually a pretty good use case of doing this um I can take this and I can set this equal to it are these values equal um
in this case no they're false so when would using this identity operator and data analytics be relevant well to be honest not really going to be relevant but want to make sure you actually understand it because it does show up from time to time a potential case of this is say we had a list of core skills of python R SQL and then we said hey these are Luke's skills they're equal to core skills and then running this shift enter we set them equal to each other in this case we can use that is operator
in order to verify if it is in fact the same item and it turns out it is the same item they're both of have the same ID so therefore that's true the last operators we're going to be going over are bitwise operators and when I say go over we're really just going to briefly review it so way you're aware of them these operators are used in conjunction with understanding the binary value so if I did this bin function to look at the binary uh binary value of something like 42 running shift enter I can see
that that's basically what the computer is reading 42 as these operators are working to compare that first bit of these burn binary values and then returning certain values depending on it we're not going to go into it much mainly I want you to be aware of it because this and symbol this or symbol and then this also not symbol is actually going to get reused again basically it's going to get overridden whenever we get into pandas so I want you to be aware of it that these symbols are available so although we covered four different
operators in here and now we've covered a total of seven for this section right here here I want you mainly just focusing on those logical operators and also membership operators those are the most important ones that there are when it comes to actually analyzing data all right so you have some practice problems around those and with that I'll see you in the next one don't repeat yourself or dry is a core principle in programming and Loops are one of the ways we can go about and ensuring we're not repeating ourselves it not only looks bad
but it's very in efficient to have repetitive lines of code one after the other and for loops and while Loops of what we're going to go over in this video help solve that so let's jump in a quick example of how powerful these type of Loops are don't worry if you're not following along we're going to go dive into it deeper further the main purpose of this I want you to just understand the power of this so I have this list that I've already defined and imported from the data set on Dat science job posting
and it's a list of all the job titles from the job postings in the data set we're going to be working with you can see things like here we have data engineer senior data engineer do we have any data analyst what's going on here okay data analyst financial and Regulatory reporting so lot of different jobs in here if I actually wanted to see the length of this list of how many jobs we're going to be going through um scrolling oh my goodness what's going on here so if I actually wanted to see the length of
these jobs going on here I can see that there are 787 th000 different jobs in here now you could think if I needed to go through this list and maybe identify like I was trying to do how many data analyst jobs are inside of there this would be very cumbersome to do one by one but this is where for loops and while Loops can help us so let's say I want to get a collection of all the different data analyst jobs within this list and I want to put it into its own list i' first
start of defining an empty list of analyst list and then from there I can use a for Loop to iterate through the job list um list that we have don't worry we're going to go over syntax in a little bit but you can pretty much read this right for job in job list I want to evaluate if it contains data analyst if data analyst is in that job itself that we're going to be iterating through so that each one of those items in the list be iterating through checking to make sure that data analyst is
in it if it's so we want to do something what do we want to do we want to add it to that analyst list so we'll do analyst list do append and we'll append that job so running this pressing shift enter we go through it you can see it was done in well less than a second we went through 700,000 jobs anyway let's actually look at that analyst list printing it out pressing shift enter we can see all these things inside of here have data analyst within the job title and if we wanted to see
the length of it I could just see length of analyst list to see how many out that 787 th000 are there so 163,000 out of that are data analyst jobs so you can imagine what we going to be able to do with these for Loops let's actually dive into it further break this down to using a simple example we have this list called numbers so we're going to start this for loop with the keyword four from there we're going to define a new variable for each of those um in this case numbers we're going to
cycle through so we're going to write number this is a new variable only going to be used within this four statement and from there we're going to use the keyword in for number in numbers and numbers is your iterable that we're going to be iterating through we'll jump into more about iterables and iterators here in a second after we get through this so for number in numbers we want to do something we're going to do something simple right now we're just going to print the number pressing shift enter to actually see it goes through number
by number and prints it out right below so getting back to this iterable of here we have numbers and it's a list what data types are iterable well everything we previously covered on dictionaries lists sets and duples conveniently why we covered before this is iterable so you can use it in this case Additionally you can do other data types as well are iterable such as a string let's say we had a variable called characters and we set it to the value of python similar to before I can defined a for Loop saying for character in
characters print character running shift enter we're going to get printed out python this variable right here can be anything that you define it I mean that can make it as simple as an underscore and then also put this underscore right here running shift enter it's going to give me the same results so let's move into a real world example with using four Loops let's say I have a dictionary of job applicant for a jobs and I'm looking to hire somebody with more than 5 years of experience so this only has four values in it right
now but really I mean I'm sure any type of job application process going to have way more than this fun little fact with this you can generate any type of datus that I'm using for this course even more quickly by using something like Chad gbt let me show you so inside your favorite chatbot prompt it something like this generate a python dictionary titled years experience with name as the key and years as the value use random values for each make it 20 items long and once it generates the code in this case it wants you
to run the code to actually generate this dictionary I was hoping to give it to you but nonetheless we can still do it I can just paste the code right in pressing play seeing we have years experience and this has even more values than we had before we're not going to use this one but this is just something to be aware of we're going to use the same years experience that I just generated with only four values so I like to start simple whenever we're building something like this I'm just going to start by saying
four and I'm going to give it a variable of person in years experience and then we end that with a colon remember pressing enter making sure that it's indented in I just want to print out person now remember when we're iterating through each one of these items in this dictionary what's it going to print is it going to print the key and the value is it going to print both well let's find out so how do we get the values out well we're going to use my favorite function for this help and go into dick
to investigate how we could potentially get this out inside of this help class on the class dict we can see they have this method right here called values it's ritten on the dictionary D and then an object providing a view on D's values so we can use this for the values so for this I'm going to add in the method 2 years experience of dot values and I know I'm getting the values now out of this so instead of doing person I'm actually going to title it what it is it is value and then I'll
put it inside of here CU we're no longer using person running this again shift enter we get the values out of it okay so we learned how to get the keys and we learned how to get the values how do we get both well the next method you need be aware of is items and dictionary. items provides a set like object providing a view on these items or dictionary items so in this case this set is going to be basically key followed by a comma and then value to have that set of the key and
value so I'm going to go up here here and modify this further and I'm going to change this method now to items and I'll change that first variable to key comma value so it's going to say for key value which is the Tuple like object in years experience items I want to print key value shift enter we got Luke three um all the way through you you get the point now remember the point of this we're trying to find an applicant that has greater than five years of experience so let's modify this further we're going
to be using an if statement to basically now check if a value is greater than five if so what do we want to do with this well we want to alert to whoever is using this that this key has and then from there has whatever number of years experience so running this pressing shift enter we can see that Kelly has 6 years experience there's no more in here so only Kelly's going to appear right now so wrap this up on for Loops we have to use some sort of iterable to iterate through to check all
the different values if you remember from that last video we did we went over that range function I just want to cover this real quick because you may see this and we converted this from here into a tuple to actually see the different values inside of it so range is commonly used within for Loops to actually iterate over a certain number of things like say I need to do something a 100 times so I could say for blank in range of 100 I want to print that blank out so shift enter and I have a
syntax thereor got can't forget the colon and it prints out all the different items so you'll commonly see this type of syntax specifically this range function being used and so I just wanted to make you aware of it next up is while loops and while Loops are checking for a condition similar to what if statements do in checking conditions on whether it's true or false and while it's maintaining true the condition underneath it is going to continue to execute so it follows the syntax of while and then the condition in this case I'm going to
just throw in true and then a colon and then underneath it what we want to do in this case we want to maybe just print uh data nerd okay I don't recommend running this on your computer but I'm going to go ahead and run it and it's going to continue to run and run and run because it's always meeting this while condition I'm going to go ahead and press stop right here hopefully it stops okay good it stopped now if I change this to while false it's never going to meet this condition underneath here so
running shift enter and scrolling to it we can see that we have Wall false print out nerd nothing's printed out underneath it because it never meets the condition to do that now I'm going to give a caveat with this while Loops are very much less common than four Loops four Loops are so powerful what you can already do I really don't find too much of a need for wild Loop but while Loops for those that have maybe coded in something like VBA or R maybe familiar with something like this where you have something like a
count and we'll start this count at one and while the count is less than five we want to oh didn't get done printing that out um we want to print the count and for each one of these occurrences that we're going through we're going to add one to each one of those occurrences so now running this what happens here is it prints 1 2 3 4 once it gets to five it doesn't meet that condition doesn't print it out you're probably like Luke why do I have to learn about this well there are some conditions
that you may meet that you need to do this so going back to that years experience dictionary that we have imagine this thing is super long and we need to go through it but we only care about returning the first instance of meeting condition and then from there we can just terminate it we don't need to move on we want to just find one occurrence of it so in this case we want to find an appli that has more than 5 years of experience but we just want to return the first applicant so I'll add
in Mary with seven years of experience so we're going to start with a while loop to check if the value inside these dictionaries so we'll just going of call it years is less than five we want it to stop whenever it gets greater than five so we don't want to continue on so all already I see that this years is um has a yellow sweet liner anything that so I need to just go ahead and Define it right now I'm going to Define it set an initial value of zero okay now I want to iterate
through all of these this year's experience if you remember dictionaries are unordered and so we can't use something like an index to iterate through it but we can actually convert this dictionary into a list and then it has an index for each of these key value payers in order to iterate through so let me show you what I mean by this I'm going to actually I know I haven't finished this while statement right here um I'll say this is the years list and here we're going to Define that we want to create a list out
of years experience let's actually see what this looks like first uh I'm going to comment out this line right here pressing shift enter to see what years list actually looks like we can see it only has the values in it so we remember we need to put in that items method in there to actually convert that into key value payers then running this it's a list of set items through here and if I wanted to access one of the items in there remember we can use that square brackets and then Define like hey I want
to see the last item in this list press shift enter I put the three in there and I'll get Alex I don't know why I said last not the last item anyway getting rid of this let's actually get back into programming that while loop so while years are less than five remember we started out with it being set to zero we converted our years into a list we want to get the key and value from this year's list remember we have to Define it an index now once again when I get done this this index
is not defined yet we also want to Define this as zero just for troubleshooting sake I'm going to go ahead and print out each one of those key value pairs now the other thing to do is right we have that index we want to be iterating through each item in that list so we need to add to it like we did before with the count by using that uh plus and a sign and a do it by increments of one so let's go ahead and do this run this now all right so silly me I
said while years less than five right we don't have a value in here for years technically this value we need to actually use right here should be the years itself and we'll change this also to years so let's actually run this again okay so in this scenario we printed out Luke and then Kelly and then as you can see goes through each one of these it prints through each one of these so now we're getting Clos to what we want I don't necessarily care about printing out this Luc at three years remember we just want
to see Kelly and that's it so now we're going to include an if statement here so if years are greater than or equal to five we want it in that case to print out so run this again we can see now we're only getting Kelly and then we'll just add a little application here of of key has so many years of experience running this we now see G has six years of experience so as you can tell from that that for Loop was a lot less involved in that while loop and luckily for you you're
probably going to experience more for Loops than you do while Loops in data analytics when using python so that's one thing good to know about this all right so now it's your turn to give it a try testing out those practice problems for different four and while Loops while building on as you've seen here all those different things we've learned previously on list sets tupal and dictionaries in order to combine this knowledge really building up and you're going to be surprised at the end of this course how much you actually do know all right with
that I'll see you in the next one as we move forward in this video we're going to start combining a lot of the things we learning specifically here we have list comprehension which combines basically lists and also Loops so what is this list comprehension well we're going to show real quick with code we're going to break it down even further as we go throughout this but I just want to show an example real quick in this the list comprehension or the what's going on is inside of the brackets here which is a list and what
we're doing is we're generating numbers from 0 to 100 and if I go ahead and print out this variable underneath it pressing shift enter we can see that it has a list of values from there to 100 or 99 so let's actually think through what we'd have to do if we didn't know about this list comprehension and we just needed to generate it with the knowledge that we had had already the first thing we'd have to do is Define a blank list anytime you want to Define anything blank a dictionary whatever or even list in
this case you just put the brackets and you don't put anything in there then from there we need to iterate through a loop and put it into this blank list so I would use for number in we'll say range 100 because we want to do the same thing we want to add that number to the numberers list so we' use numbers. append and inside of here put number underneath this I want to print out now what we've did here and print out that numbers list we've done the exact same thing that we did before from
0 to 99 and so look at this this is a lot more succinct whenever we did this previously in only one line of code compared to this right here where it took multiple lines of code and multiple basically unnecessary steps that we can make and compress if you will um in making a list so let's go over the syntax of this so walking through this step by step we're going to start by defining the list comprehension we're actually be doing by doing these square brackets and the first argument in here is the variable that we
want to put into this list so in this case we're going to just Define it we're defining it right now X we haven't defined it yet but we are going to after this variable now we need to use a four Loop if you will to now iterate through um what we want to so in our case we're wanted to generate the numbers from uh 0 to 99 so we'd use 4 x in range 100 so with this syntax we're basically going through and iterating through this iterator of range 100 each item that it gets it's
adding to the front to this x right here and that's getting added to the list so running this and actually printing it out below I'm going to say numbers shift enter okay so everything's printed there now let's modify this to actually show what's going on here further let's say I wanted to do an operation to each one of those numbers as we're iterating through a 100 in this case let's say I wanted to multiply them times two so every time that I'm cycling through each one of these numbers it's multiplying the value time 2 adding
it to the list and then we have our final list and here we see we have this new list where basically everything is going by twos until we get to around 200 or 198 and we're not limited to just doing math operations I can even do things like maybe I want to convert X to a float um and doing this here now I have all these different items converted to a float moving on to the iterator itself of that range 100 we're not limited just using range 100 we can use any item that is iterable
we talked about list sets tupal dictionaries and uh things like this range function are iterable and even things like strings so if I modify this changing removing that float I can put inside of here if I want a string to iterate through in this case I'm going put in Python running I know it's called numbers but we're not going to change that running through that we have the final values of python inside of a list and all the sections called list comprehension we're not necessarily limited to just doing list comprehension we can do set tup
dictionary all you have to do is actually change what's going on with the brackets themselves so instead of using square brackets I'll use Carly brackets in this case running shift enter we get well you can't read that anymore uh because now we're using the set but it has the letters of python in there you're probably like L what's the purpose of this list comprehension what are the real world example well if you recall back to the last video where we actually went through and analyzed that job list of all 700,000 different jobs for whether they
contain data analyst and then put that back into a list we can basically do the same thing with list comprehension I know we have this added if statement in it we'll get to that in a second um but we can do this with list comprehension so I'm going to start up by defining that list of an analyst list and then from there starting it by opening the brackets and we're going to Define a variable we're going to be iterating through that job list so I'm going to name the variable job and so we're going to
cycle through that by saying for job in job list all right so we're going to stop right there and just print out where we're at I'm going to print out analyst list and it has a whole bunch of values I'm actually curious to see what is the length of this list itself to make sure that we have all the values there still okay so we still have around 787 th000 things in there which if we actually look at the length of job list running shift enter we can see that they do match and I actually
modified this further to make it more readable and now we can compare original list is this new list is that many jobs anyway let's continue on with our list comprehension so we've already defined that we can do the four here now let's incorporate this if statement in there and as you're probably guessing you can just put this right after the for Loop so in this case if and then from there I'm going to Define condition for now we're just going to put true in there to show yeah I'm going to run it so we're going
to iterate through each item in the list from there it's going to check if it's true if we meet this condition for that item and if it is true it's going to put it in so whenever I run this next one okay I still have the same amount of jobs in there now if I change this to false nothing should get put into there so I'm going to run it run this one there's now zero jobs in it so we're going to change that if condition to be what we were trying to use before of
data analyst in job and now running it running this one and then this other one we can see like I as before original list has around 700,000 new list has around 163,000 and when I go to actually print out this list to inspect it we can see all these different jobs have data analyst in the title so there's a pretty awesome feature that we were able to condense down here we had four lines of code not count this space right here to one single line of code and I also feel it's a heck of a
lot more readable to actually go through and understand what we're actually doing here and generating this new list all right that me yapping now it's time for you to actually dive in and test out these different list comprehension methods by some practice problems that we got for this right that I'll see you in the next one so we've been going pretty hard at learning all these different fundamentals about Python and I want to go through and exercise combining all these different things that we've learned so far in order to Showcase an example in a real
world scenario we'd want to actually use these type of skills that we're learning and this exercise is pretty simple we want to filter what jobs we're going to apply for based on a list of skills we're going to assume that we have these skills or my skills of python SQL and Excel and what roles are we going to be filtered from well inside of our repo we can go into the exercise itself and I'm providing this list of dictionaries called job roles I'm going to go ahead and command C or copy this and then pasting
it into here we can see that yes it is a list of dictionaries for different job roles where we have a key of the job title itself and then we have a key of the skills and then a list of skills so in the case of this first one of this data analyst role with the skills requiring of python SQL and Excel we would want to match on this condition now where it gets tricky is we also need to match where roles may have additional skills so in this case of this business analyst role we
have those skills of python SQL and also Excel but they also include Tableau and powerbi this is pre represen istic of what I would advise to you if you're applying for jobs if you have the skills but not necessarily all of them I would still apply for role anyway let's actually get into coding this so for this the thought process is we're going to Loop through this list of job roles and then secondly verify this my skills list is in fact inside of the skills key of the dictionary we also need to set a condition
for whether a job is qualified or not but we'll get there in a second so the first thing I'm going to do is just going to get into cycling in a loop through those job roles and printing out what was displaying assuming we assigned a variable of job for each job Ru we get displayed below all of those different dictionaries inside that list and so we did that first part and now we need to get into accessing the skills inside of these dictionaries and so when we have this job variable it's basically a dictionary that's
getting printed out so we can use that key of skills and I could print out that list of skills right here so you may be thinking we can Loop through this job skills and then check if my skills is inside each one of these things but we run into errors with this as we wouldn't match on this business intelligence analyst because we'd meet a condition of when we hit this Tableau or powerbi we wouldn't meet it anyway what we want to do is we want to cycle now through my skills so we'll Define for skill
in my skills and then just checking on this as we go I'm going to actually print out that skill right above that job skills dictionary so now we are going to have everything we need we'll check SQL Excel python SQL Excel Python and we'll go through and check each of these lists as we go through it so we'll now establish a conditional statement to see if it's inside of this list by saying if skill in that job skills I'm going to indent this over and I'm going continue to print that skill and job skills for
demonstration purposes we'll delete a little bit but we can see it's only print out python SQL Excel whenever the list that themselves contain those values from the my skill list but we're checking each item iter so in this case these top three things printed are just for the data analyst itself so we need to basically establish a Boolean condition in order to keep track of a certain job for when it may not have all the skills necessary so since this is specific to the job itself I'm going to Define this new variable of qualified up
here inste it equal to true so whenever I get to a new job that condition will get reset back to True basically innocent until proven guilty and then once it's proven guilty or not true then we'll basically end the loop so we need something that will change it to where we'll get that qualified equals to false but if I was to run this as is while with also having qualified out we basically see it's say't false even though Python's in there so we need to do basically opposite right so if skill not in job skills
now running this we can actually see that skill SQL is not in this list and continue on Excel is not in this list so we've me met this condition and as you can see from this first posting right here I can tell because the lists are the same of python R machine learning and deep learning we don't need to necessarily go through both uh SQL and then move on to excel then we're basically we established that we haven't met these conditions so we need to just break at that point running this now there should be
no yep there's no more repetitive lists inside of here and at this point we don't need to be printing this anymore so I can go ahead and remove this so now we want to see what jobs we're actually qualified for so I can say if qualified I.E we're still true print out that job running control enter we have the data analyst and then also the business intelligence analyst now this job roles was a provided to me in a list the my skills was a provided to me in a list so I want to get this
back into a list anytime I need to new uh create a list I'm going to have to define a new empty list in this case I'll Define qualified rules equal to this blank list where it's just these empty square brackets and I can prove this list by run the type function on those square brackets we can see that it is a list it's just empty anyway you have to initialize the list if we want to actually run methods on it specifically instead of printing out the jobs we want to take that qualified roles list and
we want to append on that job and specifically the role from it running control enter and then actually I should have actually printed out this below so now looking at qualified roles we get the data analyst and business intelligence analyst now technically you can write this all more succinctly If instead this is basically I rewrite this entire thing up here down below you could write it more succinctly if you use this all function which is using a sort of list comprehension inside of it to compare the skills and make sure that the my skills list
is inside of the job skills but we're not going to go into more detail on this all function instead you can do this for homework and evaluate how you could have implemented this all function instead which now looking at this function if I remove these dang comments out of the way we can see with how short it is it's a good candidate for list comprehension this is what it is in list comprehension form which is not actually too bad and you could technically do this although I find myself even as an experienced programmer having issues
just writing out list comprehensions quickly and easily I like to start with functions themselves and then if I need to condense them down I can cond them down later basically you're not right or wrong on either one you want to use if it's easier to go through and writing it out entirely like this I'm perfectly fine with you doing that so with that let's actually jump back in into that python standard Library we only have about five more sections that we have to cover for this uh standard library before we start moving to other libraries
like pandas and matte plot lip where we actually are going to be working with the data set all right with that see you in the next [Music] one now I'm all for automation just watch this hey Google turn blue lights on and I try to automate anything that I can in my life to make it easier and that's exactly what functions do inside of python to basically streamline our lines of code and make it easier for us to execute things with less lines of code and we've previously seen the use case for functions working through
these problems let's say for example we wanted to see the length of these skills right here all we have to use is that Len function and pass into it skill list running shift enter we can see has three items in it imagine if every time I wanted to get the length of something like a dictionary or list I had to call out this formula right here of these three lines of code and then even go as far as to print that count of what's going on there to see that counts three and so I love
functions and there's five different types that we're going to be covering that we've won we've already covered and then some that we're going to be covering up in the next few lessons so in this section we're going to dive in deeper under understanding what built-in functions are available to you and then also how to build your own functions in the following lessons we're going to be jumping into Lambda functions or Anonymous functions and then standard Library functions and then finally my favorite using thirdparty libraries like pandas numpy matplot lib now for any hardcore programmers watching
this there are additional functions out there like generator functions async functions and recursive functions but that's just out of the scope of data analytics in general so we're not going to focus on those we're going to focus on these so we've seen a lot of built-in functions already so we've seen functions like print that basically PR out other things type which defines what is the data type of a certain variable length and then even the range function where we went and did something from like 0o to five and use that oh my favorite one help
but are we only limited to just these like which ones are actually available well if you want to you could just execute this line of code that we're not going to dive deep into it but it basically goes through looks into what's available in the current directory in regards to builtin functions and then returns it in a list using list comprehension anyway we can look inside of here and we can see that it has things that we've already used before like Len and print but even better way it's probably just go to the python documentation
for built-in functions and they provide this convenient D little chart here on all the different functions along with all the documentation explaining each of those now we're not going to dive a lot deeper into built-in functions but I do want to focus on some math-based built-in functions that I use from time to time that you definitely should be aware of so let's say I had a list of data salaries and I wanted to find different characteristics about this list if I want to find the minimum all I'm going to use is the Min function and
then Supply it that dat add a salaries list in order to see it here I find that the lowest is 85,000 conversely I can also use something like Max to find the max value or even sum to add up all the different values within this list and then display it down below and my favorite is if I ever need something organized better I can use the sorted function in order to sort these values into numerical order so now let's shift focus and actually get into building our own if you will user find functions in order
to do something maybe we need to do we're going to start simple first so you remember early in intersections we came up with a formula in order to calculate what the total salary is using the base salary and then the bonus rate and we figured out was this formula to get the total salary well imagine if we have a bunch of different base salaries and a bunch of different rates that we need to feed into this well this is a perfect example for a function so let's start defining this function anytime you start a function
you need need to start it with the keyword Def and this stands for Define and then from there we're going to now name the function and you can create whatever you want we're going to call this one calculate salary and then it's very important here you have to put parentheses and then a colon now I want to make sure I executed this right so similar to our loops and also conditional statements I can just put in this variable or I can put in this keyword of pass and then running it make sure that it executes
properly so it's working just fine we need to go further though now and Define it so for right now just for demonstration purposes I'm going to paste those variables inside of here just to to simplify this so we have this base salary inside our uh function and we have the bonus rate and we can tell it's inside the function because they're indented in and then from there I'm going to calculate that total salary and set it equal to our previous value of the base salary times 1 plus the bonus rate I'm going to remove this
past CU we don't need anymore CU we have contents in there now running this again to make sure that this actually works we got no errors good so now we've defined this function let's actually call it and see what we get back from it so just calling calculate salary and then adding those parenthesis because the parenthesis after it is a function call I'm going to run it shift enter and I didn't get anything why didn't I get anything well I didn't return anything in the function itself I returned none basically anyway I need to specify
what I want to return back from this so I'll specify that here and say for return I want to provide total salary running shift enter here and then shift enter down here now whenever I run this function we're getting output the total salary because that's what it returned so now you're probably like Luke but this doesn't do me any good I have to go into this function now to input those variables okay let's make this function actually usable by providing it arguments to then provide us values so I'm going to remove those variables from inside
of here and now I'm going to define those arguments inside of the parentheses right here so I'm going to Define that we're going to pass it base salary and also bonus rate so now running this I can see that the function works fine now let's run this calculate salary again and see what we're going to get we're going to get an error and it says to us calculate Sal salary is missing two required positional arguments base salary and bonus rate so we have to provide that now now because the function is built like this we
have to provide that base salary argument and bonus rate so I'm going to just put in base salary and bonus rate right here and now running it it provides it in and then that uh provides us the value now One technical note these variable names that I'm using right here are we're using the same ones and that's actually not even necessary say I want to just actually plug in value so 150,000 and say a rate of 0.25 as the bonus rate I can do that it's still going to run it or if I wanted to
I can even change this one to salary salary one and then this one at two rate one running this up here and now defining salary one inside of here and rate one below it running it we're going to get that 110,000 I'm going to shift things around a little bit uh before we continue just to make sure that we have it all organized and so now I have the function up above and then the actual function that we're going to execute right below it and it's executing I just want to make sure you understood that
you don't need to have the variables above the function itself that's not necessary at all but the function does have to be defined before you actually call it now building on this example further what happens if we routinely have the same bonus rate and we don't want to necessarily slow down our coding and have to put in that salary rate or that bonus rate every single time well we can add an optional argument so I want def find that the typical bonus rate is 10% you may be tempted to want to put something right inside
of here to actually specify that we're only going to be using a 0.1 of the bonus rate and then running it and then if we wanted to remove this from it well if we did this we going to get a positional error argument error because we didn't include that bonus rate but here's what we do instead we'll Define what that bonus rate is right next to the argument itself we'll set an equal sign and then the value we want assign to it should it be optional or not so in this case I'm specifying it to
be 0.1 I'm going to run this now now below this we're still calling calculate salary and we're only providing one argument of salary one whenever I run this we get the value and no matter what I put in there it's going to work and in a case where maybe the rate is different let's say at 0.5% I want to definitely find that job where it's 0.5% I could put back in that salary one and then that rate one and then whenever I run this we can see yeah 150,000 so 50% of that it now takes
in this rate one whenever we provide it overrides that optional argument as point of 01 so that's the basics on built-in functions which are obviously familiar with but also userdefined functions which we have some practice problems on for you to dive deeper into and test your skills with it we'll be moving up in the next few sections on at Landa standard Library functions and third party Library functions so looking forward to that with that I'll see you in the next one so Lambda functions also known as Anonymous functions are very similar in how they operate
compared to those userdefined functions also built-in functions and actually calculating or doing some sort of operation but before diving into this be very clear so Lambda functions are more of an advanced functionality if you're not finding yourself using Lambda functions and instead just using regular userdefined functions there's nothing wrong with that Lambda functions are just a great way to flex on your co-workers so here we are in the type of functions cover which is halfway through with those Lambda functions I promise you it gets only easier from here after this one Lambda functions are written
by starting with with Lambda then from there you provided an argument in this case we're just going to name it X for the time being and it can be any number of arguments we're just going to do one for right now then from there you do a colon to then go about providing what we want or what formula we want to be conducting on that X argument in this case we're just going to keep it simple right now we'll say x and we want to multiply x * 2 pressing shift enter we see that it's
just a function here but we actually need to use it and actually see how it works so I'll call this Lambda function mle 2 and set it equal that really ingenious right there run this again and now because python has functions as first class objects so I can I can assign this function to a variable name I can then use it so I can call moltu and then in parentheses to call this Lambda function I can provide it a argument say we'll say running shift enter we get return six back now this is two lines
of code right here technically the point of a Lambda function is be able to write something within one line so I could actually take the function wrap it into parentheses to wrap it up and then put an parentheses on the outside of it to start the function call of it and provide the argument of three running shift enter we get it now if I want to provide multiple arguments to this function I could do I could separate it with a comma so in this case we're going to pass X and Y to it and then
we'll say x * 2 plus y um and then in the parentheses we need to specify these two so I'll add well let's do four running shift enter we get this which 3 * 2 is 6 + 4 is 10 so you're probably like Luke how's this actually apply in the real world well let's go back to that previous example that we did in our function section where we were using this calc calate salary function in order to calculate total salary which is based on base salary Time 1 plus the bonus rate and now let's
say with this function we have a list of salary values we want to pass into it and in our case we're going to just assume that all the bonus rates are at in fact that 0.1% for all of these to go through it so how could we with what we know right now actually implement it without using lambdas well we can use list comprehension so I'm going to Define this variable of total salary list I'm going to set it equal to the start of our list comprehension and we'll Define a variable that we're going to
be cycling through each of these salaries in the salary list so I'm going to name it the name salary and we'll Loop through that so for salary in salary list now you know I like to iterate through things and make sure my codes working just fine so I'm just going to print out what I have underneath here right now we're not doing anything right um we're just we're just reprinting out this list cycling through and then reprinting it out okay so all the values that we see above are are still in there okay now for
list comprehension right I can in that first argument provide a function to apply to it so in this case we can do this calculate salary to this so I can type in calculate salary put this in into here now whenever I run this it's going to pass each of those arguments into calculate salary and then feed it out bam and then we have all of those through using list iteration but right now we're at just looking at only these lines of code we're at like four lines of code I'm going to make it even shorter
and technically running this again to just make sure it works we can shorten this down remove that total sary so the function itself is only about one line of code and anytime we get into writing a function with line of code as a use case for Lambda functions so we go ahead and comment this out so we're going to be altering this function right here and basically replacing it with a Lambda function remember anytime we do a Lambda function itself we need to Define it inside a parentheses and then I'll go Lambda the variable we're
going to pass into it so in this case we'll just say x and then we'll pass back out X now for this Lambda function itself we need to pass in something to it so we're going to also put a parenthesis outside of it for basically calling the arguments and and we'll pass inside of it salary so let's go ahead and now run this and having a little typo issue uh Lambda was spelled wrong need to actually do that right anyway this is providing basically our list back in this case and so we can do now
an expression on it to actually alter it so I can do times 1.1 and then doing it this way pressing play we have those corrected values back this example was shown in order to display that you can write with Lambda functions one single line of code and write things more succinctly however this isn't necessarily completely succinct technically we could just put salary and then times it times 1.1 and we're going to get the same results and that's what you actually should do in this case but I wanted to show for demonstration purposes how you could
actually use Lambda inside of list comprehension so let's move on to a slightly more complex example of how we can use Lambda functions in order to clean up or filter data so here I have a dictionary of some jobs that are available in it I have a job title the different job skills and then also this final value here on whether it's remote or not it's a boing value of true or false if you don't want to type out all those different values I have on the screen right now and you want to follow along
go to your favorite chatbot type in this prompt of generate a dictionary titled job data it's a collection of data science job posting the keys and data types for values for the dictionary are job title string string job skills as a list and remote booing okay and it only provides me one value so I need to specify uh provide I don't know five values from there you can just copy the code and put it into your notebook anyway need to make sure that it runs and it's formatted properly so we can continue on so because
we want to filter this data we're going to use a built-in function for this and that is the filter function so let's actually inspect it using my favorite function help and doing this for filter so inside the help on class filter in module builts in for the function itself it's defined up at the top inside of filter itself you pass a function so in our case we're going to be passing our Lambda function to this and then from there you pass an iterable which we have our iterable our list our job data to be the
iterable and in it it returns an iterator yielding those items of iterable for which function items is true that's just fancy talk for saying it's going to use the function filter out our values and provide it back the filtered values so let's show what this does very simply so I'm going to Define that filter function and then first we're going to put the function that we need to do and then I'm going to define the Lambda function right here and we're going to be iterating through the jobs itself so I'm just going to specify make
up this variable called jobs and I want to keep this simple right now I just want to find those where the job of being remote is true remember this function returns an inary in those items of iterable for which the function is true so I'm just going to put in there job and specify also remote because we want to look at the job remote column from there the next argument of this is the iterable which in this case is jobs data running shift enter we get a syntax error and I think it's because we need
to actually wrap this all within a parentheses to basically make it function itself and I'm going to run it again all right oh my goodness this is why um this is a silly error I just realized what I did I misspelled Lambda I should have just looked at the highlighting right there and I don't need the parentheses around it actually running this we get a filter object returned and so we need to convert this to something we can actually see so I'm going to convert this back into a list and we'll print that out right
that shift enter we can see we get back only the jobs where remote is true and it looks like three of them are true that's true but remember we want to find not only where the remote jobs are true but also where the job skills contains python so I need to modify what we're doing the operation that we're doing in here so we can use that and operator to make sure that both of these conditions are going to be true in order to set satisfy this filter function and we want to verify that python oops
we need to typ it like it is in the list python is in the job skills column okay so now we have it in the job skills column now let's run this bam okay so now we have the two in here and we can see the jav skill has python in both of these and it's remote in both so that's really the two major ways that I've seen Lambda functions used whether it's inside of something like list comprehension or built-in functions like filter or even apply and map now don't get overwhelmed by this especially if
you're not getting your head to wrap around these Lambda functions I frequently will start by just writing out a normal plain userdefined function and then once I realize that it can be applied to a Lambda function I'll eventually shift it to that so there's nothing wrong with still using user dividing functions but you look hella cool if you do it this way all right with that I'll see you in the next one so we've been talking a lot about functions lately and if you remember we had built that function that calculates salary by looking at
what the base salary is and multiplying it times some sort of rate well imagine now if I shift on to maybe a new project or a new python file and I want to do that same calculation again do I need to rewrite that entire function or could I use something like modules so jumping back into my jupyter notebook we're basically continuing on with our discussion on functions but we're moving more into towards we're not going to be building any more in functions instead we're going to be talking about how we can actually get other functions
or even classes and this is done through things like modules which we'll talk about in this one and then in the next video we're going to talk about how to get it through libraries which a library just a collection of modules anyway enough of that let's actually get into building our own module and we're going to do a simple example first so this Jupiter notebook has access to an environment specifically if I come over here to this files icon I have access to the files inside of here inside the sample data folder they have basically
other datas in here that we're actually going to extract and actually look at how we can connect to it ourselves let's not get aare of oursel okay so we have access to this environment right here specifically we can access if we have something like a python file or a module we could access it from our jupyter notebook let's see how it's done so first thing I'm going to do is going to right click in this area and I'm going to say new file I'm just going to name this for Simplicity mycore module modules are typically
lowercase and then I'm going to end this with py that makes that a python file so this is my module I'm going to double click on it and it's going to open over here in the right hand side now I'm going to close out of this so we have more space so I'm just going to put something in here specifically I'm just going to put we're going to define a skill list inside of here and this is a list with python SQL and Excel all right now it has this asteris up at the top leftand
Corner that means it wasn't saved but now it Autos saved so I don't have to press it but if it doesn't save all you got to do is press command s and it saves now how can I access this variable or List inside of my module well we're going to use the keyword of import and then the name of the module so let's just run it to make sure that it actually works correctly bam it works just fine okay scrolling on down to the next line all right so now we want to call I want
to see this skill list here I want to see it inside of my Jupiter notebook so because I've imported my module now I'm going to write my module again and then I'm going to use similar what we saw before with methods I'm going to use the dot notation in order to Define what I want to actually Import in and Google cab automatically has hints towards what you want to bring in and it already identifies that skill list is available so now if I run this pressing shift enter we're going to get that list right below
it inside of here now listing variables inside of a module is honestly not really a common use case a more common use case is actually defining a function inside of here so let's define a function that whenever we provided the argument of a skill name it provides back that that skill is my favorite skill which will be in the form of a string so I'll start by defining a function and I'm going to call this skill inside of here I'm going to put that the variable that I want to Define is skill name for what
I wanted to return I'm going to use an F string for this and I'm defined inside of curly brackets this skill name is my favorite skill exclamation point to make sure that we're uh making a point here so we have our function defined and we also have our variable in here so saving this pressing command s so all changes are saved so let's actually try to call this we're going to use my module and then similarly we're going to use that dot notation and in this we don't have we're just going to define the function
that we're want to call and that is skill right now I want you to watch something only skill list is popping up so I'm going to do skill and remember we need to provide it with an argument name so of course you know what I'm going to put in here I'm going to put python running uh this by pressing shift enter we're going to get this error message right here attribute error module my module has no attribute skill now we save this file F right and updated it but the problem is the updated contents whenever
we imported it in are not updated inside of this jupyter notebook for this new function that we actually just defined now the problem is inside of this Jupiter notebook it's being really lazy it already loaded in my module previously before we defined this skill and so even though I rerun this to import my module it actually doesn't go through recollect the contents of this so we basically need to start this whole session over again I'm going to come up here in runtime and select restart session and run all and it's going to say restart and
run all yep that's what I want to do I'm going move this over so we can see better and delete this one down here okay so now it is actually working inside of here so don't be alarmed if you're experienced the same errors as me sometimes like any electronic where you got to flip the onoff switch you got to do the same thing with your jupyter notebook just flip it on and off so let's get into a practical example of how I'd actually set this up for a real case scenario where I need to use
this once again I'm going to create this module inside of my files by creating a new file I'm going to name it something appropriate to what it actually does I'm going to be doing a lot of different functions for maybe analyzing job data so I'm going to call this job analyzer and then of course end it with py so I'm going to take this function that I used previously by command C or copying it and then pasting it in into the job analyzer py file and then command s to save it I'm going to delete
that previous function we don't need anymore and now we're going to import it in so we'll import job analyzer and then we're going to Define it here so we'll use job analyzer do calculate salary remember we only need that base salary in there so I'm just going to put that in for right now and then running this bad boy we get what we expect of 100 0,000 now import isn't the only statement we can use in here this is cumbersome to write out this module along with the function name after it if we're only using
maybe one or two functions there's an easier way to do this instead of only using that import keyword we can use another keyword and that's specifically from so we'll use from job analyzer import calculate salary okay so we're going to be importing the function of calculate salary calculate salary from job analyzer because we've imported it in like this we no longer you need to use this dot notation and whenever we run this we get the final results that we expect now let's say we have multiple functions inside of our job analyzer module and I just
added this other one of calculate bonus all it does is takes a total salary and base salary and then from there it Returns what the bonus rate should be by performing this formula right here anyway let's say we want to import both of these in well instead of saying writing calculate salary and then also we could do something like put a comma and put calculate bonus which would work just fine which I'll just I'm going to show real quick so I def find it a function of calculate bonus giving it a value of 110,000 and
then 100,000 for the base running this it says cannot import this is the same issue that we were focusing before and so what I'll just do is I'll say Resort session and run all so now I'm getting that calculate bonus rate down below it anyway this is sort of cumbersome sometimes if we want to list out every single function we want to import if there's not a lot of functions inside of the module itself we can do something and use this star annotation and this says Hey import all the different functions that you have in
there just for demonstrative purposes make sure that restart session and run all whenever I do this to make sure that we're running clean with this and Bam we just rained it again and we're able to with this now actually Import in all the different ones available so what happens now if I want to give this python file for somebody else to use as a module as well and then they may have questions in understanding how to use this remember before we could use that help function in order to investigate things like string blist and dicks
can we do it here well we can so I can put in something like the function I want to investigate calculate salary and from there run shift enter now similar to before it says help on function calculat and module job analyzer and it has what the function is along with the arguments but then nothing underneath it if you recall from our strings and lists when we investigate they actually had a description underneath it well neat functionality of python is I can provide a dock string inside of the function itself to Define what the function does
and all a doc string is is a multi-line comment so we use those three double quotes at the beginning and the end and then you can M write multiple lines through this in here I have that it's calculate the total salary based on the base salary and bonus rate I'm going to go ahead and command s to make sure that this is saved and now I'm going to do that same thing of that runtime restart session and run all moving this over so we can actually see a little bit better so now we have our
help function we have what the function is that we're asking about and also we have the definition of that function help tell you what's going on here now sometimes whenever you're reading these documentation you'll also see these extra lines included in the dock string and that's args or arguments and it gives information about in this case base salary and the bonus rate saying what the type of data type it is and then what it is and then finally returns what's going on down here it's returning it a float that is the total salary and I
can go ahead and also include this for this other one as well it'd be nice to have command s saving this and then just to actually see to make sure that it's actually working just fine I'm going to restart and run the whole session it's popping up I'm going to close this out and now we have what looks like a legit module that we've imported a function from where we can actually run help on it and it understands what we need to provide to it now shifting gears a little bit python itself has access inside
of the Python standard Library certain modules that are available right now for you to use you just have to import them in to in our case into our jupyter notle and so in their documentation they go into a lot of different modules that they have such as numeric and mathematical modules functional programming modules and even the host others we're not going to go into all of them just yet we'll go more in the next section I just want to you to focus on this for the time being that in the math one they have things
like math random statistics these are common ones we have access to right now if I go ahead and click on statistics it gives me an overview of everything that is available to use with this module of Statistics I'm more of a nerd so I actually want to check out what the source code is now we're inside of GitHub in the repository for python you remember before cpython is the version of python that we're using and check this out at the very beginning they have a dock string talking about all the different functions that are available
inside of this module of Statistics let's actually check out this mean function which description is arithmetic mean average of the data scrolling down to line 468 we finally get to the mean function it takes one argument data they have a dock string here that basically goes over how to use it and then for the code it looks like it's about 4 lines of code to actually generate what the mean is and it actually takes use of other different functions inside of this package in order to calculate the mean we're not going to dive into this
code here I just wanted to show you the source document for it and how we're actually going to now use it inside our Jupiter notebook so back inside of our Jupiter notebook let's say we have a list of salaries and this was just randomly generated feel free if you're following along to just make up five different numbers or however many numbers you want to make up and put it into something like a list that or Tuple so I'm going to go ahead and run this and then from there we're going to get into using this
so I'm going to start by importing in statistics remember we want to use the mean function so I'll do statistics that mean and then inside of that we provide the data which we can have this pop up here now and it actually shows us some uh examples of how we can use this but I'm going to just put in salary list running this pressing shift enter we get the mean now what happens now if I want to use something like median and also mode well that's going to get cumbersome writing this all out so let's
actually import them in separately so I've modified that import to say from statistics import mean mode and median now we could replace this all with that asterisk and then import them all but I'm not going to recommend this whenever you get into bigger modules even this one I mean it's relatively small but it's getting big we don't want to necessarily import them all because if we in the future Define some sort of variable or function that has the same name as this we're going to run into issues so in order to remove any of those
issues and to keep track of what functions we actually have important into the environment I recommend just listing them out so we've gone through and defined all of our different functions I want to run all these cells I'm just going to come up here and click run all and now we can see we have the mean mode and median for your homework for this I want you diving into what modules are available to you inside of the Python on standard library now they have a whole host of ones which you can go through and actually
just find out where it is by using command F and then typing in something like module but mainly I'm going to recommend that you focus your efforts underneath these numeric and mathematical modules as thata analysis basically a bunch of math all right with that you have some practice problems now to go through in order to create your own modules and also start diving in deeper into a lot of these modules that are available for you all right with that see you in the next one so like any data source the data that we're using for
this project requires some cleanup so we're going to be using modules from the python standard library in order to clean this data up let me show you what I mean so here we are in our data set which located on hugging face and it has the different columns with it which has the column name and then also the data type of that column and then the values underneath it it's got tons of them anyway let's pay attention to the data type so if we look at a lot of these we can see they strings Boolean
and even over here for the salary data it's a float now we run into a problem with two of The Columns first is this job posted date if we look down here we can see that it's actually a date and a time and python has a datetime object however it's not recognizing this it's calling it a string additionally if we look at the skills column we can see that it is a list of skills and this one also is categorized as a string now the reason for this is pretty simple all this data is located
in a file like this inside of a CSV or comma separated values we can see this by these commas that are separating It Anyway diving into it further looking specifically at those job skills we can see that these lists like this top one right here it has parentheses around the list itself so it's getting cast as a string in this case so let's actually get to work in this exercise of cleaning up both that job skills column to convert it into a list so we can actually manipulate it and also that job date column to
make sure it's a datetime object and make it easier to manipulate now this fake data that we're using has three different keys and then values specifically it has job title the job skills like we're familiar with and now this of the job date now if you don't want to type out all those different values I don't recommend you do come into something like chat gbt for the prompt of generate more fake data for this list and I would just provide it an example of what it needs to provide for that first list or that first
job entry inside that dictionary the key thing for this fake data is to make sure that whenever you provideed an example you have the job skills this list wrapped in uh double quotes and then for the job data can be double quotes or single quotes for that date value all right so let's get to cleaning this up and we're going to start with the date time cuz frankly it's a little bit easier to deal with for both of these we're going to be using the python standard library and specifically for date time we're going to
be using date time now inside the documentation of date time I can see in the documentation on the left that they have a bunch of different available types they have date only time only date time time Del uh Delta and then time zone information whether my data is dates only or dates and times I find myself gravitating towards this datetime object and inside of this class of date time time. dat time it has different methods available for it like in this case we can use something like finding out what the time date in time is
now so let's just try it out for this I'm going to import in date time it's part of python standard Library so it's one of the modules so I can just go from date time import date time now this may be sort of confusing on why am I listing it twice but if you recall back we could also be importing date time importing in date or in something like just time we're going to import the date time object running shift enter we have it imported in so if I wanted to run the now method on
date time I would just Define date time and then now running shift enter it provides me back this datetime object which I know how it's formatted as which is basically it has the year the month the day the hour the minute seconds and even milliseconds so you can see why I'm filming this right now so what are we wanting to do with this date we want to convert it to the correct object that it needs to be so let's actually inspect what's actually coming out by accessing one of these elements inside of our uh list
of dictionaries of these different jobs so I'm going to first spot start by defining the variable of data science jobs and I just want to access the first element so I'm going to type in zero inside these square brackets right here run shift enter we get the first one now I want to access access the job date of this so once again I'm going to open up those square brackets and just type in here job date running it we have it right here so just to make sure that we're on the same page I'm going
to run the type function on this because I want to see and verify what is the type of this and running that we can see it is a string we want to get it into a date time so back inside the documentation I can see that underneath that datetime object we have the method of STP key time which is string parse time we have a string we want to parse it to time for this it takes two arguments the date string or the string itself and then how we want to format it and then it
returns a date time corresponding to that datetime string so I think we understand what we're using for the datetime string but how do we format or what do we put in this format so if we go back all the way up to the top of the documentations they have this link up here for skip to the format codes and it provides this table where it has the associated code the meaning and then a example first we're going to want to get the year out of this so we're going to use this percent and then capital
Y next we're going to want the month as a zero ped decimal number basically 0 1 02 using percent M and then finally we want the day of the month basically percent D so let's do this I'm going to set this up as test date as a variable of this remove this parenthesis run shift enter and now let's actually try it out so I'm going to do date time time. St and strp time then inside there we need two things we need the date string and then the format note that the format is a string
as well so I'll put in this one first test date and then for this I'm going to Define it remember we need the year which was percent y capital Y then from there percent M for the month using a lowercase M and then also percent D and I can go ahead and also establish that we're going to be using these uh TXS or dashes in between each one of those lines all right let's go ahead and run this and see what we get and we get back this datetime object specifying the month year and day
you're probably like Luke this isn't in the format that we specified hold up a second if I use the print statement with this and wrap it in parenthesis when I go to do this it prints out the month and date and year in that format I wanted along with is the datetime object the hour and minutes and seconds there's nothing value right here so just going to pend it right to the side if I didn't want this value here which I don't care about right now if I didn't want that value there I would instead
Import in here the date object and it has a very similar method for how you can actually convert this to the date you need but like I said I don't care about this addition of time we're going to leave at date time so what happens now if I want to go through and clean up all these date times cuz right here we just did our test case of only one how would I actually do this well very simply we're going to be using a for Loop first I Define for job in data science jobs and
then a colon and what I want to do for this portion now is basically just replace the values inside of the job date with datetime objects Vice these string objects I don't need to keep these strings at all so I specify job and then the column I care about which is job date as that's the one I want to replace and I'm going to set it equal to basically exactly what we did up here so I'm just going to take that real quick place it into here and instead of using that test date I want
to use the job and job date so basically We're looping through every one of those lines on the list and we're going through and converting them one by one into the correct format and saving it into its previous location so running this it ran just fine let's actually inspect it so printing it out to inspect it we can now see that inside of here our datetime objects so task number one is complete we've cleaned up those datetime objects now we need to clean up these skills and convert them into a list now for this we
can't use date time for this because obviously it's not a date time to clean up so back inside the python standard Library we need something to actually convert this well under Python language Services we're going to use as which stands for abstract syntax trees now be honest I had to use chat gbt in order to find this module this is not something I commonly use and there's not commonly an issue you'll probably run into when converting strings into list however the daytime one is very common so you should definitely know that one to memory anyway
the method we care about is literal eval and in it it evaluates either an expression node or a string so a string in our case containing only a python literal or container display which lists our containers and it also can confirms down here A bunch of other different data types that you could feed into this if there were wrapped in quotes in order to convert it into what it actually needs to be so we're going to be taking the same approach to Loop through each of those values inside of that list or that contains our
dictionaries of jobs and then perform this function on each one of those values so I'm actually just going to come up here and start typing underneath it but before we even do that we need to actually Import in that module of as so for each of those jobs in there I'm going to Define job and then job skills because we want to replace what's currently in there so we're going to set that as the variable we're going to do as. literal eval and then from there we need to place the string inside of there so
we basically need to replace these contents that we're going to be doing on each of these operations so let's go ahead and run this and I'm getting this error because if we look at it it's pointing at the line above it on job date I tried to combine this but anyway what it's saying is the string Parts the time argument one must be a string not daytime object we converted it already if we look at that if we look at data science jobs it's already datetime anyway what we need to do is we basically have
to just run this whole skip uh script again so I'm going to come up here to runtime and I'll say run all okay so now we cleared that error because basically we passed back in data science jobs with our dates in that string and then also our list of skills in a string it went through this Loop now actually let's inspect it putting app right below we can see now we have the job skills in a list it's no longer inside the quotes and we have our datetime objects I'm not going to lie data cleaning
is one of the most timeconsuming aspects of my job as a data analyst it can take up sometimes of 50 to 80% of the time just a project alone so learning things like this is only going to help you later I promise you it's not for nothing I think that's a double negative so it means it for something all right if you haven't done it already it's your turn now to go through and actually perform these different operations dig into that datetime library and then also feel free to check out that as library with that
I'll see you in the next one so picking up where we left off with modules we learned that we can build them ourselves and they're very useful at automating routine tasks by providing us some sort of function or method in order to write less code in order to conduct some sort of operation with that we learned we can build our own modules or go to something like the python standard library and get modules from that but there's a whole heck of a lot of libraries even beyond the python stand library that are available to you
let's jump into my notebook to do a quick example you don't have to follow along I just want to show how thirdparty libraries can help speed up your workflow so let's say we want to start analyzing the data what we're going to be doing here very shortly don't into that data set that keep on teasing anyway we can go under files in the left hand pane peek inside of sample data folder and see that we have csvs which are basically a bunch of data how do we actually access this with python this data set just
has a list of columns and then a bunch of values underneath it don't worry too much about what's included in this data so if I want to read the CSV I'm going to first need to open it I'm going to use the open function for this in this you specify the file path so this is inside the sample data folder and I want to look at the California Housing test CSV so running this we have some sort of text IO wrapper object right here so I'm going to ass this to a variable of file because
that's it's a file we've opened it now with this file I want to do something to it and specifically looking at the different methods are available I want to read it so with this printed out we can see that this basically has all the different contents we need although it's a in a very unorganized manner like I'm not going to be able to use this I need to actually put this into another data type if you will in order to then go through and actually look at the longitude column latitude housing minimum age something like
that so what I'm going to do is save this file. read to a variable Tod content also for good practice I'm going to go ahead and just close this file out so we're at three lines of code not too bad but now we need to put this into like I said a data type that we can use in our case I think the best one for this I'm going to say it's a dictionary where if we look at these values here I'm going to sign each one of these column names as a key and then
the values underneath it I'm going to assign as the values basically a list of all the values so this is going to take a for Loop to enumerate through all the different rows of that file that we've read into content and then another loop inside of it to make sure that we assign each value inside the column to its Associated list once again don't worry about too much about this code I don't want you actually having to repeat it I just want to show now we're at 12 lines of code so let's go ahead and
run this and then printing out what we have which is called Data dict we can see that we have it in a dictionary now so if I want to access something like this total rooms key right here we provided the key of total rooms and then if I wanted to do something like sum up all these different values I could run the sum function on this list right here and I'm already running into errors because apparently some values inside of those lists aren't just integers it looks like some of them are also strings this data
isn't completely clean which I have to use something like this list comprehension that goes through each of the different values in the list and then basically cleans up some values in order to get it and we finally are able to summon and Get It Anyway this is a whole host of work and that's what I'm trying to show by this we're now up to like 13 lines of code which honestly is completely ridiculous for doing something that's probably going to be routine so instead I can use a library like pandas which I just import it
by saying import pandas now we haven't done this before but typically when you import pandas you use an alias which uses the as keyword and we rename it as PD this just makes it a lot shorter to access it from there I'm going to create a variable called contents to store what we want in there and I'm going to say hey for PD which is pandas I want to read a CSV which we did previously and then from there I'm going to put in the file path to that CSV and I'm going to print it
out below it so bam now we have all this data from the CSV and it's not in a dictionary it's in a data frame we're going to get to that uh data type in a little bit but it's all here but because you know dictionaries you actually probably know enough to manipulate and work with these data frames so in case if we want to go and access this total bedrooms column we can just specify it just like a dictionary and I'm going to call it here total bedrooms and so now we can see this has
all the different values for the total bedrooms and then if we want to sum it up they actually have inside of the data frame object the method of sum and now that I'm inspecting it compared to our last method I'm actually finding out that this sum calculation that I built right here is not correct I trust this pandas formula more for summing it up so the point is this we used three lines of code in order to expedite or dat analysis that is pretty common compared to 13 lines of code so that's where packages and
libraries are so important they're going to speed up your workflow so where do you go in order to access these mythical packages and libraries I keep talking about well depending on how you're installing these packages you may find it in one of two locations or at least two common locations the first one is this p.org so I can search for packages that I may need in this case let's actually look for that pandas one we have a few different options pop up and I know it's this one pandas 2.2.2 and the first thing it provides
is the code to actually install pandas but hold up real quick if we scroll down here we can see an overview of what the package is the table of contents some of its features and then where to get it so there's two major locations you can get packages or libraries from and that is piie which we're right here using this pip install pandas and then condo or anaconda and it tells you this Command right here to use now this is where we get into a very complicated Concept in Python on managing your environment and all
the different packages that are within your environment we're not going to be doing that in the basic section we're going to keep it really simple but at the beginning of the advanced section we'll be diving deeper into this for now all you need to know is that we're going to be using this pip command in order to install packages while we're working in Google collab that is their package manager of choice for that environment whenever we get into the advanced section we're going to shift over to cond and use that as a method to install
pack packages Anaconda is completely free and it's actually the preferred use case for environments for those that work in things like data science and AI as there's a lot of support behind it and it's highly reliable but we're really not going to talk much more about Anaconda until we get into the advanced section so basically just forget about it for right now so let's get into how we install and also check to see what packages are available within our environment inside of Google collab remember where using pip for this now pip is not a python
command it's a bash or shell command and you would enter it through your terminal if you recall from earlier in this session we talked about the terminal and right now you have to have Google collab Pro in order to access that and that's not necessarily needed for this instead we can use that special shell command of exclamation point and then from there provided that command that we saw previously for installing pandas now you saw that I ran this already but I hadn't installed it into the environment but I'm going to just run it right now
to show you it provides back that requirement already satisfied for all these different files it needed to install four pandas to be installed so it was already met with this so what packages are inside this environment well use an exclamation point I'll put out pip list and this is going to list all the different packages in this environment here Google collabs really nice and they provide a whack of libraries and packages that are available to you right now scrolling down I can see in fact pandas is installed and specifically the version 2.0.3 but for funsies
let's actually install a package that isn't installed inside a collab I found this one right here it's called pie jokes and the package itself provides oneline jokes for programmers jokes as a service up at the top I can see I can pip install p joke so I'm just going to copy that over and then inside the notebook pasting it after an exclamation point and installing it in it goes through and you can see it's collecting py jokes it's downloading the files it needs and then basically saying it's successfully installed py jokes now although I've installed
P jokes into the environment that doesn't mean that it is actually ready to use in this python file once again you have to actually import it in and Define it I'm not really sure how to use this so I'm going to run help function on ey jokes looking at the documentation on this it's pretty bad there's not much on here that I know what to do with so I'm just going to go on a whim here I'm going to type in pie jokes and then I'm going to type in a period from there I can
see based on this it has two methods of get joke and get jokes so let's just do get joke close parenthesis and it provides this a QA engineer walks into a bar runs into a bar crawls into a bar dances into a bar tiptoes into a bar Rams a bar jumps into a bar okay that's pretty funny cuz you got to test all the different way a QA engineer has to test all the different methods of getting into a bar all right so what third party libraries do you need to know for data nerds well
this course gets heavily into focusing on pandas and matplot lib along with a brief intro to numpy and Seaborn pandas is really popular in accessing tabular like data and then using something like a data frame to analyze it Matt plot lib has a different use case in that we can then take some sort of data that we want to visualize and it can make it into basically any type of plot that we want to being the two most important libraries that we're going to spend the majority of our time for the remainder of this course
in I want you to start getting more familiar with these for your homework one quick note before we leave on defining what a package in a library are we've been using this through or I've been using this term throughout explaining all this and sometimes I use these interchangeably so starting with the basics a module is a single python file contain aing definitions and statements we created modules two videos ago a package is a collection of python modules under a common namespace a package can also contain sub packages so basically folders with all these different modules
in it all wrapped up within a single package so we have modules and then packages and now we have libraries a library is a collection of packages or modules that are grouped together to provide functionality so both pandas and matplot or technically libraries because they're a group of packages inside of it and also include other things like pre-written code configuration data documentation to help with development basically speed up our workflow with data analytical tasks so if you think of it this way libraries can be packages and thus we can go to things like piie or
anaconda and download it for a package manager but a package alone if it only includes things like modules in it and it's very basic can't be considered considered a library anyway I had digress what's important is you understand that we have access to third part party packages and libraries that are going to speed up your workflow so you want to use them all right we have some practice problems available for you now to go through and download to different packages and test them out inside Google cab with that see you in the next one this
video is going to be optional if you will a lot of the concepts and what we're going to be doing in this is only going to be applied in this video here and we're not going to apply it further however everything we're going over with classes is going to help you better understand how these different data types like strings and lists and even data frames actually operate behind the scenes so let's look at a very simple version implementing a list data type using a class now I want to show how this list data type would
be coded and so that's why I have this basically simplified version of list all within this class we're going to go to it in a second now for this example I did want to show you the source code for the python library but like I said before python a lot of it's written in C and here the actual list object is written in C I don't know C I'm not trying to go through in this and try to explain it to you so that's why we're going to this simplified version right here anyway this is
a class right here and I'm naming this class Luke's list and in inside of this class you see the structure is very similar before it's got class at the beginning colon but inside the class itself we have different methods right here six different methods we'll dive into some of these whenever we get into our full-on example but for right now I'm just trying to demonstrate how a the list class would work anyway I'm going to go ahead and run this and now this class object of Luke's list is available so the first thing I need
to do is create an instance of this list so I'm going to say I'm call this my list and set equal to Luke's list and I need to put those parentheses outside to initiate that object running shift enter now when I run my list I should see a blank list okay it's a blank list inside this class I have this ad method and it adds an item to the end of the list by basically doing an append so let's actually add something to it I'm going to Define my list and remember I'm going to use
that dot notation to actually access those methods right now we only have ad AA ailable or being able to view that and then from there I'll say We'll add data nerd running shift enter bam we got it now actually print out the contents of that list we got data nerd inside of it if I want to add something else to it I can just add that to it as well so now we have Finance nerd in there BR out my list to actually see what's going on there we got them all now we're not going
to do all these different methods here but we're going to just do one more specifically this Len method which Returns the length of the list so I can just use that Len function and Define my list and it should print out two there's two items in there okay so that's a class that we've built for a custom list classes are really important when we need to bundle up in this case a bunch of different methods and make an object behave in a certain way and I think we have a good use case for this so
weall back whenever we were covering modules and specifically we were working with a base salary and we defined these two fun functions mainly just the first one we focused on but I also added the second one in enabling the ability to calculate a total salary by looking at the base salary and bonus rate and also calculating what the bonus value is based on the total salary and base salary so we could do is we could create our own basically currency data type for this base salary and thus allow us to do these functions when called
as a method along with a few others so this is what we're going to end up coding during this video and it's a class on base salary and it's going to have four different methods inside of it now before we get into coding it I want to demonstrate how it works so you understand the significance before we actually start coding this out so I'm going to go ahead and play this to make sure that that class based salary is in our environment so the first thing I'm going to do is initiate the class and I
do this by just calling base salary and then I know I have to provide an argument we're going to just do 100,000 and that's the as I'm looking up at the documentation that's the base salary it also takes things like the bonus rate and the symbol we want to use for this right now we're going to leave both those blank anyway running shift enter I get back provided to me which is pretty neat 100,000 formatted with a currency symbol and with a comma separating it so I'm going to assign this to a variable of data
salary so that way we can actually start playing with some of the methods of it running data salary to make sure that we have it in there yep it's 100,000 so let's run this method of calculate total salary so I Define data salary and then put the dot and then from there I can see the methods by this cube right here of calculate bonus or calculate total salary and then I provide opening closing parentheses so it goes through and calculates that it's 110,000 and it has it formatted correctly based on that 10% bonus rate now
let's say for some reason we come up here and we say hey the bonus rate is actually .2 well I can run play here we can see now that it's still 100,000 for the base salary but then we go to calculate the total salary it goes up to 120,000 so let's get into building this thing and to make sure that we're all cleared out and not actually running this in here this code that I have I'm going to reset the environment so the first thing that needs to be done in defining a class is starting
it with the keyword class and then from there we're going to name it this is typically done for a class using what we call Camel case and that's where you use an uppercase letter for the first word for the start of every word you're typically not going to see underscores or even something like lowercase used for classes because that's more reserved for variables I want to be able to quickly identify this is a class so this is how it's named all right with anything we're going to test iteratively so I'm just going to provide the
pass keyword here and then make sure that it runs it runs just fine so the first thing we're going to Define is the dunder and knit method and methods start just like functions using the defa or Define and then we're going to use two double underscores hence Dunder then put in a knit and then two more double underscores and then put a colon Dunder and knit is a magic keyword anytime you see these double underscores there are basically predefined methods inside of python that whenever you run the compiler python knows that this specifically named method
has a special use case specifically the one that we're going to be using and from there I'm going to add parentheses opening closing parentheses uh colon and then like usual I'm just going to run past to make sure that it's working properly now we're going to be defining multiple methods in here and they all need to be able to work with each other and to basically pass around whatever data we have initiated inside of this object of base salary and so by we can do this by defining the first term or the first positional argument
in a nit as self so this now running it is the most simplest form of this base salary class I can come down here and then actually run it to see what we get printed out and nothing it basically tells me it's base salary class I can pass it something like let's say we want to pass it a salary of 100,000 it's going to tell me this that base salary in it takes one positional argument which we see there's only one positional argument self and but two were given it automatically assumed the object itself is
s and then this 100,000 is now the second one so we need to build in base salary as a variable into this so I'm going to assign this base salary so running this and now this one with a which second positional argument because we already know uh class we get this back that base salary the object we have it but let's say I want to call it so we'll Define this as the variable of Sal salary and let's say I want to see the base salary so I wanted to call something like base salary if
I run the shift enter it's going to go hey has no attribute base salary so what I can do is inside this Dunder and knit method is initialize this variable of base salary as a that's why it's conveniently named a nit it initializes these variables and I'm going to set the attribute of base salary and this is an attribute not a method equal to the variable that we're passing in here of base salary so just so we're clear this base salary that we're passing in through right here is this value right here which we're assigning
to the attribute of base salary so I ran this class again now let's run this below and we can see that base salary now the attribute has 100,000 right this is not a method so we're not going to be including double parentheses or opening closing parentheses inside out of it it's not callable like that it's an attribute so we're going to be defining like this attributes are basically variables also I'm noticing we don't need this pass argument anymore that's just not necessary even run this it's still good okay now we need to pass in two
more variables that we're going to be passing this class and that is the bonus rate and then also the curreny we want to use for this so we're going to set the bonus rate equal to 0.1 and then for the symbol we're going to set it equal to the string of a dollar sign now just like this base salary I need to now Define these attributes and set these values to it so I defined the attribute of bonus rate to bonus rate and symbol to symbol whatnot so running this and then running this down below
we still have it I can even change this attribute down here to see hey what is the symbol and there's a dollar sign now one quick thing this self keyword is not like the required name of this parameter right here so actually I'm going to change all of these from self just to demonstrate I'm going to use control shift L and I have selected all of these and I'm going to rename it object for object so now running this here up above and then running below it still works this name right here doesn't really matter
what you do it but self is most commonly what it's named now the next method we want to build into this is to format how it's output so if I were to see just salary by itself printing it out it's providing this base object right now whenever I call salary that variable I want to see printed out similar to what we did above whenever we had that original class in there this actually formatted correctly with this number with the dollar signs and then the parentheses every thousand's place well we can use another method for this
and it's once again a magic method it is the dunder repper method this method like all methods inside of this class are going to have that P pass ing of that first argument of self to it so we have all of these things like base salary bonus rate and symbol passed in through this for this one that's all we'll need we'll go down to the next line and this repper method because it's a magic method it's known by python whenever you go to print out the contents of a variable with this base sat it's going
to do whatever you tell it to do within this method so I'm going to need to use a return keyword and I want it to return something we're just going to say return data nerd for the time being so we demonstrate this purpose of this running play here and then also in here and then finally here we get data nerd printed out that's not very useful all right I want actually to see the $100,000 so we're going to use an F string for this I'm going use the curly brackets to specify to pass in self.
symbol and then after self. symbol I want to pass in the actual base salary so I'll specify myself. base salary running shift enter on this shift enter again and then shift enter we got now 100,000 now remember we want this 100,000 to have a comma separator in it and there's actually a special way we can do this inside of an F string I don't want to dive too much into this because we want to focus on classes but if we have something like value and it's equal to a million here here I put it into
an F string we know we can put a variable inside of curly brackets and then it'll print out below basically as a string R this again 100,000 okay if I want to format that value because we can do that it's like a special feature I can put a colon and then from there I want to format it with a comma so I'll add a comma onto the end of that additionally if I wanted to format how many decimal places after this I'd put a decimal and then for our cases let's say we want to have
two decimal places at after it I would add two and then F okay now I have 1 million with two do places I don't want anything after it I'm going to add a zero running shift enter okay so I'm going to take this this basically symbols that I uh just defined right there command see it and I'm going to place it right inside of here inside of our class run our class run this one and then finally we have 100,000 with the comma no decimal places if you want to learn more about how to format
variables inside of f strings you can go right here to the documentation I'll include a link below I do do this from time to time but it's not to focus this we're going to focus on class so we're going to continue on all right so we're actually almost done we just have to add now those two methods that we've defined previously for calculating total salary and calculating that bonus rate so what I did is just copy and pasted in those functions that we built previously and just put them into here they're not going to work
right now and we're going to go over how to actually fix this to make them work but at least there some boiler plate code to start with we're going to start first just modifying this calculate salary function right here remember the purpose of this is to calculate total salary which is our base salary times 1 plus our bonus rate so this total salary is actually going to be used not only in this calculate salary but also in this calculate bonus because of that I actually want to Define total salary up here in the init method
so I can define a new attribute of self. total salary and I can set it equal to basically this formula right here which I'm going to copy up here now that we have total salary defined inside of self we can pass that in to the calculate portion right here now what do we want to return well we want to return similar to this uh rep Dunder repper method we want to return what the total salary is formatted similar to this so actually I'm just going to take this code right here put it inside of here
and we have self. symol but now I want to provide back self. tootal salary similarly let's do the same things here I need to initialize an attribute of the bonus and set it equal to these contents right here now we do have to modify this further total salary you can see it has these lines underneath it it's not defined we don't pass total salary in but it is initialized under self. tootal salary so what I can do is write self. total salary and now we have it in there we still got the same formula then
from here I'm going to go down and modify the method of calculate bonus passing in self and we want to provide out basically it formatted the bonus as a numerical value as all the other ones so I'm going to provide it this FST string self. symbol and now for this I want to have self. bonus boom so let's actually run this I hope it works we haven't compiled this recently okay and it works running the next line to see yep we have the symbol we have salary so let's run some of these methods now on
it I can run salary. calculate salary and then this is a method so I'm going to use the open and closing parenthesis and we get 110,000 similarly I can do salary. calculate bonus run this and the bonus is zero that's not right let's see what the error is I'm actually calculating the bonus rate again which if we already did that's not what I actually want here I want to have the actual bonus value so I basically want to do the total salary minus the base salary and I'm going to go ahead and remove this right
here running all these lines of code again getting down to the bonus this time when I run it bam $10,000 now technically analyzing this class these terms of calculate salary and calculate bonus aren asly true because we're actually calculating inside the init I'd probably update this further to call this show salary and also show bonus but that's just minor things but I do think this gives a good demonstration of when you may need to implement something like a class in order to automate certain tasks that maybe are repetitive like in this case with manipulating salary
like I said in the beginning this section is completely optional I just mainly wanted you to understand the power of classes inside of python to streamline workflows so I do have some practice problems for this but like the video they're optional as well but do suggest checking them out and trying them out so that way you can learn more and really deepen your knowledge of python with that after this we're going to be diving now more into packages specifically numpy Panda and matte plot lib all right that see you in the next one all right
so congratulations you've covered all the fundamentals about the core python Library we're going to now be shifting our focus into diving into Library specifically pandas a mat plot lib and this one numpy now this is such a powerful Library I'd be remiss if I didn't cover it in this course let's actually jump in and I'm going to show an example of using numpy compared to just using python Alan and basically and how fast and how good it is at doing analytics so we're going to create some fake data in order to test the limits of
this and it's going to be a million salaries inside of a list and we're going to have values ranging from 50,000 to 150,000 for this we're going to import in the random module for this random has a special method called Rand int and we can see inside of it it takes two different values and this is a and b which is basically a range including both endpoints so we'll put inside of here 50,000 and 100,000 running this quickly we can see we get this random value of around 53,000 running it again 81,000 think you get
the point now let's do some list comprehension to basically make a list of all these different random values and for this we want to remember create a million values so we'll say four and we'll just provide UND score as this random value value in range of 1 million and then I'll close the brackets for this and then running this we have our list of over a million values and it looks like I printer it too all right so we'll assign this to the variable of salary list equal to this one note this is all for
demonstration purposes to demonstrate how well and how computationally effective numpy is don't necessarily have to follow along so now with this Sal list let's say we want to find something of analytical importance such as the mean or average we can use the statistics library in order to figure that out right so I import in statistics and then with Statistics I use its mean method and I pass into it salary list running this we get the mean okay around 75,000 we don't really care about that we care about how long this is taking because we want
to compare this to whenever we start using numi so we can actually use is a magic command using the two percentage signs and then the list will pop up of what we want to use we want to use this one of time it we want to time this operation right here and this will run this multiple times and then provide us what is the average time that it took to run this whole thing right here actually I'm going to stop real quick it says 482 milliseconds I'm going to stop I don't want to count the
import statistics as part of that that's not fair to count that in it so I'll import it above and then we'll time it again okay much less this time cuz we didn't have to import the library 335 milliseconds now let's do this with numpy and for numpy we import it as numpy and we use the Alias so as NP this is very common practice to use NP for numpy and PD for pandas so running this and importing the library I want to also time using numpy to find the mean with the time at command so
I'll press so I'll put in here time it and then we're going to use NP do as you guessed it we're probably going to use mean and then we're going to pass into it in parentheses that list so passing salary list into this and then running it this one takes only 52 milliseconds which both of them run seven runs each but that is a sixth of the time and you're like Luke this is milliseconds what does that even matter well right here we're only working with a million rows of data and I've worked with even
bigger or larger data sets than this so I ran this again using just 10 million so 10 times the amount of data and then we actually see stuff that actually is going to affect us when we use the statistics median method this one takes 5 Seconds however whenever we use numpy's median method this takes only half a second I can promise you you get bored after about 3 seconds so this type of delay and time that it takes to compute you're going to want to take advantage of and use numpy instead now you're probably like
Luke in that other video on libraries you specifically called out that we're going to be focusing on pandas here and also Matt plot lib only and didn't really stress that we're going to be covering numpy well I want to cover numpy briefly in this section of the video because pandas if we scroll down and look here underneath the dependencies we can see they have three different packages or librar supporting it and the number one one is numpy and if we actually navigate over to numpy and then scroll on down to this section we can see
that there's a whole ecosystem around it so numpy not only supports libraries like pandas but also statistical libraries like scikit learn additionally looking underneath this visualiz ations tab it supports matte plot lib and even Seaborn of what we're going to be covering later on so although we're not going to be directly always importing in numpy as nump and using it inside of a notebook we're going to be still taking advantage of numpy basically on the back end behind the code with pandas and matplot lib actually accessing numpy and using it for us to compute faster
so in the remainder of this video I just want to show some basic operations that you can do with numpy in order for you understand and appreciate it but in no way is any of this stuff really required knowledge for you to continue on and to be successful in using things like pandas and mat plot lip all right so let's jump in we're going to start by first we have imported numpy as NP we're going to Define one of the core data types inside of numpy to do operations and that's an array to Define an
array as you guess you run NP and then you run the method array and inside of here you provide an object we're going to provide a list of values for right now we're going to keep it simple of just a few different numbers running this it will print down below the array let's actually save this to a variable of my array so we can start manipulating it further with my array we can do things like run the mean didn't spell that correctly we're going to change that to my array and my array change it to
the mean so we see the mean's 2.5 so we understand arrays are the basic data type of numpy and we we can do a lot of different things with it let's move to actually a more practical example of let's say we're working with data science job postings and we have our data inside of arrays so I have three different arrays available for us first our job titles of four different job titles the base salaries of four different values and the bonus rates of four different rates as you were guessing each one of those the index
of zero is for all those items at index zero and the index at three is all those different values so let's run this and now what's cool about arrays is we can actually do computations between arrays so let's say I wanted to figure out what the total salary is based on these base salaries and bonus rates but I wanted to get the array from that so I can define a variable of total salary and for this I can use those same python operators before of the Plus ad and subtraction in order to do operations on
these arrays so I defined that this variable is equal to the base salaries Time 1 plus the bonus rate then I want to see it right below it so I'll print out total salaries also just to maintain consistency I'm using salaries I don't know why I'm doing this but anyway we're going to just change this to maintain consistency of salaries all right so now we have this array outputed that shows our total salaries and you want to you can go through and check and see that is 90,000 times 0.12 this yeah should be now if
I want to see the mean of these total salaries I can just do numpy mean and then put in that total salaries so you can imagine if we have hundreds or thousands of different values inside of something like numpy Rays we can do our calculations a lot quicker than if they were in something like a standard list inside python now let's say we have some values inside of our data that we have missing data meaning sometimes let's say we collect a job title but we don't necessarily know the salary so I'll add in one more
job in here and we'll call it AI engineer and for this one we don't have a value for the base salaries now we don't want to put something like zero here because let's say we did we'll put zeros for each of these if we run this and then we see the total salaries out we can see it's zero but what's going to happen is it's going to affect our mean it's going to drag our mean down to even lower of 66,000 because we have the zero in there we don't want zero in there now python
has this type called none if I go ahead and print it out it is well it's not even printed anything because it's none if I print what is the type of none it will print below that it is a nun type so let's try that instead of using zero so I'll put none in here and none in here we'll run this one again we'll run total salaries again and we automatically get our first error it basically says hey it's a type error unsupported operand type for that plus operator operand are what's on each side of
here so when we do one plus none as it says int and nun types one is the int nun type is the bonus rate we get that hey you can't do this calculation so what would be a better way to store that we have no values for this well numpy if we write it out has a nan value and that means not a number and this is specifically for numerical data if I run the type method on this it returns back that even though there's not a value there it interprets this as a float and
as we know before float and ins can be added together so we're not going to get an error there so I can come up here and I'm going to replace that nun with np. Nan and additionally do it here with np. Nan running this and now total salaries again we don't get any issues and our final value is is Nan now for our mean we're going to go ahead and calculate this but we're going to get back a value of nan mean this method right here isn't meant to handle these Nan values so there's actually
a special method for this to handle those values we'll make it uh we'll add Nan in the front of it it's called nan mean you can see there's a whole bunch of other methods that do things when there's Nan values present anyway going ahead and running this now we get back that value that we saw before of 8 3,200 now there are a ton of other features inside of numpy that we could go through and spend I can make a whole course on it if you will but we're going to stop there because I really
want to shift Focus now to pandas and Matt po lip however if you are interested in learning more about numpy inside of our source code for this course Kelly put together this awesome resource on all the different operations and math that you can do with numpy and it's a wealth of knowledge additionally we have some course problems for you to go through and test out different methods and functions inside of nump pod get yourself more familiar with it all right with that one after this we're going to be jumping next into pandas one of my
favorite libraries in Python that see you in the next one if I only have the option to use one python library for the rest of my life it'd hands down be pandas by far it's the industry standard in handling tabular data so anything that Excel or csvs can handle pandas can definitely handle it so this is the GitHub repo for pandas and if we look at the stars for this it's at 42,000 this alone puts it at a very top library now as far as how to use this as far as understanding pandas I don't
really recommend it for any documentation I go to what they have here at pada.org and if you go to the URL on the screen inside of here you're going to have some different documentation I recommend just going to the API reference keep this window open we're going to be referencing it throughout this video many more so inside our jupyter notebook we need to import in pandas now if you recall pandas is already installed in this environment here here I'm running pip list and it's going to list out the different libraries available and right here we
have pandas now whenever we run this locally using Anaconda as our environment we're going to have to install pandas but that's for future us to worry about so since it's installed all I'm going to do is put import pandas as PD and as a reminder p is just the industry standard for making a shorthand notation for panda so we'd have to write it out fully now that we imported it in let's actually load some data so back in the documentation underneath API references I can go to input output and that means how we're going get
in data in and also how we can get it out of here now there's a couple of popular ways to read in data that you need to be familiar with specifically we have this one right here on pandas read CSV which is a method of pandas another popular type I find myself using is is read Excel method and as you expect you can read in an Excel file now we're using these methods in order to read in this data and then we're going to place it into a data type called a data frame which we're
going to go over in a bit so let's test this read CSV method I can type pd. reor CSV and then opening parentheses and I can see from this it takes a file path now remember we have inside of our folders over here to the left hand pane underneath sample data this California Housing test CSV I'm going click these three dots right next to it and select copy path inside of here anytime you're pasting a path so let me just say I go ahead and paste this path into here and I run play I'm going
to get invalid syntax right here and that's because it expects this file path to be a string and right now we're basically giving it some sort of weird variable name so I'm going to go ahead and wrap these in in single quotes or double quotes if you want running this again we now have all of our data inside of a data frame so let's get into a brief intro into Data frames themselves whenever we create this object right here this data frame we need to actually name it into a variable and so it's common for
if we're just doing one data frame I'm just going to name it DF I'm going to set that as the variable name running this again we have DF available if I type in type DF I get back some pretty in-depth documentation on this but the main thing is this it's a two-dimensional size mutable potentially heterogeneous tabular data that's really a lot of nerd Talk data structure also contains labeled axises IA rows and columns arithmetic operations align on both row and column labels can be thought of as a dict like container for series objects the primary
Pand does data structure so as it says it's a dictionary like container so let's go ahead and run that again of DF to actually see what's going on here and inside of this we can see we have the columns up here on the top along with the index here right on the left hand side and I can tell it's the index also because there's no label on top of it and then just like an Excel spreadsheet or Google Sheets or CSV this data is laid out in rows and columns so let's start with these columns
first because it's a dict like object we can use a similar rotation in order to access Columns of here so let's access the total rooms one in here I can type DF and then similar to a key of a dictionary I'm going to do those square brackets and then inside of here if we notice it loaded the data set so it automatically provides hints on what we want to put inside of here so I selected total bedrooms running this again shift enter we can see we get back all the different values for that column now
this is not the only way we can access this column data and I'm going to change it up every now and then and do something different specifically we can use this dot notation and use a similar way calling out the column name itself and then in this case it's going to provide us exactly the same results I'm lazy so I really prefer this method and I'm going to do this when possible so we talked about accessing The Columns but what about if we want to access the index so let's say I want to access the
first item in total bedrooms similar to that list index I can provide the square brackets and then the number of the index I want to access running shift enter I get 661 I can also do the same type of annotation with whenever we had to write out the column name in a string providing that last row of data 299 it provides it of 263 so you're probably like look that's cool but let's actually access the data we're going to use for this course so let's do it so this data is open source to everybody and
I've made it available through hugging face hugging face is a popular open-source repository and it's specifically designed towards the NLP Community or natural language processing Community anyway they provide a great platform to host data sets as a click here I can see there's a wealth of different data sets in here and this is where I hosted mine for free for you to access so how do we actually get this data set into your notebook well if I go in file inversions and actually go to the CSV itself of data job CSV you could download the
CSV and then use that similar read CSV method that I used before and put it into your environment but that's a little bit too much work I got an even simpler solution digging into the huging face documentation on how to actually load a data state from The Hub that's where the data set is located I can see that they have this load data set function which allows me to provide basically the name of our data set and then load it into our environment after we've imported in data sets now one note we will have to
install data sets into our environment as it's not available in Google collab and also it won't be available in Anaconda whenever we get to running this locally so the first thing I'm going to do is I'm going to run pip install data sets remember we're running a terminal command right here so I'm going to start with the exclamation point and I'm going to call out pip install data sets running shift enter it's going through and now installed all those different things so now I'm going to import it in using from data sets we want to
to import in the load data sets function okay we have it inside of our environment so now we're going to use load data set and for this we need to provide the path or the string name now if we go back to my data set itself we can see that the data set name is luk barus data jobs and I can actually come up here and just copy this right here and then paste it in now once again look at this yellow highlight that's going on common mistake people just go ahead and run this automatically
we're going to get an error because this needs to be a string value so I'm going to go ahead and put single quotes around this run this again shift enter now it's downloading a bunch of different things into here it downloaded the read me and then it downloaded the data itself now I can see down here from inspecting the object that is providing back that inside of this data set dick which is the data type that they're using to provide you with all this data they have this key and then value provided where at the
Key of train we need to access that train portion to access actually our data set so for this I'm going to actually import this into a variable instead and I'm going to call this data set set it equal to here run shift enter it's already loaded in so nothing really happens and then from there I'm going to do data set remember it's a dictionary like object so I'm going to use those square brackets and then put inside of their train all right and it outputs this where we have the data set itself now inside of
here but if I run a type check on this to see what actually this is it's going to come back that it's a data set we want it as a data frame so I'm going to remove that type right there and we're going to run a method on this that data set actually has specifically we're going to run this I typed in to and then underscore to see what we could do to it and we can transform it into a multitude of different things but you know what we're going to transfer into we're going to
transferred in to pandas and then from there we want to put opening closing parenthesis run it and now we have it imported in as a data frame we can tell by just the way it looks um once again I'm going to set it to that variable of DF and then run this again shift enter and then printing it out right below it right there so we're going to stop here in order for you to load in that data sets Library using pip install and then import it into a data frame like we've done here in
the next section we're going to going to be doing a deeper dive on how to actually inspect data like running things like this this info method right here where when I run it I can get all the different column data back how many values or non null values are in it and then what data type it is we're going to doing this and a whole host of other things also you do have a few practice problems to play around with this data set and access the different attributes of it all right with that I'll see
you in the next one so for the next three videos we're going to be diving into a lot of the basics behind pandas specifically we're going to be looking at inspecting cleaning and then analyzing data all in that order for the next three sessions this is like the fundamental basis that you need to know and in this video and the other ones we're going to be covering a lot of different methods and functions pretty rapid fire so with that let's get into it so I'm starting with a new notebook you don't necessarily have to but
as a minimum you have to have imported pandas as PD loaded in that data sets Library specifically the load data set function then load our data set that we're going to be actually using for the M of this course and also for the project and finally put it into a data frame so let's get into methods of actually viewing our data you saw previously I can type DF and then run it and then it prints it all up here it prints the first five rows and then also the last five rows it also tells me
how many rows and columns are available I typically don't like this I prefer to just specify that I want to see see the top rows and so I'm going to run this head method on here I have to use the open and closing parenthesis for this running it I get the first five values I can also provide as a variable to this an INT and it Returns the first n rows so in this case I want to return the first 10 rows run it bam get the first 10 rows so head gets me the first
10 what would get me the last 10 well tail so I can type in tail here and I'm actually going to just remove this of 10 and by default it provides these five values once again I can provide a f value inside of here and it gives it back to me now for both this head and also tail method I typically use this in conjunction with specifying the column name right now we have a ton of column names and maybe I want to just Identify some that I want to actually use instead so if you're
called to access the column name it's similar to a dictionary I'm going to specify this and then from there specify we're going to look at job title short and with this I can run it with this head method here to provide the top five and as shown this provides the top five values I can once again Run 10 inside of there get top 10 values anyway we're moving this head what happens now if I want to look at multiple different columns well in this case I need to provide the data frame a list of column
titles what do I mean by this so I'm going to put the brackets around this object right here of the column of job title short and I'm going to run it again it's going to run just fine and actually with this one I would actually say it's formatted a little bit nicer I like this formatting anyway what I'm doing in here is I'm providing a list of column titles so now I can use a comma so I can specify another column name so in this case we're going to do something like job location running shift
enter we now have multiple different columns now one of the biggest mistakes that I see with beginners is they don't they forget to include those brackets to thus make this into a list and so if I were to run this without it we're going to get a key error it's going to read this incorrectly as a tuple and we need to be providing it as a list so I'll come back up here and put those square brackets in rerunning it I get the multiple different values with it and once again I can run that head
or tail method on it and get those appropriate values back but now let's say I want to access maybe the 100th row of this data if I were to put in 100 in here I could do that and then run this all but I'm going to have missing values in here so just say I wanted to view 90 to 100 I can't necessarily do this with this method it doesn't provide that for well for this we can use a property of data frame which is somewhat similar to the method but anyway it is iock and
that stands for integer location and we can use integers to call out things like the index number of a row and also that of the column which we'll get to that so let's show this without selecting any columns so I can say for the D I want to use that iock property and then from there I'm going to specify I wanted Row 90 I'm just going to put in 90 running shift enter it provides this back with the D not only the column but also the data itself associated with that row now if I wanted
multiple values 90 to whatever iock supports slicing so we could do what we did before with something like list we provide I want 90 to 100 running shift enter I get the data frame back in a much more readable manner I feel inside of this tabular data like this now with IAC you can also provide it a second argument using a common then separating it and once again you can provide the integer location of maybe the column so let's go with zero for job title short running shift enter I get back the job title short
of all these different values once again I'm not really liking this formatting so let's include more than one value I'm going to specify 0 to1 sorry and that's not going to select anything right we actually have to do 022 if we actually want to get the next one over because it's going to be exclusive for that last number and now I get job title short and the job title column one quick note about this data set the job title short column is a much more basically shorter value for the job titles the job title column
is what the actual job description listed so data management analyst has been categorized as a data analyst and it makes it a a lot easier with this job title short column whenever we want to dive into specifically looking at data analysts data scientists and data Engineers we don't have to necessarily filter through this job title column anyway I digress the main thing here though to remember is I don't really care about accessing the column through integers because you got to count over and I think it's way too confusing so instead I would go about using
the ilock to get those 90 to 100 rows in this case and then from there specifying the two columns or multiple columns that I want within this portion of the data frame to then call back those objects so let's shift gear real quick now that we understand how to actually access different columns and rows of the data set now I want to go forward and look at how we can investigate the data types The Columns and a lot of info and key characteristics of it specifically we're going to be covering these different functions methods and
properties that are available the first method is info so I could run DF doino on this and I get a a lot of key characteristics back about this I get the class of it the number of rows that are included in this and the total number of columns which it then goes and further breaks down providing the column index name how many nonnull values it has so in something like our salary data right here we can see that there's a lot less data here only about 3,000 values compared to almost 780,000 job postings and then
finally the data type on here a lot of different objects but we can see Boolean values for things like job work from home and job no degree mention along with float values so decimal values for the yearly sary and also the hourly salary now the rest of these are classified as objects but technically if we inspect them they're strings but don't worry too much about that pain is classifies as its object over string just to make it a little bit easier on the processing of the data now this info method to me is one of
the most powerful and you'll frequently see these SE other properties and functions and methods getting used but as you can see from here I have to run 1 2 3 4 five different ones to get the same information that I got out of just the single one with info so if you have to memorize one I recommend just memorizing can use the describe method and with that it's going to analyze statistic for only the numerical columns and so it provides key statistics about that yearly and hourly salary giving it the count mean medium Max and
all these different percentiles so this is great if we want a quick snapshot of how the data or the numerical data is trending the last thing I want to look at in this data set is regarding this job tile short remember I told you that this job tile short basically buckets all the data types to C categorize them more correctly and provide simple names associated with each job title well let's say I want to get what are these all these unique values inside of this job title short column well I can start by defining data
frame and then the job title short column once again I can use either that string value inside the brackets or this dot notation and then from there I need to specify the unique method with this I get an array back which like cou nump before this so we know what an array is anyway it has inside of it all the different job titles available and it includes eight different job titles within this and we're going to be diving into all of these different job titles as we go through the project so now what happens if
I want to inspect our data frame specifically for certain job titles that we've identified here specifically I want to dive into maybe more data analyst jobs I can see that it's at random locations and I don't want to necessarily have to call out like hey only pull up index 02 4 well we can use a comparison operator to actually filter for this let me show you what I mean so I'm going to say DF and then job title short and for this column specifically I want the values that are equal to data analyst now if
I go ahead and run this we're going to get an error and that's because we're using invalid syntax remember anytime we're using any of these different values in here this is a string value so I need to put this inside of some sort of quotes I'm going to use single quotes here so whenever I run this if you remember back to our data frame like I said we had the data analyst located in the index of rows 0 2 and four and it returns true for these values so being that it went through and actually
classified what index values have an Associated value of true or false we can actually use this to filter a data frame so I'm going to put in here DF and then open up a square brackets and put a closing bracket so this is the condition right here that we're evaluating and we're putting this inside of DF to filter for this condition running shift enter I now get back and expecting it all the different data analyst job postings in here and it even tells me at the bottom there's 196,000 of them all right so let's say
now I want to continue on with this analysis and I don't want to just only look at data analyst I also want to look at where the salary for these data analysts is greater than 100,000 right now there's a bunch of Nan values or not a number values and I want to actually see those that are greater than 100,000 so to do this we need to use an operator that allows us to provide two different conditions for this and remember we want both of these to be true for right now we're going to just throw
in and and then from there we're going to provide d uh the DF and then the salary year average for greater than 100,000 now if we to run it like this we're going to run into a couple of issues the first is this value out of the truth value of a series is ambiguous so it's having problems resolving this so the first area we're going to fix is we actually need to wrap these into parentheses and that's going to separate to make sure that we have true actually evaluated and true on this side evaluated as
well running shift enter we're still having that the truth value of a series ambiguous and that's because of this operator right here and is the incorrect operator we need to reuse right here instead we're going to use the and operator which is an Amper sand running shift enter we now have our results back for this and we can see that it's not only the data analyst postings but also going over here to salary your average all these values are greater than 100 now if you don't remember from the bitwise lecture we also specified that to
use or we use this pipe symbol and that stands for or in this case I can run it here and although the data doesn't show or maybe it does show we have data analyst postings but not necessarily all the ones with salary data we'd met the or condition for it now what happens if I just want to have all numerical values returned back for this DF of the salary year average I'm going to turn this back to an Amper sand right here well for this we can actually access this salary year average we can run
the method on this of not na before running it in here I'm just going to copy it and paste it down here to show you what returns back and basically if you remember from our data frame most of the values did not have uh or most of the values did have na in it and so it specifies it as not na is false in this condition and so we want values with salaries in it so we want the True Values of this so coming back up here and actually running this now we can see once
again we have all the data analyst salaries but also now we have all the different salary your average including those as low as 65,000 all right let's take a breather that was a ton of different functions methods and properties all associated with data frame but these are key and core ones that I use all the time in pandas so it's worth your time to invest in learning and having all of these committed to memory I promise you it will save you some headaches all right it's time for you to get some practice with actually implementing
these different methods and properties that you've learned here in the practice problems and after that I'll see you in the next one we're going to dive into actually cleaning up this data set see you then now that we've inspected our data we need to move into cleaning if I look at the time that I spend as a data analyst the majority of it is spent in this alone so needless to say this is an important section anyway let's jump into cleaning up some data so similar last time we're going to start with importing the necessary
libraries and then loading our data set into a data frame now there's a few different columns we need to clean up but we're just going to focus on one for right now specifically running thisinfo method and scrolling down we can see all the different columns and all the different data types but right now the job posted date is an object if I want to look at the first value of job posted date I'm just going to do the square brackets and access it by its index location whenever I run this I can see that it's
basically a string because of these single quotes around it and I can confirm this by running a type on this in order to see that as a string but remember python has the ability to use date time objects and we can do a whole host of things with it that I'm going to show in a little bit but so we want to convert this to a date time well pandas to the rescue with their two date time function which this function converts a scaler array like series or data frame dict like to a panda daytime
object that data frame dick like is basically what we're going to be using we're going to be using the specifying the column so I can specify pd. 2 date time and then there we need to provide the argument and so for this we're going to provide the job posted date now running this we get back all these different date times and we can see that converted to a date time by this dtype now it converted basically in this cell but if we inspect the actual data frame Itself by doing an info method on it we
can see that job posted date is still an object so what I need to do is reassign that job posted date column to itself once it's converted to a date time so I put that at the beginning of this putting it inside of square brackets and calling out job posted date and then from there I set it equal to this function call right here and you noticed I use this bracket notation here Vice this dot notation and that's something I commonly do anytime I'm reassigning a value to that is we're going to run into problems
that I'm going to show in a little bit anyway I'm going to run this and reassign it to that job posted date now when I run this info method on here I can see that job posted date is in fact a date time so with this column of job posted date we now have access to a new property this specifically this DT which is date time and it's a datetime property and then it allows me to go even further with other properties that are available in order to convert it to maybe from a date time
now to a date or even things like month and then year so a whole host of things we can get out of there so for our future analysis it's pretty common to use something like month in order to bucket different data and then compare it month to month so I'm going to use it month in this case running shift enter this is a property and not a method so we don't have to use those open closing parentheses we can see it ran and we can see it provides back that it's an INT and it has
all those different values for it now this is cool and all but running that DF doino method again we can see that that month value is not saved inside of here we only have that job posted date we want to get this value inside of our data frame and associated with each one of those rows so we're going to do similar to what we up here but we need to create a new column now for demonstration purposes only I'm going to run this using the dot notation and I'm going to call hey we new column
of job posted month and then set it equal to this value if you recall I said there's a bad idea and I'm going to show you why anyway running this now I get back this of a user warning pandas doesn't allow columns to be created via new attribute name basically it doesn't let you use that dot notation so that's why I brought up earlier I want to use this bracket notation in order to specify what the new column is so now running this again pressing shift enter and running the info method again we now have
that job posted month and like we saw before it's an INT and it's assigned now to this data frame I can even go in and inspect it by running DF and scrolling all the way over we can see that each of these has all the different month values for it I like to double check it and so I'll just make sure yeah it's all checking out so now that I have this job post to date converted to a date time I actually want to use something useful with it specifically Al I want to organize my
data in a logical manner basically from the first job posting at the beginning of the year all the way to the last job posting at the end of the year for this we're going to use of the data frame the method. sort values and for this it only takes one argument and you specify by along with the column name that you want to sort the values by you can also do things like select whether you want to do this ascending or descending the default is ascending to set the true and then another we're going to
mess with here shortly is in place but we'll get to that in a second so I've gone ahead and specified the data frame along with that do sort values method so now we need to specify inside of here that job posted date and so we can see right now values aren't sorted running this they are in fact sorted ascendingly now one thing to note with this we have the sort values We've ran it it's right here and once again we can see that it's in the order that we want it but if I were to
come down here now and run DF and then inspect job post to date we can see that the job posted date is not in fact sorted correctly now we can do one of two things whenever I sort these values like this I could just reassign this variable of DF to this new DF that's sorted but I actually have one better than this and that comes with using in place if we look at the documentation for this we can see that it's in place then we specify whether it's true or false if you remember for the
documentation in place was false so I need to specify this CU it's a keyword argument I need to specify in place and I'm going to set it equal to true now whenever I run this nothing's going to print out below but now when I run this DF of the data frame itself and inspect the job posted date all of it is in order as we can expect also one little note it's not required but remember that first value in here is buy typically you're going to see buy equal to actually still put in there for
that key that we want to specify here and it's still going to work so I recommend you just do it this way but the values are still going to be sorted now let's get into our final cleanup for this section I want to analyze only the salary yearly data if you remember we have yearly salary data and also hourly salary data so one I want to remove this column of the salary hour average and then two I want to get rid of any rows that don't have values in it or basically values that are n
so let's start by dropping this salary hour average column for this we're going to use the data frame drop method and in it we need to specify the labels or basically the columns for this so if we come down here to the parameters and look at access we can see that that for zero we're specifying it by the index we want to specify it by the C column name so we're actually going to use one here to specify it by columns so on our data frame variable I'm going to run drop and the first thing
I need to specify is the labels and for this I'm going to specify the column name of salary hour average now the other argument remember we need to specify is Axis we need to set that one equal to one so now running this we can see scrolling over that it no longer has that column in here but once again if I run that DF doino method on this we can see that salary hour average is still within this data frame so similarly like we did before with using in place equal to true for the sore
values we can do the same thing here running shift enter it executed the cell and then running the info method on the data frame we can see that we no longer have that hourly average data so now we need to get our data frame down to those 20,000 rows that have only salary data so we need to remove all those null values or Nan values which conveniently data frame has his method of drop na now there's a couple things we need to pay attention to with this especially with the parameters the first is this axis
when we were dropping the hourly data that was a column we were trying to drop in this case we're trying to drop complete rows that have na values for a specific column title so in our case right here we're want to leave this default value of zero set because it drops rows which contain missing values but what rows do we want to specify well that comes into play with subset and that's the column label or sequence of labels so in our case column labels and with it it specifies if you were dropping rows these would
be a list of columns to include so it needs to be a list so back inside I'm going to specify this drop na method and the only one we need to provide for right now is subset and we need to see set that equal to like we set a list and we'll specify that this of salary year average running shift enter we can see that all those different na values were removed now once again if I run this DF doino method we can see that for salary year average they still have I mean they this
says the non noov vals but specifically when we look at all the different entries in it we still have over 780,000 000 so it didn't filter it down to this basically 22,000 that we needed to be so we guessed it we probably need to use something like in place so copying and paste from up here and placing it as a second argument right here I'm going to go ahead and run this and with that ran I'm now going to run now one quick note on this this cleanup that we did right here to only get
down to 22,000 values is not necessarily the data set that we're going to be only focusing on for the remainder of the course we're still going to be analyzing all those 700,000 different jobs but I just wanted to go through this example to show how you would might filter down a data frame to get something of where you want to actually analyze all right now it's your turn to get into in diving and clean the data set I have some practice problems to go through and practice a lot of these different methods that we went
through just now in order to better understand the data set and to help clean it up with that I'll see you in the next one so we've gone through and inspected our data cleaned it and now we're into the fun part of actually analyzing it and in this basic chapter we're going to focus on two main features of pandas for analyzing data that I use all the time and that's aggregating our data to find things like min max count and those type of things and then also even in a large picture actually grouping our data
so going down into things like analyzing job countries or job titles things like that so jumping back into the notebook we have our standard import statements loading the data set and then also the data cleanup to make sure that we get that job postto date formatted correctly now because we have this date formatted correctly whenever I run the describe method on this we actually get updated info before we were only in the yearly and hourly salary whenever we were going through when using this method but now we even have the job posted date since we
converted it to a date time and it has a whole host of information in it anytime I'm starting any analysis this is the method I'm going to and using first along with my other favorite method of info now there's a host of different methods that you can use in order to run on the data frame and even and even columns themselves in order to better understand maybe some of the key statistics about a column now we're not going to run through all these but we are going to do a few so just to show a
point this info method shows the count here of each one of those columns but I can also run DF doc count method here and we're getting similar results now if I move into more of a numerical analysis like say I want to get the median value well we can only do this on numerical data as you can see here we get a type eror could not convert string to float data analyst basically run to an error on that First Column over data set of job title short so some of these methods I actually need to
go in and put in the column name that I really wanted to actually dive into deeper for the median which would be only numerical columns something like salary year average and we here we get the return results for the median with no longer an error now whenever we ran this describe method before it provided a lot of these key statistics here but specifically with this Min and Max sometimes whenever this is provided of our men and Max I want to go in further and investigate what's going on here because we may have very strange values
in this case they have a minimum salary of around $115,000 which is extremely low so let's investigate it so first to confirm I'm just going to run the Min method on this and confirm yep 15,000 now I'm going to run this other method of idx Min and that's going to provide the index value of this minimum salary this is one of my frustrations when I was moving from Excel to python as well I couldn't easily find something like I doing contrl F but this kind of methods allow you to do that so I'm going to
set this variable equal to Min salary so now to access that I'm going to use that iock property on data frame and specify in square brackets the men's salary which is the index of it so it appears that this is a data engineering role in Brazil and it is in fact fulltime for that $155,000 so investigating further with chat gbt because I'm not from Brazil it provides back that this is actually a significantly higher than normal typical income so all right this isn't an anomaly then now if you remember before whenever we were investigating that
job title short column which has like data analyst data scientists and data engineer in it we use the unique method on it in order to basically provide back an array of all the different job titles now this is cool but it actually doesn't provide much value I want to find out what are the values or the unique value counts of each of these so instead I find myself using the value counts method on this and it not only provides a numerical counts it also organizes it in descending order so we can see things like data
an analyst data engineers and data science take up the majority of all the roles and are the three most highest then from there business analysts and software Engineers followed by the senior roles of data analysts data scientists and engineers then finally we close this out with machine learning engineers and Cloud Engineers so we've been using these statistical methods but we've really been using it on oneoff calculations basically like hey I want to find out what is the unique value Council job title short or I want to find out what is the minimum salary but what
happens if we want to dive deeper now into understanding multiple different aspects like what is the minimum salary oh of all of these different job titles short well as you guessed it there's a method for this specifically it's the method relating to data frame now the key parameter we need to focus on for this is just buy and it's used to determine the groups for the group buy also if you haven't been diving into the documentation yet I think you should as it provides a lot of different examples on how to run these different methods
and functions that I'm going over and so it provides a wealth of knowledge all right so let's run this method I'm going to do DF Group by and inside the parenthesis we want to list the group that we're going to be grouping by in our case we want to group by that job title short column now I'm going to go ahead and just run this and it's going to provide this object back which isn't really useful now remember we're trying to find what is the minimum value from that salary yearly average column and right now
we're doing group bu on job title short but then it's going to try to aggregate all the columns so we want to now specify what columns we want to do this aggregation method of Min on and we do that by specifying it in square brackets now running this we're still don't have what we need it's still providing that back this group by generic uh series Group by object and so now that we've grouped these we can perform a lot of these different methods that I talked about previously using things like min max count and I
can apply this to the end of it so let's watch it let's actually add in in Men In this case and I'm being silly because I put job title short in there and you'll let me get away with it and it's supposed to be job salary average that's why we have these values that are repetitive in there anyway let's try this again and this is actually supposed to be salary year average so let's run this again all right now we're getting values that we want we not only see that lowest role that we found earlier
of the data engineer at 15,000 but also a whole host of other ones now I'll be honest I don't really care too much about the men I'm going to actually change this to something something like the median that's a better representation of the data itself now we got back the median values now let's talk about manipulating each one of these even further so we learned that there's definitely fluctuations in salaries to different countries so we could also do a group by not only grouping by the job title short but also the country now if I
go in here and try to put in use a comma and use job country and then try to run this I'm going to get an error a value error no access name job country for object type. frame basically we have to provide this if we're going to provide multiple values anytime you provide multiple column values you should probably putting it into a list so now running this we're getting a whole heck of a lot different values previously we're only getting around eight values now since we're using all these different countries that I've aggregated for this
we're getting around 1300 values for this I don't want to dive into all that just yet so actually we're going to just go back to original one so we figured out how we can do multiple group eyes of different columns that we want to do let's say we want to actually aggregate or perform aggregations on multiple columns specifically we have the yearly salary data what happens if we also learn to look at the hourly salary data so anytime that I want to put multiple column values inside of here I have to use a list and
then I specify salary hour average close the bracket boom and now we have both of these values back providing that yearly median value and the hourly now one last thing to cover with this group ey let's say now yeah I can get the minimum values of job title short for the yearly salary but what is if I also want the max value so I want to do multiple different operations or aggregation methods on it well there's a method for that it's called a and inside of here as you guessed you need to provide a list
of arguments so I can provide in there the string values of what the aggregation method I want to do for this so in this case Min and Max I can also just for funsies add in median and now we have all those different values the minimum the maximum the median for that aggregation of the group bu at job title short so all these Concepts will be covered group bu can be one of the hardest ones to wrap your head around with it so dive in and test out a lot of the different things that I've
done here also have some practice problems available for you to go through and actually change it up a little bit and do different columns as well with that I'll see you in the next one we'll be jumping into an exercise and combining all these different things of cleaning inspecting and analyzing data to one exercise that see you in the next one so as we' seeing groupy is a very powerful feature in aggregating and analyzing different groups of data so we're going to be going through and doing a little bit deeper dive analyzing the US job
market feel free to adapt it to whatever country you're at specifically looking at some key statistics around these different job titles available within the country now this exercise is designed for those that maybe aren't as comfortable with using groupy or aren't as familiar with it and want some more practice if you've used it before you're very confident with it I would recommend going ahead and just skipping this video and moving on into the matplot lib intro as there's not going to be any new Concepts in here we're just going to really drill down into understanding
how to use groupi so inside my jber notbook I'm going ahead and imported the libraries loaded the data set and also cleaned it up by cleaning up that date column converting it to a date time one note if you're running this after the runtime maybe disconnects and you need to reconnect you're going to have to do that pip install data sets again just go ahead and uncomment that out and put that in so first let's look at what countries are available by looking at that job country column now this isn't really useful so let's actually
use a aggregation method to actually see something and I'm going to run value counts on this to get the value counts of those different countries now there's 160 total it's not going to Output all throughout here I just really care about seeing the top 20 so I'm going to go ahead and put head of 20 here now if you're still not senior country and you want to make sure it's available you can go ahead and we can use the is in method on this and for this we have to provide an interal so we can't
necessarily just provide a string we'll need to provide a list and we'll go ahead let's search if Brazil is in here and this is just going through each of those columns and returning back whether it's true or false in that column this doesn't really provide us a help provide a lot of help for us so I'm going to then run the any method on it which in this case there is jobs pertaining to Brazil cuz we get true back now let's run a country that may not be in here something like North Korea and that's
going to be false I don't have any job data for North Korea so that was awkward I'm going to go ahead and delete that all right let's move into now we need to create a data frame we're going to create a new data frame with only our country of choice filtered for with that job data so in this case I'm going to call this new data frame us jobs and I'm going to set it equal to the data frame and now we went to filter it right so we're going to use the bracket notation and
inside of it we need to specify that we want to set the job country equal to our country of choice now I made a few mistakes with this let's see if we can catch it first is this I use the equal operator and we need to use the comparison operator here now next they have syntax error invalid syntax per perhaps you forgot a comma this honestly is a really bad hint onto what's wrong here I like to use instead the highlighting here and if you can't see United has that yellow yellow line underneath it on
top of that red and that's because we should be including this as a string right we're searching for United States as a string so now they run that let's actually show this data frame and inspect it to make sure that everything looks right inspecting the job country column we can see that in fact there's nothing but United States in it so we have the data we need now we need to now perform a group by aggregation in order to analyze it for things like the median min max and also account and this is all of
that salary year average column which I'm noticing there's a lot of Nan values in it so let's also clean this up just to speed up the process of any Nan values so I'm setting the US jobs data frame equal to it again but we're going to filter for it and so inside our brackets we want to set the condition for what we want to meet for that salary year average column and we can use the not na method on this basically if it's not a nan value so it's true it's going to stay in there
inspecting the data frame to make sure that it removed all the Nan values looks like it did all right we're good so let's start building that group by Method on us jobs remember we need to provide that first value of what we're actually grouping by which in our case we're going to do that column of job title short and we're grouping by the job title column but we need to analyze by the salary so inside of square brackets I put the salary year average and then finally I can do the aggregation method let's just do
something like count now one quick note this bracket notation right here is actually kind of optional in that I could do a count of all the different columns so running this we can see it provides basically the counts back for everything but this doesn't necessarily work with all aggregation methods let's say I wanted to do something like run the aggregation method of Min what's it going to do on something like a text column like job title text column well it's going to throw an error basically whenever it goes through looking for what is the less
value using the less than equal to it's going to get an unsupported operation of string and Float between the instances and so like the job title column is a string it can't get a minimum of that this aggregation function can only work on numerical columns so running it again on only the salary year average column we can see we get it now one quick note this isn't formatted really how I like it it looks like it's just brunch of text print on the screen and we can fix that because whenever we pass this column of
values that we want to actually aggregate if we have to do more than one we can do it in a list and so even though there's only one item I can provide it inside of this list here and then whenever I get it back it provides a much more userfriendly view that I can actually look at and feel more confident analyzing anyway let's continue on remember we want to do an aggregation of that salary column for multiple different ways so we're going to use the a method for this and if we're only providing one aggregation
method we only have to provide one like this but if we need to provide more we need to put it into brackets to signify that it's a list so in this case I'm going to add Min to this and now I have multiple in here additionally now I've added Max and also count into here and so now we have four different attributes of interest that we want to explore and have available now now one last thing to do on this to clean it up right now it's basically a hot mess like where do I need
to look at first I typically like to analyze from top to bottom and to me the most important column of interest is the median one so I want to sort these values by median so I'm going to add the sort values method to this and the first argument that we see is BU so we need to press bu what what are we sorting it by well we're supporting it by median and median in this case needs to be defined as a string now I got this error message right here specifically calling out a key error
on the median value now anytime I get these type of errors and you will get them as well all you need to do is just copy this error message come over to whatever your favorite chatot is paste it in now Gemini provided a solution that I'll be honest actually makes it more complex than necessary going with good old chat GPT it finds the error right away and I can see inspecting the code the problem has to do with how I actually put in the columns here let me show you what I mean I'm going to
go ahead and remove this sort values and I'm going to run this again so whenever we ran this and I provided this in a list for salary year average you can see that it subsets the data even more whereas when I remove the list it doesn't subset it to that other level so now whenever I run the sort values by median I get it back you're going to run into errors like this all the time with python so it's very important that you understand how to use the Internet or chat Bots to help you quickly
troubleshoot and get through this now I don't like how this is sorted so I'm going to adjust this this has that we can adjust whether the ascending whether it's true or false so I'm going to set this to false and now we have all the median values put in the order exactly like we want it and looking through this you're probably like Luke this would look better if we we actually graft it and to your luck we're going to do that so with that I'll see you in the next video where we actually jump into
plotting with matte plot lip now that we mastered the basics of analyzing data we're going to be moving into actually visualizing the analytics that we're performing and just like pandas is the go-to source when cleaning up and analyzing data matplot lib is the go-to source whenever we're visualizing data now the code for map plot lib is made publicly available cuz is open source inside of GitHub but I don't really find much Utes out of this once again I'm going to go want to go to the documentation I me go to matplot li. org to get
this now this library has a host of different visualizations we can use for this I'm going to go in here under plot types and the majority of visualizations that I find myself using are here inside of this pairwise data or underneath statistical distributions if you ever need a reference for how to do something like say a box plot you can just come inside of here and it provides you a simple code breakdown of how you should go about building these plots now the other major resource from matplot lib going back to that homepage are the
cheat sheets and you can access it on the left hand side of the pane here clicking on the cheat sheets and particularly I'm a fan of both these two up at the top of here this one on the left hand side goes into a lot of detail of all the different charts that are available and then the second one is great for formatting getting maybe colors and different orientations that you need for building your plot anything you build with mat plot lib is highly customizable so there's a lot of things you can do and you
can make single visualizations that are 100 lines of code long and by the end of this Basics chapter you're going to be able to go through and customize all the different things here that it's showing on what you actually can customize inside of a plot so let's get into making sure matte plot lip is installed and importing it in to see what libraries are installed we're going to use that exclamation point and then type out pip list to get the list of libraries inside of here we can see that matte plot lib is installed along
with two other Associated libraries of inline and ven we're not going to worry about those for the time being now let's get to importing it in so I can Import in all of matplot lib if I want but even running this I don't necessarily want this instead I want a specific module out of matplot lib and I'm just going to show you where it is you don't need to memorize this but inside the lib folder underneath the actual Library of matplot lib we care about this P plot module that's what we want to load in
and it basically says that matplot li. pyplot is a state based interface to matplot lib it provides an implicit Matlab like way of plotting it also opens up figures on your screen and accesses a figure of the goey manager so I'm going to copy this little example it has right here to just make sure after we load it in it's working properly so like we said we want to import it in matplotlib DOP P plot specifically we use that dot notation to specify that within that map plot live folder we want to access P plot
and as we can see from the documentation here it's commonly renamed as PLT so I'm going to put that in as PLT and then run it so we're going to create this x variable here where we're creating a list from 0 to 5 and 0.1 increments and then for the y- axis we're going to be plotting a sine wave we put this inside of PLT plot which we're going to go over in the next video I just want to actually show and demonstrate that we can plot now first thing I get is name error name
NP is not defined Sly me I didn't import numpy as well so rerunning the cell above it and then below it we get in the side of our jupyter notebook the actual plot itself so I save it as sample plot on my desktop and now if I want to share any of these visualizations with my friends who probably don't care about it but if they did I could then go and give them this file so now it's your turn to give it a try and Import in matplot lib specifically that P plot module also feel
free to generate a visualization if you want to but we will be jumping into that more in the next video all right with that I'll see you in the next one so let's actually dive into learning how to use this matte plot lib library of plotting by diving into a data analytical question actually we're going to do multiple the first one we're going to do is a line chart and for this we're going to be looking at what is the trend of job postings over time for the second one we're going to use a bar
chart for this we're going to dive deeper into understanding how many jobs are associated with each of these job titles so in somebody J notebook you haven't done already you're going to be pip installing the data sets library from there importing L libraries specifically don't forget we need to import now matplot lib and that P plot module as PLT from there I loaded the data and done the data cleanup we're going to start with a simple example first we have two variables here x X and Y which we'll use appropriately for each axis and they're
both lists of numbers 1 2 3 4 basically so we can have a straight line so we want to plot this well for this according to map plot lib documation we're inside the ply plot module we have this plot function available it plots Y versus X as a line or markers so this is great for line charts or Scatter Plots so what do we need to pass into this well we have this star operator and then RS which is an unpack operator and actually bring down into the documentation we can see that it's commonly referenced
that we're going to be passing X and also y to that so with that plot function I just add in X and then also y so as expected we get provided back this line chart which basically graphs 1 2 3 four against each other now it also prints up these numbers up here and this about the object itself if I wanted to get rid of this I could run the additional method of PLT of show and this is going to remove that from it so from time to time you're going to see me include it
or if I'm lazy I'm not going to include it also to prove that I'm not completely lazy inside the documentation on the show function it says that hey it displays all open figures so it's made to actually display those figures whether in a jupyter notebook or if you're running that from a python file that's where you're actually going to more commonly see it scrolling down they have this note that it auto shows in Jupiter notebooks and the Jupiter backends call show automatically at the end of every cell by default so it's not necessary and it
says thus you usually don't have to call it explicitly there but if you're using something like chat GPT or Claude to get answers you're going to commonly see that they show this show function so let's now jump into actually printing some actual data and we're going to just start with a simple one first of the job posting dates just as a reminder running the head method on job posted date we can see that this includes not only a date but also a time now let's go ahead and just plot the date values from this entire
data frame and we're going to just plot it against each other so we're going to get a straight line and we start by defining that plot function and then we Define the X and also Y which we can just use the data frame itself and so it's sort of repetitive here and so it's sort of repetitive here I have the job posted data X and the job posted data is y but we still need to get to our final goal of plotting the job post things over time so what we need for this is we
need to perform some sort of aggregation in order to count how many job postings are happening relative to that job posted date so with our job posted date column we can run the method value counts on it and this is going to provide for each of the date times a count of the different job postings at the associated date time now let's save this to a variable and we're going to call this date counts and before we plot it I just want to show that date counts we did this operation on a data frame but
it's no longer a data frame it's now a series we're going to talk about more about that in a minute anyway let's actually plot this so we pull in PLT do plot for this we have to provide the x value and also the Y value now if I go ahead and actually pull up date counts again for us to inspect it so what we have on the left hand side for this series is actually the index these values on the right hand side or actually the core portion of the series so this 12 10 10
10 is actually what is the date counts so for this pt. plot I can run date counts and for the xais I want to use that basically that job posted date so I'm going to specify the index and like I said the date counts of the series the core of the series is the values of these counts right here so technically all I have to put in is date counts now running shift enter we get this hot mess right here which has I mean you can see the dates are in order right here and then
the different counts are up here but it's going all over the place and that's because whenever we inspect this series of date counts we can see that the index column is not in the numerical order order or in that actual chronological order that we need it to be M plot lib is not going to sort it automatically for us we have to actually sort it before providing it to us so for date counts I can provide it that we want to actually sort this index specifying date counts again I provide the method of sort index
running this and then running right below it we now have as we can see here based on the year dates we have it in actual chronological order finally actually plotting it boom still a hot mess but the lines aren't going back and forth so why is this a hot mess well if we inspect right we can see that we have the date times as the index and sometimes there's a few seconds multiple seconds and we just have these values of like one for each we just did see before we had like 12 or 10 anyway
the aggregation we're doing here is not correct we need to do something where we either aggregate it by day week month or even quarter we're going to go about aggregating it by month so I'm going to specify a new column name in the data frame of job posted month and I'm going to use the job posted date in order to calculate this specifically I'm going to use the DT accessor so date time and call out that we want to have the month attribute of this running this and then calling the data frame we can see
now that job post month is added and it has it in a numerical form for that that month but similar to before we need to perform the value counts and also sort it so with a new variable of monthly counts I set it equal to the data frame or the column of job posted date and then I want to run the value counts method on it additionally for this I want to sort the values because it's going to be coming in an unsorted order and so I provide itself back to it and provide sort index
so let's see what we get out of this when we print it out okay I already messed up because I put job posted date in here we need job posted month what am I doing right now we have what we want we have the months in the order that we want them and then the counts of this as well so let's get back into plotting this we're going to run that plot function from our P plot module or PLT and for the xais I'm going to provide the monthly counts and I want to provide the
index for this for the x- axis and then for the Y AIS I can just provide monthly counts the entire series and Bam now we have something that's actually usable and we can get insights of it we can basically see that January as expected has a surge of job postings the beginning the year new budgets a lot of people coming in and so they're trying to get people at the beginning of the year and then it sort of evens off for the rest of the year now with this monthly counts I called the index and
then I provide the series itself but we can actually access the values inside of there by providing it values and this is going to provide us exactly the same plot as we had before it's a little bit more robust in the amount of information we included in it you may see it from time to time so I just want you to be aware of it but now let's understand better about what actually what is this series going on here so we've been primarily focusing on using data frames which are a core component or object inside
of pandas but pandas also has this other object called a series A series is if you think back to numpy when we had a 1D array this is effectively what a series is I can create a series by calling pandas and then running series on it and passing it a values of data so this is in this case a list from 10 to 50 and it prints out 10 20 30 40 50 I can also with this provide the argument of an index so previously it's numbered from 0 to four when I add this list
of ABCDE e it updates to that AB bcde for the index and then if I want to access that series index well first I need to set it to a variable so I'll assign series equal to this and when I type series. index I get the index values back and when I want the values I type series. values and I get the values back which provides it back as an array now whenever we provided previously when we plotted the job post to date versus job post to date we were effectively providing this plot a series
if you will I mean as you can see up here it even says it's a series so a column of data within a data frame is also known as a series and so you've been familiar with this before but we've more explicitly using series now so I want to just make sure that we understand the differences so moving on we now have mastered how to make line charts we need now move into making bar charts and for this we're going to be plotting how many counts of different job titles we have if I inspect the
job title short column we can see we have things like data analyst scientists engineers and there's eight different job titles within it we want to aggregate this similar to just like we did before and we can do this by using once again that value counts method probably a method you need to have to memory and this provides us back a series of these different counts let's go ahead and actually assign this to a variable um and I called it job counts so now we need to plot and conveniently mpot lib has this bar function in
it we're going to be providing two values so X and then also the height and as we can see from documentation really fancy here it says makes a bar plot so calling the bar function from our B plot module I'm going to specify that job counts index and then also provide in the series itself of job counts now this thing is also a hot mess right because if we look at these job titles down here they're overlapping each other and and if we wanted to I could run the head method on this and actually sort
it down more so where we only have three of the jobs and then when we actually plot it we can actually read all the different values here but we're going to actually just do something different instead I want to use this bar H function which is for a horizontal bar plot and similarly you provide it the values of Y and in this case width so all you have to do to our original formula is change that and add an H to the end running shift enter bam we now have all these things and if you
notice with this it's in descending order so I don't really like this we're going to fix that in a second anyway I want actually more values in this now that it's in a horizontal manner we can bring back all those job titles I'm going to remove that head here and then rerun everything below it all right now this is better we have all of our different jobs in there but it's not in the order that I want it and this is because Matt plot lib in a horizontal bar chart starts plotting from the bottom and
then upward which if you think of it makes sense because whenever we're plotting something like a normal bar chart we're going to start on the left hand side with that analyst in this case that says that analyst sorry and it's going to go all the way to the right so it plots it similarly closest to the x-axis and works in reverse order anyway there's a real simple fix to this with job counts what we can do is we can sort the values and in this case we want to set ascending equal to True running this
again nothing happened because I didn't set it to itself and we finally got the plot that we want and as a reminder from the beginning of this video from the plot we started with we could also run that PLT do show function at the very end and what that's going to do is provide just the visualization back to us and we're also not going to get that extra text there all right so now it's your turn to give it a try and jump in get your hands wet actually playing around with how to plot different
series and also data frames so with that I'll see you in the next one all right so we need to do some cleanup now you notied from the previous section that it takes a lot of time in order to get the data in a correct manner to plot and I'm also spend almost as much time on the latter end actually cleaning up the plots once I have this data so remember the plot that we had for the counts of different job titles here we're doing in the vertical chart right here anyway there's a lot of
problems with this as we can see in the x-axis we can't distinguish what the names are what does the y- AIS even stand for why isn't there a title there's so many things that need to be cleaned up well in this section we'll be going over all those different things on how to actually clean up a visualization and make it presentable so that way you can share with others so inside my notebook I have the standard install of the data set you already got this by now now what you need to have is the data
so that way we can work with what we're going to plot here specifically that job C counts series and what we did was we took that job title short and did a value counts on it and then finally just run that bar function on it using the index the series itself and then showing it below let's do something simple like adding a title well inside the documentation under that pip plot module we have this title function available and all it does is set a title for what we need to provide to this is just a
label in our case we need to provide a string to it but let's actually look at an example of how this is implemented by scrolling down here and going to this first one that it has so the title of this one is mask and N Dat and if we scroll down to the code that it has for this we can see that the title was implemented right here now I want you to ignore the majority of the code up here it's not really important what I care about here is notice how we've sort of like
stacked all the different things that we want to maybe do to this in this case you can probably figure out we added a Legend to the plot also we added a title and then we showed it we didn't Define a variable and go about it in an objectoriented approach instead matplot lib enables this what you call a stateful approach so I can just stack up all these different commands that I want to provide for it so in this case I want to run this title function on it and I named this one postings by job
title anyway this type of approach the stateful approach if you've worked with something like mat lab before this is a common way that they program in there and it's how you're going to do it here so now whenever I run this it comines all these things into the current state if you will and applies it to this figure right here and if I'm going to come down and maybe start a new plot those attributes or what I've actually assigned to it are not going to be applied to that new plot it's only applied to what
I've run the code on right there so now I know that let's add some other labels and documentation to this I'm going to provide a y label for this axess with PLT I to run the Y y label and similar to the title I provide a string I'll call this count of job postings all right so now that the y axis is in order let's actually get this x AIS this thing is a hot mess we need to clean this up now if you're curious about what are the names of the different parts of a
chart if you come here to the cheat sheets and come to this intermediate one it breaks down the anatomy of a chart and basically explains hey this is the y- AIS label x-axis label these are minor ticks these are major ticks this is the grid there's the LED and frankly the way I found doing about customizing plots is using something like chat gbt to ask it what to do because if I'm in matplot lib and I'm like hey I want to format the xais labels if I type in xais labels run this I'll be honest
it provides me back basically to the PIP plot Library it doesn't provide a lot of good information at least to search so inside chat GPT I can provide something like this how can I format the x-axis labels on the graph to beout 45 degrees I want to slip them and I also typically like to provide the code so that way it could just update it for me with this inspecting the code that it provided I can see that it provides basically PLT doxx and then this rotation argument provided to it of 45 so I just
insert that right on in and press play so now inspecting the chart I can see that the labels are at a 45° angle but there's something going on here where they're not aligning properly to the tick marks this would be very receiving for somebody if we gave it to somebody so there's a way to fix this one way is we could go about just providing it back CH back to chbt and ask it to help but I would normally encourage you if you can to go back to the documentation and see if there's actually anything
in there because you may learn some things actually digging in the documentation unfortunately for us there's nothing there but I learned from Chad gbt that there's this special argument we passed to it of ha a or horizontal alignment and for this we're going to pass the value of of right what it's going to do is align these values horizontally either to the left or to the right or Center if you wanted to provide that it was trying to do I think Center in this case so whenever I rerun this with right on there it Updates
this chart to be more aligned with what I would expect with the names itself at the very end aligned right on that tick mark everything looks good for this I really like this one if I wanted to i' go ahead and save this image and hand it out to all my friends all right so now it's your turn to give it a try and get in there and actually practice this by cleaning up some problems that we have for you as always do take use of stuff like Chach gbt and Claw to help you with
cleaning up these visualizations you learn a lot from doing that but I highly recommend don't become overring on them make sure that you're still coding out the majority of it so you're getting familiar with what you need to know all right with that see you in the next one so don't be m at me but I kind of lied to you there's a slightly easier way to plot using pandas but I wanted to make sure that you understood Matt plot lib first before introducing you to this basically shortcut so let me show you what I
mean by jumping right into this just like before we've imported all our data if you haven't done it already pip andall data sets and then we're still looking at the same example from the last video so where we performed that count of job title short and then we went through and actually plotted this into a bar chart so now ignoring all that formatting for the time being we're going to just focus on the core code here itself and that is for this pip plot module we're using this bar function on it in order to specify
the index and also the job counts I'll be honest this is kind of annoying to have to remember in this case right that I have to access that index value or the property of job counts and then what do I put here job counts or do I have to do I need to include values to actually get get these as well well inside the pandas documentation for both series and also data frames it shows we have this plot function here's the same exact one for data frame as well remember we've been working both with series
and also data frames we're just going to stay here on the data frame page CU basically everything below this is very similar so with our data frame or series we run this plot function on it and then we can provide this star args or Star quars which is basically saying we can provide multiple different arguments to it cuz we're using that unpack operator these are the actual parameters right below it that we can provide the ones we care about are this X and Y which are the labels or the position that we want to use
so if we're using something like a data frame we can just provide the string value of the column now this is pretty cool what I really like about it is this kind portion right now no longer do I need to memorize all these different functions for different plots instead I can just specify via string value all these different plots so let's actually do this on job counts and we're going to run the plot method on this remember job counts is a series now because of that I'm not going to need to provide any X or
Y value all I need to do is provide the kind which is equal to bar now running this we get this bad boy which not only is it much less code but also these axises we can actually read them for once they're not overlapped personally comparing this to the last code that we have of this right here I think this makes it a whole lot simpler and if we want to change it to something like a line we just put it in and run shift enter and now we have it I don't recommend using a
line chart for this we're going to be using a bar chart and now you're probably like Luke what about all that formatting that I worked on before well everything that we did before still carries on it still is basically using matte plot lib in the background specifically that pip plot module so I can still go through and specify all these different things that I want to happen on the figure now running this I have everything updated although we can see now that I have this x label down here of job title short I want to
remove that so I'm going to just specify with the p plot module for the X label I want the values to be an empty string running this again we can see we removed it so that was for job counts which just to reiterate that was of the type series now let's see how this is done on something like a data frame which we have that already with DF now in this case with the plot method we're going to need to provide that X and Y parameters because our data frame has multiple different columns in it
it needs to know what to use and specifically for this I want to plot over time using the line chart how the salary is trending over the year in 2023 so with our data frame we can run that plot method on it and then we're going to first specify the x value which in our case is that jav posted date providing it as a string value next the Y value of that salary year average and then finally what kind of chart do we want well we want a line chart running this this ends up giving
us this line chart which I'll be honest doesn't provide a lot of value but we can see that the values are fluctuating based on the axis here I can see it's around well it's less than 200,000 that's what that's signifying right there and it's showing the dates from the beginning of the year to the end of the year anyway this example is mainly just to demonstrate that you can do data frames the data frame in our case has over 700,000 values and so we're going to get a lot of noise in it we would want
to take a similar approach of actually breaking this down maybe on a monthly basis and then plotting it I'll leave that for you to do we're not going to do that for time being the more important concept is you understand now how to use series and data frames all right so now I got some practice problems for you to go through and practice plotting with pandas now which basically implements mat plot lip and I'm curious to know which ones do you like better do you like using MP lib or you like using pandas I'm going
to be flipping up through the remainder of this course on what I'll be using basically which is actually easier I'll be using in that case and so I want you to be aware of it since I will be changing around with that I'll see you in the next one so to wrap up this basic sections we're going to work on a problem that works towards solving our final project for this we're going to be calculating the median salary based on the job title itself so we can better distinguish between high-paying and low paying roles and
frankly you can use this to determine what role you should be aiming for based on salary so inside my notebook I have the standard import of the libraries loading the data and then finally it's clean up so we want to aggregate the job titles or basically group them so we probably know the best method for this it's going to be group by calling out DF dog groupby and then pass in what do we want to group we want to group the job title short now what do we want to run an aggregation on well we
want to do this on the salary so using square brackets inside of here we specify salary year average the column for salary and finally we need to call our aggregation function which in this case I said we're going to do median so running this on it we get this series returned which has all of these different things in here notice they're out of order I'm going to add one more method onto this of sort values running this we now have it in descending order from lowest to highest okay great now you're probably like Luke why
are we running the median and not the mean for these salaries and to that I would say that's a great question what I'm going to do is run this code which is basically exactly the same as below except now I've chained this change this to mean for average now I'm going give you a second to think about it but look at how how these salaries compare to each other especially the ones in the lower brackets here with the exception of the senior data engineer and Senior data scientist these values are lower than the average so
in order to show this better I went ahead and plotted a histogram on the distribution of the salary itself and there's the code up here um we're going to go into this more in the advanced section for a lot of the special functional that I did right here but anyway I wanted to just show this histogram for the time being anyway if you're not familiar with the histogram it's showing the number of counts of something at a certain value so in this case for the salary of around 100,000 we're seeing around 6,000 jobs that have
that now the key thing to note here is this goes from zero all the way up to a million and that's because there's a high number of outliers up here in this region greater than 600 ,000 or even greater than 400,000 anyway these high outliers if we were to use the mean as we showed up here it drags this average number higher it doesn't necessarily reflect a value that's more represen istic of what I'd actually see as a salary so in the case of data analyst it says the average is 93,000 but if I'm actually
job searching I'm probably going to more see around 990,000 and actually updating this graph to show where the median and mean fall we can see the mean is slightly right due to how this data is skewed anyway enough theory on actual numbers here let's jump back into actually plotting so back up here with our series I'm going to actually create a variable to save that to of job salary and with that we can run as we learned previously plot method on this since because it is a series and I'm going to specify for this I
want to use a horizontal bar chart so this looking good but want to clean up some more so I've added the X label of the salary and denoted that hey this is in Us doll for the Y label I don't think it's necessarily necessary so necessarily necessary okay I don't think it's that necessary so we're going to go ahead and just put a empty space right here to remove it and then I put a title on the top of it so now we get something that's actually usable that we're comfortable with now sharing with our
friends and we can now start to analyze some trends of our data specifically senior roles especially for the Das scientists. engineers. analysts are all paid more than their junior roles as expected and we're confirming this with the data additionally to my surprise machine learning engineers and software Engineers are actually getting underpaid or less than data scientists and data Engineers so it pays to be in the data field data analysts aren't far off of that and they're not last so that's a one good thing and then from there we're followed by Cloud engineers and lastly is
business analyst and this makes sense with a data analyst and business analyst because business analysts typically have less technical skills than data analysts all right sweet so we got to head start on our project now before we jump now into the Advan section where we're going to be now leaving from Google collab and working on your own personal computer with that see you in the advanced section all right welcome to the advanced chapter and I'm breaking out a different flannel for this portion of the video anyway for this we're going to be running running for
the entire chapter and also for the remainder of this entire video we're going to be going through and using python locally on your own computer and we're going to be walking through the setup for both windows users and Mac users they're pretty much the same but I want to make sure that we cover any differences for this video we're going to be focusing on installing python through the Anaconda distribution walking you through all the different steps required in order to get it set up ready to go so we can start coding in the next video
we're going to be going through setting up our code editor of choice which is Visual Studio code along with the workflow on how to use this editor for your python workflow so why are we running python locally now on our computer by using something like Google collab in the cloud well there's a few different pros and also cons First Data privacy if you're working with confidential or even protected data like Hippa data this ensures the data doesn't leave your machine and you don't have to worry about compromising it next is the learning aspect you get
a lot of experience experience setting up and understanding deeper insights into pal python works by doing this third is the cost although Google collab is free if you use it enough you do have to start to pay for it if you have a computer it's already sitting there so it's technically almost free and finally this doesn't require internet access therefore you can work on it wherever you go even on a plane now there are some drawbacks as we saw from Google clo lab Cloud setup is a lot easier there are no installation steps like we're
about to have to go through also managing python environments could be a pain in the ass that's a separate subject next this provides you access to high performance Computing that you may not have access to currently depending on how shitty your computer it is it may actually be a better option to run in the cloud Vice locally and the third reason which probably most powerful is that collaboration is a lot easier in the cloud multiple users can work on the same project at the same time and you don't have to worry about having conflicts now
you may have concerns that you have a strong enough or powerful enough computer to do this in my recommendation is if it's built within the past decade it's probably more than powerful enough to power python cuz python doesn't take up a lot of resources to actually to use when you're running it but if you still have concerns or if you run into any issues during the setup you have some alternate options first is lightning AI which provides a VSS code like environment inside of your web browser to use they have a free tier available that
would allow you to complete the entire course one note there is a current weight list so you need to apply now now if you want to do this the other option is GitHub code spaces which similar to lightning. a provides a fully configured secure environment this option is completely free for up to 60 hours a month which I think you can complete the remainder of this course in those 60 hours so you could do it for free now if you decide to foro installing Python and using VSS code and running it locally that's your option
you can do that but I'm not going to be providing any support on how to use things like lightning AI or code spaces I'm only going to be providing this option now during this install if you run into any problems I highly recommend that you go to something like chaty BT and you paste in this airor message and it's going to guide you through what you need to do it's a lot quicker than putting in a YouTube comment that you're having this issue and then praying that I'm going to come and actually answer it I
would do the actual chat gbt instead it's going to be a lot quicker and it's going to help you in the long run with understanding how to use this so let's get into the anacon distribution install which is going to install python into your computer we're going to be using both a virtual machine of Windows and of Mac OS so that way if you're using one of the other you know what to do don't worry if you don't understand virtual machines that's not really important concept here I'm just mainly have this here so you can
see how to do it in each the first thing we need to do is verify if Condit is installed in your system for both of these I just created both of these environments of the windows and also the Mac OS so there's a fresh install so there should be nothing installed in them to to verify if cond is installed on your Windows machine you're going to go in to search and type CMD and open up the command prompt with this you're going to type cond info for these users you should get this message saying cond
is not recognized as an internal external command opol program or batch file this basically means K is not installed however if it does provide back information on cond then it's installed and you don't need to install it similarly on a Mac you're going to use the spotlight search so press command space and then type in terminal from there you're going to pop up your terminal inside terminal you're going to type the same command of cond info what you should get back is a command not found condo or something similar to this in that case cond
is not installed if it is you can forego the installation of cond so open up your favorite web browser whether on Windows or Mac and go to anaconda.com /downloads they want you to provide an email address for this distribution if you don't feel comfortable doing it like me just go in and click skip registration your operating system of choice should be selected so you can go ahead and click download for Mac for us we can select Intel or apple silicone I have an apple silicone so I'm going to I have the M1 chip or actually
M2 Chip so I'm going to install with that for window users you don't have this option you're just going to select download it's going to get to downloading once the packet is downloaded you're just going to go and double click it to launch it and you may get a warning message saying this will run a program determine if the Sol be installed sure allow it now for both of these on Windows and also on Mac Mac they're going to walk you through now the install process we're going to keep the default values the same for
all and so it should be pretty similar we H continue they have this long document that you probably should read if you had a lawyer and I'm going to click continue same thing with the license here you should probably be reading this but we're going to go ahead and click continue as well for Windows it's going to ask you if you want to install just for me or all users just for me is recommended so just go with this and it's going to specify this folder of underneath your root folder Anaconda 3 going to go
with that as well these are the default selections that it came with I'm going to just go ahead and click it and now click install for Windows for Mac users it's a little bit different it says hey do you want to install for all users of this computer because it's going to store it up at the root location for this Anaconda folder this is perfectly fine I would go ahead and go with it I'm comfortable with that it warrant you it's going to take 4 gbt of space that's fine as well and then you have
to go ahead and enter your password to begin the install in Windows once the installation is complete I'm going to click next it's going to tell you something about coding with anacon in the cloud we're going to skip that for the time being by selecting next and then I'm going to go ahead and just enable this to launch the Anaconda Navigator which is the guey Upon finishing this and I'm going to unclick this cuz I don't need it for Mac users it says the same exact thing about coding the cloud sure I'm going to continue
and then there's no nothing else to click I'm just going to go ahead and click that close this it then asked do I want to move the installer to the trash yes I want to do this and during this time also it did pop up with Anaconda the basically gooey interface on Mac so don't be alarmed if it does that inside both Windows and Mac you may have this popup they say there's a newer version of Anaconda Navigator available we strong recommend you update do you wish to update yes I do and I don't know
we have multiple popups coming up here but do you want to quit uh the anacon Navigator don't show me this again yes just quit it update now I'm going do the same thing on Windows and do you want to allow it to update I'm going to click quit Anaconda oh my gosh and click update now and anytime Anaconda is asking you to update just go ahead and update it it's always great to have the current version updated once it's updated we can go through and actually launching the Navigator do this on both Mac and windows
so that competes the setup for both now you have this anacon Navigator which you can access through your applications I'll be honest I don't really use this app so much so we're not going to really go through and actually tour this and understand how it functions we're going to be using other methods but I do want to verify that python is installed so on Mac if you still have your terminal open go ahead and close that out and then start it back up it needs to be restarted on Windows you need to do the same
thing and close out the command prompt we're going to use a new basically terminal inside of Windows specifically when you're using the Anaconda prompt and you can just find it by searching for Anaconda prompt to see if Python's installed which should be we're going to type in Python and then tac tac or - Dash and then version from here we can see that python 3.11 is installed on our system running this on the Mac we don't have to use anacon promp for this that's why I love Macs anyway the it says that python 3.1.7 is
installed as well so we're good to go also if you notice on both of these they now have inside a parentheses this base which you can see here also on in your Windows machine under Bas this references to the environment that you have python installed in which we're going to be going into more on environments and using that in the third video of this chapter but for right now I just want you to be aware of that so if you've been following along as we did this in this video we're done with this we're now
going to be continuing in the next video on actually installing a code editor so we can actually get to code in instead of typing things like python version and only find out the version which is completely useless okay with that I'll see you in the next one so now that we have python installed on our computer we need to now jump into installing a code Editor to use and actually run this python code now for this part of the install there's not a lot of differences between the Macs and the windows so I will call
out if there are from time to time but other than that we're just going to stick with one operating system now we could run python from a terminal or the anacon prompt by just typing in Python then we have these three dashes right here I can give it a command so as a test I have it print what's up data nerds and it outputs that but we don't want to be doing our coding right here inside of this prompt instead we're going to be installing vs code but why is that well according to the 2023
developer survey which actually interviewed 990,000 different developers not only did they identify that based on programming scripting and markup languages that python is the third most popular language out of them with almost half of respondents voting for this skill but in regards to the integrated development environment it wasn't even close Visual Studio code which we're about to install it's by far the most popular IDE to use for coding and it's free and I've been using it for years so we're going to go ahead and use it so if you go to this URL it provides
you the download option to download for Windows or for Mac click your appropriate installer once it's downloaded open it up and accept the agreement and it says it's going to put this into the program files that's perfectly fine to put that as a location and it wants to know where to place the shortcut I'm just going to keep everything default and go to next here I'm going to leave the default selected if you want to you can create a desktop icon and then from there I'm going to click install after the install I'm just going
to click finish which I'll go ahead and launch automatically back on my Mac machine I promise there's not a lot of differences it automatically just went ahead and installed it I don't think we're not going to have to go through any sort of setup yep it just automatically launches so you're not going to have those popups and go through and all those selections like all those windows users for Windows users they do have this basically walkth through of what you need to do to set it up I'm going to just skip that for now and
click Mark done so now that we have VSS code installed we need to connect this basically Editor to Python and we can do this by installing an extension if you come over here to the left hand side on the activity bar I can come in here and I can search python but actually it's just going to pop up as the most popular I'm going to click install we're installing an extension and this is ultimately going to allow us to connect to andac con or python that's running locally so I'm going to click install and it's
going to go through and set it all up once again it's to have some setup that you can actually walk through for setting up Python and everything like that I'm going to just skip this for the time being and Mark done I'm going also close out these extension okay now that we have python extension installed we still haven't connected the two so we need to connect it so we're going to open the command pallet by pressing command or in a Windows control shift p for Mac users command shift p and I'm going to type python
select interpreter from here it should pop up with that python 3.1.7 base version that we saw before and you should see some sort of file path like this of anacon 3/ python exe this is for Windows users for Mac users very much similar but it's going to be in this location of opt andak of 3 bin Python and I like that as well now we need to verify that we've actually set it up properly and python is now running I'm going to click new file to create a new python file and I'm just going to
select it here I'm going to create a simple python command that prints what's up data nerds real original I know now you've noticed up on the right hand corner we can run this python file but we actually we need to save this first that we can knows where to access it so I'm going to click contrl s if I'm on Windows command s if I'm on a Mac and I'm just going to save this python file here and call it test.py and conveniently up the top tells me the save equation so let's get into running
this now by coming up to the right hand corner and selecting run file it's going to run it right below here in our terminal and inside of here it's specifying to basically use python or the python the XV file that we have installed through anaconda and then run this test file which when it runs this of printing this statement it prints it right below this Mac users should see the same thing in their terminal window with the specifications that it has before and then printing out what's up that nerds so there's no longer really any
more differences between Windows and Mac so I'm going be staying in Windows for the remainder of this we can go ahead and close out of your test.py file and now we need to create a folder in order to store this project for the remaining exercises and the final project that we're going to be building so I'm going to click here to open folder you need to save this project folder to wherever you feel most confident doing this I recommend saving this in like a documents or on Mac you have a developer folder for me I'm
just going to put the folder right here and I'm going to name it python uncore data project I'm using these underscores Vice spaces because it just makes it easier programmatically to access it later on anyway I'm going to save it and now we're going to select this folder it's going to ask you if you trust the authors of the files in this folder I'm the author I mostly trust myself so now that we have that project folder set up let's do a quick walk through of VSS code and how you should be using it up
at the top is a menu feel like that's pretty self-explanatory over on the left is the activity bar first we have the explore which allows us to get to our different files if I wanted to create a new file inside of our python data project I click that new file icon and then I type in something like test.py and it's going to create a python file inside of here I can PPE some code and then if I want to I can go and try to run this file I don't know if you saw that but
vs code automatically saved this file before running it which pretty nice anyway the next thing is a search thing so we can go in if we wanted to search for different keywords we may have used we can find that we use that print statement in their test.py file next is Source control which we're going to go into a lot more detail when we get into the project so just keep that in the back your mind there's a debug panel right here I don't use it as much especially when I'm working with Jupiter notebooks but it
is available if you're working with python files and needed to be debug it next is extensions which we're going to get to next and also testing if we want to configure or do any tests I'm not really using that that much let's go back to those extensions so remember we installed python I can come here I can see that based on this like we could uninstall if we don't want it but you can install any extensions that you want located in this extension Marketplace so another one that you're going to need for this is Jupiter
and this allows us to run those Jupiter notebooks so I'm going to go ahead and click install and once again like all of these they have some sort of walkthrough that they'll let you do I'm going click Mark done and do it myself so now I can come inside of the explore create a new file I'll call it test and name it I py NB and run enter to create the new Jupiter notebook now if you can't remember that extension of ipynb you can go to the command pallet and type in Jupiter and you just
select here of create new Jupiter notebook so now I have this Jupiter notebook which if you remember from collab it's set up very similar it's telling us which environment we're in base which we're going to go more into environments in the next video but we can go ahead and run some code and test it out right here inside of here so running a simple print command I can go ahead and press play now I get this popup for Windows security saying do you want to allow public and private network access to this app and I'll
be honest sometimes with jupyter notebooks I do connect I run apis and stuff and connect to public networks private networks I'm going to go ahead and select and allow both of these options functionality is going to be really limited if you don't allow this but I'm more than confident that it's safe to do this anyway getting back to our prompt it printed what's up data nerds and I just confirmed run this on a Mac you shouldn't get this ER now with jupyter notebooks they put a lot lot of functionality available right up at the top
of here you to before about how we can create code and then also markdown cells so I can turn this into something like a heading now one cool thing about this though is say I make a mistake and I actually don't want this a markdown file I can select it right here and then I can just reselect what I actually want it to be in this case let's say I want it to be a python file and this is actually a comment they have options to run all or to restart your server if you want
to and then if I had some sort of variable like xal to 1 I could then open this variables and see inside of it very easily all the things I different have all right so that wraps up the install of vs code which you should have been following along if you want to now there's no practice problems or anything for this but I would actually go through and check out these extensions and see if there's any ones that maybe you want to install I like this one right here of vs code icons so from selecting
install I also select what file icon I want to use anyway now whenever I create folders or something say I create a SQL folder it will actually modify it I don't know if you saw that I'll do it again when I make something like a SQL folder it will add a little database on top of it so it actually changes up the icons and make them look pretty cool another one that I like is Rainbow CSV I'm going to go ahead and install this one as well when I open a CSV I got an example
one in this case it goes through and for the columns at the top you can see like job work from home it has the same color and I can actually follow it through now with that rainbow CSV it's still a little bit hard I also like edit CSV I work with a lot of CSV as you can tell back to that original CSV I had I have this option now if I want to edit a CSV it puts it into sort of like a similar Excel format where I can actually go through and then if
I wanted to I could actually change different values in here and then from there I can just apply change it to file and save all right so the only problems for this is for you to dive into those extensions and maybe install some extra ones that you want if you find any ones you think are useful feel free to drop in the comments below next video we're going to be jumping into virtual environments using cond to navigate that and then after that we'll be jumping into some more advanced concepts using pandas and M plot lab
all right with that see you in the next one all right I'm not going to lie virtual environments can be a tough Concept in order to grasp what is going on here and these virtual environments allow us to have these individual if you will isolated environments of our own python versions and all the libraries and packages supported for maybe a data science project we're working on and if I decide to work on another data science project I can create another virtual environment for that project and so it keeps these two projects if you will separate
so we'll jump into more Theory at the end of this but I think the best way to learn is to actually play around so to manage these virtual environments we're going to be using K which comes part of the Anaconda distribution now cond provides a a lot of documentation for this which I can go into selecting kind of above here and they provide a whole user guide for this the one thing that I find myself gravitating to is this cheat sheet here under the user guides and opening this cheat sheet up it provides all the
different kind of commands that you need to know in order to manage virtual environments so where the heck are we going to run these condic commands well you run these inside your terminal here in the Mac I can run something like cond environment list and it provides a list of all the different virtual environments I've created right now we only have this base environment which is located in this folder location and as you can see right here on the left hand side base is the one that was activated right now that we're using similarly on
a Windows machine I can use that anacon prompt that was installed and run the same command in order to see this well with VSS code they provide an even easier way to access the terminal right inside your coding window right here you come up to the menu at the top and select new terminal or personally I like to use control Tilda which is right there right next to the number one key pressing this it's going to pop up a terminal window right at the bottom of this window and once again I can run this command
to list all the environments in here right now we have base so we're going to walk through pretty quickly how to create a new virtual environment and then we're going to actually set one up for the course so you can just watch this first part if you want so first thing I'm going to create a new project if you will and I'm going to open a folder and I'm just going to create a project on my desktop and call it it delete cuz I want to delete it later we're going to open this if you
get a warning about trusting the authors you're the author yeah I trust them then inside of here I'm going to create a Jupiter notebook so I'm going to just name it test. iynb now with this notebook currently as we can see in the upper right hand corner it's using the base environment we want to create a new environment and so I brought up a terminal at the bottom and I brought it up in Basse and the nice thing about vs code is that basically as it says here the selected kind environment was successfully activated anytime
we create new environments the terminal should automatically connect to that and connect to the right virtual environment so I don't want to see this again I'm going to click don't show again so we're going to run this command up at the top for creating environment with a specified python version so write cond space create and then I'm going to write a Tac n and this is a flag to signify the name of the environment we want to create for this I typically name this the same as the project I'm working in to know that they're
coordinated so I'm just going to name this delete because we're going to be deleting it and then from there we're going to press a space and we're going to call out Python and then specify the python version current python version that's stable is 3.11 so that's what we're going to go with now it's identified that it needs to not only install python as it has right here and the following packages will be downloaded but also a bunch of other different packages that may be necessary at the bottom it says do you want to proceed and
it has y or n the Y is default so I actually I can just click enter and it will start downloading them all now that it's all downloaded it says Hey to activate this environment use cond activate delete and then whenever we're done we can deactivate it by using cond deactivate so I'm going to go cond activate delete delete probably wasn't the best of names for this because it's probably confusing to you but we're going to just go with it so now we can tell by Inner terminal that this delete environment is activated because of
it's over here now in parenthesis previously we had base for this command when we ran up here but now we have delete but this is only in our terminal right here that we have this activated up here in our Jupiter notebook we're still under base so we need to change that so I come up in here and I select s this and as we notice delete is not appearing in here sometimes vs code needs a quick reload in order to understand that it actually has this new environment available so I'm going to go to the
command pallet by pressing command shift p or control shift p for Windows users and type in reload window and then go ahead and select enter to reload it now it's detecting kernels but now it noticed whenever I got here that there's probably multiple ones so it has me now going through and actually selecting the kernel now when I do this it says hey do I want to connect to a python environment or existing Jupiter server we're not using jupyter servers we're using python environments and then inside of here we have our base environment which was
all before but now also that delete one is popping up inside of here now with it activated let's actually test it out and when I print what's up data nerds it runs this saying running with this requires the IP y kernel package and the ipy kernel package is the package required for Jupiter notebooks to be run and so you do want this so I'm going to go ahead and click install I'm also going to close out of this terminal window but down at the bottom we can see that it is installing it from cond Forge
which is basically cond anyway we have it printed up now and it print out below looks like Python's running fine now another command that you need to know is kinda list if you're remember from Google collab we used pip list in order to list the packages installed for this we're going to be using kindall list anytime that I have a bunch of stuff inside my terminal I can just type clear and it will automatically put me up to the top of the terminal I like doing that anyway running cond list it goes through and listed
all the packages in the environment of delete and there's a whole host of different packages installed within here specifically python 3.1.9 all right cool I'm going to clear this out now like I said I don't want to keep this environment this was just a practice to show you I want to actually delete it now so to delete it we're going to use kinda remove then the name flag provide the name and then say all flag to basically delete everything inside of it so cond remove we'll specify the name of delete and then we'll do the
flag it's two taxs and then all okay we're going to get an error message with this and it's going to say hey you cannot remove the current environment deactivate and then run cond remove again so we can't you can't be inside of an environment and delete at the same time sort of makes sense so I'm going to run kinda deactivate anytime I deactivate I'm going to go into the base environment so we're back inside of Base I can run inside of here cond environment list to see a list of all the different environments right now
we have base and delete and now I want to run that remove command anytime you've run commands previously you just press the up arrow and I can cycle through all those different ones I'm going to cycle exactly to the cond to remove one and now I can run it and it's going to say hey remove all packages in the environment the following packages will be removed pretty much all of them or actually all of them and by default y selected so I'm just going to press enter and then from there it says everything found within
the environment including any cond environments and non-a files will be deleted it's perfectly fine no is the default here so I need to specify yes and Bam so clearing this out and then doing cond environment list I can see now only the base environment is there all right so the only thing left to do is now just get rid of this project right here so whenever I close out I saved it to my desktop all I have to do is just move it to trash now that we master the basics of how to use cond
environments now let's go through and actually set up the condent environment that we be working with for the remainder of this project with vs code we're going to open back up that project that we created in that last video I saved mine under document so I'll open it up and I'm going to create a new python notebook if you don't have one already called test and this is ipynb type file now up in the right hand corner we can see that it says select kernel which is where we need to select our environment right now
if we navigate into the terminal we can see that using cond environment list there's only one environment base so I'm going to create that new environment by doing cond create that name flag and the name of python I'm going to go with course I know I typically said the project but that's too long we're just going to python course and then I'm going to specify the version of python that I want as the filming of this video I recommend going with 3.11 3.12 is available and I'll get to why I'm not using that at the
end of this now I can also specify any other libraries I want to install during this so let's also install pandas during this it has 21% proceed yep and last thing to do is just copy the statement of activating the environment and pasting it right here so now we switch from base to python course I'm going go ahead and close this out we don't need that so now inside our jup notebook I need to select that environment going to python environments and right now python course isn't appearing so I'm just going to do go to
the command pallet of command shift p and reload this window let's try again and now we can see that python course is available running a simple print statement of what's up data nerds it prompts me as usual I need to install also the ipy kernel Library could have done that whenever I created this environment I was a little lazy forgot to do that so now that's working let's test to make sure that pandas in fact did get installed we going go ahead and close out the terminal window here and we're going to import pandas as
PD running shift enter awesome all of it loaded now if you recall for when we were working in Google collab we had also these other import statements of from data sets import load data set and then also import the PIP plot module from mpot lib as PLT if we go and try to run this right now we're going to get an error message right that the module is not found specifically it's going to flag on that first one of data set's not found opening up the terminal using control Tilda I can confirm that these libraries
are not installed by running cond list and scrolling through it we can see that we have python in here and we also have pandas but there's no data sets or matplot lib installed well except for this map plot lib right here but that's not the full one that we want so I'm going go ahead and clear this so just like we did that pip install we're going to do cond install and before you install anything make sure you're in the correct cond environment we're in Python course we don't want to be installing this into base
so back to our Command we have cond install we're going to add data sets and matte plot lib we just add these and separate these by spaces you can add as many libraries as you want running enter it ask if I want to proceed yes I do all right and it finished the load so I'm going to clear that out and close out this terminal now let's try to run this again and it looks like we loaded both these I'm just going to confirm right quick by generating a quick little plot of a list of
one two three 4 and Bam there we have it so now we are all set up and ready to go to get back into the advanced section and start actually programming now I want to conclude this with one final example to demonstrate the importance of why we need these virtual environments specifically here this cond environment in order to manage and maintain these packages so right now using cond list inside of our environment of python course we can see that python 3.1.9 is installed which is what I'm recommending as a filling this now prior to this
I found an error if I tried to use the most recent version of Python and let me show you what I mean by that so first I'm going to get out of this python course environment and I'm going to go do this by saying K deactivate and we're going to be creating a new environment with the most recent version of python so we're going to do cond create we'll name this for convenience Pi 312 and for this we're going to install the most recent version of python and previously I was doing the equal to 3.11
but technically I can just leave it blank and it's going to install the most recent version I'll show you this and so when we look at what's going to install here for python it's installing python 3.12 anyway continuing on to actually install this the environment has been created we are in the base environment I'm going to Now activate this Pi 312 environment going to clear the console and also I'm going to switch to that python 312 it's not appearing so I'm going to reload the window and I select another kernel python environment Pi 312 okay
so now it's inside of here as I expect okay I didn't load the ipy kernel again sorry about that probably should have done that default you should probably do that default I think we're getting the picture here okay so it runs just fine of what's up data nerds as expected it's going to fail whenever we try to do pandas and also try to run these libraries I'm actually going to put these on the same line Sorry put them in the same cell try to run them again they're not going to run okay we need to
install these packages so we're going to do cond install pandas data sets and also mat plot lip asking if I want to proceed yes I do okay so the packages are now installed I'm going go ahead and clear this out and then close out of that all right so now we have this installed now I'm going to go try to run this and unlike the previous time when we had python 31 we're getting this attribute error specifically we're running into problems with the from data sets import load data sets we can see this by it
the first thing that's causing this issue that's what the arrow is pointing to and then these troubleshooting logs go through each step and we basically have this attributer read only attribute um issue well diving into it what is happening is is this data sets Library isn't necessarily fully compatible just yet with python 3.12 specifically a lot of the different dependencies of it like I think I looked up something like the library Pi Arrow yeah right here Pi Arrow isn't function properly anyway this is a perfect example of why kind of environments are so important if
for some reason I wanted to test out some new features in 3.12 cuz 3.12 has new python 3.12 has new features I could create a new environment and do it separately from this project that I have to use python 3.11 in order to get all my libraries to work so it maintains these completely separate and you don't have to worry about conflicts if I want to all I have to do I just go back to look my python environments go back and select python course and then whenever I run this now with python 3.11 with
the same libraries installed it runs just fine so as you can tell I'm really passionate about understanding how to use these virtual environments and making sure that you're setting them up properly and using them friends don't let friends not use Virtual environments so now it's your turn if you haven't done it already to go through and set up your content environment for the remainder of this project we also have some practice problems to get you used to using K in the terminal so make sure you take use of that cond cheat sheet with that see
you in the next one we're diving back into pandas see you in the next one so in this Advanced chapter we're going to start getting into actually diving into a lot of the problems that we're going to be solving for our final project so we save a lot of time whenever we get there don't have to do a lot of work because of that in the beginning of this video we're going to go over a quick overview of a lot of the different problems and insight we're going to eventually solve and then finally moving into
a new method of accessing data Beyond just using that iock method into the dolog method so here I am with VSS code open I have the project folder of our final projects in here I'm also going to go ahead and just select this entire thing and create a folder as you should for this Advanced section also if you want to you could create a new folder for that basic section that we would just went into and you could put all those different Jupiter notebooks inside of this folder here because we're going to eventually get to
uploading it to GitHub and it will could show all your work for any of the work that you did all you have to do is go to that notebook and then from there select file and then down at the bottom it has download you want to download the Jupiter notebook version then with your notebook or notebooks downloaded you could just go and drop it into here and not only is it right here right now but it has all the different work that you did from there right inside your own editor so diving into a quick
overview of what we're going to be solving with our final project for this you're going to be the perspective of an aspiring data nerd looking to analyze top paying roles and skills in the field of data science I'm going to be analyzing all of this from the perspective of a data analyst from the United States but you can adapt this to your need not only can you select what job title you want to use for this which we identified these different eight ones back in the basic section but also I have a host of different
countries you can choose from for the first problem we're going to look at what are the most demanded skills for data analyst and we'll be building a visualization to show the percent demand of a certain skill for not only data analysts but also senior data analysts be able to compare them to each other next we'll move into how in demand skills are trending for data analyst and for this we'll be evaluating how they trended over 2023 for the top five skills once we understand that we can then dive into how well do jobs and then
the skills pay for data analyst in in the basics chapter we actually solved this first part by analyzing what the median salary was for these different job titles but now we're going to dive deeper into data analyst specifically to evaluate the top trending skills on what their salary is and a little spoil out Python's one of the highest and then finally we'll mve into our last visualization of what is the most optimal skill to learn for data analyst for this I wanted to look at not only what a skill is paying but also how in
demand is a skill so for this we're going to be using a scatter plot and we're going to break it down further when we get to this chapter all right nothing me happened about the project let's actually get back into learning pandas and learning python I went ahead and created this new jupyter notebook I'm titling it one pandas accessing data I just use this nomenclature of this one as you can see from these folders here one two three so that way it gets organized automatically also you're going to notice from my screen whenever I start
a new Jupiter cell that it has this of press command I to ask to GitHub co-pilot to do something now this is because I have GitHub co-pilot installed which is an AI coding assistant I don't recommend you using it just yet but I just want to be aware about why this says this on my screen if VSS code hasn't done it already we need to go ahead and select our kernel specifically using that cond environment of python course now just like we did in collab we needed to import all our libraries so pandas data sets
in M plot lib we need to then go and load our data sets from load data sets function and then put it into a data frame and then finally do that one data clean up that we're doing right now of cleaning up the job posted date converting it to a date time loading this it takes about 36 seconds I'm not going to lie I'm on a pretty fast internet connection here and I have an M2 Chip in my MacBook right here so it's pretty fast so those speeds may not be typical to what you may
see now I keep on getting this warning message down here of this tqdm warning of I progress not found please update Jupiter and IP widgets this is basically a display that shows a progress bar underneath your loading it's pretty fancy you can go ahead and do this if you want but if you don't it's not a big deal you're just going to have this warning message warning messages are not that bad and inside my terminal I ran kinda list and inspecting it not only do I not have Jupiter but I also don't have ipy widgets
so I'm going to run kinda install Jupiter and ipy widgets it ask me if I want to proceed I do and it finished installing it so running this again and finally this one's going to do be a little bit lter faster because we have in the data set and we no longer have this warning message all right like I said that step is completely optional if you want to but now I don't have the warning message and I like that better so previously we learned that if we wanted to look at maybe the first row
of our data we could use that iock method and then then from there using square brackets we could use an index based notation in order to look at it so in this case I'm looking at the first row or zero and we have it displayed back if I want to display the first 10 rows so 0 to 10 remember 10 is exclusive running this we have all of our data back from those first 10 rolls now the one drawback of this if I want to get maybe the First Column back in this I have to
use the index notation of this so in this case Zero running this we get the results underneath it and it's not too bad but let's say something else running it back to show the full data frame let's say I wanted this job work from home column I'd have to go over one 2 3 4 5 six and then remember that right it's a zerob based index so I'm going to do five right here and then we get it back I think that was correct yeah job work from home all right we were correct so that
works fine let's go back and look at that data frame one last time now let's say I want to actually include multiple columns specifically I want to look at these three columns right here salary rate salary year average and salary hour average this is a mess to try to count over Tom and try to figure out and that's where the loock method is way better so on our data frame we now run loock and we still use that same annotation method using those brackets where the first part specifies the row index and the second part
defines the columns now if you remember from our data frame itself this index is index notation if you will so we're still going to use that number notation for the loock method so when I specify this loock of showing the first 10 rows of 0 to 10 running this we have the first 10 rows so that not much different here because we have that index notation of using numbers already but when we get to defining our columns we can not only Define it by using something like the name itself so salary rate in our condition
but we can also Define multiple columns and use that slicing method and from now identify I want to go from salary rate to salary hour average running this now bam we get now all these columns back now this is a bunch of nonv values right here which is pretty useless for me if for actually interpreting this so let's actually first filter it to include more values and for this I just listed a colon to basically select all the different rows and then from there we want to drop nonv values but specifically I care about dropping
it for this salary rate because if it's a salary yearly average there's year in this column and if it's salary hour average it'll have hour in this column so basically I just want to remove the columns or the rows that say none so I specify subset equal to salary rate running this again bam now like I said we got our all our different values we can actually see a slice of the data itself so now it's your turn to give this loock method a try and that drop na that we just covered is actually a
sneak peek into what we'll covering in the next section on cleaning up data sets but for now I want you focusing on the problems that you'll have to do which are focused on the loock method with that I'll see you in the next one so we have a mini little project we're going to work on in this section of data cleanup specifically we need to clean up our data frame and for this I have a coworker that wants access to this data but they have specific requirements first we have around 3,000 data salary points however
the data sets around 700,000 so this coworker want those nonv values or not a number values filled in in with something like a median value additionally after we fill that in they want us to go through and remove any duplicate job entries so that way they can go through it look at jobs and then if a salary isn't there they can see what the median or expected salary is for it one quick note on this data cleanup everything we do is not going to be necessarily applied to the final data frame that we're going to
be working for in the final project this is more of a case of showing how you can use functionality to clean up dat sets so for this and all the follow on I'm not going to say it every single time but I created a new Jupiter notebook inside of my number two Advanced folder and of going ahead and imported in libraries loaded the data and done the data clean up of the datetime make sure that you have the correct cond environment selected so last video we learned about using that loock method and that we can
actually use the name of either rows or in this case columns to filter data s and see it and when we dived into it we found that there were a lot of non values in here so we're going to be filling it in with the median values and this is actually pretty typical in something like machine learning if there are missing values you could go through and fill it in with median and then you could train models on this so let's calculate the median values first we're going to do this for the salary year average
column and we can run just the median method on this running control enter it's around 115,000 similarly I can run this on the hourly average column and this is one is around $46 an hour I'm going to set both of these equal to a variable so that way we can use it later so these variables are now saved now we need to get into how are we actually going to fill in these nonv values without replacing the current values that are in there well painus has this fill and a method and the first thing you
need to specify it is value one note down here they got a deprecation warning so you should always pay attention to that but that has to deal with the method which allows you to do some sort of forward fill or back fill we're not going to be going into that but if you're curious about that they have special other methods for this that they include right here so let's get and filter it but anytime I go to modify a data frame I like to keep my original one in TCT so I'm going to create a
new data frame called DF filled and I'm going to set it equal to that original data frame that we have and we're going to specify how we want to fill in this for each one of the columns we have to do each one of the columns individually and run the method on each so I'm going to specify the salary year average column and then I'm going to use fill na method on this now the first thing I passed to it is the value so I calculate the medium salary year up here so I'm going to
go ahead and just paste that right into here running control enter to see those first few values and I can basically see it got filled in but we haven't set it equal to its value itself so if we actually ran to look at this column we would see that it's still n so I'm going to go ahead and set it equal to itself and run control enter cool we're also going to do this for the hourly column as well well which is very much similar syntax of changing the column names and then also that variable
here of our running control enter so now inspecting the data frame of filled on those different columns we can go in and see that it filled in with these medium vals so the first thing that we need to do for a coworker is done now we need to move into the next step of drop duplicates right you guessed it we have a method for drop duplicates and it returns a data frame with duplicate rows removed now this also allow allows you to specify column labels but we're not going to specify that just yet so I'm
going to create a new data frame of DF unique and set it equal to that data frame of fil now with this data frame of unique I'm going to remove those duplicates so I'm going to run the drop duplicate method on it and inside of it like I said we're not going to specify anything inside of it running control enter we can see we have around 787 th000 or that many rows but how many rows did we actually drop while running this well what I did is typed out this code in order to compare the
length of these different data frames so first we see the filled data frame length then the unique and then finally it does a subtraction of the filled minus unique in order to see how many rows were actually dropped right now there's no rows drop that's because I didn't when I did this drop duplicates I didn't set it equal to itself to actually save it back to whenever it does this so running it again all right now we have 109 is removed and this means in this data set right now there are 109 duplicate entries and
that's based at looking at this entire data frame and finding out where there's repeats and assuming all the lines in one is found in another I'll be honest this is pretty impressive that we just went through 780,000 rows in 7 seconds and found all these duplicates that quick now for dropping these duplicates we didn't specify a subset or we didn't specify a column we could go further and we are actually going to go further and select certain columns from here to then use those columns specifically to filter for if there's any duplicates So based on
what I know about the data set I want to filter actually based on two things first is the job title column which is very much unique as you see here we have like data analyst and the job title is data analytics and then on here we have data analytics but then we have manager of data analytics so it's a very unique thing that we can actually filter on additionally I don't want to filter out all the data Engineers cuz just cuz we match on this one data engineer so the other column we're going to look
at is company name so that way we're dropping repetitive jobs from the same company so we're going to continue to use that unique data frame that we created up above and once again we're going to run the drop duplicates method on it this time in here we're going to specify the subset we're going to set it equal to a list of those column titles so specific Al of job title and also company name and then to actually show what work we did I'm going to copy these print statements up from above and past them down
below so now let's actually run this bad boy so now based on this our original data frames around 787 th000 the drop duplicates is around only 500,000 so we dropped almost 280,000 probably repetitive jobs from different companies and with my experience of looking into this data set there's a lot of companies that will sometimes spam job boards with repetitive jobs so this is a good way of going through and cleaning it up if necessary so now it's your turn to play around with this drop duplicates and Dot loock method in order to manipulate these data
frames and clean it up I have some practice problems ready for you with that see in the next one we're going to dive deeper into not only cleaning but also managing different data frames in this section we're going to be focusing on two important functions that I use all the time for data management first we're going to go over the sample method which is pretty straightforward and then we're going to cover the copy method which although it looks easy on the surface it's a lot more advanced so let's get into it so inside my notebook
I've imported all the libraries loaded the data and cleaned it up if you remember before we could use methods like head in order to see the first five rows in the data frame and then if we wanted to see the last five rows in the data frame we could use something like tail and that's going to provide these but I'll be honest this isn't necessarily represen istic of the D set itself and so like in this case the end result has mostly Germany postings so there's not necessarily a true sample of the data so similar
to those methods we have the sample method which returns a random sample of items from an access of objects just like head and Tails we can specify the number to sample or we can even specify something like the fraction of float how much of the fraction the data return so running this method on our data frame we get back just well one result I'm going to put in I want to get 10 results back and now we have a sampling results and we can tell this actually by this index over the leth hand corner as
it's very varied on which jobs are returned so this is super useful to use and every time I run this it's going to provide new jobs within this sample as we can see from this so if you have a certain example that you find through your sample you can output the same results every single time by setting this random state so I'll specify that random State and we're going to set it to a value uh we'll just do 42 now whenever I run this this same results are going to return every single time so we
don't have to worry about any variation in what we're returning back just a little fun fact if you're ever running through code and see this 42 pop up which you will from time to time this comes from the book The Hitchhikers Guide to the Galaxy and it's the answer to life the universe and everything just a little programming humor next method to cover is copy and and if we run it on something like a data frame it makes a copy of the object's indices and data so we're going to start by creating a new data
frame called Data frame original and all it's doing is loading our in our data set into that and we can just inspect it to make sure that I'm not lying to you and from this it has the 770,000 rows along with all the different data inside of it so copy creates a copy of the data frame but you may be like Luke why do I need to use this copy method when I can do something like this where I'm creating this new data frame called Data frame altered it's equal to the original and then if
I were to print out the data frame altered underneath this running control enter I can see once again has around 787 th000 results and all the everything's in there so what what's the purpose of this whole copy method well let's go back to that example that we did previously where we filled in that salary yearly average data that was missing those nonv values with median values so so if we inspect the altered column we can see that as expected most of them have nonv values in it there looking at the first five rows I'm going
to start by calculating that median salary and setting it equal to basically the median method of that salary yearly average column of the altered data frame looking at the results we have it at 115,000 now with that alter data frame I want to fill in those nonv values in that column and I'm going to set it equal to itself and I'm going to fill the na values with that median salary and then finally we're also going to print out those values right below it okay so we can see that those nonv values previously are now
filled in with that median value of 115,000 but what about our original data frame technically those should still be nonv values but whenever we inspect it we actually find that the original data frame actually got updated just as our altered data frame did and I'll be honest this is not a good thing cuz technically I could have multiple different data frames that have done multiple different calculations on and I wouldn't want those calculations on one data frame to affect the other so why is this happening well it has to do the fact of how we
assigned DF altered equal to DF original when we use this assignment our operator the data frame original had had a data frame in it that had a unique ID this new data frame altered was also set equal to that same data frame and that's a bunch of theory so let's go ahead and actually prove this so with our original data frame we can run something that was shown you before of that ID function and this Returns the identity of an object itself which is a unique ID number so if I look at the ID of
DF original it's this number and if I look at the data frame altered ID it also has the same number and if I'm too lazy to go through and actually compare number number I can use the comparison operator to find out if it's true if these two IDs are equal to each other and it is in fact true so these variables are referencing the same data frame so let's actually show this in action I'm going to recreate or reload the data set so we have this unique ID for our data frame original from there I'm
going to now create that alter data frame again and I'm going to set it equal to DF uncore original but this time I'm going to use the copy method with it all I got to specify as copy open and close parenthesis run it now I made these fancy print statements where I show the ID of the original the altered and then comparing the two and with this we can see that the altered data frame is different and it also confirms this with the comparison operator saying that it is false and now whenever I go through
and calculate that median salary to then be placed within the non values of salary year average column it's done correctly on the alter data frame and then for the original data frame this one is unaltered so from here on out you're going to notice as using this copy method any time we need to create a copy of the data frame and do some sort of operations on it this is basically to protect us to make sure that we're not causing harm somewhere else all right so now it's your turn to test out that sample method
and also copy method with some practice problems I have for you with that I'll see you in the next one if you've worked with either Microsoft Excel or Google Sheets pivot tables then this section of the video is going to fill right at home whenever we get to the project section we're going to be using this pivot table method in order to analyze what are the trends of skills over time and pivot table function makes this super easy to do because of it however there's still a little bit of cleanup we need to do before
we can get to that so in the time being we're actually going to be forming a pivot on something else specifically we're going to be looking at the top three jobs of data analysts data engineers and data scientists and how the median salaries Trend across the six most popular countries in our data set and pivot tables make this super easy to make so inside the pandas documentation we can see that we have this pivot table method for data frame and there's four key parameters we're going to be using for this video values index columns and
a function we not going to overload you with with all four at once instead we're going to focus on just two for right now specifically index and a function for the index you provide a column or column name and that's what we're going to be grouping by and then for a function that list the aggregation function like Min Mac size whever that may be so in a new notebook I've gone in and imported all the libraries loaded the data put into the data frame DF let's run the pivot table method on this and we're only
going to focus on two right now specifically I want to get the count of the job titles so I'm going to start by specifying the index which we're going to use that job title short and then that a funk for this I'm going to specify size because we want the size of the job title short column we're not going to do count so running this we get back what we expected now you may be like Luke I could just do a group bu on job tile short and also run that size aggregation function then actually
running this and for this we get the same exact results I'm just going to stick with Group by well bear with me for a second while we learn even more so now let's say we not only want to group by that job title short column but we also want to aggregate it to get the median yearly salary well in this case we need to add this values parameter in order to specify the column or columns to aggregate if you have multiple you can provide it as a list so with that original formula that we had
before I'm pivoting using that job title short and getting the size of it I'm just going to add values to the front of it specifying salary year average and then for the a funk I'm going to specify median because we want the median salary but now you may be like Luke I could do the group bu method on that job title short with the salary year average column as the column that we're going to Aggregate and run that median method on that and we get back the same results granted this comes in a series but
I could make it into a data frame but we get the same results nonetheless into that i' say you're correct but this final one group by can't necessarily do it with a single line of code we're going to now be calculating what is the median salary for these different job titles but broken down by Country and this brings us the last parameter to cover which is columns this is the keys to group by on the pivot table column and just so we're clear index was what we Group by on that index column so by rows
columns is what we're grouped by on the columns so just modifying our original pivot table formula that we had previously I'm going to add in columns here and I'm going to specify job country running control enter I have a mistake right here I'm missing a little parenthesis pressing control enter we have it back and this has it by countries up at the top and the job titles along the index I'll be honest I don't really like wide tables I like long tables so what I'm going to do is I'm going to move around these index
and also columns I'm just going to have them trade places running this again because like I said index is for the rows columns is for the columns we've now transcribed this if you will and have the different job tiles up the top and then the job countries in the bottom feel is a little bit easy to read although there's 103 countries to go through now so recall this is the graph we want to make now we want to Aggregate and find the top six countries along with PL data analysts data engineers and data scientists with
their respective median salaries for each of these countries so we can compare it so in order to find out what those six top countries are I'm just going to do a values count on it and we can see it organized by this so I'm going to only do the top six and then finally for this I don't really care about those values from this I care about the index column from this so I'm going to do the do index on this and we return back the index in an array so now for that pivot table
we just built I'm going to set it to that variable of DF job country salary and we need to go ahead and filter this data frame that we pivoted and get out those countries so for this I'm going to set it equal to itself and in order to identify what index values we're going to use based on remember we got these index values and they're like a string we can filter this using the lock method which is pretty cool so I specify in here top countries and and then printing it out right below it so
we can inspect to make sure we've done this correctly bam we got the top six countries now I want to filter the job tiles to include these three specific job titles and all we're doing in here is basically filtering The Columns that we want from this because of that we don't need to use any type of iock or lock method for this instead we're going to just pass in that list inside of here so job titles running shift enter we can see now bam we got this so all that is left now is we need
to plot it so we can run this plot method on it specifying the kind equal to I want it as a bar chart so running this now we get zooming out we get this bad boy which is pretty impressive I mean look at this one line of code and because we pivot it into a pivot table and have it and pandas understands how to plot it it plots it super simply now I do want to clean this up a little bit by specifying the X and Y labels the title and then rotating the tick marks
but then running this bad boy I get something that I'm super proud of with this we can do some Trend analysis with it right now I'm finding countries like India and Germany have some of the highest paying jobs for data analysts engineers and scientists although it looks like Germany slightly tops out not going to lie this is pretty tempting for somebody living in the United States to move back to Germany I've been there before loved it so may strongly consider it so so pivot tables can save you a lot of frustration and make it super
simple in order to graph insights you get very quickly now one note you could have potentially done this with Group by by running a for Loop basically aggregating all the different job titles and all the different countries that's a mess pivot tables saves you from having to even write more lines of code and makes your job easier so take a use of them all right we got some practice problems for you to dive into with that see you in the next one so far we've learned a lot about how we can manipulate data frames either
to filter it and then also do things like pivoting it and when we do operations like this it kind of jacks up our index and so we need to know certain methods in order to fix back our index so jumping back in we have our standard code for importing libraries loading the data set and then also performing data cleanup inspecting the data frame using the dot sample method that we learned previously we can see the indexes over here on the left it doesn't have a name and then everything's off to the right but this index
is obviously numbers let's inspect it more we can use DF do index which accesses the index attribute and it tells me it's a range index going from zero to the last number and it's stepping by one which makes sense although I did sample here we could also do head and show that yes it is in numerical order just like this range index is claiming another attribute we can look at is DF do index. name to see what is the name and there's nothing here it's blank right now now attributes are basically just a variable assigned
to a particular object in this case the data frame so name is technically just like a variable holder so if I wanted to in this case I could assign a name to this attribute of let's say job index running shift enter and then printing out the data frame below we can see that we now have basically a new line inserted in here with job index on here and we can go further now and actually investigate the items another attribute we can look at is the data type so I can use DF index. dtype and we're
going to get returned back that this is an INT or integer because these are integers in here for this index so now that we have that refresher out of the way we're going to be focusing on diving deeper into these three methods specifically reset index set index and sort index and we're going to go through explaining it in our data frame so let's been explaining it starting with reset index first so let's say I want to create a new data frame that it's filtered for jobs particular to the United States so I create a new
variable called DF USA and I set it equal to our data frame filtering it for under DF and then the job country setting it equal to the United States so running this and printing it right below I can see that yes it in fact did filter for the United States but we have a problem here the job index here doesn't go in iterations of one by one now because we've extracted out some of these countries we have missing values in here so we can use the reset index method for this and as expected it resets
the index now there's a host of different parameters we can use this but we're going to focus just on the in place and that's whether to modify the data frame rather than creating a new one right now it's set to false we're going to set this to true so we just don't have to create a new variable so with our DF USA Data frame I can type in reset index and then inside of here I'm going to specify in place equal to true and then I want to run it right below it so I'll put
this in and Bam all right so now with this we've kept our job index in here but now we have these this new numerical index we can see it goes down to right here to the end of 2,600 uh rows so it has what we want in this it's a reset now Tech technically we could go through now and drop this job index if we will but I would actually argue that I'd want to keep it let's say in the future when we learn about something like merge data frames and I want to use an
index to merge it back I can then use that in this case so I'm going to keep it for the time being the next method to look at is set index and it sets the data frame index using an existing column so we can take an existing column just move it over to the index now for that recently altered data frame let's say for some reason I wanted to go back back to that job index really we can go to any column but we're going to go back to job index so I can specify a
data frame using the set index method and the first thing to specify is the keys specifically in this case series we're going to specify that column title we want to set as the index so I'm going to call Job index here similar to before I'm a little lazy and I don't want to put that variable at the beginning of this so I'm going to just say hey in place equal to True Al so really now I was going to do this on the original data frame we want to do this with the USA Data frame
so running this and then printing it out right below it to actually inspect it we can see that we now have that job index back in here and inspecting it we can see that it just drops what we previously had up here this index column that was not named it just completely drops it so something to think about whenever you use this method is that we lose data sometimes the last method to cover is sort index for this returns a data frame sorted by a label in this case we don't need to necessarily specify the
index because well let's sorting the index now with the operations we just did on our current data frame dat frame USA these values remained in order so sorting the index here I don't really find ever a use case for it however I do find use case for this whenever I'm pivoting or grouping data specifically let's look at this previous example we did where we pivoted the job title short column to find the median salary and then also added in the mid and the max running this cell we can see that in fact the index is
job title short has all these different values in it now we could sort by the median values and we saw this previously using the sort values method for this I can call median pivot and then use the sort values method first we'll need to specify buy and this actually has two levels of column titles of median and then next of the salary year average so I actually need to provide this as a tuple of median and then salary year average running this we can see that we get this but we get in descending order so
instead I can specify ascending equal to true that's a joke ascending equal to false had that backwards okay so now we have an ascending order anyway that's just to demonstrate how to do median well similar to this we can use now the sort index method in order to sort these values so I specify that median pivot and then do sort index and then running this we get these values sorted and they're sorted here in alphabetical order if I wanted to make sure that we actually modified the current data frame that we have I could once
again use in place and set it equal to true and then it's not going to print it below so now I have to actually show it below running median pivot and Bam now one note you could technically put in here inside that sort values method that we did that J title short since it does have a name and whenever you run it it would run it uh this right now it's doing it uh in descending order but it can still work so you're like well when would this be a use case then and that's when
maybe the index doesn't have a column title similar to what we saw up here when we first ran the data frame oh my gosh where I'm at when we ran up the data frame up here and we didn't have a name for it we definitely want to use the sword index method in this case when it doesn't have a name so being able to manipulate indexes are going to get you out of trouble from time to time and also keep your data in a very logical and also structured manner so I got some practice problems
for you to go through now and play around with these different index attributes and also these different different index methods with that I'll see you in the next one we're going to dive into inspecting what is the demand of certain job titles over the course of a year I'm going to do a specific to the United States we're going end up getting this graph right here and it's going to be building on a couple key Concepts that we've just learned being able to group our data by month and job title but also being able to
sort it properly so we can get those months in the proper order so as always we imported libraries the data set and cleaned it up so I'm going to filter original data frame for that job country of the US you can do whatever country you want and then as good practice right as we learned earlier we're going to use that copy method with this so we created a new unique data frame with our data frame we're going to now extract out the month values from this like I said these are all from 2023 so we'll
stract out the month but I don't want the month number I actually want the verbiage of like January February March so I'm going to create a new column of job posted month and I'm going to set it equal to the data frame of the US specifically that job posted date then we're going to use that DT accessor and then specifically scrolling through what's available we're going to use this this string formatter to now format it properly in order to get month we need to use the percent sign and then a capital b as a refresher
if you're curious about all the different codes you can go into the documentation and actually investigate what all the different percent sign and letters next to it actually mean in archaus we're using this one of month as a Local's full name and then it shows a little examples to the right remember this has to be inside of string characters for this cuz you're providing the argument of a string uh I got a little bit of a typo right here I'm going to not call that monthy and once we do this I actually want to see
this data frame now to inspect it going into it we can see we now have this job posted month right next to it and it looks like it didn't work and I'm an idiot I modified the original data frame of DF should have modified the US one only this going to be fine for the time being we're just going to continue running this now scrolling over we can see okay we have all the different months in there everything looks good so now we need to Pivot our data in order to get in format necessary to
plot it specifically I want to put the job months along the index and along the index for a specific reason and then along the columns I'm going to have the different job titles remember we're going to be aggregating the count of these different jobs over time so on the correct data frame this time I'm going to run that pivot table method on it first we're going to specify the index we're going to specify it as a job posted month the next thing we're going to specify are the columns that we want to use for this
specifically we want to use that job title short column and then finally we wanted to find the counts of all those different job postings or job title shorts in each month so I'm just going to do the a funk equal to size so we were basically counting how many job title shorts there are running this we get our data frame with job titles at the top months along here and then all the different counts in between it now look at this look at Job posted month it's conveniently in alphabetical order but that provides us no
use because we don't want an alphabetic order we want it in chronological order and and as we showed from the previous video whenever I try if I would try to sort this I can either sort it ascending descending there's nothing in there for me to sort it chronologically so we have some cleanup work to do and these type of tasks always take up the majority of my time when I'm trying to actually make these visualizations like I'm like 99% of the way there I feel but I got to actually clean it up more anyway we're
going to be getting out for these different job post months we need to get the actual number associated with these job posting months and then sort it by that so first let's actually set this to a variable of dfus umore pivot so the first thing we're going to do is actually reset that index and that's to push that month in so that way when we add the numbers we can then sort it by those numbers running like this I can see it basically flattened it out so that way that job post a month is now
in there I'm going to use in place equal to true to set this now I want to create a new column for this specifically that number of the job month so on this data frame I'm going create this one this new column called month number and specific to up here how we use when we cleaned up our data to get the job posted date how we use this pd2 date time function we're going to do the same thing we're going to instead we're going to provide it that month value along with specifying how we previously
formatted so it knows what's going on there in our case scr scrolling back up we can see we use this percent B so I'm going to just copy that and paste it into here now let's see what we actually get out of here whenever we do this I'm going to paste this right below it run control enter first of all it doesn't have a year in there so it automatically just assumes the year is 1900 and the date is the first but it does have all of those different months in there so now what I
want to do is extract out only that month so we use that DT accessor and specify we would just want the month running this now we get the month numbers from it looking at the entire data frame we now see we have this added month number which August is 8 yep U so I know this is good so now we need to sort our values by the month number now if I try to run this it's going to sort it by the month number but if you notice we got another index put into here and
that's because I'm running all this code basically every single time this I have this in place equal to True basically every single time I run the cell it recreates a new index in here and eventually we have an error so if you're running this in the same cell you have to basically do this run all that's why I actually recommend probably breaking this out into different cells anyway we have it sorted by our month number so I'll also set this of in place equal to true and the next thing we need to do now is
actually set the index of this and for this we want to use that job posted month column running all we can see that we now sorted by the months they're in the correct chological order but now we have this month number column that we don't need so once again we run dfus pivot do drop because we want to drop that column and we specify that month number and once again I want to do this in place equal to true it looks like I got an error I actually need to month no not found an access
we need to specify this as a column running this all again bam so I'm actually going to expect what this looks like now I don't like this job title short and I also realized I made an error whenever I set that index I didn't do in place equal to true so I'm going to see if this actually fixes that problem with that job title short okay it did so now we have our job posted date as the month and that is on the index like we want it so it's going to make it super easy
to this along with all their different job titles so now we can just jump right into plotting it I'm going to do DF usor pivot and I'm just going to plot this bad boy as a line chart and we get this and this is actually a little overwhelming in the amount of data here but at least the months are in order right now and there are eight of the job title shorts on here it's a little cluttered so actually I want to clean this up to only have have the top three jobs or the top
three jobs that have the most counts so for this I just want to get a list of the top three in here and I want to get into a list because then I can use it to filter this data frame further so with this I'll do top three equal to and I'll use our original us data frame for this specifying that job title short column I'm trying to find the counts of all these different job title shorts so we'll a value counts method on it and I just want to see the top three cuz that's
all I really care about running this so I can actually see it below it we have these available and remember we can also use the attribute of like index in this case with what we just ran and so we have it here it's an index currently but I can convert it using the two list method and now we have it as a list so I'll just set this equal to itself to read basically rename it and we have this list available so now whenever we go in to actually plot this I can specify inside of
brackets that top three values and then from there once again plot it specifying the kind equal to line okay so now we have those top three it's not as cluttered and we can actually do some analysis on this right now we would expecting it we can see there's a super high amount in January and also a burst here in August so the January does make sense for the fact that that's normally when new budgets are approved for the year and so they ramp up hiring as far as the August I'd have to dive into this
further on why that is sort of like an outlier here also Engineers are very consistent they don't uh conform to the same thing as analysts and scientists That's Unique Insight with that now there is some cleanup I would do with this like putting a title and improving the labels on this and yeah now we have our final graph here which has an appropriate title and appropriate labels for each of the axises so bam we just demonstrated a great use of not only pivots but also when and why we need to manipulate indexes in order to
potentially Aggregate and get months in certain order or other values that you may encounter all right so we're halfway through this Panda section in the advanced chapter we only have a few more sections left specific to merging and con catting data frames basically combined data frames very very useful all right with that see you in the next one from time to time you're going to find that your data is incomplete so we're going to need to merge it with new data take for example that previous example we did where we were analyzing data science jobs
over time how they were trending what happens now if we went to combine this data with jobs in the tech industry specifically those of developers and see how they trended with data science jobs well we can use merge for this we're going to dive into how to build this visualization after we first go over a simple example to understand how to use the merge method now I have two data frames for this data frame jobs is only five rows along and it's very similar to our data frame that we currently have of things like job
title company name and job location the second data frame is not similar to what's in our data set currently and it's something we may have to Source at some point and this is company information specifically I have information like the company name industry and Company size there's a whole bunch of third party Services out there that will collect data like this and provide it for you and then we could technically merge this with our source data frame so for this we can use the merge method and we'll be using the core data frame of our
data frame jobs as the left data frame and then we can see from the arguments and parameters we can provide it we need to provide a right one which for right it is the object to merge with so it's the data frame on the right we're going to go into this how parameter in our second example so hold off on this for the time being but the other parameter I want you to focus on is this on and it's the column or index level names to join on in our case both data frames have the
same name of company name so we can just use this on in the case that you had different names for each of the data frames you could specify left on for the left data frame and right on for the the right data frame in this example we're only going to use on so with DF jobs I'm going to run the merge method I'm going to provide it that DF companies and we're going to specify on specifically company name running control enter I get the job ID job title and company name and job location from the
original data frame and now it's merged to that company's data frame with industry and Company size pretty cool so let's move into a more realistic example that you're going to be able to follow along with in the notebook of going through and import all the libraries loaded the data and clean it up just like we've been doing now in the last video we made this data frame right here where it analyzed jobs in the US specific to different job titles and how the counts of those different job titles Trend over the different months of the
year specifically in 2023 we're going to be using this data frame from the last exercise if you start a new notebook bring that code into this one but with this it's somewhat Limited in the fact that we only have data science job postings this doesn't include necessarily All Tech job postings and a commonly associated field to data science is software development so there's a lot of jobs in software development that I maybe want to compare these different jobs to well I have access to this data truth be told it's completely fictitious so don't take any
of these numbers for granted but it has a similar schema or setup as our other data frame we have a job posted month and then in it it Aggregates the count for four different types of jobs specifically front end developers backend developers full stack developers and uiux designers so because I have this data we can use the merge method in order to merge it with our original data frame so I've made this data publicly available so you don't have to try to recreate it and it's available at this URL of https col thenl b.co SLS
software CSV you need to include the https at the beginning and in order for it to queue in that this is a website that you're pulling the CSV from anyway we're using the read CSV function of pandas and whenever we import it in we can see it has the job post a month and then all the different jobs right now it has an index in here where if we look at our original data frame the job posted month is the actual index so we need to fix that I can either do something like set index
or I can just specify index call which is the index column and specify that it is job posted month running this now we have job posted month shifted to this also I'm going to actually rename and actually assign this to a variable and I've named it data frame us software pivot real original I know all right so it has everything we need in it now we need to get to merging these two data frames so with our data frame of us pivot I'm going to run the merge method on this specifying the right one so
dfus sof software pivot and for both of these we're using that index of job posted month running control enter I can see now that we've combined all of these and now we have all our data science jobs but also our software developer jobs as well now going back to that merge method we sort of glossed over this how parameter and how it's defaulted to Inner there's a host of different options available of Left Right outer inner and cross now if you're not familiar with joins they're also commonly done in something like SQL and this is
a visual representation of my from my squel for data analytics course where I represent left joins right joins inner joints and full outer joints anyway I'm going to link to the timestamp in that SQL course where I go over more in depth of how to do these left joins right joins inner joins and full outer joins so if you're not familiar with these types of joins check out that real quick and then come back here for the time being we're going to leave this how parameter as inner as it's the most common method to actually
join data frames and so we're not going to really even touch it but I do find myself from time to time also having to do right joins it's another commonly done one as well that's why it's important you understand the difference between inner and right one other resource I recommend for this that Panda cheat sheet that we talked about earlier I'll link it below and over here in the second page underneath combined data sets it goes into into different visual representations of how you can do Left Right inner and outer joins with merch the cheat
she's pretty handy anyway back to our problem we want to now plot the top five jobs based on count of jobs in the year and see how they Trend over time so the first thing I'm going to do is set this equal to a variable of do dfus merged and then we're going to create a list of the top five jobs and then once we get these top five jobs we can then filter a data frame to include those so it makes it easier to plot so with this dfus merged we're going to do a
sum and I'm just going to go ahead and print out this right below it so we can see what we're doing as we're going along but whenever I sum them it's going through and summing those columns it's really convenient it's in a pivot table already and we get the different counts total counts of all these things and this series isn't in descending order so we actually need to set that of sort values and for this we're going going to specify ascending as false running this I now have it sorted in a descending manner we only
want the top five so I'll run head five actually I don't even think I have to put a five in there I don't it's automatically five and then we want the index CU a series and so the numbers the values are the values and then the index is the name so I just want the index back okay so I have the index back and I'm just going to go ahead and also run another method on this of two list now a little Pro tip whenever it's getting this long when we're pending this many methods and
attributes with this so we can actually put this all within parentheses whenever I run this it's going to run just fine but the fact that it has parentheses allows me to now break this up into multiple different lines so in this case we can more programmatically see every step along the way and running it it still works and so if I want to include comments along the way I could now in this manner as well and so like in this case get the index of the top five all right enough getting off track let's actually
plot this bad boy for this I'm going to specify my data frame of us merged I'm going to specify that column names of the top five have a little typo here change that to merged and then for the plot we're going to make this the a line plot running control enter bam we now got it although I want to clean it up a little bit now I want to add some different things to this to fix this chart right now cuz like Legends over the top of things it needs some titles and things like that
so I had this title of monthly job postings for top tech jobs in the US an X label for the months to include that from 2023 job count for the Y label a y limit basically shifting it up to 20,000 cuz right now it's about 14,000 so that we can get that Legend to appear making sure the legend appears and now let's plot it all right this is actually more readable and is something we can actually use now and with this we can see some certain Trends specifically we can actually now see visually that front-end
and backend developers have a lot higher job counts and this is pretty represen istic because Developer jobs are usually more apparent or more frequent than data science jobs and then from there it looks like full stack developers are more in line with data analysts and data scientists none of them have really any outrageous values throughout the year they look like they're pretty steady all right so now it's your turn with those practice problems to dive into merging some data frames now as a reminder if you're not as familiar with inner and right joins I highly
recommend you checking out the following URL from my SQL course in order to learn more about all these different join methods cuz they are going to come up from time to time whenever you're merging data with that see you in the next one back when I was a junior data analyst and only knew Excel I would get these monthly reports and I would have to go through every month and aggregate them and pull them into what was a maintaining as the master document that had all the data not going to lie this was a pain
in the ass but knowing what I know now about Python and concatenate this function alone makes it super simple in order to bring together multiple documents we're going to be working with this super example first basically just to show the power of this function so here I have two data frames very similar to our jobs data frame and in it I have January data and in the one below it I have February data I want to go through and actually combine these two data frames now if I were to merge these two data frames together
which I don't recommend if I were to do that that's going to leave as a result that is not really what I wanted I wanted to actually stack these data frames on top of each other here I've created completely new columns and made this data a lot wider and basically unusable for what I need it for now concatenate as shown here in the bottom half of this reshaping data section shows that we can either combine data frames that have in this case all the same columns and this is known as appending rows of data frames
this is what we're going to be focusing on you could also technically use it to append Columns of a data frame but now we're getting into more of doing merge operations and I would recommend using merge instead for that so concat is a function of pandas and we're going to call it by calling pd. concat and for this we provide objects and objects in this case is a sequence or mapping of series or data frame objects we're basically going to provide it a list of the data frames the other thing to note is that this
access automatically defaults to zero so basically we're going to be concatenating it by appending the rows of the data frames and not the columns the other thing to note is with copy and copy is defaulted automatically to create a copy of these so it's automatically set to true so we don't have to worry about it Al altering that original data frame so let's combine these two data frames I'm going to call pandas and then that concat function inside of here I'm going to provide a list of those data frames so job posting January and then
job posting February now running this we have our two data frames combined I can see up here at the top we have our January data and right below we have our February data one thing to note about this is the index column right here goes 0 1 2 3 4 and then repeats again if we wanted to to create a new index we could set ignore index equal to True running this again it now has from 0 to 9 in this case so let's jump into an example that you can follow along with we've gone
through and imported our libraries loaded the data and cleaned it up so right now we only have one data frame inside of here and in this case it has looking at that job posted date it has an assortment of months so I have an idea we're going to create some fake data in order to actually do this concatenate function on it specifically we're going to make individual data frames for each month so the first thing we need to do is go about creating a column that we can identify each month by so I'll create a
new column called job posted month real original and for this I'll use the job posted date to extract this from running that datetime accessor on it and specifically then the method of string from time I'm going to specify only provide the month remember before we did that percent capital B and that provided the full month name I just want a three-letter month name and so we're going to do lowercase b running this data frame below so we can actually inspect what went on here scrolling over we can see we have this job posted month and
now we have all these different threel months with it all right that looks good so now we need to get into break this up into 12 different data frames because this data frame has all the months of the year and I feel the easiest way to do this is to store it inside of a dictionary which we haven't really done before here's a look at what I'm trying to create with this dictionary for it we have our different Keys which are those month values that we have in there and then for the values the data
frame itself is inside there a dictionary for the values can contain any different data tape and in this case it can contain a data frame so we can create this data set cuz right now it's not made we can tell by these yellow swigy lines we can make this with dictionary comprehension but the first thing we need to do is get these lists of names of months so I want to create a list of all the different months so I'm going to create this variable called months and in it I'm going to get the unique
values of the job posting month using that unique method running months right underneath it I got a little bit of a typo not job posting month job posted month running that again we can see we have an array of all the different months available so now we need to start building that dictionary comprehension for this we're going to start simple and then build on it I'm going to create a new variable called month I'm going to use it for not only the key but also the value and then the main part of the dictionary comprehension
is that for Loops so for month the variable we defined in months okay so this is the key value and then it's cycling through using the for Loop running this I can see inside of here we have a dictionary comprehension with the months as the key and the values as the same thing but we want as shown up here we want to have the data frames inside of there so for the value I'll specify it as the data frame but I want to filter that data frame based on the month so I'll specify inside of
square brackets the data frame again job posted month and then use the comparison operator to compare it to make sure that it equals to that month so now whenever we run it we have a hot mess um because it actually it's not going to display it like we had up here with the data frame variable name listed all nicely and stuff it's actually going to show like December and then here is the actual data frame showing it right after so I need to now assign this to a variable and I'm going to call this dict
months and I want to go ahead and actually see one of those months so for ad dict months I'm going to access one of the keys we'll just go for January and now we're getting that data frame shown back in a lot better Manner and this includes or should include all the January data which scrolling through for it Yep looks correct so all this worked to create some fake data so we can run this concatenate function um so let's actually re get into concatenating so for this my boss needs all the quarter one dat specifically
January February and March so I'll go ahead and start by running the concat function and then specifying in a list all those months of data so starting at January 1st adding February and then finally March and similar to before we want to ignore the index so that way it resets for all of it running this pressing control enter scrolling over job posted date we have January up at the top March down the middle I don't really know if it worked just fine I'm going to actually run something else so I'll assign this to a variable
of DF q1 and for this I want to inspect of here I want to inspect that job posted month and want to see the value counts of it specifically we're going to actually run a plot on it I like to see things actually visually running control enter we have well looks like pretty good right our January data February data and March data as expected January is a little bit higher than February March as we've seen before in our Eda so this matches with what I expect to now give to the boss and I'm confident in
it so concatenate along with merge are two powerful functions and if you methods that I use all the time in order to combine different data frames and being able to understand the use case of each of those individual ones super imperative in data analytics all right so now I have some practice problems for you to go through and practice this concatenate function and with that I'll see you in the next one so in that last example we were able to extract out those quarter 1 results into its own individual data frame The Next Step but
I'm sure the boss actually wants is an Excel file of all this data so that's what we're going to do in this section by exploring all the popular methods that I use in order to export my data to a file that others and even myself can use so for this example we're going to picking off right where we left off where we created that data frame on quarter 1 results using the dictionary that had January February and March in it and this data frame has over 2 20,000 rows so we're going to be exporting a
lot of rows to a file so let's explore what methods are actually available to export our data and I'm going to do this by typing in two and then underscore and this provides a whole host of options that we can use some of the ones we're going to be highlighting here are two clipboard two to CSV to Excel and then we'll also talk about things like to SQL a pickle and also paret file so let's start out with this two clipboard method and I'm not going to copy the entire data frame to my clipboard because
I don't want the thing to crash so I'm going to just specify head in this case to only copy five values going back to the two clipboard uh to two clipboard I'm going to specify in this case sep equal to comma so that way it knows Hey whenever you copy this put commas in between each of these Val vales running control enter it is complete so I'm assuming it's copied to the clipboard so I'm going to go ahead and paste a markdown cell underneath here so inside of here I'm going to then use command V
to paste and as we get back we get the first five rows of data in the clipboard but let's move into a method that I actually use all the time and that is two CSV and I'm frequently doing some sort of analysis and creating different data frames and I want to export in save that data frame in case I want to access it and analyze it in a future scenario anyway inside of 2 CSV we're going to specify the file name that we wanted to save to so in this case I'll just name it quarter
1 running control enter I get a GRE uh a Green Arrow and it's going through and saved it inside of here now this is a comma separated variable file if you will but it doesn't have the correct notation technically whenever you run this you need to run this with the file extension that you want to save it to so in this case I want to save it to a CSV running control enter again now I can see that it's saved here under CSV I'm going to go ahead and delete this one because this is not
good practice to do and now going in here and inspecting it looks like it has all the data I'll actually scroll all the way down and we can see we have almost nearly 221,000 rows in here and so now if I wanted to read that all I would have to do is do uh pandas do read CSV the function of that and specify this file of quarter 1.csv so running this on our CSV so it takes less than a second to load and inspecting it looks like we have all our different data 22,000 rows the
one thing I will note about this is this First Column right here and that's unnamed zero we didn't specify whenever we looked at this that that First Column is the index also that if you notice there's nothing right here for the name index so what we can do when we read the CSV is specify index call equal to zero running control enter again we can see that this fixes that issue let's now export this data in the same file format that our boss wants specifically to excel similar to to a CSV we need to actually
specify the file name that we're going to save this to so we're going to do quarter 1 and remember we need to name the extension extension of this so I'm going to automatically name it xlsx okay we're going to run this so we're going to get an error with it and the error we get with this says this module not found no module named open pixl basically we need the python to excel module to actually do this so let's actually install it I'm going to pull up the terminal but pressing control Tilda and I'm going
to run cond install open pyxl one thing to note make sure that your environment in this case python course is activated and you're installing it into the correct environment VSS code should automatically pick up whenever you're doing this project that that is the correct environment so you don't have that issue okay it says the following new packages will installed I'm okay with that so I'm going to go ahead and press enter and it says it's complete now running this again and after a minute and 6 seconds which is pretty hefty it went through and exported
this out into that Excel file inspecting this Excel file it has all the different columns that we had previously and then scrolling on down it has the 221,000 values one thing to note Excel has a million row limit so you really need to think about that if you have data that is greater than a million rows now there's three other methods that when you become more advanced a python you need to be aware of and these are two SQL two parquet and also two pickle you're probably familiar with SQL and with this one you would
need to import a library like SQL Alchemy and then from there you can export it to a database of your choice if you have a connection to it you can set that up up to it and Export it to that I don't find myself doing that all the time so whenever you encounter on this just go to chat GPT now these last two are going to save you some time if you're dealing with really large data sets and you're trying to export them and then also load them into python very quickly par and pickle files
as denoted by this par and pickle in their file nure are very efficient files so whenever you use this you can then read in the data a lot quicker sometimes more quickly than even csvs so when you become a hardcore data nerd with python definitely start implementing these so you can Flex on your other co-workers so now that we've gotten that task offer list of expor and that Excel file for our boss we only have two more sections on the advanced section specific to pandas then we'll do an exercise and then from there we're going
to be moving into matplot lip all right for those that purchase practice problems you have a few to work through now to test out exporting the different file types with that see you in the next one so we haven't even jumped into EXP exploring and analyzing this job skill column that has that list of skills and that's because there's a slight problem with this it's not contained as a list inside of our data frame right now as you can see up the top it's being annotated as a string luckily we have an apply method for
data frames that we're going to be covering in this session right here where we can actually go through and apply something to that column to clean it up but we're going to start with some simple examples first so jumping into our jupyter notebook we've imported in our libraries load our data and clean it up just like usual for this first and second example we're going to be looking at the salary year average column and there's a bunch of nonv values in it so I'm going to go ahead and filter it out using the not and
a method of from pandas I can specify I want to filter out that data for that salary yearly average column I put that within brackets to then filter it and then on the end I wanted to provide that salary year average column all right so now we have this back anyway this is the salaries that we have and what we want to do with this problem is we want to calculate the projected salary for next year and what we can do is this is just going to be some rough math we're going to apply what
inflation is right now of around 3% just basically take these values and multiply it times Well 1.03 to get to to see what would be 3% higher next year so with that how can we actually get inside the data frame how can we get it to apply to the entire column right here well we can use the apply function now normally I would take you into the pandas documentation but I haven't been used that help function in a while so I think you should use it well anyway whenever you plug it into here apply can
be applied on a data frame it's a method of a data frame so if I put DF do apply and then run control enter I get the help on this method and basically the same thing that you're going to get from those Panda docs if you go there as well so this apply method takes one or multiple arguments and we're going to just focus on the first one for the time being and that's this of funk it's a function a function to apply to each column or row we'll get to this in a second how
we can apply this to rows but for the time being we're just going to focus on one colum focusing on that so the first thing we need to do is create a function and we'll just call it def projected salary and it'll just take one argument salary and then with that salary what we want to return out of this function is the salary times that 1.03 so that 3% inflation so now it's pretty simple I can just take a column of Interest so in this case salary year average and I use apply on it and
provide this function of projected salary now because I'm providing this function I don't need to put these open and closing uh parentheses in this case I'm just providing the function name it will apply it as NE necessary anyway running this we can see well no values here because we got some noun values in here so to make it simpler for future views of this and in order to filter out these nonv values I'm just going to create a new data frame called DF salary and set it equal to our filter data frame which basically removes
all these nonv values and then down here I'm going to actually reference that as well running control enter now we have back all these different salary values now let's actually inspect this and see how it compares to that original salary year average column so I'm going to create a new column inside of here and call it salary year inflated and then I'll set it equal to all this I need to correct this and I got a new data frame of salary and then for the data frame of salary I want to then show these two
columns of salary year average and salary year inflated running control enter once again like similar to last time we're getting the setting with copy warning remember anytime we're doing alterations on an original data frame we want to create a copy of this so make sure we're doing copies of that all right now we get that warning message going away and inspecting it further we can see that in fact that salary your average column was multiplied times 1.03 now this function is actually pretty short and for providing this apply method a function we could use something
like an anonymous function so let's try writing this again with an anonymous function so I'm going to Define that data frame as salary and we're going to be using that salary year average column and we're going to be applying a Lambda function to that so for this you may have to dust left your hats I've been used Lambda functions a while but as a reminder so we're going to Define Lambda and then you want to define a specific variable so in this case we're going to call it salary and then you provide for colon and
after the colon you're going to provide what mathematical or what operation you want it to do in our case we want to do salary time 1.03 so once again let's see what this salary year inflated column looks like next to the salary year average column and once again comparing it to the last one it looks like all the math was done correctly for this now for those hardcore D nerds you may have noticed that I could actually rewrite this without that apply method or Lambda function as this is actually a really simplistic case where we're
only really multiplying 1.03 time that salary year average column for to actually call out the different columns to display below we can see that it still gets it done but this was mainly to show you an introduction to this apply method when running it on a column so let's actually use this apply method in a situation where it's actually applicable and you can't actually simplify it and that's in the case of our job skills and specifically we need to convert this from a string to a list if I were to display that list of first
skills I can see that the list has double quotes around it and even putting type around it to confirm running this I get it as a string so how can we actually convert this back to a list if you call back to our lesson on data types in the past you could wrap certain objects inside of a data type like list and then try to convert it however in this case if I try to do that it turns this thing into a Jun jumbled mess like it converts every single letter into the old item in
the list it doesn't work instead we're going to go inside the python standard library and use this module of as which stands for abstract syntax trees honestly I never heard of it before until chat gbt H me towards it so it's sort of special for our unique use case anyway with this you provide a node or a string so in our case we're going to provide that string to it and it basically go goes through and explains that it converts it to The Container data type in our case a list back to what it was
supposed to be let's actually show this in action so inside of this code block I'm going to import in that as module it's part of the Python standard Library so we don't have to kind install anything with this and then we're for this we're going to apply that function of literal eval and we're going to apply it to that list that string list running control ENT we can see that now we have this back into a list and I can actually confirm this by running a type on this and it conver comes back and confirms
it is a list so let's now clean up this column using the apply function and just to show real quick we can actually run a. literal eval on this entire column if I try to go ahead and do this I'm going to get a value error I didn't pass in the stuff correctly we have to use the apply method for this so the first thing I'm I'm going to do is I'm going to start with creating a function and I'm going to just call this clean list and I'll call the variable for this skill list
and this function is going to be going through each of those items in the column so I can thus return that literal eval of skill list so let's actually write this out then so for this I just want to replace it right on top of the original column of job skills so I'm going to set it equals to itself and then write that apply method passing in clean list now we're going to get an error here whenever we run this specifically we're going to get a value error Mal form node or string none and I
can show this by filtering the data frame to only show na values inside of the job skills column I'm using this is na function and passing in what column I want to filter it for anyway scrolling over to job skills we can see that we have a bunch of none values this is causing errors for us here because it's trying to do this literally Val un none when it's expecting a string to be passed so what we can do is pass in an if statement and unlike what we did down here of the function of
is na we can use not na specifically calling out the Panda's not na function and we want to look at that skill list and therefore if it's not na we want to actually return it as the literally Val for skill list otherwise we're going to leave it none okay so let's try to run this again so everything ran right but let's actually inspect the first element of this and we can see that we get a list back and confirming the type of this we can see that it is in fact a list so it did
do the conversion correctly now in this case because of how simple this function is I would encourage you to transcribe this into a Lambda function I like to write a function first and then convert it so with that job skill column setting it equal to itself and then using apply we're going to use a Lambda function then we want to Define that variable that variable is going to be that skill list I'm going to fix this typ real quick I need have Lambda in there then we're going to go into what we want to actually
do we want to perform an as. literal eval on skill list we're going to get the same error as before that there's a value error Mal form node or string none so we need to correct it for this basically include an if statement inside of our Lambda and I can basically just copy that if statement from up here paste it in here whenever I run this I'm also going to get an error expected an else after an if and we just have to specify else if it is none just return back that skill list so
what the value is currently running control enter everything looks like it ran fine displaying the first value I can see that it is a list and just confirming with type again it is in fact a list now for future note books when we go into actually analyzing the skills you're going to notice that I do use this again inside of the import Library section that I normally have up here so this code is pretty important and you need to understand what's going on there because you're going to be using it now I want to cover
one more example with this apply method because we've only been applying it to a column and we may need to apply it to a row so previously went through and just applied 3% inflation to every single set salary but let's say we have a new problem where we're assuming that for senior roles such as senior data analysts the engineers and scientists they are going to have an inflation or projected salary of 5% higher where the other roles are only 3% so we're going to need to revise this code that we did previously in order to
do this more complex apply method pulling up that help function on apply again just to refresh previously we only covered the function aspect of it but now we need to look at that this axis the default currently is zero which specifies that it applies the function to each column but if we want to apply a function to a row basically so we can access multiple columns in a row or check a column this is what we want to use in this case because this one's a little bit more complex we're going to be building a
function first and then passing it into our apply method so let's start by building out how we're going to use that apply method method for this we're going to be using on that DF salary data frame again because this has all the salary data filtered out it makes it just easier to display for us so with this data frame we're going to be creating a new column of salary year inflated and in this case instead of providing the data frame and the column we're just going to provide the data frame because we want the entire
row to go to at this time and we're running the apply method on that we'll be using a function we haven't made this function but we'll call it projected salary it's going to have some and we need to specify the parameter of axis equal to one so let's actually build this function now so this function like I said is called projected salary and inside of it we're going to be passing the row and what we need to do inside of this as it's grabbing each one of these rows is check hey what is the value
in this job title short column and if it's not in this case it's not SE Senor I want to then take that salary year average and multiply it times 1.03 however if senior is inside of it such as like in this case we want to apply it of 1.05 so this is a great if statement so if senior in the row specifically that column job title short then we want to return the row of salary year average and we want to to actually we want to multiply that so 1.05 times that so this is actually
a functioning function right now so to make sure that it's uh doing correctly I'm going to go ahead and I want to actually see the results below here and for this we're going to display below here the job title short column salary year average and that salary year inflated running control enter bam okay so we have it back now to where it's only doing a condition on the column specifically we have this one senior data engineer role and it did the appropriate multiplication in order to get that now we need to solve for all these
other things so in this case we're going to use an else statement and in that case we're just going to return 1.03 times the row of salary year average now running this all the values are going to fill in and we now guarantee that it worked correctly now all this code can be made into a Lambda function and simplifi it into one line it's getting a little long if you want to do that after this go for it I'm not going to walk through it cuz we've done enough of those practices already but just showing
us it can be done all right so that's the apply method and I'll be honest when I first encountered this method it was sort of difficult for me to wrap my head around how this is applied to columns and also to rows so you may need to watch this section of the video again in order to better understand it as you're not alone I had problems actually understanding this fully when I was first learning it all right for those that purchased the corse practice problems and you have some problems now go through and test out
how to use apply in different ways with that I'll see you in the next one so we just spent the entire last session working through understanding how to clean up this column of job skills to convert it into a list it was a lot of work but promise's going to pay off because now we're going to get into visualizing it and actually able to see what are the top skills for something like a data analyst but in order to do this we need to use the explode method on this job skills column in order to
clean it up a little further let me show you so why are we even learning about this explode method well let's look at a fake example using fake data I have a data frame with a few jobs and then a list of skills that each job requires if we wanted to count up the number of times maybe something like python or Excel appears it's going to be a little bit difficult in this format specifically if I called the data frame and then tried to run value counts on that job skills column running control enter I'm
going to get a type error and it's mainly due to the fact that list is not hashable in order to actually do this now you could use some Loops Or List comprehension maybe and actually Loop through each of the rows of the data set and then aggregate what skills are inside of each of the job postings and then from there make a totals column which ignore this job tile short and job skills but it could do totals for each of those skills but if we look at the code that I had to do for this
this is 1 2 3 4 five lines of code and that's just way too ridiculous instead what we can do with that original data frame is when we call it and run that explode method which're going to go over shortly of specifying the skills column whenever I run this it's going to explode it out to where every single value from that list is its individual row so now we see we have three data analyst rows two data scientist rows because there's two skills and three data Engineers because there's three skills scrolling up we can even
confirm this by looking at it anyway in this type of format it makes it really easy to then run if we wanted to that value counts method on the job skills column and we can then get a quick count of how many skills each of these things have we want to take it a step further we can even plot it and we get this bad boy which shows the counts of all the different skills so inside my jupyter notebook I've gone in and import all the libraries loaded the data and then for data cleanup I've
added an extra line specifically we have our line here for cleaning up the job posted date and then our final one that we just did in the last section of applying or using that method apply in order to clean up those skill lists into a list data type so let's jump right into exploding out this job skills column here I have it displaying with also job title short we're going to run this explode method method on this column of job skills so I'm going to run the help function on this explode method in order to
understand better what it needs and it takes two parameters which are column and then ignore index we're going to ignore this ignore index for the time being so for this we specify a column and it transforms each element of a list likee to a row replicating index values so let's run this explode method and we're going to just provide the column name of J job skills running control enter so looking at the index alone and also the rows we can see that they're now duplicated scrolling over to the job skills column we can see now
that it's broken out into individual rows for each skill pretty cool now we want to assign this to a new data frame so I'm going to call that DF exploded and I'm going to set that equal to this I'm go I'm going to go ahead and run it and then printing out again it looks good so let's actually visualize now this job skill since we have these individually we could do a value counts on this so I'll call out the job skills column and then we'll run a value counts on it this looks pretty good
but I don't like staring at numbers let's actually visualize this with only the top 10 results so I'll run the head method on this and then also we'll then run a plot on this specifying that we want to run this a bar chart and Bam we get this bad boy that shows the top 10 skills visualized by their Associated counts for for each so let's actually take this a step further because right now we're looking at all the job titles data analysts data engineers and data scientists I actually want to group them not only by
their skill count but also by that job title short column and that way we can go through and pick out if we want to look at data analysts data Engineers or data scientists so for this we're going to go back to that original variable that we had of that DF exploded and build on that and for this we want to group by not only that job title short but also those job skills so we'll call out DF exploded use that group by Method and then inside of the arguments for this we're going to provide the
list of both of those so job skills and then also job title short and then getting to the aggregation function that we're going to be doing for this we want to count of all these and so the best way to do that is size running control enter we can see now it's grouping it by those skills and then so in this case we have airflow and this is the count for each one of those different job title shorts now as far as the ordering that we were doing for this I don't really think it matters
too much if I ran it this way we just have job title short on the left and then broken up by all the skills associated with each job title short so I'm going to assign this to the variable of skills count and if we notice this skills count that we now created the type of this is a series and series aren't bad to work with with plotting but I prefer personally data frames so we're going to go ahead and convert this into a data frame I'll do this by creating a new variable called Data frame
skills count real original I know and I'll set it equal to skills count and so in order to transform it into a data frame we need to use reset index I'm going to print it out below and see how it looks differently so we end up getting this one now and those count column right now it has this zero title so we actually need to change that and specify the name of that basically that index that was previously the index for the series but now zero for this column so we need to name it skill
count running control enter and it changes skill count now the next thing I want to do is sort these values and sorting the values by skill count is going to be necessary so whenever we are visualizing it and only showing the top 10 results we can just do like a uh a head method to pull out those top 10 so once again I'll Define that variable of DF skill count and set it equal to our data frame of skills count and we need to sort these values so we'll use the sort values method and we
need to provide what column we want to sort this by so we'll specify that skill count running control enter we get this but now they're all in ascending order but we want descending so we'll specify ascending equal to false okay and now we have it in that correct order from highest to lowest so this is now what we need in order to plot we're going to go first and just plot data analyst basically the top 10 skills for data analyst so I'm going to create an individual variables so in case if I want to change
it and make it easier so I'll call it job title equal to data analyst and then for top skills I'll set that equal to 10 for right now and then as far as our final data frame that is filtering for those values I'll pass in our original one of data frame skills count and we're going to be filtering that one by that job title short column we want to make sure that it's equal to job title the variable that we defined above of data analyst additionally want to get the top 10 values so we're going
to run the head method passing in top skills let's go ahead and show this data frame right below this and now we have only for data analyst the top skills in descending order and the account now this is something we can actually plot so I'll run the plot method on this for kind we can go with the bar chart but I like the horizontal bar chart for these so we'll add an H to the end and then this data frame has multiple different columns in it so we need to spe specify that for the X
and Y values so I'll call out job skills for x and then for y I'll call out skill count running control enter bam we get this bad boy now now the one problem I already see in this which we've encountered before with horizontal bar charts is the order in which it plots it it starts plotting at the bottom leftand corner and then Works its way up from there so that's why we're having this so actually we need to reverse the axis so I could come up into the code and put this ascending back to true
and then from there instead of doing head do tail but I have a better method than this so matplot lib has this method for inverting the Y AIS the one sort of issue or one more learning point we have to go through with this method is the fact that it's run on the axis which in order to get the axess we need to run this method of get current access or GCA we're going to go more into this method and also the invert the y- axis but I wanted to introduce it here for the time
being just understand that this is going to get the axis and then we're going to invert the axis back in our notebook I'm going to call the P plot module with our Alias of PLT we're going to run the G CCA method on it and then we're going to run on that the invert ya AIS and as we can tell the axis is Now inverted and now with how we built this we can go in and actually look at other job titles as well that's the good thing about python so you can make something programmatically
changeable so in this case I want to look at data engineers and the top oh I said top 115 didn't mean that that's way too many we're going to scroll that down to only 15 and now we have for data Engineers the top 15 all right so there's a little bit more cleanup that I want to do for this before I'm actually satisfied with this visualization the first thing is I'm going to specify a title using an F string with this I can specify top top skills of 15 skills for the job title of data
engineer and now it's appearing formatted correctly next thing I'm going to do is change up that X Lael to change that to job posting count and I don't really like the Y label I feel like the title's already explanatory enough for that so we're going to leave that blank the last thing to notice is the legend down here I feel this is very much redundant and not necessary so we want to remove it so we can access our legend by doing pt. Legend and then this one has a method called set visible and then inside
of here you pass a Boolean expression of true or what in our case we want false we don't want it to be visible and now it's gone and the last thing just to pass in is PLT doow and now we've cleaned it up to what we want with a very concise title and appropriate aises so that is the explode method and it makes our life super simple in getting into analyzing this skill now it's your time to go through and actually do that along with those that purchase the course practice problems we have some available
for you right now and with that I'll see you in the next one as we're going to be going into an exercise building further on these top skills of data analyst but analyzing the trend of these skills across the year all right see you there now that we wrapped up covering everything that you need to know for pandas for the advanced chapter basically for the remain of the course I want to go over a quick exercise that covered a lot of those different methods that we just learned specifically we're going to be using all those
methods of explode apply and pivot in order to get this bad boy which is going to show the trend of skills throughout the year so let's get into it here in my notebook I've gone through and imported our libraries loaded our data set and then they done the data clear up remember we're still doing the job postto date cleaning along with converting those job skills to a list so way we can do the explode method now for this analysis that we're going to be doing I only want to focus on data analysts you can focus
on whatever job title you want but that's what I'm going to focus on so I'm going to create a new data frame called dfda and set it equal to our original data frame where then filtering it for that job title short column ensuring that it equals data analyst now I don't want to alter this original data frame so it's just good practice to use that copy method and I'll run this inspecting this data frame looks like we have all data analysts in it and now we need to go into right we're going to be aggregating
these skills on a monthly basis so we need to go ahead and extract that month out of the job posted date so I'll specify a new column of job posted month I'm also going to specify that it's the number and this is going to be equal to in that data frame of the job post post to date specifically want to use that accessor of month displaying the data frame below we can see now that we got this job posted month number on the side next thing we need to do now is actually get into exploding
out those job skills so for this I'm going to create a new data frame or new variable all together for this I don't want to save over the original data frame and I'll specify our dfda data frame and pass that explode method to it spe we want to use the job skills column and I want to see what this looks like whenever we do this so I'll call out that I want to look at it afterwards and it looks like it's repeating along with those job skills now being separated out so now we're going to
use both of these columns we have our job skills column all broken out and we have the associated job posted month number with it now what we have to do is Pivot it so on this dfda explode I'm going to run the pivot table method and for this I want to use the job posted months as the index and the skills as the column so I'll call specify the index as equal to job posted month and the columns is equal to job skills other thing we have to specify is the a funk and for this
we're wanting to do a count of all those job skills per job post a month so we're not going to use count we're going to use size running control enter we have all our different values in there now I am noticing one thing they're having nonv values in here and really what this means is there's there was no value there so technically in this case it is actually zero we want to be zero because if we go to plot this and there's nine values it will just skip it so we need some values in there
we need zero so for this we can specify the parameter of fill value and pass it in of Zero running control enter so now we're almost there we have our job posted month on the index and all of our different skills on the columns but this is a lot of skills if we try to plot it well let's try I'll show you I'm going to rename this as the variable of dfda pivot I know real original and for this we're going to run the plot method on it and for this we'll specify the kind is
equal to a line chart okay so not so bad to start but oh my oh goodness this Legend just keeps going and going and going so we definitely need to make that Legend disappear even that we can see there's just too many skills on here I want to actually filter down the number of skills that are on this visualization and we need to filter down by those that have the highest count here and right now it's just sorted alphabetically so what we can do in order to solve this is displaying our dfda pivot table we
need to create a column or sorry we need to create a row called total that basically sums up all the values inside of here and this is actually pretty easy to do for this we're going to create a new new index value and so we'll use the lock method for this and we'll specify that it's called total so it's creating a new row called total and this is going to be equal to the data frame and then running the sum method on this pretty neat that we can do this so easily I press control enter
it's displaying below and then now we have this total row along the bottom of here so now what we need to do is sort these columns and specifically buy what's in the total row so I'll specify once again that totals row by using that lock method and we're going to sort the values of this running control enter we get this series back which has for the index the counts and then also the names for all of it I actually don't want it in this order I want to change this so I'm going to put ascending
equal to false and I promise this is all going to make sense so we have this right now so really what we care about this is getting the job skills back which are the column titles and from there we can actually sort the data frame based on those column titles so I can run the index on this and we have basically index values which is inside of it's like a list of all the different columns and now with this we can pass this to our pivot using the square bracket notation to basically say hey all
these values inside of here sort the columns by this and Bam now we get this where all these different columns are sorted in the order that they should be with these the least at the end so the only thing left to do now is just drop this total column because it's going to jack up our plotting if it's in there we don't need it anymore so we're going to drop it um but before that I need to save what our work to a to the variable of dfda pivot and then with that dfda pivot I
want to then go ahead and drop that total row printing this displaying This Out Below make sure we did it right I didn't make any mistakes see that bam now we have everything one all sorted correctly and that total row is now gone so the only thing left now to do is to actually plot this so to plot this I'll just call out dfda pivot and we're going to access what we want from this data frame here based on iock so I just want the first five values I don't want to call them out specifically
we'll just use uh iock and we need to pass in the rows first we want to pull all the rows so we're going to pass in that colon and then for the next one we only want the first five values so automatically assume zero colon 5 we don't have to write the zero I can actually just show this down below we have what we want let's actually plot it and we're going to use a line chart for this and Bam and like every visualization you probably should clean it up I'm going to add this title
of top five skills for data analyst per month add a y label of count I'm going to remove this x label cuz I feel like we've already specified that in the title that's per month and you should know what that is so going ahead and run in this bam we now we want now as far as what we're seeing with this we have very high demand very in the beginning but remember we had a lot of job postings in January so that's why these counts are above normal when we get to the project I'm going
to shift us from plotting using a count to using something that's more represen istic via percentage specifically we're plotting something like what is the percent of job postings with SQL what is the percent of job postings with Excel and this is going to help clear up these issues that we're seeing here of these abnormalities in months like January and August now personally I don't like this month number notation down here but we're going to stop here for the exercise but what is available in GitHub and the code for this is I've gone through and cleaned
up those job posting months similar to what we did previously and to now whenever we plot it it has like January March May July September November and was able to do this with four lines of code we've all walked through this before so I'm not going to do that again here all right so that wraps up everything you need to know for pandas and you should be super proud of everything you've learned so far we're now going to be shifting gears and getting into exploring more advanced features inside of matte plot lip so with that
see you in the next one so imagine this scenario you've analyzed the top skills for data scientists data engineers and even data analysts but now you want to compare the values across each other but they're all on separate plots how do you do this well conveniently that P plot module from matplot lib offers a subplots function in order for us to graph everything on one single figure and the best way to understand this is through their cheat sheets I find looking underneath here at subplot layouts with this function we can specify the rows and columns
for the number of axises if you will on the figure itself whenever we run this function of PLT do subplot specifying a 3X3 we get back two objects a tupal a figure and then the axises once again this axis object is each of these individual plots on this figure so enough with the theory let's actually jump in I've gone through and imported all of different libraries you need loaded the data and then done our standard data clean up cleaning up the date and the job skills we're going to start by creating a subplot on basically
a 1 by one figure so for this it provides two objects it provides fig and axe this is the standard nomenclature for this you can really name it whatever you want but it's pretty common to use fig and a and then we'll call out the PLT module specifically the subplots function okay running control enter just make sure it works it plots a basic graph here if I wanted to I could add in the different rows and Columns of this so in this I specified 2 x two and I created a 2X two subplot area we're
going to keep it simple for the time being and I'm going to remove that and with keeping with Simplicity we're just going to plot the counts of the job title short column for this we'll use the value counts method running control enter to see what we're getting it's the numbers for all the different job titles so running the plot method on top of this specifying that we're going to want a bar chart we're going to then run control enter we have it now with this we need to specify we need to actually start using this
ax and we can pass in this axis for the figure itself into here so I can specify ax equal to ax which is this one up here and when I run this it's not going to change anything it's still going to be those same values of those different job title counts for this but let's now get into actually plotting two plots so not only do I want to plot the value counts of the different job titles short but also we're going to run value counts on the job schedule type so whether it's full-time contract or
intern and then I'm only looking at the first three values for this I want to get both of these plots into separate bar charts and put them on the same figure so I'll start with specifying that fig and axe equal to PLT do subplots and then we specify the rows and then the columns and I just notice it's highlighting as I go so it's telling me that the number of columns is two that's pretty cool once every day so now I'm going to copy and paste both of these into here and with these we want
to plot both of these so I'm going to specify the plot method on this specifying the kind equal to bar and in this one we most definitely need to specify axe I'm going to copy the plot and also paste it up here now we I'm going to pass both axe onto here we're going to run into issues with this and that's because now we need to specify of this axe we need to specify the index of the axe that we're using so similar to a list anytime we pass an index we're going to use that
bracket notation so in this one we're going to use zero for the first one and then for this one we're going to use one running play we can see that okay we have the job titles on the left and we have the jge schedule type on the right if I wanted to I could just trade that one and that zero to trade places running control enter we now have job schedule type on the left and the job titles on the right so you may notice by this that we have some over lap here between these
two different visualizations that are in their separate axes well we can go ahead and fix this by passing in a function that we haven't used before called tight layout and this function we want to apply to the entire figure so remember we have figure and aises accesses is the individual plots figure is the entire thing so I can specify figure. tight layout running control enter bam now we get something that's a lot more readable where they're actually separated and they're not overlapping now I do want to call out because there sometimes people will plot differently
but we're using the panda methods in order to plot these different visualizations now this is the same visualization but we're using mat plot Libs designed way of actually building visualizations I'm going to go ahead and run it just to show you and it's the same thing well kind of they don't turn axises they don't turn those labels on their side so we have a bunch of overlap but it's basically the same anyway what's going on here well instead of up here running this plot method down here we're now calling out each of the different axises
and then running the applicable plot method on it actually providing the values that we want to go into this plot and filtering it out and then also doing the tight layout like I said before this is more of our Bose and I'm not really a fan of it I'm much more a fan of using pandas to plot this so let's now get into that final example that I talked about the we want to get into plotting all the different counts of top skills for not only data scientists data Engineers but also data analyst on one
single plot and for this we're going to use that same data frame that we created during the explodes lesson where we went through copied the data frame into skills exploded all the skills out grouped them reset the index and then sorted the values by the different skill counts that was a quick version I'm not going to go through it again cuz we already did did it so for this I want to plot data analysts data engineers and data scientists so the first thing I'm going to do just create a list and we're using a list
because we're going to Loop through each of these different job titles basically pull them out of here and then plot them each individually onto our figure so we need to start by creating our fig and Axe and we'll be specifying the subplots function and for this we're going to be specifying rows and columns in this case we're stacking it so it's going to be three rows and one column now like I said we're going to be looping through each one so I'm going to create a loop with this so four and we're going to pass
in two variables we're going pass an i and also the job title and I'll show you why we're going to be using for this in enumerate job titles and I spelled enumerate wrong okay fix it enumerate job titles just as a quick refresher for this let's actually print out I and job title so we can see what's going on here so whenever we enumerate through a list it provides first the index and then it provides the actual value I provided in here job titles I wanted to should have just done job title and now it
provides the index and then the actual name itself of the job title so the first thing I want to do inside of here is filter the data frame for if we have data scientist first I want to filter it just for data scientist and just for the top five skills and just checking the name of the data frame is DF skills count so inside of here here I specify DF skills count and we want to filter this data frame specifically on that job title short column where this is equal to the job title and we
only want the first five values out of this I'm not sure if this is going to display but we're going to go ahead and try it um no it didn't display below let's try actually print okay printed the data scientists data engineers and data analysts we're getting the correct results we're going in the right manner okay so this is what we want to actually plot so I'm going to set this one of a basically data frame plot is equal to this right here and then with our data frame that we're going to plot we're going
to run the plot method on it specifying what kind we want well we want a horizontal bar chart and then we need to specify the X and Y values so job skills skill count now the last thing we need to provide remember is the aess on what axis we want to plot it on conveniently we did this enumerate function which is provides an index so we can pass into it ax create a bracket notation and then just pass in I let's plot this bad boy all right not bad so far I don't know who each
of these graphs are associated with we also have some overlap will fix but off to a good start so far in order to show the different titles on there we can just add the parameter title and set it equal to to job title running control enter I forgot to put a comma running control enter we now have all those different job titles up next thing I want to do is similar to last time remember we inverted the Y AIS so we're going to do the same thing here but remember we had to run that GCA
method in order to access an access so in this case we're just going to call out the access specifically so we'll say access then we'll specify the index and for this one we'll call out invert y AIS running control enter bam this inverted all those different values next thing I want to do is take out this job skills on the Y label because I feel it's it's redundant okay running this we now have that removed the only thing left to do for each of these individual axises before we actually make it spaced out better is
remove this Legend of skill count like that's just like completely unnecessary and similar to what we did a few lessons ago we can access Legend by calling legend of the axis and we can set the visibility of it or set visible equal to a bowling expression of false okay so this now removes that similar to how do we space it out before we're going to access that figure and we're going to call that tight layout okay so this now has all of this now actually SP spread it's a lot more readable but I want a
title across the top so so once again I'm going to access that figure and I'm going to set the subtitle so the main title for it and for the title we're going to specify that counts of top skills and job postings with a font size I'm going to specify here the parameter of 15 now I'm having an issue right now where it's overlapping and that's because of this tight layout I'm going actually going to change that to move that underneath I'm hoping that fixed yeah that does fix it so that actually spaces it out better
to where the title's not on top make sure that you have that in the correct order all right I promise the only last thing to do in this is right now I'm finding data scientists and data Engineers have a similar axis but whenever we look at data analysts they're not on the same axis it's not aligned properly so for the axis itself we can specify the set xlam and we're going to specify not only the zero in this case so what is the lowest value but also the highest value right now it's looking like it's
around 120,000 that would be the best to make sure we capture all the different values so I'll specify in here 120,000 running control enter is now saying it and also all of these are now on the same axis and this is pretty insightful now we can take a lot of different insights from this like data scientist data Engineers have some of the highest with python and data analysts are a little bit below that then when it comes to SQL not only do data Engineers cover it but also data analysts and that data scientists a little
bit less now like I mentioned in the last video we're going to be getting away from counts in the project section and specifically we're going to be shifting more to analyzing what is the likelihood or the percentage of a skill appearing in a job posting so I feel like it's going to be much more represen istic of Vice comparing counts to counts because sometimes there's more data scientists or data Engineers jobs and data analyst all right if you haven't worked through this already it's now your turn to do this along with some practice problems to
familiarize yourselves with how to use this subplots function with that I'll see you in the next one we're going to be diving into more advanced features of matte plot lip see you there so I have a LoveHate relationship with pie charts frankly they're pretty good at showing some things but then other things when there's too many variables involved you can't really even use it so we're not only going to be walking through how to actually plot a pie chart but also when are the best scenarios to use these visuals so jumping into a jupyter notebook
we have our standard import statements we loaded the data set and we've done our standard data cleanup so let's get into plotting one of these columns we're going to start with this job work from home because it's a Boolean expression and we're either going to have a true or false value and so with this we're only comparing two things which is actually really great for pie charts so with that job work from home column we're going to run value counts method on on it in order to get the counts of the different values so now
that we have this we can actually run the plot method on this specifying the kind of chart we want for this we want a pie chart warning control enter we get this one I'll be honest this chart does need some cleanup because I don't know what compar what we're comparing true to false for also we got this random looking count on the side so let's actually clean that up real quick so I'll specify the title of work from home status additionally I'm going to clean up that y label all right so this is much better
at showing the work from home status and we're comparing the two values it's an overwhelming majority of false values for the requirement to work from home so I'd argue this is actually a really good pie chart to use now I've taken all this code above and I've pasted it below and just changed this column now to show the job title short column let's go ahead and look at that one and I have the wrong title here and updating it for job titles in Luke's data set real original I know so with this I'm not a
fan of it because it's hard to compare the differences between these specifically whenever I'm looking at this for data analysts data engineers and data scientists I have a hard time telling which one is actually higher when we compare this to obviously weing the same thing below plotting it with a horizontal bar chart We compare it to this chart where it shows the job titles in Luke's data set we can clearly see that data analysts outpace data Engineers outpace data scientists much better especially with how many there are here so we need to be very careful
when we're using these p charts that we're using in on the correct thing if we have more than two I would start questioning if you should be using a bar chart instead now getting back to that original pie chart there's two things I want to customize on this first is the starting angle of where this is actually starting when you initially look at a pie chart you're going to look at the top Center portion of it so I want to orient this to where these values of the true were up at the top so I
can specify the argument of start angle and pass it in I want it to start at 90 de so this is much better of what I want and just to show how this is going if I were to put in 180 it's pointing all the way to the left because the reference point is all the way over here on the right where zero originally is so we have 0 90 180 and then just for fun this is where 270 would be now the other thing I want on here are percentages outlining what the True Values
are and what the false values are so for this I'm going to specify Auto PCT or Auto percentage and for this we need to use a format specification mini language so I'm just going to put something in real quick to show what actually it looks like it's going to be percent symbol one and then F and what this is going to return is the associated percentage for each of these so true is at 8.8 percentage 91.1 for false so if you're curious about the documentation I'll provide the link below and it's underneath the string section
and it goes into a lot of detail of all the different special characters that you could use inside of this this to format a string but the quick crash course is this the percentage introduces that we're going to actually be formatting the text from here so if I actually delete it and press control enter it's only going to throw in one F of what the text is right here so I'm going to go put that back the one is the minimum width of number so right here have eight but that I have 9 one which
is two so it can be one or more and then if I want decimal places I can specify point in this case I only want one extra decimal place so I'll press specify 0.1 running this now 8.8 and 91.2 the F in this specifies that this is a floating Point number now the only last thing to add to this is we need a percentage symbol on there special for this we need to specify two percentage symbols for it to actually do that so now I have it completely formatted how I want it with appropriate title
and different labels overall pretty impressed now let's take this up a notch we have three different columns in here that provide Boolean values for different job postings specifically work from home no degree mention or health insurance whether it's required or not so let's use our knowledge of subplots and actually plot all these pie charts onto one single plot now in order to do this we're going to have to obviously Define that plot. subplots and then we get back that Tuple of figure and axis we're going to start simple first by only plotting one and then
we'll build on more using a for Loop so I'm going to only plot the job work from home first and we're going to use mat plot lib's way of plotting this basically calling Axe and then from there calling calling its method of pi and inside of here I'm going to provide that job work from home and specifically the value counts make sure this is plotting correctly I'm just going to have it show right below here all right so pretty ugly but it's showing below here but now we're using basically the matte plot lib annotation for
this VI Panda built-in plot method now I want to use a loop for this CU I have three different pie charts I don't want to go through and write this out three different times that's just too cumbersome I'm pretty lazy so what I'm going to do is to find this dictionary of the different columns I'll call out the column name itself so job work from home and then for the value I'm going to provide what we want as the header on this pie chart so we're going to have three different pie charts so we need
to update our subplots for that remember for this we specify rows and then columns so it'll be one row and three columns now we'll get into defining our four Loop so we'll call four and I'm going to fill in what we're going to do here in a second because I want to look at this we're cycling through are dict column items and specifically we want to run the enumerate function on this right because when we run a numerate on like a list it provides us an index back and we're going to be using that for
this axe right here to provide an appropriate index for it to specify where to plot the pie chart we also need to update this in here we're going to get to that real ugly right now so what we're going to be getting back while we're enumerating this dick column of items is we're going to be getting our index of I and then we're also be getting a tuple so inside a parenthesis of the column because we specified column on the dictionary first and the title and because it specifi the title second and that's the values
also I'm realizing now I have a spelled enumerate wrong so I'm going to quick fix that and for the time being I'm not going to plot anything I want to actually show all these values in case you aren't uh keeping up with this so we're going to print I column and title commenting this out as well running control enter so we get the index for the first one of zero we get the column of job work from home and then we get the title now this is a tuple so you have to put this in
parentheses if I try to run this going to get a value error all right so let's actually get into building all those different plots then so we already have that we're cycling through the different indexes and we provide the appropriate access via our bracket notation so I in this case and then we're going into the building the pie chart we now need to specify the column in this case here and we're going to basically be doing the same method of value counts on each one so let's run this to see what we have right now
all right pretty cool not too bad um we got three pie charts they may be right maybe not be right we got to actually update it so we haven't used this title yet so I'm going to go ahead and set that now next by specifying that axis and I and then of set title passing in title running control enter we now have titles above each one now we need to actually getting the values inside of here so I'm going to come up here to the other P we did I'm just going to copy this start
angle and then this percentage formatting and paste it right into here and then running this we now have percentages and yeah this is looking good the only thing we don't know is which of these values are true and which are these false so we actually need to pass in one more parameter to this and this is labels and you provide a list of labels and you have to pass them in correct order or they'll be wrong specifically we're going to pass in false and then true running control enter we have it and then I can
come back up here I'm going to just verify the false is the bigger value for the work from home and it is over here if I were to get these labels in the wrong order I'm going to update in the wrong order it's going to reverse it and you're going to be providing people bad data so don't do that so updating this for the correct labels bam we now got our final visualization using Loops in order to enumerate through and basically graph everything and we don't have to write repetitive code it's pretty cool we just
did here all right so now it's your turn to get in and get your hands dirty making these pie charts remember it's very important that you're using it whenever you have the correct number of variables to actually showcase you're getting getting more than two or three you need to consider going to something like a bar chart with that see you in the next one after you work those practice problems see you there all right now that we got pie charts out of the way we get into more of my favorite visualizations starting with this one
of Scatter Plots these type of plots are great at showing correlation between multiple variables so in this case right here we're able to see how things like median salary and also the demand or count of a skill correlate to each other just a spoiler alert on the insights of this if we look at it we can see python is up in the top Corner meaning not only is it a highly paid skill but also it has a high demand based on where it is on the x-axis so super great insights out of these type of
plots we'll get to building that one in a bit let's start with a simple example so let's get into just plotting a simple example first and I'm creating this fake data frame here inside of it has different job skills of these skills right here and then there're madeup counts and also madeup pay and then we then transform this dictionary into a data frame now it's a data frame we can actually plot it so I specify the data frame. plot for the kind I call out that is a scatter plot and then we need to provide
the x values so what we're going to put on the x axis and in our case we're going to use that skill count column and then for the Y AIS we're going to pass in that it is the skill pay okay running control enter bam we have our scatter plot super simple to make this chart needs to be cleaned up it's a hot mess but at least we're understanding now we can plot things by their X and Y coordinates according to the skill also this isn't labeled we'll fix that later in this so we need
to get our data in a form similar to what we provided here of that skill pay and also the skill count aggregated by each of those skills so what we're going to need to do is create this data frame which it has three columns the first one is the index of the job skills itself and then for each one of those skills we need to know what the median salary is and the count is and the purpose of this format is so that way we can use that median salary as the Y AIS and the
count of those skills as the xaxis on our scatter plot so as always I've imported on our libraries loaded the data set and then done our standard data cleanup for this one since the skill and is very dependent on the job title we're going to filter this data set down to data analyst we're replacing the entire data frame of DF so we don't necessarily need to make a copy of it in this case I me you can but it's not really necessary because we're just replacing the entire data frame so the first thing we need
to do is explode out that skill column it's sort of fun fun to say too so I'll create a new variable called DF exploded and I'm going to run that explode method on the job skills column showing it below we can see that we have those repeated indexes and now that job skills column's broken out so now what we need to do with this is we need to do a grouping specifically for the job skills we need to group the job skills in order to get the count of these skills but then also for that
salary year average column we need to aggregate for each of those skills the median salary so with this exploded data frame let's actually start grouping it we want to specify that we're using the job skills column for this just running a simple size on this we can see different counts of all these different job skills in this but remember we want not only the counts of these skills but also what is their Associated median salary since there's multiple aggregation functions we need to use that AG method now I'm going to press enter to go down
to the next line because we're going to do something that we haven't done before we're going to be using basically like a dictionary if you will in order to find key and values for how we want to do these different aggregations we'll start with something simple we'll start with the skill count that's what we want to the new column name so I'm naming the column right now and I'm setting it equal to a tuple of two variables for this we have to provide the column that we want to do the aggregation on so we want
to be doing this on the job skills column and we want to do the aggregation method of count so now running this we can see just to reiterate that skill count is now the column name that we're doing and then we're using the job skill coln to perform a count and these values are should be similar to what we saw just now whenever we ran size but we also want to do that median salary so I'm going to create a new variable called median salary and to say new variable but new column name and we're
going to set it equal to that Tuple of the column of Interest so salary year average and then the aggregation method median running this one we now have skill count and median salary side by side bam this is great now I'm going to call this data frame that we got provided back a variable of skill stats and that's so we can get into actually sorting the values of it because right now it's out of order it's only an alphabetical order by the names we want to sort these values by that skill count column running control
enter we have this but it's now in ascending order so we need to set ascending equal to false okay now we have the skill counts with the highest so SQL Excel python up at the top and then the final thing we're going to do is we only need the first 10 values of this so I'm just going to run the head method on it specifying 10 and we'll set this all equal to the variable of skill stats so we can get into now plotting it so with that skill stats I specify the plots method and
we're going to be doing a scatter plot as always we need to provide an x value so we're using skill count for this one and then a y value and we'll be doing median salary for this one running control enter we have our scatter plot now with median salary on the y- axis and skill count on the X I'm going to clean this bad boy up a little bit by providing an X label specifying we want the counter job postings a y label of median yearly salary also like to usually specify that hey it's USD
dollars an appropriate title of salary versus Count of job postings for the top 10 skills and running that tight layout like we did before to basically shrink it down and make sure everybody everything fits and now we have this which isn't too bad but there's a slight problem with it if you're picking up what are the names of all these skills like what's what's associated where's python where's SQL I don't know well unfortunately matplot lib doesn't make this easy to actually work and with annotating this we'll go into the Seaborn Library later in the advanced
Chopper and it makes it a lot easier to do this but matplot lib doesn't so we have to do somewhat of a work around to get this to work for this we're going be using the text function from the PIP plot module and it adds the text that you specify to the axises at the location XY and the data coordinat so basically we provide three separate things we provide an x value a yvalue and then what we want put in so on our plot up here I'm going to add that pt. text we're just going
to do a simple example first remember need to find X so I'm going to say x 50,000 y 90,000 both of these are near the middle and then and for the text or string I'm just going to say Luke running this we get Luke in the middle here so I could technically go through find all these different points and then provide the appropriate label or we can use Python to actually do this for us scrolling back up to the data frame we're working with we basically just need to iterate through here using a for Loop
accessing the job skills column to get the name to provide us the text and then as far as the x and y coordinate we just provide the appropriate value from the column using something like iock which we can access this index using the enumerate function I'm me to show you what I mean so let's start by creating that for Loop and for this like I said we need the index and then we also need the column name we'll get the coordinates here in a second I'll show you how we're going to do that separately so
specify in and we're going to enumerate from this that way we can get oh my gosh I can't spell enumerate at all enumerate and then for we want to enumerate through the skills stats specifically the index of this so we're going to be basically provided just this list right here and they're going to provide index values so if I print out I and text I'm also going to come out the all the stuff I can press command forward slash to come out out everything else running this you can see we get for I and text
we get the index and then the actual value of the index so let's pull that graph back up by by uncommenting out all this other stuff and pressing play okay and so we have our graph back anyway we need to now getting to actually a numerating through this plot. text function I'm going go ahead and delete this print statement and for this we need to provide the x value first which is the count of job postings so for that skills stats data frame I need to call out that skill count column and then we want
the specific value from the location of the row so we're going to use iock method for this specifying I similarly we're also going to pass in that median salary for the Y position specifying its eyck location to get that y value and then finally we want to provide that text um we have which is getting back the the skill name so let's go ahead and run this and Bam now we have this graph actually labeled with all of these different values in here granted we had to do this special for Loop in order to figure
it out but we're able to do it nonetheless and personally I find this visualization one of the most insightful that you're going to get out in this entire course it shows that based on demand and salary things like python SQL and Tableau maybe even Excel even though it's sort of a lower salary these are high quality skills that you can focus on cuz not only they pay well but also they're in demand and remember this is for dat analyst so if we want to look at another job tile you'd have to actually filter the data
frame for that now it's your turn to get to work building the scatter plot and also working those practice problems with that I'll see you in the next one before we move any further exploring other plots like histogram and box plots we need to get more familiar with how we can customize all these different plots so for the first half of this lesson we're going to be focusing on this line chart here where we evaluated the trend of skills over time and we're going to talk about all the different parameters that you can adjust on
it in order to make your visualizations pop more now everything that we apply on this line charts can be similarly applied to other charts as well so it's going to go good use on all visualizations now for the second part of this lesson we're going to be moving back into that scatter plot that we just worked on but we're going to be upgrading it not only we'll be updating it so that labels won't be sitting on top of each other but also we're going to be updating that y AIS so that way it provides in
a format that's easier to read all right with that let's jump right into it so here I am in my notebook we've imported on all our libraries loaded the data set done data clamp I've added one more step in here of filtering for data analyst roles all of this analysis going to be specific to data analyst so I wanted to filter for it and just get that out of the way now we're going to be working on that chart that we created in that exercise number 12 on trending skills and I went ahead and copy
and pasted all the different code in here and then plotted it down below and as a refresher the data frame itself all it is are columns with the job skills and then for each of the rows it's indexed by the job posted month and we have this all sorted in order of the highest requested skill down to the least so let's get into customizing this and for this we're going to be customizing the line chart area so I don't really care about the title or the X and Y labels right now we're just going to
leave them be now for the final visualization that we're going to be building or customizing completely we're going to be adjusting all the different parameters and so I'm going to be going through them pretty fast pace of all the different parameters that we can cover for this but I just want to give you a sneak Peet of what we're going to be getting to so starting back with our original graph because there's going to be multiple parameters inside of the parenthesis of plot I'm going to go ahead and press enter and it's going to make
it indented the indentation doesn't really matter it's not going to affect python because it's in parenthesis it's going to do everything anyway we're going to first specify the kind and then we know it's line because it's a line we can do different things like specify the line width I'm actually going to make this a little bit bigger than what it is right now and specify the line width equal to four that pops a lot more I don't necessarily recommend making it this big for all your charts but if you need to make a point this
is it next parameter is line style and say I wanted to make it a bunch of dots well I can specify in this case providing it the string of colon pressing control enter we now have a bunch of little dots now there's a few different options for this it standard comes as a line but you could also do two dashes to signify that you want it to be a dash line or Dash and Dot if you wanted a combination of both of these we'll go back all right the next thing that we're going to look
at is the color CU right now this is using the standard map plot lip color mapping we want to provide a different color mapping and we're going to specify verice and we'll go into more in a second but I'm going to go ahead and run it and I spelled verus wrong and I think I even saying it wrong it's veritus I don't even know I've never heard of this color and I'm also getting the error because it says invalid syntax perhaps you forgot a comma and I did after here so make sure we have commas
after each one of those running again we now have this new updated colorway now if you're curious about different color maps that you can provide to this you can come in here to the cheat sheets and zoom in in on this they have a whole bunch of different color maps that you can actually provide depending on what kind of color scheme that you want next up we're going to move to the markers and and we'll identify a marker first by just specifying how we want it plotted on there I'm going to do an o a
lowercase o whenever I run this now has dots on here but as far as options go for this you can have a whole variety of different things I think another common one would be something like an asterisk and in that case it plots Stars which is pretty cool I guess we'll just keep it lowercase o for the time being next up is the marker size itself so we're going to make this the size of five since it's slightly to make it slightly bigger than our line running control enter makes it well actually I think it
actually shrunk it a little bit and the last parameter that we're going to be doing on this figure is the Fig size and this specifies the width along with the height so let's say I'm going to make it super long and then super short so in this case it's a 10x two I don't really like the dimensions of this I'll change this to 10x 5 and now this looks a little bit better now with all these additions I'm going to go back and add in that title y lbel x label everything like that and we
now have our final visualization which not bad at all now if you're curious of what parameters can be provided to this you just need to go to the applicable plot that you're calling via M plot lib because is in this case yes we're using the plot function of pandas but we're graphing it using the line plot method from matte plot lip so if you're curious about what different parameters you can provide for a our chart plotting it via pandas you can come into the matplot lib Library under the P plot module and scroll down to
see all those different parameters and a lot of the same parameters are here like things like linewidth all right let's shift gears and now get into customizing that scatter plot that we did in the last lesson here I've gone through and copy and pasted the code into here from last time and I will call out that I did specify a number of skills to plot on there specifically for that head method that we filter our data frame here on job skills and right now I'm setting it at 20 last time we used it at 10
this time we're using at 20 we're doing it at more and the reason is this with only 10 last time there was no overlap but as soon as I start adding more than that we can see that there's a lot of overlap in the text itself and it gets harder and harder to read this also I just like setting things as a variable because I can always easily update it so in this case I wanted 50 I could have 50 skills and then whenever I run it down below it'll plot all 50 on here and
wow that is a hot mess anyway let's go change that back to 20 all right this looks a lot better so what we're going to be working towards is this is updating this visualization to use a special library in order to make sure that we don't have these overlaps and if there is it provides arrows basically pointing to the associated dot with the skill so for this we're going to be using ajust text and specifically this is the documentation for it and it's made to provide this function right here of adjust text and the whole
purpose of this library is to adjust text as shown here and it can do some pretty complicated things now back to that documentation we need to provide it the parameter of text which is a list a list of matplot lib text objects to adjust when we just worked with that pt. in the last lesson so it's going to be pretty similar with this anyway the first thing we need to do is actually Import in that function so so we'll do from adjust text import adjust text now you have not installed this Library yet so you'll
need to open up your terminal using uh control Tilda make sure you're in the correct python environment and then put in cond install adjust text I'm going to run it but it's basically going to say hey you got it already and it confirms all requested packages already installed that's for me I've got installed you don't you need to do that okay I'm going to go ahead and close this terminal Bel and we're not going to start over for this plot right here I'm going to take all of this code and bring it down into here
so I'm going to come down here and insert that in press control enter and get that visualization showing now you can see that here adjust text is gray out we're going to add that in we need to put that below where we're actually generating all that text so we're going to put right here of adjust text and the argument that we're going to providing to it is the first argument of text which is a list so I'll just call it what it is text this is not defined yet we actually need to create a list
to do this so I'll say text equal to an empty list now this list is a list of all of these different pt. text calls that has basically the x value the yvalue and what we want for the text so we need to insert this or appin this into text so I'm just going to call out text. append and wrap this entire thing in parentheses all right let's run this and see what we get all right not bad it's actually moving stuff around there's no overlap anymore but you can't necessarily tell like in this case
over here like which one's Oracle and which one's SQL Server which one's flow we need some arrows going to it to explain that now going back to that documentation anytime I'm sort of caught up on what I want to do with it I can just look at some different visuals and see potentially what I want to use in this case I want to use this Arrow props parameter I'm just going to copy this entirely in and I want to use that to actually specify what to actually point to and that was going inside of our
adjust text function so I'll paste it in right in there with the same things that we were using in there and pretty neat I don't really like this red value we're actually going to change this to Gray and the line width we'll make it a little bit bigger all right bam now we have arrows pointed each one so we now know which one the flow is Oracle SQL Server all of these so much better representation of the data set now the last thing we're going to be doing is customizing this y AIS specifically I want
to better format these values to be a dollar sign then the first basically three digits of the thousands and then specifying that's the Thousand by a k value so it's not these long numbers that we have right here so in order to do this we're going to have to use the class of funk formatter and inside of it or what we're going to provide to it is a function specifically we're be providing a Lambda function to make it easy anyway the function takes in two inputs a tick value X and a position POS and returns
a string containing the coresponding tick labels that's why it's underneath ticker and let's get into modifying this using that FK formatter now for this we're going to be modifying we're not modifying the figure itself we're going to be modifying the axis so we need to access that access element or object so we can do that one of two ways I can call out fig ax and then from there specify PLT do subplots and then just pass in no arguments and now we have access to the access object running it like this have control enter alternatively
I can specify ax is equal to inside of our ply plot module we have a function called get current axis and that will also running this get current access that will get the current access so we'll just leave that one for the time being because we don't need really need fig now the one problem with this function call of GCA if we look down here our Scatter Plots completely disappeared we actually have to call this of getting current access after we actually generated the plot so I'm going to come down here at least underneath here
and call out the axis itself now running this again we can actually see we get all our value backs okay so we want the ya AIS of this axis object so I'm going to specify ax. ya axis and then we're going to call the method of set major formatter and now this is where we're going to put in that PLT do Funk formatter okay remember inside of here we need to pass a function specifically it's going to be a Lambda function and with that Lambda function we get two values it said X and also POS
or position and just for the time being with those two arguments I'm only going to print back out the x that it provides I'm going to go ahead and run this whenever I run X you're going to notice the labels down here are going to change with X and they're going to change now to basically a floating Point number anyway we're providing the number back that's what's inside of this x argument also I'm realizing now this is the Y AIS y AIS so we're going to change this actually to Y alternatively if I wanted to
see the position just to understand what it is it Returns the index value or you will of each of these tick marks but we don't really need that I don't really care about POS but I have to call it all right so we have our y value now we actually need to get into actually formatting this and the easy way of doing this is actually just passing this in Via an F string so I can do an F string and we're just going to pass in the variable of Y for the time being uh just
to make sure that it's running but we can format this now so if I want maybe a dollar sign at the beginning I'll put a dollar sign and then I want a k at the end to get what we want at the end we now have a dollar sign and K now we want to cut off basically these decimal places and also these last three zeros so what we can do is take y divide it by a th000 try in this we have it but now we have this zero and that's not really necessary so
what I'm going to do is convert this now this operation that I'm doing to an INT running control enter bam we now got what we want that y- AIS is now formatted exactly how we need it with those two to three values in between the dollar sign and also that K value to indicate a th now when you jump into it if you want to you can get into formatting the x-axis but like I hinted to in past lessons we're going to be actually changing this out to a percentage based on what is the skills
percentage based on all job postings so I'm not cleaning to that for the time being now formatting accesses doesn't always come super easy to me so don't be afraid if it's not coming easy to you as well and feel free to take advantage of things like chaty BT I'll frequently just take the code that I currently have that needs to be cleaned up and I'll copy it and I'll provide it a statement of what I want to do of hey I want to format the xais to have commas in it then from there I'll just
paste this code into it and in this case it went through and did a lot of the same things that we just did well it did exactly the same thing except it for the x-axis anyway I'm going to come up here and copy the code just to check it paste it on into here and this new line of code now goes through and updates that xais to have commas in it so I'll be honest I don't have all these formatting memorized I have to rely on things like Google and chbt to get me to the
end results all right now it's your turn to dig in with some practice problems and also if you haven't done it already finish that exercise that we just did with that I'll see you in the next one so we've covered the major type of plots already with line scatter and bar charts now we're going to be moving into more of statistical analysis visualizations particularly in this session of the histograms and the next one of box plots histograms are great at showing the distribution of values in this case right here where we're looking at salary we're
able to quickly see by this that the salaries are very much grouping around 100,000 and then when we diverge from that they're becoming less and less frequent now what we are noticing from this that the data is skewed to the right basically we have that long tail where the data is going out to almost $400,000 and this is pretty common whenever analyzing salary but enough of me actually talking about it let's get into it so I've gone through and done the standard Imports loaded the data and done all the data cleanup of cleaning up the
job postate and the job skills both these actually aren't really necessary for this section now I want to filter down to not only specific job titles but also a specific region cuz salary is going to fluctuate depend or depending on where you are around the world and in our case job ti so I filter it for data analyst and also for the job country of where I'm at right now with the United States I've also used this copy method to make sure we're not altering the original data frame now in order to generate this vigil
is pretty simple you just provided a column and it automatically does the aggregations necessary to build the histogram so here I am with the salary year average column and all I'm going to do is call plot on this and I'm going to specify for the kind equal to hist running control enter we get this thing which is ugly piece of mess but nonetheless it's showing pretty similar to what we saw in the intro but we can find to this further right now the bin so basically how wide each of these are along here there's about
1 2 3 4 5 5 6 7 8 9 10 about 10 bins right now so what I can do is I can specify a larger number of bins such as bins equal to 30 and now it's making them a lot smaller and we can see a lot of things within the data set if we go too big of a number we're going to start to see a bunch of like ups and downs throughout the data set and it's not going to tell us as much as we want to so I'm going to settle on
30 the last thing I'm going to do to customize this is add an edge around each one of those bins so I'll specify Edge color and set it equal to Black now familiar with how to make histograms I want to go through and clean this up I think the main thing I want to focus on is down here on the xaxis and these are the salaries and right now we have a pretty long tail personally I don't find a lot of value here after 300,000 I know we have an outlier out here of uh around
375,000 but we don't necessarily need that in here so I could come up here and call out PLT doxim and you provide two values of this you provide the start of zero and then where we want to go so in our case let's go to 250,000 running control enter now we can see a lot closer into the histogram I could even now with where when we zoomed in I could probably boost up those numbers a little bit on bins and yeah this thing is looking beautiful okay now what we want to do is format this
xaxis in the last session we went through and actually formatted these number values to be a dollar sign and then cut it off on the thousand's place so I'm not going to walk through all this code again I just copied and paste it from the last time and also updated the variables to be X now for both of these since we're doing the xaxis technically anyway now we have number values I think the only thing last to do now is actually add some titles and I've added this title of distribution of United State data analyst
yearly salaries and updated the X and Y value vales for this to get our final visualization and Bam we now have this bad boy we could probably put on the Wall Street Journal for how good it looks showing the distribution of our different salaries and we can really see how now this median is probably somewhere around 80 to 90,000 for these yearly salaries along with showcasing how it has this longer tail so a lot of great insights coming out of this all right so that's probably my fastest lesson in the advanced chapter but now that
you have this knowledge of histograms after you work through those practice problems you're going to be then set up to go into box plots next which we're going to build on this with that I'll see you in the next one all right now that we're familiar with our first statistical distribution plot specifically histograms we're going to now move into box plots which I feel are really great at comparing values specifically we could look at something like the distribution of salaries across data analysts data engineers and data scientists we're going to be doing that now the
reason why we covered histograms first is because box plots really build on that they both show distributions of numerical values and box plots are very much coordinated to this so after we go through a simple example of graphing we're going to break down how to interpret box blocks so here we are in my notebook I've imported in the libraries I'm loading the data and going through all the data cleanup for it just like last last lesson we're going to be focusing on only data analyst United States because we need to make sure we're not including
too many jobs when looking at a single distribution also we've gone through and drop all those nine values so let's inspect that column that we're going to be looking at most and for this I just want to get a sample of the top 10 values of it and as we can see they're numerical values none of them are really the same they're distributed all over the place so this is why box plots are great for that so jumping into building it using pandas I'm going to all I'm going to do is list that data frame
along with the column of Interest so in our case salary year average and call the plot method on it and specify I want it to be a box plot and we get a box plot alternatively if I wanted to use Matt plot lib to plot this I could use the box plot function and for the argument all I provide is that column of salary year average running this I get the same looking thing now personally I don't like the orientation of up and down using that y-axis CU I would prefer it to be horizontal so
whether you're using the pandas built-in method to plot this or M plot lib you can just specify vert equal to false and vert stands for vertical anyway this is something more on line on what I feel I can interpret a lot more easier so let's break down what's going on inside of this box plot so this was a great diagram I found on KD nuggets breaking it all down we're going to start in the middle first with the Box itself the yellow line right here in this case is the median so it's representing the 50th
percentile value and we've talked about median and average before I prefer median especially in salaries because of that skewed right that's happening I don't want to use average because we would think that the salary would be higher than it actually is now the Box itself has on the left side is at the 25th percentile and the other it's at the 75th so everything in the box is 50% of the data and this is also known as the interquartile range now we also have these lines protruding out the side they look kind of like whiskers so
you also hear this chart bead referred to as a box and whiskers chart from time to time anyway these lines that go up and down on each side are meant to signify the minimum and maximum they're not the true minimum and maximum they're calculated by these formulas and they're definitely not something you recommend but they are something you need to understand that they're not necessarily the true minimum and maximum because we could have data points that protrude even further out and for those we call outliers and they're plotted as individual points when they fall Beyond
those whiskers so I stayed up all that last night and I ended up making this beautiful thing which I'm quite proud of and for it we have both that histogram that we plotted in the last lesson and also the current histogram that we have for data analyst salary in the United States and I made sure to plot them on the same xaxis now it's pretty neat with this because now we can see how this data actually correlates between this histogram and to this box plot specifically for the median where it's marked with that blue line
here I put a red dash line also on the histogram above to show where the median is and look at it it's right here where the peak is of the most values although that's not always the case next is the inter quartile range and i' had this outlined by the yellow dashes on each side so even look at this histogram where we can actually see the bins of all the data we can see more realistically okay yeah 50% of the data does look like it falls within those yellow dotted lines and then finally from there
we have our minimum to maximum for minimum we don't have any outliers on on the low side but look on the high side we have a lot of different outliers with this remember that maximum is not the true maximum because well we have more than that anyway pretty proud of this visualization may put it on my fridge anyway let's build on this box plot that we have from our first example for data analyst only I want to actually now get into comparing different salaries of different job titles in the United States and trying to compare
histograms on top of each other not really a fan of but box plots are great for this so what do we need to plot all three of these salaries of data analyst engineer and data scientists well this is what we're going to be providing back into that box plot function the first variable is the job list and that is a list of all the different series values for the salary for data Engineers data scientist and data analyst and then along with this we will also need to provide labels so just the name of data analyst
data engineer and data scientist and then of course I like it false I'm picky so we're going to go with that so let's start with the job titles first because that's the simplest I'm just going to make a list of the different job titles so now that we have that list we need to up above we had filtered the data frame to be for uh data analyst in the US we need to basically do this but remove that the data analyst so instead of this data frame being data frame da us I'm just going to
remove this it's going to be data frame us we still want that country of United States so this portion is going to remain the same but I want to alter this portion here of the job title short with this column of DF job title short we can run the is in method and pass in job titles so that list above basically it's going to filter through it only return the rows that have data analysts data engineers and data scientists in it I'm going to go ahead and comment that out and then show this down below
so doing a value counts on the job title short column we can see we have data analysts data scientists data Engineers now inspecting that salary year average column we're going to have to remember we have a bunch of nonv values so I want to remove those out of here so D for data frame us I'm going to set it equal to itself and and I'm going to drop Na and specifically we'll specify the subset equal to salary year average so running this and then running below okay now we removed all our nonv values so now
we have a data frame that has our data scientists data Engineers data analysts all that data into it so looking at what's left we need to still get that job list which is going to be a list of each of those different series values if you will for the salary year average column for data analyst for data scientist and for data engineers and so we can do list comprehension for this so I'll specify job list here and we'll start building that list comprehension now we're going to define the how we're going to filter the data
frame at the beginning so I'm just put some blinks for the time being but to understand we're going to be going through for job title which is a new variable I'm defining in job title so recycling through this list in here and we're going to be passing it to here on what we want to filter for these different Series so specifically we want that dfus and that salary year average column and we want to actually filter this even further by the job title short so I'll put another brackets in there we'll specify the DF us
job title short and we're going to use the comparison operator hey is it equal to this job title that we're going to be passing through it okay so let's see if it runs and we didn't have any mistakes so let's actually inspect it below so I'm going to call Job list and I just want to call the first one so we'll specify zero in that list and we get a series back of all the different of the salary or average so got what we want so let's now we're going to uncomment this we're going to
plot it now we're going to pass in that job list along with the labels of the job title and the vert equal to fals and Bam now we got this one which definitely needs some cleanup uh especially that uh x-axis but we have now all the label all the values put onto a single plot I'm going to add some titles specifically adding the title of salary distribution in the United States yearly salary for that X label and then a pt. show so we can get rid of this stuff right here now it's looking better still
want to clean up this bottom axis now in order to to do the bottom axis we need to get that axis attribute so we'll call pt. getet current axis and I can set this equal to a x now with our axis we want to modify the x axis and specifically want to set the major format inside of here we're going to use that funk formatter function and for this I'm going to just copy and paste what we used before although we did the Y AIS before so I'm going to change this just to X to
have the N clature right still going to work anyway it's going through dividing in the values by a thousand so we get rid of those last three decimals or so we get rid of those last three zeros and then adding a dollar sign and a k onto it let's run this bad boy and it looks like I have an issue not enough parentheses on here and now I do have enough parentheses all right oh yeah this is good the only Le thing I've noticed I have this major outlier up here I would definitely want to
be truthful about this but in my case I care about looking more into this so I'm going to filter the X values from 0 to 600,000 so I can look better into it so for this I'm going to specify the PLT doxim specify Z and 600,000 running this bam now we have this beautiful bad boy so we can interpret a lot of things out of this first of all data scientist look how big their box is compared to things like data analysts and data Engineers it is a it has a lot bigger spread for the
salaries it can also we can also has bigger spread by the outliers that go out from it the other insights we can see from this is obviously data scientists are more likely to have higher salaries than data engineers and likewise data Engineers the data analyst which we're going to need to investigate more on why that is the case when we move into the project and it has a lot to do with the skills and the years of experience required but this is pretty neat to actually see visually how these different salaries distributed and what you
could expect depending on the role that you choose so now it's your turn to dig and try it yourself we also got some practice problems for you to go through and make some different box plots as well to better understand how to use these bad boys all right with that see you in the next one so we're getting into the final exercise for this Advanced chapter and I promise is for good use we're actually going to be building or what we're building here is going to be going towards what we're trying to solve in the
final project specifically we need to dive further into those skills and analyze for data analyst in the United States my situation feel free to adapt yours what is the pay for their Associated skills and specifically I want to analyze this from two perspective one I want to just see in general what are the top 10 paying skills and then related to that I want to see for skills that have the most demand or the highest count what are their Associated salaries so we're going to be taking advantage of everything we've learned so far in this
Advanced chapter be able to plot all of this on a single plot so let's jump in I'm inside of a fresh new notebook imported on libraries load the data set and do our standard data cleanup similar to last time because we're analyzing salaries I want to get very specific of what we're looking at and we're going to be filtering for data analysts in the United States make sure when you already filter run that copy method additionally because we're only looking at salary we don't need any of those other postings it's just going to slow us
down that don't contain salary data so we're going to go ahead and just remove it by dropping na so now we have just this final data frame and it has only data analyst in it we need to now go forward and clean up this skills column basically exploded out I love saying that so I call out explode method on this and specify job skills and then just to inspect it I'm going to call this on the job skills column to see what happened there running control enter we can see nothing happened here because I didn't
save it to the original data frame this is why it's always good to check to work and okay now we have all the scales broken out so now we need to get a new or two new data frames aggregated based on the salary year column and job skills specifically we need that median salary correlated with each skill but then we also need the counts of each of these skills so that way we can filter later to get what are the highest count skills to display at the bottom of the chart so basically we need to
do multiple aggregations so we're going to use the group by method for this so with our exploited out data frame I'm going to call the group by Method on it and these groupings need to be done on the skills themselves we basically want to see the counts of the skills and then the associated median salary of those skills so I specify job skills now for both of these we're going to be running a account and also specifically the median on that salary year average with both of these I'm going to run now that AG method
and we're going to specify with a list that we want to do count and also median let's go ahead and run control enter and now we have it back right now it's sorted in alphabetical order on count and median so I'm going to call call this dfda us group set this equal to this now we need to get two new data frames we need one data frame with basically the top 10 highest salaries and then we need a second data frame with the top 10 skills that have the highest count with their Associated salary so
for the top pay one I'll create this new data frame called dfda top pay and set this equal to dfda that us grouping that we have now but we need to sort the values of this so we're going to sort values specifying the buy of median because we want to use this salary column to sort by and then we'll just throw in ascending to making sure that it does uh it in the correct order and finally I'm just going to run a head on it of only 10 we only want to display 10 values showing
it down below this pressing control enter we get top 10 skills based on median salary so let's create that second data frame we need now for the skills with the highest count so we'll call this one dfda skills and we'll set it equal to that original data frame of the US grouping for this one we'll sort values by count and pretty much everything else is the same ascending equal to false and head is of 10 let's inspect this one as well all right we can see as we expect SQL Excel python near the top with
the highest counts and then and their Associated be salaries great let's now get into plotting this bad boy so since we're using two plots on this we need to run that subplots function so I'll specify a fig and ax and set this equal to PLT do subplots then we need to specify the rows so two in this case and then Columns of one so we're going to plot the top pay data frame first so I'm going to run the plot method on this specifying the kind equal to RH next we need to specify what Y
is equal to do we want to do the count or we want to do the median values what are we actually plotting here we want to plot the median values from there we're going to set the a equal to X and then we're calling that uh zero indexer now we're going to do the exact same thing for the other one of skills and I'm going to go ahead and just copy everything from above and paste it in below because it's exactly the same except for the a we got to change that to one all right
let's go ahead and plot this see what we have so far all right oh not too bad definitely some clean up in the formatting but we have what we want well one the ones for the top 10 for the most in demand skills this is not sorted correctly we didn't sort this so let's actually fix that real quick we sorted that by count up here but then we needed to sort it again basically by the median after this so I'm going to just take this little snippet right here and then append it on to the
end run this again and run this okay so both of these now are sorted in well decent matter but we're going to fix that now one way we've done this before is just calling ax and then from there specifying zero and then running that invert ya AIS pressing control enter okay we inverted it an alternate method that you can do for this is whenever we pass the data frame itself we can just call it in reverse order and we can do this by specifying colon colon -1 because the last value in this is the step
value and in this case we're saying hey plot it in reverse I'm going to remove this and run it again and Bam we have it like this so it's really up to you on what you want to do there's so many different ways you can go about it but now we have both of these sorted in the correct order now the next thing is I want both of these plots to be on the same x axis so right now this one at the bottom goes to 100,000 this top one goes at ,000 this is not
good to compare to one another so we basically need to set an X limb on this bottom one and extend it out to whatever exactly is the top one using so since we're modifying the axis we'll specify the axis of one and then we'll call that set xlim and what's neat about this we can then get whatever this upper axis limit is by specifying X of zero so for the first one and then running get xlm okay so now running this hopefully this works oh it is work okay so now the axises are perfectly matched
and it actually is representative of the data now I'm just going to do some minor formatting clean up with these labels here so for the first one I'm adding a title and then removing the X and Y labels similarly for the bottom graph I'm doing the same thing of s title but I do keep an X label down here for median salary I only think we need one additionally we have this currency and we've formatted it before so I'm not going to go through this again but we're using that funk formatter in order to format
it so I not only use it for the top one but also for the bottom one let's see where we're running at right now okay this is looking good but now we're starting to have overlap between the two specifically this titles running into the axis up here so on the figure itself we can run tight layout and also yeah sorry and the only last thing I think is now we need to remove this Legend cuz it's sort of redundant to have this median right here for this we can just specify inside of here Legend equal
to false rning this we have now our final visualization yes and this obviously specific to data analyst United States but it's really interesting to see some insights from this one none of these tools up here in the top 10 highest paid skills are necessarily super popular with the exception of maybe I would say hugging face but I could say the one cure thing about all of these is they seem to rely on understanding Cloud Technologies and how to use these type of Technologies in order to build Data Solutions so don't overlook learning about the cloud
as far as the topmost in demand skills for data analysts this seems to be more generic level skills like python R SQL powerbi and these are skills not only a data analyst would use but also potentially business analyst and managers so not bad work there make sure you're saving the work that we did here because we're going to be using the same plot inside of our project all right we're now moving into the final lesson of this Advanced chapter and sort of a bonus one I wanted to include on the Seaborn Library I'm super excited
to show you some of the visualizations we're going to build with this with that I'll see you in the next one this lesson is going to be somewhat of a sneak peek into how to use the Seaborn library for visualizations now it's only one lesson because frankly it's pretty easy to pick up especially if you've already been using matte plot lib Seaborn is built right on top matte plot lib so just like Pand is it built on numpy Seaborn is built on matplot lib so it's super easy to integrate Seaborn with matplot lib as we're
going to show so probably like Luke why do I need to learn another dang Library well if you go back to the plot from the last exercise this thing's great but if I wanted to maybe color it in order to show higher paying skills or basically a more cont cont color color to draw my eyes to it it's going to be super hard to do this matte plot lip but with only changing a few lines of code I can implement this in caborn and it allows me to color graphs a lot easier then the real
power of Seaborn is unlocked whenever we get into more advanced visualizations like that histogram we did well we can turn it into this plot right here where we actually smooth the lines and make it a lot more visually appealing and it's not only for histograms but also for box plots what we did previously with them we're now able to make into I feel much more beautiful and color contrasting to really draw my eyes to where I need to look in these types of visualizations so Seaborn is going to save our butts now Seaborn has a
lot more visualization than I demonstrated there you can actually go to their homepage and check underneath Gallery inspect any different one that you may be like oh this one on black holes look pretty good and it will provide all the different code needed to build this and even as demonstrated here they're still using matplot lib along with caborn and there's a lot of familiar code that you've saw before of like generating subplots with fig and axe so everything you've learned so far mat plot lib is still going to apply when you use Seaborn now before
we get into writing way with Seaborn I think it's important to understand the actual syntax we're going to be using to plot this so if we navigate back to the caborn page we can go into apis this has all the different code that we need for this you can just scroll down the left hand hand side until you find the plot of choice so let's jump into my notebook I've gone through and imported on the libraries loaded all the data set and started the data cleanup now we need to install and then import in caborn
so using control Tilda I can pull up the terminal and we're going to run kinda install caborn in my case I've already installed it because I was working on building this course but you'll go through the loading process and you'll need to accept then install and proceed forward okay for me it says all request of packages all already installed so I'm good to go in here I'm going to do import caborn and it's very common to use the alas SNS all right running this we now have Seaborn imported into here for the first two examples
we're only going to be using data analyst data and the United States also I'm going to be breezing through the code because all these different graphs we've generated before so we're not going to spend a lot of time on them going back into and explain them anyway run this control enter then the first example we're going to move into are the bar charts that we created in the last exercise and I've gone ahead and copied and pasted all the different code into here and we have the different charts below it so what we're going to
do to in order to show that this is in fact pretty much the same I'm going to take all of this code right here and I'm going to paste it below here run it make sure it's still running and generating visualization we're going to now alter this with what we can now use in caborn specifically all the stuff that we modified the axises for all this stuff can stay it's actually this portion right here whenever we run that plot method on both of those data frames so I'm going to go ahead and comment that out
on both of these and create some room now this bar plot code is very similar to what we did back with the box plot we need to provide that data and then also the X and Y in our case we want that horizontal bar chart so the X values are going to be that median salary and the Y values are going to be these skills themselves so basically the index now this example that we're about to do is exct L why I love Seaborn more than Matt plot lib it's because we can also specify this
Hue and Hue allows us to specify a column in order to color by so we can have a grading in this case we're going to specify the Hue of the salary and so whenever we have things like a higher salary we can make it to where it has a darker color and when it has a lower salary it has a lighter color and to determine what color to actually use we're going to Define this underneath the pallet parameter this is provided typically via a string and we'll kind of cover more about the options of the
different pallets in a bit so we need to actually generate these plots Now using caborn so we'll specify SNS dot in our case we want a bar plot that's the method that we're going to be running for this the first thing we need to provide is data as sh on the screen and that is just the data frame itself and I can't really see CU it's covering up but I want this dfda top pay okay so that is our data the next thing I want although it's not highlighting it is X and for this we
want that median column down here it's also y it does switch but I promise that's for a good reason so we not only need to provide the x but we also need to provide the Y in that case we're going to be doing the data frame itself but we need to provide the index and then the last thing we have to provide is this access to basically specify which one we're actually using now with this buildt we need to do the same thing basically on the bottom so I'm going to copy and paste it down
here and then I'm just going to alter it based on this data frame right here so first replacing the data what is going in there this dfda skills we're still using median with that data frame itself we want to reference the index for the Y value and then their ax is actually one like I said everything else is going to stay the same running control enter bam we have exactly the same looking thing above here although it does look like the bar plots are a little bit bigger but I promise you this is where it's
good now we need to get into coloring it well I like this example in the Seaborn library to help understand or show what I want to do we want to go basically with sequential as it gets from high to low we want to color it in a gradient type format similarly diverging is very much similar but it's going to go from one color to another Vice a light color to a high what we don't want to do in this case is qualitative that is we don't want to have a separate individual color for each skill
we wanted the SK the color to be associated with the salary itself so back in our code we're going to provide the a hue value and that says hey what do we need to look at what column do we need to look at in order to color off of and we'll also add that down here well now that we specify what we're coloring off of we need a color palette to choose from and Seaborn has a host of different ones to choose from I'm a big fan of color blue so we're going to use these
two right here basally combine it for the top one I want a dark to Blue and then the bottom one we use a blue to a white so inside of here I'm going to specify pallet and this is a string in this case we're going to provide dark and then colon B similarly down here I'm going to do the same but I'm going to call this one light and the B is just means blue now I'm going to go ahead and plot this and we're going to have issues with this already this isn't going in
the order of coloring that I want I wanted dark up at the top basically to draw your eyes immediately to it and I wanted them to be basically a Associated so in this case the blue is good anyway I want to reverse the coloring it's actually pretty easy with Seaborn to do this every one of these different color palettes you have access to you just add in an underscore R to reverse it and now it's going from Blue all the way to Black now I'm noticing now that the legend is getting in the way so
I'm going to go ahead and actually remove that I'm going to place that in here removing both of these running control enter we have bam this which is actually now more in line with the coloring scheme we need and it really draws your attention up to the top and then you slowly from there go down to the bottom based on the coloring scheme now we can also alter the themes of the graphs itself so this is like the default theme but we can change up the backgrounds and even how light or maybe dark the different
colors are and how the accesses are basically by just specifying set theme and then from there specifying a style personally I'm a fan of this one right here of this tick style so I'm going to go ahead and just copy this right here we're not going to use that custom pams and I'm going to come up here at the top paste it in like I said we're not going to use this custom param parameter we're not going to go over it anyway control enter now this thing much more appeasing to the eyes and I think
compared to our last one actually gets the job done on directing our attention next let's move into that histogram that we worked with previously we're going to be reusing a lot of this code here but anyway this was the visualization that we got with that now let's now alter it using Seaborn here I navigate to his plot which is for a histogram however we're not going to be using this because whenever I scroll down here to show these different images that are generated by it I want a smooth curve for this and right now this
is very Jagged edges and you can't do this with the his plot function but you can do this with disc plot or distribution plot this allows much smoother Curves in the final visualization and is more aligned of what we want now this one's going to be pretty simple on what we need to provide to it we just need to provide the data and then how we want to actually plot anyway the data itself if we scroll down to the parameters you can provide something like a data frame an array mapping or sequence in this case
we're going to provide the column of salary your average so all the data is going to be included when we provide this so we don't need to worry about providing an X or a y the other thing that we need to specify is kind there there's three different types you can use in this we're going to be using that KDE so I'm going to copy all this code with the exception of that top plot method I'm going to come down here and paste it now Seaborn has a hist plot method but we're actually going to
use the dis plot method I personally like the style of the dis plot better for the values we need to provide for it we can provide the data source but I'm going to go ahead since it's just a series that we're going to be plotting here I'm going to just provide the column of dfda of salary year average that's all we really need to get started and we'll go ahead and plot it now this already is already compared to this not too bad but I wanted a more smooth outline of the distribution here so what
we're going to specify inside of here is the kind we're going to use the string of KDE which is Kernel density estimation let's look at this one now all right so now this is basically an outline of it I like this a lot more I can specify some other parameters like Phill set it to true so basically fill it in so it's not all that white stuff so overall not bad and even like last time if I wanted to set the theme in this case using tick again nothing is going to change because I've already
implemented inside the environment for this so if I wanted to change it I could um but it's already said in the environment anyway we now have our final visualization at least for the histogram and I feel it looks a lot better than the last one we could definitely put this bad boy in the Wall Street Journal all right one more example to go now with plotting bar plots in Seaborn this is where we start to get slightly different the syntax that we used in matplot lip previously we had specified the data as only a series
of values for the seller year average but in this case we need to provide X and also y values for this so instead for our data we're just going to provide the name of the data frame and then for X and Y we'll just provide the column names and we'll know to use that data frame in order to access those columns pretty cool as we can minimize the amount of code we have to write for this this was our previous box Plass example that we ran in the box plots lesson and we can spice it
up once again with caborn so I'm going to take all this code with the exception of that box plot right there that we ran and just paste it on below for this one I'll just call that box plot method on the Seaborn Library specifying the data is equal to the data frame of the us we have an we have X and Y values in this so we need to specify all of it for the for the X we're using the we're basing it off the salary so salary year average then for y we're going to
be using the job title short column okay running this pressing control enter looks like I have a typo in here job title short control enter again bam and now we have this one look how fast this one was um I don't like this over here on the yaxis label so we'll go ahead and remove that PLT doy label and we'll specify an empty string running control enter bam but look at this look how quickly we were able to make this visualization out of what we already did in matte plot lip and personally I hope you
feel the same that the this one here the Seaborn a lot more visually appealing so I obviously commonly use chat gbt to help build visualizations and frequently it provides back code using the Seaborn Library so it pays to be able to know and understand what's going on there I don't feel it's that much harder to learn now that you've understood and mastered matplot lib so I do highly recommend you learn as well all right we have some practice problems for those that purchase the course practice problems and with that I'll see you in the next
one we're going to dive into actually building our final project all right congratulations so far see you there all right it's time to get into the Final Chapter you know what that means got to get on a new flannel now this is my business flannel and and basically we're going to be get into business with this third chapter and really it's a two-fold approach that we're going through here the first thing is we're going to be building on all the skills that be learning over the first two chapters in that Basics and advanced chapter and
we're going to be doing this by building several beautiful visualizations that we're going to get into in a little bit and the main point will be to put together all of the different work and insights you've gained over this course into a singular location so that way you can then share it with the world and showcase your experience that you've learned in learning python so what is the ultimate goal of this project well we're going to be investigating topang roles and skills in the data science Industry and you'll be able to select which roles and
location you want to use in order to fine-tune your analysis to get insights specific to you to accomplish this you'll be using python like we've been using on that same data set and for those that are job Seekers taking this course you can use these insights in your job search in order to maximize and optimize where you get your next job so what will be the final deliverables for this project well opening up the explore we have a folder created here called 3or project and we're going to be investigating four different questions and then the
first one the first notebook was basically this one right here where we do some Eda now with these notebooks we're also going to be having an accompany read me markdown file and this markdown file is going to detail all the different analysis that we've done along with providing insight for any non-python users to go in and read what you've done this makes it perfect to list on your resume so that way recruiters can look and identify that you do have relevant experience in using Python and in the next video we're going to be showing how
we can link our project to GitHub so we can not only show all the different code we've done but also show all those visualizations so for this project I'm going to continue to explore data analyst roles in the United States but you're not limited that at all as you're probably more than familiar with that job title short column in our data frame you have a host of options that you can choose from and pick on what you want to specialize also you don't have to necessarily limit to just one role similarly you can pick the
country that you're located in there's a 99.9% chance that I have that country available for you to use although there is a lot of data from the United States there's still a lot for other countries and this is going to allow you to explore job postings from a variety of companies we can see some of the most popular include City Walmart Accenture and deoe so a lot of top names so what questions are we going to be exploring on these foll on notebooks that we're about to build well let's walk through them each real quick
first we'll investigate what are the most demanded skills for the top three most popular data roles and this one we're going to build on further from the advanced chapter on matplot lip specifically this visualization may be built on counts of skills requested in US job postings but we're going to step up this visualization a notch and instead of using a count we're going to use a percentage or likelihood of a skill being requested in a job posting feel percentages are much more applicable and useful to understand if I'm a data scientist I can see that
python is in 72% of job postings which it's an overwhelming majority three4 of job postings require this skill so super valuable insights from this one alone the next thing to explore will be how are in demand skills trending for data analysts and you may recognize this plot from our exercise on evaluating trending skills back in the advanced chapter but once again we're going to build on it further specifically we're going to clean up this plot with Seaborn and we're going to change this to also show a likelihood of a job posting containing a skill so
that way it has more applicable numbers for us to evaluate from there we're going to look at how well do jobs and skill pay for data analysts and for this we need to evaluate other roles besides data analyst so we're going to build on our box plot Lesson by building this out to include even more job titles and reporting it appropriately so we can get some insights of how data analysts compared to other roles additionally our work at the end of the advanced chapter is going to go to good use because now we have a
use case for that visualization of understanding what are the highest paying skills and what are the most in demand skills and so this is also going to go in this section finally we're getting to our last question to understand what is the most optimal skill to learn for a data analyst and for this we're going to be modifying the graph from our scatter poot lesson and updating the x-axis so that we include the percentage or the likelihood of a job appearing in a job posting additionally as a bonus with what we learned about Seaborn we're
going to also color code this in order to find deeper insights with core Technologies and how they are trending whether it's for a program language database or Cloud technology a little sneak peek it pays in no programming languages all right now it's your turn to get to work you need to create a folder inside of here that's going to house all your different Project work I conveniently named mind 3core project to basically have it in order with the chapters that we've done so far we're going to be adding each one of these notebooks and this
images folder and this read me as we go along so don't worry about that right now you only need to create one Jupiter notebook right now and title it something like Eda intro as we need to go through and perform exploratory data analysis so in this notebook I'm going ahead and I'm importing all the libraries loading the data and done the data cleanup just like we've done before with that done I'm going to now filter down for us data analyst rules and I'm going to specify the variable of dfda us and then filter that main
data frame that we imported in for the United States and for data analysts now for this I want to pick out a few columns from this data frame in order to perform some initial Eda and also if anything we've done in the advanced or Basics chapter feel free to bring into here as Eda before this I'm going to explore in some things like this like job location to see where all the city and states are an analysis of the booing Columns of whether work from home a degree required or health insurance is required and then
and finally analysis of what are the main companies in this data set so let's get into that job location first so for this I'm going to use the value counts method and look at the top 10 of these running control enter I see quite a bit of spread across the United States with anywhere being the top one which also correlates to all those remote jobs so let's plot this and I want to use caborn for this because I like the customization of it and because of that it makes it easier if it's in a data
frame so I'm going to use the method to frame running control enter we can see this turns now into a data frame with Columns of job location and count so I'll assign this data frame to DF plot and we're going to be using that DF plot throughout all these different plots that we're going to makeing so since we're using caborn for this I'm going to call SNS do bar plot first thing I need to specify is data so that's the data frame itself of dfdore plot next thing we need to specify is our x value
and since we're doing the horizontal bar chart it's the values that we need to show our count in this case and then for y we're going to be using that job location this is good enough for now we'll go ahead and plot this and we get this visualization now I want to spice this up a bit with some coloring so I specify for the Hue I want to use count to color it by for the palette we're going to use my favorite one of dark B and we're going to put that in a reverse order
running control enter we have this now and this is obviously a different color than we're using used to or you've probably seen previously and that's because we using a theme so I can come up here and specify for Seaborn I'm going to set theme and specify the style equal to ticks running control enter all right there's a lot better coloring that I like all right I don't really like this Legend here so we're going to remove that as well and we'll specify this as false now I need to clean up all these different columns and
titles so I've changed the titles the X labels and Y labels and then also add this dpy to basically remove this border around the graph running control enter bam we got this and I realized I had a improper title so I'm going to rename this as counts of job locations for data analysts in the United States now the next thing we said we're going to explore these booing values of job work from home no degree mention in health insurance but we already did this in our pie chart example so I'm just going to take the
applicable code that I've already done from that Jupiter notebook and I'm going to basically paste it right in here and the only thing that I should have to change is the data frame I believe right there to specify we're using the data frame for data analyst in the US let's run this bad boy sweet that was pretty quick got to love python for being able to just redo work very easily with previous code that you've done and a do update so now we have for data analyst in the United States it looks like 7.5% of
jobs off the ability to work from home 28% have a requirement to have a degree which I think that's pretty low that's actually pretty cool on the other side only 35% of data analyst jobs offer health insurance all right moving into the last thing that we said we're going to explore and that was the company name so what are the different counts of the company name now I'm going to just copy and paste this job locations code that we used previously because it's basically going to be a bar chart and it's going to display very
similar data instead of using job location however we're going to be using company name so I need to replace job location here and here I can press command shift L it's going to highlight all of it so then I can go in and replace company name and then I also need to update that title of counts of companies for data analysts in the United States and we get this one where we see there's a lot of top companies as well from this so dice United Health City and Hamilton so a lot of good companies inside
of here that we have access to for data analysts in the United States all right so that wraps up the exploratory data analysis I'm going to do for this video and with this alone we've un covered a lot of good insights on job location on what are the different requirements or offerings from a job and additionally what companies are available for this now in no way are you limited to only performing the analysis of those three visualizations feel free to take this further or feel free to even dig back into what we done previously in
the basics and advanced chapter and bring it into this notebook to show the Eda for this all right with that I'll see you in the next one we're going to be getting into setting up a GitHub repository for our project see you there have you ever wondered how teams of people work together in order to build things like Google Facebook or even chaty BT well Version Control Systems are the core technology behind coordinating all those efforts in order to put code together and then build an app and these same principles can be applied to what
we're going to be doing in this lesson in setting up our git and also GitHub in order to set up a Version Control system to track our changes and then share our code to the world so what the heck is git and what the heck is GitHub well it all starts with the core technology of git which is a version control system and these projects or folders that you have git inside of can then be hosted on things like GitHub that's an online solution that allows you to collaborate with others on your code now here
is my project folder for my SQL for data analytics course inside of here I have different folders ERS like this section here is on my Advanced SQL queries and it has all the different SQL files for it this project itself has the Version Control System implemented of git when I press command shift period here I can do that on Mac the following hidden files are shown and hidden files are denoted by a period at the beginning of it it basically just hides it from here I can toggle it on and off if I want to
anyway I'm referring to this folder right here of dogit inside of this bad boy is all the contents needed or files needed in order to maintain up toate on what changes I'm doing within my project now those files are hidden for a reason as you shouldn't touch them at all so I'm going to go ahead and make them disappear again but the main thing to understand is any changes I do are tracked in that dogit file now here we are inside of GitHub for the same SQL project which is accessible at this public URL so
anybody can come in and access and view all the contents of of this course if they wanted to steal a SQL query they could go into my Advanced SQL file and then navigate in and they have access to all my code so combining both of these tools is super powerful because git like track changes in Microsoft Word allows me to go back to previous revisions of a mess up and recover my saved work alternatively GitHub allows me to store all this stuff in a remote repository so that way if I need to go or share
this with others they can go to this now I've been using this term repository but basically it's that folder that contains all the different files and folders or contents of your project and git is managing all those changes while GitHub is hosting them so with this you'll hear common terms like local repository and a remote repository there's a little bit of an oversimplification but local repositories are what's located on your computer locally and you can use Version Control Systems like git in order to maintain now alternatively remote repositories store your local repositories and there's a
lot of popular options for this not just GitHub for a remote repository you can use things like bit bucket or even gitlab but GitHub seems to be the most popular so that's why we're going with it and these type of online services are great at collaborating Kelly and I work to build this course using git and also GitHub to actually track and manage all these different changes for it so it's super valuable so what are we going to do in this well four simple steps first thing we need to do is get git installed onto
your machine the second thing we need to do is set up a GitHub account and prep everything so we can get our repository onto GitHub next we're going to go back to our project folder that you have on your computer and initialize the repository basically start git with tracking the changes on it and then finally probably the easiest of all we're going to push it onto GitHub so let's jump in the first thing of downloading git go to the URL that's shown on the screen right here and download the applicable one for your your operating
system for Windows this is super simple to install first you'll need to allow the app to make changes to your device accept some lawyer talk accept the window that it has keep all the defaults selected for what it needs to install and where it needs to install for default editor we're going to use Visual Studio code we'll leave the default for letting get decide and also for adjusting the path we'll keep the default for open SSH and using op SSL now and then we'll also check out windows style use M TV fast forward on merge
get credential Manor enable file system we only enable any new support and then we'll load oh my goodness that was a lot then it'll be finished for Mac users you're going to run Brew install git inside of your terminal now for some of you this may not work because you may not have Brew installed so running this zish command not found Brew we need to install Brew so back in yet I can click here on how to install humu via this link and it provides this a terminal link on how to actually install this system
of Homebrew which manages different packages on your Mac system it's super useful so I highly recommend you install it all right so pasting that in pressing enter it may prompt you for your system password I'll put that in and it says it's going to install these new folders for this perfectly fine with it okay Brew is installed so it does say run these two commands in your terminal to add home brew to your path so I'm going to go ahead and copy this the first one at least press enter okay so the first one looked
like it ran and then I'll copy the second one right here paste that one in press enter okay we should be good to go and then now we can run Brew install get and unlike that pesky installed from Windows we're now done now for both Mac and windows users you can check it's installed by just typing git inside of your terminal and it should display a list of commands that you can run with Associated git now now one thing that we need to set up now that you may be prompted for later if you don't
do is setting up your username and email so git knows who you are so the first thing I'm going to provide is my name with Git space config then two dashes and Global and then from there user.name and provide your name do this in quotes it's a string next thing is providing the email so we'll do get config tac tac Global and then user. email and specify our email and then bam we're all set the next thing we need to do is navigate to github.com sign up and now if you have an account already all
you're going to do is go ahead and sign in otherwise go through this process of filling in your email and providing your credentials in order to set up an account they'll send an email to your email and then from there you'll verify and you'll be set up with your good up account once you've done the verification you should be directed back to your account and from here I would go through and set up your profile information so it's all set up I include things like a bio and all my different social media links now this
has a bunch of pinned repositories on it right now which we'll get to eventually for you but just understand this is how if somebody wants to come to my profile and access a course they can get it from here so we've installed G set up GitHub now it's time to initialize our local repository now there's a few ways we can create local repositories we can do it inside VSS code and make it super simple via this method there's also alternate options you can do that I think you should know about like GitHub desktop which is
this own app you can install on your computer and manage repositories Additionally you could go directly to github.com and that can do it as well but then you'd have to actually pull it from the remote repository and the last one is using the command line basically using get commands in order to add pull and push different requests from get I do recommend you learn this option at some point but we're going to stick to the easy option of vs code first so here we are my vs code inside of my project for this python course
now I may not have the same folder structure as you but I've been using that one for basics two for advance and then three to keep my project stuff for the project itself I only have one file inside of here of that intro file that we made in the last lesson anyway over here on the left hand side of your screen we're going to click Source control and you're going to have two options for this now if you have no intention of publishing the GitHub or creating a remote repository the first option is for you
to initialize repository otherwise we're going to go with that second option of publish to GitHub as not only it's going to initialize this local repository but also it's going to publish it to GitHub to create our remote repository I'm now I'm going to go ahead and click that then it's going to say publish to a private repository or a public repository I want the world to see it so we're going to make it public it then ask me which files should be included in the repository I want all of these folders I don't want this
dsor store remember files are hidden files this one in this case is not necessary so I unselect it I'm going to then click okay now for me it's publishing it to GitHub and it says successfully publish the name of the project repository to GitHub and then I can open it on GitHub you may during this be prompted to sign into GitHub via VSS code so you need to go ahead and do that in order for VSS code to then be linked to your GitHub account I was already set up with my vs code for that
anyway once you done that open it on GitHub so now our local repository has been pushed up to GitHub and we have a remote repository available anybody can come in and download this I can even go into that number three project folder open up our Eda and it's going to include all the different code that we did along with the visualizations that we built all right so we've now done those final steps of initializing local repository and then pushing it up to GitHub told you the last step was going to be the simplest so we're
going to have a little fun now we're actually going to practice performing pushes and also pulls from GitHub and we're going to be doing this all through VSS code so starting with the push changes first this is for sending our changes that we do locally on a computer up into GitHub and this will allow for these changes to be visible there so back inside VSS code you're going to notice now that you have this dog ignor file this thing just ignored that file that we previously said ignore don't worry about it for right now anyway
let's make a change to this repository so that we have something to push so inside of our project I'm going to create a new Jitter notebook for our next lesson that we're going to be going over I'm going to title it two and we're going to be going into skills counting with that and I'm going to dial that a ipnb file now after I did this you may have noticed over on the left hand side that one appeared next to our Version Control and basically Source control has already picked up that we have a change
here and right now it has a u associated with it basically it means it's untracked so we need to add this change to our staging area now anytime you add something you need to conclude a message on what you actually did here and you want to keep it super short and I'm just going to say add project two file from there I'm going to click commit so right now this is a stage change so right now this has been committed to our local repository but it's not up on our remote repository hence why we have
this now that says sync changes and it says push one commit to origin SL Main and that's the branch of what we're in our remote repository so I'm going to go ahead and click sync changes now inside of GitHub I can navigate into that three project folder and in fact see that we've added that python notebook to this we're going ignore it's invalid for right now we haven't added anything to it we will next lesson so that's how we do push changes with vs code now we need to move into pulling changes and for this
we need to have a change on our remote repository and then actually pull that down into our local Repository itory so inside the remote repository in GitHub if you scroll down it has this here of add a read me in a readme file we're going to be building it out more on the project but basically outlines what you did for it so I'm going to click add a read me and it's going to change or alter the project and I'm going put a Todo of need to fill this in okay and then upper in the
rightand corner I'm going to click commit changes and it's going to have a commit message created already of create read me and from there I'm going to commit directly to the main branch and commit changes so now this read me is part of our project it even has that to-do of need to fill this in with it so inside a VSS code in our project locally that read me is not available right now so what I need to do is actually pull it down so I come down here into Source control I can click these
three dots right here and click pull it does a little loading in the background and now when I navigate over to the files we have this read me here along with the to-do of need to fill this in so that is the basics of how you can go about pushing and pulling changes from GitHub this is super powerful when implemented properly and allows you to collaborate with others and build some pretty cool Solutions all right so you haven't done it already it's your turn now to go through and set up git and GitHub in the
next lesson we're going to be moving on to building out our project and working on that first problem for it with that see you in the next one so let's get into tackling our first project problem and for this we're going to be looking at what are the most in- demand skills for not only data analyst but those other two top roles of data scientists and data Engineers now previously we built this visualization in the formats charts lesson from the advanced chapter this was great at showing a highlevel look of not only what the top
skills are we should focus on for data analyst but also how they compare to those other two roles but there's two major issues that we're going to fix with our new visualization first instead of having counts of job postings which is sort of frivolous what is the percentage or likelihood of a skill to appear in a job posting that's what we're actually going to calculate additionally we're going to add some highlighting along with some formatting to cue the viewers of these charts in to the highest percentage skills now jumping into my duper notebook I've defined
our question that we're trying to answer along with the methodology now we're already done with steps one and two of cleaning up the skills and calculating the skill count which we plotted previously we just need to bring that code back in here and then move on to steps three and four which is calculating the skill percentage and then plotting those final findings for all of this I'm going to be going back to that previous notebook that I have on format charts copying all the different code that we need and then just pasting it right into
here so this obviously includes all the importing libraries loading the data and then doing that data clean up next we need to keep all the different job titles in this data frame but I do want to filter it specific to the United States so I add this condition and set up this new data frame of dfus then we use the explode method to break out those list of the job skills and get them into a new data frame of DF skills we can then visualize to see what is actually here now and we can see
the job skills are broken out now that DF skills is broken out we can do a group bu on the job skills and also job title short column displaying this below this we can see we have job tile shorts along with the skill and then the counts of each one of these with their Association this right now is technically a series when I run type on it so I want to transfer this into a data frame to make it a little bit easier to manipulate so I run reset index on it and we rename that
values column to skill count displaying it below this has the same results the only thing left to do is now sort these skill counts to have be from highest to lowest so on this data frame of skill count I sort the values by skill count in sending order and I do this in place true but you could just set it equal to the variable itself but now we have the higher skill counts near the top now we need to get the top three roles and we can just Define a list which is what we did
previously personally I like to programmatically get anything like that I hardcoded in so here I'm getting the unique values in the job title short column and it's making it into an array so I'll make it into a list using the two list method and now I only want the top three results so I'll I'll use a slicer to do this specifying I want colon 3 then we have data scientist analyst and engineer I'm picky I wanted in alphabetical order so I'm going to use the sorted function on this and now we have data analysts engineers
and scientist next up is plotting it remember we want three separate plots for this so I'll insert the length of job titles and then specify we only want one column next now that we have this fig and a we need to iterate through each one of these and then plot the associated graph so we'll use a for Loop pulling out the index and then also the job title name out of our job titles list using the enumerate function then we'll Define the data frame we want to plot which is our data frame skills count which
is this data frame up here but we need to filter by that job title short column and only get the top five values so we'll specify this filter and the head equal to five and then from there we'll plot it using a bar chart specifying the X and Y values along with what axis it needs to go on additionally put a title above each these subplots with the job title itself let's go ahead and run this and this is what we got in the mat plot lib section but let's clean this up a little bit
I'm going to invert the axis remove the Y labels and remove that Legend as well and then also put up a super title if you will above all of it and I forgot to implement a tight layout so we don't have overlap okay so now we are we're all caught up to where we were with the map plot lib and we can actually see we have the counts of all the different skills so now we need to convert these counts to a percentage and if if we scroll back up to our DF skills count data
frame we can see we have the skill counts for each of the different roles and the skill what we need to do is perform an aggregation to get actually how many job postings we have for data scientists data analyst and also data engineers and then with this total number of job postings we take that skill count divide it by that total and get what is the percentage of a skill in a job posting so with our us data frame it's very important that we use the correct data frame for this we need to do a
value count of the job title short column and so I'll add in that value counts method right now this is a series so I'm going to reset the index in order to make it into a data frame and I want to name that values column jobs total so I'll set name equal to jobs total running control enter we now have this in data frame and this has the counts for all the different jobs so we need to now merge it with our DF skills count on that job title short column so we use the pandas
merge function for this we need to specify the left data frame in this case we'll use DF skills count then we need to find the right data frame of DF job title count now just realized we got the yellow syntax CU we actually didn't save it into this for this actual data frame up here that we did so I'll go ahead and run this contrl enter okay continuing on for a merge how do we want to do this what we're going to do to a left join basically everything in the left table of the DF
skill counts which is below we want to maintain and we just want to merge what's available onto this data frame finally we need to specify the column that we're merging on which is job title okay running control enter and now we have that jobs total column appended right onto this data frame so we'll name this one DF skills percent now I'll show that and now with this because we have in the same data frame we have the count which should be less than the jobs total we're going to do we're going to do some math
to calculate what is that skill percentage so for this data frame I'll specify a new column called skill percent and set it equal to the skill count divided by jobs total run this control enter we now have skills percent and the math looks all right now this is in decimal because that's percentage but I want this to be in a full Val value so whenever we go to plot it we can see things like 20 or 30% and not 0 2 or .3 so I'll multiply this all times 100 and that will fix that so
now let's plot this I'm going to come up here to this plot up here and just copy everything we have and then paste it down underneath to have a starting point remember we defined this DF skills percent so when our data frame to plot I'm going to change this to DF skills percent also I want the filter to do that as well and then the only thing left to change is we don't want to plot the count we want to plot the percentage so this should now work pressing control enter and Bam we have it
now for these percentages you should be seeing very similar ones with things like data Engineers be around 60 to 70% SQL for data analyst around 50% and then the data scientist around 60 70% if you don't have these values you need to go back to that original data frame that we calculate the totals and double check your work now frankly this chart is good enough to go for you but I'm a perfectionist so we're going to clean it up a little bit this is all pretty much optional if you want to do this but I
guess what isn't optional is actually changing the title to reflect what this actually is and that's the likelihood of skills requested in job postings okay but let's get into cleaning this up I want to upgrade this chart by using Seaborn instead and I'm going to call out the bar plot for this I'm noticing I didn't import Seaborn with this so actually I need to go back up here and specify import caborn anyway now that that's loaded I've gone through and specified the arguments of data of DF plot above X is the skill percentage Y is
the job skills what we're using as the AIS the Hue itself that we want to use skill count to color this because that's the whole point of using the Seaborn library is so we can actually color it and then this palette of dark blue putting it in reversed order let's go ahead and check what this updated visualization looks like okay not too bad I'm not a fan of this theme though especially this color of the blue that it's using so I'm going to set the theme to the style of ticks and this is much more
appeasing to the eyes I'm going to go ahead and remove our previous mat plot lib since we have that copied over now the next thing I did was go through and set the title itself invert the Y AIS remove both the X and Y labels along with the legend and then I set an X limit of 0 to 78 basically so that way all of the charts were on the same axis so let's go ahead and look at this so now we've cleaned up the axises but it's all on a comp Parable percentage so we
can look from one to the other and it's a lot more usable also I made a mistake here so caborn actually doesn't do that thing where M plot lib plots backwards so I can remove this invert axis now the next thing I want to do is similar to what we did with our Scatter Plots of using that pt. function to label all our different data points I want to do that the same thing with these bars but provide it with what is the percentage so what we'll need to do is Loop through our DF plot
specifically going through on that skill percentage column so we can get those values from here and then plot it and this Loop is going to be inside of our Loop already that is plotting a graph so in the case of data analyst once we get to the end of this of s the X limit we'll then go through and it will label one by one each of these points so I'll need to indent this further and then start from here for this we're going to be also cycling through the index and then value in the
DF plot skill percent column and we want the index so we have to do a numerate now previously we're using pt. for this but instead of PLT we're going to be using the actual axis itself and if you recall back to this function we need to specify the x value to plot over and then the Y value so we need to First specify the x of what percent and then the Y of how many index levels down to plot it so X as far as the value would be V and then the Y would be
that n or that index uh position from it and then from there we want to actually plot the value itself on this so I'll put V in there we're going to go ahead and run this and see what we get all right so there's a couple problems with this first it's too close to the bar so I'm going to fix that by doing V by adjusting the X position do V + one okay so now it's further away from the bar the next thing is it's not centered it's like slightly up some so we can
specify the vertical position by calling out VA a equal to in this case Center okay so now the numbers are much more in line with the bar but we actually need to clean up these values now of just the decimal places are way too long so we can do for this is just specify it as an F string so I'm just going to start simple first and put the variable inside the F string and then Define it with curly brackets around it running control enter nothing's going to change I want a percent symbol at the
end so we'll add that we got a percent at the end and then the last thing we need to do is just convert this to an int cuz we don't want any decimal places with her I don't at least now one thing about converting to an INT we're slightly not being truthful in our percentage so data scientist SQL is 50% if we come up here to data scientist and look at SQL it's at 50.9% so technically we probably need to actually change that and we can do that string formatting method that we learned previously specifying
that we want zero decimal places and that we're using this as a floating point or specifying it as a floating Point all right and now we have the updated percentages sqls now at 51% the very last thing I'm going to do is I feel like this xaxis on both data analysts and data Engineers is repetitive especially since it's down at the bottom and we also have all these data labels on it just way too many numbers on this and so we can remove this by specifying the axis itself and then saying set X ticks and
we set it to an empty list running control enter basically removes everything but I actually do want it to plot for data scientist or better said I want to keep the accesses on the last plot so instead I can SP Define an if statement that if I is basically equal to two which in our case we're going to programmatically say this by the length of job titles minus one in that case we want to set the x tix to nothing running control enter we have this bad boy now our final visualization I told y all
I was picky but I think it actually conveys what we want to show here which captures what are the top skills of these three top roles and as we can see from this Python and also SQL are very common across all three of these roles so I haven't taken my SQL course yet probably need to take that next so what do we do now that we got to these final results well we need to capture them somewhere specifically in a read me well inside of a repository we've created three folders of the basic advance and
project section along with that read me if you remember we created up on GitHub and then pulled down this read me is where we're going to be capturing all the different insights from our project and so we can start now so we don't have to do that later specifically I want to capture the insights from this visualization in this read me and what do I mean by this readme well here's a sneak peek at what we're going to build around this well in this it's going to start with something like an overview and then the
questions we're going to be answering in this project this first one here what are the skills most in demand for the top three most popular data roles is what we're doing now in the beginning portion I Define the tools I use data preparation and cleanup and then getting into the analysis itself that's where we need to be putting what we learned from these insights there from there I share any code that I find is revent along with the image itself or the graph we finally made and then finally capturing any insights which is where we're
at right now and so why we need to do that in our read me right now and then from there it repeats for all the other different ones as well so let's actually get to formatting this bad boy and building it out first thing I'm going to do is specify a section for the analysis we're going to be doing so I'm going to call that the analysis now remember we're in a markdown file so we're going to be using things like a hashtag and then space to format the document we can actually see what the
readme is going to look like formatted by clicking over here in the top right hand corner to open preview to the side and now we can see that we have the analysis I like to keep these side by side as we're going through it so we're going to start by looking at what are the most demanded skills for the top three most popular data roles from there I just give a brief overview of what we were actually trying to tackle if you notice from this I have some squiggly lines over underneath here basically calling out
some grammar issues that I have I have the grammarly extension installed as you can see by here it says grammarly I'm in no way promoting grammarly that you need to have it for this but it is good at going through and actually cleaning up your markdown files to make them sound like you're not a blab bling idiot the next thing I'm going to specify is Where to view my notebook and we can provide a link to this specifically I'm going to call out the name of the notebook and then we need to provide inside a
parentheses cuz see now it's actually a link the link to this file if I come over to the three project folder I can rightclick skill demand and select copy relative path then from there I can paste it inside of here now when I come over here and click this link of two skill demand notebook it takes me to my notebook directly pretty cool and this will work in GitHub too as well next I like to show some relevant code that I did in order to generate my visualization you can really pick whatever code that you
find of most importance to you and that really showcases a certain skill you want to demonstrate and for me I like demonstrating that I'm pretty good at making charts so I'm just going to put in this short code snippet for that finally we're going to get into the results now I want to put the image that we generated here into the read me so what we can do is actually go back back to this plot here and we want to save this and so what I'm going to do is save it inside of our three
project folder and I like to have organization so inside of that project folder I'm going to create a new folder called images and then I'm going to name the graph appropriately of something like skill demand all data roles you can name it whatever you want anyway saving this now if we scroll back over we have this images folder with this in here so I can copy this relative path as well now back inside of our read me we need to insert a link similar to what we did above with the python notebook in the brackets
here it's just going to be the alt text so you can just be descriptive of what the graph is so I just specify visualization of top skills for data nerds then inside parentheses I'm going to paste in that hyperlink to the relative path of that image and then finally we need to make it into an image and not a link so we're going to put an exclamation point at the front of this and made it into an image so let's go ahead and save this by pressing command s now we only have one last section
to do of identifying what are the insights so what are we taking out of this actual visualization right here that we learned so I went through and talked about three insights from this that I felt are most relevant to me specifically talking about python SQL and then also specialized technical skills like Cloud skills so I'd recommend doing this first but if you're having trouble drawing insights of these visualizations I have a backup plan for you if you go back to our Jupiter notebook and then you scroll down to the image of interest you can instead
of just saving it you can also copy it and you can go to any chatbot assistant paste it in and you can prompt it something like provide me short bullet point insights from this visualization I'll be honest these insights are mediocre at best but it is a good starting point you can even play with it further so I can probably something like provide a more holistic look at this for all three roles and the relevant skills and Technologies this a feel is a lot better analysis as it specifically calls out how Python and SQL are
so core across all three roles and then finally it gives some key takeaways for all the different roles all right so that wraps it up on the format that we're going to be going through now for the remainder of the questions we're going to be using a jupyter notebook to then perform some analysis and then we're going to be going into our read me and filling it out with the applicable insights all right so it's now your turn to dive in and investigate remember you don't have to necessarily limit it to those data analyst data
engineer and data scientists you can spread it out or make make it adapted if you went on something machine learning engineer to roles that are more applicable to that anyway I'll see you in the next one where we going over evaluating the trend of these skills over time with that I'll see you in the next one so now that we found out what are the top skills of data analyst along with other top data rules we can now dive further into this and specifically we need to find out if there's any Trend with these skills
like should we be worried that any of these Technologies are going out of style or they're actually booming so if you remember back from the advanced chapter we actually did a bunch of analysis on this already and we analyzed how skills were trending for the top five for data analyst there's just one major problem with this plot it's going by counts because we did counts we have a little bit of skew data we go back all the way back to the basics chapter where we plotted the number of job postings per month we had a
surge in January also in August for some reason and those surges can be seen in the counts of these skills so what we need to do once again is to calculate the likelihood of these skills appearing in a job posting basically calculate that percentage also we're going to do some formatting cleanup obviously this chart looks a lot better so let's jump into it so here we are in my Jupiter notebook and I've gone ahead and created a new one for calculating this question of how are in demand skills trending for data analyst the methodology is
this going to be similar to last time of we need to aggate the skill counts on a monthly basis which we've done previously and now our new analysis is reanalyzing based on percentages of the total jobs and then plot this new demand and upgrade the charts so I've gone through and import all libraries included Seaborn this time loaded the data sets and then also performed all that data cleanup for this analysis we're only focusing on data analyst in the United States so I'm filtering the database on this and creating a copy of the data frame
we're going to aggregate this on a monthly basis so I'm creating a new column of job posted month number with these month numbers now in there I can now explode out this data frame on the job skill columns and assign this to a new data frame of data Frame data analys us explode so let's go ahead and pivot this using that data frame that we exploded going to use our pivot table method specifying the index of job posted month number and for the columns I want to use that skills so job skills for the a
function itself we're going to specifying this of size cuz we're doing a count of the different skills for any nonv values we want to fill them in with a zero so we're to specify the fill value equal to zero running control enter we have this all printed out notice the skills though are in alphabetical order but job posted month number is in the correct order so I'll assign this to a new data frame of basically this of pivot and we need to be able to sort those columns are the job skill columns so we create
this new row of total using our loock method and for this we're getting the sum of those columns and now we have that total row so now we need to sort this data frame so we'll specify the data frame we're going to be filtering on and we don't want to filter by a column we want to filter by a row the row of total so we need to use the Dot Lock method for this to access that total row and we want to sort the values B on this and of course we want to set
ascending equal to false running control enter we're going to get an error and that's because for to take this value that we're sorting by and paste it down here to see what we're actually evaluating whenever we're doing this we're passing in right now these values these counts we need to pass in the index because it's a series so I'm going to specify index here okay now it actually sorted by those index values of SQL Excel Tableau python and it's in the correct order I'm going to assign this to our original data frame and the only
thing really left now to do is to actually drop that total row so let's see what we got now okay now we have it sorted appropriately with job post num and the skills and no total row now just to check to see what we're working with we can plot this using a line chart this has a bunch of the different visualizations in it but we can basically see that we are back to what we had previously for all these different skills granted we have this Legend here and all this stuff that's probably needs to be
cleaned up we're going to be making our other visualization using percentages so not really too worried about cleaning this bad boy up and now we need to get this dfda us pivot into well from Counts into a percentage what is the percent of the skill appearing relative to all job postings so in order to do this we need to go back to our original data frame of data analysts in the United States and then we're going to form a group Buy on job posted month number number and we want this to see the size of
it we can see all the different months and what are the number of job postings in each of those individual months I'm going to call this data analyst totals so now that we have these totals per month we need to go through and basically divide the appropriate row by these totals to get the percentage here now we could use a for Loop to Loop through each one of those months and divide appropriately but pandas actually makes it easier with their div method we're going to provide to it two values use the first parameter of other
any single or multiple element data structure or list-like object in our case we're passing a series so the data analyst totals and then the axis whether to compare by the index or columns since we're using the series and we have a certain index for it and we want to compare it to the index of the data frame we need to specify this of zero so with our dfda us pivot I can run the div method on it we're going to pass in that data analyst totals and then specify the axis is equal to zero now
similar to last time I don't like that they're in decimal notation I want them in actual full numbers so I'm going to take these da totals and I'm going to divide it by 100 and now we have the full numbers they're basically the same just multiply by 100 so I'm going to assign this all to a new data frame called dfda us percent real original now I want to do one more thing for getting to plot in this and that's clean up this month I don't like using numerical values I don't it looks good visually
I want actual month names and this is the code for it I'm not going to run through it line by line because we've gone over it multiple times before but if you want to you can pause the screen and take a look at all the code here and with this we have our updated data frame with the month names itself and then all the different skills so let's get into plotting this bad boy so the first thing with this data frame is I only want those first five columns so I can use the iock method
and specify that I want all the rows and then I only want the first five values so SQL to powerbi and we're going to be using caborn to plot this so I'll specify caborn and then from there we're using a line plot now the first thing we need to do is pafy uh specify the data source so I'm going to set this data frame up here of DF plot equal to this and then inside of here for data I'm going to set it equal to DF plot let's go ahead to run this and then we'll
clean it up after that so off to a good start that is looking like it should along with everything on here let's get into cleaning this up I don't like the dashes so I'm going to specify this as false along with that palette and I found this color scheme of Tab 10 which I kind of like so we're going to go with it now I've been also setting the theme of all these so I'm going to set the theme of the style of ticks and it changes a little bit of the font and formatting with
this I kind of like how it looks next thing I'm going to do is clean up the titles and then also the labels so adding an appropriate title to it and appropriate lables and for now removing the legend but not too bad remember we remove that Legend because I actually want to go through and label each one of these lines and have an appropriate label next to it so you can quickly identify what line is a ass to what skill I having to go to that damn Legend once again we're going to be using that
pt. function for this now we don't have multiple accesses so we can leave it that PLT TP text and what we need to do is go through each of those Lin L and right next to it what we want to have on it so we can do a loop basically for I in range 5 and then inside of pt. text we want to First specify the exposition so for the exposition these are numbered in index Valu so it starts actually at zero so we want to start at around 11 then for the Y value we
want to have the very last value in the data frame itself so we can do is specify DF Plus plot and then use that iock method to access the last row and then for the column itself we just pass in that I that's why we're doing this I in range so it knows to go over to for sqls case go over one now what value do we want to plot in there so the string that's next well we want the column name so we can call it the data frame plot specify the columns and it's
provided in a list and then we can access this or the element in the list by also passing in in I all right so this is good probably for right now let's run control enter not too bad it's plotting it right at 11 I don't like it that close we're going to do 11.2 and yeah looks good except for we have this giant Line running through our names so what we can do now is use that Seaborn function of despine running control enter removes the spine makes this look a lot more visually appealing now there's
one last thing to do I want to format this y AIS specifically I want to have a percent sign next to this so it knows it's not a full 60 so in order to do this we need to access the axis and we didn't do any subplots here I mean you can do that if you want but I'm just going to do ax is equal to PLT doget current axis and then with our axis I'm going to access the Y AIS and for this we need to specify a formatter for it now inside the documentation
you can see that you need to basically set the formatter of the major ticker we need to pass in a some sort of formatter and Matt plot lib actually has some underneath this uh ticker module but they don't have it actually listed here the one we need to actually prevent to format as a percent so we imported in so we specify from matplot li. ticker import we want to do a percent formatter from there and then inside of here we can specify percent formatter all right running control enter on this we now have this in
percentage I don't like the zero on the end I told youall I'm really picky so inside of here we can actually specify decimals equal to zero and Bam now we have our final visualization that's all cleaned up and ready to go so I'm already no some Trends from this first of all sequel is somewhat steady but it's also declining it's still the highest skill so I'm not too concerned especially since it's high Excel also looks like it has a decline to it but python Tableau and powerbi we're in third fourth and fifth place steam to
have steady values with it although I'd argue python is slightly increasing so you know what we need to do now we need to capture the insights of this that we're getting from these visualization in our readme so we already have our first question there I'm going to scroll on down and I'm going to put it in our question of how are in demand skills trending for de analyst from there put any code blocks that you feel are appropriate to this I love plotting so I'm going to include all my code for actually generating the plots
and I simplify it removing any unnecessary things to basic bog it down I just want to keep the most important things in there next thing we going to do is add in that final graph that we got in our visualization so we can go back to our jupyter notebook click save save it inside of that image folders that we created previously and then with that new image saved inside of there we can just go and copy that relative path I'm going paste it into here and we need to make sure it actually appears as an
image so we put an exclamation point and the very last thing we need to do now is just add our insights in so I added all those insights in focusing first on SQL then Excel then those three remaining skills and the key point of doing this analysis was to ensure that if we actually targeted or went to learn any of these five skills we're not learning a skill that is potentially going to be deprecated and not used in the future this graph basically proves all these Technologies are safe for the time being to learn so
now with confirmation that these top five skills are relatively stable we can now dive into our next question and understanding how well do not only jobs for data analyst but also skills pay for that so that's what we'll be tackling the next one with that see you there all right so we're halfway through the project and this one by far should be the easiest because we did a lot of pre-work already in the advanced chapter for this we're going to be going back and drawing up our box plots but I want to expand our search
to not only include the top three roles but also their Associated senior roles to see how well they pay and once we have a good snapshot of how that anist pay we can then move into that other visualization that we already made on how well different skills pay for it not only the top paid skills but also the most popular so let's Jump Right In for evaluating the median salary for those top six data jobs as always we're going to be importing all the libraries loading the data and doing some data cleanup for the first
analysis we're only going to focus on the United States but we need to keep all those different job titles and we'll they use this method to drop any na values for the salary year average column similar to how we got the top three jobs in the last lesson I'm running a similar set of methods in order to get the top six and with this top six we can then filter another data frame call a dfus top 6 specify we want to filter the job title short column using the is in method basically checking hey are
these titles in this list if it is keep it in the data frame so back in the lesson on Seaborn we plotted these three data roles data scientists data engineers and data analysts a lot of this formatting everything's going to remain the same from this so I'm just going to go ahead and copy this right here coming back into our notebook that we were working in previously I'm going to go ahead and paste it and we need to update it for this new data frame so specifically data frame us top six everything else I believe
should remain the same so I'm going go ahead and plot it and we get this bad boy which not too bad although I do want to f fix up a little bit right now visually I would want to actually have this to where the median salary slowly ascends for each of these different values and so we need to reorder these rows to basically have it looks like data analyst is lowest and next followed by senior data analysts and then data engineer data scientist and so on so we need to basically sort this y-axis by the
median salary so coming back up here underneath our data frame that we created we're going to need to perform a group buy aggregation to find out what are the different median salaries for these job tiles and then order it by that so I can do a group buying specifying job title short and then using that salary year average column that we want to do the aggregation on of the median salary so let's go ahead and run this not bad we want this in descending order though so with a sort values method I'm going to specify
ascending equal to false and then we don't really actually what these median values are we want the index of this series because this is a series right here and so I'm going to specify index okay now we have a list of the different job titles in the correct order that we wanted I'm going to call this list job order so now what's convenient about caborn is they provide a parameter of order and so I can specify job order now running this we get them now basically in Des sending order now I was really surprised by
the fact that not all senior roles are more than the junior roles so senior data analyst is higher than data analyst as we'd expect but it's not higher than data engineers and data scientists but their respective seniors are definitely higher so I think this provides a great Insight that if you're working as a data analyst and then you're ready to go to the next level you may consider besides going to a senior data analyst role to go to maybe a data scientist or even data engineer so before we proceed further we need to basically capture
those insights we just said there inside of our read m going back to the graph that we created I'm going to go ahead and save it and now let's add these insights I'm going to start with the title of how well do jobs and skill pay for that analyst and we'll start with this first section of salary analysis for dat nerves basically our box plop section we have a little section in here on visualizing data cuz that's my favorite along with including those box plot results underneath it and then finally I wrap it up with
these insights capturing how the senior rules compare and what I feel about data analyst rules so let's move into that second visualization where trying to build and that's investigating the median salary verse skills for data analyst and just like the last visualization we've done this before some be flying through this I need to create a new data frame because now I don't want to include the top six roles I only want to include data analyst so I do that along with the country of the United States making sure I run the copy method on this
additionally since we're only analyzing jobs with salary I'm going to drop any values that are n next we need to explode out that job skills column because currently it's in a list and then let's actually investigate it by looking at the salary year average column and the job skills column Yep looks good the first thing I want to do is find those top paying skills and so what we're going to do is do a group bu on job skills using that salary year average column as an aggregation aggregating count and median now we want the
top paying skills at the top of this data frame so we're going to use sort values on this to sort by the values of median putting ascending def false from there we'll set it equal to a new data frame of data frame. analyst top pay and for this new data frame I only need the top 10 values and now we need to repeat that group by again and aggregation but we want to get a data frame now that has the most popular skills in it so in this case we don't want to sort by the
value of median here instead we want to sort by the count so now these have the top popular skills so SQL Excel python Tableau yep now with this we need to we're going to be graphing this data frame and we're going to be graphing it we want the salaries in the appropriate order so now that we have the top 10 we need to actually sort it by the median values so now we have that top pay diaph frame along with the most popular scales diaph frame from there I'm going to go back into our notes
from Seaborn see the visualization we have here I'm going to come in now and actually take all this different code we did to generate this and I'm going to paste it on in the data frames still I use the same nomenclature for the name so the data frames are correct along with their columns so we'll go ahead and run this and Bam we have our other visualization that we're trying to get so with this visualization we're seeing a lot of good insights with it for the top highest paying paid skills we're seeing things like Technologies
around Dipper bit buug and gitlab so web service Technologies are a lot more popular one note about this we actually scroll back up to this data frame up here and look at the count of these jobs the counts are around two 31 to six so not a lot of jobs over the course of the Year requesting this but have a high salary with it conversely if you look at our most popular ones they have in the hundreds and thousands of jobs available so I wouldn't neglect your python studies to learn something like ansible a niche
technology like that's going to be good to know but it's going to pay off to know something like python in this case which is the top aing job for the top 10 most popular skills pretty meta that we discovered this using using python anyway other Trends I'm noticing with this is that Microsoft products like PowerPoint Excel and Word seem to have a lower related salary compared to like programming and visualization softwares so it pays to know more Advanced Technologies basically so you know what we need to do we need to capture these insights inter read
me so for me I added in the code to go about visualizing these different graphs and then adding in that visualization that we just built and then finally wrap it up with the insights talking about the core technologies that pay the most along with the most popular Technologies and what the insights I got out of that so we've learned two major things here one we understand now where data analysts stack up to other data science roles and two what skills pay the most for data analyst in the next and final iteration of our project we're
going to be using what we learned on skill pay here and then also what we learned previously on the counts of skills and combining those both together in order to find out what is the most optimal skill to learn so with that hope you're ready to come to a close and I'll see you in the next one all right into the final question solve for this project super excited where I'm almost done with this for this we're going to be working off that scatter plot we built in the scatter plot lesson and then improved on
in the advanced customization section so we get all those names not being on top of each other anyway this visualization still needs to be upgraded right we need to shift it to this new visualization that now has percentages instead of counts so we can actually do an Apple apps comparison of what's going on here but we're going to spice this visualization up just a little bit more we're going to actually color code it specifically we're going to be able to see which ones correlate to programming tools databases Cloud Technologies whatnot so so we can actually
extract deeper insights into not only singular technology Oles but also Technologies at large so here I in BS code we're going to start a new notebook analyzing what is the most optimal skill to learn for that analyst we're going to start with this first one of grouping the job skills in order to determine what the median salary and likelihood of being in a posting is basically that percentage first thing we do is Import in the libraries load all the data and then clean it up next thing we need to do is filter our data frame
for not only data analyst job postings but also the country of interest of the United States and you're running the copy method on this similar to before we're going to drop any job postings that don't have salary data we're going to explode out the job skills column and then let's go and investigate it yep imported and good so let's perform a grouping of this data frame in order to group the job skills based on the median salary and count so I group ey on the job skills column using that salary year average as the aggregation
of count and also media this is a good start but I want the count of skills basically the higher counted skills up at the top but whenever we transer to a percentage it's still going to be in the same order so I order by count and I set ascending equal to false we're going to have multiple columns in this data frame so I want to be very clear about what these different column names are in this case count and median isn't descriptive enough and it may be confusing later on so I'm going to rename our
data frame that just created above of dfda skills and rename it to skill count and median salary displaying it below we have updated names now we need one more thing in our data frame specifically we need that percentage and so what we can do is use our original data frame of dfda us the one where we removed all the salary data from it we can get the length of this and this will provide as the number of job postings so I'll create a new variable called data analyst job count and signed that of length of
the data frame now we need to create that new column of skill percent and for this one we'll set it equal to the skill count divided by that data analyst job count right here multipli by 100 because I want to keep it not a decimal place because I'm very particular running control enter we have now this final data frame that we can use for this and if you're doing this you should be noticing that these skills these top skills should be specific to dat anal around 50% top skills for data scientist should be around 70%
or so and same for data Engineers if any different than this you probably didn't calculate the correct length of the data frame now this data frame has 168 rows I'm not going to want to plot every single one of these on their scatter plot instead we want to calculate basically what percentage or we have a cut off percentage of what values we do want to include on there I'm going to set a skill percent variable equal to five and I've played around with these numbers and five for data analyst seems to work to not basically
clog up the visualization so with that variable I'm going to create a new data frame to basically filter our old data frame down to make sure that we have values greater than 5% printing this data table out underneath it we can see that yep we got it's about about 12 different jobs I think this is a pretty good insights into what we want to focus on so going back to the advanced customization lesson we plotted that there and I'm going to basically copy all this different code and then paste it into the notebook that we're
working with right now so with this we're almost there we can see that there is some yellow highlighting specifically we have the wrong data frame inside of here so I'm G to double click this one up here press command shift L and then change them all to be that dfda skills high demand additionally I'm just going to change this title down here to most optimal skills for data analysts in the US let's go ahead and run this see if it works and Bam we get this bad boy which almost done we forgot to change the
count of the job postings so I'm going to double click skill count I notice it's in two places press command shift L and then put in skill percent running control enter we now have a percentage down there so I'll update that X label instead of being counter job posting as percent of data analy jobs now a few things I do want to clean up with this visualization cell I know we're going to go into actually color coding the different skills in this but what we're going to do here is also going to be transferable to
the final visualization first is we need to clean up this salary values on the Y AIS in order to do do that like last time we have to access the access so I'll do ax equal to PLT doget current AIS for this we're going to be doing the Y AIS and we can do the set major formatter this one we're going to use the builtin one from the module of P plot specifically Funk formatter and we've customized this similarly on the past ones using this Lambda function so I'm going to go ahead and just copy
this and then go ahead and run it all right so we have this formatted now the only thing left now to format is the xaxis to get it as percent and we did that back in that second problem we were trying to solve so I'm going to copy the code for that one as well then inside of here I want to specify ax. xaxis and paste in that set mejor formatter and then that percent formatter right now we haven't imported it in so we need to actually import that in from that plot lib import percent
formatter okay we'll go ahead and run this it's from PL li. ticker wrong module okay so now we have not only everything all the X and y- axis formatted correctly we have everything here that we need so as we identified when we first talked about Scatter Plots we want to be in this upper rightand quadrant right here because not only is it a high demand skill but it's also a high paying skill conveniently for us there's nothing directly there but there's things around there things like python tableau equal and then we're getting into outside boundaries
here with Oracle R and Excel but we want to take this visualization a step further and I want to color it by the core Technologies now inside of our original data frame if you scroll all the way over to the right we have not only a job skills column but we also have a job type skills and this includes a dictionary in it so if we look at the first 10 values we can see that first it has a key like in this case analyst tools or async or Cloud which is like the technology and
then inside for the value we have a list so for analyst tools we have things like Tableau sap jira or we could have even multiple tools so Tableau and Splunk in this case so I went through and cleaned up this data set to basically well let me just run this code that I went through and what it does is it Aggregates all the values in the data frame and basically puts it into a singular dictionary so in our case Analyst tools now has all the analyst tools that appear within a list conversely when we get
to the next one where for databases we get all the different databases in a list and then next for libraries all the different libraries in the list and so on so we have all the data we need now this code I'm not going to go through line by line because we've gone through all the different things needed to understand what's going on here so that'll give you that as a little bit of homework or feel free to just copy and paste this and put it into your code so with this dictionary I'm going to make
it into a data frame and aside the Columns of technology and skills we need to break out this skills column so we're going to explore it out specifying that skills column and then we're going to print out this and this gives us a singular data frame of Technology with their Associated skill so now we can merge that back to our original data frame specifically our data frame that has our different skill counts median salary and skill percentage so I'll copy this data frame and with it we are going to merge that data frame technology and
we need to specify how we're we're going to actually do this so for left on this will be using the job skills column and then for the right on for the one we just actually created we can scroll up and see that it's skills only so we'll specify skills running control enter we have now the singular one where it has not only the skills but also the technology so now we have the data frame that we need to plot so I told you all that work we did earlier wasn't for bad use we're actually going
to copy all this code and just reuse it one thing first I'm going to Define this as DF plot because that's what we want to plot now down here underneath this I'm going to just throw this in for the time being we're actually going to modify it so we do see that the visualization gets made but remember we want to color code this based on the Technologies and we want to include the Technologies so we need to use the caborn library for this Vice just using M plot lip or in this case the Panda's method
of plotting which use M plot so we go ahead and comment this out and instead we're going to Define L SNS do scatterplot we're have quite a few variables for this first is data we're going to set that equal to data frame plot next is X and we want to use the skill percent for this just like above from there we're going to do y equal to the median salary which is what we did above as well and now for the Hue to actually color it so H equals to in our case we want to
set it equal to technology and that's that column up here of technology so let's go ahead and print what we have or let's go ahead and display what we have below right now all right not too bad so we have actually the different Technologies coloring it in I do want to change the theme and then remove this spine because I've been doing that frequently for all our other visualizations I want to keep them the same so I'm just going to add in this despine function along with send the theme of tixs what we've using previously
so bam we have this bad boy now which we can start to now extract insights about the technology itself as we can see things like programming for Python and SQL also for R SASS and go are in some of the most optimal skills in that upper right hand corner Grant RN SAS is more towards the middle but based on having none in there I think it's close enough next up is analyst tools so this is like visualization tools or Microsoft Excel things like that things that are specific to an analyst and they're sort of like
middle of the pack with things like PowerPoint and word like we discovered before at the lower end of being paid well and then finally we do see some databases and Cloud Technologies appear but not really prevalent so I think the Insight from this is focus on those programming and also analytical tools if you're dat analyst in order to get the most optimal pay so know we got to do now we got to update a read me for all that different work and insights we now discovered I'm going to start first with the code necessary in
order to visualize all our data and then next add in that visualization for what we just uncovered then lastly getting into those insights sharing everything that we just learned about what is the most optimal skill and proving utmost that python is the most Superior skill although may be biased on that one all right so congratulations on wrapping up this last problem we're tackling for the project we're now going to be going forward with moving into actually making this project public available not only on GitHub make sure it's updated but also on LinkedIn as well so
I'm personally proud of the work that we went through to get to this and I hope you are as well and getting into sharing it with that I'll see you in the next one all right we're almost to the Finish Line we just need to get this work upload into GitHub and also update our readme in order to include relevant information and guide our viewers on how they should understand our project so for the REM me we need to have some sections to basically build a story of what we did here so here's a breakdown
of the different components that we're going to be including in our readme and this is per the suggestion of Kelly and this is how she formats all of her different readmes and projects and blogs based on her past success with demonstrating her experience now we've already done the analysis of this read me document and so frankly that's the bulk of the work the rest of the stuff is pretty easy to fill in oh and all this is just a recommendation you can feel free to adapt it to your own need for the introduction section I
just provide an overview basically telling them what this project was all about and that I completed it in conjunct with my python course for the background that's when I jump into the questions themselves defining the four problem statements that I'm trying to solve from this and next up is tools we used make sure you're listing all the different tools for this there's a great section to demonstrate all your different experiences with these different Technologies and then we get to the analysis which basically includes all the stuff from our last four lessons where we went through
and added the images and the different code blocks and then we get into wrapping it up so I went over what I learned and frankly I learned a lot more about python even building this course so thank you for that next I provide some insights that I captured from the data itself understanding skills salaries and market trends and then from there jumping into challenges I face with this whether that was data problems you had or issues with actually visualizing included in this section and then finally I just have a little wrap-up section at the end
to conclude this all this work basically capture in a few lines what I did all right so now we have a read me done we need to get on GitHub we go to the source control tab I'm going to give this the title of final commit sync changes and then going to navigate over to GitHub make sure it uploaded one thing to know is you can have up to six different repositories pinned in here so my case this python. analytics course I'm going to put it up at the top left if you're not seeing it
there you have more than there you can customize it and add the appropriate one to it anyway let's go on into this all right bam now we have all of our different code and now this read me updated with all our this different analysis that we did this was nothing short of a lot of hard work so used to be super proud of it now that we have this all on GitHub we only got to do one more thing and that's update our LinkedIn to include this project and our relevant experience that we gained from
this course so with that I'll see you in the next one we're get into LinkedIn all right a little Bittersweet we're on this last lesson of this course been a lot of work I'm super proud of what you've done so far all right we're going to jump into how we can actually update our LinkedIn to demonstrate all of our different experience that we gained from this for those that purchase the course problems you'll be getting a course certificate and you can update your LinkedIn for this now it's not too late if you didn't purchase those
course problems as you can still get that and get that course certificate you don't necessarily have to work through all the problems to get that certificate you made it this far you deserve it after we update for the certificate we're going to update our project section to show this portfolio project and then make a post after you complete the end of course survey I'll be sending you via email your course certificate and you can just download that and save it for this LinkedIn section all right so once you navigate to LinkedIn and then navigate to
your profile itself you'll go under Add profile section specifically underneath the recommended you'll add licenses and certifications for the name you'll just name it python for datalytics which is the name of the course for the issuing organization you can select me and have a picture that pops up for it next is the issue date there's no need to select an expiration there's no expiration for the credential ID there's a specific ID that you have on your certificate paste that in here next is the credential URL whenever I sent you that email you also should have
received a URL associated with that certificate you just copy that one it's a very unique URL and you post it in here next up is the skills I recommend putting these five in there as they're the core Technologies we focused on python git matplot lib pandas and Jupiter you can feel free to add other Technologies too like Seaborn and GitHub the last thing to include is the certificate itself so you go in and add the media of that document downloaded you just add the media to it and then finally all you got to do is
click save the next thing whether you purchase the course problems or not you can actually add the project to your LinkedIn so once again we're going to go to this Add profile section and under recommended we're going to select add projects I'll select a project name of data science job analysis with python from there I include a short description which has a call to action for them to go to my GitHub page check out my project once again for those skills add in those core skills that you want to add feel free to add in
like I said caborn and GitHub next is the media in our case we will include that URL to our GitHub project so I can select add media add a link and then from there paste it in and click add from there include appropriate title and click apply the last few items to include are with the start and end date and then any contributors in my case Kelly helped me out with a lot with this project so I'm including her and that's it for the project section we can go ahead and click save the last thing
to do is make a post in order to share your project that you completed certificate and I'd also link your project in the comments so others can check it out don't forget that if you're making this post to tag Kelly and me in it so I'd love to check out your projects and see the different work that you've done for this all right so congratulations this has been nothing short of your hard work I know a lot of effort went in personally for me just building this course so I know you've put a lot in
actually working through it solving all the different problems and getting to that final project with that I want to congratulate you on that now I'm going to be making course on Python and Excel coming in the future but if you're curious about what to learn now I would recommend if you haven't done it already taking my SQL for data analytics course as we learned from this analysis right here SQL is right up there from Python and I'm using both of those in conjunction with each other all the time so I highly recommend learning both all
right with that really appreciate your time and following through all this and I'll see you in the next course
Related Videos
How I Would Learn Python FAST in 2024 (if I could start over)
12:19
How I Would Learn Python FAST in 2024 (if ...
Thu Vu data analytics
623,247 views
ChatGPT for Data Analytics: Full Course
3:35:30
ChatGPT for Data Analytics: Full Course
Luke Barousse
527,023 views
How He Got $600,000 Data Engineer Job
19:08
How He Got $600,000 Data Engineer Job
Sundas Khalid
182,551 views
Data Analysis with Python for Excel Users - Full Course
3:57:46
Data Analysis with Python for Excel Users ...
freeCodeCamp.org
3,398,292 views
Excel for Data Analytics - Full Course for Beginners
10:59:43
Excel for Data Analytics - Full Course for...
Luke Barousse
187,384 views
How Much SQL, Python, Excel & Tableau Is Enough? | Realistic Expectations
8:45
How Much SQL, Python, Excel & Tableau Is E...
Lore So What
38,789 views
Data Analytics for Beginners | Data Analytics Training | Data Analytics Course | Intellipaat
3:50:19
Data Analytics for Beginners | Data Analyt...
Intellipaat
2,284,978 views
SQL for Data Analytics - Learn SQL in 4 Hours
4:08:41
SQL for Data Analytics - Learn SQL in 4 Hours
Luke Barousse
546,858 views
Coding Was HARD Until I Learned These 5 Things...
8:34
Coding Was HARD Until I Learned These 5 Th...
Elsa Scola
798,615 views
The complete guide to Python
11:08:59
The complete guide to Python
Clear Code
531,886 views
How I use Python as a Data Analyst
13:56
How I use Python as a Data Analyst
Luke Barousse
385,844 views
10 Important Python Concepts In 20 Minutes
18:49
10 Important Python Concepts In 20 Minutes
Indently
378,888 views
2024's Biggest Breakthroughs in Math
15:13
2024's Biggest Breakthroughs in Math
Quanta Magazine
405,298 views
Python Full Course for free 🐍 (2024)
12:00:00
Python Full Course for free 🐍 (2024)
Bro Code
1,882,491 views
I Tried 50 Data Analyst Courses. Here Are Top 5
8:41
I Tried 50 Data Analyst Courses. Here Are ...
Stefanovic
183,955 views
FREE Python Course for Beginners 2024 (13 HOURS) - Code With Josh
10:12:25
FREE Python Course for Beginners 2024 (13 ...
Code with Josh
134,628 views
Python Tutorial for Beginners - Learn Python in 5 Hours [FULL COURSE]
5:31:30
Python Tutorial for Beginners - Learn Pyth...
TechWorld with Nana
6,066,959 views
How I Would Become a Data Analyst In 2025 (if I had to start over again)
15:40
How I Would Become a Data Analyst In 2025 ...
Avery Smith | Data Analyst
40,745 views
Complete Python Pandas Data Science Tutorial! (2024 Updated Edition)
1:34:11
Complete Python Pandas Data Science Tutori...
Keith Galli
156,768 views
Day in the Life of a Data Analyst (Work From Home) | *Realistic*
9:05
Day in the Life of a Data Analyst (Work Fr...
Coding with Dee
180,507 views
Copyright © 2024. Made with ♥ in London by YTScribe.com