Data Analytics with Python Prof. Ramesh Anbanandam Department of management studies Indian Institute of Technology, Roorkee Lecture No 3 Python fundamentals Okay? We will continue our lecture.
How to access different rows and columns, because, it has very important applications. When the data file is very big sometimes you need to access only some rows or some columns for your calculation purpose. That we will learn how to access a particular rows or particular columns there is looking at columns, rows and cells.
When you look at this see print df. head when I use this command and getting there are different. For example; the first column says 0, 1, 2, 3, 4, country, year, population, continent, life expectations, gdppercapita.
Suppose I want to get the first row as we know that the Python counts from 0. If you want to know the first row you type a print df. loc, it is a location in square bracket 0.
Will do that you will get the details which are there in the first row. So first if I want to know hundredth row so printed df. loc 99.
We knew that python count from zero. If I want to know 100th row you have to type 99. So it should be in Square bracket you can see the details in the 100th row.
Suppose we want to know the last row in the data set. So print df. tail n equal to 1.
If you type n equal to – 1, it will not work, that we will see why if you want to know the last row simply type to df. tail n equal to 1, you will get to know that what is the last two, So we will see that. (Video Starts: 02:01) Now we are going to use this command to see the last row, that is a detail about the last rows.
Now we can subset a multiple rows at a time. For example; there will be requirement we have to select 100th row, 1st row 100th rows and 1000th rows. For that purpose you type this command print df.
loc. You see there are two square brackets 0, 99, 999, you will see what output where getting. So type print df.
loc. Yes, so we are able to see the 1st row, 100th row, 1000th row. There is another way we can subset rows by row number by using this command iloc.
Previously loc, now we are going to use iloc. Suppose for type I want to get the 2nd row, if I type print df. iloc 1.
I will get the details about the 2nd row. Okay? Yeah, this is a detail about the 2nd row.
Suppose I want to know 100th row by using iloc command so go there. Yes? That is the details about the 100th row.
You see that if I want to access the last row by using iloc command. So you can directly type print df. iloc in squared bracket - 1.
So that will be the details of the last row. So what you can do we can open our Excel file you can verify what was the title, the last row and soon. (Video Ends: 04:27) See then important note here with iloc command.
We can pass in the - 1 to get the last row, but same thing that we could not do with loc. That is the difference between loc command and iloc command. Suppose we want to get the first 100th and 1000th rows, using iloc command.
So we are going to type this print df. iloc 0, 99, 999. Let us see what answer we are getting.
(Video Starts: 04:58) Yes? See, we have getting 1st, 100th and 1000th row. (Video Ends: 05:21) So far we are seeing subsetting rows.
Now we will see subsetting columns, the Python slicing syntax used a colon, colon represents all the rows. If you have just a colon that attribute refers to everything. So if you just want to get the first columns using a loc, or iloc syntax.
We can write something like df. loc[ : , which column we need to refer, to subset the columns. The next slide I need to show that we are going to subset the columns with loc, not the position of the colon.
It is used to select all rows. You see that, subset equal df. loc: , I want to see only two columns that is year and population.
So when you type this way you will get all the rows only two columns details that is year and population. You will type this so when you type print subset. head.
You can get the first 5 rows. So you will see how it appearing. (Video Starts: 06:26) Subsets equal to Subset is object because from the df is the initial object which has all the details.
Now I am going to fetch only few columns from the df object that I am going to saved in the name subset, subset is the object. So all the rows but I need only year column and population column so I am going to type I want to see the first 5 rows, see that I am able to see 1st 5 rows, only for 2 cells. That is year and population.
This is the way to get only 2 cells from the 2 columns from the Big Files. (Video Ends: 07:21) There is another example subset column with iloc, iloc will allow us to use integers - 1 will select the last column. The same thing whatever we have seen in the previously so subset equal to df.
iloc:, represents all the rows. Then [2 , 4 , - 1, then we can see by using this command print subset. head 1st 5 rows.
(Video Starts: 07:46) See that we are able to see the last column and the population column, life expectancy column. You can open our Excel sheet you can verify whether we are getting the right answer or not. (Video Ends: 08:26) Sometime there is another way for subsetting columns by using the command called range.
First will make range of numbers we are going to save that range of a number in object called small _ range, so small _ range equal to list range 5. Print small range will get 0, 1, 2, 3, and 4. Now this small _ range, object can be used to access the corresponding columns.
So if I type a subset equal to df. iloc:, small _ range I can get. 1st column, 2nd column, 3rd column, 4th column and 5th column, so we will try this.
(Video Starts: 09:09) small_range is an object, we are going to create a range. Suppose we want to see what small_range is. So it is up to 0 to 4, that means 1 to 5.
Now we are going to subset using that object called small _ range using ilocation command. df. ilocation:small_ range we see that here we are able to see 5 column that is a country, year, population, continent and life expectancy.
(Video Ends: 10:21) So far we have seen subsetting only rows and columns. Now we are going to subset rows and columns simultaneously. For example; using loc command so if you type print df.
loc 42 countries. We can check in the 42 label in country columns. What is the cell name, there cell name is Angola.
Will try this. (Video Starts: 10:47) Going to see in that file in 42nd label in country column what value is there so that is an Angola, Yes? (Video Ends: 11:09) Yes, we can see what is in the using the same ilocation we can see in 42nd label in 0th column.
Now we can represent column also with 0 columns, what value it is, you will see that. You can verify you have to get to the answer. You can open the Excel file.
You can verify we are correctly accessing the cell or not. (Video Starts: 11:29) Print df. iloc in 42nd label 0th column what is the value it is Angola.
(Video Ends: 11:46) Next we can subset multiple rows and columns. For example; get the 1st, 100th and 100th rows from the 1st, 4th and 6th column. So now we are going to, simultaneously we are going to fetch rows and columns and corresponding cells.
So print to df. iloc 0, 99, 999. Similarly column labels is 0, 3, 5.
Let us see what answer. (Video Starts: 12:13) This accessing rows and columns are very important functions because nowadays data file comes with a lot of rows and lot of columns. We need not use all the columns, all the rows for further analysis.
Sometimes we need only specific rows or specific columns. So these basic commands will help you, how to access a particular rows and columns, that will be very useful when we do further analysis using Python. Yeah?
This is the value so that means 1st row, 100th row 1000th row, 1st column and soon. (Video Starts: 13:08) And there is another way if you use the column names directly it makes the code a bit easier to read. In terms of number and so you see number column.
If you use for representing column, if you use column name we can see what is there, so simply type the column name. So we use this command, print df. loc 0, 99, 999.
Then directly will type the column name country, life expectancy, gdpPercap you see there is a square bracket here. (Video Starts: 13:36) That you have to do as the same that Life capital Exp, Yes? This is because country, life expectation this is the easy way to because we cannot remember column name.
(Video Ends 14:48) This was not only that instead of see suppose if you put a 10 column 13 that corresponding rows will be displayed. So print df. loc 10 to 13, the 10th row 11th, row 12th, row 13th, row will be shown and in columns country and life expectancy and gdpPercap so we will try this command.
(Video Starts: 15:11) That means we can see the range of rows at a time. You are able to see the 10th row, 11th, 12th and 13th. (Video Ends: 16:17) Okay?
Next see print df. head we can see we can able to see 1st, 10 rows. The 10th row some time for each year in our data what was the average life expectancy.
To answer this question we need to split our data into parts per year and then we can get the life expectancy column and calculate the mean. So what is happening there is a command which I go to use called groupby, we look at the data it is not grouped. So when you use this command print df.
groupby year,and life expectancy and corresponding mean. The mean of the on the in the year 1952, the mean of the life expectancy variable is 49. 05.
In 57, 51. 09. We look at the data; it is not in this order.
So the groupby by year this command is grouping all the values, with respect to year. So we will see what is the answer for this, we will verify this. (Video Starts: 17:15) When you open that Excel file you will see that the Excel file will be in some other form it is not grouped by year, different years are appearing at different places.
So this command that is a group by will help you to group the data in year wise. Yes, you see that you are able to get 1952 the life expectancy was 49 years you see that when you look at this data. When year increases the life expectancy year also increases due to advancement of medical facility available and the standard of life is also increasing.
(Video Ends: 18:42) Now, we can form a stacked table. Stacked table is using the group by command. So you type this multi _ group _ variable = df .
\ . See the \ represents to breaking the command we can use \. Otherwise you can write straightaway also no problem.
df. group by year, continent, life expectancy,gdp per capita, then we can find the mean. Then we will get this output for that means in 1952, in Africa, the life expectancies 39, in America 53, in Asia 46 in Europe 64 will try this command.
(Video Starts: 19:28) When we takes these command you will get an output, that is a stacked table. That is very useful for interpreting the whole dataset, is kind of a way of summarizing the data in the form of table. Multi_group.
You see that now year wise. It is very, very useful command it is year by 1952, some country Africa. What was the average year 1957 Africa.
We see that if you look at only the Africa data. 52 to 39 in 57 41, in 62 43, in 67 45, see that we can interpret this way, by looking at the, this table. Suppose you have to flatten this.
(Video Ends 21:24) If, you need to flatten the data frame. You can use this reset underscore index method, just to type flat = multi _ group _ var . reset _ index.
Then you see now the data is again. Now it is flattened. The same data set, which was it in the table form now it in the simple learned form.
So we will try this comment. (Video Starts 21:48) This is what you are doing the data manipulation, because from the big data file, we have to learn this kind of fundamental data manipulation methods that will be very useful, in coming classes. So able to use reset _ index command to flatten the, that stacked table.
See that now we can see first 15 rows. Now it is data is flattened into the normal form. (Video Ends 22:41) The next one is grouped frequency counts.
By using nunique command, we can get a count of unique values on the panda series. So when you type print df. groupby continent, country.
nunique, you can get unique values that means frequency. Okay, will try this command. (Video Starts: 23:04) Print, See Africa 52, America is 25, Asia 33.
When you look at the data, again, you go to excel,Excel data you can interpret what is the 52 means, what is the America 25 and soon. (Video Ends: 23:49) Now, some basic plot a way to construct two things one is year and life expectancy. So we are going to create a new object that is called Global _ yearly_ life _expectancy.
By grouping year and life expectancy, with respect to its mean. Then we are going to print it. So you are going to get two values one is year.
Next one is life expectancy. That is a mean life expectancy, you will see this. (Video Starts: 24:17) There is a new object.
The object name is called Global _ yearly _ life expectancy. Yes, see that year, and supposed we want to plot it. We will see we are going to plot this data, how we are going to plot it.
(Video Ends: 25:28) Simply, just that object name. plot. That automatically takes this was output, which I got is in x axis in a year, in y axis, average life expectancy.
We will run this. (Video Starts 25:40) So, what this data says that, when the year 1950 - 2000 you see when the year increases, the life expectancy also increases. (Video Ends: 26:07) Just we have seen only the simple plot, in coming classes, we will see some of the visual representation of the data.
We are going to see a histogram, frequency polygon, ogive curves,pie chart, stem and leaf plot and pareto chart and scatter plot . Suppose, this is the data, see what is there in East, west, north. In column first quarter, second quarter, third quarter, fourth quarter.
(Refer Slide Time: 26:30) Suppose the very easiest way is the graph. By using this is called bar graph, bar chart. Bar chart is different regions are labeled as different colors.
This is a method of visual representation of the data. If you look at this, the eastern side in third quarter, there are more sales. Okay.
The another way to represent visually, the data is pie chat, is the first quarter, third quarter. You look at this, third quarter, which is in blue in color. There are more sales.
And most importantly the pie chart, we can get pie chart only for categorical variable. The variable is continuous, you cannot use bar chart, you cannot use pie chart. So the pie chart is used only for categorical variable.
That is for only count data. The another one is the Multiple bar chart. This is another way to represent the data visually.
Another one is a simple pictogram. See, this is the frequency table. See, next one is frequency polygon.
This figure is drawn from the previous table, which was shown in the previous slide. So below 20 around 13,14. This represents frequency polygon.
When you connect the midpoint, you see that this is the. This is called frequency polygon. Then the, this one is the cumulative frequency.
It is not always, you cannot connect the midpoint, you have to be very careful with the data is continuous, then only you can connect one this bar. The data is not continuous, you cannot connect it. (Refer Slide Time: 28:24) Next one is a histogram .
The histogram was constructed from the given table. You see. The lower limit of the table values is going to in x axis.
The frequency is shown in the y axis. You see that this is data in continuous data. Okay, that was histogram.
The purpose of histogram is, the histogram will give you a rough idea what is the nature of the data whether, what kind of distribution it follows. Whether it is following bell shaped curve, whether the data is skewed right or skewed left. Next one is the frequency polygon which I have shown you.
If, the midpoint of histogram are connected then there is called frequency polygon. Because, the frequency polygon is used to know the trend. Trend of the data.
The next one is ogive curve. This is cumulative frequency curve . So what is happening in the, for example 20- under 30, the upper limit of the interval is taken the x axis, the cumulative frequency is taken in the y axis.
For example, the first interval. 20 - 30. So 30 the upper interval is 6.
For 40, upper interval is to 24, that is marked. Because the advantage of this ogive curve is, supposed if we want to know below 16, how many numbers are there, that can be read directly from the ogive curve. That is the purpose of ogive curve.
Next one is the relative frequency curve. Exactly similar to that now actual frequency that relative frequency was taken. Okay.
The next way to represent the data using pareto chart. The Pareto chart is having some applications in quality control also. This is to identify which is more important, important variable.
Assume that, if you look at this Pareto chart. There are 3 axes one is frequency. In x axis, different name is given poor wiring, short in coil, defective plug, other.
You see there is one more variable in terms of percentage. For example, I am a quality control engineer, suppose my motor is failing so often. I want to know there are different reason for failing of the motor.
I want to know what are the main reasons, due to which the motor fails. So what I have done. First I have go to frequency table, that is due to poor wiring, the motor was falling for failing 40 times, frequencies 40.
Due to short in coil, the motor was failed 30 times. Due to defective in plug, the motor was failed 25 times. Due to some other reasons the motor was failed by say below 10 times.
So the first technique is for drawing this one, we have to arrange in the descending order of their frequency. So in x axis that values are taken. Then the cumulative frequencies plotted on the, this axis.
For example, how to interpret this table is. You see, here this value corresponding this only 70. So 70 % of the failure is due to only two reasons, that is poor wiring and short circuiting.
So what is the meaning of this one is, if you are able to address these 2 problems, 70% of the failures can be eliminated. So the purpose of a Pareto chart is, to identify which is critical for us. Generally it is called 80-20 principle.
This is called the Pareto principle . That is 80% of the problems are due to 20% of the reasons. So similarly here, when you look at this, the cell, here need not always 80, see the 70% the failures, only due to 2 factors that is due to poor wiring and short coil.
So this is the pareto chart. The next one is scatter plot. The scatter plot is so far what ever seen only for one variable, the scatter plot is used for two variable.
In x axis registered vehicle, y axis the gasoline sales. So this says the scatter plot says, when the number of registered vehicle is increasing the gasoline sales is also increasing. So the scatter plot is used to know the trend out the data.
Some of the basic principle for excellent graph. One is the graph should not distort the data. The graph should be very simple.
It should not contain unnecessary adornments. So, so much decoration in the graph is not required, the scale on the vertical axis should begin at 0. All axes should be properly labeled.
Weather should be x axis or y axis, it has to be properly labeled. The graph should contain a title. The simplest possible graph should be used for given set of data.
These are the basic principle of excellent graph. See when you look at this one. The left hand side it is a bad representation of the graph.
What is happening lot of animations, unnecessary pictures. The right hand side, it is a simple graph x axis is taken as year, in y axis it has taken the wage. So it is showing some trend.
But when you look on the left hand side it is not giving any idea. What is happening year with respect to wage. Another one you look at the left side picture and right side picture.
Both are the same data. But what is happening. When here in the left side picture the scale is 0 to 100, here it is 0 to 25 just by changing the scale, we are able to get different interpretation.
You see that when the when the scale is increased. It looked like flat. If you are drawing in smaller scale.
You see that look like there's a lot of variations. So what is the learning is that we are to use proper scale to draw the picture. The next one is the graphical error, no 0 point on the vertical axis.
When you look at the left side of the figure January, February, March, April, May, June, the month is given in x axis. Monthly sales is given y axis. But the problem on the left hand side is it did not start from 0.
The right side is you see that the small Brake is given. So, that, even though, 0 to 36 there is no data, you have to make a small break like this. So that, we can come to know it start from 0.
So this is the right hand side is the right way of drawing the graph. This is the basic requirement. In this lecture, what you have seen, how to access particular rows and columns by using basic commands.
Then we have seen the different visualization techniques, different theories of the visualization technique. The next class will take in some sample data. By using the sample data with the help of the sample data will try to visualize the data.
By having different tools like a pie chart, bar chart, pictogram, Pareto chartor simple graph. Thank you, we will see you next class.