What is a quantile, how to calculate it and what is its importance in statistics? My name is Alexandre Patriota, I am a professor of statistics, and in this video I will talk about quantiles and their application in data analysis. Quantiles indicate positions in a data set and can be calculated in a variety of ways.
The statistical program R provides 9 algorithms to calculate the quantiles which can be studied in more detail in an article published by the American statistician in 1996. In this video I will discuss Only one way to obtain the quantiles. The fundamental idea of the quantile is to divide the data set into some positions.
The values of these positions will be able to be compared with the theoretical values and will help us to verify if certain probability models can be used to describe the observed data. The quantile p is the value that divides the data set ordered by p times 100 percent below and 1-p times 100 percent above. For example, if p = 10 percent we need to find the value that separates the data ordered by 10 percent below and ninety percent above.
Before we see how to calculate the quantile p, note that by definition the median is the fifty percent quantile. In addition, the first percentile is quantile 1 percent, the second percentile is quantile two percent, and so on. The first decile is the ten percent quantile, the second decile is the twenty percent quantile, and so on.
The first quartile is the 25% quantile, the second quartile is the median or fifty percent quantile and the third quartile is the 75 percent quantile. So just learn what quantile means, all of these concepts are particular cases. To calculate the quantile P, you must first roll the dice, that is, order them in ascending order.
A simple way to calculate the square p is to find the value that occupies the position P times n + 1. Note that this position will not always be an integer. It can even be less than one and greater than n.
If it is less than or equal to 1, the quantile is the minimum of the data set. If greater than or equal to n, the quantile is the maximum of the data set. Consider the 11 and annual wages of a micro company and their ordered values.
The number 13 occupies the first, second, third and fourth positions. The number 15 occupies the fifth position, the number 16 occupies the sixth position from so on until the eleventh position which is occupied by the number 250. To find the quantile fifty percent, we multiply P by n + 1.
In this case P = 0. 5 and n = 11, that is, we must multiply 0. 5 by 12.
As a result, we have that the position of the fifty percent quantile is the sixth position in the ordered data set. So the median is exactly the value that is in the sixth position, that is, 16 thousand reais per year. To find the forty percent quantile, we multiply 0.
4 by 12 which gives us the 4. 8 position. This position is not observed in the data set, since it is not an integer.
It indicates a value between 13 and 15. As position 4. 8 is closer to the fifth position than to the fourth, we give a greater weight to the value that is in the fifth position than to the value that is in the fourth.
forty percent quantile can be obtained by means of the weighted average 13 x 0. 2 plus 15 x 0. 8 which gives us 14.
6. We will see later the justification by means of a rule of three for this formula. Per hour Note that 0.
8 is the weight for the upper value and 0. 2 is the weight for the lower value. The 25 percent quantile is at position 0.
25 x 12. The 25 percent quantile is at position three which is occupied by the value thirteen. So first quartile = 13.
The seventy-five percent quantile is in position 9 which is occupied by the value 20. Therefore the third quartile = 20. The 94% quantile is already in position 11.
28 which is above the maximum position. In this case, the quantile is the maximum value. In general, the first step is to sort the data and calculate the position of the quantile: px (n + 1).
We will denote the position of the quantile by pn. If the position of the quantile is less than or equal to 1, then the quantile is the minimum of the data set. If the quantile position is greater than or equal to n, then the quantile is the maximum of the data set.
If the quantile position is an integer value, then the quantile is the value in the data set that occupies that position. The most complicated case occurs when the quantile position is not an integer value. In this case, we need to find two values in the data set that are closest to that position.
The positions of the two values closest to the quantile are given by the floor and the ceiling of the quantile position. The PN floor is denoted by the largest integer that does not exceed PN and the PN ceiling is defined by the smallest integer that is above PN. For example, if the position of the quantile is 14.
3, then the floor is 14 than the position of the value which is below the quantile and the ceiling is 15 which is the smallest integer value above 14. 3. The formula for calculating the quantile p is given by (P2 - PN) multiplied by the value that occupies the position immediately before that of the quantile + (PN - P1) multiplied by the value that occupies the position immediately after that of the quantile.
Note that P2 - P1 is always equal to 1. Let's try to understand what is behind this formula. As previously seen, the x (P1) and x (P2) values of the ordered data set only represent the values closest to the required quantile and therefore the quantile must be calculated using these values.
Consider that there is no preference in the distribution of points between X (P1) and X (P2), that is, any value between X (P1) and X (P2) would be equally likely to be observed. Under this position, we have that the distance between x (P1) and the quantile is for the distance from their positions as well as the distance between x (P2) quantile is for the distance from their respective positions. Solving the system, we found the formula for calculating the required quantile.
The code presented on the screen shows a function with two arguments to calculate the quantiles. The first argument is the vector of raw data and the second is the proportion P for the quantile of interest. The function sorts the data, calculates the position at planting and after that the function checks whether the position is less than one.
If so, it returns the minimum of data. If not , the function checks whether the position is greater than n. If so, it returns the maximum of the data and if not, the function checks whether the position is an integer.
If the position is an integer, it returns the value that occupies that position. If not, the function calculates the quantile using the formula studied in this class. To test this function, generate random numbers from a standard normal, calculate all percentiles using the studied function and using type six algorithm of R.
Finally, check if all results are identical. We will see in other classes how to calculate theoretical points by means of accumulated functions and Distribution, but before that we will have to introduce probability models and random variables. Thank you for your attention and see you next time.