What types of data do we find in statistical analysis? My name is Alexandre Patriota, I am a professor of statistics and in this video I will talk about the types of data that can be observed in practice and what models these data are associated with. The Science of Statistics At first, before thinking about models more abstract, we need to characterize the data observed in the studies.
It is worth mentioning that there are at least two types of study: experimental and observational. each type of study induces different relationships between observations and different statistical models to model the uncertainty of these relationships. In this video I will focus only on data types The types of studies will be covered in future videos Data set What is a data set?
The data set is one of the products of a study. It contains the main characteristics that one is interested in studying in a population or sample. These characteristics may be qualitative or quantitative, and is from the set of data inferential analyzes are data types Before any inferential analysis must describe and summarize data to understand that statistical models can be used first step is the characterization of data types.
For example, we will consider deforestation in the Legal Amazon, which comprises nine states Acre, Amapá, Amazonas, Pará, Rondônia, Roraima and Tocantins, in the north, Mato Grosso, in the midwest, and Maranhão, in the northeast. On the INPE website we have obtained data on deforestation in each of these states since 2004. Here I will present only data from 2019 in order to illustrate the types of data that we can observe.
The table presents 5 variables in its columns: the state, deforestation, the total area, the classification of deforestation per thousand square kilometers, and the population. They are called variables because their values are not constant and vary according to natural rules or laws that, in some cases, are known and in other cases, unknown. For example, the state variable can be Acre, Amapá, Amazonas, Pará, Rondônia, Roraima, Tocantins, Mato Grosso and Maranhão.
We know exactly what are the rules that define the states and their respective total areas. But not all variables have known rules. Note that although we know how deforestation occurs, we do not know how to explain in advance what factors contribute to its increase or decrease.
In other words, we do not know exactly what are the causes of deforestation and therefore this variable has a source of uncertainty that influences its value at some point in the study. Deforestation, total area, and population are numerical variables and therefore we can perform arithmetic operations on each one. That is, we can add the values that each one of them can assume and present its total.
As shown in the table. The same cannot be done with the variables "State" and "classification of deforestation" So we have two types of variables: qualitative variables that describe categories, names and qualities. and the quantitative variables that describe numerical quantities which can be added, subtracted, multiplied and divided.
Qualitative variables Qualitative variables are further subdivided into nominal and ordinal The categories of a nominal variable cannot be ordered quantitatively As is the case with the variable "State" Other nominal variables are country, region, sex, religion, among others. Note that we can order countries according to area or population. But in these specific cases, the variable would not be country but area or population.
So make sure that the variable in question is the one that should be typed. The original variables, in turn, can be ordered according to a quantitative criterion inherent to the variables in question. This is the case with the classification of deforestation.
Deforestation greater than 2 is greater than deforestation between 1 and 2 which, in turn, is greater than deforestation less than 1. Other ordinal variables are: education, social class, corporate position, severity of a disease, among others. Those who have completed higher education have more study than just elementary education, which, in turn, has more study than those who are illiterate.
Those in the upper middle class have more income than those in the lower middle class, and so on. Quantitative Variables Quantitative variables are subdivided into discrete and continuous Discrete variables describe a countable quantity, that is, their potential values can all be listed in an order as is the case with the population. The population of Acre in 2019 was 881 thousand 935 inhabitants.
Note that we were able to create an ordered enumeration of all potential values of the population variable and therefore this variable is considered to be discrete. continuous variables describe measurements whose potential values are numbers that cannot be enumerated in an unambiguous order. This is the case for the variables "Deforestation" and "Total area".
Consider the total area of Acre. What would be the next potential value? Note that we were unable to list a sequence as we did in the previous case and that is why it is considered a continuum.
In practice, all quantitative variables are discrete, since the classical computer can only store a finite amount of bits, that is, the values are truncated, according to some criterion, to some decimal place. However, continuous variables and continuous models are crucial in the development of approximate inferential statistical theory and it is important to know what the variables are and have potential non-enumerable values to define the appropriate statistical model. Related models Each type of variable is associated with a type of statistical model For nominal qualitative variables, we can use, for example, Logistic regression models or models of item response theory.
For ordinal qualitative variables, we can use ordinal regression models, multivariate logistic models, among others. For discrete quantitative variables, we can use, for example, Poisson models, binomial models, geometric models, among others. For continuous quantitative variables, we can use the normal model, the elliptical model, the asymmetric normal model, the gamma model, the beta model, among many others .
Data storage In practice, data are stored in a raw data table in which each column presents the observed values of the variables and each line refers to a sample unit. The raw data are not always in a suitable format to be stored on a computer. So an Initial job would be to make the necessary transformations so that the computer correctly reads the variables taking into account their characteristics.
For example, the numbers 1, 2, 3, 4 and 5 are often used to represent the categories: I strongly disagree , disagree, neutral, agree and strongly agree. In these cases, the analyst needs to specify the nature of the data so that the computer does not unduly apply arithmetic operations. We will see in other videos how to summarize the raw data to interpret the results observed in the experiment.
Thank you all for your attention and see you next week.