Probability: Types of Distributions

368.63k views1001 WordsCopy TextShare

365 Data Science

👉🏻 Sign up for Our Complete Data Science Training with 57% OFF: https://bit.ly/31QAkMi In this le...

Video Transcript:

Hello, again! In this lecture we are going to talk about various types of probability distributions and what kind of events they can be used to describe. Certain distributions share features, so we group them into types.

Some, like rolling a die or picking a card, have a finite number of outcomes. They follow discrete distributions. Others, like recording time and distance in track & field, have infinitely many outcomes.

They follow continuous distributions. We are going to examine the characteristics of some of the most common distributions. For each one we will focus on an important aspect of it or when it is used.

Before we get into the specifics, you need to know the proper notation we implement when defining distributions. We start off by writing down the variable name for our set of values, followed by the “tilde” sign. This is superseded by a capital letter depicting the type of the distribution and some characteristics of the dataset in parenthesis.

The characteristics are usually, mean and variance but they may vary depending on the type of the distribution. Alright! Let us start by talking about the discrete ones.

We will get an overview of them and then we will devote a separate lecture to each one. So, we looked at problems relating to drawing cards from a deck or flipping a coin. Both examples show events where all outcomes are equally likely.

Such outcomes are called equiprobable and these sorts of events follow a discrete Uniform Distribution. Then there are events with only two possible outcomes – true or false. They follow a Bernoulli Distribution, regardless of whether one outcome is more likely to occur.

Any event with two outcomes can be transformed into a Bernoulli event. We simply assign one of them to be “true” and the other one to be “false”. Imagine we are required to elect a captain for our college sports team.

The team consists of 7 native students and 3 international students. We assign the captain being domestic to be “true” and the captain being an international as “false”. Since the outcome can now only be “true” or “false”, we have a Bernoulli distribution.

Now, if we carry out a similar experiment several times in a row we are dealing with a Binomial Distribution. Just like the Bernoulli Distribution, the outcomes for each iteration are two, but we have many iterations. For example, we could be flipping the coin we mentioned earlier 3 times and trying to calculate the likelihood of getting heads twice.

Lastly, we should mention the Poisson Distribution. We use it when we want to test out how unusual an event frequency is for a given interval. For example, imagine we know that so far Lebron James averages 35 points per game during the regular season.

We want to know how likely it is that he will score 12 points in the first quarter of his next game. Since the frequency changes, so should our expectations for the outcome. Using the Poisson distribution, we are able to determine the chance of Lebron scoring exactly 12 points for the adjusted time interval.

Great, now on to the continuous distributions! One thing to remember is that since we are dealing with continuous outcomes, the probability distribution would be a curve as opposed to unconnected individual bars. The first one we will talk about is the Normal Distribution.

The outcomes of many events in nature closely resemble this distribution, hence the name “Normal”. For instance, according to numerous reports throughout the last few decades, the weight of an adult male polar bear is usually around 500 kilograms. However, there have been records of individual species weighing anywhere between 350kg and 700kg.

Extreme values, like 350 and 700, are called outliers and do not feature very frequently in Normal Distributions. Sometimes, we have limited data for events that resemble a Normal distribution. In those cases, we observe the Student’s-T distribution.

It serves as a small sample approximation of a Normal distribution. Another difference is that the Student’s-T accommodates extreme values significantly better. Graphically, that is represented by the curve having fatter “tails”.

Overall, this results in more values extremely far away from the mean, so the curve would probably more closely resemble a Student’s-T distribution than a Normal distribution. Now imagine only looking at the recorded weights of the last 10 sightings across Alaska and Canada. The lower number of elements would make the occurrence of any extreme value represent a much bigger part of the population than it should.

Good job, everyone! Another continuous distribution we would like to introduce is the Chi-Squared distribution. It is the first asymmetric continuous distribution we are dealing with as it only consists of non-negative values.

Graphically, that means that the Chi-Squared distribution always starts from 0 on the left. Depending on the average and maximum values within the set, the curve of the Chi Squared graph is usually skewed to the left. Unlike the previous two distributions, the Chi-Squared does not often mirror real life events.

However, it is often used in Hypothesis Testing to help determine goodness of fit. The next distribution on our list is the Exponential distribution. The Exponential distribution is usually present when we are dealing with events that are rapidly changing early on.

An easy to understand example is how online news articles generates hits. They get most of their clicks when the topic is still fresh. The more time passes, the more irrelevant it becomes and interest dies off.

The last continuous distribution we will mention is the Logistic distribution. We often find it useful in forecast analysis when we try to determine a cut-off point for a successful outcome. For instance, take a competitive e-sport like Dota 2.

We can use a Logistic distribution to determine how much of an in-game advantage at the 10-minute mark is necessary to confidently predict victory for either team. Just like with other types of forecasting, our predictions would never reach true certainty.