Mathematics of Maximizing Profit in Gambling/Investing - Kelly Criterion

245.32k views5586 WordsCopy TextShare

EpsilonDelta

In this video, we introduce the Kelly criterion which is the formula that gives optimal risk that ma...

Video Transcript:

Suppose there is a game with the probability of winning of 1/3, and if you win, you will get 1. 8 times the amount of money that you risked on the play, and if you lose you lose the entire wager. Then we can calculate the expected value of the payout by summing over all possible events of the probability of each event times the payout of each event, and once we calculate the expected value for this game, it comes out to a negative value.

You have no edge at all in this game, so you may get lucky first few times but there is no long-term winning strategy. So you will inevitably donate all your money to the house if you keep playing. Now let's consider a second game where if you win you get paid the amount of money you risk.

Or in other words, you double your betting amount. Suppose that you have a hundred percent chance of winning this game. Then it doesn't quite matter what happens if you lose since you are guaranteed to win.

And you are expected to double your money each time you play the game. Then what is your long-term winning strategy? Well you essentially have a money printer, so you might as well put your entire life savings on each play.

Even better you can maximize your profit by playing on a margin, meaning you should borrow the biggest amount of money that you possibly can and put all your assets including your kidney and liver as a collateral for the margin. We looked at two extreme examples: one where you have no edge at all, and there is no long-term winning strategy, and another where you are guaranteed to win each time and your winning strategy is to bet everything each time. So the real interesting problem is when your expected payout is positive, but you are not guaranteed to win.

Let's take a look at one such example: Suppose you buy some S&P500 Index Fund, which is a fund that takes top 500 American companies and averages them by their company's value. And let's say you implement a strategy to sell at a profit when the price doubles and cut your losses when the price drops by 25 percent. Taking the historical data starting from 2003 and back testing this strategy, I calculated that the probability of win is roughly 59 percent.

Since we are expected to win more, and each win pays much more than how much we lose each loss, The expected payout per play obviously has to be a positive value. This game definitely is worth playing, but how much should you risk each play? One possible strategy commonly used by the members of the subreddit WallStreetBets is called food stamps or Lambo, which is a strategy to risk everything in order to maximize your potential profit.

So for this strategy, we can maximize the expected value by maximizing the value of r. So you take 4x leverage to risk everything you own. Naively that is the optimal strategy since you are maximizing the expected average profit.

You may get lucky first few times, but if you continue to all-in each play, you will eventually lose everything. The very act of maximizing expected payout actually leads to ruin. So there must be a better long-term strategy right?

Let's start by defining the problem. Suppose there is a game where you know the probability of winning, which we will call p, and q, the probability of losing, which equals 1 minus p. The game is repeated n times, and each play, you will risk a fixed percentage r of your portfolio.

and each time you win you will gain tr, and each time you lose you will lose sr. And let's put on some auxiliary conditions. The amount you can lose is that most your entire portfolio, So you cannot have a negative balance, and you are not guaranteed to win.

And we want n to be sufficiently large since we are looking for a long-term sustainable strategy. So you are expected to win some and lose some, but still end up profitable in the long run. Then what fraction of your portfolio should you risk each play to maximize the long-term average profit?

You may have noticed how we are restricted to binary scenarios, but in the real world, there are many games with more than two outcomes. For example the Powerball lottery has nine different ways you can win, each with different payout. And you can even have a game with a continuum of outcomes that depend on some continuous distribution.

As interesting as those scenarios sound, they are too difficult to cover on a single video, so for this video we will focus only on binary outcomes. Before we start solving the problem, I want to start with the answer, so we can work our way towards it. So the optimal risk we should take per play is p, the probability of win over s, the percentage loss per loss minus q, the probability of loss over t, the percentage gained per win.

The amount you should risk grows as the probability of win goes up, as the first term is linearly proportional with respect to p, and the optimal risk also goes up as percentage gain per win grows, since the second term is negative inversely proportional with respect to t. This intuitively makes sense as an investment is considered better if you win more, and each time you win you win more money. Similarly the optimal risk decreases as the probability of loss and the amount of loss per loss goes up.

This formula is known as the Kelly Criterion. It's best to look at some extreme cases of the formula. So if we have a perfectly balanced game with equal chance of win and loss, and equal gain and loss, the optimal risk is nothing, meaning this isn't a game worth playing.

If we instead have a game where you're less likely to win, and your gain per win is less than your losses, then your optimal r value comes out to negative. What this means is that you don't want to be playing the game, but you want to be sitting on the other side and taking a short position being the casino or the insurance company stealing money from you. Now what if the game is extremely good in your favor?

Then your r value can come out to something bigger than one, meaning you should invest not only your entire portfolio, but take a leverage and invest with borrowed money. Notice how I put quotation marks around the word average. The first question I want to ask is: what even is an average?

There is a generalized mathematical definition of the word as mathematicians love to abstractify everything. But for now, we will just think of it as some measure of central tendency or a good representative element of a dataset. And the average that we are all very familiar with is the arithmetic mean, or sometimes just called THE mean when the context is clear.

And we compute it by adding up all the numbers in the data set and dividing by how many numbers there are. There is another kind of mean called the quadratic mean, or sometimes called root mean square and we get it by taking the arithmetic mean of all the squares of numbers then taking the square root. Another common mean that we use is called the harmonic mean and we get it by taking the arithmetic mean of all the reciprocals then taking the reciprocal.

And using similar idea, we can extend the definition to work for all real numbers, except for zero. And even for zero, we can just take the limit as alpha goes to zero to define the zero mean which is known as the geometric mean. As alpha grows bigger, more weight is placed on bigger values, and as alpha gets smaller, more weight is placed on smaller values.

So if we take the limit as alpha goes to negative infinity, we get the smallest value of the dataset, and if we take the limit to the positive infinity, we get the maximum value. Now when is it appropriate to use these different types of means? Imagine one person having three dollars and another person having five dollars.

If Karl Marx saw this, he would want to equally distribute wealth. Then the most appropriate average would be the arithmetic mean. Now consider a second situation where you are a successful day trader and tripled your money the first year and 5x'ed money the second year, then what is your average growth per year?

It is equivalent to roughly 3. 873x'ing your money each year. So when we have a compound growth, and want to calculate average growth, geometric mean is the most appropriate.

Now consider the next situation where person A can finish a task in three days, and person B can finish a task in five days. Then what is the average number of days it takes a person to finish the task? If both of them are working on the task concurrently, we can add the rates together and we can divide by 2 to get the average rate.

So it means that person A can finish 1/3 of a job per day, and person B is working at 1/5 of a job per day. Then it is equivalent as if each person was finishing 4/15 per day. So on average it takes 15/4 days to finish a task, and this is an appropriate situation to take the harmonic mean.

Now consider the next scenario where there are two chambers filled with the same homogeneous ideal gas and each chamber contains the same amount of gas. Now suppose the mean speed of the molecules in the left chamber is 3, and the right chamber is 5. So the right chamber is a bit hotter.

Now, once you connect the two chambers together, they will eventually reach a thermal equilibrium, and mean speed of the molecules on both sides of the chamber will eventually balance out to a number between 3 and 5. So what is this average speed after mixing? Well we know that the temperature on the left and the right will eventually reach an arithmetic mean, and since temperature is, roughly speaking, the average kinetic energy of the molecules, it is proportional to the square of the mean speed.

And for those of you who are curious what the constant of proportionality is, it is π times the mass of the air molecule, all over 8 times the Boltzmann constant. So now to find the average mean velocity after the thermal equilibrium, we would have to take the root mean square of 3 and 5. We looked at multiple different scenarios to use different kinds of means.

Notice how as alpha grows bigger, the mean takes more weight of the bigger value which is 5. So from these four examples, the harmonic mean is the smallest and the quadratic mean is the biggest. There are two more averages that are very commonly used.

How much does an average American make per year? and since there are ridiculous outliers like billionaires which would skew the arithmetic mean too far to the right, and zero income people which would make geometric mean zero. Instead of any of the generalized means, we typically take the median in practice which is the middle number.

And it is a pretty good representative of how much an average person makes. And another commonly used average is the mode. So when we say an average American owns a smartphone, we probably mean most Americans.

And the majority is a pretty good representative of an average person. Now we have a solid notion of what an average is, so we are ready to take the first step. Each time we win, we gain tr, so it is same as multiplying (1+tr) to how much money we have.

This essentially is the same as 30% gain really meaning 130% the original amount. And we can come up with something similar for each time we lose. Suppose we win k times out of n games, then we will lose (n-k) games.

I will call this big R for the total return. And we would want to multiply this to the initial capital to see how much money we have after n games. And for this expression, I want to use the convention that 0^0 is equal to 1, which sounds like a nonsense.

But suppose you all-in each play, and if you get lucky, you have multiplied your money n times by the factor of (1+tr), and if you get unlucky just once, then you lost all your money. So to make sense out of this situation, we will choose to define that 0^0 is equal to 1 for this expression. Of course we can treat this rigorously with limits, but we just want a model that works.

Now we are expected to win about np games, where p is the probability of winning one game. And we are to lose about nq games. So this is the average total return right?

Earlier, we have talked about all the different types of averages, and which one is this one? Well to really make sense out of this, we need to introduce a new toolkit called random variables. Consider a scenario where you toss an unfair coin 3 times.

There are eight different events possible, and we can calculate the probability of each event by multiplication since each throw is independent. Then we can come up with the random variable for the total number of heads so X of the event HHT would be 2. This example illustrates what a random variable is.

It simply is a function that takes an event and assigns a numerical value. And by doing so, we turn the raw non-numerical data, such as sequence of heads and tails, into numerical data and it is much easier to work with. Well that is the more modern interpretation, but before the rise of set theory and measure theory, mathematicians naively thought of it as a variable that can take on the values of a random event.

and for this scenario, X can take on the values 0, 1, 2, and 3. And in either perspective, we can calculate the probability of X = 2 by adding probabilities of all possible ways of two heads showing up. And we could do this similar for 3.

But what's the probability of 7 head showing up out of 3 toss? That's just impossible, so the probability is 0. Now we can plot the probability for each values of the random variable, and we call this the probability mass function.

This is one very specific example of a class of random variables that we call binomial random variable, denoted B, with two parameter n for the number of trials, and p for the probability of winning one trial. And we can just write B when the context is clear. and since B represents the number of wins, it can take on the values from 0 to n.

How do we calculate the probability of getting two wins out of four trials if the probability of winning is 1/3? well there are six different ways you can win twice, and each of these events have a probability of (1/3)^2 × (2/3)^2, and we can get 6 by taking 4 choose 2. So doing the very same thing in general, the probability of B = k is n choose k × p^k × q^(n-k) where q is (1 - p).

Next I want to come up with the notion of the arithmetic mean for random variables and how it compares with the arithmetic mean of a statistical dataset So let's say there are 10000 independent experiments, each consisting of ten dice throws, and for each experiment, we want to count the number of times six shows up. We can create a histogram to tally up the results and we can see that it is fairly common to see 6 showing up one or two times, but getting 10 sixes in a row is practically impossible. The probability model that represents one instance of the experiment is the binomial distribution of 10 trials and the probability of winning equaling 1/6.

And if we plot the probability mass function, the shape is essentially identical to the histogram because as long as the number of experiments is large enough, the number of experiments where 3 sixes came up out of the total number of experiments should approximately equal the probability of 3 sixes showing up in a single experiment. And this holds for any other values as well. So we can define the expected value, which is the probability equivalent of the arithmetic mean by summing over the product of the value of the random variable times the probability of getting that value.

And as long as the number of experiments are large enough, the arithmetic mean approaches the expected value. Now, what is the expected value of the binomial distribution? Well this summation is fairly tricky to compute, so for those of you that are interested, I will leave a link in the description below for the derivation.

So for now, let's take a small leap of faith and say it equals np. And it is a reasonable answer that agrees with our intuition since if we toss a coin 100 times, we expect about 50 heads, and if we toss the dice 60 times, we expect about 10 ones. It should be clear now that we need to construct a random variable that depends on the binomial distribution for our problem.

So now we introduce how transformation of random variables work. Suppose that X is a random variable that can take on the four values -1, 0, 1 and 2, with each value having some probability. Then Y defined as 1/X can take on the reciprocals of the values of X, and the probabilities of each value of Y is same as the probabilities of getting the corresponding X, so as an example, if we are trying to find the probability of Y equaling 1/2, first substitute Y for 1/X, then solve for X.

Notice how the plot of the probability mass function of Y is out of order, so we just have to rearrange it to make it more intuitive to read. Now what if the transformation is not one-to-one? For example consider the square transformation, then Y can only take on three values since both ±1 squared equals 1.

So if we are trying to calculate the probability of Y equaling 1, Y is equal to X^2, and there are two possible values of X^2 = 1, so we have to add the two probabilities. so we have to redraw the PMF (probability mass function) by merging the two values. This essentially is how transformations of all discrete random variables work.

Now how do we find the expected value of Y? We can just use the definition of the expected value of Y, which is the sum over all values of Y times the probability of getting that value. We can instead just sum over all possible values of X^2, times the probability of getting each value of X.

And during the summation, values that were not one-to-one will appropriately add together. This essentially illustrates the proof of the discrete version of the law of the unconscious statistician, which is often called LOTUS, which is its acronym, and it is one of the most important theorems in probability theory. This theorem is significant since we can find the expected value of a transformed distribution without knowing anything about the transform distribution.

Now we are ready to formally define R, the total return, as a random variable that depends on the binomial distribution. And this makes sense since B is a random variable that takes on the number of wins. And we want to repeatedly multiply (1 + tr) for each win, and (1 - sr) for each loss.

Let's try taking the expected value and see what happens. And we can use a LOTUS to evaluate the expected value. At this point the summation looked too nasty, so I evaluated it using Wolfram Mathematica.

So kudos to you if you can try and simplify it by hand. At this point, it's probably good to stop and think about what this means. In the very beginning, we informally introduced the random variable G to denote the net gain per single play, and we also calculated the expected value of net change per single play.

so we can make the substitution for the expected value of the total return. This expression intuitively makes sense as if we have a net change of E(G) per play, then we are on average multiplying by (1+E(G)) per play. And this expression is maximized if we take the little r, which is the risk, to be as big as possible.

It's not too hard to see that maximizing the expected value is not the best strategy. But to provide a concrete example, suppose there is a game where you flip a fair coin, and if you flip head you quadruple your risk, and if you flip tail you lose your wager. And if we take the strategy to maximize the expected value of the total return, The only way we can win is to consecutively flip heads, and if we flip just one tail we lose all of our money.

As the number of trials get larger, your chance of becoming rich approaches zero exponentially fast. And you are almost certainly guaranteed to lose everything. Maximizing the expected value was not the best choice, so what if we maximize the median instead?

For continuous distribution, the definition of median is obvious, which is the value that splits the probability into two halves. But we need a more general definition that works for discrete random variable as well. So the definition that we will use for median is any real value m such that the probability of X ≤ m is at least half, and probability X ≥ m is at least half.

So if we look at this binomial distribution X, the total probability less than or equal to 2 is more than half, and the probability greater than or equal to 2 is also more than half. So for this distribution, the median is equal to 2. Let's take a look at another binomial distribution, which happens to be symmetric and bimodal.

And the total probability less than or equal to 2 and greater than or equal to 2 is at least half. So 2 is the median of this distribution. And by the same reasoning, 3 is a median as well.

And if we take any number in between 2 and 3 such as 2. 5, that satisfies the definition of median too. So for this case, the median is the set of closed interval from 2 to 3.

Then what is the median of a binomial distribution? It is the expected value rounded to the nearest integer. Well that was a lie, we don't actually have a close formula to find the median in terms of n and p, but we do know a few things about it.

And one thing for sure is that median could either take on np rounded up or down, or maybe both at the same time. This isn't something we could prove quickly, but I want to provide an intuition for big values of n since we are interested in long-term winning strategy. So for large values of n, the binomial distribution starts to look more and more like the bell curve, so the mean and the median should roughly be equal to each other.

But since B can only take on integer values, so does the median. For rest of the video we will just approximate that the median of the binomial distribution is roughly equal to the mean. Now how do we compute the median of a transformed distribution?

For example, if we have some distribution X, and if we apply a power to it, the probability mass function essentially looks the same since there was no rearrangement in the ordering of each bars. So the median would be represented by the same bar on the left and the right. So the median of the transform is the same as transform of the median.

Now what if we apply a monotonically decreasing transformation like 2^-X? Then the ordering completely flips, but the middle is still the middle. So the median commutes with the transformation for this case as well.

What if we have a transformation that is not strictly increasing, like the integer part of X divided by 3. I marked where the original median was with the blue arrow, and since this function is not one-to-one, we would have to merge the bars together. Since 4 was the original median, everything up to 4 is at least half.

So everything up to 5 is at least half as well. And everything down to 4 is at least half, so everything down to 3 is at least half as well. so 1, which is the transform of 4, is the median of the new distribution.

So we can just find the median of a transform by taking the transform of the median as long as g is a monotonic function, whether it's increasing or decreasing, strict or not. So can we find the median of the total return? We can write the transformation as a single exponent function.

And if the base of the exponent is greater than one, it is increasing, if it's equal to 1, then it is constant, and if it's in between 0 and 1, then it is decreasing. If the base is 0 or negative we run into some issues with the monotonicity, so if we restrict both the numerator (1+tr) and the denominator (1-sr) to be positive, we get a restriction of allowed range for r. Intuitively, 1/s is the maximum possible risk that we can take, which is equivalent to all-in, but what does negative 1/t signify?

That actually is the maximum risk that we can possibly take if we are taking the short position. So now we have a reasonable range for r, and within this range, the transformation is monotonic. So we can compute the median of big R by taking the transformation of the median of the binomial distribution, then median can be approximated by np, so after some manipulation, we arrive at this expression.

And we can think of this as a single variable function of the risk since other variables are fixed parameters. So this allows us to do some single variable differential calculus. 1/s and -1/t are the roots of this function, so if we graph this function with respect to the risk, it looks something like an upside down parabola with a single maximum.

I normalized the graph so that the peak is always at the same height since it is really what we want to know about. So if we vary the value of p, the probability of winning, the location of the maximum changes. But if we change the value of n, the shape of the graph changes, but where the maximum is does not.

And this illustrates one of the key techniques in calculus: instead of finding the maximum of the original function, we can instead find the maximum of a transformed function as long as the transform is monotonic. and we can take natural log as well which is the monotonic function since it splits products into sums, and it makes calculus so much easier. Now let's take the derivative which turns logs into reciprocals, and we have to pull out the coefficient by chain rule.

and we set it equal to zero in order to find that single maximum. At this point all the heavy lifting is done, so I will leave it to you to solve for r, and get that r that maximizes the median is indeed the Kelly's formula. We computed the optimal risk for mean and median, so we might as well try the mode since those three are the averages you learn in elementary school statistics.

And unlike median, mode is fairly intuitive to define for random variables. It simply is the value that gives the biggest probability, and there can be multiple modes. Then what is the mode of a binomial distribution?

Unlike the median, there is a close formula in terms of n and p. This is a bit of overkill, and just like median, we will say that the mode is an integer value close to the expected value, and we can approximate it using the expected value. Then what is the mode of a transformed random variable?

If we simply take a discrete random variable and permute the values, then the biggest bar is still preserved. But if we merge some bars, what was originally the biggest may not be the new biggest. So one condition where mode commutes with g is that g is one-to-one.

By the way, this only holds for discrete random variables and fails for continuous random variables in general. So if you can come up with an example, leave it in the comments below. Now what if we stack all the bars into one?

That is, the transformation is a constant function? Then every single value, including what was originally the mode, transformed into the new mode. So mode commutes with g if g is a constant function.

I'm sure there are more interesting examples, but these two conditions are the ones we will be using. Just like how we did for median, the exponential function is either one-to-one or constant as long as the base of the exponent is greater than zero. And we found the appropriate conditions.

So mode of the transform is transform of the mode. So it will simplify to exact same value that median came out to. So once we find the r that maximizes the mode, we will get Kelly's formula once again.

If we plot the probability mass function of the binomial distribution, it looks roughly normal. But if we transform it to the total return, the ordering is preserved but the shape gets skewed. But if we instead plot this on a log scale, the shape becomes normal again.

By the way the choice of the base being 5 was completely arbitrary in terms of the shape that it gives, it just happened to give me the best picture. So this raises suspicion that there is an exponential or multiplicative behavior lurking in the background. So we should try taking the geometric mean.

Just like how we defined the arithmetic mean for a random variable, we can generalize this for any mean. But things get pretty tricky if we send alpha to 0. And for the case of sample geometric mean, the limit approach the product of each data, then taking the n-th root.

Now we will take a natural log on both sides since the power drops as the coefficient and product splits into sum. And once we exponentiate both sides, we have an alternative formula for geometric mean that does not have product in it. In fact this looks much closer to the generalized formula above compared to the formula that involves the product.

So it could be thought of as the missing link between the general form and the product form. In any case, we will use this formula in a similar way to define the geometric mean for a random variable. Let's first compute the expected value of the log of R, which can be evaluated using LOTUS.

Then we can drop the powers and split up the sum, and by the linearity of summation, we can split up the sum into three separate summation. and pulling out things that does not involve k outside of the summation. Notice how the circled part literally is the expected value of the binomial distribution.

And the one in the blue circle is sum of each probability, which should add to one. Now we can factor out like terms, and make some substitutions, and this should look awfully familiar once we take the exponent to find the geometric mean. It comes out to exactly same expression as the median and the mode, and once again, we can derive the Kelly's formula from here.

I want to wrap it up, with a final remark that provides an interpretation of the formula. Since the geometric mean is supposed to be the multiplicative average, we can take the n-th root to find the average net gain per single play. If we win, we multiply by (1+tr) to the principal capital, and if we lose we multiply by (1-sr), so we can think of this as the average multiplier per single play.

And if we instead find the geometric mean of the total return for the binomial random variable of one trial, we get this exact value as well.