Mathematics of Maximizing Profit in Gambling/Investing - Kelly Criterion

245.32k views5586 WordsCopy TextShare
EpsilonDelta
In this video, we introduce the Kelly criterion which is the formula that gives optimal risk that ma...
Video Transcript:
Suppose there is a game with the  probability of winning of 1/3, and if you win, you will get 1. 8 times the  amount of money that you risked on the play, and if you lose you lose the entire wager. Then we can calculate the  expected value of the payout by summing over all possible  events of the probability of each event times the payout of each event, and once we calculate the expected value for  this game, it comes out to a negative value.
You have no edge at all in this game,  so you may get lucky first few times but there is no long-term winning strategy. So you will inevitably donate all your  money to the house if you keep playing. Now let's consider a second game where if you  win you get paid the amount of money you risk.
Or in other words, you double your betting amount. Suppose that you have a hundred  percent chance of winning this game. Then it doesn't quite matter what happens  if you lose since you are guaranteed to win.
And you are expected to double your  money each time you play the game. Then what is your long-term winning strategy? Well you essentially have a money printer, so you might as well put your  entire life savings on each play.
Even better you can maximize your  profit by playing on a margin, meaning you should borrow the biggest  amount of money that you possibly can and put all your assets including your kidney  and liver as a collateral for the margin. We looked at two extreme examples:  one where you have no edge at all, and there is no long-term winning strategy, and another where you are  guaranteed to win each time and your winning strategy is  to bet everything each time. So the real interesting problem is  when your expected payout is positive, but you are not guaranteed to win.
Let's take a look at one such example: Suppose you buy some S&P500 Index Fund, which is a fund that takes  top 500 American companies and averages them by their company's value. And let's say you implement a strategy to  sell at a profit when the price doubles and cut your losses when the  price drops by 25 percent. Taking the historical data starting from  2003 and back testing this strategy, I calculated that the probability  of win is roughly 59 percent.
Since we are expected to win more, and each win  pays much more than how much we lose each loss, The expected payout per play  obviously has to be a positive value. This game definitely is worth playing,  but how much should you risk each play? One possible strategy commonly used by the  members of the subreddit WallStreetBets is called food stamps or Lambo, which is a strategy to risk everything in  order to maximize your potential profit.
So for this strategy, we can maximize the  expected value by maximizing the value of r. So you take 4x leverage to  risk everything you own. Naively that is the optimal strategy since you  are maximizing the expected average profit.
You may get lucky first few times, but  if you continue to all-in each play, you will eventually lose everything. The very act of maximizing expected  payout actually leads to ruin. So there must be a better  long-term strategy right?
Let's start by defining the problem. Suppose there is a game where you know the  probability of winning, which we will call p, and q, the probability of  losing, which equals 1 minus p. The game is repeated n times, and each play, you will risk a fixed  percentage r of your portfolio.
and each time you win you will gain tr, and each time you lose you will lose sr. And let's put on some auxiliary conditions. The amount you can lose is that  most your entire portfolio, So you cannot have a negative balance, and you are not guaranteed to win.
And we want n to be sufficiently large since we  are looking for a long-term sustainable strategy. So you are expected to win some and lose some, but still end up profitable in the long run. Then what fraction of your  portfolio should you risk each play to maximize the long-term average profit?
You may have noticed how we are  restricted to binary scenarios, but in the real world, there are many  games with more than two outcomes. For example the Powerball lottery  has nine different ways you can win, each with different payout. And you can even have a game  with a continuum of outcomes that depend on some continuous distribution.
As interesting as those scenarios sound, they are too difficult to cover on a single video, so for this video we will  focus only on binary outcomes. Before we start solving the problem, I want to start with the answer,  so we can work our way towards it. So the optimal risk we should take  per play is p, the probability of win over s, the percentage loss per loss minus q, the probability of loss over t, the percentage gained per win.
The amount you should risk grows  as the probability of win goes up, as the first term is linearly  proportional with respect to p, and the optimal risk also goes up  as percentage gain per win grows, since the second term is negative  inversely proportional with respect to t. This intuitively makes sense as an investment is considered better if you win  more, and each time you win you win more money. Similarly the optimal risk decreases as the probability of loss and the  amount of loss per loss goes up.
This formula is known as the Kelly Criterion. It's best to look at some  extreme cases of the formula. So if we have a perfectly balanced  game with equal chance of win and loss, and equal gain and loss, the optimal risk is nothing, meaning this isn't a game worth playing.
If we instead have a game where  you're less likely to win, and your gain per win is less than your losses, then your optimal r value comes out to negative. What this means is that you don't  want to be playing the game, but you want to be sitting on the other side and taking a short position being the casino or the insurance  company stealing money from you. Now what if the game is  extremely good in your favor?
Then your r value can come out  to something bigger than one, meaning you should invest not  only your entire portfolio, but take a leverage and  invest with borrowed money. Notice how I put quotation  marks around the word average. The first question I want to  ask is: what even is an average?
There is a generalized  mathematical definition of the word as mathematicians love to abstractify everything. But for now, we will just think of it  as some measure of central tendency or a good representative element of a dataset. And the average that we are all very  familiar with is the arithmetic mean, or sometimes just called THE  mean when the context is clear.
And we compute it by adding up  all the numbers in the data set and dividing by how many numbers there are. There is another kind of mean  called the quadratic mean, or sometimes called root mean square and we get it by taking the arithmetic  mean of all the squares of numbers then taking the square root. Another common mean that we  use is called the harmonic mean and we get it by taking the  arithmetic mean of all the reciprocals then taking the reciprocal.
And using similar idea, we can extend the  definition to work for all real numbers, except for zero. And even for zero, we can just take the limit as alpha  goes to zero to define the zero mean which is known as the geometric mean. As alpha grows bigger, more  weight is placed on bigger values, and as alpha gets smaller, more  weight is placed on smaller values.
So if we take the limit as  alpha goes to negative infinity, we get the smallest value of the dataset, and if we take the limit to the positive infinity, we get the maximum value. Now when is it appropriate to use  these different types of means? Imagine one person having three dollars  and another person having five dollars.
If Karl Marx saw this, he would  want to equally distribute wealth. Then the most appropriate average  would be the arithmetic mean. Now consider a second situation  where you are a successful day trader and tripled your money the first year and 5x'ed money the second year, then what is your average growth per year?
It is equivalent to roughly  3. 873x'ing your money each year. So when we have a compound growth,  and want to calculate average growth, geometric mean is the most appropriate.
Now consider the next situation where person A can finish a task in three days, and person B can finish a task in five days. Then what is the average number of days  it takes a person to finish the task? If both of them are working  on the task concurrently, we can add the rates together and we can divide by 2 to get the average rate.
So it means that person A can  finish 1/3 of a job per day, and person B is working at 1/5 of a job per day. Then it is equivalent as if each  person was finishing 4/15 per day. So on average it takes 15/4 days to finish a task, and this is an appropriate  situation to take the harmonic mean.
Now consider the next scenario where there are two chambers filled  with the same homogeneous ideal gas and each chamber contains the same amount of gas. Now suppose the mean speed of the  molecules in the left chamber is 3, and the right chamber is 5. So the right chamber is a bit hotter.
Now, once you connect the two chambers together, they will eventually reach a thermal equilibrium, and mean speed of the molecules on both sides  of the chamber will eventually balance out to a number between 3 and 5. So what is this average speed after mixing? Well we know that the temperature  on the left and the right will eventually reach an arithmetic mean, and since temperature is, roughly speaking,  the average kinetic energy of the molecules, it is proportional to the  square of the mean speed.
And for those of you who are curious  what the constant of proportionality is, it is π times the mass of the air molecule, all over 8 times the Boltzmann constant. So now to find the average mean  velocity after the thermal equilibrium, we would have to take the  root mean square of 3 and 5. We looked at multiple different scenarios  to use different kinds of means.
Notice how as alpha grows bigger, the mean takes more weight of  the bigger value which is 5. So from these four examples, the harmonic mean is the smallest and  the quadratic mean is the biggest. There are two more averages  that are very commonly used.
How much does an average American make per year? and since there are ridiculous  outliers like billionaires which would skew the arithmetic  mean too far to the right, and zero income people which  would make geometric mean zero. Instead of any of the generalized means, we typically take the median in practice which is the middle number.
And it is a pretty good representative  of how much an average person makes. And another commonly used average is the mode. So when we say an average  American owns a smartphone, we probably mean most Americans.
And the majority is a pretty good  representative of an average person. Now we have a solid notion of what an average is, so we are ready to take the first step. Each time we win, we gain tr, so it is same as multiplying  (1+tr) to how much money we have.
This essentially is the same as 30% gain  really meaning 130% the original amount. And we can come up with something  similar for each time we lose. Suppose we win k times out of n games, then we will lose (n-k) games.
I will call this big R for the total return. And we would want to multiply  this to the initial capital to see how much money we have after n games. And for this expression, I want to use  the convention that 0^0 is equal to 1, which sounds like a nonsense.
But suppose you all-in each play, and if you get lucky, you have multiplied  your money n times by the factor of (1+tr), and if you get unlucky just once,  then you lost all your money. So to make sense out of this situation, we will choose to define that 0^0  is equal to 1 for this expression. Of course we can treat this  rigorously with limits, but we just want a model that works.
Now we are expected to win about np games, where p is the probability of winning one game. And we are to lose about nq games. So this is the average total return right?
Earlier, we have talked about all  the different types of averages, and which one is this one? Well to really make sense out of this, we need to introduce a new  toolkit called random variables. Consider a scenario where you  toss an unfair coin 3 times.
There are eight different events possible, and we can calculate the probability  of each event by multiplication since each throw is independent. Then we can come up with the random  variable for the total number of heads so X of the event HHT would be 2. This example illustrates  what a random variable is.
It simply is a function that takes an  event and assigns a numerical value. And by doing so, we turn  the raw non-numerical data, such as sequence of heads and tails, into numerical data and it  is much easier to work with. Well that is the more modern interpretation, but before the rise of set  theory and measure theory, mathematicians naively thought of it as a variable  that can take on the values of a random event.
and for this scenario, X can take  on the values 0, 1, 2, and 3. And in either perspective, we can  calculate the probability of X = 2 by adding probabilities of all  possible ways of two heads showing up. And we could do this similar for 3.
But what's the probability of 7  head showing up out of 3 toss? That's just impossible, so the probability is 0. Now we can plot the probability for  each values of the random variable, and we call this the probability mass function.
This is one very specific example  of a class of random variables that we call binomial random variable, denoted B, with two parameter  n for the number of trials, and p for the probability of winning one trial. And we can just write B when the context is clear. and since B represents the number of wins, it can take on the values from 0 to n.
How do we calculate the probability  of getting two wins out of four trials if the probability of winning is 1/3? well there are six different  ways you can win twice, and each of these events have a  probability of (1/3)^2 × (2/3)^2, and we can get 6 by taking 4 choose 2. So doing the very same thing in general, the probability of B = k is n choose  k × p^k × q^(n-k) where q is (1 - p).
Next I want to come up with the notion of  the arithmetic mean for random variables and how it compares with the arithmetic  mean of a statistical dataset So let's say there are 10000  independent experiments, each consisting of ten dice throws, and for each experiment, we want to  count the number of times six shows up. We can create a histogram to tally up the results and we can see that it is fairly common  to see 6 showing up one or two times, but getting 10 sixes in a row  is practically impossible. The probability model that represents  one instance of the experiment is the binomial distribution of 10 trials  and the probability of winning equaling 1/6.
And if we plot the probability mass function, the shape is essentially  identical to the histogram because as long as the number  of experiments is large enough, the number of experiments where 3 sixes came  up out of the total number of experiments should approximately equal the probability  of 3 sixes showing up in a single experiment. And this holds for any other values as well. So we can define the expected value, which is  the probability equivalent of the arithmetic mean by summing over the product of  the value of the random variable times the probability of getting that value.
And as long as the number of  experiments are large enough, the arithmetic mean approaches the expected value. Now, what is the expected value  of the binomial distribution? Well this summation is fairly tricky to compute, so for those of you that are interested, I will leave a link in the  description below for the derivation.
So for now, let's take a small  leap of faith and say it equals np. And it is a reasonable answer  that agrees with our intuition since if we toss a coin 100  times, we expect about 50 heads, and if we toss the dice 60  times, we expect about 10 ones. It should be clear now that we  need to construct a random variable that depends on the binomial  distribution for our problem.
So now we introduce how transformation  of random variables work. Suppose that X is a random variable that  can take on the four values -1, 0, 1 and 2, with each value having some probability. Then Y defined as 1/X can take on  the reciprocals of the values of X, and the probabilities of  each value of Y is same as the probabilities of getting the corresponding X, so as an example, if we are trying to  find the probability of Y equaling 1/2, first substitute Y for 1/X, then solve for X.
Notice how the plot of the probability  mass function of Y is out of order, so we just have to rearrange it  to make it more intuitive to read. Now what if the transformation is not one-to-one? For example consider the square transformation, then Y can only take on three values  since both ±1 squared equals 1.
So if we are trying to calculate  the probability of Y equaling 1, Y is equal to X^2, and there are  two possible values of X^2 = 1, so we have to add the two probabilities. so we have to redraw the PMF (probability  mass function) by merging the two values. This essentially is how transformations  of all discrete random variables work.
Now how do we find the expected value of Y? We can just use the definition  of the expected value of Y, which is the sum over all values of Y times  the probability of getting that value. We can instead just sum over  all possible values of X^2, times the probability of getting each value of X.
And during the summation, values that were not one-to-one  will appropriately add together. This essentially illustrates the  proof of the discrete version of the law of the unconscious statistician, which is often called LOTUS, which is its acronym, and it is one of the most important  theorems in probability theory. This theorem is significant since we can find the expected value of  a transformed distribution without knowing anything about the transform distribution.
Now we are ready to formally  define R, the total return, as a random variable that depends  on the binomial distribution. And this makes sense since B is a random  variable that takes on the number of wins. And we want to repeatedly  multiply (1 + tr) for each win, and (1 - sr) for each loss.
Let's try taking the expected  value and see what happens. And we can use a LOTUS to  evaluate the expected value. At this point the summation looked too nasty, so I evaluated it using Wolfram Mathematica.
So kudos to you if you can  try and simplify it by hand. At this point, it's probably good to  stop and think about what this means. In the very beginning, we informally introduced the random variable  G to denote the net gain per single play, and we also calculated the expected  value of net change per single play.
so we can make the substitution for  the expected value of the total return. This expression intuitively makes sense as if we have a net change of E(G) per play, then we are on average  multiplying by (1+E(G)) per play. And this expression is maximized if we  take the little r, which is the risk, to be as big as possible.
It's not too hard to see that maximizing  the expected value is not the best strategy. But to provide a concrete example, suppose  there is a game where you flip a fair coin, and if you flip head you quadruple your risk,  and if you flip tail you lose your wager. And if we take the strategy to maximize  the expected value of the total return, The only way we can win is  to consecutively flip heads, and if we flip just one tail  we lose all of our money.
As the number of trials get larger, your chance of becoming rich  approaches zero exponentially fast. And you are almost certainly  guaranteed to lose everything. Maximizing the expected value  was not the best choice, so what if we maximize the median instead?
For continuous distribution, the  definition of median is obvious, which is the value that splits  the probability into two halves. But we need a more general definition that  works for discrete random variable as well. So the definition that we will  use for median is any real value m such that the probability  of X ≤ m is at least half, and probability X ≥ m is at least half.
So if we look at this binomial distribution X, the total probability less than  or equal to 2 is more than half, and the probability greater than or  equal to 2 is also more than half. So for this distribution,  the median is equal to 2. Let's take a look at another  binomial distribution, which happens to be symmetric and bimodal.
And the total probability less than or equal to 2 and greater than or equal to 2 is at least half. So 2 is the median of this distribution. And by the same reasoning, 3 is a median as well.
And if we take any number in  between 2 and 3 such as 2. 5, that satisfies the definition of median too. So for this case, the median is the  set of closed interval from 2 to 3.
Then what is the median of  a binomial distribution? It is the expected value  rounded to the nearest integer. Well that was a lie, we don't actually  have a close formula to find the median in terms of n and p, but we  do know a few things about it.
And one thing for sure is that median could either take on np rounded up or down, or maybe both at the same time. This isn't something we could prove quickly, but I want to provide an  intuition for big values of n since we are interested in  long-term winning strategy. So for large values of n, the binomial distribution starts to  look more and more like the bell curve, so the mean and the median should  roughly be equal to each other.
But since B can only take on  integer values, so does the median. For rest of the video we  will just approximate that the median of the binomial distribution  is roughly equal to the mean. Now how do we compute the median  of a transformed distribution?
For example, if we have some distribution X, and if we apply a power to it, the probability mass function  essentially looks the same since there was no rearrangement  in the ordering of each bars. So the median would be represented by  the same bar on the left and the right. So the median of the transform is  the same as transform of the median.
Now what if we apply a monotonically  decreasing transformation like 2^-X? Then the ordering completely flips,  but the middle is still the middle. So the median commutes with the  transformation for this case as well.
What if we have a transformation  that is not strictly increasing, like the integer part of X divided by 3. I marked where the original  median was with the blue arrow, and since this function is not one-to-one, we would have to merge the bars together. Since 4 was the original median,  everything up to 4 is at least half.
So everything up to 5 is at least half as well. And everything down to 4 is at least half, so everything down to 3 is at least half as well. so 1, which is the transform of 4, is  the median of the new distribution.
So we can just find the median of a transform by taking the transform of the median as long as g is a monotonic function, whether it's increasing or  decreasing, strict or not. So can we find the median of the total return? We can write the transformation  as a single exponent function.
And if the base of the exponent is  greater than one, it is increasing, if it's equal to 1, then it is constant, and if it's in between 0 and  1, then it is decreasing. If the base is 0 or negative we run  into some issues with the monotonicity, so if we restrict both the numerator  (1+tr) and the denominator (1-sr) to be positive, we get a restriction  of allowed range for r. Intuitively, 1/s is the maximum  possible risk that we can take, which is equivalent to all-in, but what does negative 1/t signify?
That actually is the maximum  risk that we can possibly take if we are taking the short position. So now we have a reasonable range for r, and within this range, the  transformation is monotonic. So we can compute the median of big R by taking the transformation of the  median of the binomial distribution, then median can be approximated by np, so after some manipulation, we arrive at this expression.
And we can think of this as a  single variable function of the risk since other variables are fixed parameters. So this allows us to do some single  variable differential calculus. 1/s and -1/t are the roots of this function, so if we graph this function  with respect to the risk, it looks something like an upside  down parabola with a single maximum.
I normalized the graph so that the  peak is always at the same height since it is really what we want to know about. So if we vary the value of p,  the probability of winning, the location of the maximum changes. But if we change the value of n,  the shape of the graph changes, but where the maximum is does not.
And this illustrates one of  the key techniques in calculus: instead of finding the maximum  of the original function, we can instead find the maximum  of a transformed function as long as the transform is monotonic. and we can take natural log as well  which is the monotonic function since it splits products into sums, and it makes calculus so much easier. Now let's take the derivative  which turns logs into reciprocals, and we have to pull out the  coefficient by chain rule.
and we set it equal to zero in  order to find that single maximum. At this point all the heavy lifting is done, so I will leave it to you to solve for r, and get that r that maximizes the  median is indeed the Kelly's formula. We computed the optimal risk for mean and median, so we might as well try the  mode since those three are the averages you learn in  elementary school statistics.
And unlike median, mode is fairly  intuitive to define for random variables. It simply is the value that  gives the biggest probability, and there can be multiple modes. Then what is the mode of a binomial distribution?
Unlike the median, there is a  close formula in terms of n and p. This is a bit of overkill, and just like median, we will say that the mode is an integer  value close to the expected value, and we can approximate it  using the expected value. Then what is the mode of a  transformed random variable?
If we simply take a discrete random  variable and permute the values, then the biggest bar is still preserved. But if we merge some bars, what was originally the biggest  may not be the new biggest. So one condition where mode commutes  with g is that g is one-to-one.
By the way, this only holds  for discrete random variables and fails for continuous  random variables in general. So if you can come up with an example,  leave it in the comments below. Now what if we stack all the bars into one?
That is, the transformation  is a constant function? Then every single value, including  what was originally the mode, transformed into the new mode. So mode commutes with g if  g is a constant function.
I'm sure there are more interesting examples, but these two conditions are  the ones we will be using. Just like how we did for median, the exponential function is  either one-to-one or constant as long as the base of the  exponent is greater than zero. And we found the appropriate conditions.
So mode of the transform is transform of the mode. So it will simplify to exact same  value that median came out to. So once we find the r that maximizes the mode, we will get Kelly's formula once again.
If we plot the probability mass  function of the binomial distribution, it looks roughly normal. But if we transform it to the total return, the ordering is preserved  but the shape gets skewed. But if we instead plot this on a log scale, the shape becomes normal again.
By the way the choice of the base being 5 was completely arbitrary in  terms of the shape that it gives, it just happened to give me the best picture. So this raises suspicion that there is an exponential or multiplicative  behavior lurking in the background. So we should try taking the geometric mean.
Just like how we defined the  arithmetic mean for a random variable, we can generalize this for any mean. But things get pretty tricky  if we send alpha to 0. And for the case of sample geometric mean, the limit approach the product of  each data, then taking the n-th root.
Now we will take a natural log on both sides since the power drops as the  coefficient and product splits into sum. And once we exponentiate both sides, we have an alternative formula for geometric mean that does not have product in it. In fact this looks much closer  to the generalized formula above compared to the formula that involves the product.
So it could be thought of as the  missing link between the general form and the product form. In any case, we will use  this formula in a similar way to define the geometric  mean for a random variable. Let's first compute the  expected value of the log of R, which can be evaluated using LOTUS.
Then we can drop the powers and split up the sum, and by the linearity of summation, we can split up the sum into  three separate summation. and pulling out things that does not  involve k outside of the summation. Notice how the circled part literally is the expected value of the binomial distribution.
And the one in the blue circle  is sum of each probability, which should add to one. Now we can factor out like terms,  and make some substitutions, and this should look awfully familiar once we take the exponent  to find the geometric mean. It comes out to exactly same  expression as the median and the mode, and once again, we can derive  the Kelly's formula from here.
I want to wrap it up, with a final remark that provides  an interpretation of the formula. Since the geometric mean is supposed  to be the multiplicative average, we can take the n-th root to find  the average net gain per single play. If we win, we multiply by  (1+tr) to the principal capital, and if we lose we multiply by (1-sr), so we can think of this as the  average multiplier per single play.
And if we instead find the  geometric mean of the total return for the binomial random variable of one trial, we get this exact value as well.
Copyright © 2024. Made with ♥ in London by YTScribe.com