this is a random Forest a powerful machine learning algorithm based on decision trees and this is tennis a sport I'm kind of obsessed with in this video I'll build a random Forest throw at it a bunch of tennis data and then make it predict tennis match outcomes and predict the winner of a big tennis tournament and we'll see how it performs but first I need data I want a lot of data I want every single break point every single double Turf every single double fold I want everything their backhands their forehands their heights their their
dad's name their mom's name their Grandma's secret lasagna recipe yeah I need that stuff too man I'm surviving on noodles over here and then I found it the Holy Grail of tennis data sets a file so massive so detailed that opening it crashed my computer summoned three statisticians and made my Exel beg for mercy the famous 2008 Wimbledon final between Rafa Nadal and Roger featherer I got it the longest grind slam final of all time I got it even this one time this kid bid Rafa and Novak jokovic backto back I have it I have
it all I have every single ATP Association of tennis professionals match from 1981 to 2024 baby but before using this data I want to try building a decision tree from scratch that's right no skure no pie torch just my good old friend naai and me and decision trees are pretty awesome think of them like uh chose your own adventure book but instead of Designing whether you fight a dragon or run away it's deciding who won a tennis match let's take the Titanic disaster as an example we've got a lot of data on passengers things like
their age their cabin and ticket class addision Tre works by asking a series of yes and no questions to classify whether someone survived for example let's take Miss Elizabeth Bonnell a simple decision tree might start by asking did she pay more than 2020 for her ticket yes she did so we go left next question was she in first class Yes again at this point our tree confidently predicts that she survived because well she actually did and we can use this tree again and again to predict other passengers this is really a simple tree but this
is what I got when training with a Titanic data set and you might be asking how do we build this huge Street well this is the coolest part we don't need any fancy algorithm like with neural Nets no matrix multiplication no grade in thec none of that fancy crap just some logic and some simple arithmetic here's how this works first we grab all the Titanic data and start with an empty tree now goal is to find the variable that best splits the passengers into survivors and non-s survivors turns out that the most powerful first split
is passenger class all right so we split the data first class passengers go one way and everyone else goes this way but hold up there's still some impurity not everyone in first class survived so what now well we look for the next best split and guess what the strongest predictor now is sex so we slice the date again and boom every female In First Class survived that means we have h a pure node which we marked as survived and for the other branches of the tree we keep repeating this process finding the best split dividing
the data and checking Purity until T we have ourselves a fully grown absolutely magnificent decision tree woo let's go implementing this in Python was actually pretty straightforward but I don't think it's going to be very fast great so now we have a fully working decision tree classifier but before we use it to predict the outcome of the tennis matches and gamble some money we need to clean up the tennis data cuz it's looking kind of gross okay 10c cleaning Montage boom combine the data sets remove empty data get the ranking difference between winner and loser
do this okay bada beam bada boom my beautiful data set is complete it has 95,000 tennis matches with their corresponding statistics and there's a lot of Statistics I calculated let me tell you that for example the head-to-head for for every match that is the number of times a player has won or lost against another player the player's age difference their height difference the number of matches won in the last 50 matches and a bunch of other stats that I won't get into it's a lot but before we throw all this stuff to our classifier we
need to plot our data plot plot plot plot so I cooked up a quick SNS bear plot and Mah I got this beauty a bunch of bar balls spotted against each other showing us some interesting patterns of course some bar balls are absolutely useless like player ID but I want to draw your attention to this one player ELO cuz it seems to be splitting data really well so let me show you how I calculated this the ELO rating system is a way to approximate a player's skill level it's mostly commonly used in chess but I
decided to apply it to my tenis data set let's take Roger featherer as an example the sort of his career his ELO rating was around 1,500 pretty average but as he kept winning matches his reading skyrocketed eventually becoming one of the greatest players of all time now here's his ELO progression plotted against every 10est player ever as you can see he is way up there and if you're wondering about these two other lines naturally those are Rafa Nadal and Nova jovic two other tennis Legends and the cool thing is that ELO is fairly easy to
Cod let's take the 2023 Wimbledon final between Carlos alcaras and Novak jokovic alar was rated about 263 and jokovic was rated about 2, 120 according of course to my calculations in a shocking comeback alrass won the match which by the way I watched life and it was really cool and since he won his rating has to be updated so we take this handy little formula and calculate his new rating turns out alcaras gained about 14 points and jovic lost about 14 points so it's pretty easy actually according to my tennis ELO rankings the current best
player in the world is Janik sner who just won the Australian Open followed by Novak jokovic and Carlos K and here's how their ELO has evolved over time this represents their overall ELO but in tennis the Surfers you're playing with really matters it's really different playing tennis in clay grass or hard surfaces so I also implemented surface specific ELO for example rafan Nadal is known as the King of Clay he's won 14 French opens and has a 112 win versus four losses record at the event the guy is a beast and as you can see
his clay ELO is really really really good in fact it's the highest I've seen as a final example here's how Carlos alcara grass's ELO has changed after winning back-to-back wion titles in 2023 and 2024 and Tennis Ela turns out to be quite good at predicting who will win so let's take all this data and fit it into a decision tree to see how well it classifies winners and while my model is training I want to thank brilliant for sponsoring this video brilliant is an online learning platform for computer science science and maths they have Great
Courses on everything from calculus and linear algebra all the way up to neural networks in fact I have done a lot of stuff behind the scenes that I haven't really shown in the video like principal component analysis linear regression and a bunch more and while I won't have time to explain it brilliant has some fantastic introductory courses on data analysis and probability where you learn by experimenting with Hands-On examples they make learning Fun by giving you puzzles and little games to test your understanding they even have courses on search engines cryptocurrencies Quantum Computing how AI
works and loads and loads of other fascinating things so if you want to support this Channel and explore exciting and interesting topics click the link in the description and use my code green code for a 30-day free trial and a 20% discount on the annual premium subscription seriously they're awesome so go check them out ooh okay our decision tree finished training and I have some good news and some bad news good news is that it gave us a pretty cool looking decision tree bad news is that my implementation was really slow like painfully slow so
I ended up having to use SK learns version but for all the haters out there my code works okay I tested it on smaller data sets and it works just as fine it just it's not ready to handle 95,000 tennis matches but anyway using that classifier out of the box we get 74% accuracy which sounds really promising until you realized that simply predicting based on Elo alone gives you 72% accuracy so yeah we can do better to take things to the next level we need random forests a single decision tree tends to have high variant
so it's quite sensitive to the specific data it's strained on but if we create multiple trees each making its own prediction we can combine the results through a majority boat and get a more stable and accurate model now building a decision tree is deterministic that is if the input stays the same we'll build EX exactly the same tree over and over so the trick to build a random Forest is build many trees using different random subsets of the data and different subsets of the variables this way there's a little bit of more variation which makes
the model more robust I also implemented my own random forest from scratch but guess what it was too slow for my huge data set again but using SK learn I got 76% not bad not bad we can even visualize our forest and it looks like this pretty sick right now at this point things got tricky I tried to improve my model by running a grid search tweaking the data and fine tuning the random Forest parameters but no matter what I tried I got stuck at 76 77% accuracy so I decided to try XG boost an
XG boost classifier is like a random Forest on steroids it uses boosting regularization to prevent over fitting and it keeps trees from growing too large it's kind of hard to explain in this video but maybe if you like the video I could explain it in another one but you you know up to you it it's right there though just saying but anyway I got a staggering 85% accuracy baby Yoohoo and as you can see the most important features it recognized were ELO surface difference and total ILO which is pretty cool just for fun I also
decided to quickly train a neural net to see how it would do on this data and I got a decent 83% accuracy now to end this video I wanted to see if my model could predict the winner of this year's Australian Open you see I trained my models with Dennis matches up until December 2024 so this year's Australian Open was not in my data set and out of the 116 matches I could find my model correctly predicted 99 of them and got only 7 wrong and hence has an accuracy of 85% on this year's Australian
Open more importantly it correctly predicted that janx sinner would win every single one of his matches so my model correct predicted the winner of a grand slam which is pretty sick if you ask me I had a lot of fun with this video so if you want me to try to predict this year's Wimbledon champion using XG Boost from scratch like the video and comment potato below if 50 people comment potato I guess I'll make a second part hope you enjoyed it and I'll see you in the next one