Introduction
I’ve been a huge fan of this time of year for over twenty years, going back to the first Final Four (FF) I attended in San Antonio in 1998. My father has been a huge fan since even before that when he went to his first FF in 1986 in Dallas. In fact, it was that FF that cemented his now everlasting love for the Duke Blue Devils. I try not to give him too much grief about that.
I have since gone on to be lucky enough to go to over 10 FF’s (and countless other Regional games here in Texas) over the last several years, the last of which was last year in San Antonio.
A little over a year ago I was attending a Data Science Bootcamp by General Assembly. I was still fairly new to Python and Machine Learning, but I decided to use both of those to try and predict who would win in the 2018 NCAA Men’s Tournament and enter into the competition sponsored by Google and Kaggle. We were still a handful of weeks from graduating when the tournament first started so I was still fairly “green” when it came to modeling. I uploaded one submission to Kaggle and ended up placing 736th out of 934 teams…not all that great, but then again maybe not all that bad for somebody with just a few weeks of ML experience.
Near the tail end of this last NFL season, I had been working with different models to predict which teams would win every weekend. After the season was over, I extended that learning to the College Basketball (CB) season. Since then I developed a script that scrapes CB scores from the previous day, runs a model, and then spits out win probabilities for games on the current day. Currently I post those every day on Twitter (@danger009mouse). At some point, I would love to automate putting those scores for daily postings on this website. Since I plan on continuing the win predictions with the Major League Baseball season, maybe I’ll have some time to do just that.
As far as the tournament goes, I feel I have even more confidence in what I’m doing now than I did a year ago, so I made two different models to upload. I’ll highlight how both brackets turned out using both of those models below.
Model 1: Points, Points, and more Points!
This model is a fairly simple model. The main inputs are basically the scores for each game played throughout the season plus a couple of other different variables. It’s highly weighted toward the current season since I started to find that teams that were highly successful this year (like Texas Tech) were being dampened by the fact that they’ve not always been all that good. Basically my original model was penalizing a team like Tech for being mediocre for several years as they’ve only had three winning seasons over the last eight. As for this tournament, unfortunately this model didn’t quite have all that many upsets predicted.
By the way, wanted to credit my long time friend of 23 years, Jason Geach (@GeachJason on twitter). The graphics below are based on a tournament pool that he’s been running for 16 years! Thanks to him for putting in all the time that he does to put this together as well as provide detailed and hilarious commentary throughout the Tournament. It really is a hoot!
Included in each bracket below you’ll see who the model picks to win as well as the winning probability for that team. Any team that had a greater than 50% chance of winning I considered to be moving on in the tournament.
Not too many surprises here with only one “upset”. I put that in quotes since I don’t necessarily consider a #9 seed beating a #8 seed to really be all that much of an upset. Duke comes out on top as we go through the tourney and shows a 43.02% chance of winning the East Bracket while Michigan State has a 17.23% chance of going to the Final Four.
Again, not too much of a surprise with Gonzaga coming out of the West. Here, though, the model has Nevada beating Michigan in the Round of 32 which could be very surprising (unless you’re from Nevada, of course). That would give Texas Tech a pretty clear shot of making it to the Elite 8 to play Gonzaga for the right to go to the Final Four. With this model, Texas Tech has a 13.43% chance of reaching the Final Four. Hey, Red Raider fans, there is a chance! Keep the faith alive!
I know these brackets seem to be a little boring since it seems to be picking all of the lower seeds to win, but look at some of the chances that a higher seed wins. In the South, I’m looking specifically at the Villanova-St. Mary’s match-up. This game will come up again in the 2nd model below. At just a 51.57% chance of winning, that’s basically a coin flip that St. Mary’s will pull off an upset. The Cincinnati-Iowa game is also fairly close to being a coin flip.
At the end, though, this model has Virginia coming out on top with a 45.55% chance of going to the Final Four. Purdue’s chances are at 12%, but they’ll have to get through Tennessee first…who unfortunately ended the SEC Tournament rather badly in that tournament’s championship game!
Again, no surprises here. I don’t think I really give Wofford much credit, but we’ll see in a second that they should definitely be a team to keep an eye on throughout this tournament. I’m also going to be keeping an eye on Houston. If they truly can make it to the Sweet 16 they have a pretty good chance of flipping the tables on Kentucky to go to the Elite 8. For this bracket, the #1 seed, North Carolina, has the lowest probability of all the #1 seeds of making the Final Four at 26.96%. In this model, Kentucky has a 17.83% of making the Final Four.
Here we have my predictions for the Final Four. With three ACC teams making the Final Four, it means that I’m predicting an all-ACC Championship game…which could be boring for just about anybody outside of the ACC. Ha! In this case, I see Duke beating Gonzaga and Virginia beating UNC before beating Duke in the title game. All in all, Virginia has a 14.82% of winning the Championship game while Duke has a 12.05% chance of winning.
Model 2: More Advanced Basketball Analytics
This model is basically an upgrade to the model I used for last year’s Kaggle competition. Using the data set with detailed statistics for every game played from 2003-2019, I did some feature engineering using Team Evaluation Metrics from NBAstuffer. This includes Possessions, Effective FG Percentage, Turnover Rate, Offensive Rebounding Percentage, and Free Throw Rate. Then I calculated the “Four Factors” weighting each of those metrics. In addition to those features, I added in the ELO rating (which was a metric I used in football as well) from FiveThirtyEight. I also used several of the advanced statistical metrics from Ken Pomeroy (highly recommend this site if you don’t know about it already). Putting it all together I use a simple logistic regression to get win probabilities of each possible game throughout the Tournament. Here are those results:
One thing you’ll find about this model is that it’s highly confident about each four of the top seeds…as in over 90% confident in most of their games. As you can see above, this model gives Duke a huge chance of winning all the way through to the Final Four. The only real upset I see here is Minnesota winning over Louisville, a rather confident win at that. For this model, Duke has a 68.13% chance of reaching the Final Four while Michigan State has a 20.3% chance of reaching the Final Four. Quite frankly, when it comes to the post season, always bet on Michigan State!
Here’s where things get really interesting. Granted this model loves Gonzaga, but take a look at the 5-12 matchup. Every single year you can count on a 12-seed beating a 5-seed. Here the model is predicting that Murray State will upset Marquette! Model is also predicting Florida to upset Nevada. I had high hopes for Nevada throughout the season, but they lost two out of three games to San Diego State including a loss in the Mountain West Championship game. Once again there is Texas Tech. This model has them reaching the Elite 8 with a coin flip of a chance to beat Michigan. All in all, Gonzaga has a 79.73% chance of reaching the Final Four in this bracket while Texas Tech has a 4.52% chance of getting that Final Four bid. So, yes, Red Raider fans, there’s still a chance, predicted by both models!!
I had to double check this bracket a few times. First off the model is predicting that the defending champions, Villanova, will be knocked off in the first round by Saint Mary’s. Saint Mary’s!!! And confidently so as well! We also have Iowa beating Cincinnati to advance. After that, there’s not too many surprises with Virginia and Tennessee meeting in the Elite 8. For this bracket, Virginia has an 86.75% chance of reaching the Final Four while Tennessee has an overall chance of 6.64% chance.
I listen to a podcast where one of the hosts graduated from Wofford. Ever since they cracked the Top 25 in the last few weeks of the regular season, he’s been boasting about their basketball program. He’s not exactly one of my favorite hosts, but I imagine he’d be quite pleased at the results of this model. It has Wofford advancing all the way to the Elite 8. If they can get past Kentucky, then they might just be able to cruise past Houston to do it! Not only that, but this model doesn’t even confidently give North Carolina that much of a chance of beating Wofford! What!?!? The other upset centers around Ohio State beating Iowa State in the first round. Iowa State seems to have fallen off the wagon these last few weeks, though, so I can totally see that happening. Overall, North Carolina has a 41.25% chance of going to the Final Four while Wofford has an 18.77% chance! Wow! Seriously.
And so this is my Final Four for the second model. While the first one had Duke winning with a 57.26% chance, this model has Gonzaga advancing with a roughly the same chance. This model, with so much love for Virginia, though, has Virginia basically rolling over UNC and Gonzaga to take the crown of Champion! All in all Gonzaga has an 11.76% chance of winning the championship while Virginia has a 57.29% chance of winning. Incredible!
Final Thoughts
So that is both of the brackets I’m turning in for our little tournament pool. The CSV files with predictions for 2278 different possible combinations of games will also be uploaded to Kaggle for that competition. I’m honestly a little bit worried about that second model, though. With the log loss scoring system, if any of the teams with over 90% chance of winning lose, then I’m a goner for sure. We’ll see how it goes!!
Good luck, everybody, and I’ll see you again on Monday to wrap up the 1st week of action. In the meantime, keep an eye on my Twitter feed as I make daily predictions on this tournament as well as all the other tournaments going on in college basketball. Keep in mind that those probabilities will more than likely be different than what you see here since those will have updated daily scores for modeling purposes!
Hello!
My name is Nsikan Akpan, and I am a science producer at PBS NewsHour.
I am wondering if you might be available for an interview about how you built your brackets for a story that I’m writing about machine learning.
Let me know if you’re available Wednesday, Thursday or Friday to chat.
Cheers,
Nsikan
Nsikan Akpan, PhD | Digital Science Producer | PBS NewsHour | O: 703-998-2144 | nakpan@newshour.org | @MoNscience