Your Last Shot (Might) Matter: The Hot Hands Effect in Professional Tennis

June 30th, 2022

Tools/methods used in this article: python, numpy, pandas, conditional probabilities, bootstrap uncertainty estimates and uncertainties from formal propagation of errors. Discussion of selection effects and sampling biases. Code is contained in sets of annotated Jupyter notebooks on github here.

Key results (the TL;DR)

  • I answer the following question: How much more likely is it for a player's serve to land in, given that their previous serve was also in? I consider sequences of first serves only, and quantify the hot-hand magnitude. I consider a causal model of "hot handedness", where making a first serve increases the probability of the next serve going in by X%. A hot hand magnitude of 0% thus corresponds to no hot-hand effect at all.

  • Using data from 110,000 first serves across 33 of the top men's professionals, there is modest evidence for a hot-hands effect of 1% -- meaning a player is on average 1% more likely to make their first serve if they also made their first serve in the previous point.

  • The evidence is mild: 10-to-1 in favor of a 1% hot hands effect. (Formally, the average hot hands effect is 1% plus or minus 0.5%).

  • The fact that the hot hands effect is a measly 1% speaks to the amazing consistency of professional athletes.

  • Ten times as much data (1 million service points total) would be enough to decisively pin down a 1% hot hands effect.

Introduction

Any player will admit to you that Tennis is a mental game. At the amateur level especially, the decisive element as to whether you win or lose is often entirely within yourself: how can you stay calm and focused, and avoid the dreaded choke?

For those who are new to Tennis, a match is played as follows. It is constructed of points, about five of which comprise a game. A handful of games are played to determine the winner of each set. The one who wins the majority of sets (usually best 2 of 3 sets), wins the match. There are tie break rules and a variety of others for games and sets, but none of those rules are relevant here. All that matters is how a point is played.

A point begins by one player tossing the ball overhead and hitting it onto the other side of the court. This is called a serve. The serve has to land in an appropriate box on the other side. If it does not, it is a fault -- and you get to perform a second serve. Players almost always use a more reliable technique for their second serve; one that has a higher chance of landing in. Top pros often make only 60% of their first serves in, while their second serve percentage can hover near 90%. The reason is simple: if you miss, or "fault", both serves, you have lost that point. Arthur Ashe summed up the importance of the serve perfectly.

“Life is like a tennis game. You can’t win without serving.”

— Arthur Ashe

The centrality of the serve explains why tennis hinges largely on personal performance, much like professional darts. The serve is so important and yet only involves you and the ball. This makes tennis a perfect arena in which to study the hot-hands effect.

The hot-hands effect is colloquially when a player is "on-fire"; when it feels like they can't miss. Many studies consider "being hot" to be a state that the player can access. Once they are in this "hot" state, they have higher chances of making their shots. Typically, people study this with questions like: if a player has made three shots in a row, are they more likely to make their fourth (compared to their average accuracy)? We consider a simpler question though:

If a player makes a first serve, are they more likely to make their next first serve? (compared to their average accuracy)

The hot-hands effect has been studied in great detail in many games: basketball, baseball (like this excellent five-thirty-eight article), darts, and even tennis. Although this, as far as I can tell, is the first time anyone has studied it in the context of Tennis first serves. Because first and second serves differ so much in form, I am only looking at first serves. So, can one first serve affect the next, or are they independent of each other?

A quick note about the hot-hands effect: it is indistinguishable from a cold-hands effect. If a player who made their previous serve is more likely to make their next serve, then conversely, a player who missed their last serve is more likely to miss their next one as well. Before embarking on any data science exploration, its useful to have a physical expectation of what you might find. It is easy to be led astray by data and arrive at (and convince yourself of) an incorrect conclusion (knock on wood). Prior to crunching any numbers, this is the expectation that I had:

There should be very little cold-hands/hot-hands effect in professional tennis players because they are extremely well trained.

Pros practice tennis, well, professionally; all day. They train to minimize their variance. The hot-hands effect should be small if it exists at all. But this is precisely what makes this problem so interesting to me. I love subtle and difficult problems. Ones where the result is going to be equally or even more subtle -- such investigations require great precision and care with the data (knock on wood again).

In what follows, I'm going to first build up a formal probabilistic model of the hot hands effect. Then I will describe the fake data I created to validate my method. After verifying everything on test data, I will apply my analysis to over one-hundred-thousand professional tennis serves from top players like Roger Federer and Novak Djokovic (see the acknowledgements section for the source of these data).


Probabilities and First Serves

The hot-hands effect is more-or-less synonymous with the concept of correlation. Is the outcome of one first serve correlated with the next?

Most people actually have a very good intuition for correlation, but may not know it yet. Some simple examples:

  • You live in Los Angeles and it is sunny at 12 pm. The weather at 1pm will, probably, also be sun. We would say that the weather between 1 and 12 pm is highly correlated (in LA).

  • You live in Denver, and it is sunny at 12 pm. But, the weather at 1pm could be sleet and snow. In Denver, the weather can be uncorrelated between one hour and the next.

  • You see a someone walking down the street, they take one step forward. You can be pretty sure that their next step will be forward, because a pedestrian's steps are correlated.

  • You see a drunkard stumbling down the street. They take one step forward, but you cannot be sure that their next step will be in the same direction as the first, because a drunk's steps are highly uncorrelated.

And one final example: you are defending Stephen Currie. He takes a step forward, and so you assume his next step will be in the same direction. You are wrong and he dribbles around you (and maybe your ankles get broken) because Stephen Currie's steps are highly uncorrelated.

We are interested in correlation between serves. Is the outcome of one first serve, correlated with the next? Formally we can encapsulate this with conditional probabilities. So imagine the following: you are watching a tennis match with your buddy, and one player is about to serve. You two like to gamble, so you make a bet on the outcomes of serves. You want to be careful about whether you bet that this player will fault (F) or whether their serve will be good (G). Because you are a careful better, you consider these probabilities:

  • P(F) is the probability of a player faulting on this first serve. While P(G) is the probability this first serve goes in.

Again, for pros, P(F) is about 0.4 (40%), and P(G) is 0.6 (60%). It has to be the case that P(G) + P(F) = 1, because probabilities have to add to 1. If your buddy is putting up $1 even-money for your $1, then of course you bet on the serve going in because P(G) is larger than P(F). Unsurprisingly, the serve goes in and you win the bet. Your buddy is angry, and so they make another bet, this time more nuanced because they want to try and outsmart you. They ask you the following.

You've been beating me all day on these wagers, so I want to try something different. I know this player's average first serve percentage is 60% -- but their last serve went in, so I think they are running hot. I bet that there is a bigger than 60% chance that their next serve goes in too.

They are asking you to bet on a conditional probability. They are wagering on the probability of a Good serve given (conditional on) the fact that the previous/last serve was G also. We refer to this probability as P(G|LG) -- called the probability of a G serve given LG (LG for Last serve being G). The other probability in the mix also is P(G|LF): The probability of this serve being G, given that the previous serve was a Fault (LF). So in probability language, your buddy simply wants to bet on this being true:

P(G|LG) is larger than P(G)

In fact, this bet will be true only if the following is true as well.

P(G|LG) > P(G|LF)
Or in plain english: the chance is higher for a player's current serve to go in if their last serve was in also.

And that is the bet you settle on. And P(G|LG) > P(G|LF) is the hot hands effect!

A quick note: Why hot and cold hands always go hand-in-hand

If P(G|LG) > P(G|LF) , then hot hands exists. This model is sometimes referred to as a "causal model of hot handedness." A positive event can cause another, and vice-versa. The vice-versa is a very important nuance. Because if P(G|LG) > P(G|LF), then it is a truism that:

if you missed your last shot, then you are more likely to miss this one as well.

So you cannot have hot streaks without also having cold streaks. If your average (so P(G)) is extremely high, then you are going to have more good-streaks then cold ones simply because on average you make more serves in than out. So the corollary is really only entertaining in a perfectly mediocre player with blazing hot hands:

In this pathological example, let's say a player's chance of making a first serve is P(G) = 0.5, and that P(G|LG) is a whopping 1. This means that the instant the player makes their first serve, they cannot miss. The next serve will go in, and so the next after that as well, and so on... They have white-hot hot hands.
But, you can show that this also means that P(F|LF) is 100% as well! So if the player starts out the game with a miss, then they will continue to miss forever! This would be a player who has both the world's best player within them, as well as the world's absolute worst.

If we find that professional players have P(G|LG) > P(G|LF), then their chance of a fault also increases if they faulted on the last serve. If they can be hot, then they can be cold. Professionals train to be as consistent as possible, and so by necessity, training out the cold-streaks will also train out the hot-streaks. The duality of "hot-hands always implies cold-hands" is really this: if professionals train to give their game 100% of themselves all the time, yet they can frequently "flip" on and give 110% of themselves, then that implies that the 100% they were supposedly giving before is actually more like 91% (= 100/110). This intuition was behind the initial hypothesis that the hot-hands effect should be small in professionals. Because pros train to give 100% and, as Limmy put it:

"A person cannot give it any more than 100%"
- Limmy, Limmy's show.

Sample and Data Selection

Please see the acknowledgements section for the source of these data and the license information. Many thousands of hours of crowd-sourced effort (and not from me!) went into compiling these serve data.

The data come from crowdsourced shot-by-shot professional tennis data by The Tennis Abstract Match Charting Project . I will skip over the parsing and reading in of these massive files -- but the code to do so is included with the repository for this article.

These data contain a treasure trove of information, but important to us is whether or not a player's first serve went in. We have to however select a sample of players and matches that are sufficiently uniform. We make these important cuts:

  1. We are only going to use match data between 2015 and 2020, to keep the number of years to a minimum yet still have a large enough sample size (about 100,000 serves) so that our uncertainties (how well we can estimate the hot hands effect) are small. We make this 5-year cut because players change over time. Players will be roughly consistent over a 2-3 year period (four- five years is pushing it, but still OK). If we go longer than that, we will have to split players into their different eras, because e.g., Federer from 2005 is not identical to Federer in 2020.

  2. Different players have different first serve percentages. We do not want to bin together players as if they were one person. So we only keep players for whom we have more than 1000 serves on record, and we keep all players separate.

After these two cuts, we have a database of 110685 first serves, grouped by game. 110,000 serves total guarantees that we will be able to tease out a hot hands effect as small as 0.4%, averaging over all players. The serve sequences, grouped by game, will be lists like: [first serve was in, first serve was out, first serve was in,...] etc. for each player and for each 5-8 serve game they played.

These data come from the following players:

['Grigor Dimitrov', 'Jo Wilfried Tsonga', 'David Goffin', 'Marin Cilic', 'Stan Wawrinka', 'Stefanos Tsitsipas', 'Rafael Nadal', 'Richard Gasquet', 'Milos Raonic', 'Roberto Bautista Agut', 'Karen Khachanov', 'Roger Federer', 'Andrey Rublev', 'John Isner', 'Matteo Berrettini', 'Alexander Zverev', 'Feliciano Lopez', 'Daniil Medvedev', 'David Ferrer', 'Gael Monfils', 'Novak Djokovic', 'Pablo Carreno Busta', 'Tomas Berdych', 'Dominic Thiem', 'Diego Sebastian Schwartzman', 'Kevin Anderson', 'Benoit Paire', 'Gilles Simon', 'Fabio Fognini', 'Kei Nishikori', 'Denis Shapovalov', 'Andy Murray']

Some players are more highly sampled than others (you can guess who if you are familiar with tennis), and so those players will have the best estimate of their hot-hands effect. At the top of the ticket, we have roughly 11000 serves from Federer and 9000 from Djokovic. 10,000 serves for any one player means that we will be able to resolve a hot hands effect of 1% or larger, for these players individually.

This raises a very interesting bias: sampling bias. Why do Federer and Djokovic have the most serves on record? After all, all pros play a similar number of matches per year. Well:

  • These are top five players, and these data come from televised tournaments. The top five players are typically going the last the longest in these tournaments, and so will play the most televised matches.

  • This is crowd-sourced data. Individuals who love tennis had to watch each match and record what they saw. The top players have the biggest fan base, and so more people pay attention to their matches.

We might be affected by sampling bias if we grouped players together. But because we do not, we are safe from it.

Inferring conditional probabilities from real serve data

We now have an enormous set of first serve data. Our goal is to see if P(G|LG) > P(G|LF), and in particular, by how much.

If P(G|LG) = P(G|LF) + h , then we say there is a hot-hands effect of magnitude h.

The goal is to find the value of h. If h = +10%, for instance, then a player is 10% more likely to make their first serve if their last serve went in also.

An interesting note is that h could, in principle, be negative. Then it would lead to an odd conclusion that a player is less likely to make their first serve if their last serve went in. This is confusing, but not so if you frame it the other way: a player is less likely to miss, if they missed their last serve also. So a negative hot hands effect could be interpreted as a "clutch-up" effect, whereby poor performance causes the player to "get their head in the game" and improve. More on this in the results section. (Spoiler: this is probably not happening for any of the players in this data set).

To calculate h, we need to analyze sequences of first serves and calculate empirically P(G|LG), P(G|LF). The best way to do this is sequence counting. One game of our crowdsourced data, after some formatting, looks like this:

Player: Roger Federer | Game of the match: 11
First serves: G F G G G

This means that in the 11th game of this particular match, Roger made his first first-serve (denoted by G, for in), missed the first-serve in the second point (F), and made the third, and so on. (In the code, the data are all converted to binary: G is denoted by 1 and F by 0.)

The probability of a fault, P(F), and the probability of making a serve P(G) is really simple: count all the occurrences of F and all the occurrences of G. We will call the number of faults N[F] and number of made-serves N[G]. So:

P(F) = N[F]/(N[F] + N[G]) and

P(G) = N[G]/(N[F] + N[G])

For Roger Federer, we get that P(G) = 0.67 ; so Roger has a 67% chance of making a first serve. This is close to his PGA stats for first serves during this same time period, so we have passed one easy check. John Isner, by contrast, has a first serve percentage of 73% across this 5 year period.

We are interested in the more interesting conditional probabilities of P(G|LG) and P(G|LF). To count these, we want to count sequences of two good serves. So how many times does G, G occur? And for P(G|LF), how many times does the sequence F G occur? In the previous example, (G F G G G) , we have 2 occurrences of G G and 1 of F G. We have roughly 10,000 serves for Roger across 2000 service games. So we simply need to count sequences, using each game independently, and add those numbers together. Formally, we want to count sequences of GG (call the number N[GG]) and divide it by the total count of any sequence with a Good last serve.

P(G|LG) = N[GG] / (N[GF] + N[GG])

P(G|LF) = N[FG] / (N[FF] + N[FG])

In plain english, P(G|LG) is equal to the number count of good serves where the last serve was also good, divided by all possible sequences where the last serve was good (so the present serve could be good or bad, hence the sum in the denominator).

This has been dense, so lets recap: we need to count pairs of consecutive good serves, and pairs of bad serves followed by good. This allows us to calculate P(G|LG) and P(G|LF). If P(G|LG) = P(G|LF) then there is no hot hands effect -- the current serve is independent of the last serve. But if say P(G|LG) = 0.68 (for Roger) and P(G|LF) = 0.66), then the hot hands effect is 2%. And simply put:

The hot hands magnitude = P(G|LG) - P(G|LF)

Verifying the Model on Test Data; the dangers of cherry-picking

You can skip this section if you would like, but then you would be trusting that I did not make any mistakes (and I strongly recommend you do not assume that I made no mistakes). In this section, I verify the code and methods on "mock" (a.k.a fake) datasets with a known hot-hands effect.

Whenever you set out and create an analysis procedure or pipeline, it is necessary to test it thoroughly. Consider a machine learning video processing software that identifies cats. Suppose your end-goal is to have security cameras that tell the owners when a cat is at the water bowl in front of their house. There are two ways to test this invention:

  1. Sell and sign contracts for your cat identifying video cameras, distribute them to homes, and collect money. Then when things don't work, you hastily try and push out software updates until the cameras reliably identify cats.

  2. OR: before you even put the software on an actual video camera, create other software that creates CGI cat videos. Thousands of them. Hundreds of thousands of fake cat videos. And throw some dogs in there. This is what is called a "mock" dataset. In machine learning parlance, it is also commonly called a validation data set. You want to verify that your software can reasonably discern cats from non-cats, before you ship out your software to production.

Number 1. is an extreme example of what is called "testing in production." As you can guess, it is very bad practice. For instance, what if you were designing self driving car software that has to identify pedestrians? Testing in production could, literally, be fatal. 2. is a much better solution. And 2. is exactly what I have done with this code here . I created two mock datasets of fake tennis players: one with zero hot hands effect and one with a 10% effect.

Long story short, repeating the analysis outlined in Inferring conditional probabilities from real serve data, is successful. The 0% dataset yields a hot hands effect of 0.05% +- 0.14% (more on this +- below), and the 10\% case yields an estimate near 10%. See Figure 1 below. This Figure shows the inferred hot hands % for 30 fictional "players". The dot for each player is the hot hands % h, while the blue bars show the interval of confidence. The interval of confidence is for "1 sigma", and can be understood heuristically as follows. The dot (the observed data point) can fluctuate within that interval or bar just from random noise).

In the next section, we do this analysis on real data (which really just amounts to pressing Go on the code with the real data). But look at one crucial conclusion from the left panel of Figure 1. Every fictional player (with "names" 0 through 30) have a 0% underlying hot/cold hands effect. Emphatically 0. Yet, due to the finite number of samples, the inferred value (the dot in the plot) can be slightly above or below 0%. This is measurement error. It is unavoidable and can only be shrunk by having more serves in the dataset. The important corollary here is that we cannot look at player 29, for instance, and say oh, they have a hot hands effect of 1.5%. The error bar helps us avoid this, because they have 1.5% but only with what is called 1 sigma significance, which is to say its not very significant at all. You can see that the confidence is low because the bar on player 29's data point is very close to intersecting with 0. The odds in favor of player 29 having a real hot hands effect is only 3 to 1. And we know for a fact, that h is in fact 0%, because we made this dataset from scratch. This illustrates the danger of cherry-picking a dataset for outliers.

You are bound to find outliers in any dataset, but it's possible (and the case here) that there is no profound physical reason behind their outliery-ness. The reason may simply be unavoidable random scatter. It also tells us the importance of having accurately estimated uncertainties (the bars in the Figure), because you can assess how significantly a data point is offset from some value. None of the data points in Figure 1 (left panel) are said to be offset from 0% with high significance. Both figure panels are just random noise, perfectly expected, about the input values of 0% and 10%.

Figure 1: The estimate of the hot hands effect for two sets of 30 fake players. On the left is a scenario where each player has a true underlying hot hands effect of 0%. No player on the left has any hot hands effect, truly, but measurement uncertainty causes us to erroneously infer values that fluctuate around 0%. On the right is a 10% hot hands scenario. Again, measurement uncertainty causes jitter around 10%. In both panels, the dots are the mean for each player and the bars denote plus-or-minus (+-) the uncertainty with the estimate of that mean (which is about 1% here). The number of serves in the fake data set was chosen to be very close to the number in the real dataset.

Results: 10-to-1 evidence for a 1% hot-hands effect

The same analysis on the mock data is applied to the 110,000 serves compiled by the Tennis Abstract Match Charting project. See the acknowledgements section for the data; the code is here. The result is summarized by figure 2 below. Note the intentional visual similarity between it and the mock data, Figure 1.

Figure 2: The estimate of the hot hands effect 32 real professional tennis players. In both panels, the dots are the mean for each player and the bars denote plus-or-minus (+-) the uncertainty with the estimate of that mean. The average hot hands magnitude is about 1% +- 0.5%.


There is a lot to unpack in Figure 2. On average, across all 30 players, the hot hands effect is 0.91% with an uncertainty of 0.56% (proper inverse variance weighting was used). So there is modest evidence in favor of a real hot hands effect, but the odds are just modest; roughly 10:1 in favor. In astrophysics, we would not consider this a firm detection -- this would merely be a suggestion that this effect probably exists, but it's not rock-solid.

Here are some important features to note about Figure 2: the players with the smallest uncertainty on their hot hands magnitude h are those with the highest number of serves on record (like Roger Federer and Novak Djokovic). To look at this further, I used bootstrap uncertainties to calculate the probability density of Roger's hot hands effect. This is shown in Figure 3b. Likewise, I computed the probability density for John Isner's h in Figure 3a.

Roger Federer shows no evidence of a hot-hands effect. His hot hands-effect is 1%, but the uncertainty on that value is also 1%, so it is in statistical agreement with 0% -- I.e., Roger's nerves are seemingly unaffected by missing a serve. John Isner on the other hand, who is known for having a massive serve that sometimes goes cold, has a hot hands effect of 5%. This is modestly statistically significant, at about 80-to-1 odds. However, as I discuss in the next section, for a subtle statistical reason, those odds are in reality substantially smaller. So John Isner does not have any strong evidence in favor of a hot-hands effect either.

Interestingly, some players appear to have a "negative" hot hands effect. This is the so-called clutch-up effect we talked about earlier. Novak Djokovic appears to have h of about negative 1%. Unfortunately though, none of the players have a significant negative hot hands effect. So there is no conclusion to draw here. The apparently negative value observed for Novak is indistinguishable from random scatter due to measurement error.


Figure 3a. The distribution of probabilities for John Isner's hot hands magnitude. The most probable value is 5%, but the uncertainty in that value is large, about 2%.

Figure 3b. Same as 3a., but for Roger Federer. The most probable value is 1%, with an uncertainty of about 1%.


Discussion

Across 110,000 professional tennis first serves, there is mild evidence in favor of a hot hands effect. The average hot hands effect is 0.91% with an uncertainty of 0.56%. Meaning that if professional tennis player makes a first serve, then they are (on average) about 1% more likely to make their next first serve.

Remember the bet that you and your buddy made:

the chance is higher for a player's current serve to go in if their last serve was in also.

So now you have data to back up your decision; and lucky for you, it is data that they do not have. So if they bet you that that statement is true, and give you better than 10:1 odds that it's true, then well that bet would be a favorable one to take (this is not investment advice or an encouragement to gamble).

Thanks for sticking with me, and hopefully you learned some things along the way.

Afternote: why John Isner's whopping 5% hot hands effect, is unfortunately, probably not real

We found in Figure 3a that John Isner's hot hands effect appears to be 5%. And that there was ostensibly 81 to 1 odds in favor of this (we also call this "2.5 sigma" evidence, in astrophysics). But we should take those 81-to-1 odds with a grain of salt, because of the following:

We arrived a set of hot hands magnitudes h, for a large number of players first; and then after, we picked a player with large h.

So we cherry picked an outlier. John Isner is roughly a 2.5 sigma outlier, and for a data set of 30 players, we expect 1 about half the time. So John could, in reality, have an underlying hot hands effect of 0% -- but because of measurement error (i.e., bad luck with how serves happened to land), he was just the 1 of 30. And in a set of 30 players, there will almost always be a 2-sigma outlier and sometimes a 2.5-sigma outlier. So there may not be any physical meaning to his 5%. This is also what is called a selection effect. The 81 to 1 odds were quoted without taking the selection effect into account. If we did take that subtle effect into account, those odds would probably be significantly lower.

Acknowledgements and Data

I thank Mileta Cekovic for pointing me to the Tennis Abstract Match Charting data. This study would not be possible without those data.

This work is based on crowdsourced shot-by-shot professional tennis data by The Tennis Abstract Match Charting Project and is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The data in this article was obtained from https://github.com/JeffSackmann/tennis_MatchChartingProject , commit hash f82e786a931a1ed5b4fcb52a1abb172d90493efd