Approximating probabilities

Introduction

Suppose we want to know the probability of some event. There are two ways we can find that: use algebra, calculus, probability theory, etc. to find the answer mathematically, or we can write a program to find the answer through the Monte Carlo method.

For example, in the famous birthday problem (taught in all intro to probability theory courses) we are interested in the probability that, in a set of $n$ randomly chosen people, some pair of them will have the same birthday (which we refer to as event $A$).

To follow along I recommend installing R and the RStudio IDE. Visit RStudio Education for resources (many of which are free!) for learning R.

Analytical solution

Using Kolmogorov axioms and conditional probability we can derive an analytical solution for $P(A)$:

$$P(A) = 1 - \frac{n! \cdot {\binom {365}{n}}}{365^{n}}$$

Deriving an exact solution analytically is an interesting challenge but we have computers now and can find an approximate answer to this scenario and more. Using random number generation we can simulate an outcome and then count how many times a particular outcome occurred out of all the simulations.

For evaluating the accuracy of our simulation-based results, we will be using the following R code for calculating $P(A)$:

pa <- function(n) {
  1 - (factorial(n) * choose(365, n)) / (365^n)
}

For example, out of 23 people with uniformly distributed birthdays, the probability that two people will share a birthday is 51%.

Simulation-based solution

To facilitate some of the work, I rely on the purrr R package, so let’s attach that:

library(purrr)

We start with a basic function that randomly generates $n$ birthdays, assuming a year with 365 days and that each day is equally likely:

birthdays <- function(n) {
  sample(1:365, n, replace = TRUE)
}

Then we have a function which simulates a single outcome: either 2 or more people share a birthday (there are duplicate values) or they were all born on different days:

simulate_a <- function(n) {
  bdays <- birthdays(n)
  any(duplicated(bdays))
}

Approximating $P(A)$ involves simulating a bunch of outcomes and calculating the proportion of those outcomes where the event $A$ was observed:

pa2 <- function(n, simulations) {
  # output a TRUE/FALSE in each simulation:
  simulation_outcomes <- map_lgl(1:simulations, ~ simulate_a(n))
  
  # proportion of simulations that are TRUE:
  sum(simulation_outcomes) / simulations
}

Lastly, we should do a few repetitions to get more approximations because we shouldn’t trust just one.

repeat_pa2 <- function(n, simulations, repeats) {
  map_dbl(1:repeats, ~ pa2(n, simulations))
}

Now we can compare pa(n) and pa2(n) at different values of $n$:

set.seed(23)

approx_pa <- map_dfr(
  c(5, 10, 20, 23, 30, 40, 50),
  function(n) {
    pas <- repeat_pa2(n, simulations = 1e5, repeats = 10)
    data.frame(
      n = n,
      analytical = pa(n),
      simulated = sprintf("%.3f (%.3f-%.3f)", mean(pas), min(pas), max(pas)))
  }
)

n	analytical	simulated
5	0.027	0.027 (0.026-0.028)
10	0.117	0.117 (0.115-0.118)
20	0.411	0.412 (0.409-0.415)
23	0.507	0.507 (0.504-0.509)
30	0.706	0.707 (0.705-0.709)
40	0.891	0.892 (0.890-0.893)
50	0.970	0.970 (0.970-0.971)

Simulation gives us probabilities pretty close to the ones we get using the analytically-derived formula!

Extension

Suppose we were instead interested in a different question:

What is the probability of event $B$, that in a group of $n$ people, there will be a sequence of $m$ consecutive days in which there is at least one birthday each day?

Good luck trying to derive a generalized solution analytically! Luckily, we have computers! Let’s start by defining a function that simulates a single outcome:

simulate_b <- function(n, m) {
  bdays <- birthdays(n) # re-using from before
  
  # two birthdays on same day doesn't matter anymore:
  bdays <- unique(bdays)
  
  # calculate how many days between each birthday:
  diffs <- c(0, diff(sort(bdays))) # 33, 35, 36 => 0, 2, 1
  
  for (i in m:length(diffs)) {
    # event B is m differences of 1 days between birthdays:
    if (sum(diffs[(i - m + 1):i] == 1) == m) {
      return(TRUE) # stop checking at 1st observance of B
    }
  }
  
  return(FALSE) # event B was not observed
}

We proceed as before: we run simulate_b a bunch of times and calculate the proportion of those times where $B$ occurred:

pb <- function(n, m, simulations) {
  simulation_outcomes <- map_lgl(1:simulations, ~ simulate_b(n, m))
  mean(simulation_outcomes)
}

And repeat it a few times to capture the uncertainty:

repeat_pa2 <- function(n, m, simulations, repeats) {
  map_dbl(1:repeats, ~ pb(n, m, simulations))
}

So, what is the probability that in a group of, say, 77 people (with uniformly random birthdays across 365 days) we will find 6 consecutive days in which there is at least one birthday each day?

set.seed(42)

summary(approx_pb <- repeat_pa2(77, 6, 1e5, 10))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002000 0.002093 0.002210 0.002197 0.002267 0.002460

For that particular scenario and event, the probability is approximately 0.22%. The way we’ve coded it means we can quickly approximate the probability for any number of people $n$ and any number of days $m$.

Exercises

If you’re interested in practicing this skill, I encourage you to practice with the following:

Verify to yourself, through simulation, that in the Monty Hall problem, contestants who switch doors have a $\frac{2}{3}$ chance of winning but contestants who stick to their initial choice have a $\frac{1}{3}$ chance of winning.
Suppose two people are playing non-stop rock–paper–scissors (also called roshambo). Player 1 (P1) always chooses one of the three shapes at random, but Player 2 (P2) always chooses between the two shapes they didn’t pick the previous round. Is one tactic better than the other – that is, which one has a higher probability of winning than the other one? (Or are they both about the same?)
Suppose P1 adjusts their strategy and always picks the shape that P2 picked the previous round, while P2 sticks to their original strategy. So, for example, if P2 chose “paper” (a flat hand) in round 1, then P1 will definitely pick “paper” in round 2, while P2 will randomly pick between “scissors” and “rock”. Is this a better or worse strategy for P1 than always randomly picking between the three shapes? How does the win probability of P1’s new strategy compare to the win probability of P2’s strategy?