Overview
In this unit that is one of 2 parts, we will introduce probability theory. Probability theory is quite fundamental if one wants to analyse different randomised strategies we might have. This is particularly useful in certain parts of machine learning, and AI. Ever wondered how Monte Carlo works? Ever wondered how chess engines explore such a large state space when figuring out optimal strategies? It turns out when the scale of problems get large enough, randomising your approach might not give you the best solution, but we can hope to maximise our chances of getting something “good enough”.
To do this, we first need to talk about the basic fundamentals and rules of probability theory. We will also reason about basic probability in this unit. We will follow this up by introducing very common distributions in the subsequent unit, and finishing off with something called bounds and deviations.
Introduction
Let’s begin with an example scenario called the Monty Hall Problem:
You’re at a gameshow, there are doors. One of the doors has an equal chance ( in ) of having a car, the other two doors have a goat behind them. The game goes like this:
- You can choose any door you wish.
- After that, of the remaining doors that you did not pick, the host will pick a door that has a goat, and reveal it to you.
- The host then offers you a choice: Do you stay on the door you originally picked? Or do you switch over to other door?
- After you decide which door you want, the host reveals the prize for you at the end of it. If you ultimately chose the door with the car, you win the car. Otherwise, you win the goat.
People have long debated as to what the best strategy is. Should we always stick? Should we always switch? Does it even matter? Is it always a / chance?
Interestingly, we can use probability theory to understand why the strategy of always switching is actually a good strategy. (We will win the car with probability /).
To first understand why, we should probably take a formal approach to this problem.
Considering the Possible Outcomes
Let’s look at the possible outcomes that we might have from this game. For this, let’s say:
- Call the possible doors , , and .
- One of the doors is chosen uniformly at random to contain the car. Let’s call this outcome door .
- You have an equal chance of taking any of the doors uniformly at random. Let’s call this outcome door .
- The host will consider one of the two doors you didn’t pick, and he will reveal one door that has a goat. Let’s call this outcome door .
- Then from there, we need to pick either outcome door or outcome door .
So let’s first think about all of the possible sequence of outcomes:
Think of it this way, if outcome door is the same as outcome door (e.g., the car was behind door , and we chose door ), then the host can reveal either doors or .
So based on the tree, notice how we can have a sequence of outcomes or sequence of outcomes .
On the other hand, if outcome door is not the same as outcome door (e.g., the car was behind door , and we chose door ), then the host has only one possible choice to reveal, which is the door that we did not choose, and does not have the car. For example, if the car was behind door , and we chose door , then the host must reveal door .
So based on the tree, we can have a sequence of outcomes like or . But never something like or .
So if we listed out of all of the possible sequences of outcomes, we get this list:
Okay, so we know which outcomes are possible. What we care about is using this to help analyse the probability always switching wins us the car. For that, we first need to identify the subset of outcomes that we care about. We call these subset of outcomes as events.
For example, if we cared about asking “what is the probability that the car is behind door B?” “What is the probability that the car is behind door C?“. Then the set of outcomes that correspond to this event is .
In this case, let’s ask ourselves, “If we always choose the option to switch, what is the probability of winning?“. Let’s first think about which subset events correspond to us winning if we always chose to switch. For example, if the car was behind door , we chose door , and the host was forced to reveal door . Then of the remaining options, we either choose to stay on door or switch to door , so switching to door lets us win the car.
Let’s think about it a little more, every time we choose a door that is not the car, then switching wins! Either:
- The car is behind door , and if we picked either doors or , the host will eliminate the other door, so if we switched, we will definitely go back to door .
- The car is behind door , and if we picked either doors or , the host will eliminate the other door, so if we switched, we will definitely go back to door .
- The car is behind door , and if we picked either doors or , the host will eliminate the other door, so if we switched, we will definitely go back to door .
Okay, so this means here are the set of outcomes we care about:
The probability of an event
Okay so carrying on from the previous part, the next thing to ask is what is the probability we obtain any of these outcomes? Let’s go back to the tree and tag it with probabilities. Recall:
- Each door has a in chance of having the car.
- Each door has a in chance of being picked by us.
- The host has to pick any door that we did not pick that has a goat to reveal (with equal probability).
So the tree, if we tagged it with probabilities, looks like this:
The tree is looking a little dense, but here’s the idea, what is the probability of the sequence of outcomes ? There was a chance of the car being behind door , after that, there is a chance of us picking door , and after that, there is a chance that the host has to reveal door . That’s why the edges of the tree look that way. So the total probability is given as:
We could also ask something like, what’s the chance that the sequence of outcomes is ? Again, if you check the tree, it happens to be that the car is behind door . After that, there is a chance of us picking door . Then there’s a chance of the host picking door . So the total probability is given as:
So again, the events we care about are:
Notice that if we refer to the tree, each and every of these outcomes has probability each. Because of that, there are of these events, the total probability is . We are more than likely to win if we always switch!
Probability Spaces, Outcomes and Events
So let’s start covering the basic concepts, while using the Monty Hall Problem (in Introduction) as an example. To model probabilistic phenomena, we have:
- A set of all possible outcomes, this is called the sample space. Typically, we use the notation to denote the set of all possible outcomes.
- Each outcome is associated with a unique probability value. This value must be .
- An event is a subset of the sample space . The probability of an event, is just the sum of probabilities of the outcomes.
In terms of notation, the probability of an outcome is written as . And typically, the probability of an event , is written as:
(Sample Space) So for example, in the Monty Hall problem, our sample space was given as:
(Probability Values) The tree that we drew, assigned each outcome a unique probability value that was . For example we said that the probability of the outcome was . Whereas the probability of the outcome was .
The tree that we drew, assigned each outcome a unique probability value that was . For example we said that the probability of the outcome was . Whereas the probability of the outcome was .
(Event) Lastly, we considered the strategy of always switching. Then the event that we win, is given by this subset:
We noted that the probability for each outcome here was , so the probability of the entire event was (since there are outcomes in this event, each with probability ).
A few more rules about probability:
Note, that for probabilities to make sense, we need a few more rules to say about probability. And these will be helpful for us when figuring out probabilities of certain events. Here are the rules:
- The probability of any outcome is , i.e., .
- The sum of all probabilities is exactly , i.e., .
Here are also some helpful facts. Recall from counting and set theory that:
- If and are disjoint sets, then .
- If , then .
We have similar rules in probability, they go like so:
- If and are disjoint sets of outcomes (disjoint events), then:
- If is the event space, and is an event (), then since : .
We also write as . Let’s think about what the event means for a little. It’s basically the “complement” event of . In other words, it’s the probability that did not happen.
An example of what we have seen so far:
Let’s work through an example to talk about some of these points. Imagine we had items: , , and . Let’s say that we took any possible permutation of the items uniformly at random, then asked, what is the probability that the first item remained in its position?
Let’s fall back to the framework that we just mentioned. This means:
- Listing out the set of all possible outcomes
- Identifying the event that we cared about
- Figuring out the probability of the event
Since we are sampling any possible permutation of the items, this means that there are possible outcomes:
So our sample space is the set that contains these permutations.
Moving on, we asked “what is the probability that the first item remained in its position?“. This is the event that we cared about. Again, this is a set of outcomes we care about. This happens to be:
This is exactly the set of outcomes where the first item remained in its position.
Lastly, we need to figure out the probability of the event, which is basically asking what is the probability we obtained as the permutation, plus the probability we obtained as the permutation.
Since we said that each permutation is taken uniformly at random, this means that the probability of any outcome is or, divided by the number of outcomes. Since there are outcomes here, each outcome occurs with probability .
So looking back our event, the probability that the first element remains in position , is given as:
Example Part 2: Complement events
What about if we asked something like: what is the probability that the first item is not at position ? Now the event is given as:
Now, we could have directly computed this and said “There are outcomes in our event, so the probability is .” Did you notice? This set of outcomes is actually the same as:
So really, we can think of the event that “the first item is not at position ” as the complement of the event “the first item is at position “. So, we could also answer this as .
Example Part 3: Disjoint Events
What about answering the following question? What is the probability that the first item is in position or position ?
Then we can think of these as disjoint events:
here corresponds to the event that the first item is in position . here corresponds to the event that the first item is in position .
Again, notice that are disjoint. So we can compute the probability as , we know is actually just .
Example 4: Dice
Just to re-iterate the point a little bit, imagine if we were given a fair dice that gives either , , , , or uniformly at random, what is the probability that the dice gives us an even number?
Again, the sample space is now the set . The event is . So the probability of (written as ) is basically:
Conditional Probability
Why Conditional Probability?
Let’s move onto talk about something called conditional probability. Think about the following kinds of events:
- Given the event that a dice rolled an even number, what is the probability the dice rolled a ?
- Given the event that after flipping coins, we saw heads. What is the probability the first coin we flipped turned up heads?
How should we analyse this? Why should we do it? Here’s another example scenario that might be a little more motivating or relevant. Let’s say we thought about medical testing. Medical tests (like the COVID test kit that everyone has) in real life are not perfect. There is always a small chance of errors.
- A test could report positively that you have a condition, even if you don’t. This is called a false positive.
- A test could report negatively that you don’t have a condition, even if you do. This is called a false negative.
(Bear in mind that “positive”/“negative” here isn’t talking about “good/bad” news, it’s talking about affirming or rejecting the fact that you have the condition.)
Now this extends beyond medical tests. Imagine we had to write an image classifier that tells you whether it contains an apple.
- There’s a chance the classifier reports a false positive: the image has no apple, but it says there is one.
- There’s a chance the classifier reports a false negative: the image has an apple, but it says there is none.
Definition:
So here’s an example of a statement we care about: Given that a test reports positive, what is the probability it is a true positive? (i.e., If a test says yes, what is the chance the answer is actually really yes?)
Think of the event that the test reports positive, as our condition. In general, for events , , we write:
to represent the probability of event occurring, given that event happens. To be clear, there is no sense of time here. (There are two things called “a priori probability” and “posterior probability”, but we will not be covering them here.) We are only promised that event happens, we do not need to insist whether it happens before or after event (either is ok).
So what is the quantity ? How do we compute this? Well the definition is given as:
This looks a lot more doable! But just to be clear: can be thought of as the probability that event and event occurs.
An example:
Let’s revisit the following question:
Given the event that a dice rolled an even number, what is the probability the dice rolled a ?
Again, this is a dice, so our event space is . We now need to create our events and . Let’s say that event is the event that the dice rolled an even number. This means that:
Meanwhile, is the event we rolled a , so . So what is the probability of the event ? Since , this means:
And so:
A Visual Intuition:
What is going on with conditional probability? Here’s the idea:
The probability is basically the ratio of the size of to the size of . We’re basically saying, we know we’re in set , what’s the probability we’re in set ? In particular, we don’t consider the outcomes in set that are outside of .
Back to Testing: Another Example
So, back to testing, let’s go back to the example of a classifier that needs to tell you if there is an apple in the image, or if there are none.
Here are 4 possible questions we can ask:
- Given that a classifier reports positive, what is the probability it is a true positive?
- Given that a classifier reports positive, what is the probability it is a false positive?
- Given that a classifier reports negative, what is the probability it is a true negative?
- Given that a classifier reports negative, what is the probability it is a false negative?
Let’s try looking at the first one for our example.
Let be the event that there actually is an apple, and let be the event that the test reports positive. Then again, given the classifier reports positive, how confident we are that there actually is an apple is given by . Now, the probability is the probability the classifier reports true and the image was an apple, whereas is the probability the classifier returns positive (on all images).
So here’s an example, let’s say we have a dataset, with pictures (we could not afford many pictures), and 6 of them have apples, of them do not have apples.
So let’s say in the bottom row, what our classifier says about whether the image has an apple or not.
Now, is , because there are exactly times when there is an apple and the classifier says “yes”. On the other hand is , because the classifier says yes times out of .
So:
Bayes Theorem
Now is this the only way to compute that value? Not quite. There is an alternative way that is quite helpful. It turns out Bayes’ Theorem gives us an alternative formulation:
So we could instead compute these quantities:
- (the probability that the classifier says yes),
- (the probability that we have an apple),
- (the probability that among the cases where we have an apple, the classifier says yes).
For the last one, notice that we have apples, and among the apple, the classifier says yes out of the times. So the final quantity is . Plugging it all in:
The values line up! But why do we even bother with this alternative formulation?
Another example for Bayes’ Theorem
Let’s go back to medical testing, and let’s say we were given the following pieces of information:
- % of the population have condition .
- The test administered on a positive patient has % chance of reporting it as a positive.
- Administering the test on anyone, the test reports positive with % chance.
Now, let’s say we took the test, and it was positive. What is the probability we have condition ?
Let be the event we have condition , and let be the event that the said said we were positive. We want . Notice how we do not have information about . But instead, line 1 gives us , line 2 gives us , and line 3 gives us .
So plugging this all in, we get:
which happens to be around , so at least based on this set-up, it seems that the test was not super informative.
An Example Table
In case this has still been a little unintuitive, hopefully the next explanation makes more sense.
| Test says positive | Test Says Negative | Total | |
|---|---|---|---|
| Actually Positive | |||
| Actually Negative | |||
| Total | |||
| So if we made a table with possible entries, then letting be the event of actually being positive, and be the event that the test says positive, we can say: |
- which is basically the total number of positive people divided by the total number of people.
- which is basically the total number of not positive (i.e., negative) people divided by the total number of people.
- which is basically the total number of people who tested positive divided by the total number of people.
Also, for conditional probability:
- . Why? We were promised that the people are actually positive (there are of them), and of those , of them tested positive.
- . Similarly, we are promised that the people have been tested positive (there are of them), and of those , of them tested positive.
We can also ask something like: “What’s the probability that we are actually positive, given that the test returned negative?” Then this is given as:
Again, this is because we were promised that people did not test positive (i.e., negatively), and among those people, of them were actually positive.
So to bring this back to the concept at the beginning of this section:
- is called the true positive rate.
- is called the false negative rate.
- is called the false positive rate.
- is called the true negative rate.
Independent Events
Lastly, let’s round off with looking at events in a little more detail. This is actually a concept that is a lot more helpful for next week, but it’s better to start this concept now before introducing it for next week instead. It starts off with the following vague intuition:
How do we say that two events are unrelated?
For example we could have flipped two different coins far away from each other. Intuitively we would like to think that the two outcomes should not affect each other.
Formally, we will say that two events and are independent if either:
- , or
A vague intuition you can have for line 2 is that “Given that has happened, the probability for event has remain unchanged.” Though this is quite improper, it might be a little bit helpful.
Another way to say that two events and are independent are if:
Example for Independence:
Assume we have two fair coins, (each have a sample space , and the probability of each outcome is ). Assume that and are independent. That is to say, :
Then what is the probability that both coins produce tails? In other words, what is:
Well, based on our assumption, it is the case that:
To be clear, we cannot draw the same conclusion if we did not assume independence.
An example with seemingly independent events:
Let’s consider a -coin example. Again, assume we have two fair coins, (each have a sample space , and the probability of each outcome is ). Assume that and are independent. Now also assume a third coin , where:
looks at and . Then:
- If , then return tails.
- Otherwise, return heads.
So again, if we listed out the sequence of outcomes and their probabilities with a tree, it looks like this:
Again, if we stuck to first finding the sample space, we’ll find that this is our set of outcomes:
And if we looked carefully, each outcome again, has probability of occurring.
Let’s look at a few events, and their probabilities. What are the following sets of outcomes that correspond to the following events?
- The event that is heads?
- The event that is tails and is heads?
- The event that is tails, is heads, and is heads?
- The event that is tails, is tails, and is heads?
(Question 1) So there are a few ways to do this, let’s begin with question 1. We really just want all the outcomes where the first value is . So event is basically the set of outcomes .
(Question 2) What about the second one? Well, we could manually go through each outcome and find that the only one is . Alternatively, we could also think first about the event that is heads. This is the set of outcomes . Also, similar to the previous question, we know that the the event where the first coin is tails is the set .
So alternatively, we can think of this as the set Both methods work.
(Question 3) The third one works in a similar way, but basically the only option is . So the event .
(Question 4) The final one, is interesting. The outcome does not exist! So .
What about their probabilities? Again, we can use the tree to figure this out. But basically if we do, you might notice that:
Okay, what was the point of all this? Let’s explore which coins are independent. Let:
- be the event that
- be the event that
- be the event that
So again, here are the respective sets:
And let’s consider the following 2 questions:
- Are events and independent of each other?
- Are events and independent of each other?
You might think the answer is no to both, after all coin does look at coins and . But it turns out, based on our definition, events and are actually independent of each other! Let’s see this formally:
Since , the two events are independent. This might be a little counter-intuitive but yet mathematically it checks out.
However, it is still true that events and are not independent of each other. Let’s also see this formally:
Since , the events are not independent.