Overview

This unit is the second of two parts in our introduction to probability. Again, the ideas here are pretty fundamental when it comes to things like analysing randomised strategies or algorithms, something that is quite common in machine learning.

To tie it all up, we’ll introduce probability distributions, talk about expected values and variances, and motivate them with some common bounds we commonly use in computer science.

Introduction

Picking up from where we last left off, we focused a lot on probability spaces, events, how to compute a bunch of stuff like conditional probability, and we mostly assumed things were uniform or independent. Extending off of this, we’ll first talk about random variables and probability distributions. What we will show in this unit are the staple approaches to randomised analysis. If you were ever curious how something like the multiplicative weights update algorithm works, the techniques shown in this page are a must.

Random Variables

For starters, let’s talk about random variables. Think of random variables as basically values that we care about. Now this might seem pretty abstract, so before we go into an example, think of random variables as functions that take as input outcomes, and outputs values.

An Example With Coins

Defining Random Variables

Let’s go back to using coins. Let’s say we had 3 fair and independent coins. So each coin produces either heads or tails, with probability , and all 3 coins’ outcomes do not affect each other.

We know that we have 8 possible outcomes, so our sample space looks like the following:

So here comes the question: What if we wanted to count the number of heads?

Here’s how we would do it, we would make a random variable, let’s call it . has to specify for each outcome, what it will output as a number. In this case, since we want to count the number of heads, we can simply wave our hands and declare “let count the number of heads.”

From this, we know that this means for example, that , and , also .

So is a random variable, that counts the number of heads.

Let’s try another question: What if we wanted to indicate that the coins were all heads?

Here’s how we would do it, we would make a random variable, this time let’s call it so we don’t collide our names. is going to output value on input . I.e. . On any other input outcome, will return . So for example. , and . Also .

is a special kind of random variable that we commonly call an indicator random variable. These modest types of random variables actually do a lot of heavy lifting in computer science. You’ll see one such example later in this unit. Think of indicator random variables as indicating when some event has occurred.

Both and are examples of random variables.

Random Variables vs. Events

Let’s look at again. If we split the sample space up, we notice that we can “break up” or “partition” the sample space based on what value outputs:

  1. For the event , outputs .
  2. For the event , outputs .
  3. For the event , outputs .
  4. For the event , outputs .

So what is ? Well it is just the probability that we either see: , or , or . In other words:

Since we assumed each coin is fair and independent, each outcome occurs with probability . There are such outcomes, so .

What about ? This time around:

  1. For the event , outputs .
  2. For the event , outputs .

So what is ? Recall, that:

Now if we look at we might notice that this is actually just itself. So is . What about ? Well by a similar argument, it is .

So:

Functions on Random Variables

Here’s an interesting idea, can we operate on random variables? We sure can! And this idea is super useful. For example, what is ? It is a random variable that first takes as input , and outputs the square of it.

So for example, we could ask something like: What is ?

Let’s think about this a little bit. can only take two possible values: or . When outputs , then so does . When outputs , so does .

So is the same as . And similarly, is the same as . So .

Another Example:

Instead of squaring, we can also do something like adding random variables together. Here’s an alternative example.

Let , and be random variables for a two separate, independent and fair 6-sided die. So again, and each can take values in .

Then technically, even something like is a random variable. And we can ask something like what is ? This is basically asking what is the probability that sum of the total outcomes we have are ?

We know that this means either ( and ) or ( and ). So:

Probability Distributions

Now that we’ve talked about random variables, the next thing is to finally talk about probability itself. Think of a probability distribution as the function that assigns to each outcome of random variables a probability. Intuitively, you can think of this as a chart that basically says for each outcome what is the probability.

Example 1:

Example 2:

Example 3:

Again, think of a distribution as basically saying “for this value, we assign this probability”.

We will look at some very common discrete probability distributions in CS:

  1. Bernoulli
  2. Geometric
  3. Binomial
  4. Uniform

Bernoulli Distribution

So let’s begin with the Bernoulli distribution. This is the distribution for indicator random variables. Again, recall that since indicator random variables only take values or , the Bernoulli distribution has to assign a probability for when , and consequently, this means with probability . Think of as the parameter of the distribution. This single value determines the entire distribution.

So to be clear, an indicator random variable has a Bernoulli distribution if:

Example:

Let’s say we roll 1 fair die, each with 6 faces. And each face is produced with probability 1/6.

So let be the random variable that indicates if the dice turns up with a number that is at least 2. I.e. if we see a value of or more.

Then we can say that . So it follows a Bernoulli distribution with parameter .

Geometric Distribution

Let’s build off the Bernoulli distribution and make use of it for something else. Given a random variable that follows a Bernoulli distribution with parameter , let be the number of times we need to try before .

Example:

As a concrete example, this is as if we are making coin flips, the coin has probability of returning heads, and we are asking: How many times do we need to flip before we see heads? And here we are going to assume that every flip of the coin is independent of its previous outcomes.

To be clear, if is a random variable that outputs the number of times we need to try, then follows the geometric distribution.

So what is the probability that ? Well that happens when the very first flip is heads, which happens with probability .

What about the probability that ? Well that happens with probability .

What about the probability that ? Do you see the pattern? We must have flipped many tails in a row, then flipped a heads. So the probability is .

So in general, if we had a random variable that followed geometric probability distribution with parameter , then .

Binomial Distribution

Let’s build off the Bernoulli distribution again, but instead ask a different question. What if we instead had n independent copies of (each as a Bernoulli distribution with parameter ), and we took trials, and asked: How many copies out of the trials returned 1?

Example:

Let’s say we had 3 coins, each coin returns heads with probability . Let’s let be the random variable that counts the number of heads. Then again, what’s the probability that ?

Well one way we could do this is to manually count this. So we know that there are 3 possible outcomes we need: . We know that a heads happens with probability , and a tails happens with probability . So:

But what about in general? What if we had more coins than ? Manually counting gets very cumbersome. Let’s try to be smarter with how we count.

Of the coins, we choose of them to be heads, so the rest must be tails. So there are possible outcomes. For each outcome, the probability it occurs is .

So in general, the probability is actually

To be clear, the binomial distribution takes 2 parameters: , the number of trials, and the probability of success of each independent trial.

Uniform Distribution

The last distribution is the uniform distribution. This is the one we have been playing with the most. In general, we have a set of values, . Each value is picked with probability .

If we let be the random variable that outputs any of the n values uniformly at random. Then has the uniform distribution.

Expectation

Now that we have seen random variables and distributions, here’s a key question:

If we ran an experiment where we had a random variable X, and we took many independent samples, then output the average value, what should we hope/expect to see?

It turns out, the answer is:

Here’s the intuition, this is the value we “expect” to see from the random variable.

Example 1: A fair die

For example, if we roll a 6-sided fair die, what is ? Based on our formula, this happens to be:

which evaluates to .

Think of it this way, if we rolled this die many, many, many times and took the average value, it should be close to .

Example 2: A Bernoulli Distributed Random Variable

If we have a random variable X that has a Bernoulli distribution with parameter , what is ? Again, based on our formula, this happens to be:

This looks quite surprising, that the expected value is the probability. But this is actually a very useful fact. We use this quite often in computer science!

Example 3: Payoff Functions

Let’s say we have a contract that with probability 1/3 will pay us 5$, and with probability 2/3 will pay us nothing (0$). What is the expected payoff?

Notice here that if we first let be a Bernoulli distributed indicator random variable with where when we get paid, then our payoff is given as:

So it boils down to asking what is ? Since takes values either or , then takes values either or . So

Properties About Expectation

So the last example actually was a teaser into some nice properties about expectation. We won’t prove it in this course, so you can take these as fact (though they are provable).

  1. If is a constant, then

As a warning, we cannot generally say that . This is true when and are independent, but otherwise, we have to be careful.

Example 4: Expectation of Binomial Distributions

Let’s go through yet another example, this time we will be asking what is the expected value of a binomially distributed random variable with parameters and .

If we faithfully followed the formula, we have that:

Except, that looks awfully complicated to analyse! So we’re going to pull out a very neat trick, and have our Bernoulli random variables do a lot of heavy lifting for us.

We are going to let be an indicator random variable with parameter , represent whether the trial was a success or not. Then:

Why do we want to do this though? Here’s the idea, using the property of expectations, we know that:

Remember, because is an indicator random variable with probability .

What about for general values of and ? Well then the math becomes:

Example 5: Expectation of Geometric Distributions

Since we’ve covered Bernoulli and Binomial, for the sake of completeness let’s do Geometric as well. Let be a geometrically distributed random variable with parameter . The math for this one is a little more involved, so let’s jump straight into it. Again, by our definition of expectation, we have that:

Now this is pretty hard to resolve, so let’s work through some magic, first of all:

So what is ? Let me lay it out term by term:

If you notice, we’re grouping terms based on their power of . What happens if we subtracted them this way? Then:

And the last series is actually geometric! Re-writing this, we get:

Okay, that was weird, let’s also resolve the left hand side:

So that gives us our expectation, which hopefully is quite intuitive. If we have a coin that returns heads with probability , we would expect to flip it 3 times before we see a heads.

Variance

So expectation was nice and all, and it tells us what the random variable “averages” around, but it doesn’t tell us how spread apart the values are. For that, we need variance.

Intuitively, variance is a measure of how much the random variable can vary.

Formally, it is defined as:

Except this form is not very helpful, so let me show you a more useful form:

Again, friendly reminder that and this is in general, not equal to .

Example 1: Variance of a Bernoulli Distributed Random Variable

For example, given a Bernoulli distributed random variable with probability , what is ? We know that , so we know that . But what is ?

So we’ve done this before! can only take value with probability or with probability . So similarly, can only take value with probability or with probability .

So again:

So, putting the two together:

Example 2: Variance of a Binomially Distributed Random Variable

As another example, what about the variance of a binomially distributed random variable with trials, and probability ? Again, let’s fall back to the neat trick that I mentioned, let be an indicator that indicates whether the was a success. In which case:

So now:

If this is not obvious, think about how can be written as .

Now again, we want , so:

Now let’s look at what’s going on in each sum separately. The first sum, sums over when , so this is just the same as . As before, we know that since is an indicator random variable, . So:

What about when ? Since and both only output either or . Then, is only when both and are , otherwise, it is . So now:

since and are independent, we know that So putting this back into the sum:

So finally, putting this back in:

Summary:

So we have that:

  1. Bernoulli distributed random variables:
  2. Binomially distributed random variables:
  3. Geometrically distributed random variable:

We will skip the proof for geometric random variables because it involves using some amount of calculus.

Properties of Variance:

  1. If and are independent random variables, then

Bounds

So we’ve worked quite hard to figure out what the expectation and variance are for random variables. But why? What’s so important about these things?

In computer science, we often have “bad” events that we want to avoid. For example, long running times in las vegas algorithms, errors in classification, hashing collisions, and so on. Anytime there is any amount of randomness, we will have to somehow argue that bad events don’t happen too often. Hopefully you’ll see what I mean, beyond this course when you finally use these ideas.

So to do so, we commonly use Markov and Chebyshev bounds! These bounds are great if we are happy with a good enough, one-sided upper bound on the probability. Typically we will be finding the probabilities of bad events and saying they don’t occur to often. So in CS at least, these are great.

Markov Bound:

If is a non-negative random variable, and , then:

For example, if is a binomially distributed random variable with n trials and success probability . We can say something like:

But this hard to analyse, and is not even in a closed form. What if we could sacrifice some amount of clarity for an easier bound to work with? So if we instead applied Markov’s bound, we have:

So for something like , this works out to be . See how simple that was? Sometimes an imprecise answer is good enough. Markov bound is one such way to get a “good enough” imprecise answer.

Chebyshev Bound:

If is a random variable, then:

For example, if is a binomially distributed random variable with trials and success probability . We can say something like:

But again, this is a lot simpler if we could use Chebyshev to say something like (assuming we are happy with a good enough, one-sided bound):