Overview

This unit is the second of two parts in our introduction to probability. Again, the ideas here are pretty fundamental when it comes to things like analysing randomised strategies or algorithms, something that is quite common in machine learning.

To tie it all up, we’ll introduce probability distributions, talk about expected values and variances, and motivate them with some common bounds we commonly use in computer science.

  1. Random variables
  2. Probability distributions
  3. Expectation and variance
  4. Markov and Chebyshev bounds

Introduction

Picking up from where we last left off, we focused a lot on probability spaces, events, how to compute a bunch of stuff like conditional probability, and we mostly assumed things were uniform or independent. Extending from this, we’ll first talk about random variables and probability distributions. What we will show in this unit are the staple approaches to randomised analysis. If you were ever curious how something like the multiplicative weights update algorithm works, the techniques shown in this page are a must.


Part 1: Random Variables

For starters, let’s talk about random variables. Think of random variables as basically values that we care about. Now this might seem pretty abstract, so before we go into an example, think of random variables as functions that take outcomes as inputs, and outputs values. We’ll begin with a simple example with coins.

Defining Random Variables

Let’s go back to using coins. Let’s say we had three fair and independent coins. Each coin produces either heads or tails, each with probability , and all three coins’ outcomes do not affect each other.

We know that we have possible outcomes, so our sample space looks like the following:

Here comes the question: what if we wanted to count the number of heads?

Here’s how we would do it, we would make a random variable, let’s call it . has to specify for each outcome, what it will output as a number. In this case, since we want to count the number of heads, we can simply wave our hands and declare “let count the number of heads”.

From this, we know that this means for example, that , , and also .

So is a random variable that counts the number of heads.

Let’s try another question: what if we wanted to indicate that the coins were all heads?

Here’s how we would do it. We would make a random variable, this time let’s call it so our names don’t collide. is going to output value on input , i.e., . On any other input outcome, will return . For example, , , and also .

is a special kind of random variable that we commonly call an indicator random variable. These modest types of random variables actually do a lot of heavy lifting in computer science. You’ll see one such example later in this unit. Think of indicator random variables as indicating when some event has occurred.

Both and are examples of random variables.

Definition: Random variables

A random variable is a representation of some quantity which depends on one or several outcomes.

An indicator random variable is a random variable whose value is either (representing that the outcome at hand does not belong to a certain event), or (the outcome does belong to that event).

Random Variables vs. Events

Let’s look at again. If we split the sample space up, we notice that we can “break up” or “partition” the sample space based on what value outputs:

  1. For the event , outputs .
  2. For the event , outputs .
  3. For the event , outputs .
  4. For the event , outputs .

So what is ? Well, it is just the probability that we see either , or , or . In other words:

Since we assumed each coin is fair and independent, each outcome occurs with probability . There are such outcomes, so .

What about ? This time around:

  1. For the event , outputs .
  2. For the event , outputs .

So what is ? Recall that (why?):

Now if we look at we might notice that this is actually just itself. So is . What about ? Well by a similar argument, it is .

So:

Functions on Random Variables

Here’s an interesting idea: can we operate on random variables? We sure can! And this idea is super useful. For example, what is ? It is a random variable that first takes as input , and outputs the square of it.

For example, we could ask something like: what is ?

Let’s think about this a little bit. can only take two possible values: or . When outputs , then so does . When outputs , so does .

So is the same as . Similarly, is the same as . So .

Another example

Instead of squaring, we can also do something like adding random variables together. Here’s an alternative example.

Let , and be random variables for a two separate, independent and fair -sided dice. Again, and can each take values in .

Then technically, even something like is a random variable, and we can ask something like “What is ?” This is basically asking “What is the probability that the sum of the two outcomes is equal to ?”

We know that this means either ( and ) or ( and ). So:


Part 2: Probability Distributions

Now that we’ve talked about random variables, the next thing is to finally talk about probability itself. Think of a probability distribution as the function that assigns to each outcome of random variables a probability. Intuitively, you can think of this as a chart that basically says for each outcome what is the probability.

Example 1:

Example 2:

Example 3:

Again, think of a distribution as basically saying “for this value, we assign this probability”.

We will look at some very common discrete probability distributions in CS:

  1. Bernoulli
  2. Geometric
  3. Binomial
  4. Uniform

Bernoulli Distribution

So let’s begin with the Bernoulli distribution. This is the distribution for indicator random variables. Again, recall that since indicator random variables only take values or , the Bernoulli distribution has to assign a probability for when , and consequently, this means with probability . Think of as the parameter of the distribution. This single value determines the entire distribution.

Example:

Let’s say we roll a fair die, each with faces, and each face is produced with probability .

Let be the random variable that indicates if the dice turns up with a number that is at least , i.e., if we see a value of or more.

Then we can say that . So it follows a Bernoulli distribution with parameter .

Definition: Bernoulli distribution

An indicator random variable follows a Bernoulli distribution with parameter if the following hold true:

Geometric Distribution

Let’s build off the Bernoulli distribution and make use of it for something else. Given a random variable that follows a Bernoulli distribution with parameter , let be the number of times we need to try before .

Example:

As a concrete example, this is as if we are making coin flips, the coin has probability of returning heads, and we are asking “How many times do we need to flip before we see heads?” Here, we are going to assume that every flip of the coin is independent of its previous outcomes.

To be clear, if is a random variable that outputs the number of times we need to try, then follows the geometric distribution.

So what is the probability that ? Well that happens when the very first flip is heads, which happens with probability .

What about the probability that ? Well that happens with probability .

What about the probability that ? Do you see the pattern? We must have flipped many tails in a row, then flipped a heads. So the probability is .

So in general, if we had a random variable that followed geometric probability distribution with parameter , then .

Definition: Geometric distribution

A random variable follows a geometric distribution with parameter if:

Binomial Distribution

Let’s build off the Bernoulli distribution again, but instead ask a different question. What if we instead had independent copies of (each as a Bernoulli distribution with parameter ), and we took trials, and asked “How many copies out of the trials returned ?”

Example:

Let’s say we had coins, each coin returns heads with probability . Let be the random variable that counts the number of heads. Then again, what’s the probability ?

Well, one way we could do this is to manually count this. So we know that there are possible outcomes we need: . We know that a heads happens with probability , and a tails happens with probability . So:

But what about in general? What if we had more coins than ? Manual counting gets very cumbersome. Let’s try to be smarter with how we count.

Of the coins, we choose of them to be heads, so the rest must be tails. So there are possible outcomes. For each outcome, the probability it occurs is .

So in general, the probability is actually .

To be clear, the binomial distribution takes two parameters: , the number of trials, and the probability of success of each independent trial.

Definition: Binomial distribution

A random variable follows a binomial distribution with parameters and (denoted ) if:

Uniform Distribution

The last distribution is the uniform distribution. This is the one we have been playing with the most. In general, we have a set of values, . Each value is picked with probability .

If we let be the random variable that outputs any of the values uniformly at random. Then has the uniform distribution.

Definition: Uniform distribution

A random variable follows a uniform distribution with parameter if:


Part 3: Expectation and Variance

Expectation

Now that we have seen random variables and distributions, here’s a key question:

If we ran an experiment where we had a random variable , and we took many independent samples, then output the average value, what should we hope/expect to see?

It turns out, the answer is:

Definition: Expectation

The expectation of a random variable is defined by:

Here’s the intuition, this is the value we “expect” to see from the random variable.

Example 1: A fair die

For example, if we roll a -sided fair die, what is ? Based on our formula, this happens to be:

which evaluates to .

Think of it this way: if we rolled this die many, many, many times and took the average value, it should be close to .

To be clear, let’s say we played this game for rounds. Letting be the amount of money you win in round 1, be the amount of money you win in round 2, so on and so forth… Then the “average” amount of money you win, would be

and as , the above value, would approach .

A betting game:

So consider a game where you’d pay dollars to participate. And in the game, we roll a fair die, and whatever the rolled value is, you win that many dollars. E.g. if the die rolls to a , you win a dollar, gaining a dollar profit. Should you play?

(Just to be clear: if you played 5 rounds, and rolled , , , , . Then you paid dollars, won dollars, so suffered a loss of dollars, or a profit of dollars.)

Well, if you played rounds, you’d pay dollars. You’d also expect to win around dollars. I.e. expect around dollars profit per round. Why? Because your expected winning across rounds is

But you also paid to play all rounds. So your expected profit over would be “winnings - paid amount”, i.e.

Of course the game is randomised, so it wouldn’t be exact. But as gets large, you should see this behaviour.

Example 2: A weighted die

Okay, let’s say we had a 6-sided die that is not fair. So something like:

Well we can apply the same type of analysis.

Which is around… or just over .

The same betting game:

In which case, if you were asked again, would you pay dollars to participate in the same better game with this weighted die, should you? I would!

In rounds, I would expect to make an expected profit of dollars now instead. Which is actually even better than before. (The computation works the same as in the previous example’s case.)

Example 2: A Bernoulli Distributed Random Variable

If we have a random variable that has a Bernoulli distribution with parameter , what is ? Again, based on our formula, this happens to be:

This looks quite surprising, that the expected value is the probability. But this is actually a very useful fact. We use this quite often in computer science!

Example 3: Payoff functions

Let’s say we have a contract that with probability will pay us dollars, and with probability will pay us nothing ( dollars). What is the expected payoff?

Notice here that if we first let be a Bernoulli-distributed indicator random variable with where when we get paid, then our payoff is given as:

So it boils down to asking what is ? Since takes values either or , then takes values either or . So

Properties of Expectation

The last example actually was a teaser into some nice properties of expectation. We won’t prove it in this course, so you can take these as fact (though they are provable).

  1. If is a constant, then .

As a warning, we cannot generally say that . This is true when and are independent, but otherwise, we have to be careful.

Example 4: Expectation of binomial distributions

Let’s go through yet another example, this time we will be asking what is the expected value of a binomially distributed random variable with parameters and .

If we faithfully followed the formula, we have that:

Except, that looks awfully complicated to analyse! So we’re going to pull out a very neat trick, and have our Bernoulli random variables do a lot of heavy lifting for us.

We are going to let be an indicator random variable with parameter , represent whether the trial was a success or not. Then:

Why do we want to do this though? Here’s the idea, using the property of expectations, we know that:

Remember, because is an indicator random variable with probability .

What about for general values of and ? Well then the math becomes:

Example 5: Expectation of geometric distributions

Since we’ve covered the Bernoulli and binomial distributions, for the sake of completeness, let’s do the geometric distribution as well. Let be a geometrically distributed random variable with parameter . The math for this one is a little more involved, so let’s jump straight into it. Again, by our definition of expectation, we have that:

Now this is pretty hard to resolve, so let’s work through some magic, first of all:

So what is ? Let me lay it out term by term:

If you notice, we’re grouping terms based on their power of . What happens if we subtracted them this way? Then:

And the last series is actually a geometric series! Re-writing this, we get:

Okay, that was weird, let’s also resolve the left hand side:

So that gives us our expectation, which hopefully is quite intuitive. If we have a coin that returns heads with probability , we would expect to flip it times before we see a heads.

Variance

So expectation was nice and all, and it tells us what the random variable “averages” around, but it doesn’t tell us how spread apart the values are. For that, we need variance.

Intuitively, variance is a measure of how much the random variable can vary.

Formally, it is defined as:

Definition: Variance

The variance of a random variable is defined by:

The first form is not that useful, so usually we use the second form (whose proof has been provided below):

Again, friendly reminder that and this is in general, not equal to .

Example 1: Variance of a Bernoulli-distributed random variable

For example, given a Bernoulli distributed random variable with probability , what is ? We know that , so we know that . But what is ?

So we’ve done this before! can only take value with probability or with probability . So similarly, can only take value with probability or with probability .

So again:

So, putting the two together:

Example 2: Variance of a binomially distributed random variable

As another example, what about the variance of a binomially distributed random variable with trials, and probability ? Again, let’s fall back to the neat trick that I mentioned, let be an indicator that indicates whether the was a success. In which case:

So now:

If this is not obvious, think about how can be written as .

Now again, we want , so:

Now let’s look at what’s going on in each sum separately. The first sum, sums over when , so this is just the same as . As before, we know that since is an indicator random variable, . So:

What about when ? Note that and both only output either or . Then, is only when both and are , otherwise, it is . So now:

since and are independent, we know that . So putting this back into the sum:

So finally, putting this back in:

Summary

So we have that:

  1. Bernoulli-distributed random variables: expectation , variance
  2. Binomially distributed random variables: expectation , variance
  3. Geometrically distributed random variable: expectation , variance

We will skip the proof for geometric random variables because it involves using some amount of calculus.

Properties of Variance

  1. If and are independent random variables, then .

Part 4: Bounds

So we’ve worked quite hard to figure out what the expectation and variance are for random variables. But why? What’s so important about these things?

In computer science, we often have “bad” events that we want to avoid. For example, long running times in Las Vegas algorithms, errors in classification, hashing collisions, and so on. Anytime there is any amount of randomness, we will have to somehow argue that bad events don’t happen too often. Hopefully you’ll see what I mean, beyond this course when you finally use these ideas.

So to do so, we commonly use Markov and Chebyshev bounds! These bounds are great if we are happy with a good enough, one-sided upper bound on the probability. Typically we will be finding the probabilities of bad events and saying they don’t occur to often. So in CS at least, these are great.

Markov Bound

Definition: Markov bound

If is a non-negative random variable, and , then:

For example, let is a binomially distributed random variable with trials and success probability . We can say something like:

But this hard to analyse, and is not even in a closed form. What if we could sacrifice some amount of clarity for an easier bound to work with? So if we instead applied Markov’s bound, we have:

So for something like , this works out to be . See how simple that was? Sometimes an imprecise answer is good enough. The Markov bound is one such way to get a “good enough” imprecise answer.

Chebyshev Bound

Definition: Chebyshev bound

If is a random variable, then:

For example, let is a binomially distributed random variable with trials and success probability . We can say something like:

But again, this is a lot simpler if we could use the Chebyshev bound to say something like (assuming we are happy with a good enough, one-sided bound):