Overview

This unit is the second of two parts in our introduction to probability. Again, the ideas here are pretty fundamental when it comes to things like analysing randomised strategies or algorithms, something that is quite common in machine learning.

To tie it all up, we’ll introduce probability distributions, talk about expected values and variances, and motivate them with some common bounds we commonly use in computer science.

Introduction

Picking up from where we last left off, we focused a lot on probability spaces, events, how to compute a bunch of stuff like conditional probability, and we mostly assumed things were uniform or independent. Extending off of this, we’ll first talk about random variables and probability distributions. What we will show in this unit are the staple approaches to randomised analysis. If you were ever curious how something like the multiplicative weights update algorithm works, the techniques shown in this page are a must.

Random Variables

For starters, let’s talk about random variables. Think of random variables as basically values that we care about. Now this might seem pretty abstract, so before we go into an example, think of random variables as functions that take as input outcomes, and outputs values.

An Example With Coins

Defining Random Variables

Let’s go back to using coins. Let’s say we had 3 fair and independent coins. So each coin produces either heads or tails, with probability $\frac{1}{2}$ , and all 3 coins’ outcomes do not affect each other.

We know that we have 8 possible outcomes, so our sample space looks like the following:

{HHH, HH T, H T H, H TT, T HH, T H T, TT H, TTT}

So here comes the question: What if we wanted to count the number of heads?

Here’s how we would do it, we would make a random variable, let’s call it $C$ . $C$ has to specify for each outcome, what it will output as a number. In this case, since we want to count the number of heads, we can simply wave our hands and declare “let $C$ count the number of heads.”

From this, we know that this means for example, that $C (HHH) = 3$ , and $C (H T H) = 2$ , also $C (TTT) = 0$ .

So $C$ is a random variable, that counts the number of heads.

Let’s try another question: What if we wanted to indicate that the coins were all heads?

Here’s how we would do it, we would make a random variable, this time let’s call it $D$ so we don’t collide our names. $D$ is going to output value $1$ on input $HHH$ . I.e. $D (HHH) = 1$ . On any other input outcome, $D$ will return $0$ . So for example. $D (HH T) = 0$ , and $D (H T H) = 0$ . Also $D (TTT) = 0$ .

$D$ is a special kind of random variable that we commonly call an indicator random variable. These modest types of random variables actually do a lot of heavy lifting in computer science. You’ll see one such example later in this unit. Think of indicator random variables as indicating when some event has occurred.

Both $C$ and $D$ are examples of random variables.

Random Variables vs. Events

Let’s look at $C$ again. If we split the sample space up, we notice that we can “break up” or “partition” the sample space based on what value $C$ outputs:

For the event ${TTT}$ , $C$ outputs $0$ .
For the event ${H TT, T H T, TT H}$ , $C$ outputs $1$ .
For the event ${HH T, H T H, T HH}$ , $C$ outputs $2$ .
For the event ${HHH}$ , $C$ outputs $3$ .

So what is $Pr [C = 2]$ ? Well it is just the probability that we either see: $HH T$ , or $H T H$ , or $T HH$ . In other words:

Pr [C = 2] = Pr [HH T] + Pr [H T H] + Pr [T HH]

Since we assumed each coin is fair and independent, each outcome occurs with probability $\frac{1}{2 ^{3}}$ . There are $3$ such outcomes, so $Pr [C = 2] = \frac{3}{8}$ .

What about $D$ ? This time around:

For the event ${HH T, H T H, H TT, T HH, T H T, TT H, TTT}$ , $D$ outputs $0$ .
For the event ${HHH}$ , $D$ outputs $1$ .

So what is $Pr [C \leq 1∣ D = 0]$ ? Recall, that:

Pr [C \leq 1∣ D = 0] = \frac{Pr [ C \leq 1 \cap D = 0 ]}{Pr [ D = 0 ]}

Now if we look at ${H TT, T H T, TT H, TTT} \cap {HH T, H T H, H TT, T HH, T H T, TT H, TTT}$ we might notice that this is actually just ${H TT, T H T, TT H, TTT}$ itself. So $Pr [C \leq 1 \cap D = 0]$ is $\frac{4}{8}$ . What about $Pr [D = 0]$ ? Well by a similar argument, it is $\frac{7}{8}$ .

So:

Pr [C \leq 1∣ D = 0] = \frac{4/8}{7/8} = \frac{4}{7}

Functions on Random Variables

Here’s an interesting idea, can we operate on random variables? We sure can! And this idea is super useful. For example, what is $D^{2}$ ? It is a random variable that first takes as input $D$ , and outputs the square of it.

So for example, we could ask something like: What is $Pr [D^{2} = 1]$ ?

Let’s think about this a little bit. $D$ can only take two possible values: $0$ or $1$ . When $D$ outputs $0$ , then so does $D^{2}$ . When $D$ outputs $1$ , so does $D^{2}$ .

So $Pr [D^{2} = 1]$ is the same as $Pr [D = 1]$ . And similarly, $Pr [D^{2} = 0]$ is the same as $Pr [D^{2} = 0]$ . So $Pr [D^{2} = 1] = \frac{1}{8}$ .

Another Example:

Instead of squaring, we can also do something like adding random variables together. Here’s an alternative example.

Let $D_{1}$ , and $D_{2}$ be random variables for a two separate, independent and fair 6-sided die. So again, $D_{1}$ and $D_{2}$ each can take values in ${1, 2, 3, 4, 5, 6}$ .

Then technically, even something like $D_{1} + D_{2}$ is a random variable. And we can ask something like what is $Pr [D_{1} + D_{2} = 3]$ ? This is basically asking what is the probability that sum of the total outcomes we have are $3$ ?

We know that this means either ( $D_{1} = 1$ and $D_{2} = 2$ ) or ( $D_{1} = 2$ and $D_{2} = 1$ ). So:

Pr [D_{1} + D_{2} = 3] = Pr [D_{1} = 1 \cap D_{2} = 2] + Pr [D_{1} = 2 \cap D_{2} = 1] = \frac{1}{6 ^{2}} + \frac{1}{6 ^{2}} = \frac{1}{18}

Probability Distributions

Now that we’ve talked about random variables, the next thing is to finally talk about probability itself. Think of a probability distribution as the function that assigns to each outcome of random variables a probability. Intuitively, you can think of this as a chart that basically says for each outcome what is the probability.

Example 1:

Example 2:

Example 3:

Again, think of a distribution as basically saying “for this value, we assign this probability”.

We will look at some very common discrete probability distributions in CS:

Bernoulli
Geometric
Binomial
Uniform

Bernoulli Distribution

So let’s begin with the Bernoulli distribution. This is the distribution for indicator random variables. Again, recall that since indicator random variables only take values $0$ or $1$ , the Bernoulli distribution has to assign a probability $p$ for when $X = 1$ , and consequently, this means $X = 0$ with probability $1 - p$ . Think of $p$ as the parameter of the distribution. This single value determines the entire distribution.

So to be clear, an indicator random variable $X$ has a Bernoulli distribution $p$ if:

$Pr [X = 1] = p$
$P r [X = 0] = 1 - p$

Example:

Let’s say we roll 1 fair die, each with 6 faces. And each face is produced with probability 1/6.

So let $X$ be the random variable that indicates if the dice turns up with a number that is at least 2. I.e. $X = 1$ if we see a value of $2$ or more.

Then we can say that $Pr [X = 1] = \frac{5}{6}$ . So it follows a Bernoulli distribution with parameter $p = \frac{5}{6}$ .

Geometric Distribution

Let’s build off the Bernoulli distribution and make use of it for something else. Given a random variable $X$ that follows a Bernoulli distribution with parameter $p$ , let $Y$ be the number of times we need to try $X$ before $X = 1$ .

Example:

As a concrete example, this is as if we are making coin flips, the coin has probability $\frac{1}{3}$ of returning heads, and we are asking: How many times do we need to flip before we see heads? And here we are going to assume that every flip of the coin is independent of its previous outcomes.

To be clear, if $Y$ is a random variable that outputs the number of times we need to try, then $Y$ follows the geometric distribution.

So what is the probability that $Y = 1$ ? Well that happens when the very first flip is heads, which happens with probability $\frac{1}{3}$ .

What about the probability that $Y = 2$ ? Well that happens with probability $\frac{2}{3} \times \frac{1}{3}$ .

What about the probability that $Y = i$ ? Do you see the pattern? We must have flipped $i - 1$ many tails in a row, then flipped a heads. So the probability is $(\frac{2}{3})^{i - 1} \times \frac{1}{3}$ .

So in general, if we had a random variable that followed geometric probability distribution with parameter $p$ , then $Pr [Y = i] = (1 - p)^{i - 1} \times p$ .

Binomial Distribution

Let’s build off the Bernoulli distribution again, but instead ask a different question. What if we instead had n independent copies of $X$ (each as a Bernoulli distribution with parameter $p$ ), and we took $n$ trials, and asked: How many copies out of the $n$ trials returned 1?

Example:

Let’s say we had 3 coins, each coin returns heads with probability $1/3$ . Let’s let $Y$ be the random variable that counts the number of heads. Then again, what’s the probability that $P r [Y = 2]$ ?

Well one way we could do this is to manually count this. So we know that there are 3 possible outcomes we need: $HH T, H T H, T HH$ . We know that a heads happens with probability $\frac{1}{3}$ , and a tails happens with probability $\frac{2}{3}$ . So:

Pr [HH T] + Pr [HH T] + Pr [HH T] = (\frac{1}{3})^{2} \cdot (\frac{2}{3}) + (\frac{1}{3})^{2} \cdot (\frac{2}{3}) + (\frac{1}{3})^{2} \cdot (\frac{2}{3}) = \frac{2}{27} \times 3 = \frac{6}{27}

But what about in general? What if we had more coins than $3$ ? Manually counting gets very cumbersome. Let’s try to be smarter with how we count.

Of the $n$ coins, we choose $i$ of them to be heads, so the rest must be tails. So there are $(i n)$ possible outcomes. For each outcome, the probability it occurs is $(p)^{i} (1 - p)^{n - i}$ .

So in general, the probability $Pr [Y = i]$ is actually $(i n) (p)^{i} (1 - p)^{n - i}$

To be clear, the binomial distribution takes 2 parameters: $n$ , the number of trials, and $p$ the probability of success of each independent trial.

Uniform Distribution

The last distribution is the uniform distribution. This is the one we have been playing with the most. In general, we have a set of $n$ values, ${1, 2, \dots, n}$ . Each value is picked with probability $\frac{1}{n}$ .

If we let $Y$ be the random variable that outputs any of the n values uniformly at random. Then $Y$ has the uniform distribution.

Expectation

Now that we have seen random variables and distributions, here’s a key question:

If we ran an experiment where we had a random variable X, and we took $t$ many independent samples, then output the average value, what should we hope/expect to see?

It turns out, the answer is:

E [X] = i \sum i \cdot P r [X = i]

Here’s the intuition, this is the value we “expect” to see from the random variable.

Example 1: A fair die

For example, if we roll a 6-sided fair die, what is $E [X]$ ? Based on our formula, this happens to be:

E [X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6}

which evaluates to $7/2$ .

Think of it this way, if we rolled this die many, many, many times and took the average value, it should be close to $3.5$ .

Example 2: A Bernoulli Distributed Random Variable

If we have a random variable X that has a Bernoulli distribution with parameter $p = 1/5$ , what is $E [X]$ ? Again, based on our formula, this happens to be:

E [X] = 0 \cdot Pr [X = 0] + 1 \cdot Pr [X = 1] = Pr [X = 1]

This looks quite surprising, that the expected value is the probability. But this is actually a very useful fact. We use this quite often in computer science!

Example 3: Payoff Functions

Let’s say we have a contract that with probability 1/3 will pay us 5$, and with probability 2/3 will pay us nothing (0$). What is the expected payoff?

Notice here that if we first let $X$ be a Bernoulli distributed indicator random variable with $p = \frac{1}{3}$ where $X = 1$ when we get paid, then our payoff is given as:

5 \cdot X

So it boils down to asking what is $E [5 X]$ ? Since $X$ takes values either $0$ or $1$ , then $5 X$ takes values either $0$ or $5$ . So

E [5 X] = 0 \cdot Pr [5 X = 0] + 5 \cdot Pr [5 X = 5] = 5 \cdot \frac{1}{3}

Properties About Expectation

So the last example actually was a teaser into some nice properties about expectation. We won’t prove it in this course, so you can take these as fact (though they are provable).

$E [X + Y] = E [X] + E [Y]$
If $c$ is a constant, then $E [c X] = c E [X]$

As a warning, we cannot generally say that $E [X \cdot Y] = E [X] \cdot E [Y]$ . This is true when $X$ and $Y$ are independent, but otherwise, we have to be careful.

Example 4: Expectation of Binomial Distributions

Let’s go through yet another example, this time we will be asking what is the expected value of a binomially distributed random variable $X$ with parameters $n = 10$ and $p = 1/4$ .

If we faithfully followed the formula, we have that:

E [X] = i = 0 \sum 10 i \cdot (i 10) (\frac{1}{4})^{i} (\frac{3}{4})^{10 - i}

Except, that looks awfully complicated to analyse! So we’re going to pull out a very neat trick, and have our Bernoulli random variables do a lot of heavy lifting for us.

We are going to let $X_{i}$ be an indicator random variable with parameter $p = \frac{1}{4}$ , represent whether the $i^{t h}$ trial was a success or not. Then:

X = i = 1 \sum 10 X_{i}

Why do we want to do this though? Here’s the idea, using the property of expectations, we know that:

E [X] = E [i = 1 \sum 10 X_{i}] = i = 1 \sum 10 E [X_{i}] = i = 1 \sum 10 \frac{1}{4} = \frac{10}{4}

Remember, $E [X_{i}] = \frac{1}{4}$ because $X_{i}$ is an indicator random variable with probability $\frac{1}{4}$ .

What about for general values of $n$ and $p$ ? Well then the math becomes:

E [X] = E [i = 1 \sum n X_{i}] = i = 1 \sum n E [X_{i}] = i = 1 \sum n p = n p

Example 5: Expectation of Geometric Distributions

Since we’ve covered Bernoulli and Binomial, for the sake of completeness let’s do Geometric as well. Let $X$ be a geometrically distributed random variable with parameter $p$ . The math for this one is a little more involved, so let’s jump straight into it. Again, by our definition of expectation, we have that:

E [X] = i = 1 \sum \infty i \cdot Pr [X = i] = i = 1 \sum \infty i \cdot p \cdot (1 - p)^{i - 1}

Now this is pretty hard to resolve, so let’s work through some magic, first of all:

(1 - p) E [X] = i = 1 \sum \infty i \cdot (1 - p) P r [X = i] = i = 1 \sum \infty i \cdot p \cdot (1 - p) \cdot (1 - p)^{i - 1} = i = 1 \sum \infty i \cdot p \cdot (1 - p)^{i}

So what is $E [X] - (1 - p) E [X]$ ? Let me lay it out term by term:

E [X] - (1 - p) E [X] = (1 p + - (2 p (1 - p) + 3 p (1 - p)^{2} + \dots) 1 p (1 - p) + 2 p (1 - p)^{2} + 3 p (1 - p)^{3} + \dots)

If you notice, we’re grouping terms based on their power of $(1 - p)$ . What happens if we subtracted them this way? Then:

E [X] - (1 - p) E [X] = (1 p + - (= (1 p + 2 p (1 - p) + 3 p (1 - p)^{2} + \dots) 1 p (1 - p) + 2 p (1 - p)^{2} + 3 p (1 - p)^{3} + \dots) p (1 - p) + p (1 - p)^{2} + \dots)

And the last series is actually geometric! Re-writing this, we get:

E [X] - (1 - p) E [X] = (1 p + p (1 - p) + p (1 - p)^{2} + \dots) = i = 0 \sum \infty p (1 - p)^{i} = p i = 0 \sum \infty (1 - p)^{i} = \frac{p}{1 - ( 1 - p )} = 1

Okay, that was weird, let’s also resolve the left hand side:

E [X] - (1 - p) E [X] p E [X] E [X] = 1 = 1 = \frac{1}{p}

So that gives us our expectation, which hopefully is quite intuitive. If we have a coin that returns heads with probability $\frac{1}{3}$ , we would expect to flip it 3 times before we see a heads.

Variance

So expectation was nice and all, and it tells us what the random variable “averages” around, but it doesn’t tell us how spread apart the values are. For that, we need variance.

Intuitively, variance is a measure of how much the random variable can vary.

Formally, it is defined as:

Va r [X] = E [(X - E [X])^{2}]

Except this form is not very helpful, so let me show you a more useful form:

Va r [X] = E [(X - E [X])^{2}] = E [X^{2} - 2 XE [X] + E [X]^{2}] = E [X^{2}] - 2 E [XE [X]] + E [E [X]^{2}] = E [X^{2}] - 2 E [X] E [X] + E [E [X]^{2}] = E [X^{2}] - 2 E [X]^{2} + E [E [X]^{2}] = E [X^{2}] - 2 E [X]^{2} + E [X]^{2} = E [X^{2}] - E [X]^{2}

Again, friendly reminder that $E [X^{2}] = E [X \cdot X]$ and this is in general, not equal to $E [X]^{2}$ .

Example 1: Variance of a Bernoulli Distributed Random Variable

For example, given a Bernoulli distributed random variable $X$ with probability $p$ , what is $Va r [X]$ ? We know that $E [X] = p$ , so we know that $E [X]^{2} = p^{2}$ . But what is $E [X^{2}]$ ?

So we’ve done this before! $X$ can only take value $1$ with probability $p$ or $0$ with probability $1 - p$ . So similarly, $X^{2}$ can only take value $1$ with probability $p$ or $0$ with probability $1 - p$ .

So again:

E [X^{2}] = 0^{2} \cdot Pr [X = 0] + 1^{2} \cdot Pr [X = 1] = Pr [X = 1] = p

So, putting the two together:

Va r [X] = E [X^{2}] - E [X]^{2} = p - p^{2} = p (1 - p)

Example 2: Variance of a Binomially Distributed Random Variable

As another example, what about the variance of a binomially distributed random variable $X$ with $n$ trials, and probability $p$ ? Again, let’s fall back to the neat trick that I mentioned, let $X_{i}$ be an indicator that indicates whether the $i^{t h}$ was a success. In which case:

X = i = 1 \sum n X_{i}

So now:

X^{2} = (i = 1 \sum n X_{i})^{2} = (j = 1 \sum n X_{i}) (i = 1 \sum n X_{i}) = (i = j \sum n X_{i} \cdot X_{j}) + i \neq = j \sum n X_{i} \cdot X_{j}

If this is not obvious, think about how $(X_{1} + X_{2}) (X_{1} + X_{2})$ can be written as $(X_{1})^{2} + (X_{2})^{2} + X_{1} \cdot X_{2} + X_{2} \cdot X_{1}$ .

Now again, we want $E [X^{2}]$ , so:

E [X^{2}] = (i = j \sum n E [X_{i} \cdot X_{j}]) + i \neq = j \sum n E [X_{i} \cdot X_{j}]

Now let’s look at what’s going on in each sum separately. The first sum, sums over $X_{i} \cdot X_{j}$ when $i = j$ , so this is just the same as $\sum_{i = 1}^{n} E [(X_{i})^{2}]$ . As before, we know that since $X_{i}$ is an indicator random variable, $E [(X_{i})^{2}] = p$ . So:

E [X^{2}] = (i = j \sum n E [X_{i} \cdot X_{j}]) + i \neq = j \sum n E [X_{i} \cdot X_{j}] = n p + i \neq = j \sum n E [X_{i} \cdot X_{j}]

What about $E [X_{i} \cdot X_{j}]$ when $i \neq = j$ ? Since $X_{i}$ and $X_{j}$ both only output either $0$ or $1$ . Then, $X_{i} \cdot X_{j}$ is $1$ only when both $X_{i}$ and $X_{j}$ are $1$ , otherwise, it is $0$ . So now:

E [X_{i} \cdot X_{j}] = Pr [X_{i} \cdot X_{j} = 1] = Pr [X_{i} = 1 \cap X_{j} = 1]

since $X_{i}$ and $X_{j}$ are independent, we know that $Pr [X_{i} = 1 \cap X_{j} = 1] = p^{2}$ So putting this back into the sum:

E [X^{2}] = n p + i \neq = j \sum n E [X_{i} \cdot X_{j}] = n p + i \neq = j \sum n p^{2} = n p + (n) (n - 1) p^{2}

So finally, putting this back in:

Va r [X] = E [X^{2}] - E [X]^{2} = n p + (n) (n - 1) p^{2} - (n p)^{2} = n (p) (1 - p)

Summary:

So we have that:

Bernoulli distributed random variables: $p (1 - p)$
Binomially distributed random variables: $n p (1 - p)$
Geometrically distributed random variable: $(1 - p) / p^{2}$

We will skip the proof for geometric random variables because it involves using some amount of calculus.

Properties of Variance:

$Va r [c X] = c^{2} Va r [X]$
If $X$ and $Y$ are independent random variables, then $Va r [X + Y] = Va r [X] + Va r [Y]$

Bounds

So we’ve worked quite hard to figure out what the expectation and variance are for random variables. But why? What’s so important about these things?

In computer science, we often have “bad” events that we want to avoid. For example, long running times in las vegas algorithms, errors in classification, hashing collisions, and so on. Anytime there is any amount of randomness, we will have to somehow argue that bad events don’t happen too often. Hopefully you’ll see what I mean, beyond this course when you finally use these ideas.

So to do so, we commonly use Markov and Chebyshev bounds! These bounds are great if we are happy with a good enough, one-sided upper bound on the probability. Typically we will be finding the probabilities of bad events and saying they don’t occur to often. So in CS at least, these are great.

Markov Bound:

If $X$ is a non-negative random variable, and $a > 0$ , then:

Pr [X \geq a] \leq \frac{E [ X ]}{a}

For example, if $X$ is a binomially distributed random variable with n trials and success probability $p = 0.4$ . We can say something like:

Pr [X \geq 20] = i = 20 \sum n (i n) (0.4)^{i} (0.6)^{n - i}

But this hard to analyse, and is not even in a closed form. What if we could sacrifice some amount of clarity for an easier bound to work with? So if we instead applied Markov’s bound, we have:

Pr [X \geq 20] \leq \frac{E [ X ]}{20} = \frac{n ( 0.4 )}{20}

So for something like $n = 10$ , this works out to be $\frac{1}{5}$ . See how simple that was? Sometimes an imprecise answer is good enough. Markov bound is one such way to get a “good enough” imprecise answer.

Chebyshev Bound:

If $X$ is a random variable, then:

Pr [∣ X - E [X] ∣ \geq a] \leq \frac{Va r [ X ]}{a ^{2}}

For example, if $X$ is a binomially distributed random variable with $n = 100$ trials and success probability $p = 0.4$ . We can say something like:

Pr [∣ X - 40∣ \geq 20] = (i = 0 \sum 19 (i n) (0.4)^{i} (0.6)^{n - i}) + (i = 61 \sum 100 (i n) (0.4)^{i} (0.6)^{n - i})

But again, this is a lot simpler if we could use Chebyshev to say something like (assuming we are happy with a good enough, one-sided bound):

Pr [∣ X - 40∣ \geq 20] \leq \frac{100 ( 0.4 ) ( 0.6 )}{20}

Mathematical Techniques for Computing

Explorer

Unit 9: Probability Distributions, Expectation, Variance, Deviations

Overview

Introduction

Random Variables

An Example With Coins

Defining Random Variables

Random Variables vs. Events

Functions on Random Variables

Another Example:

Probability Distributions

Bernoulli Distribution

Example:

Geometric Distribution

Example:

Binomial Distribution

Example:

Uniform Distribution

Expectation

Example 1: A fair die

Example 2: A Bernoulli Distributed Random Variable

Example 3: Payoff Functions

Properties About Expectation

Example 4: Expectation of Binomial Distributions

Example 5: Expectation of Geometric Distributions

Variance

Example 1: Variance of a Bernoulli Distributed Random Variable

Example 2: Variance of a Binomially Distributed Random Variable

Summary:

Properties of Variance:

Bounds

Markov Bound:

Chebyshev Bound:

Graph View

Table of Contents

Backlinks