Overview

This unit is the second of two parts in our introduction to probability. Again, the ideas here are pretty fundamental when it comes to things like analysing randomised strategies or algorithms, something that is quite common in machine learning.

To tie it all up, we’ll introduce probability distributions, talk about expected values and variances, and motivate them with some common bounds we commonly use in computer science.

Random variables
Probability distributions
Expectation and variance
Markov and Chebyshev bounds

Introduction

Picking up from where we last left off, we focused a lot on probability spaces, events, how to compute a bunch of stuff like conditional probability, and we mostly assumed things were uniform or independent. Extending from this, we’ll first talk about random variables and probability distributions. What we will show in this unit are the staple approaches to randomised analysis. If you were ever curious how something like the multiplicative weights update algorithm works, the techniques shown in this page are a must.

Part 1: Random Variables

For starters, let’s talk about random variables. Think of random variables as basically values that we care about. Now this might seem pretty abstract, so before we go into an example, think of random variables as functions that take outcomes as inputs, and outputs values. We’ll begin with a simple example with coins.

Defining Random Variables

Let’s go back to using coins. Let’s say we had three fair and independent coins. Each coin produces either heads or tails, each with probability $\frac{1}{2}$ , and all three coins’ outcomes do not affect each other.

We know that we have $8$ possible outcomes, so our sample space looks like the following:

{HHH, HH T, H T H, H TT, T HH, T H T, TT H, TTT}

Here comes the question: what if we wanted to count the number of heads?

Here’s how we would do it, we would make a random variable, let’s call it $C$ . $C$ has to specify for each outcome, what it will output as a number. In this case, since we want to count the number of heads, we can simply wave our hands and declare “let $C$ count the number of heads”.

From this, we know that this means for example, that $C (HHH) = 3$ , $C (H T H) = 2$ , and also $C (TTT) = 0$ .

So $C$ is a random variable that counts the number of heads.

Let’s try another question: what if we wanted to indicate that the coins were all heads?

Here’s how we would do it. We would make a random variable, this time let’s call it $D$ so our names don’t collide. $D$ is going to output value $1$ on input $HHH$ , i.e., $D (HHH) = 1$ . On any other input outcome, $D$ will return $0$ . For example, $D (HH T) = 0$ , $D (H T H) = 0$ , and also $D (TTT) = 0$ .

$D$ is a special kind of random variable that we commonly call an indicator random variable. These modest types of random variables actually do a lot of heavy lifting in computer science. You’ll see one such example later in this unit. Think of indicator random variables as indicating when some event has occurred.

Both $C$ and $D$ are examples of random variables.

Definition: Random variables

A random variable $X$ is a representation of some quantity which depends on one or several outcomes.

An indicator random variable is a random variable whose value is either $0$ (representing that the outcome at hand does not belong to a certain event), or $1$ (the outcome does belong to that event).

Random Variables vs. Events

Let’s look at $C$ again. If we split the sample space up, we notice that we can “break up” or “partition” the sample space based on what value $C$ outputs:

For the event ${TTT}$ , $C$ outputs $0$ .
For the event ${H TT, T H T, TT H}$ , $C$ outputs $1$ .
For the event ${HH T, H T H, T HH}$ , $C$ outputs $2$ .
For the event ${HHH}$ , $C$ outputs $3$ .

So what is $Pr [C = 2]$ ? Well, it is just the probability that we see either $HH T$ , or $H T H$ , or $T HH$ . In other words:

Pr [C = 2] = Pr [HH T] + Pr [H T H] + Pr [T HH]

Since we assumed each coin is fair and independent, each outcome occurs with probability $\frac{1}{2 ^{3}}$ . There are $3$ such outcomes, so $Pr [C = 2] = \frac{3}{8}$ .

What about $D$ ? This time around:

For the event ${HH T, H T H, H TT, T HH, T H T, TT H, TTT}$ , $D$ outputs $0$ .
For the event ${HHH}$ , $D$ outputs $1$ .

So what is $Pr [C \leq 1∣ D = 0]$ ? Recall that (why?):

Pr [C \leq 1∣ D = 0] = \frac{Pr [ C \leq 1 \cap D = 0 ]}{Pr [ D = 0 ]}

Now if we look at ${H TT, T H T, TT H, TTT} \cap {HH T, H T H, H TT, T HH, T H T, TT H, TTT}$ we might notice that this is actually just ${H TT, T H T, TT H, TTT}$ itself. So $Pr [C \leq 1 \cap D = 0]$ is $\frac{4}{8}$ . What about $Pr [D = 0]$ ? Well by a similar argument, it is $\frac{7}{8}$ .

So:

Pr [C \leq 1∣ D = 0] = \frac{4/8}{7/8} = \frac{4}{7}

Functions on Random Variables

Here’s an interesting idea: can we operate on random variables? We sure can! And this idea is super useful. For example, what is $D^{2}$ ? It is a random variable that first takes as input $D$ , and outputs the square of it.

For example, we could ask something like: what is $Pr [D^{2} = 1]$ ?

Let’s think about this a little bit. $D$ can only take two possible values: $0$ or $1$ . When $D$ outputs $0$ , then so does $D^{2}$ . When $D$ outputs $1$ , so does $D^{2}$ .

So $Pr [D^{2} = 1]$ is the same as $Pr [D = 1]$ . Similarly, $Pr [D^{2} = 0]$ is the same as $Pr [D = 0]$ . So $Pr [D^{2} = 1] = \frac{1}{8}$ .

Another example

Instead of squaring, we can also do something like adding random variables together. Here’s an alternative example.

Let $D_{1}$ , and $D_{2}$ be random variables for a two separate, independent and fair $6$ -sided dice. Again, $D_{1}$ and $D_{2}$ can each take values in ${1, 2, 3, 4, 5, 6}$ .

Then technically, even something like $D_{1} + D_{2}$ is a random variable, and we can ask something like “What is $Pr [D_{1} + D_{2} = 3]$ ?” This is basically asking “What is the probability that the sum of the two outcomes is equal to $3$ ?”

We know that this means either ( $D_{1} = 1$ and $D_{2} = 2$ ) or ( $D_{1} = 2$ and $D_{2} = 1$ ). So:

Pr [D_{1} + D_{2} = 3] = Pr [D_{1} = 1 \cap D_{2} = 2] + Pr [D_{1} = 2 \cap D_{2} = 1] = \frac{1}{6 ^{2}} + \frac{1}{6 ^{2}} = \frac{1}{18}

Part 2: Probability Distributions

Now that we’ve talked about random variables, the next thing is to finally talk about probability itself. Think of a probability distribution as the function that assigns to each outcome of random variables a probability. Intuitively, you can think of this as a chart that basically says for each outcome what is the probability.

Example 1:

Example 2:

Example 3:

Again, think of a distribution as basically saying “for this value, we assign this probability”.

We will look at some very common discrete probability distributions in CS:

Bernoulli
Geometric
Binomial
Uniform

Bernoulli Distribution

So let’s begin with the Bernoulli distribution. This is the distribution for indicator random variables. Again, recall that since indicator random variables only take values $0$ or $1$ , the Bernoulli distribution has to assign a probability $p$ for when $X = 1$ , and consequently, this means $X = 0$ with probability $1 - p$ . Think of $p$ as the parameter of the distribution. This single value determines the entire distribution.

Example:

Let’s say we roll a fair die, each with $6$ faces, and each face is produced with probability $\frac{1}{6}$ .

Let $X$ be the random variable that indicates if the dice turns up with a number that is at least $2$ , i.e., $X = 1$ if we see a value of $2$ or more.

Then we can say that $Pr [X = 1] = \frac{5}{6}$ . So it follows a Bernoulli distribution with parameter $p = \frac{5}{6}$ .

Definition: Bernoulli distribution

An indicator random variable $X$ follows a Bernoulli distribution with parameter $p$ if the following hold true:
$Pr [X = 1] = p$ $Pr [X = 0] = 1 - p$

Geometric Distribution

Let’s build off the Bernoulli distribution and make use of it for something else. Given a random variable $X$ that follows a Bernoulli distribution with parameter $p$ , let $Y$ be the number of times we need to try $X$ before $X = 1$ .

Example:

As a concrete example, this is as if we are making coin flips, the coin has probability $\frac{1}{3}$ of returning heads, and we are asking “How many times do we need to flip before we see heads?” Here, we are going to assume that every flip of the coin is independent of its previous outcomes.

To be clear, if $Y$ is a random variable that outputs the number of times we need to try, then $Y$ follows the geometric distribution.

So what is the probability that $Y = 1$ ? Well that happens when the very first flip is heads, which happens with probability $\frac{1}{3}$ .

What about the probability that $Y = 2$ ? Well that happens with probability $\frac{2}{3} \times \frac{1}{3}$ .

What about the probability that $Y = i$ ? Do you see the pattern? We must have flipped $i - 1$ many tails in a row, then flipped a heads. So the probability is $(\frac{2}{3})^{i - 1} \times \frac{1}{3}$ .

So in general, if we had a random variable that followed geometric probability distribution with parameter $p$ , then $Pr [Y = i] = (1 - p)^{i - 1} \times p$ .

Definition: Geometric distribution

A random variable $X$ follows a geometric distribution with parameter $p$ if:
$Pr [X = i] = p (1 - p)^{i - 1}$

Binomial Distribution

Let’s build off the Bernoulli distribution again, but instead ask a different question. What if we instead had $n$ independent copies of $X$ (each as a Bernoulli distribution with parameter $p$ ), and we took $n$ trials, and asked “How many copies out of the $n$ trials returned $1$ ?”

Example:

Let’s say we had $3$ coins, each coin returns heads with probability $1/3$ . Let $Y$ be the random variable that counts the number of heads. Then again, what’s the probability $Pr [Y = 2]$ ?

Well, one way we could do this is to manually count this. So we know that there are $3$ possible outcomes we need: $HH T, H T H, T HH$ . We know that a heads happens with probability $\frac{1}{3}$ , and a tails happens with probability $\frac{2}{3}$ . So:

Pr [HH T] + Pr [HH T] + Pr [HH T] = (\frac{1}{3})^{2} \cdot (\frac{2}{3}) + (\frac{1}{3})^{2} \cdot (\frac{2}{3}) + (\frac{1}{3})^{2} \cdot (\frac{2}{3}) = \frac{2}{27} \times 3 = \frac{6}{27}

But what about in general? What if we had more coins than $3$ ? Manual counting gets very cumbersome. Let’s try to be smarter with how we count.

Of the $n$ coins, we choose $i$ of them to be heads, so the rest must be tails. So there are $(i n)$ possible outcomes. For each outcome, the probability it occurs is $(p)^{i} (1 - p)^{n - i}$ .

So in general, the probability $Pr [Y = i]$ is actually $(i n) (p)^{i} (1 - p)^{n - i}$ .

To be clear, the binomial distribution takes two parameters: $n$ , the number of trials, and $p$ the probability of success of each independent trial.

Definition: Binomial distribution

A random variable $X$ follows a binomial distribution with parameters $n$ and $p$ (denoted $X \sim Bin (n, p)$ ) if:
$Pr [X = i] = (i n) p^{i} (1 - p)^{n - i}$

Uniform Distribution

The last distribution is the uniform distribution. This is the one we have been playing with the most. In general, we have a set of $n$ values, ${1, 2, \dots, n}$ . Each value is picked with probability $\frac{1}{n}$ .

If we let $Y$ be the random variable that outputs any of the $n$ values uniformly at random. Then $Y$ has the uniform distribution.

Definition: Uniform distribution

A random variable $X$ follows a uniform distribution with parameter $n$ if:
$Pr [X = i] = \frac{1}{n}$

Part 3: Expectation and Variance

Expectation

Now that we have seen random variables and distributions, here’s a key question:

If we ran an experiment where we had a random variable $X$ , and we took $t$ many independent samples, then output the average value, what should we hope/expect to see?

It turns out, the answer is:

Definition: Expectation

The expectation of a random variable $X$ is defined by:
$E [X] = i \sum i \cdot Pr [X = i]$

Here’s the intuition, this is the value we “expect” to see from the random variable.

Example 1: A fair die

For example, if we roll a $6$ -sided fair die, what is $E [X]$ ? Based on our formula, this happens to be:

E [X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6}

which evaluates to $7/2$ .

Think of it this way: if we rolled this die many, many, many times and took the average value, it should be close to $3.5$ .

To be clear, let’s say we played this game for $n$ rounds. Letting $d_{1}$ be the amount of money you win in round 1, $d_{2}$ be the amount of money you win in round 2, so on and so forth… Then the “average” amount of money you win, would be

\frac{d _{1} + d _{2} + \dots + d _{n}}{n}

and as $n \to \infty$ , the above value, would approach $3.5$ .

A betting game:

So consider a game where you’d pay $3$ dollars to participate. And in the game, we roll a fair die, and whatever the rolled value is, you win that many dollars. E.g. if the die rolls to a $4$ , you win a dollar, gaining a $1$ dollar profit. Should you play?

(Just to be clear: if you played 5 rounds, and rolled $1$ , $5$ , $1$ , $4$ , $2$ . Then you paid $3 \times 5 - 15$ dollars, won $1 + 5 + 1 + 4 + 2 = 13$ dollars, so suffered a loss of $2$ dollars, or a profit of $- 2$ dollars.)

Well, if you played $n$ rounds, you’d pay $n$ dollars. You’d also expect to win around $3.5 n$ dollars. I.e. expect around $0.5$ dollars profit per round. Why? Because your expected winning across $n$ rounds is

3.5 n = d_{1} + d_{2} + \dots + d_{n}

But you also paid $3 n$ to play all $n$ rounds. So your expected profit over $n$ would be “winnings - paid amount”, i.e.

3.5 n - 3 n = 0.5 n

Of course the game is randomised, so it wouldn’t be exact. But as $n$ gets large, you should see this behaviour.

Example 2: A weighted die

Okay, let’s say we had a 6-sided die that is not fair. So something like:

$Pr [X = 1] = \frac{1}{3}$
$Pr [X = 2] = \frac{1}{6}$
$Pr [X = 3] = \frac{1}{3}$
$Pr [X = 4] = \frac{1}{4}$
$Pr [X = 5] = \frac{1}{4}$
$Pr [X = 6] = \frac{1}{4}$

Well we can apply the same type of analysis.

E [X] = 1 \cdot \frac{1}{3} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{4} + 4 \cdot \frac{1}{4} + 5 \cdot \frac{1}{4} + 6 \cdot \frac{1}{4}

Which is around… $\frac{31}{6}$ or just over $5.16$ .

The same betting game:

In which case, if you were asked again, would you pay $3$ dollars to participate in the same better game with this weighted die, should you? I would!

In $n$ rounds, I would expect to make an expected profit of $5.16 - 3 = 2.16$ dollars now instead. Which is actually even better than before. (The computation works the same as in the previous example’s case.)

Example 2: A Bernoulli Distributed Random Variable

If we have a random variable $X$ that has a Bernoulli distribution with parameter $p = 1/5$ , what is $E [X]$ ? Again, based on our formula, this happens to be:

E [X] = 0 \cdot Pr [X = 0] + 1 \cdot Pr [X = 1] = Pr [X = 1]

This looks quite surprising, that the expected value is the probability. But this is actually a very useful fact. We use this quite often in computer science!

Example 3: Payoff functions

Let’s say we have a contract that with probability $1/3$ will pay us $5$ dollars, and with probability $2/3$ will pay us nothing ( $0$ dollars). What is the expected payoff?

Notice here that if we first let $X$ be a Bernoulli-distributed indicator random variable with $p = \frac{1}{3}$ where $X = 1$ when we get paid, then our payoff is given as:

5 \cdot X

So it boils down to asking what is $E [5 X]$ ? Since $X$ takes values either $0$ or $1$ , then $5 X$ takes values either $0$ or $5$ . So

E [5 X] = 0 \cdot Pr [5 X = 0] + 5 \cdot Pr [5 X = 5] = 5 \cdot \frac{1}{3}

Properties of Expectation

The last example actually was a teaser into some nice properties of expectation. We won’t prove it in this course, so you can take these as fact (though they are provable).

$E [X + Y] = E [X] + E [Y]$
If $c$ is a constant, then $E [c X] = c E [X]$ .

As a warning, we cannot generally say that $E [X \cdot Y] = E [X] \cdot E [Y]$ . This is true when $X$ and $Y$ are independent, but otherwise, we have to be careful.

Example 4: Expectation of binomial distributions

Let’s go through yet another example, this time we will be asking what is the expected value of a binomially distributed random variable $X$ with parameters $n = 10$ and $p = 1/4$ .

If we faithfully followed the formula, we have that:

E [X] = i = 0 \sum 10 i \cdot (i 10) (\frac{1}{4})^{i} (\frac{3}{4})^{10 - i}

Except, that looks awfully complicated to analyse! So we’re going to pull out a very neat trick, and have our Bernoulli random variables do a lot of heavy lifting for us.

We are going to let $X_{i}$ be an indicator random variable with parameter $p = \frac{1}{4}$ , represent whether the $i^{th}$ trial was a success or not. Then:

X = i = 1 \sum 10 X_{i}

Why do we want to do this though? Here’s the idea, using the property of expectations, we know that:

E [X] = E [i = 1 \sum 10 X_{i}] = i = 1 \sum 10 E [X_{i}] = i = 1 \sum 10 \frac{1}{4} = \frac{10}{4}

Remember, $E [X_{i}] = \frac{1}{4}$ because $X_{i}$ is an indicator random variable with probability $\frac{1}{4}$ .

What about for general values of $n$ and $p$ ? Well then the math becomes:

E [X] = E [i = 1 \sum n X_{i}] = i = 1 \sum n E [X_{i}] = i = 1 \sum n p = n p

Example 5: Expectation of geometric distributions

Since we’ve covered the Bernoulli and binomial distributions, for the sake of completeness, let’s do the geometric distribution as well. Let $X$ be a geometrically distributed random variable with parameter $p$ . The math for this one is a little more involved, so let’s jump straight into it. Again, by our definition of expectation, we have that:

E [X] = i = 1 \sum \infty i \cdot Pr [X = i] = i = 1 \sum \infty i \cdot p \cdot (1 - p)^{i - 1}

Now this is pretty hard to resolve, so let’s work through some magic, first of all:

(1 - p) E [X] = i = 1 \sum \infty i \cdot (1 - p) P r [X = i] = i = 1 \sum \infty i \cdot p \cdot (1 - p) \cdot (1 - p)^{i - 1} = i = 1 \sum \infty i \cdot p \cdot (1 - p)^{i}

So what is $E [X] - (1 - p) E [X]$ ? Let me lay it out term by term:

E [X] - (1 - p) E [X] = (1 p + - (2 p (1 - p) + 3 p (1 - p)^{2} + \dots) 1 p (1 - p) + 2 p (1 - p)^{2} + 3 p (1 - p)^{3} + \dots)

If you notice, we’re grouping terms based on their power of $(1 - p)$ . What happens if we subtracted them this way? Then:

E [X] - (1 - p) E [X] = (1 p + - (= (1 p + 2 p (1 - p) + 3 p (1 - p)^{2} + \dots) 1 p (1 - p) + 2 p (1 - p)^{2} + 3 p (1 - p)^{3} + \dots) p (1 - p) + p (1 - p)^{2} + \dots)

And the last series is actually a geometric series! Re-writing this, we get:

E [X] - (1 - p) E [X] = (1 p + p (1 - p) + p (1 - p)^{2} + \dots) = i = 0 \sum \infty p (1 - p)^{i} = p i = 0 \sum \infty (1 - p)^{i} = \frac{p}{1 - ( 1 - p )} = 1

Okay, that was weird, let’s also resolve the left hand side:

E [X] - (1 - p) E [X] p E [X] E [X] = 1 = 1 = \frac{1}{p}

So that gives us our expectation, which hopefully is quite intuitive. If we have a coin that returns heads with probability $\frac{1}{3}$ , we would expect to flip it $3$ times before we see a heads.

Variance

So expectation was nice and all, and it tells us what the random variable “averages” around, but it doesn’t tell us how spread apart the values are. For that, we need variance.

Intuitively, variance is a measure of how much the random variable can vary.

Formally, it is defined as:

Definition: Variance

The variance of a random variable $X$ is defined by:
$Va r [X] = E [(X - E [X])^{2}] = E [X^{2}] - (E [X])^{2}$

The first form is not that useful, so usually we use the second form (whose proof has been provided below):

Va r [X] = E [(X - E [X])^{2}] = E [X^{2} - 2 X E [X] + (E [X])^{2}] = E [X^{2}] - 2 E [X E [X]] + E [(E [X])^{2}] = E [X^{2}] - 2 E [X] E [X] + E [(E [X])^{2}] = E [X^{2}] - 2 (E [X])^{2} + E [(E [X])^{2}] = E [X^{2}] - 2 (E [X])^{2} + (E [X])^{2} = E [X^{2}] - (E [X])^{2}

Again, friendly reminder that $E [X^{2}] = E [X \cdot X]$ and this is in general, not equal to $E [X]^{2}$ .

Example 1: Variance of a Bernoulli-distributed random variable

For example, given a Bernoulli distributed random variable $X$ with probability $p$ , what is $Va r [X]$ ? We know that $E [X] = p$ , so we know that $E [X]^{2} = p^{2}$ . But what is $E [X^{2}]$ ?

So we’ve done this before! $X$ can only take value $1$ with probability $p$ or $0$ with probability $1 - p$ . So similarly, $X^{2}$ can only take value $1$ with probability $p$ or $0$ with probability $1 - p$ .

So again:

E [X^{2}] = 0^{2} \cdot Pr [X = 0] + 1^{2} \cdot Pr [X = 1] = Pr [X = 1] = p

So, putting the two together:

Va r [X] = E [X^{2}] - (E [X])^{2} = p - p^{2} = p (1 - p)

Example 2: Variance of a binomially distributed random variable

As another example, what about the variance of a binomially distributed random variable $X$ with $n$ trials, and probability $p$ ? Again, let’s fall back to the neat trick that I mentioned, let $X_{i}$ be an indicator that indicates whether the $i^{t h}$ was a success. In which case:

X = i = 1 \sum n X_{i}

So now:

X^{2} = (i = 1 \sum n X_{i})^{2} = (j = 1 \sum n X_{i}) (i = 1 \sum n X_{i}) = (i = j \sum n X_{i} \cdot X_{j}) + i \neq = j \sum n X_{i} \cdot X_{j}

If this is not obvious, think about how $(X_{1} + X_{2}) (X_{1} + X_{2})$ can be written as $(X_{1})^{2} + (X_{2})^{2} + X_{1} \cdot X_{2} + X_{2} \cdot X_{1}$ .

Now again, we want $E [X^{2}]$ , so:

E [X^{2}] = (i = j \sum n E [X_{i} \cdot X_{j}]) + i \neq = j \sum n E [X_{i} \cdot X_{j}]

Now let’s look at what’s going on in each sum separately. The first sum, sums over $X_{i} \cdot X_{j}$ when $i = j$ , so this is just the same as $\sum_{i = 1}^{n} E [(X_{i})^{2}]$ . As before, we know that since $X_{i}$ is an indicator random variable, $E [(X_{i})^{2}] = p$ . So:

E [X^{2}] = (i = j \sum n E [X_{i} \cdot X_{j}]) + i \neq = j \sum n E [X_{i} \cdot X_{j}] = n p + i \neq = j \sum n E [X_{i} \cdot X_{j}]

What about $E [X_{i} \cdot X_{j}]$ when $i \neq = j$ ? Note that $X_{i}$ and $X_{j}$ both only output either $0$ or $1$ . Then, $X_{i} \cdot X_{j}$ is $1$ only when both $X_{i}$ and $X_{j}$ are $1$ , otherwise, it is $0$ . So now:

E [X_{i} \cdot X_{j}] = Pr [X_{i} \cdot X_{j} = 1] = Pr [X_{i} = 1 \cap X_{j} = 1]

since $X_{i}$ and $X_{j}$ are independent, we know that $Pr [X_{i} = 1 \cap X_{j} = 1] = p^{2}$ . So putting this back into the sum:

E [X^{2}] = n p + i \neq = j \sum n E [X_{i} \cdot X_{j}] = n p + i \neq = j \sum n p^{2} = n p + (n) (n - 1) p^{2}

So finally, putting this back in:

Va r [X] = E [X^{2}] - E [X]^{2} = n p + (n) (n - 1) p^{2} - (n p)^{2} = n p (1 - p)

Summary

So we have that:

Bernoulli-distributed random variables: expectation $p$ , variance $p (1 - p)$
Binomially distributed random variables: expectation $n p$ , variance $n p (1 - p)$
Geometrically distributed random variable: expectation $1/ p$ , variance $(1 - p) / p^{2}$

We will skip the proof for geometric random variables because it involves using some amount of calculus.

Properties of Variance

$Va r [c X] = c^{2} Va r [X]$
If $X$ and $Y$ are independent random variables, then $Va r [X + Y] = Va r [X] + Va r [Y]$ .

Part 4: Bounds

So we’ve worked quite hard to figure out what the expectation and variance are for random variables. But why? What’s so important about these things?

In computer science, we often have “bad” events that we want to avoid. For example, long running times in Las Vegas algorithms, errors in classification, hashing collisions, and so on. Anytime there is any amount of randomness, we will have to somehow argue that bad events don’t happen too often. Hopefully you’ll see what I mean, beyond this course when you finally use these ideas.

So to do so, we commonly use Markov and Chebyshev bounds! These bounds are great if we are happy with a good enough, one-sided upper bound on the probability. Typically we will be finding the probabilities of bad events and saying they don’t occur to often. So in CS at least, these are great.

Markov Bound

Definition: Markov bound

If $X$ is a non-negative random variable, and $a > 0$ , then:
$Pr [X \geq a] \leq \frac{E [ X ]}{a}$

For example, let $X$ is a binomially distributed random variable with $n$ trials and success probability $p = 0.4$ . We can say something like:

Pr [X \geq 20] = i = 20 \sum n (i n) (0.4)^{i} (0.6)^{n - i}

But this hard to analyse, and is not even in a closed form. What if we could sacrifice some amount of clarity for an easier bound to work with? So if we instead applied Markov’s bound, we have:

Pr [X \geq 20] \leq \frac{E [ X ]}{20} = \frac{n ( 0.4 )}{20}

So for something like $n = 10$ , this works out to be $\frac{1}{5}$ . See how simple that was? Sometimes an imprecise answer is good enough. The Markov bound is one such way to get a “good enough” imprecise answer.

Chebyshev Bound

Definition: Chebyshev bound

If $X$ is a random variable, then:
$Pr [∣ X - E [X] ∣ \geq a] \leq \frac{Va r [ X ]}{a ^{2}}$

For example, let $X$ is a binomially distributed random variable with $n = 100$ trials and success probability $p = 0.4$ . We can say something like:

Pr [∣ X - 40∣ \geq 20] = (i = 0 \sum 19 (i n) (0.4)^{i} (0.6)^{n - i}) + (i = 61 \sum 100 (i n) (0.4)^{i} (0.6)^{n - i})

But again, this is a lot simpler if we could use the Chebyshev bound to say something like (assuming we are happy with a good enough, one-sided bound):

Pr [∣ X - 40∣ \geq 20] \leq \frac{100 ( 0.4 ) ( 0.6 )}{2 0 ^{2}}

Mathematical Techniques for Computing

Explorer

Unit 9: Probability Distributions, Expectation, Variance, Deviations

Overview

Introduction

Part 1: Random Variables

Defining Random Variables

Random Variables vs. Events

Functions on Random Variables

Another example

Part 2: Probability Distributions

Bernoulli Distribution

Example:

Geometric Distribution

Example:

Binomial Distribution

Example:

Uniform Distribution

Part 3: Expectation and Variance

Expectation

Example 1: A fair die

A betting game:

Example 2: A weighted die

The same betting game:

Example 2: A Bernoulli Distributed Random Variable

Example 3: Payoff functions

Properties of Expectation

Example 4: Expectation of binomial distributions

Example 5: Expectation of geometric distributions

Variance

Example 1: Variance of a Bernoulli-distributed random variable

Example 2: Variance of a binomially distributed random variable

Summary

Properties of Variance

Part 4: Bounds

Markov Bound

Chebyshev Bound

Graph View

Table of Contents

Backlinks