Tricki
a repository of mathematical know-how

Revision of Bounding probabilities by expectations from Sat, 28/03/2009 - 22:54

Quick description

If X is a random variable that takes non-negative values, then the probability that X\geq t is at most \mathbb{E}X/t. This simple result is called Markov's inequality. Clever uses of Markov's inequality can have surprisingly powerful consequences. The main trick is to choose an increasing function F and apply Markov's inequality to the random variable F(X) instead. If F is rapidly growing but \mathbb{E}F(X) is not too large, then one can prove that the probability that X\geq t, which equals the probability that F(X)\geq F(t), is very small.

Prerequisites

A familiarity with basic concepts in probability such as random variables and expectation.

Example 1

Let us begin with a proof of Markov's inequality itself. By the law of total probability we know that

\mathbb{E}X=\mathbb{E}[X|X\geq t]\mathbb{P}[X\geq t]+\mathbb{E}[X|X<t]\mathbb{P}[X<t].

Since X\geq 0 (by hypothesis), this is at least \mathbb{E}[X|X\geq t]\mathbb{P}[X\geq t], which is obviously at least t\mathbb{P}[X\geq t], and from this it follows that \mathbb{P}[X\geq t]\leq\mathbb{E}X/t, as we wanted.

Example 2

It is frequently useful in probability theory (and also in probabilistic combinatorics) to prove that a random variable is close to its mean most of the time. Of course, this may not always be true, which is why we invent concepts such as variance to give us some measure of how concentrated a random variable is about its mean. A precise relationship between the variance of a random variable X and the concentration of X about its mean \mu is given by Chebyshev's inequality, which states that if t>0, then \mathbb{P}[|X-\mu|\geq t]\leq\mathrm{Var}X/t^2.

To prove this, we first recall the definition of the variance of X, which is \mathbb{E}(X-\mu)^2. This instantly suggests a function that we can plug into Markov's inequality: instead of looking at the random variable X we shall look at (X-\mu)^2. (The function x\mapsto(x-\mu)^2 is not increasing, but this is all right because we are actually interested in the random variable |X-\mu|.)

The random variable (X-\mu)^2 is non-negative, so by Markov's inequality we find that

\mathbb{P}[(X-\mu)^2\geq t^2]\leq\mathbb{E}(X-\mu)^2/t^2=\mathrm{Var}X/t^2,

and we are done, since \mathbb{P}[(X-\mu)^2\geq t^2]=\mathbb{P}[|X-\mu|\geq t].

General discussion

Suppose that we have a random variable X with mean \mu such that \mathbb{E}|X-\mu|=r and \mathrm{Var}X=\mathbb{E}(X-\mu)^2=s^2. Then Markov's inequality implies that \mathbb{P}|X-\mu|\geq t]\leq r/t, while Chebyshev's inequality implies that \mathbb{P}|X-\mu|\geq t]\leq s^2/t^2. Since all probabilities are at most 1, Markov's inequality tells us nothing unless t\geq r, and Chebyshev's inequality tells us nothing unless t\geq s. Since \mathbb{E}(X-\mu)\geq(\mathbb{E}|X-\mu|)^2, this tells us that Markov's inequality will usually give us some information for smaller values than Chebyshev's inequality. However, if \mathbb{E}(X-\mu) and (\mathbb{E}|X-\mu|)^2 are reasonably comparable, then Chebyshev's inequality is much stronger for larger values of t. This illustrates a general principle: a good bound on \mathbb{E}F(X) for a quickly growing positive function F gives us a good upper bound for the probability that X is large. In the next example, we shall see a very strong result of this kind.

Example 3

Let X_1,X_2,\dots,X_n be independent random variables, each of mean 0 and each taking values in the interval [-1,1]. What is the probability that |X_1+\dots+X_n|\geq t?

To get a preliminary feel for this question, let us use the fact that variances of independent events add together. Each X_i obviously has variance at most 1, since it cannot differ from its mean by more than 1. Therefore, the variance of |X_1+X_2+\dots+X_n| is at most n, so by Chebyshev's inequality the probability that |X_1+\dots+X_n|\geq t is at most n/t^2. This tells us that once t gets past \sqrt{n}, the probability decays fairly rapidly. However, we shall now see that it decays far more rapidly than Chebyshev's inequality implies.

To be continued.

Comments

a suggestion for improvement

Here is a point which I feel needs to be emphasized more in Example 3.

If we try to make the argument work without leaving \lambda to be optimized later, but by, say, initially setting \lambda=1, the argument will fail: we will get a useless bound. The reason: we need to pick a function F which will grow rapidly while E F(X) will grow slowly. In this case, E e^{X_1 + \ldots X_n} grows pretty fast with n.

What kind of functions F work for this? Well, F(X)=X^2 works: out of the n^2 terms in (X_1 + \cdots + X_n)^2, almost all have expectation 0. Similarly, F(X)=X^{2k} works fine as well: in the sum (X_1 + \cdots + X_n)^{2k}, we will be able to easily cancel many terms because they have expectation 0.

The central insight behind the proof in Example 3 is that picking F(x)=(1+t)^x for really, really small t works. The reason: the average of 1+t and 1/(1+t) is really close to 1. That is: near one, inversion is about the same as subtraction. This insight can be pushed a little further to argue that if X has mean 0, the quantity E (1+t)^X should be pretty close to 1. Thus one might hope that E[(1+t)^{X_1 + \cdots X_n}] = (E[(1+t)^{X_1}])^n grows slowly - after all, it is "approximately" 1^n = 1.