Tricki
a repository of mathematical know-how

Revision of Bounding probabilities by expectations from Sat, 28/03/2009 - 21:01

Quick description

If X is a random variable that takes non-negative values, then the probability that X\geq t is at most \mathbb{E}X/t. This simple result is called Markov's inequality. Clever uses of Markov's inequality can have surprisingly powerful consequences. The main trick is to choose an increasing function F and apply Markov's inequality to the random variable F(X) instead. If F is rapidly growing but \mathbb{E}F(X) is not too large, then one can prove that the probability that X\geq t, which equals the probability that F(X)\geq F(t), is very small.

Prerequisites

A familiarity with basic concepts in probability such as random variables and expectation.

Example 1

Let us begin with a proof of Markov's inequality itself. By the law of total probability we know that

\mathbb{E}X=\mathbb{E}[X|X\geq t]\mathbb{P}[X\geq t]+\mathbb{E}[X|X<t]\mathbb{P}[X<t].

Since X\geq 0 (by hypothesis), this is at least \mathbb{E}[X|X\geq t]\mathbb{P}[X\geq t], and from this the result follows.

General discussion

Comments

a suggestion for improvement

Here is a point which I feel needs to be emphasized more in Example 3.

If we try to make the argument work without leaving \lambda to be optimized later, but by, say, initially setting \lambda=1, the argument will fail: we will get a useless bound. The reason: we need to pick a function F which will grow rapidly while E F(X) will grow slowly. In this case, E e^{X_1 + \ldots X_n} grows pretty fast with n.

What kind of functions F work for this? Well, F(X)=X^2 works: out of the n^2 terms in (X_1 + \cdots + X_n)^2, almost all have expectation 0. Similarly, F(X)=X^{2k} works fine as well: in the sum (X_1 + \cdots + X_n)^{2k}, we will be able to easily cancel many terms because they have expectation 0.

The central insight behind the proof in Example 3 is that picking F(x)=(1+t)^x for really, really small t works. The reason: the average of 1+t and 1/(1+t) is really close to 1. That is: near one, inversion is about the same as subtraction. This insight can be pushed a little further to argue that if X has mean 0, the quantity E (1+t)^X should be pretty close to 1. Thus one might hope that E[(1+t)^{X_1 + \cdots X_n}] = (E[(1+t)^{X_1}])^n grows slowly - after all, it is "approximately" 1^n = 1.