Tricki
a repository of mathematical know-how

How to use martingales

Quick description

An important general principle in probability is that a sum of bounded independent random variables, each of mean 0, has a very small probability of having a large modulus, where "large" means substantially larger than its standard deviation. For example, if you toss a coin 1000 times, then the probability that the number of heads does not lie between 400 and 600 is tiny. This principle can be proved rigorously – see Example 3 of bounding probabilities by expectations. Also, and this is the main theme of this article, the assumption of independence can be weakened in a way that makes the principle much more widely applicable.

Prerequisites

Basic concepts of probability, graph theory, metric spaces.

Note iconContributions wanted This article could use additional contributions. More examples wanted.

Example 1

Let S_n be the symmetric group on n elements. There is a natural graph with S_n as its vertex set, where two permutations \rho and \sigma are joined if and only if \rho\sigma^{-1} is a transposition, or equivalently \rho=\tau\sigma for some transposition \tau. This graph gives us a natural metric on S_n: the distance between two permutations \rho and \sigma is the smallest k such that we can find k transpositions \tau_1,\dots,\tau_k with \rho=\tau_1\dots\tau_k\sigma.

Now let S_n\rightarrow\mathbb{R} be a function that is 1-Lipschitz with respect to this distance. This is equivalent to the assumption that |\phi(\rho)-\phi(\sigma)|\leq 1 whenever there is a transposition \tau such that \rho=\tau\sigma. (An obvious example of such a function is the function \phi where \phi(\rho) is defined to be the smallest k such that \rho is a product of k transpositions.) Since every permutation can be written as a product of at most n-1 transpositions and an n-cycle needs exactly n-1 transpositions, the diameter of S_n is n-1, so |\phi(\rho)-\phi(\sigma)| cannot exceed n-1. We shall show that \phi is "highly concentrated", in the sense that for almost all pairs (\rho,\sigma) |\phi(\rho)-\phi(\sigma)| is far smaller than n – it is rarely much more than \sqrt{n}.

General discussion

Before we get on to the proof, we need to state Azuma's inequality, and before we do that we need to say what a martingale is. The latter is simple. Let Z_0,Z_1,\dots,Z_n be a sequence of random variables. They are a martingale if for every k the expectation of Z_k given the values of Z_0,Z_1,\dots,Z_{k-1} is Z_{k-1}.

It is perhaps clearer to express this in terms of the martingale difference sequence X_1,\dots,X_n, where X_k=Z_k-Z_{k-1} for each k. Then the martingale condition is that the expectation of X_k given the values of Z_0 and X_1,\dots,X_{k-1} is always 0. An obvious example is when the X_i are independent random variables with mean 0, but as we shall see there is an easy way of generating many more examples than this.

Suppose that each variable X_i takes values between -1 and 1. Then Azuma's inequality states that the probability that |X_1+\dots+X_n|\geq t is at most 2\exp(-t^2/4n). More generally, if X_i takes values between -\lambda_i and \lambda_i, then the probability that |X_1+\dots+X_n|\geq t is at most 2\exp(-t^2/4\sum \lambda_i^2).

Now let us return to our example, and see how we can define a martingale in a very natural way.

Example 1, continued

Recall that we have a function \phi defined on S_n, with the property that if \rho and \sigma are partitions such that \rho\sigma^{-1} is a transposition, then |\phi(\rho)-\phi(\sigma)|\leq 1. In order to define a martingale difference sequence, we choose a random permutation \rho and "reveal information about \rho little by little".

What does this mean? Well, we look at the values of \rho(1), \rho(2), \rho(3), and so on, and at the kth stage we define Z_k to be the expected value of \phi(\rho) given the values \rho(1),\dots,\rho(k). In other words, Z_0 is just \mathbb{E}\phi, Z_1 is the expected value of \mathbb{E}\phi(\sigma) given that \sigma(1)=\rho(1), and in general Z_k is the expected value of \mathbb{E}\phi(\sigma) given that \sigma(i)=\rho(i) for every i\in\{1,2,\dots,k\}.

What can we say about X_k=Z_k-Z_{k-1}? We would like to prove an upper bound for the maximum absolute value that X_k can take, and we would also like to prove that the expectation of X_k given the values of X_1,\dots,X_{k-1} is 0.

X_k is the amount by which the expectation of \phi(\sigma) can change if we add to our existing knowledge of the values of \sigma(1),\dots,\sigma(k-1) the knowledge that \sigma(k)=\rho(k). To get an upper bound for this, let us define A_r, for each r not in the set \{\rho(1),\dots,\rho(k-1)\}, to be the set of all \sigma such that \sigma(i)=\rho(i) for every i<k and \sigma(k)=r. Now let us pick s and t and define a bijection \beta_{st} between A_s and A_t as follows. Given a permutation \sigma\in A_s, let \beta_{st}(\sigma)=\tau agree with \sigma except that \tau(k)=t and \tau(\sigma^{-1}(t))=s. To put this another way, \beta_{st}(\sigma)=(st)\sigma, where (st) is the transposition that exchanges s and t.

Now the distance between \sigma and \beta_{rs}\sigma is at most 1 (and exactly 1 if r\ne s). Therefore, |\phi(\sigma)-\phi(\beta_{rs}\sigma)|\leq 1 for every \sigma\in A_s. It follows that \mathbb{E}[\phi(\sigma)|\sigma\in A_s] differs from \mathbb{E}[\phi(\sigma)|\sigma\in A_t] by at most 1.

In other words, once we know what \rho(k) is, the difference it can make to our expectation of \phi(\rho) is at most 1. Also, the average of \phi(\sigma) given that \sigma(i)=\rho(i) for every i<k is the average over s of the averages of \phi(\sigma) over A_s (since all the sets A_s have the same size), so the expectation of Z_k given Z_{k-1} is Z_{k-1}, as we wanted. (Equivalently, the expectation of X_k given X_1,\dots,X_{k-1} is 0.)

We have now checked the facts we wanted to check: we have a martingale and the random variables X_i in the martingale difference sequence take values in [-1,1]. It follows from Azuma's inequality that the probability that |\phi(\rho)-\mathbb{E}\phi|\geq t is at most 2\exp(-t^2/4n). From this we see that even though the diameter of S_n is n-1, for the vast majority of pairs (\rho,\sigma) the difference |\phi(\rho)-\phi(\sigma)| is at most 100\sqrt{n}.

General discussion

The proof of Azuma's inequality is a relatively straightforward generalization of the proof given in Example 3 of bounding probabilities by expectations, which deals with the special case where the martingale difference sequence consists of independent random variables that take values in [-1,1].

Comments

possible typo?

The article says

"Then the martingale condition is that the expectation of X_k given the values of X_1,\dots,X_{k-1} is always 0."

I see that the martingale condition implies this, but is it true that the martingale condition "is" this? This condition can be satisfied without the sequence Z_i being a martingale.

Indeed, suppose A, B are independent random variables with mean zero. Define
Z_0 = A, Z_1 = A + B, Z_2 = 2A+B. Clearly, E[X_1]=0, E[X_2| X_1]=E[A | B]=0, so the above condition holds. On the other this is not a martingale: E[Z_2|Z_1, Z_0] = Z_1 + Z_0 which may not equal Z_1.

Thanks &ndash; I've corrected

Thanks – I've corrected it. I've tended to deal with martingales, such as the one in this permutations example, where Z_0 is constant, which is why I wrote what I wrote.

Trivia

There is a typo in the second paragraph of section Example 1, continued: one should cancel the symbol \mathbb{E} in the two occurrences of

the expected value of \mathbb{E}\phi(\sigma) given that

And, reading the last paragraph of this section, I wondered why you picked 100\sqrt{n} as an example of extremely unlikely deviation. First, 10\sqrt{n} would already yield a rather respectable (i.e., small) bound, namely, something like 10^{-11}. Second, there is nothing specific about 100\sqrt{n} here, although the formulation seems to indicate otherwise: you might want to add a for example somewhere.

Thank you for this article.