Note:MathJax

Measure Concentration II

Posted: 2018-09-27.

This post will discuss how to get some key tail bounds from subgaussian assumptions, then introduce the wonderful world of martingale inequalities. This is a follow-up to posts on measure concentration and subgaussian variables.

Recap, and Common Inequalities from Subgaussianity

Recall that in measure concentration, the idea is to predict the unpredictable, to bound the unknowable, to bring order to chaos itself. Something like that, anyway. We have a set of random variables $Y_1,\dots,Y_n$, usually independent, and we want to make accurate predictions about some aggregation of them like the mean.

In measure concentration, we saw Markov's, Chebyshev's, and Chernoff inequalities. We then saw a nice general approach using subgaussian variables. Recall that $Y$ is $\sigma^2$-subgaussian if $\mathbb{E} e^{t (Y - \mathbb{E} Y)} \leq e^{t^2 \sigma^2 / 2}$ for all $t \in \mathbb{R}$, and this implies \[ \Pr[|Y - \mathbb{E} Y| \geq t] \leq 2e^{-t^2 / (2\sigma^2)}. \] Also recall that $c + Y$ is $\sigma^2$-subgaussian, that $cY$ is $c^2\sigma^2$-subgaussian, and that $Y + Z$, if $Z$ is independent and $\alpha^2$-subgaussian, is $\sigma^2 + \alpha^2$ subgaussian.

First, let us cite a generalization of a result from the previous blog post. I omit the proof because I haven't been able to find or come up with a proof I really like, but you can easily find one online.

Hoeffding's Lemma. If $Y \in [a,b]$, then $Y$ is $\left(\frac{b-a}{2}\right)^2$-subgaussian.

We can use subgaussianity to prove some famous and useful inequalities. We'll do one now, and some more in the next section.

Theorem (Hoeffding's Inequality). Let $Y_1,\dots,Y_n$ be independent with $Y_i \in [a_i,b_i]$. Let $X = \frac{1}{n}\sum_i Y_i$ be their average. Then

\[ \Pr[ |X - \mathbb{E} X| \geq t] \leq 2\exp\left( \frac{-2n^2 t^2}{\sum_i (b_i - a_i)^2}\right). \]

Proof. Each $Y_i$ is $(b_i-a_i)^2/4$ subgaussian, so their sum is $\frac{1}{4}\sum_i (b_i-a_i)^2$ subgaussian, and their average is $\frac{1}{4n^2}\sum_i (b_i-a_i)^2$ subgaussian.

Martingales

The martingale, native to math classrooms and certain casinos, can be distinguished from the common nightingale by its distinctive tendency to return to the origin.

Just kidding, actually a martingale is not a bird at all but a kind of harness used to keep horses on a steady path.

In our context however, a martingale is a sequence of random variables $Y_1, Y_2, \dots$ that are possibly correlated, but satisfy $\mathbb{E}[ Y_j | Y_1, \dots, Y_{j-1}] = Y_{j-1}$. (They should also have finite expectation.) Let's also define $Y_0 = 0$, so $\mathbb{E}Y_1 = 0$. A good picture to keep in mind is an unbiased random walk, i.e. $Y_j$ is the position on the horizontal axis at time $j$, and at each time we take some number of steps either right or left with the expected position equal to the current one.

The first point is that the "Chernoff method" still works. But pay attention to the difference in the proof: we no longer have independent random variables, so we have to use the conditional distributions given the previous variables.

Thoerem (Azuma-Hoeffding Inequality). If $Y_0, Y_1,\dots,Y_n$ is a martingale with $Y_0 = 0$ and $|Y_j - Y_{j-1}| \leq c_i$, then for all $t$,

\[ \Pr[ |Y_n| \geq t ] \leq 2\exp\left(\frac{-t^2}{2 \sum_i c_i^2} \right) \]

Proof. Let $X_j = Y_j - Y_{j-1}$ for $j=1,\dots,n$. Then $Y_n = \sum_{j=1}^n X_j$, and we can write (with any $\theta \gt 0$):

\begin{align} \Pr[ Y_n \geq t ] &= \Pr[ \sum_j X_j \geq t ] \\ &= \Pr[ e^{\theta \sum_j X_j} \geq e^{\theta t} ] \\ &\leq \frac{1}{e^{\theta t}} \mathbb{E} e^{\theta X_1} \cdots e^{\theta X_n} . \end{align}

So far, so good, but we have no assumption that the $X_j$ are independent, so we cannot decompose this expectation into a product of their expectations. We can use conditioning, however. Conditioned on $Y_0,\dots,Y_{j-1}$, each $X_j$ has mean zero and is in $[-c_j,c_j]$, so is $c_j^2$-subgaussian.

\begin{align} & \frac{1}{e^{\theta t}} \mathbb{E} e^{\theta X_1} \cdots e^{\theta X_n} \\ &= \frac{1}{e^{\theta t}} \mathbb{E}_{Y_1} e^{\theta X_1} \mathbb{E}_{Y_2|Y_1} e^{\theta X_2} \cdots \mathbb{E} \left[ e^{\theta X_n} \mid Y_1,\dots,Y_{n-1} \right] \\ &\leq \frac{1}{e^{\theta t}} \prod_{j=1}^n e^{\theta^2 c_j^2 / 2} \\ &= \exp\left(\frac{1}{2} \theta^2 (\sum_j c_j^2) - \theta t\right) . \end{align}

Now as usual we optimize $\theta$ (just set the derivative equal to zero) and choose $\theta = t / \sum_j c_j^2$ to get

\begin{align} \Pr[Y_n \geq t] &\leq \exp\left(\frac{1}{2} \frac{t^2}{(\sum c_j^2)^2}(\sum c_j^2) - \frac{t^2}{\sum c_j^2} \right) \\ &= \exp\left( \frac{-t^2}{2\sum_{j=1}^n c_j^2} \right) . \end{align} This proves one side of the inequality. The same proof with minor changes proves $\Pr[ Y_n \leq -t]$ is bounded by the same quantity, with proves the theorem.

We can use the nice properties of subgaussians to make this proof cleaner and more general.

Lemma. If $Y_0,\dots,Y_n$ is a martingale with $Y_0 = 0$ and each difference variable $X_j = Y_j-Y_{j-1}$ is $\sigma_j^2$-subgaussian conditioned on $Y_1,\dots,Y_{j-1}$, then $Y_n$ is $\sum_{j=1}^n \sigma_j^2$ subgaussian.

Proof.

\begin{align} \mathbb{E} e^{\beta Y_n} &= \mathbb{E} e^{\beta \sum_{j=1}^n X_j} \\ &= \mathbb{E} e^{\beta X_1} \cdots e^{\beta X_n} \\ &= \mathbb{E}_{Y_1} e^{\beta X_1} \mathbb{E}_{Y_2|Y_1} e^{\theta X_2} \cdots \mathbb{E} \left[ e^{\theta X_n} \mid Y_1,\dots,Y_{n-1} \right] \\ &\leq \prod_{j=1}^n e^{\beta^2 \sigma_j^2 / 2} \\ &= \exp\left(\frac{\beta^2}{2} \sum_{j=1}^n \sigma_j^2 \right). \end{align}

Here I admit to being a bit subtle in one respect: I take each expectation over $Y_j$ conditioned on $Y_1,\dots,Y_{j-1}$, after which $X_j$ is of course determined. I don't want to take the expectations over $X_j$ directly because we don't technically have any guarantee about the distribution of $X_j$ conditioned on $X_0,\dots,X_{j-1}$. These may not carry enough information -- we need to condition on the original martingale.

Corollary (Azuma-Hoeffding for subgaussian differences). Let $Y_0,\dots,Y_n$ be a martingale with $Y_0 = 0$. Suppose that the distribution of $Y_j$ conditioned on $Y_0,\dots,Y_{j-1}$ is $\sigma_j$-subgaussian. Then for all $t$,

\[ \Pr[ |Y_n| \geq t ] \leq \exp\left(\frac{-t^2}{2\sum_{j=1}^n \sigma_j^2}\right) . \]

By the way, just to be clear: you might notice that Hoeffding's inequality has a factor of $2$ in the exponent while the above Azuma-Hoeffding has a factor of $\frac{1}{2}$. This arises because in Hoeffding's, each $X_i \in [a_i,b_i]$ leading to $(\frac{b_i-a_i}{2})^2$-subgaussianity, whereas in Azuma-Hoeffding, each $X_i \in [-c_i,c_i]$, leading to $(\frac{c_i - (-c_i)}{2})^2 = c_i^2$-subgaussianity. So there is a factor of $4$ difference, but it is just due to the parameterization of the setup: the parameter $c_i$ is only half the range containing $X_i$.

Doob Martingales and McDiarmid's Inequality

So far, our concentration inequalities have mostly been about sums of random variables. Now, we'll talk about slightly more general martingales and give a concentration inequality that applies to any function of random variables satisfying a kind of Lipschitz property.

In general, the random variables we're interested in, $\{Y_j\}$, might depend on some underlying state or extra information. For example, maybe in addition to your position in the random walk, you sometimes randomly change the size of your steps. The history $Y_0,\dots,Y_{j-1}$ mightn't capture this information. So we want a slightly more general definition of a martingale.

The sequence $Y_1,\dots,$ is a martingale with respect to $Z_1,\dots$ if for all $j$, $\mathbb{E} \left[ Y_j \mid Z_1,\dots,Z_{j-1} \right] = Y_{j-1}$. I will continue to assume $Y_0 = 0$. In this case, let $X_j = Y_j - Y_{j-1}$; then $X_1,\dots$ is a martingale difference sequence with respect to $Z_1,\dots$.

Now we can make a kind of silly, but really useful observation. Let $Z_1,\dots,Z_n$ be any set of random variables, arbitrarily correlated. Let $f(Z_1,\dots,Z_n)$ be any real-valued function. Define \[ f_j(z_1,\dots,z_j) = \mathbb{E} f(z_1,\dots,z_j,Z_{j+1},\dots,Z_n) , \] that is, $f_j$ gives the expected value of $f$ conditioned on values for the first $j$ inputs: \[ f_j(z_1,\dots,z_j) = \mathbb{E} \left[ f(Z_1,\dots,Z_n) \mid Z_1=z_1,\dots,Z_j=z_j \right] . \] Then we call the sequence $Y_0,\dots,Y_n$ defined by $Y_j = f_j(Z_1,\dots,Z_j)$ the Doob martingale with respect to $f$ and $Z_1,\dots,Z_n$.

This definition takes a bit of digesting. What we are doing is ordering the collection of variables $Z_1,\dots,Z_n$ and plugging them into $f$ one at a time. Initially, we have $Y_0 = f_0 = \mathbb{E} f(Z_1,\dots,Z_n)$, a constant. (I now relax the requirement that $Y_0 = 0$.) Then we reveal $Z_1$ and set $Y_1$ to be the expectation of $f$ conditioned on $Z_1$. And so on; $Y_j$ is a random variable depending only on $Z_1,\dots,Z_j$, equal to the expectation of $f$ given these realizations. Excercise: check that $Y_1,\dots,Y_n$ is a martingale with respect to $Z_1,\dots,Z_n$.

This feels a bit funny because we've taken this seemingly-special and exclusive class of objects, martingales, and shown how to turn any function of any random variables into one. It turns out to be useful if we have some guarantees about the function's behavior.

Theorem (McDiarmid's Inequality). Let $Z_1,\dots,Z_n$ be a collection of random variables and $f(z_1,\dots,z_n)$ a function. Suppose that for all $\vec{z},\vec{z}'$ that are equal except at a single coordinate $j$, we have $| f(\vec{z}) - f(\vec{z}') | \leq c_j$. Then

\[ \Pr[ |f(\vec{Z}) - \mathbb{E} f(\vec{Z})| \geq t ] \leq 2e^{-t^2 / 2\sum_{j=1}^n c_j^2} . \]

Proof. Consider the Doob martingale $Y_0,\dots,Y_n$. If, for example, we write $z_1,z_2$ for the realizations of $Z_1$ and $Z_2$, then \[ Y_2 - Y_1 = \mathbb{E} \left[ f(z_1, z_2, Z_3,\dots,Z_n) - f(z_1,Z_2,Z_3,\dots,Z_n) \right] . \] By assumption, for any realizations $z_2',\dots,z_n'$, \[ |f(z_1,z_2,z_3',\dots,z_n')-f(z_1,z_2',z_3',\dots,z_n') | \leq c_2 . \] So $|Y_2 - Y_1| \leq c_2$, and similarly each $|Y_j - Y_{j-1}| \leq c_j$. Applying Azuma's inequality, generalized to the case $Y_0 \neq 0$, gives the result.