Note:MathJax

Sub-Gamma Variables and Concentration

Posted: 2019-02-03.

For variables that are usually small, subgaussian concentration bounds can often be loose. We can get better Bernstein bounds using the "sub-gamma" approach.

The general approach is very similar to subgaussians:

Define $(v,c)$-sub-gamma variables (two parameters in this case).
Show that sub-gamma variables satisfy (Bernstein) concentration bounds.
Show that summing and scaling (up) sub-gamma variables give sub-gamma variables.
Show that common variables (especially $[0,1]$-bounded) are actually sub-gamma.

Example

Say $Y = Y_1 + \dots + Y_n$ and each $Y_i$ is a Bernoulli. An approach based on subgaussian random variables can guarantee you that the standard deviation of $Y$ is at most $O(\sqrt{n})$ and is within this distance of its mean with high probability.

But what if each $Y_i$ is quite unlikely to be 1? What if $\mathbb{E} Y \ll \sqrt{n}$, for example?

In this case better bounds are true, and we can use sub-gamma variables to prove them! In particular, this post will show how to upper-bound $Y$ by a bound depending on its mean, with high probability.

Sub-Gammaness

Recall that the main tool we have for proving concentration bounds is applying Markov's inequality to something like $e^{f(Y)}$. So it is not surprising that, like with subgaussian variables, we impose a condition on the moment generating function $\mathbb{E} e^{\lambda Y}$.

Let the random variable $Y$ have mean $\mu$. We say $Y$ is $(v,c)$-sub-gamma (on the right) if, for all $\lambda$ with $0 \lt \lambda \lt 1/c$, \[ \mathbb{E} e^{\lambda(Y-\mu)} \leq \exp\left( \frac{v ~ \lambda^2}{2(1-c\lambda)} \right) . \] Note that this is one-sided: it only works for $\lambda \gt 0$, whereas the subgaussian condition was symmetric. So we are only going to be able to prove upper bounds using this definition.

For intuition in the following bound, (spoiler alert) a sum of independent Bernoullis is $(v,1)$-sub-gamma where $v$ is its mean.

Theorem. If $Y$ is $(v,c)$-sub-gamma (on the right), then for all $t \gt 0$,

\[ \Pr[Y - \mathbb{E}Y \geq \sqrt{2vt} + ct] \leq e^{-t} . \]

Proof. Start with the standard "Chernoff" approach: for all $\theta \gt 0$ and $\lambda \in (0,1/c)$, by transformations and Markov's, \begin{align*} \Pr[Y-\mathbb{E}Y \geq \theta] &\leq e^{-\lambda \theta} \mathbb{E} e^{\lambda \mathbb{E}(Y-\mathbb{E}Y)} \\ &\leq \exp\left(-\lambda \theta + \frac{v \lambda^2}{2(1-c\lambda)} \right) \end{align*} by definition of sub-gamma.

Now, I don't have a non-magical way to continue, so here's a magical "guess-and-check" proof. Let $t = \lambda \theta - \frac{v\lambda^2}{2(1-c\lambda)}$. Then the above expression says $\Pr[Y - \mathbb{E}Y \geq \theta] \leq \exp(-t)$. Now, we're going to find a choice of $\lambda$ such that $\theta = \sqrt{2vt} + ct$. This will prove the theorem. (We'll also have that for any $t \gt 0$, there is such a choice of $\lambda$.)

Okay, here's some magic. By definition, we have $\lambda\theta = t + \frac{v\lambda^2}{2(1-c\lambda)}$. Now we subtract $tc\lambda + \sqrt{2tv}\lambda$ from both sides of this equality: \begin{align*} \lambda \theta - tc\lambda - \sqrt{2tv}\lambda &= t - tc\lambda -\sqrt{2tv}\lambda + \frac{v\lambda^2}{2(1-c\lambda)} \\ &= \left(\sqrt{t(1-c\lambda)} - \lambda\sqrt{\frac{v}{2(1-c\lambda)}}\right)^2 . \end{align*} Now we choose $\lambda$ such that $t = \frac{\lambda^2v}{2(1-c\lambda)^2}$, which some calculus shows we can do for any positive $t$ using some $\lambda \in (0,1/c)$. Then we get \begin{align*} \left(\sqrt{t(1-c\lambda)} - \lambda\sqrt{\frac{v}{2(1-c\lambda)}}\right)^2 &= 0 \\ &= \lambda \theta - tc\lambda - \sqrt{2tv}\lambda \\ &= \theta - tc - \sqrt{2tv} . \end{align*} And then we get $\theta = tc - \sqrt{2tv}$, which completes the proof.

This theorem basically says there are two regimes. If $v$ (roughly the mean) is medium-small, then the $\sqrt{2vt}$ term promises that, for instance, the probability of exceeding the mean by a factor $\sqrt{\ln(1/\delta)}$ is at most $\delta$. If $v$ is really small or $c$ is large (which corresponds to variables with large ranges or heavy tails), then the second term $ct$ kicks in.

Calculus of Sub-Gammas

Sub-gammas wouldn't be useful unless we could easily show that new variables are sub-gamma. Here are the facts that let us do that.

Theorem. If $Y$ is $(v,c)$-sub-gamma, then $Y + r$ is $(v,c)$-sub-gamma for any constant $r$.

Proof. Immediate; $Y- \mathbb{E}Y = Y+r - \mathbb{E}[Y+r]$.

Theorem. If $Y$ is $(v,c)$-sub-gamma, then for any $a \geq 1/c$, $aY$ is $(a^2v, ac)$-sub-gamma.

Proof. Exercise (straightforward).

Theorem. If $Y_1$ is $(v_1,c_1)$-sub-gamma and $Y_2$ is $(v_2,c_2)$-sub-gamma, then $Y_1 + Y_2$ is $(v_1+v_2,\max\{c_1,c_2\})$-sub-gamma.

Proof. Exercise (straightforward). The key step is, letting $\bar{c} = \max\{c_1,c_2\}$, to use the substitutions $c_1 \leq \bar{c}$ and $c_2 \leq \bar{c}$.

So for example, we get that a sum of $(a_i,c)$-sub-gamma variables is $(\sum_i a_i, c)$-sub-gamma.

But note sub-gamma variables don't give us quite as much power as Gaussians: $-Y$ is not necessarily sub-gamma, since the definition isn't symmetric, and we also can't generally scale the second parameter $c$ below $1$. (One could define two-sided sub-gamma variables, but it's not clear that interesting variables satisfy this. The applications I've seen focus on e.g. Bernoullis of small mean where the left-tail restriction would only get in the way.)

Examples

Another way sub-gammas wouldn't be useful is if no variables were ever sub-gamma. By pure luck, this doesn't happen (otherwise I'd be really embarassed at this point in the blog post).

Theorem. Suppose $Y_i$ is in $[0,1]$ with mean $p$; then $Y_i$ is $(p,1)$-sub-gamma.

Proof. We're going to prove three steps.

We show $\mathbb{E} e^{\lambda Y_i} \leq p e^{\lambda} + (1-p)$.
Using this, we show $\mathbb{E} e^{\lambda(Y_i-p)} \leq \exp\left[ p\left( e^{\lambda} - 1 - \lambda\right)\right] .$
We show that for $0 \lt \lambda \lt 1$, $e^{\lambda} -1 - \lambda \leq \frac{\lambda^2}{2(1-\lambda)}$.

Combining the last two proves the theorem.

(1) Given any outcome $y \in [0,1]$, we have $e^{\lambda y} \leq y e^{\lambda} + (1-y)$. (This follows by considering a Bernoulli $X$ with mean $y$ and applying Jensen's inequality to $\mathbb{E} e^{\lambda X}$.) Okay, so

\begin{align*} \mathbb{E} e^{\lambda Y_i} &\leq \mathbb{E}\left[ Y_i e^{\lambda} + (1-Y_i) \right] \\ &= p e^{\lambda} + (1-p). \end{align*}

(2) Using the previous fact and the ever-useful inequality $1+x \leq e^x$,

\begin{align*} \mathbb{E} e^{\lambda(Y_i-p)} &= e^{-\lambda p} \mathbb{E} e^{\lambda Y_i} \\ &\leq e^{-\lambda p} \left(p e^{\lambda} + (1-p)\right) \\ &\leq e^{-\lambda p} e^{p(e^{\lambda}-1)} \\ &= \exp\left[p\left(e^{\lambda - 1 - \lambda}\right)\right] . \end{align*}

(3) Notice $1+\lambda$ are the first two terms of the Taylor series of $e^{\lambda}$:

\begin{align*} e^{\lambda} - 1 - \lambda &= \frac{\lambda^2}{2} + \frac{\lambda^3}{6} + \cdots \\ &\leq \frac{\lambda^2}{2}\left(1 + \lambda + \lambda^2 + \cdots\right) \\ &= \frac{\lambda^2}{2(1-\lambda)} \end{align*} where the series converges and equality holds if $\lambda \lt 1$.

This gives us strong tail bounds for sums of bounded variables.

Corollary. Let $Y = \sum_i a_i Y_i$ where $Y_i \in [0,1]$ with mean $p_i$. Let $v = \sum_i a_i^2 p_i$, then $Y$ is $(v, \max_i a_i)$-sub-gamma. In particular, with probability at least $1-\delta$, we have \[ Y - \mathbb{E} Y \leq \sqrt{2 v \ln\frac{1}{\delta}} + \max_i a_i \ln\frac{1}{\delta} . \] In particular, if $Y$ is the sum of $[0,a]$-bounded variables for $a \geq 1$, then with probability $1-\delta$, \[ Y - \mathbb{E} Y \leq \sqrt{2 a \mathbb{E}Y \ln\frac{1}{\delta}} + a \ln \frac{1}{\delta} . \]

Notes.

The interested reader can work out, for example, that the exponential distribution is sub-gamma as are more general gamma distributions.

Interestingly, in my mind, they aren't "tight" the way the definition of subgaussian is tight for Gaussians. I think this is because sub-gamma variables are sort of capturing several limits or tail behaviors at once, depending on how $\lambda$ and $c$ relate.

For example, in the above proof for $[0,1]$ bounded variables, we obtained $\mathbb{E} e^{\lambda X} \leq e^{p(e^{\lambda}-1)}$ when $\mathbb{E}X = p$, which is tight for a Poisson variable with mean $p$. But to get the definition of sub-gamma, we relaxed this bound a bit further. Why? Poisson's are stable in one sense, that sums of Poissons are Poisson, so a definition of "sub-Poisson" variables seems very natural and would apply to $[0,1]$ variables with mean $p$.

The problem is that Poissons don't seem to behave so nicely under scaling (e.g. they are not Poissons). This is the power we gained, but we may have given up some mathematical elegance as compared to subgaussians.

References

(2013) St{\'e}phane Boucheron, G{\'a}bor Lugosi, and Pascal Massart. Concentration inequalities: {A} nonasymptotic theory of independence.