Chernoff bound

A brief introduction to Chernoff bounds.

If you’re already familiar with Chernoff bounds, you may prefer to skip directly to the statement and proof of a typical Chernoff bound.

Chernoff bounds (a.k.a. tail bounds, Hoeffding/Azuma/Talagrand inequalities, the method of bounded differences, etc. [1, 2]) are used to bound the probability that some function (typically a sum) of many “small” random variables falls in the tail of its distribution (far from its expectation).



The distribution of a sum of “small” random variables

\includegraphics[type=pdf,ext=.pdf,read=.pdf,width=3.75in]{shared/graphics/chernoff}

Let Y be a sum of many small random variables. For example, imagine that Y is the number of heads in n coin flips. The figure shows the typical distribution of such a sum around its expectation \mu . For \varepsilon \in (0,1), the sum has probability about e^{-\varepsilon ^2\mu } of deviating from \mu by more than a factor of 1+\varepsilon (that is, an additive deviation of \varepsilon \mu ).

For example (taking any \varepsilon =\Theta (1/\sqrt \mu )) the sum has constant probability of deviating from \mu by an additive \Theta (\sqrt \mu ).

Or (taking \varepsilon =\Theta (\sqrt {\log (n)/\mu })), the sum has probability n^{-\Theta (1)} of deviating from \mu by more than \Theta (\sqrt { \mu \log n}).

A typical Chernoff bound quantifies this behavior. It says that the event that Y exceeds 1+\epsilon times its expectation \mu is rare: it happens with probability at most e^{-\mu \epsilon ^2/2}. (A similar bound holds for \Pr [ Y \le (1-\epsilon )\mu ].)

Example applications

Here are a couple of examples using a Chernoff bound.

Example (balls in bins / max bin).

Throw 6 n\ln n balls randomly into n bins. Let Y be the number of balls thrown into a given bin. Then Y is a sum of 6 n\ln n independent random \{ 0,1\} -variables:

\[ Y=\sum _{t} [\text {ball } t \text { goes in the first bin}]. \]

The expectation of Y is 6\ln n. By the Chernoff bound, for any \varepsilon \in (0,1),

\[ \renewcommand{\mathspread }{\, \, \, }\Pr [Y \ge (1+\varepsilon )\mu ] {\, \, \, {<}\, \, \, }\exp (- \varepsilon ^2\textrm{E}[Y]/3 ){\, \, \, {=}\, \, \, }1/n^{2\varepsilon ^2}. \]

Taking \varepsilon = 1 gives \Pr [Y \ge 12\ln n] {\, \, \, {<}\, \, \, }1/n^2 — the probability that a given bin receives at least 12\ln n balls is less than 1/n^2.

The naive union bound implies that the probability that any of the n bins has more than 12\ln n balls is at most 1/n.

Exercise (more balls in bins).

Show that if n^2 balls are thrown, then the probability that any bin has more than n+O(\sqrt {n\ln n}) balls is at most 1/n.

Expectations of exponentials

The proof of the Chernoff bound for a sum Y works by reasoning about the expectation of (1+\varepsilon )^{Y}. Here is some intuition for this approach. (For the full proof, go here.)

Recall the humble Markov bound: for any non-negative random variable X, the probability that X exceeds c is at most E[X]/c. This is useful but somewhat weak. Can we strengthen it by transforming X?

To take a concrete example, let X_ t be the number of heads in t random coin flips. Consider \textrm{E}[3^{X_ t}]. By direct calculation,

\[ \renewcommand{\mathspread }{\, \, \, }\textrm{E}\big [3^{X_{t+1}} \big ] {\, \, \, {=}\, \, \, }{\textstyle \frac{1}{2}} \textrm{E}[3^{X_ t}] + {\textstyle \frac{1}{2}} 3 \cdot \textrm{E}[3^{X_ t}] {\, \, \, {=}\, \, \, }2\cdot \textrm{E}[3^{X_ t}]. \]

By induction, \textrm{E}[3^{X_ t}] = 2^ t.

Does this give anything useful? Applying the Markov bound to 3^{X_ t} gives

\[ \renewcommand{\mathspread }{\, \, \, }\Pr [X_ t \ge c] {\, \, \, {=}\, \, \, }\Pr [3^{X_ t} \ge 3^ c] {\, \, \, {\le }\, \, \, }\frac{\textrm{E}[3^{X_ t}]}{3^{c}} {\, \, \, {=}\, \, \, }\frac{2^ t}{3^{c}} \approx 2^{t - 1.58\, c}. \]

For example, taking t=100 and c=70, we can conclude that the probability of getting 70 or more heads in 100 flips is at most about 5/10000. Choosing a base other than 3 may give a better bound.

The proof of the Chernoff bound generalizes this idea, and optimizes the base of the exponent. Typically, in order to bound the probability of deviating by a 1+\varepsilon factor from the mean, the best base for the exponent is something like 1+\varepsilon \approx e^\varepsilon . To prove bounds on deviations below (as opposed to above) the mean, one uses a base less than 1. For most settings, these bounds are tight up to small constant factors in the exponent.