CS3958: Advanced Algorithms

这是 ACM 班 2022 Fall 高级算法课程的个人整理小笔记. ~~众所周知, 我都是在期末复习的时候才认真做一遍笔记的.~~

课程框架：

Basic Probabilistic Tools
- Concentration Inequalities
- Martingale
Optimization
- First-Order Optimization
- Online Optimization
Markov Chain and Sampling
- An Application of MCMC (Markov Chain Monte Carlo)
- Perspectives for studying MCMC
- Stochastic Process
- Spectrum
- Operator

Lec1

Concentration Inequalities

Theorem 2 (Markov Inequality) For any non-negative r.v. $X$ and $a > 0$ ,
$\mathbf{Pr}[X \ge a] \le \frac{\mathbf{E}[X]}{a}$
Proof:
$\mathbf{E}[X] \ge \int_{X \ge a} X d \mu \ge a \int_{X \ge a} d \mu = a \mathbf{Pr}[X \ge a]$

Example 4 (Coupon Collector) Let $X$ be the number of draws.
$\mathbf{E}[X] = n H_n$
Proof: Let $X_i$ be the number of draws with $i$ cards collected.
$\mathbf{E}[X_i] = (\frac{n-i}{n}\mathbf{E}[X_{i+1}]+1) + \frac{i}{n} \mathbf{E}[X_i]$

$\mathbf{E}[X_0] = n(\frac{1}{n} + \frac{1}{n-1} + \dots + 1) = n H_n)$

另解: 令 $X_i$ 为有 $i$ 个 coupon 后得到一个新的 coupon 的抽取次数.

Theorem 3 (Chebyshev’s Inequality) For any r.v. $X$ with bounded $\mathbf{E}[X]$ and $a > 0$ ,
$\mathbf{Pr}[|X-\mathbf{E}[X]| \ge a ] \le \frac{\mathbf{D}[X]}{a^2}$
Proof:
$\mathbf{Pr}[|X-\mathbf{E}[X]|^2 \ge a^2] \le \frac{\mathbf{D}[X]}{a^2}$
Example 3: (Streaming Model) Input a sequence $a_1$ , $a_2$ , $\dots$ , $a_m$ . Compute $m$ .

Brute-Force: maintain a counter $k$ with $O(\log m)$ memory. Best algorithm for the exact answer.

Morris’ algorithm:

X <- 0
On each input: X <- X+1 with prob 2^(-X).
return 2^X-1

Theorem 1: $\mathbf{E}[\hat{m}] = m$ . $\hat{m}$ is the output of Morris' algorithm, $m$ is the true answer.

Proof: By induction.
$\begin{split} \mathbf{E}[\hat{m}] &= \mathbf{E}[2^{X_m}]-1 \ &= \mathbf{E}[\frac{2^{X_{m-1}+1}}{2^{X_{m-1}}} + (1-\frac{1}{2^{X_{m-1}}})2^{X_{m-1}}] \ &= \mathbf{E}[2^{X_{m-1}}] + 1 = m \end{split}$
Uses $O(\log \log m)$ bits of memory. We want to
$\mathbf{Pr}[|\hat{m} - m| > \varepsilon] \le \delta$
For fixed $\varepsilon$ , smaller $\delta$ is better.

Lemma 4:
$\mathbf{E}[(2^{X_m})^2] = \frac{3}{2}m^2 + \frac{3}{2}m + 1$
Proof: $m=1$ , $\mathbf{E}[(2^{X_m})^2]$ = 4.
$\begin{split} \mathbf{E}[(2^{X_m})^2] &= \mathbf{E}[\frac{(2^{X_{m-1}+1})^2}{2^{X_{m-1}}} + (1-\frac{1}{2^{X_{m-1}}})(2^{X_{m-1}})^2] \ &= \frac{3}{2}(m-1)^2 + \frac{3}{2}(m-1) + 1 + 3m \ &= \frac{3}{2}m^2 + \frac{3}{2}m + 1 \end{split}$
With Lemma 4,
$\mathbf{D}[\hat{m}] = \mathbf{E}[\hat{m}^2] - (\mathbf{E}[\hat{m}])^2 = E[(2^{X_m}-1)^2] - m^2 \le \frac{m^2}{2}$
With Chebyshev
$\mathbf{Pr}[|\hat{m}-m|\ge \varepsilon m] \le \frac{1}{2\varepsilon^2}$
If $\varepsilon$ is smaller, $\frac{1}{2\varepsilon^2} > 1$ , the bound is useless.

The Averaging Trick

We can run Morris's algorithm $t$ times in parallel and the final output is the average
$\hat{m}^* := \frac{\sum_{i=1}^t \hat{m}_i}{t}$
We have
$\mathbf{D}[\hat{m}^*] = \frac{\mathbf{D}[\hat{m}]}{t}$

$\mathbf{Pr}[|\hat{m}^* - m| \ge \varepsilon m] \le \frac{1}{t \cdot 2 \varepsilon^2}$

For a fixed $\delta$ , $t \ge \frac{1}{2\varepsilon^2 \delta}$ , $\mathbf{Pr}[|\hat{m}^* - m| \ge \varepsilon m] \le \delta$ . New algorithm uses $O(\frac{\log \log n}{\varepsilon^2 \delta})$ bits of memory.

Chernoff Bound

Theorem 5 (Chernoff Bound) Let $X_1, \dots, X_n$ be independent r.v. such that $X_i \sim \text{Ber}(p_i)$ . Let $X = \sum_{i=1}^n X_i$ and $\mu = \mathbf{E}[X] = \sum_{i=1}^n p_i$ . Then we have
$\mathbf{Pr}[X \ge (1+\delta) \mu] \le \bigg( \frac{e^\delta}{(1+\delta)^{1+\delta}} \bigg)^\mu$
If $0<\delta<1$ .
$\mathbf{Pr}[X \le (1-\delta) \mu] \le \bigg( \frac{e^{-\delta}}{(1-\delta)^{1-\delta}} \bigg)^\mu$
Proof: core idea: use $f(X) = e^{aX}$ for $a > 0$ and use Markov.

Corollary 6 For any $0<\delta<1$ ,
$\mathbf{Pr}[X \ge (1+\delta) \mu] \le \text{exp} \bigg( - \frac{\delta^2}{3} \mu \bigg)$

$\mathbf{Pr}[X \le (1-\delta) \mu] \le \text{exp} \bigg( - \frac{\delta^2}{2} \mu \bigg)$

$\mathbf{Pr}[|X - \mu| \ge \delta \mu] \le 2\text{exp} \bigg( - \frac{\delta^2}{3} \mu \bigg)$

Proof: 略

The Median Trick

Further boost the performance of Morrie's algorithm using the median trick. Choose $t = \frac{3}{2 \varepsilon^2}$ in the averaging trick and run it $s$ times in parallel.
$\mathbf{Pr}[|\hat{m_i}^* - m| \ge \varepsilon m] \le \frac{1}{3}$
At last output the median $\hat{m}^{**}$ of $\hat{m_1}^*, \hat{m_2}^*, \dots, \hat{m_s}^*$ .

Let $Y_i$ be the indicator of the (good) event that
$|\hat{m_i}^* - m|<\varepsilon \cdot m$
Then $Y := \sum_{i=1}^s Y_i$ satisfies $\mathbf{E}[Y] \ge\frac{2}{3} s$ . If the median is bad, then at least half of $\hat{m_i}^*$ is bad which is

$Y \le \frac{1}{2} s$ . By Chernoff,
$\mathbf{Pr}[Y \le \frac{1}{2} s] \le \mathbf{Pr}[|Y - \mathbf{E}[Y]| \ge \frac{1}{6}s] \le 2 \text{exp} \bigg( -\frac{s}{108} \bigg)$
Therefore, for $t = O(\frac{1}{\varepsilon^2})$ and $s = O(\log \frac{1}{\delta})$ , we have $\mathbf{Pr}[|\hat{m}^* - m| \ge \varepsilon m] \le \delta$ .

This new algorithm uses
$O(\frac{1}{\varepsilon^2} \cdot \log \frac{1}{\delta} \cdot \log \log n)$
bits of memory. ( $O(\log \frac{1}{\delta}) << O(\frac{1}{\delta})$ ).

Lec 2

Threshold Behavior of Random Graphs

Threshold Behavior: For a random graph $G(n, p)$ , these is a threshold function $r$ such that:

when $p << r(n)$ , almost no graph satisfies the property.
when $p >> r(n)$ , almost every graph satisfies the property.

Formally, (Threshold function) Given a graph property $P$ , a function $r: \mathbf{N} \rightarrow [0, 1]$ is called a threshold if:

(a) if $p(n) << r(n)$ , $\mathbf{Pr}[G \text{ satisfies } P] \rightarrow 0$ when $n \rightarrow \infty$ ;
(b) if $p(n) >> r(n)$ , $\mathbf{Pr}[G \text{ satisfies } P] \rightarrow 1$ when $n \rightarrow \infty$ ;

Theorem 2 The property "G contains a 4-clique" has a threshold function $n^{-\frac{2}{3}}$ .

Proof: Let
$X_S = \begin{cases} 1&\text{ if } G[S] \text{ is a clique } \ 0&\text{ otherwise } \end{cases}$
Let $X = \sum_{S \in \binom{[n]}{4}} X_S$ , then by the linearity
$\mathbf{E}[X] = \sum_{S \in \binom{[n]}{4}} \mathbf{E}[X_S] = \binom{n}{4} p^6 \approx \frac{n^4 p^6}{24}$
If $p << n^{-\frac{2}{3}}$ , $\mathbf{E}[X] = o(1)$ .
$\mathbf{Pr}[X \ge 1] \le \mathbf{E}[X] \rightarrow 0$
For (b), we can not use Markov because large expectation does not imply large values with high probability. Therefore, we have to consider the variance. (反算 + Chebyshev)
$\mathbf{Pr}[X = 0] \le \mathbf{Pr}[|X-\mathbf{E}[X]| \ge \mathbf{E}[X]] \le \frac{\mathbf{D}[X]}{(\mathbf{E}[X])^2}$
Calculate by consider $|S \cap T| = 2, 3$ we get (if $p >> n^{-\frac{2}{3}}$ )
$\mathbf{D}[X] \le n^6 p^{11} + n^5p^9 + n^4 p^6 = o(\mathbf{E}[X]^2)$

Hoeffding's Inequality

Example 1 (Tossing coins) How many times should we toss the coin (denote the times as $T$ ) such that with $99\%$ prob, $\hat{p} \in [(1-\varepsilon)p, (1+\varepsilon)p]$ for a given precision $\varepsilon$ ？ $p$ is the probability pf "head".

Use Chernoff
$\mathbf{Pr}[|\hat{p} - p| \ge \varepsilon p] = \mathbf{Pr}[|\hat{p}T - pT| \ge \varepsilon pT] \le 2 \exp{\bigg( - \frac{\varepsilon^2}{3} pT\bigg)} \le 0.01$

So $T \ge \frac{3 \log 200}{\varepsilon^2 p} = O(\frac{1}{\varepsilon^2})$ .

Chernoff only works for independent Bernoulli r.v. Then we have its generalized version for almost surely bounded r.v. $X_i$ .

Theorem 3 (Hoeffding’s inequality) Let $X_1, \dots , X_n$ be independent r.v. where $X_i \in [a_i ,b i] $almost surely ($ a_i \le b_i $). Let$ X = \sum{i=1}^n X_i $and$ \mu := \mathbf{E}[X]$. Then
$\mathbf{Pr}[|X-\mu| \ge t] \le 2 \exp{\bigg( -\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \bigg)}$
Proof: The key is to have a nice upper bound on the MGF (moment generating function) like what we did in Chernoff bound. So Firstly we have

Lemma (Hoeffding’s lemma) Let $X$ be a r.v. with $\mathbf{E}[X] = 0$ and $X \in [a, b]$ . Then
$\mathbf{E}[e^{\alpha X}] \le \exp \bigg(\frac{\alpha^2(b-a)^2}{8}\bigg)$
for all $\alpha \in \mathbb{R}$ .

We replace $X_i$ by $X_i - \mathbf{E}[X_i]$ here and by symmetry we only need to prove
$\mathbf{Pr}[X \ge t] \le \exp \bigg( -\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \bigg)$
Then we have
$\mathbf{Pr}[X \ge t] \le \frac{\prod_{i=1}^n \mathbf{E}[e^{\alpha X_i}]}{e^{\alpha t}} \le \exp \bigg( -\alpha t + \frac{\alpha^2}{8} \sum_{i=1}^n (b_i - a_i)^2 \bigg)$
Optimize $\alpha = \frac{4t}{\sum_{i=1}^n (b_i - a_i)^2}$ Q.E.D.

Comparing Chernoff Bound and Hoeffding’s Inequality

Let $X_1,\ \dots,\ X_n$ be i.i.d. r.v. where $X_i \sim \text{Ber}(p)$ . Let $X = \sum_{i=1}^n X_i$ .

Let $t = \delta \mu$ . Use Chernoff
$\mathbf{Pr}[|X - \mu| \ge t] \le 2 \exp \bigg( -\frac{1}{3} \delta^2 pn \bigg)$
Use Hoeffding
$\mathbf{Pr}[|X - \mu| \ge t] \le 2 \exp \bigg( - \frac{2\delta^2 \mu^2}{n} \bigg) = 2 \exp \bigg( - 2\delta^2 p^2 n \bigg)$
So Chernoff > Hoeffding.

Example 2 (Meal Delivery) Let $X_1,\ \dots,\ X_n$ be i.i.d. r.v. representing the weight of each meal. For any $i$ , $\mu = \mathbf{E}[X_i] = 300$ , and $X_i \in [250, 350]$ . Then we can give a estimate of the total number $n$ by $\hat{n} = \frac{X}{\mu} = \frac{\sum_{i=1}^n X_i}{\mu}$ . $n > 200$ .
$\mathbf{Pr}[|\hat{n} - n| \ge \delta n] = \mathbf{Pr}[|X - n \mu| \ge \delta n \mu] \le 2 \exp \bigg(- \frac{2}{10000} \delta^2 \mu^2 n\bigg) \le 0.03 \%$

Concentration on Margtinalges

Example 3 (Fair games) For each round the gambler wins $X_t$ money. Because it is a fair game, $\mathbf{E}[X_t] = 0$ . Let $Z_t = \sum_{i=1}^t X_i$ be the amount of money he won after t-th rounds. Then it is clear
$\mathbf{E}[Z_{t+1}\ \vert\ X_0,\ \dots,\ X_t] = Z_t$
Proof:
$\mathbf{E}[Z_{t+1}\ \vert\ \overline{X_{0,t}}] = \mathbf{E}[Z_{t} + X_{t+1}\ \vert\ \overline{X_{0,t}}] = Z_t + \mathbf{E}[X_{t+1}] = Z_t$
( $Z_t$ is $\sigma(X_0,\ \dots,\ X_t)$ measureable.)

Definition 5 In a probability space $(\Omega, \mathscr{F}, \mathbf{Pr})$ , a sequence of finite varabiles ${Z_n}_{n \ge 0}$ is a martingale if
$\forall n \ge 1, \mathbf{E}[Z_n\ \vert\ Z_1,\ \dots,\ Z_{n-1}] = Z_{n-1}$
Sometimes we say ${Z_n}$ is a martingale w.r.t. another sequence ${X_n}$ if
$\forall n \ge 1, \mathbf{E}[Z_n\ \vert\ X_1,\ \dots,\ X_{n-1}] = Z_{n-1}$
More formally, for a filtration $\mathscr{F}1 \subseteq \mathscr{F}_2 \subseteq \dots \mathscr{F} $, and$ Z_i $is$ \mathscr{F}_i $-measuable, then we call$ {Z_n}$ is a martingale if
$\forall n \ge 1, \mathbf{E}[Z_n\ \vert\ \mathscr{F}_{n-1}] = Z_{n-1}$
Supermartingale:
$\forall n \ge 1, \mathbf{E}[Z_n\ \vert\ \mathscr{F}_{n-1}] \le Z_{n-1}$
Submartingale:
$\forall n \ge 1, \mathbf{E}[Z_n\ \vert\ \mathscr{F}_{n-1}] \ge Z_{n-1}$
If ${Z_n}$ is a martingale then the following property is immediate

Proposition 6 For any $n \ge 1$ , $\mathbf{E}[Z_n] = \mathbf{E}[Z_0]$ .

Example 4 (1-dim random walk) $X_t \in {-1, +1}$ . $Z_t = \sum_{i=1}^t X_i$ . Then ${Z_n}$ is a martingale w.r.t. ${X_t}$ .

Example 5 (The product of independent random variables) $X_1,\ \dots,\ X_n$ are independent r.v. with $\mathbf{E}[X_i] = 1$ . Let $P_k = \prod_{i=1}^k X_i$ .
$\mathbf{E}[P_k\ \vert\ \overline{X_{k-1}}] = \mathbf{E}[P_{k-1} X_k\ \vert\ \overline{X_{k-1}}] = P_{k-1} \mathbf{E}[X_k] = P_{k-1}$
Example 6 (Polya's urn) 一开始罐子里只有一黑一白两个球. 每次摸一个球, 将它复制并将两个都放回.令 $X_n$ 为第 n 轮后罐子里的黑球数量. Let $Z_n = \frac{X_n}{n}$ . Then ${Z_n}_{n \ge 2}$ is a martingale w.r.t. ${X_n}_{n \ge 2}$ . 注意, 我们的轮数从第二轮开始算. 也就是第 n 轮后罐子里有 n 个球.
$\begin{split} \mathbf{E}[Z_{n+1}\ \vert\ \overline{X_{n}}] &= \frac{1}{n+1} \mathbf{E}[Z_{n}(X_{n} + 1) + (1-Z_n) X_{n} \ \vert\ \overline{X_{n}}] \ &= \frac{1}{n+1} \mathbf{E}[Z_{n} + X_{n} \ \vert\ \overline{X_{n}}] \ &= \frac{Z_n + X_n}{n+1} = \frac{X_n}{n} \end{split}$
Example 7 (Doob's martingale) Let $X_1,\ \dots,\ X_n$ be a sequence of r.v. (unnecessarily independent). Let $f(\overline{X_n}) = f(X_1,\ \dots,\ X_n) \in \mathbb{R}$ . For $i \ge 0$ , we define
$Z_i = \mathbf{E}[f(\overline{X_n})\ \vert\ \overline{X_i}]$
In particular, we have $Z_0 = \mathbf{E}[f(\overline{X_n})]$ and $Z_n = f(\overline{X_n})$ . ${Z_n}$ 可以被看作一个随着信息不断增加，对函数信息估计不断准确的过程。

Proposition 7 ${Z_n}$ is a martingale w.r.t. ${X_n}$ .

Proof:
$\mathbf{E}[Z_i\ \vert\ \overline{X_{i-1}}] = \mathbf{E}[\mathbf{E}[f(\overline{X_n})\ \vert\ \overline{X_i}]\ \vert\ \overline{X_{i-1}}] = \mathbf{E}[f(\overline{X_n})\ \vert\ \overline{X_{i-1}}] = Z_{i-1}$
这里用了一下双期望公式:
$\overline{X_i} \supset \overline{X_{i-1}}, \sigma(\overline{X_i}) \subset \sigma(\overline{X_{i-1}})$
如果有两个条件期望，看 sigma 代数比较大的那个.

Azuma-Hoeffding's Inequality

Hoeffding 的推广.

设有一列随机变量 ${X_n}_{n \ge 1}$ 满足 $X_i \in [a_i,\ b_i]$ . 不失一般性可设 $\mathbf{E}[X_i] = 0$ (否则做替换 $X_i - \mathbf{E}[X_i]$ ) 令 $S_k = \sum_{i=1}^k X_i$ . 如果 ${S_n}$ w.r.t. ${X_n}$ 是鞅, 那么
$\mathbf{Pr}[|S_n - S_0| \ge t] \le 2 \exp \bigg(- \frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \bigg)$
Proof: 略

Lec 3

Martingale (cont'd)

Example 1 (Balls-in-a-bag) 袋子里有 $g$ 个绿球和 $r$ 个红球. 我们想通过抽球来估计球的比例 $\frac{r}{r+g}$ . 有两种方法(场景):

Draw with replacement
Let $X_i := 1[\text{the ball is red}]$ . Then $X_i \sim \text{Ber}(\frac{r}{r+g})$ .By Hoeffding,
$\mathbf{Pr}[|X - \mathbf{E}[X]| \ge t] \le 2 \exp \bigg( - \frac{2t^2}{n} \bigg)$
$\mathbf{E}[X] = n\frac{r}{r+g}$ .

Draw without replacement

Also $\mathbf{E}[X] = n\frac{r}{r+g}$ . 因为不放回抽取可以看作先对抽球序列做一个排列，然后按排列一个个抽. 在所有排列中，第 i 个位置为 red 的排列个数当然是 $\mathbf{E}[X_i] = \frac{r}{r+g}$ .

接下来考虑函数 $f(X_1, X_2, \dots, X_n) = \sum_{i=1}^n X_i$ 以及它的 Doob 序列 $Z_i = \mathbf{E}[f(\overline{X_n})\ \vert\ \overline{X_i}]$ .

为了符合 Azuma-Hoeffding 的形式，我们需要做一个差分. 令 $Y_i = Z_i - Z_{i-1},\ i \ge 1$ . 因此
$Z_n - Z_0 = Z_n - \mathbf{E}[f] = \sum_{i=1}^n Y_i$
注意 Azuma-Hoeffding 需要我们给出一个 $Y_i$ 的上下界. 令 $S_i = \sum_{j=1}^i X_j$ ,
$Z_i = \mathbf{E}[f(\overline{X_n})\ \vert\ \overline{X_i}] = S_i + (n-i) \frac{r-S_i}{r+g-i}$

$\begin{split} Y_i = Z_i - Z_{i-1} &= [S_i + (n-i) \frac{r-S_i}{r+g-i}] - [S_{i-1} + (n-i+1) \frac{r-S_{i-1}}{r+g-i+1}] \ &= \frac{g+r-n}{g+r-i} \bigg( X_i + \frac{S_{i-1} - r}{g+r-i+1} \bigg) \in [-1, 1] \end{split}$

Therefore
$\mathbf{Pr}[|Z_n - Z_0| \ge t] = \mathbf{Pr}[|X-\mathbf{E}[X]| \ge t] \le 2 \exp \bigg(-\frac{t^2}{2n}\bigg)$

McDiarmid's Inequality

我们上面使用的这个 Doob 序列非常强大. 因此我们可以采用类似的思想进行推广.

Definition 1 (c-Lipschitz function) A function $f(x_1,\ \dots,\ x_n)$ satisfies c-Lipschitz condition if
$\forall i \in [n],\ \forall x_1, \dots, x_n,\ \forall y_i: |f(x_1, \dots, x_i, \dots, x_n) - f(x_1, \dots, y_i, \dots, x_n)| \le c$
Theorem 2 (McDiarmid's Inequality) 若 f 为 c-Lipschitz function 且 $X_1, \dots, X_n$ 为 independ random variables，那么我们有
$\mathbf{Pr}[|f(X_1, \dots, X_n) - \mathbf{E}[f(X_1, \dots, X_n)]| \ge t] \le 2 \exp \bigg( - \frac{2t^2}{nc^2} \bigg)$
Proof: 略，和上面的证明类似

下面是两个 McDiarmid's Inequality 的应用.

Example 2 (Pattern Matching) 令 $B \in {0, 1}^k$ 为一个固定字符串. 对于一个随机的字符串 $X \in {0, 1}^n$ , $B$ 在 $X$ 中期望出现的次数为？

记 $X_i$ 为 $X$ 的第 i 位， $Y_i,\ 1 \le i \le n-k+1$ 为 $1[X_i X_{i+1} \dots X_{i+k-1} = B]$ . 那么所求即为
$Y = \sum_{i=1}^{n-k+1} Y_i$

$\mathbf{E}[Y] = \sum_{i=1}^{n-k+1} \mathbf{E}[Y_i] = \frac{n-k+1}{2^k}$

记 $f(X_1, \dots X_n) = Y$ ，显然 $f$ 是 k-Lipschitz 的. 那么
$\mathbf{Pr}[|Y-\mathbf{E}[Y]| \ge t] \le 2 \exp \bigg( - \frac{2t^2}{nk^2}\bigg)$
Example 3 (Chromatic Number of $G(n, p)$ )

The most natural way: introduce a varaible $X_e = 1 [\text{the edge exists}]$ for every pair of vertices. Then $\chi(G)$ is 1-Lipschitz.
$\mathbf{Pr}[|f-\mathbf{E}[f]| \ge t] \le 2 \exp \bigg( - \frac{2t^2}{\binom{n}{2}} \bigg)$
这个 bound 并不是很好，因为要让右边为常数， $t = \Omega(n)$ ，没啥用

我们考虑改变一个点的染色至多影响其所有相邻的边。因此令所有点为 ${v_1, \dots, v_n}$ ，然后令 $Y_i,\ 2 \le i \le n$ 为 $v_i$ 与 ${v_1, \dots, v_{i-1}}$ 之间的连边情况（encode 起来）。然后将染色数看作 $f(Y_2, \dots, Y_n)$ . 显然 $f$ 也是 1-Lipschitz 的。
$\mathbf{Pr}[|f-\mathbf{E}[f]| \ge t] \le 2 \exp \bigg( - \frac{2t^2}{n-1} \bigg)$

Stopping Time

问题：如果下标是一个 random variable $\tau$ ，是否仍有 $\mathbf{E}[Z_{\tau}] = \mathbf{E}[Z_0]$ ？先定义好什么叫 "下标是随机变量"。

Definition 3 (Stopping Time) 令 $\tau \in \mathbb{N} \cup {\infty}$ 为一个随机变量. 我们称其为一个 stopping time，如果 $\forall t \ge 0$ ，事件 " $\tau \le t$ " 是 $\mathscr{F}_t$ -measurable 的。

用人话讲就是，在 $t$ 时刻是否停了，根据前面的信息就可以确定。例如，一个赌徒第一次连赢五场就停手，记这个时间为 $\tau$ ，它就是停时（赢没赢五场看最近五场就知道）；但是他最后一次连赢五场这个时间就不是停时，因为你只看前面的轮次无法确定是否是 "最后一次"。

Optional Stopping Theorem (OST)

这个定理给 $\mathbf{E}[Z_\tau] = \mathbf{E}[Z_0]$ 成立提供了三个充分条件。

Theorem 4 (Optional Stopping Theorem) ${X_t}_{t \ge 0}$ 是一个 martingale 并且 $\tau$ 是一个停时 w.r.t. ${\mathscr{F}t}{t \ge 0} $. 那么$ \mathbf{E}[X_\tau] = \mathbf{E}[X_0]$ 如果下面三个条件之一满足：

$\tau$ a.s. 有界。即 $\exists n \in \mathbb{N}$ s.t. $\mathbf{Pr}[\tau \le n] = 1$ ;

$\mathbf{Pr}[\tau < \infty] = 1$ 且存在有限的 M 使得 $\forall t < \tau,\ |X_t| \le M$ ;

$\mathbf{E}[\tau] < \infty$ 且存在常数 $c$ 使得 $\mathbf{E}[|X_{t+1}-X_t|\ \vert\ \mathscr{F}_t] \le c$ 对于任意 $t < \tau$。

Example 4 (Boy or Girl) 下面三种情况的性别比例是否是1：1？

直到生出男孩，一直生；

直到男孩数>女孩数，一直生；

直到男孩数>女孩数或者总孩子数达到10个，一直生.

我们可以直接用 1-dim random walk 来看这个问题。记 $X_t$ 为 $t$ 步后的坐标（也即男孩数-女孩数）。

第一种情况：直接满足 OST 条件三。显然期望有限，并且由于每次移动一步， $c=1$ ；

第二种情况：三个条件都不满足。实际上它肯定不是1：1；

第三种情况：直接满足条件一。当然也可以满足条件二。

满足 OST 条件后 $\mathbf{E}[X_\tau] = \mathbf{E}[X_0] = 0$ 。即期望为1：1.

Lec 4

Proof of OST

First we have
$\forall n \in \mathbb{N},\ \mathbf{E}[X_{\min{n, \tau}}] = \mathbf{E}[X_0]$
So naturally we can decompose $X_r$ into two terms
$\forall n \in \mathbb{N},\ X_\tau = X_{\min{n, \tau}} + 1[\tau > n] \cdot (X_{\tau} - X_n)$

$\mathbf{E}[X_\tau] = \mathbf{E}[X_0] + \lim_{n \rightarrow \infty} \mathbf{E}[1[\tau > n] \cdot (X_{\tau} - X_n)]$

So we only need to verify that
$\lim_{n \rightarrow \infty} \mathbf{E}[1[\tau > n] \cdot (X_{\tau} - X_n)]$
The first condition: obvisouly $1[\tau > n] = 0$ .

The second one:
$\lim_{n \rightarrow \infty} \mathbf{E}[1[\tau > n] \cdot (X_{\tau} - X_n)] \le 2M \lim_{n \rightarrow \infty} \mathbf{E}[1[\tau > n]] = 2M \lim_{n \rightarrow \infty} \mathbf{Pr}[\tau > n] = 0$
because $\mathbf{Pr}[\tau<\infty] = 1$ .

The third one:
$\begin{split} \mathbf{E}[1[\tau > n] \cdot (X_{\tau} - X_n)] &\le \mathbf{E}[\sum_{t=n}^\infty |X_{t+1} - X_t| \cdot 1[\tau > t]] \ &= \mathbf{E} \bigg[ \mathbf{E}[\sum_{t=n}^\infty |X_{t+1} - X_t| \cdot 1[\tau > t]]\ \bigg\vert\ \mathscr{F}_n \bigg] \ &= \mathbf{E} \bigg[ \mathbf{E}[|X_{n+1}-X_n|\ \vert\ \mathscr{F}_n] \cdot 1[\tau > n] + \mathbf{E}\bigg[\sum_{t=n+1}^\infty |X_{t+1} - X_t| \cdot 1[\tau > t]\ \bigg\vert\ \mathscr{F}_n\bigg] \bigg] \ &\le c \mathbf{Pr}[\tau > n] + \mathbf{E}\bigg[\sum_{t=n+1}^\infty |X_{t+1} - X_t| \cdot 1[\tau > t]\bigg] \ &\le \cdots \ &\le c \sum_{t \ge n} \mathbf{Pr}[\tau > t] \end{split}$
Because $\mathbf{E}[\tau] = \sum_{t=0}^\infty \mathbf{Pr}[\tau > t]<\infty$ . Therefore $\sum_{t \ge n} \mathbf{Pr}[\tau > t] \rightarrow 0$ when $n \rightarrow \infty$ .

Doobs martingale inequality

用 OST 可以得出一个序列最大值的 concentration 性质。

Claim 2 Let ${X_t}_{t \ge 0}$ be a martingale w.r.t. itself where $X_t \ge 0$ for every $t$ . Then
$\mathbf{Pr}\bigg[ \max_{0 \le t \le n} X_t \ge \alpha \bigg] \le \frac{\mathbf{E}[X_0]}{\alpha}$
Proof: Define $\tau$ as the index of the first element greater $\alpha$ occurs.
$\tau := \min(n, \min_{t\le n} {t\ \vert\ X_t \ge \alpha})$
It satisfies OST first condition. Then
$\mathbf{Pr}\bigg[ \max_{0 \le t \le n} X_t \ge \alpha \bigg] \le \mathbf{Pr}[X_\tau \ge \alpha] \le \frac{\mathbf{E}[X_\tau]}{\alpha} = \frac{\mathbf{E}[X_0]}{\alpha}$

1-dim Random Walk with Two Absorbing Barriers

两个吸收壁 $x=-a$ , $x=b$ . 记 $\tau$ 为停止时间，求 $\mathbf{E}[\tau]$ .

可以看出无论这个人在哪里，只要他往一个方向走 $a+b$ 步（ $2^{-(a+b)}$ 概率）那么一定结束。因此在无穷的时间内 $\mathbf{E}[\tau]<\infty$ . 并且 $\mathbf{E}[|X_{t+1}-X_t|\ \vert\ \mathscr{F}t] \le 1 $. 因此$ \mathbf{E}[X\tau] = \mathbf{E}[X_0] = 0 $. 此外$ \mathbf{E}[X_\tau] = P_a(-a) + (1-P_a) b $, 故$ P_a = \frac{b}{a+b} $.$

我们需要构造一个和时间t $有关的新 martingale. 构造$ Y_t = X_t^2 - t$. 容易验证它是鞅且满足 OST 第三条件. 故
$\mathbf{E}[\tau] = \mathbf{E}[X_\tau^2]=a^2 P_a + (1-P_a) b^2 = ab$

Pattern Matching

给定一个 pattern $P \in {0, 1}^l$ 。我们知道在随机串中 $P$ 出现次数的期望
$\mathbf{E}[\text{# of occurrence of } P \text{ in } S] = \frac{n-l+1}{2^l}$
那么第一次出现的期望时间是多少？这和 $P$ 本身有关（和它的回文程度有关），因为它影响失配时从哪个状态重新启动。

令 $P=p_1p_2\dots p_l$ . 对于任意 $n \ge 0$ , 我们假设在第 $n+1$ 次抛硬币前有一个新的赌徒 $G_{n+1}$ 带着 1 单位的钱进场。他赌从他进场开始，接下来 $l$ 次翻面结果和 $P$ match. 每次他赢了，他就有两倍的钱，并且在下回合全部押回去；如果输了，他就失去所有的钱。

令 $X_t$ 表示 t 轮的翻面结果， $M_i(t)$ 表示 t 轮时赌徒 $G_i$ 拥有的钱数量。令 $Z_t := \sum_{i=1}^t (M_i(t) - 1)$ 表示所有的赌徒在 t 轮翻面后钱的总和。那么 $Z_t$ 是个 martingale。令 $\tau$ 为第一次有赌徒全赢（即匹配上了），那么由 OST 的第二个条件 $\mathbf{E}[Z_\tau] = \mathbf{E}[Z_0]=0$ . 故
$\mathbf{E}[\tau] = \sum_{i=1}^\tau \mathbf{E}[M_i(\tau)]$
可以看出 $\tau-l+1$ 之前的赌徒一定都离场了，而 $\tau-l+1 \sim \tau$ 之间的赌徒取决于串本身头尾匹配多少。因此上式就等于
$\mathbf{E}[\tau] = \sum_{i=1}^l 2^i \chi_i$
其中 $\chi_i = 1[p_1p_2 \dots p_i = p_{l-i+1} \dots p_l]$ .

计算样例：模式串 $P=HH$ ， $\mathbf{E}[\tau] = 2 + 4 = 6$ ； $P=HT$ ， $\mathbf{E}[\tau] = 4$ 。因为如果 HT 匹配中，第一位匹配成功，第二位匹配成 H，那么我们至少还有一个 H。也就是越不回文越快匹配。最坏的情况就是 HH，因为如果一成为 T 就直接输。

Wald's Equation

Theorem 4 (Wald's Equation) 若 $X_1, X_2, \dots$ 为非负的 i.i.d. copy of r.v. $X$ ，并且 $T$ 是这个序列的一个停时。且 $\mathbf{E}[T], \mathbf{E}[X]<\infty$ . 那么
$\mathbf{E}[\sum_{i=1}^T X_i] = \mathbf{E}[T] \mathbf{E}[X]$
Proof: 令 $Z_i := \sum_{j=1}^i (X_j - \mathbf{E}[X])$ .

An Application: A Routing Problem: 假设有 n 个发送器，每一轮每个发送器都有 $\frac{1}{n}$ 的概率将消息发往路由器 R。如果同时有两个消息及以上那么所有消息作废。问期望多久时间，路由器能集齐每个发送器的消息？

考虑除开发送失败，这就是一个 coupon collector。因此我们考虑两次成功发送之间的期望时间。发送成功的概率是
$p = n(1-\frac{1}{n})^{n-1}\frac{1}{n} \approx e^{-1}$
令 $X_i$ 表示已经成功收到 $i-1$ 个包裹，成功发送下一个包裹需要多久。那么 $X_1 \sim \text{Geo}(p)$ . 因此 $\mathbf{E}[X_1] = e$ . 故
$\mathbf{E}[\sum_{i=1}^T X_i] = \mathbf{E}[T] \cdot \mathbf{E}[X_1] = e n H_n$

Lec 5

Multi-Armed Bandit

有 $k$ 个 arm，每个 arm 的 reward 服从某种分布 $f_i \in [0, 1]$ 且均值为 $\mu_i$ . 不失一般性可设 $\mu_1 \ge \mu_2 \ge \dots \ge \mu_k$ . 现在假设我们可以拉 $T$ 轮，目标是最大化期望意义上的 reward。我们不知道每个 arm 的分布情况。

我们记 $a_t$ 为第 t 轮拉的杆，因此第 t 轮的 reward $X_t \sim f_{a_t}$ . 我们的目标等价于最小化 regret（后悔）：
$R(T) := T\mu_1 - \mathbf{E}[\sum_{i=1}^T X_t] \ge 0$
我们记 $\Delta_i = \mu_1 - \mu_i \ge 0$ ， $n_i(t) = \sum_{s=1}^t 1[a_s = i]$ （也就是到 t 轮位置拉了第 i 个杆多少次），那么

Proposition 1
$R(T) = \sum_{i=2}^k \Delta_i \cdot \mathbf{E}[n_i(T)]$

The Explore-then-Commit (ETC) Algorithm

先每个杆拉 $L$ 次，也就是总共探索了 $kL$ 次。然后选择最好的杆一直拉。我们记 $\hat{\mu}i$ 为探索后得到的均值，
$R(T) = L \sum_{i=1}^k \Delta_i + (T-kL) \sum_{i=2}^k \Delta_i \cdot \mathbf{Pr} [\hat{\mu_i} > \max_{j \neq i} \hat{\mu_j}]$
后面那个式子的意思是 "第 i 根杆是 explore 后最好的"。

由于第 1 根杆事实上是最好的，所以
$\mathbf{Pr} [\hat{\mu_i} > \max_{j \neq i} \hat{\mu_j}] \le \mathbf{Pr} [\hat{\mu_i} > \hat{\mu_1}]$
令 $X_j$ 表示第 $j$ 轮 $f_i$ 回馈的 reward， $Y_j$ 表示 $f_1$ 回馈的 reward， $Z_j = X_j - Y_j \in [-1, 1]$ . 那么 $\mathbf{E}[Z] = \mathbf{E}[\sum_{j=1}^L Z_i] = -L \Delta_i$ .
$\mathbf{Pr} [\hat{\mu_i} > \hat{\mu_1}] = \mathbf{Pr} [Z > 0] = \mathbf{Pr} [Z - \mathbf{E}[Z] > L \Delta_i] \le \exp \bigg( -\frac{2(L\Delta_i)^2}{4L} \bigg) = \exp \bigg( -\frac{L\Delta_i^2}{2} \bigg)$
因此
$\begin{split} R(T) &\le L \sum_{i=1}^k \Delta_i + (T-kL) \sum_{i=2}^k \Delta_i \cdot \exp \bigg( -\frac{L\Delta_i^2}{2} \bigg) \ &\le \sum_{i=1}^k \bigg( L + T \Delta_i \exp \bigg( -\frac{L\Delta_i^2}{2} \bigg) \bigg) \end{split}$
这里做了一些放缩。 $-kL$ 这一项肯定是不关键的，因为 explore 部分一定比 commit 部分短很多，这部分损失可以略掉。

现在这里要确定参量 $L$ 。我们先考虑最坏情况的 $\Delta_i$ , 求导可得 $\Delta_i = \sqrt{\frac{1}{L}}$ 时取得最大值，故
$RHS \le \sum_{i=1}^k \bigg( L + \frac{T e^{-1/2}}{\sqrt{L}} \bigg)$
optimize $L = \Theta(T^{\frac{2}{3}})$ ，有 $R(T) \le \Theta(kT^{\frac{2}{3}})$

The Upper Confidance Bounds (UCB) Algorithm

ETC 算法改进之处：这个算法应该在 explore 阶段就 adaptively 使用目前获得的所有信息。

UCB 算法对于每个 arm i 维护了一个置信区间 $[a_i, b_i]$ ，使得 $\mu_i$ 高概率落在这个区间内。

对于给定的 $\delta_i(t)$ ，我们考虑如何选择 $c_i(t)$ ，其中 $a_i(t) = \hat{\mu}_i - c_i(t)$ ， $b_i(t) = \hat{\mu}_i + c_i(t)$ ，使得（为了简洁下面略掉 $t$ ）
$\mathbf{Pr}[|\mu_i - \hat{\mu_i}| > c_i] \le 2 \exp (-2n_i c_i^2) \le \delta_i$
因此我们选择 $c_i = \sqrt{\frac{\log(2/\delta_i)}{2 n_i}}$ .

注意到如果 upper bound $b_i$ 比较大，我们去 explore 它是有好处的，因为这要么说明 $\hat{\mu}_i$ 大，要么说明 $n_i$ 小（即它还没有被充分探索）

regret bound 这里就略去了，思想就是分类讨论。最后结果是
$R(T) \le T \Delta + \frac{6 k \log T}{\Delta} + k^2 = \Theta(\sqrt{k T \log T})$
这个 regret bound 对 $\sqrt{\log T}$ 这个因子来说是 optimal 的。

Lec 6

An Online Number Guessing Game

这个游戏有 $T$ rounds. 每个 round $t$ ，

adversary 选择一个数 $y_t \in [0, 1]$ ;

你选择一个数 $x_t \in [0, 1]$ (without knowing $y_t$ )

$y_t$ 被揭开，你付出 $(x_t - y_t)^2$ 的代价

目标是最小化 $\sum_{t=1}^T (x_t - y_t)^2$ .

这个游戏有两种 settings:

Stochastic Setting: 每个 $y_t$ 从一个固定的分布 $\mathscr{D}$ 中独立地抽取;

Adversarial Setting: adversary 可以任意选择 $y_t$ , 就好像他知道 player 的策略一样。

Stochastic Setting

我们需要最小化 $\mathbf{E}[\sum_{t=1}^T (x_t - y_t)^2]$ . 如果 $Y \sim \mathscr{D}$ ，我们只需要最小化 $x_t^2 - 2 \mathbf{E}[Y] x_t$ . 显然如果我们知道分布，那么最优策略就是每次选期望。那如果我们不知道分布呢？由于我们已经知道最佳策略，我们可以定义 regret
$R(T) = \sum_{t=1}^T (x_t - y_t)^2 - \sum_{t=1}^T (x^* - y_t)^2$
其中 $x^* = \mathbf{E}[Y]$ .

The Adversarial Setting

注意这时的 regret 是
$R(T) = \max_{y_1, y_2, \dots, y_T} \bigg( \sum_{i=1}^T (x_t-y_t)^2 - \min_{x \in [0, 1]} \sum_{t=1}^T (x - y_t)^2 \bigg)$
这时这个 regret 可能是负的。式子里面减去的那一项相当于是这个问题 ""离线版本" 的最优策略。

## The Follow-The-Leader Algorithm

可以算出离线最优解
$x^* = \arg\min_{x \in [0, 1]} \sum_{t=1}^T (x-y_t)^2 = \frac{\sum_{t=1}^T y_t}{T}$
那么如果这个问题是在线的，在进行第 t 轮时我们只知道 $1 \sim t-1$ 轮的信息。自然我们可以给出 $t-1$ 轮时的离线最优解 $\frac{1}{t-1} \sum_{s=1}^{t-1} y_s$ .

那么我们每次都采用离线最优解，即 $x_t = x_{t-1}^*$ . 这种算法有
$R(T) \le 2 H_T = \Theta(\log T)$
其中 $H_t = \sum_{t=1}^T \frac{1}{t}$ .

Proof:
$\begin{split} \R(T) &= \sum_{t=1}^T (x_t - y_t)^2 - \sum_{t=1}^T (x_T^* - y_t)^2\\&\le \sum_{t=1}^T (x_t - y_t)^2 - \sum_{t=1}^T (x_t^* - y_t)^2 \ &= \sum_{t=1}^T (x_t + x_t^* - y_t) (x_t - x_t^_) \ &\le 2 \sum_{t=2}^T |x_{t-1}^_ - x_t^*| \le 2 H_T \end{split}$

Online Learning

我们可以推广上述游戏的 setting，那么这就得到了我们 online learning / optimization 的框架。令 $V \subseteq \mathbb{R}^d$ . 我们也玩 $T$ rounds，并且每一轮

你选择一个 $x_t \in V$ .

adversary 选择一个损失函数 $l_t: V \rightarrow \mathbb{R}$ ;

你付出代价 $l_t(x_t)$

然后我们需要最小化 regret
$R(T) = \sum_{t=1}^T l_t(x_t) - \min_{x \in V} l_t(x)$
Follow-the-leader 也可以用于这个推广的问题，也就是我们每轮选择局部最优解
$x_t = \arg \min_{x} \sum_{t=1}^{T-1} l_t(x)$
敏锐的读者应该能发现 $R(T)$ 依赖于 $l_t(x_{t-1}^) - l_t(x_t^)$. 因此我们很容易给出一个例子卡掉 FTL:

Example 1 (Failure of FTL) 考虑 $V = [-1, 1]$ , $l_t(x) = a_t x$ . 这里 $a_1 = -0.5$ 并且对于 $t > 1$ :
$a_t = \begin{cases} 1&\text{ if } t \text{ is even} \ -1&\text{ if } t \text{ is odd} \end{cases}$
那么一开始第一次乱选，假设选 $x_1^* = 0$ . 然后第二次肯定选 $x_2^* = 1$ ，第三次选 $x_3^* = -1$ ... 后面就是偶数选 $1$ 奇数选 $-1$ ，这样 regret bound 和 $T$ 是线性的，还不如每次选 $0$ .

而且为了能进行 FTL，你需要先解一个离线优化问题，这并不总是容易的。这让我们先把关注点放到离线优化上来。

Lec 7

Convex Optimization

凸集的定义： $\forall x, y \in V \subseteq \mathbb{R}^n,\ \lambda \in [0, 1]$ ，有 $\lambda x + (1- \lambda)y \in V$ .

凸函数定义： $\text{epi}(f) = {(x, y) \in \mathbb{R}^{n+1}\ \vert\ y \ge f(x)}$ .

一些有用的性质：

Jensen 不等式

泰勒展开

一阶优化条件
$x^* = \arg \min_{x} f(x) \Longleftrightarrow \nabla f(x^*) = 0$

$x^* = \arg \min_{x \in V} f(x) \Longleftrightarrow \forall y \in V,\ \nabla f(x^_)^T (y - x^_) \ge 0$

Gradient Descent Algorithm

梯度下降就是如下的更新规则
$x_{t+1} = x_t - \eta \nabla f(x_t)$
定量地计算每次梯度下降的下降程度，我们需要引入势能函数 $\phi(x) := \frac{1}{2} \Vert x - x^* \Vert^2$ . 那么
$\begin{split} \phi(x_{t+1}) - \phi(x_t) &= \frac{1}{2} \bigg(\langle x_{t+1} - x^*, x_{t+1} - x^* \rangle - \langle x_{t} - x^*, x_{t} - x^* \rangle \bigg) \ &= \frac{1}{2} \bigg(\langle x_{t} - \eta \nabla f(x_t) - x^*, x_{t} - \eta \nabla f(x_t) - x^* \rangle - \langle x_{t} - x^*, x_{t} - x^* \rangle \bigg) \ &= -\eta \langle \nabla f(x_t), x_t - x^* \rangle + \frac{1}{2} \eta^2 \Vert \nabla f(x_t)^2 \Vert \ &\le \eta(f(x^_) - f(x_t)) + \frac{1}{2} \eta^2 \Vert \nabla f(x_t)^2 \Vert \end{split}$
那么求和加调整式子，我们有这个经典的 bound
$\sum_{t=0}^{T-1} (f(x_t) - f(x^_)) \le \frac{\phi(x_0) - \phi(x_T)}{\eta} + \frac{\eta}{2} \sum_{t=0}^{T-1} \Vert \nabla f(x_t)^2 \Vert^2$
如果 $f$ 是 L-Lipschitz 的，也即有 $\Vert \nabla f \Vert \le L$ . 那么
$\sum_{t=0}^{T-1} (f(x_t) - f(x^_)) \le \frac{\Vert x_0 - x^_ \Vert^2}{2\eta} + \frac{\eta TL^2}{2} \le \Vert x_0 - x^* \Vert L \sqrt{T}$
如果是 constrained 的，我们需要做一个投影
$x_{t+1} = \prod_V (x_t - \eta \nabla f(x_t))$
这个算法就变成 PGD (projected gradient descent). 上面的 bound 对 PGD 仍然成立，这是因为我们只要证明
$\phi(\prod_V (x_t - \eta \nabla f(x_t))) \le \phi(x_t - \eta \nabla f(x_t))$
也即投过之后势能更小，那么上面的分析仍然成立。投影之后势能当然会更小，因为 $x^$ 是在 $V$ 内，如果跑到 $V$ 之外了，那投进来当然是更好的。证明也很简单，设 $y = x_t - \eta \nabla f(x_t)$，只要证 $\Vert x_{t+1} - x^ \Vert^2 \le \Vert y - x^* \Vert^2$. 由于
$x_{t+1} = \arg\min_{x \in V} \Vert x - y \Vert^2$
因此由一阶优化条件 $\langle x^-x_{t+1}, y -x_{t+1} \rangle \le 0$. 故
$\begin{split} \Vert y - x^_ \Vert^2 &= \Vert y - x_{t+1} + x_{t+1} - x^* \Vert^2 \ &= \Vert y - x_{t+1} \Vert^2 + \Vert x^* - x_{t+1} \Vert^2 - 2 \langle x^_-x_{t+1}, y -x_{t+1} \rangle\\&\le \Vert x^_ - x_{t+1} \Vert^2 \end{split}$

Online Gradient Descent

回忆我们对 online learning 的假设。那么我们针对这个问题的 online gradient descent / online projected gradient descent (OPGD) 算法可以直接得到
$x_{s+1} = x_s - \eta_s \nabla l_t(x_s)$
我们可以用和 GD 相似的方法分析出一个相似的 bound。课程这里用分析 "连续版本" 的方法来得到这个 bound。

大致地讲，其引入如下 settings:

对于 $s = 0, 1, \dots, T$ ，$\mathscr{T}s = \sum{i=0}^{s-1} \eta_i $. 也就是说把两轮之间看作有一段$ \eta_i $的时间。这样以 1 的速度匀速移动过去就走了$ \eta_i$ 的距离.

$y_0 = x_0$

$\frac{\text{d} y_t}{\text{d} t} = -g_t$，其中 $g_t = \nabla l_{s_t}(x_{s_t})$，$s_t$ 是这个时间所属的轮数。

形象来讲，把一步步 gradient descent 变成了每次在两个点之间做匀速直线运动。这样分析（具体略去）可以得到 regret bound
$\sum_{s=0}^{T-1} \frac{\phi(x_s) - \phi(x_{s+1})}{\eta_s} + \frac{\eta_s \Vert \nabla l_s (x_s) \Vert^2}{2} \le diam(V) \cdot L \cdot \sqrt{T}$
假设 $l$ 是 L-Lipschitz. 用在最开始的那个猜数问题，这个界比 FTL 要糟糕很多。

Strongly Convex Function

如果 $l_s$ 满足 $\mu$ -strongly convex，这个性质即对于函数 $f$ ，对任意 $x, y$
$f(y) \ge f(x) + \nabla f(x)^T (y-x) + \frac{\mu}{2} \Vert y -x \Vert^2$
选择合适的 $\mu_s = \frac{1}{(s+1) \mu}$ 那么这个 bound 可以进一步优化为
$\sum_{s=0}^{T-1} l_s(x_s) - l_s(x^*) \le \frac{L^2}{2 \mu} H_T$
用在猜数上和 FTL 的 bound 一样好。

Lec 8

Learning with Expert Advice

在这个问题中，我们可以将 $[n]$ 中的每个元素当成一个 expert 并且在每一轮，player 需要选择一个 expert 来 follow。然后 adversary 展示每个专家的 loss。因此每一轮：

The player picks some $x_t \in [n]$ ;

The adversary picks some $l_t: [n] \rightarrow \mathbb{R}$ ;

$l_t$ is revealed and the player pays the cost $l_t(x_t)$.

那么 regret 函数就是 $\sum_{t=0}^{T-1} l_t(x_t) - l_t(x^*)$ . 显然 $V = [n]$ 并不是个凸集，因此不能直接用之前的方法。我们可以用一个随机化的方法：

The player pciks some distribution $x_t \in \Delta_{n-1}$ where $\Delta_{n-1} ={x \in [0, 1]^n: \sum_{i=1}^n x_i = 1}$ . 这个记号以后会经常出现，名字叫 probability simplex.

The adversary picks some $l_t: [n] \rightarrow \mathbb{R}$ ;

The player plays $A_t \sim x_t$ ;

$l_t$ is revealed and the player pays the cost $l_t(A_t)$.

那么对于一个固定的 expert $j \in [n]$ ，期望 regret 为
$\mathbf{E}[\sum_{t=1}^T l_t(A_t) - l_t(j)] = \sum_{t=1}^T \langle l_t, x_t - e_j \rangle$
那么这一问题就 reduce 了一个 $\Delta_{n-1}$ 上个在线凸优化问题。我们可以设定初始分布为 $x_1 = (\frac{1}{n}, \frac{1}{n}, \dots, \frac{1}{n})$ . 然后后面用 OPGD. 这样期望 regret bound 就可以得到
$\frac{diam(\Delta_{n-1})}{2 \eta} + \frac{\eta T n}{2} = \Theta(\sqrt{nT})$

Online Stochastic Gradient Descent

比较 learning with expert advice 问题与 multi-armed bandits 问题，主要有两处不同：

每个 round 结束后，multi-armed bandits 只能看到 $l_t(A_t)$ 而不是整个 $l_t$ 向量。

adversary 从一个固定的分布中选择每轮的 $l_t$ 。

我们放松第二点，允许它像 learning with expert advice 一样任意选 $l_t$ . 这称作 adversarial multi-armed bandit problem.

由于我们只知道一个位置的 $l_t(A_t)$ ，我们需要用这个信息估计一个 $\hat{l_t}$ . 首先我们要满足无偏性，也就是说 $\mathbf{E}[\hat{l_t}] = l_t$ . 然后将这个估计的 $\hat{l_t}$ 放入 OGD 算法. 一个自然的选择是
$\forall i \in [n],\ \hat{l_t} = \frac{1_{A_t = i}}{x_t(i)} l_t(i)$
那么显然有 $\mathbf{E}[\hat{l_t}] = l_t$ . 那么
$\begin{split} \mathbf{E}\bigg[ \sum_{r=1}^T \langle l_t, x_t - e_j \rangle \bigg]&\le \frac{1}{2\eta} + \frac{\eta}{2} \sum_{t=1}^T \mathbf{E} [\Vert \hat{l_t} \Vert^2] \end{split}$
注意
$\mathbf{E} [\Vert \hat{l_t} \Vert^2] = \mathbf{E} [\mathbf{E} [\Vert \hat{l_t} \Vert^2]\ \vert\ \mathscr{F}_{t-1}] = \mathbf{E} \bigg[\sum_{i=1}^n x_t(i) (\frac{l_t(i)}{x_i(t)})^2 \bigg] \le \mathbf{E} \bigg[\frac{1}{x_t(i)}\bigg]$
由于 $x_t(i)$ 可能很趋近于 0，因此我们需要一个下界。我们修改算法，每次更新的时候给定一个下界
$\tilde{x_t} = (1 - \alpha) x_t + \alpha \cdot \mathbf{u}$
其中 $\mathbf{u} = (\frac{1}{n}, \frac{1}{n}, \dots, \frac{1}{n})$ . 那么 bound 改进为
$\begin{split} \mathbf{E}\bigg[ \sum_{r=1}^T \langle l_t, x_t - e_j \rangle \bigg]&\le \frac{1}{2\eta} + \frac{\eta}{2} \sum_{t=1}^T \mathbf{E} [\Vert \hat{l_t} \Vert^2] + \alpha \cdot T \ &\le \frac{1}{2 \eta} + \frac{\eta n^2 T}{2 \alpha} + \alpha \cdot T \le 3 \cdot 4^{-1/3} \cdot (nT)^{2/3} \end{split}$
这个 bound 对 $T$ 是 $T^{\frac{2}{3}}$ . 最好的 bound 是 $\sqrt{T}$ (使用 OSMD).

Online Shortest Paths

将每个路径看成一个 expert: path 的数量太多了.

由于我们的策略是考虑 expert 上的分布，我们也可以考虑 $u$ 到 $v$ 之间的 probability flow（概率流，有点像网络流）

那么我们只要指定一个初始概率流，根据 OSGD 调整它就可以了。每轮的 loss 是 $\sum_{e \in E} p(e) \cdot w_t(e)$ .

Lec 9

这一章讲 Mirror Descent. 由于不在考试范围内先不整理. 做作业时候看过了一遍.

Lec 10

Basics of Markov Chains

Definition 1 (Markov Chain) 一列随机变量 $X_0, X_1, \dots$ 是 Markov Chain，如果对于任意的 $t$ 以及任意的 $a_0, a_1, \dots, a_{t+1}$ ,
$\mathbf{Pr}[X_{t+1} = a_{t+1} \vert X_0 = a_0, X_1 = a_1, \dots, X_t = a_t] = \mathbf{Pr}[X_{t+1} = a_{t+1} \vert X_t = a_t]$
也就是说 $t+1$ 时的状态只取决于 $t$ 时的状态。

记 $\Omega$ 为状态集，即 $X_t$ 能取到的所有可能的值。这里我们只考虑 $\Omega = [n]$ ，并且假设转移是 time-homogeneous 的，也就是任何时间的转移矩阵都是一样的。那么我们可以用一个转移矩阵（transition matrix） $P = (p_{ij})_{i, j \in [n]}$ 来表示。

同样一个 Markov chain 还可以看作在图上随机游走。在任意时间 $t \ge 0$ ，我们用 $\mu_t \in \Delta_{n-1}$ 表示 $X_t$ 的分布，即
$\mu_t(i) := \mathbf{Pr}[X_t = i]$
我们有 $\mu_{t+1}^T = \mu_{t}^T P$ . 那么有下面的结论

Proposition 2 $\mu_{t}^T = \mu_{0}^T P^t$ .

Definition 3 (Stationary Distribution) 一个分布 $\pi$ 是 $P$ 的一个 stationary distribution 如果它是转移的不动点
$\pi^T P = \pi^T$
我们对以下问题感兴趣：

每一个 Markov Chain 都有 stationary distrubution 吗？

是否唯一？

是否收敛到这个 stationary distrubution？

收敛速率？

Fundamental Theorem of Markov Chains

Theorem 4 (Perron-Frobenius Theorem) 每一个非负矩阵 A 都有一个非负实特征向量 $x$ ，对应特征值 $\lambda = \rho (A) = \max {|\lambda_i|}$ .

证明这里从略。

那么由于 Markov Chain 的转移矩阵 $P$ 是一个 stochastic matrix，则有
$P \cdot \mathbf{1} = \mathbf{1}$
而 $P^T$ 和 $P$ 特征值相同，因此 $\rho(P^T)=1$ . 根据上面的定理，存在对应非负实特征向量 $\pi$ 使得
$P^T \pi = \pi$
因此 $\frac{\pi}{\Vert \pi \Vert}$ 就是所要的 stationary distribution.

接下来考虑收敛性和惟一性。考虑如下 Markov Chain
$P = \left[ \begin{matrix} 1-p&p \ q&1-q \end{matrix} \right]$
很简单可以验证
$\pi = (\frac{q}{p+q}, \frac{p}{p+q})^T$
接下来我们可以来 check 其是否收敛。
$\begin{split} \Delta_t &:= |\mu_t(1) - \pi(1)| = |1-p-q| \cdot \Delta_{t-1} \end{split}$
因此可以看出除了 $p=q=0$ 或 $p=q=1$ ，其它情况都是收敛的。

当 $p=q=0$ 时，这个图退化成两个点，不连通。因此我们可以引出第一个限制：

Definition 5 (Irreducibility) A finite Markov Chain is irreducible if its transition graph is strongly connected.

如果 P 的转移图不强连通，我们称 $P$ reducible.

当 $p=q=1$ 时，可以看出图是一个二元环，此时其会在这两个点之间左右横跳。因此我们引入第二个限制：

Definition 6 (Aperiodicity) A Markov Chain is aperiodic if for any state $v$ ，it holds that
$\gcd{|c|\ \vert\ c \in C_v} = 1$
也即每个点上的环的长度的最大公约数为 1，这样才能避免 "周期性"。

结合这两个条件就有

Theorem 7 (Fundamental theorem of Markov chains) If a finite Markov chain $P \in \mathbb{R}^{n \times n}$ is irreducible and aperiodic, then it has a unique stationary distribution $\pi \in \mathbb{R}^n$ . Moreover, for any distribution $\mu \in \mathbb{R}^n$ ,
$\lim_{t \rightarrow \infty} \mu^T P^t = \pi^T$

Coupling

Definition 11 (Total Variation Distance) The total variation distance between two distribution $\mu$ and $\nu$ on a countable state space $\Omega$ is given by
$D_{TV}(\mu, \nu) = \frac{1}{2} \sum_{x \in \Omega}|\mu(x) - \nu(x)|$
Lemma 12
$D_{TV}(\mu, \nu) = \max_{A \subseteq \Omega} |\mu(A) - \nu(A)|$
这里 $\mu(A) := \sum_{x \in A} \mu(x)$ .

两个分布的 coupling 只是他们的一个联合分布。

Definition 13 (Coupling) 令 $\mu$ 和 $\nu$ 为相同空间 $\Omega$ 上的两个分布。令 $\omega$ 是 $\Omega \times \Omega$ 上的一个分布。如果 $(X, Y) \sim \omega$ 能推出 $X \sim \mu$ 以及 $Y \sim \nu$ ，那么 $\omega$ 称作 $\mu$ 和 $\nu$ 的一个 coupling。

Lemma 14 (Coupling Lemma)
$\mathbf{Pr}_{(X, Y) \sim \omega} [X \neq Y] \ge D_{TV} (\mu, \nu)$
And there exists a coupling $\omega^$ such that
$\mathbf{Pr}_{(X, Y) \sim \omega^_} [X \neq Y] = D_{TV} (\mu, \nu)$
Proof: 略

Coupling 给两个分布的距离提供了一个上界。

Lec 11

Proof of FTMC

Claim 2 Let $P \in \mathbb{R}^{n \times n}$ 是一个 irreducible 并且 aperiodic 的 Markov chain. 那么它满足
$\exists t^*: \forall i, j \in [n]: P^{t^*} (i, j) > 0$
Proof: ireducibility 蕴含了
$\forall i, j: \exists t: P^t(i, j) > 0$
由于 aperiodic 的性质，对于任意一对 $i, j$ ，存在 $t_{ij}$ 使得 $\forall t \ge t_{ij}$ , $P^t(i, j) > 0$ . 因此取这些 $t_{ij}$ 的 max 上面得证。

下面我们来证明 FTMC。我们只需要证
$\lim_{t \rightarrow \infty} D_{TV} (\mu_t, \pi) = 0$
我们构造一个 coupling ${X_t}$ 和 ${Y_t}$ ，其中 $X_0 \sim \mu_0$ ， $X_1 \sim \mu_1$ ， $\dots$ ，而 $Y_i \sim \pi$ 。令 $(X_t, Y_t) \sim \omega_t$ . 且如果某时刻 $t$ 有 $X_t = Y_t$ ，那么令之后它们都相等，即 $X_{t'} = Y_{t'}\ \forall t' > t$ .

由上面的 claim，存在一个全局 $t^$，使得 $\forall i, j$, $P^{t^}(i, j) = \alpha \ge 0$. 那么接下来我们来证明
$\mathbf{Pr}[X_{t^_} = Y_{t^_}] \ge \alpha^2$
简略说一下，如果 $t^$ 前它们已经相遇了，这时候它们相等的概率就是 1；如果直到 $t^-1 $它们还没相遇，那么$ \mathbf{Pr}[X_{t^} = Y_{t^} | \overline{B}] \ge \mathbf{Pr}[X_{t^} = 1] \mathbf{Pr}[Y_{t^} = 1] = \alpha^2 $.$

这样我们就有\mathbf{Pr}[X_{t^} \neq Y_{t^}] \le (1-\alpha^2)$. 类似地
$\mathbf{Pr}[X_{2t^_} \neq Y_{2t^_}] = \mathbf{Pr}[X_{t^_} \neq Y_{t^_}] \mathbf{Pr}[X_{2t^_} \neq Y_{2t^_} \vert X_{t^_} \neq Y_{t^_}] \le (1-\alpha^2)^2.$
这说明 $\mathbf{Pr}[X_t \neq Y_t] \rightarrow 0$ 当 $t \rightarrow \infty$ 时. 证毕.

Mixing Time

给定一个误差 $\varepsilon$ ，我们称一个 Markov chain 对于此误差的 mixing time 为任给初始分布，使得其离 stationary distribution 距离小等于误差的所需最少的时间。也即
$\tau_{mix}(\varepsilon) := \arg \min_{t} \max_{\mu_0} D_{TV}(\mu_t, \pi) \le \varepsilon$
通常将 $\tau_{mix}(1/4)$ 简记为 $\tau_{mix}$ .

考虑我们的 coupling lemma，我们只要能构造一个 coupling $\omega_t$ 使得
$\mathbf{Pr}_{(X_t, Y_t) \sim \omega_t} [X_t \neq Y_t] \le \varepsilon$
那么就有 $\tau{mix}(\varepsilon)\le t $. 在实践上，我们可以假设$ X_t $和$ Y_t $来自任意的初始分布（也就是任给两个初始分布$ \mu_0 = X_0 $，$ \nu_0 = Y_0 $。那么当然对$ \pi$ 也是成立的。）.

Example 1 (Random walk on hypercube) 每次选一个位置 $i \in [n]$ ，有一半的概率将这个位翻转。分析这个过程的 mixing time.

先要验证这个 graph 是 irreducible 且 aperiodic 的。

我们先把问题等价成每次选一个位置 $i$ ，然后选一个 $b \in {0, 1}$ ，将这个位设成 $b$ . 第一个 coupling 就是让两个人行为完全一样，这样假设每个位都是不同的，但是假设一个位相同了，它们后面就永远相同了。这是一个 coupon collector，因此 mixing time 为
$\tau_{mix}(\varepsilon) \le n \log \frac{n}{\varepsilon}$
观察到其实我们一次能改变两位. 所以我们可以优化我们的 coupling 策略如下：

如果当前我们操作的位相同，那么表现一样；

如果当前操作的位不同，选择两个不同的位将它们改到相同；

剩一个，只要表现相同即可。

这能减少一半的时间：
$\tau_{mix} \le \frac{1}{2} n \log n + O(n)$
Example 2 (Shuffling Cards) 有一叠 $n$ 张卡片，考虑如下洗牌规则：随机抽一张将它放到顶上。问 mixing time？

coupling：两叠卡片每次选相同的卡（花色和点数都一样），这样洗好的时间其实也是 $n$ 张牌的 coupon collector. 因此
$\tau_{mix} (\varepsilon) \le n \log \frac{n}{\varepsilon}$

Lec 12

Reversible Chains

一个 Markov chain 是 (time) reversible 的如果存在一个分布 $\pi$ 满足
$\forall i, j \in [n],\ \pi(i) P(i, j) = \pi(j) P(j, i)$
这个条件也叫做 细致平衡条件（Detailed Balance Condition）. 直观理解：transition graph 是 "无向图".

Proposition 1 如果 P 对分布 $\pi$ reversible，那么 $\pi$ 是 $P$ 的一个 stationary distribution.

Proof:
$\pi^T P(j) = \sum_{i\in [n]} \pi(i) P(i, j) = \sum_{i\in [n]} \pi(j) P(j, i) = \pi(j)$
reversible chain 这一名字来源于如果 $X_0, X_1, \dots, X_t$ 是 reversible 的，那么分布 $(X_0, X_1, \dots, X_t)$ 和分布 $(X_t, X_{t-1}, \dots, X_1, X_0)$ 是完全相同的分布。

由于 reversible chain 的矩阵是 symmetric 的（in some sense），这带来了很多好的性质。

Metropolis Algorithm

给定一个分布 $\pi$ ，怎样构造一个 Markov chain 收敛到 $\pi$ ？很容易我们可以令 $P = \mathbf{1} \pi^T$ . 但如果我们只允许 $P$ 的一个给定的 entries 集合非 0 呢？这一问题等价于在给定的图 $G$ 上赋予转移概率，使得 $\pi$ 是它的 stationary distribution.

令 $\Delta$ 为图中除了自环以外的最大度数。对于每个 $i \in [n]$ ，令邻居集合（不包含自己） ${j_1, j_2, \dots, j_d}$

"uniformly" 随机选择一个 $k \in [\Delta + 1]$ .

如果 $1 \le k \le d$ ，那么有 $\min{\frac{\pi(j_k)}{\pi(i)}, 1}$ 的概率移动到 $j_k$ ；

否则如果 $d+1 \le k \le \Delta+1$ ，do nothing.

最后产生的 transition matrix 就是
$P(i, j) = \begin{cases} 0& i \not\sim j \ \frac{1}{\Delta+1} \min{\frac{\pi(j)}{\pi(i)}, 1}&i \neq j, i \sim j \ 1 - \sum_{k \neq i} p_{ik}&i = j \end{cases}$
这样构造出来的 $P$ 是 reversible 的：
$\pi(i) P(i, j) = \pi(i) \frac{1}{\Delta+1} \min{\frac{\pi(j)}{\pi(i)}, 1} = \frac{1}{\Delta+1} \min{\pi(i), \pi(j)} = P(j, i) \pi(j)$

Sample Proper Coloring

考虑对于 $q \ge \Delta$ ，对 $G = (V, E)$ 进行 $q$ 染色的方案数计数是 #P-hard 的。我们考虑近似算法。有如下结论：an approximate counting algorithm is equivalent to an uniform sampler (in many cases). 这里只展示一个方向：

设 $\mathcal{C}$ 为所有染色的集合，给定一个 proper coloring $\sigma$ ，并且我们有一个可以 uniformly 从 $\mathcal{C}$ 中生成一个 proper coloring 的 oracle. 记 $Z = |\mathcal{C}|$ . 那么
$\begin{split} \frac{1}{Z} &= \mathbf{P}r_{x \sim \mathcal{C}}[x = \sigma] \ &= \mathbf{P}r_{x \sim \mathcal{C}}[x(1) = \sigma(1) \land x(2) = \sigma(2) \land \dots] \ &= \prod_{i=1}^n \mathbf{Pr} \bigg[ x(i) = \sigma(i)\ \bigg\vert\ \bigcap_{j 计算这个东西只需要多项式个数的 samples.$

接下来我们来使用 MCMC 制作这个 sampler. 假设我们已经有了一个 proper coloring，我们做如下转移：

Pick $v \in V$ and $c \in [q]$ uniformly at random.

Recolor $v$ with $c$ if possible.

这个图是 aperiodic 的因为有自环。对于 $q \ge \Delta+2$ ，这个图是 irreducible 的。

mixing time: 略

Spectrum of Reversible Markov Chain

Theorem 2 (Spectral Decomposition Theorem) 如果 $P \in \mathbb{R}^{n \times n}$ 是一个 symmetric matrix，那么它有着 $n$ 个实根以及与之相对应的 $n$ 个两两正交的特征向量满足
$P = \sum_{i=1}^n \lambda_i v_i v_i^T$
如果我们令 $V = [v_1, v_2, \dots, v_n]$ 以及 $\Lambda = diag(\lambda_1, \lambda_2, \dots, \lambda_n)$ . 那么上面式子可以写成这个形式
$P = V \Lambda V^T$
下面我们来推导 reversible markov chain 的谱分解定理。令 $\Pi = diag(\pi)$ . 定义 $Q = \Pi^{1/2} P \Pi^{-1/2}$ . 那么 $Q$ 是 symmetric 的
$Q(i, j) = \pi(i)^{1/2} P(i, j) \pi(j)^{-1/2} = \pi(j)^{1/2} P(j, i) \pi(i)^{-1/2} = Q(j, i)$
因此我们可以对 $Q$ 进行谱分解
$Q = \sum_{i=1}^n \lambda_i u_i u_i^T$
如果我们定义 $v_i := \Pi^{-1/2} u_i$ , 那么
$P = \sum_{i=1}^n \lambda_i \Pi^{-1/2} u_iu_i^T \Pi^{1/2} = \lambda_i v_i v_i^T \Pi$
我们声称 $v_1, \dots, v_n$ 是 $P$ 相对于 $\lambda_1, \dots, \lambda_n$ 的特征向量. 这是因为（正交性）
$\begin{split} P v_j &= \sum_{i=1}^n \lambda_i \Pi^{-1/2} u_iu_i^T \Pi^{1/2} \Pi^{-1/2} u_i \ &= \lambda_i v_j \end{split}$
如果我们定义内积 $\langle x, y \rangle_\Pi := x^T \Pi y$ , 那么在这个内积下 $v_1, \dots, v_n$ 是正交的。

Lec 13

Variational Characterization of Eigenvalues

在本章，我们假设 $P$ 是一个 reversible chain w.r.t. $\pi$ 并且有特征值 $\lambda_1, \dots, \lambda_n$ 和特征向量 $v_1, \dots, v_n$ . 因此有分解 $P = \lambda_i v_i v_i^T \Pi$ . 假设 $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_n$ . 那么下列 proposition 成立

Proposition 1 $1 = \lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_n \ge -1$

Proof: 由于 $P$ 是一个 stochastic matrix， $|\lambda_i| \le 1$ . 并且 $1$ 是一个特征值，对应的特征向量为 $\mathbf{1}$ .

定义 Rayleigh quotient
$R_P(\mathbf{x}) = \frac{\langle \mathbf{x}, P\mathbf{x} \rangle_\Pi}{\langle \mathbf{x}, \mathbf{x} \rangle_\Pi}$
我们可以将 $\lambda_1$ 写成
$\lambda_1 = \max_{\mathbf{x} \neq 0} R_P(\mathbf{x})$
这可以通过 $P$ 的 spectral decomposition 来验证. 假设 $\mathbf{x} = \sum_{i=1}^n a_i v_i$ .
$R_P(\mathbf{x}) = \sum_{i=1}^n\frac{a_i^2}{\sum_{j=1}^n a_j^2} \lambda_i$
这可以看成所有特征值的加权和。因此最大的情况当然是全部放在最大的特征值上。一般地，
$\lambda_k = \max_{\mathbf{x} \neq 0, \mathbf{x} \perp span(v_1, \dots, v_{k-1}} R_P(\mathbf{x})$
这也等价于
$\lambda_k = \max_{\text{k-dim subspace } V \subseteq \mathbb{R}^n} \min_{\mathbf{x} \in V \setminus {0}} R_P(\mathbf{x})$

FTMC for Reversible Chains

由谱分解
$P^t = \sum_{i=1}^n \lambda_i^t v_i v_i^T \Pi = \mathbf{1} \pi^T + \sum_{i=2}^n \lambda_i^t v_i v_i^T \Pi$
由于 $\mu^T \mathbf{1} \pi^T = \pi^T$ . 因此
$\lim_{t \rightarrow \infty} \mu^T P^t = \pi^T + \lim_{t \rightarrow \infty} \mu^T \bigg( \sum_{i=2}^n \lambda_i^t v_i v_i^T \Pi \bigg)$
因此只要这部分成立
$\lim_{t \rightarrow \infty} \mu^T \bigg( \sum_{i=2}^n \lambda_i^t v_i v_i^T \Pi \bigg) = 0$
只有两种情况会使得这个不对： $\lambda_2 = 1$ 或者 $\lambda_n = -1$ . 我们下面将展示这两种情况分别对应reducibility 和 periodicity。回忆 reversible chain 的 irreducible 意味着它 connected，aperiodical 意味着有奇环（not bipartite）

Proposition 3 $\lambda_2 = 1$ 当且仅当这个图不连通.

Proof:
$\begin{split} \lambda_2 &= \max_{\mathbf{x} \neq 0,\ \mathbf{x} \perp \mathbf{1}} R_P(\mathbf{x}) \ &= \max_{\mathbf{x} \neq 0,\ \mathbf{x} 1 - \perp \mathbf{1}} 1 - \frac{\sum_{(i, j) \in E} \pi(i) P(i, j) (x(i) - x(j))^2}{ \sum_{i \in V} \pi(i) x(i)^2} \end{split}$
ok. （仔细想想）

Proposition 4 $\lambda_n = -1$ iff 这个图是二分图。
$\begin{split} \lambda_n &= \min_{\mathbf{x} \neq 0} R_P(\mathbf{x}) \ &= \min_{\mathbf{x} \neq 0} \frac{\sum_{(i, j) \in V^2} \pi(i) P(i, j) (x(i) + x(j))^2 }{2 \sum_{i=1}^n \pi(i) x(i)^2} -1 \end{split}$
$x(i) = -x(j)$ 即图可以被 2-染色.（仔细想想）

Relaxation Time

上面我们分析了 $\lambda_2$ 和 $\lambda_n$ 对收敛性的影响。事实上，如果 $\lambda_2<1$ 且 $\lambda_n > -1$ ，或者等价的 $P$ 是 aperiodic 且 irreducible，那么
$\lim_{t \rightarrow \infty} \mu^T \bigg( \sum_{i=2}^n \lambda_i^t v_i v_i^T \Pi \bigg) = 0$
事实上这两个特征值的绝对值还影响收敛速度。我们定义 $\lambda^* = \max{|\lambda_2|, |\lambda_n|}$ . 定义 relaxation time 为
$\tau_{rel} := \frac{1}{1 - \lambda^*}$
这个 relaxation time 和 mixing time 有一个关系

Theorem 5
$(\tau_{rel}-1) \log \frac{1}{2\varepsilon} \le \tau_{mix}(\varepsilon) \le \tau_{rel} \log \frac{1}{\varepsilon \pi_{min}}$

From Coupling to Spectral Gap

Theorem 6 (Mu-Fa Chen, 1998) 如果有一个 coupling ${\omega_t}$ 使得
$\mathbf{E}_{(X_{t+1}, Y_{t+1}) \sim \omega_t} [d(X_{t+1}, Y_{t+1})\ \vert\ (X_t, Y_t)] \le (1 - \alpha) d (X_t, Y_t)$
那么有 $\lambda^* \le 1 - \alpha$ .

Spectral Gap 即最大的特征值和第二大的特征值之间的距离.

Proof: 略

Coupling: 概率

Spectral Gap: $1 - \lambda^*$ 代数

Graph Expansion: 几何

Graph Expansion

Example 1 考虑图上的随机游走. 由于 $P(i, j) = \frac{1}{deg(i)}$ ，因此 stationary distribution 为 $\pi(i) \sim deg(i)$ . 可以看出完全图的 mixing time 要快于其它图.（mixing time 为常数，几乎走一步就行）

令 $P$ 是一个 $\Omega$ 上的 reversible chain。定义 $i$ 到 $j$ 的概率流 $Q(i, j) := \pi(i) P(i, j)$ . 我们想要定义 "瓶颈": 将图分成两部分，在这两部分之间的流是很少的。先定义 $Q(S, \overline{S}) := \sum_{i \in S, j \in \Omega \setminus S} Q(i, j)$ . 我们定义 $S$ 的 expansion 为
$\Phi(S) = \frac{Q(S, \overline{S})}{\pi(S)}$
这里 $\pi(S) = \sum_{i \in S} \pi(i)$ .

意义：从 $\Omega$ 中抽一个点，condition on 这个点落在 $S$ ，它按照 markov chain 走一步走到 $\overline{S}$ 的概率。

$P$ 的 expansion 定义为
$\Phi(P) = \min_{S \subseteq \Omega: \pi(S) \le \frac{1}{2}} \Phi(S)$
我们总是取小于一半的 $S$ . 使得 $\Phi(S) = \Phi(P)$ 的 $S$ 是这个图的瓶颈.

下面的定理说明了 small expansion implies slow mixing.

Theorem 7
$\tau_{mix}(\varepsilon) \ge \frac{1-2\varepsilon}{2} \frac{1}{\Phi(P)}$
Proof: 令 $S$ 为这个图中的瓶颈.
$\begin{split} \mathbf{Pr}[X_t \in \overline{S}\ \vert\ X_0 \in S] &= \frac{\mathbf{Pr}[X_t \in \overline{S} \land X_0 \in S]}{\mathbf{Pr}[X_0 \in S]} \ &\le \frac{\sum_{i=0}^{t-1} \mathbf{Pr}[X_i \in S \land X_{i+1} \in \overline{S}]}{\mathbf{Pr}[X_0 \in S]} \ &= t \mathbf{Pr}[X_1 \in \overline{S}\ \vert\ X_0 \in S] = t \cdot \Phi(S) \end{split}$
因此
$\mathbf{Pr}[X_t \in S\ \vert\ X_0 = x_0] \ge 1 - t \cdot \Phi(S)$
那么
$D_{TV} (P^t(x_0, \cdot), \pi) \ge 1 - t \cdot \Phi(S) - \pi(S) \ge \frac{1}{2} - t \cdot \Phi(S)$

令它 $\ge \varepsilon$ ，得到 $t \le \frac{1-2 \varepsilon}{2} \frac{1}{\Phi(P)}$ . 因此 mixing time 必须大等于它.

Lec 14

Graph Expansion (Cont')

Graph Expansion 还能在一般的带权无向图上定义
$\Phi(S) = \frac{\sum_{i \in S, j \in \Omega \setminus S} w(i, j) }{\sum_{i, j \in S} w(i, j)}$
我们回顾 sampling coloring. 我们已经证明了 $\tau_{mix}(\varepsilon) \le q n \ log \frac{n} \varepsilon$ 对于 $q > 4 \Delta$ . 这里我们用 Graph Expansion 来做。

考虑一个 star，并且设 1 是中间的点。记 $Z$ 是所有的 proper colorings， $S$ 是所有将 1 染成 1 的 proper colorings. 那么有
$Q(S, \overline{S}) = \sum_{i \in S, j \in \overline{S}} \pi(i) P(i, j) = \frac{(q-1)(q-2)^{n-1}}{Z \cdot nq}$
$\pi(S) = \frac{(q-1)^{n-1}}{Z}$ . 那么
$\Phi(S) = \frac{1}{nq} \frac{(q-2)^{n-1}}{(q-1)^{n-2}} = \frac{q-1}{nq} \cdot \bigg( 1 - \frac{1}{q-1}\bigg)^{n-2} \le \frac{1}{n} \exp \bigg( - \frac{n-2}{q-1} \bigg).$
也就是说 $\tau_{mix} = \Omega \bigg( n \cdot \exp \frac{n-2}{q-1} \bigg)= n^{\omega(1)}$ 若 $q = O(\frac{n}{\log n})$ .

Cheeger's Inequality

考虑 $P$ 的 Laplacian，也就是 $L = I - P$ . 有分解
$L =\sum_{i=1}^n (1-\lambda_i) v_i v_i^T \Pi$
我们记它的特征值为 $\gamma_i = 1 - \lambda_i$ . 那么有 $0 = \gamma_1 \le \gamma_2 \le \dots \le \gamma_n \le 2$ .

Cheeger's Inequality 为
$\frac{\gamma_2}{2} \le \Phi(P) \le \sqrt{2 \gamma_2}$

没有风会经过的旅店

CS3958: Advanced Algorithms

Lec1

Concentration Inequalities

The Averaging Trick

Chernoff Bound

The Median Trick

Lec 2

Threshold Behavior of Random Graphs

Hoeffding's Inequality

Concentration on Margtinalges

Azuma-Hoeffding's Inequality

Lec 3

Martingale (cont'd)

McDiarmid's Inequality

Stopping Time

Optional Stopping Theorem (OST)

Lec 4

Proof of OST

Doobs martingale inequality

1-dim Random Walk with Two Absorbing Barriers

Pattern Matching

Wald's Equation

Lec 5

Multi-Armed Bandit

The Explore-then-Commit (ETC) Algorithm

The Upper Confidance Bounds (UCB) Algorithm

Lec 6

An Online Number Guessing Game

Stochastic Setting

The Adversarial Setting

Online Learning

Lec 7

Convex Optimization

Gradient Descent Algorithm

Online Gradient Descent

Strongly Convex Function

Lec 8

Learning with Expert Advice

Online Stochastic Gradient Descent

Online Shortest Paths

Lec 9

Lec 10

Basics of Markov Chains

Fundamental Theorem of Markov Chains

Coupling

Lec 11

Proof of FTMC

Mixing Time

Lec 12

Reversible Chains

Metropolis Algorithm

Sample Proper Coloring

Spectrum of Reversible Markov Chain

Lec 13

Variational Characterization of Eigenvalues

FTMC for Reversible Chains

Relaxation Time

From Coupling to Spectral Gap

Graph Expansion

Lec 14

Graph Expansion (Cont')

Cheeger's Inequality

关于 pip 的网络问题

麦乐鸡味的近况随记

Sirius

Comments | NOTHING