(loading loading loading – advance slide)
\(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\e}{\varepsilon}\) \(\newcommand{\cD}{\mathcal{D}}\) \(\newcommand{\poly}{\text{poly}}\) \(\newcommand{\cN}{\mathcal{N}}\) \(\newcommand{\tensor}{\otimes}\) \(\newcommand{\E}{\mathop{\mathbb{E}}}\) \(\renewcommand{\hat}{\widehat}\) \(\newcommand{\iprod}[1]{\langle #1 \rangle}\) \(\newcommand{\pE}{\tilde{\mathbb{E}}}\) \(\newcommand{\Paren}[1]{\left ( #1 \right )}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\Tr}{\text{Tr}}\)
sam hopkins (miller fellow, uc berkeley)
input: \(X_1,\ldots,X_n \in \R^d\) independent copies of \(X\)
output: estimator \(\hat{\mu}(X_1,\ldots,X_n)\) of \(\mu = \E X\).
for this talk assume isotropic: \(\E (X - \mu)(X - \mu)^\top = I\)
obvious “algorithm” – the empirical mean
empirical mean: \(\overline{\mu} = \tfrac 1 n \sum X_i\)
mean square error: \(\E \| \overline{\mu} - \mu \|^2 = \frac 1 n \E \|X - \mu\|^2 = \frac d n\)
what about the tail?
how small is \(\Pr(\| \overline{\mu} - \mu \| > t)\)?
confidence interval (c.i.): \(\Pr(\| \overline{\mu} - \mu \| > t) \leq \delta\) implies \(\delta\)-c.i. of width \(t\)
gaussian case:
\(X \sim \cN(\mu, I) \Rightarrow \Pr( \|\overline{\mu} - \mu\| > t + \underbrace{\sqrt{d/n}}_{\approx \E \|\overline{\mu} - \mu \| } ) \leq \underbrace{\exp(-t^2 n)}_{d\text{-independent subgaussian tail}}\)
heavy-tailed case:
covariance \(I \Rightarrow \Pr( \|\overline{\mu} - \mu\| > t ) \leq \underbrace{\frac d {t^2 n}}_{d\text{-dependent fat tail}}\)
only assume \(\E X, \E X^2\) are finite.
also occur in high dimensions: corruptions, power laws in large networks, etc.
recall: i.i.d. samples \(X_1,\ldots,X_n\) of \(X\), try to estimate \(\mu = \E X\)
how confident can you be? is there an estimator \(\hat{\mu}\) with \(\Pr( |\hat{\mu} - \mu| > t ) \leq \exp(-\Omega(t^2 n))\) when \(X\) is heavy-tailed?
no. but you can fake it!
theorem (old, \(d=1\)):
for every \(t\), exists \(\hat{\mu}_t\) such that \(\Pr( | \hat{\mu}_t - \mu | > t ) \leq \exp(-\Omega(t^2 n))\)
and \(\hat{\mu}_t\) is poly-time computable
if \(d=1\) can estimate \(\mu\) as if \(X\) were Gaussian, even if only \(\E X, \E X^2\) exist
what happens for \(d > 1\)?
\(X \sim \cN(\mu, I) \Rightarrow \Pr( \|\overline{\mu} - \mu\| > t + \underbrace{\sqrt{d/n}}_{\approx \E \|\overline{\mu} - \mu \| } ) \leq \underbrace{\exp(-t^2 n)}_{d\text{-independent subgaussian tail}}\)
theorem (lugosi-mendelson, 2018):
for every \(t\), exists \(\hat{\mu}_t\) such that
\[\Pr \left ( \| \hat{\mu}_t - \mu \| > O \left (t + \sqrt{d/n} \right ) \right )\leq \exp(-t^2 n)\]
assuming only \(\E(X-\mu)(X-\mu)^\top = I\).
new combinatorial notion of high-dimensional median, appears to require \(\exp(d)\) time
theorem (this work): same, but \(\hat{\mu}_t\) is computable in time \(O(dn) + (dt)^{O(1)}\)
covariance \(I\), \(n\) samples
(disclaimer: “tail” not strictly accurate because one estimator \(\hat{\mu}_t\) for each \(t\))
estimator | dim. | tail | time | ref. |
---|---|---|---|---|
empirical mean | any | \(d/t^2 n\) | poly | folklore |
(many) | \(1\) | \(\exp(-t^2 n)\) | poly | (many) |
geometric median | any | \(\exp(-t^2 n /d)\) | poly | [Minsker 13, Hsu-Sabato 13] |
tournament median | any | \(\exp(-t^2 n)\) | exp | [Lugosi-Mendelson 18] |
median-sdp | any | \(\exp(-t^2 n)\) | poly | this work |
theorem: given \(X_1,\ldots,X_n,\delta\), can find \(\hat{\mu}\) such that
\[ \Pr \left ( \left \| \hat{\mu} - \mu \right \| > C \left ( \sqrt{\frac{d}{n}} + \sqrt{\frac{\log(1/\delta)}{n}} \right ) \right ) \leq \delta \, , \]
main innovation: new semidefinite programming (sdp) algorithm for high-dimensional median, based on sum of squares method (sos).
sos familiarity is not a prerequisite for this talk.
[nemirovsky-yudin, jerrum-valiant-vazirani, alon-matias-szegedy]
\(X_1,\ldots,X_n\) i.i.d. copies of \(X\) with \(\E X = \mu\) and \(\E (X - \mu)^2 = 1\)
empirical mean \(\delta\)-c.i. width \(O\Paren{ \sqrt{\frac{1}{n\delta}}}\)
median of means \(\delta\)-c.i. width \(O \Paren{ \sqrt{\frac{\log(1/\delta)}{n}}}\)
\(1/\delta\) becomes \(\log(1/\delta)\)
\[k \approx \log(1/\delta)\]
key insight: number of outliers among \(Z_1,\ldots,Z_k\) concentrates even when sum of outliers does not.
recall gaussian case:
\(X \sim \cN(\mu, I) \Rightarrow \Pr( \|\overline{\mu} - \mu\| > t + \sqrt{d/n} ) \leq \exp(-t^2 n)\)
i.e. \(\delta\)-c.i. radius \(O\Paren{\sqrt{\frac dn} + \sqrt{\frac {\log 1/\delta} n}}\)
goal: match this, only assume \(\E (X - \mu)(X - \mu) = I\)
buckets: \(Z_1,\ldots,Z_k\) with \(\E Z = \mu\) and \(\E (Z - \mu)(Z-\mu)^\top = \Gamma = \tfrac kn I\)
problem: can have \(\|Z_i - \mu\| \approx \sqrt{k d/n}\) for most \(Z_i\)
result: dimension-dependent tail \(\exp(-t^2 n / d)\)
i.e. \(\delta\)-c.i. radius \(O\Paren{\sqrt{\frac{ d \log(1/\delta)}{n}}}\)
cannot ask for \(2k/3\) \(Z_i\)’s to be non-outliers
instead, ask for every direction to have at most \(k/3\) outliers!
input: \(X_1,\ldots,X_n,\delta\)
buckets: bucketed means \(Z_1,\ldots,Z_k\) for \(k = \log(1/\delta)\)
centrality: \(x\) is \(r\)-central if in all directions \(u\), for at least \(2k/3\) \(Z_i\),
\(|\iprod{Z_i,u} - \iprod{x,u}| \leq r\)
alternative interpretation: \(x\) is \(r\)-central implies \(x\) has dist. at most \(r\) to a median in every direction
remember: \(X_1,\ldots,X_n\) samples in \(k = \Theta( \log(1/\delta))\) buckets, \(Z_i\) is mean in \(i\)-th bucket.
output: output \(x \in \R^d\) of “best centrality”
running time??
\(10^5\)-foot view
a convex (sdp) relaxation of the set of \(r\)-central \(x\)’s
with enough constraints that each step of the lugosi-mendelson analysis also applies to the sdp (the heart of the “proofs to algorithms” SoS method)
\(\text{poly}(d,k)\) constraints enough to capture all important properties of \(r\)-central points
at the heart of the matter: an sdp for testing \(r\)-centrality
\[ \{ x \, : \, \text{ for all $u$, at most $k/3$ $Z_i$'s have $|\iprod{Z_i,u} - \iprod{x,u}| > r$} \} \]
how would you know if you found such a good \(x\)?
given \(x\), start with a quadratic program in \(b = b_1,\ldots,b_k, u = u_1,\ldots,u_d\):
\[\max_{u,b} \sum b_i \text{ such that } b_i^2 = b_i, \|u\|^2 = 1, \text{ and } b_i \iprod{Z_i - x,u} \geq b_i r\]
relax \((b,u)(b,u)^\top\) to block PSD matrix
\[\left ( \begin{array}{cc} B & W \\ W^\top & U \end{array} \right ) \]
\[\begin{align} SDP(Z_1,\ldots,Z_k,\mu) = & \max \, \text{Tr} B \text{ such that} \\ & B_{ii} \leq 1 \\ & \Tr U = 1 \\ & \iprod{Z_i - x, W_i} \geq r B_{ii} \end{align}\]
key lemma with high probability, centrality SDP certifies that \(\mu\) is \(r\)-central
proof ideas:
Grothendieck’s inequality to relate sdp to empirical first moment of samples
bounded differences on the sdp itself to get concentration
main theorem: first polynomial-time estimator for heavy-tailed estimation matching empirical mean’s performance in Gaussian setting
proof insight: sdps are more outlier robust than averages but can have better statistical rates than trimmed averages or naive medians
open problem: is there a practical algorithm whose empirical performance improves on state-of-the-art for heavy-tailed estimation?
thanks!