Introduction
The main focus of my research is on the algorithmic foundations of machine learning and statistics. I seek to understand when and how we can design efficient, reliable, and provably robust algorithms for extracting information from complex data. My work spans several areas—including algorithmic robust statistics, supervised learning, foundations of deep learning, statistical-computational tradeoffs, distribution testing, learning structured distributions, and applied probability—unified by a common goal: to bridge the gap between statistical and computational efficiency. I am broadly interested in developing methods that achieve strong theoretical guarantees while informing the design of practical, trustworthy learning systems.
This page offers an overview of my research contributions over the past fifteen years across several interrelated areas. Each section begins with a brief summary capturing the main ideas and motivation, followed by a more detailed discussion that outlines the key results and representative papers.
Brief Overview
Real-world datasets are rarely clean: outliers, corruptions, and systematic noise often compromise
the reliability of standard statistical methods.
The key challenge is to build algorithms that continue to perform accurately and efficiently,
even when part of the data has been distorted or adversarially corrupted.
My work established the field of algorithmic robust statistics,
which studies how to design algorithms that are both statistically
and computationally efficient in the presence of data corruption.
In a 2016 paper, we developed the first efficient methods
for robust mean and covariance estimation in high dimensions that nearly match
classical information-theoretic guarantees, resolving longstanding open problems.
Subsequent work broadened this framework to a range of settings—including mixture and graphical models and supervised learning—while generalizing and refining the algorithmic framework introduced in 2016 and uncovering intriguing connections to geometry, optimization, and computational complexity. Several of these methods, particularly those based on the iterative filtering technique, have been implemented and evaluated on real datasets, achieving state-of-the-art performance in practical tasks such as defenses against data poisoning and backdoor attacks, automatic outlier removal in genetic datasets, and out-of-distribution detection.
More broadly, robust statistics has become a lens for studying algorithmic problems that at first appear unrelated to data corruptions. Specifically, it has inspired the development of new algorithmic techniques in other areas, including in learning mixture models and natural classes of neural networks.
Highlights
Basic Results and Survey of the Field
The following work gave the first computationally efficient algorithms for robust estimation in high dimensions, settling a fundamental problem in Robust Statistics that had remained open since the early 1960s.
At the technical level, the paper introduced a general methodology
for designing provably robust efficient estimators that has been applied to a wide range of settings.
A brief expository article based on the above work appeared in the research highlights section of the
Communications of the ACM:
The focus of my subsequent work in this area has been to develop a general algorithmic theory of robustness. A detailed exposition of subsequent developments can be found in my book on the topic, which is a good entry point to the field:
The following sections provide highlights of my subsequent work in this area, organized into four main research threads.
Robust Algorithms for Complex Settings
After the 2016 work, a natural direction was to go beyond mean and covariance estimation by developing robust algorithms for more complex settings. These include robust supervised learning (both regression and classification), robustly learning various mixture models, and learning with a majority of outliers.
Near-Linear Time Algorithms
Once polynomial-time solutions for basic robust estimation tasks were known, a natural next step was to design faster and simpler methods. I initiated this direction by giving the first near-linear-time algorithm for high-dimensional robust mean estimation. A subsequent work crystallized a connection to continuous (non-convex) optimization, showing that robust mean estimation can be solved via first-order methods (e.g., gradient descent) applied to an appropriate non-convex objective. Building on these ideas, we obtained almost-linear-time algorithms for learning separated mixture models (including sub-Gaussian and bounded-covariance families), giving the first runtime improvements on this fundamental task since the classical work of Vempala and Wang from 2001. Related advances also led to the first memory-efficient robust learners in the streaming model.
Robustness as a Lens
Motivated by robustness, this line of work has also led to the development of techniques of broader algorithmic and mathematical interest. One direction establishes covering bounds for algebraic varieties, yielding new algorithmic methods for learning latent variable models. Another develops Sum-of-Squares (SoS) certifiability results—including those for sparse singular values and moment bounds for subgaussian distributions—that yield near-optimal algorithms for a range of well-studied problems, both within robust statistics and beyond. A complementary thread introduces a general method for implicit high-order moment tensor estimation, enabling efficient learning of mixture and broader latent-variable models without explicitly forming large tensors.
Bridging Theory and Practice
While much of this research is theoretical, several of the resulting algorithms
have proven effective in real-world data analysis tasks. In particular, the iterative
filtering framework introduced in the 2016 work has been directly implemented in
scalable systems used for automatic outlier removal, data decontamination, and
reliability enhancement in large-scale learning pipelines.
Building on this algorithmic foundation, later works extended these ideas to improve
the robustness of more general stochastic optimization procedures and develop principled defenses
against data and model corruption. For example, they have been used for anomaly
detection in genetic datasets, mitigating data-poisoning and backdoor attacks,
and improving out-of-distribution detection in deep learning models.
This line of work demonstrates how theoretically grounded frameworks for robustness
can translate into scalable and reliable methods for modern machine learning systems.
The following papers highlight a few representative collaborative works in this direction; related
approaches have since been leveraged across various domains within
the broader machine learning community.
Brief Overview
Supervised learning traditionally assumes that training labels are correct, yet in many
settings labels may be corrupted or adversarial. A central challenge is to understand
under which noise models accurate learning is possible, and how noise affects the
statistical and computational complexity of such problems.
My work has developed a systematic algorithmic theory of learning in the presence of
noisy labels, addressing both semi-random models—such as Massart and Tsybakov noise—and fully adversarial label noise.
A key result established the first polynomial-time algorithm for distribution-free PAC learning
of halfspaces in the presence Massart noise, resolving a longstanding open problem in computational learning theory.
Building on the distribution-free setting, I subsequently investigated stronger noise
models in distribution-specific regimes, including Tsybakov and general Massart noise,
as well as adversarial label corruption. These results established both new algorithmic
techniques and nearly matching lower bounds, clarifying the precise regimes where
efficient learning is possible and where inherent barriers arise. Together, they
provide a comprehensive view of how label noise affects the statistical and
computational complexity of learning fundamental concept classes.
Highlights
Distribution-free Learning
This line of work concerns learning under label noise without any assumptions on the underlying data distribution. It began with the first provably efficient algorithm for distribution-free PAC learning of halfspaces under Massart noise, and continued with refinements that clarified both the algorithmic landscape and its fundamental limitations.
Beyond noise tolerance, this line of work led to progress on a classical problem in the theory of computation: the construction of Forster transforms. These are transformations that regularize data without changing its essential geometry, and have long played a role in functional analysis, communication complexity, coding theory, and more. Motivated by a connection to the problem of learning halfspaces with label noise, we developed the first strongly polynomial-time algorithm for computing such transforms, which in turn yields a strongly polynomial distribution-free learner for halfspaces—a surprising result, given that proper PAC learning of halfspaces is equivalent to linear programming.
Distribution-specific Learning
This line of work examines learning with label noise under additional assumptions on the input distribution—such as Gaussian or log-concave marginals—with the goal of leveraging geometric structure to obtain better algorithms. These results extend the theory beyond the distribution-free setting and demonstrate that distributional structure can enable efficient algorithms that are provably impossible in the distribution-free regime.
Brief Overview
Deep neural networks achieve remarkable empirical success, yet our theoretical
understanding of when such models can be learned efficiently remains limited.
My work has approached this question through single-index models (SIMs) and
multi-index models (MIMs)—theoretical frameworks that isolate core nonlinear
phenomena in a mathematically tractable form. The central aim is to determine when
these models admit computationally efficient learning algorithms, and when inherent
barriers—statistical or computational—make learning impossible,
including in the presence of label noise or adversarial perturbations.
Within this framework, I have developed algorithms and computational lower bounds
that delineate the feasibility of learning nonlinear predictors—from individual neurons to
multi-index representations.
Highlights
Single-Index Models
A generalized linear model (GLM) has the form \(f(x)=\sigma(w\cdot x)\) with a known activation
\(\sigma\) (e.g., ReLU or sigmoid), while a single-index model (SIM) keeps the same form
but allows \(\sigma\) to be unknown, typically restricted to a structured family
(e.g., all monotone functions). My work investigates when such nonlinear predictors can
be learned by efficient algorithms under realistic data imperfections
(agnostic/adversarial label noise), and clarifies where computational barriers arise.
This line of work develops robust algorithms for GLMs and SIMs under
adversarial label noise. The algorithmic results in this thread achieve a constant-factor approximation to the optimal loss—the strongest
guarantee possible for polynomial-time algorithms in the agnostic setting.
The following paper showed that gradient-based methods can succeed in the agnostic setting, for a broad family of known activations \(\sigma\), including ReLUs and sigmoids.
Subsequent work developed alternative algorithmic approaches—using a convex surrogate whose analysis leverages a connection to local error bounds from optimization—to obtain robust learning guarantees under much milder distributional assumptions:
The next advance gave general algorithms for all monotone activations (still with known activation), assuming Gaussian inputs:
Finally, we move from GLMs to SIMs, where the activation is unknown. This shift makes the problem more challenging and calls for new techniques beyond gradient methods. The following work develops an optimization framework guided by a problem-specific vector field:
Multi-Index Models and Representation Learning
While single-index models capture the behavior of individual neurons, modern learning tasks
require the recovery of multiple latent features. Extending our algorithmic tools
to multi-index models marks a fundamental shift where new computational challenges arise.
My work in this domain investigates when multi-index models (MIMs)—a broad class encompassing
shallow neural networks, intersections of halfspaces, multiclass linear classifiers,
and many other non-linear predictors—can be learned efficiently.
The core challenge is to determine the algorithmic feasibility of such models and to
understand when inherent computational barriers arise. In a sequence of works, I developed
algorithms and lower bounds for MIMs, qualitatively characterizing the
complexity of learning, both in the realizable setting and in the presence of noisy labels.
A natural first step beyond SIMs is learning positive linear combinations of ReLUs (and other natural activations), which already correspond to natural depth-2 networks with multiple hidden units. My work established that such models are efficiently learnable in the PAC framework:
The computational complexity of the above algorithm scales exponentially with the number of hidden units. In the following paper, inspired by robustness, we developed an algebraic geometry approach leading to quasi-polynomial dependence:
The next advance removed positivity constraints, yielding algorithms for general linear combinations of ReLUs with stronger accuracy guarantees:
The next stage tackled fully general MIMs, beginning with discrete-valued targets:
A companion result addressed real-valued MIMs, establishing analogous guarantees:
A related line of work investigated the power of access models, showing that query access can strictly improve learnability over random sampling. A representative paper is given below:
Brief Overview
Many fundamental learning problems admit statistically optimal solutions, yet it remains
unclear when such solutions can be achieved efficiently. Understanding these
statistical-computational tradeoffs is central to my work.
There are two general approaches to establishing such barriers:
(1) proving unconditional lower bounds for broad classes of algorithms
(e.g., SQ algorithms, Sum-of-Squares algorithms), and
(2) developing efficient reductions from average-case problems believed to be hard.
My research has primarily advanced
the first approach, and has established connections between the two approaches.
A significant component of my work in this area is based on the Statistical Query (SQ) framework,
using the Non-Gaussian Component Analysis (NGCA) methodology introduced in my
FOCS 2017 paper. This framework has been leveraged to prove
strong–in many cases, optimal–SQ lower bounds for a wide range of learning tasks,
including for robust estimation, learning mixture models, and
supervised learning (both for classification and regression).
More recently, my work in this area has expanded beyond Statistical Queries
to other computational models–including Polynomial Threshold Function (PTF) tests
and Sum-of-Squares (SoS) algorithms—
and has also led to reduction-based hardness results inspired by SQ constructions.
Highlights
This line of work investigates when learning problems are solvable in principle,
yet remain intractable for efficient algorithms. A central focus is on establishing
unconditional lower bounds for broad classes of algorithms, primarily through the
Statistical Query (SQ) framework and the Non-Gaussian Component Analysis (NGCA)
methodology introduced in the following paper.
These techniques provide a unified lens
for proving computational barriers across a diverse range of learning settings,
and have also inspired subsequent work on reduction-based hardness and lower
bounds in other algorithmic models such as Sum-of-Squares (SoS) and Polynomial
Threshold Function (PTF) tests.
Basic Framework
The NGCA framework for proving SQ lower bounds was introduced in the following paper. Perhaps surprisingly, this methodology was inspired largely by questions in robust statistics.
Since its introduction, the NGCA methodology has become a flexible tool for establishing
statistical-computational tradeoffs across learning theory. It provides a general template
for constructing families of hard instances whose low-order moments match Gaussian
moments, showing that any algorithm capable of solving these problems efficiently would
yield an NGCA solver. A good entry point for understanding these developments
(with a complete set of applications till 2023) is
Chapter 8 of my Algorithmic Robust Statistics book.
Below are illustrative examples from my work that build on this framework.
Statistical-Computational Tradeoffs for Unsupervised Learning
Beyond the settings considered in the FOCS 2017 paper above, the first group of applications concerns unsupervised settings, including in robust estimation and learning mixture models.
Statistical-Computational Tradeoffs for Supervised Learning
NGCA-based techniques have also yielded lower bounds for a range of supervised learning problems, both in noiseless and noisy settings.
Beyond SQ-hardness
To further understand the robustness of these barriers, subsequent work established hardness not only within the SQ framework, but also through reductions and within other algorithmic models such as SoS and PTF tests.
Brief Overview
Distribution testing provides the algorithmic foundations for reasoning about
data-generating processes. Given sample access to one or more unknown distributions,
the goal is to decide whether they satisfy a desired property—such as identity,
independence, or structural shape—or are far from satisfying it. My work has
developed general and sharp characterizations of the sample complexity of such
testing tasks, both in discrete and in structured settings.
For discrete distributions, my results yield sample-optimal algorithms for
fundamental problems such as identity, closeness, and independence testing. For
structured families—such as monotone, unimodal, log-concave, and mixture
distributions—my work introduced the $A_k$-distance framework, an analytic
tool that captures the right geometric parameters governing testing complexity and
enables near-optimal testers even for continuous distributions.
Highlights
My work in distribution testing is organized into two main threads, depending on whether structural assumptions are imposed on the distribution.
Testing Discrete Distributions
This line of work investigates the sample complexity of testing fundamental properties of discrete distributions–such as identity, closeness, and independence– under minimal assumptions. A recurring theme is obtaining optimal bounds (up to constant factors) and developing techniques that are broadly applicable across testing problems.
Testing Structured Distributions
For structured distribution families, testing under the total variation distance turns out to be possible even for continuous distributions. In contrast to the discrete unstructured setting, the domain size is no longer the right complexity parameter. My work introduced the \(A_k\)-distance framework—a general methodology for identifying the appropriate complexity measure and designing near-optimal testers for a wide range of structured families.
Beyond shape-constrained families, my work has developed new methodologies for testing in high-dimensional parametric settings, including structured probabilistic models such as Bayesian networks, where dependencies among variables must also be inferred.
Brief Overview
A central goal of my work on distribution learning is to understand
when complex probability distributions can be estimated both statistically optimally
and computationally efficiently. This research has developed a unified algorithmic framework
for nonparametric density estimation, introducing approximation and Fourier analytic techniques
that apply across both discrete and continuous structured settings.
A key thread established the use of piecewise polynomial approximation to learn univariate
structured distributions, yielding sample-optimal and near-linear time algorithms,
with further applications to data summarization. A second line of work developed efficient
algorithms for sums of independent random variables, including Poisson binomial
and multinomial distributions, through new moment-matching and Fourier-analytic methods.
A third direction focused on structured multivariate distributions—such as log-concave
and histogram-based densities—providing the first sample and computationally efficient algorithms
for these nonparametric models.
Highlights
Understanding how to efficiently estimate an unknown probability distribution from samples is one of the classical questions in statistics and machine learning. My research on nonparametric distribution learning provides a unified algorithmic perspective on this problem, developing methods that are both sample-optimal and computationally efficient across a range of structured settings. This work combines ideas from approximation theory, harmonic analysis, and convex optimization, and has influenced subsequent developments in both theory and applications.
Distribution Learning via Piecewise Polynomial Approximation
This line of work introduced a general framework for learning arbitrary univariate distributions via piecewise polynomial approximation.
Learning Sums of Independent Random Variables
This thread focused on sums of independent random variables, a fundamental family that includes Poisson binomial and Poisson multinomial distributions. Some of the techniques developed here are of broader interest in probability and algorithmic game theory.
Learning Multivariate Nonparametric Families
The final thread studies structured nonparametric families in any fixed dimension, such as log-concave and histogram-based densities.
Brief Overview
This line of work develops new probabilistic limit theorems and analytic tools that lead to efficient algorithms. My research in this domain has produced new tools in probability theory that have found applications across learning theory, stochastic optimization, and computational complexity.
Highlights
Pseudorandomness and Derandomization
My early work showed that the Berry--Esseen Central Limit Theorem continues to hold when the summed random variables are only locally independent. This "local independence CLT"–a new probabilistic limit theorem–was motivated by questions in computational complexity and derandomization, and it has been a useful tool in learning theory, streaming algorithms, and hardness of approximation.
A followup work extended these techniques to quadratic functions, establishing quantitative CLTs for degree-$2$ polynomials and introducing analytic methods that have been used in several subsequent works.
Analysis of Threshold Functions
This direction provides a refined analytic understanding of polynomial threshold functions, yielding strong quantitative bounds on their sensitivity properties.
Learning from Limited Information
A complementary line of work gives the first efficient algorithms for learning halfspaces and, more generally, degree-$d$ polynomial threshold functions (PTFs) in a limited-information setting, where only partial access to the attributes is available. The central structural insight underlying these algorithms is a robust identifiability theorem: if two degree-$d$ PTFs have approximately matching degree-$\le d$ Fourier coefficients (their so called Chow parameters), then the functions themselves are close in $L_1$ distance.
Geometric Probability
More recently, I have explored high-dimensional geometric probability with algorithmic implications—for example, deriving nearly tight bounds for fitting an ellipsoid to Gaussian random samples, a question at the interface of geometry, optimization, and statistics.