Ankit Pensia
Research Fellow, Simons Institute

Hi! I am a research fellow at the Simons Institute for the program Modern Paradigms in Generalization and the Resilience Research Pod. Earlier, I was a Herman Goldstine Postdoctoral Fellow at IBM Research.

My research interests include robust and heavy-tailed statistics, machine learning, high-dimensional statistics, and inference under constraints.

I received my Ph.D. from the Computer Sciences department at UW-Madison in 2023, where I was advised by Po-Ling Loh, Varun Jog, and Ilias Diakonikolas. My dissertation may be found at this link (recipient of Graduate Student Research Award from the CS Department). I also interned at Google Research NYC with Pranjal Awasthi and Satyen Kale. Prior to Madison, I spent five memorable years at IIT Kanpur.

Please feel free to contact me at ankitpensia94@gmail.com.

A high-level overview of my research may be found in this short talk, given at Simons Institute.

Selected Publications (Full list, Google Scholar, dblp, Semantic Scholar)

Robust and Heavy-Tailed Statistics

Outliers and heavy-tailed distributions pose significant challenges to standard inference procedures and are studied in the field of robust statistics. I am interested in exploring the statistical and computational landscapes of such algorithms.

SoS Certifiability of Subgaussian Distributions and its Algorithmic Applications [link] [abstract]

We prove that there is a universal constant $C>0$ so that for every $d \in \mathbb N$, every centered subgaussian distribution $\mathcal D$ on $\mathbb R^d$, and every even $p \in \mathbb N$, the $d$-variate polynomial $(Cp)^{p/2} \cdot \|v\|_{2}^p - \mathbb E_{X \sim \mathcal D} \langle v,X\rangle^p$ is a sum of square polynomials. This establishes that every subgaussian distribution is \emph{SoS-certifiably subgaussian}---a condition that yields efficient learning algorithms for a wide variety of high-dimensional statistical tasks. As a direct corollary, we obtain computationally efficient algorithms with near-optimal guarantees for the following tasks, when given samples from an arbitrary subgaussian distribution: robust mean estimation, list-decodable mean estimation, clustering mean-separated mixture models, robust covariance-aware mean estimation, robust covariance estimation, and robust linear regression. Our proof makes essential use of Talagrand's generic chaining/majorizing measures theorem.
with I. Diakonikolas, S. Hopkins, and S. Tiegel. STOC, 2025
We prove that for each subgaussian distribution, there is a small sum-of-squares proof that its moments are bounded. This result is surprising as it goes against the conventional wisdom of the community. Our result dramatically expands the scope of the current algorithmic toolkit to more general distributions.

A Sub-Quadratic Time Algorithm for Robust Sparse Mean Estimation [link] [abstract]

We study the algorithmic problem of sparse mean estimation in the presence of adversarial outliers. Specifically, the algorithm observes a \emph{corrupted} set of samples from $\mathcal{N}(\mu,\mathbf{I}_d)$, where the unknown mean $\mu \in \mathbb{R}^d$ is constrained to be $k$-sparse. A series of prior works has developed efficient algorithms for robust sparse mean estimation with sample complexity $\mathrm{poly}(k,\log d, 1/\epsilon)$ and runtime $d^2 \mathrm{poly}(k,\log d,1/\epsilon)$, where $\epsilon$ is the fraction of contamination. In particular, the fastest runtime of existing algorithms is quadratic $\Omega(d^2)$, which can be prohibitive in high dimensions. This quadratic barrier in the runtime stems from the reliance of these algorithms on the sample covariance matrix, which is of size $d^2$. Our main contribution is an algorithm for robust sparse mean estimation which runs in \emph{subquadratic} time using $\mathrm{poly}(k,\log d,1/\epsilon)$ samples. We also provide analogous results for robust sparse PCA. Our results build on algorithmic advances in detecting weak correlations, a generalized version of the light-bulb problem by Valiant~\cite{Valiant15}.
A. Pensia. ICML, 2024 (Spotlight)
Existing algorithms for robust sparse estimation ran in $\Omega(d^2)$ time, where $d$ is the dimension. Thus, while sparsity improves the statistical performance (in terms of sample complexity), these benefits do not translate into practice because of high computational cost. We developed the first subquadratic algorithm. For an introduction to this problem (and this result), please see the slides of this survey talk.

Outlier Robust Mean Estimation with Subgaussian Rates via Stability [link] [abstract]

We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition. As a corollary, it follows that a number of recently developed algorithms for robust mean estimation, including iterative filtering and non-convex gradient descent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of these algorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the first computationally efficient algorithm with subgaussian rate for outlier-robust mean estimation in the strong contamination model under a finite covariance assumption.
with I. Diakonikolas and D. M. Kane. NeurIPS, 2020
We show that recent outlier-robust algorithms also achieve subgaussian confidence intervals for heavy-tailed distributions, and vice-versa. In particular, we identify the "stability" condition as the bridge between these two contamination models. In a recent paper at NeurIPS 2022, we extended these results to robust sparse mean estimation, i.e., when the mean is is sparse.

Inference under Communication, Memory, and Privacy Constraints

The proliferation of big data has led to distributed inference paradigms such as federated learning, which impose constraints on communication bandwidth, memory usage, and privacy. My research focuses on understanding the impact of these constraints.

Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints [link] [abstract]

We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal{A}$ of two distributions $p$ and $q$, defined as $\mathcal{A} := \{(\mathbf{T} p, \mathbf{T} q) : \mathbf{T} \in \mathcal{C}\}$, where $\mathcal{C}$ is the set of channels characterizing the constraints.
with A. R. Asadi, V. Jog, and P. Loh. COLT, 2023
We characterize the minmax optimal sample complexity of hypothesis testing under local differential privacy and communication constraints, develop instance-optimal algorithms, and show separation between the sample complexity for binary and ternary distributions. In this paper, we build on our recent prior work, where we investigated the cost of only the communication constraints.

Streaming Algorithms for High-Dimensional Robust Statistics [link] [abstract]

We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.
with I. Diakonikolas, D. M. Kane, and T. Pittas. ICML, 2022
We develop the first (computationally-efficient and sample-efficient) streaming algorithm with $o(d^2)$ memory usage for a variety of robust estimation tasks; in fact, our algorithm uses only $\tilde{O}(d)$ space. All the prior robust algorithms needed to store the entire dataset in memory, which leads to quadratic memory. In a recent paper at ICML 2023, we extended it to robust PCA.

Machine Learning and Statistics

I have a broad interest in the fields of machine learning and statistics.

The Sample Complexity of Simple Binary Hypothesis Testing [link] [abstract]

The sample complexity of simple binary hypothesis testing is the smallest number of i.i.d. samples required to distinguish between two distributions $p$ and $q$ in either: (i) the prior-free setting, with type-I error at most $\alpha$ and type-II error at most $\beta$; or (ii) the Bayesian setting, with Bayes error at most $\delta$ and prior distribution $(\alpha, 1-\alpha)$. This problem has only been studied when $\alpha = \beta$ (prior-free) or $\alpha = 1/2$ (Bayesian), and the sample complexity is known to be characterized by the Hellinger divergence between $p$ and $q$, up to multiplicative constants. In this paper, we derive a formula that characterizes the sample complexity (up to multiplicative constants that are independent of $p$, $q$, and all error parameters) for: (i) all $0 \le \alpha, \beta \le 1/8$ in the prior-free setting; and (ii) all $\delta \le \alpha/4$ in the Bayesian setting. In particular, the formula admits equivalent expressions in terms of certain divergences from the Jensen--Shannon and Hellinger families. The main technical result concerns an $f$-divergence inequality between members of the Jensen--Shannon and Hellinger families, which is proved by a combination of information-theoretic tools and case-by-case analyses. We explore applications of our results to robust and distributed hypothesis testing.
with V. Jog and P. Loh. COLT, 2024
We revisit the problem of simple binary hypothesis testing, a fundamental problem in statistics. Despite being studied for a century, the non-asymptotic rates of this problem were not known (in contrast, the asymptotic error rates were very well-understood). We close this gap and derive tight sample complexity results.

Generalization Error Bounds for Noisy, Iterative Algorithms [link] [abstract]

In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I(S;W)$ between the algorithm input $S$ and the algorithm output $W$, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.
with V. Jog and P. Loh. ISIT, 2018
We provide a general recipe to bound generalization error of noisy, iterative algorithms using information-theoretic tools. One prominent example of such algorithms is Stochastic Gradient Langevin Dynamics (SGLD).

Teaching Experience

Graduate Teaching Assistant
1. Mathematical Foundations of Machine Learning, UW-Madison, Fall 2018
2. Introduction to Bioinformatics, UW-Madison, Fall 2017
3. Probabilistic Mobile Robotics, IIT Kanpur, Fall 2016, Spring 2016
4. Communication Systems, IIT Kanpur, Summer 2016
Academic Mentor
1. Representation and Analysis of Random Signals, IIT Kanpur, Fall 201