Ankit Pensia
PhD candidate
Department of Computer Sciences
University of Wisconsin-Madison

Hi! I am a final-year PhD student in the Computer Sciences department at UW-Madison.

I am advised by Prof. Po-Ling Loh and Prof. Varun Jog. I also work closely with Prof. Ilias Diakonikolas. During Summer 2022, I was a research intern at Google Research NYC with Dr. Pranjal Awasthi and Dr. Satyen Kale. Before coming to the lovely town of Madison, I spent five memorable years at IIT Kanpur.

Please feel free to contact me at


My research interests include:

Please see below for more details.

Selected Publications

A comprehensive list of my published works can be found here. Relevant information can also be found on my profiles at Google Scholar, dblp, and Semantic Scholar.

Robust and Heavy-Tailed Statistics

Outliers and heavy-tailed distributions can pose significant challenges to standard inference procedures. These challenges are addressed in the field of robust statistics, which focuses on developing algorithms that are resistant to the effects of these extreme and atypical data points. I am interested in exploring the statistical and computational landscapes of such algorithms.

  • Outlier Robust Mean Estimation with Subgaussian Rates via Stability [link] [abstract]
    We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition. As a corollary, it follows that a number of recently developed algorithms for robust mean estimation, including iterative filtering and non-convex gradient descent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of these algorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the first computationally efficient algorithm with subgaussian rate for outlier-robust mean estimation in the strong contamination model under a finite covariance assumption.
    with I. Diakonikolas and D. M. Kane. NeurIPS, 2020
    We show that recent outlier-robust algorithms also achieve subgaussian confidence intervals for heavy-tailed distributions, and vice-versa. In particular, we identify the "stability" condition as the bridge between these two contamination models. In a recent paper at NeurIPS 2022, we extended these results to when the mean is further constrained to be sparse, i.e., robust sparse mean estimation.
  • Robust regression with covariate filtering: Heavy tails and adversarial contamination [link] [abstract]
    We study the problem of linear regression where both covariates and responses are potentially (i) heavy-tailed and (ii) adversarially contaminated. Several computationally efficient estimators have been proposed for the simpler setting where the covariates are sub-Gaussian and uncontaminated; however, these estimators may fail when the covariates are either heavy-tailed or contain outliers. In this work, we show how to modify the Huber regression, least trimmed squares, and least absolute deviation estimators to obtain estimators which are simultaneously computationally and statistically efficient in the stronger contamination model. Our approach is quite simple, and consists of applying a filtering algorithm to the covariates, and then applying the classical robust regression estimators to the remaining data. We show that the Huber regression estimator achieves near-optimal error rates in this setting, whereas the least trimmed squares and least absolute deviation estimators can be made to achieve near-optimal error after applying a postprocessing step.
    with V. Jog and P. Loh. Manuscript, 2020
    We propose a simple (computationally-efficient) filtering step to remove potentially harmful covariates for robust regression. Combined with Huber Regression or Least Trimmed Squares, we obtain simple, fast, and provable algorithms for robust regression.

Inference under Communication, Memory, and Privacy Constraints

The proliferation of big data has led to the development of distributed inference paradigms such as federated learning and the use of edge devices. These distributed setups impose constraints on communication bandwidth, memory usage, and privacy. My research focuses on understanding the impact of these constraints on sample complexity and developing computationally-efficient algorithms.

  • Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints [link] [abstract]
    We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) : \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.
    with A. R. Asadi, V. Jog, and P. Loh. Manuscript, 2023
    We characterizes the minmax optimal sample complexity of hypothesis testing under local differential privacy and communication constraints, develops instance-optimal algorithms, and shows separation between the sample complexity for binary and ternary distributions. In this paper, we build on our recent prior work, where we had shown that the cost of only the communication constraints is mild (at most logarithmic) for binary hypothesis testing but severe (exponential) for $M$-ary hypothesis testing.
  • Streaming Algorithms for High-Dimensional Robust Statistics [link] [abstract]
    We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.
    with I. Diakonikolas, D. M. Kane, and T. Pittas. ICML, 2022
    We develop the first (computationally-efficient and sample-efficient) streaming algorithm with $o(d^2)$ memory usage for a variety of robust estimation tasks; in fact, our algorithm uses only $\tilde{O}(d)$ space. All the prior robust algorithms needed to store the entire dataset in memory, which leads to super-quadratic memory. We also develop near linear-time robust mean estimation algorithm under a unified notion of "stability".

Machine Learning and Statistics

I have a broad interest in the fields of machine learning and statistics. In particular, I have explored the representation power of neural networks and the generalization error of learning algorithms in my research.

  • Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient [link] [abstract]
    The strong /lottery ticket hypothesis/ (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. (2020) establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random one that is a factor $O(d^4l^2)$ wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width d and depth l can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep. Our analysis heavily relies on connecting pruning random ReLU networks to random instances of the +SubsetSum+ problem. We then show that this logarithmic over-parameterization is essentially optimal for constant depth networks. Finally, we verify several of our theoretical insights with experiments.
    with S. Rajput, A. Nagle, H. Vishwakarma, and D. Papailiopoulos. NeurIPS, 2020
    We prove the strong lottery ticket hypothesis with optimal bounds on over-parameterization. We prove these results by connecting pruning of neural networks to random subset sum, a well-studied topic in TCS. This leads to an exponential improvement in the required over-parameterization in the width.
  • Generalization Error Bounds for Noisy, Iterative Algorithms [link] [abstract]
    In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I(S;W)$ between the algorithm input $S$ and the algorithm output $W$, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.
    with V. Jog and P. Loh. ISIT, 2018
    We provide a general recipe to bound generalization error of noisy, iterative algorithms using information-theoretic tools. One prominent example of such algorithms is Stochastic Gradient Langevin Dynamics (SGLD).

Teaching Experience

  1. Graduate Teaching Assistant

    1. Mathematical Foundations of Machine Learning, UW-Madison, Fall 2018

    2. Introduction to Bioinformatics, UW-Madison, Fall 2017

    3. Probabilistic Mobile Robotics, IIT Kanpur, Fall 2016, Spring 2016

    4. Communication Systems, IIT Kanpur, Summer 2016

  2. Academic Mentor

    1. Representation and Analysis of Random Signals, IIT Kanpur, Fall 201