## Publications (Google Scholar, dblp, Semantic Scholar)

- A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm

with Ilias Diakonikolas, Daniel M. Kane, Jasper C.H. Lee, and Thanasis Pittas

Manuscript, 2023

[Abstract] [arXiv]We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb{R}^d$ such that an unknown $\alpha<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal N(\mu ,\Sigma)$, the goal is to output a list of $O(1/\alpha)$ hypotheses at least one of which is close to $\Sigma$ in relative Frobenius norm. Our main result is a $\mathrm{poly}(d,1/\alpha)$ sample and time algorithm for this task that guarantees relative Frobenius norm error of $\mathrm{poly}(1/\alpha)$. Importantly, our algorithm relies purely on spectral techniques. As a corollary, we obtain an efficient spectral algorithm for robust partial clustering of Gaussian mixture models (GMMs) -- a key ingredient in the recent work of [BDJ+22] on robustly learning arbitrary GMMs. Combined with the other components of [BDJ+22], our new method yields the first Sum-of-Squares-free algorithm for robustly learning GMMs. At the technical level, we develop a novel multi-filtering method for list-decodable covariance estimation that may be useful in other settings.

- Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA

with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas

International Conference on Machine Learning**(ICML)**, 2023 (To Appear)

[Abstract] [arXiv]We study principal component analysis (PCA), where given a dataset in $\mathbb R^d$ from a distribution, the task is to find a unit vector $v$ that approximately maximizes the variance of the distribution after being projected along $v$. Despite being a classical task, standard estimators fail drastically if the data contains even a small fraction of outliers, motivating the problem of robust PCA. Recent work has developed computationally-efficient algorithms for robust PCA that either take super-linear time or have sub-optimal error guarantees. Our main contribution is to develop a nearly linear time algorithm for robust PCA with near-optimal error guarantees. We also develop a single-pass streaming algorithm for robust PCA with memory usage nearly-linear in the dimension.

- Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints

with Amir R. Asadi, Varun Jog, and Po-Ling Loh

Conference on Learning Theory**(COLT)**, 2023 (To Appear)

[Abstract] [arXiv] [Slides (20min)] [Slides (1hr)]We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) : \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.

- Gaussian Mean Testing Made Simple

with Ilias Diakonikolas and Daniel M. Kane

SIAM Symposium on Simplicity in Algorithms**(SOSA)**, 2023

[Abstract] [arXiv] [Conference version]We study the following fundamental hypothesis testing problem, which we term Gaussian mean testing. Given i.i.d. samples from a distribution $p$ on $\mathbb{R}^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $\mathcal{N}(0,I_d)$, and (ii) $p$ is a Gaussian $\mathcal{N}(\mu,\Sigma)$ for some unknown covariance $\Sigma$ and mean $\mu \in \mathbb{R}^d$ satisfying $\|\mu\|_2 \geq \epsilon$. Recent work gave an algorithm for this testing problem with the optimal sample complexity of $\Theta(\sqrt{d}/\epsilon^2)$. Both the previous algorithm and its analysis are quite complicated. Here we give an extremely simple algorithm for Gaussian mean testing with a one-page analysis. Our algorithm is sample optimal and runs in sample linear time.

- Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions

with Ilias Diakonikolas, Daniel M. Kane, and Jasper C.H. Lee

Advances in Neural Information Processing Systems**(NeurIPS)**, 2022

[Abstract] [arXiv] [Conference version]We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.

- List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering

with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas

Advances in Neural Information Processing Systems**(NeurIPS)**, 2022**(Oral)**

[Abstract] [arXiv] [Conference version]We study the problem of list-decodable sparse mean estimation. Specifically, for a parameter $\alpha \in (0, 1/2)$, we are given $m$ points in $\mathbb R^n$, $\lfloor \alpha m \rfloor$ of which are i.i.d. samples from a distribution $D$ with unknown $k$-sparse mean $\mu$. No assumptions are made on the remaining points, which form the majority of the dataset. The goal is to return a small list of candidates containing a vector $\hat \mu$ such that $\|\hat \mu - \mu\|_2$ is small. Prior work had studied the problem of list-decodable mean estimation in the dense setting. In this work, we develop a novel, conceptually simpler technique for list-decodable mean estimation. As the main application of our approach, we provide the first sample and computationally efficient algorithm for list-decodable sparse mean estimation. In particular, for distributions with ''certifiably bounded'' $t$-th moments in $k$-sparse directions and sufficiently light tails, our algorithm achieves error of $(1/\alpha)^{O(1/t)}$ with sample complexity $m = (k\log(n))^{O(t)}/\alpha$ and running time $\textrm{poly}(mn^t)$. For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $\Theta (\sqrt{\log(1/\alpha)})$ with quasi-polynomial sample and computational complexity. We complement our upper bounds with nearly-matching statistical query and low-degree polynomial testing lower bounds.

- Communication-constrained hypothesis testing: Optimality, robustness, and reverse data processing inequalities

with Varun Jog and Po-Ling Loh

Manuscript, 2022

[Abstract] [arXiv]We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting and this bound is tight. We develop a polynomial-time algorithm that achieves the aforementioned sample complexity. Our framework extends to robust hypothesis testing, where the distributions are corrupted in the total variation distance. Our proofs rely on a new reverse data processing inequality and a reverse Markov inequality, which may be of independent interest. For simple $M$-ary hypothesis testing, the sample complexity in the absence of communication constraints has a logarithmic dependence on $M$. We show that communication constraints can cause an exponential blow-up leading to $\Omega(M)$ sample complexity even for adaptive algorithms.

- Robust Sparse Mean Estimation via Sum of Squares

with Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, and Thanasis Pittas

Conference on Learning Theory**(COLT)**, 2022

[Abstract] [arXiv] [Conference version]We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with `certifiably bounded' $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \textrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively best possible.

- Streaming Algorithms for High-Dimensional Robust Statistics

with Ilias Diakonikolas, Daniel M. Kane, and Thanasis Pittas

International Conference on Machine Learning**(ICML)**, 2022

[Abstract] [arXiv] [Conference version]We study high-dimensional robust statistics tasks in the streaming model. A recent line of work obtained computationally efficient algorithms for a range of high-dimensional robust estimation tasks. Unfortunately, all previous algorithms require storing the entire dataset, incurring memory at least quadratic in the dimension. In this work, we develop the first efficient streaming algorithms for high-dimensional robust statistics with near-optimal memory requirements (up to logarithmic factors). Our main result is for the task of high-dimensional robust mean estimation in (a strengthening of) Huber's contamination model. We give an efficient single-pass streaming algorithm for this task with near-optimal error guarantees and space complexity nearly-linear in the dimension. As a corollary, we obtain streaming algorithms with near-optimal space complexity for several more complex tasks, including robust covariance estimation, robust regression, and more generally robust stochastic optimization.

- Simple Binary Hypothesis Testing under Communication Constraints

with Varun Jog and Po-Ling Loh

IEEE International Symposium on Information Theory**(ISIT)**, 2022

[Abstract] [Conference version] [PDF] [Subsumed by this paper]We study simple binary hypothesis testing under communication constraints, a.k.a. 'decentralized detection'. Here, each sample is mapped to a message from a finite set of messages via a channel before being revealed to a statistician. In the absence of communication constraints, it is well known that the sample complexity is characterized by the Hellinger distance between the distributions. We show that the sample complexity of hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting, and demonstrate that distributions exist in which this characterization is tight. We also provide a polynomial-time algorithm which achieves the aforementioned sample complexity. Our proofs rely on a new reverse data processing inequality and a reverse Markov's inequality, which may be of independent interest.

- Sharp Concentration Inequalities for the Centered Relative Entropy

with Alankrita Bhatt

Information and Inference: a Journal of the IMA, 2022

[Abstract] [arXiv] [Journal version]We study the relative entropy between the empirical estimate of a discrete distribution and the true underlying distribution. If the minimum value of the probability mass function exceeds an $\alpha>0$ (i.e. when the true underlying distribution is bounded sufficiently away from the boundary of the simplex), we prove an upper bound on the moment generating function of the centered relative entropy that matches (up to logarithmic factors in the alphabet size and $\alpha$) the optimal asymptotic rates, subsequently leading to a sharp concentration inequality for the centered relative entropy. As a corollary of this result we also obtain confidence intervals and moment bounds for the centered relative entropy that are sharp up to logarithmic factors in the alphabet size and $\alpha.$

- Statistical Query Lower Bounds for List-Decodable Linear Regression

with Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, and Alistair Stewart

Advances in Neural Information Processing Systems**(NeurIPS)**, 2021**(Spotlight)**

[Abstract] [arXiv] [Conference version]We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x,y) \in \mathbb{R}^d\times \mathbb{R}$ and a parameter $0<\alpha<1/2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1−\alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\textrm{poly}(1/\alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.

- Estimating location parameters in sample-heterogeneous distributions

with Varun Jog and Po-Ling Loh

Information and Inference: a Journal of the IMA, 2021

[Abstract] [arXiv] [Journal version] [PDF]Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finite-sample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $d$-dimensional non-identical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth, and median estimators, and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near-optimal when data are i.i.d. and when the fraction of ``low-noise'' distributions is as small as $\Omega\left(\frac{d \log n}{n}\right)$, where $n$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points.

- Robust regression with covariate filtering: Heavy tails and adversarial contamination

with Varun Jog and Po-Ling Loh

Manuscript, 2020

[Abstract] [arXiv]We study the problem of linear regression where both covariates and responses are potentially (i) heavy-tailed and (ii) adversarially contaminated. Several computationally efficient estimators have been proposed for the simpler setting where the covariates are sub-Gaussian and uncontaminated; however, these estimators may fail when the covariates are either heavy-tailed or contain outliers. In this work, we show how to modify the Huber regression, least trimmed squares, and least absolute deviation estimators to obtain estimators which are simultaneously computationally and statistically efficient in the stronger contamination model. Our approach is quite simple, and consists of applying a filtering algorithm to the covariates, and then applying the classical robust regression estimators to the remaining data. We show that the Huber regression estimator achieves near-optimal error rates in this setting, whereas the least trimmed squares and least absolute deviation estimators can be made to achieve near-optimal error after applying a postprocessing step.

- Outlier Robust Mean Estimation with Subgaussian Rates via Stability

with Ilias Diakonikolas and Daniel M. Kane

Advances in Neural Information Processing Systems**(NeurIPS)**, 2020

[Abstract] [arXiv] [Conference version]We study the problem of outlier robust high-dimensional mean estimation under a finite covariance assumption, and more broadly under finite low-degree moment assumptions. We consider a standard stability condition from the recent robust statistics literature and prove that, except with exponentially small failure probability, there exists a large fraction of the inliers satisfying this condition. As a corollary, it follows that a number of recently developed algorithms for robust mean estimation, including iterative filtering and non-convex gradient descent, give optimal error estimators with (near-)subgaussian rates. Previous analyses of these algorithms gave significantly suboptimal rates. As a corollary of our approach, we obtain the first computationally efficient algorithm with subgaussian rate for outlier-robust mean estimation in the strong contamination model under a finite covariance assumption.

- Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

with Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos

Advances in Neural Information Processing Systems**(NeurIPS)**, 2020**(Spotlight)**

[Abstract] [arXiv] [Conference version]The strong /lottery ticket hypothesis/ (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. (2020) establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width $d$ and depth $l$, by pruning a random one that is a factor $O(d^4l^2)$ wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width d and depth l can be approximated by pruning a random network that is a factor $O(\log(dl))$ wider and twice as deep. Our analysis heavily relies on connecting pruning random ReLU networks to random instances of the +SubsetSum+ problem. We then show that this logarithmic over-parameterization is essentially optimal for constant depth networks. Finally, we verify several of our theoretical insights with experiments.

- Extracting robust and accurate features via a robust information bottleneck

with Varun Jog and Po-Ling Loh

IEEE Journal on Selected Areas in Information Theory, 2020

[Abstract] [Journal version] [PDF]We propose a novel strategy for extracting features in supervised learning that can be used to construct a classifier which is more robust to small perturbations in the input space. Our method builds upon the idea of the information bottleneck, by introducing an additional penalty term that encourages the Fisher information of the extracted features to be small when parametrized by the inputs. We present two formulations where the relevance of the features to output labels is measured using either mutual information or MMSE. By tuning the regularization parameter, we can explicitly trade off the opposing desiderata of robustness and accuracy when constructing a classifier. We derive optimal solutions to both robust information bottleneck formulations when the inputs and outputs are jointly Gaussian, proving that the optimally robust features are also jointly Gaussian in this setting. We also propose methods for optimizing variational bounds on the robust information bottleneck objectives in general settings using stochastic gradient descent, which may be implemented efficiently in neural networks. Our experimental results for synthetic and real data sets show that the proposed feature extraction methods indeed produce classifiers with increased robustness to perturbations.

- Mean estimation for entangled single-sample distributions

with Varun Jog and Po-Ling Loh

IEEE International Symposium on Information Theory**(ISIT)**, 2019

[Abstract] [Conference version] [PDF] [Subsumed by this paper]We consider the problem of estimating the common mean of univariate data, when independent samples are drawn from non-identical symmetric, unimodal distributions. This captures the setting where all samples are Gaussian with different unknown variances. We propose an estimator that adapts to the level of heterogeneity in the data, achieving near-optimality in both the i.i.d. setting and some heterogeneous settings, where the fraction of 'low-noise' points is as small as $\frac{\log n} {n}$. Our estimator is a hybrid of the modal interval, shorth, and median estimators from classical statistics. The rates depend on the percentile of the mixture distribution, making our estimators useful even for distributions with infinite variance.

- Deep Topic Models for Multi-label Learning

with Rajat Panda, Nikhil Mehta, Mingyuan Zhou, and Piyush Rai

International Conference on Artificial Intelligence and Statistics**(AISTATS)**, 2019

[Abstract] [Conference version]We present a probabilistic framework for multi-label learning based on a deep generative model for the binary label vector associated with each observation. Our generative model learns deep multi-layer latent embeddings of the binary label vector, which are conditioned on the input features of the observation. The model also has an interesting interpretation in terms of a deep topic model, with each label vector representing a bag-of-words document, with the input features being its meta-data. In addition to capturing the structural properties of the label space (e.g., a near-low-rank label matrix), the model also offers a clean, geometric interpretation. In particular, the nonlinear classification boundaries learned by the model can be seen as the union of multiple convex polytopes. Our model admits a simple and scalable inference via efficient Gibbs sampling or EM algorithm. We compare our model with state-of-the-art baselines for multi-label learning on benchmark data sets, and also report some interesting qualitative results.

- Generalization Error Bounds for Noisy, Iterative Algorithms

with Varun Jog and Po-Ling Loh

IEEE International Symposium on Information Theory**(ISIT)**, 2018

[Abstract] [arXiv] [Conference version]In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I(S;W)$ between the algorithm input $S$ and the algorithm output $W$, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.

Adapted from jemdoc + new design.