Content area
We consider two prevalent data-centric constraints in modern machine learning: (a) restricted data access with potential computational constraints, and (b) poor data quality. Our goal is to provide theoretically sound algorithms/practices for such settings.
Under (a), we focus on federated learning (FL) where data is stored locally on decentralized clients, each with individual computational constraints, and on differentially private training where data access is impaired due to the privacy-preservation requirement. Specifically, we propose an accelerated FL algorithm attaining the best known complexity for smooth non-convex functions under arbitrary client heterogeneity and compressed communication. We also provide a theoretically justified recommendation for setting the clip norm in differentially private stochastic gradient descent (DP-SGD) and derive new convergence results for DP-SGD with heavy-tailed gradients. We validate the effectiveness of our methods via extensive experimentation.
Under (b), we consider the problem of learning with noisy labels in this dissertation. Specifically, we focus on the idea of retraining a model with its own hard predictions (1/0 labels) or soft predictions (raw unrounded scores) on the same training set on which it is initially trained. Surprisingly, this simple idea improves the model's performance, even though no extra information is obtained by retraining. We theoretically characterize this surprising phenomenon for linear models; to our knowledge, our results are the first of their kind. Empirically, we show the efficacy of selective retraining in improving training with local label differential privacy, where the goal is to safeguard the privacy of only the labels by injecting label noise.