#### In the Pipeline

**Distribution Compression in Near-linear Time. (arxiv, code, slides)***Abhishek Shetty, Raaz Dwivedi, and Lester Mackey.*

View details »In distribution compression, one aims to accurately summarize a probability distribution P using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling n points from a Markov chain and identifying sqrt(n) points with Õ(1/sqrt(n)) discrepancy to P. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size n. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of 4 in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers sqrt(n) points with O(sqrt(log n/n)) integration error and better-than-Monte-Carlo maximum mean discrepancy in O(n log^3 n) time and O(sqrt(n) log^2 n) space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.

**Generalized Kernel Thinning. (arxiv, code, slides)***Raaz Dwivedi and Lester Mackey.*

View details »The kernel thinning (KT) algorithm of Dwivedi and Mackey (2021) compresses a probability distribution more effectively than independent sampling by targeting a reproducing kernel Hilbert space (RKHS) and leveraging a less smooth square-root kernel. Here we provide four improvements. First, we show that KT applied directly to the target RKHS yields tighter, dimension-free guarantees for any kernel, any distribution, and any fixed function in the RKHS. Second, we show that, for analytic kernels like Gaussian, inverse multiquadric, and sinc, target KT admits maximum mean discrepancy (MMD) guarantees comparable to or better than those of square-root KT without making explicit use of a square-root kernel. Third, we prove that KT with a fractional power kernel yields better-than-Monte-Carlo MMD guarantees for non-smooth kernels, like Laplace and Matern, that do not have square-roots. Fourth, we establish that KT applied to a sum of the target and power kernels (a procedure we call KT+) simultaneously inherits the improved MMD guarantees of power KT and the tighter individual function guarantees of target KT. In our experiments with target KT and KT+, we witness significant improvements in integration error even in 100 dimensions and when compressing challenging differential equation posteriors.

**Learned Benchmarks for Subseasonal Forecasting. (arxiv, code, SubseasonalClimateUSA dataset)***Soukayna Mouatadid, Paulo Orenstein, Genevieve Flaspohler, Miruna Oprescu, Judah Cohen, Franklyn Wang, Sean Knight, Maria Geogdzhayeva, Sam Levang, Ernest Fraenkel, and Lester Mackey.*

View details »We develop a subseasonal forecasting toolkit of simple learned benchmark models that outperform both operational practice and state-of-the-art machine learning and deep learning methods. Our new models include (a) Climatology++, an adaptive alternative to climatology that, for precipitation, is 9% more accurate and 250% more skillful than the United States operational Climate Forecasting System (CFSv2); (b) CFSv2++, a learned CFSv2 correction that improves temperature and precipitation accuracy by 7-8% and skill by 50-275%; and (c) Persistence++, an augmented persistence model that combines CFSv2 forecasts with lagged measurements to improve temperature and precipitation accuracy by 6-9% and skill by 40-130%. Across the contiguous U.S., our Climatology++, CFSv2++, and Persistence++ toolkit consistently outperforms standard meteorological baselines, state-of-the-art machine and deep learning methods, and the European Centre for Medium-Range Weather Forecasts ensemble. Overall, we find that augmenting traditional forecasting approaches with learned enhancements yields an effective and computationally inexpensive strategy for building the next generation of subseasonal forecasting benchmarks.

**Social Norm Bias: Residual Harms of Fairness-Aware Algorithms. (arxiv)***Myra Cheng, Maria De-Arteaga, Lester Mackey, and Adam Tauman Kalai.*

View details »Many modern learning algorithms mitigate bias by enforcing fairness across coarsely-defined groups related to a sensitive attribute like gender or race. However, the same algorithms seldom account for the within-group biases that arise due to the heterogeneity of group members. In this work, we characterize Social Norm Bias (SNoB), a subtle but consequential type of discrimination that may be exhibited by automated decision-making systems, even when these systems achieve group fairness objectives. We study this issue through the lens of gender bias in occupation classification from biographies. We quantify SNoB by measuring how an algorithm's predictions are associated with conformity to gender norms, which is measured using a machine learning approach. This framework reveals that for classification tasks related to male-dominated occupations, fairness-aware classifiers favor biographies written in ways that align with masculine gender norms. We compare SNoB across fairness intervention techniques and show that post-processing interventions do not mitigate this type of bias at all.

**An Open Repository of Real-Time COVID-19 Indicators. (medrxiv)***Alex Reinhart, Logan Brooks, Maria Jahja, Aaron Rumack, Jingjing Tang, Wael Al Saeed, Taylor Arnold, Amartya Basu, Jacob Bien, Angel A. Cabrera, Andrew Chin, Eu Jing Chua, Brian Clark, Nat DeFries, Jodi Forlizzi, Samuel Gratzl, Alden Green, George Haff, Robin Han, Addison J. Hu, Sangwon Hyun, Ananya Joshi, Jimi Kim, Andrew Kuznetsov, Wichada La Motte-Kerr, Yeon Jin Lee, Kenneth Lee, Zachary C. Lipton, Michael X. Liu, Lester Mackey, Kathryn Mazaitis, Daniel J. McDonald, Balasubramanian Narasimhan, Natalia L. Oliveira, Pratik Patil, Adam Perer, Collin A. Politsch, Samyak Rajanala, Dawn Rucker, Nigam H. Shah, Vishnu Shankar, James Sharpnack, Dmitry Shemetov, Noah Simon, Vishakha Srivastava, Shuyi Tan, Robert Tibshirani, Elena Tuzhilina, Ana Karina Van Nortwick, Valerie Ventura, Larry Wasserman, Jeremy C. Weiss, Kristin Williams, Roni Rosenfeld, and Ryan J. Tibshirani.*

View details »The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from de-identified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data is available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making.

**Near-optimal inference in adaptive linear regression. (arxiv)***Koulik Khamaru, Yash Deshpande, Lester Mackey, and Martin J. Wainwright.*

View details »When data is collected in an adaptive manner, even simple methods likeordinary least squares can exhibit non-normal asymptotic behavior. As anundesirable consequence, hypothesis tests and confidence intervals based on asymptotic normality can lead to erroneous results. We propose a family ofonline debiasing estimators to correct these distributional anomalies in least squares estimation. Our proposed methods take advantage of the covariance structure present in the dataset and provide sharper estimates in directions for which more information has accrued. We establish an asymptotic normality property for our proposed online debiasing estimators under mild conditions on the data collection process and provide asymptotically exact confidence intervals. We additionally prove a minimax lower bound for the adaptive linear regression problem, thereby providing a baseline by which to compare estimators. There are various conditions under which our proposed estimators achieve the minimax lower bounds up to logarithmic factors. We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.

**Sampling with Mirrored Stein Operators. (arxiv, code)***Jiaxin Shi, Chang Liu, and Lester Mackey.*

View details »We introduce a new family of particle evolution samplers suitable for constrained domains and non-Euclidean geometries. Stein Variational Mirror Descent and Mirrored Stein Variational Gradient Descent minimize the Kullback-Leibler (KL) divergence to constrained target distributions by evolving particles in a dual space defined by a mirror map. Stein Variational Natural Gradient exploits non-Euclidean geometry to more efficiently minimize the KL divergence to unconstrained targets. We derive these samplers from a new class of mirrored Stein operators and adaptive kernels developed in this work. We demonstrate that these new samplers yield accurate approximations to distributions on the simplex, deliver valid confidence intervals in post-selection inference, and converge more rapidly than prior methods in large-scale unconstrained posterior inference. Finally, we establish the convergence of our new procedures under verifiable conditions on the target distribution.

**Kernel Thinning. (arxiv, code, slides)***Raaz Dwivedi and Lester Mackey.*

View details »We introduce kernel thinning, a new procedure for compressing a distribution P more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel k and O(n^2) time, kernel thinning compresses an n-point approximation to P into a sqrt(n)-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is O_d(n^{-1/2} sqrt(log n)) for compactly supported P and O_d(n^{-1/2} sqrt((log n)^{d+1} log log n)) for sub-exponential P on R^d. In contrast, an equal-sized i.i.d. sample from P suffers Omega(n^{-1/4}) integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform P on [0,1]^d but apply to general distributions on R^d and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Matern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning, in dimensions d=2 through 100.

**Stein's Method Meets Statistics: A Review of Some Recent Developments. (arxiv)***Andreas Anastasiou, Alessandro Barp, Francois-Xavier Briol, Bruno Ebner, Robert E. Gaunt, Fatemeh Ghaderinezhad, Jackson Gorham, Arthur Gretton, Christophe Ley, Qiang Liu, Lester Mackey, Chris. J. Oates, Gesine Reinert, and Yvik Swan.*

View details »Stein's method is a collection of tools for analyzing distributional comparisons through the study of a class of linear operators called Stein operators. Originally studied in probability, Stein's method has also enabled some important developments in statistics. This early success has led to a high research activity in this area in recent years. The goal of this survey is to bring together some of these developments in theoretical statistics as well as in computational statistics and, in doing so, to stimulate further research into the successful field of Stein's method and statistics. The topics we discuss include: explicit error bounds for asymptotic approximations of estimators and test statistics, a measure of prior sensitivity in Bayesian statistics, tools to benchmark and compare sampling methods such as approximate Markov chain Monte Carlo, deterministic alternatives to sampling methods, control variate techniques, and goodness-of-fit testing.

**Independent finite approximations for Bayesian nonparametric inference. (arxiv)***Tin D. Nguyen, Jonathan Huggins, Lorenzo Masoero, Lester Mackey, and Tamara Broderick.*

View details »Bayesian nonparametric priors based on completely random measures (CRMs) offer a flexible modeling approach when the number of latent components in a dataset is unknown. However, managing the infinite dimensionality of CRMs typically requires practitioners to derive ad-hoc algorithms, preventing the use of general-purpose inference methods and often leading to long compute times. We propose a general but explicit recipe to construct a simple finite-dimensional approximation that can replace the infinite-dimensional CRMs. Our independent finite approximation (IFA) is a generalization of important cases that are used in practice. The independence of atom weights in our approximation (i) makes the construction well-suited for parallel and distributed computation and (ii) facilitates more convenient inference schemes. We quantify the approximation error between IFAs and the target nonparametric prior. We compare IFAs with an alternative approximation scheme -- truncated finite approximations (TFAs), where the atom weights are constructed sequentially. We prove that, for worst-case choices of observation likelihoods, TFAs are a more efficient approximation than IFAs. However, in real-data experiments with image denoising and topic modeling, we find that IFAs perform very similarly to TFAs in terms of task-specific accuracy metrics.

**Metrizing Weak Convergence with Maximum Mean Discrepancies. (arxiv)***Carl-Johann Simon-Gabriel, Alessandro Barp, and Lester Mackey.*

View details »Theorem 12 of Simon-Gabriel and Scholkopf (JMLR, 2018) seemed to close a 40-year-old quest to characterize maximum mean discrepancies (MMD) that metrize the weak convergence of probability measures. We prove, however, that the theorem is incorrect and provide a correction. We show that, on a locally compact, non-compact, Hausdorff space, the MMD of a bounded continuous Borel measurable kernel k, whose RKHS-functions vanish at infinity, metrizes the weak convergence of probability measures if and only if k is continuous and integrally strictly positive definite (ISPD) over all signed, finite, regular Borel measures. We also show that, contrary to the claim of the aforementioned Theorem 12, there exist both bounded continuous ISPD kernels that do not metrize weak convergence and bounded continuous non-ISPD kernels that do metrize it.

**Weighted Meta-Learning. (arxiv)***Diana Cai, Rishit Sheth, Lester Mackey, and Nicolo Fusi.*

View details »Meta-learning leverages related source tasks to learn an initialization that can be quickly fine-tuned to a target task with limited labeled examples. However, many popular meta-learning algorithms, such as model-agnostic meta-learning (MAML), only assume access to the target samples for fine-tuning. In this work, we provide a general framework for meta-learning based on weighting the loss of different source tasks, where the weights are allowed to depend on the target samples. In this general setting, we provide upper bounds on the distance of the weighted empirical risk of the source tasks and expected target risk in terms of an integral probability metric (IPM) and Rademacher complexity, which apply to a number of meta-learning settings including MAML and a weighted MAML variant. We then develop a learning algorithm based on minimizing the error bound with respect to an empirical IPM, including a weighted MAML algorithm, α-MAML. Finally, we demonstrate empirically on several regression problems that our weighted meta-learning algorithm is able to find better initializations than uniformly-weighted meta-learning algorithms, such as MAML.

**A Kernel Stein Test for Comparing Latent Variable Models. (arxiv)***Heishiro Kanagawa, Wittawat Jitkrittum, Lester Mackey, Kenji Fukumizu, and Arthur Gretton.*

View details »We propose a nonparametric, kernel-based test to assess the relative goodness of fit of latent variable models with intractable unnormalized densities. Our test generalises the kernel Stein discrepancy (KSD) tests of (Liu et al., 2016, Chwialkowski et al., 2016, Yang et al., 2018, Jitkrittum et al., 2018) which required exact access to unnormalized densities. Our new test relies on the simple idea of using an approximate observed-variable marginal in place of the exact, intractable one. As our main theoretical contribution, we prove that the new test, with a properly corrected threshold, has a well-controlled type-I error. In the case of models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test (Bounliphone et al., 2015), which cannot exploit the latent structure.

**Teacher-Student Compression with Generative Adversarial Networks. (arxiv, GAN-TSC code, TSC Score code)***Ruishan Liu, Nicolo Fusi, and Lester Mackey.*

View details »More accurate machine learning models often demand more computation and memory at test time, making them difficult to deploy on CPU- or memory-constrained devices. Teacher-student compression (TSC), also known as distillation, alleviates this burden by training a less expensive student model to mimic the expensive teacher model while maintaining most of the original accuracy. However, when fresh data is unavailable for the compression task, the teacher's training data is typically reused, leading to suboptimal compression. In this work, we propose to augment the compression dataset with synthetic data from a generative adversarial network (GAN) designed to approximate the training data distribution. Our GAN-assisted TSC (GAN-TSC) significantly improves student accuracy for expensive models such as large random forests and deep neural networks on both tabular and image datasets. Building on these results, we propose a comprehensive metric---the TSC Score---to evaluate the quality of synthetic datasets based on their induced TSC performance. The TSC Score captures both data diversity and class affinity, and we illustrate its benefits over the popular Inception Score in the context of image classification.

#### Publications

**DeepMiner: Discovering Interpretable Representations for Mammogram Classification and Explanation. (arxiv, code)***Jimmy Wu, Bolei Zhou, Diondra Peck, Scott Hsieh, Vandana Dialani, Lester Mackey, and Genevieve Patterson.*

Harvard Data Science Review. October 2021.

View details »We propose DeepMiner, a framework to discover interpretable representations in deep neural networks and to build explanations for medical predictions. By probing convolutional neural networks (CNNs) trained to classify cancer in mammograms, we show that many individual units in the final convolutional layer of a CNN respond strongly to diseased tissue concepts specified by the BI-RADS lexicon. After expert annotation of the interpretable units, our proposed method is able to generate explanations for CNN mammogram classification that are correlated with ground truth radiology reports on the DDSM dataset. We show that DeepMiner not only enables better understanding of the nuances of CNN classification decisions, but also possibly discovers new visual knowledge relevant to medical diagnosis.

**Optimal Thinning of MCMC Output. (arxiv, website, video, slides)***Marina Riabiz, Wilson Chen, Jon Cockayne, Pawel Swietach, Steven A. Niederer, Lester Mackey, and Chris. J. Oates.*

Journal of the Royal Statistical Society, Series B (JRSSB). To appear.

View details »The use of heuristics to assess the convergence and compress the output of Markov chain Monte Carlo can be sub-optimal in terms of the empirical approximations that are produced. Typically a number of the initial states are attributed to "burn in" and removed, whilst the chain can be "thinned" if compression is also required. In this paper we consider the problem of selecting a subset of states, of fixed cardinality, such that the approximation provided by their empirical distribution is close to optimal. A novel method is proposed, based on greedy minimisation of a kernel Stein discrepancy, that is suitable for problems where heavy compression is required. Theoretical results guarantee consistency of the method and its effectiveness is demonstrated in the challenging context of parameter inference for ordinary differential equations. Software is available in the "Stein Thinning" package in both Python and MATLAB, and example code is included.

**Online Learning with Optimism and Delay. (arxiv, code, slides)***Genevieve Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu, Paulo Orenstein, and Lester Mackey.*

International Conference on Machine Learning (ICML). July 2021.

View details »Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms -- DORM, DORM+, and AdaHedgeD -- arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delay-as-optimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new meta-algorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to state-of-the-art forecasting models.

**Knowledge Distillation as Semiparametric Inference. (arxiv, code, slides)***Tri Dao, Govinda M. Kamath, Vasilis Syrgkanis, and Lester Mackey.*

International Conference on Learning Representations (ICLR). May 2021.

View details »A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements - cross-fitting and loss correction - to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.

**Initialization and Regularization of Factorized Neural Layers. (arxiv, code, blog)***Mikhail Khodak, Neil Tenenholtz, Lester Mackey, and Nicolo Fusi.*

International Conference on Learning Representations (ICLR). May 2021.

View details »Factorized layers - operations parameterized by products of two or more matrices - occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head self-attention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.

**Cross-validation Confidence Intervals for Test Error. (arxiv, code, slides)***Pierre Bayle, Alexandre Bayle, Lucas Janson, and Lester Mackey.*

Advances in Neural Information Processing Systems (NeurIPS). December 2020.

View details »This work develops central limit theorems for cross-validation and consistent estimators of its asymptotic variance under weak stability conditions on the learning algorithm. Together, these results provide practical, asymptotically-exact confidence intervals for k-fold test error and valid, powerful hypothesis tests of whether one learning algorithm has smaller k-fold test error than another. These results are also the first of their kind for the popular choice of leave-one-out cross-validation. In our real-data experiments with diverse learning algorithms, the resulting intervals and tests outperform the most popular alternative methods from the literature.

**Stochastic Stein Discrepancies. (arxiv, code)***Jackson Gorham, Anant Raj, and Lester Mackey.*

Advances in Neural Information Processing Systems (NeurIPS). December 2020.

View details »Stein discrepancies (SDs) monitor convergence and non-convergence in approximate inference when exact integration and sampling are intractable. However, the computation of a Stein discrepancy can be prohibitive if the Stein operator -- often a sum over likelihood terms or potentials -- is expensive to evaluate. To address this deficiency, we show that stochastic Stein discrepancies (SSDs) based on subsampled approximations of the Stein operator inherit the convergence control properties of standard SDs with probability 1. Along the way, we establish the convergence of Stein variational gradient descent (SVGD) on unbounded domains, resolving an open question of Liu (2017). In our experiments with biased Markov chain Monte Carlo (MCMC) hyperparameter tuning, approximate MCMC sampler selection, and stochastic SVGD, SSDs deliver comparable inferences to standard SDs with orders of magnitude fewer likelihood evaluations.

**Minimax Estimation of Conditional Moment Models. (arxiv, code, blog)***Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis.*

Advances in Neural Information Processing Systems (NeurIPS). December 2020.

View details »We develop an approach for estimating models described via conditional moment restrictions, with a prototypical application being non-parametric instrumental variable regression. We introduce a min-max criterion function, under which the estimation problem can be thought of as solving a zero-sum game between a modeler who is optimizing over the hypothesis space of the target model and an adversary who identifies violating moments over a test function space. We analyze the statistical estimation rate of the resulting estimator for arbitrary hypothesis spaces, with respect to an appropriate analogue of the mean squared error metric, for ill-posed inverse problems. We show that when the minimax criterion is regularized with a second moment penalty on the test function and the test function space is sufficiently rich, then the estimation rate scales with the critical radius of the hypothesis and test function spaces, a quantity which typically gives tight fast rates. Our main result follows from a novel localized Rademacher analysis of statistical learning problems defined via minimax objectives. We provide applications of our main results for several hypothesis spaces used in practice such as: reproducing kernel Hilbert spaces, high dimensional sparse linear functions, spaces defined via shape constraints, ensemble estimators such as random forests, and neural networks. For each of these applications we provide computationally efficient optimization methods for solving the corresponding minimax problem (e.g. stochastic first-order heuristics for neural networks). In several applications, we show how our modified mean squared error rate, combined with conditions that bound the ill-posedness of the inverse problem, lead to mean squared error rates. We conclude with an extensive experimental analysis of the proposed methods.

**Single Point Transductive Prediction. (arxiv, code, talk)***Nilesh Tripuraneni and Lester Mackey.*

International Conference on Machine Learning (ICML). July 2020.

View details »Standard methods in supervised learning separate training and prediction: the model is fit independently of any test points it may encounter. However, can knowledge of the next test point

**x**_{★}be exploited to improve prediction accuracy? We address this question in the context of linear prediction, showing how techniques from semiparametric inference can be used transductively to combat regularization bias. We first lower bound the**x**_{★}prediction error of ridge regression and the Lasso, showing that they must incur significant bias in certain test directions. We then provide non-asymptotic upper bounds on the**x**_{★}prediction error of two transductive prediction rules. We conclude by showing the efficacy of our methods on both synthetic and real data, highlighting the improvements single point transductive prediction can provide in settings with distribution shift.**Opportunities and Challenges for Analyzing Cancer Data at the Inter- and Intra-Institutional Levels. (article)***Julie Wu, Jordan Bryan, Samuel M. Rubinstein, Lucy Wang, Michele Lenoue-Newton, Raed Zuhour, Mia Levy, Christine Micheel, Yaomin Xu, Suresh K. Bhavnani, Lester Mackey, and Jeremy L. Warner.*

JCO Precision Oncology. June 2020.

View details »**PURPOSE:**Our goal was to identify the opportunities and challenges in analyzing data from the American Association of Cancer Research Project Genomics Evidence Neoplasia Information Exchange (GENIE), a multi-institutional database derived from clinically driven genomic testing, at both the inter- and the intra-institutional level. Inter-institutionally, we identified genotypic differences between primary and metastatic tumors across the 3 most represented cancers in GENIE. Intra-institutionally, we analyzed the clinical characteristics of the Vanderbilt-Ingram Cancer Center (VICC) subset of GENIE to inform the interpretation of GENIE as a whole.**METHODS:**We performed overall cohort matching on the basis of age, ethnicity, and sex of 13,208 patients stratified by cancer type (breast, colon, or lung) and sample site (primary or metastatic). We then determined whether detected variants, at the gene level, were associated with primary or metastatic tumors. We extracted clinical data for the VICC subset from VICC's clinical data warehouse. Treatment exposures were mapped to a 13-class schema derived from the HemOnc ontology.**RESULTS:**Across 756 genes, there were significant differences in all cancer types. In breast cancer, ESR1 variants were over-represented in metastatic samples (odds ratio, 5.91; q < 10e-6). TP53 mutations were over-represented in metastatic samples across all cancers. VICC had a significantly different cancer type distribution than that of GENIE but patients were well matched with respect to age, sex, and sample type. Treatment data from VICC was used for a bipartite network analysis, demonstrating clusters with a mix of histologies and others being more histology specific.**CONCLUSION:**This article demonstrates the feasibility of deriving meaningful insights from GENIE at the inter- and intra-institutional level and illuminates the opportunities and challenges of the data GENIE contains. The results should help guide future development of GENIE, with the goal of fully realizing its potential for accelerating precision medicine.**Approximate Cross-validation: Guarantees for Model Assessment and Selection. (arxiv, code, talk)***Ashia Wilson, Maximilian Kasy, and Lester Mackey.*

International Conference on Artificial Intelligence and Statistics (AISTATS). June 2020.

View details »Cross-validation (CV) is a popular approach for assessing and selecting predictive models. However, when the number of folds is large, CV suffers from a need to repeatedly refit a learning procedure on a large number of training datasets. Recent work in empirical risk minimization (ERM) approximates the expensive refitting with a single Newton step warm-started from the full training set optimizer. While this can greatly reduce runtime, several open questions remain including whether these approximations lead to faithful model selection and whether they are suitable for non-smooth objectives. We address these questions with three main contributions: (i) we provide uniform non-asymptotic, deterministic model assessment guarantees for approximate CV; (ii) we show that (roughly) the same conditions also guarantee model selection performance comparable to CV; (iii) we provide a proximal Newton extension of the approximate CV framework for non-smooth prediction problems and develop improved assessment guarantees for problems such as L1-regularized ERM.

**Importance Sampling via Local Sensitivity. (arxiv, talk)***Anant Raj, Cameron Musco, and Lester Mackey.*

International Conference on Artificial Intelligence and Statistics (AISTATS). June 2020.

View details »Given a loss function F : X -> R+ that can be written as the sum of losses over a large set of inputs a1, ..., an, it is often desirable to approximate F by subsampling the input points. Strong theoretical guarantees require taking into account the importance of each point, measured by how much its individual loss contributes to F(x). Maximizing this importance over all x yields the sensitivity score of ai. Sampling with probabilities proportional to these scores gives strong guarantees, allowing one to approximately minimize of F using just the subsampled points. Unfortunately, sensitivity sampling is difficult to apply since (1) it is unclear how to efficiently compute the sensitivity scores and (2) the sample size required is often impractically large. To overcome both obstacles we introduce local sensitivity, which measures data point importance in a ball around some center x0. We show that the local sensitivity can be efficiently estimated using the leverage scores of a quadratic approximation to F and that the sample size required to approximate F around x0 can be bounded. We propose employing local sensitivity sampling in an iterative optimization method and analyze its convergence when F is smooth and convex.

**Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms. (article)***Thomas Schaffter, Diana S. M. Buist, Christoph I. Lee, Yaroslav Nikulin, Dezso Ribli, Yuanfang Guan, William Lotter, Zequn Jie, Hao Du, Sijia Wang, Jiashi Feng, Mengling Feng, Hyo-Eun Kim, Francisco Albiol, Alberto Albiol, Stephen Morrell, Zbigniew Wojna, Mehmet Eren Ahsen, Umar Asif, Antonio Jimeno Yepes, Shivanthan Yohanandan, Simona Rabinovici-Cohen, Darvin Yi, Bruce Hoff, Thomas Yu, Elias Chaibub Neto, Daniel L. Rubin, Peter Lindholm, Laurie R. Margolies, Russell Bailey McBride, Joseph H. Rothstein, Weiva Sieh, Rami Ben-Ari, Stefan Harrer, Andrew Trister, Stephen Friend, Thea Norman, Berkman Sahiner, Fredrik Strand, Justin Guinney, Gustavo Stolovitzky, and the DM DREAM Consortium (including Lester Mackey).*

Journal of the American Medical Association Network Open. March 2020.

View details »**Importance:**Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives.**Objective:**To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms.**Design, Setting, and Participants:**In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016.**Main Outcomes and Measurements:**Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists' specificity with radiologists' sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists' recall assessment was developed and evaluated.**Results:**Overall, 144,231 screening mammograms from 85,580 US women (952 cancer positive less than or equal to 12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166,578 examinations from 68,008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists' sensitivity, lower than community-practice radiologists' specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity.**Conclusions and Relevance:**While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.**Minimum Stein Discrepancy Estimators. (arxiv)***Alessandro Barp, Francois-Xavier Briol, Andrew B. Duncan, Mark Girolami, and Lester Mackey.*

Advances in Neural Information Processing Systems (NeurIPS). December 2019.

View details »When maximum likelihood estimation is infeasible, one often turns to score matching, contrastive divergence, or minimum probability flow learning to obtain tractable parameter estimates. We provide a unifying perspective of these techniques as minimum Stein discrepancy estimators and use this lens to design new diffusion kernel Stein discrepancy (DKSD) and diffusion score matching (DSM) estimators with complementary strengths. We establish the consistency, asymptotic normality, and robustness of DKSD and DSM estimators, derive stochastic Riemannian gradient descent algorithms for their efficient optimization, and demonstrate their advantages over score matching in models with non-smooth densities or heavy tailed distributions.

**Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond. (arxiv)***Xuechen Li, Denny Wu, Lester Mackey, and Murat A. Erdogdu.*

Advances in Neural Information Processing Systems (NeurIPS). December 2019.

View details »Sampling with Markov chain Monte Carlo methods typically amounts to discretizing some continuous-time dynamics with numerical integration. In this paper, we establish the convergence rate of sampling algorithms obtained by discretizing smooth Ito diffusions exhibiting fast Wasserstein-2 contraction, based on local deviation properties of the integration scheme. In particular, we study a sampling algorithm constructed by discretizing the overdamped Langevin diffusion with the method of stochastic Runge-Kutta. For strongly convex potentials that are smooth up to a certain order, its iterates converge to the target distribution in 2-Wasserstein distance in Õ(dε^{-2/3}) iterations. This improves upon the best-known rate for strongly log-concave sampling based on the overdamped Langevin equation using only the gradient oracle without adjustment. In addition, we extend our analysis of stochastic Runge-Kutta methods to uniformly dissipative diffusions with possibly non-convex potentials and show they achieve better rates compared to the Euler-Maruyama scheme in terms of the dependence on tolerance ε. Numerical studies show that these algorithms lead to better stability and lower asymptotic errors.

**Accelerating Rescaled Gradient Descent: Fast Optimization of Smooth Functions. (arxiv, code)***Ashia Wilson, Lester Mackey, and Andre Wibisono.*

Advances in Neural Information Processing Systems (NeurIPS). December 2019.

View details »We present a family of algorithms, called descent algorithms, for optimizing convex and non-convex functions. We also introduce a new first-order algorithm, called rescaled gradient descent (RGD), and show that RGD achieves a faster convergence rate than gradient descent provided the function is strongly smooth -- a natural generalization of the standard smoothness assumption on the objective function. When the objective function is convex, we present two novel frameworks for "accelerating" descent methods, one in the style of Nesterov and the other in the style of Monteiro and Svaiter, using a single Lyapunov. Rescaled gradient descent can be accelerated under the same strong smoothness assumption using both frameworks. We provide several examples of strongly smooth loss functions in machine learning and numerical experiments that verify our theoretical findings. We also present several extensions of our novel Lyapunov framework, including deriving optimal universal tensor methods and extending our framework to the coordinate setting.

**Measuring Sample Quality with Diffusions. (arxiv, code)***Jackson Gorham, Andrew B. Duncan, Sebastian J. Vollmer, and Lester Mackey.*

Annals of Applied Probability. October 2019.

View details »Stein's method for measuring convergence to a continuous target distribution relies on an operator characterizing the target and Stein factor bounds on the solutions of an associated differential equation. While such operators and bounds are readily available for a diversity of univariate targets, few multivariate targets have been analyzed. We introduce a new class of characterizing operators based on Ito diffusions and develop explicit multivariate Stein factor bounds for any target with a fast-coupling Ito diffusion. As example applications, we develop computable and convergence-determining diffusion Stein discrepancies for log-concave, heavy-tailed, and multimodal targets and use these quality measures to select the hyperparameters of biased Markov chain Monte Carlo (MCMC) samplers, compare random and deterministic quadrature rules, and quantify bias-variance tradeoffs in approximate MCMC. Our results establish a near-linear relationship between diffusion Stein discrepancies and Wasserstein distances, improving upon past work even for strongly log-concave targets. The exposed relationship between Stein factors and Markov process coupling may be of independent interest.

**Improving Subseasonal Forecasting in the Western U.S. with Machine Learning. (arxiv, our SubseasonalRodeo dataset, slides, poster)***Jessica Hwang, Paulo Orenstein, Judah Cohen, Karl Pfeiffer, and Lester Mackey.*

International Conference on Knowledge Discovery and Data Mining (KDD). August 2019.

View details »Water managers in the western United States (U.S.) rely on longterm forecasts of temperature and precipitation to prepare for droughts and other wet weather extremes. To improve the accuracy of these longterm forecasts, the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a year-long real-time forecasting challenge in which participants aimed to skillfully predict temperature and precipitation in the western U.S. two to four weeks and four to six weeks in advance. Here we present and evaluate our machine learning approach to the Rodeo and release our SubseasonalRodeo dataset, collected to train and evaluate our forecasting system.

Our system is an ensemble of two nonlinear regression models. The first integrates the diverse collection of meteorological measurements and dynamic model forecasts in the SubseasonalRodeo dataset and prunes irrelevant predictors using a customized multitask model selection procedure. The second uses only historical measurements of the target variable (temperature or precipitation) and introduces multitask nearest neighbor features into a weighted local linear regression. Each model alone is significantly more accurate than the debiased operational U.S. Climate Forecasting System (CFSv2), and our ensemble skill exceeds that of the top Rodeo competitor for each target variable and forecast horizon. Moreover, over 2011-2018, an ensemble of our regression models and debiased CFSv2 improves debiased CFSv2 skill by 40-50% for temperature and 129-169% for precipitation. We hope that both our dataset and our methods will help to advance the state of the art in subseasonal forecasting.

**Stein Point Markov Chain Monte Carlo. (arxiv, code)***Wilson Ye Chen, Alessandro Barp, Francois-Xavier Briol, Jackson Gorham, Mark Girolami, Lester Mackey, and Chris. J. Oates.*

International Conference on Machine Learning (ICML). June 2019.

View details »An important task in machine learning and statistics is the approximation of a probability measure by an empirical measure supported on a discrete point set. Stein Points are a class of algorithms for this task, which proceed by sequentially minimising a Stein discrepancy between the empirical measure and the target and, hence, require the solution of a non-convex optimisation problem to obtain each new point. This paper removes the need to solve this optimisation problem by, instead, selecting each new point based on a Markov chain sample path. This significantly reduces the computational cost of Stein Points and leads to a suite of algorithms that are straightforward to implement. The new algorithms are illustrated on a set of challenging Bayesian inference problems, and rigorous theoretical guarantees of consistency are established.

**Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach. (article)***Robert Kueffner, Neta Zach, Maya Bronfeld, Raquel Norel, Nazem Atassi, Venkat Balagurusamy, Barbara DiCamillo, Adriano Chio, Merit Cudkowicz, Donna Dillenberger, Javier Garcia-Garcia, Orla Hardiman, Bruce Hoff, Joshua Knight, Melanie L. Leitner, Guang Li, Lara Mangravite, Thea Norman, Liuxia Wang, The ALS Stratification Consortium (including Lester Mackey), Jinfeng Xiao, Wen-Chieh Fang, Jian Peng, Chen Yang, Huan-Jui Chang, and Gustavo Stolovitzky.*

Scientific Reports. January 2019.

View details »Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease where substantial heterogeneity in clinical presentation urgently requires a better stratification of patients for the development of drug trials and clinical care. In this study we explored stratification through a crowdsourcing approach, the DREAM Prize4Life ALS Stratification Challenge. Using data from >10,000 patients from ALS clinical trials and 1479 patients from community-based patient registers, more than 30 teams developed new approaches for machine learning and clustering, outperforming the best current predictions of disease outcome. We propose a new method to integrate and analyze patient clusters across methods, showing a clear pattern of consistent and clinically relevant sub-groups of patients that also enabled the reliable classification of new patients. Our analyses reveal novel insights in ALS and describe for the first time the potential of a crowdsourcing to uncover hidden patient sub-populations, and to accelerate disease understanding and therapeutic development.

**A Multifactorial Model of T Cell Expansion and Durable Clinical Benefit in Response to a PD-L1 Inhibitor. (article, code, The ASCO Post)***Mark DM Leiserson, Vasilis Syrgkanis, Amy Gilson, Miroslav Dudik, Sharon Gillett, Jennifer Chayes, Christian Borgs, Dean F Bajorin, Jonathan Rosenberg, Samuel Funt, Alexandra Snyder, and Lester Mackey.*

PLOS One. December 2018.

View details »Checkpoint inhibitor immunotherapies have had major success in treating patients with late-stage cancers, yet the minority of patients benefit. Mutation load and PD-L1 staining are leading biomarkers associated with response, but each is an imperfect predictor. A key challenge to predicting response is modeling the interaction between the tumor and immune system. We begin to address this challenge with a multifactorial model for response to anti-PD-L1 therapy. We train a model to predict immune response in patients after treatment based on 36 clinical, tumor, and circulating features collected prior to treatment. We analyze data from 21 bladder cancer patients using the elastic net high-dimensional regression procedure and, as training set error is a biased and overly optimistic measure of prediction error, we use leave-one-out cross-validation to obtain unbiased estimates of accuracy on held-out patients. In held-out patients, the model explains 79% of the variance in T cell clonal expansion. This predicted immune response is multifactorial, as the variance explained is at most 23% if clinical, tumor, or circulating features are excluded. Moreover, if patients are triaged according to predicted expansion, only 38% of non-durable clinical benefit (DCB) patients need be treated to ensure that 100% of DCB patients are treated. In contrast, using mutation load or PD-L1 staining alone, one must treat at least 77% of non-DCB patients to ensure that all DCB patients receive treatment. Thus, integrative models of immune response may improve our ability to anticipate clinical benefit of immunotherapy.

**S2S reboot: An argument for greater inclusion of machine learning in subseasonal to seasonal forecasts. (article)***Judah Cohen, Dim Coumou, Jessica Hwang, Lester Mackey, Paulo Orenstein, Sonja Totz, and Eli Tziperman.*

WIREs Climate Change. December 2018.

View details »The discipline of seasonal climate prediction began as an exercise in simple statistical techniques. However, today the large government forecast centers almost exclusively rely on complex fully coupled dynamical forecast systems for their subseasonal to seasonal (S2S) predictions while statistical techniques are mostly neglected and those techniques still in use have not been updated in decades. In this Opinion Article, we argue that new statistical techniques mostly developed outside the field of climate science, collectively referred to as machine learning, can be adopted by climate forecasters to increase the accuracy of S2S predictions. We present an example of where unsupervised learning demonstrates higher accuracy in a seasonal prediction than the state-of-the-art dynamical systems. We also summarize some relevant machine learning methods that are most applicable to climate prediction. Finally, we show by comparing real-time dynamical model forecasts with observations from winter 2017/2018 that dynamical model forecasts are almost entirely insensitive to polar vortex (PV) variability and the impact on sensible weather. Instead, statistical forecasts more accurately predicted the resultant sensible weather from a mid-winter PV disruption than the dynamical forecasts. The important implication from the poor dynamical forecasts is that if Arctic change influences mid-latitude weather through PV variability, then the ability of dynamical models to demonstrate the existence of such a pathway is compromised. We conclude by suggesting that S2S prediction will be most beneficial to the public by incorporating mixed or a hybrid of dynamical forecasts and updated statistical techniques such as machine learning.

**Global Non-convex Optimization with Discretized Diffusions. (arxiv, poster)***Murat A. Erdogdu, Lester Mackey, and Ohad Shamir.*

Advances in Neural Information Processing Systems (NeurIPS). December 2018.

View details »An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing convex and non-convex functions not covered by the existing Langevin theory. Our non-asymptotic analysis delivers computable optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. Central to our approach are new explicit Stein factor bounds on the solutions of Poisson equations. We complement these results with improved optimization guarantees for targets other than the standard Gibbs measure.

**Random Feature Stein Discrepancies. (arxiv, code, poster)***Jonathan H. Huggins and Lester Mackey.*

Advances in Neural Information Processing Systems (NeurIPS). December 2018.

View details »Computable Stein discrepancies have been deployed for a variety of applications, including sampler selection in posterior inference, approximate Bayesian inference, and goodness-of-fit testing. Existing convergence-determining Stein discrepancies admit strong theoretical guarantees but suffer from a computational cost that grows quadratically in the sample size. While linear-time Stein discrepancies have been proposed for goodness-of-fit testing, they exhibit avoidable degradations in testing power---even when power is explicitly optimized. To address these shortcomings, we introduce feature Stein discrepancies (ΦSDs), a new family of quality measures that can be cheaply approximated using importance sampling. We show how to construct ΦSDs that provably determine the convergence of a sample to its target and develop high-accuracy approximations---random ΦSDs (RΦSDs)---which are computable in near-linear time. In our experiments with sampler selection for approximate posterior inference and goodness-of-fit testing, RΦSDs typically perform as well or better than quadratic-time KSDs while being orders of magnitude faster to compute.

**Stein Points. (arxiv, code, bib)***Wilson Ye Chen, Lester Mackey, Jackson Gorham, Francois-Xavier Briol, and Chris J. Oates.*

International Conference on Machine Learning (ICML). July 2018.

View details »An important task in computational statistics and machine learning is to approximate a posterior distribution p(x) with an empirical measure supported on a set of representative points {x_i}_{i=1}^n. This paper focuses on methods where the selection of points is essentially deterministic, with an emphasis on achieving accurate approximation when n is small. To this end, we present `Stein Points'. The idea is to exploit either a greedy or a conditional gradient method to iteratively minimise a kernel Stein discrepancy between the empirical measure and p(x). Our empirical results demonstrate that Stein Points enable accurate approximation of the posterior at modest computational cost. In addition, theoretical results are provided to establish convergence of the method.

**Orthogonal Machine Learning: Power and Limitations. (arxiv, slides, code, bib)***Lester Mackey, Vasilis Syrgkanis, and Ilias Zadik.*

International Conference on Machine Learning (ICML). July 2018.

View details »Double machine learning provides $\sqrt{n}$-consistent estimates of parameters of interest even when high-dimensional or nonparametric nuisance parameters are estimated at an $n^{-1/4}$ rate. The key is to employ Neyman-orthogonal moment equations which are first-order insensitive to perturbations in the nuisance parameters. We show that the $n^{-1/4}$ requirement can be improved to $n^{-1/(2k+2)}$ by employing a k-th order notion of orthogonality that grants robustness to more complex or higher-dimensional nuisance parameters. In the partially linear regression setting popular in causal inference, we show that we can construct second-order orthogonal moments if and only if the treatment residual is not normally distributed. Our proof relies on Stein's lemma and may be of independent interest. We conclude by demonstrating the robustness benefits of an explicit doubly-orthogonal estimation procedure for treatment effect.

**Accurate Inference for Adaptive Linear Models. (arxiv, code, bib)***Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy.*

International Conference on Machine Learning (ICML). July 2018.

View details »Estimators computed from adaptively collected data do not behave like their non-adaptive brethren. Rather, the sequential dependence of the collection policy can lead to severe distributional biases that persist even in the infinite data limit. We develop a general method -- W-decorrelation -- for transforming the bias of adaptive linear regression estimators into variance. The method uses only coarse-grained information about the data collection policy and does not need access to propensity scores or exact knowledge of the policy. We bound the finite-sample bias and variance of W-decorrelation and develop asymptotically correct confidence intervals based on a novel martingale central limit theorem. We then demonstrate the empirical benefits of the generic W-decorrelation procedure in two different adaptive data settings: the multi-armed bandit and the autoregressive time series settings.

**Expert identification of visual primitives used by CNNs during mammogram classification. (arxiv, code, poster, bib)***Jimmy Wu, Diondra Peck, Scott Hsieh, Vandana Dialani, Constance D. Lehman, Bolei Zhou, Vasilis Syrgkanis, Lester Mackey, and Genevieve Patterson.*

SPIE Medical Imaging. February 2018.

View details »This work interprets the internal representations of deep neural networks trained for classifying diseased tissue in 2D mammograms. We propose an expert-in-the-loop interpretation method to label the behavior of internal units in convolutional neural networks (CNNs). Expert radiologists identify that the visual patterns detected by the units are correlated with meaningful medical phenomena such as mass tissue and calcificated vessels. We demonstrate that several trained CNN models are able to produce explanatory descriptions to support the final classification decisions. We view this as an important first step toward interpreting the internal representations of medical classification CNNs and explaining their predictions.

**Empirical Bayesian Analysis of Simultaneous Changepoints in Multiple Data Sequences. (arxiv, code, bib)***Zhou Fan and Lester Mackey.*

Annals of Applied Statistics. December 2017.

View details »Copy number variations in cancer cells and volatility fluctuations in stock prices are commonly manifested as changepoints occurring at the same positions across related data sequences. We introduce a Bayesian modeling framework, BASIC, that employs a changepoint prior to capture the co-occurrence tendency in data of this type. We design efficient algorithms to sample from and maximize over the BASIC changepoint posterior and develop a Monte Carlo expectation-maximization procedure to select prior hyperparameters in an empirical Bayes fashion. We use the resulting BASIC framework to analyze DNA copy number variations in the NCI-60 cancer cell lines and to identify important events that affected the price volatility of S&P 500 stocks from 2000 to 2009.

**Measuring Sample Quality with Kernels. (arxiv, slides, code, bib)***Jackson Gorham and Lester Mackey.*

International Conference on Machine Learning (ICML). August 2017.

View details »Approximate Markov chain Monte Carlo (MCMC) offers the promise of more rapid sampling at the cost of more biased inference. Since standard MCMC diagnostics fail to detect these biases, researchers have developed computable Stein discrepancy measures that provably determine the convergence of a sample to its target distribution. This approach was recently combined with the theory of reproducing kernels to define a closed-form kernel Stein discrepancy (KSD) computable by summing kernel evaluations across pairs of sample points. We develop a theory of weak convergence for KSDs based on Stein's method, demonstrate that commonly used KSDs fail to detect non-convergence even for Gaussian targets, and show that kernels with slowly decaying tails provably determine convergence for a large class of target distributions. The resulting convergence-determining KSDs are suitable for comparing biased, exact, and deterministic sample sequences and simpler to compute and parallelize than alternative Stein discrepancies. We use our tools to compare biased samplers, select sampler hyperparameters, and improve upon existing KSD approaches to one-sample hypothesis testing and sample quality improvement.

**Improving Gibbs Sampler Scan Quality with DoGS. (arxiv, bib)***Ioannis Mitliagkas and Lester Mackey.*

International Conference on Machine Learning (ICML). August 2017.

View details »The pairwise influence matrix of Dobrushin has long been used as an analytical tool to bound the rate of convergence of Gibbs sampling. In this work, we use Dobrushin influence as the basis of a practical tool to certify and efficiently improve the quality of a discrete Gibbs sampler. Our Dobrushin-optimized Gibbs samplers (DoGS) offer customized variable selection orders for a given sampling budget and variable subset of interest, explicit bounds on total variation distance to stationarity, and certifiable improvements over the standard systematic and uniform random scan Gibbs samplers. In our experiments with joint image segmentation and object recognition, Markov chain Monte Carlo maximum likelihood estimation, and Ising model inference, DoGS consistently deliver higher-quality inferences with significantly smaller sampling budgets than standard Gibbs samplers.

**Predicting Patient "Cost Blooms" in Denmark: A Longitudinal Population-based Study. (pdf, bib)***Suzanne Tamang, Arnold Milstein, Henrik Toft Sorensen, Lars Pedersen, Lester Mackey, Jean-Raymond Betterton, Lucas Janson, and Nigam Shah.*

BMJ Open. January 2017.

View details »**Objectives:**To compare the ability of standard vs. enhanced models to predict future high-cost patients, especially those who move from a lower to the upper decile of per capita healthcare expenditures within one year - i.e., "cost bloomers."**Design:**We developed alternative models to predict being in the upper decile of healthcare expenditures in Year 2 of a sample, based on data from Year 1. Our six alternative models ranged from a standard cost-prediction model with four variables (i.e., traditional model features), to our largest enhanced model with 1,053 nontraditional model features. To quantify any increases in predictive power that enhanced models achieved over standard tools, we compared the prospective predictive performance of each model.**Participants and setting:**We used the population of Western Denmark between 2004 and 2011 (2,146,801 individuals) to predict future high-cost patients and examine characteristics of high-cost cohorts. Using the most recent two-year period (2010-11) for model evaluation, our whole-population model used a cohort of 1,557,950 individuals with a full year of active residency Year 1 (2010). Our cost-bloom model excluded the 155,795 individuals who were already high cost at the population level in Year 1, resulting in 1,402,155 individuals for prediction of cost bloomers in Year 2 (2011).**Primary outcome measures:**Using unseen data from a future year, we evaluated each model's prospective predictive performance by calculating the ratio of predicted high-cost patient expenditures to the actual high-cost patient expenditures in Year 2 - i.e., cost capture.**Results:**Our best enhanced model achieved a 21 percent and 30 percent improvement in cost capture over a standard diagnosis-based model for predicting population-level high-cost patients and cost bloomers, respectively.**Conclusions:**In combination with modern statistical learning methods for analyzing large datasets, models enhanced with a large and diverse set of features led to better performance, especially for predicting future cost bloomers.**Predicting inpatient clinical order patterns with probabilistic topic models vs. conventional order sets. (pdf, bib)***Jonathan H. Chen, Mary K. Goldstein, Steven M. Asch, Lester Mackey, and Russ B. Altman.*

Journal of the American Medical Informatics Association. September 2016.

View details »**Objective**Build probabilistic topic model representations of hospital admissions processes and compare the ability of such models to predict clinical order patterns as compared to preconstructed order sets.**Materials and Methods**The authors evaluated the first 24 hours of structured electronic health record data for > 10 K inpatients. Drawing an analogy between structured items (e.g., clinical orders) to words in a text document, the authors performed latent Dirichlet allocation probabilistic topic modeling. These topic models use initial clinical information to predict clinical orders for a separate validation set of > 4 K patients. The authors evaluated these topic model-based predictions vs existing human-authored order sets by area under the receiver operating characteristic curve, precision, and recall for subsequent clinical orders.**Results**Existing order sets predict clinical orders used within 24 hours with area under the receiver operating characteristic curve 0.81, precision 16%, and recall 35%. This can be improved to 0.90, 24%, and 47% by using probabilistic topic models to summarize clinical data into up to 32 topics. Many of these latent topics yield natural clinical interpretations (e.g., "critical care," "pneumonia," "neurologic evaluation").**Discussion**Existing order sets tend to provide nonspecific, process-oriented aid, with usability limitations impairing more precise, patient-focused support. Algorithmic summarization has the potential to breach this usability barrier by automatically inferring patient context, but with potential tradeoffs in interpretability.**Conclusion**Probabilistic topic modeling provides an automated approach to detect thematic trends in patient care and generate decision support content. A potential use case finds related clinical orders for decision support.**Efron-Stein Inequalities for Random Matrices. (pdf, bib)***Daniel Paulin, Lester Mackey, and Joel A. Tropp.*

Annals of Probability. September 2016.

View details »This paper establishes new concentration inequalities for random matrices constructed from independent random variables. These results are analogous with the generalized Efron-Stein inequalities developed by Boucheron et al. The proofs rely on the method of exchangeable pairs.

**Multivariate Stein Factors for a Class of Strongly Log-concave Distributions. (arxiv, bib)***Lester Mackey and Jackson Gorham.*

Electronic Communications in Probability. September 2016.

View details »We establish uniform bounds on the low-order derivatives of Stein equation solutions for a broad class of multivariate, strongly log-concave target distributions. These "Stein factor" bounds deliver control over Wasserstein and related smooth function distances and are well-suited to analyzing the computable Stein discrepancy measures of Gorham and Mackey. Our arguments of proof are probabilistic and feature the synchronous coupling of multiple overdamped Langevin diffusions.

**Jet-Images -- Deep Learning Edition. (pdf, code, bib)***Luke de Oliveira, Michael Kagan, Lester Mackey, Benjamin Nachman, and Ariel Schwartzman.*

Journal of High Energy Physics. July 2016.

View details »Building on the notion of a particle physics detector as a camera and the collimated streams of high energy particles, or jets, it measures as an image, we investigate the potential of machine learning techniques based on deep learning architectures to identify highly boosted W bosons. Modern deep learning algorithms trained on jet images can out-perform standard physically-motivated feature driven approaches to jet tagging. We develop techniques for visualizing how these features are learned by the network and what additional information is used to improve performance. This interplay between physically-motivated feature driven tools and supervised learning algorithms is general and can be used to significantly increase the sensitivity to discover new particles and new forces, and gain a deeper understanding of the physics within jets.

**Fuzzy Jets. (pdf, code, bib)***Lester Mackey, Benjamin Nachman, Ariel Schwartzman, and Conrad Stansbury.*

Journal of High Energy Physics. June 2016.

View details »Collimated streams of particles produced in high energy physics experiments are organized using clustering algorithms to form jets. To construct jets, the experimental collaborations based at the Large Hadron Collider (LHC) primarily use agglomerative hierarchical clustering schemes known as sequential recombination. We propose a new class of algorithms for clustering jets that use infrared and collinear safe mixture models. These new algorithms, known as fuzzy jets, are clustered using maximum likelihood techniques and can dynamically determine various properties of jets like their size. We show that the fuzzy jet size adds additional information to conventional jet tagging variables. Furthermore, we study the impact of pileup and show that with some slight modifications to the algorithm, fuzzy jets can be stable up to high pileup interaction multiplicities.

**Measuring Sample Quality with Stein's Method. (arxiv, poster, code, bib)***Jackson Gorham and Lester Mackey.*

Advances in Neural Information Processing Systems (NeurIPS). December 2015.

View details »To improve the efficiency of Monte Carlo estimation, practitioners are turning to biased Markov chain Monte Carlo procedures that trade off asymptotic exactness for computational speed. The reasoning is sound: a reduction in variance due to more rapid sampling can outweigh the bias introduced. However, the inexactness creates new challenges for sampler and parameter selection, since standard measures of sample quality like effective sample size do not account for asymptotic bias. To address these challenges, we introduce a new computable quality measure based on Stein's method that quantifies the maximum discrepancy between sample and target expectations over a large class of test functions. We use our tool to compare exact, biased, and deterministic sample sequences and illustrate applications to hyperparameter selection, convergence rate assessment, and quantifying bias-variance tradeoffs in posterior inference.

**Weighted Classification Cascades for Optimizing Discovery Significance in the HiggsML Challenge. (pdf, bib)***Lester Mackey, Jordan Bryan, and Man Yue Mo.*

Proceedings of the NeurIPS Workshop on High Energy Physics, Machine Learning, and the HiggsML Data Challenge. August 2015.

View details »We introduce a minorization-maximization approach to optimizing common measures of discovery significance in high energy physics. The approach alternates between solving a weighted binary classification problem and updating class weights in a simple, closed-form manner. Moreover, an argument based on convex duality shows that an improvement in weighted classification error on any round yields a commensurate improvement in discovery significance. We complement our derivation with experimental results from the 2014 Higgs boson machine learning challenge.

**Distributed Matrix Completion and Robust Factorization. (pdf, website, code, bib)***Lester Mackey, Ameet Talwalkar, and Michael I. Jordan.*

Journal of Machine Learning Research. April 2015.**Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. (pdf, article, bib)***Robert Kuffner, Neta Zach, Raquel Nore, Johann Hawe, David Schoenfeld, Liuxia Wang, Guang Li, Lilly Fang, Lester Mackey, Orla Hardiman, Merit Cudkowicz, Alexander Sherman, Gokhan Ertaylan, Moritz Grosse-Wentrup, Torsten Hothorn, Jules van Ligtenberg, Jakob H. Macke, Timm Meyer, Bernhard Scholkopf, Linh Tran, Rubio Vaughan, Gustavo Stolovitzky, and Melanie L. Leitner.*

Nature Biotechnology. November 2014.- Editor's choice for Science Translational Medicine, November 2014.
**Combinatorial Clustering and the Beta Negative Binomial Process. (pdf, code, bib)***Tamara Broderick, Lester Mackey, John Paisley, and Michael I. Jordan.*

IEEE Transactions on Pattern Analysis and Machine Intelligence. April 2014.**Matrix Concentration Inequalities via the Method of Exchangeable Pairs. (pdf, bib, Joel Tropp's talk)***Lester Mackey, Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp.*

Annals of Probability. March 2014.**Corrupted Sensing: Novel Guarantees for Separating Structured Signals. (pdf, bib)***Rina Foygel and Lester Mackey.*

IEEE Transactions on Information Theory. February 2014.**Distributed Low-rank Subspace Segmentation. (pdf, code, bib)***Ameet Talwalkar, Lester Mackey, Yadong Mu, Shih-Fu Chang, and Michael I. Jordan.*

IEEE International Conference on Computer Vision (ICCV). December 2013.**The Asymptotics of Ranking Algorithms. (pdf, bib)***John C. Duchi, Lester Mackey, and Michael I. Jordan.*

Annals of Statistics. November 2013.**Joint Link Prediction and Attribute Inference using a Social-Attribute Network. (pdf, website, bib)***Neil Zhenqiang Gong, Ameet Talwalkar, Lester Mackey, Ling Huang, Eui Chul Richard Shin, Emil Stefanov, Elaine (Runting) Shi, and Dawn Song*.

ACM Transactions on Intelligent Systems and Technology. March 2013.**Divide-and-Conquer Matrix Factorization. (pdf, website, code, bib)***Lester Mackey, Ameet Talwalkar, and Michael I. Jordan.*

Advances in Neural Information Processing Systems (NeurIPS). December 2011.**Visually Relating Gene Expression and in vivo DNA Binding Data. (pdf, bib)***Min-Yu Huang, Lester Mackey, Soile Keranen, Gunther Weber, Michael Jordan, David Knowles, Mark Biggin, and Bernd Hamann*.

IEEE International Conference on Bioinformatics and Biomedicine (BIBM). November 2011.**Mixed Membership Matrix Factorization. (pdf, supp info, slides, code, bib)***Lester Mackey, David Weiss, and Michael I. Jordan*.

International Conference on Machine Learning (ICML). June 2010.

Handbook of Mixed Membership Models and Their Applications. November 2014.**On the Consistency of Ranking Algorithms. (pdf, slides, bib)***John Duchi, Lester Mackey, and Michael I. Jordan*.

International Conference on Machine Learning (ICML). June 2010.- Winner of the ICML 2010 Best Student Paper Award.

**Deflation Methods for Sparse PCA. (pdf, poster, code, bib)***Lester Mackey*.

Advances in Neural Information Processing Systems (NeurIPS). December 2008.**Fault-tolerant Typed Assembly Language. (pdf, bib)***Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker*.

ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). June 2007.- Joint winner of the PLDI 2007 Best Paper Award.
- Selected for SIGPLAN CACM Research Highlights, September 2008.
**Static Typing for a Faulty Lambda Calculus. (pdf, bib)***David Walker, Lester Mackey, Jay Ligatti, George Reis, and David August*.

ACM SIGPLAN International Conference on Functional Programming (ICFP). September 2006.**Participatory Design with Proxies: Developing a Desktop-PDA System to Support People with Aphasia. (pdf, bib)***Jordan Boyd-Graber, Sonya Nikolova, Karyn Moffatt, Kenrick Kin, Joshua Lee, Lester Mackey, Marilyn Tremaine, and Maria Klawe*.

SIGCHI Conference on Human Factors in Computing Systems (CHI). April 2006.

#### Other Work

**Deriving Matrix Concentration Inequalities from Kernel Couplings. (arxiv)***Daniel Paulin, Lester Mackey, and Joel A. Tropp. May 2013***Feature-Weighted Linear Stacking. (arxiv, Joe Sill's talk)***Joint work with Joe Sill, Gabor Takacs, and David Lin. November 2009.***Anomaly Detection for Asynchronous and Incomplete Data. (pdf)***Joint work with John Duchi and Fabian Wauthier.*

Advanced Topics in Computer Systems (UC Berkeley CS 262A, E. Brewer). December 2008.**Scalable Dyadic Kernel Machines. (pdf)**

Advanced Topics in Learning and Decision Making (UC Berkeley CS 281B, P. Bartlett). May 2008.**Latent Dirichlet Markov Random Fields for Semi-supervised Image Segmentation and Object Recognition. (pdf)**

Statistical Learning Theory (UC Berkeley CS 281A, M. Jordan) and Computer Vision (UC Berkeley CS 280, J. Malik). December 2007.