Department of Mathematical Sciences

309 SCEN

University of Arkansas

Fayetteville, AR 72701

P 479-575-3351 (main)

P 479-575-5588 (enrollment)

F 479-575-8630

E-mail: math@uark.edu

# Spring Lecture Series

**45**^{th} Annual Spring Lecture Series

^{th}Annual Spring Lecture Series

**Discrete Random Structures in Bayesian Nonparametrics**

**November 10 - 13, 2020 (10:30am - 4:00pm)**

Principal Speaker: **Igor Prünster** (Dipartimento di Scienze delle Decisioni, Università Bocconi)

### Invited Speakers

**Cecilia Balocchi** (University of Torino)**Michele Guindani** (University of California at Irvine)**Li Ma** (Duke University)**Steve MacEachern** (The Ohio State University)**Ramsés Mena** (Universidad Nacional Autónoma de México)**Peter Müller** (The University of Texas at Austin)**Long Nguyen** (University of Michigan)**Surya Tokdar** (Duke University)**Yanxun Xu** (Johns Hopkins University)**Mingyuan Zhou** (The University of Texas at Austin)

### Organizers

Avishek Chakraborty, Jyotishka Datta, Giovanni Petris

### Schedule of Talks & Abstracts

**Discrete Random Structures in Bayesian Nonparametrics**

**Igor Prünster **(Dipartimento di Scienze delle Decisioni, Università Bocconi)

The popularity of Bayesian nonparametric inference has grown enormously in recent years. Indeed, the Bayesian nonparametric viewpoint naturally allows for rich and flexible probabilistic modeling and provides a natural way to quantify uncertainty. The main ingredients of Bayesian Nonparametrics are infinite dimensional probability distributions on function spaces such as spaces of density functions, of hazard rates, of distribution functions or any other functions related to modeling observations. Such probability models act as prior distributions and, combined with the data, provide posterior distributions, which form the basis of Bayesian inference, and allow for accurate function estimation.

The aim of the five lectures is to give an overview of the state of the art of Bayesian Nonparametrics and to highlight promising research directions with a special focus on the discrete random structures ubiquitous in Bayesian Nonparametrics. Connections with other seemingly unrelated areas of research in Mathematics and Science, such as, e.g., combinatorics, theory of special functions, population genetics and ecology will be pointed out. Illustrations in the contexts of mixture modeling, species sampling, survival analysis and curve estimation will be provided throughout.

**1) Foundations and history of Bayesian Nonparametrics**

The theoretical foundations of Bayesian Nonparametrics date back to the seminal work, developed during the 30s, by de Finetti, who introduced the concept of exchangeability and the celebrated representation theorem for exchangeable sequences. Rather than a mere body of estimation procedures, the Bayesian approach is actually a coherent way of inductive reasoning or, quoting de Finetti, it represents a “logic for uncertainty”. The required mathematical complexity has prevented its development until the breakthrough achieved by Ferguson in 1973 with the introduction of the Dirichlet process. During the following three decades crucial theoretical and computational advances made it possible to concretely implement Bayesian nonparametric procedures on real world problems, and important results concerning their asymptotic validation, according to frequentist criteria, were achieved. An overview of the foundations, the Dirichlet process and subsequent developments with pointers to current research will be provided.

**2) The predictive approach to Bayesian inference and discrete random structures**

Prediction is the natural outcome of Bayesian inference and de Finetti himself thoroughly stressed its role as the fundamental form of inference. In this lecture we adopt a predictive approach in de Finetti’s spirit, consider discrete random probability measures and investigate the induced exchangeable random partitions and prediction rules. By using species sampling problems as a reference example to help intuition, we discuss structural properties of exchangeable prediction schemes in terms of flexibility and desired behaviors. Special attention will be devoted to the corresponding properties of functionals of interest, most notably the number of clusters (which according to the applied context at hand takes on different interpretations such as the number of mixture components, distinct genes, agents’ types etc.). Key distributional and asymptotic properties will be highlighted.

**3) Models beyond the Dirichlet process**

Until the end of the last century, most models were based on the Dirichlet process (or allied random distributions) used either as a prior distribution itself or as a key ingredient in more complex hierarchical mixtures. This development had certainly the merit of putting the Dirichlet process in the toolbox of applied statisticians, but even if its role as the cornerstone of Bayesian Nonparametrics is legitimate, there was no reason for confining to the use of the Dirichlet process as a unique framework for modeling and estimation; a fortiori, since some drawbacks of the Dirichlet process have become quite evident by then.

A large portion of research has been devoted to the introduction of tractable generalizations of the Dirichlet motivated either by applied needs or by theoretical interest. We provide an overview of a large set of such generalizations by using completely random measures as a unifying concept. All the considered models can be seen as suitable transformations of completely random measures and this allows to highlight their inherent features as well as interesting distributional structures they share a posteriori in several statistical problems.

**4) Partial exchangeability and dependent nonparametric models**

Exchangeability of the observations is unrealistic when data are recorded under different experimental conditions or, more generally, for covariate-dependent data. Popular examples include meta-analysis, change-point phenomena, topic modelling, information retrieval and time series. In these cases, the set of covariates introduce a source of heterogeneity that needs to be modeled. This has represented one of the main research lines in Bayesian Nonparametrics since the pioneering work of MacEachern in 2000, where the dependent Dirichlet process was introduced and studied. In this lecture we show how these problems can be naturally embedded in the framework of partial exchangeability and stress that he corresponding models exhibit dependence structures spanning between the two extreme cases: exchangeability and unconditional independence. The former coincides with complete homogeneity across data corresponding to different covariates, whereas the latter corresponds to complete heterogeneity with no borrowing of strength across populations. The natural question of introducing a suitable measure of dependence for partial exchangeable models is also addressed. We discuss current options and highlight a recent proposal based on the Wasserstein distance. Widely used models and the dependence structures they induce will serve as illustrations.

**5) Dependent nonparametric priors derived from completely random measures**

In this lecture we revisit popular dependent nonparametric priors and discuss new proposals. In contrast to the mainstream approach, which consists in defining dependent nonparametric priors by means of dependent stick-breaking representations, here we tackle the problem by adopting a “random measure” approach. This crucially enables us to investigate analytically some distributional properties of the resulting collection of random probability measures. These distributional properties provide key insights into the inferential implications of the considered models. In particular, we focus on three natural ways to combine random measures, namely additive, hierarchical and nested structures. These gained huge popularity in the Machine Learning literature. Thanks to our approach we are able to derive partial exchangeable partition probability functions and posterior representations, which in turn allow to single out important limitations and advantages of the considered models. The analytical results can be readily used to devise efficient sampling schemes.

10:25am Opening Remarks

10:30am Igor Prünster

11:30am Cecilia Balocchi

2:00pm Igor Prünster

3:00pm Surya Todkar

#### Bayesian Spatial Clustering of Crime in Philadelphia

**Cecilia Balocchi** (Università di Torino)

In this talk, I will present two related projects. In the first project, we propose a model for spatial clustering of crime dynamics that induces spatial correlation between regions with homogeneous behaviors, while flexibly accounting for discontinuities. In fact, in urban environments, natural barriers or aspects of the built environment can lead to clusters of neighborhoods with distinct trends. Rather than directly sampling from the posterior distribution, we introduce an ensemble particle-based optimization procedure that simultaneously targets several partitions with the largest posterior probability, using a new local search strategy. We apply our developed methodology to estimate changes in crime over the 2006-18 period in the City of Philadelphia at the census tract level. In the second project, we consider the problem of clustering data available at multiple granularities: crime data could be analyzed at a finer neighborhood level or at a coarser police district level. Instead of choosing the best granularity level, we want to model the data at multiple levels, as the variation can happen at different resolutions in different regions of the city. The nested Dirichlet Process allows the modeling of partitions at multiple levels but restricts fine resolution clusters to be nested into coarse resolution ones. We extend the nested Dirichlet Process to allow for more flexible partitions that do not have this constraint and we propose the Nested Hierarchical Dirichlet Process. We use this new model to analyze crime frequencies in Philadelphia.

#### Joint Quantile Regression under Spatial Dependency

**Surya Tokdar **(Duke University)

Four decades ago, Roger Koenker and Gib Basett introduced the idea of quantile regression (QR). Today, QR is widely recognized as a fundamental statistical tool for analyzing complex predictor-response relationships, with a growing list of applications in ecology, economics, education, public health, climatology, and so on. In QR, one replaces the standard regression equation of the mean with a similar equation for a quantile at a given quantile level of interest. But the real strength of QR lies in the possibility of analyzing any quantile level of interest, and perhaps more importantly, contrasting many such analyses against each other with fascinating consequences.

In spite of the popularity of QR, it is only recently that an analysis framework has been developed (Yang and Tokdar, JASA 2017) which transforms Koenker and Basett’s four-decade old idea into a model based inference and prediction technique in its full generality. In doing so, the new joint estimation framework has opened doors to many important advancements of the QR analysis technique to address additional data complications. In this talk I will present recent such developments, specifically focusing on the issue of additional dependence between observation units. Such dependency manifests in many common situations, e.g., when one simultaneously measures multiple response variables per observation unit, when a response is measured repeatedly over time and/or space, or, when data is drawn from a network of individuals. This talk will focus primarily on addressing spatial dependency.

Joint work with PhD student Xu Chen.

10:30am Igor Prünster

11:30am Ramsés Mena

2:00pm Long Nguyen

3:00pm Yanxun Xu

#### Beta-Binomial Stick-Breaking Nonparametric Prior

**Ramsés Mena Chávez (**Universidad Nacional Autónoma de México)

A new class of nonparametric prior distributions, termed Beta-Binomial stick-breaking process, is proposed. By allowing the underlying length random variables to be dependent through a Beta marginals Markov chain, an appealing discrete random probability measure arises. The chain's dependence parameter controls the ordering of the stick-breaking weights, and thus tunes the model's label-switching ability. Also, by tuning this parameter, the resulting class contains the Dirichlet process and the Geometric process priors as particular cases, which is of interest for MCMC implementations. Some properties of the model are discussed and a density estimation algorithm is proposed and tested with simulated datasets.

#### Parameter Estimation and Interpretability in Bayesian Mixture Modeling

**Long Nguyen** (University of Michigan)

We study posterior contraction behaviors for parameters of interest in the context of Bayesian mixture modeling, where the number of mixing components is unknown while the model itself may or may not be correctly specified. Posterior contraction rates are given under optimal transport distances for two popular types of prior specification: one requires explicitly a prior distribution on the number of mixture components, and a nonparametric Bayesian approach which places a prior on the space of mixing distributions. Paraphrasing George Box, all mixture models are misspecified, but some may be more interpretable than others — it will be shown that the modeling choice of kernel density functions plays perhaps the most impactful roles in determining the posterior contraction rates in the misspecified situations. Drawing on concrete parameter estimation rates I will highlight some aspects about the interesting tradeoffs between model expressiveness and interpretability that a statistical modeler must negotiate in the rich world of mixture modeling.

This work is joint with Aritra Guha and Nhat Ho.

#### A Bayesian Nonparametric Approach for Inferring Drug Combination Effects on Mental Health in HIV Patients

**Yanxun Xu **(John Hopkins University)

Due to high viral evolutionary dynamics and the occurrence of drug-resistant muta- tions, modern antiretroviral therapy (ART) for treating HIV typically combines three or four drugs of different mechanisms or against different targets to more efficiently suppress viral load in people living with HIV. Understanding effects of ART drug combination on the central nervous system (CNS) function, especially mental health, can help prevent patients from discontinuing their ART to avoid undesirable downstream effects, such as decreased health outcomes and increased likelihood of HIV transmission. The emergence of electronic health records offers researchers unprecedented access to HIV data. However, modeling such data is very challenging due to the high-dimensionality of the drug combination space, the patient heterogeneity, and the sparseness of the observed drug combinations. We develop a Bayesian nonparametric approach to learn drug combination effects on HIV patients’ mental health after adjusting for socio-demographic, behavioral, and clinical factors, building upon the subset-tree kernel method and a distance- based Chinese restaurant process that takes into account patients’ treatment histories. We evaluate the proposed approach through extensive simulation studies, and apply the method to a real-world dataset from the Women’s Interagency HIV Study, yielding desirable and interpretable results.

10:30am Igor Prünster

11:30am Li ma

2:00pm Steve MacEachern

3:00pm Igor Prünster

4:00pm Social

#### Hidden Markov Polya Trees for High-Dimensional Distributions

**Li Ma** (Duke University)

The Polya tree (PT) process is a general-purpose Bayesian nonparametric model that has found wide application in a range of inference problems. The PT has a simple analytic form and the resulting posterior computation boils down to straight-forward beta-binomial conjugate updates along a partition tree over the sample space. Recent development in PT models shows that performance of these models can be substantially improved by (i) incorporating latent state variables that characterize local features of the underlying distributions and (ii) allowing the partition tree to adapt to the structure of the underlying distribution. Despite these advances, however, some important limitations of the PT that remain include---(i) the sensitivity in the posterior inference with respect to the choice of the partition points, and (ii) the lack of computational scalability to multivariate problems beyond a small number ($<10$) of dimensions. We consider a modeling strategy for PT models that incorporates a very flexible prior on the partition tree along with latent states that can be first-order dependent (i.e., following a Markov process), and introduce a hybrid algorithm that combines sequential Monte Carlo (SMC) and recursive message passing for posterior inference that can readily accommodate PT models with or without latent states as well as flexible partition points in problems up to 100 dimensions. Moreover, we investigate the large sample properties of the tree structures and latent states under the posterior model. We carry out extensive numerical experiments in the context of density estimation and two-sample testing, which show that flexible partitioning can substantially improve the performance of PT models in both inference tasks. We demonstrate an application to a flow cytometry data set with 20 dimensions and over 200,000 observations.

#### Misspecification and Familial Null Hypotheses

**Steve MacEachern **Ohio State University

One of the most basic questions in Statistics is "Are the centers of two distributions the same?". The question is answered formally through a hypothesis test with a null of "no difference", whether the analysis be classical or Bayesian. Traditional formulations of the problem rely on the belief that the models are perfectly specified. Robustness to violations of assumptions is typically studied under conditions that do not change the validity of the null hypothesis (e.g., symmetric contaminations). The actual deficiencies in model and data are likely to change a “true null” into a false statement.

The imperfections of model and data suggest the use of flexible nonparametric models for the two distributions and also suggest consideration of a family of measures of center (e.g., for real-valued data, those based upon Huber’s loss function). Each measure of the center generates a testing problem. The resulting family of null hypotheses constitutes a familial null hypothesis. Profile methods replace the original question with “Is there a measure of center in the family for which the centers of the two distributions are the same?”

This is joint work with Catherine Forbes, Mario Peruggia, and Ryan Thompson.

10:30am Igor Prünster

11:30am Peter Müller

2:00pm Mingyuan Zhou

3:00pm Michele Guindani

#### Subgroup Analysis with Time to Event Outcomes

**Peter Müller **(University of Texas)

We discuss a utility-based Bayesian approach to population finding and subgroup analysis. The approach casts the population finding process as a formal decision problem together with a flexible probability model using a flexible model, such as random forests or other nonparametric Bayesian models, to fit the data. In contrast, the decision is constrained to be parsimonious and interpretable. We define a utility function that addresses the competing aims of the desired report. We illustrate the approach with a joint time-to-event and toxicity outcome for subgroup analysis, and with a time-to-event outcome in the context of an umbrella trial master protocol.

Xu, Mueller, Tsimberidou, Berry (2018)

A Nonparametric Bayesian Basket Trial Design

Biometrical Journal

https://onlinelibrary.wiley.com/doi/full/10.1002/bimj.201700162

Morita, and Mueller, P. (2017).

Bayesian Population Finding with Biomarkers in a RCT,

Biometrics, 73, 1355--1365.

#### ACT: Asymptotic Conditional Transport

**Mingyuan Zhou **(University of Texas)

We propose conditional transport (CT) as a new divergence to measure the difference between two probability distributions. The CT divergence consists of the expected cost of a forward CT, which constructs a navigator to stochastically transport a data point of one distribution to the other distribution, and that of a backward CT which reverses the transport direction. To apply it to the distributions whose probability density functions are unknown but random samples are accessible, we further introduce asymptotic CT (ACT), whose estimation only requires access to mini-batch based discrete empirical distributions. Equipped with unbiased sample gradients that are straightforward to compute, the ACT divergence is amenable to mini-batch stochastic gradient descent based optimization. To demonstrate the efficacy of the ACT divergence in training a deep generative model, we show that it is sufficient to modify the adversarial game of an existing generative adversarial network (GAN) to a game played by a generator, a forward navigator, and a backward navigator, which try to minimize a distribution-to-distribution transport cost by optimizing both the distribution of the generator and conditional transport-path distributions specified by the navigators, versus a critic that does the opposite by inflating the point-to-point transport cost. On a wide variety of benchmark datasets for generative modeling, substituting the default statistical distance of an existing GAN with the ACT divergence is shown to consistently improve the performance.

#### A Common Atom Model for the Bayesian Nonparametric Analysis of Nested Data

**Michele Guindani **(University of California, Irvine)

The use of high-dimensional data for targeted therapeutic interventions requires new ways to characterize the heterogeneity observed across subgroups of a specific population. In particular, models for partially exchangeable data are needed for inference on nested datasets, where the observations are assumed to be organized in different units and some sharing of information is required to learn distinctive features of the units. In this talk, we propose a nested Common Atoms Model (CAM) that is particularly suited for the analysis of nested datasets where the distributions of the units are expected to differ only over a small fraction of the observations sampled from each unit. The proposed CAM allows a two-layered clustering at the distributional and observational level and is amenable to scalable posterior inference through the use of a computationally efficient nested slice-sampler algorithm. We further discuss how to extend the proposed modeling framework to handle discrete measurements, and we conduct posterior inference on a real microbiome dataset from a diet swap study to investigate how the alterations in intestinal microbiota composition are associated with different eating habits. We further investigate the performance of our model in capturing true distributional structures in the population by means of a simulation study.

*Updated 11/23/2020*