Fall 2022 Seminars

List of Seminars

A Distributed Approach for Learning Spatial Heterogeneity

Zhengyuan Zhu, Professor, Department of Statistics

Iowa State University

Date: Friday, December 2, 2022

Abstract

Spatial regression is widely used for modeling the relationship between a spatial dependent variable and explanatory covariates. In many applications there are spatial heterogeneity in such relationships, i.e., the regression coefficients may vary across space. It is a fundamental and challenging problem to detect the systematic variation in the model and determine which locations share common regression coefficients and where the boundary is. In this talk, we introduce a Spatial Heterogeneity Automatic Detection and Estimation (SHADE) procedure for automatically and simultaneously subgrouping and estimating covariate effects for spatial regression models, and present a distributed spanning-tree-based fused-lasso regression (DTFLR) approach to learn spatial heterogeneity in the distributed network systems, where the data are locally collected and held by nodes. To solve the problem parallelly, we design a distributed generalized alternating direction method of multiplier algorithm, which has a simple node-based implementation scheme and enjoys a linear convergence rate. Theoretical and numerical results as well as real-world data analysis will be presented to show that our approach outperforms existing works in terms of estimation accuracy, computation speed, and communication costs. 

About the speaker

Dr. Zhengyuan Zhu is the College of Liberal Arts and Sciences Dean's Professor, Director of the Center for Survey Statistics Methodology, and Professor of Statistics in the Department of Statistics at Iowa State University. He received his B.S. in Mathematics from Fudan University and Ph.D. in Statistics from the University of Chicago. His research interests include spatial statistics, survey statistics, machine learning, statistical data integration, and applications in environmental science, agriculture, remote sensing, and official statistics. He is a fellow of the American Statistics Association, and an elected member of the International Statistical Institute.   

Are Decision Trees as Powerful as Neural Networks?

Jason M. Klusowski, Assistant Professor, Department of Operations Research and Financial Engineering

Princeton University

Abstract

Decision trees and neural networks are conventionally seen as two contrasting approaches to learning. The popular belief is that decision trees compromise accuracy for being easy to use and understand, whereas neural networks are more accurate, but at the cost of being less transparent. In this talk, we challenge the status quo by showing that, under suitable conditions, decision trees that recursively place splits along linear combinations of the covariates achieve similar modeling power and predictive accuracy as single-hidden layer neural networks. The analytical framework presented here can importantly accommodate many existing computational tools in the literature, such as those based on randomization, dimensionality reduction, and mixed-integer optimization. 

About the speaker

Jason Klusowski is an assistant professor in the department of Operations Research and Financial Engineering (ORFE) at Princeton University. Prior to joining Princeton, he was an assistant professor in the Department of Statistics at Rutgers University, New Brunswick. He received his PhD in Statistics and Data Science from Yale University in 2018. His research interests lie broadly in statistical machine learning, with an emphasis on describing the tensions among interpretability, statistical accuracy, and computational feasibility.

Principal Flow, Sub-Manifold and Boundary

Zhigang Yao, Associate Professor, Department of Statistics and Data Science

National University of Singapore

Date: Friday, November 11, 2022

Abstract

While classical statistics has dealt with observations which are real numbers or elements of a real vector space, nowadays many statistical problems of high interest in the sciences deal with the analysis of data which consist of more complex objects, taking values in spaces which are naturally not (Euclidean) vector spaces but which still feature some geometric structure. I will discuss the problem of finding principal components to the multivariate datasets, that lie on an embedded nonlinear Riemannian manifold within the higher-dimensional space. The aim is to extend the geometric interpretation of PCA, while being able to capture the non-geodesic form of variation in the data. I will introduce the concept of a principal sub-manifold, a manifold passing through the center of the data, and at any point on the manifold extending in the direction of highest variation in the space spanned by the eigenvectors of the local tangent space PCA. We show the principal sub-manifold yields the usual principal components in Euclidean space. We illustrate how to find, use and interpret the principal sub-manifold, by which a principal boundary can be further defined for data sets on manifolds.

About the speaker

Zhigang Yao is an Associate Professor in the Department of Statistics and Data Science at the National University of Singapore (NUS). His current research is focused on the interface between statistics and geometry, especially on the manifold fitting problem. Currently he is a member of the Center of Mathematical Sciences and Applications at Harvard University. He also holds a courtesy joint appointment with the Department of Mathematics at NUS. He is a Faculty Affiliate of the Institute of Data Science (IDS) at NUS. He has held several visiting positions including Visiting Professorship at EPFL. He received his Ph.D. in Statistics from University of Pittsburgh in 2011. His thesis advisors are Bill Eddy at Carnegie Mellon and Leon Gleser at University of Pittsburgh. He has been an Assistant Professor at NUS from 2014-2020. Before joining NUS, he has been working with Victor Panaretos as a post-doc researcher at EPFL from 2011-2014.

Dynamic Mechanistic Spatio-Temporal Modeling for (Re)Emerging Epidemics

Ali Arab, Associate Professor, Department of Statistics
Georgetown University

Date: Friday, November 4, 2022

Abstract

The dynamics of emerging and reemerging epidemics are complex to understand and thus, difficult to model. Moreover, data for rare conditions (over time and space) often include excess zeros which may result in inefficient inference and ineffective prediction for such processes. This is a common issue in modeling rare or emerging diseases or diseases that are not common in specific areas, specific time periods, or those conditions that are hard to detect. A common approach to modeling data with excess zeroes is to use zero-modified models (i.e., hurdle and zero-inflated models). Here, we discuss a mechanistic science-based modeling framework to effectively model the dynamics of disease spread based on zero-modified hierarchical modeling approaches. Our proposed method combines ideas from mechanistic physical-statistical modeling and zero-modified modeling to effectively model the early stages of the pandemics of infectious diseases which is critical in combating the spread of the disease. To demonstrate our work, we provide a case study of modeling the spread of Lyme disease based on confirmed cases of the disease in Virginia during the period 2001-2016.

About the speaker

Ali Arab is an Associate Professor of Statistics in the Department of Mathematics and Statistics of Georgetown University. His methodological research is in spatio-temporal and spatial statistics, and hierarchical Bayesian modeling. He is interested in applications of statistics in environmental science, epidemiology of infectious diseases, ecology, and science and human rights. His current research is focused on developing methodological tools for studying problems in the intersection of climate change and social/natural phenomena, in particular, these projects are focused on bird phenology and climate change, climate and conflict driven forced migration, and climate change and vector-borne disease. Ali also serves as one of the American Statistical Association representatives to the American Association for the Advancement of Science (AAAS) Science and Human Rights Coalition. 

Environmental Exposures and Public Health

Jenna Krall, Assistant Professor, Department of Global and Community Health

George Mason University

Date: Friday, October 28, 2022

Abstract

Air pollution is associated with increased cardiorespiratory emergency department visits and hospitalizations.  Because air pollution is a chemical mixture of both particles and gases, challenges remain in determining which air pollutants are most harmful.  Furthermore, because the air pollution mixture varies over time and space, identifying exposure settings that are most harmful is critical to protecting health.  This talk will discuss statistical and methodological approaches for estimating exposure to environmental mixtures and their impacts on health.

About the speaker

Jenna R. Krall, PhD is an Assistant Professor in the Department of Global and Community Health at George Mason University in Fairfax, VA.  Her research interests include estimating exposure to environmental mixtures, such as air pollution, and their impacts on human health.  Dr. Krall received her PhD in Biostatistics from Johns Hopkins University and completed a postdoctoral fellowship in Biostatistics at Emory University. 

Challenges in Constructing Human-centric Natural Language Interfaces

Ziyu Yao, Assistant Professor, Department of Computer Science

George Mason University

Date: Friday, October 21, 2022

Abstract

Many existing methods for analyzing spatial data rely on the Gaussian assumption, which is violated in many applications such as wind speed, precipitation and COVID mortality data. In this talk, I will discuss several recent developments of copula-based approaches for analyzing non-Gaussian spatial data. First, I will introduce a copula-based spatio-temporal model for analyzing spatio-temporal data and a semiparametric estimator. Second, I will present a copula-based multiple indicator kriging model for the analysis of non-Gaussian spatial data by thresholding the spatial observations at a given set of quantile values. The proposed algorithms are computationally simple, since they model the marginal distribution and the spatio-temporal dependence separately. Instead of assuming a parametric distribution, the approaches model the marginal distributions nonparametrically and thus offer more flexibility. The methods will also provide convenient ways to construct both point and interval predictions based on the estimated conditional quantiles. I will present some numerical results including the analyses of a wind speed and a precipitation data. If time allows, I will also discuss a recent work on copula-based approach for analyzing count spatial data. 

About the speaker

Ziyu Yao is an Assistant Professor at the Computer Science department of George Mason University. She graduated with a PhD degree from the Ohio State University in 2021. Her research interests lie in Natural Language Processing, Artificial Intelligence, and their applications to other disciplines. In particular, she has been focusing on developing natural language interfaces (e.g., question answering systems) that can reliably assist humans in various domains (e.g., Software Engineering and Clinical Informatics). She was awarded the Presidential Fellowship by OSU in 2020 and the Graduate Student Research Award by the OSU CSE department in 2021. Her work in NLP for Clinical Informatics won the Best Paper Award in IEEE BIBM 2021. More about Dr. Yao can be found here: https://ziyuyao.org/.

Efficient Shape-constrained Inference for the Autocovariance Sequence from a Reversible Markov Chain

Hyebin Song, Assistant Professor, Department of Statistics

The Pennsylvania State University

Date: Friday, October 14, 2022

Abstract

In this talk, I will present a novel shape-constrained estimator of the autocovariance sequence resulting from a reversible Markov chain.  A motivating application for studying this problem is the estimation of the asymptotic variance in central limit theorems for Markov chains. Asymptotic variance is a key quantity in quantifying the uncertainty of the sample mean from Markov chain iterates, so accurate estimation of asymptotic variance has both statistical and practical significance. Our approach is based on the key observation that the representability of the autocovariance sequence as a moment sequence imposes certain shape constraints, which we can exploit in the estimation procedure. I will discuss the theoretical properties of the proposed estimator and provide strong consistency guarantees for the proposed estimator. Finally, I will empirically demonstrate the effectiveness of our estimator in comparison with other current state-of-the-art methods for Markov chain Monte Carlo variance estimation, including batch means, spectral variance estimators, and the initial convex sequence estimator.

About the speaker

Hyebin Song is an assistant professor at Pennsylvania State University. She received her PhD in Statistics at University of Wisconsin-Madison in 2020, advised by Garvesh Raskutti. Her research interests include high-dimensional statistics and semi-parametric inference, shape-constrained inference, and applications in biomedical research.

Copula-based approaches for analyzing non-Gaussian spatial data

Huixia Judy Wang, Department Chair and Professor, Department of Statistics
George Washington University

Date: Friday, October 7, 2022

Abstract

Many existing methods for analyzing spatial data rely on the Gaussian assumption, which is violated in many applications such as wind speed, precipitation and COVID mortality data. In this talk, I will discuss several recent developments of copula-based approaches for analyzing non-Gaussian spatial data. First, I will introduce a copula-based spatio-temporal model for analyzing spatio-temporal data and a semiparametric estimator. Second, I will present a copula-based multiple indicator kriging model for the analysis of non-Gaussian spatial data by thresholding the spatial observations at a given set of quantile values. The proposed algorithms are computationally simple, since they model the marginal distribution and the spatio-temporal dependence separately. Instead of assuming a parametric distribution, the approaches model the marginal distributions nonparametrically and thus offer more flexibility. The methods will also provide convenient ways to construct both point and interval predictions based on the estimated conditional quantiles. I will present some numerical results including the analyses of a wind speed and a precipitation data. If time allows, I will also discuss a recent work on copula-based approach for analyzing count spatial data. 

About the speaker

Huixia Judy Wang is the Chair of the Department of Statistics at George Washington University. She received her PhD in Statistics at the University of Illinois at Urbana-Champaign in 2006. She taught at North Carolina State for eight years and then moved to George Washington University. She served as Program Director for the National Science Foundation division of Mathematical Sciences for several years. Her research interests span a wide range of fields, which include quantile regression, extreme value theory and applications, bioinformatics and biostatistics, nonparametric and semiparametric methods, in addition to regression, survival analysis, longitudinal and spatial data analysis, and missing data. 

Explaining Adverse Actions in Credit Decisions Using Shapley Decomposition

Tianshu Feng

Assistant Professor, George Mason University

Date: Friday, September 30, 2022

Abstract

When a financial institution declines an application for credit, an adverse action (AA) is said to occur. The applicant is then entitled to an explanation for the negative decision. The talk focuses on credit decisions based on a predictive model for probability of default and proposes a methodology for AA explanation. The problem involves identifying the important predictors responsible for the negative decision and is straightforward when the underlying model is additive. However, it becomes non-trivial even for linear models with interactions. We consider models with low-order interactions and develop a simple and intuitive approach based on first principles. We then show how the methodology generalizes to the well-known Shapely decomposition and the recently proposed concept of Baseline Shapley (B-Shap). Unlike other Shapley techniques in the literature for local interpretability of machine learning results, B-Shap is computationally tractable since it involves just function evaluations. An illustrative case study is used to demonstrate the usefulness of the method.

About the speaker

Tianshu Feng's work is data-driven and centers on the systematic approach to processing, visualizing, analyzing, modeling, and examining data with complex features. This involves developing and applying novel, flexible, and reliable models via interdisciplinary collaborations in various areas, such as transportation, bioinformatics, healthcare, and finance. His research interests include machine learning and statistical modeling, explainable AI, model fairness and robustness, data exploration, and active learning. Prior to joining Mason, Tianshu was a Quantitative Analytics Specialist at Wells Fargo. He received his PhD degree in Industrial Engineering from the University of Washington and his Bachelor's degree in Statistics from the University of Science and Technology of China.

Scalable Bayesian p-generalized Probit and Logistic Regression Via Coresets

Katja Ickstadt
Professor and Department Chair, Department of Statistics, Technical University of Dortmund

Date: Friday, September 23, 2022

Abstract

In this talk, we consider data reduction techniques like sketching and coresets that retain the statistical information up to only little distortion quantified by theoretic bounds. Our approaches address resource restrictions like memory access, communication cost, and runtime. Coresets are small, possibly weighted data sets designed to approximate an input data set with respect to a computational problem. Often, they are subsets of the input data obtained via sampling techniques. Here, we study coresets for generalized linear models, in particular for binary outcomes. We will present coreset approaches for logistic regression as well as for p-generalized probit regression, the latter also in a Bayesian framework. The resulting reduced data sets have better scaling properties and allow for efficient computations via the established (classic) algorithms.  

About the speaker

Katja Ickstadt (Faculty of Statistics, TU Dortmund University, Dortmund, Germany) studied mathematics with a focus on technology at the Technical University of Darmstadt, Germany, where she received her doctorate in mathematics in 1994. Before her habilitation in mathematics at the Technical University of Darmstadt in 2001, she spent several years abroad, with research and teaching at the University of Basel, Switzerland, Duke University, North Carolina, USA, and the University of North Carolina in Chapel Hill, USA. In her research, Katja Ickstadt focuses on regression methods for very large, high-dimensional data, spatial and spatio-temporal models for biological and epidemiological problems, and the analysis of Gaussian process models. In particular, Bayesian methods are in the foreground. She is involved in the German region of the International Biometric Society and is Co-editor of Biometrics.

Big Spatial Data Learning: A Parallel Solution

Shan Yu
Assistant Professor, Department of Statistics, University of Virginia

Date: Friday, September 09, 2022

Abstract

Nowadays, we are living in the era of “Big Data.” A significant portion of big data is big spatial data captured through advanced technologies or large-scale simulations. Explosive growth in spatial and spatiotemporal data emphasizes the need for developing new and computationally efficient methods and credible theoretical support tailored for analyzing such large-scale data. Parallel statistical computing has proved to be a handy tool when dealing with big data. However, it is hard to execute the conventional spline regressions in parallel. In this talk, I will present a novel parallel smoothing technique for generalized partially linear spatially varying coefficient models, which can be used under different hardware parallelism levels. Moreover, conflated with concurrent computing, the proposed method can be easily extended to the distributed system. The newly developed method is evaluated through several simulation studies and an analysis of the US Loan Application Data.

About the speaker

Dr. Shan Yu joined the Department of Statistics at the University of Virginia as an Assistant Professor last August 2020 after receiving her Ph.D. from Iowa State University. Her research interests focus on advanced statistical methods for complex-structured data, statistical machine learning, and "big data" analytics. Specifically, she has been engaged in projects utilizing non-/semi-parametric regression methods, spatial/spatiotemporal data analysis, biomedical imaging analysis, statistical epidemiology, and trajectory data analysis.