Fall 2023 Seminars

List of Seminars

Challenges and Opportunities in Statistics and Data Science: Ten Research Areas

Xihong Lin

Professor

Department of Biostatistics and Department of Statistics

Harvard University

Date: Friday, November 10, 2023

Abstract

As a data-driven discipline that deals with many aspects of data, statistics is a critical pillar in the rapidly evolving landscape of data science. The increasingly vital role of data, especially big data, in many disciplines, presents the field of statistics with unparalleled challenges and exciting opportunities. Statistics plays a pivotal role in data science by assisting with the use of data and decision making in the face of uncertainty. In this article, we present ten research areas that could make statistics and data science more impactful on science and society. These areas will help better transform data into knowledge, actionable insights and deliverables, and promote more collaboration with computer and other quantitative scientists and domain scientists.

About the Speaker

Xihong Lin is a Professor and Former Chair of the Department of Biostatistics, Coordinating Director of the Program in Quantitative Genomics at the Harvard T. H. Chan School of Public Health, and Professor of the Department of Statistics at the Faculty of Arts and Sciences of Harvard University, and Associate Member of the Broad Institute of MIT and Harvard. Dr. Lin’s research interests lie in the development and application of scalable statistical and machine learning methods for the analysis of massive and complex genetic and genomic, epidemiological and health data. Some examples of her current research include analytic methods and applications for large scale Whole Genome Sequencing studies, biobanks and Electronic Health Records, techniques and tools for whole genome variant functional annotations, analysis of the interplay of genes and environment, multiple phenotype analysis, polygenic risk prediction and heritability estimation. Dr. Lin was elected to the National Academy of Medicine in 2018 and the National Academy of Sciences in 2023. She received the 2002 Mortimer Spiegelman Award from the American Public Health Association, the 2006 Committee of Presidents of Statistical Societies (COPSS) Presidents’ Award, the 2017 COPSS FN David Award, the 2008 Janet L. Norwood Award for Outstanding Achievement of a Woman in Statistics, the 2022 National Institute of Statistical Sciences Jerome Sacks Award for Outstanding Cross-Disciplinary Research, and the 2022 Marvin Zelen Leadership in Statistical Science Award. She is an elected fellow of American Statistical Association (ASA), Institute of Mathematical Statistics, and International Statistical Institute.

Event Organizers

David Kepplinger

Nicholas Rios

How to Design the Best Biomarker-Guided Clinical Trial?

Anatasia Ivanova

Professor

Department of Biostatistics

University of North Carolina at Chapel Hill

Date: Friday, November 3, 2023

Abstract

In a clinical trial with a predefined subgroup, it is assumed that the biomarker positive subgroup has the same or higher treatment effect compared to its complement, the biomarker negative subgroup. In these trials the treatment effect is usually evaluated in the biomarker positive subgroup and in the whole population. Statistical testing of the treatment effect in the biomarker negative subgroup is usually not done since it requires a larger sample size. As a result, the new intervention can be shown effective in the overall population even though it is only effective in the biomarker positive group. What can we do to improve decision making in such trials?  

About the Speaker

Anastasia Ivanova is a Professor of Biostatistics at the University of North Carolina at Chapel Hill. Anastasia’s research area is complex designs for clinical trials including adaptive designs, designs with re-randomization and enrichment, and dose-finding designs. Anastasia has published more than 170 peer refereed papers. She is an ASA Fellow and an Associate Editor of Statistics in Medicine. 

Event Organizers

Nicholas Rios

David Kepplinger

A Novel Extreme Value Autoencoder Framework for Probabilistic Model Emulation and Calibration

Likun Zhang

Assistant Professor

Department of Statistics

University of Missouri

Date: Friday, October 27, 2023

Abstract

Large physics-based simulation models are crucial for understanding complex problems related to energy and the environment. These models are typically quite computationally expensive and there are numerous computational and uncertainty quantification (UQ) challenges when using these models in the context of calibration, inverse problems, UQ for forward simulations, and model parameterization. Surrogate model emulators have proven to be useful in recent years to facilitate UQ in these contexts, particularly when combined with Bayesian inference. However, traditional methods for model emulation such as Gaussian processes, polynomial chaos expansions, and more recently, neural networks and generative models do not naturally accommodate extreme values, which are increasingly relevant for many complex processes such as environmental impacts due to climate change and anomaly detection. Many statistical methods have been developed to flexibly model the simultaneous occurrences of extremal events, but most of them assume that the dependence structure of concurrent extremes is time invariant, which is unrealistic for physical processes that exhibit diffusive dynamics at short-time scales. We propose to develop a novel probabilistic statistical framework to explicitly accommodate concurrent and dependent extremes within a conditional variational autoencoder (CVAE) engine for enabling fast and efficient uncertainty quantification in model calibration, inverse modeling, ensemble prediction, and parameter estimation contexts. We also propose a new validation framework that is tailored to assess skill in fitting extreme behavior in model outputs. Our approach addresses, for the first time, the need to have efficient surrogate emulators of expensive simulation models that can accurately characterize, in a rigorous probabilistic manner, extreme values that are dependent in space and time and across processes.

About the Speaker

Dr. Zhang is currently an assistant professor in the Department of Statistics at the University of Missouri. He received his Ph.D. degree in Statistics from Penn State in 2020, after which he worked with climate scientists at Lawrence Berkeley National Laboratory for two years. His research focuses on extreme value theory and flexible spatial extremes modeling, which has been used to study a variety of weather processes and to detect changes in their long-term climatology. He also incorporates deep learning techniques in spatial extremes modeling so domain scientists can study the dependent extremes on datasets with massive number of locations. 

Event Organizers

David Kepplinger

Nicholas Rios

Functional Data Analysis Through the Lens of Deep Neural Networks

Guanqun Cao

Associate Professor

Department of Computational Mathematics, Science, and Engineering

Michigan State University

Date: Friday, October 20, 2023

Abstract

Functional data refer to curves or functions, i.e., the data for each variable are viewed as smooth curves, surfaces, or hypersurfaces evaluated at a finite subset of some interval in 1D, 2D, and 3D. Advancements in modern technology have enabled the collection of sophisticated, ultra high-dimensional datasets, thus boosting the investigation of functional data. In this talk, I will first introduce a deep neural networks-based robust method to perform nonparametric regression for multi-dimensional functional data. The proposed estimators are based on sparsely connected deep neural networks with rectifier linear unit activation function. Meanwhile, the estimators are less susceptible to outlying observations and model-misspecification. For any multi-dimensional functional data, we provide uniform convergence rates for the proposed robust deep neural networks estimators. Then, I will present a new approach, called functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principal components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which only work for one-dimensional functional data, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data.

About the Speaker

Dr. Guanqun Cao is an Associate Professor of the Departments of Statistics and Probability and Department of Computational Mathematics, Science and Engineering at Michigan State University. Prior to joining Michigan State University in 2023, she was a faculty member of Auburn University. She obtained her Ph.D. in Statistics from the Department of Statistics and Probability at Michigan State University. Working at the interface of statistics, mathematics, and computer science, Dr. Cao is interested in developing cutting-edge statistical methods for solving issues related to data science and big data analytics. The methods she developed have a wide application in engineering, neuroimaging, environmental studies, and biomedical science. Dr. Cao is an Elected Member of the International Statistical Institute.

Event Organizers

Nicholas Rios

David Kepplinger

From Algorithms for Anomaly Detection to Spatial and Temporal Modeling and Bayesian Ultra-High Dimensional Variable Selection

Hsin-Hsiung Huang

Associate Professor

Department of Statistics and Data Science

University of Central Florida

Date: Friday, October 13, 2023

Abstract

Inspired by our investigation on spatiotemporal data analysis for the NSF ATD challenges, we've investigated Bayesian clustering, variable selection for mixed-type multivariate responses and Gaussian process priors for spatiotemporal data. The proposed Bayesian approaches effectively and efficiently fit high-dimensional data with spatial and temporal features. We further propose a two-stage Gibbs sampler which leads a consistent estimator with a much faster posterior contraction rate than a one-step Gibbs sampler.  For Bayesian ultrahigh dimensional variable selection, we have developed Bayesian sparse multivariate regression for mixed responses (BS-MRMR) with shrinkage priors model for mixed-type response generalized linear models. We consider a latent multivariate linear regression model associated with the observable mixed-type response vector through its link function. Under our proposed BS-MRMR model, multiple responses belonging to the exponential family are simultaneously modeled and mixed-type responses are allowed. We show that the MBSP-GLM model achieves posterior consistency and quantifies the posterior contraction rate. Additionally, we incorporate Gaussian processes into zero-inflated negative binomial regression. To conquer the computation bottleneck that GPs may suffer when the sample size is large, we adopt the nearest-neighbor GP approach that approximates the covariance matrix using local experts. We provide simulation studies and real-world gene data examples.

About the Speaker

Dr. Hsin-Hsiung Bill Huang is an Associate Professor in the Department of Statistics and Data Science at the University of Central Florida (UCF). Dr. Huang received his Ph.D. in Statistics from the University of Illinois at Chicago and two MS degrees from the Georgia Institute of Technology and National Taiwan University as well as the BA in Economics and BS in Mathematics degrees from National Taiwan University. His scholarly interests and expertise include Bayesian ultrahigh dimensional variable selection, regularized low-rank matrix-variate regression, clustering, classification, and dimension reduction.

His research addresses challenges in analyzing big data, interdisciplinary research, and developing new statistical methods for real-data challenges. He has developed new statistical methods for computed tomography (CT), developing a statistical reconstruction algorithm for positronium lifetime imaging using time-of-flight positron emission tomography (PET) and interdisciplinary research of developing algorithms for threat detection and large spatiotemporal data modeling challenges. He is awarded the UCF Research Incentive Award (RIA) in 2021. His current research is partially sponsored by his grant of the Algorithms of Threat Detection (ATD) of the National Science Foundation (NSF) as a principal investigator (PI) in 2019 with a supplement grant in 2023 and a new ATD grant in 2023 and an NIH R01 grant as a co-investigator in 2019. His team named UCF has won the top places in a row of the 2021 and 2022 ATD challenge competitions.

Event Organizers

Nicholas Rios

David Kepplinger

Another Look at Assessing Goodness-of-Fit of Time Series Using Fitted Residuals

Richard Davis

Professor of Statistics

Department of Statistics

Columbia University

Date: Friday, October 6, 2023

Abstract

A fundamental and often final step in time series modeling is to assess the quality of fit of a proposed model to the data.   Since the underlying distribution of the innovations that generate a model is typically not prescribed, goodness-of-fit tests typically take the form of testing the fitted residuals for serial independence.  However, these fitted residuals are inherently dependent since they are based on parameter estimates. Thus, standard tests of serial independence, such as those based on the autocorrelation function (ACF) or distance correlation function (DCF) of the fitted residuals need to be adjusted. The sample splitting procedure in Pfister et al. (2018) is one such fix for the case of models for independent data, but fails to work in the dependent case.

In this paper sample splitting is leveraged in the time series setting to perform tests of serial dependence of fitted residuals using the ACF and DCF.  Here the first a_n of the data points are used to estimate the parameters of the model and then using these parameter estimates, the last s_n of the data points are used to compute the estimated residuals.  Tests for serial independence are then based on these s_n residuals.  As long as the overlap between the a_n and s_n data splits is asymptotically ½, the ACF and DCF tests of serial independence tests often have the same limit distributions as though the underlying residuals are indeed iid.  This procedure ameliorates the need for adjustment to the construction of confidence bounds for both the ACF and DCF in goodness-of-fit testing.  (This is joint work with Leon Fernandes.)

About the Speaker

Richard Davis is a Howard Levene Professor of Statistics at Columbia University, where he served as chair from 2013 to 2019. He was the President of the Institute of Mathematical Statistics (IMS) in 2016, as well as the Editor-in-Chief of Bernoulli (2010-2012). He is also a fellow of the American Statistical Association. His research interests lie primarily in the areas of applied probability, time series, and stochastic processes - much of which is strongly influenced by extreme value theory. 

Event Organizers

Nicholas Rios

David Kepplinger

Depth Functions and Their Applications to Classification and Clustering

Giacomo Francisci

Postdoc

Department of Statistics

George Mason University

Date: Friday, September 29, 2023

Abstract

Depth functions provide a center-outward order similar to that of the real line and are used to specify medians and quantiles of multivariate distributions. As they do not require any assumption on the underlying distribution, they are widely used in non-parametric statistics and robust methods, for instance, in outlier detection and classification. In the setting of classification, a common issue is that points of zero depth with respect to all classes arise in practice leading to classification challenges. In the first part of the presentation, we address this issue using an extended notion of depth function. We use this idea to study classification for tree-indexed random variables. In the final part of the presentation, we introduce a concept of local depth function and use it to study modal clustering. We also discuss consistency properties of the clustering algorithm.

About the Speaker

Giacomo Francisci is a Postdoctoral Research Fellow at the Department of Statistics at GMU, where he works under the supervision of Prof. Vidyashankar. Prior to that, he obtained a Double Degree MSc. in Mathematics at the University of Trento (Italy) and University of Tübingen (Germany). He obtained a Cotutelle PhD. in Mathematics and Statistics at the University of Trento (Italy) and the University of Cantabria (Spain). His research interests are in depth functions and their applications to machine learning, empirical processes, branching processes, and branching random walks.

Event Organizers

Nicholas Rios

David Kepplinger

Mining Your Fantasy: From Professor to Fugitive

Dennis Peng

Journalist, True Voice of Taiwan

Date: Friday, September 22, 2023

Abstract

The Statistics and Systems Engineering and Operations Research departments invite you to a seminar given by Dennis Peng. Dennis Peng will be discussing his academic and career journey: from becoming a professor, to a TV talk show host, and more.*

About the Speaker

Dennis Peng was an Assistant Professor and Director of the Graduate School of Journalism at Taiwan University from 1995 to 2015. He was the CEO of Hakka TV from 2004 to 2005. He was also the host of Next TV and FTV for several years. 

Event Organizers

Nicholas Rios

David Kepplinger

*Note that the contents of this talk solely reflect the views of the speaker and do not reflect the views of George Mason University or its faculty.

Exact Conditional Independence Testing and Conformal Inference with Adaptively Collected Data

Lucas Janson

Associate Professor

Department of Statistics

Harvard University

Date: Friday, September 15, 2023

Abstract

Randomization testing is a fundamental method in statistics, enabling inferential tasks such as testing for (conditional) independence of random variables, constructing confidence intervals in semiparametric location models, and constructing (by inverting a permutation test) model-free prediction intervals via conformal inference. Randomization tests are exactly valid for any sample size, but their use is generally confined to exchangeable data. Yet in many applications, data is routinely collected adaptively via, e.g., (contextual) bandit and reinforcement learning algorithms or adaptive experimental designs. In this paper we present a general framework for randomization testing on adaptively collected data (despite its non-exchangeability) that uses a weighted randomization test, for which we also present computationally tractable resampling algorithms for various popular adaptive assignment algorithms, da ta-generating environments, and types of inferential tasks. Finally, we demonstrate via a range of simulations the efficacy of our framework for both testing and confidence/prediction interval construction. The relevant paper is https://arxiv.org/abs/2301.05365.

About the Speaker

Lucas Janson is an Associate Professor of Statistics and Affiliate in Computer Science at Harvard University, where he studies high-dimensional inference and statistical machine learning.

Event Organizers

David Kepplinger

Nicholas Rios

Data Science at the Singularity

David Donoho

Professor

Department of Statistics

Stanford University

Date: Friday, September 1, 2023

Abstract

A purported “AI Singularity” has been much in the public eye recently, especially since the release of ChatGPT last November, spawning social media “AI Breakthrough” threads promoting Large Language Model (LLM) achievements.  Alongside this, mass media and national political attention focused on “AI Doom” hawked by social media influencers, with twitter personalities invited to tell congresspersons about the coming "End Times."   

In my opinion, “AI Singularity” is the wrong narrative; it drains time and energy with pointless speculation. We do not yet have general intelligence, we have not yet crossed the AI singularity, and the remarkable public reactions signal something else entirely. 

Something fundamental to science really has changed in the last ten years. In certain fields which practice Data Science according to three principles I will describe, progress is simply dramatically more rapid than in those fields that don’t yet make use of it.

Researchers in the adhering fields are living through a period of very profound transformation, as they make a transition to frictionless reproducibility. This transition markedly changes the rate of spread of ideas and practices, and marks a kind of singularity, because it affects mindsets and paradigms and erases memories of much that came before.  Many phenomena driven by this transition are misidentified as signs of an AI singularity. Data Scientists should understand what's really happening and their driving role in these developments.

About the Speaker

David Donoho is a Professor of Statistics at Stanford University. Among his many accomplishments, he received COPSS Presidents' Award (1994), the John von Neumann Prize (2001, Society for Industrial and Applied Mathematics), the Norbert Wiener Prize in Applied Mathematics (2010, from SIAM and the AMS), the Shaw Prize for Mathematics (2013), the Gauss Prize from IMU (2018), and, most recently, the IEEE Jack S. Kilby Signal Processing Medal in 2022. His research interests include large-scale covariance matrix estimation, large-scale matrix denoising, detection of rare signals, compressed sensing, and empirical deep learning. 

Event Organizers

Nicholas Rios

Real-time Discriminant Analysis in the Presence of Label and Measurement Noise

Mia Hubert

Professor

Department of Mathematics, section of Statistics and Data Science

KU Leuven

Date: Friday, August 25, 2023

Abstract

Quadratic discriminant analysis (QDA) is a widely used classification technique. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. The traditional QDA rule relies on the empirical mean and covariance matrix. Unfortunately, these estimators are sensitive to label and measurement noise which often impairs the model's predictive ability. Robust estimators of location and scatter are resistant to this type of contamination. However, they have a prohibitive computational cost for large scale industrial experiments. First, we present a novel QDA method based on a real-time robust algorithm. Secondly, we introduce the classmap, a graphical display to visualize aspects of the classification results and to identify label and measurement noise.

About the Speaker

Mia Hubert is professor at the KU Leuven, department of Mathematics, section of Statistics and Data Science. Her research focuses on robust statistics, outlier detection, data visualization, depth functions, and the development of statistical software. She is an elected fellow of the ISI and has served as associate editor for several journals such as JCGS, CSDA, and Technometrics. She is co-founder and organizer of The Rousseeuw Prize for Statistics, a new biennial prize which awards pioneering work in statistical methodology.

Event Organizers

Nicholas Rios