2023 Seminars

Spring 2023 Seminars

Sufficient Cause Urn Analysis of Principal Stratification Methods

Jaffer Zaidi

Assistant Professor

Department of Global and Community Health

George Mason University

 

Date: Friday, April 28, 2023

Abstract

The analysis of causal effects when the outcome of interest is possibly truncated by death has a long history in statistics. The survivor average causal effect is commonly identified with more assumptions than those guaranteed by the design of a randomized clinical trial or using sensitivity analysis. This paper demonstrates that individual level causal effects in the `always survivor' principal stratum can be identified with no stronger identification assumptions than randomization. We further develop Rothman's sufficient cause model to derive further results providing a unified framework for sensitivity analysis of different identification strategies for principal stratification causal effects.

About the Speaker

Dr. Jaffer Zaidi is an assistant professor in the Department of Global and Community Health. Jaffer's research interests are primarily within causal inference, including but not limited to sufficient cause methods, sensitivity analysis, principal stratification, and interaction analysis. Before coming to Mason, Zaidi was a postdoctoral research fellow at University of North Carolina at Chapel Hill, funded through SAMSI (Statistical and Applied Mathematical Sciences Institute), and has conducted research in Somkhele, South Africa at the Africa Health Research Institute (AHRI).

Event Organizers

Abolfazl Safikhani

Nicholas Rios

Causal Inference with Interference

Michael Hudgens

Professor and Associate Chair

Department of Biostatistics

University of North Carolina

Date: Friday, April 21, 2023

Abstract

A fundamental assumption usually made in causal inference is that of no interference between individuals (or units), i.e., the potential outcomes of one individual are assumed to be unaffected by the treatment assignment of other individuals. However, in many settings, this assumption obviously does not hold. For example, in infectious diseases, whether one person becomes infected may depend on who else in the population is vaccinated. In this talk we will discuss recent approaches to assessing treatment effects in the presence of interference.

About the Speaker

Dr. Michael Hudgens is a Professor and Associate Chair of the Department of Biostatistics at the University of North Carolina. He also serves as the Co-Director of the UNC Causal Inference Research Lab. Professor Hudgens has co-authored approximately 300 peer-reviewed papers in statistical journals such as Biometrics, Biometrika, JASA, and JRSS-B, as well as biomedical journals such as the Lancet, Nature, and New England Journal of Medicine. He currently serves as an associate editor for Biometrics. He is an elected fellow of the American Statistical Association and has taught graduate-level biostatistics courses at UNC for over 15 years.

Event Organizers

Abolfazl Safikhani

Nicholas Rios

Generating Space-Filling Designs for Computer Experiments

Ling Wang

Assistant Professor

Department of Statistics

Purdue University 

Date: Friday, April 14, 2023

Abstract

Space-filling designs are commonly used in controlled experiments for investigating complex simulation systems. Latin hypercube design is a popular type of space-filling design because it studies as many levels as the design size for each variable and therefore achieves one-dimensional uniformity. In this talk, I will introduce a series of new methods for generating large and high-dimensional Latin hypercube designs. The generated designs are shown to be optimal under the maximin distance criterion and have small pairwise correlations between variables. When those many levels in a Latin hypercube design are not needed to learn the simulation system, the proposed methods can also be used to generate space-filling designs with less and balanced levels.  

About the Speaker

Lin Wang is an Assistant Professor of Statistics at Purdue University. Prior to joining Purdue, she was an Assistant Professor of Statistics at George Washington University from 2019 to 2022. She obtained her PhD in Statistics in 2019 from University of California, Los Angeles. Her research interests include sampling, subsampling, experimental design, and causal inference.

Event Organizers

Abolfazl Safikhani

Nicholas Rios

Power and Sample Size Calculations for Rerandomized Experiments

Zach Branson

Assistant Professor

Statistics and Data Science

Carnegie Mellon University

Date: Friday, April 7, 2023

Abstract

Power analyses are an important aspect of experimental design, because they help determine how experiments are implemented in practice. It is common to specify a desired level of power and compute the sample size necessary to obtain that power. Such calculations are well-known for completely randomized experiments, but there can be many benefits to using other experimental designs. For example, it has recently been established that rerandomization, where subjects are randomized until covariate balance is obtained, increases the precision of causal effect estimators. This work establishes the power of rerandomized treatment-control experiments, thereby allowing for sample size calculators. We find the surprising result that, while power is often greater under rerandomization than complete randomization, the opposite can occur for very small treatment effects. The reason is that inference under rerandomization can be relatively more conservative, in the sense that it can have a lower type-I error at the same nominal significance level, and this additional conservativeness adversely affects power. This surprising result is due to treatment effect heterogeneity, a quantity often ignored in power analyses. We find that heterogeneity increases power for large effect sizes but decreases power for small effect sizes.

About the Speaker

Zach Branson is an Assistant Teaching Professor in Statistics and Data Science at Carnegie Mellon University. His main research interests are experimental design and causal inference, where the goal is to assess if treatments (randomized or not) cause a change in outcomes. In addition to theoretical and methodological work, he works on applying causal inference methods in criminology, medicine, mental health, and text analysis. Beyond research, his main teaching interests are in statistical communications, e.g. training PhD students to write papers and undergraduates to give statistical presentations.

Event Organizers

Abolfazl Safikhani

Nicholas Rios

A Multiple Imputation Procedure for Record Linkage and Causal Inference to Estimate the Effects of Home-Delivered Meals

Roee Gutman

Associate Professor

Department of Biostatistics

Brown University

Date: Friday, March 31, 2023

Abstract

Causal analysis of observational studies requires data that comprise of a set of covariates, a treatment assignment indicator, and the observed outcomes. However, data confidentiality restrictions or the nature of data collection may distribute these variables across two or more datasets. In the absence of unique identifiers to link records across files, probabilistic record linkage algorithms can be leveraged to merge the datasets. Current applications of record link-age are concerned with the estimation of associations between variables that are exclusive to one file and not causal relationships. We propose a Bayesian framework for record linkage and causal inference where one file comprises all the covariate and observed outcome information, and the second file consists of a list of all individuals who receive the active treatment. Under certain ignorability assumptions, the procedure properly propagates the error in the record linkage process, resulting in valid statistical inferences. To estimate the causal effects, we devise a two-stage procedure. The first stage of the procedure performs Bayesian record linkage to multiply impute the treatment assignment for all individuals in the first file, while adjustments for covariates' imbalance and imputation of missing potential outcomes are performed in the second stage. This procedure is used to evaluate the effect of Meals on Wheels services on mortality and healthcare utilization among homebound older adults in Rhode Island. In addition, an interpretable sensitivity analysis is developed to assess potential violations of the ignorability assumptions. 

About the Speaker

Dr. Roee Gutman is an Associate Professor in the Department of Biostatistics at Brown University, where he also serves as the director for the Undergraduate Statistics Concentration. His areas of expertise are causal inference, file linkage, missing data, Bayesian data analysis, and their application to big data sources in health services research. He has been involved in many comparative effectiveness studies where he contributed both in terms of the statistical theory and its implementation.

Event Organizers

Abolfazl Safikhani

Nicholas Rios

Bayesian Modeling with Spatial Curvature Processes

Aritra Halder

Assistant Professor

Department of Biostatistics

Drexel University

Date: Friday, March 10, 2023

Abstract

Spatial process models are widely used for modeling point-referenced variables arising from diverse scientific domains. Analyzing the resulting random surface provides deeper insights into the nature of latent dependence within the studied response. We develop Bayesian modeling and inference for rapid changes on the response surface to assess directional curvature along a given trajectory. Such trajectories or curves of rapid change, often referred to as wombling boundaries, occur in geographic space in the form of rivers in a flood plain, roads, mountains or plateaus or other topographic features leading to high gradients on the response surface. We demonstrate fully model based Bayesian inference on directional curvature processes to analyze differential behavior in responses along wombling boundaries. We illustrate our methodology with a number of simulated experiments followed by multiple applications featuring the Meuse river data; temperature data from the Northeastern United States; and Boston Housing data.

About the Speaker

Aritra Halder is an Assistant Professor in the Department of Biostatistics, at Drexel University’s Dornsife School of Public Health. He completed his PhD. in Statistics from the University of Connecticut in July, 2020. His research interests are Bayesian modeling, Spatial and Spatial-temporal Statistics, and Statistical Computation.

Event Organizers

Abolfazl Safikhani

Nicholas Rios

Prognostic Digital Twins: Current and Future Applications

Arman Sabbaghi

Head of Biostatistics Research

Unlearn.AI

Date: Friday, February 10, 2023

Abstract

Clinical trials are established as the gold standard for evaluating the causal effects of new medical treatments, interventions, or therapies. However, modern clinical trials are becoming increasingly difficult to conduct due to enrollment challenges, long trial durations, and significant costs. Existing methods based on external controls can help to address some of these difficulties, but they are typically unreliable. We shall present Unlearn's TwinRCT technology, which is a novel trial design that combines historical data, machine learning, and randomization to deliver smaller, faster clinical trials and yield results that are more reliable that external controls. The core technology underlying the TwinRCT is a Digital Twin Generator (DTG), which is developed from historical data and then applied to baseline data for new clinical trial participants to create Prognostic Digital Twins. Prognostic scores for each clinical outcome of interest are derived from the Prognostic Digital Twins and used to improve the efficiency of the analysis. We shall demonstrate the relative advantages of the TwinRCT technology with respect to studies in which external controls or supplemental controls are considered options, and describe new Bayesian extensions of TwinRCT based on Prognostic Digital Twins. Ultimately, as described by the European Medicines Agency (EMA) qualification of this approach, the TwinRCT technology can yield unbiased treatment effect estimation in the primary analysis of pivotal studies, and preserve strict Type I error control even in circumstances where the historical data have known differences versus the clinical trial patients enrolled in the TwinRCT.

About the Speaker

Prior to becoming the Head of Biostatistics Research at Unlearn.AI, Arman Sabbaghi was an Associate Professor of Statistics at Purdue University. Arman Sabbaghi obtained his Ph.D. in Statistics from Harvard University in 2014.  At Purdue, his research focused on the development of new causal inference methodology for the analysis of observational data and clinical trials, the creation of statistical tools for assessing experimental designs, and the development of ML algorithms for quality control. 

Event Organizers

Abolfazl Safikhani

Nicholas Rios

Garbage In, Einstein Out: A Mathematical Study of "Einstein from Noise"

I-Ping Tu

Research Fellow

The Institute of Statistical Science

Academia Sinica, Taipei

Date: Friday, February 24, 2023

Abstract

A cryo-EM 3D structure is solved from many noisy 2D projections of individual molecules. Two keys that make this 3D reconstruction a challenging computational task is its high level of noise and the unknown pose parameters of each individual molecule.  Often times, reference is used to initiate the search of orientation, which has incurred the risk of coalescing images with low or no signal to the reference, known as the ‘Einstein from noise’ problem. Here, we investigate this phenomenon from model-bias viewpoint in terms of image dimensionality and sample size. By using mathematical modeling, we derive a surprisingly simple form accurately predicting the correlation value between Einstein face and the spurious image arising from averaging the sorted top images of purely Gaussian noise images. This theoretical value increases with n (the number of images) and m (the number of images sorted for averaging) but decreases with p (the dimensionality of image). To avoid ‘Einstein from noise’ pitfall, we propose a denoising method as a data pre-processing tool to increase the SNR. We observe that this tool makes significant improvement in either computation time or clustering average quality in 2D clustering of various cryo-EM analysis packages.  

About the Speaker

Dr. I-Ping Tu received her Ph.D. in Statistics from Stanford University in 1997. She was a senior statistician at the Stanford Functional Genomics Facility until 2003. She later moved to become a Research Fellow at the Institute of Statistical Science, Academia Sinica. Her research has mainly focused on developing statistical methods to analyze cryo-electron microscopy (cryo-EM) image data. In recent years, technical breakthrough has transformed cryo-EM to become a main tool for determination of molecular structure to atomic resolution without crystals or in solution. However, the process of structural determination from single-particle cryo-EM images is still very challenging because it involves processing extremely noisy images of unknown orientation. She has developed a 2D classification package called RE2DC with a processing platform ASCEP which integrate RE2DC with other packages to execute a pipeline for 3D structure determinations of cryo-EM data. She will continue developing efficient and robust statistical methods to improve the analysis. 

Event Organizers

Abolfazl Safikhani

Nicholas Rios

Efficient and Targeted COVID-19 Border Testing via Reinforcement Learning

Hamsa Bastani

Assistant Professor of Operations, Information, and Decisions

University of Pennsylvania

Date: Friday, February 17, 2023

Abstract

Throughout the COVID-19 pandemic, countries relied on a variety of ad-hoc border control protocols to allow for non-essential travel while safeguarding public health: from quarantining all travellers to restricting entry from select nations based on population-level epidemiological metrics such as cases, deaths or testing positivity rates. Here we report the design and performance of a reinforcement learning system, nicknamed ‘Eva’. In the summer of 2020, Eva was deployed across all Greek borders to limit the influx of asymptomatic travellers infected with SARS-CoV-2, and to inform border policies through real-time estimates of COVID-19 prevalence. In contrast to country-wide protocols, Eva allocated Greece’s limited testing resources based upon incoming travellers’ demographic information and testing results from previous travellers. By comparing Eva’s performance against modelled counterfactual scenarios, we show that Eva identified 1.85 times as many asymptomatic, infected travellers as random surveillance testing, with up to 2-4 times as many during peak travel, and 1.25-1.45 times as many asymptomatic, infected travellers as testing policies that only utilize epidemiological metrics. We demonstrate that this latter benefit arises, at least partially, because population-level epidemiological metrics had limited predictive value for the actual prevalence of SARS-CoV-2 among asymptomatic travellers and exhibited strong country-specific idiosyncrasies in the summer of 2020. Our results raise serious concerns on the effectiveness of country-agnostic internationally proposed border control policies that are based on population-level epidemiological metrics. Instead, our work represents a successful example of the potential of reinforcement learning and real-time data for safeguarding public health.

Paper Link: https://www.nature.com/articles/s41586-021-04014-z 

About the Speaker

Hamsa Bastani is an Assistant Professor of Operations, Information, and Decisions at the Wharton School, University of Pennsylvania. Her research focuses on developing novel machine learning algorithms for data-driven decision-making, with applications to healthcare operations and social good. Her work has received several recognitions, including the Wagner Prize for Excellence in Practice (2021), the Pierskalla Award for the best paper in healthcare (2016, 2019, 2021), the Behavioral OM Best Paper Award (2021), as well as first place in the George Nicholson and MSOM student paper competitions (2016). 

Event Organizers

Abolfazl Safikhani

Nicholas Rios

Fall 2023 Seminars

Challenges and Opportunities in Statistics and Data Science: Ten Research Areas

Xihong Lin

Professor

Department of Biostatistics and Department of Statistics

Harvard University

Date: Friday, November 10, 2023

Abstract

As a data-driven discipline that deals with many aspects of data, statistics is a critical pillar in the rapidly evolving landscape of data science. The increasingly vital role of data, especially big data, in many disciplines, presents the field of statistics with unparalleled challenges and exciting opportunities. Statistics plays a pivotal role in data science by assisting with the use of data and decision making in the face of uncertainty. In this article, we present ten research areas that could make statistics and data science more impactful on science and society. These areas will help better transform data into knowledge, actionable insights and deliverables, and promote more collaboration with computer and other quantitative scientists and domain scientists.

About the Speaker

Xihong Lin is a Professor and Former Chair of the Department of Biostatistics, Coordinating Director of the Program in Quantitative Genomics at the Harvard T. H. Chan School of Public Health, and Professor of the Department of Statistics at the Faculty of Arts and Sciences of Harvard University, and Associate Member of the Broad Institute of MIT and Harvard. Dr. Lin’s research interests lie in the development and application of scalable statistical and machine learning methods for the analysis of massive and complex genetic and genomic, epidemiological and health data. Some examples of her current research include analytic methods and applications for large scale Whole Genome Sequencing studies, biobanks and Electronic Health Records, techniques and tools for whole genome variant functional annotations, analysis of the interplay of genes and environment, multiple phenotype analysis, polygenic risk prediction and heritability estimation. Dr. Lin was elected to the National Academy of Medicine in 2018 and the National Academy of Sciences in 2023. She received the 2002 Mortimer Spiegelman Award from the American Public Health Association, the 2006 Committee of Presidents of Statistical Societies (COPSS) Presidents’ Award, the 2017 COPSS FN David Award, the 2008 Janet L. Norwood Award for Outstanding Achievement of a Woman in Statistics, the 2022 National Institute of Statistical Sciences Jerome Sacks Award for Outstanding Cross-Disciplinary Research, and the 2022 Marvin Zelen Leadership in Statistical Science Award. She is an elected fellow of American Statistical Association (ASA), Institute of Mathematical Statistics, and International Statistical Institute.

Event Organizers

David Kepplinger

Nicholas Rios

How to Design the Best Biomarker-Guided Clinical Trial?

Anatasia Ivanova

Professor

Department of Biostatistics

University of North Carolina at Chapel Hill

Date: Friday, November 3, 2023

Abstract

In a clinical trial with a predefined subgroup, it is assumed that the biomarker positive subgroup has the same or higher treatment effect compared to its complement, the biomarker negative subgroup. In these trials the treatment effect is usually evaluated in the biomarker positive subgroup and in the whole population. Statistical testing of the treatment effect in the biomarker negative subgroup is usually not done since it requires a larger sample size. As a result, the new intervention can be shown effective in the overall population even though it is only effective in the biomarker positive group. What can we do to improve decision making in such trials?  

About the Speaker

Anastasia Ivanova is a Professor of Biostatistics at the University of North Carolina at Chapel Hill. Anastasia’s research area is complex designs for clinical trials including adaptive designs, designs with re-randomization and enrichment, and dose-finding designs. Anastasia has published more than 170 peer refereed papers. She is an ASA Fellow and an Associate Editor of Statistics in Medicine. 

Event Organizers

Nicholas Rios

David Kepplinger

A Novel Extreme Value Autoencoder Framework for Probabilistic Model Emulation and Calibration

Likun Zhang

Assistant Professor

Department of Statistics

University of Missouri

Date: Friday, October 27, 2023

Abstract

Large physics-based simulation models are crucial for understanding complex problems related to energy and the environment. These models are typically quite computationally expensive and there are numerous computational and uncertainty quantification (UQ) challenges when using these models in the context of calibration, inverse problems, UQ for forward simulations, and model parameterization. Surrogate model emulators have proven to be useful in recent years to facilitate UQ in these contexts, particularly when combined with Bayesian inference. However, traditional methods for model emulation such as Gaussian processes, polynomial chaos expansions, and more recently, neural networks and generative models do not naturally accommodate extreme values, which are increasingly relevant for many complex processes such as environmental impacts due to climate change and anomaly detection. Many statistical methods have been developed to flexibly model the simultaneous occurrences of extremal events, but most of them assume that the dependence structure of concurrent extremes is time invariant, which is unrealistic for physical processes that exhibit diffusive dynamics at short-time scales. We propose to develop a novel probabilistic statistical framework to explicitly accommodate concurrent and dependent extremes within a conditional variational autoencoder (CVAE) engine for enabling fast and efficient uncertainty quantification in model calibration, inverse modeling, ensemble prediction, and parameter estimation contexts. We also propose a new validation framework that is tailored to assess skill in fitting extreme behavior in model outputs. Our approach addresses, for the first time, the need to have efficient surrogate emulators of expensive simulation models that can accurately characterize, in a rigorous probabilistic manner, extreme values that are dependent in space and time and across processes.

About the Speaker

Dr. Zhang is currently an assistant professor in the Department of Statistics at the University of Missouri. He received his Ph.D. degree in Statistics from Penn State in 2020, after which he worked with climate scientists at Lawrence Berkeley National Laboratory for two years. His research focuses on extreme value theory and flexible spatial extremes modeling, which has been used to study a variety of weather processes and to detect changes in their long-term climatology. He also incorporates deep learning techniques in spatial extremes modeling so domain scientists can study the dependent extremes on datasets with massive number of locations. 

Event Organizers

David Kepplinger

Nicholas Rios

Functional Data Analysis Through the Lens of Deep Neural Networks

Guanqun Cao

Associate Professor

Department of Computational Mathematics, Science, and Engineering

Michigan State University

Date: Friday, October 20, 2023

Abstract

Functional data refer to curves or functions, i.e., the data for each variable are viewed as smooth curves, surfaces, or hypersurfaces evaluated at a finite subset of some interval in 1D, 2D, and 3D. Advancements in modern technology have enabled the collection of sophisticated, ultra high-dimensional datasets, thus boosting the investigation of functional data. In this talk, I will first introduce a deep neural networks-based robust method to perform nonparametric regression for multi-dimensional functional data. The proposed estimators are based on sparsely connected deep neural networks with rectifier linear unit activation function. Meanwhile, the estimators are less susceptible to outlying observations and model-misspecification. For any multi-dimensional functional data, we provide uniform convergence rates for the proposed robust deep neural networks estimators. Then, I will present a new approach, called functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principal components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which only work for one-dimensional functional data, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data.

About the Speaker

Dr. Guanqun Cao is an Associate Professor of the Departments of Statistics and Probability and Department of Computational Mathematics, Science and Engineering at Michigan State University. Prior to joining Michigan State University in 2023, she was a faculty member of Auburn University. She obtained her Ph.D. in Statistics from the Department of Statistics and Probability at Michigan State University. Working at the interface of statistics, mathematics, and computer science, Dr. Cao is interested in developing cutting-edge statistical methods for solving issues related to data science and big data analytics. The methods she developed have a wide application in engineering, neuroimaging, environmental studies, and biomedical science. Dr. Cao is an Elected Member of the International Statistical Institute.

Event Organizers

Nicholas Rios

David Kepplinger

From Algorithms for Anomaly Detection to Spatial and Temporal Modeling and Bayesian Ultra-High Dimensional Variable Selection

Hsin-Hsiung Huang

Associate Professor

Department of Statistics and Data Science

University of Central Florida

Date: Friday, October 13, 2023

Abstract

Inspired by our investigation on spatiotemporal data analysis for the NSF ATD challenges, we've investigated Bayesian clustering, variable selection for mixed-type multivariate responses and Gaussian process priors for spatiotemporal data. The proposed Bayesian approaches effectively and efficiently fit high-dimensional data with spatial and temporal features. We further propose a two-stage Gibbs sampler which leads a consistent estimator with a much faster posterior contraction rate than a one-step Gibbs sampler.  For Bayesian ultrahigh dimensional variable selection, we have developed Bayesian sparse multivariate regression for mixed responses (BS-MRMR) with shrinkage priors model for mixed-type response generalized linear models. We consider a latent multivariate linear regression model associated with the observable mixed-type response vector through its link function. Under our proposed BS-MRMR model, multiple responses belonging to the exponential family are simultaneously modeled and mixed-type responses are allowed. We show that the MBSP-GLM model achieves posterior consistency and quantifies the posterior contraction rate. Additionally, we incorporate Gaussian processes into zero-inflated negative binomial regression. To conquer the computation bottleneck that GPs may suffer when the sample size is large, we adopt the nearest-neighbor GP approach that approximates the covariance matrix using local experts. We provide simulation studies and real-world gene data examples.

About the Speaker

Dr. Hsin-Hsiung Bill Huang is an Associate Professor in the Department of Statistics and Data Science at the University of Central Florida (UCF). Dr. Huang received his Ph.D. in Statistics from the University of Illinois at Chicago and two MS degrees from the Georgia Institute of Technology and National Taiwan University as well as the BA in Economics and BS in Mathematics degrees from National Taiwan University. His scholarly interests and expertise include Bayesian ultrahigh dimensional variable selection, regularized low-rank matrix-variate regression, clustering, classification, and dimension reduction.

His research addresses challenges in analyzing big data, interdisciplinary research, and developing new statistical methods for real-data challenges. He has developed new statistical methods for computed tomography (CT), developing a statistical reconstruction algorithm for positronium lifetime imaging using time-of-flight positron emission tomography (PET) and interdisciplinary research of developing algorithms for threat detection and large spatiotemporal data modeling challenges. He is awarded the UCF Research Incentive Award (RIA) in 2021. His current research is partially sponsored by his grant of the Algorithms of Threat Detection (ATD) of the National Science Foundation (NSF) as a principal investigator (PI) in 2019 with a supplement grant in 2023 and a new ATD grant in 2023 and an NIH R01 grant as a co-investigator in 2019. His team named UCF has won the top places in a row of the 2021 and 2022 ATD challenge competitions.

Event Organizers

Nicholas Rios

David Kepplinger

Another Look at Assessing Goodness-of-Fit of Time Series Using Fitted Residuals

Richard Davis

Professor of Statistics

Department of Statistics

Columbia University

Date: Friday, October 6, 2023

Abstract

A fundamental and often final step in time series modeling is to assess the quality of fit of a proposed model to the data.   Since the underlying distribution of the innovations that generate a model is typically not prescribed, goodness-of-fit tests typically take the form of testing the fitted residuals for serial independence.  However, these fitted residuals are inherently dependent since they are based on parameter estimates. Thus, standard tests of serial independence, such as those based on the autocorrelation function (ACF) or distance correlation function (DCF) of the fitted residuals need to be adjusted. The sample splitting procedure in Pfister et al. (2018) is one such fix for the case of models for independent data, but fails to work in the dependent case.

In this paper sample splitting is leveraged in the time series setting to perform tests of serial dependence of fitted residuals using the ACF and DCF.  Here the first a_n of the data points are used to estimate the parameters of the model and then using these parameter estimates, the last s_n of the data points are used to compute the estimated residuals.  Tests for serial independence are then based on these s_n residuals.  As long as the overlap between the a_n and s_n data splits is asymptotically ½, the ACF and DCF tests of serial independence tests often have the same limit distributions as though the underlying residuals are indeed iid.  This procedure ameliorates the need for adjustment to the construction of confidence bounds for both the ACF and DCF in goodness-of-fit testing.  (This is joint work with Leon Fernandes.)

About the Speaker

Richard Davis is a Howard Levene Professor of Statistics at Columbia University, where he served as chair from 2013 to 2019. He was the President of the Institute of Mathematical Statistics (IMS) in 2016, as well as the Editor-in-Chief of Bernoulli (2010-2012). He is also a fellow of the American Statistical Association. His research interests lie primarily in the areas of applied probability, time series, and stochastic processes - much of which is strongly influenced by extreme value theory. 

Event Organizers

Nicholas Rios

David Kepplinger

Depth Functions and Their Applications to Classification and Clustering

Giacomo Francisci

Postdoc

Department of Statistics

George Mason University

Date: Friday, September 29, 2023

Abstract

Depth functions provide a center-outward order similar to that of the real line and are used to specify medians and quantiles of multivariate distributions. As they do not require any assumption on the underlying distribution, they are widely used in non-parametric statistics and robust methods, for instance, in outlier detection and classification. In the setting of classification, a common issue is that points of zero depth with respect to all classes arise in practice leading to classification challenges. In the first part of the presentation, we address this issue using an extended notion of depth function. We use this idea to study classification for tree-indexed random variables. In the final part of the presentation, we introduce a concept of local depth function and use it to study modal clustering. We also discuss consistency properties of the clustering algorithm.

About the Speaker

Giacomo Francisci is a Postdoctoral Research Fellow at the Department of Statistics at GMU, where he works under the supervision of Prof. Vidyashankar. Prior to that, he obtained a Double Degree MSc. in Mathematics at the University of Trento (Italy) and University of Tübingen (Germany). He obtained a Cotutelle PhD. in Mathematics and Statistics at the University of Trento (Italy) and the University of Cantabria (Spain). His research interests are in depth functions and their applications to machine learning, empirical processes, branching processes, and branching random walks.

Event Organizers

Nicholas Rios

David Kepplinger

Mining Your Fantasy: From Professor to Fugitive

Dennis Peng

Journalist, True Voice of Taiwan

Date: Friday, September 22, 2023

Abstract

The Statistics and Systems Engineering and Operations Research departments invite you to a seminar given by Dennis Peng. Dennis Peng will be discussing his academic and career journey: from becoming a professor, to a TV talk show host, and more.*

About the Speaker

Dennis Peng was an Assistant Professor and Director of the Graduate School of Journalism at Taiwan University from 1995 to 2015. He was the CEO of Hakka TV from 2004 to 2005. He was also the host of Next TV and FTV for several years. 

Event Organizers

Nicholas Rios

David Kepplinger

*Note that the contents of this talk solely reflect the views of the speaker and do not reflect the views of George Mason University or its faculty.

Exact Conditional Independence Testing and Conformal Inference with Adaptively Collected Data

Lucas Janson

Associate Professor

Department of Statistics

Harvard University

Date: Friday, September 15, 2023

Abstract

Randomization testing is a fundamental method in statistics, enabling inferential tasks such as testing for (conditional) independence of random variables, constructing confidence intervals in semiparametric location models, and constructing (by inverting a permutation test) model-free prediction intervals via conformal inference. Randomization tests are exactly valid for any sample size, but their use is generally confined to exchangeable data. Yet in many applications, data is routinely collected adaptively via, e.g., (contextual) bandit and reinforcement learning algorithms or adaptive experimental designs. In this paper we present a general framework for randomization testing on adaptively collected data (despite its non-exchangeability) that uses a weighted randomization test, for which we also present computationally tractable resampling algorithms for various popular adaptive assignment algorithms, da ta-generating environments, and types of inferential tasks. Finally, we demonstrate via a range of simulations the efficacy of our framework for both testing and confidence/prediction interval construction. The relevant paper is https://arxiv.org/abs/2301.05365.

About the Speaker

Lucas Janson is an Associate Professor of Statistics and Affiliate in Computer Science at Harvard University, where he studies high-dimensional inference and statistical machine learning.

Event Organizers

David Kepplinger

Nicholas Rios

Data Science at the Singularity

David Donoho

Professor

Department of Statistics

Stanford University

Date: Friday, September 1, 2023

Abstract

A purported “AI Singularity” has been much in the public eye recently, especially since the release of ChatGPT last November, spawning social media “AI Breakthrough” threads promoting Large Language Model (LLM) achievements.  Alongside this, mass media and national political attention focused on “AI Doom” hawked by social media influencers, with twitter personalities invited to tell congresspersons about the coming "End Times."   

In my opinion, “AI Singularity” is the wrong narrative; it drains time and energy with pointless speculation. We do not yet have general intelligence, we have not yet crossed the AI singularity, and the remarkable public reactions signal something else entirely. 

Something fundamental to science really has changed in the last ten years. In certain fields which practice Data Science according to three principles I will describe, progress is simply dramatically more rapid than in those fields that don’t yet make use of it.

Researchers in the adhering fields are living through a period of very profound transformation, as they make a transition to frictionless reproducibility. This transition markedly changes the rate of spread of ideas and practices, and marks a kind of singularity, because it affects mindsets and paradigms and erases memories of much that came before.  Many phenomena driven by this transition are misidentified as signs of an AI singularity. Data Scientists should understand what's really happening and their driving role in these developments.

About the Speaker

David Donoho is a Professor of Statistics at Stanford University. Among his many accomplishments, he received COPSS Presidents' Award (1994), the John von Neumann Prize (2001, Society for Industrial and Applied Mathematics), the Norbert Wiener Prize in Applied Mathematics (2010, from SIAM and the AMS), the Shaw Prize for Mathematics (2013), the Gauss Prize from IMU (2018), and, most recently, the IEEE Jack S. Kilby Signal Processing Medal in 2022. His research interests include large-scale covariance matrix estimation, large-scale matrix denoising, detection of rare signals, compressed sensing, and empirical deep learning. 

Event Organizers

Nicholas Rios

Real-time Discriminant Analysis in the Presence of Label and Measurement Noise

Mia Hubert

Professor

Department of Mathematics, section of Statistics and Data Science

KU Leuven

Date: Friday, August 25, 2023

Abstract

Quadratic discriminant analysis (QDA) is a widely used classification technique. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. The traditional QDA rule relies on the empirical mean and covariance matrix. Unfortunately, these estimators are sensitive to label and measurement noise which often impairs the model's predictive ability. Robust estimators of location and scatter are resistant to this type of contamination. However, they have a prohibitive computational cost for large scale industrial experiments. First, we present a novel QDA method based on a real-time robust algorithm. Secondly, we introduce the classmap, a graphical display to visualize aspects of the classification results and to identify label and measurement noise.

About the Speaker

Mia Hubert is professor at the KU Leuven, department of Mathematics, section of Statistics and Data Science. Her research focuses on robust statistics, outlier detection, data visualization, depth functions, and the development of statistical software. She is an elected fellow of the ISI and has served as associate editor for several journals such as JCGS, CSDA, and Technometrics. She is co-founder and organizer of The Rousseeuw Prize for Statistics, a new biennial prize which awards pioneering work in statistical methodology.

Event Organizers

Nicholas Rios