List of Seminars
Randomization Tests: The Forgotten Component of the Randomized Clinical Trial
William F. Rosenberger
University Professor of Statistics, George Mason University
Date: Friday, December 3, 2021
Abstract
“…The customary test for an observed difference… is based on an enumeration of the probabilities, on the initial hypothesis that two treatments do not differ in their effects,… of all the various results which would occur if the trial were repeated indefinitely with different random samples of the same size as those actually used.”
Peter Armitage, 1954
Randomization has been the hallmark of the clinical trial since Sir Bradford Hill introduced it in the 1948 streptomycin trial. An exploration of the early literature yields three rationales: (1) the incorporation of randomization provides unpredictability in treatment assignments, thereby mitigating selection bias; (2) randomization tends to ensure comparability in the treatment groups on known and unknown confounders (at least asymptotically); and (3) the act of randomization itself provides a basis for inference when random sampling is not conducted from a population model. Of these three, rationale (3) is often forgotten, ignored, or left untaught.
Today, randomization is a rote exercise, scarcely considered in protocols or medical journal articles. “Randomization was done by Excel” is a standard sentence that serves to check the box that investigators specify how they conducted the randomization. Yet the literature of the last century is rich with statistical articles on randomization methods and their consequences, authored by some of the greats of the biostatistics and statistics world. In this talk, we review some of this literature and describe very simple methods to rectify some of this oversight. We describe how randomization-based inference can be used for virtually any outcome of interest in a clinical trial.
We conclude that randomization matters!
Optimal Relevant Subset Designs
Adam Lane
Assistant Professor, University of Cincinnati Department of Pediatrics
Date: Friday, November 19, 2021
Abstract
Fisher (1934, Proc. of the Roy. Soc. 144:285-307) argued that certain ancillary statistics form a relevant subset, a subset of the sample space on which inference should be restricted and showed that conditioning on their observed values reduces the dimension of the data without a loss of information. The use of ancillary statistics in post-data inference has received significant attention; however, their role in the design of experiments has not been well characterized. Ancillary statistics are unknown prior to data collection and as a result cannot be incorporated into the design {a priori. Conversely, in sequential experiments the ancillary statistics based on the data from the preceding observations are known and can be used to determine the design assignment of the current observation. The main results of this work describe the benefits of incorporating ancillary statistics, specifically, the ancillary statistic that constitutes a relevant subset, into adaptive designs.
Copula-Based Bivariate for Poisson Time Series Models
Norou Diawara
Professor, Department of Mathematics & Statistics, Old Dominion University
Date: Friday, November 5, 2021
Abstract
The class of bivariate integer-valued time series models is gaining rapid popularity. However, its efficiency and adaptability are being challenged because of zero-inflation of count time series (ZITS) and algorithm techniques. In this presentation, the bivariate copula is presented with ZITS. The computational algorithm is proposed via copula theory. Each series follows a Markov chain with the serial dependence is captured using copula-based transition probabilities with Poisson and zero-inflated Poisson margins. The copula theory is also used to capture bivariate ZITS where the dependence between the two series using the bivariate Gaussian, t-copula functions. Likelihood based inference is used to estimate the models parameters for simulated and real data with the bivariate integrals of the gaussian or t copula functions being evaluated using standard randomized Monte Carlo methods.
About the speaker
Dr. Norou Diawara is Professor in the Mathematics and Statistics Department at Old Dominion University. His current research interests are on Multivariate and Functional Data Analysis, Modeling, Probability Theory and its Applications in Biostatistics and Time Series. His work is applied to Discrete Choice modeling, Spatio-temporal models and on Statistical Pattern recognition using copula. He has been collaborating with researchers in the Engineering, Health Sciences, Oceanography and Psychology. His support is in statistical design and methodology studies, size/power calculations in research activities, and has served as a collaborative investigator on applied grants, study implementation.
Accounting for Overdiagnosis in Estimating Components of Survival Time in Randomized Cancer Screening Trials
Karen Kafadar
Commonwealth Professor and Chair, Department of Statistics, University of Virginia
Date: Friday, October 29, 2021
Abstract
Cancer screening is assumed to be beneficial, in terms of reduced mortality and extended survival. Survival is often measured as the time between clinical detection of disease and endpoint (cure or death). When the disease is screen-detected, survival has two additional components: lead time (time by which the screening test advances the time of clinical diagnosis) and benefit time (extended survival time if the screen detection is beneficial). All three components are affected by two effects: length biased sampling (slow-growing cases are more likely to be screen-detected than fast-growing ones) and overdiagnosis (cases that are screen-detected but would never have surfaced clinically in the absence of screening). We quantify both effects in this talk and illustrate their non-trivial impacts on the results from actual randomized cancer screening trials.
(This work is performed in collaboration with Dr. Philip C. Prorok, former Chief of the Biometry Research Group, National Cancer Institute.)
Tensor Quantile Regression for Neuroimage Study of Human Intelligence
Heping Zhang
Susan Dwight Bliss Professor of Biostatistics, Yale University School of Public Health
Date: Friday, October 15, 2021
Abstract
Human intelligence is usually measured by well-established psychometric tests through a series of problem solving. The recorded cognitive scores are continuous but usually heavy-tailed with potential outliers and violating the normality assumption. Meanwhile, magnetic resonance imaging provides an unparalleled opportunity to study brain structures and cognitive ability. Motivated by association studies between MRI images and human intelligence, we propose a tensor quantile regression model, which is a general and robust alternative to the commonly used scalar-on-image linear regression. Moreover, we take into account rich spatial information of brain structures, incorporating low-rankness and piece-wise smoothness of imaging coefficients into a regularized regression framework. We formulate the optimization problem as a sequence of penalized quantile regressions with a generalized Lasso penalty based on tensor decomposition, and develop a computationally efficient alternating direction method of multipliers algorithm estimate the model components. Extensive numerical studies are conducted to examine the empirical performance of the proposed method and its competitors. Finally, we apply the proposed method to a large-scale important dataset: The Human Connectome Project. We find that the tensor quantile regression can serve as a prognostic tool to assess future risk of cognitive impairment progression. More importantly, with the proposed method, we are able to identify the most activated brain subregions associated with quantiles of human intelligence. The prefrontal and anterior cingulate cortex are found to be mostly associated with lower and upper quantile of fluid intelligence. The insular cortex associated with median of fluid intelligence is a rarely reported region.
This is a joint work with Cai Li, currently a postdoctoral associate at Department of Biostatistics, Yale University School of Public Health.
About the speaker
Dr. Zhang published over 340 research articles and monographs in theory and applications of statistical methods and in several areas of biomedical research including epidemiology, genetics, child and women health, mental health, substance use, and reproductive medicine. He directed a training program in mental health research that was funded by the NIMH. He directs the Collaborative Center for Statistics in Science that coordinates the Reproductive Medicine Network to evaluate treatment effectiveness for infertility. He is a fellow of the American Statistical Association and a fellow of the Institute of Mathematical Statistics. He was named the 2008 Myrto Lefokopoulou distinguished lecturer by Harvard School of Public Health and a Medallion Lecturer by the Institute of Mathematical Statistics. In 2011, he received the Royan International Award on Reproductive Health. Dr. Zhang was the president of the International Chinese Statistical Association in 2019. He serves as the editor of the Journal of the American Statistical Association - Applications and Case Studies. He was selected to deliver 2022 Neyman lecture by the Institute of Mathematical Statistics.
Learning from a large number of chi-squared tests
Inchi Hu
George Mason University
Date: Friday, October 1, 2021
Abstract
Efron (2011) investigated the merit and limitation of an empirical Bayes method to correct selection bias based on Tweedie's formula first reported in Robbins (1956). The exceptional virtue of Tweedie's formula for the normal distribution lies in its representation of selection bias as a simple function of the derivative of log marginal likelihood. Since the marginal likelihood and its derivative can be estimated from the data directly without specifying the prior distribution, bias correction can be carried out conveniently. We propose a Bayesian hierarchical model for chi-squared data such that the resulting Tweedie's formula has the same virtue as that of the normal distribution. Because the family of noncentral chi-squared distributions, the common alternative distributions for chi-squared tests, does not constitute an exponential family, our results cannot be obtained by extending existing results. Furthermore, the corresponding Tweedie's formula manifests new phenomena quite different from those of the normal data and suggests new ways to analyse chi-square data. Two real-data examples are discussed: gene expression difference among ethnic groups and higher-order interaction of gene expression in breast cancer metastasis. This is joint work with Lilun Du.
Real-time sufficient dimension reduction through principal least squares support vector machines
Yuexiao Dong
Associate Professor of Statistical Science, Fox School of Business, Temple University
Date: Friday, September 24, 2021
Abstract
We propose a real-time approach for sufficient dimension reduction. Compared with popular sufficient dimension reduction methods including sliced inverse regression and principal support vector machines, the proposed principal least squares support vector machines approach enjoys better estimation of the central subspace. Furthermore, this new proposal can be used in the presence of streamed data for quick real-time updates. It is demonstrated through simulations and real data applications that our proposal performs better and faster than existing algorithms in the literature.
About the speaker
Yuexiao Dong is Associate Professor and Gilliland Research Fellow from the Department of Statistical Science, Fox School of Business, Temple University. Dr. Dong obtained his Ph.D. from the Pennsylvania State University in 2009. Dr. Dong’s primary research focus is on sufficient dimension reduction and high-dimensional data analysis. His research has been published in statistical journals such as The Annals of Statistics, Journal of the American Statistical Association, and Biometrika. His proposal “New Developments in Sufficient Dimension Reduction” has been funded by the National Science Foundation.
Dr. Dong's other research interests include machine learning and business analytics. His collaborative work has been published in Journal of Machine Learning Research, IEEE Transactions on Information Theory, Pattern Recognition, and Journal of Product Innovation Management. Dr. Dong has served as an Associate Editor for the Journal of Systems Science and Complexity since 2015.
Supervised Network Centrality Estimation and Prediction
Linda Zhao
Professor of Statistics and Data Science, University of Pennsylvania
Date: Friday, September 17, 2021
Abstract
Directed networks play a ubiquitous and crucial role in our lives and have implications for individual’s behavior and outcomes. The node’s position in the network, usually captured by the centrality, is an important intermediary of network effects, and is often incorporated in regression model to elucidate the effect of the network on outcome variable of interest. In empirical studies, researchers often adopt a two-stage procedure to evaluate the network effect – first estimate the centrality from the observed network and then employ the estimated centrality in regression. Despite the prevalent adoption of such two-stage procedure, it fails to incorporate the observational errors from the observed network and lacks valid inference. We first propose a unified inferential framework that combines the network error model and the regression on centrality model, under which we prove the shortcoming of the two-stage in estimating the centrality and demonstrate the consequent undesirable effect in the outcome regression. We then propose a novel supervised network centrality estimation and prediction (SuperCENT) methodology that simultaneously combines the information from the two es- sential models. The proposed method always provides superior estimates of the centrality and the true underlying network over the two-stage procedure, and produces better network effect estimation and more accurate outcome prediction when the observational error of the network is severe. We further derive the distribution of the centrality and network effect for both the SuperCENT and two-stage, which can be used to construct valid confidence intervals. Our model is applied to predict the currency risk premium based on the centrality of the global trade network. We show that a trading strategy based on centralities estimated by SuperCENT yields return three times as high as the two-stage method.
Joint work with Cai, J., Yang, D., Zhu, W. and Shen, H.
About the speaker
Linda Zhao is a full professor of statistics in the Wharton School. She received her Ph.D. from Cornell in 1993 and joined the University of Pennsylvania since 1994. A fellow of the IMS, Linda has been actively engaged in her academic career. Her specialty falls in modern machine learning methods, replicability in science, network and high dimensional data, housing price prediction, and Bayesian methods. Current projects include equity ownership network, and its relationship to firm performance and innovation activities; identify signals from noisy data using non-parametric Bayesian scheme; and model-free data analysis. Her work has won NSF support for over 20 years. Since past five years, she has been developing and teaching a modern data mining course to undergraduate, MBA, Master, and Ph.D. students throughout the entire Penn campus. Students comment that her data mining course is one of the most fun and useful courses offered at Penn. She is also an avid ballroom dancer and she loves to travel around the world.
Algorithmic Robustness in Classification
Jie Shen
Assistant Professor, Department of Computer Science, Stevens Institute of Technology
Date: Friday, September 3, 2021
Abstract
Learning linear classifiers (i.e. halfspaces) is one of the fundamental problems in machine learning dating back to 1950s. In the presence of benign label noise such as random classification noise, the problem is well understood. However, when the data are corrupted by more realistic noise, even establishing polynomial-time learnability can be nontrivial. In this talk, I will introduce our recent work on learning with Massart noise and with malicious noise that significantly advances the state of the art. In particular, for the Massart noise where each label is flipped with an unknown probability across the domain, we present the first polynomial-time algorithm that is robust to any noise rate <1/2. For the malicious noise where an adversary may inspect the learning algorithm and inject malicious data, we present the first sample-optimal learning algorithm that achieves information-theoretic noise tolerance. In both works, the developed algorithms are active in nature, and are nearly label-optimal. Finally, I will discuss some important directions such as list-decodable classification, where the majority of the data are contaminated.
About the speaker
Dr. Jie Shen is an Assistant Professor in the Computer Science Department at Stevens Institute of Technology, and is also a faculty member of the Stevens AI Institute. The goal of his research is to understand fundamental limits of learning under real-world constraints such as limited availability of labeled data and the presence of high level noise, and to design efficient algorithms with provable guarantees. His recent works investigate interactive learning from untrusted data, where learning algorithms are involved in data acquisition for optimal data efficiency and robustness. Over the past few years, he has published around 15 papers in top machine learning conferences such as ICML, NeurIPS, and ALT, and has served as senior program committee member for IJCAI, program committee member/journal reviewer for ICML, NeurIPS, COLT, ICLR, AISTATS, AAAI, JMLR, ML, TIT, TPAMI, TSP, PR etc. He obtained his BS degree in Mathematics at Shanghai Jiao Tong University, and completed his Ph.D. in Computer Science at Rutgers University in 2018. He was a visiting scholar at National University of Singapore and Duke University. He received the NSF CRII award in 2020.