Current Search: Research Repository (x) » doctoral thesis (x) » Department of Statistics (x)
Search results
 Title
 Bayesian Analysis of Survival Data with Missing Censoring Indicators and Simulation of Interval Censored Data.
 Creator

Bunn, Veronica, Sinha, Debajyoti, Brownstein, Naomi Chana, Slate, Elizabeth H., Linero, Antonio Ricardo, Florida State University, College of Arts and Sciences, Department of...
Show moreBunn, Veronica, Sinha, Debajyoti, Brownstein, Naomi Chana, Slate, Elizabeth H., Linero, Antonio Ricardo, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

In some large clinical studies, it may be impractical to give physical examinations to every subject at his/her last monitoring time in order to diagnose the occurrence of an event of interest. This challenge creates survival data with missing censoring indicators where the probability of missing may depend on time of last monitoring. We present a fully Bayesian semiparametric method for such survival data to estimate regression parameters of Cox's proportional hazards model [Cox, 1972]....
Show moreIn some large clinical studies, it may be impractical to give physical examinations to every subject at his/her last monitoring time in order to diagnose the occurrence of an event of interest. This challenge creates survival data with missing censoring indicators where the probability of missing may depend on time of last monitoring. We present a fully Bayesian semiparametric method for such survival data to estimate regression parameters of Cox's proportional hazards model [Cox, 1972]. Simulation studies show that our method performs better than competing methods. We apply the proposed method to data from the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study. Clinical studies often include interval censored data. We present a method for the simulation of interval censored data based on Poisson processes. We show that our method gives simulated data that fulfills the assumption of independent interval censoring, and is more computationally efficient that other methods used for simulating interval censored data.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Bunn_fsu_0071E_14742
 Format
 Thesis
 Title
 Bayesian Modeling and Variable Selection for Complex Data.
 Creator

Li, Hanning, Pati, Debdeep, Huffer, Fred W. (Fred William), Kercheval, Alec N., Sinha, Debajyoti, Bradley, Jonathan R., Florida State University, College of Arts and Sciences,...
Show moreLi, Hanning, Pati, Debdeep, Huffer, Fred W. (Fred William), Kercheval, Alec N., Sinha, Debajyoti, Bradley, Jonathan R., Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

As we routinely encounter highthroughput datasets in complex biological and environment research, developing novel models and methods for variable selection has received widespread attention. In this dissertation, we addressed a few key challenges in Bayesian modeling and variable selection for highdimensional data with complex spatial structures. a) Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the...
Show moreAs we routinely encounter highthroughput datasets in complex biological and environment research, developing novel models and methods for variable selection has received widespread attention. In this dissertation, we addressed a few key challenges in Bayesian modeling and variable selection for highdimensional data with complex spatial structures. a) Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the noise. However, such priors encounter computational issues in high dimensions. This has motivated continuous shrinkage priors, resembling the twocomponent priors facilitating computation and interpretability. While such priors are widely used for estimating highdimensional sparse vectors, selecting a subset of variables remains a daunting task. b) Spatial/spatialtemporal data sets with complex structures are nowadays commonly encountered in various scientific research fields ranging from atmospheric sciences, forestry, environmental science, biological science, and social science. Selecting important spatial variables that have significant influences on occurrences of events is undoubtedly necessary and essential for providing insights to researchers. Selfexcitation, which is a feature that occurrence of an event increases the likelihood of more occurrences of the same type of events nearby in time and space, can be found in many natural/social events. Research on modeling data with selfexcitation feature has increasingly drawn interests recently. However, existing literature on selfexciting models with inclusion of highdimensional spatial covariates is still underdeveloped. c) Gaussian Process is among the most powerful model frames for spatial data. Its major bottleneck is the computational complexity which stems from inversion of dense matrices associated with a Gaussian process covariance. Hierarchical divideconquer Gaussian Process models have been investigated for ultra large data sets. However, computation associated with scaling the distributing computing algorithm to handle a large number of subgroups poses a serious bottleneck. In chapter 2 of this dissertation, we propose a general approach for variable selection with shrinkage priors. The presence of very few tuning parameters makes our method attractive in comparison to ad hoc thresholding approaches. The applicability of the approach is not limited to continuous shrinkage priors, but can be used along with any shrinkage prior. Theoretical properties for nearcollinear design matrices are investigated and the method is shown to have good performance in a wide range of synthetic data examples and in a real data example on selecting genes affecting survival due to lymphoma. In Chapter 3 of this dissertation, we propose a new selfexciting model that allows the inclusion of spatial covariates. We develop algorithms which are effective in obtaining accurate estimation and variable selection results in a variety of synthetic data examples. Our proposed model is applied on Chicago crime data where the influence of various spatial features is investigated. In Chapter 4, we focus on a hierarchical Gaussian Process regression model for ultrahigh dimensional spatial datasets. By evaluating the latent Gaussian process on a regular grid, we propose an efficient computational algorithm through circulant embedding. The latent Gaussian process borrows information across multiple subgroups, thereby obtaining a more accurate prediction. The hierarchical model and our proposed algorithm are studied through simulation examples.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Li_fsu_0071E_14159
 Format
 Thesis
 Title
 Bayesian Models for Capturing Heterogeneity in Discrete Data.
 Creator

Geng, Junxian, Slate, Elizabeth H., Pati, Debdeep, Schmertmann, Carl P., Zhang, Xin, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Population heterogeneity exists frequently in discrete data. Many Bayesian models perform reasonably well in capturing this subpopulation structure. Typically, the Dirichlet process mixture model (DPMM) and a variable dimensional alternative that we refer to as the mixture of finite mixtures (MFM) model are used, as they both have natural byproducts of clustering derived from Polya urn schemes. The first part of this dissertation focuses on a model for the association between a binary...
Show morePopulation heterogeneity exists frequently in discrete data. Many Bayesian models perform reasonably well in capturing this subpopulation structure. Typically, the Dirichlet process mixture model (DPMM) and a variable dimensional alternative that we refer to as the mixture of finite mixtures (MFM) model are used, as they both have natural byproducts of clustering derived from Polya urn schemes. The first part of this dissertation focuses on a model for the association between a binary response and binary predictors. The model incorporates Boolean combinations of predictors, called logic trees, as parameters arising from a DPMM or MFM. Joint modeling is proposed to solve the identifiability issue that arises when using a mixture model for a binary response. Different MCMC algorithms are introduced and compared for fitting these models. The second part of this dissertation is the application of the mixture of finite mixtures model to community detection problems. Here, the communities are analogous to the clusters in the earlier work. A probabilistic framework that allows simultaneous estimation of the number of clusters and the cluster configuration is proposed. We prove clustering consistency in this setting. We also illustrate the performance of these methods with simulation studies and discuss applications.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Geng_fsu_0071E_13791
 Format
 Thesis
 Title
 A Bayesian Wavelet Based Analysis of Longitudinally Observed Skewed Heteroscedastic Responses.
 Creator

Baker, Danisha S. (Danisha Sharice), Chicken, Eric, Sinha, Debajyoti, Harper, Kristine, Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of...
Show moreBaker, Danisha S. (Danisha Sharice), Chicken, Eric, Sinha, Debajyoti, Harper, Kristine, Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Unlike many of the current statistical models focusing on highly skewed longitudinal data, we present a novel model accommodating a skewed error distribution, partial linear median regression function, nonparametric wavelet expansion, and serial observations on the same unit. Parameters are estimated via a semiparametric Bayesian procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We use a hierarchical mixture model as the prior for the wavelet...
Show moreUnlike many of the current statistical models focusing on highly skewed longitudinal data, we present a novel model accommodating a skewed error distribution, partial linear median regression function, nonparametric wavelet expansion, and serial observations on the same unit. Parameters are estimated via a semiparametric Bayesian procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We use a hierarchical mixture model as the prior for the wavelet coefficients. For the "vanishing" coefficients, the model includes a level dependent prior probability mass at zero. This practice implements wavelet coefficient thresholding as a Bayesian Rule. Practical advantages of our method are illustrated through a simulation study and via analysis of a cardiotoxicity study of children of HIV infected mother.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Baker_fsu_0071E_14036
 Format
 Thesis
 Title
 Building a Model Performance Measure for Examining Clinical Relevance Using Net Benefit Curves.
 Creator

Mukherjee, Anwesha, McGee, Daniel, Hurt, Myra M., Slate, Elizabeth H., Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

ROC curves are often used to evaluate predictive accuracy of statistical prediction models. This thesis studies other measures which not only incorporate the statistical but also the clinical consequences of using a particular prediction model. Depending on the disease and population under study, the misclassification costs of false positives and false negatives vary. The concept of Decision Curve Analysis (DCA) takes this cost into account, by using the threshold probability (the...
Show moreROC curves are often used to evaluate predictive accuracy of statistical prediction models. This thesis studies other measures which not only incorporate the statistical but also the clinical consequences of using a particular prediction model. Depending on the disease and population under study, the misclassification costs of false positives and false negatives vary. The concept of Decision Curve Analysis (DCA) takes this cost into account, by using the threshold probability (the probability above which a patient opts for treatment). Using the DCA technique, a Net Benefit Curve is built by plotting "Net Benefit", a function of the expected benefit and expected harm of using a model, by the threshold probability. Only the threshold probability range that is relevant to the disease and the population under study is used to plot the net benefit curve to obtain the optimum results using a particular statistical model. This thesis concentrates on the process of construction of a summary measure to find which predictive model yields highest net benefit. The most intuitive approach is to calculate the area under the net benefit curve. We examined whether the use of weights such as, the estimated empirical distribution of the threshold probability to compute the weighted area under the curve, creates a better summary measure. Real data from multiple cardiovascular research studies The Diverse Population Collaboration (DPC) datasets, is used to compute the summary measures: area under the ROC curve (AUROC), area under the net benefit curve (ANBC) and weighted area under the net benefit curve (WANBC). The results from the analysis are used to compare these measures to examine whether these measures are in agreement with each other and which would be the best to use in specified clinical scenarios. For different models the summary measures and its standard errors (SE) were calculated to study the variability in the measure. The method of metaanalysis is used to summarize these estimated summary measures to reveal if there is significant variability among these studies.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Mukherjee_fsu_0071E_14350
 Format
 Thesis
 Title
 Comparative mRNA Expression Analysis Leveraging Known Biochemical Interactions.
 Creator

Steppi, Albert Joseph, Zhang, Jinfeng, Sang, QingXiang, Wu, Wei, Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

We present two studies incorporating existing biological knowledge into differential gene expression analysis that attempt to place the results within a broader biological context. The studies investigate breast cancer health disparity between differing ethnic groups by comparing gene expression levels in tumor samples from patients from different ethnic populations. We incorporate existing knowledge by making comparisons not just between individual genes, but between sets of related genes...
Show moreWe present two studies incorporating existing biological knowledge into differential gene expression analysis that attempt to place the results within a broader biological context. The studies investigate breast cancer health disparity between differing ethnic groups by comparing gene expression levels in tumor samples from patients from different ethnic populations. We incorporate existing knowledge by making comparisons not just between individual genes, but between sets of related genes and networks of interacting genes. In the first study, a comparison is made between mRNA expression patterns in Asian and Caucasian American breast cancer samples in an attempt to better understand why there are significantly lower breast cancer incidence and mortality rates in Asian Americans compared to Caucasian Americans. In the second study, the expression levels of genes related to drug and xenobiotic metabolizing enzymes (DXME) are compared between African, Asian, and Caucasian American breast cancer patients. The expression of genes related to these enzymes has been found to significantly affect drug clearance and the onset of drug resistance. Both studies found differentially expressed genes and pathways that may be associated with health disparities between the three ethnic populations. A thorough investigation of the literature was made in order to understand the context in which these differences in gene expression could affect the development and progression of breast tumors, and to identify genes and pathways that may be differentially expressed between the ethnic groups in general but not associated with breast cancer. Many of the relevant differences in gene expression were found to be linked to factors such as diet and differences in body composition. The process of finding relevant pathways and sets of interacting genes to inform comparative mRNA expression analysis can be laborious and time consuming. The literature is expanding at an exponential rate, and there is little hope for research groups to be able to keep up with all of the latest research. It is becoming more common for journals to require authors to make their results available in public databases, but many results concerning biochemical interactions are only accessible in unstructured text. Extracting relationships and interactions from the biological literature using techniques from machine learning and natural language processing is an important and growing field of research. To gain a better understanding of this field, we participated in the BioCreative VI Track 4 challenge, which involved classifying PubMed abstracts that contain examples of proteinprotein interactions that are affected by a mutation. We discuss the model we developed and the lessons learned while participating in the competition. The problem of acquiring sufficient quantities of quality labeled data is a great obstacle preventing the improvement of performance. We present a web application we are developing to streamline the annotation of entityentity interactions in text. It makes use of a database of known interactions to locate passages that are likely to be relevant and offers a simple and concise user interface to minimize the cognitive burden on the annotator.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Steppi_fsu_0071E_14522
 Format
 Thesis
 Title
 Elastic Functional Principal Component Analysis for Modeling and Testing of Functional Data.
 Creator

Duncan, Megan, Srivastava, Anuj, Klassen, E., Huffer, Fred W., Wu, Wei, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Statistical analysis of functional data requires tools for comparing, summarizing and modeling observed functions as elements of a function space. A key issue in Functional Data Analysis (FDA) is the presence of the phase variability in the observed data. A successful statistical model of functional data has to account for the presence of phase variability. Otherwise the ensuing inferences can be inferior. Recent methods for FDA include steps for phase separation or functional alignment. For...
Show moreStatistical analysis of functional data requires tools for comparing, summarizing and modeling observed functions as elements of a function space. A key issue in Functional Data Analysis (FDA) is the presence of the phase variability in the observed data. A successful statistical model of functional data has to account for the presence of phase variability. Otherwise the ensuing inferences can be inferior. Recent methods for FDA include steps for phase separation or functional alignment. For example, Elastic Functional Principal Component Analysis (Elastic FPCA) uses the strengths of Functional Principal Component Analysis (FPCA), along with the tools from Elastic FDA, to perform joint phaseamplitude separation and modeling. A related problem in FDA is to quantify and test for the amount of phase in a given data. We develop two types of hypothesis tests for testing the significance of phase variability: a metricbased approach and a modelbased approach. The metricbased approach treats phase and amplitude as independent components and uses their respective metrics to apply the FriedmanRafsky Test, Schilling's Nearest Neighbors, and Energy Test to test the differences between functions and their amplitudes. In the modelbased test, we use Concordance Correlation Coefficients as a tool to quantify the agreement between functions and their reconstructions using FPCA and Elastic FPCA. We demonstrate this framework using a number of simulated and real data, including weather, tecator, and growth data.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Duncan_fsu_0071E_14470
 Format
 Thesis
 Title
 Elastic Functional Regression Model.
 Creator

Ahn, Kyungmin, Srivastava, Anuj, Klassen, E., Wu, Wei, Huffer, Fred W., Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Functional variables serve important roles as predictors in a variety of pattern recognition and vision applications. Focusing on a specific subproblem, termed scalaronfunction regression, most current approaches adopt the standard L2 inner product to form a link between functional predictors and scalar responses. These methods may perform poorly when predictor functions contain nuisance phase variability, i.e., predictors are temporally misaligned due to noise. While a simple solution...
Show moreFunctional variables serve important roles as predictors in a variety of pattern recognition and vision applications. Focusing on a specific subproblem, termed scalaronfunction regression, most current approaches adopt the standard L2 inner product to form a link between functional predictors and scalar responses. These methods may perform poorly when predictor functions contain nuisance phase variability, i.e., predictors are temporally misaligned due to noise. While a simple solution could be to prealign predictors as a preprocessing step, before applying a regression model, this alignment is seldom optimal from the perspective of regression. In this dissertation, we propose a new approach, termed elastic functional regression, where alignment is included in the regression model itself, and is performed in conjunction with the estimation of other model parameters. This model is based on a normpreserving warping of predictors, not the standard time warping of functions, and provides better prediction in situations where the shape or the amplitude of the predictor is more useful than its phase. We demonstrate the effectiveness of this framework using simulated and real data.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Ahn_fsu_0071E_14452
 Format
 Thesis
 Title
 An Examination of the Relationship between Alcohol and Dementia in a Longitudinal Study.
 Creator

Hu, Tingting, McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

The high mortality rate and huge expenditure caused by dementia makes it a pressing concern for public health researchers. Among the potential risk factors in diet and nutrition, the relation between alcohol usage and dementia has been investigated in many studies, but no clear picture has emerged. This association has been reported as protective, neurotoxic, Ushaped curve, and insignificant in different sources. An individual’s alcohol usage is dynamic and could change over time, however,...
Show moreThe high mortality rate and huge expenditure caused by dementia makes it a pressing concern for public health researchers. Among the potential risk factors in diet and nutrition, the relation between alcohol usage and dementia has been investigated in many studies, but no clear picture has emerged. This association has been reported as protective, neurotoxic, Ushaped curve, and insignificant in different sources. An individual’s alcohol usage is dynamic and could change over time, however, to our knowledge, only one study took this timevarying nature into account when assessing the association between alcohol intake and cognition. Using Framingham Heart Study (FHS) data, our work fills an important gap in that both alcohol use and dementia status were included into the analysis longitudinally. Furthermore, we incorporated a genderspecific categorization of alcohol consumption. In this study, we examined three aspects of the association: (1) Concurrent alcohol usage and dementia, longitudinally, (2) Past alcohol usage and later dementia, (3) Cumulative alcohol usage and dementia. The data consisted of 2,192 FHS participants who took Exams 1723 during 19811996, which included dementia assessment, and had complete data on alcohol use (mean followup = 40 years) and key covariates. Cognitive status was determined using information from the MiniMental State Examinations (MMSE) and the examiner’s assessment. Alcohol consumption was determined in oz/week and also categorized as none, moderate and heavy. We investigated both total alcohol consumption and consumption by type of alcoholic beverage. Results showed that the association between alcohol and dementia may differ by gender and by alcoholic type.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Hu_fsu_0071E_14330
 Format
 Thesis
 Title
 Examining the Effect of Treatment on the Distribution of Blood Pressure in the Population Using Observational Data.
 Creator

Kucukemiroglu, Saryet Alexa, McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences,...
Show moreKucukemiroglu, Saryet Alexa, McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Since the introduction of antihypertensive medications in the mid1950s, there has been an increased use of blood pressure medications in the US. The growing use of antihypertensive treatment has affected the distribution of blood pressure in the population over time. Now observational data no longer reflect natural blood pressure levels. Our goal is to examine the effect of antihypertensive drugs on distributions of blood pressure using several wellknown observational studies. The...
Show moreSince the introduction of antihypertensive medications in the mid1950s, there has been an increased use of blood pressure medications in the US. The growing use of antihypertensive treatment has affected the distribution of blood pressure in the population over time. Now observational data no longer reflect natural blood pressure levels. Our goal is to examine the effect of antihypertensive drugs on distributions of blood pressure using several wellknown observational studies. The statistical concept of censoring is used to estimate the distribution of blood pressure in populations if no treatment were available. The treated and estimated untreated distributions are then compared to determine the general effect of these medications in the population. Our analyses show that these drugs have an increasing impact on controlling blood pressure distributions in populations that are heavily treated.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Kucukemiroglu_fsu_0071E_14275
 Format
 Thesis
 Title
 Fused Lasso and Tensor Covariance Learning with Robust Estimation.
 Creator

Kunz, Matthew Ross, She, Yiyuan, Stiegman, Albert E., Mai, Qing, Chicken, Eric, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

With the increase in computation and data storage, there has been a vast collection of information gained with scientific measurement devices. However, with this increase in data and variety of domain applications, statistical methodology must be tailored to specific problems. This dissertation is focused on analyzing chemical information with an underlying structure. Robust fused lasso leverages information about the neighboring regression coefficient structure to create blocks of...
Show moreWith the increase in computation and data storage, there has been a vast collection of information gained with scientific measurement devices. However, with this increase in data and variety of domain applications, statistical methodology must be tailored to specific problems. This dissertation is focused on analyzing chemical information with an underlying structure. Robust fused lasso leverages information about the neighboring regression coefficient structure to create blocks of coefficients. Robust modifications are made to the mean to account for gross outliers in the data. This method is applied to near infrared spectral measurements in prediction of an aqueous analyte concentration and is shown to improve prediction accuracy. Expansion on the robust estimation and structure analysis is performed by examining graph structures within a clustered tensor. The tensor is subjected to wavelet smoothing and robust sparse precision matrix estimation for a detailed look into the covariance structure. This methodology is applied to catalytic kinetics data where the graph structure estimates the elementary steps within the reaction mechanism.
Show less  Date Issued
 2018
 Identifier
 2018_Fall_Kunz_fsu_0071E_14844
 Format
 Thesis
 Title
 Generalized Mahalanobis Depth in Point Process and Its Application in Neural Coding and SemiSupervised Learning in Bioinformatics.
 Creator

Liu, Shuyi, Wu, Wei, Wang, Xiaoqiang, Zhang, Jinfeng, Mai, Qing, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In the first project, we propose to generalize the notion of depth in temporal point process observations. The new depth is defined as a weighted product of two probability terms: 1) the number of events in each process, and 2) the centeroutward ranking on the event times conditioned on the number of events. In this study, we adopt the Poisson distribution for the first term and the Mahalanobis depth for the second term. We propose an efficient bootstrapping approach to estimate parameters...
Show moreIn the first project, we propose to generalize the notion of depth in temporal point process observations. The new depth is defined as a weighted product of two probability terms: 1) the number of events in each process, and 2) the centeroutward ranking on the event times conditioned on the number of events. In this study, we adopt the Poisson distribution for the first term and the Mahalanobis depth for the second term. We propose an efficient bootstrapping approach to estimate parameters in the defined depth. In the case of Poisson process, the observed events are order statistics where the parameters can be estimated robustly with respect to sample size. We demonstrate the use of the new depth by ranking realizations from a Poisson process. We also test the new method in classification problems using simulations as well as real neural spike train data. It is found that the new framework provides more accurate and robust classifications as compared to commonly used likelihood methods. In the second project, we demonstrate the value of semisupervised dimension reduction in clinical area. The advantage of semisupervised dimension reduction is very easy to understand. SemiSupervised dimension reduction method adopts the unlabeled data information to perform dimension reduction and it can be applied to help build a more precise prediction model comparing with common supervised dimension reduction techniques. After thoroughly comparing with dimension embedding methods with label data only, we show the improvement of semisupervised dimension reduction with unlabeled data in breast cancer chemotherapy clinical area. In our semisupervised dimension reduction method, we not only explore adding unlabeled data to linear dimension reduction such as PCA, we also explore semisupervised nonlinear dimension reduction, such as semisupervised LLE and semisupervised Isomap.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Liu_fsu_0071E_14367
 Format
 Thesis
 Title
 High Level Image Analysis on Manifolds via Projective Shapes and 3D Reflection Shapes.
 Creator

Lester, David T. (David Thomas), Patrangenaru, Victor, Liu, Xiuwen, Barbu, Adrian G. (Adrian Gheorghe), Tao, Minjing, Florida State University, College of Arts and Sciences,...
Show moreLester, David T. (David Thomas), Patrangenaru, Victor, Liu, Xiuwen, Barbu, Adrian G. (Adrian Gheorghe), Tao, Minjing, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Shape analysis is a widely studied topic in modern Statistics with important applications in areas such as medical imaging. Here we focus on twosample hypothesis testing for both finite and infinite extrinsic mean shapes of configurations. First, we present a test for equality of mean projective shapes of 2D contours based on rotations. Secondly, we present a test for mean 3D reflection shapes based on the Schoenberg mean. We apply these tests to footprint data (contours), clamshells (3D...
Show moreShape analysis is a widely studied topic in modern Statistics with important applications in areas such as medical imaging. Here we focus on twosample hypothesis testing for both finite and infinite extrinsic mean shapes of configurations. First, we present a test for equality of mean projective shapes of 2D contours based on rotations. Secondly, we present a test for mean 3D reflection shapes based on the Schoenberg mean. We apply these tests to footprint data (contours), clamshells (3D reflection shape) and human facial configurations extracted from digital camera images. We also present the method of MANOVA on manifolds, and apply it to face data extracted from digital camera images. Finally, we present a new statistical tool called antiregression.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Lester_fsu_0071E_13856
 Format
 Thesis
 Title
 Influence Measures for Bayesian Data Analysis.
 Creator

De Oliveira, Melaine C. (Melaine Cristina), Sinha, Debajyoti, Panton, Lynn B., Bradley, Jonathan R., Linero, Antonio Ricardo, Lipsitz, Stuart, Florida State University, College...
Show moreDe Oliveira, Melaine C. (Melaine Cristina), Sinha, Debajyoti, Panton, Lynn B., Bradley, Jonathan R., Linero, Antonio Ricardo, Lipsitz, Stuart, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Identifying influential observations in the data is desired to ensure proper inference and statistical analysis. Modern methods to identify influence cases uses crossvalidation diagnostics based on the effect of deletion of ith observation on inference. A popular method to identify influential observations is to use KullbackLiebler divergence measure between the posterior distribution of the parameter of interest given full data and the posterior distribution given the crossvalidated data...
Show moreIdentifying influential observations in the data is desired to ensure proper inference and statistical analysis. Modern methods to identify influence cases uses crossvalidation diagnostics based on the effect of deletion of ith observation on inference. A popular method to identify influential observations is to use KullbackLiebler divergence measure between the posterior distribution of the parameter of interest given full data and the posterior distribution given the crossvalidated data, where the crossvalidated data has the ith observation removed. Although, in Bayesian inference, the posterior distribution contains all the relevant information about a parameter of interest, when the goal is prediction, perhaps the predictive distribution should be used to identifying influential observations. So, we extended our method to the comparison of the posterior predictive distributions given full data and crossvalidated data. We generalize and extend existing popular Bayesian crossvalidated influence diagnostics using Bregman divergence based measure (BD). We derive useful properties of these BD based on the influence of each observation on the posterior distribution and we show that it can be extended to the predictive distribution. We show that these BD based measures allow interpretable calibration and that they can be computed via Monte Carlo Markov Chain (MCMC) samples from a single posterior based on full data. We illustrate how our new measure of influence of observations have more useful practical roles for data analysis than popular Bayesian residual analysis tools (CPO) in an example of metaanalysis with binary response and in other cases of intervalcensored data.
Show less  Date Issued
 2018
 Identifier
 2018_Su_DeOliveira_fsu_0071E_14712
 Format
 Thesis
 Title
 A MatchedSampleBased Normalization Method: CrossPlatform Microarray and NGS Data Integration.
 Creator

Zhang, Se Rin, Zhang, Jinfeng, Sang, QingXiang, Wu, Wei, Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Utilizing high throughput gene expression data stored in public archives not only saves research time and cost but also enhances the power of its statistical support. However, gene expression profiling data can be obtained from many different technical platforms. Same gene expressions quantified by different platforms have different distributional properties, which makes the data integration across multiple platforms challenging. Several crossplatform normalization methods developed and...
Show moreUtilizing high throughput gene expression data stored in public archives not only saves research time and cost but also enhances the power of its statistical support. However, gene expression profiling data can be obtained from many different technical platforms. Same gene expressions quantified by different platforms have different distributional properties, which makes the data integration across multiple platforms challenging. Several crossplatform normalization methods developed and tried to remove the differences caused by the platform discrepancy but they also remove the important biological signals as well. Zhang and Jiang (2015) introduced a new method focusing on eliminating platform effect among systematic effects by employing matched samples which are measured by different platforms for getting a benchmark model. Since the matched sample have no biological difference, their approach is robust to get rid of solely the platform effect. They showed that the new method performs better than Distance Weighted Discrimination (DWD) method. This paper is a followup study of their work and we attempt to improve the new method by incorporating Fast Linear Mixed Regression (FLMER) model. The result indicates that the FLMER model works better than the original proposed model, OLS (Ordinary Least Squares) model in afternormalization concordance comparison and Differential Expression(DE) analysis. Also, we compare our methods to other existing crossplatform normalization methods not only DWD but also Empirical Bayes methods, XPN and GQ methods. The results showed that the proposed method performs much better than other crossplatform normalization methods for removing platform differences and keeping the biological information.
Show less  Date Issued
 2018
 Identifier
 2018_Fall_Zhang_fsu_0071E_14868
 Format
 Thesis
 Title
 Multivariate Binary Longitudinal Data Analysis.
 Creator

Alzahrani, Hissah, Slate, Elizabeth H., Wetherby, Amy M., McGee, Daniel, Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

The longitudinal data analysis plays an important role in a lot of applications today. It is defined by many measurements are obtained over many times. These measurements has complicated correlation structure because they are obtained from the same subjects over the time. In multivariate longitudinal data, there is an additional source of correlation which is "outcomes", the data are obtained over the time for many outcomes for the same subjects. This application could happens in many medical...
Show moreThe longitudinal data analysis plays an important role in a lot of applications today. It is defined by many measurements are obtained over many times. These measurements has complicated correlation structure because they are obtained from the same subjects over the time. In multivariate longitudinal data, there is an additional source of correlation which is "outcomes", the data are obtained over the time for many outcomes for the same subjects. This application could happens in many medical, financial and psychological studies. For example, the patients measurements for some variables are measured over some occasions in order to study the mean changes of these patients. How we can generate and analyze this type of data for complete and incomplete cases is the main goal of this dissertation. It consists of three main studies about the analysis of multivariate binary longitudinal data. The first study is a method to generate correlated binary data for a multivariate longitudinal model with specified correlation structure. This specified structure allows the correlation to be induced over the outcomes or occasions. Second study is a comparison of three methods for analyzing multivariate binary longitudinal data; each one can be beneficial for determined aims. Also, we investigated the difference among the parameter estimations of the three methods. The third study is an investigation of missing data analysis via GEE models, controlling the correlation over the occasions and outcomes via simulation study. However, several methods for handling missing data are used to reduce the bias of the parameter estimations for the incomplete data. these three studies are presented in separated chapters of this dissertation.
Show less  Date Issued
 2016
 Identifier
 FSU_2017SP_Alzahrani_fsu_0071E_13609
 Format
 Thesis
 Title
 New Approaches of Differential Gene Expression Analysis and Cancer Immune Evasion Mechanism Identification.
 Creator

Liu, Yuhang, Zhang, Jinfeng, Sang, QingXiang, Mai, Qing, She, Yiyuan, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Background: Genomic and epigenomic data analyses has been a popular research area in the 21st century. Common research problems include detecting differentially expressed genes between groups, clustering and classification using genomic data in order to study the heterogeneity of a disease, and dividing a sequence of measurements along a genome into segments to identify different functional regions of the genome. This study gives a comprehensive investigation of the aforementioned tasks, with...
Show moreBackground: Genomic and epigenomic data analyses has been a popular research area in the 21st century. Common research problems include detecting differentially expressed genes between groups, clustering and classification using genomic data in order to study the heterogeneity of a disease, and dividing a sequence of measurements along a genome into segments to identify different functional regions of the genome. This study gives a comprehensive investigation of the aforementioned tasks, with emphasis on developing new computational methodologies. Normalization is an important data preparation step in gene expression analyses, in order to remove various systematic noise, therefore reduce sample variance and increase the power of subsequent statistical analyses. On the other hand, variance reduction is made possible by borrowing information across all genes, including differentially expressed genes (DEGs) and outliers, which will inevitably introduce bias to the data. A question of interest is how to avoid inflation of type I error rate and loss of statistical power incurred by this bias. Breast cancer (BRCA) can escape immune surveillance using 6 known evasion mechanisms, yet the complexity of combination of these mechanisms used by subsets of human BRCA patients is not fully understood. In the era of immunotherapy and personalized medication, there is an urgent need for advancing the knowledge of immune evasion clusters (IEC) in BRCA and identifying reliable biomarkers, which is essential for better understanding of patients’ response to immunotherapies and for rational clinical trial design of combination immunotherapies. Identification of functional enriched regions of a genome often requires dividing a sequence of measurements along the genome into segments where adjacent segments have different properties (e.g. mean values). Despite dozens of algorithms developed to address this issue, accuracy and computational efficiency still need to be improved, to tackle both existing and emerging segmentation problems in genomic and epigenomic research. Results: In chapter 1 of this study we propose a new differential gene expression analysis pipeline superdelta, that pairs a modified ttest derived based on large sample theory with a robust multivariate extension of global normalization, designed to minimize the bias introduced by DEGs. In simulation studies, Superdelta was compared to four commonly used normalization methods: global, medianIQR, quantile, and cyclic loess normalization, and shown to have better statistical power with tighter type I error control. We then applied all methods to a microarray gene expression dataset on BRCA patients who received neoadjuvant chemotherapy. Superdelta was able to identify marginally more DEGs than its competitors, in addition to the substantial overlap of DEGs identified by all of them. Appropriate adaptations are under active development to make this procedure framework incorporated with RNASeq data and more general betweengroup comparison problems. In chapter 2, we developed a sequential biclustering (SBiC) method based on existing biclustering approach using the plaid model and applied it to the log2 normalized RNAseq data of immune related genes of BRCA patients from The Cancer Genome Atlas (TCGA). We identified seven clusters for 81% of the studied samples. We found that 78.8% of these samples evade through TGFβ immunosuppression, 57.75% through DcR3 counterattack, 48% through CTLA4, and 27.8% through PD1. Interestingly, combination of TGFβ and DcR3 was pronounced in 57.75% of patients and evasion through DcR3 was exclusive to the lobular invasive subgroup. In addition, triple negative breast cancer (TNBC) patients split equally into 2 clusters: one with impaired antigen presentation and another with high leukocyte recruitment but a combination of 4 evasion mechanisms. We also identified biomarkers that play important roles in distinguishing immune evasion mechanisms. These findings provide a better understanding of patients’ response to immunotherapies and shed light to rational design of novel combination immunotherapies. In chapter 3, We designed an efficient algorithm called iSeg, for segmentation of genomic and epigenomic profiles. It first utilizes dynamic programming to identify candidate significant segments, then uses a novel data structure based on coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Merging of significant segments are performed at the end to generate the final set of segments. The algorithm can serve as a general computational framework that works with different model assumptions of the data. As a general procedure, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, and (differential) nuclease sensitivity. We evaluated iSeg using both simulated and experimental datasets and showed that it performs satisfactorily when compared with some popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is very computationally efficient, well suited for long sequences and large number of input data profiles.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Liu_fsu_0071E_14757_Comp
 Format
 Set of related objects
 Title
 NonParametric and SemiParametric Estimation and Inference with Applications to Finance and Bioinformatics.
 Creator

Tran, Hoang Trong, She, Yiyuan, Ökten, Giray, Chicken, Eric, Niu, Xufeng, Tao, Minjing, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In this dissertation, we develop tools from nonparametric and semiparametric statistics to perform estimation and inference. In the first chapter, we propose a new method called NonParametric Outlier Identification and Smoothing (NOIS), which robustly smooths stock prices, automatically detects outliers and constructs pointwise confidence bands around the resulting curves. In real world examples of highfrequency data, NOIS successfully detects erroneous prices as outliers and uncovers...
Show moreIn this dissertation, we develop tools from nonparametric and semiparametric statistics to perform estimation and inference. In the first chapter, we propose a new method called NonParametric Outlier Identification and Smoothing (NOIS), which robustly smooths stock prices, automatically detects outliers and constructs pointwise confidence bands around the resulting curves. In real world examples of highfrequency data, NOIS successfully detects erroneous prices as outliers and uncovers borderline cases for further study. NOIS can also highlight notable features and reveal new insights in interday chart patterns. In the second chapter, we focus on a method for nonparametric inference called empirical likelihood (EL). Computation of EL in the case of a fixed parameter vector is a convex optimization problem easily solved by Lagrange multipliers. In the case of a composite empirical likelihood (CEL) test where certain components of the parameter vector are free to vary, the optimization problem becomes nonconvex and much more difficult. We propose a new algorithm for the CEL problem named the BILinear Algorithm for Composite EmPirical Likelihood (BICEP). We extend the BICEP framework by introducing a new method called Robust Empirical Likelihood (REL) that detects outliers and greatly improves the inference in comparison to the nonrobust EL. The REL method is combined with CEL by the TRILinear Algorithm for Composite EmPirical Likelihood (TRICEP). We demonstrate the efficacy of the proposed methods on simulated and real world datasets. We present a novel semiparametric method for variable selection with interesting biological applications in the final chapter. In bioinformatics datasets the experimental units often have structured relationships that are nonlinear and hierarchical. For example, in microbiome data the individual taxonomic units are connected to each other through a phylogenetic tree. Conventional techniques for selecting relevant taxa either do not account for the pairwise dependencies between taxa, or assume linear relationships. In this work we propose a new framework for variable selection called SemiParametric Affinity Based Selection (SPAS), which has the flexibility to utilize struc tured and nonparametric relationships between variables. In synthetic data experiments SPAS outperforms existing methods and on real world microbiome datasets it selects taxa according to their phylogenetic similarities.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Tran_fsu_0071E_14477
 Format
 Thesis
 Title
 Nonparametric Change Point Detection Methods for Profile Variability.
 Creator

Geneus, Vladimir J. (Vladimir Jacques), Chicken, Eric, Liu, Guosheng (Professor of Earth, Ocean and Atmospheric Science), Sinha, Debajyoti, Zhang, Xin (Professor of Engineering)...
Show moreGeneus, Vladimir J. (Vladimir Jacques), Chicken, Eric, Liu, Guosheng (Professor of Earth, Ocean and Atmospheric Science), Sinha, Debajyoti, Zhang, Xin (Professor of Engineering), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Due to the importance of seeing profile change in devices such as of medical apparatus, measuring the change point in variability of a different functions is important. In a sequence of functional observations (each of the same length), we wish to determine as quickly as possible when a change in the observations has occurred. Waveletbased change point methods are proposed that determine when the variability of the noise in a sequence of functional profiles (i.e. the precision profile of...
Show moreDue to the importance of seeing profile change in devices such as of medical apparatus, measuring the change point in variability of a different functions is important. In a sequence of functional observations (each of the same length), we wish to determine as quickly as possible when a change in the observations has occurred. Waveletbased change point methods are proposed that determine when the variability of the noise in a sequence of functional profiles (i.e. the precision profile of medical devices) has occurred; goes out of control from a known, fixed value, or an estimated incontrol value. Various methods have been proposed which focus on changes in the form of the function. One method, the NEWMA, based on EWMA, focuses on changes in both. However, the drawback is that the form of the incontrol function is known. Others methods, including the χ² for Phase I & Phase II make some assumption about the function. Our interest, however, is in detecting changes in the variance from one function to the next. In particular, we are interested not on differences from one profile to another (variance between), rather differences in variance (variance within). The functional portion of the profiles is allowed to come from a large class of functions and may vary from profile to profile. The estimator is evaluated on a variety of conditions, including allowing the wavelet noise subspace to be substantially contaminated by the profile's functional structure, and is compared to two competing noise monitoring methods. Nikoo and Noorossana (2013) propose a nonparametric wavelet regression method that uses both change point techniques to monitor the variance: a Nonparametric Control Charts, via the mean of m median control charts, and a Parametric Control Charts, via χ²distribution. We propose improvements to their method by incorporating prior data and making use of likelihood ratios. Our methods make use of the orthogonal properties of wavelet projections to accurately and efficiently monitor the level of noise from one profile to the next; detect changes in noise in Phase II setting. We show through simulation results that our proposed methods have better power and are more robust against the confounding effect between variance estimation and function estimation. The proposed methods are shown to be very efficient at detecting when the variability has changed through an extensive simulation study. Extensions are considered that explore the usage of windowing and estimated incontrol values for the MAD method; and the effect of the exact distribution under normality rather than the asymptotic distribution. These developments are implemented in the parametric, nonparametric scale, and complete nonparameric settings. The proposed methodologies are tested through simulation and applicable to various biometric and health related topics; and have the potential to improve in computational efficiency and in reducing the number of assumptions required.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Geneus_fsu_0071E_13862
 Format
 Thesis
 Title
 Nonparametric Detection of Arbitrary Changes to Distributions and Methods of Regularization of Piecewise Constant Functional Data.
 Creator

Orndorff, Mark Adam, Chicken, Eric, Liu, Guosheng, Pati, Debdeep, Tao, Minjing, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Nonparametric statistical methods can refer a wide variety of techniques. In this dissertation, we focus on two problems in statistics which are common applications of nonparametric statistics. The main body of the dissertation focuses on distributionfree process control for detection of arbitrary changes to the distribution of an underlying random variable. A secondary problem, also part of the broad umbrella of nonparametric statistics, is the proper approximation of a function....
Show moreNonparametric statistical methods can refer a wide variety of techniques. In this dissertation, we focus on two problems in statistics which are common applications of nonparametric statistics. The main body of the dissertation focuses on distributionfree process control for detection of arbitrary changes to the distribution of an underlying random variable. A secondary problem, also part of the broad umbrella of nonparametric statistics, is the proper approximation of a function. Statistical process control minimizes disruptions to a properly controlled process and quickly terminates out of control processes. Although rarely satisfied in practice, strict distributional assumptions are often needed to monitor these processes. Previous models have often exclusively focused on monitoring changes in the mean or variance of the underlying process. The proposed model establishes a monitoring method requiring few distributional assumptions while monitoring all changes in the underlying distribution generating the data. No assumptions on the form of the incontrol distribution are made other than independence within and between observed samples. Windowing is employed to reduce computational complexity of the algorithm as well as ensure fast detection of changes. Results indicate quicker detection of large jumps than in many previously established methods. It is now common to analyze large quantities of data generated by sensors over time. Traditional analysis techniques do not incorporate the inherent functional structure often present in this type of data. The second focus of this dissertation is the development of a analysis method for functional data where the range of the function has a discrete, ordinal structure. Use is made of spline based methods using a piecewise constant function approximation. After a large amount of data reduction is achieved, generalized linear mixed model methodology is employed in order to model the data.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Orndorff_fsu_0071E_13820
 Format
 Thesis
 Title
 On the Statistical Modeling of Count Data in High Dimensions.
 Creator

Tang, Shao, She, Yiyuan, Ökten, Giray, McGee, Daniel, Niu, Xufeng, Tao, Minjing, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Count data are ubiquitous in modern statistical applications. How to modeling such data remains a challenging task in machine learning. In this study, we consider various aspects of statistical modeling on Poisson count data. Concerned with computational burdens for maximum likelihood estimation of the mean, we revisit the classical iterative proportional scaling and propose a set of methods that achieve computational scalability in high dimensional applications with regularized extensions...
Show moreCount data are ubiquitous in modern statistical applications. How to modeling such data remains a challenging task in machine learning. In this study, we consider various aspects of statistical modeling on Poisson count data. Concerned with computational burdens for maximum likelihood estimation of the mean, we revisit the classical iterative proportional scaling and propose a set of methods that achieve computational scalability in high dimensional applications with regularized extensions for feature selection. In order to capture association effects given multivariate count data, we utilize the tool of nonGaussian graph learning. We perform comprehensive empirical studies on synthetic data and real world data to demonstrate its power. Based on the concept of data depth, we investigate a nonparametric approach for modeling multivariate data. We utilize modern optimization techniques to provide scalable algorithms in high dimensional depth and depth median computations. Realworld examples are given to show the effectiveness of the proposed methods.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Tang_fsu_0071E_14680
 Format
 Thesis
 Title
 Regression Methods for Skewed and Heteroscedastic Response with HighDimensional Covariates.
 Creator

Wang, Libo, Sinha, Debajyoti, Taylor, Miles G., Pati, Debdeep, She, Yiyuan, Yang, Yun (Professor of Statistics), Florida State University, College of Arts and Sciences,...
Show moreWang, Libo, Sinha, Debajyoti, Taylor, Miles G., Pati, Debdeep, She, Yiyuan, Yang, Yun (Professor of Statistics), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

The rise of studies with highdimensional potential covariates has invited a renewed interest in dimension reduction that promotes more parsimonious models, ease of interpretation and computational tractability. However, current variable selection methods restricted to continuous response often assume Gaussian response for methodological as well as theoretical developments. In this thesis, we consider regression models that induce sparsity, gain prediction power, and accommodates response...
Show moreThe rise of studies with highdimensional potential covariates has invited a renewed interest in dimension reduction that promotes more parsimonious models, ease of interpretation and computational tractability. However, current variable selection methods restricted to continuous response often assume Gaussian response for methodological as well as theoretical developments. In this thesis, we consider regression models that induce sparsity, gain prediction power, and accommodates response distributions beyond Gaussian with common variance. The first part of this thesis is a transformbothside Bayesian variable selection model (TBS) which allows skewness, heteroscedasticity and extreme heavy tailed responses. Our method develops a framework which facilitates computationally feasible inference in spite of inducing nonlocal priors on the original regression coefficients. Even if the transformed conditional mean is no longer linear with respect to covariates, we still prove the consistency of our Bayesian TBS estimators. Simulation studies and real data analysis demonstrate the advantages of our methods. Another main part of this thesis deals the above challenges from a frequentist standpoint. This model incorporates a penalized likelihood to accommodate skewed response, arising from an epsilonskewnormal (ESN) distribution. With suitable optimization techniques to handle this twopiece penalized likelihood, our method demonstrates substantial gains in sensitivity and specificity even under highdimensional settings. We conclude this thesis with a novel Bayesian semiparametric modal regression method along with its implementation and simulation studies.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Wang_fsu_0071E_13950
 Format
 Thesis
 Title
 Robust Function Registration Using Depth on the Phase Variability.
 Creator

Cleveland, Jason D., Wu, Wei, Klassen, E. (Eric), Srivastava, Anuj, Tao, Minjing, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In the field of functional data analysis, registration is still a fundamental problem. Registration still has to take into consideration what underlying template is chosen for "center" or alignment purposes and the hurdles that come with each. In this dissertation we will cover the registration of temporal observations with compositional and additive noises present under a mean template and the registration of temporal observations under a median template. Our first project covers an...
Show moreIn the field of functional data analysis, registration is still a fundamental problem. Registration still has to take into consideration what underlying template is chosen for "center" or alignment purposes and the hurdles that come with each. In this dissertation we will cover the registration of temporal observations with compositional and additive noises present under a mean template and the registration of temporal observations under a median template. Our first project covers an adaptation of the FisherRao framework such that can yield a consistent estimator when both noises are present. Our second project covers a similar registration but uses data depth. The adapted FisherRao method gives a mean template and the data depth method gives a median. As in standard statistics, we wish to explore outlier removal, appropriate template usage, and pattern classification. Various frameworks have been developed over the past two decades where registrations are conducted based on optimal time warping between functions. Comparison of functions solely based on time warping, however, may have limited application, in particular when certain constraints are desired in the registration. In the first project, we study registration with normpreserving constraint. A closely related problem is on signal estimation, where the goal is to estimate the groundtruth template given random observations with both compositional and additive noises. We propose to adopt the FisherRao framework to compute the underlying template, and mathematically prove that such framework leads to a consistent estimator. We then illustrate the constrained FisherRao registration using simulations as well as two real data sets. It is found that the constrained method is robust with respect to additive noise and has superior alignment and classification performance to conventional, unconstrained registration methods. Recently, statistical methodologies have been developed and extended to deal with functional observations. A past notion of depth has advanced the ways in which we can view these functional observations, i.e. an intuitive centeroutward ordering scheme and median signal template estimation. However, functional observations often have noise. Thus, we propose a semiparametric model and an algorithm that yields a consistent estimator for the underlying median template. We also propose a new band based boxplot methodology for outlier detection and removal. We illustrate the robustness of this depth based registration using simulations as well as two real data sets.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Cleveland_fsu_0071E_13838
 Format
 Thesis
 Title
 Scalable and Structured High Dimensional Covariance Matrix Estimation.
 Creator

Sabnis, Gautam, Pati, Debdeep, Kercheval, Alec N., Sinha, Debajyoti, Chicken, Eric, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

With rapid advances in data acquisition and storage techniques, modern scientific investigations in epidemiology, genomics, imaging and networks are increasingly producing challenging data structures in the form of highdimensional vectors, matrices and multiway arrays (tensors) rendering traditional statistical and computational tools inappropriate. One hope for meaningful inferences in such situations is to discover an inherent lowerdimensional structure that explains the physical or...
Show moreWith rapid advances in data acquisition and storage techniques, modern scientific investigations in epidemiology, genomics, imaging and networks are increasingly producing challenging data structures in the form of highdimensional vectors, matrices and multiway arrays (tensors) rendering traditional statistical and computational tools inappropriate. One hope for meaningful inferences in such situations is to discover an inherent lowerdimensional structure that explains the physical or biological process generating the data. The structural assumptions impose constraints that force the objects of interest to lie in lowerdimensional spaces, thereby facilitating their estimation and interpretation and, at the same time reducing computational burden. The assumption of an inherent structure, motivated by various scientific applications, is often adopted as the guiding light in the analysis and is fast becoming a standard tool for parsimonious modeling of such high dimensional data structures. The content of this thesis is specifically directed towards methodological development of statistical tools, with attractive computational properties, for drawing meaningful inferences though such structures. The third chapter of this thesis proposes a distributed computing framework, based on a divide and conquer strategy and hierarchical modeling, to accelerate posterior inference for highdimensional Bayesian factor models. Our approach distributes the task of highdimensional covariance matrix estimation to multiple cores, solves each subproblem separately via a latent factor model, and then combines these estimates to produce a global estimate of the covariance matrix. Existing divide and conquer methods focus exclusively on dividing the total number of observations n into subsamples while keeping the dimension p fixed. The approach is novel in this regard: it includes all of the n samples in each subproblem and, instead, splits the dimension p into smaller subsets for each subproblem. The subproblems themselves can be challenging to solve when p is large due to the dependencies across dimensions. To circumvent this issue, a novel hierarchical structure is specified on the latent factors that allows for flexible dependencies across dimensions, while still maintaining computational efficiency. Our approach is readily parallelizable and is shown to have computational efficiency of several orders of magnitude in comparison to fitting a full factor model. The fourth chapter of this thesis proposes a novel way of estimating a covariance matrix that can be represented as a sum of a lowrank matrix and a diagonal matrix. The proposed method compresses highdimensional data, computes the sample covariance in the compressed space, and lifts it back to the ambient space via a decompression operation. A salient feature of our approach relative to existing literature on combining sparsity and lowrank structures in covariance matrix estimation is that we do not require the lowrank component to be sparse. A principled framework for estimating the compressed dimension using Stein's Unbiased Risk Estimation theory is demonstrated. In the final chapter of this thesis, we tackle the problem of variable selection in high dimensions. Consistent model selection in high dimensions has received substantial interest in recent years and is an extremely challenging problem for Bayesians. The literature on model selection with continuous shrinkage priors is even lessdeveloped due to the unavailability of exact zeros in the posterior samples of parameter of interest. Heuristic methods based on thresholding the posterior mean are often used in practice which lack theoretical justification, and inference is highly sensitive to the choice of the threshold. We aim to address the problem of selecting variables through a novel method of post processing the posterior samples.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Sabnis_fsu_0071E_14043
 Format
 Thesis
 Title
 Scalable Nonconvex Optimization Algorithms: Theory and Applications.
 Creator

Wang, Zhifeng, She, Yiyuan, Beaumont, Paul M., Niu, Xufeng, Zhang, Jinfeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Modern statistical problems often involve minimizing objective functions that are not necessarily convex or smooth. In this study, we devote to developing scalable algorithms for nonconvex optimization with statistical guarantees. We first investigate a broad surrogate framework defined by generalized Bregman divergence functions for developing scalable algorithms. Local linear approximation, mirror descent, iterative thresholding, and DC programming can all be viewed as particular instances....
Show moreModern statistical problems often involve minimizing objective functions that are not necessarily convex or smooth. In this study, we devote to developing scalable algorithms for nonconvex optimization with statistical guarantees. We first investigate a broad surrogate framework defined by generalized Bregman divergence functions for developing scalable algorithms. Local linear approximation, mirror descent, iterative thresholding, and DC programming can all be viewed as particular instances. The Bregman recharacterization enables us to choose suitable measures of computational error to establish global convergence rate results even for nonconvex problems in highdimensional settings. Moreover, under some regularity conditions, the sequence of iterates in Bregman surrogate optimization can be shown to approach the statistical truth within the desired accuracy geometrically fast. The algorithms can be accelerated with a careful control of relaxation and stepsize parameters. Simulation studies are performed to support the theoretical results. An important applications of nonconvex optimization is robust estimation. Outliers widely occur in bigdata and highdimensional applications and may severely affect statistical estimation and inference. A framework of outlierresistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming but explicitly includes outlyingness parameters for all samples, which greatly facilitates computation, theory, and parameter tuning. To address the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, we introduce a new means to alleviate the requirement on the initial value such that on regular datasets the number of data resamplings can be substantially reduced. Moreover, based on combined statistical and computational treatments, we are able to develop new tools for nonasymptotic robust analysis regarding a general loss. The obtained estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology in robust parameter estimation and variable selection.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Wang_fsu_0071E_14775
 Format
 Thesis
 Title
 SemiParametric Generalized Estimating Equations with Kernel Smoother: A Longitudinal Study in Financial Data Analysis.
 Creator

Yang, Liu, Niu, Xufeng, Cheng, Yingmei, Huffer, Fred W. (Fred William), Tao, Minjing, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Longitudinal studies are widely used in various fields, such as public health, clinic trials and financial data analysis. A major challenge for longitudinal studies is repeated measurements from each subject, which cause time dependent correlation within subjects. Generalized Estimating Equations can deal with correlated outcomes for longitudinal data through marginal effect. My model will base on Generalized Estimating Equations with semiparametric approach, providing a flexible structure...
Show moreLongitudinal studies are widely used in various fields, such as public health, clinic trials and financial data analysis. A major challenge for longitudinal studies is repeated measurements from each subject, which cause time dependent correlation within subjects. Generalized Estimating Equations can deal with correlated outcomes for longitudinal data through marginal effect. My model will base on Generalized Estimating Equations with semiparametric approach, providing a flexible structure for regression models: coefficients for parametric covariates will be estimated and nuisance covariates will be fitted in kernel smoothers for nonparametric part. Profile kernel estimator and the seemingly unrelated kernel estimator (SUR) will be used to deliver consistent and efficient semiparametric estimators comparing to parametric models. We provide simulation results for estimating semiparametric models with one or multiple nonparametric terms. In application part, we would like to focus on financial market: a credit card loan data will be used with the payment information for each customer across 6 months, investigating whether gender, income, age or other factors will influence payment status significantly. Furthermore, we propose model comparisons to evaluate whether our model should be fitted based on different levels of factors, such as male and female or based on different types of estimating methods, such as parametric estimation or semiparametric estimation.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_YANG_fsu_0071E_14219
 Format
 Thesis
 Title
 Semiparametric Bayesian Regression Models for Skewed Responses.
 Creator

Bhingare, Apurva Chandrashekhar, Sinha, Debajyoti, Shanbhag, Sachin, Linero, Antonio Ricardo, Bradley, Jonathan R., Pati, Debdeep, Lipsitz, Stuart, Florida State University,...
Show moreBhingare, Apurva Chandrashekhar, Sinha, Debajyoti, Shanbhag, Sachin, Linero, Antonio Ricardo, Bradley, Jonathan R., Pati, Debdeep, Lipsitz, Stuart, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

It is common to encounter skewed response data in medicine, epidemiology and health care studies. Methodology needs to be devised to overcome the natural difficulties that occur in analyzing such data particularly when it is multivariate. Existing Bayesian statistical methods to deal with skewed data are mostly fully parametric. We propose novel semiparametric Bayesian methods to model an analyze such data. These methods make minimal assumptions about the true form of the distribution and...
Show moreIt is common to encounter skewed response data in medicine, epidemiology and health care studies. Methodology needs to be devised to overcome the natural difficulties that occur in analyzing such data particularly when it is multivariate. Existing Bayesian statistical methods to deal with skewed data are mostly fully parametric. We propose novel semiparametric Bayesian methods to model an analyze such data. These methods make minimal assumptions about the true form of the distribution and structure of the observed data. Through examples from real life studies, we demonstrate practical advantages of our semiparametric Bayesian methods over the existing methods. For many reallife studies with skewed multivariate responses, the level of skewness and association structure assumptions are essential for evaluating the covariate effects on the response and its predictive distribution. First, we present a novel semiparametric multivariate model class leading to a theoretically justifiable semiparametric Bayesian analysis of multivariate skewed responses. Like the multivariate Gaussian densities, this multivariate model is closed under marginalization, allows a wide class of multivariate associations, and has meaningful physical interpretations of skewness levels and covariate effects on the marginal density. Compared to existing models, our model enjoys several desirable practical properties, including Bayesian computing via available software, and assurance of consistent Bayesian estimates of parameters and the nonparametric error density under a set of plausible prior assumptions. We introduce a particular parametric version of the model as an alternative to various parametric skewsymmetric models available in the literature. We illustrate the practical advantages of our methods over existing parametric alternatives via application to a clinical study to assess periodontal disease and through a simulation study. Unlike most of the models existing in literature, this class of models advocates a latent variable approach making implementation under the Bayesian paradigm via standard software for MCMC computation like WinBUGS/JAGS straightforward. Although, JAGS and WinBUGS are flexible MCMC engines, for complex model structures they tend to be rather slow. We offer an alternative tool to implement the aforementioned parametric version of the models using PROC MCMC in SAS. Our goal is to facilitate and encourage more extensive implementation of these models. To achieve this goal, we illustrate the implementation using PROC MCMC in SAS via examples from real life and provide a full annotated SAS code. In large scale national surveys, we often come across skewed data as well as semicontinuous data, that is, data characterized by point mass at zero (degenerate) and right skewed continuous distribution on positive support. For example, in the Medical Expenditure Panel Survey (MEPS), the variable total health care expenditure (i.e., the response) for nonusers of the health care services is zero, whereas for the users it is has continuous distribution typically skewed towards the right. We provide an overview of the existing models and methods to analyze such data.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Bhingare_fsu_0071E_14468
 Format
 Thesis
 Title
 Shape Constrained Single Index Models for Biomedical Studies.
 Creator

Dhara, Kumaresh, Sinha, Debajyoti, Pati, Debdeep, Proudfit, Greg Hajcak, Slate, Elizabeth H., Chicken, Eric, Florida State University, College of Arts and Sciences, Department...
Show moreDhara, Kumaresh, Sinha, Debajyoti, Pati, Debdeep, Proudfit, Greg Hajcak, Slate, Elizabeth H., Chicken, Eric, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

For many biomedical, environmental and economic studies with an unknown nonlinear relationship between the response and its multiple predictors, a single index model provides practical dimension reduction and good physical interpretation. However widespread uses of existing Bayesian analysis for such models are lacking in biostatistics due to some major impediments including slow mixing of the Markov Chain Monte Carlo (MCMC), inability to deal with missing covariates and a lack of...
Show moreFor many biomedical, environmental and economic studies with an unknown nonlinear relationship between the response and its multiple predictors, a single index model provides practical dimension reduction and good physical interpretation. However widespread uses of existing Bayesian analysis for such models are lacking in biostatistics due to some major impediments including slow mixing of the Markov Chain Monte Carlo (MCMC), inability to deal with missing covariates and a lack of theoretical justification of the rate of convergence. We present a new Bayesian single index model with associated MCMC algorithm that incorporates an efficient Metropolis Hastings (MH) step for the conditional distribution of the index vector. Our method leads to a model with good biological interpretation and prediction, implementable Bayesian inference, fast convergence of the MCMC, and a first time extension to accommodate missing covariates. We also obtain for the first time, the set of sufficient conditions for obtaining the optimal rate of convergence of the overall regression function. We illustrate the practical advantages of our method and computational tool via reanalysis of an environmental study. I have proposed a frequentist and a Bayesian methods for a monotone singleindex models using the Bernstein polynomial basis to represent the link function. The monotonicity of the unknown link function creates a clinically interpretable index, along with the relative importance of the covariates on the index. We develop a computationallysimple, iterative, profile likelihoodbased method for the frequentist analysis. To ease the computational complexity of the Bayesian analysis, we also develop a novel and efficient MetropolisHastings step to sample from the conditional posterior distribution of the index parameters. These methodologies and their advantages over existing methods are illustrated via simulation studies. These methods are also used to analyze depression based measures among adolescent girls.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Dhara_fsu_0071E_14739
 Format
 Thesis
 Title
 Small Area Estimation with Random Effects Selection.
 Creator

Lee, Jiwon, She, Yiyuan, Ökten, Giray, McGee, Daniel, Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In this study, we propose a robust method holding a selective shrinkage power for small area estimation with automatic random effects selection referred to as SARS. In our proposed model, both fixed effects and random effects are treated as joint target. In this case, maximizing joint likelihood of fixed effects and random effects makes more sense than maximizing marginal likelihood. In practice, variance of sampling error and variance of modeling error (random effects) are unknown. SARS does...
Show moreIn this study, we propose a robust method holding a selective shrinkage power for small area estimation with automatic random effects selection referred to as SARS. In our proposed model, both fixed effects and random effects are treated as joint target. In this case, maximizing joint likelihood of fixed effects and random effects makes more sense than maximizing marginal likelihood. In practice, variance of sampling error and variance of modeling error (random effects) are unknown. SARS does not require any prior information of both variance components and dimensionality of data. Furthermore, areaspecific random effects, accounting for additional area variation, are not always necessary in small area estimation model. From this observation, we can impose sparsity on random effects by assigning zero for the large area. This sparsity brings heavy tails, which means that the normality assumption of random effects is not retained any longer. The SARS holding selective and predictive power employs penalized regression using a nonconvex penalty. For solving the nonconvex problem of SARS, we employ iterative algorithms via a quantile thresholding procedure. The algorithms make use of the iterative selectionestimation paradigm with a variety of techniques such as progressive screening when tuning parameters, mutistart strategy with subsampling method and feature subset method to generate more efficient initial points for enhancing computation efficiency and efficacy. To achieve optimal prediction error under the dimensional relaxation, we propose a new theoretical predictive information criterion for SARS (SARSPIC) which is derived based upon nonasymptotic oracle inequalities using minimax rate of ideal predictive risk. Experiments with simulation and real poverty data of schoolage(517) children demonstrate the efficiency of SARS.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Lee_fsu_0071E_13675
 Format
 Thesis
 Title
 Spatial Statistics and Its Applications in Biostatistics and Environmental Statistics.
 Creator

Hu, Guanyu, Huffer, Fred W. (Fred William), Paek, Insu, Sinha, Debajyoti, Slate, Elizabeth H., Bradley, Jonathan R., Florida State University, College of Arts and Sciences,...
Show moreHu, Guanyu, Huffer, Fred W. (Fred William), Paek, Insu, Sinha, Debajyoti, Slate, Elizabeth H., Bradley, Jonathan R., Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

This dissertation presents some topics in spatial statistics and their application in biostatistics and environmental statistics. The field of spatial statistics is an energetic area in statistics. In Chapter 2 and Chapter 3, the goal is to build subregion models under the assumption that the responses or the parameters are spatially correlated. For regression models, considering spatially varying coecients is a reasonable way to build subregion models. There are two different techniques for...
Show moreThis dissertation presents some topics in spatial statistics and their application in biostatistics and environmental statistics. The field of spatial statistics is an energetic area in statistics. In Chapter 2 and Chapter 3, the goal is to build subregion models under the assumption that the responses or the parameters are spatially correlated. For regression models, considering spatially varying coecients is a reasonable way to build subregion models. There are two different techniques for exploring spatially varying coecients. One is geographically weighted regression (Brunsdon et al. 1998). The other is a spatially varying coecients model which assumes a stationary Gaussian process for the regression coecients (Gelfand et al. 2003). Based on the ideas of these two techniques, we introduce techniques for exploring subregion models in survival analysis which is an important area of biostatistics. In Chapter 2, we introduce modied versions of the KaplanMeier and NelsonAalen estimators which incorporate geographical weighting. We use ideas from counting process theory to obtain these modied estimators, to derive variance estimates, and to develop associated hypothesis tests. In Chapter 3, we introduce a Bayesian parametric accelerated failure time model with spatially varying coefficients. These two techniques can explore subregion models in survival analysis using both nonparametric and parametric approaches. In Chapter 4, we introduce Bayesian parametric covariance regression analysis for a response vector. The proposed method denes a regression model between the covariance matrix of a pdimensional response vector and auxiliary variables. We propose a constrained MetropolisHastings algorithm to get the estimates. Simulation results are presented to show performance of both regression and covariance matrix estimates. Furthermore, we have a more realistic simulation experiment in which our Bayesian approach has better performance than the MLE. Finally, we illustrate the usefulness of our model by applying it to the Google Flu data. In Chapter 5, we give a brief summary of future work.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Hu_fsu_0071E_14205
 Format
 Thesis
 Title
 Statistical Shape Analysis of Neuronal Tree Structures.
 Creator

Duncan, Adam, Srivastava, Anuj, Klassen, E., Wu, Wei, Huffer, Fred W., Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Neuron morphology plays a central role in characterizing cognitive health and functionality of brain structures. The problem of quantifying neuron shapes, and capturing statistical variability of shapes, is difficult because axons and dendrites have tree structures that differ in both geometry and topology. In this work, we restrict to the trees that consist of: (1) a main branch viewed as a parameterized curve in ℝ³, and (2) some number of secondary branches  also parameterized curves in...
Show moreNeuron morphology plays a central role in characterizing cognitive health and functionality of brain structures. The problem of quantifying neuron shapes, and capturing statistical variability of shapes, is difficult because axons and dendrites have tree structures that differ in both geometry and topology. In this work, we restrict to the trees that consist of: (1) a main branch viewed as a parameterized curve in ℝ³, and (2) some number of secondary branches  also parameterized curves in ℝ³  which emanate from the main branch at arbitrary points. We present two shapeanalytic frameworks which each give a metric structure to the set of such tree shapes, Both frameworks are based on an elastic metric on the space of curves with certain shapepreserving nuisance variables modded out. In the first framework, the side branches are treated as a continuum of curvevalued annotations to the main branch. In the second framework, the side branches are treated as discrete entities and are matched to each other by permutation. We show geodesic deformations between tree shapes in both frameworks, and we show Fréchet means and modes of variability, as well as crossvalidated classification between different experimental groups using the second framework. We conclude with a smaller project which extends some of these ideas to more general weighted attributed graphs.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Duncan_fsu_0071E_14500
 Format
 Thesis
 Title
 A Study of Some Issues of GoodnessofFit Tests for Logistic Regression.
 Creator

Ma, Wei, McGee, Daniel, Mai, Qing, Levenson, Cathy W., Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Goodnessoffit tests are important to assess how well a model fits a set of observations. HosmerLemeshow (HL) test is a popular and commonly used method to assess the goodnessoffit for logistic regression. However, there are two issues for using HL test. One of them is that we have to specify the number of partition groups and the different groups often suggest the different decisions. So in this study, we propose several grouping tests to combine multiple HL tests with varying the number...
Show moreGoodnessoffit tests are important to assess how well a model fits a set of observations. HosmerLemeshow (HL) test is a popular and commonly used method to assess the goodnessoffit for logistic regression. However, there are two issues for using HL test. One of them is that we have to specify the number of partition groups and the different groups often suggest the different decisions. So in this study, we propose several grouping tests to combine multiple HL tests with varying the number of groups to make the decision instead of just using one arbitrary group or finding the optimum group. This is due to the reason that the best selection for the groups is datadependent and it is not easy to find. The other drawback of HL test is that it is not powerful to detect the violation of missing interactions between continuous and dichotomous covariates. Therefore, we propose global and interaction tests in order to capture such violations. Simulation studies are carried out to assess the Type I errors and powers for all the proposed tests. These tests are illustrated by the bone mineral density data from NHANES III.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Ma_fsu_0071E_14681
 Format
 Thesis
 Title
 Testing for the Equality of Two Distributions on High Dimensional Object Spaces and Nonparametric Inference for Location Parameters.
 Creator

Guo, Ruite, Patrangenaru, Victor, Mio, Washington, Barbu, Adrian G. (Adrian Gheorghe), Bradley, Jonathan R., Florida State University, College of Arts and Sciences, Department...
Show moreGuo, Ruite, Patrangenaru, Victor, Mio, Washington, Barbu, Adrian G. (Adrian Gheorghe), Bradley, Jonathan R., Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Our view is that while some of the basic principles of data analysis are going to remain unchanged, others are to be gradually replaced with Geometry and Topology methods. Linear methods are still making sense for functional data analysis, or in the context of tangent bundles of object spaces. Complex nonstandard data is represented on object spaces. An object space admitting a manifold stratification may be embedded in an Euclidean space. One defines the extrinsic energy distance associated...
Show moreOur view is that while some of the basic principles of data analysis are going to remain unchanged, others are to be gradually replaced with Geometry and Topology methods. Linear methods are still making sense for functional data analysis, or in the context of tangent bundles of object spaces. Complex nonstandard data is represented on object spaces. An object space admitting a manifold stratification may be embedded in an Euclidean space. One defines the extrinsic energy distance associated with two probability measures on an arbitrary object space embedded in a numerical space, and one introduces an extrinsic energy statistic to test for homogeneity of distributions of two random objects (r.o.'s) on such an object space. This test is validated via a simulation example on the Kendall space of planar kads with a VeroneseWhitney (VW) embedding. One considers an application to medical imaging, to test for the homogeneity of the distributions of Kendall shapes of the midsections of the Corpus Callosum in a clinically normal population vs a population of ADHD diagnosed individuals. Surprisingly, due to the high dimensionality, these distributions are not significantly different, although they are known to have highly significant VWmeans. New spread and location parameters are to be added to reflect the nontrivial topology of certain object spaces. TDA is going to be adapted to object spaces, and hypothesis testing for distributions is going to be based on extrinsic energy methods. For a random point on an object space embedded in an Euclidean space, the mean vector cannot be represented as a point on that space, except for the case when the embedded space is convex. To address this misgiving, since the mean vector is the minimizer of the expected square distance, following Frechet (1948), on an embedded compact object space, one may consider both minimizers and maximizers of the expected square distance to a given point on the embedded object space as mean, respectively antimean of the random point. Of all distances on an object space, one considers here the chord distance associated with the embedding of the object space, since for such distances one can give a necessary and sufficient condition for the existence of a unique Frechet mean (respectively Frechet antimean). For such distributions these location parameters are called extrinsic mean (respectively extrinsic antimean), and the corresponding sample statistics are consistent estimators of their population counterparts. Moreover around the extrinsic mean ( antimean ) located at a smooth point, one derives the limit distribution of such estimators.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Guo_fsu_0071E_13977
 Format
 Thesis
 Title
 Tests and Classifications in Adaptive Designs with Applications.
 Creator

Chen, Qiusheng, Niu, Xufeng, McGee, Daniel, Slate, Elizabeth H., Zhang, Jinfeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Statistical tests for biomarker identification and classification methods for patient grouping are two important topics in adaptive designs of clinical trials. In this article, we evaluate four test methods for biomarker identification: a modelbased identification method, the popular ttest, the nonparametric Wilcoxon Rank Sum test, and the Least Absolute Shrinkage and Selection Operator (Lasso) method. For selecting the best classification methods in Stage 2 of an adaptive design, we...
Show moreStatistical tests for biomarker identification and classification methods for patient grouping are two important topics in adaptive designs of clinical trials. In this article, we evaluate four test methods for biomarker identification: a modelbased identification method, the popular ttest, the nonparametric Wilcoxon Rank Sum test, and the Least Absolute Shrinkage and Selection Operator (Lasso) method. For selecting the best classification methods in Stage 2 of an adaptive design, we examine classification methods including the recently developed machine learning approaches such as Random Forest, Lasso and ElasticNet Regularized Generalized Linear Models (Glmnet), Support Vector Machine (SVM), Gradient Boosting Machine (GBM), and Extreme Gradient Boost ing (XGBoost). Statistical simulations are carried out in our study to assess the performance of biomarker identification methods and the classification methods. The best identification method and the classification technique will be selected based on the True Positive Rate (TPR,also called Sensitivity) and the True Negative Rate (TNR,also called Specificity). The optimal test method for gene identification and classification method for patient grouping will be applied to the Adap tive Signature Design (ASD) for the purpose of evaluating the performance of ASD in different situations, including simulated data and a real data set for breast cancer patients.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Chen_fsu_0071E_14309
 Format
 Thesis
 Title
 Volatility Matrix Estimation for HighFrequency Financial Data.
 Creator

Xue, Yang, Tao, Minjing, Cheng, Yingmei, Fendler, Rachel Loveitt, Huffer, Fred W., Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Volatility is usually employed to measure the dispersion of asset returns, and it’s widely used in risk analysis and asset management. This first chapter studies a kernelbased spot volatility matrix estimator with preaveraging approach for highfrequency data contaminated by market microstructure noise. When the sample size goes to infinity and the bandwidth vanishes, we show that our estimator is consistent and its asymptotic normality is established with achieving an optimal convergence...
Show moreVolatility is usually employed to measure the dispersion of asset returns, and it’s widely used in risk analysis and asset management. This first chapter studies a kernelbased spot volatility matrix estimator with preaveraging approach for highfrequency data contaminated by market microstructure noise. When the sample size goes to infinity and the bandwidth vanishes, we show that our estimator is consistent and its asymptotic normality is established with achieving an optimal convergence rate. We also construct a consistent pairwise spot covolatility estimator with HayashiYoshida method for nonsynchronous highfrequency data with noise contamination. The simulation studies demonstrate that the proposed estimators work well under different noise levels, and their estimation performances are improved by the increasing sample frequency. In empirical applications, we implement the estimators on the intraday prices of four component stocks of Dow Jones Industrial Average. The second chapter shows a factorbased vast volatility matrix estimation method for high frequency financial data with market microstructure noise, finite large jumps and infinite activity small jumps. We construct the sample volatility matrix estimator based on the approximate factor model, and use the preaveraging and thresholding estimation method (PATH) to digest the noise and jumps. After using the principle component analysis (PCA) to decompose the sample volatility matrix estimator, our proposed volatility matrix estimator is finally obtained by imposing the blockdiagonal regularization on the residual covariance matrix through sorting the assets with the global industry classification standard (GICS) codes. The Monte Carlo simulation shows that our proposed volatility matrix estimator can remove the majority effects of noise and jumps, and its estimation performance improves fast when the sample frequency increases. Finally, the PCAbased estimators are employed to perform volatility matrix estimation and asset allocation for S&P 500 stocks. To compare with PCAbased estimators, we also include the exchangetraded funds (ETFs) data to construct observable factors such as the FamaFrench factors for volatility estimation.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Xue_fsu_0071E_14471
 Format
 Thesis
 Title
 WaveletBased Bayesian Approaches to Sequential Profile Monitoring.
 Creator

Varbanov, Roumen, Chicken, Eric, Linero, Antonio Ricardo, Huffenberger, Kevin M., Yang, Yanyun, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

We consider changepoint detection and estimation in sequences of functional observations. This setting often arises when the quality of a process is characterized by such observations, termed profiles, and monitoring profiles for changes in structure can be used to ensure the stability of the process over time. While interest in profile monitoring has grown, few methods approach the problem from a Bayesian perspective. In this dissertation, we propose three waveletbased Bayesian approaches...
Show moreWe consider changepoint detection and estimation in sequences of functional observations. This setting often arises when the quality of a process is characterized by such observations, termed profiles, and monitoring profiles for changes in structure can be used to ensure the stability of the process over time. While interest in profile monitoring has grown, few methods approach the problem from a Bayesian perspective. In this dissertation, we propose three waveletbased Bayesian approaches to profile monitoring  the last of which can be extended to a general process monitoring setting. First, we develop a general framework for the problem of interest in which we base inference on the posterior distribution of the change point without placing restrictive assumptions on the form of profiles. The proposed method uses an analytic form of the posterior distribution in order to run online without relying on Markov chain Monte Carlo (MCMC) simulation. Wavelets, an effective tool for estimating nonlinear signals from noisecontaminated observations, enable the method to flexibly distinguish between sustained changes in profiles and the inherent variability of the process. Second, we modify the initial framework in a posterior approximation algorithm designed to utilize past information in a computationally efficient manner. We show that the approximation can detect changes of smaller magnitude better than traditional alternatives for curbing computational cost. Third, we introduce a monitoring scheme that allows an unchanged process to run infinitely long without a false alarm; the scheme maintains the ability to detect a change with probability one. We include theoretical results regarding these properties and illustrate the implementation of the scheme in the previously established framework. We demonstrate the efficacy of proposed methods on simulated data and significantly outperform a relevant frequentist competitor.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Varbanov_fsu_0071E_14513
 Format
 Thesis