Current Search: Research Repository (x) » Statistics (x) » The Florida State University (x)
Search results
Pages
 Title
 2D Affine and Projective Shape Analysis, and Bayesian Elastic Active Contours.
 Creator

Bryner, Darshan W., Srivastava, Anuj, Klassen, Eric, Gallivan, Kyle, Huffer, Fred, Wu, Wei, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

An object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian...
Show moreAn object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian framework for shape analysis where metrics and related quantities are invariant to larger groups, the affine and projective groups, that approximate such transformations that arise from perspective skews. Highlighting two possibilities for representing object boundaries  ordered points (or landmarks) and parametrized curves  we study different combinations of these representations (points and curves) and transformations (affine and projective). Specifically, we provide solutions to three out of four situations and develop algorithms for computing geodesics and intrinsic sample statistics, leading up to Gaussiantype statistical models, and classifying test shapes using such models learned from training data. In the case of parametrized curves, an added issue is to obtain invariance to the reparameterization group. The geodesics are constructed by particularizing the pathstraightening algorithm to geometries of current manifolds and are used, in turn, to compute shape statistics and Gaussiantype shape models. We demonstrate these ideas using a number of examples from shape and activity recognition. After developing such Gaussiantype shape models, we present a variational framework for naturally incorporating these shape models as prior knowledge in guidance of active contours for boundary extraction in images. This socalled Bayesian active contour framework is especially suitable for images where boundary estimation is difficult due to low contrast, low resolution, and presence of noise and clutter. In traditional active contour models curves are driven towards minimum of an energy composed of image and smoothing terms. We introduce an additional shape term based on shape models of prior known relevant shape classes. The minimization of this total energy, using iterated gradientbased updates of curves, leads to an improved segmentation of object boundaries. We demonstrate this Bayesian approach to segmentation using a number of shape classes in many imaging scenarios including the synthetic imaging modalities of SAS (synthetic aperture sonar) and SAR (synthetic aperture radar), which are notoriously difficult to obtain accurate boundary extractions. In practice, the training shapes used for priorshape models may be collected from viewing angles different from those for the test images and thus may exhibit a shape variability brought about by perspective effects. Therefore, by allowing for a prior shape model to be invariant to, say, affine transformations of curves, we propose an active contour algorithm where the resulting segmentation is robust to perspective skews.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8534
 Format
 Thesis
 Title
 Adaptive Series Estimators for Copula Densities.
 Creator

Gui, Wenhao, Wegkamp, Marten, Van Engelen, Robert A., Niu, Xufeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

In this thesis, based on an orthonormal series expansion, we propose a new nonparametric method to estimate copula density functions. Since the basis coefficients turn out to be expectations, empirical averages are used to estimate these coefficients. We propose estimators of the variance of the estimated basis coefficients and establish their consistency. We derive the asymptotic distribution of the estimated coefficients under mild conditions. We derive a simple oracle inequality for the...
Show moreIn this thesis, based on an orthonormal series expansion, we propose a new nonparametric method to estimate copula density functions. Since the basis coefficients turn out to be expectations, empirical averages are used to estimate these coefficients. We propose estimators of the variance of the estimated basis coefficients and establish their consistency. We derive the asymptotic distribution of the estimated coefficients under mild conditions. We derive a simple oracle inequality for the copula density estimator based on a finite series using the estimated coefficients. We propose a stopping rule for selecting the number of coefficients used in the series and we prove that this rule minimizes the mean integrated squared error. In addition, we consider hard and soft thresholding techniques for sparse representations. We obtain oracle inequalities that hold with prescribed probability for various norms of the difference between the copula density and our threshold series density estimator. Uniform confidence bands are derived as well. The oracle inequalities clearly reveal that our estimator adapts to the unknown degree of sparsity of the series representation of the copula density. A simulation study indicates that our method is extremely easy to implement and works very well, and it compares favorably to the popular kernel based copula density estimator, especially around the boundary points, in terms of mean squared error. Finally, we have applied our method to an insurance dataset. After comparing our method with the previous data analyses, we reach the same conclusion as the parametric methods in the literature and as such we provide additional justification for the use of the developed parametric model.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd3929
 Format
 Thesis
 Title
 Age Effects in the Extinction of Planktonic Foraminifera: A New Look at Van Valen's Red Queen Hypothesis.
 Creator

Wiltshire, Jelani, Huﬀer, Fred, Parker, William, Chicken, Eric, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Van Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic...
Show moreVan Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic Foraminifera (Foram) extinction data include Weibull and Exponential modeling (Parker and Arnold, 1997), and Cox proportional hazards modeling (Doran et al. 2004,2006). I propose a general class of test statistics that can be used to test for the effect of age on extinction. These test statistics allow for a varying background rate of extinction and attempt to remove the effects of other covariates when assessing the effect of age on extinction. No model is assumed for the covariate effects. Instead I control for covariate effects by pairing or grouping together similar species. I use simulated data sets to compare the power of the statistics. In applying the test statistics to the Foram data, I have found age to have a positive effect on extinction.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0952
 Format
 Thesis
 Title
 Algorithmic Lung Nodule Analysis in Chest Tomography Images: Lung Nodule Malignancy Likelihood Prediction and a Statistical Extension of the Level Set Image Segmentation Method.
 Creator

Hancock, Matthew C. (Matthew Charles), Magnan, Jeronimo Francisco, Duke, D. W., Hurdal, Monica K., Mio, Washington, Florida State University, College of Arts and Sciences,...
Show moreHancock, Matthew C. (Matthew Charles), Magnan, Jeronimo Francisco, Duke, D. W., Hurdal, Monica K., Mio, Washington, Florida State University, College of Arts and Sciences, Department of Mathematics
Show less  Abstract/Description

Lung cancer has the highest mortality rate of all cancers in both men and women in the United States. The algorithmic detection, characterization, and diagnosis of abnormalities found in chest CT scan images can aid radiologists by providing additional medicallyrelevant information to consider in their assessment of medical images. Such algorithms, if robustly validated in clinical settings, carry the potential to improve the health of the general population. In this thesis, we first give an...
Show moreLung cancer has the highest mortality rate of all cancers in both men and women in the United States. The algorithmic detection, characterization, and diagnosis of abnormalities found in chest CT scan images can aid radiologists by providing additional medicallyrelevant information to consider in their assessment of medical images. Such algorithms, if robustly validated in clinical settings, carry the potential to improve the health of the general population. In this thesis, we first give an analysis of publicly available chest CT scan annotation data, in which we determine upper bounds on expected classification accuracy when certain radiological features are used as inputs to statistical learning algorithms for the purpose of inferring the likelihood of a lung nodule as being either malignant or benign. Second, a statistical extension of the level set method for image segmentation is introduced and applied to both syntheticallygenerated and real threedimensional image volumes of lung nodules in chest CT scans, obtaining results comparable to the current stateoftheart on the latter.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Hancock_fsu_0071E_14427
 Format
 Thesis
 Title
 Analysis of crossclassified data using negative binomial models.
 Creator

Ramakrishnan, Viswanathan., Florida State University
 Abstract/Description

Several procedures are available for analyzing crossclassified data under the Poisson model. When data suggest the presence of "nonPoisson" variation an alternative model is desirable. Often a negative binomial model is useful as an alternative. In this dissertation methodology for analyzing data under a twoparameter negative binomial model is provided. A conditional likelihood approach is suggested to simplify estimation and inference procedures. Large sample properties of the conditional...
Show moreSeveral procedures are available for analyzing crossclassified data under the Poisson model. When data suggest the presence of "nonPoisson" variation an alternative model is desirable. Often a negative binomial model is useful as an alternative. In this dissertation methodology for analyzing data under a twoparameter negative binomial model is provided. A conditional likelihood approach is suggested to simplify estimation and inference procedures. Large sample properties of the conditional likelihood approach are derived. Based on simulations these properties are examined for small samples. The suggested methodology is applied to two sets of data from ecological research studies.
Show less  Date Issued
 1989, 1989
 Identifier
 AAI9016503, 3161994, FSDT3161994, fsu:78193
 Format
 Document (PDF)
 Title
 Analysis of Multivariate Data with Random Cluster Size.
 Creator

Li, Xiaoyun, Sinha, Debajyoti, Zhou, Yi, McGee, Dan, Lipsitz, Stuart, Department of Statistics, Florida State University
 Abstract/Description

In this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real...
Show moreIn this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real data applications. In biomedical studies involving clustered binary responses, the cluster size can vary because some components of the cluster can be absent. When both the presence of a cluster component as well as the binary disease status of a present component are treated as responses of interest, we propose a novel twostage random effects logistic regression framework. For the ease of interpretation of regression effects, both the marginal probability of presence/absence of a component as well as the conditional probability of disease status of a present component, preserve the approximate logistic regression forms. We present a maximum likelihood method of estimation implementable using standard statistical software. We compare our models and the physical interpretation of regression effects with competing methods from literature. We also present a simulation study to assess the robustness of our procedure to wrong specification of the random effects distribution and to compare finite sample performances of estimates with existing methods. The methodology is illustrated via analyzing a study of the periodontal health status in a diabetic Gullah population. We extend this model in longitudinal studies with binary longitudinal response and informative missing data. In longitudinal studies, when treating each subject as a cluster, cluster size is the total number of observations for each subject. When data is informatively missing, cluster size of each subject can vary and is related to the binary response of interest and we are also interested in the missing mechanism. This is a modified situation of the cluster binary data with present components. We modify and adopt our proposed twostage random effects logistic regression model so that both the marginal probability of binary response and missing indicator as well as the conditional probability of binary response and missing indicator preserve logistic regression forms. We present a Bayesian framework of this model and illustrate our proposed model on an AIDS data example.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1425
 Format
 Thesis
 Title
 An analysis of test reliability.
 Creator

Isaacson, Fenton R., Florida State University
 Abstract/Description

"The need for efficient means of testing has long been recognized. To obtain efficiency in testing requires the study of four attributes of the testing instrumentnamely: reliability, validity, interpretability and administrability. It is the purpose of this paper to examine in some detail the first of these attributes, reliability. In particular, this is an attempt to analyse the reliability of Mathematics 101 Test D which was administered at Florida State University in the fall of 1948"...
Show more"The need for efficient means of testing has long been recognized. To obtain efficiency in testing requires the study of four attributes of the testing instrumentnamely: reliability, validity, interpretability and administrability. It is the purpose of this paper to examine in some detail the first of these attributes, reliability. In particular, this is an attempt to analyse the reliability of Mathematics 101 Test D which was administered at Florida State University in the fall of 1948"Introduction.
Show less  Date Issued
 1949
 Identifier
 FSU_historic_AKP4870
 Format
 Thesis
 Title
 Association Models for Clustered Data with Binary and Continuous Responses.
 Creator

Lin, Lanjia, Sinha, Debajyoti, Hurt, Myra, Lipsitz, Stuart R., McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

This dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And...
Show moreThis dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And the marginal regression function of the continuous response preserves linear form. Withincluster and withinsubject associations could be measured by our proposed models. For the bivariate correlated random effects model, we illustrate how different levels of the association between two random effects induce different Kendall's tau values for association between the binary and continuous responses from the same cluster. Fully parametric and semiparametric Bayesian methods as well as maximum likelihood method are illustrated for model analysis. In the semiparametric Bayesian model, normality assumption of the regression error for the continuous response is relaxed by using a nonparametric Dirichlet Process prior. Robustness of the bivariate correlated random effects model using ML method to misspecifications of regression function as well as random effect distribution is investigated by simulation studies. The Bayesian and likelihood methods are applied to a developmental toxicity study of ethylene glycol in mice.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1330
 Format
 Thesis
 Title
 A Bayesian Approach to MetaRegression: The Relationship Between Body Mass Index and AllCause Mortality.
 Creator

Marker, Mahtab, McGee, Dan, Hurt, Myra, Niu, Xiufeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

This thesis presents a Bayesian approach to MetaRegression and Individual Patient Data (IPD) Metaanalysis. The focus of the research is on establishing the relationship between Body Mass Index (BMI) and allcause mortality. This has been an area of continuing interest in the medical and public health communities and no concensus has been reached on what the optimal weight for individuals is. Standards are usually speci ed in terms of body mass index (BMI = wt(kg) over height(m)2 ) which is...
Show moreThis thesis presents a Bayesian approach to MetaRegression and Individual Patient Data (IPD) Metaanalysis. The focus of the research is on establishing the relationship between Body Mass Index (BMI) and allcause mortality. This has been an area of continuing interest in the medical and public health communities and no concensus has been reached on what the optimal weight for individuals is. Standards are usually speci ed in terms of body mass index (BMI = wt(kg) over height(m)2 ) which is associated with body fat percentage. Many studies in the literature have modelled the relationship between BMI and mortality and reported a variety of relationships including Ushaped, Jshaped and linear curves. The aim of my research was to use statistical methods to determine whether we can combine these diverse results an obtain single estimated relationship, using which one can nd the point of minimum mortality and establish reasonable ranges for optimal BMI or how we can best examine the reasons for the heterogeneity of results. Commonly used techniques of Metaanalysis and Metaregression are explored and a problem with the estimation procedure in the multivariate setting is presented. A Bayesian approach using Hierarchical Generalized Linear Mixed Model is suggested and implemented to overcome this drawback of standard estimation techniques. Another area which is explored briefly is that of Individual Patient Data metaanalysis. A Frailty model or Random Effects Proportional Hazards Survival model approach is proposed to carry out IPD metaregression and come up with a single estimated relationship between BMI and mortality, adjusting for the variation between studies.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd2736
 Format
 Thesis
 Title
 Bayesian Dynamic Survival Models for Longitudinal Aging Data.
 Creator

He, Jianghua, McGee, Daniel L., Niu, Xufeng, Johnson, Suzanne B., Huﬀer, Fred W., Department of Statistics, Florida State University
 Abstract/Description

In this study, we will examine the Bayesian Dynamic Survival Models, timevarying coefficients models from a Bayesian perspective, and their applications in the aging setting. The specific questions we are interested in are: Do the relative importance of characteristics measured at a particular age, such as blood pressure, smoking, and body weight, with respect to heart diseases or death change as people age? If they do, how can we model the change? And, how does the change affect the...
Show moreIn this study, we will examine the Bayesian Dynamic Survival Models, timevarying coefficients models from a Bayesian perspective, and their applications in the aging setting. The specific questions we are interested in are: Do the relative importance of characteristics measured at a particular age, such as blood pressure, smoking, and body weight, with respect to heart diseases or death change as people age? If they do, how can we model the change? And, how does the change affect the analysis results if fixedeffect models are applied? In the epidemiological and statistical literature, the relationship between a risk factor and the risk of an event is often described in terms of the numerical contribution of the risk factor to the total risk within a followup period, using methods such as contingency tables and logistic regression models. With the development of survival analysis, another method named the Proportional Hazards Model becomes more popular. This model describes the relationship between a covariate and risk within a followup period as a process, under the assumption that the hazard ratio of the covariate is fixed during the followup period. Neither previous methods nor the Proportional Hazards Model allows the effect of a covariates to change flexibly with time. In these study, we intend to investigate some classic epidemiological relationships using appropriate methods that allow coefficients to change with time, and compare our results with those found in the literature. After describing what has been done in previous work based on multiple logistic regression or discriminant function analysis, we summarize different methods for estimating the time varying coefficient survival models that are developed specifically for the situations under which the proportional hazards assumption is violated. We will focus on the Bayesian Dynamic Survival Model because its flexibility and Bayesian structure fits our study goals. There are two estimation methods for the Bayesian Dynamic Survival Models, the Linear Bayesian Estimation (LBE) method and the Markov Chain Monte Carlo (MCMC) sampling method. The LBE method is simpler, faster, and more flexible to calculate, but it requires specifications of some parameters that usually are unknown. The MCMC method gets around the difficulty of specifying parameters, but is much more computationally intensive. We will use a simulation study to investigate the performances of these two methods, and provide suggestions on how to use them effectively in application. The Bayesian Dynamic Survival Model is applied to the Framingham Heart Study to investigate the timevarying effects of covariates such as gender, age, smoking, and SBP (Systolic Blood Pressure) with respect to death. We also examined the changing relationship between BMI (Body Mass Index) and allcause mortality, and suggested that some of the heterogeneity observed in the results found in the literature is likely to be a consequence of using fixed effect models to describe a timevarying relationship.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd4174
 Format
 Thesis
 Title
 Bayesian Generalized Polychotomous Response Models and Applications.
 Creator

Yang, Fang, Niu, XuFeng, Johnson, Suzanne B., McGee, Dan, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Polychotomous quantal response models are widely used in medical and econometric studies to analyze categorical or ordinal data. In this study, we apply the Bayesian methodology through a mixedeffects polychotomous quantal response model. For the Bayesian polychotomous quantal response model, we assume uniform improper priors for the regression coeffcients and explore the suffcient conditions for a proper joint posterior distribution of the parameters in the models. Simulation results from...
Show morePolychotomous quantal response models are widely used in medical and econometric studies to analyze categorical or ordinal data. In this study, we apply the Bayesian methodology through a mixedeffects polychotomous quantal response model. For the Bayesian polychotomous quantal response model, we assume uniform improper priors for the regression coeffcients and explore the suffcient conditions for a proper joint posterior distribution of the parameters in the models. Simulation results from Gibbs sampling estimates will be compared to traditional maximum likelihood estimates to show the strength that using the uniform improper priors for the regression coeffcients. Motivated by investigating of relationship between BMI categories and several risk factors, we carry out the application studies to examine the impact of risk factors on BMI categories, especially for categories of "Overweight" and "Obesities". By applying the mixedeffects Bayesian polychotomous response model with uniform improper priors, we would get similar interpretations of the association between risk factors and BMI, comparing to literature findings.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1092
 Format
 Thesis
 Title
 Bayesian Inference and Novel Models for Survival Data with Cured Fraction.
 Creator

Gupta, Cherry Chunqi Huang, Sinha, Debajyoti, Glueckauf, Robert L., Slate, Elizabeth H., Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of...
Show moreGupta, Cherry Chunqi Huang, Sinha, Debajyoti, Glueckauf, Robert L., Slate, Elizabeth H., Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Existing curerate survival models are generally not convenient for modeling and estimating the survival quantiles of a patient with specified covariate values. They also do not allow inference on the change in the number of clonogens over time. This dissertation proposes two novel classes of curerate model, the transformbothsides curerate model (TBSCRM) and the clonogen proliferation curerate model (CPCRM). Both can be used to make inference about both the curerate and the survival...
Show moreExisting curerate survival models are generally not convenient for modeling and estimating the survival quantiles of a patient with specified covariate values. They also do not allow inference on the change in the number of clonogens over time. This dissertation proposes two novel classes of curerate model, the transformbothsides curerate model (TBSCRM) and the clonogen proliferation curerate model (CPCRM). Both can be used to make inference about both the curerate and the survival probabilities over time. The TBSCRM can also produce estimates of a patient's quantiles of survival time, and the CPCRM can produce estimates of a patient's expected number of clonogens at each time. We develop methods of Bayesian inference about the covariate effects on relevant quantities such as the curerate, methods which use Markov Chain Monte Carlo (MCMC) tools. We also show that the TBSCRMbased and CPCRMbased Bayesian methods perform well in simulation studies and outperform existing curerate models in application to the breast cancer survival data from the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) database.
Show less  Date Issued
 2016
 Identifier
 FSU_2016SU_Gupta_fsu_0071E_13423
 Format
 Thesis
 Title
 Bayesian Methods for Skewed Response Including Longitudinal and Heteroscedastic Data.
 Creator

Tang, Yuanyuan, Sinha, Debajyoti, Pati, Debdeep, Flynn, Heather, She, Yiyuan, Lipsitz, Stuart, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

Skewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle...
Show moreSkewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle the skewed univariate response with a flexible heteroscedasticity. For longitudinal studies with heavily skewed continuous response, statistical model and methods focusing on mean response are not appropriate. In this paper, we present a partial linear model of median regression function of skewed longitudinal response. We develop a semiparametric Bayesian estimation procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We provide justifications for using our methods including theoretical investigation of the support of the prior, asymptotic properties of the posterior and also simulation studies of finite sample properties. Ease of implementation and advantages of our model and method compared to existing methods are illustrated via analysis of a cardiotoxicity study of children of HIV infected mother. Our second aim is to develop a Bayesian simultaneous variable selection and estimation of median regression for skewed response variable. Our hierarchical Bayesian model can incorporate advantages of $l_0$ penalty for skewed and heteroscedastic error. Some preliminary simulation studies have been conducted to compare the performance of proposed model and existing frequentist median lasso regression model. Considering the estimation bias and total square error, our proposed model performs as good as, or better than competing frequentist estimators. In biomedical studies, the covariates often affect the location, scale as well as the shape of the skewed response distribution. Existing biostatistical literature mainly focuses on the mean regression with a symmetric error distribution. While such modeling assumptions and methods are often deemed as restrictive and inappropriate for skewed response, the completely nonparametric methods may lack a physical interpretation of the covariate effects. Existing nonparametric methods also miss any easily implementable computational tool. For a skewed response, we develop a novel model accommodating a nonparametric error density that depends on the covariates. The advantages of our semiparametric associated Bayes method include the ease of prior elicitation/determination, an easily implementable posterior computation, theoretically sound properties of the selection of priors and accommodation of possible outliers. The practical advantages of the method are illustrated via a simulation study and an analysis of a reallife epidemiological study on the serum response to DDT exposure during gestation period.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7622
 Format
 Thesis
 Title
 Bayesian Modeling and Variable Selection for Complex Data.
 Creator

Li, Hanning, Pati, Debdeep, Huffer, Fred W. (Fred William), Kercheval, Alec N., Sinha, Debajyoti, Bradley, Jonathan R., Florida State University, College of Arts and Sciences,...
Show moreLi, Hanning, Pati, Debdeep, Huffer, Fred W. (Fred William), Kercheval, Alec N., Sinha, Debajyoti, Bradley, Jonathan R., Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

As we routinely encounter highthroughput datasets in complex biological and environment research, developing novel models and methods for variable selection has received widespread attention. In this dissertation, we addressed a few key challenges in Bayesian modeling and variable selection for highdimensional data with complex spatial structures. a) Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the...
Show moreAs we routinely encounter highthroughput datasets in complex biological and environment research, developing novel models and methods for variable selection has received widespread attention. In this dissertation, we addressed a few key challenges in Bayesian modeling and variable selection for highdimensional data with complex spatial structures. a) Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the noise. However, such priors encounter computational issues in high dimensions. This has motivated continuous shrinkage priors, resembling the twocomponent priors facilitating computation and interpretability. While such priors are widely used for estimating highdimensional sparse vectors, selecting a subset of variables remains a daunting task. b) Spatial/spatialtemporal data sets with complex structures are nowadays commonly encountered in various scientific research fields ranging from atmospheric sciences, forestry, environmental science, biological science, and social science. Selecting important spatial variables that have significant influences on occurrences of events is undoubtedly necessary and essential for providing insights to researchers. Selfexcitation, which is a feature that occurrence of an event increases the likelihood of more occurrences of the same type of events nearby in time and space, can be found in many natural/social events. Research on modeling data with selfexcitation feature has increasingly drawn interests recently. However, existing literature on selfexciting models with inclusion of highdimensional spatial covariates is still underdeveloped. c) Gaussian Process is among the most powerful model frames for spatial data. Its major bottleneck is the computational complexity which stems from inversion of dense matrices associated with a Gaussian process covariance. Hierarchical divideconquer Gaussian Process models have been investigated for ultra large data sets. However, computation associated with scaling the distributing computing algorithm to handle a large number of subgroups poses a serious bottleneck. In chapter 2 of this dissertation, we propose a general approach for variable selection with shrinkage priors. The presence of very few tuning parameters makes our method attractive in comparison to ad hoc thresholding approaches. The applicability of the approach is not limited to continuous shrinkage priors, but can be used along with any shrinkage prior. Theoretical properties for nearcollinear design matrices are investigated and the method is shown to have good performance in a wide range of synthetic data examples and in a real data example on selecting genes affecting survival due to lymphoma. In Chapter 3 of this dissertation, we propose a new selfexciting model that allows the inclusion of spatial covariates. We develop algorithms which are effective in obtaining accurate estimation and variable selection results in a variety of synthetic data examples. Our proposed model is applied on Chicago crime data where the influence of various spatial features is investigated. In Chapter 4, we focus on a hierarchical Gaussian Process regression model for ultrahigh dimensional spatial datasets. By evaluating the latent Gaussian process on a regular grid, we propose an efficient computational algorithm through circulant embedding. The latent Gaussian process borrows information across multiple subgroups, thereby obtaining a more accurate prediction. The hierarchical model and our proposed algorithm are studied through simulation examples.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Li_fsu_0071E_14159
 Format
 Thesis
 Title
 Bayesian Models for Capturing Heterogeneity in Discrete Data.
 Creator

Geng, Junxian, Slate, Elizabeth H., Pati, Debdeep, Schmertmann, Carl P., Zhang, Xin, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Population heterogeneity exists frequently in discrete data. Many Bayesian models perform reasonably well in capturing this subpopulation structure. Typically, the Dirichlet process mixture model (DPMM) and a variable dimensional alternative that we refer to as the mixture of finite mixtures (MFM) model are used, as they both have natural byproducts of clustering derived from Polya urn schemes. The first part of this dissertation focuses on a model for the association between a binary...
Show morePopulation heterogeneity exists frequently in discrete data. Many Bayesian models perform reasonably well in capturing this subpopulation structure. Typically, the Dirichlet process mixture model (DPMM) and a variable dimensional alternative that we refer to as the mixture of finite mixtures (MFM) model are used, as they both have natural byproducts of clustering derived from Polya urn schemes. The first part of this dissertation focuses on a model for the association between a binary response and binary predictors. The model incorporates Boolean combinations of predictors, called logic trees, as parameters arising from a DPMM or MFM. Joint modeling is proposed to solve the identifiability issue that arises when using a mixture model for a binary response. Different MCMC algorithms are introduced and compared for fitting these models. The second part of this dissertation is the application of the mixture of finite mixtures model to community detection problems. Here, the communities are analogous to the clusters in the earlier work. A probabilistic framework that allows simultaneous estimation of the number of clusters and the cluster configuration is proposed. We prove clustering consistency in this setting. We also illustrate the performance of these methods with simulation studies and discuss applications.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Geng_fsu_0071E_13791
 Format
 Thesis
 Title
 A Bayesian MRF Framework for Labeling Terrain Using Hyperspectral Imaging.
 Creator

Neher, Robert E., Srivastava, Anuj, Liu, Xiuwen, Huffer, Fred, Wegkamp, Marten, Department of Statistics, Florida State University
 Abstract/Description

We explore the nonGaussianity of hyperspectral data and present probability models that capture variability of hyperspectral images. In particular, we present a nonparametric probability distribution that models the distribution of the hyperspectral data after reducing the dimension of the data via either principal components or Fisher's discriminant analysis. We also explore the directional differences in observed images and present two parametric distributions, the generalized Laplacian...
Show moreWe explore the nonGaussianity of hyperspectral data and present probability models that capture variability of hyperspectral images. In particular, we present a nonparametric probability distribution that models the distribution of the hyperspectral data after reducing the dimension of the data via either principal components or Fisher's discriminant analysis. We also explore the directional differences in observed images and present two parametric distributions, the generalized Laplacian and the Bessel K form, that well model the nonGaussian behavior of the directional differences. We then propose a model that labels each spatial site, using Bayesian inference and Markov random fields, that incorporates the information of the nonparametric distribution of the data, and the parametric distributions of the directional differences, along with a prior distribution that favors smooth labeling. We then test our model on actual hyperspectral data and present the results of our model, using the Washington D.C. Mall and Indian Springs rural area data sets.
Show less  Date Issued
 2004
 Identifier
 FSU_migr_etd2691
 Format
 Thesis
 Title
 Bayesian nonparametric estimation via Gibbs sampling for coherent systems with redundancy.
 Creator

Lawson, Kevin Lee., Florida State University
 Abstract/Description

We consider a coherent system S consisting of m independent components for which we do not know the distributions of the components' lifelengths. If we know the structure function of the system, then we can estimate the distribution of the system lifelength by estimating the distributions of the lifelengths of the individual components. Suppose that we can collect data under the 'autopsy model', wherein a system is run until a failure occurs and then the status (functioning or dead) of each...
Show moreWe consider a coherent system S consisting of m independent components for which we do not know the distributions of the components' lifelengths. If we know the structure function of the system, then we can estimate the distribution of the system lifelength by estimating the distributions of the lifelengths of the individual components. Suppose that we can collect data under the 'autopsy model', wherein a system is run until a failure occurs and then the status (functioning or dead) of each component is obtained. This test is repeated n times. The autopsy statistics consist of the age of the system at the time of breakdown and the set of parts that are dead by the time of breakdown. Using the structure function and the recorded status of the components, we then classify the failure time of each component. We develop a nonparametric Bayesian estimate of the distributions of the component lifelengths and then use this to obtain an estimate of the distribution of the lifelength of the system. The procedure is applicable to machinetest settings wherein the machines have redundant designs. A parametric procedure is also given.
Show less  Date Issued
 1994, 1994
 Identifier
 AAI9502812, 3088467, FSDT3088467, fsu:77272
 Format
 Document (PDF)
 Title
 Bayesian Portfolio Optimization with TimeVarying Factor Models.
 Creator

Zhao, Feng, Niu, Xufeng, Cheng, Yingmei, Huﬀer, Fred W., Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

We develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity ("betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors ("alphas"), and returns of risk factors in excess of the riskfree rate ("risk premia"). Both firmlevel characteristics and macroeconomic variables are used to predict stocks' timevarying alphas and betas, and macroeconomic variables are used to predict the risk...
Show moreWe develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity ("betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors ("alphas"), and returns of risk factors in excess of the riskfree rate ("risk premia"). Both firmlevel characteristics and macroeconomic variables are used to predict stocks' timevarying alphas and betas, and macroeconomic variables are used to predict the risk premia. All of the models are specified in a Bayesian framework to account for estimation risk, and informative prior distributions on both stock returns and model parameters are adopted to reduce estimation error. To gauge the economic signicance of the predictability, we apply the models to the U.S. stock market and construct optimal portfolios based on model predictions. Outofsample performance of the portfolios is evaluated to compare the models. The empirical results confirm predictabiltiy from all of the sources considered in our model: (1) The equity risk premium is timevarying and predictable using macroeconomic variables; (2) Stocks' alphas and betas differ crosssectionally and are predictable using firmlevel characteristics; and (3) Stocks' alphas and betas are also timevarying and predictable using macroeconomic variables. Comparison of different subperiods shows that the predictability of stocks' betas is persistent over time, but the predictability of stocks' alphas and the risk premium has diminished to some extent. The empirical results also suggest that Bayesian statistical techinques, especially the use of informative prior distributions, help reduce model estimation error and result in portfolios that outperform the passive indexing strategy. The findings are robust in the presence of transaction costs.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd0526
 Format
 Thesis
 Title
 A Bayesian Semiparametric Joint Model for Longitudinal and Survival Data.
 Creator

Wang, Pengpeng, Slate, Elizabeth H., Bradley, Jonathan R., Wetherby, Amy M., Lin, Lifeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Many biomedical studies monitor both a longitudinal marker and a survival time on each subject under study. Modeling these two endpoints as joint responses has potential to improve the inference for both. We consider the approach of Brown and Ibrahim (2003) that proposes a Bayesian hierarchical semiparametric joint model. The model links the longitudinal and survival outcomes by incorporating the mean longitudinal trajectory as a predictor for the survival time. The usual parametric mixed...
Show moreMany biomedical studies monitor both a longitudinal marker and a survival time on each subject under study. Modeling these two endpoints as joint responses has potential to improve the inference for both. We consider the approach of Brown and Ibrahim (2003) that proposes a Bayesian hierarchical semiparametric joint model. The model links the longitudinal and survival outcomes by incorporating the mean longitudinal trajectory as a predictor for the survival time. The usual parametric mixed effects model for the longitudinal trajectory is relaxed by using a Dirichlet process prior on the coefficients. A Cox proportional hazards model is then used for the survival time. The complicated joint likelihood increases the computational complexity. We develop a computationally efficient method by using a multivariate loggamma distribution instead of Gaussian distribution to model the data. We use Gibbs sampling combined with Neal's algorithm (2000) and the MetropolisHastings method for inference. Simulation studies illustrate the procedure and compare this loggamma joint model with the Gaussian joint models. We apply this joint modeling method to a human immunodeciency virus (HIV) data and a prostatespecific antigen (PSA) data.
Show less  Date Issued
 2019
 Identifier
 2019_Spring_Wang_fsu_0071E_15120
 Format
 Thesis
 Title
 BAYESIAN SOLUTIONS TO SOME CLASSICAL PROBLEMS OF STATISTICS.
 Creator

PEREIRA, CARLOS ALBERTO DE BRAGANCA., Florida State University
 Abstract/Description

Three of the basic questions of Statistics may be stated as follows: (A) Which portion of the data X is actually informative about the parameter of interest (theta)? (B) How can all the relevant information about (theta) provided by the data X be extracted? (C) What kind of information about (theta) do the data X possess?, The perspective of this dissertation is that of a Bayesian., Chapter I is essentially concerned with question A. The theory of conditional independence is explained and the...
Show moreThree of the basic questions of Statistics may be stated as follows: (A) Which portion of the data X is actually informative about the parameter of interest (theta)? (B) How can all the relevant information about (theta) provided by the data X be extracted? (C) What kind of information about (theta) do the data X possess?, The perspective of this dissertation is that of a Bayesian., Chapter I is essentially concerned with question A. The theory of conditional independence is explained and the relations between ancillarity, sufficiency, and statistical independence are discussed in depth. Some related concepts like specific sufficiency, bounded completeness, and splitting sets are also studied in some details. The language of conditional independence is used in the remaining Chapters., Chapter II deals with question B for the particular problem of analysing categorical data with missing entries. It is demonstrated how a suitably chosen prior for the frequency parameters can streamline the analysis in the presence of missing entries due to nonresponse or other causes. The two cases where the data follow the Multinomial or the Multivariate Hypergeometric model are treated separately. In the first case it is adequate to restrict the prior (for the cell probabilities) to the class of Dirichlet distributions. In the Hypergeometric case it is convenient to select a prior (for the cell population frequencies) from the class of DirichletMultinomial (DM) distributions. The DM distributions are studied in detail., Chapter III is directly related to question C. Conditions on the likelihood function and on the prior distribution are presented in order to assess the effect of the sample on the posterior distribution. More specifically, it is shown that under certain conditions, the larger the observations obtained, the larger (stochastically in terms of the posterior distribution) is the appropriate parameter., Finally, Chapter IV deals with the characterization of distributions in terms of Blackwell comparison of experiments. It is shown that a result (for the Hypergeometric model) obtained in Chapter II is actually a consequence of a property of complete families of distributions.
Show less  Date Issued
 1980, 1980
 Identifier
 AAI8108380, 3084857, FSDT3084857, fsu:74358
 Format
 Document (PDF)
 Title
 Bayesian Tractography Using Geometric Shape Priors.
 Creator

Dong, Xiaoming, Srivastava, Anuj, Klassen, E. (Eric), Wu, Wei, Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Diffusionweighted image(DWI) and tractography have been developed for decades and are key elements in recent, largescale efforts for mapping the human brain. The two techniques together provide us a unique possibility to access the macroscopic structure and connectivity of the human brain noninvasively and in vivo. The information obtained not only can help visualize brain connectivity and help segment the brain into different functional areas but also provides tools for understanding some...
Show moreDiffusionweighted image(DWI) and tractography have been developed for decades and are key elements in recent, largescale efforts for mapping the human brain. The two techniques together provide us a unique possibility to access the macroscopic structure and connectivity of the human brain noninvasively and in vivo. The information obtained not only can help visualize brain connectivity and help segment the brain into different functional areas but also provides tools for understanding some major cognitive diseases such as multiple sclerosis, schizophrenia, epilepsy, etc. There are lots of efforts have been put into this area. On the one hand, a vast spectrum of tractography algorithms have been developed in recent years, ranging from deterministic approaches through probabilistic methods to global tractography; On the other hand, various mathematical models, such as diffusion tensor, multitensor model, spherical deconvolution, Qball modeling, have been developed to better exploit the acquisition dependent signal of Diffusionweighted image(DWI). Despite considerable progress in this area, current methods still face many challenges, such as sensitive to noise, lots of false positive/negative fibers, incapable of handling complex fiber geometry and expensive computation cost. More importantly, recent researches have shown that, even with highquality data, the results using current tractography methods may not be improved, suggesting that it is unlikely to obtain an anatomically accurate map of the human brain solely based on the diffusion profile. Motivated by these issues, this dissertation develops a global approach that incorporates anatomical validated geometric shape prior when reconstructing neuron fibers. The fiber tracts between regions of interest are initialized and updated via deformations based on gradients of the posterior energy defined in this paper. This energy has contributions from diffusion data, shape prior information, and roughness penalty. The dissertation first describes and demonstrates the proposed method on the 2D dataset and then extends it to 3D Phantom data and the real brain data. The results show that the proposed method is relatively immune to issues such as noise, complicated fiber structure like fiber crossings and kissing, false positive fibers, and achieve more explainable tractography results.
Show less  Date Issued
 2019
 Identifier
 2019_Spring_DONG_fsu_0071E_15144
 Format
 Thesis
 Title
 Building a Model Performance Measure for Examining Clinical Relevance Using Net Benefit Curves.
 Creator

Mukherjee, Anwesha, McGee, Daniel, Hurt, Myra M., Slate, Elizabeth H., Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

ROC curves are often used to evaluate predictive accuracy of statistical prediction models. This thesis studies other measures which not only incorporate the statistical but also the clinical consequences of using a particular prediction model. Depending on the disease and population under study, the misclassification costs of false positives and false negatives vary. The concept of Decision Curve Analysis (DCA) takes this cost into account, by using the threshold probability (the...
Show moreROC curves are often used to evaluate predictive accuracy of statistical prediction models. This thesis studies other measures which not only incorporate the statistical but also the clinical consequences of using a particular prediction model. Depending on the disease and population under study, the misclassification costs of false positives and false negatives vary. The concept of Decision Curve Analysis (DCA) takes this cost into account, by using the threshold probability (the probability above which a patient opts for treatment). Using the DCA technique, a Net Benefit Curve is built by plotting "Net Benefit", a function of the expected benefit and expected harm of using a model, by the threshold probability. Only the threshold probability range that is relevant to the disease and the population under study is used to plot the net benefit curve to obtain the optimum results using a particular statistical model. This thesis concentrates on the process of construction of a summary measure to find which predictive model yields highest net benefit. The most intuitive approach is to calculate the area under the net benefit curve. We examined whether the use of weights such as, the estimated empirical distribution of the threshold probability to compute the weighted area under the curve, creates a better summary measure. Real data from multiple cardiovascular research studies The Diverse Population Collaboration (DPC) datasets, is used to compute the summary measures: area under the ROC curve (AUROC), area under the net benefit curve (ANBC) and weighted area under the net benefit curve (WANBC). The results from the analysis are used to compare these measures to examine whether these measures are in agreement with each other and which would be the best to use in specified clinical scenarios. For different models the summary measures and its standard errors (SE) were calculated to study the variability in the measure. The method of metaanalysis is used to summarize these estimated summary measures to reveal if there is significant variability among these studies.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Mukherjee_fsu_0071E_14350
 Format
 Thesis
 Title
 A Class of MixedDistribution Models with Applications in Financial Data Analysis.
 Creator

Tang, Anqi, Niu, Xufeng, Cheng, Yingmei, Wu, Wei, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Statisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the...
Show moreStatisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the probability of nonzero response; the other part is a linear model that models the mean response given that the outcomes are not zeros. Random effects with AR(1) covariance structure are introduced into both parts of the model to allow serial correlation and subject specific effect. Estimating the twopart model is challenging because of high dimensional integration necessary to obtain the maximum likelihood estimates. We propose a Monte Carlo EM algorithm for estimating the maximum likelihood estimates of parameters. Through simulation study, we demonstrate the good performance of the MCEM method in parameter and standard error estimation. To illustrate, we apply the twopart model with correlated random effects and the model with autoregressive random effects to executive compensation data to investigate potential determinants of CEO stock option grants.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1710
 Format
 Thesis
 Title
 A Class of Semiparametric Volatility Models with Applications to Financial Time Series.
 Creator

Chung, Steve S., Niu, XuFeng, Gallivan, Kyle, Sinha, Debajyoti, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

The autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied...
Show moreThe autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied and well documented in financial and econometric literature and many variants of ARCH/GARCH models have been proposed. To list a few, these include exponential GARCH(EGARCH), GJRGARHCH(or threshold GARCH), integrated GARCH(IGARCH), quadratic GARCH(QGARCH), and fractionally integrated GARCH(FIGARCH). The ARCH/GARCH models and their variant models have gained a lot of attention and they are still popular choice for modeling volatility. Despite their popularity, they suffer from model flexibility. Volatility is a latent variable and hence, putting a specific model structure violates this latency assumption. Recently, several attempts have been made in order to ease the strict structural assumptions on volatility. Both nonparametric and semiparametric volatility models have been proposed in the literature. We review and discuss these modeling techniques in detail. In this dissertation, we propose a class of semiparametric multiplicative volatility models. We define the volatility as a product of parametric and nonparametric parts. Due to the positivity restriction, we take the log and square transformations on the volatility. We assume that the parametric part is GARCH(1,1) and it serves as a initial guess to the volatility. We estimate GARCH(1,1) parameters by using conditional likelihood method. The nonparametric part assumes an additive structure. There may exist some loss of interpretability by assuming an additive structure but we gain flexibility. Each additive part is constructed from a sieve of Bernstein basis polynomials. The nonparametric component acts as an improvement for the parametric component. The model is estimated from an iterative algorithm based on boosting. We modified the boosting algorithm (one that is given in Friedman 2001) such that it uses a penalized least squares method. As a penalty function, we tried three different penalty functions: LASSO, ridge, and elastic net penalties. We found that, in our simulations and application, ridge penalty worked the best. Our semiparametric multiplicative volatility model is evaluated using simulations and applied to the six major exchange rates and SP 500 index. The results show that the proposed model outperforms the existing volatility models in both insample estimation and outofsample prediction.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd8756
 Format
 Thesis
 Title
 Comparative mRNA Expression Analysis Leveraging Known Biochemical Interactions.
 Creator

Steppi, Albert Joseph, Zhang, Jinfeng, Sang, QingXiang, Wu, Wei, Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

We present two studies incorporating existing biological knowledge into differential gene expression analysis that attempt to place the results within a broader biological context. The studies investigate breast cancer health disparity between differing ethnic groups by comparing gene expression levels in tumor samples from patients from different ethnic populations. We incorporate existing knowledge by making comparisons not just between individual genes, but between sets of related genes...
Show moreWe present two studies incorporating existing biological knowledge into differential gene expression analysis that attempt to place the results within a broader biological context. The studies investigate breast cancer health disparity between differing ethnic groups by comparing gene expression levels in tumor samples from patients from different ethnic populations. We incorporate existing knowledge by making comparisons not just between individual genes, but between sets of related genes and networks of interacting genes. In the first study, a comparison is made between mRNA expression patterns in Asian and Caucasian American breast cancer samples in an attempt to better understand why there are significantly lower breast cancer incidence and mortality rates in Asian Americans compared to Caucasian Americans. In the second study, the expression levels of genes related to drug and xenobiotic metabolizing enzymes (DXME) are compared between African, Asian, and Caucasian American breast cancer patients. The expression of genes related to these enzymes has been found to significantly affect drug clearance and the onset of drug resistance. Both studies found differentially expressed genes and pathways that may be associated with health disparities between the three ethnic populations. A thorough investigation of the literature was made in order to understand the context in which these differences in gene expression could affect the development and progression of breast tumors, and to identify genes and pathways that may be differentially expressed between the ethnic groups in general but not associated with breast cancer. Many of the relevant differences in gene expression were found to be linked to factors such as diet and differences in body composition. The process of finding relevant pathways and sets of interacting genes to inform comparative mRNA expression analysis can be laborious and time consuming. The literature is expanding at an exponential rate, and there is little hope for research groups to be able to keep up with all of the latest research. It is becoming more common for journals to require authors to make their results available in public databases, but many results concerning biochemical interactions are only accessible in unstructured text. Extracting relationships and interactions from the biological literature using techniques from machine learning and natural language processing is an important and growing field of research. To gain a better understanding of this field, we participated in the BioCreative VI Track 4 challenge, which involved classifying PubMed abstracts that contain examples of proteinprotein interactions that are affected by a mutation. We discuss the model we developed and the lessons learned while participating in the competition. The problem of acquiring sufficient quantities of quality labeled data is a great obstacle preventing the improvement of performance. We present a web application we are developing to streamline the annotation of entityentity interactions in text. It makes use of a database of known interactions to locate passages that are likely to be relevant and offers a simple and concise user interface to minimize the cognitive burden on the annotator.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Steppi_fsu_0071E_14522
 Format
 Thesis
 Title
 A Comparison of Estimators in Hierarchical Linear Modeling: Restricted Maximum Likelihood versus Bootstrap via Minimum Norm Quadratic Unbiased Estimators.
 Creator

Delpish, Ayesha Nneka, Niu, XuFeng, Tate, Richard L., Huﬀer, Fred W., Zahn, Douglas, Department of Statistics, Florida State University
 Abstract/Description

The purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations,...
Show moreThe purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations, the importance of this assumption for the accuracy of multilevel parameter estimates and their standard errors was assessed using the accuracy index of relative bias and by observing the coverage percentages of 95% confidence intervals constructed for both estimation procedures. The study systematically varied the number of groups at level2 (30 versus 100), the size of the intraclass correlation (0.01 versus 0.20) and the distribution of the observations (normal versus chisquared with 1 degree of freedom). The number of groups and intraclass correlation factors produced effects consistent with those previously reported—as the number of groups increased, the bias in the parameter estimates decreased, with a more significant effect observed for those estimates obtained via REML. High levels of the intraclass correlation also led to a decrease in the efficiency of parameter estimation under both methods. Study results show that while both the restricted maximum likelihood and the bootstrap via MINQUE estimates of the fixed effects were accurate, the efficiency of the estimates was affected by the distribution of errors with the bootstrap via MINQUE procedure outperforming the REML. Both procedures produced less efficient estimators under the chisquared distribution, particularly for the variancecovariance component estimates.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd0771
 Format
 Thesis
 Title
 A comparison of robust and least squares regression models using actual and simulated data.
 Creator

Gilbert, Scott Alan., Florida State University
 Abstract/Description

The purpose of this study was to compare several robust regression techniques to ordinary least squares (OLS) regression when analyzing bivariate and multivariate data. The bivariate analysis compared of the performance of alternative robust procedures in regard to the detection of outliers versus the standard OLS regression techniques. The bivariate analysis demonstrated the weaknesses of OLS regression and the standard OLS outlier diagnostic techniques when multiple outliers are present. In...
Show moreThe purpose of this study was to compare several robust regression techniques to ordinary least squares (OLS) regression when analyzing bivariate and multivariate data. The bivariate analysis compared of the performance of alternative robust procedures in regard to the detection of outliers versus the standard OLS regression techniques. The bivariate analysis demonstrated the weaknesses of OLS regression and the standard OLS outlier diagnostic techniques when multiple outliers are present. In addition, this research assessed the empirical performance of alpha and power under three nonnormal probability density functions using a Monte Carlo simulation., The first analysis focused on several bivariate data sets. Each data set was plotted and each of the regression models used to analyze the data. The usual results (e.g., R$\sp2$, regression coefficients, standard errors, and regression diagnostics) were examined to give a visual as well as empirical analysis of the models' performance in the presence of multiple outliers., The second component of this study entailed a Monte Carlo simulation of five robust regression models and OLS regression under four probability density functions. The variables included in the study were placed in one 2$\sp1$3$\sp2$ and two 3$\sp2$ factorial design repeated over four probability density functions, resulting in a total of 90 experimental runs of the Monte Carlo simulation. Random samples were generated and then transformed to fit desired distributional moment characteristics. The incremental null hypothesis was used as the basis to calculate empirical alpha and power values calculated., The analysis demonstrated the inadequacies of the standard OLS based outlier detection methods and explained how regression analysis could be improved if a robust regression method is used in parallel with OLS regression. The multivariate analysis demonstrated the robustness of the OLS regression model to three nonnormal populations. It further demonstrated a moderate inflation of alpha for the Mclass of robust regression model and a lack of power stability with the rank transform regression method., Based on the results of this study, recommendations were made for using robust regression methods and suggestions for future research offered.
Show less  Date Issued
 1992, 1992
 Identifier
 AAI9222385, 3087822, FSDT3087822, fsu:76632
 Format
 Document (PDF)
 Title
 THE COMPARISON OF SENSITIVITIES OF EXPERIMENTS (MAXIMUM LIKELIHOOD, RANDOM, FIXED, ANALYSIS OF VARIANCE).
 Creator

YOUNG, BARBARA NELSON., Florida State University
 Abstract/Description

The sensitivity of a measurement technique is defined to be its ability to detect differences among the treatments in a fixed effects design, or the presence of a between treatments component of variance in a random effects design. Consider an experiment, consisting of two identical subexperiments, designed specifically for the purpose of comparing two measurement techniques. It is assumed that the techniques of analysis of variance are applicable in analyzing the data obtained from the two...
Show moreThe sensitivity of a measurement technique is defined to be its ability to detect differences among the treatments in a fixed effects design, or the presence of a between treatments component of variance in a random effects design. Consider an experiment, consisting of two identical subexperiments, designed specifically for the purpose of comparing two measurement techniques. It is assumed that the techniques of analysis of variance are applicable in analyzing the data obtained from the two measurement techniques. The subexperiments may have either fixed or random treatment effects in either oneway or general block designs. It is assumed that the experiment yields bivariate observations from the two measurement methods which may or may not be independent. Likelihood ratio tests are used in the various settings of this dissertation to both extend current techniques and provide alternative methods for comparing the sensitivities of experiments.
Show less  Date Issued
 1985, 1985
 Identifier
 AAI8524629, 3086182, FSDT3086182, fsu:75665
 Format
 Document (PDF)
 Title
 A Comparison of Three Approaches to Confidence Interval Estimation for Coefficient Omega.
 Creator

Xu, Jie, Yang, Yanyun, Becker, Betsy Jane, Almond, Russell G., Florida State University, College of Education, Department of Educational Psychology and Learning Systems
 Abstract/Description

Coefficient Omega was introduced by McDonald (1978) as a reliability coefficient of composite scores for the congeneric model. Interval estimation (Neyman, 1937) on coefficient Omega provides a range of plausible values which is likely to capture the population reliability of composite scores. The Wald method, likelihood method, and biascorrected and accelerated bootstrap method are three methods to construct confidence interval for coefficient Omega (e.g., Cheung, 2009b; Kelley & Cheng,...
Show moreCoefficient Omega was introduced by McDonald (1978) as a reliability coefficient of composite scores for the congeneric model. Interval estimation (Neyman, 1937) on coefficient Omega provides a range of plausible values which is likely to capture the population reliability of composite scores. The Wald method, likelihood method, and biascorrected and accelerated bootstrap method are three methods to construct confidence interval for coefficient Omega (e.g., Cheung, 2009b; Kelley & Cheng, 2012; Raykov, 2002, 2004, 2009; Raykov & Marcoulides, 2004; Padilla & Divers, 2013). Very limited number of studies on the evaluation of these three methods can be found in the literature (e.g., Cheung, 2007, 2009a, 2009b; Kelley & Cheng, 2012; Padilla & Divers, 2013). No simulation study has been conducted to evaluate the performance of these three methods for interval construction on coefficient Omega. In the current simulation study, I assessed these three methods by comparing their empirical performance on interval estimation for coefficient Omega. Four factors were included in the simulation design: sample size, number of items, factor loading, and degree of nonnormality. Two thousands datasets were generated in R 2.15.0 (R Core Team, 2012) for each condition. For each generated dataset, three approaches (i.e., the Wald method, likelihood method, and biascorrected and accelerated bootstrap method) were used to construct 95% confidence interval of coefficient Omega in R 2.15.0. The results showed that when the data were multivariate normally distributed, three methods performed equally well and coverage probabilities were very close to the prespecified .95 confidence level. When the data were multivariate nonnormally distributed, coverage probabilities decreased and interval widths became wider for all three methods as the degree of nonnormality increased. In general, when the data departed from the multivariate normality, the BCa bootstrap method performed better than the other two methods, with relatively higher coverage probabilities, while the Wald and likelihood methods were comparable and yielded narrower interval width than the BCa bootstrap method.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd9269
 Format
 Thesis
 Title
 A comparison of two methods of bootstrapping in a reliability model.
 Creator

Chiang, YuangChin., Florida State University
 Abstract/Description

We consider bootstrapping in the following reliability model which was considered by Doss, Freitag, and Proschan (1987). Available for testing is a sample of iid systems each having the same structure of m independent components. Each system is continuously observed until it fails. For every component in each system, either a failure time or a censoring time is recorded. A failure time is recorded if the component fails before or at the time of system failure; otherwise a censoring time is...
Show moreWe consider bootstrapping in the following reliability model which was considered by Doss, Freitag, and Proschan (1987). Available for testing is a sample of iid systems each having the same structure of m independent components. Each system is continuously observed until it fails. For every component in each system, either a failure time or a censoring time is recorded. A failure time is recorded if the component fails before or at the time of system failure; otherwise a censoring time is recorded. To estimate the distribution of the component lifelengths F$\sb1,\...$,F$\sb{\rm m}$, one can formally compute the KaplanMeier estimates F$\sb1,\...$,F$\sb{\rm m}$. Various quantities of interest, such as the probability that a new system will survive time t$\sb0$, may then be estimated by combining F$\sb1,\...$,F$\sb{\rm m}$ in a suitable way. In this model, bootstrapping can be carried out in two different ways. One can resample n systems at random from the original n systems. Alternatively, one can construct artificial systems by generating independent random lifelengths from the KaplanMeier estimates F$\sb{\rm j}$, and from those form artificial data. The two methods are distinct. We show that asymptotically, bootstrapping by either method yields correct answers. We also compare the two methods via simulation studies.
Show less  Date Issued
 1988, 1988
 Identifier
 AAI8906216, 3161719, FSDT3161719, fsu:77918
 Format
 Document (PDF)
 Title
 The computation of probabilities which involve spacings, with applications to the scan statistic.
 Creator

Lin, ChienTai., Florida State University
 Abstract/Description

We develop a methodology for evaluating probabilities which involve linear combinations of spacings and then present some applications of this methodology. The basic idea underlying our method was given by Huffer (1988): A recursion is used to break up the joint distribution of several linear combinations of spacings into a sum of simpler components. The same recursion is then applied to each of these components and so on. The process is continued until we obtain components which are simple...
Show moreWe develop a methodology for evaluating probabilities which involve linear combinations of spacings and then present some applications of this methodology. The basic idea underlying our method was given by Huffer (1988): A recursion is used to break up the joint distribution of several linear combinations of spacings into a sum of simpler components. The same recursion is then applied to each of these components and so on. The process is continued until we obtain components which are simple and easily expressed in closed form. We describe algorithms and a computer program (written in C) which implement this approach. Our approach has two advantages. First, it is fairly general and can be used to solve a variety of problems involving linear combinations of spacings. Secondly, because the output of our procedure is a polynomial whose coefficients are computed exactly, we can supply numerical answers which are accurate to any required degree of precision. We apply our program to compute the distribution of the scan statistic for small sample sizes. We also use the recursion and computer program to calculate the lower order moments of the number of clumps in randomly distributed points. We can use these moments to obtain bounds and approximations for the distribution of the scan statistic. Our approximations are based on fitting a compound Poisson distribution to the moments of the number of clumps.
Show less  Date Issued
 1993, 1993
 Identifier
 AAI9416150, 3088291, FSDT3088291, fsu:77095
 Format
 Document (PDF)
 Title
 Conditional bootstrap methods for censored data.
 Creator

Kim, JiHyun., Florida State University
 Abstract/Description

We first consider the random censorship model of survival analysis. The pairs of positive random variables ($X\sb{i},Y\sb{i}$), i = 1,$\...$,n, are independent and identically distributed, with distribution functions F(t) = P($X\sb{i} \leq\ t$) and G(t) = P($Y\sb{i} \leq\ t$) and the Y's are independent of the X's. We observe only ($T\sb{i},\delta\sb{i}$), i = 1,$\...$,n, where $T\sb{i}$ = min($X\sb{i},Y\sb{i}$) and $\delta\sb{i}$ = I($X\sb{i} \leq\ Y\sb{i}$). The X's represent survival times...
Show moreWe first consider the random censorship model of survival analysis. The pairs of positive random variables ($X\sb{i},Y\sb{i}$), i = 1,$\...$,n, are independent and identically distributed, with distribution functions F(t) = P($X\sb{i} \leq\ t$) and G(t) = P($Y\sb{i} \leq\ t$) and the Y's are independent of the X's. We observe only ($T\sb{i},\delta\sb{i}$), i = 1,$\...$,n, where $T\sb{i}$ = min($X\sb{i},Y\sb{i}$) and $\delta\sb{i}$ = I($X\sb{i} \leq\ Y\sb{i}$). The X's represent survival times, the Y's represent censoring times. Efron (1981) proposed two bootstrap methods for the random censorship model and showed that they are distributionally the same. Akritas (1986) established the weak convergence of the bootstrapped KaplanMeier estimator of F when bootstrapping is done by this method. Let us now consider bootstrapping more closely. Suppose that we wish to estimate the variance of F(t). If we knew the Y's then we would condition on them by the ancillarity principle, since the distribution of the Y's does not depend on F. That is, we would want to estimate Var$\{$F(t)$\vert Y\sb1,\...,Y\sb{n}\}$. Unfortunately, in the random censorship model we do not see all the Y's. If $\delta\sb{i}$ = 0 we see the exact value of $Y\sb{i}$, but if $\delta\sb{i}$ = 1 we know only that $Y\sb{i} > T\sb{i}$. Let us denote this information on the Y's by ${\cal C}$. Thus, what we want to estimate is Var$\{$F(t)$\vert{\cal C}\}$. Efron's scheme is appropriate for estimating the unconditional variance. We propose a new bootstrap method which provides an estimate of Var$\{$F(t)$\vert{\cal C}\}$., In this research we show that the KaplanMeier estimator of F formed by the new bootstrap method has the same limiting distribution as the one by Efron's approach. The results of simulation studies assessing the small sample performance of the two bootstrap methods are reported. We also consider the model in which the $X\sb{i}$'s are censored by the $Y\sb{i}$'s and also by known fixed constants, and propose an appropriate bootstrap method for that model. This bootstrap method is a readily modified version of the new bootstrap method above.
Show less  Date Issued
 1990, 1990
 Identifier
 AAI9113938, 3162201, FSDT3162201, fsu:78399
 Format
 Document (PDF)
 Title
 Contributions to the theory of arrangement increasing functions.
 Creator

Proschan, Michael Arthur., Florida State University
 Abstract/Description

A function $f(\underline{x})$ which increases each time we transpose an out of order pair of coordinates, $x\sb{j} > x\sb{k}$ for some $j x\sb{k}$ by transposing the two x coordinates. The theory of AI functions is tailor made for ranking and selection problems, in which case we assume that the density $f(\underline{\theta}$,$\underline{x})$ of observations with respective parameters $\theta\sb1, \..., \theta\sb{n}$ is AI, and the goal is to determine the largest or smallest parameters., In...
Show moreA function $f(\underline{x})$ which increases each time we transpose an out of order pair of coordinates, $x\sb{j} > x\sb{k}$ for some $j x\sb{k}$ by transposing the two x coordinates. The theory of AI functions is tailor made for ranking and selection problems, in which case we assume that the density $f(\underline{\theta}$,$\underline{x})$ of observations with respective parameters $\theta\sb1, \..., \theta\sb{n}$ is AI, and the goal is to determine the largest or smallest parameters., In this dissertation we present new applications of AI functions in such areas as biology and reliability, and we generalize the notion of AI functions. We consider multivector extensions, some with and one without respect to parameter vectors, and we connect these. Another generalization (TEGO) is motivated by the connection between total positivity (TP) and AI. TEGO results are shown to imply AI and TP results. We also define and develop a partial ordering on densities of rank vectors. The theory, which involves finding the extreme points of the convex set of AI rank densities, is then used to establish some power results of rank tests.
Show less  Date Issued
 1989, 1989
 Identifier
 AAI9002934, 3161869, FSDT3161869, fsu:78068
 Format
 Document (PDF)
 Title
 Covariance on Manifolds.
 Creator

Balov, Nikolay H. (Nikolay Hristov), Srivastava, Anuj, Klassen, Eric, Patrangenaru, Victor, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

With ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Nonlinear statistical analysis has become an area of intensive research in recent years. Despite the impressive progress in this direction, a unified and consistent framework has not been reached. In this regard, the following work is an attempt to improve our understanding of random...
Show moreWith ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Nonlinear statistical analysis has become an area of intensive research in recent years. Despite the impressive progress in this direction, a unified and consistent framework has not been reached. In this regard, the following work is an attempt to improve our understanding of random phenomena on nonEuclidean spaces. More specifically, the motivating goal of the present dissertation is to generalize the notion of distribution covariance, which in standard settings is defined only in Euclidean spaces, on arbitrary manifolds with metric. We introduce a tensor field structure, named covariance field, that is consistent with the heterogeneous nature of manifolds. It not only describes the variability imposed by a probability distribution but also provides alternative distribution representations. The covariance field combines the distribution density with geometric characteristics of its domain and thus fills the gap between these two.We present some of the properties of the covariance fields and argue that they can be successfully applied to various statistical problems. In particular, we provide a systematic approach for defining parametric families of probability distributions on manifolds, parameter estimation for regression analysis, nonparametric statistical tests for comparing probability distributions and interpolation between such distributions. We then present several application areas where this new theory may have potential impact. One of them is the branch of directional statistics, with domain of influence ranging from geosciences to medical image analysis. The fundamental level at which the covariance based structures are introduced, also opens a new area for future research.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1045
 Format
 Thesis
 Title
 Critical Issues in Survey MetaAnalysis.
 Creator

Gozutok, Ahmet Serhat, Becker, Betsy Jane, Huffer, Fred W., Yang, Yanyun, Paek, Insu, Florida State University, College of Education, Department of Educational Psychology and...
Show moreGozutok, Ahmet Serhat, Becker, Betsy Jane, Huffer, Fred W., Yang, Yanyun, Paek, Insu, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Show less  Abstract/Description

In research synthesis, researchers may aim at summarizing peoples' attitudes and perceptions of phenomena that have been assessed using different measures. Selfreport rating scales are among the most commonly used measurement tools to quantify such latent constructs in education and psychology. However, selfreport ratingscale questions measuring the same construct may differ from each other in many ways. Scale format, number of response options, wording of questions, and labeling of...
Show moreIn research synthesis, researchers may aim at summarizing peoples' attitudes and perceptions of phenomena that have been assessed using different measures. Selfreport rating scales are among the most commonly used measurement tools to quantify such latent constructs in education and psychology. However, selfreport ratingscale questions measuring the same construct may differ from each other in many ways. Scale format, number of response options, wording of questions, and labeling of response option categories may vary across questions. Consequently, variations across the measures of the same construct bring about the issue of comparability of the results across the studies in metaanalytic investigations. In this study, I examine the complexities of summarizing the results of different survey questions about the same construct in the metaanalytic fashion. More specifically, this study focuses on the practical problems that arise when combining survey items that differ from one another in the wording of question stems, numbers of response option categories, scale direction (i.e., unipolar and bipolar scales), response scale labeling (i.e., fullylabeled scales and endpointslabeled scales), and responseoption labeling (e.g., "extremely happy"  "completely happy"  "most happy", "pretty happy", "quite happy" "moderately happy", and "not at all happy"  "least happy"  "most unhappy"). In addition, I propose practical solutions to handle the issues that arise due to such variations when conducting a metaanalysis. I discuss the implications of the proposed solutions from the perspective of metaanalysis. Examples are obtained from the collection of studies in the World Happiness Database (Veenhoven, 2006), which includes various singleitem happiness measures.
Show less  Date Issued
 2018
 Identifier
 2018_Fall_Gozutok_fsu_0071E_14866
 Format
 Thesis
 Title
 Cumulative regression function methods in survival analysis and time series.
 Creator

Zhang, MeiJie., Florida State University
 Abstract/Description

One may estimate a conditional hazard function from grouped (and possibly censored) survival data by the time and covariate specific occurrence/exposure rate. Asymptotic results for cumulative versions of this estimator are developed, utilizing the general framework of counting processes. In particular, a grouped data based goodnessoffit test for Cox's proportional hazard model is given. Various constraints on the asymptotic behavior of the widths of the calendar periods and covariate...
Show moreOne may estimate a conditional hazard function from grouped (and possibly censored) survival data by the time and covariate specific occurrence/exposure rate. Asymptotic results for cumulative versions of this estimator are developed, utilizing the general framework of counting processes. In particular, a grouped data based goodnessoffit test for Cox's proportional hazard model is given. Various constraints on the asymptotic behavior of the widths of the calendar periods and covariate strata employed in grouping the data are needed to prove the results. Actual performance of the estimators and test statistics is evaluated by Monte Carlo methods., We also consider the problem of identifying the class of time series model to which a series belongs based on observation of part of the series. Techniques of nonparametric estimation have been applied to this problem by Auestad and Tjostheim (Biometrika 77(1990):669687) who used kernel estimates of the onestep lagged conditional mean and variance functions. We study cumulative versions of such estimates. These are more stable than the kernel estimates and can be used to construct confidence bands for the underlying cumulative mean and variance functions. Goodnessoffit tests for specific parametric models are also developed.
Show less  Date Issued
 1991, 1991
 Identifier
 AAI9202323, 3087663, FSDT3087663, fsu:76478
 Format
 Document (PDF)
 Title
 DETERMINING A SUFFICIENT LEVEL OF INTERRATER RELIABILITY (POWER ANALYSIS, MISCLASSIFICATION, SAMPLE SIZE).
 Creator

RASP, JOHN M., Florida State University
 Abstract/Description

The reliability of a test or measurement procedure is, generally speaking, an index of the consistency of its results. Interrater reliability assesses the consistency of judgements among a set of raters. We model the observation taken on a subject by an unreliable procedure as the sum of a true score with mean (mu) and variance (sigma)(,T)('2) and an error term with mean 0 and variance (sigma)(,E)('2). The reliability coefficient then is (rho) = (sigma)(,T)('2)/((sigma)(,T)('2) + (sigma)(,E)...
Show moreThe reliability of a test or measurement procedure is, generally speaking, an index of the consistency of its results. Interrater reliability assesses the consistency of judgements among a set of raters. We model the observation taken on a subject by an unreliable procedure as the sum of a true score with mean (mu) and variance (sigma)(,T)('2) and an error term with mean 0 and variance (sigma)(,E)('2). The reliability coefficient then is (rho) = (sigma)(,T)('2)/((sigma)(,T)('2) + (sigma)(,E)('2))., The reliability of an instrument or rating procedure is generally evaluated in an initial experiment (or series of experiments) known as a "reliability study." Once an instrument is established as having some degree of reliability, it is then used as a measurement tool in subsequent research, known as "decision studies.", An unreliable procedure measures imperfectly. The impact of the error in measurement is investigated as it relates to three broad areas of statistical procedures: estimation, hypothesis testing, and decisionmaking., An unreliable measurement decreases the precision of estimates. The effect of an unreliable measurement on the width of a confidence interval for the population mean is examined. Also, an expression is developed to facilitate estimation of the reliability of a test or measurement in a decision study when the populations of interest may differ from those in the reliability study., An unreliable instrument weakens hypothesis tests. The extent to which lack of reliability attenuates the power of the twosample ttest, the Ftest in the analysis of variance, and the ttest for statistically significant correlation between two variables is investigated., An unreliable measurement engenders false classifications. A dichotomous decision is considered, and expressions for the probability of misclassifying a subject by a rating procedure with a given reliability are developed. Overall as well as directional misclassification rates are found under the model of true scores and errors distributed as independent normals. Effects of departures from this model, by heavytailed and skewed true score and error distributions, and by errors whose variance is a function of the true score, are considered. A general expression for this misclassification probability is found. A confidence interval for the misclassification probability is developed., These results provide tools for a researcher better to make decisions concerning the design of an experiment. They permit the costs of increased reliability to be more knowledgeably compared with the consequences of using an unreliable measurement procedure in a given situation.
Show less  Date Issued
 1984, 1984
 Identifier
 AAI8416723, 3085837, FSDT3085837, fsu:75324
 Format
 Document (PDF)
 Title
 Developing SRSF Shape Analysis Techniques for Applications in Neuroscience and Genomics.
 Creator

Wesolowski, Sergiusz, Wu, Wei, Bertram, R. (Richard), Srivastava, Anuj, Beerli, Peter, Mio, Washington, Florida State University, College of Arts and Sciences, Department of...
Show moreWesolowski, Sergiusz, Wu, Wei, Bertram, R. (Richard), Srivastava, Anuj, Beerli, Peter, Mio, Washington, Florida State University, College of Arts and Sciences, Department of Mathematics
Show less  Abstract/Description

Dissertation focuses on exploring the capabilities of the SRSF statistical shape analysis framework through various applications. Each application gives rise to a specific mathematical shape analysis model. The theoretical investigation of the models, driven by real data problems, give rise to new tools and theorems necessary to conduct a sound inference in the space of shapes. From theoretical standpoint the robustness results are provided for the model parameters estimation and an ANOVA...
Show moreDissertation focuses on exploring the capabilities of the SRSF statistical shape analysis framework through various applications. Each application gives rise to a specific mathematical shape analysis model. The theoretical investigation of the models, driven by real data problems, give rise to new tools and theorems necessary to conduct a sound inference in the space of shapes. From theoretical standpoint the robustness results are provided for the model parameters estimation and an ANOVAlike statistical testing procedure is discussed. The projects were a result of the collaboration between theoretical and applicationfocused research groups: the Shape Analysis Group at the Department of Statistics at Florida State University, the Center of Genomics and Personalized Medicine at FSU and the FSU's Department of Neuroscience. As a consequence each of the projects consists of two aspects—the theoretical investigation of the mathematical model and the application driven by a real life problem. The applications components, are similar from the data modeling standpoint. In each case the problem is set in an infinite dimensional space, elements of which are experimental data points that can be viewed as shapes. The three projects are: ``A new framework for Euclidean summary statistics in the neural spike train space''. The project provides a statistical framework for analyzing the spike train data and a new noise removal procedure for neural spike trains. The framework adapts the SRSF elastic metric in the space of point patterns to provides a new notion of the distance. ``SRSF shape analysis for sequencing data reveal new differentiating patterns''. This project uses the shape interpretation of the Next Generation Sequencing data to provide a new point of view of the exon level gene activity. The novel approach reveals a new differential gene behavior, that can't be captured by the stateofthe art techniques. Code is available online on github repository. ``How changes in shape of nucleosomal DNA near TSS influence changes of gene expression''. The result of this work is the novel shape analysis model explaining the relation between the change of the DNA arrangement on nucleosomes and the change in the differential gene expression.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Wesolowski_fsu_0071E_14177
 Format
 Thesis
 Title
 Discrimination and Calibration of Prognostic Survival Models.
 Creator

Simino, Jeannette M., Hollander, Myles, McGee, Daniel, Hurt, Myra, Niu, XuFeng, Department of Statistics, Florida State University
 Abstract/Description

Clinicians employ prognostic survival models for diseases such as coronary heart disease and cancer to inform patients about risks, treatments, and clinical decisions (Altman and Royston 2000). These prognostic models are not useful unless they are valid in the population to which they are applied. There are no generally accepted algorithms for assessing the validity of an external survival model in a new population. Researchers often invoke measures of predictive accuracy, the degree to...
Show moreClinicians employ prognostic survival models for diseases such as coronary heart disease and cancer to inform patients about risks, treatments, and clinical decisions (Altman and Royston 2000). These prognostic models are not useful unless they are valid in the population to which they are applied. There are no generally accepted algorithms for assessing the validity of an external survival model in a new population. Researchers often invoke measures of predictive accuracy, the degree to which predicted outcomes match observed outcomes (Justice et al. 1999). One component of predictive accuracy is discrimination, the ability of the model to correctly rank the individuals in the sample by risk. A common measure of discrimination for prognostic survival models is the concordance index, also called the cstatistic. We utilize the concordance index to determine the discrimination of Framinghambased Cox and Loglogistic models of coronary heart disease (CHD) death in cohorts from the Diverse Populations Collaboration, a collection of studies that encompasses many ethnic, geographic, and socioeconomic groups. Pencina and D'Agostino presented a confidence interval for the concordance index when assessing the discrimination of an external prognostic model. We perform simulations to determine the robustness of their confidence interval when measuring discrimination during internal validation. The Pencina and D'Agostino confidence interval is not valid in the internal validation setting because their assumption of mutually independent observations is violated. We compare the Pencina and D'Agostino confidence interval to a bootstrap confidence interval that we propose that is valid for the internal validation. We specifically discern the performance of the interval when the same sample is used to both fit and determine the validity of a prognostic model. The framework for our simulations is a Weibull proportional hazards model of CHD death fit to the Framingham exam 4 data. We then focus on the second component of accuracy, calibration, which measures the agreement between the observed and predicted event rates for groups of patients (Altman and Royston 2000). In 2000, van Houwelingen introduced a method called validation by calibration to allow a clinician to assess the validity of a wellaccepted published survival model on his/her own patient population and adjust the published model to fit that population. Van Houwelingen embeds the published model into a new model with only 3 parameters which helps combat the overfitting that occurs when models with many covariates are fit on data sets with a small number of events. We explore validation by calibration as a tool to adjust models when an external model over or underestimates risk. Van Houwelingen discusses the general method and then focusses on the proportional hazards model. There are situations where proportional hazards may not hold, thus we extend the methodology to the Loglogistic accelerated failure time model. We perform validation by calibration of Framinghambased Cox and Loglogistic models of CHD death to cohorts from the Diverse Populations Collaboration. Lastly, we conduct simulations that investigate the power of the global Wald validation by calibration test. We study its power to reject an invalid proportional hazards or Loglogistic accelerated failure time model under various scale and/or shape misspecifications.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd0328
 Format
 Thesis
 Title
 Dynamic and Stochastic Transition of Traffic Conditions and Its Application in Urban Traffic Mobility.
 Creator

Kidando, Emmanuel, Moses, Ren, Duncan, Michael Douglas, Ozguven, Eren Erman, Sobanjo, John Olusegun, Sando, Thobias M., Florida State University, FAMUFSU College of Engineering...
Show moreKidando, Emmanuel, Moses, Ren, Duncan, Michael Douglas, Ozguven, Eren Erman, Sobanjo, John Olusegun, Sando, Thobias M., Florida State University, FAMUFSU College of Engineering, Department of Civil and Environmental Engineering
Show less  Abstract/Description

Analytical models developed using field data can provide useful information with acceptable confidence to evaluate and predict the operational characteristics of a highway. As such, this study presents statistical models that can be used to estimate the travel time or speed distribution, cluster different traffic conditions, to model the dynamic transition of traffic regimes (DTR), and quantify the disparityeffects on the DTR associated with different lateral lane positions (i.e., lane near...
Show moreAnalytical models developed using field data can provide useful information with acceptable confidence to evaluate and predict the operational characteristics of a highway. As such, this study presents statistical models that can be used to estimate the travel time or speed distribution, cluster different traffic conditions, to model the dynamic transition of traffic regimes (DTR), and quantify the disparityeffects on the DTR associated with different lateral lane positions (i.e., lane near shoulder, middle lane(s) and lane near a median) as well as different days of the week. In the analysis, this study uses Bayesian frameworks to estimate the model parameters. These frameworks reduce the impact of model overfitting and also incorporate uncertainty in the estimates. Data from a freeway corridor along I295 located in Jacksonville, Florida were selected for analysis. It includes data from individual microwave vehicle sensors, segment level aggregated traffic data and data aggregated at a corridor level. The proposed probabilistic frameworks developed by this study can be a useful resource in detecting and evaluating different traffic conditions, which can facilitate the planning action to implement congestionrelated countermeasures in urban areas. In addition, findings from the hierarchical regression model presented by the current study can be used in the application of intelligent transportation systems, mainly in the dynamic lanemanagement strategy.
Show less  Date Issued
 2019
 Identifier
 2019_Spring_Kidando_fsu_0071E_15049
 Format
 Thesis
 Title
 The Effect of Risk Factors on Coronary Heart Disease: An AgeRelevant Multivariate Meta Analysis.
 Creator

Li, Yan, McGee, Dan, She, Yiyuan, Eberstein, Ike, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

The importance of major risk factors, such as hypertension, total cholesterol, body mass index, diabetes, smoking, for predicting incidence and mortality of Coronary Heart Disease (CHD) is well known. In light of the fact that age is also a major risk factor for CHD death, a natural question is whether the risk effects on CHD change with age. This thesis focuses on examining the interaction between age and risk factors using data from multiple studies containing differing age ranges. The aim...
Show moreThe importance of major risk factors, such as hypertension, total cholesterol, body mass index, diabetes, smoking, for predicting incidence and mortality of Coronary Heart Disease (CHD) is well known. In light of the fact that age is also a major risk factor for CHD death, a natural question is whether the risk effects on CHD change with age. This thesis focuses on examining the interaction between age and risk factors using data from multiple studies containing differing age ranges. The aim of my research is to use statistical methods to determine whether we can combine these diverse results to obtain an overall summary, using which one can find how the risk effects on CHD death change with age. One intuitive approach is to use classical meta analysis based on generalized linear models. More specifically, one can fit a logistic model with CHD death as response and age, a risk factor and their interaction as covariates for each of the studies, and conduct meta analysis on every set of three coefficients in the multivariate setting to obtain 'synthesized' coefficients. Another aspect of the thesis is a new method, meta analysis with respect to curves that goes beyond linear models. The basic idea is that one can choose the same spline with the same knots on covariates, say age and systolic blood pressure (SBP), for all the studies to ensure common basis functions. The knotbased tensor product basis coefficients obtained from penalized logistic regression can be used for multivariate meta analysis. Using the common basis functions and the 'synthesized' knotbased basis coefficients from meta analysis, a twodimensional smooth surface on the ageSBP domain is estimated. By cutting through the smooth surface along two axes, the resulting slices show how the risk effect on CHD death change at an arbitrary age as well as how the age effect on CHD death change at an arbitrary SBP value. The application to multiple studies will be presented.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1428
 Format
 Thesis
 Title
 An Effective and Efficient Approach for Clusterability Evaluation.
 Creator

Adolfsson, Andreas, Ackerman, Margareta, Brownstein, Naomi Chana, Haiduc, Sonia, Tyson, Gary Scott, Florida State University, College of Arts and Sciences, Department of...
Show moreAdolfsson, Andreas, Ackerman, Margareta, Brownstein, Naomi Chana, Haiduc, Sonia, Tyson, Gary Scott, Florida State University, College of Arts and Sciences, Department of Computer Science
Show less  Abstract/Description

Clustering is an essential data mining tool that aims to discover inherent cluster structure in data. As such, the study of clusterability, which evaluates whether data possesses such structure, is an integral part of cluster analysis. Yet, despite their central role in the theory and application of clustering, current notions of clusterability fall short in two crucial aspects that render them impractical; most are computationally infeasible and others fail to classify the structure of real...
Show moreClustering is an essential data mining tool that aims to discover inherent cluster structure in data. As such, the study of clusterability, which evaluates whether data possesses such structure, is an integral part of cluster analysis. Yet, despite their central role in the theory and application of clustering, current notions of clusterability fall short in two crucial aspects that render them impractical; most are computationally infeasible and others fail to classify the structure of real datasets. In this thesis, we propose a novel approach to clusterability evaluation that is both computationally efficient and successfully captures the structure in real data. Our method applies multimodality tests to the (onedimensional) set of pairwise distances based on the original, potentially highdimensional data. We present extensive analyses of our approach for both the Dip and Silverman multimodality tests on real data as well as 17,000 simulations, demonstrating the success of our approach as the first practical notion of clusterability.
Show less  Date Issued
 2016
 Identifier
 FSU_SUMMER2017_Adolfsson_fsu_0071N_13478
 Format
 Thesis
 Title
 Effects of inspection error on optimal inspection policies and software fault detection models.
 Creator

Herge, Donna Carol., Florida State University
 Abstract/Description

Inspection policies are essential for many types of systems for which the status (functioning or failed) can be determined only by actual inspection. Two types of inspection error may occur. A functioning system may be incorrectly assessed as having failed or a failed system may be incorrectly assessed as functioning. These errors are designated as Types I and II respectively, and their impact on optimal inspection policies and software fault detection models is analyzed. For a periodic...
Show moreInspection policies are essential for many types of systems for which the status (functioning or failed) can be determined only by actual inspection. Two types of inspection error may occur. A functioning system may be incorrectly assessed as having failed or a failed system may be incorrectly assessed as functioning. These errors are designated as Types I and II respectively, and their impact on optimal inspection policies and software fault detection models is analyzed. For a periodic inspection model with Type I error, an optimal replacement age is obtained, then monotonicity and asymptotic properties of the longrun expected cost per unit of time are presented. Type I error is incorporated into a cumulative damage model. When the failure density is reverse rule of order 2, an algorithm to compute an optimal inspection sequence is derived, and it is proven that optimal intervals are increasing. Extending the optimal inspection sequence model of Barlow, Hunter, and Proschan to include Type II inspection error, it is proven that optimal intervals are nonincreasing for a PF$\sb2$ density, and an algorithm to compute optimal intervals is derived. Additionally, monotonicity and majorization results are obtained for an optimal inspection sequence with Type II error. The impact of faultdetection error on a software optimal release time model is shown. The effect of fault diversity on the JelinskiMoranda model and how this relates to imperfect fault detection is demonstrated.
Show less  Date Issued
 1992, 1992
 Identifier
 AAI9306034, 3087973, FSDT3087973, fsu:76780
 Format
 Document (PDF)
 Title
 Elastic Functional Principal Component Analysis for Modeling and Testing of Functional Data.
 Creator

Duncan, Megan, Srivastava, Anuj, Klassen, E., Huffer, Fred W., Wu, Wei, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Statistical analysis of functional data requires tools for comparing, summarizing and modeling observed functions as elements of a function space. A key issue in Functional Data Analysis (FDA) is the presence of the phase variability in the observed data. A successful statistical model of functional data has to account for the presence of phase variability. Otherwise the ensuing inferences can be inferior. Recent methods for FDA include steps for phase separation or functional alignment. For...
Show moreStatistical analysis of functional data requires tools for comparing, summarizing and modeling observed functions as elements of a function space. A key issue in Functional Data Analysis (FDA) is the presence of the phase variability in the observed data. A successful statistical model of functional data has to account for the presence of phase variability. Otherwise the ensuing inferences can be inferior. Recent methods for FDA include steps for phase separation or functional alignment. For example, Elastic Functional Principal Component Analysis (Elastic FPCA) uses the strengths of Functional Principal Component Analysis (FPCA), along with the tools from Elastic FDA, to perform joint phaseamplitude separation and modeling. A related problem in FDA is to quantify and test for the amount of phase in a given data. We develop two types of hypothesis tests for testing the significance of phase variability: a metricbased approach and a modelbased approach. The metricbased approach treats phase and amplitude as independent components and uses their respective metrics to apply the FriedmanRafsky Test, Schilling's Nearest Neighbors, and Energy Test to test the differences between functions and their amplitudes. In the modelbased test, we use Concordance Correlation Coefficients as a tool to quantify the agreement between functions and their reconstructions using FPCA and Elastic FPCA. We demonstrate this framework using a number of simulated and real data, including weather, tecator, and growth data.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Duncan_fsu_0071E_14470
 Format
 Thesis
 Title
 Elastic Functional Regression Model.
 Creator

Ahn, Kyungmin, Srivastava, Anuj, Klassen, E., Wu, Wei, Huffer, Fred W., Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Functional variables serve important roles as predictors in a variety of pattern recognition and vision applications. Focusing on a specific subproblem, termed scalaronfunction regression, most current approaches adopt the standard L2 inner product to form a link between functional predictors and scalar responses. These methods may perform poorly when predictor functions contain nuisance phase variability, i.e., predictors are temporally misaligned due to noise. While a simple solution...
Show moreFunctional variables serve important roles as predictors in a variety of pattern recognition and vision applications. Focusing on a specific subproblem, termed scalaronfunction regression, most current approaches adopt the standard L2 inner product to form a link between functional predictors and scalar responses. These methods may perform poorly when predictor functions contain nuisance phase variability, i.e., predictors are temporally misaligned due to noise. While a simple solution could be to prealign predictors as a preprocessing step, before applying a regression model, this alignment is seldom optimal from the perspective of regression. In this dissertation, we propose a new approach, termed elastic functional regression, where alignment is included in the regression model itself, and is performed in conjunction with the estimation of other model parameters. This model is based on a normpreserving warping of predictors, not the standard time warping of functions, and provides better prediction in situations where the shape or the amplitude of the predictor is more useful than its phase. We demonstrate the effectiveness of this framework using simulated and real data.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Ahn_fsu_0071E_14452
 Format
 Thesis
 Title
 Elastic Shape Analysis of RNAs and Proteins.
 Creator

Laborde, Jose M., Srivastava, Anuj, Zhang, Jinfeng, Klassen, Eric, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

Proteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and...
Show moreProteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and classifications. This work applies Elastic Shape Analysis (ESA), a method recently developed in computer vision, to construct rigorous mathematical and statistical frameworks for the comparison, clustering and classification of proteins and RNAs. ESA treats bio molecular structures as 3D parameterized curves, which are represented with a special map called the square root velocity function (SRVF). In the resulting shape space of elastic curves, one can perform statistical analysis of curves as if they were random variables. One can compare, match and deform one curve into another, or as well as compute averages and covariances of curve populations, and perform hypothesis testing and classification of curves according to their shapes. We have successfully applied ESA to the comparison and classification of protein and RNA structures. We further extend the ESA framework to incorporate additional nongeometric information that tags the shape of the molecules (namely, the sequence of nucleotide/aminoacid letters for RNAs/proteins and, in the latter case, also the labels for the socalled secondary structure). The biological representation is chosen such that the ESA framework continues to be mathematically formal. We have achieved superior classification of RNA functions compared to stateoftheart methods on benchmark RNA datasets which has led to the publication of this work in the journal, Nucleic Acids Research (NAR). Based on the ESA distances, we have also developed a fast method to classify protein domains by using a representative set of protein structures generated by a clusteringbased technique we call Multiple Centroid Class Partitioning (MCCP). Comparison with other standard approaches showed that MCCP significantly improves the accuracy while keeping the representative set smaller than the other methods. The current schemes for the classification and organization of proteins (such as SCOP and CATH) assume a discrete space of their structures, where a protein is classified into one and only one class in a hierarchical tree structure. Our recent study, and studies by other researchers, showed that the protein structure space is more continuous than discrete. To capture the complex but quantifiable continuous nature of protein structures, we propose to organize these molecules using a network model, where individual proteins are mapped to possibly multiple nodes of classes, each associated with a probability. Structural classes will then be connected to form a network based on overlaps of corresponding probability distributions in the structural space.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8586
 Format
 Thesis
 Title
 An Ensemble Approach to Predicting Health Outcomes.
 Creator

Nilles, Ester Kim, McGee, Dan, Zhang, Jinfeng, Eberstein, Isaac, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Heart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk...
Show moreHeart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk have been based on logistic regression or Cox proportional hazards models. In more recent years, machine learning models have grown in popularity within the medical field for their superior predictive and classification performances over the classical statistical models. However, their performances in heart disease and premature birth predictions have been comparable and inconclusive, leaving the question of which model most accurately reflects the data difficult to resolve. Our aim is to incorporate information learned by different models into one final model that will generate superior predictive performances. We first compare the widely used machine learning models  the multilayer perceptron network, knearest neighbor and support vector machine  to the statistical models logistic regression and Cox proportional hazards. Then the individual models are combined into one in an ensemble approach, also referred to as ensemble modeling. The proposed approaches include SSEweighted, AUCweighted, logistic and flexible naive Bayes. The individual models are unique and capture different aspects of the data, but as expected, no individual one outperforms any other. The ensemble approach is an easily computed method that eliminates the need to select one model, integrates the strengths of different models, and generates optimal performances. Particularly in cases where the risk factors associated to an outcome are elusive, such as in premature birth, the ensemble models significantly improve their prediction.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7530
 Format
 Thesis
 Title
 Envelopes, Subspace Learning and Applications.
 Creator

Wang, Wenjing, Zhang, Xin, Tao, Minjing, Li, Wen, Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Envelope model is a nascent dimension reduction technique. We focus on extending the envelope methodology to broader applications. In the first part of this thesis we propose a common reducing subspace model that can simultaneously estimating covariance, precision matrices and their differences across multiple populations. This model leads to substantial dimension reduction and efficient parameter estimation. We explicitly quantify the efficiency gain through an asymptotic analysis. In the...
Show moreEnvelope model is a nascent dimension reduction technique. We focus on extending the envelope methodology to broader applications. In the first part of this thesis we propose a common reducing subspace model that can simultaneously estimating covariance, precision matrices and their differences across multiple populations. This model leads to substantial dimension reduction and efficient parameter estimation. We explicitly quantify the efficiency gain through an asymptotic analysis. In the second part, we propose a set of new mixture models called CLEMM (Clustering with Envelope Mixture Models) that is based on the widely used Gaussian mixture model assumptions. The proposed CLEMM framework and the associated envelopeEM algorithms provides the foundations for envelope methodology in unsupervised and semisupervised learning problems. We also illustrate the performance of these models with simulation studies and empirical applications. Also, we have extended the envelope discriminant analysis from vector data to tensor data in the third part of this thesis. Another study on copulabased models for forecasting realized volatility matrix is included, which is an important financial application of estimating covariance matrices. We consider multivariatet, Clayton, and bivariate t, Gumbel, Clayton copulas to model and forecast oneday ahead realized volatility matrices. Empirical results show that copula based models can achieve significant performance both in terms of statistical precision and economical efficiency.
Show less  Date Issued
 2019
 Identifier
 2019_Spring_Wang_fsu_0071E_15085
 Format
 Thesis
 Title
 ESTIMATING JOINTLY SYSTEM AND COMPONENT RELIABILITIES USING A MUTUAL CENSORSHIP APPROACH (SURVIVAL ANALYSIS, COUNTING PROCESSES, MARTINGALES, KAPLANMEIER, RELIABILITY FUNCTION).
 Creator

FREITAG, STEVEN ARTHUR., Florida State University
 Abstract/Description

Let F denote the life distribution of a coherent structure of independent components. Suppose that we have a sample of independent systems, each having the structure (phi). Each system is continuously observed until it fails. For every component in each system, either a failure time or a censoring time is recorded. A failure time is recorded if the component fails before or at the time of system failure; otherwise a censoring time is recorded. We introduce a method for finding estimates for F...
Show moreLet F denote the life distribution of a coherent structure of independent components. Suppose that we have a sample of independent systems, each having the structure (phi). Each system is continuously observed until it fails. For every component in each system, either a failure time or a censoring time is recorded. A failure time is recorded if the component fails before or at the time of system failure; otherwise a censoring time is recorded. We introduce a method for finding estimates for F(t), quantiles, and other functionals of F, based on the censorship of the component lives by system failure. We present limit theorems that enable the construction of confidence intervals for large samples.
Show less  Date Issued
 1986, 1986
 Identifier
 AAI8609671, 3086298, FSDT3086298, fsu:75781
 Format
 Document (PDF)
 Title
 ESTIMATING MULTIDIMENSIONAL TABLES FROM SURVEY DATA: PREDICTING MAGAZINE AUDIENCES.
 Creator

DANAHER, PETER JOSEPH., Florida State University
 Abstract/Description

Suppose an advertiser constructs an advertising campaign by placing k advertisements in a magazine. He now estimates the proportion of the population which sees none, one, or up to all k advertisements (called the exposure distribution). Several criteria for evaluating the effectiveness of the campaign can be obtained directly from the exposure distribution. Two of them are reach, the proportion of the population which is exposed to at least one of the advertisements and effective reach, the...
Show moreSuppose an advertiser constructs an advertising campaign by placing k advertisements in a magazine. He now estimates the proportion of the population which sees none, one, or up to all k advertisements (called the exposure distribution). Several criteria for evaluating the effectiveness of the campaign can be obtained directly from the exposure distribution. Two of them are reach, the proportion of the population which is exposed to at least one of the advertisements and effective reach, the mean of the exposure distribution., We develop three exposure distribution models for the cases where advertising campaigns are comprised of one, two, or three or more magazines. The models build on each other in that the model for one magazine is used to improve the fit of the model for two magazines and the model for two magazines is used to estimate the parameters of the model for three or more magazines., A thorough empirical test, using the AGB:McNair "National Media Survey", shows that each of our models outperforms the best currentlyavailable models. In addition, the three models are proved to have optimal asymptotic properties., The models are used to select a media schedule which maximizes either reach or effective reach subject to a budget constraint. A monotonicity property of reach and effective reach yields an algorithm for optimizing both reach and effective reach that greatly reduces computation time over conventional methods used to solve integer programming problems., It is more useful to estimate the proportion of the population which sees the advertisements in a magazine rather than the proportion which sees the magazine. Often, however, no advertisement recall data is available so we are forced to estimate the proportion which is exposed to just the magazines. If advertisement recall data is available we give a natural and simple adjustment of the original magazine exposure data to get advertisement exposure data. Our models also give an excellent fit to these adjusted exposure data.
Show less  Date Issued
 1987, 1987
 Identifier
 AAI8721837, 3086665, FSDT3086665, fsu:76140
 Format
 Document (PDF)