Current Search: Research Repository (x) » Thesis (x) » Biological Abstracts, Inc (x) » English language (x) » Illustrations (x) » Department of Statistics (x)
Search results
Pages
 Title
 2D Affine and Projective Shape Analysis, and Bayesian Elastic Active Contours.
 Creator

Bryner, Darshan W., Srivastava, Anuj, Klassen, Eric, Gallivan, Kyle, Huffer, Fred, Wu, Wei, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

An object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian...
Show moreAn object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian framework for shape analysis where metrics and related quantities are invariant to larger groups, the affine and projective groups, that approximate such transformations that arise from perspective skews. Highlighting two possibilities for representing object boundaries  ordered points (or landmarks) and parametrized curves  we study different combinations of these representations (points and curves) and transformations (affine and projective). Specifically, we provide solutions to three out of four situations and develop algorithms for computing geodesics and intrinsic sample statistics, leading up to Gaussiantype statistical models, and classifying test shapes using such models learned from training data. In the case of parametrized curves, an added issue is to obtain invariance to the reparameterization group. The geodesics are constructed by particularizing the pathstraightening algorithm to geometries of current manifolds and are used, in turn, to compute shape statistics and Gaussiantype shape models. We demonstrate these ideas using a number of examples from shape and activity recognition. After developing such Gaussiantype shape models, we present a variational framework for naturally incorporating these shape models as prior knowledge in guidance of active contours for boundary extraction in images. This socalled Bayesian active contour framework is especially suitable for images where boundary estimation is difficult due to low contrast, low resolution, and presence of noise and clutter. In traditional active contour models curves are driven towards minimum of an energy composed of image and smoothing terms. We introduce an additional shape term based on shape models of prior known relevant shape classes. The minimization of this total energy, using iterated gradientbased updates of curves, leads to an improved segmentation of object boundaries. We demonstrate this Bayesian approach to segmentation using a number of shape classes in many imaging scenarios including the synthetic imaging modalities of SAS (synthetic aperture sonar) and SAR (synthetic aperture radar), which are notoriously difficult to obtain accurate boundary extractions. In practice, the training shapes used for priorshape models may be collected from viewing angles different from those for the test images and thus may exhibit a shape variability brought about by perspective effects. Therefore, by allowing for a prior shape model to be invariant to, say, affine transformations of curves, we propose an active contour algorithm where the resulting segmentation is robust to perspective skews.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8534
 Format
 Thesis
 Title
 Adaptive Series Estimators for Copula Densities.
 Creator

Gui, Wenhao, Wegkamp, Marten, Van Engelen, Robert A., Niu, Xufeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

In this thesis, based on an orthonormal series expansion, we propose a new nonparametric method to estimate copula density functions. Since the basis coefficients turn out to be expectations, empirical averages are used to estimate these coefficients. We propose estimators of the variance of the estimated basis coefficients and establish their consistency. We derive the asymptotic distribution of the estimated coefficients under mild conditions. We derive a simple oracle inequality for the...
Show moreIn this thesis, based on an orthonormal series expansion, we propose a new nonparametric method to estimate copula density functions. Since the basis coefficients turn out to be expectations, empirical averages are used to estimate these coefficients. We propose estimators of the variance of the estimated basis coefficients and establish their consistency. We derive the asymptotic distribution of the estimated coefficients under mild conditions. We derive a simple oracle inequality for the copula density estimator based on a finite series using the estimated coefficients. We propose a stopping rule for selecting the number of coefficients used in the series and we prove that this rule minimizes the mean integrated squared error. In addition, we consider hard and soft thresholding techniques for sparse representations. We obtain oracle inequalities that hold with prescribed probability for various norms of the difference between the copula density and our threshold series density estimator. Uniform confidence bands are derived as well. The oracle inequalities clearly reveal that our estimator adapts to the unknown degree of sparsity of the series representation of the copula density. A simulation study indicates that our method is extremely easy to implement and works very well, and it compares favorably to the popular kernel based copula density estimator, especially around the boundary points, in terms of mean squared error. Finally, we have applied our method to an insurance dataset. After comparing our method with the previous data analyses, we reach the same conclusion as the parametric methods in the literature and as such we provide additional justification for the use of the developed parametric model.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd3929
 Format
 Thesis
 Title
 Age Effects in the Extinction of Planktonic Foraminifera: A New Look at Van Valen's Red Queen Hypothesis.
 Creator

Wiltshire, Jelani, Huﬀer, Fred, Parker, William, Chicken, Eric, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Van Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic...
Show moreVan Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic Foraminifera (Foram) extinction data include Weibull and Exponential modeling (Parker and Arnold, 1997), and Cox proportional hazards modeling (Doran et al. 2004,2006). I propose a general class of test statistics that can be used to test for the effect of age on extinction. These test statistics allow for a varying background rate of extinction and attempt to remove the effects of other covariates when assessing the effect of age on extinction. No model is assumed for the covariate effects. Instead I control for covariate effects by pairing or grouping together similar species. I use simulated data sets to compare the power of the statistics. In applying the test statistics to the Foram data, I have found age to have a positive effect on extinction.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0952
 Format
 Thesis
 Title
 Analysis of Multivariate Data with Random Cluster Size.
 Creator

Li, Xiaoyun, Sinha, Debajyoti, Zhou, Yi, McGee, Dan, Lipsitz, Stuart, Department of Statistics, Florida State University
 Abstract/Description

In this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real...
Show moreIn this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real data applications. In biomedical studies involving clustered binary responses, the cluster size can vary because some components of the cluster can be absent. When both the presence of a cluster component as well as the binary disease status of a present component are treated as responses of interest, we propose a novel twostage random effects logistic regression framework. For the ease of interpretation of regression effects, both the marginal probability of presence/absence of a component as well as the conditional probability of disease status of a present component, preserve the approximate logistic regression forms. We present a maximum likelihood method of estimation implementable using standard statistical software. We compare our models and the physical interpretation of regression effects with competing methods from literature. We also present a simulation study to assess the robustness of our procedure to wrong specification of the random effects distribution and to compare finite sample performances of estimates with existing methods. The methodology is illustrated via analyzing a study of the periodontal health status in a diabetic Gullah population. We extend this model in longitudinal studies with binary longitudinal response and informative missing data. In longitudinal studies, when treating each subject as a cluster, cluster size is the total number of observations for each subject. When data is informatively missing, cluster size of each subject can vary and is related to the binary response of interest and we are also interested in the missing mechanism. This is a modified situation of the cluster binary data with present components. We modify and adopt our proposed twostage random effects logistic regression model so that both the marginal probability of binary response and missing indicator as well as the conditional probability of binary response and missing indicator preserve logistic regression forms. We present a Bayesian framework of this model and illustrate our proposed model on an AIDS data example.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1425
 Format
 Thesis
 Title
 Analysis of the Wealth Distribution at Equilibrium in a Heterogeneous Agent Economy.
 Creator

Badshah, Muffasir H., Srivastava, Anuj, Beaumont, Paul, Wu, Wei, Kercheval, Alec, Department of Statistics, Florida State University
 Abstract/Description

This paper aims at analyzing a macro economy with a continuum of infinitelylived households that make rational decisions about consumption and wealth savings in the face of employment and aggregate productivity shocks. The heterogeneous population structure arises when households differ in wealth and employment status against which they cannot insure. In this framework, the household wealth evolution is modeled as a mixture Markov process. The stationary wealth distributions are obtained...
Show moreThis paper aims at analyzing a macro economy with a continuum of infinitelylived households that make rational decisions about consumption and wealth savings in the face of employment and aggregate productivity shocks. The heterogeneous population structure arises when households differ in wealth and employment status against which they cannot insure. In this framework, the household wealth evolution is modeled as a mixture Markov process. The stationary wealth distributions are obtained using eigen structures of transition matrices under the PerronFrobenius theorem. This step is utilized repeatedly to find the equilibrium state of the system, and it leads to an efficient framework for studying the dynamic general equilibrium. A systematic evaluation of the equilibrium state under different initial conditions is further presented and analyzed.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0844
 Format
 Thesis
 Title
 AP Student Visual Preferences for Problem Solving.
 Creator

Swoyer, Liesl, Department of Statistics
 Abstract/Description

The purpose of this study is to explore the mathematical preference of high school AP Calculus students by examining their tendencies for using differing methods of thought. A student's preferred mode of thinking was measured on a scale ranging from a preference for analytical thought to a preference for visual thought as they completed derivative and antiderivative tasks presented both algebraically and graphically. This relates to previous studies by continuing to analyze the factors that...
Show moreThe purpose of this study is to explore the mathematical preference of high school AP Calculus students by examining their tendencies for using differing methods of thought. A student's preferred mode of thinking was measured on a scale ranging from a preference for analytical thought to a preference for visual thought as they completed derivative and antiderivative tasks presented both algebraically and graphically. This relates to previous studies by continuing to analyze the factors that have been found to mediate the students' performance and preference in regards to a variety of calculus tasks. Data was collected by Dr. Erhan Haciomeroglu at the University of Central Florida. Students' preferences were not affected by gender. Students were found to approach graphical and algebraic tasks similarly, without any significant change with regards to derivative or antiderivative nature of the tasks. Highly analytic and highly visual students revealed the same proportion of change in visuality as harmonic students when more difficult calculus tasks were encountered. Thus, a strong preference for visual thinking when completing algebraic tasks was not the determining factor of their preferred method of thinking when approaching graphical tasks.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_uhm0052
 Format
 Thesis
 Title
 Association Models for Clustered Data with Binary and Continuous Responses.
 Creator

Lin, Lanjia, Sinha, Debajyoti, Hurt, Myra, Lipsitz, Stuart R., McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

This dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And...
Show moreThis dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And the marginal regression function of the continuous response preserves linear form. Withincluster and withinsubject associations could be measured by our proposed models. For the bivariate correlated random effects model, we illustrate how different levels of the association between two random effects induce different Kendall's tau values for association between the binary and continuous responses from the same cluster. Fully parametric and semiparametric Bayesian methods as well as maximum likelihood method are illustrated for model analysis. In the semiparametric Bayesian model, normality assumption of the regression error for the continuous response is relaxed by using a nonparametric Dirichlet Process prior. Robustness of the bivariate correlated random effects model using ML method to misspecifications of regression function as well as random effect distribution is investigated by simulation studies. The Bayesian and likelihood methods are applied to a developmental toxicity study of ethylene glycol in mice.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1330
 Format
 Thesis
 Title
 A Bayesian Approach to MetaRegression: The Relationship Between Body Mass Index and AllCause Mortality.
 Creator

Marker, Mahtab, McGee, Dan, Hurt, Myra, Niu, Xiufeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

This thesis presents a Bayesian approach to MetaRegression and Individual Patient Data (IPD) Metaanalysis. The focus of the research is on establishing the relationship between Body Mass Index (BMI) and allcause mortality. This has been an area of continuing interest in the medical and public health communities and no concensus has been reached on what the optimal weight for individuals is. Standards are usually speci ed in terms of body mass index (BMI = wt(kg) over height(m)2 ) which is...
Show moreThis thesis presents a Bayesian approach to MetaRegression and Individual Patient Data (IPD) Metaanalysis. The focus of the research is on establishing the relationship between Body Mass Index (BMI) and allcause mortality. This has been an area of continuing interest in the medical and public health communities and no concensus has been reached on what the optimal weight for individuals is. Standards are usually speci ed in terms of body mass index (BMI = wt(kg) over height(m)2 ) which is associated with body fat percentage. Many studies in the literature have modelled the relationship between BMI and mortality and reported a variety of relationships including Ushaped, Jshaped and linear curves. The aim of my research was to use statistical methods to determine whether we can combine these diverse results an obtain single estimated relationship, using which one can nd the point of minimum mortality and establish reasonable ranges for optimal BMI or how we can best examine the reasons for the heterogeneity of results. Commonly used techniques of Metaanalysis and Metaregression are explored and a problem with the estimation procedure in the multivariate setting is presented. A Bayesian approach using Hierarchical Generalized Linear Mixed Model is suggested and implemented to overcome this drawback of standard estimation techniques. Another area which is explored briefly is that of Individual Patient Data metaanalysis. A Frailty model or Random Effects Proportional Hazards Survival model approach is proposed to carry out IPD metaregression and come up with a single estimated relationship between BMI and mortality, adjusting for the variation between studies.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd2736
 Format
 Thesis
 Title
 Bayesian Dynamic Survival Models for Longitudinal Aging Data.
 Creator

He, Jianghua, McGee, Daniel L., Niu, Xufeng, Johnson, Suzanne B., Huﬀer, Fred W., Department of Statistics, Florida State University
 Abstract/Description

In this study, we will examine the Bayesian Dynamic Survival Models, timevarying coefficients models from a Bayesian perspective, and their applications in the aging setting. The specific questions we are interested in are: Do the relative importance of characteristics measured at a particular age, such as blood pressure, smoking, and body weight, with respect to heart diseases or death change as people age? If they do, how can we model the change? And, how does the change affect the...
Show moreIn this study, we will examine the Bayesian Dynamic Survival Models, timevarying coefficients models from a Bayesian perspective, and their applications in the aging setting. The specific questions we are interested in are: Do the relative importance of characteristics measured at a particular age, such as blood pressure, smoking, and body weight, with respect to heart diseases or death change as people age? If they do, how can we model the change? And, how does the change affect the analysis results if fixedeffect models are applied? In the epidemiological and statistical literature, the relationship between a risk factor and the risk of an event is often described in terms of the numerical contribution of the risk factor to the total risk within a followup period, using methods such as contingency tables and logistic regression models. With the development of survival analysis, another method named the Proportional Hazards Model becomes more popular. This model describes the relationship between a covariate and risk within a followup period as a process, under the assumption that the hazard ratio of the covariate is fixed during the followup period. Neither previous methods nor the Proportional Hazards Model allows the effect of a covariates to change flexibly with time. In these study, we intend to investigate some classic epidemiological relationships using appropriate methods that allow coefficients to change with time, and compare our results with those found in the literature. After describing what has been done in previous work based on multiple logistic regression or discriminant function analysis, we summarize different methods for estimating the time varying coefficient survival models that are developed specifically for the situations under which the proportional hazards assumption is violated. We will focus on the Bayesian Dynamic Survival Model because its flexibility and Bayesian structure fits our study goals. There are two estimation methods for the Bayesian Dynamic Survival Models, the Linear Bayesian Estimation (LBE) method and the Markov Chain Monte Carlo (MCMC) sampling method. The LBE method is simpler, faster, and more flexible to calculate, but it requires specifications of some parameters that usually are unknown. The MCMC method gets around the difficulty of specifying parameters, but is much more computationally intensive. We will use a simulation study to investigate the performances of these two methods, and provide suggestions on how to use them effectively in application. The Bayesian Dynamic Survival Model is applied to the Framingham Heart Study to investigate the timevarying effects of covariates such as gender, age, smoking, and SBP (Systolic Blood Pressure) with respect to death. We also examined the changing relationship between BMI (Body Mass Index) and allcause mortality, and suggested that some of the heterogeneity observed in the results found in the literature is likely to be a consequence of using fixed effect models to describe a timevarying relationship.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd4174
 Format
 Thesis
 Title
 Bayesian Generalized Polychotomous Response Models and Applications.
 Creator

Yang, Fang, Niu, XuFeng, Johnson, Suzanne B., McGee, Dan, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Polychotomous quantal response models are widely used in medical and econometric studies to analyze categorical or ordinal data. In this study, we apply the Bayesian methodology through a mixedeffects polychotomous quantal response model. For the Bayesian polychotomous quantal response model, we assume uniform improper priors for the regression coeffcients and explore the suffcient conditions for a proper joint posterior distribution of the parameters in the models. Simulation results from...
Show morePolychotomous quantal response models are widely used in medical and econometric studies to analyze categorical or ordinal data. In this study, we apply the Bayesian methodology through a mixedeffects polychotomous quantal response model. For the Bayesian polychotomous quantal response model, we assume uniform improper priors for the regression coeffcients and explore the suffcient conditions for a proper joint posterior distribution of the parameters in the models. Simulation results from Gibbs sampling estimates will be compared to traditional maximum likelihood estimates to show the strength that using the uniform improper priors for the regression coeffcients. Motivated by investigating of relationship between BMI categories and several risk factors, we carry out the application studies to examine the impact of risk factors on BMI categories, especially for categories of "Overweight" and "Obesities". By applying the mixedeffects Bayesian polychotomous response model with uniform improper priors, we would get similar interpretations of the association between risk factors and BMI, comparing to literature findings.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1092
 Format
 Thesis
 Title
 Bayesian Inference and Novel Models for Survival Data with Cured Fraction.
 Creator

Gupta, Cherry Chunqi Huang, Sinha, Debajyoti, Glueckauf, Robert L., Slate, Elizabeth H., Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of...
Show moreGupta, Cherry Chunqi Huang, Sinha, Debajyoti, Glueckauf, Robert L., Slate, Elizabeth H., Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Existing curerate survival models are generally not convenient for modeling and estimating the survival quantiles of a patient with specified covariate values. They also do not allow inference on the change in the number of clonogens over time. This dissertation proposes two novel classes of curerate model, the transformbothsides curerate model (TBSCRM) and the clonogen proliferation curerate model (CPCRM). Both can be used to make inference about both the curerate and the survival...
Show moreExisting curerate survival models are generally not convenient for modeling and estimating the survival quantiles of a patient with specified covariate values. They also do not allow inference on the change in the number of clonogens over time. This dissertation proposes two novel classes of curerate model, the transformbothsides curerate model (TBSCRM) and the clonogen proliferation curerate model (CPCRM). Both can be used to make inference about both the curerate and the survival probabilities over time. The TBSCRM can also produce estimates of a patient's quantiles of survival time, and the CPCRM can produce estimates of a patient's expected number of clonogens at each time. We develop methods of Bayesian inference about the covariate effects on relevant quantities such as the curerate, methods which use Markov Chain Monte Carlo (MCMC) tools. We also show that the TBSCRMbased and CPCRMbased Bayesian methods perform well in simulation studies and outperform existing curerate models in application to the breast cancer survival data from the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) database.
Show less  Date Issued
 2016
 Identifier
 FSU_2016SU_Gupta_fsu_0071E_13423
 Format
 Thesis
 Title
 Bayesian Methods for Skewed Response Including Longitudinal and Heteroscedastic Data.
 Creator

Tang, Yuanyuan, Sinha, Debajyoti, Pati, Debdeep, Flynn, Heather, She, Yiyuan, Lipsitz, Stuart, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

Skewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle...
Show moreSkewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle the skewed univariate response with a flexible heteroscedasticity. For longitudinal studies with heavily skewed continuous response, statistical model and methods focusing on mean response are not appropriate. In this paper, we present a partial linear model of median regression function of skewed longitudinal response. We develop a semiparametric Bayesian estimation procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We provide justifications for using our methods including theoretical investigation of the support of the prior, asymptotic properties of the posterior and also simulation studies of finite sample properties. Ease of implementation and advantages of our model and method compared to existing methods are illustrated via analysis of a cardiotoxicity study of children of HIV infected mother. Our second aim is to develop a Bayesian simultaneous variable selection and estimation of median regression for skewed response variable. Our hierarchical Bayesian model can incorporate advantages of $l_0$ penalty for skewed and heteroscedastic error. Some preliminary simulation studies have been conducted to compare the performance of proposed model and existing frequentist median lasso regression model. Considering the estimation bias and total square error, our proposed model performs as good as, or better than competing frequentist estimators. In biomedical studies, the covariates often affect the location, scale as well as the shape of the skewed response distribution. Existing biostatistical literature mainly focuses on the mean regression with a symmetric error distribution. While such modeling assumptions and methods are often deemed as restrictive and inappropriate for skewed response, the completely nonparametric methods may lack a physical interpretation of the covariate effects. Existing nonparametric methods also miss any easily implementable computational tool. For a skewed response, we develop a novel model accommodating a nonparametric error density that depends on the covariates. The advantages of our semiparametric associated Bayes method include the ease of prior elicitation/determination, an easily implementable posterior computation, theoretically sound properties of the selection of priors and accommodation of possible outliers. The practical advantages of the method are illustrated via a simulation study and an analysis of a reallife epidemiological study on the serum response to DDT exposure during gestation period.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7622
 Format
 Thesis
 Title
 Bayesian Modeling and Variable Selection for Complex Data.
 Creator

Li, Hanning, Pati, Debdeep, Huffer, Fred W. (Fred William), Kercheval, Alec N., Sinha, Debajyoti, Bradley, Jonathan R., Florida State University, College of Arts and Sciences,...
Show moreLi, Hanning, Pati, Debdeep, Huffer, Fred W. (Fred William), Kercheval, Alec N., Sinha, Debajyoti, Bradley, Jonathan R., Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

As we routinely encounter highthroughput datasets in complex biological and environment research, developing novel models and methods for variable selection has received widespread attention. In this dissertation, we addressed a few key challenges in Bayesian modeling and variable selection for highdimensional data with complex spatial structures. a) Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the...
Show moreAs we routinely encounter highthroughput datasets in complex biological and environment research, developing novel models and methods for variable selection has received widespread attention. In this dissertation, we addressed a few key challenges in Bayesian modeling and variable selection for highdimensional data with complex spatial structures. a) Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the noise. However, such priors encounter computational issues in high dimensions. This has motivated continuous shrinkage priors, resembling the twocomponent priors facilitating computation and interpretability. While such priors are widely used for estimating highdimensional sparse vectors, selecting a subset of variables remains a daunting task. b) Spatial/spatialtemporal data sets with complex structures are nowadays commonly encountered in various scientific research fields ranging from atmospheric sciences, forestry, environmental science, biological science, and social science. Selecting important spatial variables that have significant influences on occurrences of events is undoubtedly necessary and essential for providing insights to researchers. Selfexcitation, which is a feature that occurrence of an event increases the likelihood of more occurrences of the same type of events nearby in time and space, can be found in many natural/social events. Research on modeling data with selfexcitation feature has increasingly drawn interests recently. However, existing literature on selfexciting models with inclusion of highdimensional spatial covariates is still underdeveloped. c) Gaussian Process is among the most powerful model frames for spatial data. Its major bottleneck is the computational complexity which stems from inversion of dense matrices associated with a Gaussian process covariance. Hierarchical divideconquer Gaussian Process models have been investigated for ultra large data sets. However, computation associated with scaling the distributing computing algorithm to handle a large number of subgroups poses a serious bottleneck. In chapter 2 of this dissertation, we propose a general approach for variable selection with shrinkage priors. The presence of very few tuning parameters makes our method attractive in comparison to ad hoc thresholding approaches. The applicability of the approach is not limited to continuous shrinkage priors, but can be used along with any shrinkage prior. Theoretical properties for nearcollinear design matrices are investigated and the method is shown to have good performance in a wide range of synthetic data examples and in a real data example on selecting genes affecting survival due to lymphoma. In Chapter 3 of this dissertation, we propose a new selfexciting model that allows the inclusion of spatial covariates. We develop algorithms which are effective in obtaining accurate estimation and variable selection results in a variety of synthetic data examples. Our proposed model is applied on Chicago crime data where the influence of various spatial features is investigated. In Chapter 4, we focus on a hierarchical Gaussian Process regression model for ultrahigh dimensional spatial datasets. By evaluating the latent Gaussian process on a regular grid, we propose an efficient computational algorithm through circulant embedding. The latent Gaussian process borrows information across multiple subgroups, thereby obtaining a more accurate prediction. The hierarchical model and our proposed algorithm are studied through simulation examples.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Li_fsu_0071E_14159
 Format
 Thesis
 Title
 Bayesian Models for Capturing Heterogeneity in Discrete Data.
 Creator

Geng, Junxian, Slate, Elizabeth H., Pati, Debdeep, Schmertmann, Carl P., Zhang, Xin, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Population heterogeneity exists frequently in discrete data. Many Bayesian models perform reasonably well in capturing this subpopulation structure. Typically, the Dirichlet process mixture model (DPMM) and a variable dimensional alternative that we refer to as the mixture of finite mixtures (MFM) model are used, as they both have natural byproducts of clustering derived from Polya urn schemes. The first part of this dissertation focuses on a model for the association between a binary...
Show morePopulation heterogeneity exists frequently in discrete data. Many Bayesian models perform reasonably well in capturing this subpopulation structure. Typically, the Dirichlet process mixture model (DPMM) and a variable dimensional alternative that we refer to as the mixture of finite mixtures (MFM) model are used, as they both have natural byproducts of clustering derived from Polya urn schemes. The first part of this dissertation focuses on a model for the association between a binary response and binary predictors. The model incorporates Boolean combinations of predictors, called logic trees, as parameters arising from a DPMM or MFM. Joint modeling is proposed to solve the identifiability issue that arises when using a mixture model for a binary response. Different MCMC algorithms are introduced and compared for fitting these models. The second part of this dissertation is the application of the mixture of finite mixtures model to community detection problems. Here, the communities are analogous to the clusters in the earlier work. A probabilistic framework that allows simultaneous estimation of the number of clusters and the cluster configuration is proposed. We prove clustering consistency in this setting. We also illustrate the performance of these methods with simulation studies and discuss applications.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Geng_fsu_0071E_13791
 Format
 Thesis
 Title
 A Bayesian MRF Framework for Labeling Terrain Using Hyperspectral Imaging.
 Creator

Neher, Robert E., Srivastava, Anuj, Liu, Xiuwen, Huffer, Fred, Wegkamp, Marten, Department of Statistics, Florida State University
 Abstract/Description

We explore the nonGaussianity of hyperspectral data and present probability models that capture variability of hyperspectral images. In particular, we present a nonparametric probability distribution that models the distribution of the hyperspectral data after reducing the dimension of the data via either principal components or Fisher's discriminant analysis. We also explore the directional differences in observed images and present two parametric distributions, the generalized Laplacian...
Show moreWe explore the nonGaussianity of hyperspectral data and present probability models that capture variability of hyperspectral images. In particular, we present a nonparametric probability distribution that models the distribution of the hyperspectral data after reducing the dimension of the data via either principal components or Fisher's discriminant analysis. We also explore the directional differences in observed images and present two parametric distributions, the generalized Laplacian and the Bessel K form, that well model the nonGaussian behavior of the directional differences. We then propose a model that labels each spatial site, using Bayesian inference and Markov random fields, that incorporates the information of the nonparametric distribution of the data, and the parametric distributions of the directional differences, along with a prior distribution that favors smooth labeling. We then test our model on actual hyperspectral data and present the results of our model, using the Washington D.C. Mall and Indian Springs rural area data sets.
Show less  Date Issued
 2004
 Identifier
 FSU_migr_etd2691
 Format
 Thesis
 Title
 Bayesian Portfolio Optimization with TimeVarying Factor Models.
 Creator

Zhao, Feng, Niu, Xufeng, Cheng, Yingmei, Huﬀer, Fred W., Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

We develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity ("betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors ("alphas"), and returns of risk factors in excess of the riskfree rate ("risk premia"). Both firmlevel characteristics and macroeconomic variables are used to predict stocks' timevarying alphas and betas, and macroeconomic variables are used to predict the risk...
Show moreWe develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity ("betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors ("alphas"), and returns of risk factors in excess of the riskfree rate ("risk premia"). Both firmlevel characteristics and macroeconomic variables are used to predict stocks' timevarying alphas and betas, and macroeconomic variables are used to predict the risk premia. All of the models are specified in a Bayesian framework to account for estimation risk, and informative prior distributions on both stock returns and model parameters are adopted to reduce estimation error. To gauge the economic signicance of the predictability, we apply the models to the U.S. stock market and construct optimal portfolios based on model predictions. Outofsample performance of the portfolios is evaluated to compare the models. The empirical results confirm predictabiltiy from all of the sources considered in our model: (1) The equity risk premium is timevarying and predictable using macroeconomic variables; (2) Stocks' alphas and betas differ crosssectionally and are predictable using firmlevel characteristics; and (3) Stocks' alphas and betas are also timevarying and predictable using macroeconomic variables. Comparison of different subperiods shows that the predictability of stocks' betas is persistent over time, but the predictability of stocks' alphas and the risk premium has diminished to some extent. The empirical results also suggest that Bayesian statistical techinques, especially the use of informative prior distributions, help reduce model estimation error and result in portfolios that outperform the passive indexing strategy. The findings are robust in the presence of transaction costs.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd0526
 Format
 Thesis
 Title
 A Bayesian Wavelet Based Analysis of Longitudinally Observed Skewed Heteroscedastic Responses.
 Creator

Baker, Danisha S. (Danisha Sharice), Chicken, Eric, Sinha, Debajyoti, Harper, Kristine, Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of...
Show moreBaker, Danisha S. (Danisha Sharice), Chicken, Eric, Sinha, Debajyoti, Harper, Kristine, Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Unlike many of the current statistical models focusing on highly skewed longitudinal data, we present a novel model accommodating a skewed error distribution, partial linear median regression function, nonparametric wavelet expansion, and serial observations on the same unit. Parameters are estimated via a semiparametric Bayesian procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We use a hierarchical mixture model as the prior for the wavelet...
Show moreUnlike many of the current statistical models focusing on highly skewed longitudinal data, we present a novel model accommodating a skewed error distribution, partial linear median regression function, nonparametric wavelet expansion, and serial observations on the same unit. Parameters are estimated via a semiparametric Bayesian procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We use a hierarchical mixture model as the prior for the wavelet coefficients. For the "vanishing" coefficients, the model includes a level dependent prior probability mass at zero. This practice implements wavelet coefficient thresholding as a Bayesian Rule. Practical advantages of our method are illustrated through a simulation study and via analysis of a cardiotoxicity study of children of HIV infected mother.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Baker_fsu_0071E_14036
 Format
 Thesis
 Title
 A Class of MixedDistribution Models with Applications in Financial Data Analysis.
 Creator

Tang, Anqi, Niu, Xufeng, Cheng, Yingmei, Wu, Wei, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Statisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the...
Show moreStatisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the probability of nonzero response; the other part is a linear model that models the mean response given that the outcomes are not zeros. Random effects with AR(1) covariance structure are introduced into both parts of the model to allow serial correlation and subject specific effect. Estimating the twopart model is challenging because of high dimensional integration necessary to obtain the maximum likelihood estimates. We propose a Monte Carlo EM algorithm for estimating the maximum likelihood estimates of parameters. Through simulation study, we demonstrate the good performance of the MCEM method in parameter and standard error estimation. To illustrate, we apply the twopart model with correlated random effects and the model with autoregressive random effects to executive compensation data to investigate potential determinants of CEO stock option grants.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1710
 Format
 Thesis
 Title
 A Class of Semiparametric Volatility Models with Applications to Financial Time Series.
 Creator

Chung, Steve S., Niu, XuFeng, Gallivan, Kyle, Sinha, Debajyoti, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

The autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied...
Show moreThe autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied and well documented in financial and econometric literature and many variants of ARCH/GARCH models have been proposed. To list a few, these include exponential GARCH(EGARCH), GJRGARHCH(or threshold GARCH), integrated GARCH(IGARCH), quadratic GARCH(QGARCH), and fractionally integrated GARCH(FIGARCH). The ARCH/GARCH models and their variant models have gained a lot of attention and they are still popular choice for modeling volatility. Despite their popularity, they suffer from model flexibility. Volatility is a latent variable and hence, putting a specific model structure violates this latency assumption. Recently, several attempts have been made in order to ease the strict structural assumptions on volatility. Both nonparametric and semiparametric volatility models have been proposed in the literature. We review and discuss these modeling techniques in detail. In this dissertation, we propose a class of semiparametric multiplicative volatility models. We define the volatility as a product of parametric and nonparametric parts. Due to the positivity restriction, we take the log and square transformations on the volatility. We assume that the parametric part is GARCH(1,1) and it serves as a initial guess to the volatility. We estimate GARCH(1,1) parameters by using conditional likelihood method. The nonparametric part assumes an additive structure. There may exist some loss of interpretability by assuming an additive structure but we gain flexibility. Each additive part is constructed from a sieve of Bernstein basis polynomials. The nonparametric component acts as an improvement for the parametric component. The model is estimated from an iterative algorithm based on boosting. We modified the boosting algorithm (one that is given in Friedman 2001) such that it uses a penalized least squares method. As a penalty function, we tried three different penalty functions: LASSO, ridge, and elastic net penalties. We found that, in our simulations and application, ridge penalty worked the best. Our semiparametric multiplicative volatility model is evaluated using simulations and applied to the six major exchange rates and SP 500 index. The results show that the proposed model outperforms the existing volatility models in both insample estimation and outofsample prediction.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd8756
 Format
 Thesis
 Title
 A Comparison of Estimators in Hierarchical Linear Modeling: Restricted Maximum Likelihood versus Bootstrap via Minimum Norm Quadratic Unbiased Estimators.
 Creator

Delpish, Ayesha Nneka, Niu, XuFeng, Tate, Richard L., Huﬀer, Fred W., Zahn, Douglas, Department of Statistics, Florida State University
 Abstract/Description

The purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations,...
Show moreThe purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations, the importance of this assumption for the accuracy of multilevel parameter estimates and their standard errors was assessed using the accuracy index of relative bias and by observing the coverage percentages of 95% confidence intervals constructed for both estimation procedures. The study systematically varied the number of groups at level2 (30 versus 100), the size of the intraclass correlation (0.01 versus 0.20) and the distribution of the observations (normal versus chisquared with 1 degree of freedom). The number of groups and intraclass correlation factors produced effects consistent with those previously reported—as the number of groups increased, the bias in the parameter estimates decreased, with a more significant effect observed for those estimates obtained via REML. High levels of the intraclass correlation also led to a decrease in the efficiency of parameter estimation under both methods. Study results show that while both the restricted maximum likelihood and the bootstrap via MINQUE estimates of the fixed effects were accurate, the efficiency of the estimates was affected by the distribution of errors with the bootstrap via MINQUE procedure outperforming the REML. Both procedures produced less efficient estimators under the chisquared distribution, particularly for the variancecovariance component estimates.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd0771
 Format
 Thesis
 Title
 Covariance on Manifolds.
 Creator

Balov, Nikolay H. (Nikolay Hristov), Srivastava, Anuj, Klassen, Eric, Patrangenaru, Victor, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

With ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Nonlinear statistical analysis has become an area of intensive research in recent years. Despite the impressive progress in this direction, a unified and consistent framework has not been reached. In this regard, the following work is an attempt to improve our understanding of random...
Show moreWith ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Nonlinear statistical analysis has become an area of intensive research in recent years. Despite the impressive progress in this direction, a unified and consistent framework has not been reached. In this regard, the following work is an attempt to improve our understanding of random phenomena on nonEuclidean spaces. More specifically, the motivating goal of the present dissertation is to generalize the notion of distribution covariance, which in standard settings is defined only in Euclidean spaces, on arbitrary manifolds with metric. We introduce a tensor field structure, named covariance field, that is consistent with the heterogeneous nature of manifolds. It not only describes the variability imposed by a probability distribution but also provides alternative distribution representations. The covariance field combines the distribution density with geometric characteristics of its domain and thus fills the gap between these two.We present some of the properties of the covariance fields and argue that they can be successfully applied to various statistical problems. In particular, we provide a systematic approach for defining parametric families of probability distributions on manifolds, parameter estimation for regression analysis, nonparametric statistical tests for comparing probability distributions and interpolation between such distributions. We then present several application areas where this new theory may have potential impact. One of them is the branch of directional statistics, with domain of influence ranging from geosciences to medical image analysis. The fundamental level at which the covariance based structures are introduced, also opens a new area for future research.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1045
 Format
 Thesis
 Title
 Discrimination and Calibration of Prognostic Survival Models.
 Creator

Simino, Jeannette M., Hollander, Myles, McGee, Daniel, Hurt, Myra, Niu, XuFeng, Department of Statistics, Florida State University
 Abstract/Description

Clinicians employ prognostic survival models for diseases such as coronary heart disease and cancer to inform patients about risks, treatments, and clinical decisions (Altman and Royston 2000). These prognostic models are not useful unless they are valid in the population to which they are applied. There are no generally accepted algorithms for assessing the validity of an external survival model in a new population. Researchers often invoke measures of predictive accuracy, the degree to...
Show moreClinicians employ prognostic survival models for diseases such as coronary heart disease and cancer to inform patients about risks, treatments, and clinical decisions (Altman and Royston 2000). These prognostic models are not useful unless they are valid in the population to which they are applied. There are no generally accepted algorithms for assessing the validity of an external survival model in a new population. Researchers often invoke measures of predictive accuracy, the degree to which predicted outcomes match observed outcomes (Justice et al. 1999). One component of predictive accuracy is discrimination, the ability of the model to correctly rank the individuals in the sample by risk. A common measure of discrimination for prognostic survival models is the concordance index, also called the cstatistic. We utilize the concordance index to determine the discrimination of Framinghambased Cox and Loglogistic models of coronary heart disease (CHD) death in cohorts from the Diverse Populations Collaboration, a collection of studies that encompasses many ethnic, geographic, and socioeconomic groups. Pencina and D'Agostino presented a confidence interval for the concordance index when assessing the discrimination of an external prognostic model. We perform simulations to determine the robustness of their confidence interval when measuring discrimination during internal validation. The Pencina and D'Agostino confidence interval is not valid in the internal validation setting because their assumption of mutually independent observations is violated. We compare the Pencina and D'Agostino confidence interval to a bootstrap confidence interval that we propose that is valid for the internal validation. We specifically discern the performance of the interval when the same sample is used to both fit and determine the validity of a prognostic model. The framework for our simulations is a Weibull proportional hazards model of CHD death fit to the Framingham exam 4 data. We then focus on the second component of accuracy, calibration, which measures the agreement between the observed and predicted event rates for groups of patients (Altman and Royston 2000). In 2000, van Houwelingen introduced a method called validation by calibration to allow a clinician to assess the validity of a wellaccepted published survival model on his/her own patient population and adjust the published model to fit that population. Van Houwelingen embeds the published model into a new model with only 3 parameters which helps combat the overfitting that occurs when models with many covariates are fit on data sets with a small number of events. We explore validation by calibration as a tool to adjust models when an external model over or underestimates risk. Van Houwelingen discusses the general method and then focusses on the proportional hazards model. There are situations where proportional hazards may not hold, thus we extend the methodology to the Loglogistic accelerated failure time model. We perform validation by calibration of Framinghambased Cox and Loglogistic models of CHD death to cohorts from the Diverse Populations Collaboration. Lastly, we conduct simulations that investigate the power of the global Wald validation by calibration test. We study its power to reject an invalid proportional hazards or Loglogistic accelerated failure time model under various scale and/or shape misspecifications.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd0328
 Format
 Thesis
 Title
 The Effect of Risk Factors on Coronary Heart Disease: An AgeRelevant Multivariate Meta Analysis.
 Creator

Li, Yan, McGee, Dan, She, Yiyuan, Eberstein, Ike, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

The importance of major risk factors, such as hypertension, total cholesterol, body mass index, diabetes, smoking, for predicting incidence and mortality of Coronary Heart Disease (CHD) is well known. In light of the fact that age is also a major risk factor for CHD death, a natural question is whether the risk effects on CHD change with age. This thesis focuses on examining the interaction between age and risk factors using data from multiple studies containing differing age ranges. The aim...
Show moreThe importance of major risk factors, such as hypertension, total cholesterol, body mass index, diabetes, smoking, for predicting incidence and mortality of Coronary Heart Disease (CHD) is well known. In light of the fact that age is also a major risk factor for CHD death, a natural question is whether the risk effects on CHD change with age. This thesis focuses on examining the interaction between age and risk factors using data from multiple studies containing differing age ranges. The aim of my research is to use statistical methods to determine whether we can combine these diverse results to obtain an overall summary, using which one can find how the risk effects on CHD death change with age. One intuitive approach is to use classical meta analysis based on generalized linear models. More specifically, one can fit a logistic model with CHD death as response and age, a risk factor and their interaction as covariates for each of the studies, and conduct meta analysis on every set of three coefficients in the multivariate setting to obtain 'synthesized' coefficients. Another aspect of the thesis is a new method, meta analysis with respect to curves that goes beyond linear models. The basic idea is that one can choose the same spline with the same knots on covariates, say age and systolic blood pressure (SBP), for all the studies to ensure common basis functions. The knotbased tensor product basis coefficients obtained from penalized logistic regression can be used for multivariate meta analysis. Using the common basis functions and the 'synthesized' knotbased basis coefficients from meta analysis, a twodimensional smooth surface on the ageSBP domain is estimated. By cutting through the smooth surface along two axes, the resulting slices show how the risk effect on CHD death change at an arbitrary age as well as how the age effect on CHD death change at an arbitrary SBP value. The application to multiple studies will be presented.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1428
 Format
 Thesis
 Title
 Elastic Shape Analysis of RNAs and Proteins.
 Creator

Laborde, Jose M., Srivastava, Anuj, Zhang, Jinfeng, Klassen, Eric, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

Proteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and...
Show moreProteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and classifications. This work applies Elastic Shape Analysis (ESA), a method recently developed in computer vision, to construct rigorous mathematical and statistical frameworks for the comparison, clustering and classification of proteins and RNAs. ESA treats bio molecular structures as 3D parameterized curves, which are represented with a special map called the square root velocity function (SRVF). In the resulting shape space of elastic curves, one can perform statistical analysis of curves as if they were random variables. One can compare, match and deform one curve into another, or as well as compute averages and covariances of curve populations, and perform hypothesis testing and classification of curves according to their shapes. We have successfully applied ESA to the comparison and classification of protein and RNA structures. We further extend the ESA framework to incorporate additional nongeometric information that tags the shape of the molecules (namely, the sequence of nucleotide/aminoacid letters for RNAs/proteins and, in the latter case, also the labels for the socalled secondary structure). The biological representation is chosen such that the ESA framework continues to be mathematically formal. We have achieved superior classification of RNA functions compared to stateoftheart methods on benchmark RNA datasets which has led to the publication of this work in the journal, Nucleic Acids Research (NAR). Based on the ESA distances, we have also developed a fast method to classify protein domains by using a representative set of protein structures generated by a clusteringbased technique we call Multiple Centroid Class Partitioning (MCCP). Comparison with other standard approaches showed that MCCP significantly improves the accuracy while keeping the representative set smaller than the other methods. The current schemes for the classification and organization of proteins (such as SCOP and CATH) assume a discrete space of their structures, where a protein is classified into one and only one class in a hierarchical tree structure. Our recent study, and studies by other researchers, showed that the protein structure space is more continuous than discrete. To capture the complex but quantifiable continuous nature of protein structures, we propose to organize these molecules using a network model, where individual proteins are mapped to possibly multiple nodes of classes, each associated with a probability. Structural classes will then be connected to form a network based on overlaps of corresponding probability distributions in the structural space.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8586
 Format
 Thesis
 Title
 An Ensemble Approach to Predicting Health Outcomes.
 Creator

Nilles, Ester Kim, McGee, Dan, Zhang, Jinfeng, Eberstein, Isaac, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Heart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk...
Show moreHeart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk have been based on logistic regression or Cox proportional hazards models. In more recent years, machine learning models have grown in popularity within the medical field for their superior predictive and classification performances over the classical statistical models. However, their performances in heart disease and premature birth predictions have been comparable and inconclusive, leaving the question of which model most accurately reflects the data difficult to resolve. Our aim is to incorporate information learned by different models into one final model that will generate superior predictive performances. We first compare the widely used machine learning models  the multilayer perceptron network, knearest neighbor and support vector machine  to the statistical models logistic regression and Cox proportional hazards. Then the individual models are combined into one in an ensemble approach, also referred to as ensemble modeling. The proposed approaches include SSEweighted, AUCweighted, logistic and flexible naive Bayes. The individual models are unique and capture different aspects of the data, but as expected, no individual one outperforms any other. The ensemble approach is an easily computed method that eliminates the need to select one model, integrates the strengths of different models, and generates optimal performances. Particularly in cases where the risk factors associated to an outcome are elusive, such as in premature birth, the ensemble models significantly improve their prediction.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7530
 Format
 Thesis
 Title
 Estimating the Probability of Cardiovascular Disease: A Comparison of Methods.
 Creator

Fan, Li, McGee, Daniel, Hurt, Myra, Niu, XuFeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Risk prediction plays an important role in clinical medicine. It not only helps in educating patients to improve life style and in targeting individuals at high risk, but also guides treatment decisions. So far, various instruments have been used for different risk assessment in different countries and the risk predictions based from these different models are not consistent. In public use, a reliable risk prediction is necessary. This thesis discusses the models that have been developed for...
Show moreRisk prediction plays an important role in clinical medicine. It not only helps in educating patients to improve life style and in targeting individuals at high risk, but also guides treatment decisions. So far, various instruments have been used for different risk assessment in different countries and the risk predictions based from these different models are not consistent. In public use, a reliable risk prediction is necessary. This thesis discusses the models that have been developed for risk assessment and evaluates the performance of prediction at two levels, including the overall level and the individual level. At the overall level, cross validation and simulation are used to assess the risk prediction, while at the individual level, the "Parametric Bootstrap" and the delta method are used to evaluate the uncertainty of the individual risk prediction. Further exploration of the reasons producing different performance among the models is ongoing.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd4508
 Format
 Thesis
 Title
 Estimation and Sequential Monitoring of Nonlinear Functional Responses Using Wavelet Shrinkage.
 Creator

Cuevas, Jordan, Chicken, Eric, Sobanjo, John, Niu, Xufeng, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

Statistical process control (SPC) is widely used in industrial settings to monitor processes for shifts in their distributions. SPC is generally thought of in two distinct phases: Phase I, in which historical data is analyzed in order to establish an incontrol process, and Phase II, in which new data is monitored for deviations from the incontrol form. Traditionally, SPC had been used to monitor univariate (multivariate) processes for changes in a particular parameter (parameter vector)....
Show moreStatistical process control (SPC) is widely used in industrial settings to monitor processes for shifts in their distributions. SPC is generally thought of in two distinct phases: Phase I, in which historical data is analyzed in order to establish an incontrol process, and Phase II, in which new data is monitored for deviations from the incontrol form. Traditionally, SPC had been used to monitor univariate (multivariate) processes for changes in a particular parameter (parameter vector). Recently however, technological advances have resulted in processes in which each observation is actually an ndimensional functional response (referred to as a profile), where n can be quite large. Additionally, these profiles are often unable to be adequately represented parametrically, making traditional SPC techniques inapplicable. This dissertation starts out by addressing the problem of nonparametric function estimation, which would be used to analyze process data in a PhaseI setting. The translation invariant wavelet estimator (TI) is often used to estimate irregular functions, despite the drawback that it tends to oversmooth jumps. A trimmed translation invariant estimator (TTI) is proposed, of which the TI estimator is a special case. By reducing the point by point variability of the TI estimator, TTI is shown to retain the desirable qualities of TI while improving reconstructions of functions with jumps. Attention is then turned to the PhaseII problem of monitoring sequences of profiles for deviations from incontrol. Two profile monitoring schemes are proposed; the first monitors for changes in the noise variance using a likelihood ratio test based on the highest detail level of wavelet coefficients of the observed profile. The second offers a semiparametric test to monitor for changes in both the functional form and noise variance. Both methods make use of wavelet shrinkage in order to distinguish relevant functional information from noise contamination. Different forms of each of these test statistics are proposed and results are compared via Monte Carlo simulation.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd4788
 Format
 Thesis
 Title
 Estimation from Data Representing a Sample of Curves.
 Creator

Auguste, Anna L., Bunea, Florentina, Mason, Patrick, Hollander, Myles, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

This dissertation introduces and assesses an algorithm to generate confidence bands for a regression function or a main effect when multiple data sets are available. In particular it proposes to construct confidence bands for different trajectories and then aggregate these to produce an overall confidence band for a mean function. An estimator of the regression function or main effect is also examined. First, nonparametric estimators and confidence bands are formed on each data set separately...
Show moreThis dissertation introduces and assesses an algorithm to generate confidence bands for a regression function or a main effect when multiple data sets are available. In particular it proposes to construct confidence bands for different trajectories and then aggregate these to produce an overall confidence band for a mean function. An estimator of the regression function or main effect is also examined. First, nonparametric estimators and confidence bands are formed on each data set separately. Then each data set is in turn treated as a testing set for aggregating the preliminary results from the remaining data sets. The criterion used for this aggregation is either the least squares (LS) criterion or a BIC type penalized LS criterion. The proposed estimator is the average over data sets of these aggregates. It is thus a weighted sum of the preliminary estimators. The proposed confidence band is the minimum L1 band of all the M aggregate bands when we only have a main effect. In the case where there is some random effect we suggest an adjustment to the confidence band. In this case, the proposed confidence band is the minimum L1 band of all the M adjusted aggregate bands. Desirable asymptotic properties are shown to hold. A simulation study examines the performance of each technique relative to several alternate methods and theoretical benchmarks. An application to seismic data is conducted.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd0286
 Format
 Thesis
 Title
 An Examination of the Concept of Frailty in the Elderly.
 Creator

Griffin, Felicia R., McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Frailty has been defined as a state of increased vulnerability to adverse outcomes. The concept of frailty has been centered around counting the number of deficits in health, which can be diseases, disabilities, or symptoms. However, there is no consensus on how it should be quantified. Frailty has been considered synonymous with functional status and comorbidity, but these may be distinct concepts requiring different management. We compared two methods of defining a frailty phenotype, a...
Show moreFrailty has been defined as a state of increased vulnerability to adverse outcomes. The concept of frailty has been centered around counting the number of deficits in health, which can be diseases, disabilities, or symptoms. However, there is no consensus on how it should be quantified. Frailty has been considered synonymous with functional status and comorbidity, but these may be distinct concepts requiring different management. We compared two methods of defining a frailty phenotype, a count of deficits and a weighted score of health deficits incorporating the strength of association between each deficit and mortality. The strength of association was estimated using proportional hazards coefficients. The study uses data from the third National Health and Nutrition Examination Survey. We compared the two methodologies: frailty was associated with age, gender, ethnicity, and having comorbid chronic diseases. The predictive association of frail status with the incidence of death over 12 years was significant for the weighted phenotype, with hazard ratio 3.46, 95% confidence interval (CI) (2.78, 4.30) unadjusted and hazard ratio 1.89, 95% confidence interval (CI) (1.57, 2.30) adjusted. The unweighted predictive association of frail status with the incidence of death was also significant, with a lower hazard ratio of 3.13, 95% CI (2.53, 3.87) unadjusted and hazard ratio of 1.40 95% CI (1.20, 1.65) adjusted. When examining the association of frailty and cause specific death, frailty was associated with a higher risk of death due to CHD, Stroke, CVD, and Other causes for both male and female (unadjusted). However, after adjusting for various covariates death due to CHD, CVD, and Others causes remain significant for both males and females. When comparing the definition of osteoporosis or low bone mass to the model of frailty, femoral neck Tscore declined significantly with increasing levels of frailty. There was overlap and uniqueness in the definitions of frailty, functional status, and comorbidity that require further research. Understanding the causal interrelationship could help explain why these three conditions are likely to cooccur. In addition, there is an association between frailty and dietary quality based on the Mediterranean diet. This study provides a more valuable understanding of the complex concept of frailty and the role latent variables in this concept. This study also introduces a weighted score for defining a frailty phenotype that is more strongly predictive of mortality, and hence has potential to improve targeting and care of today's elderly.
Show less  Date Issued
 2015
 Identifier
 FSU_migr_etd9342
 Format
 Thesis
 Title
 Examining the Effect of Treatment on the Distribution of Blood Pressure in the Population Using Observational Data.
 Creator

Kucukemiroglu, Saryet Alexa, McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences,...
Show moreKucukemiroglu, Saryet Alexa, McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Since the introduction of antihypertensive medications in the mid1950s, there has been an increased use of blood pressure medications in the US. The growing use of antihypertensive treatment has affected the distribution of blood pressure in the population over time. Now observational data no longer reflect natural blood pressure levels. Our goal is to examine the effect of antihypertensive drugs on distributions of blood pressure using several wellknown observational studies. The...
Show moreSince the introduction of antihypertensive medications in the mid1950s, there has been an increased use of blood pressure medications in the US. The growing use of antihypertensive treatment has affected the distribution of blood pressure in the population over time. Now observational data no longer reflect natural blood pressure levels. Our goal is to examine the effect of antihypertensive drugs on distributions of blood pressure using several wellknown observational studies. The statistical concept of censoring is used to estimate the distribution of blood pressure in populations if no treatment were available. The treated and estimated untreated distributions are then compared to determine the general effect of these medications in the population. Our analyses show that these drugs have an increasing impact on controlling blood pressure distributions in populations that are heavily treated.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Kucukemiroglu_fsu_0071E_14275
 Format
 Thesis
 Title
 Examining the Relationship of Dietary Component Intakes to Each Other and to Mortality.
 Creator

Alrajhi, Sharifah, McGee, Daniel, Levenson, Cathy W., Niu, Xufeng, Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In this essay we present analysis examining the basic dietary structure and its relationship to mortality in the first National Health and Nutrition Examination Survey (NHANES I) conducted between 1971 and 1975. We used results from 24hour recalls on 10,483 individuals in this study. All of the indivduals in the analytic sample were followed through 1992 for vital status. The mean followup period for the participants was 16 years. During followup 2,042 (48%) males and 1,754 (27%) females...
Show moreIn this essay we present analysis examining the basic dietary structure and its relationship to mortality in the first National Health and Nutrition Examination Survey (NHANES I) conducted between 1971 and 1975. We used results from 24hour recalls on 10,483 individuals in this study. All of the indivduals in the analytic sample were followed through 1992 for vital status. The mean followup period for the participants was 16 years. During followup 2,042 (48%) males and 1,754 (27%) females died. We first attempted to capture the inherent structure of the dietary data using principal components analyses (PCA). We performed this estimation separately for each race (white and black) and gender (male and female) and compared the estimated principal components among these four strata. We found that the principal components were similar (but not identical) in the four strata. we also related our estimated principal components to mortality using Cox Proportional Hazards (CPH) models and related dietary component to mortality using forward variable selection.
Show less  Date Issued
 2015
 Identifier
 FSU_2015fall_Alrajhi_fsu_0071E_12802
 Format
 Thesis
 Title
 Failure Time Regression Models for Thinned Point Processes.
 Creator

Holden, Robert T., Huffer, Fred G., Nichols, Warren, McGee, Dan, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

In survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the...
Show moreIn survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the predictors on that distribution. Suppose that the criterion event isn't a terminal event that can only occur once, but is a repeatable event. The sequence of events forms a stochastic {it point process}. Further suppose that only some of the events are detected (observed); the detected events form a thinned point process. Any failure time model based on the data will be based not on the time until the first occurrence, but on the time until the first detected occurrence of the event. The implications of estimating survival regression models from such incomplete data will be analyzed. It will be shown that the effect of thinning on regression parameters depends on the combination of the type of regression model, the type of point process that generates the events, and the thinning mechanism. For some combinations, the effect of a predictor will be the same for time to the first event and the time to the first detected event. For other combinations, the regression effect will be changed as a result of the incomplete detection.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8568
 Format
 Thesis
 Title
 First Steps towards Image Denoising under LowLight Conditions.
 Creator

Anaya, Josue Samuel, MeyerBaese, Anke, Linero, Antonio, Zhang, Jinfeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

The application of noise reduction or performing denoising on an image is a very important topic in the field of computer vision and computational photography. Many popular state of the art denoising algorithms are trained and evaluated using images with artificial noise. These trained algorithms and their evaluations on synthetic data may lead to incorrect conclusions about their performances. In this paper we will first introduce a benchmark dataset of uncompressed color images corrupted by...
Show moreThe application of noise reduction or performing denoising on an image is a very important topic in the field of computer vision and computational photography. Many popular state of the art denoising algorithms are trained and evaluated using images with artificial noise. These trained algorithms and their evaluations on synthetic data may lead to incorrect conclusions about their performances. In this paper we will first introduce a benchmark dataset of uncompressed color images corrupted by natural noise due to lowlight conditions, together with spatially and intensityaligned low noise images of the same scenes. The dataset contains over 100 scenes and more than 500 images, including both RAW formatted images and 8 bit BMP pixel and intensity aligned images. We will also introduce a method for estimating the true noise level in each of our images, since even the low noise images contain a small amount of noise. Through this noise estimation method we will construct a convolutional neural network model for automatic noise estimation in single noisy images. Finally, we improve upon a stateoftheart denoising algorithm Block Matching through 3D filtering (BM3D) by learning a specialized denoising parameter using another developed convolutional neural network.
Show less  Date Issued
 2016
 Identifier
 FSU_FA2016_Anaya_fsu_0071E_13600
 Format
 Thesis
 Title
 Flexible Additive Risk Models Using Piecewise Constant Hazard Functions.
 Creator

Uhm, Daiho, Huﬀer, Fred W., Kercheval, Alec, McGee, Dan, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

We study a weighted least squares (WLS) estimator for Aalen's additive risk model which allows for a very flexible handling of covariates. We divide the followup period into intervals and assume a constant hazard rate in each interval. The model is motivated as a piecewise approximation of a hazard function composed of three parts: arbitrary nonparametric functions for some covariate effects, smoothly varying functions for others, and known (or constant) functions for yet others. The...
Show moreWe study a weighted least squares (WLS) estimator for Aalen's additive risk model which allows for a very flexible handling of covariates. We divide the followup period into intervals and assume a constant hazard rate in each interval. The model is motivated as a piecewise approximation of a hazard function composed of three parts: arbitrary nonparametric functions for some covariate effects, smoothly varying functions for others, and known (or constant) functions for yet others. The proposed estimator is an extension of the grouped data version of the HufferMcKeague estimator (1991). Our estimator may also be regarded as a piecewise constant analog of the semiparametric estimates of McKeague & Sasieni (1994), and Lin & Ying (1994). By using a fairly large number of intervals, we should get an essentially semiparametric model similar to the McKeagueSasieni and LinYing approaches. For our model, since the number of parameters is finite (although large), conventional approaches (such as maximum likelihood) are easy to formulate and implement. The approach is illustrated by simulations, and is applied to data from the Framingham heart study.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd1464
 Format
 Thesis
 Title
 A Framework for Comparing Shape Distributions.
 Creator

Henning, Wade, Srivastava, Anuj, Alamo, Ruﬁna G., Huﬀer, Fred W. (Fred William), Wu, Wei, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

The problem of comparisons of shape populations is present in many branches of science, including nanomanufacturing, medical imaging, particle analysis, fisheries, seed science, and computer vision. Researchers in these fields have traditionally characterized the profiles in these sets using combinations of scalar valued descriptor features, like aspect ratio or roughness, whose distributions are easy to compare using classical statistics. However, there is a desire in this community for a...
Show moreThe problem of comparisons of shape populations is present in many branches of science, including nanomanufacturing, medical imaging, particle analysis, fisheries, seed science, and computer vision. Researchers in these fields have traditionally characterized the profiles in these sets using combinations of scalar valued descriptor features, like aspect ratio or roughness, whose distributions are easy to compare using classical statistics. However, there is a desire in this community for a single comprehensive feature that uniquely defines these profiles. The shape of the profile itself is such a feature. Shape features have traditionally been studied as individuals, and comparing distributions underlying sets of shapes is challenging. Since the data comes in the form of samples from shape populations, we use kernel methods to estimate underlying shape densities. We then take a metric approach to define a proper distance, termed the FisherRao distance, to quantify differences between any two densities. This distance can be used for clustering, classification and other types of statistical modeling; however, this dissertation focuses on comparing shape populations as a classical twosample hypothesis test with populations characterized by respective probability densities on shape space. Since we are interested in the shapes of planar closed curves and the space of such curves is infinite dimensional, there are some theoretical issues in defining and estimating densities on this space. We therefore use a spherical multidimensional scaling algorithm to project shape distributions to the unit twosphere, and this allows us to use a von MisesFisher kernel for density estimation. The estimated densities are then compared using the FisherRao distance, which, in turn, is estimated using Monte Carlo methods. This distance estimate is used as a test statistic for the twosample hypothesis test mentioned above. We use a bootstrap approach to perform the test and to evaluate population classification performance. We demonstrate these ideas using applications from industrial and chemical engineering.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd9185
 Format
 Thesis
 Title
 The Frequentist Performance of Some Bayesian Confidence Intervals for the Survival Function.
 Creator

Tao, Yingfeng, Huﬀer, Fred, Okten, Giray, Sinha, Debajyoti, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Estimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based...
Show moreEstimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based in some way on the KaplanMeier estimator first proposed by Kaplan and Meier (1958) and widely used as the nonparametric estimator in the presence of rightcensored data. For intervalcensored data, the Turnbull estimator (Turnbull (1974)) plays a similar role. For a class of Bayesian models involving Dirichlet priors, Doss and Huffer (2003) suggested several simulation techniques to approximate the posterior distribution of the survival function by using Markov chain Monte Carlo or sequential importance sampling. These techniques lead to probability intervals for the survival function (at arbitrary time points) and its quantiles for both the rightcensored and intervalcensored cases. This dissertation will examine the frequentist properties and general performance of these probability intervals when the prior is noninformative. Simulation studies will be used to compare these probability intervals with other published approaches. Extensions of the DossHuffer approach are given for constructing simultaneous confidence bands for the survival function and for computing approximate confidence intervals for the survival function based on Edgeworth expansions using posterior moments. The performance of these extensions is studied by simulation.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7624
 Format
 Thesis
 Title
 Functional Component Analysis and Regression Using Elastic Methods.
 Creator

Tucker, J. Derek, Srivastava, Anuj, Wu, Wei, Klassen, Eric, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Constructing generative models for functional observations is an important task in statistical function analysis. In general, functional data contains both phase (or x or horizontal) and amplitude (or y or vertical) variability. Traditional methods often ignore the phase variability and focus solely on the amplitude variation, using crosssectional techniques such as functional principal component analysis for dimensional reduction and regression for data modeling. Ignoring phase variability...
Show moreConstructing generative models for functional observations is an important task in statistical function analysis. In general, functional data contains both phase (or x or horizontal) and amplitude (or y or vertical) variability. Traditional methods often ignore the phase variability and focus solely on the amplitude variation, using crosssectional techniques such as functional principal component analysis for dimensional reduction and regression for data modeling. Ignoring phase variability leads to a loss of structure in the data, and inefficiency in data models. Moreover, most methods use a "preprocessing'' alignment step to remove the phasevariability; without considering a more natural joint solution. This dissertation presents three approaches to this problem. The first relies on separating the phase (xaxis) and amplitude (yaxis), then modeling these components using joint distributions. This separation in turn, is performed using a technique called elastic alignment of functions that involves a new mathematical representation of functional data. Then, using individual principal components, one for each phase and amplitude components, it imposes joint probability models on principal coefficients of these components while respecting the nonlinear geometry of the phase representation space. The second combines the phasevariability into the objective function for two component analysis methods, functional principal component analysis and functional principal least squares. This creates a more complete solution, as the phasevariability is removed while simultaneously extracting the components. The third approach combines the phasevariability into the functional linear regression model and then extends the model to logistic and multinomial logistic regression. Through incorporating the phasevariability a more parsimonious regression model is obtained and therefore, more accurate prediction of observations is achieved. These models then are easily extended from functional data to curves (which are essentially functions in R2) to perform regression with curves as predictors. These ideas are demonstrated using random sampling for models estimated from simulated and real datasets, and show their superiority over models that ignore phaseamplitude separation. Furthermore, the models are applied to classification of functional data and achieve high performance in applications involving SONAR signals of underwater objects, handwritten signatures, periodic body movements recorded by smart phones, and physiological data.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd9106
 Format
 Thesis
 Title
 Geometric Approaches for Analysis of Images, Densities and Trajectories on Manifolds.
 Creator

Zhang, Zhengwu, Srivastava, Anuj, Klassen, E. (Eric), Wu, Wei, Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In this dissertation, we focus on the problem of analyzing highdimensional functional data using geometric approaches. The term functional data refers to images, densities and trajectories on manifolds. The nature of these data imposes difficulties on statistical analysis. First, the objects are functional type of data which are infinite dimensional. One needs to explore the possible representations of each type such that the representations can facilitate the future statistical analysis....
Show moreIn this dissertation, we focus on the problem of analyzing highdimensional functional data using geometric approaches. The term functional data refers to images, densities and trajectories on manifolds. The nature of these data imposes difficulties on statistical analysis. First, the objects are functional type of data which are infinite dimensional. One needs to explore the possible representations of each type such that the representations can facilitate the future statistical analysis. Second, the representation spaces are often nonlinear manifolds. Thus, proper Riemannian structures are necessary to compare objects. Third, the analysis and comparison of objects need be invariant to certain nuisance variables. For example, comparison between two images should be invariant to their blur levels, and comparison between timeindexed trajectories on manifolds should be invariant to their temporal evaluation rates. We start by introducing frameworks for representing, comparing and analyzing functions in Euclidean space including signals, images and densities, and the comparisons are invariant to the Gaussian blur existed in these objects. Applications in blur levels matching, blurred image recognition, image classification and twosample hypothesis test are discussed. Next, we present frameworks for analyzing longitudinal trajectories on a manifold M, while the analysis is invariant to the reparameterization action (temporal variation). Particularly, we are interested in analyzing trajectories in two manifolds: the twosphere and the set of symmetric positivedefinite matrices. Applications such as bird migration and hurricane tracks analysis, visual speech recognition and hand gesture recognition are used to demonstrate the advantages of the proposed frameworks. In the end, a Bayesian framework for clustering of shapes of curves is presented, and examples of clustering cell shapes and protein structures are discussed.
Show less  Date Issued
 2015
 Identifier
 FSU_migr_etd9503
 Format
 Thesis
 Title
 GoodnessofTests for Logistic Regression.
 Creator

Wu, Sutan, McGee, Dan L., Zhang, Jinfeng, Hurt, Myra, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

The generalized linear model and particularly the logistic model are widely used in public health, medicine, and epidemiology. Goodnessoffit tests for these models are popularly used to describe how well a proposed model fits a set of observations. These different goodnessoffit tests all have individual advantages and disadvantages. In this thesis, we mainly consider the performance of the "HosmerLemeshow" test, the Pearson's chisquare test, the unweighted sum of squares test and the...
Show moreThe generalized linear model and particularly the logistic model are widely used in public health, medicine, and epidemiology. Goodnessoffit tests for these models are popularly used to describe how well a proposed model fits a set of observations. These different goodnessoffit tests all have individual advantages and disadvantages. In this thesis, we mainly consider the performance of the "HosmerLemeshow" test, the Pearson's chisquare test, the unweighted sum of squares test and the cumulative residual test. We compare their performance in a series of empirical studies as well as particular simulation scenarios. We conclude that the unweighted sum of squares test and the cumulative sums of residuals test give better overall performance than the other two. We also conclude that the commonly suggested practice of assuming that a pvalue less than 0.15 is an indication of lack of fit at the initial steps of model diagnostics should be adopted. Additionally, D'Agostino et al. presented the relationship of the stacked logistic regression and the Cox regression model in the Framingham Heart Study. So in our future study, we will examine the possibility and feasibility of the adaption these goodnessoffit tests to the Cox proportional hazards model using the stacked logistic regression.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0693
 Format
 Thesis
 Title
 High Level Image Analysis on Manifolds via Projective Shapes and 3D Reflection Shapes.
 Creator

Lester, David T. (David Thomas), Patrangenaru, Victor, Liu, Xiuwen, Barbu, Adrian G. (Adrian Gheorghe), Tao, Minjing, Florida State University, College of Arts and Sciences,...
Show moreLester, David T. (David Thomas), Patrangenaru, Victor, Liu, Xiuwen, Barbu, Adrian G. (Adrian Gheorghe), Tao, Minjing, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Shape analysis is a widely studied topic in modern Statistics with important applications in areas such as medical imaging. Here we focus on twosample hypothesis testing for both finite and infinite extrinsic mean shapes of configurations. First, we present a test for equality of mean projective shapes of 2D contours based on rotations. Secondly, we present a test for mean 3D reflection shapes based on the Schoenberg mean. We apply these tests to footprint data (contours), clamshells (3D...
Show moreShape analysis is a widely studied topic in modern Statistics with important applications in areas such as medical imaging. Here we focus on twosample hypothesis testing for both finite and infinite extrinsic mean shapes of configurations. First, we present a test for equality of mean projective shapes of 2D contours based on rotations. Secondly, we present a test for mean 3D reflection shapes based on the Schoenberg mean. We apply these tests to footprint data (contours), clamshells (3D reflection shape) and human facial configurations extracted from digital camera images. We also present the method of MANOVA on manifolds, and apply it to face data extracted from digital camera images. Finally, we present a new statistical tool called antiregression.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Lester_fsu_0071E_13856
 Format
 Thesis
 Title
 Impact of Missing Data on Building Prognostic Models and Summarizing Models Across Studies.
 Creator

Munshi, Mahtab R., McGee, Daniel, Eberstein, Isaac, Hollander, Myles, Niu, Xufeng, Chattopadhyay, Somesh, Department of Statistics, Florida State University
 Abstract/Description

We examine the impact of missing data in two settings, the development of prognostic models and the addition of new risk factors to existing risk functions. Most statistical software presently available perform complete case analysis, wherein only participants with known values for all of the characteristics being analyzed are included in model development. Missing data also impacts the summarization of evidence amongst multiple studies using metaanalytic techniques. As we progress in...
Show moreWe examine the impact of missing data in two settings, the development of prognostic models and the addition of new risk factors to existing risk functions. Most statistical software presently available perform complete case analysis, wherein only participants with known values for all of the characteristics being analyzed are included in model development. Missing data also impacts the summarization of evidence amongst multiple studies using metaanalytic techniques. As we progress in medical research, new covariates become available for studying various outcomes. While we want to investigate the influence of new factors on the outcome, we also do not want to discard the historical datasets that do not have information about these markers. Our research plan is to investigate different methods to estimate parameters for a model when some of the covariates are missing. These methods include likelihood based inference for the studylevel coefficients and likelihood based inference for the logistic model on the personlevel data. We compare the results from our methods to the corresponding results from complete case analysis. We focus our empirical investigation on a historical example, the addition of high density lipoproteins to existing equations for predicting death due to coronary heart disease. We verify our methods through simulation studies on this example.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd2191
 Format
 Thesis
 Title
 Individual PatientLevel Data MetaAnalysis: A Comparison of Methods for the Diverse Populations Collaboration Data Set.
 Creator

Dutton, Matthew Thomas, McGee, Daniel, Becker, Betsy, Niu, Xufeng, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

DerSimonian and Laird define metaanalysis as "the statistical analysis of a collection of analytic results for the purpose of integrating their findings. One alternative to classical metaanalytic approaches in known as Individual PatientLevel Data, or IPD, metaanalysis. Rather than depending on summary statistics calculated for individual studies, IPD metaanalysis analyzes the complete data from all included studies. Two potential approaches to incorporating IPD data into the meta...
Show moreDerSimonian and Laird define metaanalysis as "the statistical analysis of a collection of analytic results for the purpose of integrating their findings. One alternative to classical metaanalytic approaches in known as Individual PatientLevel Data, or IPD, metaanalysis. Rather than depending on summary statistics calculated for individual studies, IPD metaanalysis analyzes the complete data from all included studies. Two potential approaches to incorporating IPD data into the metaanalytic framework are investigated. A twostage analysis is first conducted, in which individual models are fit for each study and summarized using classical metaanalysis procedures. Secondly, a onestage approach that singularly models the data and summarizes the information across studies is investigated. Data from the Diverse Populations Collaboration data set are used to investigate the differences between these two methods in a specific example. The bootstrap procedure is used to determine if the two methods produce statistically different results in the DPC example. Finally, a simulation study is conducted to investigate the accuracy of each method in given scenarios.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd0620
 Format
 Thesis
 Title
 Inference for Semiparametric TimeVarying Covariate Effect Relative Risk Regression Models.
 Creator

Ye, Gang, McKeague, Ian W., Wang, Xiaoming, Huffer, Fred W., Song, KaiSheng, Department of Statistics, Florida State University
 Abstract/Description

A major interest of survival analysis is to assess covariate effects on survival via appropriate conditional hazard function regression models. The Cox proportional hazards model, which assumes an exponential form for the relative risk, has been a popular choice. However, other regression forms such as Aalen's additive risk model may be more appropriate in some applications. In addition, covariate effects may depend on time, which can not be reflected by a Cox proportional hazards model. In...
Show moreA major interest of survival analysis is to assess covariate effects on survival via appropriate conditional hazard function regression models. The Cox proportional hazards model, which assumes an exponential form for the relative risk, has been a popular choice. However, other regression forms such as Aalen's additive risk model may be more appropriate in some applications. In addition, covariate effects may depend on time, which can not be reflected by a Cox proportional hazards model. In this dissertation, we study a class of timevarying covariate effect regression models in which the link function (relative risk function) is a twice continuously differentiable and prespecified, but otherwise general given function. This is a natural extension of the PrenticeSelf model, in which the link function is general but covariate effects are modelled to be time invariant. In the first part of the dissertation, we focus on estimating the cumulative or integrated covariate effects. The standard martingale approach based on counting processes is utilized to derive a likelihoodbased iterating equation. An estimator for the cumulative covariate effect that is generated from the iterating equation is shown to be ¡Ìnconsistent. Asymptotic normality of the estimator is also demonstrated. Another aspect of the dissertation is to investigate a new test for the above timevarying covariate effect regression model and study consistency of the test based on martingale residuals. For Aalen's additive risk model, we introduce a test statistic based on the HufferMcKeague weightedleastsquares estimator and show its consistency against some alternatives. An alternative way to construct a test statistic based on Bayesian Bootstrap simulation is introduced. An application to real lifetime data will be presented.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd0949
 Format
 Thesis
 Title
 Intensity Estimation in Poisson Processes with Phase Variability.
 Creator

Gordon, Glenna, Wu, Wei, Whyte, James, Srivastava, Anuj, Chicken, Eric, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Intensity estimation for Poisson processes is a classical problem and has been extensively studied over the past few decades. However, current methods of intensity estimation assume phase variability or compositional noise, i.e. a nonlinear shift along the time axis, is nonexistent in the data which is an unreasonable assumption for practical observations. The key challenge is that these observations are not "aligned'', and registration procedures are required for successful estimation. As a...
Show moreIntensity estimation for Poisson processes is a classical problem and has been extensively studied over the past few decades. However, current methods of intensity estimation assume phase variability or compositional noise, i.e. a nonlinear shift along the time axis, is nonexistent in the data which is an unreasonable assumption for practical observations. The key challenge is that these observations are not "aligned'', and registration procedures are required for successful estimation. As a result, these estimation methods can yield estimators that are inefficient or that underperform in simulations and applications. This dissertation summarizes two key projects which examine estimation of the intensity of a Poisson process in the presence of phase variability. The first project proposes an alignmentbased framework for intensity estimation. First, it is shown that the intensity function is areapreserved with respect to compositional noise. Such a property implies that the time warping is only encoded in the density, or normalized intensity, function. Then, the intensity function can be decomposed into the product of the estimated total intensity (a scalar value) and the estimated density function. The estimation of the density relies on a metric which measures the phase difference between two density functions. An asymptotic study shows that the proposed estimation algorithm provides a consistent estimator for the normalized intensity. The success of the proposed estimation algorithm is illustrated using two simulations and the new framework is applied in a real data set of neural spike trains, showing that the proposed estimation method yields improved classification accuracy over previous methods. The second project utilizes 2014 Florida data from the Healthcare Cost and Utilization Project's State Inpatient Database and State Emergency Department Database (provided to the U.S. Department of Health and Human Services, Agency for Healthcare Research and Quality by the Florida Agency for Health Care Administration) to examine heart failure emergency department arrival times. Current estimation methods for examining emergency department arrival data ignore the functional nature of the data and implement naive analysis methods. In this dissertation, the arrivals are treated as a Poisson process and the intensity of the process is estimated using existing density estimation and function registration methods. The results of these analyses show the importance of considering the functional nature of emergency department arrival data and the critical role that function registration plays in the intensity estimation of the arrival process.
Show less  Date Issued
 2016
 Identifier
 FSU_FA2016_Gordon_fsu_0071E_13511
 Format
 Thesis
 Title
 Interrelating of Longitudinal Processes: An Empirical Example.
 Creator

RoyalThomas, Tamika Y. N., McGee, Daniel, Levenson, Cathy, Sinha, Debajyoti, Osmond, Clive, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

The Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process Blood pressure and anthropometric measurements at seven timepoints on the mother during pregnancy. (ii) In Utero measurements  Ultrasound measurements of the fetus taken at six timepoints during...
Show moreThe Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process Blood pressure and anthropometric measurements at seven timepoints on the mother during pregnancy. (ii) In Utero measurements  Ultrasound measurements of the fetus taken at six timepoints during pregnancy. (iii) Birth to present process  Children's anthropometric and blood pressure measurements at 24 timepoints from birth to 14 years. A comprehensive analysis of the interrelationship of these three longitudinal processes is presented using joint modeling for multivariate longitudinal profiles. We propose a new methodology of examining child's cardiovascular risk by extending a current view of likelihood estimation. Joint modeling of multivariate longitudinal profiles is done and the extension of the traditional likelihood method is utilized in this paper and compared to the maximum likelihood estimates. Our main goal is to examine whether the process in mothers predicts fetal development which in turn predicts the future cardiovascular health of the children. One of the difficulties with `in utero' and early childhood data is that certain variables are highly correlated and so using dimension reduction techniques are quite applicable in this scenario. Principal component analysis (PCA) is utilized in creating a smaller dimension of uncorrelated data which is then utilized in a longitudinal analysis setting. These principal components are then utilized in an optimal linear mixed model for longitudinal data which indicates that in utero and early childhood attributes predicts the future cardiovascular health of the children. This dissertation has added a body of knowledge to developmental origins of adult diseases and has supplied some significant results while utilizing a rich diversity of statistical methodologies.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1792
 Format
 Thesis
 Title
 Investigating the Categories for Cholesterol and Blood Pressure for Risk Assessment of Death Due to Coronary Heart Disease.
 Creator

Franks, Billy J., McGee, Daniel, Hurt, Myra, Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Many characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those...
Show moreMany characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those in common usage. Whatever categories are chosen, they should allow physicians to accurately estimate the probability of survival from coronary heart disease until some time t. The best categories will be those that provide the most accurate prediction for an individual's risk of dying by t. The approach that will be used to determine these categories will be a version of Classification And Regression Trees that can be applied to censored survival data. The major goals of this dissertation are to obtain dataderived categories for risk assessment, compare these categories to the ones already recommended in the medical community, and to assess the performance of these categories in predicting survival probabilities.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd4402
 Format
 Thesis
 Title
 Investigating the Use of Mortality Data as a Surrogate for Morbidity Data.
 Creator

Miller, Gregory, Hollander, Myles, McGee, Daniel, Hurt, Myra, Wu, Wei, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

We are interested in differences between risk models based on Coronary Heart Disease (CHD) incidence, or morbidity, compared to risk models based on CHD death. Risk models based on morbidity have been developed based on the Framingham Heart Study, while the European SCORE project developed a risk model for CHD death. Our goal is to determine whether these two developed models differ in treatment decisions concerning patient heart health. We begin by reviewing recent metrics in surrogate...
Show moreWe are interested in differences between risk models based on Coronary Heart Disease (CHD) incidence, or morbidity, compared to risk models based on CHD death. Risk models based on morbidity have been developed based on the Framingham Heart Study, while the European SCORE project developed a risk model for CHD death. Our goal is to determine whether these two developed models differ in treatment decisions concerning patient heart health. We begin by reviewing recent metrics in surrogate variables and prognostic model performance. We then conduct bootstrap hypotheses tests between two Cox proportional hazards models using Framingham data, one with incidence as a response, and one with death as a response, and find that the coefficients differ for the age covariate, but find no significant differences for the other risk factors. To understand how surrogacy can be applied to our case, where the surrogate variable is nested within the true variable of interest, we examine models based on a composite event compared to models based on singleton events. We also conduct a simulation, simulating times to a CHD incidence and time from CHD incidence to CHD death, censoring at 25 years to represent the end of a study. We compare a Cox model with death response with a Cox model based on incidence using bootstrapped confidence intervals, and find that age and systolic blood pressure have differences with their covariates. We continue the simulation by using Net Reclassification Index (NRI) to evaluate the treatment decision performance of the two models, and find that the two models do not perform significantly different in correctly classifying events, if the decisions are based on the risk ranks of the individuals. As long as the relative order of patients' risks is preserved across different risk models, treatment decisions based on classifying an upper specified percent as high risk will not be significantly different. We conclude the dissertation with statements about future methods for approaching our question.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd2408
 Format
 Thesis
 Title
 Logistic Regression, Measures of Explained Variation, and the Base Rate Problem.
 Creator

Sharma, Dinesh R., McGee, Daniel L., Hurt, Myra, Niu, XuFeng, Chicken, Eric, Department of Statistics, Florida State University
 Abstract/Description

One of the desirable properties of the coefficient of determinant (R2 measure) is that its values for different models should be comparable whether the models differ in one or more predictors, or in the dependent variable, or whether the models are specified as being different for different subsets of a dataset. This allows researchers to compare adequacy of models across subgroups of the population or models with different but related dependent variables. However, the various analogs of the...
Show moreOne of the desirable properties of the coefficient of determinant (R2 measure) is that its values for different models should be comparable whether the models differ in one or more predictors, or in the dependent variable, or whether the models are specified as being different for different subsets of a dataset. This allows researchers to compare adequacy of models across subgroups of the population or models with different but related dependent variables. However, the various analogs of the R2 measure used for logistic regression analysis are highly sensitive to the base rate (proportion of successes in the sample) and thus do not possess this property. An R2 measure sensitive to the base rate is not suitable to comparison for the same or different model on different datasets, different subsets of a dataset or different but related dependent variables. We evaluated 14 R2 measures that have been suggested or might be useful to measure the explained variation in the logistic regression models based on three criteria 1) intuitively reasonable interpret ability; 2) numerical consistency with the Rho2 of underlying model, and 3) the base rate sensitivity. We carried out a Monte Carlo Simulation study to examine the numerical consistency and the base rate dependency of the various R2 measures for logistic regression analysis. We found all of the parametric R2 measures to be substantially sensitive to the base rate. The magnitude of the base rate sensitivity of these measures tends to be further influenced by the rho2 of the underlying model. None of the measures considered in our study are found to perform equally well in all of the three evaluation criteria used. While R2L stands out for its intuitively reasonable interpretability as a measures of explained variation as well as its independence from the base rate, it appears to severely underestimate the underlying rho2. We found R2CS to be numerically most consistent with the underlying Rho2, with R2N its nearest competitor. In addition, the base rate sensitivity of these two measures appears to be very close to that of the R2L, the most base rate invariant parametric R2 measure. Therefore, we suggest to use R2CS and R2N for logistic regression modeling, specially when it is reasonable to believe that a underlying latent variable exists. However, when the latent variable does not exit, comparability with theunderlying rho2 is not an issue and R2L might be a better choice over all the R2 measures.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd1789
 Format
 Thesis
 Title
 Matched Sample Based Approach for CrossPlatform Normalization on Gene Expression Data.
 Creator

Shao, Jiang, Zhang, Jinfeng, Sang, QingXiang Amy, Wu, Wei, Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Geneexpression data profile are widely used in all kinds of biomedical studies especially in cancer research. This dissertation work focus on solving the problem of how to combine datasets arising from different studies. Of particular interest is how to remove platform effect alone. The matched sample based crossplatform normalization method we developed are designed to tackle data merging problem in two scenarios: The first is affyagilent crossplatform normalization which are belong to...
Show moreGeneexpression data profile are widely used in all kinds of biomedical studies especially in cancer research. This dissertation work focus on solving the problem of how to combine datasets arising from different studies. Of particular interest is how to remove platform effect alone. The matched sample based crossplatform normalization method we developed are designed to tackle data merging problem in two scenarios: The first is affyagilent crossplatform normalization which are belong to classic microarray gene expression profile. The second is the integration of microarray data with Next Generation Sequencing genome data. We use several general validation measures to assess and compare with the popular Distanceweighted discrimination method. With the public webbased tool NCI60 CellMiner and The Cancer Genome Atlas data portal supported, our proposed method outperformed DWD in both crossplatform scenarios. It can be further assessed by the ability of exploring biological features in the studies of cancer type discrimination. We applied our method onto two classification problem: One is Breast cancer tumor/normal status classification on microarray and next generation sequencing datasets; The other is Breast cancer patients chemotherapy response classification on GPL96 and GPL570 microarray datasets. Both problems show the classification power are increased after our matched sample based crossplatform normalization method.
Show less  Date Issued
 2015
 Identifier
 FSU_2015fall_Shao_fsu_0071E_12833
 Format
 Thesis
 Title
 Median Regression for Complex Survey Data.
 Creator

Fraser, Raphael André, Sinha, Debajyoti, Lipsitz, Stuart, Carlson, Elwood, Slate, Elizabeth H., Huffer, Fred W. (Fred William), Florida State University, College of Arts and...
Show moreFraser, Raphael André, Sinha, Debajyoti, Lipsitz, Stuart, Carlson, Elwood, Slate, Elizabeth H., Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

The ready availability of publicuse data from various large national complex surveys has immense potential for the assessment of population characteristicsmeans, proportions, totals, etcetera. Using a modelbased approach, complex surveys can be used to evaluate the effectiveness of treatments and to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data...
Show moreThe ready availability of publicuse data from various large national complex surveys has immense potential for the assessment of population characteristicsmeans, proportions, totals, etcetera. Using a modelbased approach, complex surveys can be used to evaluate the effectiveness of treatments and to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data due to design features such as stratification, multistage sampling and unequal selection probabilities. In this paper, we accommodate these design features in the analysis of highly skewed response variables arising from large complex surveys. Specifically, we propose a doubletransformbothsides based estimating equations approach to estimate the median regression parameters of the highly skewed response; the doubletransformbothsides method applies the same transformation twice to both the response and regression function. The usual sandwich variance estimate can be used in our approach, whereas a resampling approach would be needed for a pseudolikelihood based on minimizing absolute deviations. Furthermore, the doubletransformbothsides estimator is relatively robust to the true underlying distribution, and has much smaller mean square error than the least absolute deviations estimator. The method is motivated by an analysis of laboratory data on urinary iodine concentration from the National Health and Nutrition Examination Survey.
Show less  Date Issued
 2015
 Identifier
 FSU_2015fall_Fraser_fsu_0071E_12825
 Format
 Thesis