 Association Models for Clustered Data with Binary and Continuous Responses.
 Creator

Lin, Lanjia, Sinha, Debajyoti, Hurt, Myra, Lipsitz, Stuart R., McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

This dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And...
This dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And the marginal regression function of the continuous response preserves linear form. Withincluster and withinsubject associations could be measured by our proposed models. For the bivariate correlated random effects model, we illustrate how different levels of the association between two random effects induce different Kendall's tau values for association between the binary and continuous responses from the same cluster. Fully parametric and semiparametric Bayesian methods as well as maximum likelihood method are illustrated for model analysis. In the semiparametric Bayesian model, normality assumption of the regression error for the continuous response is relaxed by using a nonparametric Dirichlet Process prior. Robustness of the bivariate correlated random effects model using ML method to misspecifications of regression function as well as random effect distribution is investigated by simulation studies. The Bayesian and likelihood methods are applied to a developmental toxicity study of ethylene glycol in mice.
2009
 2009
 Identifier
 FSU_migr_etd1330
 Format
 Thesis
 Title
 Building a Model Performance Measure for Examining Clinical Relevance Using Net Benefit Curves.
 Creator

Mukherjee, Anwesha, McGee, Daniel, Hurt, Myra M., Slate, Elizabeth H., Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

ROC curves are often used to evaluate predictive accuracy of statistical prediction models. This thesis studies other measures which not only incorporate the statistical but also the clinical consequences of using a particular prediction model. Depending on the disease and population under study, the misclassification costs of false positives and false negatives vary. The concept of Decision Curve Analysis (DCA) takes this cost into account, by using the threshold probability (the...
ROC curves are often used to evaluate predictive accuracy of statistical prediction models. This thesis studies other measures which not only incorporate the statistical but also the clinical consequences of using a particular prediction model. Depending on the disease and population under study, the misclassification costs of false positives and false negatives vary. The concept of Decision Curve Analysis (DCA) takes this cost into account, by using the threshold probability (the probability above which a patient opts for treatment). Using the DCA technique, a Net Benefit Curve is built by plotting "Net Benefit", a function of the expected benefit and expected harm of using a model, by the threshold probability. Only the threshold probability range that is relevant to the disease and the population under study is used to plot the net benefit curve to obtain the optimum results using a particular statistical model. This thesis concentrates on the process of construction of a summary measure to find which predictive model yields highest net benefit. The most intuitive approach is to calculate the area under the net benefit curve. We examined whether the use of weights such as, the estimated empirical distribution of the threshold probability to compute the weighted area under the curve, creates a better summary measure. Real data from multiple cardiovascular research studies The Diverse Population Collaboration (DPC) datasets, is used to compute the summary measures: area under the ROC curve (AUROC), area under the net benefit curve (ANBC) and weighted area under the net benefit curve (WANBC). The results from the analysis are used to compare these measures to examine whether these measures are in agreement with each other and which would be the best to use in specified clinical scenarios. For different models the summary measures and its standard errors (SE) were calculated to study the variability in the measure. The method of metaanalysis is used to summarize these estimated summary measures to reveal if there is significant variability among these studies.
2018
 2018
 Identifier
 2018_Sp_Mukherjee_fsu_0071E_14350
 Format
 Thesis
 Title
 Covariance on Manifolds.
 Creator

Balov, Nikolay H. (Nikolay Hristov), Srivastava, Anuj, Klassen, Eric, Patrangenaru, Victor, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

With ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Nonlinear statistical analysis has become an area of intensive research in recent years. Despite the impressive progress in this direction, a unified and consistent framework has not been reached. In this regard, the following work is an attempt to improve our understanding of random...
With ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Nonlinear statistical analysis has become an area of intensive research in recent years. Despite the impressive progress in this direction, a unified and consistent framework has not been reached. In this regard, the following work is an attempt to improve our understanding of random phenomena on nonEuclidean spaces. More specifically, the motivating goal of the present dissertation is to generalize the notion of distribution covariance, which in standard settings is defined only in Euclidean spaces, on arbitrary manifolds with metric. We introduce a tensor field structure, named covariance field, that is consistent with the heterogeneous nature of manifolds. It not only describes the variability imposed by a probability distribution but also provides alternative distribution representations. The covariance field combines the distribution density with geometric characteristics of its domain and thus fills the gap between these two.We present some of the properties of the covariance fields and argue that they can be successfully applied to various statistical problems. In particular, we provide a systematic approach for defining parametric families of probability distributions on manifolds, parameter estimation for regression analysis, nonparametric statistical tests for comparing probability distributions and interpolation between such distributions. We then present several application areas where this new theory may have potential impact. One of them is the branch of directional statistics, with domain of influence ranging from geosciences to medical image analysis. The fundamental level at which the covariance based structures are introduced, also opens a new area for future research.
2009
 2009
 Identifier
 FSU_migr_etd1045
 Format
 Thesis
 Title
 Discrimination and Calibration of Prognostic Survival Models.
 Creator

Simino, Jeannette M., Hollander, Myles, McGee, Daniel, Hurt, Myra, Niu, XuFeng, Department of Statistics, Florida State University
 Abstract/Description

Clinicians employ prognostic survival models for diseases such as coronary heart disease and cancer to inform patients about risks, treatments, and clinical decisions (Altman and Royston 2000). These prognostic models are not useful unless they are valid in the population to which they are applied. There are no generally accepted algorithms for assessing the validity of an external survival model in a new population. Researchers often invoke measures of predictive accuracy, the degree to...
Clinicians employ prognostic survival models for diseases such as coronary heart disease and cancer to inform patients about risks, treatments, and clinical decisions (Altman and Royston 2000). These prognostic models are not useful unless they are valid in the population to which they are applied. There are no generally accepted algorithms for assessing the validity of an external survival model in a new population. Researchers often invoke measures of predictive accuracy, the degree to which predicted outcomes match observed outcomes (Justice et al. 1999). One component of predictive accuracy is discrimination, the ability of the model to correctly rank the individuals in the sample by risk. A common measure of discrimination for prognostic survival models is the concordance index, also called the cstatistic. We utilize the concordance index to determine the discrimination of Framinghambased Cox and Loglogistic models of coronary heart disease (CHD) death in cohorts from the Diverse Populations Collaboration, a collection of studies that encompasses many ethnic, geographic, and socioeconomic groups. Pencina and D'Agostino presented a confidence interval for the concordance index when assessing the discrimination of an external prognostic model. We perform simulations to determine the robustness of their confidence interval when measuring discrimination during internal validation. The Pencina and D'Agostino confidence interval is not valid in the internal validation setting because their assumption of mutually independent observations is violated. We compare the Pencina and D'Agostino confidence interval to a bootstrap confidence interval that we propose that is valid for the internal validation. We specifically discern the performance of the interval when the same sample is used to both fit and determine the validity of a prognostic model. The framework for our simulations is a Weibull proportional hazards model of CHD death fit to the Framingham exam 4 data. We then focus on the second component of accuracy, calibration, which measures the agreement between the observed and predicted event rates for groups of patients (Altman and Royston 2000). In 2000, van Houwelingen introduced a method called validation by calibration to allow a clinician to assess the validity of a wellaccepted published survival model on his/her own patient population and adjust the published model to fit that population. Van Houwelingen embeds the published model into a new model with only 3 parameters which helps combat the overfitting that occurs when models with many covariates are fit on data sets with a small number of events. We explore validation by calibration as a tool to adjust models when an external model over or underestimates risk. Van Houwelingen discusses the general method and then focusses on the proportional hazards model. There are situations where proportional hazards may not hold, thus we extend the methodology to the Loglogistic accelerated failure time model. We perform validation by calibration of Framinghambased Cox and Loglogistic models of CHD death to cohorts from the Diverse Populations Collaboration. Lastly, we conduct simulations that investigate the power of the global Wald validation by calibration test. We study its power to reject an invalid proportional hazards or Loglogistic accelerated failure time model under various scale and/or shape misspecifications.
2009
 2009
 Identifier
 FSU_migr_etd0328
 Format
 Thesis
 Title
 Elastic Shape Analysis of RNAs and Proteins.
 Creator

Laborde, Jose M., Srivastava, Anuj, Zhang, Jinfeng, Klassen, Eric, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

Proteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and...
Proteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and classifications. This work applies Elastic Shape Analysis (ESA), a method recently developed in computer vision, to construct rigorous mathematical and statistical frameworks for the comparison, clustering and classification of proteins and RNAs. ESA treats bio molecular structures as 3D parameterized curves, which are represented with a special map called the square root velocity function (SRVF). In the resulting shape space of elastic curves, one can perform statistical analysis of curves as if they were random variables. One can compare, match and deform one curve into another, or as well as compute averages and covariances of curve populations, and perform hypothesis testing and classification of curves according to their shapes. We have successfully applied ESA to the comparison and classification of protein and RNA structures. We further extend the ESA framework to incorporate additional nongeometric information that tags the shape of the molecules (namely, the sequence of nucleotide/aminoacid letters for RNAs/proteins and, in the latter case, also the labels for the socalled secondary structure). The biological representation is chosen such that the ESA framework continues to be mathematically formal. We have achieved superior classification of RNA functions compared to stateoftheart methods on benchmark RNA datasets which has led to the publication of this work in the journal, Nucleic Acids Research (NAR). Based on the ESA distances, we have also developed a fast method to classify protein domains by using a representative set of protein structures generated by a clusteringbased technique we call Multiple Centroid Class Partitioning (MCCP). Comparison with other standard approaches showed that MCCP significantly improves the accuracy while keeping the representative set smaller than the other methods. The current schemes for the classification and organization of proteins (such as SCOP and CATH) assume a discrete space of their structures, where a protein is classified into one and only one class in a hierarchical tree structure. Our recent study, and studies by other researchers, showed that the protein structure space is more continuous than discrete. To capture the complex but quantifiable continuous nature of protein structures, we propose to organize these molecules using a network model, where individual proteins are mapped to possibly multiple nodes of classes, each associated with a probability. Structural classes will then be connected to form a network based on overlaps of corresponding probability distributions in the structural space.
2013
 2013
 Identifier
 FSU_migr_etd8586
 Format
 Thesis
 Title
 Estimating the Probability of Cardiovascular Disease: A Comparison of Methods.
 Creator

Fan, Li, McGee, Daniel, Hurt, Myra, Niu, XuFeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Risk prediction plays an important role in clinical medicine. It not only helps in educating patients to improve life style and in targeting individuals at high risk, but also guides treatment decisions. So far, various instruments have been used for different risk assessment in different countries and the risk predictions based from these different models are not consistent. In public use, a reliable risk prediction is necessary. This thesis discusses the models that have been developed for...
Risk prediction plays an important role in clinical medicine. It not only helps in educating patients to improve life style and in targeting individuals at high risk, but also guides treatment decisions. So far, various instruments have been used for different risk assessment in different countries and the risk predictions based from these different models are not consistent. In public use, a reliable risk prediction is necessary. This thesis discusses the models that have been developed for risk assessment and evaluates the performance of prediction at two levels, including the overall level and the individual level. At the overall level, cross validation and simulation are used to assess the risk prediction, while at the individual level, the "Parametric Bootstrap" and the delta method are used to evaluate the uncertainty of the individual risk prediction. Further exploration of the reasons producing different performance among the models is ongoing.
2009
 2009
 Identifier
 FSU_migr_etd4508
 Format
 Thesis
 Title
 An Examination of the Concept of Frailty in the Elderly.
 Creator

Griffin, Felicia R., McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Frailty has been defined as a state of increased vulnerability to adverse outcomes. The concept of frailty has been centered around counting the number of deficits in health, which can be diseases, disabilities, or symptoms. However, there is no consensus on how it should be quantified. Frailty has been considered synonymous with functional status and comorbidity, but these may be distinct concepts requiring different management. We compared two methods of defining a frailty phenotype, a...
Frailty has been defined as a state of increased vulnerability to adverse outcomes. The concept of frailty has been centered around counting the number of deficits in health, which can be diseases, disabilities, or symptoms. However,
Show less  Date Issued
 2015
 Identifier
 FSU_migr_etd9342
 Format
 Thesis
 Title
 An Examination of the Relationship between Alcohol and Dementia in a Longitudinal Study.
 Creator

Hu, Tingting, McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

The high mortality rate and huge expenditure caused by dementia makes it a pressing concern for public health researchers. Among the potential risk factors in diet and nutrition, the relation between alcohol usage and dementia has been investigated in many studies, but no clear picture has emerged. This association has been reported as protective, neurotoxic, Ushaped curve, and insignificant in different sources. An individual’s alcohol usage is dynamic and could change over time, however,...
Show moreThe high mortality rate and huge expenditure caused by dementia makes it a pressing concern for public health researchers. Among the potential risk factors in diet and nutrition, the relation between alcohol usage and dementia has been investigated in many studies, but no clear picture has emerged. This association has been reported as protective, neurotoxic, Ushaped curve, and insignificant in different sources. An individual’s alcohol usage is dynamic and could change over time, however, to our knowledge, only one study took this timevarying nature into account when assessing the association between alcohol intake and cognition. Using Framingham Heart Study (FHS) data, our work fills an important gap in that both alcohol use and dementia status were included into the analysis longitudinally. Furthermore, we incorporated a genderspecific categorization of alcohol consumption. In this study, we examined three aspects of the association: (1) Concurrent alcohol usage and dementia, longitudinally, (2) Past alcohol usage and later dementia, (3) Cumulative alcohol usage and dementia. The data consisted of 2,192 FHS participants who took Exams 1723 during 19811996, which included dementia assessment, and had complete data on alcohol use (mean followup = 40 years) and key covariates. Cognitive status was determined using information from the MiniMental State Examinations (MMSE) and the examiner’s assessment. Alcohol consumption was determined in oz/week and also categorized as none, moderate and heavy. We investigated both total alcohol consumption and consumption by type of alcoholic beverage. Results showed that the association between alcohol and dementia may differ by gender and by alcoholic type.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Hu_fsu_0071E_14330
 Format
 Thesis
 Title
 Examining the Relationship of Dietary Component Intakes to Each Other and to Mortality.
 Creator

Alrajhi, Sharifah, McGee, Daniel, Levenson, Cathy W., Niu, Xufeng, Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In this essay we present analysis examining the basic dietary structure and its relationship to mortality in the first National Health and Nutrition Examination Survey (NHANES I) conducted between 1971 and 1975. We used results from 24hour recalls on 10,483 individuals in this study. All of the indivduals in the analytic sample were followed through 1992 for vital status. The mean followup period for the participants was 16 years. During followup 2,042 (48%) males and 1,754 (27%) females...
Show moreIn this essay we present analysis examining the basic dietary structure and its relationship to mortality in the first National Health and Nutrition Examination Survey (NHANES I) conducted between 1971 and 1975. We used results from 24hour recalls on 10,483 individuals in this study. All of the indivduals in the analytic sample were followed through 1992 for vital status. The mean followup period for the participants was 16 years. During followup 2,042 (48%) males and 1,754 (27%) females died. We first attempted to capture the inherent structure of the dietary data using principal components analyses (PCA). We performed this estimation separately for each race (white and black) and gender (male and female) and compared the estimated principal components among these four strata. We found that the principal components were similar (but not identical) in the four strata. we also related our estimated principal components to mortality using Cox Proportional Hazards (CPH) models and related dietary component to mortality using forward variable selection.
Show less  Date Issued
 2015
 Identifier
 FSU_2015fall_Alrajhi_fsu_0071E_12802
 Format
 Thesis
 Title
 Impact of Missing Data on Building Prognostic Models and Summarizing Models Across Studies.
 Creator

Munshi, Mahtab R., McGee, Daniel, Eberstein, Isaac, Hollander, Myles, Niu, Xufeng, Chattopadhyay, Somesh, Department of Statistics, Florida State University
 Abstract/Description

We examine the impact of missing data in two settings, the development of prognostic models and the addition of new risk factors to existing risk functions. Most statistical software presently available perform complete case analysis, wherein only participants with known values for all of the characteristics being analyzed are included in model development. Missing data also impacts the summarization of evidence amongst multiple studies using metaanalytic techniques. As we progress in...
Show moreWe examine the impact of missing data in two settings, the development of prognostic models and the addition of new risk factors to existing risk functions. Most statistical software presently available perform complete case analysis, wherein only participants with known values for all of the characteristics being analyzed are included in model development. Missing data also impacts the summarization of evidence amongst multiple studies using metaanalytic techniques. As we progress in medical research, new covariates become available for studying various outcomes. While we want to investigate the influence of new factors on the outcome, we also do not want to discard the historical datasets that do not have information about these markers. Our research plan is to investigate different methods to estimate parameters for a model when some of the covariates are missing. These methods include likelihood based inference for the studylevel coefficients and likelihood based inference for the logistic model on the personlevel data. We compare the results from our methods to the corresponding results from complete case analysis. We focus our empirical investigation on a historical example, the addition of high density lipoproteins to existing equations for predicting death due to coronary heart disease. We verify our methods through simulation studies on this example.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd2191
 Format
 Thesis
 Title
 Individual PatientLevel Data MetaAnalysis: A Comparison of Methods for the Diverse Populations Collaboration Data Set.
 Creator

Dutton, Matthew Thomas, McGee, Daniel, Becker, Betsy, Niu, Xufeng, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

DerSimonian and Laird define metaanalysis as "the statistical analysis of a collection of analytic results for the purpose of integrating their findings. One alternative to classical metaanalytic approaches in known as Individual PatientLevel Data, or IPD, metaanalysis. Rather than depending on summary statistics calculated for individual studies, IPD metaanalysis analyzes the complete data from all included studies. Two potential approaches to incorporating IPD data into the meta...
Show moreDerSimonian and Laird define metaanalysis as "the statistical analysis of a collection of analytic results for the purpose of integrating their findings. One alternative to classical metaanalytic approaches in known as Individual PatientLevel Data, or IPD, metaanalysis. Rather than depending on summary statistics calculated for individual studies, IPD metaanalysis analyzes the complete data from all included studies. Two potential approaches to incorporating IPD data into the metaanalytic framework are investigated. A twostage analysis is first conducted, in which individual models are fit for each study and summarized using classical metaanalysis procedures. Secondly, a onestage approach that singularly models the data and summarizes the information across studies is investigated. Data from the Diverse Populations Collaboration data set are used to investigate the differences between these two methods in a specific example. The bootstrap procedure is used to determine if the two methods produce statistically different results in the DPC example. Finally, a simulation study is conducted to investigate the accuracy of each method in given scenarios.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd0620
 Format
 Thesis
 Title
 Interrelating of Longitudinal Processes: An Empirical Example.
 Creator

RoyalThomas, Tamika Y. N., McGee, Daniel, Levenson, Cathy, Sinha, Debajyoti, Osmond, Clive, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

The Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process Blood pressure and anthropometric measurements at seven timepoints on the mother during pregnancy. (ii) In Utero measurements  Ultrasound measurements of the fetus taken at six timepoints during...
Show moreThe Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process Blood pressure and anthropometric measurements at seven timepoints on the mother during pregnancy. (ii) In Utero measurements  Ultrasound measurements of the fetus taken at six timepoints during pregnancy. (iii) Birth to present process  Children's anthropometric and blood pressure measurements at 24 timepoints from birth to 14 years. A comprehensive analysis of the interrelationship of these three longitudinal processes is presented using joint modeling for multivariate longitudinal profiles. We propose a new methodology of examining child's cardiovascular risk by extending a current view of likelihood estimation. Joint modeling of multivariate longitudinal profiles is done and the extension of the traditional likelihood method is utilized in this paper and compared to the maximum likelihood estimates. Our main goal is to examine whether the process in mothers predicts fetal development which in turn predicts the future cardiovascular health of the children. One of the difficulties with `in utero' and early childhood data is that certain variables are highly correlated and so using dimension reduction techniques are quite applicable in this scenario. Principal component analysis (PCA) is utilized in creating a smaller dimension of uncorrelated data which is then utilized in a longitudinal analysis setting. These principal components are then utilized in an optimal linear mixed model for longitudinal data which indicates that in utero and early childhood attributes predicts the future cardiovascular health of the children. This dissertation has added a body of knowledge to developmental origins of adult diseases and has supplied some significant results while utilizing a rich diversity of statistical methodologies.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1792
 Format
 Thesis
 Title
 Investigating the Categories for Cholesterol and Blood Pressure for Risk Assessment of Death Due to Coronary Heart Disease.
 Creator

Franks, Billy J., McGee, Daniel, Hurt, Myra, Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Many characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those...
Show moreMany characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those in common usage. Whatever categories are chosen, they should allow physicians to accurately estimate the probability of survival from coronary heart disease until some time t. The best categories will be those that provide the most accurate prediction for an individual's risk of dying by t. The approach that will be used to determine these categories will be a version of Classification And Regression Trees that can be applied to censored survival data. The major goals of this dissertation are to obtain dataderived categories for risk assessment, compare these categories to the ones already recommended in the medical community, and to assess the performance of these categories in predicting survival probabilities.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd4402
 Format
 Thesis
 Title
 Investigating the Use of Mortality Data as a Surrogate for Morbidity Data.
 Creator

Miller, Gregory, Hollander, Myles, McGee, Daniel, Hurt, Myra, Wu, Wei, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

We are interested in differences between risk models based on Coronary Heart Disease (CHD) incidence, or morbidity, compared to risk models based on CHD death. Risk models based on morbidity have been developed based on the Framingham Heart Study, while the European SCORE project developed a risk model for CHD death. Our goal is to determine whether these two developed models differ in treatment decisions concerning patient heart health. We begin by reviewing recent metrics in surrogate...
Show moreWe are interested in differences between risk models based on Coronary Heart Disease (CHD) incidence, or morbidity, compared to risk models based on CHD death. Risk models based on morbidity have been developed based on the Framingham Heart Study, while the European SCORE project developed a risk model for CHD death. Our goal is to determine whether these two developed models differ in treatment decisions concerning patient heart health. We begin by reviewing recent metrics in surrogate variables and prognostic model performance. We then conduct bootstrap hypotheses tests between two Cox proportional hazards models using Framingham data, one with incidence as a response, and one with death as a response, and find that the coefficients differ for the age covariate, but find no significant differences for the other risk factors. To understand how surrogacy can be applied to our case, where the surrogate variable is nested within the true variable of interest, we examine models based on a composite event compared to models based on singleton events. We also conduct a simulation, simulating times to a CHD incidence and time from CHD incidence to CHD death, censoring at 25 years to represent the end of a study. We compare a Cox model with death response with a Cox model based on incidence using bootstrapped confidence intervals, and find that age and systolic blood pressure have differences with their covariates. We continue the simulation by using Net Reclassification Index (NRI) to evaluate the treatment decision performance of the two models, and find that the two models do not perform significantly different in correctly classifying events, if the decisions are based on the risk ranks of the individuals. As long as the relative order of patients' risks is preserved across different risk models, treatment decisions based on classifying an upper specified percent as high risk will not be significantly different. We conclude the dissertation with statements about future methods for approaching our question.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd2408
 Format
 Thesis
 Title
 Meta Analysis and Meta Regression of a Measure of Discrimination Used in Prognostic Modeling.
 Creator

Rivera, Gretchen L., McGee, Daniel, Hurt, Myra, Niu, Xufeng, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

In this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper we consider the logistic model. The dataset used is the Diverse Populations Collaboration (DPC) dataset which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our...
Show moreIn this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper we consider the logistic model. The dataset used is the Diverse Populations Collaboration (DPC) dataset which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our analysis we include those individuals who are 17 years old or older. The predictors are: age, diabetes, total serum cholesterol (mg/dl), high density lipoprotein (mg/dl), systolic blood pressure (mmHg) and if the participant is a current cigarette smoker. There is a natural grouping within the studies such as gender, rural or urban area and race. Based on these strata we have 84 cohort groups. Our main interest is to evaluate how well the prognostic model discriminates. For this, we used the area under the Receiver Operating Characteristic (ROC) curve. The main idea of the ROC curve is that a set of subject is known to belong to one of two classes (signal or noise group). Then an assignment procedure assigns each object to a class on the basis of information observed. The assignment procedure is not perfect: sometimes an object is misclassified. We want to evaluate the quality of performance of this procedure, for this we used the Area under the ROC curve (AUROC). The AUROC varies from 0.5 (no apparent accuracy) to 1.0 (perfect accuracy). For each logistic model we found the AUROC and its standard error (SE). We used Metaanalysis to summarize the estimated AUROCs and to evaluate if there is heterogeneity in our estimates. To evaluate the existence of significant heterogeneity we used the Q statistic. Since heterogeneity was found in our study we compare seven different methods for estimating τ2 (between study variance). We conclude by examining whether differences in study characteristics explained the heterogeneity in the values of the AUROC.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7580
 Format
 Thesis
 Title
 A Method for Finding the Nadir of NonMonotonic Relationships.
 Creator

Tan, Fei, McGee, Daniel, Lloyd, Donald, Huﬀer, Fred, Niu, Xufeng, Dutton, Gareth, Department of Statistics, Florida State University
 Abstract/Description

Different methods have been proposed to model the Jshaped or Ushaped relationship between a risk factor and mortality so that the optimal riskfactor value (nadir) associated with the lowest mortality can be estimated. The basic model considered is the Cox Proportional Hazards model. Current methods include a quadratic method, a method with transformation, fractional polynomials, a change point method and fixedknot spline regression. A quadratic method contains both the linear and the...
Show moreDifferent methods have been proposed to model the Jshaped or Ushaped relationship between a risk factor and mortality so that the optimal riskfactor value (nadir) associated with the lowest mortality can be estimated. The basic model considered is the Cox Proportional Hazards model. Current methods include a quadratic method, a method with transformation, fractional polynomials, a change point method and fixedknot spline regression. A quadratic method contains both the linear and the quadratic term of the risk factor, it is simple but often it generates unrealistic nadir estimates. The transformation method converts the original risk factor so that after transformation it has a Normal distribution, but this may not work when there is no good transformation to normality. Fractional polynomials are an extended class of regular polynomials that applies negative and fractional powers to the risk factor. Compared with the quadratic method or the transformation method it does not always have a good model interpretation and inferences about it do not incorporate the uncertainty coming from preselection of powers and degree. A change point method models the prognostic index using two pieces of upward quadratic functions that meet at their common nadir. This method assumes the knot and the nadir are the same, which is not always true. Fixedknot spline regression has also been used to model nonlinear prognostic indices. But its inference does not account for variation arising from knot selections. Here we consider spline regressions with free knots, a natural generalization of the quadratic, the change point and the fixedknot spline method. They can be applied to risk factors that do not have a good transformation to normality as well as keep intuitive model interpretations. Asymptotic normality and consistency of the maximum partial likelihood estimators are established under certain condition. When the condition is not satisfied simulations are used to explore asymptotic properties. The new method is motivated by and applied to the nadir estimation in nonmonotonic relationships between BMI (body mass index) and allcause mortality. Its performance is compared with that of existing methods, adopting criteria of nadir estimation ability and goodness of fit.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd1719
 Format
 Thesis
 Title
 MixedEffects Models for Count Data with Applications to Educational Research.
 Creator

Shin, Jihyung, Niu, Xufeng, Hu, Shouping, Al Otaiba, Stephanie Dent, McGee, Daniel, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

This research is motivated by an analysis of reading research data. We are interested in modeling the test outcome of ability to fluently recode letters into sounds of kindergarten children aged between 5 and 7. The data showed excessive zero scores (more than 30% of children) on the test. In this dissertation, we carefully examine the models dealing with excessive zeros, which are based on the mixture of distributions, a distribution with zeros and a standard probability distribution with...
Show moreThis research is motivated by an analysis of reading research data. We are interested in modeling the test outcome of ability to fluently recode letters into sounds of kindergarten children aged between 5 and 7. The data showed excessive zero scores (more than 30% of children) on the test. In this dissertation, we carefully examine the models dealing with excessive zeros, which are based on the mixture of distributions, a distribution with zeros and a standard probability distribution with non negative values. In such cases, a log normal variable or a Poisson random variable is often observed with probability from semicontinuous data or count data. The previously proposed models, mixedeffects and mixeddistribution models (MEMD) by Tooze(2002) et al. for semicontinuous data and zeroinflated Poisson (ZIP) regression models by Lambert(1992) for count data are reviewed. We apply zeroinflated Poisson models to repeated measures data of zeroinflated data by introducing a pair of possibly correlated random effects to the zeroinflated Poisson model to accommodate withinsubject correlation and between subject heterogeneity. The model describes the effect of predictor variables on the probability of nonzero responses (occurrence) and mean of nonzero responses (intensity) separately. The likelihood function is maximized using dual quasiNewton optimization of an approximated by adaptive Gaussian quadrature. The maximum likelihood estimates are obtained through standard statistical software package. Using different model parameters, the number of subject, and the number of measurements per subject, the simulation study is conducted and the results are presented. The dissertation ends with the application of the model to reading research data and future research. We examine the number of correct letter sound counted of children collected over 2008 2009 academic year. We find that age, gender and socioeconomic status are significantly related to the letter sound fluency of children in both parts of the model. The model provides better explanation of data structure and easier interpretations of parameter values, as they are the same as in standard logistic models and Poisson regression models. The model can be extended to accommodate serial correlation which can be observed in longitudinal data. Also, one may consider multilevel zeroinflated Poisson model. Although the multilevel model was proposed previously, parameter estimation by penalized quasi likelihood methods is questionable, and further examination is needed.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd5181
 Format
 Thesis
 Title
 Modeling Differential Item Functioning (DIF) Using Multilevel Logistic Regression Models: A Bayesian Perspective.
 Creator

Chaimongkol, Saengla, Huﬀer, Fred W., Kamata, Akihito, Tate, Richard, Niu, XuFeng, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

A multilevel logistic regression approach provides an attractive and practical alternative for the study of Differential Item Functioning (DIF). It is not only useful for identifying items with DIF but also for explaining the presence of DIF. Kamata and Binici (2003) first attempted to identify group unit characteristic variables explaining the variation of DIF by using hierarchical generalized linear models. Their models were implemented by the HLM5 software, which uses the penalized or...
Show moreA multilevel logistic regression approach provides an attractive and practical alternative for the study of Differential Item Functioning (DIF). It is not only useful for identifying items with DIF but also for explaining the presence of DIF. Kamata and Binici (2003) first attempted to identify group unit characteristic variables explaining the variation of DIF by using hierarchical generalized linear models. Their models were implemented by the HLM5 software, which uses the penalized or predictive quasilikelihood (PQL) method. They found that the variance estimates produced by HLM5 for the level 3 parameters are substantially negatively biased. This study extends their work by using a Bayesian approach to obtain more accurate parameter estimates. Two different approaches to modeling the DIF will be presented. These are referred to as the relative and mixture distribution approach, respectively. The relative approach measures the DIF of a particular item relative to the mean overall DIF for all items in the test. The mixture distribution approach treats the DIF as independent values drawn from a distribution which is a mixture of a normal distribution and a discrete distribution concentrated at zero. A simulation study is presented to assess the adequacy of the proposed models. This work also describes and studies models which allow the DIF to vary at level 3 (from school to school). In an example using real data, it is shown how the models can be applied to the identification of items with DIF and the explanation of the source of the DIF.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd3939
 Format
 Thesis
 Title
 Multivariate Binary Longitudinal Data Analysis.
 Creator

Alzahrani, Hissah, Slate, Elizabeth H., Wetherby, Amy M., McGee, Daniel, Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

The longitudinal data analysis plays an important role in a lot of applications today. It is defined by many measurements are obtained over many times. These measurements has complicated correlation structure because they are obtained from the same subjects over the time. In multivariate longitudinal data, there is an additional source of correlation which is "outcomes", the data are obtained over the time for many outcomes for the same subjects. This application could happens in many medical...
Show moreThe longitudinal data analysis plays an important role in a lot of applications today. It is defined by many measurements are obtained over many times. These measurements has complicated correlation structure because they are obtained from the same subjects over the time. In multivariate longitudinal data, there is an additional source of correlation which is "outcomes", the data are obtained over the time for many outcomes for the same subjects. This application could happens in many medical, financial and psychological studies. For example, the patients measurements for some variables are measured over some occasions in order to study the mean changes of these patients. How we can generate and analyze this type of data for complete and incomplete cases is the main goal of this dissertation. It consists of three main studies about the analysis of multivariate binary longitudinal data. The first study is a method to generate correlated binary data for a multivariate longitudinal model with specified correlation structure. This specified structure allows the correlation to be induced over the outcomes or occasions. Second study is a comparison of three methods for analyzing multivariate binary longitudinal data; each one can be beneficial for determined aims. Also, we investigated the difference among the parameter estimations of the three methods. The third study is an investigation of missing data analysis via GEE models, controlling the correlation over the occasions and outcomes via simulation study. However, several methods for handling missing data are used to reduce the bias of the parameter estimations for the incomplete data. these three studies are presented in separated chapters of this dissertation.
Show less  Date Issued
 2016
 Identifier
 FSU_2017SP_Alzahrani_fsu_0071E_13609
 Format
 Thesis
 Title
 On the Statistical Modeling of Count Data in High Dimensions.
 Creator

Tang, Shao, She, Yiyuan, Ökten, Giray, McGee, Daniel, Niu, Xufeng, Tao, Minjing, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Count data are ubiquitous in modern statistical applications. How to modeling such data remains a challenging task in machine learning. In this study, we consider various aspects of statistical modeling on Poisson count data. Concerned with computational burdens for maximum likelihood estimation of the mean, we revisit the classical iterative proportional scaling and propose a set of methods that achieve computational scalability in high dimensional applications with regularized extensions...
Show moreCount data are ubiquitous in modern statistical applications. How to modeling such data remains a challenging task in machine learning. In this study, we consider various aspects of statistical modeling on Poisson count data. Concerned with computational burdens for maximum likelihood estimation of the mean, we revisit the classical iterative proportional scaling and propose a set of methods that achieve computational scalability in high dimensional applications with regularized extensions for feature selection. In order to capture association effects given multivariate count data, we utilize the tool of nonGaussian graph learning. We perform comprehensive empirical studies on synthetic data and real world data to demonstrate its power. Based on the concept of data depth, we investigate a nonparametric approach for modeling multivariate data. We utilize modern optimization techniques to provide scalable algorithms in high dimensional depth and depth median computations. Realworld examples are given to show the effectiveness of the proposed methods.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Tang_fsu_0071E_14680
 Format
 Thesis
 Title
 Practical Methods for Equivalence and NonInferiority Studies with Survival Response.
 Creator

Martinez, Elvis Englebert, Sinha, Debajyoti, Levenson, Cathy W., Chicken, Eric, Lipsitz, Stuart, McGee, Daniel, Florida State University, College of Arts and Sciences,...
Show moreMartinez, Elvis Englebert, Sinha, Debajyoti, Levenson, Cathy W., Chicken, Eric, Lipsitz, Stuart, McGee, Daniel, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Determining the equivalence or noninferiority of a new drug (test drug) with a existing treatment (reference drug) is an important topic of statistical interest. Wellek (1993) pioneered the way for logrank based equivalence and noninferiority testing by formulating a testing procedure using proportional hazards model (PHM) of Cox (1972). In many equivalence and noninferiority trials, two hazards functions may converge to one rather than being proportional for all timepoints. In this case...
Show moreDetermining the equivalence or noninferiority of a new drug (test drug) with a existing treatment (reference drug) is an important topic of statistical interest. Wellek (1993) pioneered the way for logrank based equivalence and noninferiority testing by formulating a testing procedure using proportional hazards model (PHM) of Cox (1972). In many equivalence and noninferiority trials, two hazards functions may converge to one rather than being proportional for all timepoints. In this case, the proportional odds survival model (POSM) of Bennett (1983) will be more sufficient than a Cox's PHM assumption. We show in both cases, when the wrong modeling assumption is made and Cox's PH assumption is violated, the popular procedure of Wellek (1993) has an inflated type I error. On the contrary, our proposed POS model based equivalence and noninferiority tests maintains the practitioners desired 5% level of significance regardless of the underlying modeling assumption (e.g. Cox,1972; Wellek, 1993). Furthermore for noninferiority trials, we introduce a method to determine the optimal sample size required when a desired power and type I error is specified and the data follows the POSM of Bennett (1983). For both of the above trials, we present simulation studies showing the finite approximation of powers and type I error rates, when the underlying modeling assumption are correctly specified and when the assumptions are misspecified.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd9214
 Format
 Thesis
 Title
 Predictive Accuracy Measures for Binary Outcomes: Impact of Incidence Rate and Optimization Techniques.
 Creator

Scolnik, Ryan, McGee, Daniel, Slate, Elizabeth H., Eberstein, Isaac W., Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences, Department of...
Show moreScolnik, Ryan, McGee, Daniel, Slate, Elizabeth H., Eberstein, Isaac W., Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Evaluating the performance of models predicting a binary outcome can be done using a variety of measures. While some measures intend to describe the model's overall fit, others more accurately describe the model's ability to discriminate between the two outcomes. If a model fits well but doesn't discriminate well, what does that tell us? Given two models, if one discriminates well but has poor fit while the other fits well but discriminates poorly, which of the two should we choose? The...
Show moreEvaluating the performance of models predicting a binary outcome can be done using a variety of measures. While some measures intend to describe the model's overall fit, others more accurately describe the model's ability to discriminate between the two outcomes. If a model fits well but doesn't discriminate well, what does that tell us? Given two models, if one discriminates well but has poor fit while the other fits well but discriminates poorly, which of the two should we choose? The measures of interest for our research include the area under the ROC curve, Brier Score, discrimination slope, LogLoss, Rsquared and Fscore. To examine the underlying relationships among all of the measures, real data and simulation studies are used. The real data comes from multiple cardiovascular research studies and the simulation studies are run under general conditions and also for incidence rates ranging from 2% to 50%. The results of these analyses provide insight into the relationships among the measures and raise concern for scenarios when the measures may yield different conclusions. The impact of incidence rate on the relationships provides a basis for exploring alternative maximization routines to logistic regression. While most of the measures are easily optimized using the NewtonRaphson algorithm, the maximization of the area under the ROC curve requires optimization of a nonlinear, nondifferentiable function. Usage of the NelderMead simplex algorithm and close connections to economics research yield unique parameter estimates and general asymptotic conditions. Using real and simulated data to compare optimizing the area under the ROC curve to logistic regression further reveals the impact of incidence rate on the relationships, significant increases in achievable areas under the ROC curve, and differences in conclusions about including a variable in a model.
Show less  Date Issued
 2016
 Identifier
 FSU_2016SP_Scolnik_fsu_0071E_13146
 Format
 Thesis
 Title
 The Relationship Between Body Mass and Blood Pressure in Diverse Populations.
 Creator

Abayomi, Emilola J., McGee, Daniel, Lackland, Daniel, Hurt, Myra, Chicken, Eric, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

High blood pressure is a major determinant of risk for Coronary Heart Disease (CHD) and stroke, leading causes of death in the industrialized world. A myriad of pharmacological treatments for elevated blood pressure, defined as a blood pressure greater than 140/90mmHg, are available and have at least partially resulted in large reductions in the incidence of CHD and stroke in the U.S. over the last 50 years. The factors that may increase blood pressure levels are not well understood, but body...
Show moreHigh blood pressure is a major determinant of risk for Coronary Heart Disease (CHD) and stroke, leading causes of death in the industrialized world. A myriad of pharmacological treatments for elevated blood pressure, defined as a blood pressure greater than 140/90mmHg, are available and have at least partially resulted in large reductions in the incidence of CHD and stroke in the U.S. over the last 50 years. The factors that may increase blood pressure levels are not well understood, but body mass is thought to be a major determinant of blood pressure level. Obesity is measured through various methods (skinfolds, waisttohip ratio, bioelectrical impedance analysis (BIA), etc.), but the most commonly used measure is body mass index,BMI= Weight(kg)/Height(m)2
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd5308
 Format
 Thesis
 Title
 The Relationship of Diabetes to Coronary Heart Disease Mortality: A MetaAnalysis Based on PersonLevel Data.
 Creator

Williams, Felicia Gray, McGee, Daniel, Hurt, Myra, Pati, Debdeep, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Studies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a metaanalysis of personlevel data from 42 cohort studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 77,863 men and 84,671 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of selfreported...
Show moreStudies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a metaanalysis of personlevel data from 42 cohort studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 77,863 men and 84,671 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of selfreported diabetes mellitus at baseline ranged between less than 1% in the youngest cohort and 15.7% (males) and 11.1% (females) in the NHLBI CHS study of the elderly. CHD death rates varied between 2% and 20%. A metaanalysis was performed in order to calculate overall hazard ratios (HR) of CHD mortality among diabetics compared to nondiabetics using Cox Proportional Hazard models. The randomeffects HR associated with baseline diabetes and adjusted for age was significantly higher for females 2.65 (95% CI: 2.34, 2.96) than for males 2.33 (95% CI: 2.07, 2.58) (p=0.004). These estimates were similar to the randomeffects estimates adjusted additionally for serum cholesterol, systolic blood pressure, and current smoking status: females 2.69 (95% CI: 2.35, 3.03) and males 2.32 (95% CI: 2.05, 2.59) . They also agree closely with estimates (odds ratios of 2.9 for females and 2.3 for males) obtained in a recent metaanalysis of 50 studies of both fatal and nonfatal CHD but not based on personlevel data. This evidence suggests that diabetes diminishes the female advantage. An additional analysis was performed on race. Only 14 cohorts were analyzed in the metaanalysis. This analyses showed no significant difference between the black and white cohorts before (p=0.68) or after adjustment for the major CHD RFs (p=0.88). The limited amount of studies used may lack the power to detect any differences.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7662
 Format
 Thesis
 Title
 The Risk of Lipids on Coronary Heart Disease: Prognostic Models and MetaAnalysis.
 Creator

Almansour, Aseel, McGee, Daniel, Flynn, Heather, Niu, Xufeng, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Prognostic models are widely used in medicine to estimate particular patients' risk of developing disease. For cardiovascular disease risk numerous prognostic models have been developed for predicting cardiovascular disease including those by Wilson et al. using the Framingham Study[17], by Assmann et al. using the Procam study[22] and by Conroy et al.[33] using a pool of European cohorts. The prognostic models developed by these researchers differed in their approach to estimating risk but...
Show morePrognostic models are widely used in medicine to estimate particular patients' risk of developing disease. For cardiovascular disease risk numerous prognostic models have been developed for predicting cardiovascular disease including those by Wilson et al. using the Framingham Study[17], by Assmann et al. using the Procam study[22] and by Conroy et al.[33] using a pool of European cohorts. The prognostic models developed by these researchers differed in their approach to estimating risk but all included one or more of the lipid determinations: Total cholesterol (TC). Low Density Lipoproteins (LDL), High Density Lipoproteins (HDL), or ratios TC/HDL and LDL/HDL. None of these researchers included both LDL and TC in the same model due to the high correlation between these measurements. In this thesis we will examine some questions about the inclusion of lipid determinations in prognostic models: Can the effect of LDL and TC on the risk of dying from CHD be differentiated? If one measure is demonstrably stronger than the other, then a single model using that variable would be considered advantageous. Is it possible to derive a single measure from TC and LDL that is a stronger predictor than either measure? If so, then a new summarization of the lipid measurements should be used in prognostic modeling. Does the addition of HDL to a prognostic model improve the predictive accuracy of the model? If it does, then this determination that is almost universally determined should be used when developing prognostic models. We use data from nine independent studies to examine these issues. The studies were chosen because they include longitudinal followup of participants and included lipid determinations in the baseline examination of participants. There are many methodologies available for developing prognostic models, including logistic regression and the proportional hazards model. We used the proportional hazards model since we have followup times and times to death from CHD on all of the participants in the included studies. We summarized our results using a metaanalytic approach. Using the metaanalytic approach, we addressed the additional question of whether the results vary significantly among the different studies and also whether adding additional characteristics to the prognostic models changes the estimated effect of the lipid determinations. All of our results are presented stratified by gender and, when appropriate, by race. Finally, because our studies were not selected randomly, we also examined whether there is evidence of bias in our metaanalyses. For this examination we used funnel plots with related methodology for testing whether there is evidence of bias in the results.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd8724
 Format
 Thesis
 Title
 Small Area Estimation with Random Effects Selection.
 Creator

Lee, Jiwon, She, Yiyuan, Ökten, Giray, McGee, Daniel, Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In this study, we propose a robust method holding a selective shrinkage power for small area estimation with automatic random effects selection referred to as SARS. In our proposed model, both fixed effects and random effects are treated as joint target. In this case, maximizing joint likelihood of fixed effects and random effects makes more sense than maximizing marginal likelihood. In practice, variance of sampling error and variance of modeling error (random effects) are unknown. SARS does...
Show moreIn this study, we propose a robust method holding a selective shrinkage power for small area estimation with automatic random effects selection referred to as SARS. In our proposed model, both fixed effects and random effects are treated as joint target. In this case, maximizing joint likelihood of fixed effects and random effects makes more sense than maximizing marginal likelihood. In practice, variance of sampling error and variance of modeling error (random effects) are unknown. SARS does not require any prior information of both variance components and dimensionality of data. Furthermore, areaspecific random effects, accounting for additional area variation, are not always necessary in small area estimation model. From this observation, we can impose sparsity on random effects by assigning zero for the large area. This sparsity brings heavy tails, which means that the normality assumption of random effects is not retained any longer. The SARS holding selective and predictive power employs penalized regression using a nonconvex penalty. For solving the nonconvex problem of SARS, we employ iterative algorithms via a quantile thresholding procedure. The algorithms make use of the iterative selectionestimation paradigm with a variety of techniques such as progressive screening when tuning parameters, mutistart strategy with subsampling method and feature subset method to generate more efficient initial points for enhancing computation efficiency and efficacy. To achieve optimal prediction error under the dimensional relaxation, we propose a new theoretical predictive information criterion for SARS (SARSPIC) which is derived based upon nonasymptotic oracle inequalities using minimax rate of ideal predictive risk. Experiments with simulation and real poverty data of schoolage(517) children demonstrate the efficiency of SARS.
Show less  Date Issued
 2017
 Identifier
 FSU_2017SP_Lee_fsu_0071E_13675
 Format
 Thesis
 Title
 Spatiotemporal Bayesian Hierarchical Models, with Application to Birth Outcomes.
 Creator

Norton, Jonathan D. (Jonathan David), Niu, Xufeng, Eberstein, Isaac, Huﬀer, Fred, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

A class of hierarchical Bayesian models is introduced for adverse birth outcomes such as preterm birth, which are assumed to follow a conditional binomial distribution. The logodds of an adverse outcome in a particular county, logit(p(i)), follows a linear model which includes observed covariates and normallydistributed random effects. Spatial dependence between neighboring regions is allowed for by including an intrinsic autoregressive (IAR) prior or an IAR convolution prior in the linear...
Show moreA class of hierarchical Bayesian models is introduced for adverse birth outcomes such as preterm birth, which are assumed to follow a conditional binomial distribution. The logodds of an adverse outcome in a particular county, logit(p(i)), follows a linear model which includes observed covariates and normallydistributed random effects. Spatial dependence between neighboring regions is allowed for by including an intrinsic autoregressive (IAR) prior or an IAR convolution prior in the linear predictor. Temporal dependence is incorporated by including a temporal IAR term also. It is shown that the variance parameters underlying these random effects (IAR, convolution, convolution plus temporal IAR) are identifiable. The same results are also shown to hold when the IAR is replaced by a conditional autoregressive (CAR) model. Furthermore, properties of the CAR parameter ρ are explored. The Deviance Information Criterion (DIC) is considered as a way to compare spatial hierarchical models. Simulations are performed to test whether the DIC can identify whether binomial outcomes come from an IAR, an IAR convolution, or independent normal deviates. Having established the theoretical foundations of the class of models and validated the DIC as a means of comparing models, we examine preterm birth and low birth weight counts in the state of Arkansas from 1994 to 2005. We find that preterm birth and low birth weight have different spatial patterns of risk, and that rates of low birth weight can be fit with a strikingly simple model that includes a constant spatial effect for all periods, a linear trend, and three covariates. It is also found that the risks of each outcome are increasing over time, even with adjustment for covariates.
Show less  Date Issued
 2008
 Identifier
 FSU_migr_etd2523
 Format
 Thesis
 Title
 A Study of Some Issues of GoodnessofFit Tests for Logistic Regression.
 Creator

Ma, Wei, McGee, Daniel, Mai, Qing, Levenson, Cathy W., Niu, Xufeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Goodnessoffit tests are important to assess how well a model fits a set of observations. HosmerLemeshow (HL) test is a popular and commonly used method to assess the goodnessoffit for logistic regression. However, there are two issues for using HL test. One of them is that we have to specify the number of partition groups and the different groups often suggest the different decisions. So in this study, we propose several grouping tests to combine multiple HL tests with varying the number...
Show moreGoodnessoffit tests are important to assess how well a model fits a set of observations. HosmerLemeshow (HL) test is a popular and commonly used method to assess the goodnessoffit for logistic regression. However, there are two issues for using HL test. One of them is that we have to specify the number of partition groups and the different groups often suggest the different decisions. So in this study, we propose several grouping tests to combine multiple HL tests with varying the number of groups to make the decision instead of just using one arbitrary group or finding the optimum group. This is due to the reason that the best selection for the groups is datadependent and it is not easy to find. The other drawback of HL test is that it is not powerful to detect the violation of missing interactions between continuous and dichotomous covariates. Therefore, we propose global and interaction tests in order to capture such violations. Simulation studies are carried out to assess the Type I errors and powers for all the proposed tests. These tests are illustrated by the bone mineral density data from NHANES III.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Ma_fsu_0071E_14681
 Format
 Thesis
 Title
 Tests and Classifications in Adaptive Designs with Applications.
 Creator

Chen, Qiusheng, Niu, Xufeng, McGee, Daniel, Slate, Elizabeth H., Zhang, Jinfeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Statistical tests for biomarker identification and classification methods for patient grouping are two important topics in adaptive designs of clinical trials. In this article, we evaluate four test methods for biomarker identification: a modelbased identification method, the popular ttest, the nonparametric Wilcoxon Rank Sum test, and the Least Absolute Shrinkage and Selection Operator (Lasso) method. For selecting the best classification methods in Stage 2 of an adaptive design, we...
Show moreStatistical tests for biomarker identification and classification methods for patient grouping are two important topics in adaptive designs of clinical trials. In this article, we evaluate four test methods for biomarker identification: a modelbased identification method, the popular ttest, the nonparametric Wilcoxon Rank Sum test, and the Least Absolute Shrinkage and Selection Operator (Lasso) method. For selecting the best classification methods in Stage 2 of an adaptive design, we examine classification methods including the recently developed machine learning approaches such as Random Forest, Lasso and ElasticNet Regularized Generalized Linear Models (Glmnet), Support Vector Machine (SVM), Gradient Boosting Machine (GBM), and Extreme Gradient Boost ing (XGBoost). Statistical simulations are carried out in our study to assess the performance of biomarker identification methods and the classification methods. The best identification method and the classification technique will be selected based on the True Positive Rate (TPR,also called Sensitivity) and the True Negative Rate (TNR,also called Specificity). The optimal test method for gene identification and classification method for patient grouping will be applied to the Adap tive Signature Design (ASD) for the purpose of evaluating the performance of ASD in different situations, including simulated data and a real data set for breast cancer patients.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Chen_fsu_0071E_14309
 Format
 Thesis
 Title
 Variable Selection of Correlated Predictors in Logistic Regression: Investigating the DietHeart Hypothesis.
 Creator

Thompson, Warren R. (Warren Robert), McGee, Daniel, Eberstein, Isaac, Huﬀer, Fred, Sinha, Debajyoti, She, Yiyuan, Department of Statistics, Florida State University
 Abstract/Description

Variable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the...
Show moreVariable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the variable selection problem in the context of logistic regression. Specifically, we investigated the merits of the bootstrap, ridge regression, the lasso and Bayesian model averaging (BMA) as variable selection techniques when highly correlated predictors and a dichotomous outcome are considered. This dissertation also contributes to the literature on the dietheart hypothesis. The dietheart hypothesis has been around since the early twentieth century. Since then, researchers have attempted to isolate the nutrients in diet that promote coronary heart disease (CHD). After a century of research, there is still no consensus. In our current research, we used some of the more recent statistical methodologies (mentioned above) to investigate the effect of twenty dietary variables on the incidence of coronary heart disease. Logistic regression models were generated for the data from the Honolulu Heart Program  a study of CHD incidence in men of Japanese descent. Our results were largely methodspecific. However, regardless of method considered, there was strong evidence to suggest that alcohol consumption has a strong protective effect on the risk of coronary heart disease. Of the variables considered, dietary cholesterol and caffeine were the only variables that, at best, exhibited a moderately strong harmful association with CHD incidence. Further investigation that includes a broader array of food groups is recommended.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1360
 Format
 Thesis