Current Search: Research Repository (x) » Thesis (x) » East Asia (x) » Education (x) » Statistics (x)
Search results
Pages
 Title
 An analysis of test reliability.
 Creator

Isaacson, Fenton R., Florida State University
 Abstract/Description

"The need for efficient means of testing has long been recognized. To obtain efficiency in testing requires the study of four attributes of the testing instrumentnamely: reliability, validity, interpretability and administrability. It is the purpose of this paper to examine in some detail the first of these attributes, reliability. In particular, this is an attempt to analyse the reliability of Mathematics 101 Test D which was administered at Florida State University in the fall of 1948"...
Show more"The need for efficient means of testing has long been recognized. To obtain efficiency in testing requires the study of four attributes of the testing instrumentnamely: reliability, validity, interpretability and administrability. It is the purpose of this paper to examine in some detail the first of these attributes, reliability. In particular, this is an attempt to analyse the reliability of Mathematics 101 Test D which was administered at Florida State University in the fall of 1948"Introduction.
Show less  Date Issued
 1949
 Identifier
 FSU_historic_AKP4870
 Format
 Thesis
 Title
 A Bayesian MRF Framework for Labeling Terrain Using Hyperspectral Imaging.
 Creator

Neher, Robert E., Srivastava, Anuj, Liu, Xiuwen, Huffer, Fred, Wegkamp, Marten, Department of Statistics, Florida State University
 Abstract/Description

We explore the nonGaussianity of hyperspectral data and present probability models that capture variability of hyperspectral images. In particular, we present a nonparametric probability distribution that models the distribution of the hyperspectral data after reducing the dimension of the data via either principal components or Fisher's discriminant analysis. We also explore the directional differences in observed images and present two parametric distributions, the generalized Laplacian...
Show moreWe explore the nonGaussianity of hyperspectral data and present probability models that capture variability of hyperspectral images. In particular, we present a nonparametric probability distribution that models the distribution of the hyperspectral data after reducing the dimension of the data via either principal components or Fisher's discriminant analysis. We also explore the directional differences in observed images and present two parametric distributions, the generalized Laplacian and the Bessel K form, that well model the nonGaussian behavior of the directional differences. We then propose a model that labels each spatial site, using Bayesian inference and Markov random fields, that incorporates the information of the nonparametric distribution of the data, and the parametric distributions of the directional differences, along with a prior distribution that favors smooth labeling. We then test our model on actual hyperspectral data and present the results of our model, using the Washington D.C. Mall and Indian Springs rural area data sets.
Show less  Date Issued
 2004
 Identifier
 FSU_migr_etd2691
 Format
 Thesis
 Title
 Modeling Differential Item Functioning (DIF) Using Multilevel Logistic Regression Models: A Bayesian Perspective.
 Creator

Chaimongkol, Saengla, Huﬀer, Fred W., Kamata, Akihito, Tate, Richard, Niu, XuFeng, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

A multilevel logistic regression approach provides an attractive and practical alternative for the study of Differential Item Functioning (DIF). It is not only useful for identifying items with DIF but also for explaining the presence of DIF. Kamata and Binici (2003) first attempted to identify group unit characteristic variables explaining the variation of DIF by using hierarchical generalized linear models. Their models were implemented by the HLM5 software, which uses the penalized or...
Show moreA multilevel logistic regression approach provides an attractive and practical alternative for the study of Differential Item Functioning (DIF). It is not only useful for identifying items with DIF but also for explaining the presence of DIF. Kamata and Binici (2003) first attempted to identify group unit characteristic variables explaining the variation of DIF by using hierarchical generalized linear models. Their models were implemented by the HLM5 software, which uses the penalized or predictive quasilikelihood (PQL) method. They found that the variance estimates produced by HLM5 for the level 3 parameters are substantially negatively biased. This study extends their work by using a Bayesian approach to obtain more accurate parameter estimates. Two different approaches to modeling the DIF will be presented. These are referred to as the relative and mixture distribution approach, respectively. The relative approach measures the DIF of a particular item relative to the mean overall DIF for all items in the test. The mixture distribution approach treats the DIF as independent values drawn from a distribution which is a mixture of a normal distribution and a discrete distribution concentrated at zero. A simulation study is presented to assess the adequacy of the proposed models. This work also describes and studies models which allow the DIF to vary at level 3 (from school to school). In an example using real data, it is shown how the models can be applied to the identification of items with DIF and the explanation of the source of the DIF.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd3939
 Format
 Thesis
 Title
 Statistical Models on Human Shapes with Application to Bayesian Image Segmentation and Gait Recognition.
 Creator

Kaziska, David M., Srivastava, Anuj, Mio, Washington, Chicken, Eric, Wegkamp, Marten, Department of Statistics, Florida State University
 Abstract/Description

In this dissertation we develop probability models for human shapes and apply those probability models to the problems of image segmentation and human identi_cation by gait recognition. To build probability models on human shapes, we consider human shape to be realizations of random variables on a space of simple closed curves and a space of elastic curves. Both of these spaces are quotient spaces of in_nite dimensional manifolds. Our probability models arise through Tangent Principal...
Show moreIn this dissertation we develop probability models for human shapes and apply those probability models to the problems of image segmentation and human identi_cation by gait recognition. To build probability models on human shapes, we consider human shape to be realizations of random variables on a space of simple closed curves and a space of elastic curves. Both of these spaces are quotient spaces of in_nite dimensional manifolds. Our probability models arise through Tangent Principal Component Analysis, a method of studying probability models on manifolds by projecting them onto a tangent plane to the manifold. Since we put the tangent plane at the Karcher mean of sample shapes, we begin our study by examining statistical properties of Karcher means on manifolds. We derive theoretical results for the location of Karcher means on certain manifolds, and perform a simulation study of properties of Karcher means on our shape space. Turning to the speci_c problem of distributions on human shapes we examine alternatives for probability models and _nd that kernel density estimators perform well. We use this model to sample shapes and to perform shape testing. The _rst application we consider is human detection in infrared images. We pursue this application using Bayesian image segmentation, in which our proposed human in an image is a maximum likelihood estimate, obtained using a prior distribution on human shapes and a likelihood arising from a divergence measure on the pixels in the image. We then consider human identi_cation by gait recognition. We examine human gait as a cyclostationary process on the space of elastic curves and develop a metric on processes based on the geodesic distance between sequences on that space. We develop and demonstrate a framework for gait recognition based on this metric, which includes the following elements: automatic detection of gait cycles, interpolation to register gait cycles, computation of a mean gait cycle, and identi_cation by matching a test cycle to the nearest member of a training set. We perform the matching both by an exhaustive search of the training set and through an expedited method using clusterbased trees and boosting.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd3275
 Format
 Thesis
 Title
 Inference for Semiparametric TimeVarying Covariate Effect Relative Risk Regression Models.
 Creator

Ye, Gang, McKeague, Ian W., Wang, Xiaoming, Huffer, Fred W., Song, KaiSheng, Department of Statistics, Florida State University
 Abstract/Description

A major interest of survival analysis is to assess covariate effects on survival via appropriate conditional hazard function regression models. The Cox proportional hazards model, which assumes an exponential form for the relative risk, has been a popular choice. However, other regression forms such as Aalen's additive risk model may be more appropriate in some applications. In addition, covariate effects may depend on time, which can not be reflected by a Cox proportional hazards model. In...
Show moreA major interest of survival analysis is to assess covariate effects on survival via appropriate conditional hazard function regression models. The Cox proportional hazards model, which assumes an exponential form for the relative risk, has been a popular choice. However, other regression forms such as Aalen's additive risk model may be more appropriate in some applications. In addition, covariate effects may depend on time, which can not be reflected by a Cox proportional hazards model. In this dissertation, we study a class of timevarying covariate effect regression models in which the link function (relative risk function) is a twice continuously differentiable and prespecified, but otherwise general given function. This is a natural extension of the PrenticeSelf model, in which the link function is general but covariate effects are modelled to be time invariant. In the first part of the dissertation, we focus on estimating the cumulative or integrated covariate effects. The standard martingale approach based on counting processes is utilized to derive a likelihoodbased iterating equation. An estimator for the cumulative covariate effect that is generated from the iterating equation is shown to be ¡Ìnconsistent. Asymptotic normality of the estimator is also demonstrated. Another aspect of the dissertation is to investigate a new test for the above timevarying covariate effect regression model and study consistency of the test based on martingale residuals. For Aalen's additive risk model, we introduce a test statistic based on the HufferMcKeague weightedleastsquares estimator and show its consistency against some alternatives. An alternative way to construct a test statistic based on Bayesian Bootstrap simulation is introduced. An application to real lifetime data will be presented.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd0949
 Format
 Thesis
 Title
 Investigating the Categories for Cholesterol and Blood Pressure for Risk Assessment of Death Due to Coronary Heart Disease.
 Creator

Franks, Billy J., McGee, Daniel, Hurt, Myra, Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Many characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those...
Show moreMany characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those in common usage. Whatever categories are chosen, they should allow physicians to accurately estimate the probability of survival from coronary heart disease until some time t. The best categories will be those that provide the most accurate prediction for an individual's risk of dying by t. The approach that will be used to determine these categories will be a version of Classification And Regression Trees that can be applied to censored survival data. The major goals of this dissertation are to obtain dataderived categories for risk assessment, compare these categories to the ones already recommended in the medical community, and to assess the performance of these categories in predicting survival probabilities.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd4402
 Format
 Thesis
 Title
 Impact of Missing Data on Building Prognostic Models and Summarizing Models Across Studies.
 Creator

Munshi, Mahtab R., McGee, Daniel, Eberstein, Isaac, Hollander, Myles, Niu, Xufeng, Chattopadhyay, Somesh, Department of Statistics, Florida State University
 Abstract/Description

We examine the impact of missing data in two settings, the development of prognostic models and the addition of new risk factors to existing risk functions. Most statistical software presently available perform complete case analysis, wherein only participants with known values for all of the characteristics being analyzed are included in model development. Missing data also impacts the summarization of evidence amongst multiple studies using metaanalytic techniques. As we progress in...
Show moreWe examine the impact of missing data in two settings, the development of prognostic models and the addition of new risk factors to existing risk functions. Most statistical software presently available perform complete case analysis, wherein only participants with known values for all of the characteristics being analyzed are included in model development. Missing data also impacts the summarization of evidence amongst multiple studies using metaanalytic techniques. As we progress in medical research, new covariates become available for studying various outcomes. While we want to investigate the influence of new factors on the outcome, we also do not want to discard the historical datasets that do not have information about these markers. Our research plan is to investigate different methods to estimate parameters for a model when some of the covariates are missing. These methods include likelihood based inference for the studylevel coefficients and likelihood based inference for the logistic model on the personlevel data. We compare the results from our methods to the corresponding results from complete case analysis. We focus our empirical investigation on a historical example, the addition of high density lipoproteins to existing equations for predicting death due to coronary heart disease. We verify our methods through simulation studies on this example.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd2191
 Format
 Thesis
 Title
 Quasi3D Statistical Inversion of Oceanographic Tracer Data.
 Creator

Herbei, Radu, Speer, Kevin, Wegkamp, Marten, Laurent, Louis St., Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

We perform a quasi3D Bayesian inversion of oceanographic tracer data from the South Atlantic Ocean. Initially we are considering one active neutral density layer with an upper and lower boundary. The available hydrographic data is linked to model parameters (water velocities, diffusion coefficients) via a 3D advectiondiffusion equation. A robust solution to the inverse problem considered can be attained by introducing prior information about parameters and modeling the observation error....
Show moreWe perform a quasi3D Bayesian inversion of oceanographic tracer data from the South Atlantic Ocean. Initially we are considering one active neutral density layer with an upper and lower boundary. The available hydrographic data is linked to model parameters (water velocities, diffusion coefficients) via a 3D advectiondiffusion equation. A robust solution to the inverse problem considered can be attained by introducing prior information about parameters and modeling the observation error. This approach estimates both horizontal and vertical flow as well as diffusion coefficients. We find a system of alternating zonal jets at the depths of the North Atlantic Deep Water, consistent with direct measurements of flow and concentration maps. A uniqueness analysis of our model is performed in terms of the oxygen consumption rate. The vertical mixing coefficient bears some relation to the bottom topography even though we do not incorporate that into our model. We extend the method to a multilayer model, using thermal wind relations weakly in a local fashion (as opposed to integrating the entire water column) to connect layers vertically. Results suggest that the estimated deep zonal jets extend vertically, with a clear depth dependent structure. The vertical structure of the flow field is modified by the tracer fields over that set a priori by thermal wind. Our estimates are consistent with observed flow at the depths of the Antarctic Intermediate Water; at still shallower depths, above the layers considered here, the subtropical gyre is a significant feature of the horizontal flow.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd4101
 Format
 Thesis
 Title
 Minimax Tests for Nonparametric Alternatives with Applications to High Frequency Data.
 Creator

Yu, Han, Song, KaiSheng, Professor, Jack Quine, Professor, Fred Huﬀer, Professor, Dan McGee, Department of Statistics, Florida State University
 Abstract/Description

We present a general methodology for developing an asymptotically distributionfree, asymptotic minimax tests. The tests are constructed via a nonparametric densityquantile function and the limiting distribution is derived by a martingale approach. The procedure can be viewed as a novel parametric extension of the classical parametric likelihood ratio test. The proposed tests are shown to be omnibus within an extremely large class of nonparametric global alternatives characterized by simple...
Show moreWe present a general methodology for developing an asymptotically distributionfree, asymptotic minimax tests. The tests are constructed via a nonparametric densityquantile function and the limiting distribution is derived by a martingale approach. The procedure can be viewed as a novel parametric extension of the classical parametric likelihood ratio test. The proposed tests are shown to be omnibus within an extremely large class of nonparametric global alternatives characterized by simple conditions. Furthermore, we establish that the proposed tests provide better minimax distinguishability. The tests have much greater power for detecting highfrequency nonparametric alternatives than the existing classical tests such as KolmogorovSmirnov and Cramervon Mises tests. The good performance of the proposed tests is demonstrated by Monte Carlo simulations and applications in High Energy Physics.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd0796
 Format
 Thesis
 Title
 Logistic Regression, Measures of Explained Variation, and the Base Rate Problem.
 Creator

Sharma, Dinesh R., McGee, Daniel L., Hurt, Myra, Niu, XuFeng, Chicken, Eric, Department of Statistics, Florida State University
 Abstract/Description

One of the desirable properties of the coefficient of determinant (R2 measure) is that its values for different models should be comparable whether the models differ in one or more predictors, or in the dependent variable, or whether the models are specified as being different for different subsets of a dataset. This allows researchers to compare adequacy of models across subgroups of the population or models with different but related dependent variables. However, the various analogs of the...
Show moreOne of the desirable properties of the coefficient of determinant (R2 measure) is that its values for different models should be comparable whether the models differ in one or more predictors, or in the dependent variable, or whether the models are specified as being different for different subsets of a dataset. This allows researchers to compare adequacy of models across subgroups of the population or models with different but related dependent variables. However, the various analogs of the R2 measure used for logistic regression analysis are highly sensitive to the base rate (proportion of successes in the sample) and thus do not possess this property. An R2 measure sensitive to the base rate is not suitable to comparison for the same or different model on different datasets, different subsets of a dataset or different but related dependent variables. We evaluated 14 R2 measures that have been suggested or might be useful to measure the explained variation in the logistic regression models based on three criteria 1) intuitively reasonable interpret ability; 2) numerical consistency with the Rho2 of underlying model, and 3) the base rate sensitivity. We carried out a Monte Carlo Simulation study to examine the numerical consistency and the base rate dependency of the various R2 measures for logistic regression analysis. We found all of the parametric R2 measures to be substantially sensitive to the base rate. The magnitude of the base rate sensitivity of these measures tends to be further influenced by the rho2 of the underlying model. None of the measures considered in our study are found to perform equally well in all of the three evaluation criteria used. While R2L stands out for its intuitively reasonable interpretability as a measures of explained variation as well as its independence from the base rate, it appears to severely underestimate the underlying rho2. We found R2CS to be numerically most consistent with the underlying Rho2, with R2N its nearest competitor. In addition, the base rate sensitivity of these two measures appears to be very close to that of the R2L, the most base rate invariant parametric R2 measure. Therefore, we suggest to use R2CS and R2N for logistic regression modeling, specially when it is reasonable to believe that a underlying latent variable exists. However, when the latent variable does not exit, comparability with theunderlying rho2 is not an issue and R2L might be a better choice over all the R2 measures.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd1789
 Format
 Thesis
 Title
 A Comparison of Estimators in Hierarchical Linear Modeling: Restricted Maximum Likelihood versus Bootstrap via Minimum Norm Quadratic Unbiased Estimators.
 Creator

Delpish, Ayesha Nneka, Niu, XuFeng, Tate, Richard L., Huﬀer, Fred W., Zahn, Douglas, Department of Statistics, Florida State University
 Abstract/Description

The purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations,...
Show moreThe purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations, the importance of this assumption for the accuracy of multilevel parameter estimates and their standard errors was assessed using the accuracy index of relative bias and by observing the coverage percentages of 95% confidence intervals constructed for both estimation procedures. The study systematically varied the number of groups at level2 (30 versus 100), the size of the intraclass correlation (0.01 versus 0.20) and the distribution of the observations (normal versus chisquared with 1 degree of freedom). The number of groups and intraclass correlation factors produced effects consistent with those previously reported—as the number of groups increased, the bias in the parameter estimates decreased, with a more significant effect observed for those estimates obtained via REML. High levels of the intraclass correlation also led to a decrease in the efficiency of parameter estimation under both methods. Study results show that while both the restricted maximum likelihood and the bootstrap via MINQUE estimates of the fixed effects were accurate, the efficiency of the estimates was affected by the distribution of errors with the bootstrap via MINQUE procedure outperforming the REML. Both procedures produced less efficient estimators under the chisquared distribution, particularly for the variancecovariance component estimates.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd0771
 Format
 Thesis
 Title
 Estimation from Data Representing a Sample of Curves.
 Creator

Auguste, Anna L., Bunea, Florentina, Mason, Patrick, Hollander, Myles, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

This dissertation introduces and assesses an algorithm to generate confidence bands for a regression function or a main effect when multiple data sets are available. In particular it proposes to construct confidence bands for different trajectories and then aggregate these to produce an overall confidence band for a mean function. An estimator of the regression function or main effect is also examined. First, nonparametric estimators and confidence bands are formed on each data set separately...
Show moreThis dissertation introduces and assesses an algorithm to generate confidence bands for a regression function or a main effect when multiple data sets are available. In particular it proposes to construct confidence bands for different trajectories and then aggregate these to produce an overall confidence band for a mean function. An estimator of the regression function or main effect is also examined. First, nonparametric estimators and confidence bands are formed on each data set separately. Then each data set is in turn treated as a testing set for aggregating the preliminary results from the remaining data sets. The criterion used for this aggregation is either the least squares (LS) criterion or a BIC type penalized LS criterion. The proposed estimator is the average over data sets of these aggregates. It is thus a weighted sum of the preliminary estimators. The proposed confidence band is the minimum L1 band of all the M aggregate bands when we only have a main effect. In the case where there is some random effect we suggest an adjustment to the confidence band. In this case, the proposed confidence band is the minimum L1 band of all the M adjusted aggregate bands. Desirable asymptotic properties are shown to hold. A simulation study examines the performance of each technique relative to several alternate methods and theoretical benchmarks. An application to seismic data is conducted.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd0286
 Format
 Thesis
 Title
 Optimal Linear Representations of Images under Diverse Criteria.
 Creator

Rubinshtein, Evgenia, Srivastava, Anuj, Liu, Xiuwen, Huﬀer, Fred, Chicken, Eric, Department of Statistics, Florida State University
 Abstract/Description

Image analysis often requires dimension reduction before statistical analysis, in order to apply sophisticated procedures. Motivated by eventual applications, a variety of criteria have been proposed: reconstruction error, class separation, nonGaussianity using kurtosis, sparseness, mutual information, recognition of objects, and their combinations. Although some criteria have analytical solutions, the remaining ones require numerical approaches. We present geometric tools for finding linear...
Show moreImage analysis often requires dimension reduction before statistical analysis, in order to apply sophisticated procedures. Motivated by eventual applications, a variety of criteria have been proposed: reconstruction error, class separation, nonGaussianity using kurtosis, sparseness, mutual information, recognition of objects, and their combinations. Although some criteria have analytical solutions, the remaining ones require numerical approaches. We present geometric tools for finding linear projections that optimize a given criterion for a given data set. The main idea is to formulate a problem of optimization on a Grassmann or a Stiefel manifold, and to use differential geometry of the underlying space to construct optimization algorithms. Purely deterministic updates lead to local solutions, and addition of random components allows for stochastic gradient searches that eventually lead to global solutions. We demonstrate these results using several image datasets, including natural images and facial images.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd1926
 Format
 Thesis
 Title
 A Statistical Approach to an Ocean Circulation Inverse Problem.
 Creator

Choi, Seoeun, Huﬀer, Fred W., Speer, Kevin G., Nolder, Craig, Niu, Xufeng, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

This dissertation presents, applies, and evaluates a statistical approach to an ocean circulation problem. The objective is to produce a map of ocean velocity in the North Atlantic based on sparse measurements along ship tracks, based on a Bayesian approach with a physical model. The Stommel Gulf Stream model which relates the wind stress curl to the transport stream function is the physical model. A Gibbs sampler is used to extract features from the posterior velocity field. To specify the...
Show moreThis dissertation presents, applies, and evaluates a statistical approach to an ocean circulation problem. The objective is to produce a map of ocean velocity in the North Atlantic based on sparse measurements along ship tracks, based on a Bayesian approach with a physical model. The Stommel Gulf Stream model which relates the wind stress curl to the transport stream function is the physical model. A Gibbs sampler is used to extract features from the posterior velocity field. To specify the prior, the equation of the Stommel Gulf Stream model on a twodimensional grid is used.Comparisons with earlier approaches used by oceanographers are also presented.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd3758
 Format
 Thesis
 Title
 A Bayesian Approach to MetaRegression: The Relationship Between Body Mass Index and AllCause Mortality.
 Creator

Marker, Mahtab, McGee, Dan, Hurt, Myra, Niu, Xiufeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

This thesis presents a Bayesian approach to MetaRegression and Individual Patient Data (IPD) Metaanalysis. The focus of the research is on establishing the relationship between Body Mass Index (BMI) and allcause mortality. This has been an area of continuing interest in the medical and public health communities and no concensus has been reached on what the optimal weight for individuals is. Standards are usually speci ed in terms of body mass index (BMI = wt(kg) over height(m)2 ) which is...
Show moreThis thesis presents a Bayesian approach to MetaRegression and Individual Patient Data (IPD) Metaanalysis. The focus of the research is on establishing the relationship between Body Mass Index (BMI) and allcause mortality. This has been an area of continuing interest in the medical and public health communities and no concensus has been reached on what the optimal weight for individuals is. Standards are usually speci ed in terms of body mass index (BMI = wt(kg) over height(m)2 ) which is associated with body fat percentage. Many studies in the literature have modelled the relationship between BMI and mortality and reported a variety of relationships including Ushaped, Jshaped and linear curves. The aim of my research was to use statistical methods to determine whether we can combine these diverse results an obtain single estimated relationship, using which one can nd the point of minimum mortality and establish reasonable ranges for optimal BMI or how we can best examine the reasons for the heterogeneity of results. Commonly used techniques of Metaanalysis and Metaregression are explored and a problem with the estimation procedure in the multivariate setting is presented. A Bayesian approach using Hierarchical Generalized Linear Mixed Model is suggested and implemented to overcome this drawback of standard estimation techniques. Another area which is explored briefly is that of Individual Patient Data metaanalysis. A Frailty model or Random Effects Proportional Hazards Survival model approach is proposed to carry out IPD metaregression and come up with a single estimated relationship between BMI and mortality, adjusting for the variation between studies.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd2736
 Format
 Thesis
 Title
 Flexible Additive Risk Models Using Piecewise Constant Hazard Functions.
 Creator

Uhm, Daiho, Huﬀer, Fred W., Kercheval, Alec, McGee, Dan, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

We study a weighted least squares (WLS) estimator for Aalen's additive risk model which allows for a very flexible handling of covariates. We divide the followup period into intervals and assume a constant hazard rate in each interval. The model is motivated as a piecewise approximation of a hazard function composed of three parts: arbitrary nonparametric functions for some covariate effects, smoothly varying functions for others, and known (or constant) functions for yet others. The...
Show moreWe study a weighted least squares (WLS) estimator for Aalen's additive risk model which allows for a very flexible handling of covariates. We divide the followup period into intervals and assume a constant hazard rate in each interval. The model is motivated as a piecewise approximation of a hazard function composed of three parts: arbitrary nonparametric functions for some covariate effects, smoothly varying functions for others, and known (or constant) functions for yet others. The proposed estimator is an extension of the grouped data version of the HufferMcKeague estimator (1991). Our estimator may also be regarded as a piecewise constant analog of the semiparametric estimates of McKeague & Sasieni (1994), and Lin & Ying (1994). By using a fairly large number of intervals, we should get an essentially semiparametric model similar to the McKeagueSasieni and LinYing approaches. For our model, since the number of parameters is finite (although large), conventional approaches (such as maximum likelihood) are easy to formulate and implement. The approach is illustrated by simulations, and is applied to data from the Framingham heart study.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd1464
 Format
 Thesis
 Title
 Bayesian Dynamic Survival Models for Longitudinal Aging Data.
 Creator

He, Jianghua, McGee, Daniel L., Niu, Xufeng, Johnson, Suzanne B., Huﬀer, Fred W., Department of Statistics, Florida State University
 Abstract/Description

In this study, we will examine the Bayesian Dynamic Survival Models, timevarying coefficients models from a Bayesian perspective, and their applications in the aging setting. The specific questions we are interested in are: Do the relative importance of characteristics measured at a particular age, such as blood pressure, smoking, and body weight, with respect to heart diseases or death change as people age? If they do, how can we model the change? And, how does the change affect the...
Show moreIn this study, we will examine the Bayesian Dynamic Survival Models, timevarying coefficients models from a Bayesian perspective, and their applications in the aging setting. The specific questions we are interested in are: Do the relative importance of characteristics measured at a particular age, such as blood pressure, smoking, and body weight, with respect to heart diseases or death change as people age? If they do, how can we model the change? And, how does the change affect the analysis results if fixedeffect models are applied? In the epidemiological and statistical literature, the relationship between a risk factor and the risk of an event is often described in terms of the numerical contribution of the risk factor to the total risk within a followup period, using methods such as contingency tables and logistic regression models. With the development of survival analysis, another method named the Proportional Hazards Model becomes more popular. This model describes the relationship between a covariate and risk within a followup period as a process, under the assumption that the hazard ratio of the covariate is fixed during the followup period. Neither previous methods nor the Proportional Hazards Model allows the effect of a covariates to change flexibly with time. In these study, we intend to investigate some classic epidemiological relationships using appropriate methods that allow coefficients to change with time, and compare our results with those found in the literature. After describing what has been done in previous work based on multiple logistic regression or discriminant function analysis, we summarize different methods for estimating the time varying coefficient survival models that are developed specifically for the situations under which the proportional hazards assumption is violated. We will focus on the Bayesian Dynamic Survival Model because its flexibility and Bayesian structure fits our study goals. There are two estimation methods for the Bayesian Dynamic Survival Models, the Linear Bayesian Estimation (LBE) method and the Markov Chain Monte Carlo (MCMC) sampling method. The LBE method is simpler, faster, and more flexible to calculate, but it requires specifications of some parameters that usually are unknown. The MCMC method gets around the difficulty of specifying parameters, but is much more computationally intensive. We will use a simulation study to investigate the performances of these two methods, and provide suggestions on how to use them effectively in application. The Bayesian Dynamic Survival Model is applied to the Framingham Heart Study to investigate the timevarying effects of covariates such as gender, age, smoking, and SBP (Systolic Blood Pressure) with respect to death. We also examined the changing relationship between BMI (Body Mass Index) and allcause mortality, and suggested that some of the heterogeneity observed in the results found in the literature is likely to be a consequence of using fixed effect models to describe a timevarying relationship.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd4174
 Format
 Thesis
 Title
 A Method for Finding the Nadir of NonMonotonic Relationships.
 Creator

Tan, Fei, McGee, Daniel, Lloyd, Donald, Huﬀer, Fred, Niu, Xufeng, Dutton, Gareth, Department of Statistics, Florida State University
 Abstract/Description

Different methods have been proposed to model the Jshaped or Ushaped relationship between a risk factor and mortality so that the optimal riskfactor value (nadir) associated with the lowest mortality can be estimated. The basic model considered is the Cox Proportional Hazards model. Current methods include a quadratic method, a method with transformation, fractional polynomials, a change point method and fixedknot spline regression. A quadratic method contains both the linear and the...
Show moreDifferent methods have been proposed to model the Jshaped or Ushaped relationship between a risk factor and mortality so that the optimal riskfactor value (nadir) associated with the lowest mortality can be estimated. The basic model considered is the Cox Proportional Hazards model. Current methods include a quadratic method, a method with transformation, fractional polynomials, a change point method and fixedknot spline regression. A quadratic method contains both the linear and the quadratic term of the risk factor, it is simple but often it generates unrealistic nadir estimates. The transformation method converts the original risk factor so that after transformation it has a Normal distribution, but this may not work when there is no good transformation to normality. Fractional polynomials are an extended class of regular polynomials that applies negative and fractional powers to the risk factor. Compared with the quadratic method or the transformation method it does not always have a good model interpretation and inferences about it do not incorporate the uncertainty coming from preselection of powers and degree. A change point method models the prognostic index using two pieces of upward quadratic functions that meet at their common nadir. This method assumes the knot and the nadir are the same, which is not always true. Fixedknot spline regression has also been used to model nonlinear prognostic indices. But its inference does not account for variation arising from knot selections. Here we consider spline regressions with free knots, a natural generalization of the quadratic, the change point and the fixedknot spline method. They can be applied to risk factors that do not have a good transformation to normality as well as keep intuitive model interpretations. Asymptotic normality and consistency of the maximum partial likelihood estimators are established under certain condition. When the condition is not satisfied simulations are used to explore asymptotic properties. The new method is motivated by and applied to the nadir estimation in nonmonotonic relationships between BMI (body mass index) and allcause mortality. Its performance is compared with that of existing methods, adopting criteria of nadir estimation ability and goodness of fit.
Show less  Date Issued
 2007
 Identifier
 FSU_migr_etd1719
 Format
 Thesis
 Title
 Spatiotemporal Bayesian Hierarchical Models, with Application to Birth Outcomes.
 Creator

Norton, Jonathan D. (Jonathan David), Niu, Xufeng, Eberstein, Isaac, Huﬀer, Fred, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

A class of hierarchical Bayesian models is introduced for adverse birth outcomes such as preterm birth, which are assumed to follow a conditional binomial distribution. The logodds of an adverse outcome in a particular county, logit(p(i)), follows a linear model which includes observed covariates and normallydistributed random effects. Spatial dependence between neighboring regions is allowed for by including an intrinsic autoregressive (IAR) prior or an IAR convolution prior in the linear...
Show moreA class of hierarchical Bayesian models is introduced for adverse birth outcomes such as preterm birth, which are assumed to follow a conditional binomial distribution. The logodds of an adverse outcome in a particular county, logit(p(i)), follows a linear model which includes observed covariates and normallydistributed random effects. Spatial dependence between neighboring regions is allowed for by including an intrinsic autoregressive (IAR) prior or an IAR convolution prior in the linear predictor. Temporal dependence is incorporated by including a temporal IAR term also. It is shown that the variance parameters underlying these random effects (IAR, convolution, convolution plus temporal IAR) are identifiable. The same results are also shown to hold when the IAR is replaced by a conditional autoregressive (CAR) model. Furthermore, properties of the CAR parameter ρ are explored. The Deviance Information Criterion (DIC) is considered as a way to compare spatial hierarchical models. Simulations are performed to test whether the DIC can identify whether binomial outcomes come from an IAR, an IAR convolution, or independent normal deviates. Having established the theoretical foundations of the class of models and validated the DIC as a means of comparing models, we examine preterm birth and low birth weight counts in the state of Arkansas from 1994 to 2005. We find that preterm birth and low birth weight have different spatial patterns of risk, and that rates of low birth weight can be fit with a strikingly simple model that includes a constant spatial effect for all periods, a linear trend, and three covariates. It is also found that the risks of each outcome are increasing over time, even with adjustment for covariates.
Show less  Date Issued
 2008
 Identifier
 FSU_migr_etd2523
 Format
 Thesis
 Title
 Revealing Sparse Signals in Functional Data.
 Creator

Ivanescu, Andrada E. (Andrada Eugenia), Bunea, Florentina, Wegkamp, Marten, Gert, Joshua, Niu, Xufeng, Hollander, Myles, Department of Statistics, Florida State University
 Abstract/Description

My dissertation presents a novel statistical method to estimate a sparse signal in functional data and to construct confidence bands for the signal. Existing methods for inference for the mean function in this framework include smoothing splines and kernel estimates. Our methodology involves thresholding a least squares estimator, and the threshold level depends on the sources of variability that exist in this type of data. The proposed estimation method and the confidence bands successfully...
Show moreMy dissertation presents a novel statistical method to estimate a sparse signal in functional data and to construct confidence bands for the signal. Existing methods for inference for the mean function in this framework include smoothing splines and kernel estimates. Our methodology involves thresholding a least squares estimator, and the threshold level depends on the sources of variability that exist in this type of data. The proposed estimation method and the confidence bands successfully adapt to the sparsity of the signal. We present supporting evidence through simulations and applications to real datasets.
Show less  Date Issued
 2008
 Identifier
 FSU_migr_etd3852
 Format
 Thesis
 Title
 Variable Selection of Correlated Predictors in Logistic Regression: Investigating the DietHeart Hypothesis.
 Creator

Thompson, Warren R. (Warren Robert), McGee, Daniel, Eberstein, Isaac, Huﬀer, Fred, Sinha, Debajyoti, She, Yiyuan, Department of Statistics, Florida State University
 Abstract/Description

Variable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the...
Show moreVariable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the variable selection problem in the context of logistic regression. Specifically, we investigated the merits of the bootstrap, ridge regression, the lasso and Bayesian model averaging (BMA) as variable selection techniques when highly correlated predictors and a dichotomous outcome are considered. This dissertation also contributes to the literature on the dietheart hypothesis. The dietheart hypothesis has been around since the early twentieth century. Since then, researchers have attempted to isolate the nutrients in diet that promote coronary heart disease (CHD). After a century of research, there is still no consensus. In our current research, we used some of the more recent statistical methodologies (mentioned above) to investigate the effect of twenty dietary variables on the incidence of coronary heart disease. Logistic regression models were generated for the data from the Honolulu Heart Program  a study of CHD incidence in men of Japanese descent. Our results were largely methodspecific. However, regardless of method considered, there was strong evidence to suggest that alcohol consumption has a strong protective effect on the risk of coronary heart disease. Of the variables considered, dietary cholesterol and caffeine were the only variables that, at best, exhibited a moderately strong harmful association with CHD incidence. Further investigation that includes a broader array of food groups is recommended.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1360
 Format
 Thesis
 Title
 Covariance on Manifolds.
 Creator

Balov, Nikolay H. (Nikolay Hristov), Srivastava, Anuj, Klassen, Eric, Patrangenaru, Victor, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

With ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Nonlinear statistical analysis has become an area of intensive research in recent years. Despite the impressive progress in this direction, a unified and consistent framework has not been reached. In this regard, the following work is an attempt to improve our understanding of random...
Show moreWith ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Nonlinear statistical analysis has become an area of intensive research in recent years. Despite the impressive progress in this direction, a unified and consistent framework has not been reached. In this regard, the following work is an attempt to improve our understanding of random phenomena on nonEuclidean spaces. More specifically, the motivating goal of the present dissertation is to generalize the notion of distribution covariance, which in standard settings is defined only in Euclidean spaces, on arbitrary manifolds with metric. We introduce a tensor field structure, named covariance field, that is consistent with the heterogeneous nature of manifolds. It not only describes the variability imposed by a probability distribution but also provides alternative distribution representations. The covariance field combines the distribution density with geometric characteristics of its domain and thus fills the gap between these two.We present some of the properties of the covariance fields and argue that they can be successfully applied to various statistical problems. In particular, we provide a systematic approach for defining parametric families of probability distributions on manifolds, parameter estimation for regression analysis, nonparametric statistical tests for comparing probability distributions and interpolation between such distributions. We then present several application areas where this new theory may have potential impact. One of them is the branch of directional statistics, with domain of influence ranging from geosciences to medical image analysis. The fundamental level at which the covariance based structures are introduced, also opens a new area for future research.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1045
 Format
 Thesis
 Title
 A Study of the Asymptotic Properties of Lasso Estimates for Correlated Data.
 Creator

Gupta, Shuva, Bunea, Florentina, Gert, Joshua, Hollander, Myles, Wegkamp, Marten, Department of Statistics, Florida State University
 Abstract/Description

In this thesis we investigate postmodel selection properties of L1 penalized weighted least squares estimators in regression models with a large number of variables M and correlated errors. We focus on correct subset selection and on the asymptotic distribution of the penalized estimators. In the simple case of AR(1) errors we give conditions under which correct subset selection can be achieved via our procedure. We then provide a detailed generalization of this result to models with errors...
Show moreIn this thesis we investigate postmodel selection properties of L1 penalized weighted least squares estimators in regression models with a large number of variables M and correlated errors. We focus on correct subset selection and on the asymptotic distribution of the penalized estimators. In the simple case of AR(1) errors we give conditions under which correct subset selection can be achieved via our procedure. We then provide a detailed generalization of this result to models with errors that have a weakdependency structure (Doukhan 1996). In all cases, the number M of regression variables is allowed to exceed the sample size n. We further investigate the asymptotic distribution of our estimates, when M < n, and show that under appropriate choices of the tuning parameters the limiting distribution is multivariate normal. This generalizes to the case of correlated errors the result of Knight and Fu (2000), obtained for regression models with independent errors.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd3896
 Format
 Thesis
 Title
 Adaptive Series Estimators for Copula Densities.
 Creator

Gui, Wenhao, Wegkamp, Marten, Van Engelen, Robert A., Niu, Xufeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

In this thesis, based on an orthonormal series expansion, we propose a new nonparametric method to estimate copula density functions. Since the basis coefficients turn out to be expectations, empirical averages are used to estimate these coefficients. We propose estimators of the variance of the estimated basis coefficients and establish their consistency. We derive the asymptotic distribution of the estimated coefficients under mild conditions. We derive a simple oracle inequality for the...
Show moreIn this thesis, based on an orthonormal series expansion, we propose a new nonparametric method to estimate copula density functions. Since the basis coefficients turn out to be expectations, empirical averages are used to estimate these coefficients. We propose estimators of the variance of the estimated basis coefficients and establish their consistency. We derive the asymptotic distribution of the estimated coefficients under mild conditions. We derive a simple oracle inequality for the copula density estimator based on a finite series using the estimated coefficients. We propose a stopping rule for selecting the number of coefficients used in the series and we prove that this rule minimizes the mean integrated squared error. In addition, we consider hard and soft thresholding techniques for sparse representations. We obtain oracle inequalities that hold with prescribed probability for various norms of the difference between the copula density and our threshold series density estimator. Uniform confidence bands are derived as well. The oracle inequalities clearly reveal that our estimator adapts to the unknown degree of sparsity of the series representation of the copula density. A simulation study indicates that our method is extremely easy to implement and works very well, and it compares favorably to the popular kernel based copula density estimator, especially around the boundary points, in terms of mean squared error. Finally, we have applied our method to an insurance dataset. After comparing our method with the previous data analyses, we reach the same conclusion as the parametric methods in the literature and as such we provide additional justification for the use of the developed parametric model.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd3929
 Format
 Thesis
 Title
 Estimating the Probability of Cardiovascular Disease: A Comparison of Methods.
 Creator

Fan, Li, McGee, Daniel, Hurt, Myra, Niu, XuFeng, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Risk prediction plays an important role in clinical medicine. It not only helps in educating patients to improve life style and in targeting individuals at high risk, but also guides treatment decisions. So far, various instruments have been used for different risk assessment in different countries and the risk predictions based from these different models are not consistent. In public use, a reliable risk prediction is necessary. This thesis discusses the models that have been developed for...
Show moreRisk prediction plays an important role in clinical medicine. It not only helps in educating patients to improve life style and in targeting individuals at high risk, but also guides treatment decisions. So far, various instruments have been used for different risk assessment in different countries and the risk predictions based from these different models are not consistent. In public use, a reliable risk prediction is necessary. This thesis discusses the models that have been developed for risk assessment and evaluates the performance of prediction at two levels, including the overall level and the individual level. At the overall level, cross validation and simulation are used to assess the risk prediction, while at the individual level, the "Parametric Bootstrap" and the delta method are used to evaluate the uncertainty of the individual risk prediction. Further exploration of the reasons producing different performance among the models is ongoing.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd4508
 Format
 Thesis
 Title
 Time Scales in Epidemiological Analysis.
 Creator

Chalise, Prabhakar, McGee, Daniel L., Chicken, Eric, Carlson, Elwood, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

The Cox proportional hazards model is routinely used to determine the time until an event of interest. Two time scales are used in practice: follow up time and chronological age. The former is the most frequently used time scale both in clinical studies and longitudinal observational studies. However, there is no general consensus about which time scale is the best. In recent years, papers have appeared arguing for using chronological age as the time scale either with or without adjusting the...
Show moreThe Cox proportional hazards model is routinely used to determine the time until an event of interest. Two time scales are used in practice: follow up time and chronological age. The former is the most frequently used time scale both in clinical studies and longitudinal observational studies. However, there is no general consensus about which time scale is the best. In recent years, papers have appeared arguing for using chronological age as the time scale either with or without adjusting the entryage. Also, it has been asserted that if the cumulative baseline hazard is exponential or if the ageatentry is independent of covariate, the two models are equivalent. Our studies do not satisfy these two conditions in general. We found that the true factor that makes the models perform significantly different is the variability in the ageatentry. If there is no variability in the entryage, time scales do not matter and both models estimate exactly the same coefficients. As the variability increases the models disagree with each other. We also computed the optimum time scale proposed by Oakes and utilized them for the Cox model. Both of our empirical and simulation studies show that follow up time scale model using age at entry as a covariate is better than the chronological age and Oakes time scale models. This finding is illustrated with two examples with data from Diverse Population Collaboration. Based on our findings, we recommend using follow up time as a time scale for epidemiological analysis.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd3933
 Format
 Thesis
 Title
 Discrimination and Calibration of Prognostic Survival Models.
 Creator

Simino, Jeannette M., Hollander, Myles, McGee, Daniel, Hurt, Myra, Niu, XuFeng, Department of Statistics, Florida State University
 Abstract/Description

Clinicians employ prognostic survival models for diseases such as coronary heart disease and cancer to inform patients about risks, treatments, and clinical decisions (Altman and Royston 2000). These prognostic models are not useful unless they are valid in the population to which they are applied. There are no generally accepted algorithms for assessing the validity of an external survival model in a new population. Researchers often invoke measures of predictive accuracy, the degree to...
Show moreClinicians employ prognostic survival models for diseases such as coronary heart disease and cancer to inform patients about risks, treatments, and clinical decisions (Altman and Royston 2000). These prognostic models are not useful unless they are valid in the population to which they are applied. There are no generally accepted algorithms for assessing the validity of an external survival model in a new population. Researchers often invoke measures of predictive accuracy, the degree to which predicted outcomes match observed outcomes (Justice et al. 1999). One component of predictive accuracy is discrimination, the ability of the model to correctly rank the individuals in the sample by risk. A common measure of discrimination for prognostic survival models is the concordance index, also called the cstatistic. We utilize the concordance index to determine the discrimination of Framinghambased Cox and Loglogistic models of coronary heart disease (CHD) death in cohorts from the Diverse Populations Collaboration, a collection of studies that encompasses many ethnic, geographic, and socioeconomic groups. Pencina and D'Agostino presented a confidence interval for the concordance index when assessing the discrimination of an external prognostic model. We perform simulations to determine the robustness of their confidence interval when measuring discrimination during internal validation. The Pencina and D'Agostino confidence interval is not valid in the internal validation setting because their assumption of mutually independent observations is violated. We compare the Pencina and D'Agostino confidence interval to a bootstrap confidence interval that we propose that is valid for the internal validation. We specifically discern the performance of the interval when the same sample is used to both fit and determine the validity of a prognostic model. The framework for our simulations is a Weibull proportional hazards model of CHD death fit to the Framingham exam 4 data. We then focus on the second component of accuracy, calibration, which measures the agreement between the observed and predicted event rates for groups of patients (Altman and Royston 2000). In 2000, van Houwelingen introduced a method called validation by calibration to allow a clinician to assess the validity of a wellaccepted published survival model on his/her own patient population and adjust the published model to fit that population. Van Houwelingen embeds the published model into a new model with only 3 parameters which helps combat the overfitting that occurs when models with many covariates are fit on data sets with a small number of events. We explore validation by calibration as a tool to adjust models when an external model over or underestimates risk. Van Houwelingen discusses the general method and then focusses on the proportional hazards model. There are situations where proportional hazards may not hold, thus we extend the methodology to the Loglogistic accelerated failure time model. We perform validation by calibration of Framinghambased Cox and Loglogistic models of CHD death to cohorts from the Diverse Populations Collaboration. Lastly, we conduct simulations that investigate the power of the global Wald validation by calibration test. We study its power to reject an invalid proportional hazards or Loglogistic accelerated failure time model under various scale and/or shape misspecifications.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd0328
 Format
 Thesis
 Title
 Stochastic Models and Inferences for Commodity Futures Pricing.
 Creator

Ncube, Moeti M., Srivastava, Anuj, Doran, James, Mason, Patrick, Niu, Xufeng, Huﬀer, Fred, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

The stochastic modeling of financial assets is essential to the valuation of financial products and investment decisions. These models are governed by certain parameters that are estimated through a process known as calibration. Current procedures typically perform a gridsearch optimization of a given objective function over a specified parameter space. These methods can be computationally intensive and require restrictions on the parameter space to achieve timely convergence. In this thesis...
Show moreThe stochastic modeling of financial assets is essential to the valuation of financial products and investment decisions. These models are governed by certain parameters that are estimated through a process known as calibration. Current procedures typically perform a gridsearch optimization of a given objective function over a specified parameter space. These methods can be computationally intensive and require restrictions on the parameter space to achieve timely convergence. In this thesis, we propose an alternative Kalman Smoother Expectation Maximization procedure (KSEM) that can jointly estimate all the parameters and produces better model t that compared to alternative estimation procedures. Further, we consider the additional complexity of the modeling of jumps or spikes that may occur in a time series. For this calibration we develop a Particle Smoother Expectation Maximization procedure (PSEM) for the optimization of nonlinear systems. This is an entirely new estimation approach, and we provide several examples of it's application.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd2707
 Format
 Thesis
 Title
 Transformation Models for Survival Data Analysis and Applications.
 Creator

Liu, Yang, Niu, XuFeng, Lloyd, Donald, McGee, Dan, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

It is often assumed that all uncensored subjects will eventually experience the event of interest in standard survival models. However, in some situations when the event considered is not death, it will never occur for a proportion of subjects. Survival models with a cure fraction are becoming popular in analyzing this type of study. We propose a generalized transformation model motivated by Zeng et al's (2006) transformed proportional time cure model. In our proposed model, fractional...
Show moreIt is often assumed that all uncensored subjects will eventually experience the event of interest in standard survival models. However, in some situations when the event considered is not death, it will never occur for a proportion of subjects. Survival models with a cure fraction are becoming popular in analyzing this type of study. We propose a generalized transformation model motivated by Zeng et al's (2006) transformed proportional time cure model. In our proposed model, fractional polynomials are used instead of the simple linear combination of the covariates. The proposed models give us more flexibility without loosing any good properties of the original model, such as asymptotic consistency and asymptotic normality of the regression coefficients. The proposed model will better fit the data where the relationship between a response variable and covariates is nonlinear. We also provide a power selection procedure based on the likelihood function. A simulation study is carried out to show the accuracy of the proposed power selection procedure. The proposed models are applied to coronary heart disease and cancer related medical data from both observational cohort studies and clinical trials
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1155
 Format
 Thesis
 Title
 Association Models for Clustered Data with Binary and Continuous Responses.
 Creator

Lin, Lanjia, Sinha, Debajyoti, Hurt, Myra, Lipsitz, Stuart R., McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

This dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And...
Show moreThis dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And the marginal regression function of the continuous response preserves linear form. Withincluster and withinsubject associations could be measured by our proposed models. For the bivariate correlated random effects model, we illustrate how different levels of the association between two random effects induce different Kendall's tau values for association between the binary and continuous responses from the same cluster. Fully parametric and semiparametric Bayesian methods as well as maximum likelihood method are illustrated for model analysis. In the semiparametric Bayesian model, normality assumption of the regression error for the continuous response is relaxed by using a nonparametric Dirichlet Process prior. Robustness of the bivariate correlated random effects model using ML method to misspecifications of regression function as well as random effect distribution is investigated by simulation studies. The Bayesian and likelihood methods are applied to a developmental toxicity study of ethylene glycol in mice.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1330
 Format
 Thesis
 Title
 Some New Methods for Design and Analysis of Survival Data.
 Creator

Wang, Wenting, Sinha, Debajyoti, Arjmandi, Bahram H., McGee, Dan, Niu, Xufeng, Yu, Kai, Department of Statistics, Florida State University
 Abstract/Description

For survival outcomes, usually, statistical equivalent tests to show a new treatment therapeutically equivalent to a standard treatment are based on the Cox (1972) proportional hazards assumption. We present an alternative method based on the linear transformation model (LTM) for two treatment arms, and show the advantages of using this equivalence test instead of tests based on the Cox's model. LTM is a very general class of models including models such as the proportional odds survival...
Show moreFor survival outcomes, usually, statistical equivalent tests to show a new treatment therapeutically equivalent to a standard treatment are based on the Cox (1972) proportional hazards assumption. We present an alternative method based on the linear transformation model (LTM) for two treatment arms, and show the advantages of using this equivalence test instead of tests based on the Cox's model. LTM is a very general class of models including models such as the proportional odds survival model (POSM). We presented a sufficient condition to check whether logrank based tests have inflated Type I error rates. We show that POSM and some other commonly used survival models within the LTM class all satisfy this condition. Simulation studies show that repeated use of our test instead of using logrank based tests will be a safer statistical practice. Our second goal is to develop a practical Bayesian model for survival data with high dimensional covariate vector. We develop the Information Matrix (IM) and Information Matrix Ridge (IMR) priors for commonly used survival models including the Cox's model and the cure rate model proposed by Chen et al. (1999), and examine many desirable theoretical properties including sufficient conditions for the existence of the moment generating functions for these priors and corresponding posterior distributions. The performance of these priors in practice is compared with some competing priors via the Bayesian analysis of a study that investigates the relationship between lung cancer survival time and a large number of genetic markers.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1248
 Format
 Thesis
 Title
 Bayesian Generalized Polychotomous Response Models and Applications.
 Creator

Yang, Fang, Niu, XuFeng, Johnson, Suzanne B., McGee, Dan, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Polychotomous quantal response models are widely used in medical and econometric studies to analyze categorical or ordinal data. In this study, we apply the Bayesian methodology through a mixedeffects polychotomous quantal response model. For the Bayesian polychotomous quantal response model, we assume uniform improper priors for the regression coeffcients and explore the suffcient conditions for a proper joint posterior distribution of the parameters in the models. Simulation results from...
Show morePolychotomous quantal response models are widely used in medical and econometric studies to analyze categorical or ordinal data. In this study, we apply the Bayesian methodology through a mixedeffects polychotomous quantal response model. For the Bayesian polychotomous quantal response model, we assume uniform improper priors for the regression coeffcients and explore the suffcient conditions for a proper joint posterior distribution of the parameters in the models. Simulation results from Gibbs sampling estimates will be compared to traditional maximum likelihood estimates to show the strength that using the uniform improper priors for the regression coeffcients. Motivated by investigating of relationship between BMI categories and several risk factors, we carry out the application studies to examine the impact of risk factors on BMI categories, especially for categories of "Overweight" and "Obesities". By applying the mixedeffects Bayesian polychotomous response model with uniform improper priors, we would get similar interpretations of the association between risk factors and BMI, comparing to literature findings.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1092
 Format
 Thesis
 Title
 Nonparametric Estimation of Three Dimensional Projective Shapes with Applications in Medical Imaging and in Pattern Recognition.
 Creator

Crane, Michael, Patrangenaru, Victor, Liu, Xiuwen, Huﬀer, Fred W., Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

This dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision reconstruction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic...
Show moreThis dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision reconstruction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic sample covariance to construct bootstrap confidence regions for mean projective shapes of 3D configurations. Chapters 4, 5 and 6 contain new results. In chapter 4, we develop tests for coplanarity. In chapter 5, is on reconstruction of 3D polyhedral scenes, including texture from arbitrary partial views. In chapter 6, we develop a nonparametric methodology for estimating the mean change for matched samples on a Lie group. We then notice that for k ≥ 4, a manifold of projective shapes of kads in general position in 3D has a structure of 3k − 15 dimensional Lie group (PQuaternions) that is equivariantly embedded in an Euclidean space, therefore testing for mean 3D projective shape change amounts to a one sample test for extrinsic mean PQuaternion Objects. The Lie group technique leads to a large sample and nonparametric bootstrap test for one population extrinsic mean on a projective shape space, as recently developed by Patrangenaru, Liu and Sughatadasa. On the other hand, in absence of occlusions, the 3D projective shape of a spatial configuration can be recovered from a stereo pair of images, thus allowing to test for mean glaucomatous 3D projective shape change detection from standard stereo pairs of eye images.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd4607
 Format
 Thesis
 Title
 A Probabilistic and Graphical Analysis of Evidence in O.J. Simpson's Murder Case Using Bayesian Networks.
 Creator

Olumide, Kunle, Huﬀer, Fred, Shute, Valerie, Sinha, Debajyoti, Niu, Xufeng, Logan, Wayne, Department of Statistics, Florida State University
 Abstract/Description

This research work is an attempt to illustrate the versatility and wide applications of the field of statistical science. Specifically, the research work involves the application of statistics in the field of law. The application will focus on the subfields of Evidence and Criminal law using one of the most celebrated cases in the history of American jurisprudence  the 1994 O.J. Simpson murder case in California. Our task here is to do a probabilistic and graphical analysis of the body of...
Show moreThis research work is an attempt to illustrate the versatility and wide applications of the field of statistical science. Specifically, the research work involves the application of statistics in the field of law. The application will focus on the subfields of Evidence and Criminal law using one of the most celebrated cases in the history of American jurisprudence  the 1994 O.J. Simpson murder case in California. Our task here is to do a probabilistic and graphical analysis of the body of evidence in this case using Bayesian Networks. We will begin the analysis by first constructing our main hypothesis regarding the guilt or nonguilt of the accused; this main hypothesis will be supplemented by a series of ancillary hypotheses. Using graphs and probability concepts, we will be evaluating the probative force or strength of the evidence and how well the body of evidence at hand will prove our main hypothesis. We will employ Bayes rule, likelihoods and likelihood ratios to carry out such an evaluation. Some sensitivity analyses will be carried out by varying the degree of our prior beliefs or probabilities, and evaluating the effect of such variations on the likelihood ratios regarding our main hypothesis.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd2287
 Format
 Thesis
 Title
 GoodnessofTests for Logistic Regression.
 Creator

Wu, Sutan, McGee, Dan L., Zhang, Jinfeng, Hurt, Myra, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

The generalized linear model and particularly the logistic model are widely used in public health, medicine, and epidemiology. Goodnessoffit tests for these models are popularly used to describe how well a proposed model fits a set of observations. These different goodnessoffit tests all have individual advantages and disadvantages. In this thesis, we mainly consider the performance of the "HosmerLemeshow" test, the Pearson's chisquare test, the unweighted sum of squares test and the...
Show moreThe generalized linear model and particularly the logistic model are widely used in public health, medicine, and epidemiology. Goodnessoffit tests for these models are popularly used to describe how well a proposed model fits a set of observations. These different goodnessoffit tests all have individual advantages and disadvantages. In this thesis, we mainly consider the performance of the "HosmerLemeshow" test, the Pearson's chisquare test, the unweighted sum of squares test and the cumulative residual test. We compare their performance in a series of empirical studies as well as particular simulation scenarios. We conclude that the unweighted sum of squares test and the cumulative sums of residuals test give better overall performance than the other two. We also conclude that the commonly suggested practice of assuming that a pvalue less than 0.15 is an indication of lack of fit at the initial steps of model diagnostics should be adopted. Additionally, D'Agostino et al. presented the relationship of the stacked logistic regression and the Cox regression model in the Framingham Heart Study. So in our future study, we will examine the possibility and feasibility of the adaption these goodnessoffit tests to the Cox proportional hazards model using the stacked logistic regression.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0693
 Format
 Thesis
 Title
 Nonparametric Estimation of Three Dimensional Projective Shapes with Applications in Medical Imaging and in Pattern Recognition.
 Creator

Crane, Michael, Patrangenaru, Victor, Liu, Xiuwen, Huﬀer, Fred W., Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

This dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision re construction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic...
Show moreThis dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision re construction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic sample covariance to construct boot strap confidence regions for mean projective shapes of 3D configurations. Chapters 4, 5 and 6 contain new results. In chapter 4, we develop tests for coplanarity. In chapter 5, is on reconstruction of 3D polyhedral scenes, including texture from arbitrary partial views. In chapter 6, we develop a nonparametric methodology for estimating the mean change for matched samples on a Lie group. We then notice that for k '' 4, a manifold of projective shapes of kads in general position in 3D has a structure of 3k and #8722; 15 dimensional Lie group (PQuaternions) that is equivariantly embedded in an Euclidean space, therefore testing for mean 3D projective shape change amounts to a one sample test for extrinsic mean PQuaternion Objects. The Lie group technique leads to a large sample and nonparametric bootstrap test for one population extrinsic mean on a projective shape space, as recently developed by Patrangenaru, Liu and Sughatadasa [1]. On the other hand, in absence of occlusions, the 3D projective shape of a spatial configuration can be recovered from a stereo pair of images, thus allowing to test for mean glaucomatous 3D projective shape change detection from standard stereo pairs of eye images.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd7118
 Format
 Thesis
 Title
 Age Effects in the Extinction of Planktonic Foraminifera: A New Look at Van Valen's Red Queen Hypothesis.
 Creator

Wiltshire, Jelani, Huﬀer, Fred, Parker, William, Chicken, Eric, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Van Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic...
Show moreVan Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic Foraminifera (Foram) extinction data include Weibull and Exponential modeling (Parker and Arnold, 1997), and Cox proportional hazards modeling (Doran et al. 2004,2006). I propose a general class of test statistics that can be used to test for the effect of age on extinction. These test statistics allow for a varying background rate of extinction and attempt to remove the effects of other covariates when assessing the effect of age on extinction. No model is assumed for the covariate effects. Instead I control for covariate effects by pairing or grouping together similar species. I use simulated data sets to compare the power of the statistics. In applying the test statistics to the Foram data, I have found age to have a positive effect on extinction.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0952
 Format
 Thesis
 Title
 TimeVarying Coefficient Models with ARMAGARCH Structures for Longitudinal Data Analysis.
 Creator

Zhao, Haiyan, Niu, Xufeng, Huﬀer, Fred, Nolder, Craig, McGee, Dan, Department of Statistics, Florida State University
 Abstract/Description

The motivation of my research comes from the analysis of the Framingham Heart Study (FHS) data. The FHS is a long term prospective study of cardiovascular disease in the community of Framingham, Massachusetts. The study began in 1948 and 5,209 subjects were initially enrolled. Examinations were given biennially to the study participants and their status associated with the occurrence of disease was recorded. In this dissertation, the event we are interested in is the incidence of the coronary...
Show moreThe motivation of my research comes from the analysis of the Framingham Heart Study (FHS) data. The FHS is a long term prospective study of cardiovascular disease in the community of Framingham, Massachusetts. The study began in 1948 and 5,209 subjects were initially enrolled. Examinations were given biennially to the study participants and their status associated with the occurrence of disease was recorded. In this dissertation, the event we are interested in is the incidence of the coronary heart disease (CHD). Covariates considered include sex, age, cigarettes per day (CSM), serum cholesterol (SCL), systolic blood pressure (SBP) and body mass index (BMI, weight in kilograms/height in meters squared). Statistical literature review indicates that effects of the covariates on Cardiovascular disease or death caused by all possible diseases in the Framingham study change over time. For example, the effect of SCL on Cardiovascular disease decreases linearly over time. In this study, I would like to examine the timevarying effects of the risk factors on CHD incidence. Timevarying coefficient models with ARMAGARCH structure are developed in this research. The maximum likelihood and the marginal likelihood methods are used to estimate the parameters in the proposed models. Since highdimensional integrals are involved in the calculations of the marginal likelihood, the Laplace approximation is employed in this study. Simulation studies are conducted to evaluate the performance of these two estimation methods based on our proposed models. The KullbackLeibler (KL) divergence and the root mean square error are employed in the simulation studies to compare the results obtained from different methods. Simulation results show that the marginal likelihood approach gives more accurate parameter estimates, but is more computationally intensive. Following the simulation study, our proposed models are applied to the Framingham Heart Study to investigate the timevarying effects of covariates with respect to CHD incidence. To specify the timeseries structures of the effects of risk factors, the Bayesian Information Criterion (BIC) is used for model selection. Our study shows that the relationship between CHD and risk factors changes over time. For males, there is an obviously decreasing linear trend for age effect, which implies that the age effect on CHD is less significant for elder patients than younger patients. The effect of CSM stays almost the same in the first 30 years and decreases thereafter. There are slightly decreasing linear trends for both effects of SBP and BMI. Furthermore, the coefficients of SBP are mostly positive over time, i.e., patients with higher SBP are more likely developing CHD as expected. For females, there is also an obviously decreasing linear trend for age effect, while the effects of SBP and BMI on CHD are mostly positive and do not change too much over time.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0527
 Format
 Thesis
 Title
 Ultrafast Lattice Dynamics in Metal Thin Films and NanoParticles.
 Creator

Wang, Xuan, Cao, Jim, Yang, Wei, Bonesteel, Nicholas, Riley, Mark, Xiong, Peng, Department of Physics, Florida State University
 Abstract/Description

This thesis presents the new development of the 3rd generation femtosecond diffractometer (FED) in Professor Jim Cao's group and its application to study ultrafast structural dynamics of solid state materials. The 3rd generation FED prevails its former type and other similar FED instruments by a DC electron gun that can generate much higher energy electron pulses, and a more efficient imaging system. This combination together with miscellaneous improvements significantly boosts the signalto...
Show moreThis thesis presents the new development of the 3rd generation femtosecond diffractometer (FED) in Professor Jim Cao's group and its application to study ultrafast structural dynamics of solid state materials. The 3rd generation FED prevails its former type and other similar FED instruments by a DC electron gun that can generate much higher energy electron pulses, and a more efficient imaging system. This combination together with miscellaneous improvements significantly boosts the signaltonoise ratio and thus enables us to study more complex solid state materials. Two main thrusts are discussed in details in this thesis. The first one is the dynamics of coherent phonon generation by ultrafast heating in gold thin film and nanoparticles, which emphasizes the electronic thermal stress. The other one is the ultrafast dynamics in Nickel, which shows that the mutual interactions among lattice, spin and electron subsystems can significantly alter the ultrafast lattice dynamics. In these studies, we exploit the advantage of FED instrument as an ideal tool that can directly and simultaneously monitor the coherent and random motion of lattice.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1247
 Format
 Thesis
 Title
 Multistate Intensity Model with ARGARCH Random Effect for Corporate Credit Rating Transition Analysis.
 Creator

Li, Zhi, Niu, Xufeng, Huﬀer, Fred, Kercheval, Alec, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

This thesis presents a stochastic process and time series study on corporate credit rating and market implied rating transitions. By extending an existing model, this paper incorporates the generalized autoregressive conditional heteroscedastic (GARCH) random effects to capture volatility changes in the instantaneous transition rates. The GARCH model is a crucial part in financial research since its ability to model volatility changes gives the market practitioners flexibility to build more...
Show moreThis thesis presents a stochastic process and time series study on corporate credit rating and market implied rating transitions. By extending an existing model, this paper incorporates the generalized autoregressive conditional heteroscedastic (GARCH) random effects to capture volatility changes in the instantaneous transition rates. The GARCH model is a crucial part in financial research since its ability to model volatility changes gives the market practitioners flexibility to build more accurate models on high frequency financial data. The corporate rating transition modeling was historically dealing with low frequency data which did not have the need to specify the volatility. However, the newly published Moody's market implied ratings are exhibiting much higher transition frequencies. Therefore, we feel that it is necessary to capture the volatility component and make extensions to existing models to reflect this fact. The theoretical model specification and estimation details are discussed thoroughly in this dissertation. The performance of our models is studied on several simulated data sets and compared to the original model. Finally, the models are applied to both Moody's issuer rating and market implied rating transition data as an application.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1426
 Format
 Thesis
 Title
 The Effect of Risk Factors on Coronary Heart Disease: An AgeRelevant Multivariate Meta Analysis.
 Creator

Li, Yan, McGee, Dan, She, Yiyuan, Eberstein, Ike, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

The importance of major risk factors, such as hypertension, total cholesterol, body mass index, diabetes, smoking, for predicting incidence and mortality of Coronary Heart Disease (CHD) is well known. In light of the fact that age is also a major risk factor for CHD death, a natural question is whether the risk effects on CHD change with age. This thesis focuses on examining the interaction between age and risk factors using data from multiple studies containing differing age ranges. The aim...
Show moreThe importance of major risk factors, such as hypertension, total cholesterol, body mass index, diabetes, smoking, for predicting incidence and mortality of Coronary Heart Disease (CHD) is well known. In light of the fact that age is also a major risk factor for CHD death, a natural question is whether the risk effects on CHD change with age. This thesis focuses on examining the interaction between age and risk factors using data from multiple studies containing differing age ranges. The aim of my research is to use statistical methods to determine whether we can combine these diverse results to obtain an overall summary, using which one can find how the risk effects on CHD death change with age. One intuitive approach is to use classical meta analysis based on generalized linear models. More specifically, one can fit a logistic model with CHD death as response and age, a risk factor and their interaction as covariates for each of the studies, and conduct meta analysis on every set of three coefficients in the multivariate setting to obtain 'synthesized' coefficients. Another aspect of the thesis is a new method, meta analysis with respect to curves that goes beyond linear models. The basic idea is that one can choose the same spline with the same knots on covariates, say age and systolic blood pressure (SBP), for all the studies to ensure common basis functions. The knotbased tensor product basis coefficients obtained from penalized logistic regression can be used for multivariate meta analysis. Using the common basis functions and the 'synthesized' knotbased basis coefficients from meta analysis, a twodimensional smooth surface on the ageSBP domain is estimated. By cutting through the smooth surface along two axes, the resulting slices show how the risk effect on CHD death change at an arbitrary age as well as how the age effect on CHD death change at an arbitrary SBP value. The application to multiple studies will be presented.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1428
 Format
 Thesis
 Title
 Analysis of Multivariate Data with Random Cluster Size.
 Creator

Li, Xiaoyun, Sinha, Debajyoti, Zhou, Yi, McGee, Dan, Lipsitz, Stuart, Department of Statistics, Florida State University
 Abstract/Description

In this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real...
Show moreIn this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real data applications. In biomedical studies involving clustered binary responses, the cluster size can vary because some components of the cluster can be absent. When both the presence of a cluster component as well as the binary disease status of a present component are treated as responses of interest, we propose a novel twostage random effects logistic regression framework. For the ease of interpretation of regression effects, both the marginal probability of presence/absence of a component as well as the conditional probability of disease status of a present component, preserve the approximate logistic regression forms. We present a maximum likelihood method of estimation implementable using standard statistical software. We compare our models and the physical interpretation of regression effects with competing methods from literature. We also present a simulation study to assess the robustness of our procedure to wrong specification of the random effects distribution and to compare finite sample performances of estimates with existing methods. The methodology is illustrated via analyzing a study of the periodontal health status in a diabetic Gullah population. We extend this model in longitudinal studies with binary longitudinal response and informative missing data. In longitudinal studies, when treating each subject as a cluster, cluster size is the total number of observations for each subject. When data is informatively missing, cluster size of each subject can vary and is related to the binary response of interest and we are also interested in the missing mechanism. This is a modified situation of the cluster binary data with present components. We modify and adopt our proposed twostage random effects logistic regression model so that both the marginal probability of binary response and missing indicator as well as the conditional probability of binary response and missing indicator preserve logistic regression forms. We present a Bayesian framework of this model and illustrate our proposed model on an AIDS data example.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1425
 Format
 Thesis
 Title
 A Statistical Approach for Information Extraction of Biological Relationships.
 Creator

Bell, Lindsey R., Zhang, Jinfeng, Niu, Xufeng, Tyson, Gary, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Vast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text...
Show moreVast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text becomes increasingly evident. Text mining has four major components. First relevant articles are identified through information retrieval (IR), next important concepts and terms are flagged using entity recognition (ER), and then relationships between these entities are extracted from the literature in a process called information extraction(IE). Finally, text mining takes these elements and seeks to synthesize new information from the literature. Our goal is information extraction from unstructured literature concerning biological entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as protein names, disease names, genes, and smallmolecules. Interaction words describe the relationship between the biological terms. Under this framework we aim to combine the strengths of three classifiers in an ensemble approach. The three classifiers we consider are Bayesian Networks, Support Vector Machines, and a mixture of logistic models defined by interaction word. The three classifiers and ensemble approach are evaluated on three benchmark corpora and one corpus that is introduced in this study. The evaluation includes cross validation and crosscorpus validation to replicate an application scenario. The three classifiers are unique and we find that performance of individual classifiers varies depending on the corpus. Therefore, an ensemble of classifiers removes the need to choose one classifier and provides optimal performance.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1314
 Format
 Thesis
 Title
 Investigating the Use of Mortality Data as a Surrogate for Morbidity Data.
 Creator

Miller, Gregory, Hollander, Myles, McGee, Daniel, Hurt, Myra, Wu, Wei, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

We are interested in differences between risk models based on Coronary Heart Disease (CHD) incidence, or morbidity, compared to risk models based on CHD death. Risk models based on morbidity have been developed based on the Framingham Heart Study, while the European SCORE project developed a risk model for CHD death. Our goal is to determine whether these two developed models differ in treatment decisions concerning patient heart health. We begin by reviewing recent metrics in surrogate...
Show moreWe are interested in differences between risk models based on Coronary Heart Disease (CHD) incidence, or morbidity, compared to risk models based on CHD death. Risk models based on morbidity have been developed based on the Framingham Heart Study, while the European SCORE project developed a risk model for CHD death. Our goal is to determine whether these two developed models differ in treatment decisions concerning patient heart health. We begin by reviewing recent metrics in surrogate variables and prognostic model performance. We then conduct bootstrap hypotheses tests between two Cox proportional hazards models using Framingham data, one with incidence as a response, and one with death as a response, and find that the coefficients differ for the age covariate, but find no significant differences for the other risk factors. To understand how surrogacy can be applied to our case, where the surrogate variable is nested within the true variable of interest, we examine models based on a composite event compared to models based on singleton events. We also conduct a simulation, simulating times to a CHD incidence and time from CHD incidence to CHD death, censoring at 25 years to represent the end of a study. We compare a Cox model with death response with a Cox model based on incidence using bootstrapped confidence intervals, and find that age and systolic blood pressure have differences with their covariates. We continue the simulation by using Net Reclassification Index (NRI) to evaluate the treatment decision performance of the two models, and find that the two models do not perform significantly different in correctly classifying events, if the decisions are based on the risk ranks of the individuals. As long as the relative order of patients' risks is preserved across different risk models, treatment decisions based on classifying an upper specified percent as high risk will not be significantly different. We conclude the dissertation with statements about future methods for approaching our question.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd2408
 Format
 Thesis
 Title
 New Semiparametric Methods for Recurrent Events Data.
 Creator

Gu, Yu, Sinha, Debajyoti, Eberstein, Isaac W., McGee, Dan, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Recurrent events data are rising in all areas of biomedical research. We present a model for recurrent events data with the same link for the intensity and mean functions. Simple interpretations of the covariate effects on both the intensity and mean functions lead to a better understanding of the covariate effects on the recurrent events process. We use partial likelihood and empirical Bayes methods for inference and provide theoretical justifications and as well as relationships between...
Show moreRecurrent events data are rising in all areas of biomedical research. We present a model for recurrent events data with the same link for the intensity and mean functions. Simple interpretations of the covariate effects on both the intensity and mean functions lead to a better understanding of the covariate effects on the recurrent events process. We use partial likelihood and empirical Bayes methods for inference and provide theoretical justifications and as well as relationships between these methods. We also show the asymptotic properties of the empirical Bayes estimators. We illustrate the computational convenience and implementation of our methods with the analysis of a heart transplant study. We also propose an additive regression model and associated empirical Bayes method for the risk of a new event given the history of the recurrent events. Both the cumulative mean and rate functions have closed form expressions for our model. Our inference method for the simiparametric model is based on maximizing a finite dimensional integrated likelihood obtained by integrating over the nonparametric cumulative baseline hazard function. Our method can accommodate timevarying covariates and is easier to implement computationally instead of iterative algorithm based full Bayes methods. The asymptotic properties of our estimates give the largesample justifications from a frequentist stand point. We apply our method on a study of heart transplant patients to illustrate the computational convenience and other advantages of our method.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd3941
 Format
 Thesis
 Title
 Statistical Modelling and Applications of Neural Spike Trains.
 Creator

Lawhern, Vernon, Wu, Wei, Contreras, Robert J., Srivastava, Anuj, Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

In this thesis we investigate statistical modelling of neural activity in the brain. We first develop a framework which is an extension of the statespace Generalized Linear Model (GLM) by Eden and colleagues [20] to include the effects of hidden states. These states, collectively, represent variables which are not observed (or even observable) in the modeling process but nonetheless can have an impact on the neural activity. We then develop a framework that allows us to input apriori target...
Show moreIn this thesis we investigate statistical modelling of neural activity in the brain. We first develop a framework which is an extension of the statespace Generalized Linear Model (GLM) by Eden and colleagues [20] to include the effects of hidden states. These states, collectively, represent variables which are not observed (or even observable) in the modeling process but nonetheless can have an impact on the neural activity. We then develop a framework that allows us to input apriori target information into the model. We examine both of these modelling frameworks on motor cortex data recorded from monkeys performing different targetdriven hand and arm movement tasks. Finally, we perform temporal coding analysis of sensory stimulation using principled statistical models and show the efficacy of our approach.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd3251
 Format
 Thesis
 Title
 Bayesian Portfolio Optimization with TimeVarying Factor Models.
 Creator

Zhao, Feng, Niu, Xufeng, Cheng, Yingmei, Huﬀer, Fred W., Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

We develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity ("betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors ("alphas"), and returns of risk factors in excess of the riskfree rate ("risk premia"). Both firmlevel characteristics and macroeconomic variables are used to predict stocks' timevarying alphas and betas, and macroeconomic variables are used to predict the risk...
Show moreWe develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity ("betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors ("alphas"), and returns of risk factors in excess of the riskfree rate ("risk premia"). Both firmlevel characteristics and macroeconomic variables are used to predict stocks' timevarying alphas and betas, and macroeconomic variables are used to predict the risk premia. All of the models are specified in a Bayesian framework to account for estimation risk, and informative prior distributions on both stock returns and model parameters are adopted to reduce estimation error. To gauge the economic signicance of the predictability, we apply the models to the U.S. stock market and construct optimal portfolios based on model predictions. Outofsample performance of the portfolios is evaluated to compare the models. The empirical results confirm predictabiltiy from all of the sources considered in our model: (1) The equity risk premium is timevarying and predictable using macroeconomic variables; (2) Stocks' alphas and betas differ crosssectionally and are predictable using firmlevel characteristics; and (3) Stocks' alphas and betas are also timevarying and predictable using macroeconomic variables. Comparison of different subperiods shows that the predictability of stocks' betas is persistent over time, but the predictability of stocks' alphas and the risk premium has diminished to some extent. The empirical results also suggest that Bayesian statistical techinques, especially the use of informative prior distributions, help reduce model estimation error and result in portfolios that outperform the passive indexing strategy. The findings are robust in the presence of transaction costs.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd0526
 Format
 Thesis
 Title
 Individual PatientLevel Data MetaAnalysis: A Comparison of Methods for the Diverse Populations Collaboration Data Set.
 Creator

Dutton, Matthew Thomas, McGee, Daniel, Becker, Betsy, Niu, Xufeng, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

DerSimonian and Laird define metaanalysis as "the statistical analysis of a collection of analytic results for the purpose of integrating their findings. One alternative to classical metaanalytic approaches in known as Individual PatientLevel Data, or IPD, metaanalysis. Rather than depending on summary statistics calculated for individual studies, IPD metaanalysis analyzes the complete data from all included studies. Two potential approaches to incorporating IPD data into the meta...
Show moreDerSimonian and Laird define metaanalysis as "the statistical analysis of a collection of analytic results for the purpose of integrating their findings. One alternative to classical metaanalytic approaches in known as Individual PatientLevel Data, or IPD, metaanalysis. Rather than depending on summary statistics calculated for individual studies, IPD metaanalysis analyzes the complete data from all included studies. Two potential approaches to incorporating IPD data into the metaanalytic framework are investigated. A twostage analysis is first conducted, in which individual models are fit for each study and summarized using classical metaanalysis procedures. Secondly, a onestage approach that singularly models the data and summarizes the information across studies is investigated. Data from the Diverse Populations Collaboration data set are used to investigate the differences between these two methods in a specific example. The bootstrap procedure is used to determine if the two methods produce statistically different results in the DPC example. Finally, a simulation study is conducted to investigate the accuracy of each method in given scenarios.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd0620
 Format
 Thesis
 Title
 A Riemannian Framework for Annotated Curves Analysis.
 Creator

Liu, Wei, Srivastava, Anuj, Zhang, Jinfeng, Klassen, Eric P., Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

We propose a Riemannian framework for shape analysis of annotated curves, curves that have certain attributes defined along them, in addition to their geometries.These attributes may be in form of vectorvalued functions, discrete landmarks, or symbolic labels, and provide auxiliary information along the curves. The resulting shape analysis, that is comparing, matching, and deforming, is naturally influenced by the auxiliary functions. Our idea is to construct curves in higher dimensions...
Show moreWe propose a Riemannian framework for shape analysis of annotated curves, curves that have certain attributes defined along them, in addition to their geometries.These attributes may be in form of vectorvalued functions, discrete landmarks, or symbolic labels, and provide auxiliary information along the curves. The resulting shape analysis, that is comparing, matching, and deforming, is naturally influenced by the auxiliary functions. Our idea is to construct curves in higher dimensions using both geometric and auxiliary coordinates, and analyze shapes of these curves. The difficulty comes from the need for removing different groups from different components: the shape is invariant to rigidmotion, global scale and reparameterization while the auxiliary component is usually invariant only to the reparameterization. Thus, the removal of some transformations (rigid motion and global scale) is restricted only to the geometric coordinates, while the reparameterization group is removed for all coordinates. We demonstrate this framework using a number of experiments.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd4997
 Format
 Thesis
 Title
 A Novel Riemannian Metric for Analyzing Spherical Functions with Applications to HARDI Data.
 Creator

Ncube, Sentibaleng, Srivastava, Anuj, Klassen, Eric, Wu, Wei, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

We propose a novel Riemannian framework for analyzing orientation distribution functions (ODFs), or their probability density functions (PDFs), in HARDI data sets for use in comparing, interpolating, averaging, and denoising PDFs. This is accomplished by separating shape and orientation features of PDFs, and then analyzing them separately under their own Riemannian metrics. We formulate the action of the rotation group on the space of PDFs, and define the shape space as the quotient space of...
Show moreWe propose a novel Riemannian framework for analyzing orientation distribution functions (ODFs), or their probability density functions (PDFs), in HARDI data sets for use in comparing, interpolating, averaging, and denoising PDFs. This is accomplished by separating shape and orientation features of PDFs, and then analyzing them separately under their own Riemannian metrics. We formulate the action of the rotation group on the space of PDFs, and define the shape space as the quotient space of PDFs modulo the rotations. In other words, any two PDFs are compared in: (1) shape by rotationally aligning one PDF to another, using the FisherRao distance on the aligned PDFs, and (2) orientation by comparing their rotation matrices. This idea improves upon the results from using the FisherRao metric in analyzing PDFs directly, a technique that is being used increasingly, and leads to geodesic interpolations that are biologically feasible. This framework leads to definitions and efficient computations for the Karcher mean that provide tools for improved interpolation and denoising. We demonstrate these ideas, using an experimental setup involving several PDFs.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd5064
 Format
 Thesis