 Title
 Tests for equivalence of two survival functions: Alternative to the tests under proportional hazards..
 Creator

Martinez, Elvis E, Sinha, Debajyoti, Wang, Wenting, Lipsitz, Stuart R, Chappell, Richard J
 Abstract/Description

For either the equivalence trial or the noninferiority trial with survivor outcomes from two treatment groups, the most popular testing procedure is the extension (e.g., Wellek, A logrank test for equivalence of two survivor functions, Biometrics, 1993; 49: 877881) of logrank based test under proportional hazards model. We show that the actual type I error rate for the popular procedure of Wellek is higher than the intended nominal rate when survival responses from two treatment arms...
Show moreFor either the equivalence trial or the noninferiority trial with survivor outcomes from two treatment groups, the most popular testing procedure is the extension (e.g., Wellek, A logrank test for equivalence of two survivor functions, Biometrics, 1993; 49: 877881) of logrank based test under proportional hazards model. We show that the actual type I error rate for the popular procedure of Wellek is higher than the intended nominal rate when survival responses from two treatment arms satisfy the proportional odds survival model. When the true model is proportional odds survival model, we show that the hypothesis of equivalence of two survival functions can be formulated as a statistical hypothesis involving only the survival odds ratio parameter. We further show that our new equivalence test, formulation, and related procedures are applicable even in the presence of additional covariates beyond treatment arms, and the associated equivalence test procedures have correct type I error rates under the proportional hazards model as well as the proportional odds survival model. These results show that use of our test will be a safer statistical practice for equivalence trials of survival responses than the commonly used logrank based tests.
Date Issued
 20170201
 Identifier
 FSU_pmch_24925887, 10.1177/0962280214539282, PMC5557049, 24925887, 24925887, 0962280214539282
 Format
 Citation
 Title
 Approximate median regression for complex survey data with skewed response.
 Creator

Fraser, Raphael André, Lipsitz, Stuart R, Sinha, Debajyoti, Fitzmaurice, Garrett M, Pan, Yi
 Abstract/Description

The ready availability of publicuse data from various large national complex surveys has immense potential for the assessment of population characteristics using regression models. Complex surveys can be used to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data due to complex survey design features. That is, stratification, multistage sampling, and...
Show moreThe ready availability of publicuse data from various large national complex surveys has immense potential for the assessment of population characteristics using regression models. Complex surveys can be used to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data due to complex survey design features. That is, stratification, multistage sampling, and weighting. In this article, we accommodate these design features in the analysis of highly skewed response variables arising from large complex surveys. Specifically, we propose a doubletransformbothsides (DTBS)'based estimating equations approach to estimate the median regression parameters of the highly skewed response; the DTBS approach applies the same BoxCox type transformation twice to both the outcome and regression function. The usual sandwich variance estimate can be used in our approach, whereas a resampling approach would be needed for a pseudolikelihood based on minimizing absolute deviations (MAD). Furthermore, the approach is relatively robust to the true underlying distribution, and has much smaller mean square error than a MAD approach. The method is motivated by an analysis of laboratory data on urinary iodine (UI) concentration from the National Health and Nutrition Examination Survey.
Date Issued
 20161201
 Identifier
 FSU_pmch_27062562, 10.1111/biom.12517, PMC5055849, 27062562, 27062562
 Format
 Citation
 Title
 Exact Bayesian pvalues for a test of independence in a 2 × 2 contingency table with missing data.
 Creator

Lin, Yan, Lipsitz, Stuart R, Sinha, Debajyoti, Fitzmaurice, Garrett, Lipshultz, Steven
 Abstract/Description

Altham (Altham PME. Exact Bayesian analysis of a 2 × 2 contingency table, and Fisher's "exact" significance test. J R Stat Soc B 1969; 31: 261269) showed that a onesided pvalue from Fisher's exact test of independence in a 2 × 2 contingency table is equal to the posterior probability of negative association in the 2 × 2 contingency table under a Bayesian analysis using an improper prior. We derive an extension of Fisher's exact test pvalue in the presence of missing data, assuming the...
Show moreAltham (Altham PME. Exact Bayesian analysis of a 2 × 2 contingency table, and Fisher's "exact" significance test. J R Stat Soc B 1969; 31: 261269) showed that a onesided pvalue from Fisher's exact test of independence in a 2 × 2 contingency table is equal to the posterior probability of negative association in the 2 × 2 contingency table under a Bayesian analysis using an improper prior. We derive an extension of Fisher's exact test pvalue in the presence of missing data, assuming the missing data mechanism is ignorable (i.e., missing at random or completely at random). Further, we propose Bayesian pvalues for a test of independence in a 2 × 2 contingency table with missing data using alternative priors; we also present results from a simulation study exploring the Type I error rate and power of the proposed exact test pvalues. An example, using data on the association between blood pressure and a cardiac enzyme, is presented to illustrate the methods.
Date Issued
 20181101
 Identifier
 FSU_pmch_28633606, 10.1177/0962280217702538, PMC5799034, 28633606, 28633606
 Format
 Citation
 Title
 OneStep Generalized Estimating Equations with Large Cluster Sizes.
 Creator

Lipsitz, Stuart, Fitzmaurice, Garrett, Sinha, Debajyoti, Hevelone, Nathanael, Hu, Jim, Nguyen, Louis L
 Abstract/Description

Medical studies increasingly involve a large sample of independent clusters, where the cluster sizes are also large. Our motivating example from the 2010 Nationwide Inpatient Sample (NIS) has 8,001,068 patients and 1049 clusters, with average cluster size of 7627. Consistent parameter estimates can be obtained naively assuming independence, which are inefficient when the intracluster correlation (ICC) is high. Efficient generalized estimating equations (GEE) incorporate the ICC and sum all...
Show moreMedical studies increasingly involve a large sample of independent clusters, where the cluster sizes are also large. Our motivating example from the 2010 Nationwide Inpatient Sample (NIS) has 8,001,068 patients and 1049 clusters, with average cluster size of 7627. Consistent parameter estimates can be obtained naively assuming independence, which are inefficient when the intracluster correlation (ICC) is high. Efficient generalized estimating equations (GEE) incorporate the ICC and sum all pairs of observations within a cluster when estimating the ICC. For the 2010 NIS, there are 92.6 billion pairs of observations, making summation of pairs computationally prohibitive. We propose a onestep GEE estimator that 1) matches the asymptotic efficiency of the fullyiterated GEE; 2) uses a simpler formula to estimate the ICC that avoids summing over all pairs; and 3) completely avoids matrix multiplications and inversions. These three features make the proposed estimator much less computationally intensive, especially with large cluster sizes. A unique contribution of this paper is that it expresses the GEE estimating equations incorporating the ICC as a simple sum of vectors and scalars.
Date Issued
 20170101
 Identifier
 FSU_pmch_29422762, 10.1080/10618600.2017.1321552, PMC5800532, 29422762, 29422762
 Format
 Citation
 Title
 Biascorrected estimates for logistic regression models for complex surveys with application to the United States' Nationwide Inpatient Sample.
 Creator

Rader, Kevin A, Lipsitz, Stuart R, Fitzmaurice, Garrett M, Harrington, David P, Parzen, Michael, Sinha, Debajyoti
 Abstract/Description

For complex surveys with a binary outcome, logistic regression is widely used to model the outcome as a function of covariates. Complex survey sampling designs are typically stratified cluster samples, but consistent and asymptotically unbiased estimates of the logistic regression parameters can be obtained using weighted estimating equations (WEEs) under the naive assumption that subjects within a cluster are independent. Despite the relatively large samples typical of many complex surveys,...
Show moreFor complex surveys with a binary outcome, logistic regression is widely used to model the outcome as a function of covariates. Complex survey sampling designs are typically stratified cluster samples, but consistent and asymptotically unbiased estimates of the logistic regression parameters can be obtained using weighted estimating equations (WEEs) under the naive assumption that subjects within a cluster are independent. Despite the relatively large samples typical of many complex surveys, with rare outcomes, many interaction terms, or analysis of subgroups, the logistic regression parameters estimates from WEE can be markedly biased, just as with independent samples. In this paper, we propose biascorrected WEEs for complex survey data. The proposed method is motivated by a study of postoperative complications in laparoscopic cystectomy, using data from the 2009 United States' Nationwide Inpatient Sample complex survey of hospitals.
Date Issued
 20171001
 Identifier
 FSU_pmch_26265769, 10.1177/0962280215596550, PMC5799008, 26265769, 26265769, 0962280215596550
 Format
 Citation
 Title
 Efficient Computation of Reduced Regression Models.
 Creator

Lipsitz, Stuart R, Fitzmaurice, Garrett M, Sinha, Debajyoti, Hevelone, Nathanael, Giovannucci, Edward, Trinh, QuocDien, Hu, Jim C
 Abstract/Description

We consider settings where it is of interest to fit and assess regression submodels that arise as various explanatory variables are excluded from a larger regression model. The larger model is referred to as the full model; the submodels are the reduced models. We show that a computationally efficient approximation to the regression estimates under any reduced model can be obtained from a simple weighted least squares (WLS) approach based on the estimated regression parameters and covariance...
Show moreWe consider settings where it is of interest to fit and assess regression submodels that arise as various explanatory variables are excluded from a larger regression model. The larger model is referred to as the full model; the submodels are the reduced models. We show that a computationally efficient approximation to the regression estimates under any reduced model can be obtained from a simple weighted least squares (WLS) approach based on the estimated regression parameters and covariance matrix from the full model. This WLS approach can be considered an extension to unbiased estimating equations of a firstorder Taylor series approach proposed by Lawless and Singhal. Using data from the 2010 Nationwide Inpatient Sample (NIS), a 20% weighted, stratified, cluster sample of approximately 8 million hospital stays from approximately 1000 hospitals, we illustrate the WLS approach when fitting interval censored regression models to estimate the effect of type of surgery (robotic versus nonrobotic surgery) on hospital lengthofstay while adjusting for three sets of covariates: patientlevel characteristics, hospital characteristics, and zipcode level characteristics. Ordinarily, standard fitting of the reduced models to the NIS data takes approximately 10 hours; using the proposed WLS approach, the reduced models take seconds to fit.
Date Issued
 20170101
 Identifier
 FSU_pmch_29104296, 10.1080/00031305.2017.1296375, PMC5664962, 29104296, 29104296
 Format
 Citation
 Title
 Age Effects in the Extinction of Planktonic Foraminifera: A New Look at Van Valen's Red Queen Hypothesis.
 Creator

Wiltshire, Jelani, Huﬀer, Fred, Parker, William, Chicken, Eric, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Van Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic...
Show moreVan Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic Foraminifera (Foram) extinction data include Weibull and Exponential modeling (Parker and Arnold, 1997), and Cox proportional hazards modeling (Doran et al. 2004,2006). I propose a general class of test statistics that can be used to test for the effect of age on extinction. These test statistics allow for a varying background rate of extinction and attempt to remove the effects of other covariates when assessing the effect of age on extinction. No model is assumed for the covariate effects. Instead I control for covariate effects by pairing or grouping together similar species. I use simulated data sets to compare the power of the statistics. In applying the test statistics to the Foram data, I have found age to have a positive effect on extinction.
Date Issued
 2010
 Identifier
 FSU_migr_etd0952
 Format
 Thesis
 Title
 Association Models for Clustered Data with Binary and Continuous Responses.
 Creator

Lin, Lanjia, Sinha, Debajyoti, Hurt, Myra, Lipsitz, Stuart R., McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

This dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And...
Show moreThis dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And the marginal regression function of the continuous response preserves linear form. Withincluster and withinsubject associations could be measured by our proposed models. For the bivariate correlated random effects model, we illustrate how different levels of the association between two random effects induce different Kendall's tau values for association between the binary and continuous responses from the same cluster. Fully parametric and semiparametric Bayesian methods as well as maximum likelihood method are illustrated for model analysis. In the semiparametric Bayesian model, normality assumption of the regression error for the continuous response is relaxed by using a nonparametric Dirichlet Process prior. Robustness of the bivariate correlated random effects model using ML method to misspecifications of regression function as well as random effect distribution is investigated by simulation studies. The Bayesian and likelihood methods are applied to a developmental toxicity study of ethylene glycol in mice.
Date Issued
 2009
 Identifier
 FSU_migr_etd1330
 Format
 Thesis
 Title
 Time Scales in Epidemiological Analysis.
 Creator

Chalise, Prabhakar, McGee, Daniel L., Chicken, Eric, Carlson, Elwood, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

The Cox proportional hazards model is routinely used to determine the time until an event of interest. Two time scales are used in practice: follow up time and chronological age. The former is the most frequently used time scale both in clinical studies and longitudinal observational studies. However, there is no general consensus about which time scale is the best. In recent years, papers have appeared arguing for using chronological age as the time scale either with or without adjusting the...
Show moreThe Cox proportional hazards model is routinely used to determine the time until an event of interest. Two time scales are used in practice: follow up time and chronological age. The former is the most frequently used time scale both in clinical studies and longitudinal observational studies. However, there is no general consensus about which time scale is the best. In recent years, papers have appeared arguing for using chronological age as the time scale either with or without adjusting the entryage. Also, it has been asserted that if the cumulative baseline hazard is exponential or if the ageatentry is independent of covariate, the two models are equivalent. Our studies do not satisfy these two conditions in general. We found that the true factor that makes the models perform significantly different is the variability in the ageatentry. If there is no variability in the entryage, time scales do not matter and both models estimate exactly the same coefficients. As the variability increases the models disagree with each other. We also computed the optimum time scale proposed by Oakes and utilized them for the Cox model. Both of our empirical and simulation studies show that follow up time scale model using age at entry as a covariate is better than the chronological age and Oakes time scale models. This finding is illustrated with two examples with data from Diverse Population Collaboration. Based on our findings, we recommend using follow up time as a time scale for epidemiological analysis.
Date Issued
 2009
 Identifier
 FSU_migr_etd3933
 Format
 Thesis
 Title
 New Semiparametric Methods for Recurrent Events Data.
 Creator

Gu, Yu, Sinha, Debajyoti, Eberstein, Isaac W., McGee, Dan, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Recurrent events data are rising in all areas of biomedical research. We present a model for recurrent events data with the same link for the intensity and mean functions. Simple interpretations of the covariate effects on both the intensity and mean functions lead to a better understanding of the covariate effects on the recurrent events process. We use partial likelihood and empirical Bayes methods for inference and provide theoretical justifications and as well as relationships between...
Show moreRecurrent events data are rising in all areas of biomedical research. We present a model for recurrent events data with the same link for the intensity and mean functions. Simple interpretations of the covariate effects on both the intensity and mean functions lead to a better understanding of the covariate effects on the recurrent events process. We use partial likelihood and empirical Bayes methods for inference and provide theoretical justifications and as well as relationships between these methods. We also show the asymptotic properties of the empirical Bayes estimators. We illustrate the computational convenience and implementation of our methods with the analysis of a heart transplant study. We also propose an additive regression model and associated empirical Bayes method for the risk of a new event given the history of the recurrent events. Both the cumulative mean and rate functions have closed form expressions for our model. Our inference method for the simiparametric model is based on maximizing a finite dimensional integrated likelihood obtained by integrating over the nonparametric cumulative baseline hazard function. Our method can accommodate timevarying covariates and is easier to implement computationally instead of iterative algorithm based full Bayes methods. The asymptotic properties of our estimates give the largesample justifications from a frequentist stand point. We apply our method on a study of heart transplant patients to illustrate the computational convenience and other advantages of our method.
Date Issued
 2011
 Identifier
 FSU_migr_etd3941
 Format
 Thesis
 Title
 The Relationship of Diabetes to Coronary Heart Disease Mortality: A MetaAnalysis Based on PersonLevel Data.
 Creator

Williams, Felicia Gray, McGee, Daniel, Hurt, Myra, Pati, Debdeep, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Studies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a metaanalysis of personlevel data from 42 cohort studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 77,863 men and 84,671 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of selfreported...
Show moreStudies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a metaanalysis of personlevel data from 42 cohort studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 77,863 men and 84,671 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of selfreported diabetes mellitus at baseline ranged between less than 1% in the youngest cohort and 15.7% (males) and 11.1% (females) in the NHLBI CHS study of the elderly. CHD death rates varied between 2% and 20%. A metaanalysis was performed in order to calculate overall hazard ratios (HR) of CHD mortality among diabetics compared to nondiabetics using Cox Proportional Hazard models. The randomeffects HR associated with baseline diabetes and adjusted for age was significantly higher for females 2.65 (95% CI: 2.34, 2.96) than for males 2.33 (95% CI: 2.07, 2.58) (p=0.004). These estimates were similar to the randomeffects estimates adjusted additionally for serum cholesterol, systolic blood pressure, and current smoking status: females 2.69 (95% CI: 2.35, 3.03) and males 2.32 (95% CI: 2.05, 2.59) . They also agree closely with estimates (odds ratios of 2.9 for females and 2.3 for males) obtained in a recent metaanalysis of 50 studies of both fatal and nonfatal CHD but not based on personlevel data. This evidence suggests that diabetes diminishes the female advantage. An additional analysis was performed on race. Only 14 cohorts were analyzed in the metaanalysis. This analyses showed no significant difference between the black and white cohorts before (p=0.68) or after adjustment for the major CHD RFs (p=0.88). The limited amount of studies used may lack the power to detect any differences.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7662
 Format
 Thesis
 Title
 Artificial Prediction Markets for Classification, Regression and Density Estimation.
 Creator

Lay, Nathan, Barbu, Adrian, MeyerBaese, Anke, Sinha, Debajyoti, Ming, Ye, Wang, Xiaoqiang, Department of Scientific Computing, Florida State University
 Abstract/Description

Prediction markets are forums of trade where contracts on the future outcomes of events are bought and sold. These contracts reward buyers based on correct predictions and thus give incentive to make accurate predictions. Prediction markets have successfully predicted the outcomes of sporting events, elections, scientific hypothesese, foreign affairs, etc... and have repeatedly demonstrated themselves to be more accurate than individual experts or polling [2]. Since prediction markets are...
Show morePrediction markets are forums of trade where contracts on the future outcomes of events are bought and sold. These contracts reward buyers based on correct predictions and thus give incentive to make accurate predictions. Prediction markets have successfully predicted the outcomes of sporting events, elections, scientific hypothesese, foreign affairs, etc... and have repeatedly demonstrated themselves to be more accurate than individual experts or polling [2]. Since prediction markets are aggregation mechanisms, they have garnered interest in the machine learning community. Artificial prediction markets have been successfully used to solve classification problems [34, 33]. This dissertation explores the underlying optimization problem in the classification market, as presented in [34, 33], proves that it is related to maximum log likelihood, relates the classification market to existing machine learning methods and further extends the idea to regression and density estimation. In addition, the results of empirical experiments are presented on a variety of UCI [25], LIAAD [49] and synthetic data to demonstrate the probability accuracy, prediction accuracy as compared to Random Forest [9] and Implicit Online Learning [32], and the loss function.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7461
 Format
 Thesis
 Title
 Nonparametric Nonstationary Density Estimation Including Upper Control Limit Methods for Detecting Change Points.
 Creator

Becvarik, Rachel A., Chicken, Eric, Liu, Guosheng, Sinha, Debajyoti, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

Nonstationary nonparametric densities occur naturally including applications such as monitoring the amount of toxins in the air and in monitoring internet streaming data. Progress has been made in estimating these densities, but there is little current work on monitoring them for changes. A new statistic is proposed which effectively monitors these nonstationary nonparametric densities through the use of transformed wavelet coefficients of the quantiles. This method is completely...
Show moreNonstationary nonparametric densities occur naturally including applications such as monitoring the amount of toxins in the air and in monitoring internet streaming data. Progress has been made in estimating these densities, but there is little current work on monitoring them for changes. A new statistic is proposed which effectively monitors these nonstationary nonparametric densities through the use of transformed wavelet coefficients of the quantiles. This method is completely nonparametric, designed for no particular distributional assumptions; thus making it effective in a variety of conditions. Existing methods for monitoring sequential data typically focus on using a single value upper control limit (UCL) based on a specified in control average run length (ARL) to detect changes in these nonstationary statistics. However, such a UCL is not designed to take into consideration the false alarm rate, the power associated with the test or the underlying distribution of the ARL. Additionally, if the monitoring statistic is known to be monotonic over time (which is typical in methods using maxima in their statistics, for example) the flat UCL does not adjust to this property. We propose several methods for creating UCLs that provide improved power and simultaneously adjust the false alarm rate to userspecified values. Our methods are constructive in nature, making no use of assumed distribution properties of the underlying monitoring statistic. We evaluate the different proposed UCLs through simulations to illustrate the improvements over current UCLs. The proposed method is evaluated with respect to profile monitoring scenarios and the proposed density statistic. The method is applicable for monitoring any monotonically nondecreasing nonstationary statistics.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7292
 Format
 Thesis
 Title
 Theories on Group Variable Selection in Multivariate Regression Models.
 Creator

Ha, SeungYeon, She, Yiyuan, Okten, Giray, Huffer, Fred, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

We study group variable selection on multivariate regression model. Group variable selection is equivalent to select the nonzero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS according to JamesStein phenomenon (1961). As one of shrinkage methods, we study penalized...
Show moreWe study group variable selection on multivariate regression model. Group variable selection is equivalent to select the nonzero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS according to JamesStein phenomenon (1961). As one of shrinkage methods, we study penalized least square estimation for a group variable selection. Among them, we study L0 regularization and L0 + L2 regularization with the purpose of obtaining accurate prediction and consistent feature selection, and use the corresponding computational procedure Hard TISP and HardRidge TISP (She, 2009) to solve the numerical difficulties. These regularization methods show better performance both on prediction and selection than Lasso (L1 regularization), which is one of popular penalized least square method. L0 acheives the same optimal rate of prediction loss and estimation loss as Lasso, but it requires no restriction on design matrix or sparsity for controlling the prediction error and a relaxed condition than Lasso for controlling the estimation error. Also, for selection consistency, it requires much relaxed incoherence condition, which is correlation between the relevant subset and irrelevant subset of predictors. Therefore L0 can work better than Lasso both on prediction and sparsity recovery, in practical cases such that correlation is high or sparsity is not low. We study another method, L0 + L2 regularization which uses the combined penalty of L0 and L2. For the corresponding procedure HardRidge TISP, two parameters work independently for selection and shrinkage (to enhance prediction) respectively, and therefore it gives better performance on some cases (such as low signal strength) than L0 regularization. For L0 regularization, λ works for selection but it is tuned in terms of prediction accuracy. L0 + L2 regularization gives the optimal rate of prediction and estimation errors without any restriction, when the coefficient of l2 penalty is appropriately assigned. Furthermore, it can achieve a better rate of estimation error with an ideal choice of blockwise weight to l2 penalty.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7404
 Format
 Thesis
 Title
 A Class of Semiparametric Volatility Models with Applications to Financial Time Series.
 Creator

Chung, Steve S., Niu, XuFeng, Gallivan, Kyle, Sinha, Debajyoti, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

The autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied...
Show moreThe autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied and well documented in financial and econometric literature and many variants of ARCH/GARCH models have been proposed. To list a few, these include exponential GARCH(EGARCH), GJRGARHCH(or threshold GARCH), integrated GARCH(IGARCH), quadratic GARCH(QGARCH), and fractionally integrated GARCH(FIGARCH). The ARCH/GARCH models and their variant models have gained a lot of attention and they are still popular choice for modeling volatility. Despite their popularity, they suffer from model flexibility. Volatility is a latent variable and hence, putting a specific model structure violates this latency assumption. Recently, several attempts have been made in order to ease the strict structural assumptions on volatility. Both nonparametric and semiparametric volatility models have been proposed in the literature. We review and discuss these modeling techniques in detail. In this dissertation, we propose a class of semiparametric multiplicative volatility models. We define the volatility as a product of parametric and nonparametric parts. Due to the positivity restriction, we take the log and square transformations on the volatility. We assume that the parametric part is GARCH(1,1) and it serves as a initial guess to the volatility. We estimate GARCH(1,1) parameters by using conditional likelihood method. The nonparametric part assumes an additive structure. There may exist some loss of interpretability by assuming an additive structure but we gain flexibility. Each additive part is constructed from a sieve of Bernstein basis polynomials. The nonparametric component acts as an improvement for the parametric component. The model is estimated from an iterative algorithm based on boosting. We modified the boosting algorithm (one that is given in Friedman 2001) such that it uses a penalized least squares method. As a penalty function, we tried three different penalty functions: LASSO, ridge, and elastic net penalties. We found that, in our simulations and application, ridge penalty worked the best. Our semiparametric multiplicative volatility model is evaluated using simulations and applied to the six major exchange rates and SP 500 index. The results show that the proposed model outperforms the existing volatility models in both insample estimation and outofsample prediction.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd8756
 Format
 Thesis
 Title
 The Risk of Lipids on Coronary Heart Disease: Prognostic Models and MetaAnalysis.
 Creator

Almansour, Aseel, McGee, Daniel, Flynn, Heather, Niu, Xufeng, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Prognostic models are widely used in medicine to estimate particular patients' risk of developing disease. For cardiovascular disease risk numerous prognostic models have been developed for predicting cardiovascular disease including those by Wilson et al. using the Framingham Study[17], by Assmann et al. using the Procam study[22] and by Conroy et al.[33] using a pool of European cohorts. The prognostic models developed by these researchers differed in their approach to estimating risk but...
Show morePrognostic models are widely used in medicine to estimate particular patients' risk of developing disease. For cardiovascular disease risk numerous prognostic models have been developed for predicting cardiovascular disease including those by Wilson et al. using the Framingham Study[17], by Assmann et al. using the Procam study[22] and by Conroy et al.[33] using a pool of European cohorts. The prognostic models developed by these researchers differed in their approach to estimating risk but all included one or more of the lipid determinations: Total cholesterol (TC). Low Density Lipoproteins (LDL), High Density Lipoproteins (HDL), or ratios TC/HDL and LDL/HDL. None of these researchers included both LDL and TC in the same model due to the high correlation between these measurements. In this thesis we will examine some questions about the inclusion of lipid determinations in prognostic models: Can the effect of LDL and TC on the risk of dying from CHD be differentiated? If one measure is demonstrably stronger than the other, then a single model using that variable would be considered advantageous. Is it possible to derive a single measure from TC and LDL that is a stronger predictor than either measure? If so, then a new summarization of the lipid measurements should be used in prognostic modeling. Does the addition of HDL to a prognostic model improve the predictive accuracy of the model? If it does, then this determination that is almost universally determined should be used when developing prognostic models. We use data from nine independent studies to examine these issues. The studies were chosen because they include longitudinal followup of participants and included lipid determinations in the baseline examination of participants. There are many methodologies available for developing prognostic models, including logistic regression and the proportional hazards model. We used the proportional hazards model since we have followup times and times to death from CHD on all of the participants in the included studies. We summarized our results using a metaanalytic approach. Using the metaanalytic approach, we addressed the additional question of whether the results vary significantly among the different studies and also whether adding additional characteristics to the prognostic models changes the estimated effect of the lipid determinations. All of our results are presented stratified by gender and, when appropriate, by race. Finally, because our studies were not selected randomly, we also examined whether there is evidence of bias in our metaanalyses. For this examination we used funnel plots with related methodology for testing whether there is evidence of bias in the results.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd8724
 Format
 Thesis
 Title
 Failure Time Regression Models for Thinned Point Processes.
 Creator

Holden, Robert T., Huffer, Fred G., Nichols, Warren, McGee, Dan, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

In survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the...
Show moreIn survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the predictors on that distribution. Suppose that the criterion event isn't a terminal event that can only occur once, but is a repeatable event. The sequence of events forms a stochastic {it point process}. Further suppose that only some of the events are detected (observed); the detected events form a thinned point process. Any failure time model based on the data will be based not on the time until the first occurrence, but on the time until the first detected occurrence of the event. The implications of estimating survival regression models from such incomplete data will be analyzed. It will be shown that the effect of thinning on regression parameters depends on the combination of the type of regression model, the type of point process that generates the events, and the thinning mechanism. For some combinations, the effect of a predictor will be the same for time to the first event and the time to the first detected event. For other combinations, the regression effect will be changed as a result of the incomplete detection.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8568
 Format
 Thesis
 Title
 Meta Analysis and Meta Regression of a Measure of Discrimination Used in Prognostic Modeling.
 Creator

Rivera, Gretchen L., McGee, Daniel, Hurt, Myra, Niu, Xufeng, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

In this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper we consider the logistic model. The dataset used is the Diverse Populations Collaboration (DPC) dataset which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our...
Show moreIn this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper we consider the logistic model. The dataset used is the Diverse Populations Collaboration (DPC) dataset which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our analysis we include those individuals who are 17 years old or older. The predictors are: age, diabetes, total serum cholesterol (mg/dl), high density lipoprotein (mg/dl), systolic blood pressure (mmHg) and if the participant is a current cigarette smoker. There is a natural grouping within the studies such as gender, rural or urban area and race. Based on these strata we have 84 cohort groups. Our main interest is to evaluate how well the prognostic model discriminates. For this, we used the area under the Receiver Operating Characteristic (ROC) curve. The main idea of the ROC curve is that a set of subject is known to belong to one of two classes (signal or noise group). Then an assignment procedure assigns each object to a class on the basis of information observed. The assignment procedure is not perfect: sometimes an object is misclassified. We want to evaluate the quality of performance of this procedure, for this we used the Area under the ROC curve (AUROC). The AUROC varies from 0.5 (no apparent accuracy) to 1.0 (perfect accuracy). For each logistic model we found the AUROC and its standard error (SE). We used Metaanalysis to summarize the estimated AUROCs and to evaluate if there is heterogeneity in our estimates. To evaluate the existence of significant heterogeneity we used the Q statistic. Since heterogeneity was found in our study we compare seven different methods for estimating τ2 (between study variance). We conclude by examining whether differences in study characteristics explained the heterogeneity in the values of the AUROC.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7580
 Format
 Thesis
 Title
 The Frequentist Performance of Some Bayesian Confidence Intervals for the Survival Function.
 Creator

Tao, Yingfeng, Huﬀer, Fred, Okten, Giray, Sinha, Debajyoti, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Estimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based...
Show moreEstimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based in some way on the KaplanMeier estimator first proposed by Kaplan and Meier (1958) and widely used as the nonparametric estimator in the presence of rightcensored data. For intervalcensored data, the Turnbull estimator (Turnbull (1974)) plays a similar role. For a class of Bayesian models involving Dirichlet priors, Doss and Huffer (2003) suggested several simulation techniques to approximate the posterior distribution of the survival function by using Markov chain Monte Carlo or sequential importance sampling. These techniques lead to probability intervals for the survival function (at arbitrary time points) and its quantiles for both the rightcensored and intervalcensored cases. This dissertation will examine the frequentist properties and general performance of these probability intervals when the prior is noninformative. Simulation studies will be used to compare these probability intervals with other published approaches. Extensions of the DossHuffer approach are given for constructing simultaneous confidence bands for the survival function and for computing approximate confidence intervals for the survival function based on Edgeworth expansions using posterior moments. The performance of these extensions is studied by simulation.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7624
 Format
 Thesis
 Title
 Bayesian Methods for Skewed Response Including Longitudinal and Heteroscedastic Data.
 Creator

Tang, Yuanyuan, Sinha, Debajyoti, Pati, Debdeep, Flynn, Heather, She, Yiyuan, Lipsitz, Stuart, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

Skewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle...
Show moreSkewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle the skewed univariate response with a flexible heteroscedasticity. For longitudinal studies with heavily skewed continuous response, statistical model and methods focusing on mean response are not appropriate. In this paper, we present a partial linear model of median regression function of skewed longitudinal response. We develop a semiparametric Bayesian estimation procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We provide justifications for using our methods including theoretical investigation of the support of the prior, asymptotic properties of the posterior and also simulation studies of finite sample properties. Ease of implementation and advantages of our model and method compared to existing methods are illustrated via analysis of a cardiotoxicity study of children of HIV infected mother. Our second aim is to develop a Bayesian simultaneous variable selection and estimation of median regression for skewed response variable. Our hierarchical Bayesian model can incorporate advantages of $l_0$ penalty for skewed and heteroscedastic error. Some preliminary simulation studies have been conducted to compare the performance of proposed model and existing frequentist median lasso regression model. Considering the estimation bias and total square error, our proposed model performs as good as, or better than competing frequentist estimators. In biomedical studies, the covariates often affect the location, scale as well as the shape of the skewed response distribution. Existing biostatistical literature mainly focuses on the mean regression with a symmetric error distribution. While such modeling assumptions and methods are often deemed as restrictive and inappropriate for skewed response, the completely nonparametric methods may lack a physical interpretation of the covariate effects. Existing nonparametric methods also miss any easily implementable computational tool. For a skewed response, we develop a novel model accommodating a nonparametric error density that depends on the covariates. The advantages of our semiparametric associated Bayes method include the ease of prior elicitation/determination, an easily implementable posterior computation, theoretically sound properties of the selection of priors and accommodation of possible outliers. The practical advantages of the method are illustrated via a simulation study and an analysis of a reallife epidemiological study on the serum response to DDT exposure during gestation period.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7622
 Format
 Thesis
 Title
 An Ensemble Approach to Predicting Health Outcomes.
 Creator

Nilles, Ester Kim, McGee, Dan, Zhang, Jinfeng, Eberstein, Isaac, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Heart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk...
Show moreHeart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk have been based on logistic regression or Cox proportional hazards models. In more recent years, machine learning models have grown in popularity within the medical field for their superior predictive and classification performances over the classical statistical models. However, their performances in heart disease and premature birth predictions have been comparable and inconclusive, leaving the question of which model most accurately reflects the data difficult to resolve. Our aim is to incorporate information learned by different models into one final model that will generate superior predictive performances. We first compare the widely used machine learning models  the multilayer perceptron network, knearest neighbor and support vector machine  to the statistical models logistic regression and Cox proportional hazards. Then the individual models are combined into one in an ensemble approach, also referred to as ensemble modeling. The proposed approaches include SSEweighted, AUCweighted, logistic and flexible naive Bayes. The individual models are unique and capture different aspects of the data, but as expected, no individual one outperforms any other. The ensemble approach is an easily computed method that eliminates the need to select one model, integrates the strengths of different models, and generates optimal performances. Particularly in cases where the risk factors associated to an outcome are elusive, such as in premature birth, the ensemble models significantly improve their prediction.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7530
 Format
 Thesis
 Title
 A Probabilistic and Graphical Analysis of Evidence in O.J. Simpson's Murder Case Using Bayesian Networks.
 Creator

Olumide, Kunle, Huﬀer, Fred, Shute, Valerie, Sinha, Debajyoti, Niu, Xufeng, Logan, Wayne, Department of Statistics, Florida State University
 Abstract/Description

This research work is an attempt to illustrate the versatility and wide applications of the field of statistical science. Specifically, the research work involves the application of statistics in the field of law. The application will focus on the subfields of Evidence and Criminal law using one of the most celebrated cases in the history of American jurisprudence  the 1994 O.J. Simpson murder case in California. Our task here is to do a probabilistic and graphical analysis of the body of...
Show moreThis research work is an attempt to illustrate the versatility and wide applications of the field of statistical science. Specifically, the research work involves the application of statistics in the field of law. The application will focus on the subfields of Evidence and Criminal law using one of the most celebrated cases in the history of American jurisprudence  the 1994 O.J. Simpson murder case in California. Our task here is to do a probabilistic and graphical analysis of the body of evidence in this case using Bayesian Networks. We will begin the analysis by first constructing our main hypothesis regarding the guilt or nonguilt of the accused; this main hypothesis will be supplemented by a series of ancillary hypotheses. Using graphs and probability concepts, we will be evaluating the probative force or strength of the evidence and how well the body of evidence at hand will prove our main hypothesis. We will employ Bayes rule, likelihoods and likelihood ratios to carry out such an evaluation. Some sensitivity analyses will be carried out by varying the degree of our prior beliefs or probabilities, and evaluating the effect of such variations on the likelihood ratios regarding our main hypothesis.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd2287
 Format
 Thesis
 Title
 Nonparametric Estimation of Three Dimensional Projective Shapes with Applications in Medical Imaging and in Pattern Recognition.
 Creator

Crane, Michael, Patrangenaru, Victor, Liu, Xiuwen, Huﬀer, Fred W., Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

This dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision re construction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic...
Show moreThis dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision re construction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic sample covariance to construct boot strap confidence regions for mean projective shapes of 3D configurations. Chapters 4, 5 and 6 contain new results. In chapter 4, we develop tests for coplanarity. In chapter 5, is on reconstruction of 3D polyhedral scenes, including texture from arbitrary partial views. In chapter 6, we develop a nonparametric methodology for estimating the mean change for matched samples on a Lie group. We then notice that for k '' 4, a manifold of projective shapes of kads in general position in 3D has a structure of 3k and #8722; 15 dimensional Lie group (PQuaternions) that is equivariantly embedded in an Euclidean space, therefore testing for mean 3D projective shape change amounts to a one sample test for extrinsic mean PQuaternion Objects. The Lie group technique leads to a large sample and nonparametric bootstrap test for one population extrinsic mean on a projective shape space, as recently developed by Patrangenaru, Liu and Sughatadasa [1]. On the other hand, in absence of occlusions, the 3D projective shape of a spatial configuration can be recovered from a stereo pair of images, thus allowing to test for mean glaucomatous 3D projective shape change detection from standard stereo pairs of eye images.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd7118
 Format
 Thesis
 Title
 Nonparametric Estimation of Three Dimensional Projective Shapes with Applications in Medical Imaging and in Pattern Recognition.
 Creator

Crane, Michael, Patrangenaru, Victor, Liu, Xiuwen, Huﬀer, Fred W., Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

This dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision reconstruction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic...
Show moreThis dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision reconstruction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic sample covariance to construct bootstrap confidence regions for mean projective shapes of 3D configurations. Chapters 4, 5 and 6 contain new results. In chapter 4, we develop tests for coplanarity. In chapter 5, is on reconstruction of 3D polyhedral scenes, including texture from arbitrary partial views. In chapter 6, we develop a nonparametric methodology for estimating the mean change for matched samples on a Lie group. We then notice that for k ≥ 4, a manifold of projective shapes of kads in general position in 3D has a structure of 3k − 15 dimensional Lie group (PQuaternions) that is equivariantly embedded in an Euclidean space, therefore testing for mean 3D projective shape change amounts to a one sample test for extrinsic mean PQuaternion Objects. The Lie group technique leads to a large sample and nonparametric bootstrap test for one population extrinsic mean on a projective shape space, as recently developed by Patrangenaru, Liu and Sughatadasa. On the other hand, in absence of occlusions, the 3D projective shape of a spatial configuration can be recovered from a stereo pair of images, thus allowing to test for mean glaucomatous 3D projective shape change detection from standard stereo pairs of eye images.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd4607
 Format
 Thesis
 Title
 Analysis of Multivariate Data with Random Cluster Size.
 Creator

Li, Xiaoyun, Sinha, Debajyoti, Zhou, Yi, McGee, Dan, Lipsitz, Stuart, Department of Statistics, Florida State University
 Abstract/Description

In this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real...
Show moreIn this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real data applications. In biomedical studies involving clustered binary responses, the cluster size can vary because some components of the cluster can be absent. When both the presence of a cluster component as well as the binary disease status of a present component are treated as responses of interest, we propose a novel twostage random effects logistic regression framework. For the ease of interpretation of regression effects, both the marginal probability of presence/absence of a component as well as the conditional probability of disease status of a present component, preserve the approximate logistic regression forms. We present a maximum likelihood method of estimation implementable using standard statistical software. We compare our models and the physical interpretation of regression effects with competing methods from literature. We also present a simulation study to assess the robustness of our procedure to wrong specification of the random effects distribution and to compare finite sample performances of estimates with existing methods. The methodology is illustrated via analyzing a study of the periodontal health status in a diabetic Gullah population. We extend this model in longitudinal studies with binary longitudinal response and informative missing data. In longitudinal studies, when treating each subject as a cluster, cluster size is the total number of observations for each subject. When data is informatively missing, cluster size of each subject can vary and is related to the binary response of interest and we are also interested in the missing mechanism. This is a modified situation of the cluster binary data with present components. We modify and adopt our proposed twostage random effects logistic regression model so that both the marginal probability of binary response and missing indicator as well as the conditional probability of binary response and missing indicator preserve logistic regression forms. We present a Bayesian framework of this model and illustrate our proposed model on an AIDS data example.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1425
 Format
 Thesis
 Title
 Variable Selection of Correlated Predictors in Logistic Regression: Investigating the DietHeart Hypothesis.
 Creator

Thompson, Warren R. (Warren Robert), McGee, Daniel, Eberstein, Isaac, Huﬀer, Fred, Sinha, Debajyoti, She, Yiyuan, Department of Statistics, Florida State University
 Abstract/Description

Variable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the...
Show moreVariable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the variable selection problem in the context of logistic regression. Specifically, we investigated the merits of the bootstrap, ridge regression, the lasso and Bayesian model averaging (BMA) as variable selection techniques when highly correlated predictors and a dichotomous outcome are considered. This dissertation also contributes to the literature on the dietheart hypothesis. The dietheart hypothesis has been around since the early twentieth century. Since then, researchers have attempted to isolate the nutrients in diet that promote coronary heart disease (CHD). After a century of research, there is still no consensus. In our current research, we used some of the more recent statistical methodologies (mentioned above) to investigate the effect of twenty dietary variables on the incidence of coronary heart disease. Logistic regression models were generated for the data from the Honolulu Heart Program  a study of CHD incidence in men of Japanese descent. Our results were largely methodspecific. However, regardless of method considered, there was strong evidence to suggest that alcohol consumption has a strong protective effect on the risk of coronary heart disease. Of the variables considered, dietary cholesterol and caffeine were the only variables that, at best, exhibited a moderately strong harmful association with CHD incidence. Further investigation that includes a broader array of food groups is recommended.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1360
 Format
 Thesis
 Title
 Bayesian Modeling and Variable Selection for Complex Data.
 Creator

Li, Hanning, Pati, Debdeep, Huffer, Fred W. (Fred William), Kercheval, Alec N., Sinha, Debajyoti, Bradley, Jonathan R., Florida State University, College of Arts and Sciences,...
Show moreLi, Hanning, Pati, Debdeep, Huffer, Fred W. (Fred William), Kercheval, Alec N., Sinha, Debajyoti, Bradley, Jonathan R., Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

As we routinely encounter highthroughput datasets in complex biological and environment research, developing novel models and methods for variable selection has received widespread attention. In this dissertation, we addressed a few key challenges in Bayesian modeling and variable selection for highdimensional data with complex spatial structures. a) Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the...
Show moreAs we routinely encounter highthroughput datasets in complex biological and environment research, developing novel models and methods for variable selection has received widespread attention. In this dissertation, we addressed a few key challenges in Bayesian modeling and variable selection for highdimensional data with complex spatial structures. a) Most Bayesian variable selection methods are restricted to mixture priors having separate components for characterizing the signal and the noise. However, such priors encounter computational issues in high dimensions. This has motivated continuous shrinkage priors, resembling the twocomponent priors facilitating computation and interpretability. While such priors are widely used for estimating highdimensional sparse vectors, selecting a subset of variables remains a daunting task. b) Spatial/spatialtemporal data sets with complex structures are nowadays commonly encountered in various scientific research fields ranging from atmospheric sciences, forestry, environmental science, biological science, and social science. Selecting important spatial variables that have significant influences on occurrences of events is undoubtedly necessary and essential for providing insights to researchers. Selfexcitation, which is a feature that occurrence of an event increases the likelihood of more occurrences of the same type of events nearby in time and space, can be found in many natural/social events. Research on modeling data with selfexcitation feature has increasingly drawn interests recently. However, existing literature on selfexciting models with inclusion of highdimensional spatial covariates is still underdeveloped. c) Gaussian Process is among the most powerful model frames for spatial data. Its major bottleneck is the computational complexity which stems from inversion of dense matrices associated with a Gaussian process covariance. Hierarchical divideconquer Gaussian Process models have been investigated for ultra large data sets. However, computation associated with scaling the distributing computing algorithm to handle a large number of subgroups poses a serious bottleneck. In chapter 2 of this dissertation, we propose a general approach for variable selection with shrinkage priors. The presence of very few tuning parameters makes our method attractive in comparison to ad hoc thresholding approaches. The applicability of the approach is not limited to continuous shrinkage priors, but can be used along with any shrinkage prior. Theoretical properties for nearcollinear design matrices are investigated and the method is shown to have good performance in a wide range of synthetic data examples and in a real data example on selecting genes affecting survival due to lymphoma. In Chapter 3 of this dissertation, we propose a new selfexciting model that allows the inclusion of spatial covariates. We develop algorithms which are effective in obtaining accurate estimation and variable selection results in a variety of synthetic data examples. Our proposed model is applied on Chicago crime data where the influence of various spatial features is investigated. In Chapter 4, we focus on a hierarchical Gaussian Process regression model for ultrahigh dimensional spatial datasets. By evaluating the latent Gaussian process on a regular grid, we propose an efficient computational algorithm through circulant embedding. The latent Gaussian process borrows information across multiple subgroups, thereby obtaining a more accurate prediction. The hierarchical model and our proposed algorithm are studied through simulation examples.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Li_fsu_0071E_14159
 Format
 Thesis
 Title
 Spatial Statistics and Its Applications in Biostatistics and Environmental Statistics.
 Creator

Hu, Guanyu, Huffer, Fred W. (Fred William), Paek, Insu, Sinha, Debajyoti, Slate, Elizabeth H., Bradley, Jonathan R., Florida State University, College of Arts and Sciences,...
Show moreHu, Guanyu, Huffer, Fred W. (Fred William), Paek, Insu, Sinha, Debajyoti, Slate, Elizabeth H., Bradley, Jonathan R., Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

This dissertation presents some topics in spatial statistics and their application in biostatistics and environmental statistics. The field of spatial statistics is an energetic area in statistics. In Chapter 2 and Chapter 3, the goal is to build subregion models under the assumption that the responses or the parameters are spatially correlated. For regression models, considering spatially varying coecients is a reasonable way to build subregion models. There are two different techniques for...
Show moreThis dissertation presents some topics in spatial statistics and their application in biostatistics and environmental statistics. The field of spatial statistics is an energetic area in statistics. In Chapter 2 and Chapter 3, the goal is to build subregion models under the assumption that the responses or the parameters are spatially correlated. For regression models, considering spatially varying coecients is a reasonable way to build subregion models. There are two different techniques for exploring spatially varying coecients. One is geographically weighted regression (Brunsdon et al. 1998). The other is a spatially varying coecients model which assumes a stationary Gaussian process for the regression coecients (Gelfand et al. 2003). Based on the ideas of these two techniques, we introduce techniques for exploring subregion models in survival analysis which is an important area of biostatistics. In Chapter 2, we introduce modied versions of the KaplanMeier and NelsonAalen estimators which incorporate geographical weighting. We use ideas from counting process theory to obtain these modied estimators, to derive variance estimates, and to develop associated hypothesis tests. In Chapter 3, we introduce a Bayesian parametric accelerated failure time model with spatially varying coefficients. These two techniques can explore subregion models in survival analysis using both nonparametric and parametric approaches. In Chapter 4, we introduce Bayesian parametric covariance regression analysis for a response vector. The proposed method denes a regression model between the covariance matrix of a pdimensional response vector and auxiliary variables. We propose a constrained MetropolisHastings algorithm to get the estimates. Simulation results are presented to show performance of both regression and covariance matrix estimates. Furthermore, we have a more realistic simulation experiment in which our Bayesian approach has better performance than the MLE. Finally, we illustrate the usefulness of our model by applying it to the Google Flu data. In Chapter 5, we give a brief summary of future work.
Show less  Date Issued
 2017
 Identifier
 FSU_FALL2017_Hu_fsu_0071E_14205
 Format
 Thesis
 Title
 Examining the Relationship of Dietary Component Intakes to Each Other and to Mortality.
 Creator

Alrajhi, Sharifah, McGee, Daniel, Levenson, Cathy W., Niu, Xufeng, Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In this essay we present analysis examining the basic dietary structure and its relationship to mortality in the first National Health and Nutrition Examination Survey (NHANES I) conducted between 1971 and 1975. We used results from 24hour recalls on 10,483 individuals in this study. All of the indivduals in the analytic sample were followed through 1992 for vital status. The mean followup period for the participants was 16 years. During followup 2,042 (48%) males and 1,754 (27%) females...
Show moreIn this essay we present analysis examining the basic dietary structure and its relationship to mortality in the first National Health and Nutrition Examination Survey (NHANES I) conducted between 1971 and 1975. We used results from 24hour recalls on 10,483 individuals in this study. All of the indivduals in the analytic sample were followed through 1992 for vital status. The mean followup period for the participants was 16 years. During followup 2,042 (48%) males and 1,754 (27%) females died. We first attempted to capture the inherent structure of the dietary data using principal components analyses (PCA). We performed this estimation separately for each race (white and black) and gender (male and female) and compared the estimated principal components among these four strata. We found that the principal components were similar (but not identical) in the four strata. we also related our estimated principal components to mortality using Cox Proportional Hazards (CPH) models and related dietary component to mortality using forward variable selection.
Show less  Date Issued
 2015
 Identifier
 FSU_2015fall_Alrajhi_fsu_0071E_12802
 Format
 Thesis
 Title
 Median Regression for Complex Survey Data.
 Creator

Fraser, Raphael André, Sinha, Debajyoti, Lipsitz, Stuart, Carlson, Elwood, Slate, Elizabeth H., Huffer, Fred W. (Fred William), Florida State University, College of Arts and...
Show moreFraser, Raphael André, Sinha, Debajyoti, Lipsitz, Stuart, Carlson, Elwood, Slate, Elizabeth H., Huffer, Fred W. (Fred William), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

The ready availability of publicuse data from various large national complex surveys has immense potential for the assessment of population characteristicsmeans, proportions, totals, etcetera. Using a modelbased approach, complex surveys can be used to evaluate the effectiveness of treatments and to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data...
Show moreThe ready availability of publicuse data from various large national complex surveys has immense potential for the assessment of population characteristicsmeans, proportions, totals, etcetera. Using a modelbased approach, complex surveys can be used to evaluate the effectiveness of treatments and to identify risk factors for important diseases such as cancer. Existing statistical methods based on estimating equations and/or utilizing resampling methods are often not valid with survey data due to design features such as stratification, multistage sampling and unequal selection probabilities. In this paper, we accommodate these design features in the analysis of highly skewed response variables arising from large complex surveys. Specifically, we propose a doubletransformbothsides based estimating equations approach to estimate the median regression parameters of the highly skewed response; the doubletransformbothsides method applies the same transformation twice to both the response and regression function. The usual sandwich variance estimate can be used in our approach, whereas a resampling approach would be needed for a pseudolikelihood based on minimizing absolute deviations. Furthermore, the doubletransformbothsides estimator is relatively robust to the true underlying distribution, and has much smaller mean square error than the least absolute deviations estimator. The method is motivated by an analysis of laboratory data on urinary iodine concentration from the National Health and Nutrition Examination Survey.
Show less  Date Issued
 2015
 Identifier
 FSU_2015fall_Fraser_fsu_0071E_12825
 Format
 Thesis
 Title
 A Bayesian Wavelet Based Analysis of Longitudinally Observed Skewed Heteroscedastic Responses.
 Creator

Baker, Danisha S. (Danisha Sharice), Chicken, Eric, Sinha, Debajyoti, Harper, Kristine, Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of...
Show moreBaker, Danisha S. (Danisha Sharice), Chicken, Eric, Sinha, Debajyoti, Harper, Kristine, Pati, Debdeep, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Unlike many of the current statistical models focusing on highly skewed longitudinal data, we present a novel model accommodating a skewed error distribution, partial linear median regression function, nonparametric wavelet expansion, and serial observations on the same unit. Parameters are estimated via a semiparametric Bayesian procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We use a hierarchical mixture model as the prior for the wavelet...
Show moreUnlike many of the current statistical models focusing on highly skewed longitudinal data, we present a novel model accommodating a skewed error distribution, partial linear median regression function, nonparametric wavelet expansion, and serial observations on the same unit. Parameters are estimated via a semiparametric Bayesian procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We use a hierarchical mixture model as the prior for the wavelet coefficients. For the "vanishing" coefficients, the model includes a level dependent prior probability mass at zero. This practice implements wavelet coefficient thresholding as a Bayesian Rule. Practical advantages of our method are illustrated through a simulation study and via analysis of a cardiotoxicity study of children of HIV infected mother.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Baker_fsu_0071E_14036
 Format
 Thesis
 Title
 Regression Methods for Skewed and Heteroscedastic Response with HighDimensional Covariates.
 Creator

Wang, Libo, Sinha, Debajyoti, Taylor, Miles G., Pati, Debdeep, She, Yiyuan, Yang, Yun (Professor of Statistics), Florida State University, College of Arts and Sciences,...
Show moreWang, Libo, Sinha, Debajyoti, Taylor, Miles G., Pati, Debdeep, She, Yiyuan, Yang, Yun (Professor of Statistics), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

The rise of studies with highdimensional potential covariates has invited a renewed interest in dimension reduction that promotes more parsimonious models, ease of interpretation and computational tractability. However, current variable selection methods restricted to continuous response often assume Gaussian response for methodological as well as theoretical developments. In this thesis, we consider regression models that induce sparsity, gain prediction power, and accommodates response...
Show moreThe rise of studies with highdimensional potential covariates has invited a renewed interest in dimension reduction that promotes more parsimonious models, ease of interpretation and computational tractability. However, current variable selection methods restricted to continuous response often assume Gaussian response for methodological as well as theoretical developments. In this thesis, we consider regression models that induce sparsity, gain prediction power, and accommodates response distributions beyond Gaussian with common variance. The first part of this thesis is a transformbothside Bayesian variable selection model (TBS) which allows skewness, heteroscedasticity and extreme heavy tailed responses. Our method develops a framework which facilitates computationally feasible inference in spite of inducing nonlocal priors on the original regression coefficients. Even if the transformed conditional mean is no longer linear with respect to covariates, we still prove the consistency of our Bayesian TBS estimators. Simulation studies and real data analysis demonstrate the advantages of our methods. Another main part of this thesis deals the above challenges from a frequentist standpoint. This model incorporates a penalized likelihood to accommodate skewed response, arising from an epsilonskewnormal (ESN) distribution. With suitable optimization techniques to handle this twopiece penalized likelihood, our method demonstrates substantial gains in sensitivity and specificity even under highdimensional settings. We conclude this thesis with a novel Bayesian semiparametric modal regression method along with its implementation and simulation studies.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Wang_fsu_0071E_13950
 Format
 Thesis
 Title
 Nonparametric Change Point Detection Methods for Profile Variability.
 Creator

Geneus, Vladimir J. (Vladimir Jacques), Chicken, Eric, Liu, Guosheng (Professor of Earth, Ocean and Atmospheric Science), Sinha, Debajyoti, Zhang, Xin (Professor of Engineering)...
Show moreGeneus, Vladimir J. (Vladimir Jacques), Chicken, Eric, Liu, Guosheng (Professor of Earth, Ocean and Atmospheric Science), Sinha, Debajyoti, Zhang, Xin (Professor of Engineering), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Due to the importance of seeing profile change in devices such as of medical apparatus, measuring the change point in variability of a different functions is important. In a sequence of functional observations (each of the same length), we wish to determine as quickly as possible when a change in the observations has occurred. Waveletbased change point methods are proposed that determine when the variability of the noise in a sequence of functional profiles (i.e. the precision profile of...
Show moreDue to the importance of seeing profile change in devices such as of medical apparatus, measuring the change point in variability of a different functions is important. In a sequence of functional observations (each of the same length), we wish to determine as quickly as possible when a change in the observations has occurred. Waveletbased change point methods are proposed that determine when the variability of the noise in a sequence of functional profiles (i.e. the precision profile of medical devices) has occurred; goes out of control from a known, fixed value, or an estimated incontrol value. Various methods have been proposed which focus on changes in the form of the function. One method, the NEWMA, based on EWMA, focuses on changes in both. However, the drawback is that the form of the incontrol function is known. Others methods, including the χ² for Phase I & Phase II make some assumption about the function. Our interest, however, is in detecting changes in the variance from one function to the next. In particular, we are interested not on differences from one profile to another (variance between), rather differences in variance (variance within). The functional portion of the profiles is allowed to come from a large class of functions and may vary from profile to profile. The estimator is evaluated on a variety of conditions, including allowing the wavelet noise subspace to be substantially contaminated by the profile's functional structure, and is compared to two competing noise monitoring methods. Nikoo and Noorossana (2013) propose a nonparametric wavelet regression method that uses both change point techniques to monitor the variance: a Nonparametric Control Charts, via the mean of m median control charts, and a Parametric Control Charts, via χ²distribution. We propose improvements to their method by incorporating prior data and making use of likelihood ratios. Our methods make use of the orthogonal properties of wavelet projections to accurately and efficiently monitor the level of noise from one profile to the next; detect changes in noise in Phase II setting. We show through simulation results that our proposed methods have better power and are more robust against the confounding effect between variance estimation and function estimation. The proposed methods are shown to be very efficient at detecting when the variability has changed through an extensive simulation study. Extensions are considered that explore the usage of windowing and estimated incontrol values for the MAD method; and the effect of the exact distribution under normality rather than the asymptotic distribution. These developments are implemented in the parametric, nonparametric scale, and complete nonparameric settings. The proposed methodologies are tested through simulation and applicable to various biometric and health related topics; and have the potential to improve in computational efficiency and in reducing the number of assumptions required.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Geneus_fsu_0071E_13862
 Format
 Thesis
 Title
 Scalable and Structured High Dimensional Covariance Matrix Estimation.
 Creator

Sabnis, Gautam, Pati, Debdeep, Kercheval, Alec N., Sinha, Debajyoti, Chicken, Eric, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

With rapid advances in data acquisition and storage techniques, modern scientific investigations in epidemiology, genomics, imaging and networks are increasingly producing challenging data structures in the form of highdimensional vectors, matrices and multiway arrays (tensors) rendering traditional statistical and computational tools inappropriate. One hope for meaningful inferences in such situations is to discover an inherent lowerdimensional structure that explains the physical or...
Show moreWith rapid advances in data acquisition and storage techniques, modern scientific investigations in epidemiology, genomics, imaging and networks are increasingly producing challenging data structures in the form of highdimensional vectors, matrices and multiway arrays (tensors) rendering traditional statistical and computational tools inappropriate. One hope for meaningful inferences in such situations is to discover an inherent lowerdimensional structure that explains the physical or biological process generating the data. The structural assumptions impose constraints that force the objects of interest to lie in lowerdimensional spaces, thereby facilitating their estimation and interpretation and, at the same time reducing computational burden. The assumption of an inherent structure, motivated by various scientific applications, is often adopted as the guiding light in the analysis and is fast becoming a standard tool for parsimonious modeling of such high dimensional data structures. The content of this thesis is specifically directed towards methodological development of statistical tools, with attractive computational properties, for drawing meaningful inferences though such structures. The third chapter of this thesis proposes a distributed computing framework, based on a divide and conquer strategy and hierarchical modeling, to accelerate posterior inference for highdimensional Bayesian factor models. Our approach distributes the task of highdimensional covariance matrix estimation to multiple cores, solves each subproblem separately via a latent factor model, and then combines these estimates to produce a global estimate of the covariance matrix. Existing divide and conquer methods focus exclusively on dividing the total number of observations n into subsamples while keeping the dimension p fixed. The approach is novel in this regard: it includes all of the n samples in each subproblem and, instead, splits the dimension p into smaller subsets for each subproblem. The subproblems themselves can be challenging to solve when p is large due to the dependencies across dimensions. To circumvent this issue, a novel hierarchical structure is specified on the latent factors that allows for flexible dependencies across dimensions, while still maintaining computational efficiency. Our approach is readily parallelizable and is shown to have computational efficiency of several orders of magnitude in comparison to fitting a full factor model. The fourth chapter of this thesis proposes a novel way of estimating a covariance matrix that can be represented as a sum of a lowrank matrix and a diagonal matrix. The proposed method compresses highdimensional data, computes the sample covariance in the compressed space, and lifts it back to the ambient space via a decompression operation. A salient feature of our approach relative to existing literature on combining sparsity and lowrank structures in covariance matrix estimation is that we do not require the lowrank component to be sparse. A principled framework for estimating the compressed dimension using Stein's Unbiased Risk Estimation theory is demonstrated. In the final chapter of this thesis, we tackle the problem of variable selection in high dimensions. Consistent model selection in high dimensions has received substantial interest in recent years and is an extremely challenging problem for Bayesians. The literature on model selection with continuous shrinkage priors is even lessdeveloped due to the unavailability of exact zeros in the posterior samples of parameter of interest. Heuristic methods based on thresholding the posterior mean are often used in practice which lack theoretical justification, and inference is highly sensitive to the choice of the threshold. We aim to address the problem of selecting variables through a novel method of post processing the posterior samples.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Sabnis_fsu_0071E_14043
 Format
 Thesis
 Title
 Shape Constrained Single Index Models for Biomedical Studies.
 Creator

Dhara, Kumaresh, Sinha, Debajyoti, Pati, Debdeep, Proudfit, Greg Hajcak, Slate, Elizabeth H., Chicken, Eric, Florida State University, College of Arts and Sciences, Department...
Show moreDhara, Kumaresh, Sinha, Debajyoti, Pati, Debdeep, Proudfit, Greg Hajcak, Slate, Elizabeth H., Chicken, Eric, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

For many biomedical, environmental and economic studies with an unknown nonlinear relationship between the response and its multiple predictors, a single index model provides practical dimension reduction and good physical interpretation. However widespread uses of existing Bayesian analysis for such models are lacking in biostatistics due to some major impediments including slow mixing of the Markov Chain Monte Carlo (MCMC), inability to deal with missing covariates and a lack of...
Show moreFor many biomedical, environmental and economic studies with an unknown nonlinear relationship between the response and its multiple predictors, a single index model provides practical dimension reduction and good physical interpretation. However widespread uses of existing Bayesian analysis for such models are lacking in biostatistics due to some major impediments including slow mixing of the Markov Chain Monte Carlo (MCMC), inability to deal with missing covariates and a lack of theoretical justification of the rate of convergence. We present a new Bayesian single index model with associated MCMC algorithm that incorporates an efficient Metropolis Hastings (MH) step for the conditional distribution of the index vector. Our method leads to a model with good biological interpretation and prediction, implementable Bayesian inference, fast convergence of the MCMC, and a first time extension to accommodate missing covariates. We also obtain for the first time, the set of sufficient conditions for obtaining the optimal rate of convergence of the overall regression function. We illustrate the practical advantages of our method and computational tool via reanalysis of an environmental study. I have proposed a frequentist and a Bayesian methods for a monotone singleindex models using the Bernstein polynomial basis to represent the link function. The monotonicity of the unknown link function creates a clinically interpretable index, along with the relative importance of the covariates on the index. We develop a computationallysimple, iterative, profile likelihoodbased method for the frequentist analysis. To ease the computational complexity of the Bayesian analysis, we also develop a novel and efficient MetropolisHastings step to sample from the conditional posterior distribution of the index parameters. These methodologies and their advantages over existing methods are illustrated via simulation studies. These methods are also used to analyze depression based measures among adolescent girls.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Dhara_fsu_0071E_14739
 Format
 Thesis
 Title
 Influence Measures for Bayesian Data Analysis.
 Creator

De Oliveira, Melaine C. (Melaine Cristina), Sinha, Debajyoti, Panton, Lynn B., Bradley, Jonathan R., Linero, Antonio Ricardo, Lipsitz, Stuart, Florida State University, College...
Show moreDe Oliveira, Melaine C. (Melaine Cristina), Sinha, Debajyoti, Panton, Lynn B., Bradley, Jonathan R., Linero, Antonio Ricardo, Lipsitz, Stuart, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Identifying influential observations in the data is desired to ensure proper inference and statistical analysis. Modern methods to identify influence cases uses crossvalidation diagnostics based on the effect of deletion of ith observation on inference. A popular method to identify influential observations is to use KullbackLiebler divergence measure between the posterior distribution of the parameter of interest given full data and the posterior distribution given the crossvalidated data...
Show moreIdentifying influential observations in the data is desired to ensure proper inference and statistical analysis. Modern methods to identify influence cases uses crossvalidation diagnostics based on the effect of deletion of ith observation on inference. A popular method to identify influential observations is to use KullbackLiebler divergence measure between the posterior distribution of the parameter of interest given full data and the posterior distribution given the crossvalidated data, where the crossvalidated data has the ith observation removed. Although, in Bayesian inference, the posterior distribution contains all the relevant information about a parameter of interest, when the goal is prediction, perhaps the predictive distribution should be used to identifying influential observations. So, we extended our method to the comparison of the posterior predictive distributions given full data and crossvalidated data. We generalize and extend existing popular Bayesian crossvalidated influence diagnostics using Bregman divergence based measure (BD). We derive useful properties of these BD based on the influence of each observation on the posterior distribution and we show that it can be extended to the predictive distribution. We show that these BD based measures allow interpretable calibration and that they can be computed via Monte Carlo Markov Chain (MCMC) samples from a single posterior based on full data. We illustrate how our new measure of influence of observations have more useful practical roles for data analysis than popular Bayesian residual analysis tools (CPO) in an example of metaanalysis with binary response and in other cases of intervalcensored data.
Show less  Date Issued
 2018
 Identifier
 2018_Su_DeOliveira_fsu_0071E_14712
 Format
 Thesis
 Title
 Building a Model Performance Measure for Examining Clinical Relevance Using Net Benefit Curves.
 Creator

Mukherjee, Anwesha, McGee, Daniel, Hurt, Myra M., Slate, Elizabeth H., Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

ROC curves are often used to evaluate predictive accuracy of statistical prediction models. This thesis studies other measures which not only incorporate the statistical but also the clinical consequences of using a particular prediction model. Depending on the disease and population under study, the misclassification costs of false positives and false negatives vary. The concept of Decision Curve Analysis (DCA) takes this cost into account, by using the threshold probability (the...
Show moreROC curves are often used to evaluate predictive accuracy of statistical prediction models. This thesis studies other measures which not only incorporate the statistical but also the clinical consequences of using a particular prediction model. Depending on the disease and population under study, the misclassification costs of false positives and false negatives vary. The concept of Decision Curve Analysis (DCA) takes this cost into account, by using the threshold probability (the probability above which a patient opts for treatment). Using the DCA technique, a Net Benefit Curve is built by plotting "Net Benefit", a function of the expected benefit and expected harm of using a model, by the threshold probability. Only the threshold probability range that is relevant to the disease and the population under study is used to plot the net benefit curve to obtain the optimum results using a particular statistical model. This thesis concentrates on the process of construction of a summary measure to find which predictive model yields highest net benefit. The most intuitive approach is to calculate the area under the net benefit curve. We examined whether the use of weights such as, the estimated empirical distribution of the threshold probability to compute the weighted area under the curve, creates a better summary measure. Real data from multiple cardiovascular research studies The Diverse Population Collaboration (DPC) datasets, is used to compute the summary measures: area under the ROC curve (AUROC), area under the net benefit curve (ANBC) and weighted area under the net benefit curve (WANBC). The results from the analysis are used to compare these measures to examine whether these measures are in agreement with each other and which would be the best to use in specified clinical scenarios. For different models the summary measures and its standard errors (SE) were calculated to study the variability in the measure. The method of metaanalysis is used to summarize these estimated summary measures to reveal if there is significant variability among these studies.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Mukherjee_fsu_0071E_14350
 Format
 Thesis
 Title
 Bayesian Analysis of Survival Data with Missing Censoring Indicators and Simulation of Interval Censored Data.
 Creator

Bunn, Veronica, Sinha, Debajyoti, Brownstein, Naomi Chana, Slate, Elizabeth H., Linero, Antonio Ricardo, Florida State University, College of Arts and Sciences, Department of...
Show moreBunn, Veronica, Sinha, Debajyoti, Brownstein, Naomi Chana, Slate, Elizabeth H., Linero, Antonio Ricardo, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

In some large clinical studies, it may be impractical to give physical examinations to every subject at his/her last monitoring time in order to diagnose the occurrence of an event of interest. This challenge creates survival data with missing censoring indicators where the probability of missing may depend on time of last monitoring. We present a fully Bayesian semiparametric method for such survival data to estimate regression parameters of Cox's proportional hazards model [Cox, 1972]....
Show moreIn some large clinical studies, it may be impractical to give physical examinations to every subject at his/her last monitoring time in order to diagnose the occurrence of an event of interest. This challenge creates survival data with missing censoring indicators where the probability of missing may depend on time of last monitoring. We present a fully Bayesian semiparametric method for such survival data to estimate regression parameters of Cox's proportional hazards model [Cox, 1972]. Simulation studies show that our method performs better than competing methods. We apply the proposed method to data from the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study. Clinical studies often include interval censored data. We present a method for the simulation of interval censored data based on Poisson processes. We show that our method gives simulated data that fulfills the assumption of independent interval censoring, and is more computationally efficient that other methods used for simulating interval censored data.
Show less  Date Issued
 2018
 Identifier
 2018_Su_Bunn_fsu_0071E_14742
 Format
 Thesis
 Title
 Survival Analysis Using Bayesian Joint Models.
 Creator

Xu, Zhixing, Sinha, Debajyoti, Schatschneider, Christopher, Bradley, Jonathan R., Chicken, Eric, Lin, Lifeng, Florida State University, College of Arts and Sciences, Department...
Show moreXu, Zhixing, Sinha, Debajyoti, Schatschneider, Christopher, Bradley, Jonathan R., Chicken, Eric, Lin, Lifeng, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

In many clinical studies, each patient is at risk of recurrent events as well as the terminating event. In Chapter 2, we present a novel latentclass based semiparametric joint model that offers clinically meaningful and estimable association between the recurrence profile and risk of termination. Unlike previous sharedfrailty based joint models, this model has a coherent interpretation of the covariate effects on all relevant functions and model quantities that are either conditional or...
Show moreIn many clinical studies, each patient is at risk of recurrent events as well as the terminating event. In Chapter 2, we present a novel latentclass based semiparametric joint model that offers clinically meaningful and estimable association between the recurrence profile and risk of termination. Unlike previous sharedfrailty based joint models, this model has a coherent interpretation of the covariate effects on all relevant functions and model quantities that are either conditional or unconditional on events history. We offer a fully Bayesian method for estimation and prediction using a complete specification of the prior process of the baseline functions. When there is a lack of prior information about the baseline functions, we derive a practical and theoretically justifiable partial likelihood based semiparametric Bayesian approach. Our Markov Chain Monte Carlo tools for both Bayesian methods are implementable via publicly available software. Practical advantages of our methods are illustrated via a simulation study and the analysis of a transplant study with recurrent NonFatal Graft Rejections (NFGR) and the termination event of death due to total graft rejection. In Chapter 3, we are motivated by the important problem of estimating Daily Fine Particulate Matter (PM2.5) over the US. Tracking and estimating Daily Fine Particulate Matter (PM2.5) is very important as it has been shown that PM2.5 is directly related to mortality related to the lungs, cardiovascular system, and stroke. That is, high values of PM2.5 constitute a public health problem in the US, and it is important that we precisely estimate PM2.5 to aid in public policy decisions. Thus, we propose a Bayesian hierarchical model for highdimensional ``multitype" responses. By ``multitype" responses we mean a collection of correlated responses that have different distributional assumptions (e.g., continuous skewed observations, and countvalued observations). The Centers for Disease Control and Prevention (CDC) database provides counts of mortalities related to PM2.5 and daily averaged PM2.5 which are treated as responses in our analysis. Our model capitalizes on the shared conjugate structure between the Weibull (to model PM2.5), Poisson (to model diseases mortalities), and multivariate loggamma distributions, and use dimension reduction to aid with computation. Our model can also be used to improve the precision of estimates and estimate at undisclosed/missing counties. We provide a simulation study to illustrate the performance of the model and give an indepth analysis of the CDC dataset.
Show less  Date Issued
 2019
 Identifier
 2019_Spring_Xu_fsu_0071E_15078
 Format
 Thesis
 Title
 Evaluating the Effectiveness of the ExpectationMaximization (EM) Algorithm for Bayesian Network Calibration.
 Creator

Tingir, Seyfullah, Almond, Russell G., Sinha, Debajyoti, Becker, Betsy Jane, Yang, Yanyun, Florida State University, College of Education, Department of Educational Psychology...
Show moreTingir, Seyfullah, Almond, Russell G., Sinha, Debajyoti, Becker, Betsy Jane, Yang, Yanyun, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Show less  Abstract/Description

Educators use various statistical techniques to explain relationships between latent and observable variables. One way to model these relationships is to use Bayesian networks as a scoring model. However, adjusting the conditional probability tables (CPTparameters) to fit a set of observations is still a challenge when using Bayesian networks. A CPT provides the conditional probabilities of a single discrete variable with respect to other discrete variables. In general Bayesian networks, the...
Show moreEducators use various statistical techniques to explain relationships between latent and observable variables. One way to model these relationships is to use Bayesian networks as a scoring model. However, adjusting the conditional probability tables (CPTparameters) to fit a set of observations is still a challenge when using Bayesian networks. A CPT provides the conditional probabilities of a single discrete variable with respect to other discrete variables. In general Bayesian networks, the CPTs that link the proficiency variable and observable outcomes are not necessarily monotonic, but they are often constrained to be monotonic in educational applications. The monotonicity constraint states that if an examinee shows an improvement on a proficiency variable (parent variable), the individual performance on an observable (child variable) should improve. For example, if a student has a higher writing skill, then this student is likely to score better on an essay task. For educational research, building parametric models (i.e., DiBello models) with the ExpectationMaximization algorithm provides monotonic conditional probability tables (CPT). This dissertation explored the effectiveness of the EM algorithm within the DiBello parameterization under different sample sizes, test forms, and item structures. The data generation model specifies two skill variables with a different number of items depending on the test forms. The outcome measures were the relative bias of the parameters to assess parameter recovery, KullbackLeibler distance to evaluate the distance between CPTs, and Cohen's κ to assess classification agreement between data generation and estimation models. The simulation study results showed that a minimum sample size of 400 was sufficient to produce acceptable parameter bias and KL distance. A balanced distribution of simple and integrated type items produced less bias compared to an unbalanced item distribution. The parameterized EM algorithm stabilized the estimates for cells small sizes in CPTs, providing minimal KL distance values. However, the classification agreement between generated and estimated models was low.
Show less  Date Issued
 2019
 Identifier
 2019_Summer_Tingir_fsu_0071E_15106
 Format
 Thesis
 Title
 Semiparametric Bayesian Regression Models for Skewed Responses.
 Creator

Bhingare, Apurva Chandrashekhar, Sinha, Debajyoti, Shanbhag, Sachin, Linero, Antonio Ricardo, Bradley, Jonathan R., Pati, Debdeep, Lipsitz, Stuart, Florida State University,...
Show moreBhingare, Apurva Chandrashekhar, Sinha, Debajyoti, Shanbhag, Sachin, Linero, Antonio Ricardo, Bradley, Jonathan R., Pati, Debdeep, Lipsitz, Stuart, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

It is common to encounter skewed response data in medicine, epidemiology and health care studies. Methodology needs to be devised to overcome the natural difficulties that occur in analyzing such data particularly when it is multivariate. Existing Bayesian statistical methods to deal with skewed data are mostly fully parametric. We propose novel semiparametric Bayesian methods to model an analyze such data. These methods make minimal assumptions about the true form of the distribution and...
Show moreIt is common to encounter skewed response data in medicine, epidemiology and health care studies. Methodology needs to be devised to overcome the natural difficulties that occur in analyzing such data particularly when it is multivariate. Existing Bayesian statistical methods to deal with skewed data are mostly fully parametric. We propose novel semiparametric Bayesian methods to model an analyze such data. These methods make minimal assumptions about the true form of the distribution and structure of the observed data. Through examples from real life studies, we demonstrate practical advantages of our semiparametric Bayesian methods over the existing methods. For many reallife studies with skewed multivariate responses, the level of skewness and association structure assumptions are essential for evaluating the covariate effects on the response and its predictive distribution. First, we present a novel semiparametric multivariate model class leading to a theoretically justifiable semiparametric Bayesian analysis of multivariate skewed responses. Like the multivariate Gaussian densities, this multivariate model is closed under marginalization, allows a wide class of multivariate associations, and has meaningful physical interpretations of skewness levels and covariate effects on the marginal density. Compared to existing models, our model enjoys several desirable practical properties, including Bayesian computing via available software, and assurance of consistent Bayesian estimates of parameters and the nonparametric error density under a set of plausible prior assumptions. We introduce a particular parametric version of the model as an alternative to various parametric skewsymmetric models available in the literature. We illustrate the practical advantages of our methods over existing parametric alternatives via application to a clinical study to assess periodontal disease and through a simulation study. Unlike most of the models existing in literature, this class of models advocates a latent variable approach making implementation under the Bayesian paradigm via standard software for MCMC computation like WinBUGS/JAGS straightforward. Although, JAGS and WinBUGS are flexible MCMC engines, for complex model structures they tend to be rather slow. We offer an alternative tool to implement the aforementioned parametric version of the models using PROC MCMC in SAS. Our goal is to facilitate and encourage more extensive implementation of these models. To achieve this goal, we illustrate the implementation using PROC MCMC in SAS via examples from real life and provide a full annotated SAS code. In large scale national surveys, we often come across skewed data as well as semicontinuous data, that is, data characterized by point mass at zero (degenerate) and right skewed continuous distribution on positive support. For example, in the Medical Expenditure Panel Survey (MEPS), the variable total health care expenditure (i.e., the response) for nonusers of the health care services is zero, whereas for the users it is has continuous distribution typically skewed towards the right. We provide an overview of the existing models and methods to analyze such data.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Bhingare_fsu_0071E_14468
 Format
 Thesis
 Title
 Transformation Models for Survival Data Analysis and Applications.
 Creator

Liu, Yang, Niu, XuFeng, Lloyd, Donald, McGee, Dan, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

It is often assumed that all uncensored subjects will eventually experience the event of interest in standard survival models. However, in some situations when the event considered is not death, it will never occur for a proportion of subjects. Survival models with a cure fraction are becoming popular in analyzing this type of study. We propose a generalized transformation model motivated by Zeng et al's (2006) transformed proportional time cure model. In our proposed model, fractional...
Show moreIt is often assumed that all uncensored subjects will eventually experience the event of interest in standard survival models. However, in some situations when the event considered is not death, it will never occur for a proportion of subjects. Survival models with a cure fraction are becoming popular in analyzing this type of study. We propose a generalized transformation model motivated by Zeng et al's (2006) transformed proportional time cure model. In our proposed model, fractional polynomials are used instead of the simple linear combination of the covariates. The proposed models give us more flexibility without loosing any good properties of the original model, such as asymptotic consistency and asymptotic normality of the regression coefficients. The proposed model will better fit the data where the relationship between a response variable and covariates is nonlinear. We also provide a power selection procedure based on the likelihood function. A simulation study is carried out to show the accuracy of the proposed power selection procedure. The proposed models are applied to coronary heart disease and cancer related medical data from both observational cohort studies and clinical trials
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1155
 Format
 Thesis
 Title
 Interrelating of Longitudinal Processes: An Empirical Example.
 Creator

RoyalThomas, Tamika Y. N., McGee, Daniel, Levenson, Cathy, Sinha, Debajyoti, Osmond, Clive, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

The Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process Blood pressure and anthropometric measurements at seven timepoints on the mother during pregnancy. (ii) In Utero measurements  Ultrasound measurements of the fetus taken at six timepoints during...
Show moreThe Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process Blood pressure and anthropometric measurements at seven timepoints on the mother during pregnancy. (ii) In Utero measurements  Ultrasound measurements of the fetus taken at six timepoints during pregnancy. (iii) Birth to present process  Children's anthropometric and blood pressure measurements at 24 timepoints from birth to 14 years. A comprehensive analysis of the interrelationship of these three longitudinal processes is presented using joint modeling for multivariate longitudinal profiles. We propose a new methodology of examining child's cardiovascular risk by extending a current view of likelihood estimation. Joint modeling of multivariate longitudinal profiles is done and the extension of the traditional likelihood method is utilized in this paper and compared to the maximum likelihood estimates. Our main goal is to examine whether the process in mothers predicts fetal development which in turn predicts the future cardiovascular health of the children. One of the difficulties with `in utero' and early childhood data is that certain variables are highly correlated and so using dimension reduction techniques are quite applicable in this scenario. Principal component analysis (PCA) is utilized in creating a smaller dimension of uncorrelated data which is then utilized in a longitudinal analysis setting. These principal components are then utilized in an optimal linear mixed model for longitudinal data which indicates that in utero and early childhood attributes predicts the future cardiovascular health of the children. This dissertation has added a body of knowledge to developmental origins of adult diseases and has supplied some significant results while utilizing a rich diversity of statistical methodologies.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1792
 Format
 Thesis
 Title
 Semiparametric Survival Analysis Using Models with LogLinear Median.
 Creator

Lin, Jianchang, Sinha, Debajyoti, Zhou, Yi, Lipsitz, Stuart, McGee, Dan, Niu, XuFeng, She, Yiyuan, Department of Statistics, Florida State University
 Abstract/Description

First, we present two novel semiparametric survival models with loglinear median regression functions for right censored survival data. These models are useful alternatives to the popular Cox (1972) model and linear transformation models (Cheng et al., 1995). Compared to existing semiparametric models, our models have many important practical advantages, including interpretation of the regression parameters via the median and the ability to address heteroscedasticity. We demonstrate that our...
Show moreFirst, we present two novel semiparametric survival models with loglinear median regression functions for right censored survival data. These models are useful alternatives to the popular Cox (1972) model and linear transformation models (Cheng et al., 1995). Compared to existing semiparametric models, our models have many important practical advantages, including interpretation of the regression parameters via the median and the ability to address heteroscedasticity. We demonstrate that our modeling techniques facilitate the ease of prior elicitation and computation for both parametric and semiparametric Bayesian analysis of survival data. We illustrate the advantages of our modeling, as well as model diagnostics, via reanalysis of a smallcell lung cancer study. Results of our simulation study provide further guidance regarding appropriate modelling in practice. Our second goal is to develop the methods of analysis and associated theoretical properties for interval censored and current status survival data. These new regression models use loglinear regression function for the median. We present frequentist and Bayesian procedures for estimation of the regression parameters. Our model is a useful and practical alternative to the popular semiparametric models which focus on modeling the hazard function. We illustrate the advantages and properties of our proposed methods via reanalyzing a breast cancer study. Our other aim is to develop a model which is able to account for the heteroscedasticity of response, together with robust parameter estimation and outlier detection using sparsity penalization. Some preliminary simulation studies have been conducted to compare the performance of proposed model and existing median lasso regression model. Considering the estimation bias, mean squared error and other identication benchmark measures, our proposed model performs better than the competing frequentist estimator.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd4992
 Format
 Thesis
 Title
 Practical Methods for Equivalence and NonInferiority Studies with Survival Response.
 Creator

Martinez, Elvis Englebert, Sinha, Debajyoti, Levenson, Cathy W., Chicken, Eric, Lipsitz, Stuart, McGee, Daniel, Florida State University, College of Arts and Sciences,...
Show moreMartinez, Elvis Englebert, Sinha, Debajyoti, Levenson, Cathy W., Chicken, Eric, Lipsitz, Stuart, McGee, Daniel, Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

Determining the equivalence or noninferiority of a new drug (test drug) with a existing treatment (reference drug) is an important topic of statistical interest. Wellek (1993) pioneered the way for logrank based equivalence and noninferiority testing by formulating a testing procedure using proportional hazards model (PHM) of Cox (1972). In many equivalence and noninferiority trials, two hazards functions may converge to one rather than being proportional for all timepoints. In this case...
Show moreDetermining the equivalence or noninferiority of a new drug (test drug) with a existing treatment (reference drug) is an important topic of statistical interest. Wellek (1993) pioneered the way for logrank based equivalence and noninferiority testing by formulating a testing procedure using proportional hazards model (PHM) of Cox (1972). In many equivalence and noninferiority trials, two hazards functions may converge to one rather than being proportional for all timepoints. In this case, the proportional odds survival model (POSM) of Bennett (1983) will be more sufficient than a Cox's PHM assumption. We show in both cases, when the wrong modeling assumption is made and Cox's PH assumption is violated, the popular procedure of Wellek (1993) has an inflated type I error. On the contrary, our proposed POS model based equivalence and noninferiority tests maintains the practitioners desired 5% level of significance regardless of the underlying modeling assumption (e.g. Cox,1972; Wellek, 1993). Furthermore for noninferiority trials, we introduce a method to determine the optimal sample size required when a desired power and type I error is specified and the data follows the POSM of Bennett (1983). For both of the above trials, we present simulation studies showing the finite approximation of powers and type I error rates, when the underlying modeling assumption are correctly specified and when the assumptions are misspecified.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd9214
 Format
 Thesis
 Title
 Methods of Block Thresholding Across Multiple Resolution Levels in Adaptive Wavelet Estimation.
 Creator

Schleeter, Tiffany M., Chicken, Eric, Clark, Kathleen M., Pati, Debdeep, Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Blocking methods of thresholding have demonstrated many advantages over termbyterm methods in adaptive wavelet estimation. These blocking methods are resolutionlevel specific, meaning the coefficients are grouped together only within the same resolution level. Techniques have not yet been proposed for blocking across multiple resolution levels and do not take into consideration varying shapes of blocks for wavelet coefficients. Presently, several methods of block thresholding across...
Show moreBlocking methods of thresholding have demonstrated many advantages over termbyterm methods in adaptive wavelet estimation. These blocking methods are resolutionlevel specific, meaning the coefficients are grouped together only within the same resolution level. Techniques have not yet been proposed for blocking across multiple resolution levels and do not take into consideration varying shapes of blocks for wavelet coefficients. Presently, several methods of block thresholding across multiple resolution levels are described. Various simulation studies analyze the use of these methods among nonparametric functions, including comparisons to other blocking and nonblocking wavelet thresholding methods. The introduction of a this new technique questions when this method will be advantageous over resolutionlevel specific methods. Another simulation study demonstrates a method of statistically selecting when blocking across resolution levels is beneficial over traditional techniques. Additional analysis will conclude how effective the automated selection method is in both simulation and if put into practice.
Show less  Date Issued
 2015
 Identifier
 FSU_migr_etd9677
 Format
 Thesis
 Title
 The Table of the Transient World: LongTerm Historical Process and the Culture of Mass Consumption in Ancient Rome and Italy, 200 BCE20 CE.
 Creator

CollinsElliott, Stephen A., De Grummond, Nancy T., Levenson, David, Marincola, John, Pullen, Daniel, Sinha, Debajyoti, Stone, David, Department of Classics, Florida State...
Show moreCollinsElliott, Stephen A., De Grummond, Nancy T., Levenson, David, Marincola, John, Pullen, Daniel, Sinha, Debajyoti, Stone, David, Department of Classics, Florida State University
Show less  Abstract/Description

This dissertation questions the dominant paradigm of a 'cultural revolution' in ancient Rome and Italy, as a product of the Augustan age. It also calls into consideration the notions that aristocratic elites were cultural trendsetters during the last two centuries BCE and that the majority of ancient Italians were largely passive as the sweeping changes of the period unfolded. Breaking new ground with sophisticated quantitative analyses, the dissertation conducts a longterm comparative...
Show moreThis dissertation questions the dominant paradigm of a 'cultural revolution' in ancient Rome and Italy, as a product of the Augustan age. It also calls into consideration the notions that aristocratic elites were cultural trendsetters during the last two centuries BCE and that the majority of ancient Italians were largely passive as the sweeping changes of the period unfolded. Breaking new ground with sophisticated quantitative analyses, the dissertation conducts a longterm comparative study of food consumption among the mass society throughout Italy to see whether popular cultural habits come toward any point of homogeneity in the Augustan age. It illustrates how macroregional groups (Etruria, Apulia, and Latium) reveal a distinct tendency toward Italian homogeneity that transpires slowly over time starting around the midsecond century BCE. Apulian sites moreover begin to diverge from this trend starting in the first century CE, showing that the maximum point of cultural unification occurred under Augustus but that it was not permanent. These results thus not only complicate the narrative of Italian unification and illustrate the different levels into which culture can be particularized, but they also provide a context for the agency of Augustus and the members of his regime, in terms of their ability to exact or perpetrate cultural change: leaders and the elites of a social order are granted their authority, to a degree, through their own making, but the maintenance of that power depends upon a concession of power on the part of the rest of society. The way in which the proliferation of the symbols of power found common purchase within Italy corresponds with an era of a shared culture reflected in the habits of mass consumption. The success of the Augustan age, therefore, and its proliferation of symbols of power, should be considered in light of this preexisting longterm sociohistorical trend.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd8760
 Format
 Thesis
 Title
 GoodnessofTests for Logistic Regression.
 Creator

Wu, Sutan, McGee, Dan L., Zhang, Jinfeng, Hurt, Myra, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

The generalized linear model and particularly the logistic model are widely used in public health, medicine, and epidemiology. Goodnessoffit tests for these models are popularly used to describe how well a proposed model fits a set of observations. These different goodnessoffit tests all have individual advantages and disadvantages. In this thesis, we mainly consider the performance of the "HosmerLemeshow" test, the Pearson's chisquare test, the unweighted sum of squares test and the...
Show moreThe generalized linear model and particularly the logistic model are widely used in public health, medicine, and epidemiology. Goodnessoffit tests for these models are popularly used to describe how well a proposed model fits a set of observations. These different goodnessoffit tests all have individual advantages and disadvantages. In this thesis, we mainly consider the performance of the "HosmerLemeshow" test, the Pearson's chisquare test, the unweighted sum of squares test and the cumulative residual test. We compare their performance in a series of empirical studies as well as particular simulation scenarios. We conclude that the unweighted sum of squares test and the cumulative sums of residuals test give better overall performance than the other two. We also conclude that the commonly suggested practice of assuming that a pvalue less than 0.15 is an indication of lack of fit at the initial steps of model diagnostics should be adopted. Additionally, D'Agostino et al. presented the relationship of the stacked logistic regression and the Cox regression model in the Framingham Heart Study. So in our future study, we will examine the possibility and feasibility of the adaption these goodnessoffit tests to the Cox proportional hazards model using the stacked logistic regression.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0693
 Format
 Thesis
 Title
 An Examination of the Concept of Frailty in the Elderly.
 Creator

Griffin, Felicia R., McGee, Daniel, Slate, Elizabeth H., Hurt, Myra M., Sinha, Debajyoti, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Frailty has been defined as a state of increased vulnerability to adverse outcomes. The concept of frailty has been centered around counting the number of deficits in health, which can be diseases, disabilities, or symptoms. However, there is no consensus on how it should be quantified. Frailty has been considered synonymous with functional status and comorbidity, but these may be distinct concepts requiring different management. We compared two methods of defining a frailty phenotype, a...
Show moreFrailty has been defined as a state of increased vulnerability to adverse outcomes. The concept of frailty has been centered around counting the number of deficits in health, which can be diseases, disabilities, or symptoms. However, there is no consensus on how it should be quantified. Frailty has been considered synonymous with functional status and comorbidity, but these may be distinct concepts requiring different management. We compared two methods of defining a frailty phenotype, a count of deficits and a weighted score of health deficits incorporating the strength of association between each deficit and mortality. The strength of association was estimated using proportional hazards coefficients. The study uses data from the third National Health and Nutrition Examination Survey. We compared the two methodologies: frailty was associated with age, gender, ethnicity, and having comorbid chronic diseases. The predictive association of frail status with the incidence of death over 12 years was significant for the weighted phenotype, with hazard ratio 3.46, 95% confidence interval (CI) (2.78, 4.30) unadjusted and hazard ratio 1.89, 95% confidence interval (CI) (1.57, 2.30) adjusted. The unweighted predictive association of frail status with the incidence of death was also significant, with a lower hazard ratio of 3.13, 95% CI (2.53, 3.87) unadjusted and hazard ratio of 1.40 95% CI (1.20, 1.65) adjusted. When examining the association of frailty and cause specific death, frailty was associated with a higher risk of death due to CHD, Stroke, CVD, and Other causes for both male and female (unadjusted). However, after adjusting for various covariates death due to CHD, CVD, and Others causes remain significant for both males and females. When comparing the definition of osteoporosis or low bone mass to the model of frailty, femoral neck Tscore declined significantly with increasing levels of frailty. There was overlap and uniqueness in the definitions of frailty, functional status, and comorbidity that require further research. Understanding the causal interrelationship could help explain why these three conditions are likely to cooccur. In addition, there is an association between frailty and dietary quality based on the Mediterranean diet. This study provides a more valuable understanding of the complex concept of frailty and the role latent variables in this concept. This study also introduces a weighted score for defining a frailty phenotype that is more strongly predictive of mortality, and hence has potential to improve targeting and care of today's elderly.
Show less  Date Issued
 2015
 Identifier
 FSU_migr_etd9342
 Format
 Thesis
 Title
 Some New Methods for Design and Analysis of Survival Data.
 Creator

Wang, Wenting, Sinha, Debajyoti, Arjmandi, Bahram H., McGee, Dan, Niu, Xufeng, Yu, Kai, Department of Statistics, Florida State University
 Abstract/Description

For survival outcomes, usually, statistical equivalent tests to show a new treatment therapeutically equivalent to a standard treatment are based on the Cox (1972) proportional hazards assumption. We present an alternative method based on the linear transformation model (LTM) for two treatment arms, and show the advantages of using this equivalence test instead of tests based on the Cox's model. LTM is a very general class of models including models such as the proportional odds survival...
Show moreFor survival outcomes, usually, statistical equivalent tests to show a new treatment therapeutically equivalent to a standard treatment are based on the Cox (1972) proportional hazards assumption. We present an alternative method based on the linear transformation model (LTM) for two treatment arms, and show the advantages of using this equivalence test instead of tests based on the Cox's model. LTM is a very general class of models including models such as the proportional odds survival model (POSM). We presented a sufficient condition to check whether logrank based tests have inflated Type I error rates. We show that POSM and some other commonly used survival models within the LTM class all satisfy this condition. Simulation studies show that repeated use of our test instead of using logrank based tests will be a safer statistical practice. Our second goal is to develop a practical Bayesian model for survival data with high dimensional covariate vector. We develop the Information Matrix (IM) and Information Matrix Ridge (IMR) priors for commonly used survival models including the Cox's model and the cure rate model proposed by Chen et al. (1999), and examine many desirable theoretical properties including sufficient conditions for the existence of the moment generating functions for these priors and corresponding posterior distributions. The performance of these priors in practice is compared with some competing priors via the Bayesian analysis of a study that investigates the relationship between lung cancer survival time and a large number of genetic markers.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1248
 Format
 Thesis