Current Search: Department of Statistics (x)
Search results
Pages
 Title
 AP Student Visual Preferences for Problem Solving.
 Creator

Swoyer, Liesl, Department of Statistics
 Abstract/Description

The purpose of this study is to explore the mathematical preference of high school AP Calculus students by examining their tendencies for using differing methods of thought. A student's preferred mode of thinking was measured on a scale ranging from a preference for analytical thought to a preference for visual thought as they completed derivative and antiderivative tasks presented both algebraically and graphically. This relates to previous studies by continuing to analyze the factors that...
Show moreThe purpose of this study is to explore the mathematical preference of high school AP Calculus students by examining their tendencies for using differing methods of thought. A student's preferred mode of thinking was measured on a scale ranging from a preference for analytical thought to a preference for visual thought as they completed derivative and antiderivative tasks presented both algebraically and graphically. This relates to previous studies by continuing to analyze the factors that have been found to mediate the students' performance and preference in regards to a variety of calculus tasks. Data was collected by Dr. Erhan Haciomeroglu at the University of Central Florida. Students' preferences were not affected by gender. Students were found to approach graphical and algebraic tasks similarly, without any significant change with regards to derivative or antiderivative nature of the tasks. Highly analytic and highly visual students revealed the same proportion of change in visuality as harmonic students when more difficult calculus tasks were encountered. Thus, a strong preference for visual thinking when completing algebraic tasks was not the determining factor of their preferred method of thinking when approaching graphical tasks.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_uhm0052
 Format
 Thesis
 Title
 Generalized Mahalanobis Depth In Point Process And Its Application In Neural Coding.
 Creator

Liu, Shuyi, Wu, Wei
 Abstract/Description

In this paper, we propose to generalize the notion of depth in temporal point process observations. The new depth is defined as a weighted product of two probability terms: (1) the number of events in each process, and (2) the centeroutward ranking on the event times conditioned on the number of events. In this study, we adopt the Poisson distribution for the first term and the Mahalanobis depth for the second term. We propose an efficient bootstrapping approach to estimate parameters in...
Show moreIn this paper, we propose to generalize the notion of depth in temporal point process observations. The new depth is defined as a weighted product of two probability terms: (1) the number of events in each process, and (2) the centeroutward ranking on the event times conditioned on the number of events. In this study, we adopt the Poisson distribution for the first term and the Mahalanobis depth for the second term. We propose an efficient bootstrapping approach to estimate parameters in the defined depth. In the case of Poisson process, the observed events are order statistics where the parameters can be estimated robustly with respect to sample size. We demonstrate the use of the new depth by ranking realizations from a Poisson process. We also test the new method in classification problems using simulations as well as real neural spike train data. It is found that the new framework provides more accurate and robust classifications as compared to commonly used likelihood methods.
Show less  Date Issued
 201706
 Identifier
 FSU_libsubv1_wos_000408732000021, 10.1214/17AOAS1030
 Format
 Citation
 Title
 Are screening methods useful in feature selection?: An empirical study.
 Creator

Wang, Mingyuan, Barbu, Adrian
 Abstract/Description

Filter or screening methods are often used as a preprocessing step for reducing the number of variables used by a learning algorithm in obtaining a classification or regression model. While there are many such filter methods, there is a need for an objective evaluation of these methods. Such an evaluation is needed to compare them with each other and also to answer whether they are at all useful, or a learning algorithm could do a better job without them. For this purpose, many popular...
Show moreFilter or screening methods are often used as a preprocessing step for reducing the number of variables used by a learning algorithm in obtaining a classification or regression model. While there are many such filter methods, there is a need for an objective evaluation of these methods. Such an evaluation is needed to compare them with each other and also to answer whether they are at all useful, or a learning algorithm could do a better job without them. For this purpose, many popular screening methods are partnered in this paper with three regression learners and five classification learners and evaluated on ten real datasets to obtain accuracy criteria such as Rsquare and area under the ROC curve (AUC). The obtained results are compared through curve plots and comparison tables in order to find out whether screening methods help improve the performance of learning algorithms and how they fare with each other. Our findings revealed that the screening methods were useful in improving the prediction of the best learner on two regression and two classification datasets out of the ten datasets evaluated.
Show less  Date Issued
 20190911
 Identifier
 FSU_libsubv1_scholarship_submission_1568294804_edd95dc1_Comp, 10.1371/journal.pone.0220842
 Format
 Set of related objects
 Title
 Power Of Two.
 Creator

Piekarewicz, J., Linero, A. R., Giuliani, P., Chicken, E.
 Abstract/Description

Background: Besides its intrinsic value as a fundamental nuclearstructure observable, the weakcharge density of Pb208a quantity that is closely related to its neutron distributionis of fundamental importance in constraining the equation of state of neutronrich matter. Purpose: To assess the impact that a second electroweak measurement of the weakcharge form factor of Pb208 may have on the determination of its overall weakcharge density. Methods: Using the two putative experimental...
Show moreBackground: Besides its intrinsic value as a fundamental nuclearstructure observable, the weakcharge density of Pb208a quantity that is closely related to its neutron distributionis of fundamental importance in constraining the equation of state of neutronrich matter. Purpose: To assess the impact that a second electroweak measurement of the weakcharge form factor of Pb208 may have on the determination of its overall weakcharge density. Methods: Using the two putative experimental values of the form factor, together with a simple implementation of Bayes' theorem, we calibrate a theoretically soundyet surprisingly little knownsymmetrized Fermi function, that is characterized by a density and form factor that are both known exactly in closed form. Results: Using the charge form factor of Pb208 as a proxy for its weakcharge form factor, we demonstrate that using only two experimental points to calibrate the symmetrized Fermi function is sufficient to accurately reproduce the experimental charge form factor over a significant range of momentum transfers. Conclusions: It is demonstrated that a second measurement of the weakcharge form factor of Pb208 supplemented by a robust theoretical input in the form of the symmetrized Fermi function would place significant constraints on the neutron distribution of Pb208. In turn, such constraints will become vital in the interpretation of hadronic experiments that will probe the neutronrich skin of exotic nuclei at future radioactive beam facilities.
Show less  Date Issued
 20160915
 Identifier
 FSU_libsubv1_wos_000383149400001, 10.1103/PhysRevC.94.034316
 Format
 Citation
 Title
 Randomized Sketches For Kernels: Fast And Optimal Nonparametric Regression.
 Creator

Yang, Yun, Pilanci, Mert, Wainwright, Martin J.
 Abstract/Description

Kernel ridge regression (KRR) is a standard method for performing nonparametric regression over reproducing kernel Hilbert spaces. Given n samples, the time and space complexity of computing the KRR estimate scale as O(n(3)) and O(n(2)), respectively, and so is prohibitive in many cases. We propose approximations of KRR based on mdimensional randomized sketches of the kernel matrix, and study how small the projection dimension m can be chosen while still preserving minimax optimality of the...
Show moreKernel ridge regression (KRR) is a standard method for performing nonparametric regression over reproducing kernel Hilbert spaces. Given n samples, the time and space complexity of computing the KRR estimate scale as O(n(3)) and O(n(2)), respectively, and so is prohibitive in many cases. We propose approximations of KRR based on mdimensional randomized sketches of the kernel matrix, and study how small the projection dimension m can be chosen while still preserving minimax optimality of the approximate KRR estimate. For various classes of randomized sketches, including those based on Gaussian and randomized Hadamard matrices, we prove that it suffices to choose the sketch dimension m proportional to the statistical dimension (modulo logarithmic factors). Thus, we obtain fast and minimax optimal approximations to the KRR estimate for nonparametric regression. In doing so, we prove a novel lower bound on the minimax risk of kernel regression in terms of the localized Rademacher complexity.
Show less  Date Issued
 201706
 Identifier
 FSU_libsubv1_wos_000404395900003, 10.1214/16AOS1472
 Format
 Citation
 Title
 Why Deep Learning Works.
 Creator

Brahma, Pratik Prabhanjan, Wu, Dapeng, She, Yiyuan
 Abstract/Description

Deep hierarchical representations of the data have been found out to provide better informative features for several machine learning applications. In addition, multilayer neural networks surprisingly tend to achieve better performance when they are subject to an unsupervised pretraining. The booming of deep learning motivates researchers to identify the factors that contribute to its success. One possible reason identified is the flattening of manifoldshaped data in higher layers of neural...
Show moreDeep hierarchical representations of the data have been found out to provide better informative features for several machine learning applications. In addition, multilayer neural networks surprisingly tend to achieve better performance when they are subject to an unsupervised pretraining. The booming of deep learning motivates researchers to identify the factors that contribute to its success. One possible reason identified is the flattening of manifoldshaped data in higher layers of neural networks. However, it is not clear how to measure the flattening of such manifoldshaped data and what amount of flattening a deep neural network can achieve. For the first time, this paper provides quantitative evidence to validate the flattening hypothesis. To achieve this, we propose a few quantities for measuring manifold entanglement under certain assumptions and conduct experiments with both synthetic and realworld data. Our experimental results validate the proposition and lead to new insights on deep learning.
Show less  Date Issued
 201610
 Identifier
 FSU_libsubv1_wos_000384644000001, 10.1109/TNNLS.2015.2496947
 Format
 Citation
 Title
 Regression Methods for Skewed and Heteroscedastic Response with HighDimensional Covariates.
 Creator

Wang, Libo, Sinha, Debajyoti, Taylor, Miles G., Pati, Debdeep, She, Yiyuan, Yang, Yun (Professor of Statistics), Florida State University, College of Arts and Sciences,...
Show moreWang, Libo, Sinha, Debajyoti, Taylor, Miles G., Pati, Debdeep, She, Yiyuan, Yang, Yun (Professor of Statistics), Florida State University, College of Arts and Sciences, Department of Statistics
Show less  Abstract/Description

The rise of studies with highdimensional potential covariates has invited a renewed interest in dimension reduction that promotes more parsimonious models, ease of interpretation and computational tractability. However, current variable selection methods restricted to continuous response often assume Gaussian response for methodological as well as theoretical developments. In this thesis, we consider regression models that induce sparsity, gain prediction power, and accommodates response...
Show moreThe rise of studies with highdimensional potential covariates has invited a renewed interest in dimension reduction that promotes more parsimonious models, ease of interpretation and computational tractability. However, current variable selection methods restricted to continuous response often assume Gaussian response for methodological as well as theoretical developments. In this thesis, we consider regression models that induce sparsity, gain prediction power, and accommodates response distributions beyond Gaussian with common variance. The first part of this thesis is a transformbothside Bayesian variable selection model (TBS) which allows skewness, heteroscedasticity and extreme heavy tailed responses. Our method develops a framework which facilitates computationally feasible inference in spite of inducing nonlocal priors on the original regression coefficients. Even if the transformed conditional mean is no longer linear with respect to covariates, we still prove the consistency of our Bayesian TBS estimators. Simulation studies and real data analysis demonstrate the advantages of our methods. Another main part of this thesis deals the above challenges from a frequentist standpoint. This model incorporates a penalized likelihood to accommodate skewed response, arising from an epsilonskewnormal (ESN) distribution. With suitable optimization techniques to handle this twopiece penalized likelihood, our method demonstrates substantial gains in sensitivity and specificity even under highdimensional settings. We conclude this thesis with a novel Bayesian semiparametric modal regression method along with its implementation and simulation studies.
Show less  Date Issued
 2017
 Identifier
 FSU_SUMMER2017_Wang_fsu_0071E_13950
 Format
 Thesis
 Title
 First Steps towards Image Denoising under LowLight Conditions.
 Creator

Anaya, Josue Samuel, MeyerBaese, Anke, Linero, Antonio, Zhang, Jinfeng, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

The application of noise reduction or performing denoising on an image is a very important topic in the field of computer vision and computational photography. Many popular state of the art denoising algorithms are trained and evaluated using images with artificial noise. These trained algorithms and their evaluations on synthetic data may lead to incorrect conclusions about their performances. In this paper we will first introduce a benchmark dataset of uncompressed color images corrupted by...
Show moreThe application of noise reduction or performing denoising on an image is a very important topic in the field of computer vision and computational photography. Many popular state of the art denoising algorithms are trained and evaluated using images with artificial noise. These trained algorithms and their evaluations on synthetic data may lead to incorrect conclusions about their performances. In this paper we will first introduce a benchmark dataset of uncompressed color images corrupted by natural noise due to lowlight conditions, together with spatially and intensityaligned low noise images of the same scenes. The dataset contains over 100 scenes and more than 500 images, including both RAW formatted images and 8 bit BMP pixel and intensity aligned images. We will also introduce a method for estimating the true noise level in each of our images, since even the low noise images contain a small amount of noise. Through this noise estimation method we will construct a convolutional neural network model for automatic noise estimation in single noisy images. Finally, we improve upon a stateoftheart denoising algorithm Block Matching through 3D filtering (BM3D) by learning a specialized denoising parameter using another developed convolutional neural network.
Show less  Date Issued
 2016
 Identifier
 FSU_FA2016_Anaya_fsu_0071E_13600
 Format
 Thesis
 Title
 Generalized Mahalanobis Depth in Point Process and Its Application in Neural Coding and SemiSupervised Learning in Bioinformatics.
 Creator

Liu, Shuyi, Wu, Wei, Wang, Xiaoqiang, Zhang, Jinfeng, Mai, Qing, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

In the first project, we propose to generalize the notion of depth in temporal point process observations. The new depth is defined as a weighted product of two probability terms: 1) the number of events in each process, and 2) the centeroutward ranking on the event times conditioned on the number of events. In this study, we adopt the Poisson distribution for the first term and the Mahalanobis depth for the second term. We propose an efficient bootstrapping approach to estimate parameters...
Show moreIn the first project, we propose to generalize the notion of depth in temporal point process observations. The new depth is defined as a weighted product of two probability terms: 1) the number of events in each process, and 2) the centeroutward ranking on the event times conditioned on the number of events. In this study, we adopt the Poisson distribution for the first term and the Mahalanobis depth for the second term. We propose an efficient bootstrapping approach to estimate parameters in the defined depth. In the case of Poisson process, the observed events are order statistics where the parameters can be estimated robustly with respect to sample size. We demonstrate the use of the new depth by ranking realizations from a Poisson process. We also test the new method in classification problems using simulations as well as real neural spike train data. It is found that the new framework provides more accurate and robust classifications as compared to commonly used likelihood methods. In the second project, we demonstrate the value of semisupervised dimension reduction in clinical area. The advantage of semisupervised dimension reduction is very easy to understand. SemiSupervised dimension reduction method adopts the unlabeled data information to perform dimension reduction and it can be applied to help build a more precise prediction model comparing with common supervised dimension reduction techniques. After thoroughly comparing with dimension embedding methods with label data only, we show the improvement of semisupervised dimension reduction with unlabeled data in breast cancer chemotherapy clinical area. In our semisupervised dimension reduction method, we not only explore adding unlabeled data to linear dimension reduction such as PCA, we also explore semisupervised nonlinear dimension reduction, such as semisupervised LLE and semisupervised Isomap.
Show less  Date Issued
 2018
 Identifier
 2018_Sp_Liu_fsu_0071E_14367
 Format
 Thesis
 Title
 Embracing the Generalized Propensity Score Method: Measuring the Effect of Library Usage on FirstTimeInCollege Student Academic Success.
 Creator

Mao, Jingying, Kinsley, Kirsten
 Abstract/Description

This research focuses on FirstTimeinCollege (FTIC) student library usage during the first academic year as number of visits (frequency) and length of stay (duration) and how that might affect firstterm grade point average (GPA) and firstyear retention using the generalized propensity score (GPS). We also want to demonstrate that GPS is a proper tool that researchers in libraries can use to make causal inferences about the effects of library usage on student academic success outcomes in...
Show moreThis research focuses on FirstTimeinCollege (FTIC) student library usage during the first academic year as number of visits (frequency) and length of stay (duration) and how that might affect firstterm grade point average (GPA) and firstyear retention using the generalized propensity score (GPS). We also want to demonstrate that GPS is a proper tool that researchers in libraries can use to make causal inferences about the effects of library usage on student academic success outcomes in observation studies.
Show less  Date Issued
 20171109
 Identifier
 FSU_libsubv1_scholarship_submission_1514926919_ab4848cb, 10.18438/B8BH35
 Format
 Citation
 Title
 Two Studies on the Application of Machine Learning for Biomedical Big Data.
 Creator

Lung, PeiYau, Zhang, Jinfeng, Liu, Xiuwen, Barbu, Adrian G., Wu, Wei, Florida State University, College of Arts and Sciences, Department of Statistics
 Abstract/Description

Large volumes of genomic data and new scientific discoveries in biomedical research are being made every day by laboratories in both academia and industry. However, two issues severely affect the usability of socalled biomedical big data: 1) the majority of the public genomic data do not contain enough clinical information, and 2) scientific discoveries are stored in text as unstructured data. This dissertation presents two studies, which address each issue using machine learning methods, in...
Show moreLarge volumes of genomic data and new scientific discoveries in biomedical research are being made every day by laboratories in both academia and industry. However, two issues severely affect the usability of socalled biomedical big data: 1) the majority of the public genomic data do not contain enough clinical information, and 2) scientific discoveries are stored in text as unstructured data. This dissertation presents two studies, which address each issue using machine learning methods, in order to maximize the usability of biomedical big data. In the first study, we infer missing clinical information using multiple gene expression data sets and a wide variety of machine learning methods. We proposed a new performance measure, Proportion of Positives which can be predicted with High accuracy (PPH), to evaluate models in term of their effectiveness in recovering data with missing clinical information. PPH estimates the percentage of data that can be recovered given a desired level of accuracy. The experiment results demonstrate the effectiveness of the predicted clinical information in downstream inference tasks. In the second study, we propose a threestage computational method to automatically extract chemicalprotein interactions (CPIs) from a given text. Our method extracts CPIpairs and CPItriplets from sentences; where a CPIpair consists of a chemical compound and a protein name, and a CPItriplet consists of a CPIpair along with an interaction word describing their relationship. We extract a diverse set of features from sentences, which are used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. Our method performed the best among systems which use nondeeplearning methods, and outperformed several deeplearningbased systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.
Show less  Date Issued
 2019
 Identifier
 2019_Summer_Lung_fsu_0071E_15134
 Format
 Thesis
 Title
 Personalized Chemotherapy Selection For Breast Cancer Using Gene Expression Profiles.
 Creator

Yu, Kaixian, Sang, QingXiang Amy, Lung, PeiYau, Tan, Winston, Lively, Ty, Sheffield, Cedric, BouDargham, Mayassa J., Liu, Jun S., Zhang, Jinfeng
 Abstract/Description

Choosing the optimal chemotherapy regimen is still an unmet medical need for breast cancer patients. In this study, we reanalyzed data from seven independent data sets with totally 1079 breast cancer patients. The patients were treated with three different types of commonly used neoadjuvant chemotherapies: anthracycline alone, anthracycline plus paclitaxel, and anthracycline plus docetaxel. We developed random forest models with variable selection using both genetic and clinical variables to...
Show moreChoosing the optimal chemotherapy regimen is still an unmet medical need for breast cancer patients. In this study, we reanalyzed data from seven independent data sets with totally 1079 breast cancer patients. The patients were treated with three different types of commonly used neoadjuvant chemotherapies: anthracycline alone, anthracycline plus paclitaxel, and anthracycline plus docetaxel. We developed random forest models with variable selection using both genetic and clinical variables to predict the response of a patient using pCR (pathological complete response) as the measure of response. The models were then used to reassign an optimal regimen to each patient to maximize the chance of pCR. An independent validation was performed where each independent study was left out during model building and later used for validation. The expected pCR rates of our method are significantly higher than the rates of the best treatments for all the seven independent studies. A validation study on 21 breast cancer cell lines showed that our prediction agrees with their drugsensitivity profiles. In conclusion, the new strategy, called PRES (Personalized REgimen Selection), may significantly increase response rates for breast cancer patients, especially those with HER2 and ER negative tumors, who will receive one of the widelyaccepted chemotherapy regimens.
Show less  Date Issued
 20170303
 Identifier
 FSU_libsubv1_wos_000395286700001, 10.1038/srep43294
 Format
 Citation
 Title
 Automatic stage identification of Drosophila egg chamber based on DAPI images.
 Creator

Jia, Dongyu, Xu, Qiuping, Xie, Qian, Mio, Washington, Deng, WuMin
 Abstract/Description

The Drosophila egg chamber, whose development is divided into 14 stages, is a wellestablished model for developmental biology. However, visual stage determination can be a tedious, subjective and timeconsuming task prone to errors. Our study presents an objective, reliable and repeatable automated method for quantifying cell features and classifying egg chamber stages based on DAPI images. The proposed approach is composed of two steps: 1) a feature extraction step and 2) a statistical...
Show moreThe Drosophila egg chamber, whose development is divided into 14 stages, is a wellestablished model for developmental biology. However, visual stage determination can be a tedious, subjective and timeconsuming task prone to errors. Our study presents an objective, reliable and repeatable automated method for quantifying cell features and classifying egg chamber stages based on DAPI images. The proposed approach is composed of two steps: 1) a feature extraction step and 2) a statistical modeling step. The egg chamber features used are egg chamber size, oocyte size, egg chamber ratio and distribution of follicle cells. Methods for determining the onsite of the polytene stage and centripetal migration are also discussed. The statistical model uses linear and ordinal regression to explore the stagefeature relationships and classify egg chamber stages. Combined with machine learning, our method has great potential to enable discovery of hidden developmental mechanisms.
Show less  Date Issued
 20160106
 Identifier
 FSU_libsubv1_wos_000368658200001, 10.1038/srep18850
 Format
 Citation
 Title
 Prediction and Testing for NonParametric Random Function Signals in a Complex System.
 Creator

Hill, Paul C., Chicken, Eric, Klassen, Eric, Niu, Xufeng, Barbu, Adrian, Department of Statistics, Florida State University
 Abstract/Description

Methods employed in the construction of prediction bands for continuous curves require a dierent approach to those used for a data point. In many cases, the underlying function is unknown and thus a distributionfree approach which preserves sufficient coverage for the entire signal is necessary in the signal analysis. This paper discusses three methods for the formation of (1alpha)100% bootstrap prediction bands and their performances are compared through the coverage probabilities obtained...
Show moreMethods employed in the construction of prediction bands for continuous curves require a dierent approach to those used for a data point. In many cases, the underlying function is unknown and thus a distributionfree approach which preserves sufficient coverage for the entire signal is necessary in the signal analysis. This paper discusses three methods for the formation of (1alpha)100% bootstrap prediction bands and their performances are compared through the coverage probabilities obtained for each technique. Bootstrap samples are first obtained for the signal and then three dierent criteria are provided for the removal of 100% of the curves resulting in the (1alpha)100% prediction band. The first method uses the L1 distance between the upper and lower curves as a gauge to extract the widest bands in the dataset of signals. Also investigated are extractions using the Hausdorffdistance between the bounds as well as an adaption to the bootstrap intervals discussed in Lenhoffet al (1999). The bootstrap prediction bands each have good coverage probabilities for the continuous signals in the dataset. For a 95% prediction band, the coverage obtained were 90.59%, 93.72% and 95% for the L1 Distance, Hausdorff Distance and the adjusted Bootstrap methods respectively. The methods discussed in this paper have been applied to constructing prediction bands for spring discharge in a successful manner giving good coverage in each case. Spring Discharge measured over time can be considered as a continuous signal and the ability to predict the future signals of spring discharge is useful for monitoring flow and other issues related to the spring. While in some cases, rainfall has been tted with the gamma distribution, the discharge of the spring represented as continuous curves, is better approached not assuming any specific distribution. The Bootstrap aspect occurs not in sampling the output discharge curves but rather in simulating the input recharge that enters the spring. Bootstrapping the rainfall as described in this paper, allows for adequately creating new samples over different periods of time as well as specic rain events such as hurricanes or drought. The Bootstrap prediction methods put forth in this paper provide an approach that supplies adequate coverage for prediction bands for signals represented as continuous curves. The pathway outlined by the flow of the discharge through the springshed is described as a tree. A nonparametric pairwise test, motivated by the idea of Kmeans clustering, is proposed to decipher whether there is equality between two trees in terms of their discharges. A large sample approximation is devised for this lowertail significance test and test statistics for different numbers of input signals are compared to a generated table of critical values.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd4910
 Format
 Thesis
 Title
 Estimation and Sequential Monitoring of Nonlinear Functional Responses Using Wavelet Shrinkage.
 Creator

Cuevas, Jordan, Chicken, Eric, Sobanjo, John, Niu, Xufeng, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

Statistical process control (SPC) is widely used in industrial settings to monitor processes for shifts in their distributions. SPC is generally thought of in two distinct phases: Phase I, in which historical data is analyzed in order to establish an incontrol process, and Phase II, in which new data is monitored for deviations from the incontrol form. Traditionally, SPC had been used to monitor univariate (multivariate) processes for changes in a particular parameter (parameter vector)....
Show moreStatistical process control (SPC) is widely used in industrial settings to monitor processes for shifts in their distributions. SPC is generally thought of in two distinct phases: Phase I, in which historical data is analyzed in order to establish an incontrol process, and Phase II, in which new data is monitored for deviations from the incontrol form. Traditionally, SPC had been used to monitor univariate (multivariate) processes for changes in a particular parameter (parameter vector). Recently however, technological advances have resulted in processes in which each observation is actually an ndimensional functional response (referred to as a profile), where n can be quite large. Additionally, these profiles are often unable to be adequately represented parametrically, making traditional SPC techniques inapplicable. This dissertation starts out by addressing the problem of nonparametric function estimation, which would be used to analyze process data in a PhaseI setting. The translation invariant wavelet estimator (TI) is often used to estimate irregular functions, despite the drawback that it tends to oversmooth jumps. A trimmed translation invariant estimator (TTI) is proposed, of which the TI estimator is a special case. By reducing the point by point variability of the TI estimator, TTI is shown to retain the desirable qualities of TI while improving reconstructions of functions with jumps. Attention is then turned to the PhaseII problem of monitoring sequences of profiles for deviations from incontrol. Two profile monitoring schemes are proposed; the first monitors for changes in the noise variance using a likelihood ratio test based on the highest detail level of wavelet coefficients of the observed profile. The second offers a semiparametric test to monitor for changes in both the functional form and noise variance. Both methods make use of wavelet shrinkage in order to distinguish relevant functional information from noise contamination. Different forms of each of these test statistics are proposed and results are compared via Monte Carlo simulation.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd4788
 Format
 Thesis
 Title
 Weighted Adaptive Methods for Multivariate Response Models with an HIV/Neurocognitive Application.
 Creator

Geis, Jennifer Ann, She, Yiyuan, MeyerBaese, Anke, Barbu, Adrian, Bunea, Florentina, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Multivariate response models are being used increasingly more in almost all fields with the necessary employment of inferential methods such as Canonical Correlation Analysis (CCA). This requires the estimation of the number of uncorrelated canonical relationships between the two sets, or, equivalently so, determining the rank of the coefficient estimator in the multivariate response model.One way to do this is by the Rank Selection Criterion (RSC) by Bunea et al. with the assumption the...
Show moreMultivariate response models are being used increasingly more in almost all fields with the necessary employment of inferential methods such as Canonical Correlation Analysis (CCA). This requires the estimation of the number of uncorrelated canonical relationships between the two sets, or, equivalently so, determining the rank of the coefficient estimator in the multivariate response model.One way to do this is by the Rank Selection Criterion (RSC) by Bunea et al. with the assumption the error matrix has independent constant variance entries. While this assumption is necessary to show their strong theoretical results, in practical application, some flexibility is required. That is, such assumption cannot always be safely made. What is developed here are the theoretics that parallel Bunea et al.'s work with the addition of a "decorrelator" weight matrix. One choice for the weight matrix is the residual covariance, but this introduces many issues in practice. A computationally more convenient weight matrix is the sample response covariance. When such a weight matrix is chosen, CCA is directly accessible by this weighted version of RSC giving rise to an Adaptive CCA (ACCA) with principal proofs for the large sample setting. However, particular considerations are required for the highdimensional setting, where similar theoretics do not hold. What is offered instead are extensive empirical simulations that reveal that using the sample response covariance still provides good rank recovery and estimation of the coefficient matrix, and hence, also provides good estimation of the number of canonical relationships and variates. It is argued precisely why other versions of the residual covariance, including a regularized version, are poor choices in the highdimensional setting. Another approach to avoid these issues is to employ some type of variable selection methodology first before applying ACCA. Truly, any group selection method may be applied prior to ACCA as variable selection in the multivariate response model is the same as group selection in the univariate response model and thus completely eliminates these highdimensional concerns. To offer a practical application of these ideas, ACCA is applied to a "large sample'" neurocognitive dataset. Then, a highdimensional dataset is generated to which Group LASSO will be first utilized before ACCA. This provides a unique perspective into the relationships between cognitive deficiencies in HIVpositive patients and the extensive, available neuroimaging measures.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd4861
 Format
 Thesis
 Title
 Nonparametric Wavelet Thresholding and Profile Monitoring for NonGaussian Errors.
 Creator

McGinnity, Kelly, Chicken, Eric, Hoeﬂich, Peter, Niu, Xufeng, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

Recent advancements in data collection allow scientists and researchers to obtain massive amounts of information in short periods of time. Often this data is functional and quite complex. Wavelet transforms are popular, particularly in the engineering and manufacturing fields, for handling these type of complicated signals. A common application of wavelets is in statistical process control (SPC), in which one tries to determine as quickly as possible if and when a sequence of profiles has...
Show moreRecent advancements in data collection allow scientists and researchers to obtain massive amounts of information in short periods of time. Often this data is functional and quite complex. Wavelet transforms are popular, particularly in the engineering and manufacturing fields, for handling these type of complicated signals. A common application of wavelets is in statistical process control (SPC), in which one tries to determine as quickly as possible if and when a sequence of profiles has gone outofcontrol. However, few wavelet methods have been proposed that don't rely in some capacity on the assumption that the observational errors are normally distributed. This dissertation aims to fill this void by proposing a simple, nonparametric, distributionfree method of monitoring profiles and estimating changepoints. Using only the magnitudes and location maps of thresholded wavelet coefficients, our method uses the spatial adaptivity property of wavelets to accurately detect profile changes when the signal is obscured with a variety of nonGaussian errors. Wavelets are also widely used for the purpose of dimension reduction. Applying a thresholding rule to a set of wavelet coefficients results in a "denoised" version of the original function. Once again, existing thresholding procedures generally assume independent, identically distributed normal errors. Thus, the second main focus of this dissertation is a nonparametric method of thresholding that does not assume Gaussian errors, or even that the form of the error distribution is known. We improve upon an existing evenodd crossvalidation method by employing block thresholding and level dependence, and show that the proposed method works well on both skewed and heavytailed distributions. Such thresholding techniques are essential to the SPC procedure developed above.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7502
 Format
 Thesis
 Title
 The Frequentist Performance of Some Bayesian Confidence Intervals for the Survival Function.
 Creator

Tao, Yingfeng, Huﬀer, Fred, Okten, Giray, Sinha, Debajyoti, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Estimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based...
Show moreEstimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based in some way on the KaplanMeier estimator first proposed by Kaplan and Meier (1958) and widely used as the nonparametric estimator in the presence of rightcensored data. For intervalcensored data, the Turnbull estimator (Turnbull (1974)) plays a similar role. For a class of Bayesian models involving Dirichlet priors, Doss and Huffer (2003) suggested several simulation techniques to approximate the posterior distribution of the survival function by using Markov chain Monte Carlo or sequential importance sampling. These techniques lead to probability intervals for the survival function (at arbitrary time points) and its quantiles for both the rightcensored and intervalcensored cases. This dissertation will examine the frequentist properties and general performance of these probability intervals when the prior is noninformative. Simulation studies will be used to compare these probability intervals with other published approaches. Extensions of the DossHuffer approach are given for constructing simultaneous confidence bands for the survival function and for computing approximate confidence intervals for the survival function based on Edgeworth expansions using posterior moments. The performance of these extensions is studied by simulation.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7624
 Format
 Thesis
 Title
 Bayesian Methods for Skewed Response Including Longitudinal and Heteroscedastic Data.
 Creator

Tang, Yuanyuan, Sinha, Debajyoti, Pati, Debdeep, Flynn, Heather, She, Yiyuan, Lipsitz, Stuart, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

Skewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle...
Show moreSkewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle the skewed univariate response with a flexible heteroscedasticity. For longitudinal studies with heavily skewed continuous response, statistical model and methods focusing on mean response are not appropriate. In this paper, we present a partial linear model of median regression function of skewed longitudinal response. We develop a semiparametric Bayesian estimation procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We provide justifications for using our methods including theoretical investigation of the support of the prior, asymptotic properties of the posterior and also simulation studies of finite sample properties. Ease of implementation and advantages of our model and method compared to existing methods are illustrated via analysis of a cardiotoxicity study of children of HIV infected mother. Our second aim is to develop a Bayesian simultaneous variable selection and estimation of median regression for skewed response variable. Our hierarchical Bayesian model can incorporate advantages of $l_0$ penalty for skewed and heteroscedastic error. Some preliminary simulation studies have been conducted to compare the performance of proposed model and existing frequentist median lasso regression model. Considering the estimation bias and total square error, our proposed model performs as good as, or better than competing frequentist estimators. In biomedical studies, the covariates often affect the location, scale as well as the shape of the skewed response distribution. Existing biostatistical literature mainly focuses on the mean regression with a symmetric error distribution. While such modeling assumptions and methods are often deemed as restrictive and inappropriate for skewed response, the completely nonparametric methods may lack a physical interpretation of the covariate effects. Existing nonparametric methods also miss any easily implementable computational tool. For a skewed response, we develop a novel model accommodating a nonparametric error density that depends on the covariates. The advantages of our semiparametric associated Bayes method include the ease of prior elicitation/determination, an easily implementable posterior computation, theoretically sound properties of the selection of priors and accommodation of possible outliers. The practical advantages of the method are illustrated via a simulation study and an analysis of a reallife epidemiological study on the serum response to DDT exposure during gestation period.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7622
 Format
 Thesis
 Title
 Statistical Analysis of Trajectories on Riemannian Manifolds.
 Creator

Su, Jingyong, Srivastava, Anuj, Klassen, Erik, Huffer, Fred, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

This thesis consists of two distinct topics. First, we present a framework for estimation and analysis of trajectories on Riemananian manifolds. Second, we propose a framework of detecting, classifying, and estimating shapes in point cloud data. This thesis mainly focuses on statistical analysis of trajectories that take values on nonlinear manifolds. There are many difficulties when analyzing temporal trajectories on nonlinear manifold. First, the observed data are always noisy and discrete...
Show moreThis thesis consists of two distinct topics. First, we present a framework for estimation and analysis of trajectories on Riemananian manifolds. Second, we propose a framework of detecting, classifying, and estimating shapes in point cloud data. This thesis mainly focuses on statistical analysis of trajectories that take values on nonlinear manifolds. There are many difficulties when analyzing temporal trajectories on nonlinear manifold. First, the observed data are always noisy and discrete at unsynchronized times. Second, trajectories are observed under arbitrary temporal evolutions. In this work, we first address the problem of estimating full smooth trajectories on nonlinear manifolds using only a set of timeindexed points, for use in interpolation, smoothing, and prediction of dynamic systems. Furthermore, we study statistical analysis of trajectories that take values on nonlinear Riemannian manifolds and are observed under arbitrary temporal evolutions. The problem of analyzing such temporal trajectories including registration, comparison, modeling and evaluation exist in a lot of applications. We introduce a quantity that provides both a cost function for temporal registration and a proper distance for comparison of trajectories. This distance, in turn, is used to define statistical summaries, such as the sample means and covariances, of given trajectories and Gaussiantype models to capture their variability. Both theoretical proofs and experimental results are provided to validate our work. The problems of detecting, classifying, and estimating shapes in point cloud data are important due to their general applicability in image analysis, computer vision, and graphics. They are challenging because the data is typically noisy, cluttered, and unordered. We study these problems using a fully statistical model where the data is modeled using a Poisson process on the objects boundary (curves or surfaces), corrupted by additive noise and a clutter process. Using likelihood functions dictated by the model, we develop a generalized likelihood ratio test for detecting a shape in a point cloud. Additionally, we develop a procedure for estimating most likely shapes in observed point clouds under given shape hypotheses. We demonstrate this framework using examples of 2D and 3D shape detection and estimation in both real and simulated data, and a usage of this framework in shape retrieval from a 3D shape database.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7619
 Format
 Thesis
 Title
 Meta Analysis and Meta Regression of a Measure of Discrimination Used in Prognostic Modeling.
 Creator

Rivera, Gretchen L., McGee, Daniel, Hurt, Myra, Niu, Xufeng, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

In this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper we consider the logistic model. The dataset used is the Diverse Populations Collaboration (DPC) dataset which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our...
Show moreIn this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper we consider the logistic model. The dataset used is the Diverse Populations Collaboration (DPC) dataset which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our analysis we include those individuals who are 17 years old or older. The predictors are: age, diabetes, total serum cholesterol (mg/dl), high density lipoprotein (mg/dl), systolic blood pressure (mmHg) and if the participant is a current cigarette smoker. There is a natural grouping within the studies such as gender, rural or urban area and race. Based on these strata we have 84 cohort groups. Our main interest is to evaluate how well the prognostic model discriminates. For this, we used the area under the Receiver Operating Characteristic (ROC) curve. The main idea of the ROC curve is that a set of subject is known to belong to one of two classes (signal or noise group). Then an assignment procedure assigns each object to a class on the basis of information observed. The assignment procedure is not perfect: sometimes an object is misclassified. We want to evaluate the quality of performance of this procedure, for this we used the Area under the ROC curve (AUROC). The AUROC varies from 0.5 (no apparent accuracy) to 1.0 (perfect accuracy). For each logistic model we found the AUROC and its standard error (SE). We used Metaanalysis to summarize the estimated AUROCs and to evaluate if there is heterogeneity in our estimates. To evaluate the existence of significant heterogeneity we used the Q statistic. Since heterogeneity was found in our study we compare seven different methods for estimating τ2 (between study variance). We conclude by examining whether differences in study characteristics explained the heterogeneity in the values of the AUROC.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7580
 Format
 Thesis
 Title
 An Ensemble Approach to Predicting Health Outcomes.
 Creator

Nilles, Ester Kim, McGee, Dan, Zhang, Jinfeng, Eberstein, Isaac, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Heart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk...
Show moreHeart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk have been based on logistic regression or Cox proportional hazards models. In more recent years, machine learning models have grown in popularity within the medical field for their superior predictive and classification performances over the classical statistical models. However, their performances in heart disease and premature birth predictions have been comparable and inconclusive, leaving the question of which model most accurately reflects the data difficult to resolve. Our aim is to incorporate information learned by different models into one final model that will generate superior predictive performances. We first compare the widely used machine learning models  the multilayer perceptron network, knearest neighbor and support vector machine  to the statistical models logistic regression and Cox proportional hazards. Then the individual models are combined into one in an ensemble approach, also referred to as ensemble modeling. The proposed approaches include SSEweighted, AUCweighted, logistic and flexible naive Bayes. The individual models are unique and capture different aspects of the data, but as expected, no individual one outperforms any other. The ensemble approach is an easily computed method that eliminates the need to select one model, integrates the strengths of different models, and generates optimal performances. Particularly in cases where the risk factors associated to an outcome are elusive, such as in premature birth, the ensemble models significantly improve their prediction.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7530
 Format
 Thesis
 Title
 MixedEffects Models for Count Data with Applications to Educational Research.
 Creator

Shin, Jihyung, Niu, Xufeng, Hu, Shouping, Al Otaiba, Stephanie Dent, McGee, Daniel, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

This research is motivated by an analysis of reading research data. We are interested in modeling the test outcome of ability to fluently recode letters into sounds of kindergarten children aged between 5 and 7. The data showed excessive zero scores (more than 30% of children) on the test. In this dissertation, we carefully examine the models dealing with excessive zeros, which are based on the mixture of distributions, a distribution with zeros and a standard probability distribution with...
Show moreThis research is motivated by an analysis of reading research data. We are interested in modeling the test outcome of ability to fluently recode letters into sounds of kindergarten children aged between 5 and 7. The data showed excessive zero scores (more than 30% of children) on the test. In this dissertation, we carefully examine the models dealing with excessive zeros, which are based on the mixture of distributions, a distribution with zeros and a standard probability distribution with non negative values. In such cases, a log normal variable or a Poisson random variable is often observed with probability from semicontinuous data or count data. The previously proposed models, mixedeffects and mixeddistribution models (MEMD) by Tooze(2002) et al. for semicontinuous data and zeroinflated Poisson (ZIP) regression models by Lambert(1992) for count data are reviewed. We apply zeroinflated Poisson models to repeated measures data of zeroinflated data by introducing a pair of possibly correlated random effects to the zeroinflated Poisson model to accommodate withinsubject correlation and between subject heterogeneity. The model describes the effect of predictor variables on the probability of nonzero responses (occurrence) and mean of nonzero responses (intensity) separately. The likelihood function is maximized using dual quasiNewton optimization of an approximated by adaptive Gaussian quadrature. The maximum likelihood estimates are obtained through standard statistical software package. Using different model parameters, the number of subject, and the number of measurements per subject, the simulation study is conducted and the results are presented. The dissertation ends with the application of the model to reading research data and future research. We examine the number of correct letter sound counted of children collected over 2008 2009 academic year. We find that age, gender and socioeconomic status are significantly related to the letter sound fluency of children in both parts of the model. The model provides better explanation of data structure and easier interpretations of parameter values, as they are the same as in standard logistic models and Poisson regression models. The model can be extended to accommodate serial correlation which can be observed in longitudinal data. Also, one may consider multilevel zeroinflated Poisson model. Although the multilevel model was proposed previously, parameter estimation by penalized quasi likelihood methods is questionable, and further examination is needed.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd5181
 Format
 Thesis
 Title
 A Riemannian Framework for Annotated Curves Analysis.
 Creator

Liu, Wei, Srivastava, Anuj, Zhang, Jinfeng, Klassen, Eric P., Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

We propose a Riemannian framework for shape analysis of annotated curves, curves that have certain attributes defined along them, in addition to their geometries.These attributes may be in form of vectorvalued functions, discrete landmarks, or symbolic labels, and provide auxiliary information along the curves. The resulting shape analysis, that is comparing, matching, and deforming, is naturally influenced by the auxiliary functions. Our idea is to construct curves in higher dimensions...
Show moreWe propose a Riemannian framework for shape analysis of annotated curves, curves that have certain attributes defined along them, in addition to their geometries.These attributes may be in form of vectorvalued functions, discrete landmarks, or symbolic labels, and provide auxiliary information along the curves. The resulting shape analysis, that is comparing, matching, and deforming, is naturally influenced by the auxiliary functions. Our idea is to construct curves in higher dimensions using both geometric and auxiliary coordinates, and analyze shapes of these curves. The difficulty comes from the need for removing different groups from different components: the shape is invariant to rigidmotion, global scale and reparameterization while the auxiliary component is usually invariant only to the reparameterization. Thus, the removal of some transformations (rigid motion and global scale) is restricted only to the geometric coordinates, while the reparameterization group is removed for all coordinates. We demonstrate this framework using a number of experiments.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd4997
 Format
 Thesis
 Title
 Nonparametric Data Analysis on Manifolds with Applications in Medical Imaging.
 Creator

Osborne, Daniel Eugene, Patrangenaru, Victor, Liu, Xiuwen, Barbu, Adrian, Chicken, Eric, Department of Statistics, Florida State University
 Abstract/Description

Over the past twenty years, there has been a rapid development in Nonparametric Statistical Analysis on Manifolds applied to Medical Imaging problems. In this body of work, we focus on two different medical imaging problems. The first problem corresponds to analyzing the CT scan data. In this context, we perform nonparametric analysis on the 3D data retrieved from CT scans of healthy young adults, on the SizeandReflection Shape Space of kads in general position in 3D. This work is a part...
Show moreOver the past twenty years, there has been a rapid development in Nonparametric Statistical Analysis on Manifolds applied to Medical Imaging problems. In this body of work, we focus on two different medical imaging problems. The first problem corresponds to analyzing the CT scan data. In this context, we perform nonparametric analysis on the 3D data retrieved from CT scans of healthy young adults, on the SizeandReflection Shape Space of kads in general position in 3D. This work is a part of larger project on planning reconstructive surgery in severe skull injuries which includes preprocessing and postprocessing steps of CT images. The next problem corresponds to analyzing MR diffusion tensor imaging data. Here, we develop a twosample procedure for testing the equality of the generalized Frobenius means of two independent populations on the space of symmetric positive matrices. These new methods, naturally lead to an analysis based on Cholesky decompositions of covariance matrices which helps to decrease computational time and does not increase dimensionality. The resulting nonparametric matrix valued statistics are used for testing if there is a difference on average between corresponding signals in Diffusion Tensor Images (DTI) in young children with dyslexia when compared to their clinically normal peers. The results presented here correspond to data that was previously used in the literature using parametric methods which also showed a significant difference.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd5085
 Format
 Thesis
 Title
 Riemannian Shape Analysis of Curves and Surfaces.
 Creator

Kurtek, Sebastian, Srivastava, Anuj, Klassen, Eric, Wu, Wei, Huﬀer, Fred, Dryden, Ian, Department of Statistics, Florida State University
 Abstract/Description

Shape analysis of curves and surfaces is a very important tool in many applications ranging from computer vision to bioinformatics and medical imaging. There are many difficulties when analyzing shapes of parameterized curves and surfaces. Firstly, it is important to develop representations and metrics such that the analysis is invariant to parameterization in addition to the standard transformations (rigid motion and scaling). Furthermore, under the chosen representations and metrics, the...
Show moreShape analysis of curves and surfaces is a very important tool in many applications ranging from computer vision to bioinformatics and medical imaging. There are many difficulties when analyzing shapes of parameterized curves and surfaces. Firstly, it is important to develop representations and metrics such that the analysis is invariant to parameterization in addition to the standard transformations (rigid motion and scaling). Furthermore, under the chosen representations and metrics, the analysis must be performed on infinitedimensional and sometimes nonlinear spaces, which poses an additional difficulty. In this work, we develop and apply methods which address these issues. We begin by defining a framework for shape analysis of parameterized open curves and extend these ideas to shape analysis of surfaces. We utilize the presented frameworks in various classification experiments spanning multiple application areas. In the case of curves, we consider the problem of clustering DTMRI brain fibers, classification of protein backbones, modeling and segmentation of signatures and statistical analysis of biosignals. In the case of surfaces, we perform disease classification using 3D anatomical structures in the brain, classification of handwritten digits by viewing images as quadrilateral surfaces, and finally classification of cropped facial surfaces. We provide two additional extensions of the general shape analysis frameworks that are the focus of this dissertation. The first one considers shape analysis of marked spherical surfaces where in addition to the surface information we are given a set of manually or automatically generated landmarks. This requires additional constraints on the definition of the reparameterization group and is applicable in many domains, especially medical imaging and graphics. Second, we consider reflection symmetry analysis of planar closed curves and spherical surfaces. Here, we also provide an example of disease detection based on brain asymmetry measures. We close with a brief summary and a discussion of open problems, which we plan on exploring in the future.
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd4963
 Format
 Thesis
 Title
 A Novel Riemannian Metric for Analyzing Spherical Functions with Applications to HARDI Data.
 Creator

Ncube, Sentibaleng, Srivastava, Anuj, Klassen, Eric, Wu, Wei, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

We propose a novel Riemannian framework for analyzing orientation distribution functions (ODFs), or their probability density functions (PDFs), in HARDI data sets for use in comparing, interpolating, averaging, and denoising PDFs. This is accomplished by separating shape and orientation features of PDFs, and then analyzing them separately under their own Riemannian metrics. We formulate the action of the rotation group on the space of PDFs, and define the shape space as the quotient space of...
Show moreWe propose a novel Riemannian framework for analyzing orientation distribution functions (ODFs), or their probability density functions (PDFs), in HARDI data sets for use in comparing, interpolating, averaging, and denoising PDFs. This is accomplished by separating shape and orientation features of PDFs, and then analyzing them separately under their own Riemannian metrics. We formulate the action of the rotation group on the space of PDFs, and define the shape space as the quotient space of PDFs modulo the rotations. In other words, any two PDFs are compared in: (1) shape by rotationally aligning one PDF to another, using the FisherRao distance on the aligned PDFs, and (2) orientation by comparing their rotation matrices. This idea improves upon the results from using the FisherRao metric in analyzing PDFs directly, a technique that is being used increasingly, and leads to geodesic interpolations that are biologically feasible. This framework leads to definitions and efficient computations for the Karcher mean that provide tools for improved interpolation and denoising. We demonstrate these ideas, using an experimental setup involving several PDFs.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd5064
 Format
 Thesis
 Title
 Theories on Group Variable Selection in Multivariate Regression Models.
 Creator

Ha, SeungYeon, She, Yiyuan, Okten, Giray, Huffer, Fred, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

We study group variable selection on multivariate regression model. Group variable selection is equivalent to select the nonzero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS according to JamesStein phenomenon (1961). As one of shrinkage methods, we study penalized...
Show moreWe study group variable selection on multivariate regression model. Group variable selection is equivalent to select the nonzero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS according to JamesStein phenomenon (1961). As one of shrinkage methods, we study penalized least square estimation for a group variable selection. Among them, we study L0 regularization and L0 + L2 regularization with the purpose of obtaining accurate prediction and consistent feature selection, and use the corresponding computational procedure Hard TISP and HardRidge TISP (She, 2009) to solve the numerical difficulties. These regularization methods show better performance both on prediction and selection than Lasso (L1 regularization), which is one of popular penalized least square method. L0 acheives the same optimal rate of prediction loss and estimation loss as Lasso, but it requires no restriction on design matrix or sparsity for controlling the prediction error and a relaxed condition than Lasso for controlling the estimation error. Also, for selection consistency, it requires much relaxed incoherence condition, which is correlation between the relevant subset and irrelevant subset of predictors. Therefore L0 can work better than Lasso both on prediction and sparsity recovery, in practical cases such that correlation is high or sparsity is not low. We study another method, L0 + L2 regularization which uses the combined penalty of L0 and L2. For the corresponding procedure HardRidge TISP, two parameters work independently for selection and shrinkage (to enhance prediction) respectively, and therefore it gives better performance on some cases (such as low signal strength) than L0 regularization. For L0 regularization, λ works for selection but it is tuned in terms of prediction accuracy. L0 + L2 regularization gives the optimal rate of prediction and estimation errors without any restriction, when the coefficient of l2 penalty is appropriately assigned. Furthermore, it can achieve a better rate of estimation error with an ideal choice of blockwise weight to l2 penalty.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7404
 Format
 Thesis
 Title
 Monte Carlo Likelihood Estimation for Conditional Autoregressive Models with Application to Sparse Spatiotemporal Data.
 Creator

Bain, Rommel, Huffer, Fred, Becker, Betsy, Niu, Xufeng, Srivastava, Anuj, Department of Statistics, Florida State University
 Abstract/Description

Spatiotemporal modeling is increasingly used in a diverse array of fields, such as ecology, epidemiology, health care research, transportation, economics, and other areas where data arise from a spatiotemporal process. Spatiotemporal models describe the relationship between observations collected from different spatiotemporal sites. The modeling of spatiotemporal interactions arising from spatiotemporal data is done by incorporating the spacetime dependence into the covariance structure. A...
Show moreSpatiotemporal modeling is increasingly used in a diverse array of fields, such as ecology, epidemiology, health care research, transportation, economics, and other areas where data arise from a spatiotemporal process. Spatiotemporal models describe the relationship between observations collected from different spatiotemporal sites. The modeling of spatiotemporal interactions arising from spatiotemporal data is done by incorporating the spacetime dependence into the covariance structure. A main goal of spatiotemporal modeling is the estimation and prediction of the underlying process that generates the observations under study and the parameters that govern the process. Furthermore, analysis of the spatiotemporal correlation of variables can be used for estimating values at sites where no measurements exist. In this work, we develop a framework for estimating quantities that are functions of complete spatiotemporal data when the spatiotemporal data is incomplete. We present two classes of conditional autoregressive (CAR) models (the homogeneous CAR (HCAR) model and the weighted CAR (WCAR) model) for the analysis of sparse spatiotemporal data (the log of monthly mean zooplankton biomass) collected on a spatiotemporal lattice by the California Cooperative Oceanic Fisheries Investigations (CalCOFI). These models allow for spatiotemporal dependencies between nearest neighbor sites on the spatiotemporal lattice. Typically, CAR model likelihood inference is quite complicated because of the intractability of the CAR model's normalizing constant. Sparse spatiotemporal data further complicates likelihood inference. We implement Monte Carlo likelihood (MCL) estimation methods for parameter estimation of our HCAR and WCAR models. Monte Carlo likelihood estimation provides an approximation for intractable likelihood functions. We demonstrate our framework by giving estimates for several different quantities that are functions of the complete CalCOFI time series data.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7283
 Format
 Thesis
 Title
 Nonparametric Nonstationary Density Estimation Including Upper Control Limit Methods for Detecting Change Points.
 Creator

Becvarik, Rachel A., Chicken, Eric, Liu, Guosheng, Sinha, Debajyoti, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

Nonstationary nonparametric densities occur naturally including applications such as monitoring the amount of toxins in the air and in monitoring internet streaming data. Progress has been made in estimating these densities, but there is little current work on monitoring them for changes. A new statistic is proposed which effectively monitors these nonstationary nonparametric densities through the use of transformed wavelet coefficients of the quantiles. This method is completely...
Show moreNonstationary nonparametric densities occur naturally including applications such as monitoring the amount of toxins in the air and in monitoring internet streaming data. Progress has been made in estimating these densities, but there is little current work on monitoring them for changes. A new statistic is proposed which effectively monitors these nonstationary nonparametric densities through the use of transformed wavelet coefficients of the quantiles. This method is completely nonparametric, designed for no particular distributional assumptions; thus making it effective in a variety of conditions. Existing methods for monitoring sequential data typically focus on using a single value upper control limit (UCL) based on a specified in control average run length (ARL) to detect changes in these nonstationary statistics. However, such a UCL is not designed to take into consideration the false alarm rate, the power associated with the test or the underlying distribution of the ARL. Additionally, if the monitoring statistic is known to be monotonic over time (which is typical in methods using maxima in their statistics, for example) the flat UCL does not adjust to this property. We propose several methods for creating UCLs that provide improved power and simultaneously adjust the false alarm rate to userspecified values. Our methods are constructive in nature, making no use of assumed distribution properties of the underlying monitoring statistic. We evaluate the different proposed UCLs through simulations to illustrate the improvements over current UCLs. The proposed method is evaluated with respect to profile monitoring scenarios and the proposed density statistic. The method is applicable for monitoring any monotonically nondecreasing nonstationary statistics.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7292
 Format
 Thesis
 Title
 Elastic Shape Analysis of RNAs and Proteins.
 Creator

Laborde, Jose M., Srivastava, Anuj, Zhang, Jinfeng, Klassen, Eric, McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

Proteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and...
Show moreProteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and classifications. This work applies Elastic Shape Analysis (ESA), a method recently developed in computer vision, to construct rigorous mathematical and statistical frameworks for the comparison, clustering and classification of proteins and RNAs. ESA treats bio molecular structures as 3D parameterized curves, which are represented with a special map called the square root velocity function (SRVF). In the resulting shape space of elastic curves, one can perform statistical analysis of curves as if they were random variables. One can compare, match and deform one curve into another, or as well as compute averages and covariances of curve populations, and perform hypothesis testing and classification of curves according to their shapes. We have successfully applied ESA to the comparison and classification of protein and RNA structures. We further extend the ESA framework to incorporate additional nongeometric information that tags the shape of the molecules (namely, the sequence of nucleotide/aminoacid letters for RNAs/proteins and, in the latter case, also the labels for the socalled secondary structure). The biological representation is chosen such that the ESA framework continues to be mathematically formal. We have achieved superior classification of RNA functions compared to stateoftheart methods on benchmark RNA datasets which has led to the publication of this work in the journal, Nucleic Acids Research (NAR). Based on the ESA distances, we have also developed a fast method to classify protein domains by using a representative set of protein structures generated by a clusteringbased technique we call Multiple Centroid Class Partitioning (MCCP). Comparison with other standard approaches showed that MCCP significantly improves the accuracy while keeping the representative set smaller than the other methods. The current schemes for the classification and organization of proteins (such as SCOP and CATH) assume a discrete space of their structures, where a protein is classified into one and only one class in a hierarchical tree structure. Our recent study, and studies by other researchers, showed that the protein structure space is more continuous than discrete. To capture the complex but quantifiable continuous nature of protein structures, we propose to organize these molecules using a network model, where individual proteins are mapped to possibly multiple nodes of classes, each associated with a probability. Structural classes will then be connected to form a network based on overlaps of corresponding probability distributions in the structural space.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8586
 Format
 Thesis
 Title
 Failure Time Regression Models for Thinned Point Processes.
 Creator

Holden, Robert T., Huffer, Fred G., Nichols, Warren, McGee, Dan, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

In survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the...
Show moreIn survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the predictors on that distribution. Suppose that the criterion event isn't a terminal event that can only occur once, but is a repeatable event. The sequence of events forms a stochastic {it point process}. Further suppose that only some of the events are detected (observed); the detected events form a thinned point process. Any failure time model based on the data will be based not on the time until the first occurrence, but on the time until the first detected occurrence of the event. The implications of estimating survival regression models from such incomplete data will be analyzed. It will be shown that the effect of thinning on regression parameters depends on the combination of the type of regression model, the type of point process that generates the events, and the thinning mechanism. For some combinations, the effect of a predictor will be the same for time to the first event and the time to the first detected event. For other combinations, the regression effect will be changed as a result of the incomplete detection.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8568
 Format
 Thesis
 Title
 2D Affine and Projective Shape Analysis, and Bayesian Elastic Active Contours.
 Creator

Bryner, Darshan W., Srivastava, Anuj, Klassen, Eric, Gallivan, Kyle, Huffer, Fred, Wu, Wei, Zhang, Jinfeng, Department of Statistics, Florida State University
 Abstract/Description

An object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian...
Show moreAn object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian framework for shape analysis where metrics and related quantities are invariant to larger groups, the affine and projective groups, that approximate such transformations that arise from perspective skews. Highlighting two possibilities for representing object boundaries  ordered points (or landmarks) and parametrized curves  we study different combinations of these representations (points and curves) and transformations (affine and projective). Specifically, we provide solutions to three out of four situations and develop algorithms for computing geodesics and intrinsic sample statistics, leading up to Gaussiantype statistical models, and classifying test shapes using such models learned from training data. In the case of parametrized curves, an added issue is to obtain invariance to the reparameterization group. The geodesics are constructed by particularizing the pathstraightening algorithm to geometries of current manifolds and are used, in turn, to compute shape statistics and Gaussiantype shape models. We demonstrate these ideas using a number of examples from shape and activity recognition. After developing such Gaussiantype shape models, we present a variational framework for naturally incorporating these shape models as prior knowledge in guidance of active contours for boundary extraction in images. This socalled Bayesian active contour framework is especially suitable for images where boundary estimation is difficult due to low contrast, low resolution, and presence of noise and clutter. In traditional active contour models curves are driven towards minimum of an energy composed of image and smoothing terms. We introduce an additional shape term based on shape models of prior known relevant shape classes. The minimization of this total energy, using iterated gradientbased updates of curves, leads to an improved segmentation of object boundaries. We demonstrate this Bayesian approach to segmentation using a number of shape classes in many imaging scenarios including the synthetic imaging modalities of SAS (synthetic aperture sonar) and SAR (synthetic aperture radar), which are notoriously difficult to obtain accurate boundary extractions. In practice, the training shapes used for priorshape models may be collected from viewing angles different from those for the test images and thus may exhibit a shape variability brought about by perspective effects. Therefore, by allowing for a prior shape model to be invariant to, say, affine transformations of curves, we propose an active contour algorithm where the resulting segmentation is robust to perspective skews.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd8534
 Format
 Thesis
 Title
 The Relationship Between Body Mass and Blood Pressure in Diverse Populations.
 Creator

Abayomi, Emilola J., McGee, Daniel, Lackland, Daniel, Hurt, Myra, Chicken, Eric, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

High blood pressure is a major determinant of risk for Coronary Heart Disease (CHD) and stroke, leading causes of death in the industrialized world. A myriad of pharmacological treatments for elevated blood pressure, defined as a blood pressure greater than 140/90mmHg, are available and have at least partially resulted in large reductions in the incidence of CHD and stroke in the U.S. over the last 50 years. The factors that may increase blood pressure levels are not well understood, but body...
Show moreHigh blood pressure is a major determinant of risk for Coronary Heart Disease (CHD) and stroke, leading causes of death in the industrialized world. A myriad of pharmacological treatments for elevated blood pressure, defined as a blood pressure greater than 140/90mmHg, are available and have at least partially resulted in large reductions in the incidence of CHD and stroke in the U.S. over the last 50 years. The factors that may increase blood pressure levels are not well understood, but body mass is thought to be a major determinant of blood pressure level. Obesity is measured through various methods (skinfolds, waisttohip ratio, bioelectrical impedance analysis (BIA), etc.), but the most commonly used measure is body mass index,BMI= Weight(kg)/Height(m)2
Show less  Date Issued
 2012
 Identifier
 FSU_migr_etd5308
 Format
 Thesis
 Title
 The Relationship of Diabetes to Coronary Heart Disease Mortality: A MetaAnalysis Based on PersonLevel Data.
 Creator

Williams, Felicia Gray, McGee, Daniel, Hurt, Myra, Pati, Debdeep, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Studies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a metaanalysis of personlevel data from 42 cohort studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 77,863 men and 84,671 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of selfreported...
Show moreStudies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a metaanalysis of personlevel data from 42 cohort studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 77,863 men and 84,671 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of selfreported diabetes mellitus at baseline ranged between less than 1% in the youngest cohort and 15.7% (males) and 11.1% (females) in the NHLBI CHS study of the elderly. CHD death rates varied between 2% and 20%. A metaanalysis was performed in order to calculate overall hazard ratios (HR) of CHD mortality among diabetics compared to nondiabetics using Cox Proportional Hazard models. The randomeffects HR associated with baseline diabetes and adjusted for age was significantly higher for females 2.65 (95% CI: 2.34, 2.96) than for males 2.33 (95% CI: 2.07, 2.58) (p=0.004). These estimates were similar to the randomeffects estimates adjusted additionally for serum cholesterol, systolic blood pressure, and current smoking status: females 2.69 (95% CI: 2.35, 3.03) and males 2.32 (95% CI: 2.05, 2.59) . They also agree closely with estimates (odds ratios of 2.9 for females and 2.3 for males) obtained in a recent metaanalysis of 50 studies of both fatal and nonfatal CHD but not based on personlevel data. This evidence suggests that diabetes diminishes the female advantage. An additional analysis was performed on race. Only 14 cohorts were analyzed in the metaanalysis. This analyses showed no significant difference between the black and white cohorts before (p=0.68) or after adjustment for the major CHD RFs (p=0.88). The limited amount of studies used may lack the power to detect any differences.
Show less  Date Issued
 2013
 Identifier
 FSU_migr_etd7662
 Format
 Thesis
 Title
 The Risk of Lipids on Coronary Heart Disease: Prognostic Models and MetaAnalysis.
 Creator

Almansour, Aseel, McGee, Daniel, Flynn, Heather, Niu, Xufeng, Sinha, Debajyoti, Department of Statistics, Florida State University
 Abstract/Description

Prognostic models are widely used in medicine to estimate particular patients' risk of developing disease. For cardiovascular disease risk numerous prognostic models have been developed for predicting cardiovascular disease including those by Wilson et al. using the Framingham Study[17], by Assmann et al. using the Procam study[22] and by Conroy et al.[33] using a pool of European cohorts. The prognostic models developed by these researchers differed in their approach to estimating risk but...
Show morePrognostic models are widely used in medicine to estimate particular patients' risk of developing disease. For cardiovascular disease risk numerous prognostic models have been developed for predicting cardiovascular disease including those by Wilson et al. using the Framingham Study[17], by Assmann et al. using the Procam study[22] and by Conroy et al.[33] using a pool of European cohorts. The prognostic models developed by these researchers differed in their approach to estimating risk but all included one or more of the lipid determinations: Total cholesterol (TC). Low Density Lipoproteins (LDL), High Density Lipoproteins (HDL), or ratios TC/HDL and LDL/HDL. None of these researchers included both LDL and TC in the same model due to the high correlation between these measurements. In this thesis we will examine some questions about the inclusion of lipid determinations in prognostic models: Can the effect of LDL and TC on the risk of dying from CHD be differentiated? If one measure is demonstrably stronger than the other, then a single model using that variable would be considered advantageous. Is it possible to derive a single measure from TC and LDL that is a stronger predictor than either measure? If so, then a new summarization of the lipid measurements should be used in prognostic modeling. Does the addition of HDL to a prognostic model improve the predictive accuracy of the model? If it does, then this determination that is almost universally determined should be used when developing prognostic models. We use data from nine independent studies to examine these issues. The studies were chosen because they include longitudinal followup of participants and included lipid determinations in the baseline examination of participants. There are many methodologies available for developing prognostic models, including logistic regression and the proportional hazards model. We used the proportional hazards model since we have followup times and times to death from CHD on all of the participants in the included studies. We summarized our results using a metaanalytic approach. Using the metaanalytic approach, we addressed the additional question of whether the results vary significantly among the different studies and also whether adding additional characteristics to the prognostic models changes the estimated effect of the lipid determinations. All of our results are presented stratified by gender and, when appropriate, by race. Finally, because our studies were not selected randomly, we also examined whether there is evidence of bias in our metaanalyses. For this examination we used funnel plots with related methodology for testing whether there is evidence of bias in the results.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd8724
 Format
 Thesis
 Title
 A Class of Semiparametric Volatility Models with Applications to Financial Time Series.
 Creator

Chung, Steve S., Niu, XuFeng, Gallivan, Kyle, Sinha, Debajyoti, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

The autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied...
Show moreThe autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied and well documented in financial and econometric literature and many variants of ARCH/GARCH models have been proposed. To list a few, these include exponential GARCH(EGARCH), GJRGARHCH(or threshold GARCH), integrated GARCH(IGARCH), quadratic GARCH(QGARCH), and fractionally integrated GARCH(FIGARCH). The ARCH/GARCH models and their variant models have gained a lot of attention and they are still popular choice for modeling volatility. Despite their popularity, they suffer from model flexibility. Volatility is a latent variable and hence, putting a specific model structure violates this latency assumption. Recently, several attempts have been made in order to ease the strict structural assumptions on volatility. Both nonparametric and semiparametric volatility models have been proposed in the literature. We review and discuss these modeling techniques in detail. In this dissertation, we propose a class of semiparametric multiplicative volatility models. We define the volatility as a product of parametric and nonparametric parts. Due to the positivity restriction, we take the log and square transformations on the volatility. We assume that the parametric part is GARCH(1,1) and it serves as a initial guess to the volatility. We estimate GARCH(1,1) parameters by using conditional likelihood method. The nonparametric part assumes an additive structure. There may exist some loss of interpretability by assuming an additive structure but we gain flexibility. Each additive part is constructed from a sieve of Bernstein basis polynomials. The nonparametric component acts as an improvement for the parametric component. The model is estimated from an iterative algorithm based on boosting. We modified the boosting algorithm (one that is given in Friedman 2001) such that it uses a penalized least squares method. As a penalty function, we tried three different penalty functions: LASSO, ridge, and elastic net penalties. We found that, in our simulations and application, ridge penalty worked the best. Our semiparametric multiplicative volatility model is evaluated using simulations and applied to the six major exchange rates and SP 500 index. The results show that the proposed model outperforms the existing volatility models in both insample estimation and outofsample prediction.
Show less  Date Issued
 2014
 Identifier
 FSU_migr_etd8756
 Format
 Thesis
 Title
 TimeVarying Coefficient Models with ARMAGARCH Structures for Longitudinal Data Analysis.
 Creator

Zhao, Haiyan, Niu, Xufeng, Huﬀer, Fred, Nolder, Craig, McGee, Dan, Department of Statistics, Florida State University
 Abstract/Description

The motivation of my research comes from the analysis of the Framingham Heart Study (FHS) data. The FHS is a long term prospective study of cardiovascular disease in the community of Framingham, Massachusetts. The study began in 1948 and 5,209 subjects were initially enrolled. Examinations were given biennially to the study participants and their status associated with the occurrence of disease was recorded. In this dissertation, the event we are interested in is the incidence of the coronary...
Show moreThe motivation of my research comes from the analysis of the Framingham Heart Study (FHS) data. The FHS is a long term prospective study of cardiovascular disease in the community of Framingham, Massachusetts. The study began in 1948 and 5,209 subjects were initially enrolled. Examinations were given biennially to the study participants and their status associated with the occurrence of disease was recorded. In this dissertation, the event we are interested in is the incidence of the coronary heart disease (CHD). Covariates considered include sex, age, cigarettes per day (CSM), serum cholesterol (SCL), systolic blood pressure (SBP) and body mass index (BMI, weight in kilograms/height in meters squared). Statistical literature review indicates that effects of the covariates on Cardiovascular disease or death caused by all possible diseases in the Framingham study change over time. For example, the effect of SCL on Cardiovascular disease decreases linearly over time. In this study, I would like to examine the timevarying effects of the risk factors on CHD incidence. Timevarying coefficient models with ARMAGARCH structure are developed in this research. The maximum likelihood and the marginal likelihood methods are used to estimate the parameters in the proposed models. Since highdimensional integrals are involved in the calculations of the marginal likelihood, the Laplace approximation is employed in this study. Simulation studies are conducted to evaluate the performance of these two estimation methods based on our proposed models. The KullbackLeibler (KL) divergence and the root mean square error are employed in the simulation studies to compare the results obtained from different methods. Simulation results show that the marginal likelihood approach gives more accurate parameter estimates, but is more computationally intensive. Following the simulation study, our proposed models are applied to the Framingham Heart Study to investigate the timevarying effects of covariates with respect to CHD incidence. To specify the timeseries structures of the effects of risk factors, the Bayesian Information Criterion (BIC) is used for model selection. Our study shows that the relationship between CHD and risk factors changes over time. For males, there is an obviously decreasing linear trend for age effect, which implies that the age effect on CHD is less significant for elder patients than younger patients. The effect of CSM stays almost the same in the first 30 years and decreases thereafter. There are slightly decreasing linear trends for both effects of SBP and BMI. Furthermore, the coefficients of SBP are mostly positive over time, i.e., patients with higher SBP are more likely developing CHD as expected. For females, there is also an obviously decreasing linear trend for age effect, while the effects of SBP and BMI on CHD are mostly positive and do not change too much over time.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd0527
 Format
 Thesis
 Title
 A Comparison of Estimators in Hierarchical Linear Modeling: Restricted Maximum Likelihood versus Bootstrap via Minimum Norm Quadratic Unbiased Estimators.
 Creator

Delpish, Ayesha Nneka, Niu, XuFeng, Tate, Richard L., Huﬀer, Fred W., Zahn, Douglas, Department of Statistics, Florida State University
 Abstract/Description

The purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations,...
Show moreThe purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations, the importance of this assumption for the accuracy of multilevel parameter estimates and their standard errors was assessed using the accuracy index of relative bias and by observing the coverage percentages of 95% confidence intervals constructed for both estimation procedures. The study systematically varied the number of groups at level2 (30 versus 100), the size of the intraclass correlation (0.01 versus 0.20) and the distribution of the observations (normal versus chisquared with 1 degree of freedom). The number of groups and intraclass correlation factors produced effects consistent with those previously reported—as the number of groups increased, the bias in the parameter estimates decreased, with a more significant effect observed for those estimates obtained via REML. High levels of the intraclass correlation also led to a decrease in the efficiency of parameter estimation under both methods. Study results show that while both the restricted maximum likelihood and the bootstrap via MINQUE estimates of the fixed effects were accurate, the efficiency of the estimates was affected by the distribution of errors with the bootstrap via MINQUE procedure outperforming the REML. Both procedures produced less efficient estimators under the chisquared distribution, particularly for the variancecovariance component estimates.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd0771
 Format
 Thesis
 Title
 Estimation from Data Representing a Sample of Curves.
 Creator

Auguste, Anna L., Bunea, Florentina, Mason, Patrick, Hollander, Myles, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

This dissertation introduces and assesses an algorithm to generate confidence bands for a regression function or a main effect when multiple data sets are available. In particular it proposes to construct confidence bands for different trajectories and then aggregate these to produce an overall confidence band for a mean function. An estimator of the regression function or main effect is also examined. First, nonparametric estimators and confidence bands are formed on each data set separately...
Show moreThis dissertation introduces and assesses an algorithm to generate confidence bands for a regression function or a main effect when multiple data sets are available. In particular it proposes to construct confidence bands for different trajectories and then aggregate these to produce an overall confidence band for a mean function. An estimator of the regression function or main effect is also examined. First, nonparametric estimators and confidence bands are formed on each data set separately. Then each data set is in turn treated as a testing set for aggregating the preliminary results from the remaining data sets. The criterion used for this aggregation is either the least squares (LS) criterion or a BIC type penalized LS criterion. The proposed estimator is the average over data sets of these aggregates. It is thus a weighted sum of the preliminary estimators. The proposed confidence band is the minimum L1 band of all the M aggregate bands when we only have a main effect. In the case where there is some random effect we suggest an adjustment to the confidence band. In this case, the proposed confidence band is the minimum L1 band of all the M adjusted aggregate bands. Desirable asymptotic properties are shown to hold. A simulation study examines the performance of each technique relative to several alternate methods and theoretical benchmarks. An application to seismic data is conducted.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd0286
 Format
 Thesis
 Title
 Association Models for Clustered Data with Binary and Continuous Responses.
 Creator

Lin, Lanjia, Sinha, Debajyoti, Hurt, Myra, Lipsitz, Stuart R., McGee, Daniel, Department of Statistics, Florida State University
 Abstract/Description

This dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And...
Show moreThis dissertation develops novel single random effect models as well as bivariate correlated random effects model for clustered data with bivariate mixed responses. Logit and identity link functions are used for the binary and continuous responses. For the ease of interpretation of the regression effects, random effect of the binary response has bridge distribution so that the marginal model of mean of the binary response after integrating out the random effect preserves logistic form. And the marginal regression function of the continuous response preserves linear form. Withincluster and withinsubject associations could be measured by our proposed models. For the bivariate correlated random effects model, we illustrate how different levels of the association between two random effects induce different Kendall's tau values for association between the binary and continuous responses from the same cluster. Fully parametric and semiparametric Bayesian methods as well as maximum likelihood method are illustrated for model analysis. In the semiparametric Bayesian model, normality assumption of the regression error for the continuous response is relaxed by using a nonparametric Dirichlet Process prior. Robustness of the bivariate correlated random effects model using ML method to misspecifications of regression function as well as random effect distribution is investigated by simulation studies. The Bayesian and likelihood methods are applied to a developmental toxicity study of ethylene glycol in mice.
Show less  Date Issued
 2009
 Identifier
 FSU_migr_etd1330
 Format
 Thesis
 Title
 Investigating the Categories for Cholesterol and Blood Pressure for Risk Assessment of Death Due to Coronary Heart Disease.
 Creator

Franks, Billy J., McGee, Daniel, Hurt, Myra, Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

Many characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those...
Show moreMany characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those in common usage. Whatever categories are chosen, they should allow physicians to accurately estimate the probability of survival from coronary heart disease until some time t. The best categories will be those that provide the most accurate prediction for an individual's risk of dying by t. The approach that will be used to determine these categories will be a version of Classification And Regression Trees that can be applied to censored survival data. The major goals of this dissertation are to obtain dataderived categories for risk assessment, compare these categories to the ones already recommended in the medical community, and to assess the performance of these categories in predicting survival probabilities.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd4402
 Format
 Thesis
 Title
 Statistical Shape Analysis on Manifolds with Applications to Planar Contours and Structural Proteomics.
 Creator

Ellingson, Leif A., Patrangenaru, Vic, Mio, Washington, Zhang, Jinfeng, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

The technological advances in recent years have produced a wealth of intricate digital imaging data that is analyzed effectively using the principles of shape analysis. Such data often lies on either highdimensional or infinitedimensional manifolds. With computing power also now strong enough to handle this data, it is necessary to develop theoreticallysound methodology to perform the analysis in a computationally efficient manner. In this dissertation, we propose approaches of doing so...
Show moreThe technological advances in recent years have produced a wealth of intricate digital imaging data that is analyzed effectively using the principles of shape analysis. Such data often lies on either highdimensional or infinitedimensional manifolds. With computing power also now strong enough to handle this data, it is necessary to develop theoreticallysound methodology to perform the analysis in a computationally efficient manner. In this dissertation, we propose approaches of doing so for planar contours and the threedimensional atomic structures of protein binding sites. First, we adapt Kendall's definition of direct similarity shapes of finite planar configurations to shapes of planar contours under certain regularity conditions and utilize Ziezold's nonparametric view of Frechet mean shapes. The space of direct similarity shapes of regular planar contours is embedded in a space of HilbertSchmidt operators in order to obtain the VeroneseWhitney extrinsic mean shape. For computations, it is necessary to use discrete approximations of both the contours and the embedding. For cases when landmarks are not provided, we propose an automated, randomized landmark selection procedure that is useful for contour matching within a population and is consistent with the underlying asymptotic theory. For inference on the extrinsic mean direct similarity shape, we consider a onesample neighborhood hypothesis test and the use of nonparametric bootstrap to approximate confidence regions. Bandulasiri et al (2008) suggested using extrinsic reflection sizeandshape analysis to study the relationship between the structure and function of protein binding sites. In order to obtain meaningful results for this approach, it is necessary to identify the atoms common to a group of binding sites with similar functions and obtain proper correspondences for these atoms. We explore this problem in depth and propose an algorithm for simultaneously finding the common atoms and their respective correspondences based upon the Iterative Closest Point algorithm. For a benchmark data set, our classification results compare favorably with those of leading established methods. Finally, we discuss current directions in the field of statistics on manifolds, including a computational comparison of intrinsic and extrinsic analysis for various applications and a brief introduction of sample spaces with manifold stratification.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd0053
 Format
 Thesis
 Title
 Multistate Intensity Model with ARGARCH Random Effect for Corporate Credit Rating Transition Analysis.
 Creator

Li, Zhi, Niu, Xufeng, Huﬀer, Fred, Kercheval, Alec, Wu, Wei, Department of Statistics, Florida State University
 Abstract/Description

This thesis presents a stochastic process and time series study on corporate credit rating and market implied rating transitions. By extending an existing model, this paper incorporates the generalized autoregressive conditional heteroscedastic (GARCH) random effects to capture volatility changes in the instantaneous transition rates. The GARCH model is a crucial part in financial research since its ability to model volatility changes gives the market practitioners flexibility to build more...
Show moreThis thesis presents a stochastic process and time series study on corporate credit rating and market implied rating transitions. By extending an existing model, this paper incorporates the generalized autoregressive conditional heteroscedastic (GARCH) random effects to capture volatility changes in the instantaneous transition rates. The GARCH model is a crucial part in financial research since its ability to model volatility changes gives the market practitioners flexibility to build more accurate models on high frequency financial data. The corporate rating transition modeling was historically dealing with low frequency data which did not have the need to specify the volatility. However, the newly published Moody's market implied ratings are exhibiting much higher transition frequencies. Therefore, we feel that it is necessary to capture the volatility component and make extensions to existing models to reflect this fact. The theoretical model specification and estimation details are discussed thoroughly in this dissertation. The performance of our models is studied on several simulated data sets and compared to the original model. Finally, the models are applied to both Moody's issuer rating and market implied rating transition data as an application.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1426
 Format
 Thesis
 Title
 The Effect of Risk Factors on Coronary Heart Disease: An AgeRelevant Multivariate Meta Analysis.
 Creator

Li, Yan, McGee, Dan, She, Yiyuan, Eberstein, Ike, Niu, Xufeng, Department of Statistics, Florida State University
 Abstract/Description

The importance of major risk factors, such as hypertension, total cholesterol, body mass index, diabetes, smoking, for predicting incidence and mortality of Coronary Heart Disease (CHD) is well known. In light of the fact that age is also a major risk factor for CHD death, a natural question is whether the risk effects on CHD change with age. This thesis focuses on examining the interaction between age and risk factors using data from multiple studies containing differing age ranges. The aim...
Show moreThe importance of major risk factors, such as hypertension, total cholesterol, body mass index, diabetes, smoking, for predicting incidence and mortality of Coronary Heart Disease (CHD) is well known. In light of the fact that age is also a major risk factor for CHD death, a natural question is whether the risk effects on CHD change with age. This thesis focuses on examining the interaction between age and risk factors using data from multiple studies containing differing age ranges. The aim of my research is to use statistical methods to determine whether we can combine these diverse results to obtain an overall summary, using which one can find how the risk effects on CHD death change with age. One intuitive approach is to use classical meta analysis based on generalized linear models. More specifically, one can fit a logistic model with CHD death as response and age, a risk factor and their interaction as covariates for each of the studies, and conduct meta analysis on every set of three coefficients in the multivariate setting to obtain 'synthesized' coefficients. Another aspect of the thesis is a new method, meta analysis with respect to curves that goes beyond linear models. The basic idea is that one can choose the same spline with the same knots on covariates, say age and systolic blood pressure (SBP), for all the studies to ensure common basis functions. The knotbased tensor product basis coefficients obtained from penalized logistic regression can be used for multivariate meta analysis. Using the common basis functions and the 'synthesized' knotbased basis coefficients from meta analysis, a twodimensional smooth surface on the ageSBP domain is estimated. By cutting through the smooth surface along two axes, the resulting slices show how the risk effect on CHD death change at an arbitrary age as well as how the age effect on CHD death change at an arbitrary SBP value. The application to multiple studies will be presented.
Show less  Date Issued
 2010
 Identifier
 FSU_migr_etd1428
 Format
 Thesis
 Title
 Optimal Linear Representations of Images under Diverse Criteria.
 Creator

Rubinshtein, Evgenia, Srivastava, Anuj, Liu, Xiuwen, Huﬀer, Fred, Chicken, Eric, Department of Statistics, Florida State University
 Abstract/Description

Image analysis often requires dimension reduction before statistical analysis, in order to apply sophisticated procedures. Motivated by eventual applications, a variety of criteria have been proposed: reconstruction error, class separation, nonGaussianity using kurtosis, sparseness, mutual information, recognition of objects, and their combinations. Although some criteria have analytical solutions, the remaining ones require numerical approaches. We present geometric tools for finding linear...
Show moreImage analysis often requires dimension reduction before statistical analysis, in order to apply sophisticated procedures. Motivated by eventual applications, a variety of criteria have been proposed: reconstruction error, class separation, nonGaussianity using kurtosis, sparseness, mutual information, recognition of objects, and their combinations. Although some criteria have analytical solutions, the remaining ones require numerical approaches. We present geometric tools for finding linear projections that optimize a given criterion for a given data set. The main idea is to formulate a problem of optimization on a Grassmann or a Stiefel manifold, and to use differential geometry of the underlying space to construct optimization algorithms. Purely deterministic updates lead to local solutions, and addition of random components allows for stochastic gradient searches that eventually lead to global solutions. We demonstrate these results using several image datasets, including natural images and facial images.
Show less  Date Issued
 2006
 Identifier
 FSU_migr_etd1926
 Format
 Thesis
 Title
 A Class of MixedDistribution Models with Applications in Financial Data Analysis.
 Creator

Tang, Anqi, Niu, Xufeng, Cheng, Yingmei, Wu, Wei, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Statisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the...
Show moreStatisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the probability of nonzero response; the other part is a linear model that models the mean response given that the outcomes are not zeros. Random effects with AR(1) covariance structure are introduced into both parts of the model to allow serial correlation and subject specific effect. Estimating the twopart model is challenging because of high dimensional integration necessary to obtain the maximum likelihood estimates. We propose a Monte Carlo EM algorithm for estimating the maximum likelihood estimates of parameters. Through simulation study, we demonstrate the good performance of the MCEM method in parameter and standard error estimation. To illustrate, we apply the twopart model with correlated random effects and the model with autoregressive random effects to executive compensation data to investigate potential determinants of CEO stock option grants.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1710
 Format
 Thesis
 Title
 Impact of Missing Data on Building Prognostic Models and Summarizing Models Across Studies.
 Creator

Munshi, Mahtab R., McGee, Daniel, Eberstein, Isaac, Hollander, Myles, Niu, Xufeng, Chattopadhyay, Somesh, Department of Statistics, Florida State University
 Abstract/Description

We examine the impact of missing data in two settings, the development of prognostic models and the addition of new risk factors to existing risk functions. Most statistical software presently available perform complete case analysis, wherein only participants with known values for all of the characteristics being analyzed are included in model development. Missing data also impacts the summarization of evidence amongst multiple studies using metaanalytic techniques. As we progress in...
Show moreWe examine the impact of missing data in two settings, the development of prognostic models and the addition of new risk factors to existing risk functions. Most statistical software presently available perform complete case analysis, wherein only participants with known values for all of the characteristics being analyzed are included in model development. Missing data also impacts the summarization of evidence amongst multiple studies using metaanalytic techniques. As we progress in medical research, new covariates become available for studying various outcomes. While we want to investigate the influence of new factors on the outcome, we also do not want to discard the historical datasets that do not have information about these markers. Our research plan is to investigate different methods to estimate parameters for a model when some of the covariates are missing. These methods include likelihood based inference for the studylevel coefficients and likelihood based inference for the logistic model on the personlevel data. We compare the results from our methods to the corresponding results from complete case analysis. We focus our empirical investigation on a historical example, the addition of high density lipoproteins to existing equations for predicting death due to coronary heart disease. We verify our methods through simulation studies on this example.
Show less  Date Issued
 2005
 Identifier
 FSU_migr_etd2191
 Format
 Thesis
 Title
 Analysis of Multivariate Data with Random Cluster Size.
 Creator

Li, Xiaoyun, Sinha, Debajyoti, Zhou, Yi, McGee, Dan, Lipsitz, Stuart, Department of Statistics, Florida State University
 Abstract/Description

In this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real...
Show moreIn this dissertation, we examine binary correlated data with present/absent component or missing data that are related to binary responses of interest. Depending on the data structure, correlated binary data can be referred as emph{clustered data} if sampling unit is a cluster of subjects, or it can be referred as emph{longitudinal data} when it involves repeated measurement of same subject over time. We propose our novel models in these two data structures and illustrate the model with real data applications. In biomedical studies involving clustered binary responses, the cluster size can vary because some components of the cluster can be absent. When both the presence of a cluster component as well as the binary disease status of a present component are treated as responses of interest, we propose a novel twostage random effects logistic regression framework. For the ease of interpretation of regression effects, both the marginal probability of presence/absence of a component as well as the conditional probability of disease status of a present component, preserve the approximate logistic regression forms. We present a maximum likelihood method of estimation implementable using standard statistical software. We compare our models and the physical interpretation of regression effects with competing methods from literature. We also present a simulation study to assess the robustness of our procedure to wrong specification of the random effects distribution and to compare finite sample performances of estimates with existing methods. The methodology is illustrated via analyzing a study of the periodontal health status in a diabetic Gullah population. We extend this model in longitudinal studies with binary longitudinal response and informative missing data. In longitudinal studies, when treating each subject as a cluster, cluster size is the total number of observations for each subject. When data is informatively missing, cluster size of each subject can vary and is related to the binary response of interest and we are also interested in the missing mechanism. This is a modified situation of the cluster binary data with present components. We modify and adopt our proposed twostage random effects logistic regression model so that both the marginal probability of binary response and missing indicator as well as the conditional probability of binary response and missing indicator preserve logistic regression forms. We present a Bayesian framework of this model and illustrate our proposed model on an AIDS data example.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1425
 Format
 Thesis
 Title
 A Statistical Approach for Information Extraction of Biological Relationships.
 Creator

Bell, Lindsey R., Zhang, Jinfeng, Niu, Xufeng, Tyson, Gary, Huﬀer, Fred, Department of Statistics, Florida State University
 Abstract/Description

Vast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text...
Show moreVast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text becomes increasingly evident. Text mining has four major components. First relevant articles are identified through information retrieval (IR), next important concepts and terms are flagged using entity recognition (ER), and then relationships between these entities are extracted from the literature in a process called information extraction(IE). Finally, text mining takes these elements and seeks to synthesize new information from the literature. Our goal is information extraction from unstructured literature concerning biological entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as protein names, disease names, genes, and smallmolecules. Interaction words describe the relationship between the biological terms. Under this framework we aim to combine the strengths of three classifiers in an ensemble approach. The three classifiers we consider are Bayesian Networks, Support Vector Machines, and a mixture of logistic models defined by interaction word. The three classifiers and ensemble approach are evaluated on three benchmark corpora and one corpus that is introduced in this study. The evaluation includes cross validation and crosscorpus validation to replicate an application scenario. The three classifiers are unique and we find that performance of individual classifiers varies depending on the corpus. Therefore, an ensemble of classifiers removes the need to choose one classifier and provides optimal performance.
Show less  Date Issued
 2011
 Identifier
 FSU_migr_etd1314
 Format
 Thesis