 Improvement of Quality Prediction in InterConnected Manufacturing System by Integrating MultiSource Data.
Ren, Jie, Wang, Hui, Vanli, Omer Arda, Park, Chiwoo, Huffer, Fred W. (Fred William), Florida State University, FAMUFSU College of Engineering, Department of Industrial and Manufacturing Engineering
Ren, Jie, Wang, Hui, Vanli, Omer Arda, Park, Chiwoo, Huffer, Fred W. (Fred William), Florida State University, FAMUFSU College of Engineering, Department of Industrial and Manufacturing Engineering
With the development of advanced sensing and network technology such as wireless data transmission and data storage and analytics under cloud platforms, the manufacturing plant is going through a new revolution, by which different production units/components can communicate with each other, leading to interconnected manufacturing. The interconnection enables the close coordination of process control actions among machines to improve product quality. Traditional quality prediction methods...
Show moreWith the development of advanced sensing and network technology such as wireless data transmission and data storage and analytics under cloud platforms, the manufacturing plant is going through a new revolution, by which different production units/components can communicate with each other, leading to interconnected manufacturing. The interconnection enables the close coordination of process control actions among machines to improve product quality. Traditional quality prediction methods that focus on the data from one single source are not sufficient to deal with the variation modeling, and quality prediction problems involved the interconnected manufacturing. Instead, new quality prediction methods that can integrate the data from multiple sources are necessary. This research addresses the fundamental challenges in improving quality prediction by data fusion for interconnected manufacturing including knowledge sharing and transfer among different machines and collaboration error monitoring. The methodology is demonstrated through surface machining and additive manufacturing processes. The first study is on the surface quality prediction for one machining process by fusing multiresolution spatial data measured from multiple surfaces or different surface machining processes. The surface variation is decomposed into a global trend part that characterizes the spatially varying relationship of selected process variables and surface height and a zeromean spatial Gaussian process part. Three models including two varying coefficientbased spatial models and an inference rulebased spatial model are proposed and compared. Also, transfer learning technique is used to help train the model via transferring useful information from a datarich surface to a datalacking surface, which demonstrates the advantage of interconnected manufacturing. The second study deals with the surface mating errors caused by the surface variations from two interconnected surface machining processes. A model aggregating data from two surfaces is proposed to predict the leak areas for surface assembly. By using the measurements of leak areas and the profiles of surfaces mated as training data along with Hagen–Poiseuille law, this study develops a novel diagnostic method to predict potential leak areas (leakage paths). The effectiveness and robustness of the proposed method are verified by an experiment and a simulation study. The approach provides practical guidance for the subsequent assembly process as well as troubleshooting in manufacturing processes. The last study focuses on the learning of quality prediction model in interconnected additive manufacturing systems, by which different 3D printing processes involved are driven by similar printing mechanisms and can exchange quality data via a network. A quality prediction model that estimates the printing widths along the printing paths for materialextrusionbased additive manufacturing (a.k.a., fused filament fabrication or fused deposition modeling) is established by leveraging the betweenprinter quality data. The established mathematical model quantifies the printing linewidth along the printing paths based on the kinematic parameters, e.g., printing speed and acceleration while considering data from multiple printers that contain betweenmachines similarity. The method can allow for the betweenprinter knowledge sharing to improve the quality prediction so that a printing process with limited historical data can quickly learn an effective quality model without intensive retraining, thus improving the system responsiveness to product variety. In the long run, the outcome of this research can help contribute to the development of highefficient InternetofThings manufacturing services for personalized products.
 2019
 2019_Spring_Ren_fsu_0071E_15160
 Thesis
 Dynamic and Stochastic Transition of Traffic Conditions and Its Application in Urban Traffic Mobility.
Kidando, Emmanuel, Moses, Ren, Duncan, Michael Douglas, Ozguven, Eren Erman, Sobanjo, John Olusegun, Sando, Thobias M., Florida State University, FAMUFSU College of Engineering, Department of Civil and Environmental Engineering
Kidando, Emmanuel, Moses, Ren, Duncan, Michael Douglas, Ozguven, Eren Erman, Sobanjo, John Olusegun, Sando, Thobias M., Florida State University, FAMUFSU College of Engineering, Department of Civil and Environmental Engineering
Analytical models developed using field data can provide useful information with acceptable confidence to evaluate and predict the operational characteristics of a highway. As such, this study presents statistical models that can be used to estimate the travel time or speed distribution, cluster different traffic conditions, to model the dynamic transition of traffic regimes (DTR), and quantify the disparityeffects on the DTR associated with different lateral lane positions (i.e., lane near shoulder, middle lane(s) and lane near a median) as well as different days of the week.
Show moreAnalytical models developed using field data can provide useful information with acceptable confidence to evaluate and predict the operational characteristics of a highway. As such, this study presents statistical models that can be used to estimate the travel time or speed distribution, cluster different traffic conditions, to model the dynamic transition of traffic regimes (DTR), and quantify the disparityeffects on the DTR associated with different lateral lane positions (i.e., lane near shoulder, middle lane(s) and lane near a median) as well as different days of the week. In the analysis, this study uses Bayesian frameworks to estimate the model parameters. These frameworks reduce the impact of model overfitting and also incorporate uncertainty in the estimates. Data from a freeway corridor along I295 located in Jacksonville, Florida were selected for analysis. It includes data from individual microwave vehicle sensors, segment level aggregated traffic data and data aggregated at a corridor level. The proposed probabilistic frameworks developed by this study can be a useful resource in detecting and evaluating different traffic conditions, which can facilitate the planning action to implement congestionrelated countermeasures in urban areas. In addition, findings from the hierarchical regression model presented by the current study can be used in the application of intelligent transportation systems, mainly in the dynamic lanemanagement strategy.
 2019
 2019_Spring_Kidando_fsu_0071E_15049
 Thesis
 Impact of Violations of Measurement Invariance in Longitudinal Mediation Modeling.
Xu, Jie, Yang, Yanyun, Zhang, Qian, Huffer, Fred W. (Fred William), Becker, Betsy J., Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Xu, Jie, Yang, Yanyun, Zhang, Qian, Huffer, Fred W. (Fred William), Becker, Betsy J., Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Research has shown that crosssectional mediation analysis cannot accurately reflect a true longitudinal mediated effect. To investigate longitudinal mediated effects, different longitudinal mediation models have been proposed and these models focus on different research questions related to longitudinal mediation. When fitting mediation models to longitudinal data, the assumption of longitudinal measurement invariance is usually made. However, the consequences of violating this assumption have not been thoroughly studied in mediation analysis.
Show moreResearch has shown that crosssectional mediation analysis cannot accurately reflect a true longitudinal mediated effect. To investigate longitudinal mediated effects, different longitudinal mediation models have been proposed and these models focus on different research questions related to longitudinal mediation. When fitting mediation models to longitudinal data, the assumption of longitudinal measurement invariance is usually made. However, the consequences of violating this assumption have not been thoroughly studied in mediation analysis. No studies have examined issues of measurement noninvariance in a latent crosslagged panel mediation (LCPM) model with three or more measurement occasions. The goal of the current study is to investigate the impact of violations of measurement invariance on longitudinal mediation analysis. The focal model in the study is the LCPM model suggested by Cole and Maxwell (2003). This model can be used to examine mediated effects among the latent predictor, mediator, and outcome variables across time. In addition, it can account for measurement error and allow for the evaluation of longitudinal measurement invariance. Simulation methods were used and the investigation was performed using population covariance matrices and sample data generated under various conditions. Eight design factors were considered for data generation: sample size, proportion of noninvariant items, position of latent factors with noninvariant items, type of noninvariant parameters, magnitude of noninvariance, pattern of noninvariance, size of the direct effect, and size of the mediated effect. Results from population investigation were evaluated based on overall model fit and the calculated direct and mediated effects; results from finite sample analysis were evaluated in terms of convergence and inadmissible solutions, overall model fit, bias/relative bias, coverage rates, and statistical power/type I error rates. In general, results obtained from finite sample analysis were consistent with those from the population investigation, with respect to both model fit and parameter estimation. The type I error rate of the mediated effects was inflated under the noninvariant conditions with small sample size (200); power of the direct and mediated effects was excellent (1.0 or close to 1.0) across all investigated conditions. Type I error rates based on the chisquare statistic test were seriously inflated under the invariant conditions, especially when the sample size was relatively small. Power for detecting model misspecifications due to longitudinal noninvariance was excellent across all investigated conditions. Fit indices (CFI, TLI, RMSEA, and SRMR) were not sensitive in detecting misspecifications caused by violations of measurement invariance in the investigated LCPM model. Study results also showed that as the magnitude of noninvariance, the proportion of noninvariant items, and the number of positions of latent variables with noninvariant items increased, estimation of the direct and mediated effects tended to be less accurate. The decreasing pattern of change in item parameters over measurement occasions resulted in the least accurate estimates of the direct and mediated effects. Parameter estimates were fairly accurate under the conditions of the decreasing and then increasing pattern and the mixed pattern of change in item parameters. Findings from this study can help empirical researchers better understand the potential impact of violating measurement invariance on longitudinal mediation analysis using the LCPM model.
 2019
 2019_Spring_Xu_fsu_0071E_14994
 Thesis
 Learning Political Will in Organizations: A Social Learning Theory Perspective.
Maher, Liam Patrick, Ferris, Gerald R., Schatschneider, Christopher, Hochwarter, Wayne A., Van Iddekinge, Chad H., Wang, Gang, Florida State University, College of Business, Department of Management
Maher, Liam Patrick, Ferris, Gerald R., Schatschneider, Christopher, Hochwarter, Wayne A., Van Iddekinge, Chad H., Wang, Gang, Florida State University, College of Business, Department of Management
The past several decades have seen great advances in the field of organizational politics. At the individual level, political skill has garnered the majority of the scholarly focus, whereas it's motivational counterpart, political will, has gone relatively unexamined. Political will represents the motivation to engage in political behavior, which, regardless of the skill with which it is executed, potentially has tremendous effects on myriad different organizational outcomes. Thus, it is critical for scholars to understand how political will spreads through work units.
Show moreThe past several decades have seen great advances in the field of organizational politics. At the individual level, political skill has garnered the majority of the scholarly focus, whereas it's motivational counterpart, political will, has gone relatively unexamined. Political will represents the motivation to engage in political behavior, which, regardless of the skill with which it is executed, potentially has tremendous effects on myriad different organizational outcomes. Thus, it is critical for scholars to understand how political will spreads through work units. This dissertation synthesizes theories of political will, political skill, social identity, social learning, and relationship quality to explain the process of how followers learn political will from their leaders and environments. Specifically, I plan to show that when leaders possess political will, they engage in political behavior. Followers will learn the virtues and drawbacks of political behavior from their leaders, both vicariously and through direct mentoring, and thus their political will should be a function of their leader’s political will. Leaders and their many followers differ in their levels of leadermember relationship quality, political skill, and selfconcept congruence, it is proposed that these differences will drive the level of learning that occurs. The proposed model is tested using data from 406 government workers and their 78 direct supervisors. The primary analyses only supported the hypothesis that leader political will predicts leader political behavior. Exploratory analyses that employed follower rated measures of leader political behavior provided evidence that follower political will is a function of follower perceptions of their leader’s political behavior and their own histories with organizational politics. Strengths, limitations, and opportunities for future research are discussed.
 2018
 2018_Sp_Maher_fsu_0071E_14422
 Thesis
 Traits, Species, and Communities: Integrative Bayesian Approaches to Ecological Biogeography across Geographic, Environmental, Phylogenetic, and Morphological Space.
Humphreys, John M., Elsner, James B., Steppan, Scott J., Mesev, Victor, Pau, Stephanie, Florida State University, College of Social Sciences and Public Policy, Department of Geography
Humphreys, John M., Elsner, James B., Steppan, Scott J., Mesev, Victor, Pau, Stephanie, Florida State University, College of Social Sciences and Public Policy, Department of Geography
Assuming a methodological perspective, this dissertation proceeds through a series of studies that cover levels of biological organization ranging from the morphological traits of individual specimens to community assemblages. The presented research explores geographic extents ranging from local to global scales, examines both plants and animals, and explores relationships among species with common ancestry. The research appraises and then proposes solutions to a variety of yet unresolved...
Show moreAssuming a methodological perspective, this dissertation proceeds through a series of studies that cover levels of biological organization ranging from the morphological traits of individual specimens to community assemblages. The presented research explores geographic extents ranging from local to global scales, examines both plants and animals, and explores relationships among species with common ancestry. The research appraises and then proposes solutions to a variety of yet unresolved issues in species distribution modeling; including, preferential sampling, spatial dependency, multiscaled spatial processes, niche equilibrium assumptions, data structure arising from shared evolutionary history, and correlations between predictor variables. Approaching the geographic distribution of wetlands as an applied concern, the study presented in Chapter 2 emphasizes that the identication and inventory of wetlands are essential components of water resource management. To be eective in these endeavors, it is critical that the process used to detect and document wetlands be time ecient, accurate, and repeatable as new environmental information becomes available. Approaches dependent on aerial photographic interpretation of land cover by individual human analysts necessitate hours of assessment, introduce human error, and fail to include the best available soils and hydrologic data. The goal of Chapter 2 is to apply hierarchical modeling and Bayesian inference to predict the probability of wetland presence as a continuous gradient with the explicit consideration of spatial structure. The presented spatial statistical model can evaluate 100 km2 at a 50 x 50 meter resolution in approximately 50 minutes while simultaneously incorporating ancillary data and accounting for latent spatial processes. Model results demonstrate an ability to consistently capture wetlands identied through aerial interpretation with greater than 90% accuracy (scaled Brier Score) and to identify wetland extents, ecotones, and hydrologic connections not identied through use of other modeling and mapping techniques. The provided model is reasonably robust to changes in resolution, areal extents between 100 km2 and 300 km2, and regionspecic physical conditions. As with modeling wetland occurrence, species distribution modeling aimed at forecasting the spread of invasive species under projected global warming also oers land managers an important tool for assessing future ecological risk and for prioritizing management actions. Chapter 3 applies Bayesian inference and newly available geostatistical tools to forecast global range expansion for the ecosystem altering invasive climbing fern Lygodium microphyllum. The presented modeling framework emphasizes the need to account for spatial processes at both the individual and aggregate levels, the necessity of modeling nonlinear responses to environmental gradients, and the explanatory power of biotic covariates. Results indicate that Lygodium microphyllum will undergo global range expansion in concert with anthropogenic global warming and that the species is likely temperature and dispersal limited. Predictions are presented for current and future climate conditions assuming both limited and unlimited dispersal scenarios. Finally, Chapter 4 provides a novel framework to combine multispecies joint modeling techniques with spatially explicit phylogenetic regression to simultaneously predict the probability of species occurrence and the geographic distribution of interspecic continuous morphological traits. Choosing the South American leafeared mice (genus: Phyllotis) as an empirical example, a threetiered phylogenetic coregionalization trait biogeography model (PhyCoRTBio) is constructed. The conditionally dependent structure of the PhyCoRTBio model enables information from multiple species and from multiple specimenspecic trait metrics to be leveraged towards estimation of a focal species distribution. I hypothesize that, relative to other commonly used species distribution modeling methods, the PhyCoRTBio approach will exhibit improved performance in predicting occurrence for species within the genus Phyllotis. After describing its statistical implementation, this hypothesis is assessed by constructing PhyCoRTBio models for six dierent Phyllotis species and then comparing results to those derived using maximum entropy methods, random forest clustering, Gaussian random eld species distribution models, and Hierarchical Bayesian species distribution models. To judge the relative performance of each modeling approach, model sensitivity (proportion of correctly predicted presences), specicity (proportion of correctly predicted absences), the area under the receiver operating characteristic curve (AUC), and the True Skill Statistic (TSS) are calculated. Findings indicate that traitbased covariates improve model performance and highlight the need to consider spatial processes and phylogenetic information during multispecies distribution modeling.
 2018
 2018_Sp_Humphreys_fsu_0071E_14298
 Thesis
 Critical Issues in Survey MetaAnalysis.
Gozutok, Ahmet Serhat, Becker, Betsy Jane, Huffer, Fred W., Yang, Yanyun, Paek, Insu, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Gozutok, Ahmet Serhat, Becker, Betsy Jane, Huffer, Fred W., Yang, Yanyun, Paek, Insu, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
In research synthesis, researchers may aim at summarizing peoples' attitudes and perceptions of phenomena that have been assessed using different measures. Selfreport rating scales are among the most commonly used measurement tools to quantify such latent constructs in education and psychology. However, selfreport ratingscale questions measuring the same construct may differ from each other in many ways.
Show moreIn research synthesis, researchers may aim at summarizing peoples' attitudes and perceptions of phenomena that have been assessed using different measures. Selfreport rating scales are among the most commonly used measurement tools to quantify such latent constructs in education and psychology. However, selfreport ratingscale questions measuring the same construct may differ from each other in many ways. Scale format, number of response options, wording of questions, and labeling of response option categories may vary across questions. Consequently, variations across the measures of the same construct bring about the issue of comparability of the results across the studies in metaanalytic investigations. In this study, I examine the complexities of summarizing the results of different survey questions about the same construct in the metaanalytic fashion. More specifically, this study focuses on the practical problems that arise when combining survey items that differ from one another in the wording of question stems, numbers of response option categories, scale direction (i.e., unipolar and bipolar scales), response scale labeling (i.e., fullylabeled scales and endpointslabeled scales), and responseoption labeling (e.g., "extremely happy"  "completely happy"  "most happy", "pretty happy", "quite happy" "moderately happy", and "not at all happy"  "least happy"  "most unhappy"). In addition, I propose practical solutions to handle the issues that arise due to such variations when conducting a metaanalysis. I discuss the implications of the proposed solutions from the perspective of metaanalysis. Examples are obtained from the collection of studies in the World Happiness Database (Veenhoven, 2006), which includes various singleitem happiness measures.
 2018
 2018_Fall_Gozutok_fsu_0071E_14866
 Thesis
 The Impact of Rater Variability on Relationships among Different EffectSize Indices for InterRater Agreement between Human and Automated Essay Scoring.
Yun, Jiyeo, Becker, Betsy Jane, Huffer, Fred W. (Fred William), Paek, Insu, Zhang, Qian, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Yun, Jiyeo, Becker, Betsy Jane, Huffer, Fred W. (Fred William), Paek, Insu, Zhang, Qian, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Since researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and machine scoring, and then have suggested evaluation criteria for interrater agreement. The main purpose of my study is to investigate the magnitudes of and relationships among indices for interrater agreement used to assess the relatedness of human and automated essay scoring, and to examine impacts of rater variability on interrater agreement.
Show moreSince researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and machine scoring, and then have suggested evaluation criteria for interrater agreement. The main purpose of my study is to investigate the magnitudes of and relationships among indices for interrater agreement used to assess the relatedness of human and automated essay scoring, and to examine impacts of rater variability on interrater agreement. To implement the investigations, my study consists of two parts: empirical and simulation studies. Based on the results from the empirical study, the overall effects for interrater agreement were .63 and .99 for exact and adjacent proportions of agreement, .48 for kappas, and between .75 and .78 for correlations. Additionally, significant differences between 6point scales and the other scales (i.e., 3, 4, and 5point scales) for correlations, kappas and proportions of agreement existed. Moreover, based on the results of the simulated data, the highest agreements and lowest discrepancies achieved in the matched rater distribution pairs. Specifically, the means of exact and adjacent proportions of agreement, kappa and weighted kappa values, and correlations were .58, .95, .42, .78, and .78, respectively. Meanwhile the average standardized mean difference was .0005 in the matched rater distribution pairs. Acceptable values for interrater agreement as evaluation criteria for automated essay scoring, impacts of rater variability on interrater agreement, and relationships among interrater agreement indices were discussed.
 2017
 FSU_FALL2017_Yun_fsu_0071E_14144
 Thesis
 Regressing over LinearCircular Data Using a Mixture of LinearLinear Regression Models.
EsmaieeliSikaroudi, Ali, Park, Chiwoo, Vanli, Omer Arda, Shanbhag, Sachin, Florida State University, College of Engineering, Department of Industrial and Manufacturing Engineering
Regression over circular response data requires special methods due to the periodic nature of this data type. In previous works, researchers tried to use the concept of projecting realline distributions on unit circles or using transformation methods to transform circular response to realline and wise versa; however, their methods only work for simple data and in some cases they are really complicated and slow.
Show moreRegression over circular response data requires special methods due to the periodic nature of this data type. In previous works, researchers tried to use the concept of projecting realline distributions on unit circles or using transformation methods to transform circular response to realline and wise versa; however, their methods only work for simple data and in some cases they are really complicated and slow. In this research circular responses are treated as the output of the modulo operation on unobserved linear responses. A mixture of multiple linearlinear regression models is used to implement this idea. We used Gaussian Mixture method to model the data and Gibbs sampling to tune the parameters. The idea itself would be a new way to look at the linearcircular regression problem and can be used as the foundation of the other methods to be developed in future.
 2017
 FSU_2017SP_EsmaieeliSikaroudi_fsu_0071N_13833
 Thesis
 A WeaklyInformative GroupSpecific Prior Distribution for MetaAnalysis.
Thompson, Christopher, Becker, Betsy Jane, Clark, Kathleen M., Almond, Russell G., Aloe, Ariel M., Yang, Yanyun, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Thompson, Christopher, Becker, Betsy Jane, Clark, Kathleen M., Almond, Russell G., Aloe, Ariel M., Yang, Yanyun, Florida State University, College
While Bayesian metaanalysis has flourished both in methodological and substantive work, groupspecific Bayesian modeling remains scarce. Common practice for choosing prior distributions entails using typical noninformative priors. Currently, there is a push to use more informative prior distributions. In this dissertation I propose a group specific weakly informative prior distribution. The new prior distribution uses a frequentist estimate of betweenstudies heterogeneity as the...
Show moreWhile Bayesian metaanalysis has flourished both in methodological and substantive work, groupspecific Bayesian modeling remains scarce. Common practice for choosing prior distributions entails using typical noninformative priors. Currently, there is a push to use more informative prior distributions. In this dissertation I propose a group specific weakly informative prior distribution. The new prior distribution uses a frequentist estimate of betweenstudies heterogeneity as the noncentrality parameter in a folded noncentral t distribution. This new distribution is then modeled individually for groups based on some categorical factor. An extensive simulation study was performed to assess the performance of the new groupspecific prior distribution to several noninformative prior distributions in a variety of metaanalytic scenarios. An application using data from a previously published metaanalysis on dynamic geometry software is also provided.
 2016
 FSU_2016SP_Thompson_fsu_0071E_13051
 Thesis
 Structural Health Monitoring with LambWave Sensors: Problems in Damage Monitoring, Prognostics and Multisensory Decision Fusion.
Mishra, Spandan, Vanli, Omer Arda, Okoli, Okenwa, Jung, Sungmoon, Park, Chiwoo, Florida State University, FAMUFSU College of Engineering, Department of Industrial and...
Show moreMishra, Spandan, Vanli, Omer Arda, Okoli, Okenwa, Jung, Sungmoon, Park, Chiwoo, Florida State University, FAMUFSU College of Engineering, Department of Industrial and Manufacturing Engineering
Carbon ﬁber reinforced composites (CFRC) have several desirable traits that can be exploited in the design of advanced structures and systems. The applications requiring high strength toweight ratio and high stiﬀnesstoweight ratio such as, fuselage of airplanes, wind turbine blades, waterboats etc. have found profound use of CFRC. Furthermore, low density, good vibration damping ability, easy manufacturability, carbon ﬁber’s electrical conductivity, as well as high thermal conductivity...
Show moreCarbon ﬁber reinforced composites (CFRC) have several desirable traits that can be exploited in the design of advanced structures and systems. The applications requiring high strength toweight ratio and high stiﬀnesstoweight ratio such as, fuselage of airplanes, wind turbine blades, waterboats etc. have found profound use of CFRC. Furthermore, low density, good vibration damping ability, easy manufacturability, carbon ﬁber’s electrical conductivity, as well as high thermal conductivity and smooth surface ﬁnish provide additional beneﬁts to the users. Various applications of CFRC can be relevant for aerospace, military, windturbines, robotics, sports equipment etc. However, among many advantages of CFRC there are a few disadvantages; CFRC undergo completely diﬀerent failure patterns compared to metals. Once the yield strength is exceeded, CFRC will fail suddenly and catastrophically. The inherent anisotropic nature of CFRC makes it very diﬃcult for traditional condition monitoring methods to assess the condition of the structure. The complex failure patterns, including delamination, microcracks, and matrixcracks require specialized sensing and monitoring schemes for composite structure. This Ph.D. research is focuses on developing an integrated structural health monitoring methodology for damage monitoring, remaining useful life estimation (RUL), and decision fusion using Lambwave data. The main objective of this research is to develop an integrated damage detection method that utilizes Lambwave sensor data to infer the state of the damage condition and make an accurate prognosis of the structure. Slow fatigue loading results in very unique failure patterns in the CFRC structures, fatigue damage ﬁrst manifests itself as ﬁberbreakage and then slowly progresses to matrixcracks and that ultimately leads to delamination damage. This type of failure process is very diﬃcult to monitor using the traditionally used damage monitoring methods such as Xray evaluation, ultrasonic evaluation, infrared evaluation etc. For this research, we have used principal component (PC) based multivariate cumulative sum (MCUSUM) to monitor the structure. MCUSUM chart is very useful when monitoring structures undergoing slow and gradual change. For remainingusefullife (RUL) estimation, we have proposed to use the Wiener process model coupled with principal component regression (PCR). For damage detection/classiﬁcation we studied discriminant analysis, inspite of the popular use in image analysis and in the gene data classiﬁcation problem, has not been widely used for damage classiﬁcation. In this research, we showed that discriminant analysis is a useful detecting known damage modes, while dealing with the high dimensionality of Lambwave data. We modiﬁed the standard Gaussian discriminant analysis by introducing regularization parameters to directly process raw Lambwave data without requiring an intermediate feature extraction step.
 2016
 FSU_2016SU_Mishra_fsu_0071E_13346
 Thesis
 Investigating the ChiSquareBased ModelFit Indexes for WLSMV and ULSMV Estimators.
Xia, Yan, Yang, Yanyun, Huffer, Fred W. (Fred William), Almond, Russell G., Becker, Betsy Jane, Paek, Insu, Florida State University, College of Education, Department of...
Show moreXia, Yan, Yang, Yanyun, Huffer, Fred W. (Fred William), Almond, Russell G., Becker, Betsy Jane, Paek, Insu, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
In structural equation modeling (SEM), researchers use the model chisquare statistic and modelfit indexes to evaluate modeldata fit. Root mean square error of approximation (RMSEA), comparative fit index (CFI), and TuckerLewis index (TLI) are widely applied modelfit indexes. When data are ordered and categorical, the most popular estimator is the diagonally weighted least squares (DWLS) estimator. Robust corrections have been proposed to adjust the uncorrected chisquare statistic from...
Show moreIn structural equation modeling (SEM), researchers use the model chisquare statistic and modelfit indexes to evaluate modeldata fit. Root mean square error of approximation (RMSEA), comparative fit index (CFI), and TuckerLewis index (TLI) are widely applied modelfit indexes. When data are ordered and categorical, the most popular estimator is the diagonally weighted least squares (DWLS) estimator. Robust corrections have been proposed to adjust the uncorrected chisquare statistic from DWLS so that its first and second order moments are in alignment with the target central chisquare distribution under correctly specified models. DWLS with such a correction is called the mean and varianceadjusted weighted least squares (WLSMV) estimator. An alternative to WLSMV is the meanand varianceadjusted unweighted least squares (ULSMV) estimator, which has been shown to perform as well as, or slightly better than WLSMV. Because the chisquare statistic is corrected, the chisquarebased RMSEA, CFI, and TLI are thus also corrected by replacing the uncorrected chisquare statistic with the robust chisquare statistic. The robust model fit indexes calculated in such a way are named as the populationcorrected robust (PR) model fit indexes following BrosseauLiard, Savalei, and Li (2012). The PR model fit indexes are currently reported in almost every application when WLSMV or ULSMV is used. Nevertheless, previous studies have found the PR model fit indexes from WLSMV are sensitive to several factors such as sample sizes, model sizes, and thresholds for categorization. The first focus of this dissertation is on the dependency of model fit indexes on the thresholds for ordered categorical data. Because the weight matrix in the WLSMV fit function and the correction factors for both WLSMV and ULSMV include the asymptotic variances of thresholds and polychoric correlations, the model fit indexes are very likely to depend on the thresholds. The dependency of model fit indexes on the thresholds is not a desirable property, because when the misspecification lies in the factor structures (e.g., cross loadings are ignored or two factors are considered as a single factor), model fit indexes should reflect such misspecification rather than the threshold values. As alternatives to the PR model fit indexes, BrosseauLiard et al. (2012), BrosseauLiard and Savalei (2014), and Li and Bentler (2006) proposed the samplecorrected robust (SR) model fit indexes. The PR fit indexes are found to converge to distorted asymptotic values, but the SR fit indexes converge to their definitions asymptotically. However, the SR model fit indexes were proposed for continuous data, and have been neither investigated nor implemented in SEM software when WLSMV and ULSMV are applied. This dissertation thus investigates the PR and SR model fit indexes for WLSMV and ULSMV. The first part of the simulation study examines the dependency of the model fit indexes on the thresholds when the model misspecification results from omitting crossloadings or collapsing factors in confirmatory factor analysis. The study is conducted on extremely large computergenerated datasets in order to approximate the asymptotic values of model fit indexes. The results find that only the SR fit indexes from ULSMV are independent of the population threshold values, given the other design factors. The PR fit indexes from ULSMV, and the PR and SR fit indexes from WLSMV are influenced by thresholds, especially when data are binary and the hypothesized model is greatly misspecified. The second part of the simulation varies the sample sizes from 100 to 1000 to investigate whether the SR fit indexes under finite samples are more accurate estimates of the defined values of RMSEA, CFI, and TLI, compared with the uncorrected model fit indexes without robust correction and the PR fit indexes. Results show that the SR fit indexes are the more accurate in general. However, when the thresholds are different across items, data are binary, and sample size is less than 500, all versions of these indexes can be very inaccurate. In such situations, larger sample sizes are needed. In addition, the conventional cutoffs developed from continuous data with maximum likelihood (e.g., RMSEA < .06, CFI > .95, and TLI > .95; Hu & Bentler, 1999) have been applied to WLSMV and ULSMV regardless of the arguments against such a practice (e.g., Marsh, Hau, & Wen, 2004). For comparison purposes, this dissertation reports the RMSEA, CFI, and TLI based on continuous data using maximum likelihood before the variables are categorized to create ordered categorical data. Results show that the model fit indexes from maximum likelihood are very different from those from WLSMV and ULSMV, suggesting that the conventional rules should not be applied to WLSMV and ULSMV.
 2016
 FSU_2016SU_Xia_fsu_0071E_13379
 Thesis
 New Methods in Tornado Risk and Vulnerability Assessments.
Widen, Holly Marie, Elsner, James B., Hart, Robert E. (Robert Edward), Uejio, Christopher K., Pau, Stephanie, Medders, Lori A., Florida State University, College of Social...
Show moreWiden, Holly Marie, Elsner, James B., Hart, Robert E. (Robert Edward), Uejio, Christopher K., Pau, Stephanie, Medders, Lori A., Florida State University, College of Social Sciences and Public Policy, Department of Geography
This dissertation includes a series of studies that present innovative methodologies to improve tornado risk and vulnerability assessments. Limitations of the historical tornado dataset are well known and relate to inconsistencies in data collection procedures, rating assessments, updates in technology, and public awareness. The limitations make it difficult to accurately evaluate tornado risk and vulnerability. Thus, the research presented in this dissertation aims to 1) improve tornado risk...
Show moreThis dissertation includes a series of studies that present innovative methodologies to improve tornado risk and vulnerability assessments. Limitations of the historical tornado dataset are well known and relate to inconsistencies in data collection procedures, rating assessments, updates in technology, and public awareness. The limitations make it difficult to accurately evaluate tornado risk and vulnerability. Thus, the research presented in this dissertation aims to 1) improve tornado risk assessments using the historical dataset by accounting for known nonmeteorological factors and 2) enhance tornado vulnerability assessments by utilizing a new dataset containing more precise damage survey data. This work includes three individual studies, two focused on risk and one on vulnerability, using different geographic scales. Tornado occurrence rates computed from the available reports are biased low relative to the unknown true rates. A method to estimate the annual statewide probability of getting hit by a tornado improves this low bias by using the average report density as a function of distance from nearest city center. The method is demonstrated on Kansas and then applied to 15 other tornadoprone states from Nebraska to Tennessee over the period 19502011. The adjusted rates are significantly higher than the raw rates and thus, the return periods are less than previously thought (closer to 1000 years). The expected annual number of people exposed to tornadoes has also increased for every state. The evaluation of tornado occurrences is improved using a statistical model that produces a smoothed regionalscale climatology. The model is applied to data aggregated at the county level, including annual population, annual tornado counts, and an index of terrain roughness. The model has a term to capture the smoothed frequency relative to the state average and is used to examine additional hypotheses concerning relationships of tornado activity with terrain roughness and County Warning Area. Tornado reports are found to increase by 13\% for a twofold increase in population across Kansas after accounting for improvements in rating procedures. The pattern of spatially correlated errors also shows Kansas tornado activity to be consistent with the dryline climatology. The model is significantly improved by adding terrain roughness, which has a negative relationship with tornado activity and its flexibility is demonstrated by fitting it to data from Illinois, Mississippi, South Dakota, and Ohio. Advancements in technology have improved the collection of tornado damage survey data which can be used to enhance vulnerability assessments. The National Weather Service (NWS) Damage Assessment Toolkit (DAT) contains the most extensive GISbased damage survey data available to the public which provides more precise damage path areas. These data are used with socioeconomic data in two statistical models. The models are developed to determine which factors are significant predictors of the incidence and magnitude of casualties while accounting for maximum EF Scale rating, total path area, and population density at the storm level. Percent unemployment is a significant predictor and produces the best model for the incidence of at least one tornado casualty. Although percent elderly generates the best model for predicting the magnitude of casualties, it is only marginally significant and its relationship is negative. The Southeast has the highest averages of the sensitivity factors considering all of the tornado events. These results highlight the need for heightened tornado awareness and preparedness as our exposure to these events increases due to our population continuing to expand. As demonstrated in this work, these methods can be used to enhance regional/local tornado forecasts, insurance risk estimates, public policy, urban planning, and emergency management and mitigation with the detection of spatiotemporal patterns in tornado activity (due to variations in climate) and vulnerability (due to changes in population demographics and urban sprawl). They can be employed to examine other geographic locations on multiple scales. They can also be adapted to study the patterns and relationships of other spatial and temporal phenomena.
 2016
 FSU_2016SP_Widen_fsu_0071E_13208
 Thesis
 The Use of a MetaAnalysis Technique in Equating and Its Comparison with Several Small Sample Equating Methods.
Caglak, Serdar, Paek, Insu, Patrangenaru, Victor, Almond, Russell G., Roehrig, Alysia D., Florida State University, College of Education, Department of Educational Psychology...
Show moreCaglak, Serdar, Paek, Insu, Patrangenaru, Victor, Almond, Russell G., Roehrig, Alysia D., Florida State University, College of Education, Department of Educational Psychology and Learning Systems
The main objective of this study was to investigate the improvement of the accuracy of small sample equating, which typically occurs in teacher certification/licensure examinations due to a low volume of test takers per test administration, under the NonEquivalent Groups with Anchor Test (NEAT) design by combining previous and current equating outcomes using a metaanalysis technique. The proposed metaanalytic score transformation procedure was called "metaequating" throughout this study....
Show moreThe main objective of this study was to investigate the improvement of the accuracy of small sample equating, which typically occurs in teacher certification/licensure examinations due to a low volume of test takers per test administration, under the NonEquivalent Groups with Anchor Test (NEAT) design by combining previous and current equating outcomes using a metaanalysis technique. The proposed metaanalytic score transformation procedure was called "metaequating" throughout this study. To conduct metaequating, the previous and current equating outcomes obtained from the chosen equating methods (ID (Identity Equating), CircleArc (CA) and Nominal Weights Mean (NW)) and synthetic functions (SFs) of these methods (CAS and NWS) were used, and then, empirical Bayesian (EB) and metaequating (META) procedures were implemented to estimate the equating relationship between test forms at the population level. The SFs were created by giving equal weight to each of the chosen equating methods and the identity (ID) equating. Finally, the chosen equating methods, the SFs of each method (e.g., CAS, NWS, etc.), and also the META and EB versions (e.g., NWEB, CAMETA, NWSMETA, etc.) were investigated and compared under varying testing conditions. These steps involved manipulating some of the factors that influence the accuracy of test score equating. In particular, the effect of test form difficulty levels, the groupmean ability differences, the number of previous equatings, and the sample size on the accuracy of the equating outcomes were investigated. The Chained Equipercentile (CE) equating with 6univariate and 2bivariate moments loglinear presmoothing was used as the criterion equating function to establish the equating relationship between the new form and the base (reference) form with 50,000 examinees per test form. To compare the performance of the equating methods, small numbers of examinee samples were randomly drawn from examinee populations with different ability levels in each simulation replication. Each pairs of the new and base test forms were randomly and independently selected from all available condition specific test form pairs. Those test forms were then used to obtain previous equating outcomes. However, purposeful selections of the examinee ability and test form difficulty distributions were made to obtain the current equating outcomes in each simulation replication. The previous equating outcomes were later used for the implementation of both the META and EB score transformation procedures. The effect of study factors and their possible interactions on each of the accuracy measures were investigated along the entirescore range and the cut (reduced)score range using a series of mixedfactorial ANOVA (MFA) procedures. The performances of the equating methods were also compared based on posthoc tests. Results show that the behaviors of the equating methods vary based on the each level of the group ability difference, test form difficult difference, and new group examinee sample size. Also, the use of both META and EB procedures improved the accuracy of equating results on average. The META and EB versions of the chosen equating methods therefore might be a solution to equate the test forms that are similar in their psychometric characteristics and also taken by new form examinee samples less than 50. However, since there are many factors affecting the equating results in reality, one should always expect that equating methods and score transformation procedures, or in more general terms, estimation procedures may function differently, to some degree, depending on conditions in which they are implemented. Therefore, one should consider the recommendations for the use of the proposed equating methods in this study as a piece of information, not an absolute guideline, for a rule of thumbs for practicing small sample test equating in teacher certification/licensure examinations.
 2015
 FSU_2015fall_Caglak_fsu_0071E_12863
 Thesis
 Four Methods for Combining Dependent Effects from Studies Reporting Regression Analysis.
Gunter, Tracey Danielle, Becker, Betsy Jane, Huffer, Fred W. (Fred William), Almond, Russell G., Paek, Insu, Florida State University, College of Education, Department of...
Show moreGunter, Tracey Danielle, Becker, Betsy Jane, Huffer, Fred W. (Fred William), Almond, Russell G., Paek, Insu, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Over the years a variety of indices have been proposed to summarize regression analyses. Unfortunately the proposed indices are only appropriate when metaanalysts want to understand the role of a single predictor variable in predicting the outcome variable. However, sometimes metaanalysts want to understand the effect of a set of variables on an outcome variable. In this paper, four methods are presented for obtaining a composite effect for two focal predictor variables from a single...
Show moreOver the years a variety of indices have been proposed to summarize regression analyses. Unfortunately the proposed indices are only appropriate when metaanalysts want to understand the role of a single predictor variable in predicting the outcome variable. However, sometimes metaanalysts want to understand the effect of a set of variables on an outcome variable. In this paper, four methods are presented for obtaining a composite effect for two focal predictor variables from a single regression model. The indices are the average of the standardized regression coefficients (ASC), the average of the standardized regression coefficients using Hedges and Olkin's (1985) approach (AHO), the sheaf coefficient (SC), and the squared multiple semipartial correlation coefficient (MSP). A simulation study was conducted to examine the behavior of the indices and their variance when the number of predictor variables in the model, the sample size, the correlations between the focal predictor variables in the model, and the correlations between the focal and nonfocal predictor variables in the model were manipulated. The results of the study show that the average bias values of the ASC and AHO estimates are small even when the sample size is small. Furthermore, the ASC and AHO estimates and their estimated variances are more precise than the other indices under all conditions examined. Therefore, when metaanalysts are interested in estimating the effect of a set of predictor variables on an outcome variable from a single regression model, the ASC or AHO procedures are preferred.
 2015
 FSU_2015fall_Gunter_fsu_0071E_12829
 Thesis
 Multistage Process Monitoring Using Group Exponential Weighted Moving Average Control Chart.
Symum, Hasan, Vanli, Omer Arda, Awoniyi, Samuel A. (Samuel Ayodele), Wang, Hui, Florida State University, FAMUFSU College of Engineering, Department of Industrial and...
Show moreSymum, Hasan, Vanli, Omer Arda, Awoniyi, Samuel A. (Samuel Ayodele), Wang, Hui, Florida State University, FAMUFSU College of Engineering, Department of Industrial and Manufacturing Engineering Department of Industrial and Manufacturing Engineering
This thesis proposes a new variation propagation modeling and group EWMA control chart method for quality improvement in multistage process that aims to detect and isolate the largest variation propagation and faulty stages in a multistage process. Since, it is computationally difficult to estimate the variation propagation, this model can provide important estimation for quantifying the variation in each stage. Also it is crucial to develop a control chart that enables to detect the faulty...
Show moreThis thesis proposes a new variation propagation modeling and group EWMA control chart method for quality improvement in multistage process that aims to detect and isolate the largest variation propagation and faulty stages in a multistage process. Since, it is computationally difficult to estimate the variation propagation, this model can provide important estimation for quantifying the variation in each stage. Also it is crucial to develop a control chart that enables to detect the faulty product quickly and isolate the faulty stage effectively in sequential process. A test statics of control chart is proposed. Two different case studies are used to illustrate the proposed approach. Error are estimated using linear and logistic regression equation and EWMA parameters are calculated for in control dataset. The result of the case study from automotive hood assembly and Healthcare Performance Monitoring shows that, group EWMA can achieve quicker detection than traditional EWMA in later stages of multistage process
 2015
 FSU_migr_etd9467
 Thesis
 MetaAnalysis of Factor Analyses: Comparison of Univariate and Multivariate Approaches Using Correlation Matrices and Factor Loadings.
Cho, Kyunghwa, Becker, Betsy Jane, Huffer, Fred W. (Fred William), Paek, Insu, Yang, Yanyun, Florida State University, College of Education, Department of Educational Psychology...
Show moreCho, Kyunghwa, Becker, Betsy Jane, Huffer, Fred W. (Fred William), Paek, Insu, Yang, Yanyun, Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Currently, more sophisticated techniques such as factor analyses are frequently applied in primary research thus may need to be metaanalyzed. This topic has been given little attention in the past due to its complexity. Because factor analysis is becoming more popular in research in many areas including education, social work, social science, and so on, the study of methods for the metaanalysis of factor analyses is also becoming more important. The first main purpose of this dissertation...
Show moreCurrently, more sophisticated techniques such as factor analyses are frequently applied in primary research thus may need to be metaanalyzed. This topic has been given little attention in the past due to its complexity. Because factor analysis is becoming more popular in research in many areas including education, social work, social science, and so on, the study of methods for the metaanalysis of factor analyses is also becoming more important. The first main purpose of this dissertation is to compare the results of seven different approaches to doing metaanalysis of confirmatory factor analyses. Specifically, five approaches are based on univariate metaanalysis methods. The next two approaches use multivariate metaanalysis to obtain the results of factor loadings and the standard errors of factor loadings. The results from each approach are compared. Given the fact that factor analyses are commonly used in many areas, the second purpose of this dissertation is to explore the appropriate approach or approaches to use for the metaanalysis of factor analyses, especially Confirmatory Factor Analysis (CFA). When the average sample size was small, the results of IRD, WMC, WMFL, and GLSMFL approaches showed better performance than those of UMC, MFL, and GLSMC approaches to estimating parameters. With large average sample sizes (larger than 150), the performance to estimate the parameters across all seven approaches seemed to be similar in this dissertation. Based on my simulation results, researchers who want to conduct metaanalytic confirmatory factor analysis can apply any of these approaches to synthesize the results from primary studies it their studies have n > 150.
 2015
 FSU_migr_etd9570
 Thesis
 A Class of Semiparametric Volatility Models with Applications to Financial Time Series.
Chung, Steve S., Niu, XuFeng, Gallivan, Kyle, Sinha, Debajyoti, Wu, Wei, Department of Statistics, Florida State University
The autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied...
Show moreThe autoregressive conditional heteroskedasticity (ARCH) and generalized autoregressive conditional heteroskedasticity (GARCH) models take the dependency of the conditional second moments. The idea behind ARCH/GARCH model is quite intuitive. For ARCH models, past squared innovations describes the present squared volatility. For GARCH models, both squared innovations and the past squared volatilities define the present volatility. Since their introduction, they have been extensively studied and well documented in financial and econometric literature and many variants of ARCH/GARCH models have been proposed. To list a few, these include exponential GARCH(EGARCH), GJRGARHCH(or threshold GARCH), integrated GARCH(IGARCH), quadratic GARCH(QGARCH), and fractionally integrated GARCH(FIGARCH). The ARCH/GARCH models and their variant models have gained a lot of attention and they are still popular choice for modeling volatility. Despite their popularity, they suffer from model flexibility. Volatility is a latent variable and hence, putting a specific model structure violates this latency assumption. Recently, several attempts have been made in order to ease the strict structural assumptions on volatility. Both nonparametric and semiparametric volatility models have been proposed in the literature. We review and discuss these modeling techniques in detail. In this dissertation, we propose a class of semiparametric multiplicative volatility models. We define the volatility as a product of parametric and nonparametric parts. Due to the positivity restriction, we take the log and square transformations on the volatility. We assume that the parametric part is GARCH(1,1) and it serves as a initial guess to the volatility. We estimate GARCH(1,1) parameters by using conditional likelihood method. The nonparametric part assumes an additive structure. There may exist some loss of interpretability by assuming an additive structure but we gain flexibility. Each additive part is constructed from a sieve of Bernstein basis polynomials. The nonparametric component acts as an improvement for the parametric component. The model is estimated from an iterative algorithm based on boosting. We modified the boosting algorithm (one that is given in Friedman 2001) such that it uses a penalized least squares method. As a penalty function, we tried three different penalty functions: LASSO, ridge, and elastic net penalties. We found that, in our simulations and application, ridge penalty worked the best. Our semiparametric multiplicative volatility model is evaluated using simulations and applied to the six major exchange rates and SP 500 index. The results show that the proposed model outperforms the existing volatility models in both insample estimation and outofsample prediction.
 2014
 FSU_migr_etd8756
 Thesis
 The Risk of Lipids on Coronary Heart Disease: Prognostic Models and MetaAnalysis.
Almansour, Aseel, McGee, Daniel, Flynn, Heather, Niu, Xufeng, Sinha, Debajyoti, Department of Statistics, Florida State University
Prognostic models are widely used in medicine to estimate particular patients' risk of developing disease. For cardiovascular disease risk numerous prognostic models have been developed for predicting cardiovascular disease including those by Wilson et al. using the Framingham Study[17], by Assmann et al. using the Procam study[22] and by Conroy et al.[33] using a pool of European cohorts. The prognostic models developed by these researchers differed in their approach to estimating risk but...
Show morePrognostic models are widely used in medicine to estimate particular patients' risk of developing disease. For cardiovascular disease risk numerous prognostic models have been developed for predicting cardiovascular disease including those by Wilson et al. using the Framingham Study[17], by Assmann et al. using the Procam study[22] and by Conroy et al.[33] using a pool of European cohorts. The prognostic models developed by these researchers differed in their approach to estimating risk but all included one or more of the lipid determinations: Total cholesterol (TC). Low Density Lipoproteins (LDL), High Density Lipoproteins (HDL), or ratios TC/HDL and LDL/HDL. None of these researchers included both LDL and TC in the same model due to the high correlation between these measurements. In this thesis we will examine some questions about the inclusion of lipid determinations in prognostic models: Can the effect of LDL and TC on the risk of dying from CHD be differentiated? If one measure is demonstrably stronger than the other, then a single model using that variable would be considered advantageous. Is it possible to derive a single measure from TC and LDL that is a stronger predictor than either measure? If so, then a new summarization of the lipid measurements should be used in prognostic modeling. Does the addition of HDL to a prognostic model improve the predictive accuracy of the model? If it does, then this determination that is almost universally determined should be used when developing prognostic models. We use data from nine independent studies to examine these issues. The studies were chosen because they include longitudinal followup of participants and included lipid determinations in the baseline examination of participants. There are many methodologies available for developing prognostic models, including logistic regression and the proportional hazards model. We used the proportional hazards model since we have followup times and times to death from CHD on all of the participants in the included studies. We summarized our results using a metaanalytic approach. Using the metaanalytic approach, we addressed the additional question of whether the results vary significantly among the different studies and also whether adding additional characteristics to the prognostic models changes the estimated effect of the lipid determinations. All of our results are presented stratified by gender and, when appropriate, by race. Finally, because our studies were not selected randomly, we also examined whether there is evidence of bias in our metaanalyses. For this examination we used funnel plots with related methodology for testing whether there is evidence of bias in the results.
 2014
 FSU_migr_etd8724
 Thesis
 Nonlinear Multivariate Tests for HighDimensional Data Using Wavelets with Applications in Genomics and Engineering.
Girimurugan, Senthil Balaji, Chicken, Eric, Zhang, Jinfeng, Ahlquist, Jon, Tao, Minjing, Department of Statistics, Florida State University
Gaussian processes are not uncommon in various fields of science such as engineering, genomics, quantitative finance and astronomy, to name a few. In fact, such processes are special cases in a broader class of data known as functional data. When the underlying mean response of a process is a function, the resulting data from these processes are functional responses and specialized statistical tools are required in their analysis. The methodology discussed in this work offers nonparametric...
Show moreGaussian processes are not uncommon in various fields of science such as engineering, genomics, quantitative finance and astronomy, to name a few. In fact, such processes are special cases in a broader class of data known as functional data. When the underlying mean response of a process is a function, the resulting data from these processes are functional responses and specialized statistical tools are required in their analysis. The methodology discussed in this work offers nonparametric tests that can detect differences in such data with greater power and good control of TypeI error over existing methods. The incorporation of Wavelet Transforms makes the test an efficient approach due to its decorrelation properties. These tests are designed primarily to handle functional responses from multiple treatments simultaneously and generally are extensible to high dimensional data. The sparseness introduced by Wavelet Transforms is another advantage of this test when compared to traditional tests. In addition to offering a theoretical framework, several applications of such tests in the fields of engineering, genomics and quantitative finance are also discussed.
 2014
 FSU_migr_etd8789
 Thesis
 A Comparison of Three Approaches to Confidence Interval Estimation for Coefficient Omega.
Xu, Jie, Yang, Yanyun, Becker, Betsy Jane, Almond, Russell G., Florida State University, College of Education, Department of Educational Psychology and Learning Systems
Coefficient Omega was introduced by McDonald (1978) as a reliability coefficient of composite scores for the congeneric model. Interval estimation (Neyman, 1937) on coefficient Omega provides a range of plausible values which is likely to capture the population reliability of composite scores. The Wald method, likelihood method, and biascorrected and accelerated bootstrap method are three methods to construct confidence interval for coefficient Omega (e.g., Cheung, 2009b; Kelley & Cheng,...
Show moreCoefficient Omega was introduced by McDonald (1978) as a reliability coefficient of composite scores for the congeneric model. Interval estimation (Neyman, 1937) on coefficient Omega provides a range of plausible values which is likely to capture the population reliability of composite scores. The Wald method, likelihood method, and biascorrected and accelerated bootstrap method are three methods to construct confidence interval for coefficient Omega (e.g., Cheung, 2009b; Kelley & Cheng, 2012; Raykov, 2002, 2004, 2009; Raykov & Marcoulides, 2004; Padilla & Divers, 2013). Very limited number of studies on the evaluation of these three methods can be found in the literature (e.g., Cheung, 2007, 2009a, 2009b; Kelley & Cheng, 2012; Padilla & Divers, 2013). No simulation study has been conducted to evaluate the performance of these three methods for interval construction on coefficient Omega. In the current simulation study, I assessed these three methods by comparing their empirical performance on interval estimation for coefficient Omega. Four factors were included in the simulation design: sample size, number of items, factor loading, and degree of nonnormality. Two thousands datasets were generated in R 2.15.0 (R Core Team, 2012) for each condition. For each generated dataset, three approaches (i.e., the Wald method, likelihood method, and biascorrected and accelerated bootstrap method) were used to construct 95% confidence interval of coefficient Omega in R 2.15.0. The results showed that when the data were multivariate normally distributed, three methods performed equally well and coverage probabilities were very close to the prespecified .95 confidence level. When the data were multivariate nonnormally distributed, coverage probabilities decreased and interval widths became wider for all three methods as the degree of nonnormality increased. In general, when the data departed from the multivariate normality, the BCa bootstrap method performed better than the other two methods, with relatively higher coverage probabilities, while the Wald and likelihood methods were comparable and yielded narrower interval width than the BCa bootstrap method.
 2014
 FSU_migr_etd9269
 Thesis
 Sparse Factor AutoRegression for Forecasting Macroeconomic Time Series with Very Many Predictors.
Galvis, Oliver Kurt, She, Yiyuan, Okten, Giray, Beaumont, Paul, Huﬀer, Fred, Tao, Minjing, Department of Statistics, Florida State University
Forecasting a univariate target time series in high dimensions with very many predictors poses challenges in statistical learning and modeling. First, many nuisance time series exist and need to be removed. Second, from economic theories, a macroeconomic target series is typically driven by few latent factors constructed from some macroeconomic indices. Consequently, a high dimensional problem arises where deleting junk time series and constructing predictive factors simultaneously, are...
Show moreForecasting a univariate target time series in high dimensions with very many predictors poses challenges in statistical learning and modeling. First, many nuisance time series exist and need to be removed. Second, from economic theories, a macroeconomic target series is typically driven by few latent factors constructed from some macroeconomic indices. Consequently, a high dimensional problem arises where deleting junk time series and constructing predictive factors simultaneously, are meaningful and advantageous for accuracy of the forecasting task. In macroeconomics, multiple categories are available with the target series belonging to one of them. With all series available we advocate constructing category level factors to enhance the performance of the forecasting task. We introduce a novel methodology, the Sparse Factor AutoRegression (SFAR) methodology, to construct predictive factors from a reduced set of relevant time series. SFAR attains dimension reduction via joint variable selection and rank reduction in high dimensional time series data. A multivariate setting is used to achieve simultaneous low rank and cardinality control on the matrix of coefficients where $ell_{0}$constraint regulates the number of useful series and the rank constrain elucidates the upper bound for constructed factors. The doublyconstrained matrix is a nonconvex mathematical problem optimized via an efficient iterative algorithm with a theoretical guarantee of convergence. SFAR fits factors using a sparse low rank matrix in response to a target category series. Forecasting is then performed using lagged observations and shrinkage methods. We generate a finite sample data to verify our theoretical findings via a comparative study of the SFAR. We also analyze realworld macroeconomic time series data to demonstrate the usage of the SFAR in practice.
 2014
 FSU_migr_etd8990
 Thesis
 Functional Component Analysis and Regression Using Elastic Methods.
Tucker, J. Derek, Srivastava, Anuj, Wu, Wei, Klassen, Eric, Huﬀer, Fred, Department of Statistics, Florida State University
Constructing generative models for functional observations is an important task in statistical function analysis. In general, functional data contains both phase (or x or horizontal) and amplitude (or y or vertical) variability. Traditional methods often ignore the phase variability and focus solely on the amplitude variation, using crosssectional techniques such as functional principal component analysis for dimensional reduction and regression for data modeling. Ignoring phase variability...
Show moreConstructing generative models for functional observations is an important task in statistical function analysis. In general, functional data contains both phase (or x or horizontal) and amplitude (or y or vertical) variability. Traditional methods often ignore the phase variability and focus solely on the amplitude variation, using crosssectional techniques such as functional principal component analysis for dimensional reduction and regression for data modeling. Ignoring phase variability leads to a loss of structure in the data, and inefficiency in data models. Moreover, most methods use a "preprocessing'' alignment step to remove the phasevariability; without considering a more natural joint solution. This dissertation presents three approaches to this problem. The first relies on separating the phase (xaxis) and amplitude (yaxis), then modeling these components using joint distributions. This separation in turn, is performed using a technique called elastic alignment of functions that involves a new mathematical representation of functional data. Then, using individual principal components, one for each phase and amplitude components, it imposes joint probability models on principal coefficients of these components while respecting the nonlinear geometry of the phase representation space. The second combines the phasevariability into the objective function for two component analysis methods, functional principal component analysis and functional principal least squares. This creates a more complete solution, as the phasevariability is removed while simultaneously extracting the components. The third approach combines the phasevariability into the functional linear regression model and then extends the model to logistic and multinomial logistic regression. Through incorporating the phasevariability a more parsimonious regression model is obtained and therefore, more accurate prediction of observations is achieved. These models then are easily extended from functional data to curves (which are essentially functions in R2) to perform regression with curves as predictors. These ideas are demonstrated using random sampling for models estimated from simulated and real datasets, and show their superiority over models that ignore phaseamplitude separation. Furthermore, the models are applied to classification of functional data and achieve high performance in applications involving SONAR signals of underwater objects, handwritten signatures, periodic body movements recorded by smart phones, and physiological data.
 2014
 FSU_migr_etd9106
 Thesis
 Parametric and Nonparametric Spherical Regression with Diffeomorphisms.
Rosenthal, Michael, Srivastava, Anuj, Wu, Wei, Klassen, Eric, Pati, Debdeep, Department of Statistics, Florida State University
Spherical regression explores relationships between pairs of variables on spherical domains. Spherical data has become more prevalent in biological, gaming, geographical, and meteorological investigations, creating a need for tools that analyze such data. Previous works on spherical regression have focused on rigid parametric models or nonparametric kernel smoothing methods. This leaves a huge gap in the available tools with no intermediate options currently available. This work will develop...
Show moreSpherical regression explores relationships between pairs of variables on spherical domains. Spherical data has become more prevalent in biological, gaming, geographical, and meteorological investigations, creating a need for tools that analyze such data. Previous works on spherical regression have focused on rigid parametric models or nonparametric kernel smoothing methods. This leaves a huge gap in the available tools with no intermediate options currently available. This work will develop two such intermediate models, one parametric using projective linear transformation and one nonparametric model using diffeomorphic maps from a sphere to itself. The models are estimated in a maximumlikelihood framework using gradientbased optimizations. For the parametric model, an efficient NewtonRaphson algorithm is derived and asymptotic analysis is developed. A firstorder roughness penalty is specified for the nonparametric model using the Jacobian of diffeomorphisms. The prediction performance of the proposed models are compared with stateoftheart methods using simulated and real data involving plate tectonics, cloud deformations, wind, accelerometer, bird migration, and vectorcardiogram data.
 2014
 FSU_migr_etd9082
 Thesis
 The Relationship of Diabetes to Coronary Heart Disease Mortality: A MetaAnalysis Based on PersonLevel Data.
Williams, Felicia Gray, McGee, Daniel, Hurt, Myra, Pati, Debdeep, Sinha, Debajyoti, Department of Statistics, Florida State University
Studies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a metaanalysis of personlevel data from 42 cohort studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 77,863 men and 84,671 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of selfreported...
Show moreStudies have suggested that diabetes is a stronger risk factor for coronary heart disease (CHD) in women than in men. We present a metaanalysis of personlevel data from 42 cohort studies in which diabetes, CHD mortality and potential confounders were available and a minimum of 75 CHD deaths occurred. These studies followed up 77,863 men and 84,671 women aged 42 to 73 years on average from the US, Denmark, Iceland, Norway and the UK. Individual study prevalence rates of selfreported diabetes mellitus at baseline ranged between less than 1% in the youngest cohort and 15.7% (males) and 11.1% (females) in the NHLBI CHS study of the elderly. CHD death rates varied between 2% and 20%. A metaanalysis was performed in order to calculate overall hazard ratios (HR) of CHD mortality among diabetics compared to nondiabetics using Cox Proportional Hazard models. The randomeffects HR associated with baseline diabetes and adjusted for age was significantly higher for females 2.65 (95% CI: 2.34, 2.96) than for males 2.33 (95% CI: 2.07, 2.58) (p=0.004). These estimates were similar to the randomeffects estimates adjusted additionally for serum cholesterol, systolic blood pressure, and current smoking status: females 2.69 (95% CI: 2.35, 3.03) and males 2.32 (95% CI: 2.05, 2.59) . They also agree closely with estimates (odds ratios of 2.9 for females and 2.3 for males) obtained in a recent metaanalysis of 50 studies of both fatal and nonfatal CHD but not based on personlevel data. This evidence suggests that diabetes diminishes the female advantage. An additional analysis was performed on race. Only 14 cohorts were analyzed in the metaanalysis. This analyses showed no significant difference between the black and white cohorts before (p=0.68) or after adjustment for the major CHD RFs (p=0.88). The limited amount of studies used may lack the power to detect any differences.
 2013
 FSU_migr_etd7662
 Thesis
 Nonparametric Wavelet Thresholding and Profile Monitoring for NonGaussian Errors.
McGinnity, Kelly, Chicken, Eric, Hoeﬂich, Peter, Niu, Xufeng, Zhang, Jinfeng, Department of Statistics, Florida State University
Recent advancements in data collection allow scientists and researchers to obtain massive amounts of information in short periods of time. Often this data is functional and quite complex. Wavelet transforms are popular, particularly in the engineering and manufacturing fields, for handling these type of complicated signals. A common application of wavelets is in statistical process control (SPC), in which one tries to determine as quickly as possible if and when a sequence of profiles has...
Show moreRecent advancements in data collection allow scientists and researchers to obtain massive amounts of information in short periods of time. Often this data is functional and quite complex. Wavelet transforms are popular, particularly in the engineering and manufacturing fields, for handling these type of complicated signals. A common application of wavelets is in statistical process control (SPC), in which one tries to determine as quickly as possible if and when a sequence of profiles has gone outofcontrol. However, few wavelet methods have been proposed that don't rely in some capacity on the assumption that the observational errors are normally distributed. This dissertation aims to fill this void by proposing a simple, nonparametric, distributionfree method of monitoring profiles and estimating changepoints. Using only the magnitudes and location maps of thresholded wavelet coefficients, our method uses the spatial adaptivity property of wavelets to accurately detect profile changes when the signal is obscured with a variety of nonGaussian errors. Wavelets are also widely used for the purpose of dimension reduction. Applying a thresholding rule to a set of wavelet coefficients results in a "denoised" version of the original function. Once again, existing thresholding procedures generally assume independent, identically distributed normal errors. Thus, the second main focus of this dissertation is a nonparametric method of thresholding that does not assume Gaussian errors, or even that the form of the error distribution is known. We improve upon an existing evenodd crossvalidation method by employing block thresholding and level dependence, and show that the proposed method works well on both skewed and heavytailed distributions. Such thresholding techniques are essential to the SPC procedure developed above.
 2013
 FSU_migr_etd7502
 Thesis
 Nonparametric Nonstationary Density Estimation Including Upper Control Limit Methods for Detecting Change Points.
Becvarik, Rachel A., Chicken, Eric, Liu, Guosheng, Sinha, Debajyoti, Wu, Wei, Department of Statistics, Florida State University
Nonstationary nonparametric densities occur naturally including applications such as monitoring the amount of toxins in the air and in monitoring internet streaming data. Progress has been made in estimating these densities, but there is little current work on monitoring them for changes. A new statistic is proposed which effectively monitors these nonstationary nonparametric densities through the use of transformed wavelet coefficients of the quantiles. This method is completely...
Show moreNonstationary nonparametric densities occur naturally including applications such as monitoring the amount of toxins in the air and in monitoring internet streaming data. Progress has been made in estimating these densities, but there is little current work on monitoring them for changes. A new statistic is proposed which effectively monitors these nonstationary nonparametric densities through the use of transformed wavelet coefficients of the quantiles. This method is completely nonparametric, designed for no particular distributional assumptions; thus making it effective in a variety of conditions. Existing methods for monitoring sequential data typically focus on using a single value upper control limit (UCL) based on a specified in control average run length (ARL) to detect changes in these nonstationary statistics. However, such a UCL is not designed to take into consideration the false alarm rate, the power associated with the test or the underlying distribution of the ARL. Additionally, if the monitoring statistic is known to be monotonic over time (which is typical in methods using maxima in their statistics, for example) the flat UCL does not adjust to this property. We propose several methods for creating UCLs that provide improved power and simultaneously adjust the false alarm rate to userspecified values. Our methods are constructive in nature, making no use of assumed distribution properties of the underlying monitoring statistic. We evaluate the different proposed UCLs through simulations to illustrate the improvements over current UCLs. The proposed method is evaluated with respect to profile monitoring scenarios and the proposed density statistic. The method is applicable for monitoring any monotonically nondecreasing nonstationary statistics.
 2013
 FSU_migr_etd7292
 Thesis
 Monte Carlo Likelihood Estimation for Conditional Autoregressive Models with Application to Sparse Spatiotemporal Data.
Bain, Rommel, Huffer, Fred, Becker, Betsy, Niu, Xufeng, Srivastava, Anuj, Department of Statistics, Florida State University
Spatiotemporal modeling is increasingly used in a diverse array of fields, such as ecology, epidemiology, health care research, transportation, economics, and other areas where data arise from a spatiotemporal process. Spatiotemporal models describe the relationship between observations collected from different spatiotemporal sites. The modeling of spatiotemporal interactions arising from spatiotemporal data is done by incorporating the spacetime dependence into the covariance structure. A...
Show moreSpatiotemporal modeling is increasingly used in a diverse array of fields, such as ecology, epidemiology, health care research, transportation, economics, and other areas where data arise from a spatiotemporal process. Spatiotemporal models describe the relationship between observations collected from different spatiotemporal sites. The modeling of spatiotemporal interactions arising from spatiotemporal data is done by incorporating the spacetime dependence into the covariance structure. A main goal of spatiotemporal modeling is the estimation and prediction of the underlying process that generates the observations under study and the parameters that govern the process. Furthermore, analysis of the spatiotemporal correlation of variables can be used for estimating values at sites where no measurements exist. In this work, we develop a framework for estimating quantities that are functions of complete spatiotemporal data when the spatiotemporal data is incomplete. We present two classes of conditional autoregressive (CAR) models (the homogeneous CAR (HCAR) model and the weighted CAR (WCAR) model) for the analysis of sparse spatiotemporal data (the log of monthly mean zooplankton biomass) collected on a spatiotemporal lattice by the California Cooperative Oceanic Fisheries Investigations (CalCOFI). These models allow for spatiotemporal dependencies between nearest neighbor sites on the spatiotemporal lattice. Typically, CAR model likelihood inference is quite complicated because of the intractability of the CAR model's normalizing constant. Sparse spatiotemporal data further complicates likelihood inference. We implement Monte Carlo likelihood (MCL) estimation methods for parameter estimation of our HCAR and WCAR models. Monte Carlo likelihood estimation provides an approximation for intractable likelihood functions. We demonstrate our framework by giving estimates for several different quantities that are functions of the complete CalCOFI time series data.
 2013
 FSU_migr_etd7283
 Thesis
 Theories on Group Variable Selection in Multivariate Regression Models.
Ha, SeungYeon, She, Yiyuan, Okten, Giray, Huffer, Fred, Sinha, Debajyoti, Department of Statistics, Florida State University
We study group variable selection on multivariate regression model. Group variable selection is equivalent to select the nonzero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS according to JamesStein phenomenon (1961). As one of shrinkage methods, we study penalized...
Show moreWe study group variable selection on multivariate regression model. Group variable selection is equivalent to select the nonzero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS according to JamesStein phenomenon (1961). As one of shrinkage methods, we study penalized least square estimation for a group variable selection. Among them, we study L0 regularization and L0 + L2 regularization with the purpose of obtaining accurate prediction and consistent feature selection, and use the corresponding computational procedure Hard TISP and HardRidge TISP (She, 2009) to solve the numerical difficulties. These regularization methods show better performance both on prediction and selection than Lasso (L1 regularization), which is one of popular penalized least square method. L0 acheives the same optimal rate of prediction loss and estimation loss as Lasso, but it requires no restriction on design matrix or sparsity for controlling the prediction error and a relaxed condition than Lasso for controlling the estimation error. Also, for selection consistency, it requires much relaxed incoherence condition, which is correlation between the relevant subset and irrelevant subset of predictors. Therefore L0 can work better than Lasso both on prediction and sparsity recovery, in practical cases such that correlation is high or sparsity is not low. We study another method, L0 + L2 regularization which uses the combined penalty of L0 and L2. For the corresponding procedure HardRidge TISP, two parameters work independently for selection and shrinkage (to enhance prediction) respectively, and therefore it gives better performance on some cases (such as low signal strength) than L0 regularization. For L0 regularization, λ works for selection but it is tuned in terms of prediction accuracy. L0 + L2 regularization gives the optimal rate of prediction and estimation errors without any restriction, when the coefficient of l2 penalty is appropriately assigned. Furthermore, it can achieve a better rate of estimation error with an ideal choice of blockwise weight to l2 penalty.
 2013
 FSU_migr_etd7404
 Thesis
 2D Affine and Projective Shape Analysis, and Bayesian Elastic Active Contours.
Bryner, Darshan W., Srivastava, Anuj, Klassen, Eric, Gallivan, Kyle, Huffer, Fred, Wu, Wei, Zhang, Jinfeng, Department of Statistics, Florida State University
An object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian...
Show moreAn object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian framework for shape analysis where metrics and related quantities are invariant to larger groups, the affine and projective groups, that approximate such transformations that arise from perspective skews. Highlighting two possibilities for representing object boundaries  ordered points (or landmarks) and parametrized curves  we study different combinations of these representations (points and curves) and transformations (affine and projective). Specifically, we provide solutions to three out of four situations and develop algorithms for computing geodesics and intrinsic sample statistics, leading up to Gaussiantype statistical models, and classifying test shapes using such models learned from training data. In the case of parametrized curves, an added issue is to obtain invariance to the reparameterization group. The geodesics are constructed by particularizing the pathstraightening algorithm to geometries of current manifolds and are used, in turn, to compute shape statistics and Gaussiantype shape models. We demonstrate these ideas using a number of examples from shape and activity recognition. After developing such Gaussiantype shape models, we present a variational framework for naturally incorporating these shape models as prior knowledge in guidance of active contours for boundary extraction in images. This socalled Bayesian active contour framework is especially suitable for images where boundary estimation is difficult due to low contrast, low resolution, and presence of noise and clutter. In traditional active contour models curves are driven towards minimum of an energy composed of image and smoothing terms. We introduce an additional shape term based on shape models of prior known relevant shape classes. The minimization of this total energy, using iterated gradientbased updates of curves, leads to an improved segmentation of object boundaries. We demonstrate this Bayesian approach to segmentation using a number of shape classes in many imaging scenarios including the synthetic imaging modalities of SAS (synthetic aperture sonar) and SAR (synthetic aperture radar), which are notoriously difficult to obtain accurate boundary extractions. In practice, the training shapes used for priorshape models may be collected from viewing angles different from those for the test images and thus may exhibit a shape variability brought about by perspective effects. Therefore, by allowing for a prior shape model to be invariant to, say, affine transformations of curves, we propose an active contour algorithm where the resulting segmentation is robust to perspective skews.
 2013
 FSU_migr_etd8534
 Thesis
 Elastic Shape Analysis of RNAs and Proteins.
Laborde, Jose M., Srivastava, Anuj, Zhang, Jinfeng, Klassen, Eric, McGee, Daniel, Department of Statistics, Florida State University
Proteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and...
Show moreProteins and RNAs are molecular machines performing biological functions in the cells of all organisms. Automatic comparison and classification of these biomolecules are fundamental yet open problems in the field of Structural Bioinformatics. An outstanding unsolved issue is the definition and efficient computation of a formal distance between any two biomolecules. Current methods use alignment scores, which are not proper distances, to derive statistical tests for comparison and classifications. This work applies Elastic Shape Analysis (ESA), a method recently developed in computer vision, to construct rigorous mathematical and statistical frameworks for the comparison, clustering and classification of proteins and RNAs. ESA treats bio molecular structures as 3D parameterized curves, which are represented with a special map called the square root velocity function (SRVF). In the resulting shape space of elastic curves, one can perform statistical analysis of curves as if they were random variables. One can compare, match and deform one curve into another, or as well as compute averages and covariances of curve populations, and perform hypothesis testing and classification of curves according to their shapes. We have successfully applied ESA to the comparison and classification of protein and RNA structures. We further extend the ESA framework to incorporate additional nongeometric information that tags the shape of the molecules (namely, the sequence of nucleotide/aminoacid letters for RNAs/proteins and, in the latter case, also the labels for the socalled secondary structure). The biological representation is chosen such that the ESA framework continues to be mathematically formal. We have achieved superior classification of RNA functions compared to stateoftheart methods on benchmark RNA datasets which has led to the publication of this work in the journal, Nucleic Acids Research (NAR). Based on the ESA distances, we have also developed a fast method to classify protein domains by using a representative set of protein structures generated by a clusteringbased technique we call Multiple Centroid Class Partitioning (MCCP). Comparison with other standard approaches showed that MCCP significantly improves the accuracy while keeping the representative set smaller than the other methods. The current schemes for the classification and organization of proteins (such as SCOP and CATH) assume a discrete space of their structures, where a protein is classified into one and only one class in a hierarchical tree structure. Our recent study, and studies by other researchers, showed that the protein structure space is more continuous than discrete. To capture the complex but quantifiable continuous nature of protein structures, we propose to organize these molecules using a network model, where individual proteins are mapped to possibly multiple nodes of classes, each associated with a probability. Structural classes will then be connected to form a network based on overlaps of corresponding probability distributions in the structural space.
 2013
 FSU_migr_etd8586
 Thesis
 Failure Time Regression Models for Thinned Point Processes.
Holden, Robert T., Huffer, Fred G., Nichols, Warren, McGee, Dan, Sinha, Debajyoti, Department of Statistics, Florida State University
In survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the...
Show moreIn survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the predictors on that distribution. Suppose that the criterion event isn't a terminal event that can only occur once, but is a repeatable event. The sequence of events forms a stochastic {it point process}. Further suppose that only some of the events are detected (observed); the detected events form a thinned point process. Any failure time model based on the data will be based not on the time until the first occurrence, but on the time until the first detected occurrence of the event. The implications of estimating survival regression models from such incomplete data will be analyzed. It will be shown that the effect of thinning on regression parameters depends on the combination of the type of regression model, the type of point process that generates the events, and the thinning mechanism. For some combinations, the effect of a predictor will be the same for time to the first event and the time to the first detected event. For other combinations, the regression effect will be changed as a result of the incomplete detection.
 2013
 FSU_migr_etd8568
 Thesis
 Meta Analysis and Meta Regression of a Measure of Discrimination Used in Prognostic Modeling.
Rivera, Gretchen L., McGee, Daniel, Hurt, Myra, Niu, Xufeng, Sinha, Debajyoti, Department of Statistics, Florida State University
In this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper we consider the logistic model. The dataset used is the Diverse Populations Collaboration (DPC) dataset which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our...
Show moreIn this paper we are interested in predicting death with the underlying cause of coronary heart disease (CHD). There are two prognostic modeling methods used to predict CHD: the logistic model and the proportional hazard model. For this paper we consider the logistic model. The dataset used is the Diverse Populations Collaboration (DPC) dataset which includes 28 studies. The DPC dataset has epidemiological results from investigation conducted in different populations around the world. For our analysis we include those individuals who are 17 years old or older. The predictors are: age, diabetes, total serum cholesterol (mg/dl), high density lipoprotein (mg/dl), systolic blood pressure (mmHg) and if the participant is a current cigarette smoker. There is a natural grouping within the studies such as gender, rural or urban area and race. Based on these strata we have 84 cohort groups. Our main interest is to evaluate how well the prognostic model discriminates. For this, we used the area under the Receiver Operating Characteristic (ROC) curve. The main idea of the ROC curve is that a set of subject is known to belong to one of two classes (signal or noise group). Then an assignment procedure assigns each object to a class on the basis of information observed. The assignment procedure is not perfect: sometimes an object is misclassified. We want to evaluate the quality of performance of this procedure, for this we used the Area under the ROC curve (AUROC). The AUROC varies from 0.5 (no apparent accuracy) to 1.0 (perfect accuracy). For each logistic model we found the AUROC and its standard error (SE). We used Metaanalysis to summarize the estimated AUROCs and to evaluate if there is heterogeneity in our estimates. To evaluate the existence of significant heterogeneity we used the Q statistic. Since heterogeneity was found in our study we compare seven different methods for estimating τ2 (between study variance). We conclude by examining whether differences in study characteristics explained the heterogeneity in the values of the AUROC.
 2013
 FSU_migr_etd7580
 Thesis
 The Frequentist Performance of Some Bayesian Confidence Intervals for the Survival Function.
Tao, Yingfeng, Huﬀer, Fred, Okten, Giray, Sinha, Debajyoti, Niu, Xufeng, Department of Statistics, Florida State University
Estimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based...
Show moreEstimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based in some way on the KaplanMeier estimator first proposed by Kaplan and Meier (1958) and widely used as the nonparametric estimator in the presence of rightcensored data. For intervalcensored data, the Turnbull estimator (Turnbull (1974)) plays a similar role. For a class of Bayesian models involving Dirichlet priors, Doss and Huffer (2003) suggested several simulation techniques to approximate the posterior distribution of the survival function by using Markov chain Monte Carlo or sequential importance sampling. These techniques lead to probability intervals for the survival function (at arbitrary time points) and its quantiles for both the rightcensored and intervalcensored cases. This dissertation will examine the frequentist properties and general performance of these probability intervals when the prior is noninformative. Simulation studies will be used to compare these probability intervals with other published approaches. Extensions of the DossHuffer approach are given for constructing simultaneous confidence bands for the survival function and for computing approximate confidence intervals for the survival function based on Edgeworth expansions using posterior moments. The performance of these extensions is studied by simulation.
 2013
 FSU_migr_etd7624
 Thesis
 Statistical Analysis of Trajectories on Riemannian Manifolds.
Su, Jingyong, Srivastava, Anuj, Klassen, Erik, Huffer, Fred, Zhang, Jinfeng, Department of Statistics, Florida State University
This thesis consists of two distinct topics. First, we present a framework for estimation and analysis of trajectories on Riemananian manifolds. Second, we propose a framework of detecting, classifying, and estimating shapes in point cloud data. This thesis mainly focuses on statistical analysis of trajectories that take values on nonlinear manifolds. There are many difficulties when analyzing temporal trajectories on nonlinear manifold. First, the observed data are always noisy and discrete...
Show moreThis thesis consists of two distinct topics. First, we present a framework for estimation and analysis of trajectories on Riemananian manifolds. Second, we propose a framework of detecting, classifying, and estimating shapes in point cloud data. This thesis mainly focuses on statistical analysis of trajectories that take values on nonlinear manifolds. There are many difficulties when analyzing temporal trajectories on nonlinear manifold. First, the observed data are always noisy and discrete at unsynchronized times. Second, trajectories are observed under arbitrary temporal evolutions. In this work, we first address the problem of estimating full smooth trajectories on nonlinear manifolds using only a set of timeindexed points, for use in interpolation, smoothing, and prediction of dynamic systems. Furthermore, we study statistical analysis of trajectories that take values on nonlinear Riemannian manifolds and are observed under arbitrary temporal evolutions. The problem of analyzing such temporal trajectories including registration, comparison, modeling and evaluation exist in a lot of applications. We introduce a quantity that provides both a cost function for temporal registration and a proper distance for comparison of trajectories. This distance, in turn, is used to define statistical summaries, such as the sample means and covariances, of given trajectories and Gaussiantype models to capture their variability. Both theoretical proofs and experimental results are provided to validate our work. The problems of detecting, classifying, and estimating shapes in point cloud data are important due to their general applicability in image analysis, computer vision, and graphics. They are challenging because the data is typically noisy, cluttered, and unordered. We study these problems using a fully statistical model where the data is modeled using a Poisson process on the objects boundary (curves or surfaces), corrupted by additive noise and a clutter process. Using likelihood functions dictated by the model, we develop a generalized likelihood ratio test for detecting a shape in a point cloud. Additionally, we develop a procedure for estimating most likely shapes in observed point clouds under given shape hypotheses. We demonstrate this framework using examples of 2D and 3D shape detection and estimation in both real and simulated data, and a usage of this framework in shape retrieval from a 3D shape database.
 2013
 FSU_migr_etd7619
 Thesis
 Bayesian Methods for Skewed Response Including Longitudinal and Heteroscedastic Data.
Tang, Yuanyuan, Sinha, Debajyoti, Pati, Debdeep, Flynn, Heather, She, Yiyuan, Lipsitz, Stuart, Zhang, Jinfeng, Department of Statistics, Florida State University
Skewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle...
Show moreSkewed response data are very popular in practice, especially in biomedical area. We begin our work from the skewed longitudinal response without heteroscedasticity. We extend the skewed error density to the multivariate response. Then we study the heterocedasticity. We extend the transformbothsides model to the bayesian variable selection area to handle the univariate skewed response, where the variance of response is a function of the median. At last, we proposed a novel model to handle the skewed univariate response with a flexible heteroscedasticity. For longitudinal studies with heavily skewed continuous response, statistical model and methods focusing on mean response are not appropriate. In this paper, we present a partial linear model of median regression function of skewed longitudinal response. We develop a semiparametric Bayesian estimation procedure using an appropriate Dirichlet process mixture prior for the skewed error distribution. We provide justifications for using our methods including theoretical investigation of the support of the prior, asymptotic properties of the posterior and also simulation studies of finite sample properties. Ease of implementation and advantages of our model and method compared to existing methods are illustrated via analysis of a cardiotoxicity study of children of HIV infected mother. Our second aim is to develop a Bayesian simultaneous variable selection and estimation of median regression for skewed response variable. Our hierarchical Bayesian model can incorporate advantages of $l_0$ penalty for skewed and heteroscedastic error. Some preliminary simulation studies have been conducted to compare the performance of proposed model and existing frequentist median lasso regression model. Considering the estimation bias and total square error, our proposed model performs as good as, or better than competing frequentist estimators. In biomedical studies, the covariates often affect the location, scale as well as the shape of the skewed response distribution. Existing biostatistical literature mainly focuses on the mean regression with a symmetric error distribution. While such modeling assumptions and methods are often deemed as restrictive and inappropriate for skewed response, the completely nonparametric methods may lack a physical interpretation of the covariate effects. Existing nonparametric methods also miss any easily implementable computational tool. For a skewed response, we develop a novel model accommodating a nonparametric error density that depends on the covariates. The advantages of our semiparametric associated Bayes method include the ease of prior elicitation/determination, an easily implementable posterior computation, theoretically sound properties of the selection of priors and accommodation of possible outliers. The practical advantages of the method are illustrated via a simulation study and an analysis of a reallife epidemiological study on the serum response to DDT exposure during gestation period.
 2013
 FSU_migr_etd7622
 Thesis
 An Ensemble Approach to Predicting Health Outcomes.
Nilles, Ester Kim, McGee, Dan, Zhang, Jinfeng, Eberstein, Isaac, Sinha, Debajyoti, Department of Statistics, Florida State University
Heart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk...
Show moreHeart disease and premature birth continue to be the leading cause of mortality and neonatal mortality in large parts of the world. They are also estimated to have the highest medical expenditures in the United States. Early detection of heart disease incidence plays a critical role in preserving heart health, and identifying pregnancies at high risk of premature birth is highly valuable information for early interventions. The past few decades, identification of patients at high health risk have been based on logistic regression or Cox proportional hazards models. In more recent years, machine learning models have grown in popularity within the medical field for their superior predictive and classification performances over the classical statistical models. However, their performances in heart disease and premature birth predictions have been comparable and inconclusive, leaving the question of which model most accurately reflects the data difficult to resolve. Our aim is to incorporate information learned by different models into one final model that will generate superior predictive performances. We first compare the widely used machine learning models  the multilayer perceptron network, knearest neighbor and support vector machine  to the statistical models logistic regression and Cox proportional hazards. Then the individual models are combined into one in an ensemble approach, also referred to as ensemble modeling. The proposed approaches include SSEweighted, AUCweighted, logistic and flexible naive Bayes. The individual models are unique and capture different aspects of the data, but as expected, no individual one outperforms any other. The ensemble approach is an easily computed method that eliminates the need to select one model, integrates the strengths of different models, and generates optimal performances. Particularly in cases where the risk factors associated to an outcome are elusive, such as in premature birth, the ensemble models significantly improve their prediction.
 2013
 FSU_migr_etd7530
 Thesis
 AP Student Visual Preferences for Problem Solving.
Swoyer, Liesl, Department of Statistics
The purpose of this study is to explore the mathematical preference of high school AP Calculus students by examining their tendencies for using differing methods of thought. A student's preferred mode of thinking was measured on a scale ranging from a preference for analytical thought to a preference for visual thought as they completed derivative and antiderivative tasks presented both algebraically and graphically. This relates to previous studies by continuing to analyze the factors that...
Show moreThe purpose of this study is to explore the mathematical preference of high school AP Calculus students by examining their tendencies for using differing methods of thought. A student's preferred mode of thinking was measured on a scale ranging from a preference for analytical thought to a preference for visual thought as they completed derivative and antiderivative tasks presented both algebraically and graphically. This relates to previous studies by continuing to analyze the factors that have been found to mediate the students' performance and preference in regards to a variety of calculus tasks. Data was collected by Dr. Erhan Haciomeroglu at the University of Central Florida. Students' preferences were not affected by gender. Students were found to approach graphical and algebraic tasks similarly, without any significant change with regards to derivative or antiderivative nature of the tasks. Highly analytic and highly visual students revealed the same proportion of change in visuality as harmonic students when more difficult calculus tasks were encountered. Thus, a strong preference for visual thinking when completing algebraic tasks was not the determining factor of their preferred method of thinking when approaching graphical tasks.
 2012
 FSU_migr_uhm0052
 Thesis
 The Relationship Between Body Mass and Blood Pressure in Diverse Populations.
Abayomi, Emilola J., McGee, Daniel, Lackland, Daniel, Hurt, Myra, Chicken, Eric, Niu, Xufeng, Department of Statistics, Florida State University
High blood pressure is a major determinant of risk for Coronary Heart Disease (CHD) and stroke, leading causes of death in the industrialized world. A myriad of pharmacological treatments for elevated blood pressure, defined as a blood pressure greater than 140/90mmHg, are available and have at least partially resulted in large reductions in the incidence of CHD and stroke in the U.S. over the last 50 years. The factors that may increase blood pressure levels are not well understood, but body...
Show moreHigh blood pressure is a major determinant of risk for Coronary Heart Disease (CHD) and stroke, leading causes of death in the industrialized world. A myriad of pharmacological treatments for elevated blood pressure, defined as a blood pressure greater than 140/90mmHg, are available and have at least partially resulted in large reductions in the incidence of CHD and stroke in the U.S. over the last 50 years. The factors that may increase blood pressure levels are not well understood, but body mass is thought to be a major determinant of blood pressure level. Obesity is measured through various methods (skinfolds, waisttohip ratio, bioelectrical impedance analysis (BIA), etc.), but the most commonly used measure is body mass index,BMI= Weight(kg)/Height(m)2
 2012
 FSU_migr_etd5308
 Thesis
 Nonparametric Data Analysis on Manifolds with Applications in Medical Imaging.
Osborne, Daniel Eugene, Patrangenaru, Victor, Liu, Xiuwen, Barbu, Adrian, Chicken, Eric, Department of Statistics, Florida State University
Over the past twenty years, there has been a rapid development in Nonparametric Statistical Analysis on Manifolds applied to Medical Imaging problems. In this body of work, we focus on two different medical imaging problems. The first problem corresponds to analyzing the CT scan data. In this context, we perform nonparametric analysis on the 3D data retrieved from CT scans of healthy young adults, on the SizeandReflection Shape Space of kads in general position in 3D. This work is a part...
Show moreOver the past twenty years, there has been a rapid development in Nonparametric Statistical Analysis on Manifolds applied to Medical Imaging problems. In this body of work, we focus on two different medical imaging problems. The first problem corresponds to analyzing the CT scan data. In this context, we perform nonparametric analysis on the 3D data retrieved from CT scans of healthy young adults, on the SizeandReflection Shape Space of kads in general position in 3D. This work is a part of larger project on planning reconstructive surgery in severe skull injuries which includes preprocessing and postprocessing steps of CT images. The next problem corresponds to analyzing MR diffusion tensor imaging data. Here, we develop a twosample procedure for testing the equality of the generalized Frobenius means of two independent populations on the space of symmetric positive matrices. These new methods, naturally lead to an analysis based on Cholesky decompositions of covariance matrices which helps to decrease computational time and does not increase dimensionality. The resulting nonparametric matrix valued statistics are used for testing if there is a difference on average between corresponding signals in Diffusion Tensor Images (DTI) in young children with dyslexia when compared to their clinically normal peers. The results presented here correspond to data that was previously used in the literature using parametric methods which also showed a significant difference.
 2012
 FSU_migr_etd5085
 Thesis
 MixedEffects Models for Count Data with Applications to Educational Research.
Shin, Jihyung, Niu, Xufeng, Hu, Shouping, Al Otaiba, Stephanie Dent, McGee, Daniel, Wu, Wei, Department of Statistics, Florida State University
This research is motivated by an analysis of reading research data. We are interested in modeling the test outcome of ability to fluently recode letters into sounds of kindergarten children aged between 5 and 7. The data showed excessive zero scores (more than 30% of children) on the test. In this dissertation, we carefully examine the models dealing with excessive zeros, which are based on the mixture of distributions, a distribution with zeros and a standard probability distribution with...
Show moreThis research is motivated by an analysis of reading research data. We are interested in modeling the test outcome of ability to fluently recode letters into sounds of kindergarten children aged between 5 and 7. The data showed excessive zero scores (more than 30% of children) on the test. In this dissertation, we carefully examine the models dealing with excessive zeros, which are based on the mixture of distributions, a distribution with zeros and a standard probability distribution with non negative values. In such cases, a log normal variable or a Poisson random variable is often observed with probability from semicontinuous data or count data. The previously proposed models, mixedeffects and mixeddistribution models (MEMD) by Tooze(2002) et al. for semicontinuous data and zeroinflated Poisson (ZIP) regression models by Lambert(1992) for count data are reviewed. We apply zeroinflated Poisson models to repeated measures data of zeroinflated data by introducing a pair of possibly correlated random effects to the zeroinflated Poisson model to accommodate withinsubject correlation and between subject heterogeneity. The model describes the effect of predictor variables on the probability of nonzero responses (occurrence) and mean of nonzero responses (intensity) separately. The likelihood function is maximized using dual quasiNewton optimization of an approximated by adaptive Gaussian quadrature. The maximum likelihood estimates are obtained through standard statistical software package. Using different model parameters, the number of subject, and the number of measurements per subject, the simulation study is conducted and the results are presented. The dissertation ends with the application of the model to reading research data and future research. We examine the number of correct letter sound counted of children collected over 2008 2009 academic year. We find that age, gender and socioeconomic status are significantly related to the letter sound fluency of children in both parts of the model. The model provides better explanation of data structure and easier interpretations of parameter values, as they are the same as in standard logistic models and Poisson regression models. The model can be extended to accommodate serial correlation which can be observed in longitudinal data. Also, one may consider multilevel zeroinflated Poisson model. Although the multilevel model was proposed previously, parameter estimation by penalized quasi likelihood methods is questionable, and further examination is needed.
 2012
 FSU_migr_etd5181
 Thesis
 Estimation and Sequential Monitoring of Nonlinear Functional Responses Using Wavelet Shrinkage.
Cuevas, Jordan, Chicken, Eric, Sobanjo, John, Niu, Xufeng, Wu, Wei, Department of Statistics, Florida State University
Statistical process control (SPC) is widely used in industrial settings to monitor processes for shifts in their distributions. SPC is generally thought of in two distinct phases: Phase I, in which historical data is analyzed in order to establish an incontrol process, and Phase II, in which new data is monitored for deviations from the incontrol form. Traditionally, SPC had been used to monitor univariate (multivariate) processes for changes in a particular parameter (parameter vector)....
Show moreStatistical process control (SPC) is widely used in industrial settings to monitor processes for shifts in their distributions. SPC is generally thought of in two distinct phases: Phase I, in which historical data is analyzed in order to establish an incontrol process, and Phase II, in which new data is monitored for deviations from the incontrol form. Traditionally, SPC had been used to monitor univariate (multivariate) processes for changes in a particular parameter (parameter vector). Recently however, technological advances have resulted in processes in which each observation is actually an ndimensional functional response (referred to as a profile), where n can be quite large. Additionally, these profiles are often unable to be adequately represented parametrically, making traditional SPC techniques inapplicable. This dissertation starts out by addressing the problem of nonparametric function estimation, which would be used to analyze process data in a PhaseI setting. The translation invariant wavelet estimator (TI) is often used to estimate irregular functions, despite the drawback that it tends to oversmooth jumps. A trimmed translation invariant estimator (TTI) is proposed, of which the TI estimator is a special case. By reducing the point by point variability of the TI estimator, TTI is shown to retain the desirable qualities of TI while improving reconstructions of functions with jumps. Attention is then turned to the PhaseII problem of monitoring sequences of profiles for deviations from incontrol. Two profile monitoring schemes are proposed; the first monitors for changes in the noise variance using a likelihood ratio test based on the highest detail level of wavelet coefficients of the observed profile. The second offers a semiparametric test to monitor for changes in both the functional form and noise variance. Both methods make use of wavelet shrinkage in order to distinguish relevant functional information from noise contamination. Different forms of each of these test statistics are proposed and results are compared via Monte Carlo simulation.
 2012
 FSU_migr_etd4788
 Thesis
 Riemannian Shape Analysis of Curves and Surfaces.
Kurtek, Sebastian, Srivastava, Anuj, Klassen, Eric, Wu, Wei, Huﬀer, Fred, Dryden, Ian, Department of Statistics, Florida State University
Shape analysis of curves and surfaces is a very important tool in many applications ranging from computer vision to bioinformatics and medical imaging. There are many difficulties when analyzing shapes of parameterized curves and surfaces. Firstly, it is important to develop representations and metrics such that the analysis is invariant to parameterization in addition to the standard transformations (rigid motion and scaling). Furthermore, under the chosen representations and metrics, the...
Show moreShape analysis of curves and surfaces is a very important tool in many applications ranging from computer vision to bioinformatics and medical imaging. There are many difficulties when analyzing shapes of parameterized curves and surfaces. Firstly, it is important to develop representations and metrics such that the analysis is invariant to parameterization in addition to the standard transformations (rigid motion and scaling). Furthermore, under the chosen representations and metrics, the analysis must be performed on infinitedimensional and sometimes nonlinear spaces, which poses an additional difficulty. In this work, we develop and apply methods which address these issues. We begin by defining a framework for shape analysis of parameterized open curves and extend these ideas to shape analysis of surfaces. We utilize the presented frameworks in various classification experiments spanning multiple application areas. In the case of curves, we consider the problem of clustering DTMRI brain fibers, classification of protein backbones, modeling and segmentation of signatures and statistical analysis of biosignals. In the case of surfaces, we perform disease classification using 3D anatomical structures in the brain, classification of handwritten digits by viewing images as quadrilateral surfaces, and finally classification of cropped facial surfaces. We provide two additional extensions of the general shape analysis frameworks that are the focus of this dissertation. The first one considers shape analysis of marked spherical surfaces where in addition to the surface information we are given a set of manually or automatically generated landmarks. This requires additional constraints on the definition of the reparameterization group and is applicable in many domains, especially medical imaging and graphics. Second, we consider reflection symmetry analysis of planar closed curves and spherical surfaces. Here, we also provide an example of disease detection based on brain asymmetry measures. We close with a brief summary and a discussion of open problems, which we plan on exploring in the future.
 2012
 FSU_migr_etd4963
 Thesis
 Semiparametric Survival Analysis Using Models with LogLinear Median.
Lin, Jianchang, Sinha, Debajyoti, Zhou, Yi, Lipsitz, Stuart, McGee, Dan, Niu, XuFeng, She, Yiyuan, Department of Statistics, Florida State University
First, we present two novel semiparametric survival models with loglinear median regression functions for right censored survival data. These models are useful alternatives to the popular Cox (1972) model and linear transformation models (Cheng et al., 1995). Compared to existing semiparametric models, our models have many important practical advantages, including interpretation of the regression parameters via the median and the ability to address heteroscedasticity. We demonstrate that our...
Show moreFirst, we present two novel semiparametric survival models with loglinear median regression functions for right censored survival data. These models are useful alternatives to the popular Cox (1972) model and linear transformation models (Cheng et al., 1995). Compared to existing semiparametric models, our models have many important practical advantages, including interpretation of the regression parameters via the median and the ability to address heteroscedasticity. We demonstrate that our modeling techniques facilitate the ease of prior elicitation and computation for both parametric and semiparametric Bayesian analysis of survival data. We illustrate the advantages of our modeling, as well as model diagnostics, via reanalysis of a smallcell lung cancer study. Results of our simulation study provide further guidance regarding appropriate modelling in practice. Our second goal is to develop the methods of analysis and associated theoretical properties for interval censored and current status survival data. These new regression models use loglinear regression function for the median. We present frequentist and Bayesian procedures for estimation of the regression parameters. Our model is a useful and practical alternative to the popular semiparametric models which focus on modeling the hazard function. We illustrate the advantages and properties of our proposed methods via reanalyzing a breast cancer study. Our other aim is to develop a model which is able to account for the heteroscedasticity of response, together with robust parameter estimation and outlier detection using sparsity penalization. Some preliminary simulation studies have been conducted to compare the performance of proposed model and existing median lasso regression model. Considering the estimation bias, mean squared error and other identication benchmark measures, our proposed model performs better than the competing frequentist estimator.
 2012
 FSU_migr_etd4992
 Thesis
 Prediction and Testing for NonParametric Random Function Signals in a Complex System.
Hill, Paul C., Chicken, Eric, Klassen, Eric, Niu, Xufeng, Barbu, Adrian, Department of Statistics, Florida State University
Methods employed in the construction of prediction bands for continuous curves require a dierent approach to those used for a data point. In many cases, the underlying function is unknown and thus a distributionfree approach which preserves sufficient coverage for the entire signal is necessary in the signal analysis. This paper discusses three methods for the formation of (1alpha)100% bootstrap prediction bands and their performances are compared through the coverage probabilities obtained...
Show moreMethods employed in the construction of prediction bands for continuous curves require a dierent approach to those used for a data point. In many cases, the underlying function is unknown and thus a distributionfree approach which preserves sufficient coverage for the entire signal is necessary in the signal analysis. This paper discusses three methods for the formation of (1alpha)100% bootstrap prediction bands and their performances are compared through the coverage probabilities obtained for each technique. Bootstrap samples are first obtained for the signal and then three dierent criteria are provided for the removal of 100% of the curves resulting in the (1alpha)100% prediction band. The first method uses the L1 distance between the upper and lower curves as a gauge to extract the widest bands in the dataset of signals. Also investigated are extractions using the Hausdorffdistance between the bounds as well as an adaption to the bootstrap intervals discussed in Lenhoffet al (1999). The bootstrap prediction bands each have good coverage probabilities for the continuous signals in the dataset. For a 95% prediction band, the coverage obtained were 90.59%, 93.72% and 95% for the L1 Distance, Hausdorff Distance and the adjusted Bootstrap methods respectively. The methods discussed in this paper have been applied to constructing prediction bands for spring discharge in a successful manner giving good coverage in each case. Spring Discharge measured over time can be considered as a continuous signal and the ability to predict the future signals of spring discharge is useful for monitoring flow and other issues related to the spring. While in some cases, rainfall has been tted with the gamma distribution, the discharge of the spring represented as continuous curves, is better approached not assuming any specific distribution. The Bootstrap aspect occurs not in sampling the output discharge curves but rather in simulating the input recharge that enters the spring. Bootstrapping the rainfall as described in this paper, allows for adequately creating new samples over different periods of time as well as specic rain events such as hurricanes or drought. The Bootstrap prediction methods put forth in this paper provide an approach that supplies adequate coverage for prediction bands for signals represented as continuous curves. The pathway outlined by the flow of the discharge through the springshed is described as a tree. A nonparametric pairwise test, motivated by the idea of Kmeans clustering, is proposed to decipher whether there is equality between two trees in terms of their discharges. A large sample approximation is devised for this lowertail significance test and test statistics for different numbers of input signals are compared to a generated table of critical values.
 2012
 FSU_migr_etd4910
 Thesis
 Weighted Adaptive Methods for Multivariate Response Models with an HIV/Neurocognitive Application.
Geis, Jennifer Ann, She, Yiyuan, MeyerBaese, Anke, Barbu, Adrian, Bunea, Florentina, Niu, Xufeng, Department of Statistics, Florida State University
Multivariate response models are being used increasingly more in almost all fields with the necessary employment of inferential methods such as Canonical Correlation Analysis (CCA). This requires the estimation of the number of uncorrelated canonical relationships between the two sets, or, equivalently so, determining the rank of the coefficient estimator in the multivariate response model.One way to do this is by the Rank Selection Criterion (RSC) by Bunea et al. with the assumption the...
Show moreMultivariate response models are being used increasingly more in almost all fields with the necessary employment of inferential methods such as Canonical Correlation Analysis (CCA). This requires the estimation of the number of uncorrelated canonical relationships between the two sets, or, equivalently so, determining the rank of the coefficient estimator in the multivariate response model.One way to do this is by the Rank Selection Criterion (RSC) by Bunea et al. with the assumption the error matrix has independent constant variance entries. While this assumption is necessary to show their strong theoretical results, in practical application, some flexibility is required. That is, such assumption cannot always be safely made. What is developed here are the theoretics that parallel Bunea et al.'s work with the addition of a "decorrelator" weight matrix. One choice for the weight matrix is the residual covariance, but this introduces many issues in practice. A computationally more convenient weight matrix is the sample response covariance. When such a weight matrix is chosen, CCA is directly accessible by this weighted version of RSC giving rise to an Adaptive CCA (ACCA) with principal proofs for the large sample setting. However, particular considerations are required for the highdimensional setting, where similar theoretics do not hold. What is offered instead are extensive empirical simulations that reveal that using the sample response covariance still provides good rank recovery and estimation of the coefficient matrix, and hence, also provides good estimation of the number of canonical relationships and variates. It is argued precisely why other versions of the residual covariance, including a regularized version, are poor choices in the highdimensional setting. Another approach to avoid these issues is to employ some type of variable selection methodology first before applying ACCA. Truly, any group selection method may be applied prior to ACCA as variable selection in the multivariate response model is the same as group selection in the univariate response model and thus completely eliminates these highdimensional concerns. To offer a practical application of these ideas, ACCA is applied to a "large sample'" neurocognitive dataset. Then, a highdimensional dataset is generated to which Group LASSO will be first utilized before ACCA. This provides a unique perspective into the relationships between cognitive deficiencies in HIVpositive patients and the extensive, available neuroimaging measures.
 2012
 FSU_migr_etd4861
 Thesis
 Statistical Shape Analysis on Manifolds with Applications to Planar Contours and Structural Proteomics.
Ellingson, Leif A., Patrangenaru, Vic, Mio, Washington, Zhang, Jinfeng, Niu, Xufeng, Department of Statistics, Florida State University
The technological advances in recent years have produced a wealth of intricate digital imaging data that is analyzed effectively using the principles of shape analysis. Such data often lies on either highdimensional or infinitedimensional manifolds. With computing power also now strong enough to handle this data, it is necessary to develop theoreticallysound methodology to perform the analysis in a computationally efficient manner. In this dissertation, we propose approaches of doing so...
Show moreThe technological advances in recent years have produced a wealth of intricate digital imaging data that is analyzed effectively using the principles of shape analysis. Such data often lies on either highdimensional or infinitedimensional manifolds. With computing power also now strong enough to handle this data, it is necessary to develop theoreticallysound methodology to perform the analysis in a computationally efficient manner. In this dissertation, we propose approaches of doing so for planar contours and the threedimensional atomic structures of protein binding sites. First, we adapt Kendall's definition of direct similarity shapes of finite planar configurations to shapes of planar contours under certain regularity conditions and utilize Ziezold's nonparametric view of Frechet mean shapes. The space of direct similarity shapes of regular planar contours is embedded in a space of HilbertSchmidt operators in order to obtain the VeroneseWhitney extrinsic mean shape. For computations, it is necessary to use discrete approximations of both the contours and the embedding. For cases when landmarks are not provided, we propose an automated, randomized landmark selection procedure that is useful for contour matching within a population and is consistent with the underlying asymptotic theory. For inference on the extrinsic mean direct similarity shape, we consider a onesample neighborhood hypothesis test and the use of nonparametric bootstrap to approximate confidence regions. Bandulasiri et al (2008) suggested using extrinsic reflection sizeandshape analysis to study the relationship between the structure and function of protein binding sites. In order to obtain meaningful results for this approach, it is necessary to identify the atoms common to a group of binding sites with similar functions and obtain proper correspondences for these atoms. We explore this problem in depth and propose an algorithm for simultaneously finding the common atoms and their respective correspondences based upon the Iterative Closest Point algorithm. For a benchmark data set, our classification results compare favorably with those of leading established methods. Finally, we discuss current directions in the field of statistics on manifolds, including a computational comparison of intrinsic and extrinsic analysis for various applications and a brief introduction of sample spaces with manifold stratification.
 2011
 FSU_migr_etd0053
 Thesis
 Individual PatientLevel Data MetaAnalysis: A Comparison of Methods for the Diverse Populations Collaboration Data Set.
Dutton, Matthew Thomas, McGee, Daniel, Becker, Betsy, Niu, Xufeng, Zhang, Jinfeng, Department of Statistics, Florida State University
DerSimonian and Laird define metaanalysis as "the statistical analysis of a collection of analytic results for the purpose of integrating their findings. One alternative to classical metaanalytic approaches in known as Individual PatientLevel Data, or IPD, metaanalysis. Rather than depending on summary statistics calculated for individual studies, IPD metaanalysis analyzes the complete data from all included studies. Two potential approaches to incorporating IPD data into the meta...
Show moreDerSimonian and Laird define metaanalysis as "the statistical analysis of a collection of analytic results for the purpose of integrating their findings. One alternative to classical metaanalytic approaches in known as Individual PatientLevel Data, or IPD, metaanalysis. Rather than depending on summary statistics calculated for individual studies, IPD metaanalysis analyzes the complete data from all included studies. Two potential approaches to incorporating IPD data into the metaanalytic framework are investigated. A twostage analysis is first conducted, in which individual models are fit for each study and summarized using classical metaanalysis procedures. Secondly, a onestage approach that singularly models the data and summarizes the information across studies is investigated. Data from the Diverse Populations Collaboration data set are used to investigate the differences between these two methods in a specific example. The bootstrap procedure is used to determine if the two methods produce statistically different results in the DPC example. Finally, a simulation study is conducted to investigate the accuracy of each method in given scenarios.
 2011
 FSU_migr_etd0620
 Thesis
 A Class of MixedDistribution Models with Applications in Financial Data Analysis.
Tang, Anqi, Niu, Xufeng, Cheng, Yingmei, Wu, Wei, Huﬀer, Fred, Department of Statistics, Florida State University
Statisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the...
Show moreStatisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the probability of nonzero response; the other part is a linear model that models the mean response given that the outcomes are not zeros. Random effects with AR(1) covariance structure are introduced into both parts of the model to allow serial correlation and subject specific effect. Estimating the twopart model is challenging because of high dimensional integration necessary to obtain the maximum likelihood estimates. We propose a Monte Carlo EM algorithm for estimating the maximum likelihood estimates of parameters. Through simulation study, we demonstrate the good performance of the MCEM method in parameter and standard error estimation. To illustrate, we apply the twopart model with correlated random effects and the model with autoregressive random effects to executive compensation data to investigate potential determinants of CEO stock option grants.
 2011
 FSU_migr_etd1710
 Thesis
 Interrelating of Longitudinal Processes: An Empirical Example.
RoyalThomas, Tamika Y. N., McGee, Daniel, Levenson, Cathy, Sinha, Debajyoti, Osmond, Clive, Niu, Xufeng, Department of Statistics, Florida State University
The Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process Blood pressure and anthropometric measurements at seven timepoints on the mother during pregnancy. (ii) In Utero measurements  Ultrasound measurements of the fetus taken at six timepoints during...
Show moreThe Barker Hypothesis states that maternal and `in utero' attributes during pregnancy affects a child's cardiovascular health throughout life. We present an analysis of a unique longitudinal dataset from Jamaica that consists of three longitudinal processes: (i) Maternal longitudinal process Blood pressure and anthropometric measurements at seven timepoints on the mother during pregnancy. (ii) In Utero measurements  Ultrasound measurements of the fetus taken at six timepoints during pregnancy. (iii) Birth to present process  Children's anthropometric and blood pressure measurements at 24 timepoints from birth to 14 years. A comprehensive analysis of the interrelationship of these three longitudinal processes is presented using joint modeling for multivariate longitudinal profiles. We propose a new methodology of examining child's cardiovascular risk by extending a current view of likelihood estimation. Joint modeling of multivariate longitudinal profiles is done and the extension of the traditional likelihood method is utilized in this paper and compared to the maximum likelihood estimates. Our main goal is to examine whether the process in mothers predicts fetal development which in turn predicts the future cardiovascular health of the children. One of the difficulties with `in utero' and early childhood data is that certain variables are highly correlated and so using dimension reduction techniques are quite applicable in this scenario. Principal component analysis (PCA) is utilized in creating a smaller dimension of uncorrelated data which is then utilized in a longitudinal analysis setting. These principal components are then utilized in an optimal linear mixed model for longitudinal data which indicates that in utero and early childhood attributes predicts the future cardiovascular health of the children. This dissertation has added a body of knowledge to developmental origins of adult diseases and has supplied some significant results while utilizing a rich diversity of statistical methodologies.
 2011
 FSU_migr_etd1792
 Thesis
 Statistical Modelling and Applications of Neural Spike Trains.
Lawhern, Vernon, Wu, Wei, Contreras, Robert J., Srivastava, Anuj, Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
In this thesis we investigate statistical modelling of neural activity in the brain. We first develop a framework which is an extension of the statespace Generalized Linear Model (GLM) by Eden and colleagues [20] to include the effects of hidden states. These states, collectively, represent variables which are not observed (or even observable) in the modeling process but nonetheless can have an impact on the neural activity. We then develop a framework that allows us to input apriori target...
Show moreIn this thesis we investigate statistical modelling of neural activity in the brain. We first develop a framework which is an extension of the statespace Generalized Linear Model (GLM) by Eden and colleagues [20] to include the effects of hidden states. These states, collectively, represent variables which are not observed (or even observable) in the modeling process but nonetheless can have an impact on the neural activity. We then develop a framework that allows us to input apriori target information into the model. We examine both of these modelling frameworks on motor cortex data recorded from monkeys performing different targetdriven hand and arm movement tasks. Finally, we perform temporal coding analysis of sensory stimulation using principled statistical models and show the efficacy of our approach.
 2011
 FSU_migr_etd3251
 Thesis