 GINOM: A statistical framework for assessing interval overlap of multiple genomic features..
Bryner, Darshan, Criscione, Stephen, Leith, Andrew, Huynh, Quyen, Huffer, Fred, Neretti, Nicola
A common problem in genomics is to test for associations between two or more genomic features, typically represented as intervals interspersed across the genome. Existing methodologies can test for significant pairwise associations between two genomic intervals; however, they cannot test for associations involving multiple sets of intervals. This limits our ability to uncover more complex, yet biologically important associations between multiple sets of genomic features. We introduce GINOM ...
Show moreA common problem in genomics is to test for associations between two or more genomic features, typically represented as intervals interspersed across the genome. Existing methodologies can test for significant pairwise associations between two genomic intervals; however, they cannot test for associations involving multiple sets of intervals. This limits our ability to uncover more complex, yet biologically important associations between multiple sets of genomic features. We introduce GINOM (Genomic INterval Overlap Model), a new method that enables testing of significant associations between multiple genomic features. We demonstrate GINOM's ability to identify higherorder associations with both simulated and real data. In particular, we used GINOM to explore L1 retrotransposable element insertion bias in lung cancer and found a significant pairwise association between L1 insertions and heterochromatic marks. Unlike other methods, GINOM also detected an association between L1 insertions and gene bodies marked by a facultative heterochromatic mark, which could explain the observed bias for L1 insertions towards cancerassociated genes.
 20170615
 FSU_pmch_28617797, 10.1371/journal.pcbi.1005586, PMC5491313, 28617797, 28617797, PCOMPBIOLD1601322
 The Frequentist Performance of Some Bayesian Confidence Intervals for the Survival Function.
Tao, Yingfeng, Huffer, Fred, Okten, Giray, Sinha, Debajyoti, Niu, Xufeng, Department of Statistics, Florida State University
Estimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based...
Show moreEstimation of a survival function is a very important topic in survival analysis with contributions from many authors. This dissertation considers estimation of confidence intervals for the survival function based on right censored or intervalcensored survival data. Most of the methods for estimating pointwise confidence intervals and simultaneous confidence bands of the survival function are reviewed in this dissertation. In the rightcensored case, almost all confidence intervals are based in some way on the KaplanMeier estimator first proposed by Kaplan and Meier (1958) and widely used as the nonparametric estimator in the presence of rightcensored data. For intervalcensored data, the Turnbull estimator (Turnbull (1974)) plays a similar role. For a class of Bayesian models involving Dirichlet priors, Doss and Huffer (2003) suggested several simulation techniques to approximate the posterior distribution of the survival function by using Markov chain Monte Carlo or sequential importance sampling. These techniques lead to probability intervals for the survival function (at arbitrary time points) and its quantiles for both the rightcensored and intervalcensored cases. This dissertation will examine the frequentist properties and general performance of these probability intervals when the prior is noninformative. Simulation studies will be used to compare these probability intervals with other published approaches. Extensions of the DossHuffer approach are given for constructing simultaneous confidence bands for the survival function and for computing approximate confidence intervals for the survival function based on Edgeworth expansions using posterior moments. The performance of these extensions is studied by simulation.
 2013
 FSU_migr_etd7624
 Statistical Analysis of Trajectories on Riemannian Manifolds.
Su, Jingyong, Srivastava, Anuj, Klassen, Erik, Huffer, Fred, Zhang, Jinfeng, Department of Statistics, Florida State University
This thesis consists of two distinct topics. First, we present a framework for estimation and analysis of trajectories on Riemananian manifolds. Second, we propose a framework of detecting, classifying, and estimating shapes in point cloud data. This thesis mainly focuses on statistical analysis of trajectories that take values on nonlinear manifolds. There are many difficulties when analyzing temporal trajectories on nonlinear manifold. First, the observed data are always noisy and discrete...
Show moreThis thesis consists of two distinct topics. First, we present a framework for estimation and analysis of trajectories on Riemananian manifolds. Second, we propose a framework of detecting, classifying, and estimating shapes in point cloud data. This thesis mainly focuses on statistical analysis of trajectories that take values on nonlinear manifolds. There are many difficulties when analyzing temporal trajectories on nonlinear manifold. First, the observed data are always noisy and discrete at unsynchronized times. Second, trajectories are observed under arbitrary temporal evolutions. In this work, we first address the problem of estimating full smooth trajectories on nonlinear manifolds using only a set of timeindexed points, for use in interpolation, smoothing, and prediction of dynamic systems. Furthermore, we study statistical analysis of trajectories that take values on nonlinear Riemannian manifolds and are observed under arbitrary temporal evolutions. The problem of analyzing such temporal trajectories including registration, comparison, modeling and evaluation exist in a lot of applications. We introduce a quantity that provides both a cost function for temporal registration and a proper distance for comparison of trajectories. This distance, in turn, is used to define statistical summaries, such as the sample means and covariances, of given trajectories and Gaussiantype models to capture their variability. Both theoretical proofs and experimental results are provided to validate our work. The problems of detecting, classifying, and estimating shapes in point cloud data are important due to their general applicability in image analysis, computer vision, and graphics. They are challenging because the data is typically noisy, cluttered, and unordered. We study these problems using a fully statistical model where the data is modeled using a Poisson process on the objects boundary (curves or surfaces), corrupted by additive noise and a clutter process. Using likelihood functions dictated by the model, we develop a generalized likelihood ratio test for detecting a shape in a point cloud. Additionally, we develop a procedure for estimating most likely shapes in observed point clouds under given shape hypotheses. We demonstrate this framework using examples of 2D and 3D shape detection and estimation in both real and simulated data, and a usage of this framework in shape retrieval from a 3D shape database.
 2013
 FSU_migr_etd7619
 Discrete Frenet Frame with Application to Structural Biology and Kinematics.
Lu, Yuanting, Quine, John R., Huffer, Fred W., Bertram, Richard, Cross, Timothy A., Cogan, Nick, Department of Mathematics, Florida State University
The classical Frenet frame is a moving frame on a smooth curve. Connecting a sequence of points in space by line segments makes a discrete curve. The reference frame consisting of tangent, normal and binormal vectors at each point is defined as discrete Frenet frame (DFF). The DFF is useful in studying shapes of long molecules such as proteins. In this dissertation, we provide a solid mathematics foundation for DFF by showing the limit of the Frenet formula for DFF is the classical Frenet formula. As part of a survey of various ways to compute rigid body motion, we show the DenavitHartenberg (DH) conventions in robotics are a special case of the DFFs. Finally, we apply
Show moreThe classical Frenet frame is a moving frame on a smooth curve. Connecting a sequence of points in space by line segments makes a discrete curve. The reference frame consisting of tangent, normal and binormal vectors at each point is defined as discrete Frenet frame (DFF). The DFF is useful in studying shapes of long molecules such as proteins. In this dissertation, we provide a solid mathematics foundation for DFF by showing the limit of the Frenet formula for DFF is the classical Frenet formula. As part of a survey of various ways to compute rigid body motion, we show the DenavitHartenberg (DH) conventions in robotics are a special case of the DFFs. Finally, we apply DFF to solve the kink angle problem in protein alpha helical structure using data from NMR experiments.
 2013
 FSU_migr_etd7477
 A Riemannian Framework for Annotated Curves Analysis.
Liu, Wei, Srivastava, Anuj, Zhang, Jinfeng, Klassen, Eric P., Huﬀer, Fred, Department of Statistics, Florida State University
We propose a Riemannian framework for shape analysis of annotated curves, curves that have certain attributes defined along them, in addition to their geometries.These attributes may be in form of vectorvalued functions, discrete landmarks, or symbolic labels, and provide auxiliary information along the curves. The resulting shape analysis, that is comparing, matching, and deforming, is naturally influenced by the auxiliary functions. Our idea is to construct curves in higher dimensions...
Show moreWe propose a Riemannian framework for shape analysis of annotated curves, curves that have certain attributes defined along them, in addition to their geometries.These attributes may be in form of vectorvalued functions, discrete landmarks, or symbolic labels, and provide auxiliary information along the curves. The resulting shape analysis, that is comparing, matching, and deforming, is naturally influenced by the auxiliary functions. Our idea is to construct curves in higher dimensions using both geometric and auxiliary coordinates, and analyze shapes of these curves. The difficulty comes from the need for removing different groups from different components: the shape is invariant to rigidmotion, global scale and reparameterization while the auxiliary component is usually invariant only to the reparameterization. Thus, the removal of some transformations (rigid motion and global scale) is restricted only to the geometric coordinates, while the reparameterization group is removed for all coordinates. We demonstrate this framework using a number of experiments.
 2011
 FSU_migr_etd4997
 Riemannian Shape Analysis of Curves and Surfaces.
Kurtek, Sebastian, Srivastava, Anuj, Klassen, Eric, Wu, Wei, Huﬀer, Fred, Dryden, Ian, Department of Statistics, Florida State University
Shape analysis of curves and surfaces is a very important tool in many applications ranging from computer vision to bioinformatics and medical imaging. There are many difficulties when analyzing shapes of parameterized curves and surfaces. Firstly, it is important to develop representations and metrics such that the analysis is invariant to parameterization in addition to the standard transformations (rigid motion and scaling). Furthermore, under the chosen representations and metrics, the...
Show moreShape analysis of curves and surfaces is a very important tool in many applications ranging from computer vision to bioinformatics and medical imaging. There are many difficulties when analyzing shapes of parameterized curves and surfaces. Firstly, it is important to develop representations and metrics such that the analysis is invariant to parameterization in addition to the standard transformations (rigid motion and scaling). Furthermore, under the chosen representations and metrics, the analysis must be performed on infinitedimensional and sometimes nonlinear spaces, which poses an additional difficulty. In this work, we develop and apply methods which address these issues. We begin by defining a framework for shape analysis of parameterized open curves and extend these ideas to shape analysis of surfaces. We utilize the presented frameworks in various classification experiments spanning multiple application areas. In the case of curves, we consider the problem of clustering DTMRI brain fibers, classification of protein backbones, modeling and segmentation of signatures and statistical analysis of biosignals. In the case of surfaces, we perform disease classification using 3D anatomical structures in the brain, classification of handwritten digits by viewing images as quadrilateral surfaces, and finally classification of cropped facial surfaces. We provide two additional extensions of the general shape analysis frameworks that are the focus of this dissertation. The first one considers shape analysis of marked spherical surfaces where in addition to the surface information we are given a set of manually or automatically generated landmarks. This requires additional constraints on the definition of the reparameterization group and is applicable in many domains, especially medical imaging and graphics. Second, we consider reflection symmetry analysis of planar closed curves and spherical surfaces. Here, we also provide an example of disease detection based on brain asymmetry measures. We close with a brief summary and a discussion of open problems, which we plan on exploring in the future.
 2012
 FSU_migr_etd4963
 Theories on Group Variable Selection in Multivariate Regression Models.
Ha, SeungYeon, She, Yiyuan, Okten, Giray, Huffer, Fred, Sinha, Debajyoti, Department of Statistics, Florida State University
We study group variable selection on multivariate regression model. Group variable selection is equivalent to select the nonzero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS according to JamesStein phenomenon (1961). As one of shrinkage methods, we study penalized...
Show moreWe study group variable selection on multivariate regression model. Group variable selection is equivalent to select the nonzero rows of coefficient matrix, since there are multiple response variables and thus if one predictor is irrelevant to estimation then the corresponding row must be zero. In high dimensional setup, shrinkage estimation methods are applicable and guarantee smaller MSE than OLS according to JamesStein phenomenon (1961). As one of shrinkage methods, we study penalized least square estimation for a group variable selection. Among them, we study L0 regularization and L0 + L2 regularization with the purpose of obtaining accurate prediction and consistent feature selection, and use the corresponding computational procedure Hard TISP and HardRidge TISP (She, 2009) to solve the numerical difficulties. These regularization methods show better performance both on prediction and selection than Lasso (L1 regularization), which is one of popular penalized least square method. L0 acheives the same optimal rate of prediction loss and estimation loss as Lasso, but it requires no restriction on design matrix or sparsity for controlling the prediction error and a relaxed condition than Lasso for controlling the estimation error. Also, for selection consistency, it requires much relaxed incoherence condition, which is correlation between the relevant subset and irrelevant subset of predictors. Therefore L0 can work better than Lasso both on prediction and sparsity recovery, in practical cases such that correlation is high or sparsity is not low. We study another method, L0 + L2 regularization which uses the combined penalty of L0 and L2. For the corresponding procedure HardRidge TISP, two parameters work independently for selection and shrinkage (to enhance prediction) respectively, and therefore it gives better performance on some cases (such as low signal strength) than L0 regularization. For L0 regularization, λ works for selection but it is tuned in terms of prediction accuracy. L0 + L2 regularization gives the optimal rate of prediction and estimation errors without any restriction, when the coefficient of l2 penalty is appropriately assigned. Furthermore, it can achieve a better rate of estimation error with an ideal choice of blockwise weight to l2 penalty.
 2013
 FSU_migr_etd7404
 Monte Carlo Likelihood Estimation for Conditional Autoregressive Models with Application to Sparse Spatiotemporal Data.
Bain, Rommel, Huffer, Fred, Becker, Betsy, Niu, Xufeng, Srivastava, Anuj, Department of Statistics, Florida State University
Spatiotemporal modeling is increasingly used in a diverse array of fields, such as ecology, epidemiology, health care research, transportation, economics, and other areas where data arise from a spatiotemporal process. Spatiotemporal models describe the relationship between observations collected from different spatiotemporal sites. The modeling of spatiotemporal interactions arising from spatiotemporal data is done by incorporating the spacetime dependence into the covariance structure. A...
Show moreSpatiotemporal modeling is increasingly used in a diverse array of fields, such as ecology, epidemiology, health care research, transportation, economics, and other areas where data arise from a spatiotemporal process. Spatiotemporal models describe the relationship between observations collected from different spatiotemporal sites. The modeling of spatiotemporal interactions arising from spatiotemporal data is done by incorporating the spacetime dependence into the covariance structure. A main goal of spatiotemporal modeling is the estimation and prediction of the underlying process that generates the observations under study and the parameters that govern the process. Furthermore, analysis of the spatiotemporal correlation of variables can be used for estimating values at sites where no measurements exist. In this work, we develop a framework for estimating quantities that are functions of complete spatiotemporal data when the spatiotemporal data is incomplete. We present two classes of conditional autoregressive (CAR) models (the homogeneous CAR (HCAR) model and the weighted CAR (WCAR) model) for the analysis of sparse spatiotemporal data (the log of monthly mean zooplankton biomass) collected on a spatiotemporal lattice by the California Cooperative Oceanic Fisheries Investigations (CalCOFI). These models allow for spatiotemporal dependencies between nearest neighbor sites on the spatiotemporal lattice. Typically, CAR model likelihood inference is quite complicated because of the intractability of the CAR model's normalizing constant. Sparse spatiotemporal data further complicates likelihood inference. We implement Monte Carlo likelihood (MCL) estimation methods for parameter estimation of our HCAR and WCAR models. Monte Carlo likelihood estimation provides an approximation for intractable likelihood functions. We demonstrate our framework by giving estimates for several different quantities that are functions of the complete CalCOFI time series data.
 2013
 FSU_migr_etd7283
 Failure Time Regression Models for Thinned Point Processes.
Holden, Robert T., Huffer, Fred G., Nichols, Warren, McGee, Dan, Sinha, Debajyoti, Department of Statistics, Florida State University
In survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the...
Show moreIn survival analysis, data on the time until a specific criterion event (or "endpoint") occurs are analyzed, often with regard to the effects of various predictors. In the classic applications, the criterion event is in some sense a terminal event, e.g., death of a person or failure of a machine or machine component. In these situations, the analysis requires assumptions only about the distribution of waiting times until the criterion event occurs and the nature of the effects of the predictors on that distribution. Suppose that the criterion event isn't a terminal event that can only occur once, but is a repeatable event. The sequence of events forms a stochastic {it point process}. Further suppose that only some of the events are detected (observed); the detected events form a thinned point process. Any failure time model based on the data will be based not on the time until the first occurrence, but on the time until the first detected occurrence of the event. The implications of estimating survival regression models from such incomplete data will be analyzed. It will be shown that the effect of thinning on regression parameters depends on the combination of the type of regression model, the type of point process that generates the events, and the thinning mechanism. For some combinations, the effect of a predictor will be the same for time to the first event and the time to the first detected event. For other combinations, the regression effect will be changed as a result of the incomplete detection.
 2013
 FSU_migr_etd8568
 2D Affine and Projective Shape Analysis, and Bayesian Elastic Active Contours.
Bryner, Darshan W., Srivastava, Anuj, Klassen, Eric, Gallivan, Kyle, Huffer, Fred, Wu, Wei, Zhang, Jinfeng, Department of Statistics, Florida State University
An object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian...
Show moreAn object of interest in an image can be characterized to some extent by the shape of its external boundary. Current techniques for shape analysis consider the notion of shape to be invariant to the similarity transformations (rotation, translation and scale), but often times in 2D images of 3D scenes, perspective effects can transform shapes of objects in a more complicated manner than what can be modeled by the similarity transformations alone. Therefore, we develop a general Riemannian framework for shape analysis where metrics and related quantities are invariant to larger groups, the affine and projective groups, that approximate such transformations that arise from perspective skews. Highlighting two possibilities for representing object boundaries  ordered points (or landmarks) and parametrized curves  we study different combinations of these representations (points and curves) and transformations (affine and projective). Specifically, we provide solutions to three out of four situations and develop algorithms for computing geodesics and intrinsic sample statistics, leading up to Gaussiantype statistical models, and classifying test shapes using such models learned from training data. In the case of parametrized curves, an added issue is to obtain invariance to the reparameterization group. The geodesics are constructed by particularizing the pathstraightening algorithm to geometries of current manifolds and are used, in turn, to compute shape statistics and Gaussiantype shape models. We demonstrate these ideas using a number of examples from shape and activity recognition. After developing such Gaussiantype shape models, we present a variational framework for naturally incorporating these shape models as prior knowledge in guidance of active contours for boundary extraction in images. This socalled Bayesian active contour framework is especially suitable for images where boundary estimation is difficult due to low contrast, low resolution, and presence of noise and clutter. In traditional active contour models curves are driven towards minimum of an energy composed of image and smoothing terms. We introduce an additional shape term based on shape models of prior known relevant shape classes. The minimization of this total energy, using iterated gradientbased updates of curves, leads to an improved segmentation of object boundaries. We demonstrate this Bayesian approach to segmentation using a number of shape classes in many imaging scenarios including the synthetic imaging modalities of SAS (synthetic aperture sonar) and SAR (synthetic aperture radar), which are notoriously difficult to obtain accurate boundary extractions. In practice, the training shapes used for priorshape models may be collected from viewing angles different from those for the test images and thus may exhibit a shape variability brought about by perspective effects. Therefore, by allowing for a prior shape model to be invariant to, say, affine transformations of curves, we propose an active contour algorithm where the resulting segmentation is robust to perspective skews.
 2013
 FSU_migr_etd8534
 Partial Differential Equation Methods to Price Options in the Energy Market.
Yan, Jinhua, Kopriva, David, Huﬀer, Fred, Case, Bettye Anne, Nolder, Craig, Wang, Xiaoming, Department of Mathematics, Florida State University
We develop partial differential equation methods with wellposed boundary conditions to price average strike options and swing options in the energy market. We use the energy method to develop boundary conditions that make a two space variable model of Asian options wellposed on a finite domain. To test the performance of wellposed boundary conditions, we price an average strike call. We also derive new boundary conditions for the average strike option from the putcall parity. Numerical...
Show moreWe develop partial differential equation methods with wellposed boundary conditions to price average strike options and swing options in the energy market. We use the energy method to develop boundary conditions that make a two space variable model of Asian options wellposed on a finite domain. To test the performance of wellposed boundary conditions, we price an average strike call. We also derive new boundary conditions for the average strike option from the putcall parity. Numerical results show that wellposed boundary conditions are working appropriately and solutions with new boundary conditions match the similarity solution significantly better than those provided in the existing literature. To price swing options, we develop a finite element penalty method on a one factor mean reverting diffusion model. We use the energy method to find wellposed boundary conditions on a finite domain, derive formulas to estimate the size of the numerical domain, develop a priori error estimates for both Dirichlet boundary conditions and Neumann boundary conditions. We verify the results through numerical experiments. Since the optimal exercise price is unknown in advance, which makes the swing option valuation challenging, we use a penalty method to resolve the difficulty caused by the early exercise feature. Numerical results show that the finite element penalty method is thousands times faster than the Binomial tree method at the same level of accuracy. Furthermore, we price a multiple right swing option with different strike prices. We find that a jump discontinuity can occur in the initial condition of a swing right since the exercise of another swing right may force its optimal exercise region to shrink. We develop an algorithm to identify the optimal exercise boundary at each time level, which allows us to record the optimal exercise time. Numerical results are accurate to one cent comparing with the benchmark solutions computed by a Binomial tree method. We extend applications to multiple right swing options with a waiting period restriction. A waiting period exists between two swing rights to be exercised successively, so we cannot exercise the latter right when we see an optimal exercise opportunity within the waiting period, but have to wait for the first optimal exercise opportunity after the waiting period. Therefore, we keep track of the optimal exercise time when pricing each swing right. We also verify an extreme case numerically. When the waiting time decreases, the value of M right swing option price increases to the value of M times an American option price as expected.
 2013
 FSU_migr_etd7673
 TimeVarying Coefficient Models with ARMAGARCH Structures for Longitudinal Data Analysis.
Zhao, Haiyan, Niu, Xufeng, Huﬀer, Fred, Nolder, Craig, McGee, Dan, Department of Statistics, Florida State University
The motivation of my research comes from the analysis of the Framingham Heart Study (FHS) data. The FHS is a long term prospective study of cardiovascular disease in the community of Framingham, Massachusetts. The study began in 1948 and 5,209 subjects were initially enrolled. Examinations were given biennially to the study participants and their status associated with the occurrence of disease was recorded. In this dissertation, the event we are interested in is the incidence of the coronary...
Show moreThe motivation of my research comes from the analysis of the Framingham Heart Study (FHS) data. The FHS is a long term prospective study of cardiovascular disease in the community of Framingham, Massachusetts. The study began in 1948 and 5,209 subjects were initially enrolled. Examinations were given biennially to the study participants and their status associated with the occurrence of disease was recorded. In this dissertation, the event we are interested in is the incidence of the coronary heart disease (CHD). Covariates considered include sex, age, cigarettes per day (CSM), serum cholesterol (SCL), systolic blood pressure (SBP) and body mass index (BMI, weight in kilograms/height in meters squared). Statistical literature review indicates that effects of the covariates on Cardiovascular disease or death caused by all possible diseases in the Framingham study change over time. For example, the effect of SCL on Cardiovascular disease decreases linearly over time. In this study, I would like to examine the timevarying effects of the risk factors on CHD incidence. Timevarying coefficient models with ARMAGARCH structure are developed in this research. The maximum likelihood and the marginal likelihood methods are used to estimate the parameters in the proposed models. Since highdimensional integrals are involved in the calculations of the marginal likelihood, the Laplace approximation is employed in this study. Simulation studies are conducted to evaluate the performance of these two estimation methods based on our proposed models. The KullbackLeibler (KL) divergence and the root mean square error are employed in the simulation studies to compare the results obtained from different methods. Simulation results show that the marginal likelihood approach gives more accurate parameter estimates, but is more computationally intensive. Following the simulation study, our proposed models are applied to the Framingham Heart Study to investigate the timevarying effects of covariates with respect to CHD incidence. To specify the timeseries structures of the effects of risk factors, the Bayesian Information Criterion (BIC) is used for model selection. Our study shows that the relationship between CHD and risk factors changes over time. For males, there is an obviously decreasing linear trend for age effect, which implies that the age effect on CHD is less significant for elder patients than younger patients. The effect of CSM stays almost the same in the first 30 years and decreases thereafter. There are slightly decreasing linear trends for both effects of SBP and BMI. Furthermore, the coefficients of SBP are mostly positive over time, i.e., patients with higher SBP are more likely developing CHD as expected. For females, there is also an obviously decreasing linear trend for age effect, while the effects of SBP and BMI on CHD are mostly positive and do not change too much over time.
 2010
 FSU_migr_etd0527
 A Comparison of Estimators in Hierarchical Linear Modeling: Restricted Maximum Likelihood versus Bootstrap via Minimum Norm Quadratic Unbiased Estimators.
Delpish, Ayesha Nneka, Niu, XuFeng, Tate, Richard L., Huﬀer, Fred W., Zahn, Douglas, Department of Statistics, Florida State University
The purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations,...
Show moreThe purpose of the study was to investigate the relative performance of two estimation procedures, the restricted maximum likelihood (REML) and the bootstrap via MINQUE, for a twolevel hierarchical linear model under a variety of conditions. Specific focus lay on observing whether the bootstrap via MINQUE procedure offered improved accuracy in the estimation of the model parameters and their standard errors in situations where normality may not be guaranteed. Through Monte Carlo simulations, the importance of this assumption for the accuracy of multilevel parameter estimates and their standard errors was assessed using the accuracy index of relative bias and by observing the coverage percentages of 95% confidence intervals constructed for both estimation procedures. The study systematically varied the number of groups at level2 (30 versus 100), the size of the intraclass correlation (0.01 versus 0.20) and the distribution of the observations (normal versus chisquared with 1 degree of freedom). The number of groups and intraclass correlation factors produced effects consistent with those previously reported—as the number of groups increased, the bias in the parameter estimates decreased, with a more significant effect observed for those estimates obtained via REML. High levels of the intraclass correlation also led to a decrease in the efficiency of parameter estimation under both methods. Study results show that while both the restricted maximum likelihood and the bootstrap via MINQUE estimates of the fixed effects were accurate, the efficiency of the estimates was affected by the distribution of errors with the bootstrap via MINQUE procedure outperforming the REML. Both procedures produced less efficient estimators under the chisquared distribution, particularly for the variancecovariance component estimates.
 2006
 FSU_migr_etd0771
 Estimation from Data Representing a Sample of Curves.
Auguste, Anna L., Bunea, Florentina, Mason, Patrick, Hollander, Myles, Huﬀer, Fred, Department of Statistics, Florida State University
This dissertation introduces and assesses an algorithm to generate confidence bands for a regression function or a main effect when multiple data sets are available. In particular it proposes to construct confidence bands for different trajectories and then aggregate these to produce an overall confidence band for a mean function. An estimator of the regression function or main effect is also examined. First, nonparametric estimators and confidence bands are formed on each data set separately...
Show moreThis dissertation introduces and assesses an algorithm to generate confidence bands for a regression function or a main effect when multiple data sets are available. In particular it proposes to construct confidence bands for different trajectories and then aggregate these to produce an overall confidence band for a mean function. An estimator of the regression function or main effect is also examined. First, nonparametric estimators and confidence bands are formed on each data set separately. Then each data set is in turn treated as a testing set for aggregating the preliminary results from the remaining data sets. The criterion used for this aggregation is either the least squares (LS) criterion or a BIC type penalized LS criterion. The proposed estimator is the average over data sets of these aggregates. It is thus a weighted sum of the preliminary estimators. The proposed confidence band is the minimum L1 band of all the M aggregate bands when we only have a main effect. In the case where there is some random effect we suggest an adjustment to the confidence band. In this case, the proposed confidence band is the minimum L1 band of all the M adjusted aggregate bands. Desirable asymptotic properties are shown to hold. A simulation study examines the performance of each technique relative to several alternate methods and theoretical benchmarks. An application to seismic data is conducted.
 2006
 FSU_migr_etd0286
 Same Author and Same Data Dependence in MetaAnalysis.
Shin, InSoo, Becker, Betsy Jane, Huffer, Fred, Kamata, Akihito, Yang, Yanyun, Department of Educational Psychology and Learning Systems, Florida State University
When conducting metaanalysis, reviewers gather extensive sets of primary studies for metaanalysis. When we have two or more primary studies by the same author, or two more studies using the same data set, we have the issues we call 'same author' and 'same data' issues in metaanalysis. When a researcher conducts a metaanalysis, he or she first confronts 'same author' and 'same data' issues in the data gathering stage. These issues lead to between studies dependence in metaanalysis. In...
Show moreWhen conducting metaanalysis, reviewers gather extensive sets of primary studies for metaanalysis. When we have two or more primary studies by the same author, or two more studies using the same data set, we have the issues we call 'same author' and 'same data' issues in metaanalysis. When a researcher conducts a metaanalysis, he or she first confronts 'same author' and 'same data' issues in the data gathering stage. These issues lead to between studies dependence in metaanalysis. In this dissertation, methods of showing dependence are investigated, and the impact of 'same author' studies and 'same data' studies is investigated. The prevalence of these phenomena is outlined, and how metaanalysts have treated this issue until now is summarized. Also journal editors' criteria are reviewed. To show dependence of 'same author' studies and 'same data' studies, fixedeffects categorical analysis, homogeneity tests, and intraclass correlations are used. To measure the impact of 'same author' and 'same data' studies, sensitivity analysis and HLM analyses are conducted. Two example analyses are conducted using data sets from a classsize metaanalysis and ESL (English as a Second Language) metaanalysis. The former is an example of the 'same data' problem, and the latter is an example of the 'same author' problem. Finally, simulation studies are conducted to assess how each analysis technique works.
 2009
 FSU_migr_etd0319
 Variance Gamma Pricing of American Futures Options.
Yoo, Eunjoo, Nolder, Craig A., Huﬀer, Fred, Case, Bettye Anne, Kercheval, Alec N., Quine, Jack, Department of Mathematics, Florida State University
In financial markets under uncertainty, the classical BlackScholes model cannot explain the empirical facts such as fat tails observed in the probability density. To overcome this drawback, during the last decade, Lévy process and stochastic volatility models were introduced to financial modeling. Today crude oil futures markets are highly volatile. It is the purpose of this dissertation to develop a mathematical framework in which American options on crude oil futures contracts are priced...
Show moreIn financial markets under uncertainty, the classical BlackScholes model cannot explain the empirical facts such as fat tails observed in the probability density. To overcome this drawback, during the last decade, Lévy process and stochastic volatility models were introduced to financial modeling. Today crude oil futures markets are highly volatile. It is the purpose of this dissertation to develop a mathematical framework in which American options on crude oil futures contracts are priced more effectively than by current methods. In this work, we use the Variance Gamma process to model the futures price process. To generate the underlying process, we use a random tress method so that we evaluate the option prices at each tree node. Through fifty replications of a random tree, the averaged value is taken as a true option price. Pricing performance using this method is accessed using American options on crude oil commodity contracts from December 2003 to November 2004. In comparison with the Variance Gamma model, we price using the BlackScholes model as well. Over the entire sample period, a positive skewness and high kurtosis, especially in the shortterm options, are observed. In terms of pricing errors, the Variance Gamma process performs better than the BlackScholes model for the American options on crude oil commodities.
 2008
 FSU_migr_etd0691
 Numerical Methods for Portfolio Risk Estimation.
Zhang, Jianke, Kercheval, Alec, Huﬀer, Fred, Gallivan, Kyle, Beaumont, Paul, Nichols, Warren, Department of Mathematics, Florida State University
In portfolio risk management, a global covariance matrix forecast often needs to be adjusted by changing diagonal blocks corresponding to specific submarkets. Unless certain constraints are obeyed, this can result in the loss of positive definiteness of the global matrix. Imposing the proper constraints while minimizing the disturbance of offdiagonal blocks leads to a nonconvex optimization problem in numerical linear algebra called the Weighted Orthogonal Procrustes Problem. We analyze...
Show moreIn portfolio risk management, a global covariance matrix forecast often needs to be adjusted by changing diagonal blocks corresponding to specific submarkets. Unless certain constraints are obeyed, this can result in the loss of positive definiteness of the global matrix. Imposing the proper constraints while minimizing the disturbance of offdiagonal blocks leads to a nonconvex optimization problem in numerical linear algebra called the Weighted Orthogonal Procrustes Problem. We analyze and compare two local minimizing algorithms and offer an algorithm for global minimization. Our methods are faster and more effective than current numerical methods for covariance matrix revision.
 2007
 FSU_migr_etd0542
 Anova for Parameter Dependent Nonlinear PDEs and Numerical Methods for the Stochastic Stokes Equations.
Chen, Zheng, Gunzburger, Max, Huﬀer, Fred, Peterson, Janet, Wang, Xiaoqiang, Department of Mathematics, Florida State University
This dissertation includes the application of analysisofvariance (ANOVA) expansions to analyze solutions of parameter dependent partial differential equations and the analysis and finite element approximations of the Stokes equations with stochastic forcing terms. In the first part of the dissertation, the impact of parameter dependent boundary conditions on the solutions of a class of nonlinear PDEs is considered. Based on the ANOVA expansions of functionals of the solutions, the effects...
Show moreThis dissertation includes the application of analysisofvariance (ANOVA) expansions to analyze solutions of parameter dependent partial differential equations and the analysis and finite element approximations of the Stokes equations with stochastic forcing terms. In the first part of the dissertation, the impact of parameter dependent boundary conditions on the solutions of a class of nonlinear PDEs is considered. Based on the ANOVA expansions of functionals of the solutions, the effects of different parameter sampling methods on the accuracy of surrogate optimization approaches to PDE constrained optimization is considered. The effects of the smoothness of the functional and the nonlinearity in the PDE on the decay of the higherorder ANOVA terms are studied. The concept of effective dimensions is used to determine the accuracy of the ANOVA expansions. Demonstrations are given to show that whenever truncated ANOVA expansions of functionals provide accurate approximations, optimizers found through a simple surrogate optimization strategy are also relatively accurate. The effects of several parameter sampling strategies on the accuracy of the surrogate optimization method are also considered; it is found that for this sparse sampling application, the Latin hypercube sampling method has advantages over other wellknown sampling methods. Although most of the results are presented and discussed in the context of surrogate optimization problems, they also apply to other settings such as stochastic ensemble methods and reducedorder modeling for nonlinear PDEs. In the second part of the dissertation, we study the numerical analysis of the Stokes equations driven by a stochastic process. The random processes we use are white noise, colored noise and the homogeneous Gaussian process. When the process is white noise, we deal with the singularity of matrix Green's functions in the form of mild solutions with the aid of the theory of distributions. We develop finite element methods to solve the stochastic Stokes equations. In the 2D and 3D cases, we derive error estimates for the approximate solutions. The results of numerical experiments are provided in the 2D case that demonstrate the algorithm and convergence rates. On the other hand, the singularity of the matrix Green's functions necessitates the use of the homogeneous Gaussian process. In the framework of theory of abstract Wiener spaces, the stochastic integrals with respect to the homogeneous Gaussian process can be defined on a larger space than L2 . With some conditions on the density function in the definition of the homogeneous Gaussian process, the matrix Green's functions have well defined integrals. We have studied the probability properties of this kind of integral and simulated discretized colored noise.
 2007
 FSU_migr_etd3851
 Investigating the Categories for Cholesterol and Blood Pressure for Risk Assessment of Death Due to Coronary Heart Disease.
Franks, Billy J., McGee, Daniel, Hurt, Myra, Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
Many characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those...
Show moreMany characteristics for predicting death due to coronary heart disease are measured on a continuous scale. These characteristics, however, are often categorized for clinical use and to aid in treatment decisions. We would like to derive a systematic approach to determine the best categorizations of systolic blood pressure and cholesterol level for use in identifying individuals who are at high risk for death due to coronary heart disease and to compare these data derived categories to those in common usage. Whatever categories are chosen, they should allow physicians to accurately estimate the probability of survival from coronary heart disease until some time t. The best categories will be those that provide the most accurate prediction for an individual's risk of dying by t. The approach that will be used to determine these categories will be a version of Classification And Regression Trees that can be applied to censored survival data. The major goals of this dissertation are to obtain dataderived categories for risk assessment, compare these categories to the ones already recommended in the medical community, and to assess the performance of these categories in predicting survival probabilities.
 2005
 FSU_migr_etd4402
 Assessing the Shelf Life of Retail Shrimp Using RealTime Microrespirometer.
Alderees, Fahad, Hsieh, YunHwa Peggy, Arjmandi, Bahram, Huffer, Fred W., Department of Nutrition, Food, and Exercise Science, Florida State University
Shrimp is the most consumed seafood item in the United States (U.S.). Currently 90% of the shrimp consumed in the U.S. is imported from a few Asian countries. When imported shrimp arrives to its destination, it probably contains a load of microbial contamination due to the postharvest processing steps such as transportation, handling, preparation, beheading, peeling, deveining, packaging and storage that could add further bacterial contamination. Most of the U.S. import refusals belong to...
Show moreShrimp is the most consumed seafood item in the United States (U.S.). Currently 90% of the shrimp consumed in the U.S. is imported from a few Asian countries. When imported shrimp arrives to its destination, it probably contains a load of microbial contamination due to the postharvest processing steps such as transportation, handling, preparation, beheading, peeling, deveining, packaging and storage that could add further bacterial contamination. Most of the U.S. import refusals belong to seafood shipments due to the detection of bacterial contamination and filthy appearance. Upon shipment arrival, testing for microbial activities of seafood requires a two day incubation period when using the traditional Aerobic Plate Count (APC) method; however, a novel noninstrumental microrespirometer which was developed by Hsieh and Hsieh (2000) can determine the microbial activity of the sample in realtime by measuring the CO2 evolution rate (CER). CO2 is a byproduct of microbial respiration which can be used as a direct indicator of biological activity. The unique characteristic of this method is that it is a simple device that can determine the microbial activity in food less than one hour and is highly sensitive in determining the CER and simple to operate. The use of the microrespirometer instead of the APC in testing the imported seafood shipments will save a great deal of time and lower the cost for both importers and exporters by lowering the testing cost and reducing the costly waiting time at the ports. The specific objectives of this study are: 1) to validate the realtime microrespirometer method by correlating the rapid CER results with the traditional cultural APC method, 2) to establish a shrimp spoilage cutoff value of CER using the microrespirometer method by comparing the results with sensory analysis, 3) to exam the effect of chloramphenicol on shrimp shelf life using noninstrumental microrespirometer, APC method and sensory analysis and 4) to compare the shelf life of farmraised imported shrimp with domestic wildcaught shrimp using noninstrumental microrespirometer, APC, pH and sensory analysis. Frozen domestic wildcaught shrimp (Penaeus duorarum) and imported farmraised shrimp (Panaeus vannamei) were purchased locally. Domestic shrimp were treated with chloramohenicol at 10 and 30 ppm and stored at 4°C along with the untreated domestic and imported shrimp. Samples were tested daily using the microrespirometer, APC, pH and olfactory sensory analysis. The p values and correlations between CER, APC and sensory analysis were determined using SPSS Statistic software and Microsoft Excel 2007. The microrespirometer and pH determinations were done in triplicate; the APC was performed in duplicate and the experiments were repeated twice. The CER method was found to be highly correlated with the APC (R²=0.812 to 0.929) for all samples stored at 4°C. When samples' spoilage odor became noticeable, the average CER value of all samples was 27.23 µl/h/g. In order to allow for a small safe margin, a CER value of 25 µl/h/g was identified as a safe cutoff value for raw shrimp stored at 4°C. Samples treated with chloramphenicol had significant (P The difference in microbial quality and shelf life of various (source of origin and drug treatment) shrimp samples were able to be determined rapidly and accurately when using the realtime CER method.
 2010
 FSU_migr_etd0160
 Multistate Intensity Model with ARGARCH Random Effect for Corporate Credit Rating Transition Analysis.
Li, Zhi, Niu, Xufeng, Huﬀer, Fred, Kercheval, Alec, Wu, Wei, Department of Statistics, Florida State University
This thesis presents a stochastic process and time series study on corporate credit rating and market implied rating transitions. By extending an existing model, this paper incorporates the generalized autoregressive conditional heteroscedastic (GARCH) random effects to capture volatility changes in the instantaneous transition rates. The GARCH model is a crucial part in financial research since its ability to model volatility changes gives the market practitioners flexibility to build more...
Show moreThis thesis presents a stochastic process and time series study on corporate credit rating and market implied rating transitions. By extending an existing model, this paper incorporates the generalized autoregressive conditional heteroscedastic (GARCH) random effects to capture volatility changes in the instantaneous transition rates. The GARCH model is a crucial part in financial research since its ability to model volatility changes gives the market practitioners flexibility to build more accurate models on high frequency financial data. The corporate rating transition modeling was historically dealing with low frequency data which did not have the need to specify the volatility. However, the newly published Moody's market implied ratings are exhibiting much higher transition frequencies. Therefore, we feel that it is necessary to capture the volatility component and make extensions to existing models to reflect this fact. The theoretical model specification and estimation details are discussed thoroughly in this dissertation. The performance of our models is studied on several simulated data sets and compared to the original model. Finally, the models are applied to both Moody's issuer rating and market implied rating transition data as an application.
 2010
 FSU_migr_etd1426
 Optimal Linear Representations of Images under Diverse Criteria.
Rubinshtein, Evgenia, Srivastava, Anuj, Liu, Xiuwen, Huﬀer, Fred, Chicken, Eric, Department of Statistics, Florida State University
Image analysis often requires dimension reduction before statistical analysis, in order to apply sophisticated procedures. Motivated by eventual applications, a variety of criteria have been proposed: reconstruction error, class separation, nonGaussianity using kurtosis, sparseness, mutual information, recognition of objects, and their combinations. Although some criteria have analytical solutions, the remaining ones require numerical approaches. We present geometric tools for finding linear...
Show moreImage analysis often requires dimension reduction before statistical analysis, in order to apply sophisticated procedures. Motivated by eventual applications, a variety of criteria have been proposed: reconstruction error, class separation, nonGaussianity using kurtosis, sparseness, mutual information, recognition of objects, and their combinations. Although some criteria have analytical solutions, the remaining ones require numerical approaches. We present geometric tools for finding linear projections that optimize a given criterion for a given data set. The main idea is to formulate a problem of optimization on a Grassmann or a Stiefel manifold, and to use differential geometry of the underlying space to construct optimization algorithms. Purely deterministic updates lead to local solutions, and addition of random components allows for stochastic gradient searches that eventually lead to global solutions. We demonstrate these results using several image datasets, including natural images and facial images.
 2006
 FSU_migr_etd1926
 A Class of MixedDistribution Models with Applications in Financial Data Analysis.
Tang, Anqi, Niu, Xufeng, Cheng, Yingmei, Wu, Wei, Huﬀer, Fred, Department of Statistics, Florida State University
Statisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the...
Show moreStatisticians often encounter data in the form of a combination of discrete and continuous outcomes. A special case is zeroinflated longitudinal data where the response variable has a large portion of zeros. These data exhibit correlation because observations are obtained on the same subjects over time. In this dissertation, we propose a twopart mixed distribution model to model zeroinflated longitudinal data. The first part of the model is a logistic regression model that models the probability of nonzero response; the other part is a linear model that models the mean response given that the outcomes are not zeros. Random effects with AR(1) covariance structure are introduced into both parts of the model to allow serial correlation and subject specific effect. Estimating the twopart model is challenging because of high dimensional integration necessary to obtain the maximum likelihood estimates. We propose a Monte Carlo EM algorithm for estimating the maximum likelihood estimates of parameters. Through simulation study, we demonstrate the good performance of the MCEM method in parameter and standard error estimation. To illustrate, we apply the twopart model with correlated random effects and the model with autoregressive random effects to executive compensation data to investigate potential determinants of CEO stock option grants.
 2011
 FSU_migr_etd1710
 Monoclonal AntibodyBased Sandwich EnzymeLinked Immunosorbent Assay for the Detection of Mammalian Meat in Meat and Feed Products.
Rao, Qinchun, Hsieh, YunHwa Peggy, Huffer, Fred W., Sathe, Shridhar K., Department of Nutrition, Food, and Exercise Science, Florida State University
Detection of mammalian tissue in nonmammalian meat or feed products is important for enforcement of foodlabeling laws and prevention of the spread of transmissible spongiform encephalopathies (TSEs). This study was conducted to develop a monoclonal antibodybased sandwich enzymelinked immunosorbent assay (ELISA) for rapid detection of raw, cooked (100°C, 30 min) and autoclaved (121°C/1.2 bar, 30 min) mammalian meats (beef, deer, elk, horse, lamb and pork) adulterated in nonmammalian meat ...
Show moreDetection of mammalian tissue in nonmammalian meat or feed products is important for enforcement of foodlabeling laws and prevention of the spread of transmissible spongiform encephalopathies (TSEs). This study was conducted to develop a monoclonal antibodybased sandwich enzymelinked immunosorbent assay (ELISA) for rapid detection of raw, cooked (100°C, 30 min) and autoclaved (121°C/1.2 bar, 30 min) mammalian meats (beef, deer, elk, horse, lamb and pork) adulterated in nonmammalian meat (chicken, duck and turkey) and soybased feed products, and to assess the performance of the assay. This assay utilized a pair of MAbs against thermalstable skeletal muscle protein, troponin I (sTnI). MAb 6G1, specific to mammalian and poultry sTnIs, was used as the capture antibody and horseradish peroxidase (HRP) conjugated MAb 8F10, specific to mammalian sTnI, was used as the detection antibody. The assay conditions that were optimized include: the dilutions of the capture antibody and the detection antibody, the selection of the antibody buffer, the incubation time for antigenantibody binding, and the dilutions of the adulterated meat and feed samples. The optimized assay achieved a detection limit of 0.05% (w/w) for raw, 0.50% (w/w) for cooked and 1.00% (w/w) for autoclaved beef in turkey (P ≤ 0.05); 0.50% (w/w) for pork in chicken mixtures (raw, cooked and autoclaved) (P ≤ 0.05); and 0.50% (w/w) for bovine meat meal in soybased feed mixtures (P ≤ 0.05). The fat content (0 − 30%, w/w) of the meat samples did not significantly affect the assay signals (P ≥ 0.05). As the temperature and time of the heat treatment of the meat samples increased, the reactivity of this assay decreased slightly. However, the assay was still adequate to analyze samples subjected to the most severe heat treatment (132°C/2.0 bar, 120 min). This MAbbased sandwich ELISA is the first assay suitable for rapid, sensitive and reliable detection of undeclared mammalian proteins in meat and feed products, regardless of the extent of heat processing.
 2004
 FSU_migr_etd2122
 A Statistical Approach for Information Extraction of Biological Relationships.
Bell, Lindsey R., Zhang, Jinfeng, Niu, Xufeng, Tyson, Gary, Huﬀer, Fred, Department of Statistics, Florida State University
Vast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text...
Show moreVast amounts of biomedical information are stored in scientific literature, easily accessed through publicly available databases. Relationships among biomedical terms constitute a major part of our biological knowledge. Acquiring such structured information from unstructured literature can be done through human annotation, but is time and resource consuming. As this content continues to rapidly grow, the popularity and importance of text mining for obtaining information from unstructured text becomes increasingly evident. Text mining has four major components. First relevant articles are identified through information retrieval (IR), next important concepts and terms are flagged using entity recognition (ER), and then relationships between these entities are extracted from the literature in a process called information extraction(IE). Finally, text mining takes these elements and seeks to synthesize new information from the literature. Our goal is information extraction from unstructured literature concerning biological entities. To do this, we use the structure of triplets where each triplet contains two biological entities and one interaction word. The biological entities may include terms such as protein names, disease names, genes, and smallmolecules. Interaction words describe the relationship between the biological terms. Under this framework we aim to combine the strengths of three classifiers in an ensemble approach. The three classifiers we consider are Bayesian Networks, Support Vector Machines, and a mixture of logistic models defined by interaction word. The three classifiers and ensemble approach are evaluated on three benchmark corpora and one corpus that is introduced in this study. The evaluation includes cross validation and crosscorpus validation to replicate an application scenario. The three classifiers are unique and we find that performance of individual classifiers varies depending on the corpus. Therefore, an ensemble of classifiers removes the need to choose one classifier and provides optimal performance.
 2011
 FSU_migr_etd1314
 Variable Selection of Correlated Predictors in Logistic Regression: Investigating the DietHeart Hypothesis.
Thompson, Warren R. (Warren Robert), McGee, Daniel, Eberstein, Isaac, Huﬀer, Fred, Sinha, Debajyoti, She, Yiyuan, Department of Statistics, Florida State University
Variable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the...
Show moreVariable selection is an important aspect of modeling. Its aim is to distinguish between the authentic variables which are important in predicting outcome, and the noise variables which possess little to no predictive value. In other words, the goal is to find the variables that (collectively) best explains and predicts changes in the outcome variable. The variable selection problem is exacerbated when correlated variables are included in the covariate set. This dissertation examines the variable selection problem in the context of logistic regression. Specifically, we investigated the merits of the bootstrap, ridge regression, the lasso and Bayesian model averaging (BMA) as variable selection techniques when highly correlated predictors and a dichotomous outcome are considered. This dissertation also contributes to the literature on the dietheart hypothesis. The dietheart hypothesis has been around since the early twentieth century. Since then, researchers have attempted to isolate the nutrients in diet that promote coronary heart disease (CHD). After a century of research, there is still no consensus. In our current research, we used some of the more recent statistical methodologies (mentioned above) to investigate the effect of twenty dietary variables on the incidence of coronary heart disease. Logistic regression models were generated for the data from the Honolulu Heart Program  a study of CHD incidence in men of Japanese descent. Our results were largely methodspecific. However, regardless of method considered, there was strong evidence to suggest that alcohol consumption has a strong protective effect on the risk of coronary heart disease. Of the variables considered, dietary cholesterol and caffeine were the only variables that, at best, exhibited a moderately strong harmful association with CHD incidence. Further investigation that includes a broader array of food groups is recommended.
 2009
 FSU_migr_etd1360
 Bayesian Generalized Polychotomous Response Models and Applications.
Yang, Fang, Niu, XuFeng, Johnson, Suzanne B., McGee, Dan, Huﬀer, Fred, Department of Statistics, Florida State University
Polychotomous quantal response models are widely used in medical and econometric studies to analyze categorical or ordinal data. In this study, we apply the Bayesian methodology through a mixedeffects polychotomous quantal response model. For the Bayesian polychotomous quantal response model, we assume uniform improper priors for the regression coeffcients and explore the suffcient conditions for a proper joint posterior distribution of the parameters in the models. Simulation results from...
Show morePolychotomous quantal response models are widely used in medical and econometric studies to analyze categorical or ordinal data. In this study, we apply the Bayesian methodology through a mixedeffects polychotomous quantal response model. For the Bayesian polychotomous quantal response model, we assume uniform improper priors for the regression coeffcients and explore the suffcient conditions for a proper joint posterior distribution of the parameters in the models. Simulation results from Gibbs sampling estimates will be compared to traditional maximum likelihood estimates to show the strength that using the uniform improper priors for the regression coeffcients. Motivated by investigating of relationship between BMI categories and several risk factors, we carry out the application studies to examine the impact of risk factors on BMI categories, especially for categories of "Overweight" and "Obesities". By applying the mixedeffects Bayesian polychotomous response model with uniform improper priors, we would get similar interpretations of the association between risk factors and BMI, comparing to literature findings.
 2010
 FSU_migr_etd1092
 Nonparametric Estimation of Three Dimensional Projective Shapes with Applications in Medical Imaging and in Pattern Recognition.
Crane, Michael, Patrangenaru, Victor, Liu, Xiuwen, Huﬀer, Fred W., Sinha, Debajyoti, Department of Statistics, Florida State University
This dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision reconstruction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic...
Show moreThis dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision reconstruction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic sample covariance to construct bootstrap confidence regions for mean projective shapes of 3D configurations. Chapters 4, 5 and 6 contain new results. In chapter 4, we develop tests for coplanarity. In chapter 5, is on reconstruction of 3D polyhedral scenes, including texture from arbitrary partial views. In chapter 6, we develop a nonparametric methodology for estimating the mean change for matched samples on a Lie group. We then notice that for k ≥ 4, a manifold of projective shapes of kads in general position in 3D has a structure of 3k − 15 dimensional Lie group (PQuaternions) that is equivariantly embedded in an Euclidean space, therefore testing for mean 3D projective shape change amounts to a one sample test for extrinsic mean PQuaternion Objects. The Lie group technique leads to a large sample and nonparametric bootstrap test for one population extrinsic mean on a projective shape space, as recently developed by Patrangenaru, Liu and Sughatadasa. On the other hand, in absence of occlusions, the 3D projective shape of a spatial configuration can be recovered from a stereo pair of images, thus allowing to test for mean glaucomatous 3D projective shape change detection from standard stereo pairs of eye images.
 2010
 FSU_migr_etd4607
 A Probabilistic and Graphical Analysis of Evidence in O.J. Simpson's Murder Case Using Bayesian Networks.
Olumide, Kunle, Huﬀer, Fred, Shute, Valerie, Sinha, Debajyoti, Niu, Xufeng, Logan, Wayne, Department of Statistics, Florida State University
This research work is an attempt to illustrate the versatility and wide applications of the field of statistical science. Specifically, the research work involves the application of statistics in the field of law. The application will focus on the subfields of Evidence and Criminal law using one of the most celebrated cases in the history of American jurisprudence  the 1994 O.J. Simpson murder case in California. Our task here is to do a probabilistic and graphical analysis of the body of...
Show moreThis research work is an attempt to illustrate the versatility and wide applications of the field of statistical science. Specifically, the research work involves the application of statistics in the field of law. The application will focus on the subfields of Evidence and Criminal law using one of the most celebrated cases in the history of American jurisprudence  the 1994 O.J. Simpson murder case in California. Our task here is to do a probabilistic and graphical analysis of the body of evidence in this case using Bayesian Networks. We will begin the analysis by first constructing our main hypothesis regarding the guilt or nonguilt of the accused; this main hypothesis will be supplemented by a series of ancillary hypotheses. Using graphs and probability concepts, we will be evaluating the probative force or strength of the evidence and how well the body of evidence at hand will prove our main hypothesis. We will employ Bayes rule, likelihoods and likelihood ratios to carry out such an evaluation. Some sensitivity analyses will be carried out by varying the degree of our prior beliefs or probabilities, and evaluating the effect of such variations on the likelihood ratios regarding our main hypothesis.
 2010
 FSU_migr_etd2287
 Sparse Factor AutoRegression for Forecasting Macroeconomic Time Series with Very Many Predictors.
Galvis, Oliver Kurt, She, Yiyuan, Okten, Giray, Beaumont, Paul, Huﬀer, Fred, Tao, Minjing, Department of Statistics, Florida State University
Forecasting a univariate target time series in high dimensions with very many predictors poses challenges in statistical learning and modeling. First, many nuisance time series exist and need to be removed. Second, from economic theories, a macroeconomic target series is typically driven by few latent factors constructed from some macroeconomic indices. Consequently, a high dimensional problem arises where deleting junk time series and constructing predictive factors simultaneously, are...
Show moreForecasting a univariate target time series in high dimensions with very many predictors poses challenges in statistical learning and modeling. First, many nuisance time series exist and need to be removed. Second, from economic theories, a macroeconomic target series is typically driven by few latent factors constructed from some macroeconomic indices. Consequently, a high dimensional problem arises where deleting junk time series and constructing predictive factors simultaneously, are meaningful and advantageous for accuracy of the forecasting task. In macroeconomics, multiple categories are available with the target series belonging to one of them. With all series available we advocate constructing category level factors to enhance the performance of the forecasting task. We introduce a novel methodology, the Sparse Factor AutoRegression (SFAR) methodology, to construct predictive factors from a reduced set of relevant time series. SFAR attains dimension reduction via joint variable selection and rank reduction in high dimensional time series data. A multivariate setting is used to achieve simultaneous low rank and cardinality control on the matrix of coefficients where $ell_{0}$constraint regulates the number of useful series and the rank constrain elucidates the upper bound for constructed factors. The doublyconstrained matrix is a nonconvex mathematical problem optimized via an efficient iterative algorithm with a theoretical guarantee of convergence. SFAR fits factors using a sparse low rank matrix in response to a target category series. Forecasting is then performed using lagged observations and shrinkage methods. We generate a finite sample data to verify our theoretical findings via a comparative study of the SFAR. We also analyze realworld macroeconomic time series data to demonstrate the usage of the SFAR in practice.
 2014
 FSU_migr_etd8990
 Adaptive Series Estimators for Copula Densities.
Gui, Wenhao, Wegkamp, Marten, Van Engelen, Robert A., Niu, Xufeng, Huﬀer, Fred, Department of Statistics, Florida State University
In this thesis, based on an orthonormal series expansion, we propose a new nonparametric method to estimate copula density functions. Since the basis coefficients turn out to be expectations, empirical averages are used to estimate these coefficients. We propose estimators of the variance of the estimated basis coefficients and establish their consistency. We derive the asymptotic distribution of the estimated coefficients under mild conditions. We derive a simple oracle inequality for the...
Show moreIn this thesis, based on an orthonormal series expansion, we propose a new nonparametric method to estimate copula density functions. Since the basis coefficients turn out to be expectations, empirical averages are used to estimate these coefficients. We propose estimators of the variance of the estimated basis coefficients and establish their consistency. We derive the asymptotic distribution of the estimated coefficients under mild conditions. We derive a simple oracle inequality for the copula density estimator based on a finite series using the estimated coefficients. We propose a stopping rule for selecting the number of coefficients used in the series and we prove that this rule minimizes the mean integrated squared error. In addition, we consider hard and soft thresholding techniques for sparse representations. We obtain oracle inequalities that hold with prescribed probability for various norms of the difference between the copula density and our threshold series density estimator. Uniform confidence bands are derived as well. The oracle inequalities clearly reveal that our estimator adapts to the unknown degree of sparsity of the series representation of the copula density. A simulation study indicates that our method is extremely easy to implement and works very well, and it compares favorably to the popular kernel based copula density estimator, especially around the boundary points, in terms of mean squared error. Finally, we have applied our method to an insurance dataset. After comparing our method with the previous data analyses, we reach the same conclusion as the parametric methods in the literature and as such we provide additional justification for the use of the developed parametric model.
 2009
 FSU_migr_etd3929
 Estimating the Probability of Cardiovascular Disease: A Comparison of Methods.
Fan, Li, McGee, Daniel, Hurt, Myra, Niu, XuFeng, Huﬀer, Fred, Department of Statistics, Florida State University
Risk prediction plays an important role in clinical medicine. It not only helps in educating patients to improve life style and in targeting individuals at high risk, but also guides treatment decisions. So far, various instruments have been used for different risk assessment in different countries and the risk predictions based from these different models are not consistent. In public use, a reliable risk prediction is necessary. This thesis discusses the models that have been developed for...
Show moreRisk prediction plays an important role in clinical medicine. It not only helps in educating patients to improve life style and in targeting individuals at high risk, but also guides treatment decisions. So far, various instruments have been used for different risk assessment in different countries and the risk predictions based from these different models are not consistent. In public use, a reliable risk prediction is necessary. This thesis discusses the models that have been developed for risk assessment and evaluates the performance of prediction at two levels, including the overall level and the individual level. At the overall level, cross validation and simulation are used to assess the risk prediction, while at the individual level, the "Parametric Bootstrap" and the delta method are used to evaluate the uncertainty of the individual risk prediction. Further exploration of the reasons producing different performance among the models is ongoing.
 2009
 FSU_migr_etd4508
 The Estimation and Specification Search of Structural Equation Modeling Using Frequentist and Bayesian Methods.
Liang, Xinya, Yang, Yanyun, Huffer, Fred, Becker, Betsy Jane, Paek, Insu, Department of Educational Psychology and Learning Systems, Florida State University
Structural equation modeling (SEM) refers to statistical analyses of the relationships among observed and latent variables based on hypothesized models. In reality, proposed models are rarely perfect, specification search is conducted to correct specification errors between the proposed and population models. Both frequentist and Bayesian methods have strength and limitations in the estimation and specification search of SEM models. Estimation problems that arise from the violation of...
Show moreStructural equation modeling (SEM) refers to statistical analyses of the relationships among observed and latent variables based on hypothesized models. In reality, proposed models are rarely perfect, specification search is conducted to correct specification errors between the proposed and population models. Both frequentist and Bayesian methods have strength and limitations in the estimation and specification search of SEM models. Estimation problems that arise from the violation of distributional and/or structural assumptions have not been thoroughly studied. Performance of specification search methods based on different theoretical framework has rarely been compared. Two purposes of this study were: (1) to investigate robust maximum likelihood (RML) and three Bayesian methods for estimating confirmatory factor analysis models under imperfect conditions, and (2) to compare modification index (MI) and Bayesian structural equation modeling (BSEM) in search of crossloadings in factor analysis models. Two Monte Caro studies were designed for model estimation (Study 1) and specification search (Study 2), respectively. Both studies replicated 2000 datasets for each condition. Design factors included sample size, factor structure, loading size, and item distribution. Study 1 analyzed both correctly specified and misspecified models. Results were evaluated based on model fit, parameter estimates, and standard errors. Study 2 searched for omitted 1, 2, and 4 crossloadings in data generation models. The evaluation of results focused on the success of specification search and model evaluation. Results showed that the frequentist chisquare test was more powerful than the Bayesian posterior predictive pvalue test. Bayesian methods specified with appropriate priors provided accurate parameter estimates similar to RML even under moderate violation of SEM assumptions. Practically however, the selections of Bayesian priors on hypothesized models need to be exceptionally cautious, because they are likely to interact with sample sizes, data distribution, and degree of model misspecification. In specification search, MI generally provided higher model recovery rates than BSEM under the designed conditions. BSEM led to considerable false positive solutions as sample size increased if informative priors were not properly selected. However, MI is not always preferable. The study recommended that practical selections of Bayesian priors may be based on 95% parameter coverage. Future research will investigate the sensitivity of various Bayesian priors in specification search.
 2014
 FSU_migr_etd9031
 Quasi3D Statistical Inversion of Oceanographic Tracer Data.
Herbei, Radu, Speer, Kevin, Wegkamp, Marten, Laurent, Louis St., Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
We perform a quasi3D Bayesian inversion of oceanographic tracer data from the South Atlantic Ocean. Initially we are considering one active neutral density layer with an upper and lower boundary. The available hydrographic data is linked to model parameters (water velocities, diffusion coefficients) via a 3D advectiondiffusion equation. A robust solution to the inverse problem considered can be attained by introducing prior information about parameters and modeling the observation error....
Show moreWe perform a quasi3D Bayesian inversion of oceanographic tracer data from the South Atlantic Ocean. Initially we are considering one active neutral density layer with an upper and lower boundary. The available hydrographic data is linked to model parameters (water velocities, diffusion coefficients) via a 3D advectiondiffusion equation. A robust solution to the inverse problem considered can be attained by introducing prior information about parameters and modeling the observation error. This approach estimates both horizontal and vertical flow as well as diffusion coefficients. We find a system of alternating zonal jets at the depths of the North Atlantic Deep Water, consistent with direct measurements of flow and concentration maps. A uniqueness analysis of our model is performed in terms of the oxygen consumption rate. The vertical mixing coefficient bears some relation to the bottom topography even though we do not incorporate that into our model. We extend the method to a multilayer model, using thermal wind relations weakly in a local fashion (as opposed to integrating the entire water column) to connect layers vertically. Results suggest that the estimated deep zonal jets extend vertically, with a clear depth dependent structure. The vertical structure of the flow field is modified by the tracer fields over that set a priori by thermal wind. Our estimates are consistent with observed flow at the depths of the Antarctic Intermediate Water; at still shallower depths, above the layers considered here, the subtropical gyre is a significant feature of the horizontal flow.
 2006
 FSU_migr_etd4101
 Mixture Item Response TheoryMimic Model: Simultaneous Estimation of Differential Item Functioning for Manifest Groups and Latent Classes.
Bilir, Mustafa Kuzey, Kamata, Akihito, Huffer, Fred, Becker, Betsy J., Yang, Yanyun, Department of Educational Psychology and Learning Systems, Florida State University
This study uses a new psychometric model (The mixture item response theoryMIMIC model) that simultaneously estimates differential item functioning (DIF) across manifest groups and latent classes. Current DIF detection methods investigate DIF either across manifest groups (e.g., gender, ethnicity, etc.), or across latent classes (e.g., solution strategies, speededness, etc.). Alternatively, one of these aspects is considered as the real source of DIF and the other aspect is considered as a...
Show moreThis study uses a new psychometric model (The mixture item response theoryMIMIC model) that simultaneously estimates differential item functioning (DIF) across manifest groups and latent classes. Current DIF detection methods investigate DIF either across manifest groups (e.g., gender, ethnicity, etc.), or across latent classes (e.g., solution strategies, speededness, etc.). Alternatively, one of these aspects is considered as the real source of DIF and the other aspect is considered as a proxy for the same source. This can only be true when manifest and latent classifications provide perfect or very high overlap. A combination of a Rasch type model for manifest groupDIF (GDIF) and a mixture Rasch model for latent classDIF (CDIF) detection is applied as the mixture IRTMIMIC model (MixIRTMIMIC). A Markov chain Monte Carlo method called Gibbs sampler is applied for Bayesian estimation of parameters for MixIRTMIMIC model as well as the Rasch model, and the mixture Rasch model. This study shows that in detection of DIF, when the groupclass overlap is between 50% and 70%; manifest group approaches and latent class approaches can provide biased DIF, and item difficulty estimates for some test items that show GDIF and CDIF, simultaneously. However, for the same conditions MixIRTMIMIC provides unbiased estimates for latent classDIF (CDIF) and item difficulty parameters, while the confounding is reflected as bias in GDIF parameter estimates. Main factors of importance are groupclass overlap and the overlap between DIF items. MixIRTMIMIC contributes by; (1) estimating the unbiased magnitudes of GDIF and CDIF, (2) estimating the unbiased estimates of item difficulties when other approaches have biased estimates, (3) determining the overlap ratio (confounding) between groups and classes which is unknown a priori (4) true source(s) of DIF. Researchers, test developers, and state testing programs that are interested in detecting true sources of differences (e.g. cognitive, gender, ethnic) across individuals are potential users of MixIRTMIMIC. It is important to note that this study is an initial step to detect both types of DIF simultaneously, and is limited to binary data and a special case of 2 groups by 2 classes, which can be applied to most DIF detection purposes. Its performance and extensions will be investigated for other possible situations.
 2009
 FSU_migr_etd3761
 Calibration of Multivariate Generalized Hyperbolic Distributions Using the EM Algorithm, with Applications in Risk Management, Portfolio Optimization and Portfolio Credit Risk.
Hu, Wenbo, Kercheval, Alec, Huﬀer, Fred, Case, Bettye, Nichols, Warren, Nolder, Craig, Department of Mathematics, Florida State University
The distributions of many financial quantities are wellknown to have heavy tails, exhibit skewness, and have other nonGaussian characteristics. In this dissertation we study an especially promising family: the multivariate generalized hyperbolic distributions (GH). This family includes and generalizes the familiar Gaussian and Student t distributions, and the socalled skewed t distributions, among many others. The primary obstacle to the applications of such distributions is the numerical...
Show moreThe distributions of many financial quantities are wellknown to have heavy tails, exhibit skewness, and have other nonGaussian characteristics. In this dissertation we study an especially promising family: the multivariate generalized hyperbolic distributions (GH). This family includes and generalizes the familiar Gaussian and Student t distributions, and the socalled skewed t distributions, among many others. The primary obstacle to the applications of such distributions is the numerical difficulty of calibrating the distributional parameters to the data. In this dissertation we describe a way to stably calibrate GH distributions for a wider range of parameters than has previously been reported. In particular, we develop a version of the EM algorithm for calibrating GH distributions. This is a modification of methods proposed in McNeil, Frey, and Embrechts (2005), and generalizes the algorithm of Protassov (2004). Our algorithm extends the stability of the calibration procedure to a wide range of parameters, now including parameter values that maximize loglikelihood for our real market data sets. This allows for the first time certain GH distributions to be used in modeling contexts when previously they have been numerically intractable. Our algorithm enables us to make new uses of GH distributions in three financial applications. First, we forecast univariate ValueatRisk (VaR) for stock index returns, and we show in outofsample backtesting that the GH distributions outperform the Gaussian distribution. Second, we calculate an efficient frontier for equity portfolio optimization under the skewedt distribution and using Expected Shortfall as the risk measure. Here, we show that the Gaussian efficient frontier is actually unreachable if returns are skewed t distributed. Third, we build an intensitybased model to price Basket Credit Default Swaps by calibrating the skewed t distribution directly, without the need to separately calibrate xi the skewed t copula. To our knowledge this is the first use of the skewed t distribution in portfolio optimization and in portfolio credit risk.
 2005
 FSU_migr_etd3694
 Functional Component Analysis and Regression Using Elastic Methods.
Tucker, J. Derek, Srivastava, Anuj, Wu, Wei, Klassen, Eric, Huﬀer, Fred, Department of Statistics, Florida State University
Constructing generative models for functional observations is an important task in statistical function analysis. In general, functional data contains both phase (or x or horizontal) and amplitude (or y or vertical) variability. Traditional methods often ignore the phase variability and focus solely on the amplitude variation, using crosssectional techniques such as functional principal component analysis for dimensional reduction and regression for data modeling. Ignoring phase variability...
Show moreConstructing generative models for functional observations is an important task in statistical function analysis. In general, functional data contains both phase (or x or horizontal) and amplitude (or y or vertical) variability. Traditional methods often ignore the phase variability and focus solely on the amplitude variation, using crosssectional techniques such as functional principal component analysis for dimensional reduction and regression for data modeling. Ignoring phase variability leads to a loss of structure in the data, and inefficiency in data models. Moreover, most methods use a "preprocessing'' alignment step to remove the phasevariability; without considering a more natural joint solution. This dissertation presents three approaches to this problem. The first relies on separating the phase (xaxis) and amplitude (yaxis), then modeling these components using joint distributions. This separation in turn, is performed using a technique called elastic alignment of functions that involves a new mathematical representation of functional data. Then, using individual principal components, one for each phase and amplitude components, it imposes joint probability models on principal coefficients of these components while respecting the nonlinear geometry of the phase representation space. The second combines the phasevariability into the objective function for two component analysis methods, functional principal component analysis and functional principal least squares. This creates a more complete solution, as the phasevariability is removed while simultaneously extracting the components. The third approach combines the phasevariability into the functional linear regression model and then extends the model to logistic and multinomial logistic regression. Through incorporating the phasevariability a more parsimonious regression model is obtained and therefore, more accurate prediction of observations is achieved. These models then are easily extended from functional data to curves (which are essentially functions in R2) to perform regression with curves as predictors. These ideas are demonstrated using random sampling for models estimated from simulated and real datasets, and show their superiority over models that ignore phaseamplitude separation. Furthermore, the models are applied to classification of functional data and achieve high performance in applications involving SONAR signals of underwater objects, handwritten signatures, periodic body movements recorded by smart phones, and physiological data.
 2014
 FSU_migr_etd9106
 Statistical Modelling and Applications of Neural Spike Trains.
Lawhern, Vernon, Wu, Wei, Contreras, Robert J., Srivastava, Anuj, Huﬀer, Fred, Niu, Xufeng, Department of Statistics, Florida State University
In this thesis we investigate statistical modelling of neural activity in the brain. We first develop a framework which is an extension of the statespace Generalized Linear Model (GLM) by Eden and colleagues [20] to include the effects of hidden states. These states, collectively, represent variables which are not observed (or even observable) in the modeling process but nonetheless can have an impact on the neural activity. We then develop a framework that allows us to input apriori target...
Show moreIn this thesis we investigate statistical modelling of neural activity in the brain. We first develop a framework which is an extension of the statespace Generalized Linear Model (GLM) by Eden and colleagues [20] to include the effects of hidden states. These states, collectively, represent variables which are not observed (or even observable) in the modeling process but nonetheless can have an impact on the neural activity. We then develop a framework that allows us to input apriori target information into the model. We examine both of these modelling frameworks on motor cortex data recorded from monkeys performing different targetdriven hand and arm movement tasks. Finally, we perform temporal coding analysis of sensory stimulation using principled statistical models and show the efficacy of our approach.
 2011
 FSU_migr_etd3251
 Bayesian Portfolio Optimization with TimeVarying Factor Models.
Zhao, Feng, Niu, Xufeng, Cheng, Yingmei, Huﬀer, Fred W., Zhang, Jinfeng, Department of Statistics, Florida State University
We develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity ("betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors ("alphas"), and returns of risk factors in excess of the riskfree rate ("risk premia"). Both firmlevel characteristics and macroeconomic variables are used to predict stocks' timevarying alphas and betas, and macroeconomic variables are used to predict the risk...
Show moreWe develop a modeling framework to simultaneously evaluate various types of predictability in stock returns, including stocks' sensitivity ("betas") to systematic risk factors, stocks' abnormal returns unexplained by risk factors ("alphas"), and returns of risk factors in excess of the riskfree rate ("risk premia"). Both firmlevel characteristics and macroeconomic variables are used to predict stocks' timevarying alphas and betas, and macroeconomic variables are used to predict the risk premia. All of the models are specified in a Bayesian framework to account for estimation risk, and informative prior distributions on both stock returns and model parameters are adopted to reduce estimation error. To gauge the economic signicance of the predictability, we apply the models to the U.S. stock market and construct optimal portfolios based on model predictions. Outofsample performance of the portfolios is evaluated to compare the models. The empirical results confirm predictabiltiy from all of the sources considered in our model: (1) The equity risk premium is timevarying and predictable using macroeconomic variables; (2) Stocks' alphas and betas differ crosssectionally and are predictable using firmlevel characteristics; and (3) Stocks' alphas and betas are also timevarying and predictable using macroeconomic variables. Comparison of different subperiods shows that the predictability of stocks' betas is persistent over time, but the predictability of stocks' alphas and the risk premium has diminished to some extent. The empirical results also suggest that Bayesian statistical techinques, especially the use of informative prior distributions, help reduce model estimation error and result in portfolios that outperform the passive indexing strategy. The findings are robust in the presence of transaction costs.
 2011
 FSU_migr_etd0526
 A Bayesian Approach to MetaRegression: The Relationship Between Body Mass Index and AllCause Mortality.
Marker, Mahtab, McGee, Dan, Hurt, Myra, Niu, Xiufeng, Huﬀer, Fred, Department of Statistics, Florida State University
This thesis presents a Bayesian approach to MetaRegression and Individual Patient Data (IPD) Metaanalysis. The focus of the research is on establishing the relationship between Body Mass Index (BMI) and allcause mortality. This has been an area of continuing interest in the medical and public health communities and no concensus has been reached on what the optimal weight for individuals is. Standards are usually speci ed in terms of body mass index (BMI = wt(kg) over height(m)2 ) which is...
Show moreThis thesis presents a Bayesian approach to MetaRegression and Individual Patient Data (IPD) Metaanalysis. The focus of the research is on establishing the relationship between Body Mass Index (BMI) and allcause mortality. This has been an area of continuing interest in the medical and public health communities and no concensus has been reached on what the optimal weight for individuals is. Standards are usually speci ed in terms of body mass index (BMI = wt(kg) over height(m)2 ) which is associated with body fat percentage. Many studies in the literature have modelled the relationship between BMI and mortality and reported a variety of relationships including Ushaped, Jshaped and linear curves. The aim of my research was to use statistical methods to determine whether we can combine these diverse results an obtain single estimated relationship, using which one can nd the point of minimum mortality and establish reasonable ranges for optimal BMI or how we can best examine the reasons for the heterogeneity of results. Commonly used techniques of Metaanalysis and Metaregression are explored and a problem with the estimation procedure in the multivariate setting is presented. A Bayesian approach using Hierarchical Generalized Linear Mixed Model is suggested and implemented to overcome this drawback of standard estimation techniques. Another area which is explored briefly is that of Individual Patient Data metaanalysis. A Frailty model or Random Effects Proportional Hazards Survival model approach is proposed to carry out IPD metaregression and come up with a single estimated relationship between BMI and mortality, adjusting for the variation between studies.
 2007
 FSU_migr_etd2736
 Standardized Regression Coefficients as Indices of Effect Sizes in MetaAnalysis.
Kim, Rae Seon, Becker, Betsy Jane, Huffer, Fred, Yang, Yanyun, Paek, Insu, Department of Educational Psychology and Learning Systems, Florida State University
When conducting a metaanalysis, it is common to find many collected studies that report regression analyses, because multiple regression analysis is widely used in many fields. Metaanalysis uses effect sizes drawn from individual studies as a means of synthesizing a collection of results. However, indices of effect size from regression analyses have not been studied extensively. Standardized regression coefficients from multiple regression analysis are scale free estimates of the effect of...
Show moreWhen conducting a metaanalysis, it is common to find many collected studies that report regression analyses, because multiple regression analysis is widely used in many fields. Metaanalysis uses effect sizes drawn from individual studies as a means of synthesizing a collection of results. However, indices of effect size from regression analyses have not been studied extensively. Standardized regression coefficients from multiple regression analysis are scale free estimates of the effect of a predictor on a single outcome. Thus these coefficients can be used as effect–size indices for combining studies of the effect of a focal predictor on a target outcome. I begin with a discussion of the statistical properties of standardized regression coefficients when used as measures of effect size in metaanalysis. The main purpose of this dissertation is the presentation of methods for obtaining standardized regression coefficients and their standard errors from reported regression results. An example of this method is demonstrated using selected studies from a published metaanalysis on teacher verbal ability and school outcomes (Aloe & Becker, 2009). Last, a simulation is conducted to examine the effect of multicollinearity (intercorrelation among predictors), as well as the number of predictors on the distributions of the estimated standardized regression slopes and their variance estimates. This is followed by an examination of the empirical distribution of estimated standardized regression slopes and their variances from simulated data for different conditions. The estimated standardized regression slopes have larger variance and get close to zero when predictors are highly correlated via the simulation study.
 2011
 FSU_migr_etd3109
 Nonparametric Estimation of Three Dimensional Projective Shapes with Applications in Medical Imaging and in Pattern Recognition.
Crane, Michael, Patrangenaru, Victor, Liu, Xiuwen, Huﬀer, Fred W., Sinha, Debajyoti, Department of Statistics, Florida State University
This dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision re construction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic...
Show moreThis dissertation is on analysis of invariants of a 3D configuration from its 2D images in pictures of this configuration, without requiring any restriction on the camera positioning relative to the scene pictured. We briefly review some of the main results found in the literature. The methodology used is nonparametric, manifold based combined with standard computer vision re construction techniques. More specifically, we use asymptotic results for the extrinsic sample mean and the extrinsic sample covariance to construct boot strap confidence regions for mean projective shapes of 3D configurations. Chapters 4, 5 and 6 contain new results. In chapter 4, we develop tests for coplanarity. In chapter 5, is on reconstruction of 3D polyhedral scenes, including texture from arbitrary partial views. In chapter 6, we develop a nonparametric methodology for estimating the mean change for matched samples on a Lie group. We then notice that for k '' 4, a manifold of projective shapes of kads in general position in 3D has a structure of 3k and #8722; 15 dimensional Lie group (PQuaternions) that is equivariantly embedded in an Euclidean space, therefore testing for mean 3D projective shape change amounts to a one sample test for extrinsic mean PQuaternion Objects. The Lie group technique leads to a large sample and nonparametric bootstrap test for one population extrinsic mean on a projective shape space, as recently developed by Patrangenaru, Liu and Sughatadasa [1]. On the other hand, in absence of occlusions, the 3D projective shape of a spatial configuration can be recovered from a stereo pair of images, thus allowing to test for mean glaucomatous 3D projective shape change detection from standard stereo pairs of eye images.
 2010
 FSU_migr_etd7118
 A Bayesian MRF Framework for Labeling Terrain Using Hyperspectral Imaging.
Neher, Robert E., Srivastava, Anuj, Liu, Xiuwen, Huffer, Fred, Wegkamp, Marten, Department of Statistics, Florida State University
We explore the nonGaussianity of hyperspectral data and present probability models that capture variability of hyperspectral images. In particular, we present a nonparametric probability distribution that models the distribution of the hyperspectral data after reducing the dimension of the data via either principal components or Fisher's discriminant analysis. We also explore the directional differences in observed images and present two parametric distributions, the generalized Laplacian...
Show moreWe explore the nonGaussianity of hyperspectral data and present probability models that capture variability of hyperspectral images. In particular, we present a nonparametric probability distribution that models the distribution of the hyperspectral data after reducing the dimension of the data via either principal components or Fisher's discriminant analysis. We also explore the directional differences in observed images and present two parametric distributions, the generalized Laplacian and the Bessel K form, that well model the nonGaussian behavior of the directional differences. We then propose a model that labels each spatial site, using Bayesian inference and Markov random fields, that incorporates the information of the nonparametric distribution of the data, and the parametric distributions of the directional differences, along with a prior distribution that favors smooth labeling. We then test our model on actual hyperspectral data and present the results of our model, using the Washington D.C. Mall and Indian Springs rural area data sets.
 2004
 FSU_migr_etd2691
 Inference for Semiparametric TimeVarying Covariate Effect Relative Risk Regression Models.
Ye, Gang, McKeague, Ian W., Wang, Xiaoming, Huffer, Fred W., Song, KaiSheng, Department of Statistics, Florida State University
A major interest of survival analysis is to assess covariate effects on survival via appropriate conditional hazard function regression models. The Cox proportional hazards model, which assumes an exponential form for the relative risk, has been a popular choice. However, other regression forms such as Aalen's additive risk model may be more appropriate in some applications. In addition, covariate effects may depend on time, which can not be reflected by a Cox proportional hazards model. In...
Show moreA major interest of survival analysis is to assess covariate effects on survival via appropriate conditional hazard function regression models. The Cox proportional hazards model, which assumes an exponential form for the relative risk, has been a popular choice. However, other regression forms such as Aalen's additive risk model may be more appropriate in some applications. In addition, covariate effects may depend on time, which can not be reflected by a Cox proportional hazards model. In this dissertation, we study a class of timevarying covariate effect regression models in which the link function (relative risk function) is a twice continuously differentiable and prespecified, but otherwise general given function. This is a natural extension of the PrenticeSelf model, in which the link function is general but covariate effects are modelled to be time invariant. In the first part of the dissertation, we focus on estimating the cumulative or integrated covariate effects. The standard martingale approach based on counting processes is utilized to derive a likelihoodbased iterating equation. An estimator for the cumulative covariate effect that is generated from the iterating equation is shown to be ¡Ìnconsistent. Asymptotic normality of the estimator is also demonstrated. Another aspect of the dissertation is to investigate a new test for the above timevarying covariate effect regression model and study consistency of the test based on martingale residuals. For Aalen's additive risk model, we introduce a test statistic based on the HufferMcKeague weightedleastsquares estimator and show its consistency against some alternatives. An alternative way to construct a test statistic based on Bayesian Bootstrap simulation is introduced. An application to real lifetime data will be presented.
 2005
 FSU_migr_etd0949
 Age Effects in the Extinction of Planktonic Foraminifera: A New Look at Van Valen's Red Queen Hypothesis.
Wiltshire, Jelani, Huﬀer, Fred, Parker, William, Chicken, Eric, Sinha, Debajyoti, Department of Statistics, Florida State University
Van Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic...
Show moreVan Valen's Red Queen hypothesis states that within a homogeneous taxonomic group the age is statistically independent of the rate of extinction. The case of the Red Queen hypothesis being addressed here is when the homogeneous taxonomic group is a group of similar species. Since Van Valen's work, various statistical approaches have been used to address the relationship between taxon duration (age) and the rate of extinction. Some of the more recent approaches to this problem using Planktonic Foraminifera (Foram) extinction data include Weibull and Exponential modeling (Parker and Arnold, 1997), and Cox proportional hazards modeling (Doran et al. 2004,2006). I propose a general class of test statistics that can be used to test for the effect of age on extinction. These test statistics allow for a varying background rate of extinction and attempt to remove the effects of other covariates when assessing the effect of age on extinction. No model is assumed for the covariate effects. Instead I control for covariate effects by pairing or grouping together similar species. I use simulated data sets to compare the power of the statistics. In applying the test statistics to the Foram data, I have found age to have a positive effect on extinction.
 2010
 FSU_migr_etd0952
 Essays on the Role of Trade Frictions in International Economics.
Yoshimine, Koichi, Norrbin, Stefan C., Huﬀer, Fred W., Beaumont, Paul M., Garriga, Carlos, Department of Economics, Florida State University
This dissertation consists of three essays. The first essay examines the effects of tax differentials on the trade balance across countries. Given that intrafirm trade accounts for the sizable share of the world's international trade, it is expected that incomeshifting activities of multinational firms can bias the trade balance in many countries. Specifically, an increase in the relative tax liability in one country is expected to decrease the trade balance of that country. Using proxies to...
Show moreThis dissertation consists of three essays. The first essay examines the effects of tax differentials on the trade balance across countries. Given that intrafirm trade accounts for the sizable share of the world's international trade, it is expected that incomeshifting activities of multinational firms can bias the trade balance in many countries. Specifically, an increase in the relative tax liability in one country is expected to decrease the trade balance of that country. Using proxies to the effective tax liability of 19 OECD countries, the cointegrating regressions show significantly negative relationships between tax differentials and the trade balance among relatively small industrial countries. The second essay asks whether the empirically observed home biases in international trade are accounted for by a theoretical model. It has been pointed out that trade among individual Canadian provinces is much larger than the trade between individual Canadian provinces and individual U.S. states. There is a similar tendency in the trade among the OECD member countries. Obstfeld and Rogoff (2000) claim that such a bias can be explained if one takes into account the interaction between transaction costs and the elasticity of substitution. This study tests their claim using a dynamic general equilibrium model where agents pay proportional transaction costs. The simulation results show that the bias levels generated by the plausible values for transaction cost and elasticity are not particularly inconsistent with the observed levels in the US  Canada relationship. The third essay tests a version of international real business cycle model aimed at examining the effect on the exchangerate volatility of market segmentation generated by a trade friction across countries. Obstfeld and Rogoff (2000) argue that segmentation in international goods market can explain the empirically observed real exchangerate volatility. In this study, a trade cost in goods market combined with income heterogeneity of consumers endogenously generates market segmentation by preventing a fraction of consumers from participating in international trade. Under such a circumstance, the volatility of exchange rate actually rises, but the volatility is still below the observed reality, suggesting that trade cost alone cannot explain the anomalous exchangerate behaviors.
 2004
 FSU_migr_etd0867
 Minimax Tests for Nonparametric Alternatives with Applications to High Frequency Data.
Yu, Han, Song, KaiSheng, Professor, Jack Quine, Professor, Fred Huﬀer, Professor, Dan McGee, Department of Statistics, Florida State University
We present a general methodology for developing an asymptotically distributionfree, asymptotic minimax tests. The tests are constructed via a nonparametric densityquantile function and the limiting distribution is derived by a martingale approach. The procedure can be viewed as a novel parametric extension of the classical parametric likelihood ratio test. The proposed tests are shown to be omnibus within an extremely large class of nonparametric global alternatives characterized by simple...
Show moreWe present a general methodology for developing an asymptotically distributionfree, asymptotic minimax tests. The tests are constructed via a nonparametric densityquantile function and the limiting distribution is derived by a martingale approach. The procedure can be viewed as a novel parametric extension of the classical parametric likelihood ratio test. The proposed tests are shown to be omnibus within an extremely large class of nonparametric global alternatives characterized by simple conditions. Furthermore, we establish that the proposed tests provide better minimax distinguishability. The tests have much greater power for detecting highfrequency nonparametric alternatives than the existing classical tests such as KolmogorovSmirnov and Cramervon Mises tests. The good performance of the proposed tests is demonstrated by Monte Carlo simulations and applications in High Energy Physics.
 2006
 FSU_migr_etd0796
 Flexible Additive Risk Models Using Piecewise Constant Hazard Functions.
Uhm, Daiho, Huﬀer, Fred W., Kercheval, Alec, McGee, Dan, Niu, Xufeng, Department of Statistics, Florida State University
We study a weighted least squares (WLS) estimator for Aalen's additive risk model which allows for a very flexible handling of covariates. We divide the followup period into intervals and assume a constant hazard rate in each interval. The model is motivated as a piecewise approximation of a hazard function composed of three parts: arbitrary nonparametric functions for some covariate effects, smoothly varying functions for others, and known (or constant) functions for yet others. The...
Show moreWe study a weighted least squares (WLS) estimator for Aalen's additive risk model which allows for a very flexible handling of covariates. We divide the followup period into intervals and assume a constant hazard rate in each interval. The model is motivated as a piecewise approximation of a hazard function composed of three parts: arbitrary nonparametric functions for some covariate effects, smoothly varying functions for others, and known (or constant) functions for yet others. The proposed estimator is an extension of the grouped data version of the HufferMcKeague estimator (1991). Our estimator may also be regarded as a piecewise constant analog of the semiparametric estimates of McKeague & Sasieni (1994), and Lin & Ying (1994). By using a fairly large number of intervals, we should get an essentially semiparametric model similar to the McKeagueSasieni and LinYing approaches. For our model, since the number of parameters is finite (although large), conventional approaches (such as maximum likelihood) are easy to formulate and implement. The approach is illustrated by simulations, and is applied to data from the Framingham heart study.
 2007
 FSU_migr_etd1464
 Sequential Experimentation Schemes for Resolution III, Robust and Mixedlevel Designs.
Rios, Armando, Simpson, James R., Huffer, Fred, Pignatiello, Joseph J., Perry, Marcus, Department of Industrial and Manufacturing Engineering, Florida State University
General augmentation techniques such as foldover and semifold have been a common practice in industrial experimentation for many years. Even though these techniques are extremely effective in maintaining balance and orthogonality, they possess serious disadvantages such as the inability to decouple specific terms and a high level of inefficiency. This dissertation aims for a sequential experimentation approach capable of improving the drawbacks of the general methods while maintaining some of...
Show moreGeneral augmentation techniques such as foldover and semifold have been a common practice in industrial experimentation for many years. Even though these techniques are extremely effective in maintaining balance and orthogonality, they possess serious disadvantages such as the inability to decouple specific terms and a high level of inefficiency. This dissertation aims for a sequential experimentation approach capable of improving the drawbacks of the general methods while maintaining some of its benefits. Chapter 3 begins with proposing an algorithm for sequential augmentation of fractional factorial designs resolution III. The proposed algorithm is compared with its competitors, semifold and foldover using simulated data under 3 noise level conditions. Advantages, limitations, and potential benefits of the new method are provided. Chapter 4 explores new possibilities for augmentation of efficient mixedlevel designs (EAs). Current augmentation methods for mixedlevel designs include only the optimal foldover plans developed by Guo (2006). Semifold plans for several mixedlevel designs are developed by selecting half of the treatment combinations of the foldover fraction using the general balance metric criterion and an exhaustive search approach. Chapter 5 complements this research by providing a methodology for sequential augmentation of mixed resolution robust designs. The work presented here extends the current limits of sequential experimentation for resolution III, mixedlevel and robust designs and provides a viable alternative for the experimenter in situations in which financial restrictions do not allow the implementation of a general method.
 2008
 FSU_migr_etd1852
