Authors
^{1} Department of Animal Science, Faculty of Agriculture, Shahid Bahonar University of Kerman, Kerman, Iran
^{2} Department of Animal Science, Faculty of Mathematics and Computer Science, Shahid Bahonar University of Kerman, Kerman, Iran
Abstract
Keywords
INTRODUCTION
Today, genome-wide marker data and whole genome single nucleotide polymorphism (SNP) chips are easy available and widely used for evaluation a wide range of economically important traits in plant and animal breeding programs (Desta and Ortiz, 2015; Do et al. 2015). Animal and plant breeders, are mainly interested in estimating genomic breeding values for these traits with including whole genome wide marker information that call as genomic selection (GS); (Daetwyler et al. 2013; Meuwissen et al. 2016). The genomic selection goal is that maximum capturing variance that can be explained by the markers (Su et al. 2012). To achieve this, any statistical method implemented for the predictions must be able handle the large numbers of markers and evaluate marker effects across the entire genome (Gianola and Rosa, 2015). In hence, the models poses two source of challenges, one is curse of dimensionality and the other is unknown genetic architecture of the quantitative traits (Daetwyler et al. 2010; de Los Campos et al. 2013). Genetic architecture is a description of the structure of the genotype-phenotype relationship that includes the nature of the loci contributing to phenotypic variation (e.g., number of loci and their genomic location) and a description of the alleles at those loci; number of alleles, magnitude of effects, patterns of pleiotropy, additivity, dominance, epistasis, epigenetic effects (Holland, 2007; Tiezzi and Maltecca, 2015; Yang et al. 2007). These methods behave different manner to overcome the difﬁculties of the second challenge and providing robust estimations from simple to complex genetic architecture of quantitative traits (Daetwyler et al. 2010; Fernández et al. 2016). However, the performance of prediction methods can be affected by genetic architecture and in particular patterns of pleiotropy, additivity, dominance and epistasis (Lidan Sun; 2015). The aims need to assess the accuracy and sensitivity of whole genome prediction statistical methods that are linked conceptually by genetic architecture. However, selection of the best genomic prediction model with considering a specific genetic architecture of a complex traits remains one of the main goals for animal breeders and evolutionary geneticists. In other hand, the genetic architecture of most economic traits remains unknown for animal breeders and evolutionary geneticists (Stranger et al. 2011). Then, unravelling the performance of GS methods toward the genomic architecture of a polygenic trait is fundamental to study ability of methods for predictions. The objective of this study was to compare different genomic prediction models includes parametric and non-parametric, and assess their abilities toward a specific trait with only additive architecture. We hypothesize that with further exploration of predictive ability of GS models toward the purely genomic architectures, we will clearly find the true accuracy of genomic prediction models under a simple genetic architecture.
MATERIALS AND METHODS
In order to evaluate the impact of purely additive genetic architectures of complex trait on the accuracy and predictive ability of GS models, we simulated a quantitative trait which influenced by only narrow sense heritability (h^{2}=0.3) and assessed predictive ability of the 14 genomic prediction models.
Population structure
A population was simulated for 2000 historical generations at an effective size of 100 (N_{e}=100). After 55 generations random mating, during the whole process, all individuals were generated with one gamete from a random father and one from a random mother. In each generation, 20 males mated with 400 females, 20 half-sib families. Therefore, the data set for the estimation of the marker effects consisted of the 4800 individuals from the last five generations and used to estimate predictive ability and sensitivity of statistical method. The genome was assumed to consist of 5 chromosomes each 100 cM long and 2000 loci/chromosome (i.e. a total of 10000 SNP plus 300 QTL) were located at random map positions (as shown in Figure 1). Both SNP and QTL were biallelic. Mutations were generated at a rate of 2.5 × 10^{-5} per locus per generation at the marker loci and at the QTL loci. Similar to Meuwissen and Goddard (2001), a standard gamma distribution with shape parameter α= 0.42 and scale parameter β= 2.619 was used to drawn allele substitution effects (α_{j}). The sign of an allele substitution effect was drawn at random with equal chance.
True genomic estimated breeding values
The true breeding value (TBV) for each animal was calculated as the expected genotypic value of a certain QTL genotype that carried by animal i:
Where:
X_{ij}: covariate indicator of the genotype of the j^{th} QTL of the i^{th} individual that has the values 2, 1, 0 for genotypes AA, Aa or aa, respectively.
p_{j} and q_{j}: allelic frequencies (A or a) for the j^{th} marker in the training population.
α: average effect of substitution for the j^{th} marker calculated as:
α_{j}= а_{j} + d_{j}(q_{j}-p_{j}) with d_{j}= 0.
Statistical methods
To predict marker effects and performance of GS models, a five-folds cross validation scheme were used and repeated 20 times per run. We divided the data into training and testing sets and the training sets were used to ﬁt the models, and the testing sets were used to determine the performance of the particular method. The evaluated methods include parametric methods; genomic best linear unbiased prediction (GBLUP), ridge regression BLUP (RR-BLUP), least absolute shrinkage and selection operator (LASSO), elastic net (EN), Bayesian ridge regression (BRR), Bayesian LASSO (BL), Bayes A, Bayes B, Bayes C and non-parametric methods, includes; reproducing kernel hilbert space (RKHS), support vector machine (SVM), relevance vector machine (RVM) and gaussian processors (GP). The statistical software R (R Core Team, 2015) was used to run the parametric and nonparametric methods. With respect to that the evaluations were based on the 20 replicates for each cross validated scenario, the average of the results was reported.
Figure 1 Distribution of randomly SNP coverage across the five simulated chromosomes
For each scenario, the sensitivity and predictive ability of the genomic prediction models was measured by four statistic criteria including: predictive correlation as the person’s correlation between the true phenotypic values and the predicted estimated genomic breeding values (γ_{y,GEBV}), mean square error (MSE), empirical accuracies of genomic predictions as the correlation between GEBVs and the true breeding values (γ_{GEBV,TBV}) and the unbiasedness was assessed by regression of the simulated phenotypes on the GEBVs (b_{y,GEBV}). Significant differences between methods in terms of predictive ability were assessed by means of paired t-tests (α=1%), adjusted by bonferroni correction.
RESULTS AND DISCUSSION
The comparison of cross-validated results for different models allowed estimation of the similarities and dissimilarities of them. Averages and standard errors (SE) were computed for each statistic by considering the results of the 20 replicates available in each situation. Table 1 shows the mean and SE (sampling variabilities) of afore mentioned statistics for the nine parametric (five Bayesian and four frequentist) and five non-parametric implemented models. The predictive correlation of models ranges from 0.676 for LASSO to 0.568 for random forest regression and the range of mean square error and bias for tested models were 23.15-30.34 and 0.93-1.63, respectively. The predictive correlation (γ_{y,GEBV}), MSE and bias of the Bayesian methods were quite similar. The predictive correlation of GBLUP, LASSO and EN models and Bayesian models was higher than RKHS and machine learning models. Comparison of described models for MSE and bias also demonstrated high similarity between Bayesian and frequentist parametric method and a slight their differences with non-parametric and machine learning method. The bias between Bayesian and parametric method was very close to 1 for all traits except elastic net (γ_{y,GEBV}=0.93). Unbiased models are expected to have a slope coefficient of 1, whereas values greater than 1 indicate a biased overestimation in the GEBV prediction and values smaller than 1 indicate a biased underestimation of the GEBV. In our study, in terms of empirical accuracy, the five Bayesian methods and other parametric performed similarly with correlations over 0.90 and non-parametric and machine learnings had slightly downward correlations. Bayes B, Bayes C and LASSO were slightly more accurate than others, while random forest regression yielded the smallest bias (the worst), the lowest accuracy and the highest MSE. Bayesian and parametric (RR-BLUP, GBLUP, LASSO and EN) models reached a very similar predictive correlation (γ_{y,GEBV}) for the given trait. However, non-parametric methods tended to outperform than other models for MSE, bias and empirical accuracies.
Table 1 Prediction accuracies criteria means across different genomic prediction models for a purely additive trait (h^{2}=0.3) with five folds cross-validation
MSE: mean squared error.
The means within the same column with at least one common letter, do not have significant difference (P>0.001).
Three machine learning method (Support Vector Machine, Relevance Vector Machine and Gaussian Process) and random forest performed poorly on these datasets, even though the models parameters was optimized well, but the methods significantly different from all the other for the criteria (P<0.01). The others pairs of non-parametric methods significantly different from each other (P<0.01), with comparing means, in terms of predictive correlation, MSE and empirical accuracy, RVM had higher values but RKHS had a best bias, very close to 1 (i.e. γ_{TBV,GEBV}=1.03). The boxplots (Figures 1 and 2), show the distribution of the prediction accuracy values for 100 runs and the relative performance of the methods. Figure 1 contains 28 boxplots of bias (A) and predictive correlation of predicted breeding values (B) and Figure 2 shows MSE and empirical accuracy boxplots for the 14 different methods. In each ﬁgure, the ﬁrst five boxplots are for the Bayesian methods, the next four box plots represent the frequentist methods and the later five plots are for non-parametric methods. These summary plots clearly show the ability and inability of using any kind of methods when only additive architecture is present.
Bias of the methods
The coefﬁcient of regression (slope) of simulated phenotype on GEBV was calculated as a measurement of the bias of each method. Ideally a value of γ_{y,GEBV} equal to one indicates no bias in the prediction. Figure 2(A) shows the slopes of regressed simulated phenotypes on estimated breeding values for all models. All Bayesian method had very similar bias and very close to one (red vertical line in Figure 2(A)) and they were not signiﬁcantly different than one, indicating no signiﬁcant bias in the prediction. Across the frequentist method RR-BLUP and GBLUP had a bias similar to Bayesian methods and very close to one but a slight upward and downward estimation found for LASSO and elastic net, respectively. In addition, more variation and signiﬁcant differences among the non parametric methods were detected. The value of γ_{y,GEBV} derived from RKHS was slightly better across the non-parametric models. The values of SVM, GP and RF were higher than one (over estimated) and the RVM was lower (under estimated).
Predictive ability of the methods
Figure 2(B) shows the boxplot of correlation between the simulated phenotype (y) and the predicted genomic breeding value (GEBV), since each validation group had GEBV estimated from a different prediction equation and might have a different mean. This correlation represent the predictive ability (γ_{y,GEBV}) of GS to predict phenotypes (Resende et al. 2012). Overall, the ability to predict phenotype ranged from 0.56 for Random Forest to 0.67 for LASSO (Table 1). In the Figure the red line shows the total mean accuracy of all methods (0.65), although the parametric methods differ in a priori assumptions about marker effects, but their predictive ability was similar and all of them had the accuracy higher than the total mean (red line). In contrast, nonparametric methods, particularly RF, RKHS, and SVM, provided predictions that are reasonably less accurate for a traits with purely additive architecture.
Figure 2 The boxplots of Bias and Accuracy of prediction for the trait with purely additive genetic architecture (narrow sense heritability of 0.30) The ﬁrst nine boxplots (five Bayesian and four frequentist) correspond to the parametric methods, and the last five (dark blue) boxplots correspond to the non-parametric (machine learning) methods
Mean square error and empirical accuracy
Mean square error (MSE) of the estimator used for calculating as prediction error and a risk function to quantify differences between the estimator and the true value. In this study, the mean squared error of prediction between simulated phenotype and GEBV of animals in the testing set was used as the loss function to be minimized and a goodness of prediction criteria for the tested models. A lower MSE is associated with a better overall fit and also, larger estimates of γ_{TBV,GEBV} is a criteria for more reliable predictions. The results from this study show that in terms of MSE, parametric methods performed better and MSE value downward of total MSE mean except for RR-BLUP method (Figure 3C). The five non-parametric method, showed higher MSE than the total mean. The overall fit of the models to the simulated purely additive trait, judged by the mean squared prediction error, favored parametric methods over the non-parametric regression methods. However, for empirical accuracy (i.e. γ_{GEBV,TBV}), lower estimates of the correlation were obtained for non-parametrics and all parametric performed better (Figure 3D).
Hierarchical clustering tree
Figure 4 presents the hierarchical clustering tree obtained through the averaging of the distance matrix across all methods. The clustering results showed the similarities in terms of four estimated criteria (predictive correlation, MSE, Bias and empirical accuracy) between the different models. The clustering chart clearly shows that all parametric except RR-BLUP methods placed in one cluster. The RR-BLUP and RKHS had close similarity and fallen into a separate cluster. As well as, the machine learning methods (SVM, RVM and GP) has close similarities and were placed in the same category. The Random Forest Regression was quite different, with deep divisions in the clustering with other method. This analysis also showed strong similarity between two by two methods and conclude that Bayes B with Bayes C, LASSO with GBLUP, Bayes A with Bayes L, Bayes RR with EN, RR-BLUP with RKHS, RVN with GP have more similarities. It is also interesting to note that the elastic net clustered with Bayes RR despite being a combination of lasso and ridge regression penalty. The idea of genomic selection was initially raised more than 15 years ago, however, it was not practically used until the coming of high capacity genotyping platforms (Meuwissen et al. 2016). Beginning reports on advantages and disadvantages of various statistical methodologies for genomic selection have been conducted largely on simulated data sets (Daetwyler Hans et al. 2013). Simulation is potentially informative way to assess predict ability of genomic prediction models, especially when underlying genetic nature of complex trait and biological mechanisms are unknown. This paper describes the performance of 14 statistical approaches for the prediction of genomic breeding values using a simulated data set. We compared nine parametric (five Bayesian and four frequentist) and five non-parametric statistical GS methods. Comparisons were based on a simulated phenotype where genotypic variability was responsible for only 30% of the phenotypic variability (h^{2}=0.3). The underlying genetic architectures responsible for the genotypic variability, consisted of 300 independently segregating biallelic QTL loci that contributed equally either in an additive manner to a quantitative trait (purely additive architecture). Our study showed that using a parametric method in analyzing a simple additive architecture quantitative trait could provide better goodness of fit than using non-parametric method.
Figure 3 The boxplots of MSE and correlation of TBV and GEBV of prediction for the trait with purely additive genetic architecture (heritability of 0.30)
The ﬁrst nine boxplots (five Bayesian and four frequentist) correspond to the parametric methods and the last five (dark blue) boxplots correspond to the nonparametric (Machine Learning) methods
Figure 4 Hierarchical clustering of genomic selection (GS) models based on four estimated criteria, the height on the y axis refers to the value of the criterion associated with a particular agglomeration of models. Parametric (frequentist): RR-BLUP, GBLUP, LASSO, EN, parametric (Bayesian shrinkage regression): Bayes A, Bayes B, Bayes C, Bayesian LASSO, non-parametric (Machine Learnings): reproducing kernel hilbert space (RKHS), relevance vector machine (RVM), support vector machine (SVM) and gaussian processors (GP)
Our result also showed that all parametric methods, except RR-BLUP, could explain most of phenotypic variation because they showed lower MSE, higher accuracy, the regression coefficient close to one and higher γ_{TBV,GEBV}. It seems that accuracy of statistical genomic prediction models is dependent on the genetic architecture of complex traits, the size of the training population, the number of independent chromosome segments, the heritability of complex trait and the marker density panels. With respect to the relative performance of the prediction methods, Daetwyler et al. (2010) suggested that the accuracy of GBLUP is invariant to number of quantitative trait loci (QTL) affecting the trait, while the accuracy of statistical strategies taking into account the variable selection, is expected to be greater than that of GBLUP when number of independent chromosome segments are more than number of QTLs. In the present study with a highly additive quantitative trait, without involving any non-additive effects, the predictive abilities were considerably greater in the case of parametric predictions when compared to the non-parametric prediction models. Despite this, this superiority of predictive ability for parametric method, predictions varied markedly across Bayesian and frequentists methods. Another hypothesis for the differences in the superiority of parametric predictions in compare with non-parametric method, also mentioned by (Lee et al. 2008). Our results are in agreement with the findings by Gianola et al. (2006), who state that the non-parametric methods should be able to better predict phenotypes that are based on genetic architectures consisting of epistatic interactions. Also, Howard et al. (2014) state that if the underlying genetic architecture is additive, then parametric GS methods are slightly better than the non-parametric methods for levels of heritability. Both the accuracy of prediction and the MSE results suggest the same about the models in terms of predictive performance. When underlying architecture of a trait is purely additive, the RR-BLUP perform the worst among the parametric methods, and the Random Forest performs the worst among the non-parametric methods (shown in Figure 2 and Figure 3). The LASSO and GBLUP had the highest prediction accuracy and the lowest MSE values among the parametric methods when only the additive is present. The other parametric methods like the EN are more power to shrink the QTL effects. Among the all methods, the Random Forest showed poor predictive ability. We can conclude that when the underlying architecture is only additive, in fact, we assume that the SNPs on genome are independent that is same as assumptions about parametric approaches. The simulation of a purely additive architecture done with the assumption that simulate additive effects for QTLS are independent. In this case, we satisfy the parametric model assumption of having independent explanatory variables, so the parametric models have a larger predictive power than the nonparametric models (Howard et al. 2014). Hayes et al. (2010) reported that the accuracy of GEBV using parametric methods for overall type trait, which considering a normal distribution of the effects is better and conversely for fat % and proportion of back spot when predictions is based on leptokurtic distribution of the effect is better. In other study, Ober et al. (2012) reported that, parametric GS methods were unable to predict chill coma recovery, a quantitatively measured adaptive trait in Drosophila. Two-dimensional scans of the whole genome had previously revealed that the genetic architecture of this trait is composed primarily of interactions involving many loci. Whole genome prediction is affected not only by the underlying genetic architecture but also by additional types of unpredictable genetic contributions including intra-locus dominance and genotype by environment interactions. Here in, we have demonstrated the superior ability of the parametric method in Bayesian and frequentist context to accurately predict phenotypes for a highly additive architecture complex trait.
CONCLUSION
Fourteen regression methods proposed to calculate genomic breeding values using a complex trait which is under a highly additive genetic architecture for 10000 genome-wide SNP markers. From our evaluations, for traits controlled by large number of QTLs with only additive effects, parametric methods reached very similar predictive abilities and a clear ranking of statistical methods was observed in function of the trait analyzed. Knowledge of traits’ genetic architectures can be integrated into practices of genomic prediction, which will help the approaches perform better toward different architectures.