Phenotypes are often correlated because of pleiotropy or linkage disequilibrium among QTLs in the complex traits system. A multi-trait mixed model is prevalent in quantiative genetics because it can account for genetic or environmental covariances among phenotypes. Nevertheless, there is a concern of computational efficiency when the number of phenotypes is large, particularly when phenotypes are collected using high-throughput phenotyping platforms.
One alternative approach is to use a factor analytic (FA) model. A FA model assumes that observed phenotypes are generated from underlying unobserved latent variables or factors. In this tutorial, we introduce two types of FA models: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). We select ten phenotypes related to midparent heterosis or MPH (Area_Day21, Area_Growth, Area_Day29_MPH, Area_Day21_MPH, Area_Day29, Area_Growth_MPH, LTF, LTF_MPH, DTF, and DTF_MPH) from a public Arabidopsis data (Seymour et al. 2016) to illustrate how FA models can be fitted. The Arabidopsis data set used in this tutorial can be found in easyGWAS (Arabidopsis thaliana -> F1 Hybrids Seymour et. al. 2016, PNAS). Alternatively, you can download the data set directly from here using
wget or DownGit.
Load R packages
# install.packages("blavaan") library(blavaan) # install.packages("pheatmap") library(pheatmap) # install.packages("psych") library(psych) # install.packages("dplyr") library(dplyr) # install.packages("tidyr") library(tidyr) # install.packages("knitr") library(knitr)
# Import 10 Arabidopsis phenotypes <- read.table('Arabidopsis.txt', header = TRUE, stringsAsFactors = FALSE) pheno dim(pheno) # 372 10 head(pheno)
Exploratory Factor Analysis (EFA)
When we have no biological prior knowledge regarding the number of factors, EFA can be used to search the underlying factors and infer the latent structure. The EFA is consisted of three steps: 1) identify the number of factors underlying observed phenotypes; 2) fit the EFA model based on the number of factors estimated in 1); and 3) establish the latent structure between factors and phenotypes from factor loading coefficients derived from 2).
Step 1: Estimate the number of factors
# phenotypic correlation among traits <- cor(pheno, use = "complete.obs") COR # parallel analysis to identify the number of underlying factors fa.parallel(COR, n.obs = 123, fa = "fa", n.iter = 100, error.bars = FALSE, se.bars = FALSE, ylabel = 'Eigen values of factors', fm = 'mle')
The parallel analysis infers the number of factors by comparing the eigenvalues of observed data and the eigenvalues simulated from random data. One criterion to set the number of factors is to count how many eigenvalues from observed data are larger than the eigenvalues from the random data. From the above figure, this number equals two in our example.
Step 2: Fit EFA model
# fit a maximum likelihood-based EFA model by assuming two factors <- fa(COR, nfactors = 2, fm = "ml", max.iter = 100, fit.mle rotate = "varimax") # extract factor loading coefficients <- round(fit.mle$loadings, digits = 2) loading # display the factor loadings using a heatmap pheatmap(loading, display_numbers = TRUE, cluster_cols = FALSE, angle_col = 0, main = 'Factor loading coefficients', fontsize_number = 10)
The factor loading heatmap shows the pattern between the observed phenotypes and factors. We can see that the first factor (ML1) has large factor loading coefficients to the area associated traits, while the second factor (ML2) presents large loadings to the reproductive transition phase (RTP) related traits (i.e., LTF, LTF_MPH, DTF, and DTF_MPH,).
Step 3: Establish the latent structure
According to the factor loading heatmap, it is possible to eliminate cross-loadings (i.e., phenotype receives factor loadings from more than one factor) by highlighting the highest loading. The inferred latent structure between the observed phenotypes and factors is shown below. We can assign a biological interpretation to those factors: Area and RTP.
Confirmatory factor analysis (CFA)
A CFA model assumes that a latent structure is known. We can fit a CFA model based on the latent structure inferred from EFA.
# CFA model specification <- ' CFA.Model # 6 Area =~ Area_Day21 + Area_Growth + Area_Day29_MPH + Area_Day21_MPH + Area_Day29 + Area_Growth_MPH # 4 RTP =~ LTF + LTF_MPH + DTF_MPH + DTF ' # fit the CFA model <- bcfa(CFA.Model, data = pheno, CFA.fit burnin = 200, sample = 200, target = "stan", save.lvs = T, n.chains = 2)
Note that we used a small number of burn-in and Gibbs sampling for an illustrative purpose. In practice, sufficient MCMC samples should be considered and model convergence should be checked.
Estimate factor loadings
parameterEstimates(CFA.fit, standardized=TRUE) %>% filter(op == "=~") %>% select('Latent Variable' = lhs, 'Observed Phenotypes' = rhs, 'Estimate' = std.all) %>% kable(digits = 3, format='html', caption ='Standardized Factor Loadings')
|Latent Variable||Observed Phenotypes’||Estimate|
Estimate factor scores
<- blavInspect(CFA.fit, 'lvmeans') bfs_all colnames(bfs_all) <- c('Area', 'RTP') dim(bfs_all) head(bfs_all) Area RTP1,] -1.328089 0.001999496 [2,] -10.467880 0.028819467 [3,] 4.738378 -0.009752476 [4,] 3.049506 -0.019339391 [5,] -7.811509 0.023633191 [6,] -3.666438 0.010334994[
The estimated factor scores for Area and RTP can be used as new phenotypes for subsequent genetic analyses.