SNP-BLUP
SNP-BLUP
- Overview
- Load R objects
- Statistical model
- BLUP of SNP marker effects
- Incidence matrix X, I, and phenotype y
- Quality control of genotype matrix
- Variance components estimation using the regress package
- Inverse of V
- SNP effects
- Mixed model equations
- The rrBLUP package
- Relationship between GBLUP and SNP-BLUP
- The regress package
Overview
We will learn how to estimate BLUP of SNP marker effects (SNP-BLUP) and the equivalency between GBLUP and SNP-BLUP.
Load R objects
Use the function load() to load the phenotype object dat_day17.Rda, the OLS fit object fit_day17.Rda, the numerator relationship matrix A2.Rda, the genotype matrix W.Rda, the results from pedigree-based BLUP day18.Rda, and the results from genomic BLUP day20.Rda we created in class.
load(file.choose()) # dat_day17.Rda
dim(dat)
load(file.choose()) # fit_day17.Rda
summary(fit)
load(file.choose()) # A2.Rda
dim(A2)
load(file.choose()) # W.Rda
dim(W)
load(file.choose()) # day18.Rda
load(file.choose()) # day20.Rda
ls()Statistical model
The statistical model we will fit is given by \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{W}\mathbf{a} + \mathbf{e} \\ \mathbf{a} & \sim MVN(0, \mathbf{I}\sigma^2_a) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \] where \(\mathbf{y}\) is the vector of birth weights, \(\mathbf{X}\) is the incident matrix for fixed effects, \(\mathbf{W}\) is the genotype matrix, \(\mathbf{b}\) is the vector of regression coefficients for fixed effects, \(\mathbf{a}\) is the vector of regression coefficients for random SNP effects, and \(\mathbf{e}\) is the vector of residuals.
BLUP of SNP marker effects
Recall that BLUP of \(\mathbf{a}\) is given by Henderson (1975) (doi: 10.2307/2529430) \[ \begin{align*} BLUP(\mathbf{a}) &= E(\mathbf{a} | \mathbf{y}) \\ &= Cov(\mathbf{a}, \mathbf{y}') Var(\mathbf{y})^{-1} [\mathbf{y} - E(\mathbf{y})] \\ &= \mathbf{I}\sigma^2_{a} \mathbf{W}'\mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\hat{\mathbf{b}}), \end{align*} \]
where \[ \begin{align*} \mathbf{V} &= Var(\mathbf{y}) \\ &= \mathbf{W}\mathbf{I}\sigma^2_a\mathbf{W}' + \mathbf{I}\sigma^2_e \end{align*} \]
We predict SNP marker effects \(\hat{\mathbf{a}}\) in the following two steps.
- fit OLS to estimate fixed effects (\(\hat{\mathbf{b}}\))
- fit BLUP to obtain marker effects (\(\hat{\mathbf{a}}\)) condition on the estimated \(\hat{\mathbf{b}}\)
Later we will discuss how to obtain \(\hat{\mathbf{b}}\) and \(\hat{\mathbf{a}}\) simultaneously using MME.
Incidence matrix X, I, and phenotype y
We will now contruct each component one by one. First we will create the incidence matrices X and \(n\) by \(n\) diagonal matrix I, and the vector of phenotypes y.
dat2 <- dat[, c("AgeDam.mon.", "SEX")]
X <- model.matrix(~dat2$AgeDam.mon. + dat2$SEX)
Inxn <- diag(nrow(W))
y <- dat$BWTQuality control of genotype matrix
We will use a minor allele frequency threshold of 0.05 and create a new centered genotype matrix W2. We will also create a \(m\) by \(m\) diagonal matrix.
p <- colMeans(W)/2
maf <- pmin(p, 1 - p)
index <- which(maf < 0.05)
W2 <- W[, -index]
W2 <- scale(W2, center = TRUE, scale = FALSE)
p2 <- p[-index]
Imxm <- diag(ncol(W2))Exercise 1
How many SNPs are left after the quality control?
Variance components estimation using the regress package
The regress() function in the regress package fits a liner mixed model and estimates variance components (\(\sigma^2_a\) and \(\sigma^2_e\)) through a restricted maximum likelihood (REML). Note that the variance-covariance structure of \(\mathbf{a}\) (argument = varcov) is given by \(\mathbf{W}\mathbf{W'}\) because \(\mathbf{a} \sim MVN(0, \mathbf{I}\sigma^2_a)\), hence \(\mathbf{Wa} \sim MVN(0, \mathbf{W}\mathbf{W}' \sigma^2_a)\).
library(regress)
WtW <- W2 %*% t(W2)
varcompSNP <- regress(y ~ -1 + X, ~WtW)
sigma2a <- varcompSNP$sigma[1] # additive SNP variance -> sigma^2_a
sigma2e <- varcompSNP$sigma[2] # residual variance -> sigma^2_e
lambda <- sigma2e/sigma2aInverse of V
Here we compute the inverse of \(V\) matrix. We will use 1) the multiplication operator %*% and 2) the function solve() to obtain the inverse of matrix.
V <- W2 %*% Imxm %*% t(W2) * sigma2a + Inxn * sigma2e
dim(V)
Vinv <- solve(V)
dim(Vinv)SNP effects
Now we compute the BLUP of SNP effects (\(\hat{\mathbf{a}}\)).
ahat <- sigma2a * Imxm %*% t(W2) %*% Vinv %*% (matrix(y) - matrix(predict(fit)))
ahat2 <- sigma2a * Imxm %*% t(W2) %*% Vinv %*% (matrix(y) - matrix(X %*% fit$coefficients)) # alternative
table(ahat == ahat2)
head(ahat)
tail(ahat)Exercise 2
Which SNP has the largest effect? What is the name of SNP?
Mixed model equations
The corresponding MME is givin by simply replacing \(\mathbf{Z}\) and \(\mathbf{K}\) in the relationship-based MME with \(\mathbf{W}\) and \(I_{m \times m}\), respectively. \[ \begin{align} \begin{bmatrix} \mathbf{X'X} & \mathbf{X'W} \\ \mathbf{W'X} & \mathbf{W'W} + \mathbf{I}^{-1} \lambda \end{bmatrix} \begin{bmatrix} \hat{\mathbf{b}} \\ \hat{\mathbf{a}} \end{bmatrix} = \begin{bmatrix} \mathbf{X'y} \\ \mathbf{W'y} \end{bmatrix}. \end{align} \]
where \(\lambda\) is the ratio of variance components \(\frac{\sigma^2_e}{\sigma^2_a}\).
Henderson later showed that \(\hat{\mathbf{b}}\) from the above equation is a solution for generalized least square estimate and linear combination of \(\hat{\mathbf{b}}\) and \(\hat{\mathbf{a}}\) is BLUP. Thus, we can obtain BLUE of \(\hat{\mathbf{b}}\) and BLUP of \(\hat{\mathbf{a}}\) simultaneously by solveing MME. \[ \begin{align} \begin{bmatrix} \hat{\mathbf{b}} \\ \hat{\mathbf{a}} \end{bmatrix} = \begin{bmatrix} \mathbf{X'X} & \mathbf{X'W} \\ \mathbf{W'X} & \mathbf{W'W} + \mathbf{I}^{-1} \lambda \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X'y} \\ \mathbf{W'y} \end{bmatrix} \end{align} \]
The rrBLUP package
The mixed.solve() function in the rrBLUP package estimates fixed effects and variance components it and predicts BLUP of SNP effects.
# install.packages('rrBLUP')
library(rrBLUP)
rr <- mixed.solve(y, X = X, Z = W2)
names(rr)
rr$beta # fixed effect
rr$Vu # additive SNP variance
rr$Ve # residual variance
head(rr$u) # BLUP of SNP marker effects
tail(rr$u)Exercise 3
Compare the estimates of fixed effects obtained from the following: the regress package and the rrBLUP package. Use the cor() function to calculate correlations.
Exercise 4
Compare the estimates of variance components (\(\sigma^2_a\) and \(\sigma^2_e\)) obtained from the following: the regress package and the rrBLUP package.
Exercise 5
Compare the estimates of BLUP of SNP marker effects obtained from the following: the SNP-BLUP (ahat) and the rrBLUP package. Use the cor() function to calculate correlations.
Relationship between GBLUP and SNP-BLUP
Recall that the statistical model of GBLUP is \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{u} + \mathbf{e} \\ \mathbf{u} & \sim MVN(0, \mathbf{G}\sigma^2_G) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \] The SNP-BLUP model is given by \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{W}\mathbf{a} + \mathbf{e} \\ \mathbf{a} & \sim MVN(0, \mathbf{I}\sigma^2_a) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \]
The difference is that we approximate the true genetic signal \(\mathbf{g}\) with \(\mathbf{Zu}\) in GBLUP (\(\mathbf{g = Zu}\)), whereas we approximate the genetic signal with \(\mathbf{Wa}\) in SNP-BLUP (\(\mathbf{g = Wa}\)). This suggests that the vector of estimated genomic breeding values \(\hat{\mathbf{u}}\) is equivalent to the genotype matrix multiplied by the BLUP of SNP marker effects \(\mathbf{W}\hat{\mathbf{a}}\). \[ \hat{\mathbf{u}} = \mathbf{W}\hat{\mathbf{a}} \]
Exercise 6
Compare estimated genomic breeding values obtained from the following: the GBLUP (uhatG), the SNP-BLUP (ahat), and the rrBLUP package. Use the cor() function to calculate correlations.
The regress package
The regress() function in the regress package can also predict genomic breeding values.
regress.SNP <- BLUP(varcompSNP)
names(regress.SNP)
# GEBV
head(regress.SNP$Mean) # regress
head(c(W2 %*% ahat)) # SNP-BLUP
head(c(W2 %*% rr$u)) # rrBLUP
head(c(uhatG)) # GBLUP