SNP-BLUP
SNP-BLUP
- Overview
- Load R objects
- Statistical model
- BLUP of SNP marker effects
- Incidence matrix X, I, and phenotype y
- Quality control of genotype matrix
- Variance components estimation using the regress package
- Inverse of V
- SNP effects
- Mixed model equations
- The rrBLUP package
- Relationship between GBLUP and SNP-BLUP
- The regress package
Overview
We will learn how to estimate BLUP of SNP marker effects (SNP-BLUP) and the equivalency between GBLUP and SNP-BLUP.
Load R objects
Use the function load()
to load the phenotype object dat_day17.Rda, the OLS fit object fit_day17.Rda, the numerator relationship matrix A2.Rda, the genotype matrix W.Rda, the results from pedigree-based BLUP day18.Rda, and the results from genomic BLUP day20.Rda we created in class.
load(file.choose()) # dat_day17.Rda
dim(dat)
load(file.choose()) # fit_day17.Rda
summary(fit)
load(file.choose()) # A2.Rda
dim(A2)
load(file.choose()) # W.Rda
dim(W)
load(file.choose()) # day18.Rda
load(file.choose()) # day20.Rda
ls()
Statistical model
The statistical model we will fit is given by \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{W}\mathbf{a} + \mathbf{e} \\ \mathbf{a} & \sim MVN(0, \mathbf{I}\sigma^2_a) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \] where \(\mathbf{y}\) is the vector of birth weights, \(\mathbf{X}\) is the incident matrix for fixed effects, \(\mathbf{W}\) is the genotype matrix, \(\mathbf{b}\) is the vector of regression coefficients for fixed effects, \(\mathbf{a}\) is the vector of regression coefficients for random SNP effects, and \(\mathbf{e}\) is the vector of residuals.
BLUP of SNP marker effects
Recall that BLUP of \(\mathbf{a}\) is given by Henderson (1975) (doi: 10.2307/2529430) \[ \begin{align*} BLUP(\mathbf{a}) &= E(\mathbf{a} | \mathbf{y}) \\ &= Cov(\mathbf{a}, \mathbf{y}') Var(\mathbf{y})^{-1} [\mathbf{y} - E(\mathbf{y})] \\ &= \mathbf{I}\sigma^2_{a} \mathbf{W}'\mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\hat{\mathbf{b}}), \end{align*} \]
where \[ \begin{align*} \mathbf{V} &= Var(\mathbf{y}) \\ &= \mathbf{W}\mathbf{I}\sigma^2_a\mathbf{W}' + \mathbf{I}\sigma^2_e \end{align*} \]
We predict SNP marker effects \(\hat{\mathbf{a}}\) in the following two steps.
- fit OLS to estimate fixed effects (\(\hat{\mathbf{b}}\))
- fit BLUP to obtain marker effects (\(\hat{\mathbf{a}}\)) condition on the estimated \(\hat{\mathbf{b}}\)
Later we will discuss how to obtain \(\hat{\mathbf{b}}\) and \(\hat{\mathbf{a}}\) simultaneously using MME.
Incidence matrix X, I, and phenotype y
We will now contruct each component one by one. First we will create the incidence matrices X
and \(n\) by \(n\) diagonal matrix I
, and the vector of phenotypes y
.
dat2 <- dat[, c("AgeDam.mon.", "SEX")]
X <- model.matrix(~dat2$AgeDam.mon. + dat2$SEX)
Inxn <- diag(nrow(W))
y <- dat$BWT
Quality control of genotype matrix
We will use a minor allele frequency threshold of 0.05 and create a new centered genotype matrix W2
. We will also create a \(m\) by \(m\) diagonal matrix.
p <- colMeans(W)/2
maf <- pmin(p, 1 - p)
index <- which(maf < 0.05)
W2 <- W[, -index]
W2 <- scale(W2, center = TRUE, scale = FALSE)
p2 <- p[-index]
Imxm <- diag(ncol(W2))
Exercise 1
How many SNPs are left after the quality control?
Variance components estimation using the regress package
The regress()
function in the regress package fits a liner mixed model and estimates variance components (\(\sigma^2_a\) and \(\sigma^2_e\)) through a restricted maximum likelihood (REML). Note that the variance-covariance structure of \(\mathbf{a}\) (argument = varcov
) is given by \(\mathbf{W}\mathbf{W'}\) because \(\mathbf{a} \sim MVN(0, \mathbf{I}\sigma^2_a)\), hence \(\mathbf{Wa} \sim MVN(0, \mathbf{W}\mathbf{W}' \sigma^2_a)\).
library(regress)
WtW <- W2 %*% t(W2)
varcompSNP <- regress(y ~ -1 + X, ~WtW)
sigma2a <- varcompSNP$sigma[1] # additive SNP variance -> sigma^2_a
sigma2e <- varcompSNP$sigma[2] # residual variance -> sigma^2_e
lambda <- sigma2e/sigma2a
Inverse of V
Here we compute the inverse of \(V\) matrix. We will use 1) the multiplication operator %*%
and 2) the function solve()
to obtain the inverse of matrix.
V <- W2 %*% Imxm %*% t(W2) * sigma2a + Inxn * sigma2e
dim(V)
Vinv <- solve(V)
dim(Vinv)
SNP effects
Now we compute the BLUP of SNP effects (\(\hat{\mathbf{a}}\)).
ahat <- sigma2a * Imxm %*% t(W2) %*% Vinv %*% (matrix(y) - matrix(predict(fit)))
ahat2 <- sigma2a * Imxm %*% t(W2) %*% Vinv %*% (matrix(y) - matrix(X %*% fit$coefficients)) # alternative
table(ahat == ahat2)
head(ahat)
tail(ahat)
Exercise 2
Which SNP has the largest effect? What is the name of SNP?
Mixed model equations
The corresponding MME is givin by simply replacing \(\mathbf{Z}\) and \(\mathbf{K}\) in the relationship-based MME with \(\mathbf{W}\) and \(I_{m \times m}\), respectively. \[ \begin{align} \begin{bmatrix} \mathbf{X'X} & \mathbf{X'W} \\ \mathbf{W'X} & \mathbf{W'W} + \mathbf{I}^{-1} \lambda \end{bmatrix} \begin{bmatrix} \hat{\mathbf{b}} \\ \hat{\mathbf{a}} \end{bmatrix} = \begin{bmatrix} \mathbf{X'y} \\ \mathbf{W'y} \end{bmatrix}. \end{align} \]
where \(\lambda\) is the ratio of variance components \(\frac{\sigma^2_e}{\sigma^2_a}\).
Henderson later showed that \(\hat{\mathbf{b}}\) from the above equation is a solution for generalized least square estimate and linear combination of \(\hat{\mathbf{b}}\) and \(\hat{\mathbf{a}}\) is BLUP. Thus, we can obtain BLUE of \(\hat{\mathbf{b}}\) and BLUP of \(\hat{\mathbf{a}}\) simultaneously by solveing MME. \[ \begin{align} \begin{bmatrix} \hat{\mathbf{b}} \\ \hat{\mathbf{a}} \end{bmatrix} = \begin{bmatrix} \mathbf{X'X} & \mathbf{X'W} \\ \mathbf{W'X} & \mathbf{W'W} + \mathbf{I}^{-1} \lambda \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X'y} \\ \mathbf{W'y} \end{bmatrix} \end{align} \]
The rrBLUP package
The mixed.solve()
function in the rrBLUP package estimates fixed effects and variance components it and predicts BLUP of SNP effects.
# install.packages('rrBLUP')
library(rrBLUP)
rr <- mixed.solve(y, X = X, Z = W2)
names(rr)
rr$beta # fixed effect
rr$Vu # additive SNP variance
rr$Ve # residual variance
head(rr$u) # BLUP of SNP marker effects
tail(rr$u)
Exercise 3
Compare the estimates of fixed effects obtained from the following: the regress package and the rrBLUP package. Use the cor()
function to calculate correlations.
Exercise 4
Compare the estimates of variance components (\(\sigma^2_a\) and \(\sigma^2_e\)) obtained from the following: the regress package and the rrBLUP package.
Exercise 5
Compare the estimates of BLUP of SNP marker effects obtained from the following: the SNP-BLUP (ahat
) and the rrBLUP package. Use the cor()
function to calculate correlations.
Relationship between GBLUP and SNP-BLUP
Recall that the statistical model of GBLUP is \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{u} + \mathbf{e} \\ \mathbf{u} & \sim MVN(0, \mathbf{G}\sigma^2_G) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \] The SNP-BLUP model is given by \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{W}\mathbf{a} + \mathbf{e} \\ \mathbf{a} & \sim MVN(0, \mathbf{I}\sigma^2_a) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \]
The difference is that we approximate the true genetic signal \(\mathbf{g}\) with \(\mathbf{Zu}\) in GBLUP (\(\mathbf{g = Zu}\)), whereas we approximate the genetic signal with \(\mathbf{Wa}\) in SNP-BLUP (\(\mathbf{g = Wa}\)). This suggests that the vector of estimated genomic breeding values \(\hat{\mathbf{u}}\) is equivalent to the genotype matrix multiplied by the BLUP of SNP marker effects \(\mathbf{W}\hat{\mathbf{a}}\). \[ \hat{\mathbf{u}} = \mathbf{W}\hat{\mathbf{a}} \]
Exercise 6
Compare estimated genomic breeding values obtained from the following: the GBLUP (uhatG
), the SNP-BLUP (ahat
), and the rrBLUP package. Use the cor()
function to calculate correlations.
The regress package
The regress()
function in the regress package can also predict genomic breeding values.
regress.SNP <- BLUP(varcompSNP)
names(regress.SNP)
# GEBV
head(regress.SNP$Mean) # regress
head(c(W2 %*% ahat)) # SNP-BLUP
head(c(W2 %*% rr$u)) # rrBLUP
head(c(uhatG)) # GBLUP