SNP-BLUP
SNP-BLUP
- Overview
- Load R objects
- Statistical model
- BLUP of SNP marker effects
- Incidence matrix X, I, and phenotype y
- Quality control of genotype matrix
- Variance components estimation using the regress package
- Inverse of V
- Ordinary least squares
- SNP effects
- Mixed model equations
- The rrBLUP package
- Relationship between GBLUP and SNP-BLUP
- The regress package
Overview
We will learn how to estimate BLUP of SNP marker effects (SNP-BLUP) and the equivalency between GBLUP and SNP-BLUP.
Load R objects
Use the function read_excel
in the readxl
package to read the pedigree file (Simdata.xlsx) in a data frame format. Use the function load()
to load the SNP matrix (W.Rda), the numerator relationship matrix (A.Rda), the genomic relationship matrix (G.Rda) the results from pedigree-based BLUP (day17.Rda), and the results from genome-based BLUP (day18.Rda) we created in class.
library(readxl)
dat <- read_excel(file.choose())
dim(dat)
head(dat)
load(file.choose()) # A.Rda
dim(A)
load(file.choose()) # G.Rda
dim(G)
load(file.choose()) # W.Rda
dim(W)
load(file.choose()) # day17.Rda
load(file.choose()) # day18.Rda
ls()
Statistical model
The statistical model we will fit is given by \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{W}\mathbf{a} + \mathbf{e} \\ \mathbf{a} & \sim MVN(0, \mathbf{I}\sigma^2_a) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \] where \(\mathbf{y}\) is the vector of birth weights, \(\mathbf{X}\) is the incident matrix for fixed effects, \(\mathbf{W}\) is the genotype matrix, \(\mathbf{b}\) is the vector of regression coefficients for fixed effects, \(\mathbf{a}\) is the vector of regression coefficients for random SNP effects, and \(\mathbf{e}\) is the vector of residuals.
BLUP of SNP marker effects
Recall that BLUP of \(\mathbf{a}\) is given by Henderson (1975) (doi: 10.2307/2529430) \[ \begin{align*} BLUP(\mathbf{a}) &= E(\mathbf{a} | \mathbf{y}) \\ &= Cov(\mathbf{a}, \mathbf{y}') Var(\mathbf{y})^{-1} [\mathbf{y} - E(\mathbf{y})] \\ &= \mathbf{I}\sigma^2_{a} \mathbf{W}'\mathbf{V}^{-1}(\mathbf{y} - \mathbf{X}\hat{\mathbf{b}}), \end{align*} \]
where \[ \begin{align*} \mathbf{V} &= Var(\mathbf{y}) \\ &= \mathbf{W}\mathbf{I}\sigma^2_a\mathbf{W}' + \mathbf{I}\sigma^2_e \end{align*} \]
We predict SNP marker effects \(\hat{\mathbf{a}}\) in the following two steps.
- fit OLS to estimate fixed effects (\(\hat{\mathbf{b}}\))
- fit BLUP to obtain marker effects (\(\hat{\mathbf{a}}\)) condition on the estimated \(\hat{\mathbf{b}}\)
Later we will discuss how to obtain \(\hat{\mathbf{b}}\) and \(\hat{\mathbf{a}}\) simultaneously using MME.
Incidence matrix X, I, and phenotype y
We will now contruct each component one by one. First we will create the incidence matrices X
and \(n\) by \(n\) diagonal matrix I
, and the vector of phenotypes y
.
dat2 <- dat[, c("AgeDam", "Sex")]
X <- model.matrix(~dat2$AgeDam + dat2$Sex)
Inxn <- diag(nrow(W))
y <- dat$BW_205d
Quality control of genotype matrix
We will use a minor allele frequency threshold of 0.05 and create a new centered genotype matrix W2
. We will also create a \(m\) by \(m\) diagonal matrix.
p <- colMeans(W)/2
maf <- pmin(p, 1 - p)
index <- which(maf < 0.05)
W2 <- W[, -index]
W2 <- scale(W2, center = TRUE, scale = FALSE)
p2 <- p[-index]
Imxm <- diag(ncol(W2))
Exercise 1
How many SNPs are left after the quality control?
Variance components estimation using the regress package
The regress()
function in the regress package fits a liner mixed model and estimates variance components (\(\sigma^2_a\) and \(\sigma^2_e\)) through a restricted maximum likelihood (REML). Note that the variance-covariance structure of \(\mathbf{a}\) (argument = varcov
) is given by \(\mathbf{W}\mathbf{W'}\) because \(\mathbf{a} \sim MVN(0, \mathbf{I}\sigma^2_a)\), hence \(\mathbf{Wa} \sim MVN(0, \mathbf{W}\mathbf{W}' \sigma^2_a)\).
library(regress)
WtW <- W2 %*% t(W2)
varcompSNP <- regress(y ~ -1 + X, ~WtW)
sigma2a <- varcompSNP$sigma[1] # additive SNP variance -> sigma^2_a
sigma2e <- varcompSNP$sigma[2] # residual variance -> sigma^2_e
lambda <- sigma2e/sigma2a
Inverse of V
Here we compute the inverse of \(V\) matrix. We will use 1) the multiplication operator %*%
and 2) the function solve()
to obtain the inverse of matrix.
V <- W2 %*% Imxm %*% t(W2) * sigma2a + Inxn * sigma2e
dim(V)
Vinv <- solve(V)
dim(Vinv)
Ordinary least squares
We will use the lm()
function to fit OLS and estimate fixed effect coefficients. This function offers an intuitive formula syntax. The summary()
function summarizes a model fit.
fit <- lm(BW_205d ~ AgeDam + factor(Sex), data = dat)
summary(fit)
residuals(fit)
SNP effects
Now we compute the BLUP of SNP effects (\(\hat{\mathbf{a}}\)).
ahat <- sigma2a * Imxm %*% t(W2) %*% Vinv %*% (matrix(y) - matrix(predict(fit)))
ahat2 <- sigma2a * Imxm %*% t(W2) %*% Vinv %*% (matrix(y) - matrix(X %*% fit$coefficients)) # alternative
table(ahat == ahat2)
head(ahat)
tail(ahat)
Exercise 2
Which SNP has the largest effect? What is the name of SNP?
Mixed model equations
The corresponding MME is givin by simply replacing \(\mathbf{Z}\) and \(\mathbf{K}\) in the relationship-based MME with \(\mathbf{W}\) and \(I_{m \times m}\), respectively. \[ \begin{align} \begin{bmatrix} \mathbf{X'X} & \mathbf{X'W} \\ \mathbf{W'X} & \mathbf{W'W} + \mathbf{I}^{-1} \lambda \end{bmatrix} \begin{bmatrix} \hat{\mathbf{b}} \\ \hat{\mathbf{a}} \end{bmatrix} = \begin{bmatrix} \mathbf{X'y} \\ \mathbf{W'y} \end{bmatrix}. \end{align} \]
where \(\lambda\) is the ratio of variance components \(\frac{\sigma^2_e}{\sigma^2_a}\).
Henderson later showed that \(\hat{\mathbf{b}}\) from the above equation is a solution for generalized least square estimate and linear combination of \(\hat{\mathbf{b}}\) and \(\hat{\mathbf{a}}\) is BLUP. Thus, we can obtain BLUE of \(\hat{\mathbf{b}}\) and BLUP of \(\hat{\mathbf{a}}\) simultaneously by solveing MME. \[ \begin{align} \begin{bmatrix} \hat{\mathbf{b}} \\ \hat{\mathbf{a}} \end{bmatrix} = \begin{bmatrix} \mathbf{X'X} & \mathbf{X'W} \\ \mathbf{W'X} & \mathbf{W'W} + \mathbf{I}^{-1} \lambda \end{bmatrix}^{-1} \begin{bmatrix} \mathbf{X'y} \\ \mathbf{W'y} \end{bmatrix} \end{align} \]
The rrBLUP package
The mixed.solve()
function in the rrBLUP package estimates fixed effects and variance components it and predicts BLUP of SNP effects.
# install.packages('rrBLUP')
library(rrBLUP)
rr <- mixed.solve(y, X = X, Z = W2)
names(rr)
rr$beta # fixed effect
rr$Vu # additive SNP variance
rr$Ve # residual variance
head(rr$u) # BLUP of SNP marker effects
tail(rr$u)
Exercise 3
Compare the estimates of fixed effects obtained from the following: the regress package and the rrBLUP package. Use the cor()
function to calculate correlations.
Exercise 4
Compare the estimates of variance components (\(\sigma^2_a\) and \(\sigma^2_e\)) obtained from the following: the regress package and the rrBLUP package.
Exercise 5
Compare the estimates of BLUP of SNP marker effects obtained from the following: the SNP-BLUP (ahat
) and the rrBLUP package. Use the cor()
function to calculate correlations.
Relationship between GBLUP and SNP-BLUP
Recall that the statistical model of GBLUP is \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{u} + \mathbf{e} \\ \mathbf{u} & \sim MVN(0, \mathbf{G}\sigma^2_G) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \] The SNP-BLUP model is given by \[ \begin{align*} \mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{W}\mathbf{a} + \mathbf{e} \\ \mathbf{a} & \sim MVN(0, \mathbf{I}\sigma^2_a) \\ \mathbf{e} & \sim MVN(0, \mathbf{I}\sigma^2_e) \\ \end{align*} \]
The difference is that we approximate the true genetic signal \(\mathbf{g}\) with \(\mathbf{Zu}\) in GBLUP (\(\mathbf{g = Zu}\)), whereas we approximate the genetic signal with \(\mathbf{Wa}\) in SNP-BLUP (\(\mathbf{g = Wa}\)). This suggests that the vector of estimated genomic breeding values \(\hat{\mathbf{u}}\) is equivalent to the genotype matrix multiplied by the BLUP of SNP marker effects \(\mathbf{W}\hat{\mathbf{a}}\). \[ \hat{\mathbf{u}} = \mathbf{W}\hat{\mathbf{a}} \]
Exercise 6
Compare estimated genomic breeding values obtained from the following: the GBLUP (uhatG
), the SNP-BLUP (ahat
), and the rrBLUP package. Use the cor()
function to calculate correlations.
The regress package
The regress()
function in the regress package can also predict genomic breeding values.
regress.SNP <- BLUP(varcompSNP)
names(regress.SNP)
# GEBV
head(regress.SNP$Mean) # regress
head(c(W2 %*% ahat)) # SNP-BLUP
head(c(W2 %*% rr$u)) # rrBLUP
head(c(uhatG)) # GBLUP