Data 1

Simulate data

Randomly sample genotype data.

set.seed(40)

lambda1 <- 1
X <- matrix(sample(c(-1,0,1), 15, replace=TRUE), nrow=3, ncol=5)
X
y <- c(10, 5, 8)
y

Marker effects

Compute the individual marker effects.

# fit OLS
beta <- solve(t(X)%*%X) %*% t(X)%*% y 

# fit RR 
beta <- solve(t(X)%*%X + diag(lambda1,5)) %*% t(X)%*% y
beta

# marker 1
beta[1,]
# marker 2
beta[2,]
# marker 1
(X[,1]  %*% (y - X[,2:5] %*% matrix(beta[-1,]))) / (sum(X[,1]^2) + diag(lambda1,1))
# marker 2
(X[,2]  %*% (y - X[,-2] %*% matrix(beta[-2,]))) / (sum(X[,2]^2) + diag(lambda1,1))
# marker 3
??

Amounts of shrinkage

Compute the amounts of shrinkage for each marker

# marker 1
sum(X[,1]^2) / (sum(X[,1]^2) + diag(lambda1,1)) 
# marker 2
sum(X[,2]^2) / (sum(X[,2]^2) + diag(lambda1,1)) 
# marker 3
sum(X[,3]^2) / (sum(X[,3]^2) + diag(lambda1,1)) 
# marker 4
sum(X[,4]^2) / (sum(X[,4]^2) + diag(lambda1,1)) 
# marker 5
sum(X[,5]^2) / (sum(X[,5]^2) + diag(lambda1,1)) 

Data 2

We are going to use the cattle data included in the synbreedData package. Learn more about Synbreed project and synbreed R packages.

if(!require(synbreed) | !require(synbreedData)){
    install.packages("synbreed", "synbreedData")
}
data(cattle)
y <- as.matrix(cattle$pheno[,1,1])
#pheno <- scale(pheno)
dim(cattle$geno)
cattleC <- codeGeno(cattle) # 500 x 7250
W <- cattleC$geno[,which(cattleC$map==1)] # 500 x 250

Genotype imputation

Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object W.

for (j in 1:ncol(W)){
  W[,j] <- ifelse(is.na(W[,j]), mean(W[,j], na.rm=TRUE), W[,j])
}

Quality control

Perform a quality control analysis by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in X.

# MAF < 0.05
p <- colMeans(W) / 2
maf <- ifelse(p > 0.5, 1-p, p)
maf.index <- which(maf < 0.05)
W <- W[, -maf.index] # 500 x 230

Centering of genotype matrix

Center the genotype matrix to have a mean of zero. Save this matrix as W.

W <- scale(W, center = TRUE, scale = FALSE)

Question 2.1

Fit a standard ridge regression and a SVD-based ridge regression. Use \(\lambda = 5\). Verify that predicted genetic values are equal.

Question 2.2

Compute marker effects using SVD-based ridge regression and SVD-based OLS. Recall that the effect of ridge is to replace \(D^{-1}\) in the OLS expression with \((D^2 + \lambda I)^{-1}D\). Show that ridge estimates are shrunk toward zero because these amounts are indeed smaller in the ridge regression. Use \(\lambda = 5\).

Data 3

Zhang et al. (doi: 10.1534/g3.114.016261) analyzed dairy cattle data including 5024 individuals genotyped with 50K chips. Download FileS1.zip and FileS2.txt from the journal website. The genotype file is cattle_genotypes.txt.

X <- read.table(file="cattle_genotypes.txt", header=TRUE, stringsAsFactors=FALSE)
X <- X[,-1]
X <- as.matrix(X)
dim(X) # 5024 x 42551

# Phenotype
pheno <- read.table(file="FileS2.txt", header=TRUE, stringsAsFactors=FALSE)
y <- pheno[,2]

Quality control

Perform a quality control analysis by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in X.

# MAF < 0.05
p <- colMeans(X) / 2
maf <- ifelse(p > 0.5, 1-p, p)
maf.index <- which(maf < 0.05)
X <- X[, -maf.index] # 5024 x 39117

Question 3.1

Recall that ridge estimats can be expressed as \(\hat{\beta}^{ridge}_j = \frac{\sum^n_{i=1} x^2_{ij} }{\sum^n_{i=1} x^2_{ij} + \lambda} \times \hat{\beta}^{ols}_{j} \approx \frac{2p_j(1-p_j)}{2p_j(1-p_j) + \frac{\lambda}{n}} \times \hat{\beta}^{ols}_{j}\). Create a centred genotype matrix Xc and calculate the amounts of shrinkage for all markers. Show that the greatest shrinkage is achieved for the marker with the most extreme minor allele frequency. Also, create a scatter plot for amount of shrinkage vs. MAF. Use \(\lambda = 300\).

Question 3.2

We will determine a prior for marker effects in Bayesian ridge regression. Read Perez and de los Campos (2014) (10.1534/genetics.114.164442) to learn more about the BGLR R package. Recall that genetic variance is given by \(Var(g_i) = \sigma^2_b \sum^m_{j=1} Var(x_{ij})\). We can equate this \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij})\) to the product of our prior expectation about the expected proportion of variance that is explained by the regression times an estimate of the phenotypic variance \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij}) = R^2 V_y\). This results in \(\sigma^2_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})}\). Then equating this to the prior mode of the variance parameter of scaled-inverse chi square density gives \(\frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} = \frac{S_b}{df_b + 2}\), where \(S_b\) and \(df_b\) are the prior scale and degrees of freedom of marker effects, respectively. Then solving for the prior scale parameter \(S_b\) yields \(S_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} (df_b + 2)\). Compute the prior scale \(S_b\) for the wheat data set according to the above derivation. Use \(df_b = 5\) and \(R^2 = 0.5\).

Question 3.3

Alternatively, we can use \(Var(g_i) = \sigma^2_b \times n^{-1} \sum^n_{i=1} \sum^m_{j=1}x^2_{ij}\), where \(n^{-1} \sum^n_{i=1} \sum^m_{j=1}x_{ij}^2\) is the average sum of squares of the genotypes. Recompute the prior scale \(S_b\) based on this parameterization. Again, use \(df_b = 5\) and \(R^2 = 0.5\).

Question 3.4

Report a prior scale \(S_b\) used for Bayesian ridge regression with default setting in the BGLR function. Which of the above rule-based priors is used in the BGLR? Hint: set nIter = 1, burnIn = 1 so that you can see the output.