This homework assignment involves data analysis of 280 winter wheat accessions genotyped with 1083 Diversity Array Technology (DArT) markers at 9 locations. Phenotypes available are grain yield, grain volume weight, plant height, and flowering date. Both phenotypic and genotypic data are downloadable from DRYAD.
# phenotypes
pheno <- read.csv("http://datadryad.org/bitstream/handle/10255/dryad.40880/phenotype.csv?sequence=1", header = TRUE, stringsAsFactors = FALSE)
pheno1 <- subset(pheno, Location ==1) # use location 1
# DArt markers
library(readxl)
geno1 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=2, skip=3, na="-")
geno2 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=3, skip=3, na="-")
geno3 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=4, skip=3, na="-")
table(geno1[,1:7] == geno2[,1:7])
##
## TRUE
## 13867
table(geno1[,1:7] == geno3[,1:7])
##
## TRUE
## 13867
table(geno2[,1:7] == geno3[,1:7])
##
## TRUE
## 13867
geno4 <- cbind(geno1, geno2[,-c(1:7)], geno3[, -c(1:7)])
# quality control
qc.callrate <- which(geno4$CallRate < 95)
qc.p <- which(geno4$P < 80)
qc.index <- unique(c(qc.callrate, qc.p))
length(qc.index) # 1234
## [1] 1234
geno5 <- geno4[-qc.index, ]
geno6 <- t(geno5[,-c(1:7)])
Check if the phenotype and genotype files have the same accessions.
# phenotype -> genotype
table(pheno1[,2] %in% rownames(geno6))
##
## FALSE TRUE
## 4 276
na1a.index <- which(!pheno1[,2] %in% rownames(geno6))
pheno1[c(na1a.index),2] # "NE10522" "NW10568" "NE10570" "NE10583" -> found in the phenotype file but in the genotype file
## [1] "NE10522" "NW10568" "NE10570" "NE10583"
# genotype -> phenotype
table(rownames(geno6) %in% pheno1[,2])
##
## FALSE TRUE
## 2 276
na1b.index <- which(!rownames(geno6) %in% pheno1[,2])
rownames(geno6)[c(na1b.index)] # "Goodstreak" "Camelot" -> found in the genotype file but in the phenotype file
## [1] "Goodstreak" "Camelot"
pheno1a <- pheno1[match(rownames(geno6), pheno1[,2]), ]
pheno1b <- pheno1a[-c(277:278), ]
geno7 <- geno6[-c(277:278), ]
table(pheno1b[,2] == rownames(geno7), useNA = "always")
##
## TRUE <NA>
## 276 0
table(pheno1b[,2] %in% rownames(geno7))
##
## TRUE
## 276
table(rownames(geno7) %in% pheno1b[,2])
##
## TRUE
## 276
# final phenotype object
y <- pheno1b$yield # use grain yield
# final genotype object
geno <- geno7 # 276 x 747
Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object X
.
Perform a quality control by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in X2
.
Standardize the genotype matrix from Question 2 to have a mean of zero and variance of one. Save this matrix as Xs
.
We will determine a prior for marker effects in Bayesian ridge regression. Read Perez and de los Campos (2014) (10.1534/genetics.114.164442) to learn more about the BGLR R package. Recall that genetic variance is given by \(Var(g_i) = \sigma^2_b \sum^m_{j=1} Var(x_{ij})\). We can equate this \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij})\) to the product of our prior expectation about the expected proportion of variance that is explained by the regression times an estimate of the phenotypic variance \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij}) = R^2 V_y\). This results in \(\sigma^2_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})}\). Then equating this to the prior mode of the variance parameter of scaled-inverse chi square density gives \(\frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} = \frac{S_b}{df_b + 2}\), where \(S_b\) and \(df_b\) are the prior scale and degrees of freedom of marker effects, respectively. Then solving for the prior scale parameter \(S_b\) yields \(S_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} (df_b + 2)\). Compute the prior scale \(S_b\) for the wheat data set according to the above derivation. Use \(df_b = 5\) and \(R^2 = 0.5\).
Alternatively, we can use \(Var(g_i) = \sigma^2_b \times n^{-1} \sum^n_{i=1} \sum^m_{j=1}x^2_{ij}\), where \(n^{-1} \sum^n_{i=1} \sum^m_{j=1}x_{ij}^2\) is the average sum of squares of the genotypes. Recompute the prior scale \(S_b\) based on this parameterization. Again, use \(df_b = 5\) and \(R^2 = 0.5\).
Report a prior scale \(S_b\) used for Bayesian ridge regression with default setting in the BGLR function. Which of the above rule-based priors is used in the BGLR? Hint: set nIter
= 1, burnIn
= 1 so that you can see the output.
Jiang and Reif (2015) (DOI: 10.1534/genetics.115.177907) showed that a Gaussian kerenel matrix can be decomposed into \(GK = \Lambda \tilde{H} \Lambda\), where \(\Lambda = diag[\exp (- \theta \sum^m_{k=1} x^2_{1k} ) , \cdots, \exp (- \theta\sum^m_{k=1} x^2_{nk})]\), \(\tilde{H} = 1_{n \times n} + \sum^{\infty}_{k=1} \frac{(2 \theta m)^k}{k!} G^{\#k}\), and \(G\) is the second genomic relationship matrix of VanRaden (2008). Confirm that this is indeed true. Use the bandwidth parameter \(\theta = 0.0001\).
Evaluate predictive performance of the Bayesian ridge regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403)
at the begining of the code so that your analysis is reporducible.
Create two Gaussin kernel matrices (GK1
and GK2
) where means of lower triangler matrix are about 0.85 and 0.25, respectively.
Evaluate predictive performance of the reproducing Hilbert spaces regression model by repeating three-fold cross-validation 5 times. Fit a multiple kernel method by using GK1
and GK2
simultaneously. Again, type set.seed(0403)
at the begining of the code so that your analysis is reporducible.