# 1 Data

This homework assignment involves data analysis of 280 winter wheat accessions genotyped with 1083 Diversity Array Technology (DArT) markers at 9 locations. Phenotypes available are grain yield, grain volume weight, plant height, and flowering date. Both phenotypic and genotypic data are downloadable from DRYAD.

# phenotypes
pheno1 <- subset(pheno, Location ==1) # use location 1

# DArt markers
geno1 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=2, skip=3, na="-")
geno2 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=3, skip=3, na="-")
geno3 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=4, skip=3, na="-")
table(geno1[,1:7] == geno2[,1:7]) 
##
##  TRUE
## 13867
table(geno1[,1:7] == geno3[,1:7])
##
##  TRUE
## 13867
table(geno2[,1:7] == geno3[,1:7])
##
##  TRUE
## 13867
geno4 <- cbind(geno1, geno2[,-c(1:7)], geno3[, -c(1:7)])

## 1.1 Quality control

# quality control
qc.callrate <- which(geno4$CallRate < 95) qc.p <- which(geno4$P < 80)
qc.index <- unique(c(qc.callrate, qc.p))
length(qc.index) # 1234
## [1] 1234
geno5 <- geno4[-qc.index, ]
geno6 <- t(geno5[,-c(1:7)])

## 1.2 Data cleaning

Check if the phenotype and genotype files have the same accessions.

# phenotype -> genotype
table(pheno1[,2] %in% rownames(geno6))
##
## FALSE  TRUE
##     4   276
na1a.index <- which(!pheno1[,2] %in% rownames(geno6))
pheno1[c(na1a.index),2] # "NE10522" "NW10568" "NE10570" "NE10583" -> found in the phenotype file but in the genotype file 
## [1] "NE10522" "NW10568" "NE10570" "NE10583"
# genotype -> phenotype
table(rownames(geno6) %in% pheno1[,2])
##
## FALSE  TRUE
##     2   276
na1b.index <- which(!rownames(geno6) %in% pheno1[,2])
rownames(geno6)[c(na1b.index)] # "Goodstreak" "Camelot" -> found in the genotype file but in the phenotype file 
## [1] "Goodstreak" "Camelot"
pheno1a <- pheno1[match(rownames(geno6), pheno1[,2]), ]
pheno1b <- pheno1a[-c(277:278), ]
geno7 <- geno6[-c(277:278), ]
table(pheno1b[,2] == rownames(geno7), useNA = "always")
##
## TRUE <NA>
##  276    0
table(pheno1b[,2] %in% rownames(geno7))
##
## TRUE
##  276
table(rownames(geno7) %in% pheno1b[,2])
##
## TRUE
##  276
# final phenotype object
y <- pheno1b\$yield # use grain yield

# final genotype object
geno <- geno7 # 276 x 747

## 1.3 Question 1

Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object X.

## 1.4 Question 2

Perform a quality control by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in X2.

## 1.5 Question 3

Standardize the genotype matrix from Question 2 to have a mean of zero and variance of one. Save this matrix as Xs.

## 1.6 Question 4

We will determine a prior for marker effects in Bayesian ridge regression. Read Perez and de los Campos (2014) (10.1534/genetics.114.164442) to learn more about the BGLR R package. Recall that genetic variance is given by $$Var(g_i) = \sigma^2_b \sum^m_{j=1} Var(x_{ij})$$. We can equate this $$\sigma^2_b \times \sum^m_{j=1} Var(x_{ij})$$ to the product of our prior expectation about the expected proportion of variance that is explained by the regression times an estimate of the phenotypic variance $$\sigma^2_b \times \sum^m_{j=1} Var(x_{ij}) = R^2 V_y$$. This results in $$\sigma^2_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})}$$. Then equating this to the prior mode of the variance parameter of scaled-inverse chi square density gives $$\frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} = \frac{S_b}{df_b + 2}$$, where $$S_b$$ and $$df_b$$ are the prior scale and degrees of freedom of marker effects, respectively. Then solving for the prior scale parameter $$S_b$$ yields $$S_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} (df_b + 2)$$. Compute the prior scale $$S_b$$ for the wheat data set according to the above derivation. Use $$df_b = 5$$ and $$R^2 = 0.5$$.

## 1.7 Question 5

Alternatively, we can use $$Var(g_i) = \sigma^2_b \times n^{-1} \sum^n_{i=1} \sum^m_{j=1}x^2_{ij}$$, where $$n^{-1} \sum^n_{i=1} \sum^m_{j=1}x_{ij}^2$$ is the average sum of squares of the genotypes. Recompute the prior scale $$S_b$$ based on this parameterization. Again, use $$df_b = 5$$ and $$R^2 = 0.5$$.

## 1.8 Question 6

Report a prior scale $$S_b$$ used for Bayesian ridge regression with default setting in the BGLR function. Which of the above rule-based priors is used in the BGLR? Hint: set nIter = 1, burnIn = 1 so that you can see the output.

## 1.9 Question 7

Jiang and Reif (2015) (DOI: 10.1534/genetics.115.177907) showed that a Gaussian kerenel matrix can be decomposed into $$GK = \Lambda \tilde{H} \Lambda$$, where $$\Lambda = diag[\exp (- \theta \sum^m_{k=1} x^2_{1k} ) , \cdots, \exp (- \theta\sum^m_{k=1} x^2_{nk})]$$, $$\tilde{H} = 1_{n \times n} + \sum^{\infty}_{k=1} \frac{(2 \theta m)^k}{k!} G^{\#k}$$, and $$G$$ is the second genomic relationship matrix of VanRaden (2008). Confirm that this is indeed true. Use the bandwidth parameter $$\theta = 0.0001$$.

## 1.10 Question 8

Evaluate predictive performance of the Bayesian ridge regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403) at the begining of the code so that your analysis is reporducible.

## 1.11 Question 9

Create two Gaussin kernel matrices (GK1 and GK2) where means of lower triangler matrix are about 0.85 and 0.25, respectively.

## 1.12 Question 10

Evaluate predictive performance of the reproducing Hilbert spaces regression model by repeating three-fold cross-validation 5 times. Fit a multiple kernel method by using GK1 and GK2 simultaneously. Again, type set.seed(0403) at the begining of the code so that your analysis is reporducible.