This homework assignment involves data analysis of 280 winter wheat accessions genotyped with 1083 Diversity Array Technology (DArT) markers at 9 locations. Phenotypes available are grain yield, grain volume weight, plant height, and flowering date. Both phenotypic and genotypic data are downloadable from DRYAD.

```
# phenotypes
pheno <- read.csv("http://datadryad.org/bitstream/handle/10255/dryad.40880/phenotype.csv?sequence=1", header = TRUE, stringsAsFactors = FALSE)
pheno1 <- subset(pheno, Location ==1) # use location 1
# DArt markers
library(readxl)
geno1 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=2, skip=3, na="-")
geno2 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=3, skip=3, na="-")
geno3 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=4, skip=3, na="-")
table(geno1[,1:7] == geno2[,1:7])
```

```
##
## TRUE
## 13867
```

`table(geno1[,1:7] == geno3[,1:7])`

```
##
## TRUE
## 13867
```

`table(geno2[,1:7] == geno3[,1:7])`

```
##
## TRUE
## 13867
```

`geno4 <- cbind(geno1, geno2[,-c(1:7)], geno3[, -c(1:7)])`

```
# quality control
qc.callrate <- which(geno4$CallRate < 95)
qc.p <- which(geno4$P < 80)
qc.index <- unique(c(qc.callrate, qc.p))
length(qc.index) # 1234
```

`## [1] 1234`

```
geno5 <- geno4[-qc.index, ]
geno6 <- t(geno5[,-c(1:7)])
```

Check if the phenotype and genotype files have the same accessions.

```
# phenotype -> genotype
table(pheno1[,2] %in% rownames(geno6))
```

```
##
## FALSE TRUE
## 4 276
```

```
na1a.index <- which(!pheno1[,2] %in% rownames(geno6))
pheno1[c(na1a.index),2] # "NE10522" "NW10568" "NE10570" "NE10583" -> found in the phenotype file but in the genotype file
```

`## [1] "NE10522" "NW10568" "NE10570" "NE10583"`

```
# genotype -> phenotype
table(rownames(geno6) %in% pheno1[,2])
```

```
##
## FALSE TRUE
## 2 276
```

```
na1b.index <- which(!rownames(geno6) %in% pheno1[,2])
rownames(geno6)[c(na1b.index)] # "Goodstreak" "Camelot" -> found in the genotype file but in the phenotype file
```

`## [1] "Goodstreak" "Camelot"`

```
pheno1a <- pheno1[match(rownames(geno6), pheno1[,2]), ]
pheno1b <- pheno1a[-c(277:278), ]
geno7 <- geno6[-c(277:278), ]
table(pheno1b[,2] == rownames(geno7), useNA = "always")
```

```
##
## TRUE <NA>
## 276 0
```

`table(pheno1b[,2] %in% rownames(geno7))`

```
##
## TRUE
## 276
```

`table(rownames(geno7) %in% pheno1b[,2])`

```
##
## TRUE
## 276
```

```
# final phenotype object
y <- pheno1b$yield # use grain yield
# final genotype object
geno <- geno7 # 276 x 747
```

Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object `X`

.

Perform a quality control by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in `X2`

.

Standardize the genotype matrix from Question 2 to have a mean of zero and variance of one. Save this matrix as `Xs`

.

We will determine a prior for marker effects in Bayesian ridge regression. Read Perez and de los Campos (2014) (10.1534/genetics.114.164442) to learn more about the BGLR R package. Recall that genetic variance is given by \(Var(g_i) = \sigma^2_b \sum^m_{j=1} Var(x_{ij})\). We can equate this \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij})\) to the product of our prior expectation about the expected proportion of variance that is explained by the regression times an estimate of the phenotypic variance \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij}) = R^2 V_y\). This results in \(\sigma^2_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})}\). Then equating this to the prior mode of the variance parameter of scaled-inverse chi square density gives \(\frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} = \frac{S_b}{df_b + 2}\), where \(S_b\) and \(df_b\) are the prior scale and degrees of freedom of marker effects, respectively. Then solving for the prior scale parameter \(S_b\) yields \(S_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} (df_b + 2)\). Compute the prior scale \(S_b\) for the wheat data set according to the above derivation. Use \(df_b = 5\) and \(R^2 = 0.5\).

Alternatively, we can use \(Var(g_i) = \sigma^2_b \times n^{-1} \sum^n_{i=1} \sum^m_{j=1}x^2_{ij}\), where \(n^{-1} \sum^n_{i=1} \sum^m_{j=1}x_{ij}^2\) is the average sum of squares of the genotypes. Recompute the prior scale \(S_b\) based on this parameterization. Again, use \(df_b = 5\) and \(R^2 = 0.5\).

Report a prior scale \(S_b\) used for Bayesian ridge regression with default setting in the BGLR function. Which of the above rule-based priors is used in the BGLR? Hint: set `nIter`

= 1, `burnIn`

= 1 so that you can see the output.

Jiang and Reif (2015) (DOI: 10.1534/genetics.115.177907) showed that a Gaussian kerenel matrix can be decomposed into \(GK = \Lambda \tilde{H} \Lambda\), where \(\Lambda = diag[\exp (- \theta \sum^m_{k=1} x^2_{1k} ) , \cdots, \exp (- \theta\sum^m_{k=1} x^2_{nk})]\), \(\tilde{H} = 1_{n \times n} + \sum^{\infty}_{k=1} \frac{(2 \theta m)^k}{k!} G^{\#k}\), and \(G\) is the second genomic relationship matrix of VanRaden (2008). Confirm that this is indeed true. Use the bandwidth parameter \(\theta = 0.0001\).

Evaluate predictive performance of the Bayesian ridge regression model by repeating 3-fold cross-validation 5 times. Use `set.seed(0403)`

at the begining of the code so that your analysis is reporducible.

Create two Gaussin kernel matrices (`GK1`

and `GK2`

) where means of lower triangler matrix are about 0.85 and 0.25, respectively.

Evaluate predictive performance of the reproducing Hilbert spaces regression model by repeating three-fold cross-validation 5 times. Fit a multiple kernel method by using `GK1`

and `GK2`

simultaneously. Again, type `set.seed(0403)`

at the begining of the code so that your analysis is reporducible.