1 Data

This homework assignment involves data analysis of 280 winter wheat accessions genotyped with 1083 Diversity Array Technology (DArT) markers at 9 locations. Phenotypes available are grain yield, grain volume weight, plant height, and flowering date. Both phenotypic and genotypic data are downloadable from DRYAD.

# phenotypes 
pheno <- read.csv("http://datadryad.org/bitstream/handle/10255/dryad.40880/phenotype.csv?sequence=1", header = TRUE, stringsAsFactors = FALSE)
pheno1 <- subset(pheno, Location ==1) # use location 1

# DArt markers 
library(readxl)
geno1 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=2, skip=3, na="-")
geno2 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=3, skip=3, na="-")
geno3 <- read_excel("DartGenot.xls", col_names=TRUE, sheet=4, skip=3, na="-")
table(geno1[,1:7] == geno2[,1:7])

## 
##  TRUE 
## 13867

table(geno1[,1:7] == geno3[,1:7])

## 
##  TRUE 
## 13867

table(geno2[,1:7] == geno3[,1:7])

## 
##  TRUE 
## 13867

geno4 <- cbind(geno1, geno2[,-c(1:7)], geno3[, -c(1:7)])

1.1 Quality control

# quality control
qc.callrate <- which(geno4$CallRate < 95)
qc.p <- which(geno4$P < 80)
qc.index <- unique(c(qc.callrate, qc.p))
length(qc.index) # 1234

## [1] 1234

geno5 <- geno4[-qc.index, ]
geno6 <- t(geno5[,-c(1:7)])

1.2 Data cleaning

Check if the phenotype and genotype files have the same accessions.

# phenotype -> genotype
table(pheno1[,2] %in% rownames(geno6))

## 
## FALSE  TRUE 
##     4   276

na1a.index <- which(!pheno1[,2] %in% rownames(geno6))
pheno1[c(na1a.index),2] # "NE10522" "NW10568" "NE10570" "NE10583" -> found in the phenotype file but in the genotype file

## [1] "NE10522" "NW10568" "NE10570" "NE10583"

# genotype -> phenotype 
table(rownames(geno6) %in% pheno1[,2])

## 
## FALSE  TRUE 
##     2   276

na1b.index <- which(!rownames(geno6) %in% pheno1[,2])
rownames(geno6)[c(na1b.index)] # "Goodstreak" "Camelot" -> found in the genotype file but in the phenotype file

## [1] "Goodstreak" "Camelot"

pheno1a <- pheno1[match(rownames(geno6), pheno1[,2]), ]
pheno1b <- pheno1a[-c(277:278), ]
geno7 <- geno6[-c(277:278), ]
table(pheno1b[,2] == rownames(geno7), useNA = "always")

## 
## TRUE <NA> 
##  276    0

table(pheno1b[,2] %in% rownames(geno7))

## 
## TRUE 
##  276

table(rownames(geno7) %in% pheno1b[,2])

## 
## TRUE 
##  276

# final phenotype object 
y <- pheno1b$yield # use grain yield 

# final genotype object 
geno <- geno7 # 276 x 747

1.3 Question 1

Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object X.

1.4 Question 2

Perform a quality control by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in X2.

1.5 Question 3

Standardize the genotype matrix from Question 2 to have a mean of zero and variance of one. Save this matrix as Xs.

1.6 Question 4

We will determine a prior for marker effects in Bayesian ridge regression. Read Perez and de los Campos (2014) (10.1534/genetics.114.164442) to learn more about the BGLR R package. Recall that genetic variance is given by \(Var(g_i) = \sigma^2_b \sum^m_{j=1} Var(x_{ij})\). We can equate this \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij})\) to the product of our prior expectation about the expected proportion of variance that is explained by the regression times an estimate of the phenotypic variance \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij}) = R^2 V_y\). This results in \(\sigma^2_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})}\). Then equating this to the prior mode of the variance parameter of scaled-inverse chi square density gives \(\frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} = \frac{S_b}{df_b + 2}\), where \(S_b\) and \(df_b\) are the prior scale and degrees of freedom of marker effects, respectively. Then solving for the prior scale parameter \(S_b\) yields \(S_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} (df_b + 2)\). Compute the prior scale \(S_b\) for the wheat data set according to the above derivation. Use \(df_b = 5\) and \(R^2 = 0.5\).

1.7 Question 5

Alternatively, we can use \(Var(g_i) = \sigma^2_b \times n^{-1} \sum^n_{i=1} \sum^m_{j=1}x^2_{ij}\), where \(n^{-1} \sum^n_{i=1} \sum^m_{j=1}x_{ij}^2\) is the average sum of squares of the genotypes. Recompute the prior scale \(S_b\) based on this parameterization. Again, use \(df_b = 5\) and \(R^2 = 0.5\).

1.8 Question 6

Report a prior scale \(S_b\) used for Bayesian ridge regression with default setting in the BGLR function. Which of the above rule-based priors is used in the BGLR? Hint: set nIter = 1, burnIn = 1 so that you can see the output.

1.9 Question 7

Jiang and Reif (2015) (DOI: 10.1534/genetics.115.177907) showed that a Gaussian kerenel matrix can be decomposed into \(GK = \Lambda \tilde{H} \Lambda\), where \(\Lambda = diag[\exp (- \theta \sum^m_{k=1} x^2_{1k} ) , \cdots, \exp (- \theta\sum^m_{k=1} x^2_{nk})]\), \(\tilde{H} = 1_{n \times n} + \sum^{\infty}_{k=1} \frac{(2 \theta m)^k}{k!} G^{\#k}\), and \(G\) is the second genomic relationship matrix of VanRaden (2008). Confirm that this is indeed true. Use the bandwidth parameter \(\theta = 0.0001\).

1.10 Question 8

Evaluate predictive performance of the Bayesian ridge regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403) at the begining of the code so that your analysis is reporducible.

1.11 Question 9

Create two Gaussin kernel matrices (GK1 and GK2) where means of lower triangler matrix are about 0.85 and 0.25, respectively.

1.12 Question 10

Evaluate predictive performance of the reproducing Hilbert spaces regression model by repeating three-fold cross-validation 5 times. Fit a multiple kernel method by using GK1 and GK2 simultaneously. Again, type set.seed(0403) at the begining of the code so that your analysis is reporducible.

ASCI 896 Statistical Genomics

Contents

0.0.1 Homework assignment 5

0.0.2 Due date: Friday, April 29, 5pm