## ASCI 896 Statistical Genomics

### Due date

Tuesday, April 18, 5pm

## Data

This homework assignment involves data analysis of 280 winter wheat accessions genotyped with 1083 Diversity Array Technology (DArT) markers at 9 locations. Phenotypes available are grain yield, grain volume weight, plant height, and flowering date. Both phenotypic and genotypic data are downloadable from DRYAD.

# phenotypes
header = TRUE, stringsAsFactors = FALSE)
pheno1 <- subset(pheno, Location == 1)  # use location 1

# DArt markers
geno1 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 2, skip = 3,
na = "-")
geno2 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 3, skip = 3,
na = "-")
geno3 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 4, skip = 3,
na = "-")
table(geno1[, 1:7] == geno2[, 1:7])
table(geno1[, 1:7] == geno3[, 1:7])
table(geno2[, 1:7] == geno3[, 1:7])
geno4 <- cbind(geno1, geno2[, -c(1:7)], geno3[, -c(1:7)])

## Quality control

# quality control
qc.callrate <- which(geno4$CallRate < 95) qc.p <- which(geno4$P < 80)
qc.index <- unique(c(qc.callrate, qc.p))
length(qc.index)  # 1234
geno5 <- geno4[-qc.index, ]
geno6 <- t(geno5[, -c(1:7)])

## Data cleaning

Check if the phenotype and genotype files have the same accessions.

# phenotype -> genotype
table(pheno1[, 2] %in% rownames(geno6))
na1a.index <- which(!pheno1[, 2] %in% rownames(geno6))
pheno1[c(na1a.index), 2]  # 'NE10522' 'NW10568' 'NE10570' 'NE10583' -> found in the phenotype file but in the genotype file

# genotype -> phenotype
table(rownames(geno6) %in% pheno1[, 2])
na1b.index <- which(!rownames(geno6) %in% pheno1[, 2])
rownames(geno6)[c(na1b.index)]  # 'Goodstreak' 'Camelot' -> found in the genotype file but in the phenotype file

pheno1a <- pheno1[match(rownames(geno6), pheno1[, 2]), ]
pheno1b <- pheno1a[-c(277:278), ]
geno7 <- geno6[-c(277:278), ]
table(pheno1b[, 2] == rownames(geno7), useNA = "always")
table(pheno1b[, 2] %in% rownames(geno7))
table(rownames(geno7) %in% pheno1b[, 2])

# final phenotype object
y <- pheno1b\$yield  # use grain yield

# final genotype object
geno <- geno7  # 276 x 747

## Question 1

Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object X.

## Question 2

Perform a quality control by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in X2.

## Question 3

Standardize the genotype matrix from Question 2 to have a mean of zero and variance of one. Save this matrix as Xs.

## Question 4

We will determine a prior for marker effects in Bayesian ridge regression. Read Perez and de los Campos (2014) (10.1534/genetics.114.164442) to learn more about the BGLR R package. Recall that genetic variance is given by $$Var(g_i) = \sigma^2_b \sum^m_{j=1} Var(x_{ij})$$. We can equate this $$\sigma^2_b \times \sum^m_{j=1} Var(x_{ij})$$ to the product of our prior expectation about the expected proportion of variance that is explained by the regression times an estimate of the phenotypic variance $$\sigma^2_b \times \sum^m_{j=1} Var(x_{ij}) = R^2 V_y$$. This results in $$\sigma^2_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})}$$. Then equating this to the prior mode of the variance parameter of scaled-inverse chi square density gives $$\frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} = \frac{S_b}{df_b + 2}$$, where $$S_b$$ and $$df_b$$ are the prior scale and degrees of freedom of marker effects, respectively. Then solving for the prior scale parameter $$S_b$$ yields $$S_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} (df_b + 2)$$. Compute the prior scale $$S_b$$ for the wheat data set according to the above derivation. Use $$df_b = 5$$ and $$R^2 = 0.5$$.

## Question 5

Alternatively, we can use $$Var(g_i) = \sigma^2_b \times n^{-1} \sum^n_{i=1} \sum^m_{j=1}x^2_{ij}$$, where $$n^{-1} \sum^n_{i=1} \sum^m_{j=1}x_{ij}^2$$ is the average sum of squares of the genotypes. Recompute the prior scale $$S_b$$ based on this parameterization. Again, use $$df_b = 5$$ and $$R^2 = 0.5$$.

## Question 6

Report a prior scale $$S_b$$ used for Bayesian ridge regression with default setting in the BGLR function. Which of the above rule-based priors is used in the BGLR? Hint: set nIter = 1, burnIn = 1 so that you can see the output.

## Question 7

Evaluate predictive performance of the Bayesian ridge regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403) at the begining of the code so that your analysis is reporducible.

## Question 8

Evaluate predictive performance of the Bayesian LASSO regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403) at the begining of the code so that your analysis is reporducible. Compare predictive accuracies between the Bayesian ridge regression and Bayesian LASSO.

April 6, 2017