ASCI 896 Statistical Genomics

Data
Quality control
Data cleaning
Question 1
Question 2
Question 3
Question 4
Question 5
Question 6
Question 7
Question 8

Due date

Tuesday, April 18, 5pm

Data

This homework assignment involves data analysis of 280 winter wheat accessions genotyped with 1083 Diversity Array Technology (DArT) markers at 9 locations. Phenotypes available are grain yield, grain volume weight, plant height, and flowering date. Both phenotypic and genotypic data are downloadable from DRYAD.

# phenotypes
pheno <- read.csv("http://datadryad.org/bitstream/handle/10255/dryad.40880/phenotype.csv?sequence=1", 
    header = TRUE, stringsAsFactors = FALSE)
pheno1 <- subset(pheno, Location == 1)  # use location 1

# DArt markers
library(readxl)
geno1 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 2, skip = 3, 
    na = "-")
geno2 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 3, skip = 3, 
    na = "-")
geno3 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 4, skip = 3, 
    na = "-")
table(geno1[, 1:7] == geno2[, 1:7])
table(geno1[, 1:7] == geno3[, 1:7])
table(geno2[, 1:7] == geno3[, 1:7])
geno4 <- cbind(geno1, geno2[, -c(1:7)], geno3[, -c(1:7)])

Quality control

# quality control
qc.callrate <- which(geno4$CallRate < 95)
qc.p <- which(geno4$P < 80)
qc.index <- unique(c(qc.callrate, qc.p))
length(qc.index)  # 1234
geno5 <- geno4[-qc.index, ]
geno6 <- t(geno5[, -c(1:7)])

Data cleaning

Check if the phenotype and genotype files have the same accessions.

# phenotype -> genotype
table(pheno1[, 2] %in% rownames(geno6))
na1a.index <- which(!pheno1[, 2] %in% rownames(geno6))
pheno1[c(na1a.index), 2]  # 'NE10522' 'NW10568' 'NE10570' 'NE10583' -> found in the phenotype file but in the genotype file 

# genotype -> phenotype
table(rownames(geno6) %in% pheno1[, 2])
na1b.index <- which(!rownames(geno6) %in% pheno1[, 2])
rownames(geno6)[c(na1b.index)]  # 'Goodstreak' 'Camelot' -> found in the genotype file but in the phenotype file 

pheno1a <- pheno1[match(rownames(geno6), pheno1[, 2]), ]
pheno1b <- pheno1a[-c(277:278), ]
geno7 <- geno6[-c(277:278), ]
table(pheno1b[, 2] == rownames(geno7), useNA = "always")
table(pheno1b[, 2] %in% rownames(geno7))
table(rownames(geno7) %in% pheno1b[, 2])

# final phenotype object
y <- pheno1b$yield  # use grain yield 

# final genotype object
geno <- geno7  # 276 x 747

Question 1

Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object X.

Question 2

Perform a quality control by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in X2.

Question 3

Standardize the genotype matrix from Question 2 to have a mean of zero and variance of one. Save this matrix as Xs.

Question 4

We will determine a prior for marker effects in Bayesian ridge regression. Read Perez and de los Campos (2014) (10.1534/genetics.114.164442) to learn more about the BGLR R package. Recall that genetic variance is given by \(Var(g_i) = \sigma^2_b \sum^m_{j=1} Var(x_{ij})\). We can equate this \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij})\) to the product of our prior expectation about the expected proportion of variance that is explained by the regression times an estimate of the phenotypic variance \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij}) = R^2 V_y\). This results in \(\sigma^2_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})}\). Then equating this to the prior mode of the variance parameter of scaled-inverse chi square density gives \(\frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} = \frac{S_b}{df_b + 2}\), where \(S_b\) and \(df_b\) are the prior scale and degrees of freedom of marker effects, respectively. Then solving for the prior scale parameter \(S_b\) yields \(S_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} (df_b + 2)\). Compute the prior scale \(S_b\) for the wheat data set according to the above derivation. Use \(df_b = 5\) and \(R^2 = 0.5\).

Question 5

Alternatively, we can use \(Var(g_i) = \sigma^2_b \times n^{-1} \sum^n_{i=1} \sum^m_{j=1}x^2_{ij}\), where \(n^{-1} \sum^n_{i=1} \sum^m_{j=1}x_{ij}^2\) is the average sum of squares of the genotypes. Recompute the prior scale \(S_b\) based on this parameterization. Again, use \(df_b = 5\) and \(R^2 = 0.5\).

Question 6

Report a prior scale \(S_b\) used for Bayesian ridge regression with default setting in the BGLR function. Which of the above rule-based priors is used in the BGLR? Hint: set nIter = 1, burnIn = 1 so that you can see the output.

Question 7

Evaluate predictive performance of the Bayesian ridge regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403) at the begining of the code so that your analysis is reporducible.

Question 8

Evaluate predictive performance of the Bayesian LASSO regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403) at the begining of the code so that your analysis is reporducible. Compare predictive accuracies between the Bayesian ridge regression and Bayesian LASSO.

ASCI 896 Statistical Genomics

Homework assignment 6