ASCI 896 Statistical Genomics

Due date

Tuesday, April 18, 5pm


This homework assignment involves data analysis of 280 winter wheat accessions genotyped with 1083 Diversity Array Technology (DArT) markers at 9 locations. Phenotypes available are grain yield, grain volume weight, plant height, and flowering date. Both phenotypic and genotypic data are downloadable from DRYAD.

# phenotypes
pheno <- read.csv("", 
    header = TRUE, stringsAsFactors = FALSE)
pheno1 <- subset(pheno, Location == 1)  # use location 1

# DArt markers
geno1 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 2, skip = 3, 
    na = "-")
geno2 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 3, skip = 3, 
    na = "-")
geno3 <- read_excel("DartGenot.xls", col_names = TRUE, sheet = 4, skip = 3, 
    na = "-")
table(geno1[, 1:7] == geno2[, 1:7])
table(geno1[, 1:7] == geno3[, 1:7])
table(geno2[, 1:7] == geno3[, 1:7])
geno4 <- cbind(geno1, geno2[, -c(1:7)], geno3[, -c(1:7)])

Quality control

# quality control
qc.callrate <- which(geno4$CallRate < 95)
qc.p <- which(geno4$P < 80)
qc.index <- unique(c(qc.callrate, qc.p))
length(qc.index)  # 1234
geno5 <- geno4[-qc.index, ]
geno6 <- t(geno5[, -c(1:7)])

Data cleaning

Check if the phenotype and genotype files have the same accessions.

# phenotype -> genotype
table(pheno1[, 2] %in% rownames(geno6))
na1a.index <- which(!pheno1[, 2] %in% rownames(geno6))
pheno1[c(na1a.index), 2]  # 'NE10522' 'NW10568' 'NE10570' 'NE10583' -> found in the phenotype file but in the genotype file 

# genotype -> phenotype
table(rownames(geno6) %in% pheno1[, 2])
na1b.index <- which(!rownames(geno6) %in% pheno1[, 2])
rownames(geno6)[c(na1b.index)]  # 'Goodstreak' 'Camelot' -> found in the genotype file but in the phenotype file 

pheno1a <- pheno1[match(rownames(geno6), pheno1[, 2]), ]
pheno1b <- pheno1a[-c(277:278), ]
geno7 <- geno6[-c(277:278), ]
table(pheno1b[, 2] == rownames(geno7), useNA = "always")
table(pheno1b[, 2] %in% rownames(geno7))
table(rownames(geno7) %in% pheno1b[, 2])

# final phenotype object
y <- pheno1b$yield  # use grain yield 

# final genotype object
geno <- geno7  # 276 x 747

Question 1

Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object X.

Question 2

Perform a quality control by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in X2.

Question 3

Standardize the genotype matrix from Question 2 to have a mean of zero and variance of one. Save this matrix as Xs.

Question 4

We will determine a prior for marker effects in Bayesian ridge regression. Read Perez and de los Campos (2014) (10.1534/genetics.114.164442) to learn more about the BGLR R package. Recall that genetic variance is given by \(Var(g_i) = \sigma^2_b \sum^m_{j=1} Var(x_{ij})\). We can equate this \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij})\) to the product of our prior expectation about the expected proportion of variance that is explained by the regression times an estimate of the phenotypic variance \(\sigma^2_b \times \sum^m_{j=1} Var(x_{ij}) = R^2 V_y\). This results in \(\sigma^2_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})}\). Then equating this to the prior mode of the variance parameter of scaled-inverse chi square density gives \(\frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} = \frac{S_b}{df_b + 2}\), where \(S_b\) and \(df_b\) are the prior scale and degrees of freedom of marker effects, respectively. Then solving for the prior scale parameter \(S_b\) yields \(S_b = \frac{R^2 V_y}{\sum^m_{j=1} Var(x_{ij})} (df_b + 2)\). Compute the prior scale \(S_b\) for the wheat data set according to the above derivation. Use \(df_b = 5\) and \(R^2 = 0.5\).

Question 5

Alternatively, we can use \(Var(g_i) = \sigma^2_b \times n^{-1} \sum^n_{i=1} \sum^m_{j=1}x^2_{ij}\), where \(n^{-1} \sum^n_{i=1} \sum^m_{j=1}x_{ij}^2\) is the average sum of squares of the genotypes. Recompute the prior scale \(S_b\) based on this parameterization. Again, use \(df_b = 5\) and \(R^2 = 0.5\).

Question 6

Report a prior scale \(S_b\) used for Bayesian ridge regression with default setting in the BGLR function. Which of the above rule-based priors is used in the BGLR? Hint: set nIter = 1, burnIn = 1 so that you can see the output.

Question 7

Evaluate predictive performance of the Bayesian ridge regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403) at the begining of the code so that your analysis is reporducible.

Question 8

Evaluate predictive performance of the Bayesian LASSO regression model by repeating 3-fold cross-validation 5 times. Use set.seed(0403) at the begining of the code so that your analysis is reporducible. Compare predictive accuracies between the Bayesian ridge regression and Bayesian LASSO.

Gota Morota

April 6, 2017