Due date
Friday, May 8, 5pm
Maize data
Download the maize data set available at https://github.com/morota/apsc5984-2020/raw/gh-pages/final/dat.Rda. This dataset contains 264 maize phenotypes and 1,135 SNP markers.
rm(list=ls())
load("dat.Rda")
ls()
dim(W)
Wcs <- scale(W, center = TRUE, scale = TRUE) # centered and scaled marker matrix
dim(Wcs)
length(y)
Question 1
Report the mean of minor allele frequency.
Question 2
Compute the second genomic relationship matrix of VanRaden (2008) G
using the entire markers (Wcs
). Fit GBLUP by using the mixed.solve
function in the rrBLUP R package. Report the estimates of intercept and genomic heritability.
Question 3
Write down the statistical model you used in Question 2. Explain what each term is.
Question 4
Fit RR-GBLUP by using the mixed.solve
function in the rrBLUP R package. Use the Wcs
marker matrix. Report the estimate of genomic heritability. The estimates of genomic heritability from GBLUP and RR-BLUP should be the same. Explain why.
Question 5
Write down the statistical model you used in Question 4. Explain what each term is.
Question 6
Fit Bayesian ridge regression (BRR) by using the BGLR R package. Take 5,000 samples with 1,000 burin-in. Set set.seed(100)
. Use the Wcs
marker matrix. Report the estimate of genomic heritability. The estimate of genomic heritability from BRR should be similar to those of GBLUP and RR-BLUP. Explain why.
Cross-validation
We will first randomly partition the data into the 200 lines of reference set and the 64 lines of validation set.
set.seed(100)
trn.index <- sample(1:nrow(Wcs), 200, replace=FALSE)
tst.index <- setdiff(1:nrow(Wcs), trn.index)
y.trn <- y[trn.index] # training phenotype data
y.tst <- y[tst.index] # testing phenotype data
Wcs.trn <- Wcs[trn.index, ] # training marker matrix data
Wcs.tst <- Wcs[tst.index, ] # testing marker matrix data
Question 7
Perform single-marker mixed linear model GWAS using the rrBLUP package. Use the reference phenotype y.trn
and the reference marker matrix Wcs.trn
. Create a Manhattan plot and find out which markers are significant under the Bonferroni correction. Add the genome-wide significance level to a Manhattan plot by setting genomewideline
argument in the qqman R package. Additionaly, set suggestiveline = FALSE
. Use the nominal statistical significance threshold of \(\alpha = 0.05\). Assume that all markers are from chromosome 1 and their base pair positions are located sequentially.
Question 8
Write down the statistical model you used in Question 7. Explain what each term is.
Question 9
Continue Question 7, but identify significant markers using the multiple testing correction of Li and Ji (2005). Add the genome-wide significance level to a Manhattan plot by setting genomewideline
argument in the qqman R package. Use the nominal statistical significance threshold of \(\alpha = 0.05\). If the number of significant markers is different between the Bonferroni and Li and Ji corrections, explain why.
Question 10
Select the top 100 biggest effect GWAS markers from Questions 7-9. Perform RR-BLUP genomic prediction using only those top 100 markers. Evalute cross-validation based predictive performance by training the model using the reference set and predict the validation set. Report the estimate of predictive correlation.
Question 11
Repeat Question 10, but this time perform RR-BLUP genomic prediction by fitting the whole markers simultaneously regardless of their statistical significance. Evalute prediction performance using cross-validation. Report the estimate of predictive correlation. Explain why the predictive correlation from Question 11 decreased or increased compared to that of Question 10.