Homework assignment 2

Due date: Friday, February 26, 5pm

Data 1

We will analyze Arabidopsis recombinant inbred lines available at the Bay-0 x Shahdara project. Download phenotypic and genotypic data. The target trait in this homework assignment is Flowering time in Long Days measured as the number of days between germination and bolting (the beginning of the inflorescence growth). You can learn more about the Bay-0 x Shahdara population in Loudet et al., (2002).

library(readxl)
# phenotypes 
bayxshaY <- read_excel("BayxSha_PublishedPheno.xls", col_names=FALSE, sheet=1, skip=4)
y0 <- bayxshaY[,6]
y0 <- y0[-c((length(bayxshaY[,6])-1):(length(bayxshaY[,6])))]
y1 <- y0[-which(is.na(y0))]
y1 <- scale(y1)
  
# genotypes 
bayxshaG <- read_excel("BayxSha_2_Genotypes.xls", col_names=FALSE, skip=5)
bayxshaG.id <- bayxshaG[,1]
bayxshaG <- bayxshaG[,-1]
bayxshaG[bayxshaG == "A"] <- 2
bayxshaG[bayxshaG == "C"] <- 1
bayxshaG[bayxshaG == "B"] <- 0
bayxshaG[bayxshaG == "D"] <- NA
bayxshaG <- as.matrix(bayxshaG)
mode(bayxshaG) <- "numeric"
bayxshaG <- bayxshaG[-which(is.na(y0)), ]

Question 1.1

What are the dimensions of y1 and bayxshaG?

Question 1.2

Replace missing marker genotypes with mean values. Then store marker genotypes in a matrix object W1.

Question 1.3

Apply a singular value decomposition (SVD) to W1 and fit a principal component regression using OLS. Show that the regression of phenotypes on marker covariates is equivalent to the regression of phenotypes on principal components (eigenvectors) computed from marker genotypes. Hint: One way to answer this question is to show that predicted genetic values \(\hat{g}\) are equal.

Question 1.4

Fit a standard ridge regression and a SVD-based ridge regression. Use \(\lambda = 5\). Verify that predicted genetic values are equal.

Question 1.5

Compute marker effects using SVD-based ridge regression and SVD-based OLS. Recall that the effect of ridge is to replace \(D^{-1}\) in the OLS expression with \((D^2 + \lambda I)^{-1}D\). Show that ridge estimates are shrunk toward zero because these amounts are indeed smaller in the ridge regression.

Data 2

Download barley data from the Triticeae Toolbox (T3). This web portal contains barley data generated by the Barley CAP project. YouTube tutorials are available if this is your first time to access T3. (Youtube1 and Youtube2).

pheno <- read.table(file="download_HVXF/traits.txt", header=TRUE, stringsAsFactors = FALSE, sep='\t')
geno <- read.table(file="download_HVXF/snpfile.txt", header=TRUE, stringsAsFactors = FALSE, check.names = FALSE)
pheno2 <- pheno[match(rownames(geno), pheno[,1]),]
table(pheno2[,1] %in% rownames(geno))
## 
## TRUE 
##  399
table(pheno2[,1] == rownames(geno))
## 
## TRUE 
##  399
y2 <- pheno2[,3]
y2 <- scale(y2)

Question 2.1

What are the dimensions of y2 and geno?

Question 2.2

Replace missing marker genotypes with mean values. Then store marker genotypes in a matrix W2.

Question 2.3

Recall that ridge estimates can be expressed as \(\hat{\beta}^{ridge}_j = \frac{\sum^n_{i=1} x^2_{ij} }{\sum^n_{i=1} x^2_{ij} + \lambda} \times \hat{\beta}^{ols}_{j}\). Create a centered genotype matrix W2c and calculate the amounts of shrinkage for all markers. Show that the greatest shrinkage is achieved for the marker with the most extreme minor allele frequency. Use \(\lambda = 1000\).

Question 2.4

Show that we cannot fit OLS using the solve() function when \(n < m\) but the ridge regression can circumvent this problem. Create a scatter plot of observed y2 vs. predicted y2. Use \(\lambda = 1000\).