## ASCI 896 Statistical Genomics

### Due date

Thursday, March 2, 5pm

# Arabidopsis data

We will analyze Arabidopsis recombinant inbred lines available at the Bay-0 x Shahdara project. Download the phenotypic and genotypic data. The target trait in this homework assignment is Flowering time in Long Days measured as the number of days between germination and bolting (the beginning of the inflorescence growth). You can learn more about the Bay-0 x Shahdara population in Loudet et al., (2002).

library(readxl)
# phenotypes
bayxshaY <- read_excel("BayxSha_PublishedPheno.xls", col_names = FALSE, sheet = 1,
skip = 4)
y0 <- bayxshaY[, 6]
y0 <- y0[-c((length(bayxshaY[, 6]) - 1):(length(bayxshaY[, 6])))]
y1 <- y0[-which(is.na(y0))]
y1 <- scale(y1)

# genotypes
bayxshaG <- read_excel("BayxSha_2_Genotypes.xls", col_names = FALSE, skip = 5)
bayxshaG.id <- bayxshaG[, 1]
bayxshaG <- bayxshaG[, -1]
bayxshaG[bayxshaG == "A"] <- 2
bayxshaG[bayxshaG == "C"] <- 1
bayxshaG[bayxshaG == "B"] <- 0
bayxshaG[bayxshaG == "D"] <- NA
bayxshaG <- as.matrix(bayxshaG)
mode(bayxshaG) <- "numeric"
bayxshaG <- bayxshaG[-which(is.na(y0)), ]

## Question 1

What are the dimensions of y1 and bayxshaG?

## Question 2

How many markers contain missing genotypes (NA) ? Use the function is.na().

## Question 3

Replace missing marker genotypes with mean values in a column-wise fashion (i.e., compute the mean of each column and replace NAs in that column). Then store marker genotypes in a matrix variable W1. Show that the variable W1 does not include any NA.

## Question 4

Apply a singular value decomposition (SVD) to the genotype matrix W1 (i.e., $$\mathbf{W}_1 = \mathbf{UDV'}$$). Show that 1) $$\mathbf{U'U}= \mathbf{I}$$, 2) $$\mathbf{V'V} = \mathbf{I}$$, 3) $$\mathbf{VV}'= \mathbf{I}$$, 4) $$\mathbf{V^{-1}} = \mathbf{V'}$$, and 5) $$\mathbf{V'}^{-1} = \mathbf{V}$$. Use the function svd().

## Question 5

Fit a standard OLS and a principal component-based OLS. Show that the regression of phenotypes on marker covariates is equivalent to the regression of phenotypes on principal components (eigenvectors) computed from marker genotypes. Hint: One way to answer this question is to show that their predicted genetic values are equal.

## Question 6

In general, the ridge regression has a higher predictive ability than OLS because the vector of ridge estimators yields a smaller variance than that of the vector of OLS estimators. Fit a standard ridge regression. Use $$\lambda = 5$$. Show that the variance of marker effect estimates is smaller in the ridge regression than that of OLS.

## Question 7

Fit a standard ridge regression and a SVD-based ridge regression. Use $$\lambda = 5$$. Show that their predicted genetic values are equal.

## Question 8

Recall that the effect of ridge is to replace $$D^{-1}$$ in the OLS expression with $$(D^2 + \lambda I)^{-1}D$$. Show that ridge estimates are shrunk toward zero because these amounts are indeed smaller in the ridge regression than those in OLS. Use $$\lambda = 5$$.

# Wheat data

We will analyze the wheat data from Pérez-Rodríguez et al., (2012). (doi:10.1534/g3.112.003665). Download the supporting data File S1. The R object YLD_1.RData includes 306 elite wheat lines genotyped with 1,717 diversity array technology (DArT) markers. Note that the DArT marker is coded either 0 or 1 indicating the presence or absence of allele at a given locus. The marker information is in the Marker variable.

rm(list = ls())
W2 <- Markers
dim(W2)

## Question 9

Recall that ridge estimates can be expressed as $$\hat{\beta}^{ridge}_j = \frac{\sum^n_{i=1} x^2_{ij} }{\sum^n_{i=1} x^2_{ij} + \lambda} \times \hat{\beta}^{ols}_{j}$$. Create a centered genotype matrix W2c. Calculate the minor allele frequencies (MAF) and the amounts of shrinkage for all markers. Plot MAF on the X axis and the amounts of shrinkage on the Y axis. Interpret the figure you obtained. Use $$\lambda = 100$$.

# Mouse data

We will analyze the mouse data from Valdar et al., (2006) (doi:10.1038/ng1840). This dataset is available in the BGLR R package. The mice.X variable contains 1,814 individuals and 10,346 SNPs.

library(BGLR)
data(mice)
?(mice)
W3 <- mice.X
dim(W3)

## Question 10

Calculate MAF and the amounts of shrinkage for all markers. Create a centered genotype matrix W3c. Plot MAF on the X axis and the amounts of shrinkage on the Y axis. Interpret the figure you obtained. Does the pattern you observe here agree with the wheat dataset from Question 9? If not, explain why. Use $$\lambda = 100$$.

# Maize data

Download the maize data File S1 from Crossa et al., (2010) available at 10.1534/genetics.110.118521. The dataCorn_SS.RData contains 264 corn yield phenotypes evaluated under severe drought stress conditions. The variable X includes 1,135 SNP markers.

rm(list = ls())
dim(W4)
Calculate MAF and the amounts of shrinkage for all markers. Create a centered genotype matrix W4c. Plot MAF on the X axis and the amounts of shrinkage on the Y axis. Interpret the figure you obtained. Does the pattern you observe agree with the wheat dataset from Question 9? If not, explain why. Use $$\lambda = 100$$.
Continue analyzing the maize dataset. The vector of phenotypes is included in the y variable. Recall that the dimension of predictors is 264 by 1,135. Show that we cannot fit OLS using the solve() function when $$n < m$$ but the ridge regression can circumvent this problem. Create a scatter plot of observed y vs. predicted y. Use the variables y and W4. Set $$\lambda = 100$$.