APSC 5984 Complex Trait Genomics

Due date

Friday, February 28, 5pm

Data

For this assignment, we are going to use the cattle data included in the synbreedData package. Learn more about the Synbreed project and synbreed R packages.

library(synbreed)
library(synbreedData)
help(package = "synbreedData")
data(cattle)
`?`(cattle)
pheno <- as.matrix(cattle$pheno[, 1, 1])
pheno <- scale(pheno)
dim(cattle$geno)
set.seed(100)
cattleG <- codeGeno(cattle, impute = TRUE, impute.type = "random", reference.allele = "minor")  # genotype imputation
W <- cattleG$geno  # marker genotype matrix 

Question 1

Compute the allele frequency of SNP markers. Recall that the expectation of marker genotype, \(E(W)\), is given by \(2p\), where \(p\) is the frequency of reference allele. Verify that \(2p\) is equal to the mean of each marker genotype obtained from the colMeans() function. Use the all.equal() function.

Question 2

Recall that the variance of marker genotype, \(Var(W)\), is given by \(2p(1-p)\). Verify that \(2p(1-p)\) is close to the variance of each genotype obtained from the var() function.

Question 3

Create a new marker matrix X from W and recode markers so that three genotypes \(AA\), \(Aa\), and \(aa\) are coded as 1, 0, and -1, respectively.

Recall that the expectation of genotype, \(E(X)\), is given by \(2p-1\), where \(p\) is the frequency of reference allele. Verify that \(2p-1\) is equal to the mean of each genotype obtained from the colMeans() function.

Question 4

Recall that the variance of genotype, \(Var(X)\), remains the same and is given by \(2p(1-p)\). Verify that \(2p(1-p)\) is close to the variance of each genotype obtained from the var() function.

Question 5

Verify that no matter how we code markers, centered marker codes, \(W - E(W)\) and \(X - E(X)\), remain the same.

Question 6

We will recode the SNP genotypes so that now the major allele is treated as a reference allele. Store the new coding into the W2 variable.

# Recode so that AA -> 0, Aa -> 1, and aa -> 2. 
W2 <- W
W2[W2==0] <- 3
W2[W2==2] <- 0
W2[W2==3] <- 2

Compute the allele freqeuncy of SNP markers using W2. Compare your result with the allele frequency obtained from W.

Question 7

Create a new variable W3 by subsetting the first 10 markers of the W matrix. The dimension of W3 is equal to \(500 \times 10\). Verify that the covariance between allelic counts is \(Cov(W3[,i], W3[,j]) \approx 2D\), where \(D\) is the estimate of linkage disequilibrium. Use the W3 object and the LD() function from the genetics package to obtain \(D\).

Question 8

Recall that \(r^2\) of Hill and Robertson (1968) and \(r^2\) (correlation squared) directly applied to SNP marker matrix (allelic counts) are theoretically equivalent. Check whether these two are the same using the W3 object. Create a scatter plot of \(r^2\) (Hill and Robertson) vs. \(r^2\) (correlation squared of SNP matrix). Use the LD() and the cor() functions. How good is the agreement? If they do not agree, explain why.

Question 9

Select SNP markers only on chromosome 1 and store them as a W4 object.

W4 <- cattleG$geno[, which(cattleG$map == 1)]
dim(W4)

Perform GWAS using single marker ordinary least squares (OLS) and estimate SNP marker effects. Use the objects W4 and pheno, and the function summary(lm()). Save the vector of marker effects into a.

Question 10

Recode the SNP genotypes (W4) so that now the major allele is treated as a reference allele. Store the new coding into the W5 variable. Perform single marker GWAS using OLS and estimate SNP marker effects. Use the objects W5 and pheno, and the function lm(). Save the vector of marker effects into a2. What is the difference between a and a2? Interpret the results.

Question 11

Compute the allele frequency of reference allele for each SNP marker (Question 1). Report the estimate of multi-locus additive genetic variance under the linkage equilibrium (LE) assumption. Use the objects W4 and a.

Question 12

Compute multi-locus additive genetic variance that accounts for linkage disequilibrium (LD). Apply the expression based on the correlation between genotypes by using the cor() function. Report the estimate of additive genetic variance. Use the objects W4 and a.

Question 13

What is the proportion of the genetic variance under LD that is explained by the genetic variance under LE?

Question 14

Read Sved and Hill (2018) and summmarize the paper in 300 - 500 words.

Gota Morota

February 17, 2020