APSC 5984 Complex Trait Genomics
Homework assignment 2
Due date
Friday, February 28, 5pm
Data
For this assignment, we are going to use the cattle
data included in the synbreedData package. Learn more about the Synbreed project and synbreed R packages.
library(synbreed)
library(synbreedData)
help(package = "synbreedData")
data(cattle)
`?`(cattle)
pheno <- as.matrix(cattle$pheno[, 1, 1])
pheno <- scale(pheno)
dim(cattle$geno)
set.seed(100)
cattleG <- codeGeno(cattle, impute = TRUE, impute.type = "random", reference.allele = "minor") # genotype imputation
W <- cattleG$geno # marker genotype matrix
Question 1
Compute the allele frequency of SNP markers. Recall that the expectation of marker genotype, \(E(W)\), is given by \(2p\), where \(p\) is the frequency of reference allele. Verify that \(2p\) is equal to the mean of each marker genotype obtained from the colMeans()
function. Use the all.equal()
function.
Question 2
Recall that the variance of marker genotype, \(Var(W)\), is given by \(2p(1-p)\). Verify that \(2p(1-p)\) is close to the variance of each genotype obtained from the var()
function.
Question 3
Create a new marker matrix X
from W
and recode markers so that three genotypes \(AA\), \(Aa\), and \(aa\) are coded as 1, 0, and -1, respectively.
Recall that the expectation of genotype, \(E(X)\), is given by \(2p-1\), where \(p\) is the frequency of reference allele. Verify that \(2p-1\) is equal to the mean of each genotype obtained from the colMeans()
function.
Question 4
Recall that the variance of genotype, \(Var(X)\), remains the same and is given by \(2p(1-p)\). Verify that \(2p(1-p)\) is close to the variance of each genotype obtained from the var()
function.
Question 5
Verify that no matter how we code markers, centered marker codes, \(W - E(W)\) and \(X - E(X)\), remain the same.
Question 6
We will recode the SNP genotypes so that now the major allele is treated as a reference allele. Store the new coding into the W2
variable.
# Recode so that AA -> 0, Aa -> 1, and aa -> 2.
W2 <- W
W2[W2==0] <- 3
W2[W2==2] <- 0
W2[W2==3] <- 2
Compute the allele freqeuncy of SNP markers using W2
. Compare your result with the allele frequency obtained from W
.
Question 7
Create a new variable W3
by subsetting the first 10 markers of the W
matrix. The dimension of W3
is equal to \(500 \times 10\). Verify that the covariance between allelic counts is \(Cov(W3[,i], W3[,j]) \approx 2D\), where \(D\) is the estimate of linkage disequilibrium. Use the W3
object and the LD()
function from the genetics package to obtain \(D\).
Question 8
Recall that \(r^2\) of Hill and Robertson (1968) and \(r^2\) (correlation squared) directly applied to SNP marker matrix (allelic counts) are theoretically equivalent. Check whether these two are the same using the W3
object. Create a scatter plot of \(r^2\) (Hill and Robertson) vs. \(r^2\) (correlation squared of SNP matrix). Use the LD()
and the cor()
functions. How good is the agreement? If they do not agree, explain why.
Question 9
Select SNP markers only on chromosome 1 and store them as a W4
object.
W4 <- cattleG$geno[, which(cattleG$map == 1)]
dim(W4)
Perform GWAS using single marker ordinary least squares (OLS) and estimate SNP marker effects. Use the objects W4
and pheno
, and the function summary(lm())
. Save the vector of marker effects into a
.
Question 10
Recode the SNP genotypes (W4
) so that now the major allele is treated as a reference allele. Store the new coding into the W5
variable. Perform single marker GWAS using OLS and estimate SNP marker effects. Use the objects W5
and pheno
, and the function lm()
. Save the vector of marker effects into a2
. What is the difference between a
and a2
? Interpret the results.
Question 11
Compute the allele frequency of reference allele for each SNP marker (Question 1). Report the estimate of multi-locus additive genetic variance under the linkage equilibrium (LE) assumption. Use the objects W4
and a
.
Question 12
Compute multi-locus additive genetic variance that accounts for linkage disequilibrium (LD). Apply the expression based on the correlation between genotypes by using the cor()
function. Report the estimate of additive genetic variance. Use the objects W4
and a
.
Question 13
What is the proportion of the genetic variance under LD that is explained by the genetic variance under LE?
Question 14
Read Sved and Hill (2018) and summmarize the paper in 300 - 500 words.