Population stratification

Rice data

Download the Rice Diversity Panel data RiceDiversity.44K.MSU6.Genotypes_PLINK.zip and RiceDiversity.44K.germplasm.csv from http://ricediversity.org/data/sets/44kgwas/.

Read files

The genotype data set is provided in the PLINK format. We will use the read_ped function from the BGLR package to read the PLINK format files into the R environment.

library(BGLR)
out<- read_ped("RiceDiversity_44K_Genotypes_PLINK/sativas413.ped")
p=out$p
n=out$n
out=out$x
#Recode snp to 0,1,2 format using allele 1
# 0 --> 0
# 1 --> 1
# 2 --> NA
# 3 --> 2
out[out==2]=NA
out[out==3]=2
W <- matrix(out,nrow=p,ncol=n,byrow=TRUE)
W <- t(W) # 413 x 36901

# Use the first 3000 SNPs
W <- W[, 1:3000] # or load("W.Rda")

The accession IDs (individual IDs) are included in .fam file.

# accession ID
fam <-read.table("RiceDiversity_44K_Genotypes_PLINK/sativas413.fam", header = FALSE, stringsAsFactors = FALSE)  
head(fam)
rownames(W) <- fam$V2 # 413 x 36901

We will use the read.table function to read the phenotype file RiceDiversity_44K_Phenotypes_34traits_PLINK.txt from the direct link http://www.ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt.

# phenotypes
rice.pheno <- read.table("http://www.ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt", header=TRUE, stringsAsFactors = FALSE, sep = "\t")
rice.pheno <- read.table("RiceDiversity_44K_Phenotypes_34traits_PLINK.txt", header=TRUE, stringsAsFactors = FALSE, sep = "\t")
table(rownames(W) == rice.pheno$NSFTVID)
y <- matrix(rice.pheno$Flowering.time.at.Arkansas) # # use the first trait 
rownames(y) <- rice.pheno$NSFTVID
index <- !is.na(y)
y <- y[index, 1, drop=FALSE] # 374
W <- W[index, ] # 374 x 36901
table(rownames(W) == rownames(y))

The extent of population structure in rice

This data set is known to exhibit a subpopulation structure as the panel contains indica, aus, temperate japonica, tropical japonica, aromatic, and highly admixed accessions. We will create a PC plot to visualize the rice substructure.

# PC plots
gp <-read.csv("RiceDiversity.44K.germplasm.csv", header = TRUE, skip = 1,  stringsAsFactors = FALSE)   # 431 x 12
gp2 <- gp[match(rownames(y), gp$NSFTV.ID), ]
table(rownames(y) == gp2$NSFTV.ID)

plot(gp2$PC1, gp2$PC2, xlab="PC1", ylab="PC2", col=c(1:6)[factor(gp2$Sub.population)])
legend(x="topleft", legend = levels(factor(gp2$Sub.population)), col=c(1:6), pch=1, cex=0.27)

Genotype imputation

Replace missing marker genotypes with mean values. Then store the marker genotypes in a matrix object W.

for (j in 1:ncol(W)){
  W[,j] <- ifelse(is.na(W[,j]), mean(W[,j], na.rm=TRUE), W[,j])
}

Quality control

Perform a quality control analysis by removing markers with MAF < 0.05. How many markers are removed? Save the filtered genotype matrix in W2.

p <- colSums(W) / (2*nrow(W)) # or colMean(X) / 2
maf <- ifelse(p > 0.5, 1-p, p)
maf.index <- which(maf < 0.05)
W2 <- W[, -maf.index]  # 374 x 33558 or 374 x 2730

Standardization of genotype matrix

Standardize the genotype matrix to have a mean of zero and variance of one. Save this matrix as Ws.

Ws <- scale(W2, center = TRUE, scale = TRUE)

G matrix

Compute the second genomic relationship matrix, G, of VanRaden (2008) using the entire markers. Then add a very small positive constant (e.g., 0.001) to the diagonal elements so that G matrix is invertible.

# dimensions 
n <- nrow(Ws)
m <- ncol(Ws)

G <- tcrossprod(Ws) / ncol(Ws)
G <- G + diag(n)*0.001

Question 1 - Mixed model association

Set up mixed model equations (MME) by fitting a single marker-based linear mixed model \(\mathbf{y = Xb + Zu + e}\), where \(\mathbf{X}\) is the incident matrix of fixed effects (intercept and single marker code), \(\mathbf{b}\) is the regression coefficients for the intercept and marker effect, \(\mathbf{Z}\) is the incident matrix of individuals, \(\mathbf{u} \sim N(0, \mathbf{G}\sigma^2_u)\) is the genetic effects , and \(\mathbf{e} \sim N(0, \mathbf{I}\sigma^2_e)\) is the residual. Directly take the inverse of LHS to obtain the solutions for MME. Report the estimates of additive marker effects and their associated -log10 of \(p\)-values. Use \(\lambda = 42.08053 / 173.0836 = 0.2431225\).

Question 2 - Mixed model association

Reparametrize the mixed model and fit weighted least squares. In this parameterization, the random effects of individuals \(\mathbf{u} \sim N(0, \mathbf{G}\sigma^2_u)\) are included in the error term. Do not use the lm() function. Report the estimates of additive marker effects and their associated -log10 of \(p\)-values. Use \(\lambda = 42.08053 / 173.0836 = 0.2431225\).

Question 3 - Mixed model association

Repeat Question 2 but fit weighted least squares by using the lm() function. Report the estimates of additive marker effects and their associated -log10 of \(p\)-values.

Question 4 - Mixed model association

Repeat Question 1 and fit a single-marker-based linear mixed model by using the GWAS function in the rrBLUP R package. Report the -log10 of p-values for additive marker effects.