Linear mixed model for GWAS

class: center, middle, inverse, title-slide

# Linear mixed model for GWAS
## LIFE 891-002: Integrating Quantitative and Computational Biology into Life Sciences Research
### Gota Morota
### 2018/02/19

---

class: inverse, center, middle

# Quantitative Genetics

Analysis of complex or multifactorial traits

All genes affect all traits - the question is by how much?

Infinitesimal model

Oligogenic model

---

# What is quantitative genetics?

Population genetics
- **Mathematics** is language of population genetics, **population genetics** is language of **evolution**.

Quantitative genetics 
- **Statistics** is language of quantitative genetics, **quantitative genetics** is language of **complex trait genetics**.

**Phenotypes** first in quantitative genetics

In the era of genomics, phenotype is **king**

---

# Regression model
Galton (1886). Regression towards mediocrity in hereditary stature
"<img src="galton1886.png" width=600 height=380>

---
# Prediction vs. Inference 
Complex traits are controlled by large number of genes with small effects, and  influenced by both genetics and environments

- Inference (location)
    - average effects of allele substitution

- Inference (variability)
   - variance component estimation
   - genomic heritability
  
Combination of above two (e.g., estimate proportion of additive genetic variance explained by QTLs)

- Prediction 
    - genomic selection
    - prediction of yet-to-be observed phenotypes

---
# Prediction vs. Inference 
<div align="center">
<img src="Lo2015PNAS.png" width=900 height=400>
</div>

* [http://www.pnas.org/content/112/45/13892.abstract
](http://www.pnas.org/content/112/45/13892.abstract
)

---
# How to parameterize  response variable y?

- Prediction of additive genetic effects 
     - `\(\mathbf{ y =  E + a + \boldsymbol{\epsilon}}\)` 
     
- Prediction of total genetic effects **parametrically**
     - `\(\mathbf{ y =  \mathbf{E} + \underbrace{\mathbf{ a + d + a*a + a*d + d*d}}_{g} + } \boldsymbol{\epsilon}\)` 
     
- Prediction of total genetic effects **non-parametrically**
     - `\(\mathbf{ y =  \mathbf{E} + \mathbf{g} + \boldsymbol{\epsilon}}\)`

---
# Phenotypes
![](plant_01.png)
&nbsp;
&nbsp;
![](plant_02.jpg)

.center[Image data]

---
# Genomic information (e.g., SNPs)
![](SNPs.png)

.center[Repeat of numbers 0, 1, and 2]

---
# Quantitative genetics
Connecting image data with genomic information 
<center>
<div>
<img src="plant_01.png" width=100 height=100>
=
<img src="SNPs.png" width=100 height=100> + error 
</div>
</center>
This is equivalent to 
`\begin{align*}
\mathbf{y} &= \mathbf{W}\mathbf{a} + \boldsymbol{\epsilon} \\
\underbrace{\begin{bmatrix}    
y_1\\
y_2\\
\vdots \\
y_n\end{bmatrix}}_{n \times 1}  
&= \underbrace{\begin{bmatrix}
   w_{11} & w_{12} & \cdots & w_{1m} \\
    w_{21} & w_{22} & \cdots & w_{2m} \\
    \vdots & \vdots & \ddots & \vdots \\
    w_{n1} & w_{n2} & \cdots & w_{nm}
\end{bmatrix}}_{n \times m} \quad
\underbrace{\begin{bmatrix}
a_1\\
a_2\\
\vdots \\
a_m\end{bmatrix}}_{m \times 1} 
+\underbrace{\begin{bmatrix}
\epsilon_1\\
\epsilon_2\\
\vdots \\
\epsilon_m\end{bmatrix}}_{n \times 1} 
\end{align*}`

where `\(n\)` is the number of individuals (e.g., accessions) and `\(m\)` is the number of SNPs.

---
# Genetic values
Quantitative genetic model:

`\begin{align*}
\mathbf{y} &= \mathbf{g} + \boldsymbol{\epsilon} \\
\end{align*}`
where `\(\mathbf{y}\)` is the vector of observed phenotypes, `\(\mathbf{g}\)` is the vector of genetic values, and `\(\boldsymbol{\epsilon}\)` is the vector of residuals.

Example:

| Plant ID   | y           | g  |   e  |
| ------------- |:-------------:| -----:|------|
| 1      | 10 | ? |  ?     |
| 2      |  7     |   ? |  ?    |
| 3 | 12     |    ? |    ? |

---
# Genetic values
Quantitative genetic model:

Example:

| Plant ID    | y           | g  |   e  |
| ------------- |:-------------:| -----:|------|
| 1      | 10 | 5 |  5     |
| 2      |  7     |   6 |  1    |
| 3 | 12     |    2 |    10 |

We approximate unknown `\(\mathbf{g}\)` with `\(\mathbf{Wa}\)`.

`\begin{align*}
\mathbf{y} &= \mathbf{g} + \boldsymbol{\epsilon} \\
 &\approx \mathbf{W}\mathbf{a} + \boldsymbol{\epsilon} 
\end{align*}`

---
# Expectation and variance 
Define the random variable `\(W\)` which counts the number of reference allele `\(A\)`.
`\begin{align*}
  W &=
  \begin{cases}
    2 & \text{if } AA  \text{ with frequency } p^2 \\
    1 & \text{if } Aa \text{ with frequency } 2p(1-p) \\
    0 & \text{if } aa \text{ with frequency } (1-p)^2
  \end{cases} \\
\end{align*}`
where `\(p\)` is the allele frequency of `\(A\)`.

Then, 
`\begin{align*}
E[W] &= 0 \times (1 - p_j)^2 + 1 \times [2p(1-p)] + 2 \times p^2 \\
&= 2p \\
E[W^2] &= 0^2 \times (1 - p_j)^2 + 1^2 \times [2p(1-p)] + 2^2 \times p^2 \\
&= 2p(1-p) + 4p^2 \\
\end{align*}`
Thus, the variance of allelic counts is 
`\begin{align*}
Var(W) &= E[W^2] - E[W]^2 \\
&= 2p(1-p) + 4p^2  - 4p^2\\
&= 2p(1-p)
\end{align*}`

---
# Alternative coding 
Define the random variable `\(W\)` which counts the number of reference allele `\(A\)`.
`\begin{align*}
  W &=
  \begin{cases}
    1 & \text{if } AA  \text{ with frequency } p^2 \\
    0 & \text{if } Aa \text{ with frequency } 2p(1-p) \\
    -1 & \text{if } aa \text{ with frequency } (1-p)^2
  \end{cases} \\
\end{align*}`
where `\(p\)` is the allele frequency of `\(A\)`.

Then, 
`\begin{align*}
E[W] &= -1 \times (1 - p_j)^2 + 0 \times [2p(1-p)] + 1 \times p^2 \\
&= −(1 − 2p + p^2) + p^2 = 2p-1 \\
E[W^2] &= (-1)^2 \times (1 - p_j)^2 + 0^2 \times [2p(1-p)] + 1^2 \times p^2 \\
&= 1 − 2p + p^2 +p^2  = 2p^2 − 2p + 1 \\
\end{align*}`
Thus, the variance of allelic counts is 
`\begin{align*}
Var(W) &= E[W^2] - E[W]^2 \\
&= 2p^2 − 2p + 1 − (4p^2 − 4p + 1)\\
&= -2p^2 + 2p = 2p(1-p)
\end{align*}`

---
# Centered marker codes

`\begin{align*}
  W - E(W) &=
  \begin{cases}
    2 -2p & \text{if } AA  \text{ with frequency } p^2 \\
    1 - 2p & \text{if } Aa \text{ with frequency } 2p(1-p) \\
    0 - 2p & \text{if } aa \text{ with frequency } (1-p)^2
  \end{cases} \\
\end{align*}`

`\begin{align*}
  W - E(W) &=
  \begin{cases}
    1 - (2p-1) = 2 -2p& \text{if } AA  \text{ with frequency } p^2 \\
    0 - (2p-1)  = 1 - 2p & \text{if } Aa \text{ with frequency } 2p(1-p) \\
    -1 - (2p-1) = 0 - 2p & \text{if } aa \text{ with frequency } (1-p)^2
  \end{cases} \\
\end{align*}`
where `\(p\)` is the allele frequency of `\(A\)`.

Therefore, the variance and the centered codes are the same.

---
class: inverse, center, middle

# Population structure

---
# Principal components (PC)
PC captures population structure

- Menozzi et al. (1978).  [doi:10.1126/science.356262](http://doi.org/10.1126/science.356262)
- Cavalli-Sforza et al. (1996). [ISBN-13: 978-0691029054](http://www.amazon.com/gp/product/0691029059/)

Example in rice diversity panel data
![](riceDiversityPC1-2.png)

---
# Population stratification 
Population structure as a confounder
<div align="center">
<img src="myBalding2006.png" width=700 height=260>
</div>

- Knowler et al. (1988) Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet. 
- inflation of SNP effect sizes
- inflation of genomic heritability
- overestimation of prediction accuracy

---
# What are principal components?
`\(n \times m\)` matrix of SNPs ( `\(\mathbf{W}\)` )

- singular value decomposition of `\(\mathbf{W} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}'\)`

`\(n \times n\)` genomic relationship matrix ( `\(\mathbf{G}\)` )

- eigen decomposition of `\(\mathbf{G} = \mathbf{U}\mathbf{D}\mathbf{U}'\)`

Principal components

1. `\(\text{PC} = \mathbf{U}\)`
2. `\(\text{PC} = \sqrt{\mathbf{D}}\mathbf{U}\)`

---
# Genomic relationship matrix (1)
Recall that
`\begin{align*}
\mathbf{y} &= \mathbf{g} + \boldsymbol{\epsilon} = \mathbf{W}_c\mathbf{a} + \boldsymbol{\epsilon}
\end{align*}`

Assume genetic value is parameterized as `\(g_{i} = \sum w_{ij} a_j\)`
where both `\(w\)` and `\(a\)` are treated as random and independent. 
Assuming linkage equilibrium of markers (all loci are mutually independent) 
`\begin{align*}
\sigma^2_g &= \sum_j 2 p_j(1-p_j) \cdot \sigma^2_{a_j}. \notag \\
\end{align*}`
Under the homogeneous marker variance assumption
`\begin{align}
\sigma^2_{a} &= \frac{\sigma^2_g}{2 \sum_j p_j(1-p_j) }.
\end{align}`
Then, variance of genetic values is 
`\begin{align*}
Var(\mathbf{g}) &= Var(\mathbf{W}_c\mathbf{a}) = \mathbf{W_cW'_c}\sigma^2_{a} \\
&= \frac{\mathbf{W_cW'_c}}{2 \sum_j p_j(1-p_j)} \sigma^2_g = \mathbf{G}\sigma^2_g
\end{align*}`

---
# Genomic relationship matrix (2)
Similarly, 
`\begin{align*}
\sigma^2_g &=   \sum^m_{j=1} 2p_{j}(1 - p_j)\sigma^2_{a}  \\
&= m \sigma^2_{a}
\end{align*}`

- homogeneous marker variance assumption

- if assumed that all markers have variance 1 (following standardizing marker genotypes)

- the marked genetic variance is given by the sum of individual marker variances

`\begin{align*}
\sigma^2_{a} = \sigma^2_g / m
\end{align*}`

Then, variance of genetic values is 
`\begin{align*}
Var(\mathbf{g}) &= Var(\mathbf{W}_{cs}\mathbf{a}) = \mathbf{W_{cs}W'_{cs}}\sigma^2_{a} \\
&= \frac{\mathbf{W_{cs}W'_{cs}}}{m} \sigma^2_g = \mathbf{G}\sigma^2_g
\end{align*}`

---
class: inverse, center, middle

# Genome-wide association studies (GWAS)

---
# Ordinary least squares (OLS) 
Quantitative genetic model: `\(\mathbf{y} = \mathbf{Wa} + \boldsymbol{\epsilon}\)`

How to find the SNP effects ( `\(\mathbf{a}\)` )?

We minimize the residual sum of squares
`\begin{align*}
\boldsymbol{\epsilon}' \boldsymbol{\epsilon} &= (\mathbf{y-Wa})'(\mathbf{y-Wa}) \\
&= \mathbf{y}'\mathbf{y}  - \mathbf{y}'\mathbf{W} \mathbf{a}- \mathbf{a}'\mathbf{W}'\mathbf{y} + \mathbf{a}'\mathbf{W}'\mathbf{W}\mathbf{a}  \\
&= \mathbf{y}'\mathbf{y} - 2\mathbf{a}'\mathbf{W}'\mathbf{y} + \mathbf{a}'\mathbf{W}'\mathbf{W}\mathbf{a}
\end{align*}`

We take a partial derivative with respect to `\(\mathbf{a}\)`
`\begin{align*}
  \frac{\partial \boldsymbol{\epsilon\epsilon}'}{\partial \boldsymbol{a}} &= 2 \mathbf{W}'\mathbf{W} \mathbf{a} - 2\mathbf{X}'\mathbf{y}
\end{align*}`

By setting the equation equal to zero, we obtain a least square estimator of `\(\mathbf{a}\)`. 
`\begin{align*}
\mathbf{W}'\mathbf{W} \mathbf{a} &= \mathbf{W}' \mathbf{y} \\
\hat{\mathbf{a}} &= (\mathbf{W}'\mathbf{W})^{-1} \mathbf{W}' \mathbf{y}
\end{align*}`

---
# Ordinary least squares (OLS) 
- `\(\hat{\mathbf{a}}\)` is the vector of regression coefficient for markers, i.e., effect size of SNPs

- if the Gauss-Markov theorem is met, `\(E[\hat{\mathbf{a}}] = \mathbf{a} \rightarrow\)` BLUE 
    - `\(E[\boldsymbol{\epsilon}] = 0\)`,  `\(Var[\boldsymbol{\epsilon}] = \mathbf{I}\sigma^2_{\epsilon}\)`

What if  number of SNPs ( `\(m\)` ) `\(>>\)`  number of individuals ( `\(n\)` ) ???

- `\((\mathbf{W}'\mathbf{W})^{-1}\)` does not exist

- Effective degrees of freedom

---
# OLS: Single marker regression
Test each marker for the presence of QTLs and select those with significant effects

Problems: marker effect sizes are exaggerated

Suppose the true model is given by two causal SNPs 
`\begin{align*}
  \mathbf{y} & = \mathbf{w}_1a_1 + \mathbf{w}_2a_2 + \boldsymbol{\epsilon} \\
  \mathbf{y} & = \begin{smallmatrix} \underbrace{\mathbf{W}}_{n \times 2}\underbrace{\mathbf{a}}_{2 \times 1} \end{smallmatrix} + \boldsymbol{\epsilon}
\end{align*}`

The OLS estimator for the full mdoel is 
`\begin{align*}
  \begin{bmatrix}
    \hat{a}_1 \\
    \hat{a}_2 
  \end{bmatrix} &=
    \begin{bmatrix}
      \mathbf{w}'_1\mathbf{w}_1 & \mathbf{w}'_1\mathbf{w}_2 \\
      \mathbf{w}'_2\mathbf{w}_1 & \mathbf{w}'_2\mathbf{w}_2 
    \end{bmatrix}^{-1}
    \begin{bmatrix}
      \mathbf{w}'_1\mathbf{y} \\
      \mathbf{w}'_2\mathbf{y} \\
    \end{bmatrix} \\
    \hat{\mathbf{a}} &= (\mathbf{W'W})^{-1}\mathbf{W}'\mathbf{y}
\end{align*}`

---
# OLS: Single marker regression
The expectation of `\(\hat{\mathbf{a}}\)` is
`\begin{align*}
  E(\hat{\mathbf{a}} | \mathbf{W}) = (\mathbf{W'W})^{-1}\mathbf{W'}E(\mathbf{y} ) = (\mathbf{W'W})^{-1}\mathbf{W'Wa} = \mathbf{a}
\end{align*}`
which is a nice property of BLUE.

Now, what if we fit a single SNP model `\(\mathbf{y} = \mathbf{w}_1a_1 + \boldsymbol{\epsilon}\)`?

The OLS estimate is `\(\hat{a}_1 = (\mathbf{w'_1w_1})^{-1}\mathbf{w}'_1\mathbf{y}\)`

The expectation of `\(\hat{a}_1\)` is
`\begin{align*}
  E(\hat{a}_1 | \mathbf{w}_1) &= (\mathbf{w'_1w_1})^{-1}\mathbf{w}'_1E(\mathbf{y}) \\
  &= (\mathbf{w'_1w_1})^{-1}\mathbf{w'_1}[\mathbf{w_1a_1 + w_2a_2}] \\
  &= \mathbf{(w'_1w_1)}^{-1}\mathbf{w'_1w_1}a_1 + (\mathbf{w'_1w_1})^{-1}\mathbf{w'_1w_2}a_2 \\
  &= a_1 + (\mathbf{w'_1w_1})^{-1}\mathbf{w'_1w_2}a_2 
\end{align*}`

- OLS is biased if full model holds but fit a misspecified model
- this bias is proportional to `\((\mathbf{w'_1w_1})^{-1}\mathbf{w'_1w_2}a_2\)`
- the same applies when there are more than two SNPs

---
class: inverse, center, middle

# Linear mixed model

---
# Linear mixed model for GWAS 
Single marker-based mixed model association (MMA)
`\begin{align*}
\mathbf{y} &= \mu + \mathbf{w_ja_j} + \mathbf{Zg} + \boldsymbol{\epsilon} \\
\mathbf{g} &\sim N(0, \mathbf{G}\sigma^2_{g})
\end{align*}`

`\(\mathbf{G}\)` captures population structure and polygenic effects

Double counting?

Alternatively, 
`\begin{align*}
\mathbf{y} &= \mu + \mathbf{w_ja_j} + \mathbf{Zg} + \boldsymbol{\epsilon} \\
\mathbf{g} &\sim N(0, \mathbf{G}_{-k}\sigma^2_{g_{-k}})
\end{align*}`
where `\(-k\)` denotes the `\(k\)`th chromosome removed

---
# How to solve the linear mixed model?

1:  Mixed model equations (MME)
`\begin{align*}
\mathbf{y} &= \mu + \mathbf{w_ja_j} + \mathbf{Zg} + \boldsymbol{\epsilon} \\
\end{align*}`

The mixed model equations of [Henderson (1949)](http://morotalab.org/literature/pdf/henderson1949.pdf) are given by

2:  Weighted least squares
`\begin{align*}
  \hat{\mathbf{a}} &= (\mathbf{W'U T U'W})^{-1}\mathbf{W'U} \mathbf{T} \mathbf{U'y} 
\end{align*}`
where
`\begin{align*}
\mathbf{T} = [\mathbf{D}  + \lambda \mathbf{I}  ]^{-1}
\end{align*}`

---
# Important literature
Animal

* [Kennedy et al. 1992.](https://doi.org/10.2527/1992.7072000x) Estimation of effects of single genes on quantitative traits. J Anim Sci. 70:2000-2012.

Plant

* [Yu et al. 2006.](https://dx.doi.org/10.1038/ng1702) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 38:203-208.